Sign in to follow this  
venzon

Clamping a value to a range in C++, quickly

Recommended Posts

venzon    256
I need some advice about how to speed up an operation I do a lot as part of a big loop: I need to take a float value and clamp it to a range. This is in C++. Here's what I do right now:
for (int i = 0; i < 512; ++i)
{
    float value = somearray[i];
    float clamp_at = 0.2;
    if (value > clamp_at)
        value = clamp_at;
    if (value < -clamp_at)
        value = -clamp_at;

    //... other code here ...
}
Is there some faster way to get this done? It needs to be portable across 32-bit, 64-bit, little endian, and big endian platforms... so I think that means assembly is probably out. Thanks in advance.

Share this post


Link to post
Share on other sites
Xai    1838
I'm not saying assembly is going to be better or not in this case, but assembly is never out in a case like this ... you would just need to have it be conditional ... and to provided the greatest benefit you would do it in a way that only makes the decision once ... like a function point, or an if outside the loop at least:

if(is x386)
386 optimized assembly
else if (is ...)
etc.
else
strait C++ version(current code maybe)

but of course better would be something like:

having constructed the architecure specific class at program load time, so that when you do:

MyArchOptimizedFunction(xxx)

it calls the right one directly

Share this post


Link to post
Share on other sites
mattnewport    1038
The compiler might be smart enough to optimize this anyway but probably not: you don't need to check if the value is < -clamp if the first check passed. Try this instead, should be a bit faster:

inline float clamp(float x, float a, float b)
{
return x < a ? a : (x > b ? b : x);
}

// in your loop
const float clamp_at = 0.2f;
const float value = clamp(somearray[i], -clamp_at, clamp_at);

Share this post


Link to post
Share on other sites
AIDev    190
You can make a minor improvement by making the second 'if' an 'else if'. You might gain additionally if you make the 0.2 a "const float" if you don't intend on modifying it ... but only if your compiler is exceptionally talented at optimizations! Also, declare the clamp variable outside the loop, and if you want to be very picky also declare a neg_clamp_at value.

Honestly, this is not code that needs to be optimized. It takes virtually no time for the processor to execute, even if it is in a critical loop.

Regarding the previous post, using the trinary operator probably is not faster since a value is assigned EVERY time; here, an assignment only occurs if the clamp is needed (which is likely not often?).


const float clamp_at = 0.2;
const float neg_clamp_at = -clamp_at;

for (int i = 0; i < 512; ++i)
{
float value = somearray[i];

if (value > clamp_at)
{
value = clamp_at;
}
else if (value < neg_clamp_at)
{
value = neg_clamp_at;
}

//... other code here ...
}

Share this post


Link to post
Share on other sites
venzon    256
Hey all, thanks for the quick suggestions. This bit of code is used several times in a critical area. It sounds like my original guess wasn't missing something really huge and obvious. Nevertheless, the implementation AIDev recommended in his last post took my benchmark from 247 ms avg to 203 ms avg (lower is better). With optimizations turned on, it does 120 ms avg. Thanks again!

Share this post


Link to post
Share on other sites
Spoonbender    1258
I'd go with mattnewport's suggestion or something like it.
Conditional assignments (? : operator) will probably perform better here. (Since branches hinder the compiler and CPU pipeline a lot in the ability to reorder and schedule instructions)

Share this post


Link to post
Share on other sites
mattnewport    1038
Quote:
Original post by AIDev
Regarding the previous post, using the trinary operator probably is not faster since a value is assigned EVERY time; here, an assignment only occurs if the clamp is needed (which is likely not often?).

The expensive parts of the code on a modern processor will be fetching the value from the array (which may result in a cache miss) and the conditionals (which will hinder optimal instruction scheduling). An additional assignment is cheap by comparison - it's probably just a register move at worst and would likely be optimized out here anyway. If the target processor supports an fsel type instruction (like PowerPC) then the branch here could be avoided completely and the compilers I've worked with that can make that type of optimization are more likely to do it with a conditional than with an if statement.

To optimize this code significantly though you'd want to know more context for the loop - there's only so much you can do to optimize a clamp operation in isolation but often there's a smarter way to optimize at a higher level.

Share this post


Link to post
Share on other sites
ToohrVyk    1595
My suggestion is:

const float sup = 0.2;
const float inf = -0.2;
float values[512];

for (int i = 0; i < 512; ++i)
{
values[i] = std::min(sup, std::max(inf, somearray[i]));
}

for (int i = 0; i < 512; ++i)
{
const float & value = values(i];
//... other code here ...
}


Hopefully, the min and max will contain compiler intrinsics for the relevant platform, resulting in very few operations. The loop being small, the compiler might decide to unfold it so that any pipeline latency issues disappear, and perhaps even use SSE all by itself.

Depending on what the other code is, you may also benefit from keeping everything in a single loop.

Share this post


Link to post
Share on other sites
iMalc    2466
We might be able to speed it up more if we could see what you're doing with 'value'. Afterall, if this is an inner loop, then quite possibly all the code in this loop has the same potential for speed gains.

Share this post


Link to post
Share on other sites
Jan Wassenberg    999
std::min/max looks good. On x86, what you want generated is FCOMI/FCMOV (FPU) or PFMIN (3dNow) or MINSS (SSE). I think more important than the ternary or if question is whether you have enabled SSE instructions in the code generation options. The Intel compiler is probably better at this as well.

Oh, just thought of another eevil trick. You can mask off the sign bit, compare magnitude via integer instructions, use carry to select the clamp_at or current value (<- hard, but possible to get a C compiler to generate this kind of code), then or-in the sign bit again.
However, this kind of dirty business is probably no longer worthwhile since the advent of FCOMI/FCMOV and the multimedia min instructions.

Share this post


Link to post
Share on other sites
Kylotan    9882
I wonder if splitting it into 2 loops would help: one doing the min clamping and one doing the max clamping. Twice as much loop overhead, some redundant checks made, but half the conditionals in each iteration.

I also wonder if it's implementing the sequential indexing as address increments, or if there's some quirk of C++ that requires it to not do that in case i changes somehow. I expect it handles it just fine, but I always find it interesting to read the assembly.

Share this post


Link to post
Share on other sites
mattnewport    1038
Quote:
Original post by Kylotan
I wonder if splitting it into 2 loops would help: one doing the min clamping and one doing the max clamping. Twice as much loop overhead, some redundant checks made, but half the conditionals in each iteration.


That probably depends on the data set - if it's small enough to fit comfortably in the L1 cache it may be faster as two loops. If it doesn't fit the L1 but fits the L2 then I would expect two loops to be a bit slower. If it doesn't fit in L2 then two loops would be a lot slower. It comes back to needing to know more about the context to offer any better optimization advice I think.

Share this post


Link to post
Share on other sites
ZQJ    496
Well, I'm not sure exactly how difficult this would be in your situation, but what you really want is to get your compiler to generate MINPS and MAXPS instructions. That will allow you to do four elements of the array at a time. For GCC look at X86 builtins in the manual for how to get this (and take a look in pmmintrin.h).

Edit: didn't read that part about portability carefully enough. If you don't want to sacrifice any portability, I think your options are a bit thin beyond what you have. You can always have a backup version that gets compiled if any of the optimized versions aren't good enough.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this