clamping a floating point number between 0-1
The shortest way to clamp a floating-point number, in terms of code, is max(0f, min(1f, x)), unless you have access to a specialized primitive. If profiling tells you that this operation is too slow, then you can try optimizations:
- Attempt to eliminate some cases (for instance, numbers which are always between 0 and 1, or numbers that are always greater than 0, and so on) to reduce the number of operations.
- Attempt to reduce the number of floating-point numbers you have to process.
- Attempt to use vectorial operations by clamping several numbers at the same time.
I use the following instruction (in c++):
float clamped = value > 1.0f ? 1.0f : (value < 0.0f ? 0.0f : value);
It may be the case that max and min are actually implemented this way (I'm not sure) and thus the composition turns out to be something similar to what I posted...
float clamped = value > 1.0f ? 1.0f : (value < 0.0f ? 0.0f : value);
It may be the case that max and min are actually implemented this way (I'm not sure) and thus the composition turns out to be something similar to what I posted...
template < typename T > T clamp( T min, T value, T max ) { assert( min <= max ); return std::max( min, std::min( max, value ) ); }float clamped = clamp( 0.0f, value, 1.0f );
Is how I'd do it.
The std::min/max approach is so very nice and simple, and it's great to make a template out of it. However, VC2005 (release mode, default cflags) ends up generating the following:
With /arch:SSE2 it becomes:
Now using
Since poorly-predicted conditional branches are very expensive, it is worth going to considerable trouble. If ToohrVyk's vectorization is not possible, then at least the SSE MINSS and MAXSS instructions should be used. VC2005 is apparently not studly enough to manage that by itself, so using the _mm_min_ss etc. intrinsics is advisable.
If SSE is not available, then standard bit bashing applies. For the < 0 check, AND with a mask populated with the complement of the IEEE-754 sign bit (yes, this turns -0 into 0). For > 1, construct a second mask from the carry bit of the subtraction of 0x3F800000 from the float's representation; use it to select between the 1.0f constant and the previous result.
004010F8 fldz 004010FA add esp,4 004010FD fst dword ptr [esp+2Ch] 00401101 lea ecx,[esp+24h] 00401105 fld1 00401107 fst dword ptr [esp+28h] 0040110B fcomp dword ptr [esp+24h] 0040110F fnstsw ax 00401111 test ah,41h 00401114 je main+3Ah (40111Ah) 00401116 lea ecx,[esp+28h] 0040111A fcomp dword ptr [ecx] 0040111C fnstsw ax 0040111E test ah,41h 00401121 jne main+47h (401127h) 00401123 lea ecx,[esp+2Ch] 00401127 fld dword ptr [ecx]
This is very slow and does not use capabilities of newer architectures.With /arch:SSE2 it becomes:
00401908 fld dword ptr [esp+38h] 0040190C xorps xmm0,xmm0 0040190F fld1 00401911 movss xmm1,dword ptr [__real@3f800000 (402114h)] 00401919 add esp,4 0040191C fcomip st,st(1) 0040191E fstp st(0) 00401920 movss dword ptr [esp+3Ch],xmm0 00401926 movss dword ptr [esp+38h],xmm1 0040192C lea eax,[esp+34h] 00401930 ja main+46h (401936h) 00401932 lea eax,[esp+38h] 00401936 comiss xmm0,dword ptr [eax] 00401939 jbe main+4Fh (40193Fh) 0040193B lea eax,[esp+3Ch] 0040193F movss xmm0,dword ptr [eax]
It's now using FCOMI which is a big win, but the SSE parts are absurd. This code is just laughable.Now using
f = (f < 0.0f)? 0.0f : f;f = (f > 1.0f)? 1.0f : f;
it generates:00401908 fld dword ptr [esp+40h] 0040190C add esp,4 0040190F fldz 00401911 fcomip st,st(1) 00401913 fstp st(0) 00401915 jbe main+2Ch (40191Ch) 00401917 xorps xmm0,xmm0 0040191A jmp main+42h (401932h) 0040191C movss xmm0,dword ptr [esp+3Ch] 00401922 movss xmm1,dword ptr [__real@3f800000 (402114h)] 0040192A comiss xmm0,xmm1 0040192D jbe main+42h (401932h) 0040192F movaps xmm0,xmm1
This is not much better. The approach of cignox1 appears to generate the same code.Since poorly-predicted conditional branches are very expensive, it is worth going to considerable trouble. If ToohrVyk's vectorization is not possible, then at least the SSE MINSS and MAXSS instructions should be used. VC2005 is apparently not studly enough to manage that by itself, so using the _mm_min_ss etc. intrinsics is advisable.
If SSE is not available, then standard bit bashing applies. For the < 0 check, AND with a mask populated with the complement of the IEEE-754 sign bit (yes, this turns -0 into 0). For > 1, construct a second mask from the carry bit of the subtraction of 0x3F800000 from the float's representation; use it to select between the 1.0f constant and the previous result.
As a side note, on cm_10 graphical chipsets from NVIDIA, the simple approach ends up simply as:
The comparison-based approach would quite possibly compile to something at most like this, and possibly more complex:
I suspect that most shader languages result in a similar result, making the min-max-based approach better.
max.f32 $x, $x, 0.0f; min.f32 $x, $x, 1.0f;
The comparison-based approach would quite possibly compile to something at most like this, and possibly more complex:
selp.lt.f32 $p, $x, 0.0; @$p mov.f32 $x, 0.0f; selp.lt.f32 $p, $x, 1.0;@$p mov.f32 $x, 1.0f;
I suspect that most shader languages result in a similar result, making the min-max-based approach better.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement