# shr faster than cmp ?

This topic is 4199 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi, I want to compare to float values and use the boolean result to index an array. What do you think is faster ? array[ float1 < float2 ]++; or float b = float2 - float1; array[(*(unsigned int*)&b >> 31)]++; Thanks for your opinions Quak

##### Share on other sites
array[ float1 < float2 ]++;

Even if it was ten times slower you should use it since it shows what you intent to do. Also if the other one if faster the compiler will most likely do that for you.

##### Share on other sites
Everything CTar said, plus:

- the other way isn't even guaranteed to work: 'int' doesn't have to be 32 bits, even if 'float' does.

- Why the hell do you want to index using the result of a comparison?

##### Share on other sites
Quote:
 Why the hell do you want to index using the result of a comparison?

heh. Common technique to avoid an if()/else.

The SHR code is fairly slow because data has to be moved to/from FPU reg.
Evil trick: you can directly compare the memory underlying the floats as uint32_t; this avoids the dog-slow floating-point compare/set flags. It's even reliable, assuming comparands are of same sign.

Otherwise, I doubt the compiler is going to manage to emit FCOMI+SBB, so you may be best off writing it in asm if it's truly time-critical.

##### Share on other sites
Quote:
 Original post by QuakHi,I want to compare to float values and use the boolean result to index an array.What do you think is faster ?array[ float1 < float2 ]++;orfloat b = float2 - float1;array[(*(unsigned int*)&b >> 31)]++;Thanks for your opinionsQuak

Ouch that hurts, why are you not using 64 bit assambly for that? Float : square circles. Thase were some opinions.

if(a < b){
plus++;
} else {
minus++;
}

Is there any reason to use floating point? Even the fixed point arithmetic could alow better chances to optimalize it.

##### Share on other sites
Quote:
 Ouch that hurts

Yes, that is fitting. Actually WTF is more accurate. The 2 things that are at least comprehensible are flat-out wrong:
1)
Quote:
 Even the fixed point arithmetic could alow better chances to optimalize it.

Incorrect on anything except microcontrollers or equivalent.

2)
Quote:
 if(a < b){ plus++; }

Congratulations, that is precisely what needs to be avoided. Conditional branches are murder on everything except abovementioned µCs.

##### Share on other sites
Finish your project, then profile and see where the bottleneck is. Premature optimizisation will make the code look confusing, and may lead to more bugs. My view is that it's important to get it right, and then make it fast.

##### Share on other sites
Quote:
 Finish your project, then profile and see where the bottleneck is. Premature optimizisation will make the code look confusing, and may lead to more bugs. My view is that it's important to get it right, and then make it fast.

Waah! Aren't we talking about HOW to make it fast? Why do you assume OP is an idiot and foolishly wasting his time with premature optimizations?

Signal:Noise Ratio too low. /me goes elsewhere.

##### Share on other sites
Your first solution should certainly be quicker than the alternative you provided.

In the alternate version, there are a few things which should not be done for optimization. Shift by 31 is slow on some CPUs. Reading FP as INT will likely eject FP to cache and read back into INT (to do shift), a delay that should be avoided. So this version has a dependency chain of subtraction, FP to INT, shift. Much worse than other version should generate.

If you really need the performance, make sure the compiler is generating code that avoids branches. On x86 the worst code you should settle for would be using CMOV/SETcc to set your index to 0 or 1.

##### Share on other sites
Thanks for your suggestions. I measured the time for both options and found out that they are roughly the same.
The assambly code generated by the cmp looks like this btw:

0042541B fld dword ptr [float1]
00425421 fcomp dword ptr [float2]
00425427 fnstsw ax
00425429 test ah,41h
0042542C jne 0042543A
0042542E mov dword ptr [ebp-5CCh],1
00425438 jmp 00425444
0042543A mov dword ptr [ebp-5CCh],0

Quote:
 Original post by Jan WassenbergThe SHR code is fairly slow because data has to be moved to/from FPU reg.Evil trick: you can directly compare the memory underlying the floats as uint32_t; this avoids the dog-slow floating-point compare/set flags. It's even reliable, assuming comparands are of same sign.Otherwise, I doubt the compiler is going to manage to emit FCOMI+SBB, so you may be best off writing it in asm if it's truly time-critical.

Could you explain the "evil trick" more detailed to me ?
Both comparands always have a 0 sign bit.

Thanks
Quak

##### Share on other sites
Ah, some signal :D
Quote:
 Your first solution should certainly be quicker than the alternative you provided.

In an ideal world maybe, but not on x86.

Quote:
 So this version has a dependency chain of subtraction, FP to INT, shift. Much worse than other version should generate.

True, but the other (first) code is also bad. As can be seen in Quak's asm listing, it's doing the FP compare (which is slow due to FNSTSW) and then a conditional jump, which is exactly what the code is trying to avoid.
Sadly this "modern compiler" doesn't have enough mojo to use SBB to construct a mask, thus avoiding the JNE/JMP.

Quote:
 Could you explain the "evil trick" more detailed to me ? Both comparands always have a 0 sign bit.

There's nothing to it, really :) Since the sign bit is 0 and assuming no NaN etc., the IEEE bit representation guarantees *(u32*)&float1 < *(u32*)&float2 if and only if: float1 < float2. That means just compare the floats as if they were 32-bit unsigned ints.

##### Share on other sites
More noise: while I'd expect pretty much every compiler ever written to use true=1, I vaguely remember that the standard only says "non-zero", so true COULD also end up being MAX_UNSIGNED or whatever (I'd probably rule out collecting multiple bools in a single byte behind your back). But unless it's supposed to work with every single exotic compiler... who cares.

##### Share on other sites
Quote:
 Original post by TriencoMore noise: while I'd expect pretty much every compiler ever written to use true=1, I vaguely remember that the standard only says "non-zero", so true COULD also end up being MAX_UNSIGNED or whatever (I'd probably rule out collecting multiple bools in a single byte behind your back). But unless it's supposed to work with every single exotic compiler... who cares.

I can't say whether his code is C or C++, but the C++ standard promise that a conversion from bool to an integral type will be 0 for false and 1 for true. I'm not sure about C (I have never used pure C).
Quote:
 If the destination type is bool, see 4.12. If the source type is bool, the value false is converted to zero and the value true is converted to one.

##### Share on other sites
Quote:
 Original post by CTarI can't say whether his code is C or C++, but the C++ standard promise that a conversion from bool to an integral type will be 0 for false and 1 for true. I'm not sure about C (I have never used pure C).

I guess there are too many Cs out there. After some googling it seems that C99 has a _Bool keyword also using 0 and 1. "Old" C didn't have real bools anyway.

Quote:
 If the destination type is bool, see 4.12. If the source type is bool, the value false is converted to zero and the value true is converted to one.

Need to keep that around for the next time somebody is complaining about my code relying on this (I must have made the mistake of blindly believing it after 4 people said "wrong" and one said "don't know"). So I guess the only danger would be doing something that doesn't include at least an implicit cast.

##### Share on other sites
Quote:
 Original post by Jan Wassenberg[...]Why do you assume OP is an idiot and foolishly wasting his time with premature optimizations?[...]
In such odd questions[1] as this, such is often the case.

As far as how to actually improve it - explain what operation you're performing. Context-based optimizations often allow meaningful speed improvements, while context-free situations such as the one stated in the OP rarely do.

[1] It's odd because it demonstrates a knowledge of assembly but apparently not of where to look up timing information or of profiling. If the OP knew how to profile, he/she could have profiled the two different methods in under 5 minutes would have had an answer right then (as he/she later did, apparently) and could then ask us how to perform the same abstract operation faster instead of asking which option of those he/she provided is faster. Of course, perhaps the OP is advanced enough to know different CPUs have different timings, but in that case it would be better to make a downloadable benchmark and get many people to run it and report their results.

##### Share on other sites
Quote:
Original post by Trienco
Quote:
 Original post by CTarIf the destination type is bool, see 4.12. If the source type is bool, the value false is converted to zero and the value true is converted to one.

Need to keep that around for the next time somebody is complaining about my code relying on this (I must have made the mistake of blindly believing it after 4 people said "wrong" and one said "don't know"). So I guess the only danger would be doing something that doesn't include at least an implicit cast.

They might have been mistaking the fact that bools only store the values "true" and "false", and the representations are implementation defined. But during promotion, the values are defined as CTar said.

##### Share on other sites
Quote:
 It's odd because it demonstrates a knowledge of assembly but apparently not of where to look up timing information or of profiling.

You have a point, but in this case, it's not that clear. The main slowdown is architecture-dependent (-> you can measure on your CPU, and find that others perform quite differently) and not really tied to specific instruction latencies as can be looked up in the manual.

Quote:
 Of course, perhaps the OP is advanced enough to know different CPUs have different timings, but in that case it would be better to make a downloadable benchmark and get many people to run it and report their results.

Measuring is important too, but I at least would want to understand why something is slower. (otherwise - measurement or not - you are practicing cargo-cult programming :) )

##### Share on other sites
Quote:
 Original post by ExtrariusAs far as how to actually improve it - explain what operation you're performing. Context-based optimizations often allow meaningful speed improvements, while context-free situations such as the one stated in the OP rarely do.

I think my question "how to index an array with the result of a float comparison with maximum speed" has enough context to answer it properly.
Anyway could you tell me where I can find timing information on assembly instructions? It would be really helpful...

Thanks,
Quak

##### Share on other sites
Quote:
Original post by Quak
Quote:
 Original post by ExtrariusAs far as how to actually improve it - explain what operation you're performing. Context-based optimizations often allow meaningful speed improvements, while context-free situations such as the one stated in the OP rarely do.

I think my question "how to index an array with the result of a float comparison with maximum speed" has enough context to answer it properly.

For example, do we need the array, or would it be ok to assume the two values are held in separate variables? Are there more than these two values in the array?

Basically, what is it you want to *do*, without locking into a specific implementation?

As I understand it, you've got two values (currently array[0] and array[1]). Depending on a comparison between two floats (currently float1 and float2) (do you pull these from an array? Pointers? Are they in variables already?), you want to increment one of the two values. And that's it? (or should the result then be written back an array, or memory location? Or can that simply be held in a variable?)
Is this performed in a loop? If so, what else is in the loop body?

Just a few questions that occur to me. :)
We need context to offer any useful optimization advice.

Quote:
 Anyway could you tell me where I can find timing information on assembly instructions? It would be really helpful...

developer.amd.com
Intel has similar documents on their website.

##### Share on other sites
Quote:
 Original post by TriencoMore noise: while I'd expect pretty much every compiler ever written to use true=1

True being 1 could be considered poor judgement on somebodies part... odd that you would expect every compiler to impliment it that way when clearly there are many many compilers (admitedly not too many are C) which impliment truthfullnes as:

False = 0
True = Not False

..which if booleans are stored as signed 2's compliment integers equates to:

True = -1

Where equality against truth is always a comparison against false.. and true comparisons ops always return Not False, such as (2 > 1) would be -1 on Intel

The beauty of this method is not well understood by most C programmers... but that is to be expected :)

##### Share on other sites
Quote:
 Original post by Quak0042541B fld dword ptr [float1] 00425421 fcomp dword ptr [float2] 00425427 fnstsw ax 00425429 test ah,41h 0042542C jne 0042543A 0042542E mov dword ptr [ebp-5CCh],1 00425438 jmp 00425444 0042543A mov dword ptr [ebp-5CCh],0

This code doesn't achieved what you wanted (code without branches). There is one unconditional jump, and one conditional. Maybe try this: if (a < b) { x++; } else { y++; } and see how it looks in assembly. I don't think there will be more branches than you already have. Then why make ugly code, when in low-level they are about the same?

##### Share on other sites
Quote:
 Original post by bubu LVThis code doesn't achieved what you wanted (code without branches). There is one unconditional jump, and one conditional. Maybe try this: if (a < b) { x++; } else { y++; } and see how it looks in assembly. I don't think there will be more branches than you already have. Then why make ugly code, when in low-level they are about the same?

Even better, try with the ?: operator instead of if statements. This would typically compile to conditional move operations instead of actual branches.

##### Share on other sites
Maybe you can tune your compiler settings:
void foo(float f1, float f2, int* array){	array[ (f1 < f2) & 1 ]++;}

compiles with
gcc -O2 -pipe -march=k8 -mfpmath=sse -fomit-frame-pointer -c fast_cmp.cc

down to
00000000 <foo(float, float, int*)>:   0:   f3 0f 10 44 24 08       movss  0x8(%esp),%xmm0   6:   8b 44 24 0c             mov    0xc(%esp),%eax   a:   0f 2e 44 24 04          ucomiss 0x4(%esp),%xmm0   f:   8d 50 04                lea    0x4(%eax),%edx  12:   0f 46 d0                cmovbe %eax,%edx  15:   ff 02                   incl   (%edx)  17:   c3                      ret

using gcc 3.4.4:
C:\TEMP>gcc -vReading specs from c:/MinGW/bin/../lib/gcc/mingw32/3.4.4/specsConfigured with: ../gcc/configure --with-gcc --with-gnu-ld --with-gnu-as --host=mingw32 --target=mingw32 --prefix=/mingw --enable-threads --disable-nls --enable-languages=c,c++,f77,ada,objc,java --disable-win32-registry --disable-shared --enable-sjlj-exceptions --enable-libgcj --disable-java-awt --without-x --enable-java-gc=boehm --disable-libgcj-debug --enable-interpreter --enable-hash-synchronization --enable-libstdcxx-debugThread model: win32gcc version 3.4.4 (mingw special)

As you can see, there are no branches and even SSE instructions are used to perform the calculation.

Compiling the same program with:
gcc -O2 -pipe -msse -fomit-frame-pointer -c fast_cmp.cc

gives
00000000 <foo(float, float, int*)>:   0:   d9 44 24 08             flds   0x8(%esp)   4:   8b 44 24 0c             mov    0xc(%esp),%eax   8:   d9 44 24 04             flds   0x4(%esp)   c:   d9 c9                   fxch   %st(1)   e:   8d 50 04                lea    0x4(%eax),%edx  11:   df e9                   fucomip %st(1),%st  13:   dd d8                   fstp   %st(0)  15:   0f 46 d0                cmovbe %eax,%edx  18:   ff 02                   incl   (%edx)  1a:   c3                      ret

Again no branch is generated, just a conditional move.

Doing "evil tricks", as Jan suggested, would be something like this:
void foo(float f1, float f2, int* array){	float f = f1 - f2;	array[ ((*(int*)&f) >> 31) & 1 ]++;}

Compiled with
gcc -O2 -pipe -msse -fomit-frame-pointer -c fast_cmp.cc

results in
00000000 <foo(float, float, int*)>:   0:   83 ec 04                sub    $0x4,%esp 3: 8b 54 24 10 mov 0x10(%esp),%edx 7: d9 44 24 0c flds 0xc(%esp) b: d8 6c 24 08 fsubrs 0x8(%esp) f: d9 1c 24 fstps (%esp) 12: 8b 04 24 mov (%esp),%eax 15: c1 e8 1f shr$0x1f,%eax  18:   ff 04 82                incl   (%edx,%eax,4)  1b:   58                      pop    %eax  1c:   c3                      ret

Note that the instruction fstps (%esp) stores the float into memory. This will slow things down.

My two cents.

##### Share on other sites
On most ppc you could use the fsel instruction with operands binary equal to integer 0/1 of the same size. If you can find how that instruction is emulated on an x86 that might be a hint to the fastest solution.

##### Share on other sites
Quote:
 Original post by nmiAs you can see, there are no branches and even SSE instructions are used to perform the calculation.

Scalar SSE instructions, which means no big difference.
Quote:
 Note that the instruction fstps (%esp) stores the float into memory. This will slow things down.

True, unless the floats are read from memory immediately before this code, in which case they could just be loaded as ints rather than floats. But we'd need to see more of the code to know if that's possible.

##### Share on other sites

This topic is 4199 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.