shr faster than cmp ?

Started by
42 comments, last by GameDev.net 17 years, 10 months ago
Hi, I want to compare to float values and use the boolean result to index an array. What do you think is faster ? array[ float1 < float2 ]++; or float b = float2 - float1; array[(*(unsigned int*)&b >> 31)]++; Thanks for your opinions Quak
Advertisement
array[ float1 < float2 ]++;

Even if it was ten times slower you should use it since it shows what you intent to do. Also if the other one if faster the compiler will most likely do that for you.
Everything CTar said, plus:

- the other way isn't even guaranteed to work: 'int' doesn't have to be 32 bits, even if 'float' does.

- Why the hell do you want to index using the result of a comparison?
Quote:Why the hell do you want to index using the result of a comparison?

heh. Common technique to avoid an if()/else.

The SHR code is fairly slow because data has to be moved to/from FPU reg.
Evil trick: you can directly compare the memory underlying the floats as uint32_t; this avoids the dog-slow floating-point compare/set flags. It's even reliable, assuming comparands are of same sign.

Otherwise, I doubt the compiler is going to manage to emit FCOMI+SBB, so you may be best off writing it in asm if it's truly time-critical.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Quote:Original post by Quak
Hi,
I want to compare to float values and use the boolean result to index an array.
What do you think is faster ?

array[ float1 < float2 ]++;

or

float b = float2 - float1;
array[(*(unsigned int*)&b >> 31)]++;

Thanks for your opinions
Quak


Ouch that hurts, why are you not using 64 bit assambly for that? Float : square circles. Thase were some opinions.



if(a < b){
plus++;
} else {
minus++;
}

Is there any reason to use floating point? Even the fixed point arithmetic could alow better chances to optimalize it.

Quote:Ouch that hurts

Yes, that is fitting. Actually WTF is more accurate. The 2 things that are at least comprehensible are flat-out wrong:
1)
Quote:Even the fixed point arithmetic could alow better chances to optimalize it.

Incorrect on anything except microcontrollers or equivalent.

2)
Quote:if(a < b){ plus++; }

Congratulations, that is precisely what needs to be avoided. Conditional branches are murder on everything except abovementioned µCs.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Finish your project, then profile and see where the bottleneck is. Premature optimizisation will make the code look confusing, and may lead to more bugs. My view is that it's important to get it right, and then make it fast.
Quote:Finish your project, then profile and see where the bottleneck is. Premature optimizisation will make the code look confusing, and may lead to more bugs. My view is that it's important to get it right, and then make it fast.

Waah! Aren't we talking about HOW to make it fast? Why do you assume OP is an idiot and foolishly wasting his time with premature optimizations?

Signal:Noise Ratio too low. /me goes elsewhere.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Your first solution should certainly be quicker than the alternative you provided.

In the alternate version, there are a few things which should not be done for optimization. Shift by 31 is slow on some CPUs. Reading FP as INT will likely eject FP to cache and read back into INT (to do shift), a delay that should be avoided. So this version has a dependency chain of subtraction, FP to INT, shift. Much worse than other version should generate.

If you really need the performance, make sure the compiler is generating code that avoids branches. On x86 the worst code you should settle for would be using CMOV/SETcc to set your index to 0 or 1.
Thanks for your suggestions. I measured the time for both options and found out that they are roughly the same.
The assambly code generated by the cmp looks like this btw:

0042541B fld dword ptr [float1]
00425421 fcomp dword ptr [float2]
00425427 fnstsw ax
00425429 test ah,41h
0042542C jne 0042543A
0042542E mov dword ptr [ebp-5CCh],1
00425438 jmp 00425444
0042543A mov dword ptr [ebp-5CCh],0

Quote:Original post by Jan Wassenberg
The SHR code is fairly slow because data has to be moved to/from FPU reg.
Evil trick: you can directly compare the memory underlying the floats as uint32_t; this avoids the dog-slow floating-point compare/set flags. It's even reliable, assuming comparands are of same sign.

Otherwise, I doubt the compiler is going to manage to emit FCOMI+SBB, so you may be best off writing it in asm if it's truly time-critical.


Could you explain the "evil trick" more detailed to me ?
Both comparands always have a 0 sign bit.

Thanks
Quak

This topic is closed to new replies.

Advertisement