shr faster than cmp ?

Started by
42 comments, last by GameDev.net 17 years, 10 months ago
Quote:Original post by Quak
0042541B fld dword ptr [float1]
00425421 fcomp dword ptr [float2]
00425427 fnstsw ax
00425429 test ah,41h
0042542C jne 0042543A
0042542E mov dword ptr [ebp-5CCh],1
00425438 jmp 00425444
0042543A mov dword ptr [ebp-5CCh],0


This code doesn't achieved what you wanted (code without branches). There is one unconditional jump, and one conditional. Maybe try this: if (a < b) { x++; } else { y++; } and see how it looks in assembly. I don't think there will be more branches than you already have. Then why make ugly code, when in low-level they are about the same?
Advertisement
Quote:Original post by bubu LV
This code doesn't achieved what you wanted (code without branches). There is one unconditional jump, and one conditional. Maybe try this: if (a < b) { x++; } else { y++; } and see how it looks in assembly. I don't think there will be more branches than you already have. Then why make ugly code, when in low-level they are about the same?


Even better, try with the ?: operator instead of if statements. This would typically compile to conditional move operations instead of actual branches.
Maybe you can tune your compiler settings:
void foo(float f1, float f2, int* array){	array[ (f1 < f2) & 1 ]++;}

compiles with
gcc -O2 -pipe -march=k8 -mfpmath=sse -fomit-frame-pointer -c fast_cmp.cc

down to
00000000 <foo(float, float, int*)>:   0:   f3 0f 10 44 24 08       movss  0x8(%esp),%xmm0   6:   8b 44 24 0c             mov    0xc(%esp),%eax   a:   0f 2e 44 24 04          ucomiss 0x4(%esp),%xmm0   f:   8d 50 04                lea    0x4(%eax),%edx  12:   0f 46 d0                cmovbe %eax,%edx  15:   ff 02                   incl   (%edx)  17:   c3                      ret

using gcc 3.4.4:
C:\TEMP>gcc -vReading specs from c:/MinGW/bin/../lib/gcc/mingw32/3.4.4/specsConfigured with: ../gcc/configure --with-gcc --with-gnu-ld --with-gnu-as --host=mingw32 --target=mingw32 --prefix=/mingw --enable-threads --disable-nls --enable-languages=c,c++,f77,ada,objc,java --disable-win32-registry --disable-shared --enable-sjlj-exceptions --enable-libgcj --disable-java-awt --without-x --enable-java-gc=boehm --disable-libgcj-debug --enable-interpreter --enable-hash-synchronization --enable-libstdcxx-debugThread model: win32gcc version 3.4.4 (mingw special)

As you can see, there are no branches and even SSE instructions are used to perform the calculation.

Compiling the same program with:
gcc -O2 -pipe -msse -fomit-frame-pointer -c fast_cmp.cc

gives
00000000 <foo(float, float, int*)>:   0:   d9 44 24 08             flds   0x8(%esp)   4:   8b 44 24 0c             mov    0xc(%esp),%eax   8:   d9 44 24 04             flds   0x4(%esp)   c:   d9 c9                   fxch   %st(1)   e:   8d 50 04                lea    0x4(%eax),%edx  11:   df e9                   fucomip %st(1),%st  13:   dd d8                   fstp   %st(0)  15:   0f 46 d0                cmovbe %eax,%edx  18:   ff 02                   incl   (%edx)  1a:   c3                      ret

Again no branch is generated, just a conditional move.



Doing "evil tricks", as Jan suggested, would be something like this:
void foo(float f1, float f2, int* array){	float f = f1 - f2;	array[ ((*(int*)&f) >> 31) & 1 ]++;}


Compiled with
gcc -O2 -pipe -msse -fomit-frame-pointer -c fast_cmp.cc

results in
00000000 <foo(float, float, int*)>:   0:   83 ec 04                sub    $0x4,%esp   3:   8b 54 24 10             mov    0x10(%esp),%edx   7:   d9 44 24 0c             flds   0xc(%esp)   b:   d8 6c 24 08             fsubrs 0x8(%esp)   f:   d9 1c 24                fstps  (%esp)  12:   8b 04 24                mov    (%esp),%eax  15:   c1 e8 1f                shr    $0x1f,%eax  18:   ff 04 82                incl   (%edx,%eax,4)  1b:   58                      pop    %eax  1c:   c3                      ret

Note that the instruction fstps (%esp) stores the float into memory. This will slow things down.



My two cents.
On most ppc you could use the fsel instruction with operands binary equal to integer 0/1 of the same size. If you can find how that instruction is emulated on an x86 that might be a hint to the fastest solution.
Quote:Original post by nmi
As you can see, there are no branches and even SSE instructions are used to perform the calculation.

Scalar SSE instructions, which means no big difference.
Quote:
Note that the instruction fstps (%esp) stores the float into memory. This will slow things down.

True, unless the floats are read from memory immediately before this code, in which case they could just be loaded as ints rather than floats. But we'd need to see more of the code to know if that's possible.
Quote:False = 0
True = Not False
..which if booleans are stored as signed 2's compliment integers equates to:
True = -1

Not necessarily. *logical* Not False could just as well be 12; you don't have to insist on bitwise not. (this is the convention adopted in WinAPI - return values are usually 0 for failure or nonzero DWORD on success)

Quote:The beauty of this method is not well understood by most C programmers... but that is to be expected :)

VB coders ragging on us C guys! ROFL :)
More seriously, maybe you could give it a shot explaining this? After all, I see only disadvantages. The WinAPI convention allows better code optimization by allowing you to return any (non-zero) intermediate value already in registers, rather than insisting on ~0.

nmi: oh good, FUCOMI is much better than FCOMP. UCOMMISS is ok, too (requiring SSE isn't all that bad, but it does require moving data to XMM). But: why is GCC doing CMOV? Simple flag-based tricks such as SBB are apparently too much for today's "modern compilers".


Quote:Doing "evil tricks", as Jan suggested, would be something like this:
Nooo!
The evil trick was comparing floats as uint32_t, not subtracting and extracting the sign bit.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Just to clarify, this is basically the code I want to speed up:

float f1, f2;
unsigned int i1, i2, i3;
//....
if( f1 < f2 )
i1 += i3;
else
i2 += i3;


I tried Jans suggestion and this is what I get:

array[*(uint*)&float1 > *(uint*)&float2] += buffer[1];

004023DC mov ebx,dword ptr [esp+44h]
004023E0 cmp ebx,dword ptr [esp+28h]
004023E4 sbb ebx,ebx
004023E6 neg ebx
004023E8 add dword ptr [esp+ebx*4+38h],eax
004023EC lea ebx,[esp+ebx*4+38h]

Looks much better, doesn't it ?
Indeed :)
And I'm seeing SBB, which is nice. The NEG could be removed by flipping the meaning of i1/i2 and adding 1 to array index. You could 'encourage' the compiler to do this via idx = (i1 < i2)? ~0 : 0; array[idx+1] += i3.
hm, mighty strange that the compiler is using the long effective address and only then putting it into a register. That looks pretty crappy.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Quote:Original post by Jan Wassenberg

VB coders ragging on us C guys! ROFL :)
More seriously, maybe you could give it a shot explaining this? After all, I see only disadvantages. The WinAPI convention allows better code optimization by allowing you to return any (non-zero) intermediate value already in registers, rather than insisting on ~0.


The WinAPI convention isnt about booleans. Of course, in any language that impliments integer booleans fluently, API return codes look just like booleans of the return size.

I'm not sure what you mean by disadvantages.. a compiler can handily translate equivilence with True as not equivilent with false with no performance loss. Most everything a compiler does in expressions under the hood is post-operation flag-based and in that sense, neither methodology has any advantages. Either the zero flag is set, or it isnt.

VB6 and earlier did use true as an integer = -1 .. it was almost fluent but it did not consider equivilence with true as non equivilence with false. Basically, it *enforced* integer booleans which is not what a fluent definition does.

If the data type is precisely 1 bit, then true = not false = 1

..everyone can agree there!

But tell me, which compiler impliments booleans as single bits? Processors don't work on individual bits. The definition that true always = 1 is arbitrary and offers no actual advantage over any other arbitrary value. True = Not False on the other hand does offer some advantages in masking and so forth.. especialy important on processors without conditional move instructions.
Perhaps this should be another thread, but I'm actually curious:

Why not just do the assembly yourself, by hand, instead of jumping through hoops trying to coax good assembly out of the compiler? I don't see this gaining significant portability (won't you have to jump through different hoops for different compilers?), and readability is long forgotten. High level optimizations were ruled out from the beginning (e.g. trying to find a solution that doesn't even need to do this), and "array[float1<float2]++;" is about as far as you can go at the C level. To me, it looks like y'all're trying to perform surgery with oven mitts on your hands.

This topic is closed to new replies.

Advertisement