Maybe you can tune your compiler settings:
void foo(float f1, float f2, int* array){ array[ (f1 < f2) & 1 ]++;}
compiles with
gcc -O2 -pipe -march=k8 -mfpmath=sse -fomit-frame-pointer -c fast_cmp.cc
down to
00000000 <foo(float, float, int*)>: 0: f3 0f 10 44 24 08 movss 0x8(%esp),%xmm0 6: 8b 44 24 0c mov 0xc(%esp),%eax a: 0f 2e 44 24 04 ucomiss 0x4(%esp),%xmm0 f: 8d 50 04 lea 0x4(%eax),%edx 12: 0f 46 d0 cmovbe %eax,%edx 15: ff 02 incl (%edx) 17: c3 ret
using gcc 3.4.4:
C:\TEMP>gcc -vReading specs from c:/MinGW/bin/../lib/gcc/mingw32/3.4.4/specsConfigured with: ../gcc/configure --with-gcc --with-gnu-ld --with-gnu-as --host=mingw32 --target=mingw32 --prefix=/mingw --enable-threads --disable-nls --enable-languages=c,c++,f77,ada,objc,java --disable-win32-registry --disable-shared --enable-sjlj-exceptions --enable-libgcj --disable-java-awt --without-x --enable-java-gc=boehm --disable-libgcj-debug --enable-interpreter --enable-hash-synchronization --enable-libstdcxx-debugThread model: win32gcc version 3.4.4 (mingw special)
As you can see, there are no branches and even SSE instructions are used to perform the calculation.
Compiling the same program with:
gcc -O2 -pipe -msse -fomit-frame-pointer -c fast_cmp.cc
gives
00000000 <foo(float, float, int*)>: 0: d9 44 24 08 flds 0x8(%esp) 4: 8b 44 24 0c mov 0xc(%esp),%eax 8: d9 44 24 04 flds 0x4(%esp) c: d9 c9 fxch %st(1) e: 8d 50 04 lea 0x4(%eax),%edx 11: df e9 fucomip %st(1),%st 13: dd d8 fstp %st(0) 15: 0f 46 d0 cmovbe %eax,%edx 18: ff 02 incl (%edx) 1a: c3 ret
Again no branch is generated, just a conditional move.
Doing "evil tricks", as Jan suggested, would be something like this:
void foo(float f1, float f2, int* array){ float f = f1 - f2; array[ ((*(int*)&f) >> 31) & 1 ]++;}
Compiled with
gcc -O2 -pipe -msse -fomit-frame-pointer -c fast_cmp.cc
results in
00000000 <foo(float, float, int*)>: 0: 83 ec 04 sub $0x4,%esp 3: 8b 54 24 10 mov 0x10(%esp),%edx 7: d9 44 24 0c flds 0xc(%esp) b: d8 6c 24 08 fsubrs 0x8(%esp) f: d9 1c 24 fstps (%esp) 12: 8b 04 24 mov (%esp),%eax 15: c1 e8 1f shr $0x1f,%eax 18: ff 04 82 incl (%edx,%eax,4) 1b: 58 pop %eax 1c: c3 ret
Note that the instruction fstps (%esp) stores the float into memory. This will slow things down.
My two cents.