# Invoking FCOMI, FCMOV

## Recommended Posts

Does anybody know how to invoke, e.g., FCOMI or FCMOV? (x86 Pentium Pro and up asm instructions). I'm trying to do it in inline asm (C++, VS2005), and I just can't get the syntax right for that instruction. I keep getting: error C2415: improper operand type Example.. I've just been trying many kinds of arguments:
__declspec(noinline) void foo( float a, float b )
{
__asm {
FCOMI       DWORD PTR [a], DWORD PTR [b]
}
}


Not sure if I need to pass the stack registers explicitly...

##### Share on other sites
These instructions can only be used between floating-point registers.
FCOMI compares the ST(0) register with another register, ST(i).
FCMOVcc conditionally moves from a register ST(i) into ST(0).

They can be used as, for example:
__asm{	fcomi st(0), st(3)}

##### Share on other sites
PARENS!!! I shouldn't have been looking at disasm to find the syntax. ;) Thanks.

Now I just need to figure out if there are any quirks to the x87 floating point register stack. (I don't usually play with inline assembly, so I don't know what is the programmer's responsibility, w/respect to restoring state after the inline asm, etc.)

##### Share on other sites
EDIT: I'm erasing this comment... I need to think about what I'm even asking. ;) Tired, bbl.

##### Share on other sites
Quote:
 Original post by discman1028PARENS!!! I shouldn't have been looking at disasm to find the syntax. ;) Thanks.Now I just need to figure out if there are any quirks to the x87 floating point register stack. (I don't usually play with inline assembly, so I don't know what is the programmer's responsibility, w/respect to restoring state after the inline asm, etc.)

Compilers themselves tend to not keep any x87 state across calls. You can expect that the x87 is entirely yours to do with as you please with any of the compiler I have extensively worked with. (aside that functions of float/double return on x87 stack) .. quite simply the x87 stack is considered volatile with any reasonable calling convention.

##### Share on other sites
I like that convention very much. :) Thanks.

Unfortunately, no matter what I do, I cannot get my branchless version to run any faster than the branching version that the compiler generates for:

return (a >= 0.0f ? b : c);

(where a, b, and c are floats).

In fact, even if I profile the only two instructions I intend to use (leaving out the stack pushes and pops, it's still about twice as slow as the compiler generated version (Intel Core 2 Quad).

__forceinline float MYFSel(float, float, float){	__asm {		fcomi		st(0), st(1)		fcmovbe		st(0), st(1)	}}

Maybe I should look up the latency of those instructions... but why does he (see bottom) and he (search "conditional move") think that fcmov* could be beneficial in hard-to-predict-branch cases?? It doesn't seem to be.

(P.S. My profiling was for an iteration over 2 inlined routines, one containing the pure-C++ return statement you see above, the other containing the inline asm routine you see above.)

(EDIT: By the way, the asm generated for the above-mentioned C++ was:)

__declspec(noinline) float MYFSel_br(float a, float b, float c){	return (a >= 0.0f ? b : c);004011E0  fldz             004011E2  fcomp       dword ptr [esp+4] 004011E6  fnstsw      ax   004011E8  test        ah,41h 004011EB  jp          MYFSel_br+1Ah (4011FAh) 004011ED  fld         dword ptr [esp+8] 004011F1  fstp        dword ptr [esp+4] 004011F5  fld         dword ptr [esp+4] }

##### Share on other sites
And.... I also noticed there's an "fcos" instruction (cool!). So I try this, and the cosf() version is still faster!

__forceinline float MYCOS(float){	__asm {		fcos	}}__forceinline float THEIRCOS(float a){	return cosf(a);}

Maybe I'm missing something very important. Optimization on or off, it's still taking longer to do fcos than cosf()... all I can think is that fcos is microcoded. And it seems that, stepping into cosf(), cosf() uses SSE. So maybe those two facts help cosf() beat fcos...

##### Share on other sites
Quote:
 Original post by Rockoon1Compilers themselves tend to not keep any x87 state across calls. You can expect that the x87 is entirely yours to do with as you please with any of the compiler I have extensively worked with. (aside that functions of float/double return on x87 stack) .. quite simply the x87 stack is considered volatile with any reasonable calling convention.
That's true for any x86 calling convention I've ever encountered except for one important detail: they expect the stack to be empty on return (or contain a floating-point return value) and entry.

##### Share on other sites
Quote:
Original post by implicit
Quote:
 Original post by Rockoon1Compilers themselves tend to not keep any x87 state across calls. You can expect that the x87 is entirely yours to do with as you please with any of the compiler I have extensively worked with. (aside that functions of float/double return on x87 stack) .. quite simply the x87 stack is considered volatile with any reasonable calling convention.
That's true for any x86 calling convention I've ever encountered except for one important detail: they expect the stack to be empty on return (or contain a floating-point return value) and entry.

As long as he is correct about x87, I like the x87 convention since you may avoid unnecessary pops of the stack.

EDIT: Overpopping the stack is OK, but overpushing it seems to start invalidating newly-pushed values. So, if you don't pop and restore the original stack state whenever you leave a function, won't you start getting garbage?

##### Share on other sites
Quote:
 Original post by discman1028As long as he is correct about x87, I like the x87 convention since you may avoid unnecessary pops of the stack.
Uh, no, the 80x87 floating point unit is just a part (and once upon a time an optional add-on) of the 80x86 architecture. We've been using the terms interchangeably as all x86 calling conventions I know of also specify how to deal with the FPU, though I suppose that at some point back when dinosaurs roamed the earth there may have been such a thing as an x87-agnostic calling convention ;)

##### Share on other sites
You're completely missing the boat. If you want to take advantage of fcmov or fcos or fdiv any instruction for that matter, you have to 1) issue them early 2) do meaningful work while you wait for them to complete 3) avoid using the execution units they rely on until their issue ports are free. In the case of fcmov you should make sure the branch is giving the predictor a hard time.

If you don't respect the latency and throughput characteristics of certain instructions you'll end up misusing them. In your case, calling fcmov or fcos in a tight loop IS EXACTLY THE WRONG WAY TO USE THEM.

That's why assembly is the last step in the process; wait until you've written a routine that you're happy with before you even bother with assembly. Nobody sits down and writes assembly first.

##### Share on other sites
Quote:
 Original post by outRider... Nobody sits down and writes assembly first.

Yeah, this is why I was thinking in my other post that I must be crazy to think I can outperform compiler code, especially in something as fine-grained as float compares. But I had found suggestions that said it might be beneficial, at least in code where the branch prediction is 50-50 in its correct guesses.

Still, yes, I had ignored scheduling, pipelining, etc, in the hopes that the compiler might help. This isn't a serious optimization undertaking, just a venture..

##### Share on other sites
Also, are you sure that in your tests the branches are truely hard to predict? Unless you have defined your sequence with a prng they probably arent so hard to predict

If you arent carefull, the core2 can and will use the last *8* branch cases at the specific branch site as an 8-bit index into a 256-bit table of prior branch behavior .. it is really good at finding situations of high coherence .. basically if you recorded the branch behavior as a stream of 1's and 0's (taken and not taken) and an order-8 binary predicting data compressor would do a good job at compressing that stream, then the core2 will also do a good job at predicting the branches (its basically the same algorithm)

Have you profiled to ensure that approximately 50% of the branches are mispredicted?

##### Share on other sites
Quote:
 Original post by Rockoon1Also, are you sure that in your tests the branches are truely hard to predict? Unless you have defined your sequence with a prng they probably arent so hard to predict

Thanks for the info... that's good to know about Core 2 prediction.

x = rand();
y = rand();
if (x < y) ...

That kind of branch... hard to predict.

##### Share on other sites
Quote:
 Original post by discman1028About the test, it's essentially:x = rand();y = rand();if (x < y) ...That kind of branch... hard to predict.

"In theory there is no difference between theory and practice, but in practice there is." - Jan L. A. van de Snepscheut

..your rand() probably doesnt provide such a guarantee.

Fill an array with a large prime number of "random" 0's and 1's, and then shuffle them to increase their dimensionality.

## Create an account

Register a new account

• ## Partner Spotlight

• ### Forum Statistics

• Total Topics
627638
• Total Posts
2978327

• 10
• 12
• 22
• 13
• 34