Sign in to follow this  
discman1028

Invoking FCOMI, FCMOV

Recommended Posts

Does anybody know how to invoke, e.g., FCOMI or FCMOV? (x86 Pentium Pro and up asm instructions). I'm trying to do it in inline asm (C++, VS2005), and I just can't get the syntax right for that instruction. I keep getting: error C2415: improper operand type Example.. I've just been trying many kinds of arguments:
__declspec(noinline) void foo( float a, float b )
{
	__asm {
		FCOMI       DWORD PTR [a], DWORD PTR [b]
	}
}

Not sure if I need to pass the stack registers explicitly...

Share this post


Link to post
Share on other sites
These instructions can only be used between floating-point registers.
FCOMI compares the ST(0) register with another register, ST(i).
FCMOVcc conditionally moves from a register ST(i) into ST(0).

They can be used as, for example:
__asm
{
fcomi st(0), st(3)
}

Share this post


Link to post
Share on other sites
PARENS!!! I shouldn't have been looking at disasm to find the syntax. ;) Thanks.

Now I just need to figure out if there are any quirks to the x87 floating point register stack. (I don't usually play with inline assembly, so I don't know what is the programmer's responsibility, w/respect to restoring state after the inline asm, etc.)

Share this post


Link to post
Share on other sites
Quote:
Original post by discman1028
PARENS!!! I shouldn't have been looking at disasm to find the syntax. ;) Thanks.

Now I just need to figure out if there are any quirks to the x87 floating point register stack. (I don't usually play with inline assembly, so I don't know what is the programmer's responsibility, w/respect to restoring state after the inline asm, etc.)


Compilers themselves tend to not keep any x87 state across calls. You can expect that the x87 is entirely yours to do with as you please with any of the compiler I have extensively worked with. (aside that functions of float/double return on x87 stack) .. quite simply the x87 stack is considered volatile with any reasonable calling convention.

Share this post


Link to post
Share on other sites
I like that convention very much. :) Thanks.

Unfortunately, no matter what I do, I cannot get my branchless version to run any faster than the branching version that the compiler generates for:

return (a >= 0.0f ? b : c);

(where a, b, and c are floats).

In fact, even if I profile the only two instructions I intend to use (leaving out the stack pushes and pops, it's still about twice as slow as the compiler generated version (Intel Core 2 Quad).


__forceinline float MYFSel(float, float, float)
{
__asm {
fcomi st(0), st(1)
fcmovbe st(0), st(1)
}
}







Maybe I should look up the latency of those instructions... but why does he (see bottom) and he (search "conditional move") think that fcmov* could be beneficial in hard-to-predict-branch cases?? It doesn't seem to be.

(P.S. My profiling was for an iteration over 2 inlined routines, one containing the pure-C++ return statement you see above, the other containing the inline asm routine you see above.)

(EDIT: By the way, the asm generated for the above-mentioned C++ was:)


__declspec(noinline) float MYFSel_br(float a, float b, float c)
{
return (a >= 0.0f ? b : c);
004011E0 fldz
004011E2 fcomp dword ptr [esp+4]
004011E6 fnstsw ax
004011E8 test ah,41h
004011EB jp MYFSel_br+1Ah (4011FAh)
004011ED fld dword ptr [esp+8]
004011F1 fstp dword ptr [esp+4]
004011F5 fld dword ptr [esp+4]
}



Share this post


Link to post
Share on other sites
And.... I also noticed there's an "fcos" instruction (cool!). So I try this, and the cosf() version is still faster!


__forceinline float MYCOS(float)
{
__asm {
fcos
}
}

__forceinline float THEIRCOS(float a)
{
return cosf(a);
}




Maybe I'm missing something very important. Optimization on or off, it's still taking longer to do fcos than cosf()... all I can think is that fcos is microcoded. And it seems that, stepping into cosf(), cosf() uses SSE. So maybe those two facts help cosf() beat fcos...

Share this post


Link to post
Share on other sites
Quote:
Original post by Rockoon1
Compilers themselves tend to not keep any x87 state across calls. You can expect that the x87 is entirely yours to do with as you please with any of the compiler I have extensively worked with. (aside that functions of float/double return on x87 stack) .. quite simply the x87 stack is considered volatile with any reasonable calling convention.
That's true for any x86 calling convention I've ever encountered except for one important detail: they expect the stack to be empty on return (or contain a floating-point return value) and entry.

Share this post


Link to post
Share on other sites
Quote:
Original post by implicit
Quote:
Original post by Rockoon1
Compilers themselves tend to not keep any x87 state across calls. You can expect that the x87 is entirely yours to do with as you please with any of the compiler I have extensively worked with. (aside that functions of float/double return on x87 stack) .. quite simply the x87 stack is considered volatile with any reasonable calling convention.
That's true for any x86 calling convention I've ever encountered except for one important detail: they expect the stack to be empty on return (or contain a floating-point return value) and entry.


As long as he is correct about x87, I like the x87 convention since you may avoid unnecessary pops of the stack.

EDIT: Overpopping the stack is OK, but overpushing it seems to start invalidating newly-pushed values. So, if you don't pop and restore the original stack state whenever you leave a function, won't you start getting garbage?

Share this post


Link to post
Share on other sites
Quote:
Original post by discman1028
As long as he is correct about x87, I like the x87 convention since you may avoid unnecessary pops of the stack.
Uh, no, the 80x87 floating point unit is just a part (and once upon a time an optional add-on) of the 80x86 architecture. We've been using the terms interchangeably as all x86 calling conventions I know of also specify how to deal with the FPU, though I suppose that at some point back when dinosaurs roamed the earth there may have been such a thing as an x87-agnostic calling convention ;)

Share this post


Link to post
Share on other sites
You're completely missing the boat. If you want to take advantage of fcmov or fcos or fdiv any instruction for that matter, you have to 1) issue them early 2) do meaningful work while you wait for them to complete 3) avoid using the execution units they rely on until their issue ports are free. In the case of fcmov you should make sure the branch is giving the predictor a hard time.

If you don't respect the latency and throughput characteristics of certain instructions you'll end up misusing them. In your case, calling fcmov or fcos in a tight loop IS EXACTLY THE WRONG WAY TO USE THEM.

That's why assembly is the last step in the process; wait until you've written a routine that you're happy with before you even bother with assembly. Nobody sits down and writes assembly first.

Share this post


Link to post
Share on other sites
Quote:
Original post by outRider
... Nobody sits down and writes assembly first.


Yeah, this is why I was thinking in my other post that I must be crazy to think I can outperform compiler code, especially in something as fine-grained as float compares. But I had found suggestions that said it might be beneficial, at least in code where the branch prediction is 50-50 in its correct guesses.

Still, yes, I had ignored scheduling, pipelining, etc, in the hopes that the compiler might help. This isn't a serious optimization undertaking, just a venture..

Share this post


Link to post
Share on other sites
Also, are you sure that in your tests the branches are truely hard to predict? Unless you have defined your sequence with a prng they probably arent so hard to predict

If you arent carefull, the core2 can and will use the last *8* branch cases at the specific branch site as an 8-bit index into a 256-bit table of prior branch behavior .. it is really good at finding situations of high coherence .. basically if you recorded the branch behavior as a stream of 1's and 0's (taken and not taken) and an order-8 binary predicting data compressor would do a good job at compressing that stream, then the core2 will also do a good job at predicting the branches (its basically the same algorithm)

Have you profiled to ensure that approximately 50% of the branches are mispredicted?

Share this post


Link to post
Share on other sites
Quote:
Original post by Rockoon1
Also, are you sure that in your tests the branches are truely hard to predict? Unless you have defined your sequence with a prng they probably arent so hard to predict


Thanks for the info... that's good to know about Core 2 prediction.

About the test, it's essentially:

x = rand();
y = rand();
if (x < y) ...

That kind of branch... hard to predict.

Share this post


Link to post
Share on other sites
Quote:
Original post by discman1028
About the test, it's essentially:

x = rand();
y = rand();
if (x < y) ...

That kind of branch... hard to predict.


"In theory there is no difference between theory and practice, but in practice there is." - Jan L. A. van de Snepscheut

..your rand() probably doesnt provide such a guarantee.

Fill an array with a large prime number of "random" 0's and 1's, and then shuffle them to increase their dimensionality.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this