__declspec(noinline) void foo( float a, float b )
{
__asm {
FCOMI DWORD PTR [a], DWORD PTR
}
}
Invoking FCOMI, FCMOV
Does anybody know how to invoke, e.g., FCOMI or FCMOV? (x86 Pentium Pro and up asm instructions). I'm trying to do it in inline asm (C++, VS2005), and I just can't get the syntax right for that instruction. I keep getting:
error C2415: improper operand type
Example.. I've just been trying many kinds of arguments:
Not sure if I need to pass the stack registers explicitly...
These instructions can only be used between floating-point registers.
FCOMI compares the ST(0) register with another register, ST(i).
FCMOVcc conditionally moves from a register ST(i) into ST(0).
They can be used as, for example:
FCOMI compares the ST(0) register with another register, ST(i).
FCMOVcc conditionally moves from a register ST(i) into ST(0).
They can be used as, for example:
__asm{ fcomi st(0), st(3)}
PARENS!!! I shouldn't have been looking at disasm to find the syntax. ;) Thanks.
Now I just need to figure out if there are any quirks to the x87 floating point register stack. (I don't usually play with inline assembly, so I don't know what is the programmer's responsibility, w/respect to restoring state after the inline asm, etc.)
Now I just need to figure out if there are any quirks to the x87 floating point register stack. (I don't usually play with inline assembly, so I don't know what is the programmer's responsibility, w/respect to restoring state after the inline asm, etc.)
Quote:Original post by discman1028
PARENS!!! I shouldn't have been looking at disasm to find the syntax. ;) Thanks.
Now I just need to figure out if there are any quirks to the x87 floating point register stack. (I don't usually play with inline assembly, so I don't know what is the programmer's responsibility, w/respect to restoring state after the inline asm, etc.)
Compilers themselves tend to not keep any x87 state across calls. You can expect that the x87 is entirely yours to do with as you please with any of the compiler I have extensively worked with. (aside that functions of float/double return on x87 stack) .. quite simply the x87 stack is considered volatile with any reasonable calling convention.
I like that convention very much. :) Thanks.
Unfortunately, no matter what I do, I cannot get my branchless version to run any faster than the branching version that the compiler generates for:
return (a >= 0.0f ? b : c);
(where a, b, and c are floats).
In fact, even if I profile the only two instructions I intend to use (leaving out the stack pushes and pops, it's still about twice as slow as the compiler generated version (Intel Core 2 Quad).
Maybe I should look up the latency of those instructions... but why does he (see bottom) and he (search "conditional move") think that fcmov* could be beneficial in hard-to-predict-branch cases?? It doesn't seem to be.
(P.S. My profiling was for an iteration over 2 inlined routines, one containing the pure-C++ return statement you see above, the other containing the inline asm routine you see above.)
(EDIT: By the way, the asm generated for the above-mentioned C++ was:)
Unfortunately, no matter what I do, I cannot get my branchless version to run any faster than the branching version that the compiler generates for:
return (a >= 0.0f ? b : c);
(where a, b, and c are floats).
In fact, even if I profile the only two instructions I intend to use (leaving out the stack pushes and pops, it's still about twice as slow as the compiler generated version (Intel Core 2 Quad).
__forceinline float MYFSel(float, float, float){ __asm { fcomi st(0), st(1) fcmovbe st(0), st(1) }}
Maybe I should look up the latency of those instructions... but why does he (see bottom) and he (search "conditional move") think that fcmov* could be beneficial in hard-to-predict-branch cases?? It doesn't seem to be.
(P.S. My profiling was for an iteration over 2 inlined routines, one containing the pure-C++ return statement you see above, the other containing the inline asm routine you see above.)
(EDIT: By the way, the asm generated for the above-mentioned C++ was:)
__declspec(noinline) float MYFSel_br(float a, float b, float c){ return (a >= 0.0f ? b : c);004011E0 fldz 004011E2 fcomp dword ptr [esp+4] 004011E6 fnstsw ax 004011E8 test ah,41h 004011EB jp MYFSel_br+1Ah (4011FAh) 004011ED fld dword ptr [esp+8] 004011F1 fstp dword ptr [esp+4] 004011F5 fld dword ptr [esp+4] }
And.... I also noticed there's an "fcos" instruction (cool!). So I try this, and the cosf() version is still faster!
Maybe I'm missing something very important. Optimization on or off, it's still taking longer to do fcos than cosf()... all I can think is that fcos is microcoded. And it seems that, stepping into cosf(), cosf() uses SSE. So maybe those two facts help cosf() beat fcos...
__forceinline float MYCOS(float){ __asm { fcos }}__forceinline float THEIRCOS(float a){ return cosf(a);}
Maybe I'm missing something very important. Optimization on or off, it's still taking longer to do fcos than cosf()... all I can think is that fcos is microcoded. And it seems that, stepping into cosf(), cosf() uses SSE. So maybe those two facts help cosf() beat fcos...
Quote:Original post by Rockoon1That's true for any x86 calling convention I've ever encountered except for one important detail: they expect the stack to be empty on return (or contain a floating-point return value) and entry.
Compilers themselves tend to not keep any x87 state across calls. You can expect that the x87 is entirely yours to do with as you please with any of the compiler I have extensively worked with. (aside that functions of float/double return on x87 stack) .. quite simply the x87 stack is considered volatile with any reasonable calling convention.
Quote:Original post by implicitQuote:Original post by Rockoon1That's true for any x86 calling convention I've ever encountered except for one important detail: they expect the stack to be empty on return (or contain a floating-point return value) and entry.
Compilers themselves tend to not keep any x87 state across calls. You can expect that the x87 is entirely yours to do with as you please with any of the compiler I have extensively worked with. (aside that functions of float/double return on x87 stack) .. quite simply the x87 stack is considered volatile with any reasonable calling convention.
As long as he is correct about x87, I like the x87 convention since you may avoid unnecessary pops of the stack.
EDIT: Overpopping the stack is OK, but overpushing it seems to start invalidating newly-pushed values. So, if you don't pop and restore the original stack state whenever you leave a function, won't you start getting garbage?
Quote:Original post by discman1028Uh, no, the 80x87 floating point unit is just a part (and once upon a time an optional add-on) of the 80x86 architecture. We've been using the terms interchangeably as all x86 calling conventions I know of also specify how to deal with the FPU, though I suppose that at some point back when dinosaurs roamed the earth there may have been such a thing as an x87-agnostic calling convention ;)
As long as he is correct about x87, I like the x87 convention since you may avoid unnecessary pops of the stack.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement