Branchless math ops (like fsel)

Started by
18 comments, last by Rockoon1 16 years, 6 months ago
Hey all, I'm working on some branchless math ops. When using SSE, selecting a vector based on a mask is easy.

_mm_or_ps( _mm_andnot_ps( maskvect, zerovect ), _mm_and_ps( maskvect, nonzerovect) );




Now, I would like to do the same with scalar floats. It is possible on PowerPC with fsel, but I am trying to do the same on x86. I have seen the MOVC/FMOVC instructions, but I do not see any instrinsics exposing them (and I'd like to avoid x86 asm for now). (Using VS2005.) I am trying to use C++ to generate the best branchless ASM possible. It seems that SSE has a scalar instruction set. But they operate on 128-bit vectors. My questions are: (1) Is it an expensive operation to transfer between SSE vector registers and the x87 floating point stack, in the first place? This assumption is the reason for my goal. It seems like the vector <-> scalar transfers might not be as expensive on x86 as it is on other certain PowerPC+Altivec architectures. (I remember, in fact, that one revision of SSE actually shared its XMM registers with the x87 stack.) (2) (I've wondered this for a while): Why are there scalar SSE instructions that operate on only one component of a 128-bit vector? Does it save processing time? (3) What are your suggestions? The naive implementation of fsel simply using C produces the following in an optimized build:

__declspec(noinline) float MyFSel( float mask, float a, float b )
{
	return (mask >= 0.0f ? a : b);
004010E0  fldz             
004010E2  fcomp       dword ptr [esp+4] 
004010E6  fnstsw      ax   
004010E8  test        ah,41h 
004010EB  jp          MyFSel+1Ah (4010FAh) 
004010ED  fld         dword ptr [esp+8] 
004010F1  fstp        dword ptr [esp+4] 
004010F5  fld         dword ptr [esp+4] 
//}




I realize the sign bit is the leftmost bit in IEEE754, and I'm thinking of ways to use that, but I'm not sure what's the best way on x86. Thanks. (I have browsed GDnet threads one and two before posting). [Edited by - discman1028 on October 10, 2007 6:28:16 PM]
--== discman1028 ==--
Advertisement
Something like this seems to be branchless, but isn't correct: if mask == -0.0f (0x80000000), it should select a, not b.

But I am more interested in knowing if fetches/stores to "dword ptr [esp+14h]" are expensive. If I see disasm with "dword ptr", is my calculation not happening completely in registers? (I know most of what is shown is just non-inlined-function pre- and post- work, but I'm wondering in general.)

__declspec(noinline) float MyFSel( float mask, float a, float b ){	float fArray[2] = { a, b };	return fArray[ *((u32*)(&mask)) >> 31 ];}// Disasm below...__declspec(noinline) float MyFSel( float mask, float a, float b ){004010E0  sub         esp,8 	float fArray[2] = { a, b };004010E3  fld         dword ptr [esp+10h] 	return fArray[ *((u32*)(&mask)) >> 31 ];004010E7  mov         eax,dword ptr [esp+0Ch] 004010EB  fstp        dword ptr [esp] 004010EE  shr         eax,1Fh 004010F1  fld         dword ptr [esp+14h] 004010F5  fstp        dword ptr [esp+4] 004010F9  fld         dword ptr [esp+eax*4] }004010FC  add         esp,8 004010FF  ret       


EDIT: As a final note, I don't really need to emulate fsel. If I can get a boolean comparison to return an int without a branch, that would be nice too:

__declspec(noinline) int MyGTE(float a, float b){	return (int)(a >= b);004010F0  fld         dword ptr [esp+4] 004010F4  fld         dword ptr [esp+8] 004010F8  fcompp           004010FA  fnstsw      ax   004010FC  test        ah,41h 004010FF  jp          MyGTE+17h (401107h) // argh00401101  mov         eax,1 }


[Edited by - discman1028 on October 10, 2007 7:25:34 PM]
--== discman1028 ==--
Today must be assembly day or something on Gamedev. I think someone else started a thread on getting the compiler to generate cmov and wasn't able to either.

1) I haven't done anything with SSE in ages, but IIRC you have to go through memory, just like PowerPC. It won't kill you, but it may be preferable to a hard to predict branch if you really can't get the compiler to generate fcmov.

2) I have no idea. It doesn't save any time on a sensible vector unit.

3) If you need cmov or fcmov and can't get the compiler to generate it for you, do it yourself with some inline assembly and hope the compiler doesn't do anything stupid trying to get your operands into the right registers.
1. Yes it can be, loading to the fpu from SSE requires a store to memory and a load to fpu stack. If you are lucky this will only hit the cache.

2. Because sometimes you only need to operate on one of the elements in the vector. Imagine a dot product in SSE without haddps. Also SSE float ops work differently than fpu ops. See my journal for more details on those differences (.net optimization part 2 iirc).

3. Branch prediction misses arent as expensive as you think (excepting tigh loop conditions, but if you have a hard to predict condition there then you have issues). Profile and see, and try to stick to compiler intrinsics if you can( journal entry on this too)

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

Quote:Original post by Washu
2. Because sometimes you only need to operate on one of the elements in the vector. Imagine a dot product in SSE without haddps.


I understand the first sentence (although I actually don't see the utility of operating on the 1st element ONLY, except that it saves you a shuffle instruction or two in case you really wanted to pass thru Y, Z, and W to the result), but how does the 2nd sentence have to do with anything?

Quote:Original post by Washu
3. Branch prediction misses arent as expensive as you think...


Is this just generally because the general x86 architecture kicks ass at pipelining & prefetching & prediction, compared to other architectures? Because "expensive" is all relative, and I am sure that branch misprediction is a pipeline flusher that ain't up to no good.

Thanks guys -- I guess I'll keep working at generating an fcmov.. and maybe try inline asm if I gotta.
--== discman1028 ==--
Quote:Original post by discman1028
Quote:Original post by Washu
2. Because sometimes you only need to operate on one of the elements in the vector. Imagine a dot product in SSE without haddps.


I understand the first sentence (although I actually don't see the utility of operating on the 1st element ONLY, except that it saves you a shuffle instruction or two in case you really wanted to pass thru Y, Z, and W to the result), but how does the 2nd sentence have to do with anything?

heh, good point, wasn't thinking clearly here. However by providing a general purpose set of single floating point operations, SSE saves you from having to switch back and forth between the FPU and SSE. You might, for instance, perform a series of calculations to end up at a floating point number, then fill an SSE vector register with that number. By not having to switch between the FPU and SSE you avoid having to do a write out to memory and a read in from memory (which may hit the cache if you're lucky... if not, then it's expensive).
Quote:
Quote:Original post by Washu
3. Branch prediction misses arent as expensive as you think...


Is this just generally because the general x86 architecture kicks ass at pipelining & prefetching & prediction, compared to other architectures? Because "expensive" is all relative, and I am sure that branch misprediction is a pipeline flusher that ain't up to no good.

Nope, it's not that branch prediction is so awsome on the x86/x64 platform, it's that memory hits are extremely expensive, so much so that the minor amount of time a branch miss will attract is insignificant compared to the amount of time waiting for something from main memory. Herb Sutter had a recent talk on this, and it's well worth viewing the video of it (and the slides side by side). This is especially important in high performance computing.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

Quote:Original post by Washu
SSE saves you from having to switch back and forth between the FPU and SSE. You might, for instance, perform a series of calculations to end up at a floating point number, then fill an SSE vector register with that number. By not having to switch between the FPU and SSE you avoid having to do a write out to memory and a read in from memory.


Of course.. but anything you could do with *ss() function (and then splat (or "shuffle") into the whole SSE vector reg), you could do with *ps() functions.

e.g.

D = A*B+C

If the float values you care about were in the X components, you could do:

D = _mm_mul_ss(A,B);
D = _mm_add_ss(D,C);
D = _mm_shuffle_ps(D,D,0);

Or you could simply do:

D = _mm_mul_ps(A,B);
D = _mm_add_ps(D,C);
D = _mm_shuffle_ps(D,D,0);

Or, if A and B and C already had the same uniform value in all four components, the shuffle isn't even needed in case 2.

So I am still adamant about the unusefulness of the *ss() operations.

Quote:Original post by Washu
Nope ... the minor amount of time a branch miss will attract is insignificant compared to the amount of time waiting for something from main memory.


Of course... but as I said, everything's relative! Of COURSE we don't want a mem fetch (even worse is a cache miss, but even if a hit). But on a more picky level, we also would like to avoid a mispredicted branch (or even a correctly predicted branch on an architecture that doesn't prefetch instructions well)!
--== discman1028 ==--
Quote:Original post by discman1028
Of course.. but anything you could do with *ss() function (and then splat (or "shuffle") into the whole SSE vector reg), you could do with *ps() functions.

e.g.

D = A*B+C

If the float values you care about were in the X components, you could do:

D = _mm_mul_ss(A,B);
D = _mm_add_ss(D,C);
D = _mm_shuffle_ps(D,D,0);

Or you could simply do:

D = _mm_mul_ps(A,B);
D = _mm_add_ps(D,C);
D = _mm_shuffle_ps(D,D,0);

Or, if A and B already had the same value in all four components, the shuffle isn't even needed in case 2.

So I am still adamant about the unusefulness of the *ss() operations.

SS instructions are minorly faster than the PS variety. Exactly why this is I'm not sure, since they should be operating entirely in parallel. Also, the SS instructions allow you to operate on single floats throughout, thus avoiding the almost FPU entirely (not all functionality is replicated). Due to the number of SSE registers this allows for significantly more parallel calculation capabilities than the FPU can perform, typically. Especially since the destination can be seperated from the source, something that the FPU stack isn't so great at :)
Quote:
Quote:Original post by Washu
Nope ... the minor amount of time a branch miss will attract is insignificant compared to the amount of time waiting for something from main memory.


Of course... but as I said, everything's relative! Of COURSE we don't want a mem fetch (even worse is a cache miss, but even if a hit). But on a more picky level, we also would like to avoid a mispredicted branch (or even a correctly predicted branch on an architecture that doesn't prefetch instructions well)!

Again, watch his lecture, it's very much worth it.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

Quote:Original post by Washu
SS instructions are minorly faster than the PS variety.


Really? Where did you find that, I couldn't find any documentation on that anywhere. :D

Quote:Original post by Washu
Again, watch his lecture, it's very much worth it.


Will do. I also read your archived journal entries.. very good! Die-hard singleton fans/critic rants are always fun to read too.
--== discman1028 ==--
Quote:Original post by discman1028
Quote:Original post by Washu
SS instructions are minorly faster than the PS variety. Exactly why this is I'm not sure, since they should be operating entirely in parallel.


Really? Where did you find that, I couldn't find any documentation on that anywhere. :D

One of the intel reference manuals relating to optimizing code for the Core2Duo platform.
Quote:
Quote:Original post by Washu
Again, watch his lecture, it's very much worth it.


Will do. I also read your archived journal entries.. very good! Die-hard singleton fans/critic rants are always fun to read too.

Eh, lets leave singletons out of this.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

This topic is closed to new replies.

Advertisement