Sign in to follow this  

Unity Branchless math ops (like fsel)

This topic is 3717 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hey all, I'm working on some branchless math ops. When using SSE, selecting a vector based on a mask is easy.
_mm_or_ps( _mm_andnot_ps( maskvect, zerovect ), _mm_and_ps( maskvect, nonzerovect) );




Now, I would like to do the same with scalar floats. It is possible on PowerPC with fsel, but I am trying to do the same on x86. I have seen the MOVC/FMOVC instructions, but I do not see any instrinsics exposing them (and I'd like to avoid x86 asm for now). (Using VS2005.) I am trying to use C++ to generate the best branchless ASM possible. It seems that SSE has a scalar instruction set. But they operate on 128-bit vectors. My questions are: (1) Is it an expensive operation to transfer between SSE vector registers and the x87 floating point stack, in the first place? This assumption is the reason for my goal. It seems like the vector <-> scalar transfers might not be as expensive on x86 as it is on other certain PowerPC+Altivec architectures. (I remember, in fact, that one revision of SSE actually shared its XMM registers with the x87 stack.) (2) (I've wondered this for a while): Why are there scalar SSE instructions that operate on only one component of a 128-bit vector? Does it save processing time? (3) What are your suggestions? The naive implementation of fsel simply using C produces the following in an optimized build:
__declspec(noinline) float MyFSel( float mask, float a, float b )
{
	return (mask >= 0.0f ? a : b);
004010E0  fldz             
004010E2  fcomp       dword ptr [esp+4] 
004010E6  fnstsw      ax   
004010E8  test        ah,41h 
004010EB  jp          MyFSel+1Ah (4010FAh) 
004010ED  fld         dword ptr [esp+8] 
004010F1  fstp        dword ptr [esp+4] 
004010F5  fld         dword ptr [esp+4] 
//}




I realize the sign bit is the leftmost bit in IEEE754, and I'm thinking of ways to use that, but I'm not sure what's the best way on x86. Thanks. (I have browsed GDnet threads one and two before posting). [Edited by - discman1028 on October 10, 2007 6:28:16 PM]

Share this post


Link to post
Share on other sites
Something like this seems to be branchless, but isn't correct: if mask == -0.0f (0x80000000), it should select a, not b.

But I am more interested in knowing if fetches/stores to "dword ptr [esp+14h]" are expensive. If I see disasm with "dword ptr", is my calculation not happening completely in registers? (I know most of what is shown is just non-inlined-function pre- and post- work, but I'm wondering in general.)


__declspec(noinline) float MyFSel( float mask, float a, float b )
{
float fArray[2] = { a, b };
return fArray[ *((u32*)(&mask)) >> 31 ];
}

// Disasm below...

__declspec(noinline) float MyFSel( float mask, float a, float b )
{
004010E0 sub esp,8
float fArray[2] = { a, b };
004010E3 fld dword ptr [esp+10h]
return fArray[ *((u32*)(&mask)) >> 31 ];
004010E7 mov eax,dword ptr [esp+0Ch]
004010EB fstp dword ptr [esp]
004010EE shr eax,1Fh
004010F1 fld dword ptr [esp+14h]
004010F5 fstp dword ptr [esp+4]
004010F9 fld dword ptr [esp+eax*4]
}
004010FC add esp,8
004010FF ret








EDIT: As a final note, I don't really need to emulate fsel. If I can get a boolean comparison to return an int without a branch, that would be nice too:


__declspec(noinline) int MyGTE(float a, float b)
{
return (int)(a >= b);
004010F0 fld dword ptr [esp+4]
004010F4 fld dword ptr [esp+8]
004010F8 fcompp
004010FA fnstsw ax
004010FC test ah,41h
004010FF jp MyGTE+17h (401107h) // argh
00401101 mov eax,1
}



[Edited by - discman1028 on October 10, 2007 7:25:34 PM]

Share this post


Link to post
Share on other sites
Today must be assembly day or something on Gamedev. I think someone else started a thread on getting the compiler to generate cmov and wasn't able to either.

1) I haven't done anything with SSE in ages, but IIRC you have to go through memory, just like PowerPC. It won't kill you, but it may be preferable to a hard to predict branch if you really can't get the compiler to generate fcmov.

2) I have no idea. It doesn't save any time on a sensible vector unit.

3) If you need cmov or fcmov and can't get the compiler to generate it for you, do it yourself with some inline assembly and hope the compiler doesn't do anything stupid trying to get your operands into the right registers.

Share this post


Link to post
Share on other sites
1. Yes it can be, loading to the fpu from SSE requires a store to memory and a load to fpu stack. If you are lucky this will only hit the cache.

2. Because sometimes you only need to operate on one of the elements in the vector. Imagine a dot product in SSE without haddps. Also SSE float ops work differently than fpu ops. See my journal for more details on those differences (.net optimization part 2 iirc).

3. Branch prediction misses arent as expensive as you think (excepting tigh loop conditions, but if you have a hard to predict condition there then you have issues). Profile and see, and try to stick to compiler intrinsics if you can( journal entry on this too)

Share this post


Link to post
Share on other sites
Quote:
Original post by Washu
2. Because sometimes you only need to operate on one of the elements in the vector. Imagine a dot product in SSE without haddps.


I understand the first sentence (although I actually don't see the utility of operating on the 1st element ONLY, except that it saves you a shuffle instruction or two in case you really wanted to pass thru Y, Z, and W to the result), but how does the 2nd sentence have to do with anything?

Quote:
Original post by Washu
3. Branch prediction misses arent as expensive as you think...


Is this just generally because the general x86 architecture kicks ass at pipelining & prefetching & prediction, compared to other architectures? Because "expensive" is all relative, and I am sure that branch misprediction is a pipeline flusher that ain't up to no good.

Thanks guys -- I guess I'll keep working at generating an fcmov.. and maybe try inline asm if I gotta.

Share this post


Link to post
Share on other sites
Quote:
Original post by discman1028
Quote:
Original post by Washu
2. Because sometimes you only need to operate on one of the elements in the vector. Imagine a dot product in SSE without haddps.


I understand the first sentence (although I actually don't see the utility of operating on the 1st element ONLY, except that it saves you a shuffle instruction or two in case you really wanted to pass thru Y, Z, and W to the result), but how does the 2nd sentence have to do with anything?

heh, good point, wasn't thinking clearly here. However by providing a general purpose set of single floating point operations, SSE saves you from having to switch back and forth between the FPU and SSE. You might, for instance, perform a series of calculations to end up at a floating point number, then fill an SSE vector register with that number. By not having to switch between the FPU and SSE you avoid having to do a write out to memory and a read in from memory (which may hit the cache if you're lucky... if not, then it's expensive).
Quote:
Quote:
Original post by Washu
3. Branch prediction misses arent as expensive as you think...


Is this just generally because the general x86 architecture kicks ass at pipelining & prefetching & prediction, compared to other architectures? Because "expensive" is all relative, and I am sure that branch misprediction is a pipeline flusher that ain't up to no good.

Nope, it's not that branch prediction is so awsome on the x86/x64 platform, it's that memory hits are extremely expensive, so much so that the minor amount of time a branch miss will attract is insignificant compared to the amount of time waiting for something from main memory. Herb Sutter had a recent talk on this, and it's well worth viewing the video of it (and the slides side by side). This is especially important in high performance computing.

Share this post


Link to post
Share on other sites
Quote:
Original post by Washu
SSE saves you from having to switch back and forth between the FPU and SSE. You might, for instance, perform a series of calculations to end up at a floating point number, then fill an SSE vector register with that number. By not having to switch between the FPU and SSE you avoid having to do a write out to memory and a read in from memory.


Of course.. but anything you could do with *ss() function (and then splat (or "shuffle") into the whole SSE vector reg), you could do with *ps() functions.

e.g.

D = A*B+C

If the float values you care about were in the X components, you could do:

D = _mm_mul_ss(A,B);
D = _mm_add_ss(D,C);
D = _mm_shuffle_ps(D,D,0);

Or you could simply do:

D = _mm_mul_ps(A,B);
D = _mm_add_ps(D,C);
D = _mm_shuffle_ps(D,D,0);

Or, if A and B and C already had the same uniform value in all four components, the shuffle isn't even needed in case 2.

So I am still adamant about the unusefulness of the *ss() operations.

Quote:
Original post by Washu
Nope ... the minor amount of time a branch miss will attract is insignificant compared to the amount of time waiting for something from main memory.


Of course... but as I said, everything's relative! Of COURSE we don't want a mem fetch (even worse is a cache miss, but even if a hit). But on a more picky level, we also would like to avoid a mispredicted branch (or even a correctly predicted branch on an architecture that doesn't prefetch instructions well)!

Share this post


Link to post
Share on other sites
Quote:
Original post by discman1028
Of course.. but anything you could do with *ss() function (and then splat (or "shuffle") into the whole SSE vector reg), you could do with *ps() functions.

e.g.

D = A*B+C

If the float values you care about were in the X components, you could do:

D = _mm_mul_ss(A,B);
D = _mm_add_ss(D,C);
D = _mm_shuffle_ps(D,D,0);

Or you could simply do:

D = _mm_mul_ps(A,B);
D = _mm_add_ps(D,C);
D = _mm_shuffle_ps(D,D,0);

Or, if A and B already had the same value in all four components, the shuffle isn't even needed in case 2.

So I am still adamant about the unusefulness of the *ss() operations.

SS instructions are minorly faster than the PS variety. Exactly why this is I'm not sure, since they should be operating entirely in parallel. Also, the SS instructions allow you to operate on single floats throughout, thus avoiding the almost FPU entirely (not all functionality is replicated). Due to the number of SSE registers this allows for significantly more parallel calculation capabilities than the FPU can perform, typically. Especially since the destination can be seperated from the source, something that the FPU stack isn't so great at :)
Quote:

Quote:
Original post by Washu
Nope ... the minor amount of time a branch miss will attract is insignificant compared to the amount of time waiting for something from main memory.


Of course... but as I said, everything's relative! Of COURSE we don't want a mem fetch (even worse is a cache miss, but even if a hit). But on a more picky level, we also would like to avoid a mispredicted branch (or even a correctly predicted branch on an architecture that doesn't prefetch instructions well)!

Again, watch his lecture, it's very much worth it.

Share this post


Link to post
Share on other sites
Quote:
Original post by Washu
SS instructions are minorly faster than the PS variety.


Really? Where did you find that, I couldn't find any documentation on that anywhere. :D

Quote:
Original post by Washu
Again, watch his lecture, it's very much worth it.


Will do. I also read your archived journal entries.. very good! Die-hard singleton fans/critic rants are always fun to read too.

Share this post


Link to post
Share on other sites
Quote:
Original post by discman1028
Quote:
Original post by Washu
SS instructions are minorly faster than the PS variety. Exactly why this is I'm not sure, since they should be operating entirely in parallel.


Really? Where did you find that, I couldn't find any documentation on that anywhere. :D

One of the intel reference manuals relating to optimizing code for the Core2Duo platform.
Quote:

Quote:
Original post by Washu
Again, watch his lecture, it's very much worth it.


Will do. I also read your archived journal entries.. very good! Die-hard singleton fans/critic rants are always fun to read too.

Eh, lets leave singletons out of this.

Share this post


Link to post
Share on other sites
Quote:
Original post by Washu
Eh, lets leave singletons out of this.


But I take your side! :)

Quote:
Original post by discman1028
So I am still adamant about the unusefulness of the *ss() operations.


Hmmm maybe I lied, they could be of some use. Maybe you could store your W component (or any component), by convention*, of your vector4 in the [0] position. Then you could modify W by the [0] component of any other vector4 without affecting the X,Y,Z components. I'm not sure if that's why SSE chose to include those instructions, but it's one valid use I hadn't thought of.

The same thing could be done in any vector architecture, but you would have to generate a mask vector like <0xFFFFFFFF,0,0,0> in order to apply the change to the [0] component only.

Just trying to think of ways to use the darn things... :)

*by convention b/c if you try to optimize by shuffling, then performing a 1-component *ss() op, than shuffling back, you could have just done a packed operation, then shuffled the result and the original together to create the desired result, saving one instruction.

Share this post


Link to post
Share on other sites
Quote:
Original post by discman1028
Quote:
Original post by Washu
Eh, lets leave singletons out of this.


But I take your side! :)

Quote:
Original post by discman1028
So I am still adamant about the unusefulness of the *ss() operations.


Hmmm maybe I lied, they could be of some use. Maybe you could store your W component (or any component), by convention*, of your vector4 in the [0] position. Then you could modify W by the [0] component of any other vector4 without affecting the X,Y,Z components. I'm not sure if that's why SSE chose to include those instructions, but it's one valid use I hadn't thought of.

The same thing could be done in any vector architecture, but you would have to generate a mask vector like <0xFFFFFFFF,0,0,0> in order to apply the change to the [0] component only.

Just trying to think of ways to use the darn things... :)

*by convention b/c if you shuffle, then perform, than shuffle back, you could have just done a packed operation, then shuffled the result and the original together to create the desired result, saving one instruction.

I added stuff to my post, you should read it again.

Share this post


Link to post
Share on other sites
Quote:
Original post by Washu
I added stuff to my post, you should read it again.


Ahhhh! I added stuff to mine too... heh let me go back to read yours, hold on. This almost seems like a chat session.

Share this post


Link to post
Share on other sites
Quote:
Original post by Washu
Also, the SS instructions allow you to operate on single floats throughout, thus avoiding the almost FPU entirely (not all functionality is replicated). Due to the number of SSE registers this allows for significantly more parallel calculation capabilities than the FPU can perform, typically. Especially since the destination can be seperated from the source, something that the FPU stack isn't so great at :)


Is this what you added?

Ok, so starting that sentence, you've switched gears to talk about why SS is better than FPU, instead of why SS is different than PS...

On other architectures, I know staying in vector registers is optimal, but wasn't sure if that was the case for SSE (I had heard that SSE and the FPU were a little more tightly coupled on x86, than PowerPC w/Altivec, e.g.). But it is good to confirm that that is incorrect, and in fact I should perform float ops in vector regs (or at least keep floats in vector regs, once they are already there), per your advice, I agree upon that, and that both PS and SS ops can help me do that.

What I had also pondered for a while was whether SS had anything useful to offer over PS. But I guess you could find uses, like the one I mentioned.

Share this post


Link to post
Share on other sites
Quote:
Original post by Washu
One of the intel reference manuals relating to optimizing code for the Core2Duo platform.


Thanks, this was a very good reference!

Unfortunately, from the following blurb, I'm not sure if the scalar SS ops are only better than the PS ops on Core Solo and higher:

Quote:

Execution of SIMD instructions on Intel Core Solo and Intel Core Duo processors are
improved over Pentium M processors by the following enhancements:
-- Micro-op fusion — Scalar SIMD operations on register and memory have single
μop flows comparable to X87 flows. Many packed instructions are fused to reduce
its μop flow from four to two μops.

...


I'll post some other facts here (about SS ops versus FPU ops), just for everyone's sake (SSE and SSE2, and probably higher):

Quote:

3.8.4 x87 vs. Scalar SIMD Floating-point Trade-offs
===================================================
There are a number of differences between x87 floating-point code and scalar
floating-point code (using SSE and SSE2). The following differences should drive
decisions about which registers and instructions to use:
-- When an input operand for a SIMD floating-point instruction contains values that
are less than the representable range of the data type, a denormal exception
occurs. This causes a significant performance penalty. An SIMD floating-point
3-85
GENERAL OPTIMIZATION GUIDELINES
operation has a flush-to-zero mode in which the results will not underflow.
Therefore subsequent computation will not face the performance penalty of
handling denormal input operands. For example, in the case of 3D applications
with low lighting levels, using flush-to-zero mode can improve performance by as
much as 50% for applications with large numbers of underflows.
-- Scalar floating-point SIMD instructions have lower latencies than equivalent x87
instructions. Scalar SIMD floating-point multiply instruction may be pipelined,
while x87 multiply instruction is not.
-- Only x87 supports transcendental instructions.
-- x87 supports 80-bit precision, double extended floating point. SSE support a
maximum of 32-bit precision. SSE2 supports a maximum of 64-bit precision.
-- Scalar floating-point registers may be accessed directly, avoiding FXCH and topof-
stack restrictions.
-- The cost of converting from floating point to integer with truncation is significantly
lower with Streaming SIMD Extensions 2 and Streaming SIMD Extensions
in the processors based on Intel NetBurst microarchitecture than with either
changes to the rounding mode or the sequence prescribed in the Example 3-45.
Assembly/Compiler Coding Rule 63. (M impact, M generality) Use Streaming
SIMD Extensions 2 or Streaming SIMD Extensions unless you need an x87 feature.
Most SSE2 arithmetic operations have shorter latency then their X87 counterpart
and they eliminate the overhead associated with the management of the X87
register stack.


Unfortunately, it goes on to say that on a Pentium M, it's better to use the FPU. :(

Quote:

3.8.4.1 Scalar SSE/SSE2 Performance on Intel Core Solo and Intel Core
Duo Processors
===================================================
On Intel Core Solo and Intel Core Duo processors, the combination of improved
decoding and μop fusion allows instructions which were formerly two, three, and four
μops to go through all decoders. As a result, scalar SSE/SSE2 code can match the
performance of x87 code executing through two floating-point units. On Pentium M
processors, scalar SSE/SSE2 code can experience approximately 30% performance
degradation relative to x87 code executing through two floating-point units.



And some more general info:

Quote:

6.4 SCALAR FLOATING-POINT CODE
===================================================
There are SIMD floating-point instructions that operate only on the least-significant
operand in the SIMD register. These instructions are known as scalar instructions.
They allow the XMM registers to be used for general-purpose floating-point computations.
In terms of performance, scalar floating-point code can be equivalent to or exceed
x87 floating-point code and has the following advantages:
-- SIMD floating-point code uses a flat register model, whereas x87 floating-point
code uses a stack model. Using scalar floating-point code eliminates the need to
use FXCH instructions. These have performance limits on the Intel Pentium 4
processor.
-- Mixing with MMX technology code without penalty.
-- Flush-to-zero mode.
-- Shorter latencies than x87 floating-point.


Now I also must wonder why SSE has _mm_unpackhi_ps()/_mm_unpacklo_ps(), when all of those can be performed with _mm_shuffle_ps() w/different immediate val arguments, which also seems to have an equal or lower latency according to the appendix. :-/ (EDIT: Actually seems to depend on the processor. :-/ As if different SSE versions alone weren't hard enough to provide fallbacks for..)

Share this post


Link to post
Share on other sites
Quote:
Original post by outRider
3) If you need cmov or fcmov and can't get the compiler to generate it for you, do it yourself with some inline assembly and hope the compiler doesn't do anything stupid trying to get your operands into the right registers.


I can't seem to get the compiler to generate this! (Can anyone?)

In any case, assuming I can get inline asm + FCMOV working, I think I will be using FCMOV in situations where it's hard to predict what the answer to the comparison will be.

For, e.g., a loop condition, I think I would use striaght C++, since the branch prediction would be correct for the majority of the iterations, and I have read that the latency of the branch instructions is less than that of FCMOV. (And thus avoiding the branch with FCMOV is only worthwhile if it's tough to predict the result.)

Share this post


Link to post
Share on other sites
I'm also beginning to feel weary about this optimization. Although it seems beneficial, who am I to think I can write better code than my compiler, for something as fine-grained as a floating-point compare? But, supposedly it really can be done in two branchless instructions...

Maybe compilers just default to the Intel branching assembly instructions for compares due to the lower latency of the instructions, and they assume that branch prediction is *usually* accurate and beneficial. So I am hoping it would be beneficial to expose a few efficient "impossible-to-predict-result" float compare functions.

EDIT: agner backs me up on this. ;)

[Edited by - discman1028 on October 11, 2007 4:14:33 PM]

Share this post


Link to post
Share on other sites
Quote:
Original post by discman1028
I'm also beginning to feel weary about this optimization. Although it seems beneficial, who am I to think I can write better code than my compiler, for something as fine-grained as a floating-point compare? But, supposedly it really can be done in two branchless instructions...

Maybe compilers just default to the Intel branching assembly instructions for compares due to the lower latency of the instructions, and they assume that branch prediction is *usually* accurate and beneficial. So I am hoping it would be beneficial to expose a few efficient "impossible-to-predict-result" float compare functions.

EDIT: agner backs me up on this. ;)


Trust me, compilers generate the dumbest assembly sometimes, things that no human would ever do. That's not to say you'll beat them every time, especially when you get into the hundreds or thousands of instructions, but I've seen my share questionable code generated by them.

As for your original problem, try using inline assembly first. I've seen compilers do silly things while trying to get the operands you use into registers that undo the benefit of whatever you're trying to do inline, but hopefully it doesn't happen in your case.

Share this post


Link to post
Share on other sites
Something that should be noted about the SSE packed vs single instructions..

..very few processors have execution units that can process an entire SSE packed operation in one go.. so these instructions may be split into multiple uops within the pipeline..

On the Pentium 3, the basic SSE packed operations (addition, subtraction) are split into two uops that can only be executed through a single execution port

Even on the Pentium 3, the SSE single operations produce a single uop (to the same port as the packed operations)

The use of a single execution port on the P3 is a bottleneck because a specific port can only recieve 1 uop per cycle -> this is why the reciprocal throughput of the basic SSE Packed instructions is never better than 2 cycles per instruction (0.5 packed instructions per cycle)

Also, the P3 class machine can only retire 3 uops per clock cycle, so the packed instructions WILL be consuming 66% of this resource on their retirement cycle vs 33% for the single's ..

The P4 class machines are worse, because the trace cache itself cannot handle more than 3 uops per cycle in addition to the same 3 uop per cycle retirement limit. The P4 does have the advantage in that both SSE Packed and SSE Single instructions produce 1 uop .. which is a god-send considering the limits comming and going..

AMD64's produce 2 uops for the basic SSE Packed and 1 uop for the basic SSE Single but the AMD64 has a different retirement limit -> its still 3, but its 3 macro ops per cycle rather than 3 micro-ops (uops) .. none-the-less, 2 uops consumes 2 execution units yet there is only 1 FADD unit -> much the same as the P3 in practice.

Share this post


Link to post
Share on other sites

This topic is 3717 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this