# Unity Branchless math ops (like fsel)

This topic is 3717 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hey all, I'm working on some branchless math ops. When using SSE, selecting a vector based on a mask is easy.
_mm_or_ps( _mm_andnot_ps( maskvect, zerovect ), _mm_and_ps( maskvect, nonzerovect) );


Now, I would like to do the same with scalar floats. It is possible on PowerPC with fsel, but I am trying to do the same on x86. I have seen the MOVC/FMOVC instructions, but I do not see any instrinsics exposing them (and I'd like to avoid x86 asm for now). (Using VS2005.) I am trying to use C++ to generate the best branchless ASM possible. It seems that SSE has a scalar instruction set. But they operate on 128-bit vectors. My questions are: (1) Is it an expensive operation to transfer between SSE vector registers and the x87 floating point stack, in the first place? This assumption is the reason for my goal. It seems like the vector <-> scalar transfers might not be as expensive on x86 as it is on other certain PowerPC+Altivec architectures. (I remember, in fact, that one revision of SSE actually shared its XMM registers with the x87 stack.) (2) (I've wondered this for a while): Why are there scalar SSE instructions that operate on only one component of a 128-bit vector? Does it save processing time? (3) What are your suggestions? The naive implementation of fsel simply using C produces the following in an optimized build:
__declspec(noinline) float MyFSel( float mask, float a, float b )
{
return (mask >= 0.0f ? a : b);
004010E0  fldz
004010E2  fcomp       dword ptr [esp+4]
004010E6  fnstsw      ax
004010E8  test        ah,41h
004010EB  jp          MyFSel+1Ah (4010FAh)
004010ED  fld         dword ptr [esp+8]
004010F1  fstp        dword ptr [esp+4]
004010F5  fld         dword ptr [esp+4]
//}


I realize the sign bit is the leftmost bit in IEEE754, and I'm thinking of ways to use that, but I'm not sure what's the best way on x86. Thanks. (I have browsed GDnet threads one and two before posting). [Edited by - discman1028 on October 10, 2007 6:28:16 PM]

##### Share on other sites
Something like this seems to be branchless, but isn't correct: if mask == -0.0f (0x80000000), it should select a, not b.

But I am more interested in knowing if fetches/stores to "dword ptr [esp+14h]" are expensive. If I see disasm with "dword ptr", is my calculation not happening completely in registers? (I know most of what is shown is just non-inlined-function pre- and post- work, but I'm wondering in general.)

__declspec(noinline) float MyFSel( float mask, float a, float b ){	float fArray[2] = { a, b };	return fArray[ *((u32*)(&mask)) >> 31 ];}// Disasm below...__declspec(noinline) float MyFSel( float mask, float a, float b ){004010E0  sub         esp,8 	float fArray[2] = { a, b };004010E3  fld         dword ptr [esp+10h] 	return fArray[ *((u32*)(&mask)) >> 31 ];004010E7  mov         eax,dword ptr [esp+0Ch] 004010EB  fstp        dword ptr [esp] 004010EE  shr         eax,1Fh 004010F1  fld         dword ptr [esp+14h] 004010F5  fstp        dword ptr [esp+4] 004010F9  fld         dword ptr [esp+eax*4] }004010FC  add         esp,8 004010FF  ret

EDIT: As a final note, I don't really need to emulate fsel. If I can get a boolean comparison to return an int without a branch, that would be nice too:

__declspec(noinline) int MyGTE(float a, float b){	return (int)(a >= b);004010F0  fld         dword ptr [esp+4] 004010F4  fld         dword ptr [esp+8] 004010F8  fcompp           004010FA  fnstsw      ax   004010FC  test        ah,41h 004010FF  jp          MyGTE+17h (401107h) // argh00401101  mov         eax,1 }

[Edited by - discman1028 on October 10, 2007 7:25:34 PM]

##### Share on other sites
Today must be assembly day or something on Gamedev. I think someone else started a thread on getting the compiler to generate cmov and wasn't able to either.

1) I haven't done anything with SSE in ages, but IIRC you have to go through memory, just like PowerPC. It won't kill you, but it may be preferable to a hard to predict branch if you really can't get the compiler to generate fcmov.

2) I have no idea. It doesn't save any time on a sensible vector unit.

3) If you need cmov or fcmov and can't get the compiler to generate it for you, do it yourself with some inline assembly and hope the compiler doesn't do anything stupid trying to get your operands into the right registers.

##### Share on other sites
1. Yes it can be, loading to the fpu from SSE requires a store to memory and a load to fpu stack. If you are lucky this will only hit the cache.

2. Because sometimes you only need to operate on one of the elements in the vector. Imagine a dot product in SSE without haddps. Also SSE float ops work differently than fpu ops. See my journal for more details on those differences (.net optimization part 2 iirc).

3. Branch prediction misses arent as expensive as you think (excepting tigh loop conditions, but if you have a hard to predict condition there then you have issues). Profile and see, and try to stick to compiler intrinsics if you can( journal entry on this too)

##### Share on other sites
Quote:
 Original post by Washu2. Because sometimes you only need to operate on one of the elements in the vector. Imagine a dot product in SSE without haddps.

I understand the first sentence (although I actually don't see the utility of operating on the 1st element ONLY, except that it saves you a shuffle instruction or two in case you really wanted to pass thru Y, Z, and W to the result), but how does the 2nd sentence have to do with anything?

Quote:
 Original post by Washu3. Branch prediction misses arent as expensive as you think...

Is this just generally because the general x86 architecture kicks ass at pipelining & prefetching & prediction, compared to other architectures? Because "expensive" is all relative, and I am sure that branch misprediction is a pipeline flusher that ain't up to no good.

Thanks guys -- I guess I'll keep working at generating an fcmov.. and maybe try inline asm if I gotta.

##### Share on other sites
Quote:
Original post by discman1028
Quote:
 Original post by Washu2. Because sometimes you only need to operate on one of the elements in the vector. Imagine a dot product in SSE without haddps.

I understand the first sentence (although I actually don't see the utility of operating on the 1st element ONLY, except that it saves you a shuffle instruction or two in case you really wanted to pass thru Y, Z, and W to the result), but how does the 2nd sentence have to do with anything?

heh, good point, wasn't thinking clearly here. However by providing a general purpose set of single floating point operations, SSE saves you from having to switch back and forth between the FPU and SSE. You might, for instance, perform a series of calculations to end up at a floating point number, then fill an SSE vector register with that number. By not having to switch between the FPU and SSE you avoid having to do a write out to memory and a read in from memory (which may hit the cache if you're lucky... if not, then it's expensive).
Quote:
Quote:
 Original post by Washu3. Branch prediction misses arent as expensive as you think...

Is this just generally because the general x86 architecture kicks ass at pipelining & prefetching & prediction, compared to other architectures? Because "expensive" is all relative, and I am sure that branch misprediction is a pipeline flusher that ain't up to no good.

Nope, it's not that branch prediction is so awsome on the x86/x64 platform, it's that memory hits are extremely expensive, so much so that the minor amount of time a branch miss will attract is insignificant compared to the amount of time waiting for something from main memory. Herb Sutter had a recent talk on this, and it's well worth viewing the video of it (and the slides side by side). This is especially important in high performance computing.

##### Share on other sites
Quote:
 Original post by WashuSSE saves you from having to switch back and forth between the FPU and SSE. You might, for instance, perform a series of calculations to end up at a floating point number, then fill an SSE vector register with that number. By not having to switch between the FPU and SSE you avoid having to do a write out to memory and a read in from memory.

Of course.. but anything you could do with *ss() function (and then splat (or "shuffle") into the whole SSE vector reg), you could do with *ps() functions.

e.g.

D = A*B+C

If the float values you care about were in the X components, you could do:

D = _mm_mul_ss(A,B);
D = _mm_shuffle_ps(D,D,0);

Or you could simply do:

D = _mm_mul_ps(A,B);
D = _mm_shuffle_ps(D,D,0);

Or, if A and B and C already had the same uniform value in all four components, the shuffle isn't even needed in case 2.

So I am still adamant about the unusefulness of the *ss() operations.

Quote:
 Original post by WashuNope ... the minor amount of time a branch miss will attract is insignificant compared to the amount of time waiting for something from main memory.

Of course... but as I said, everything's relative! Of COURSE we don't want a mem fetch (even worse is a cache miss, but even if a hit). But on a more picky level, we also would like to avoid a mispredicted branch (or even a correctly predicted branch on an architecture that doesn't prefetch instructions well)!

##### Share on other sites
Quote:
 Original post by discman1028Of course.. but anything you could do with *ss() function (and then splat (or "shuffle") into the whole SSE vector reg), you could do with *ps() functions.e.g.D = A*B+CIf the float values you care about were in the X components, you could do:D = _mm_mul_ss(A,B);D = _mm_add_ss(D,C);D = _mm_shuffle_ps(D,D,0);Or you could simply do:D = _mm_mul_ps(A,B);D = _mm_add_ps(D,C);D = _mm_shuffle_ps(D,D,0);Or, if A and B already had the same value in all four components, the shuffle isn't even needed in case 2.So I am still adamant about the unusefulness of the *ss() operations.

SS instructions are minorly faster than the PS variety. Exactly why this is I'm not sure, since they should be operating entirely in parallel. Also, the SS instructions allow you to operate on single floats throughout, thus avoiding the almost FPU entirely (not all functionality is replicated). Due to the number of SSE registers this allows for significantly more parallel calculation capabilities than the FPU can perform, typically. Especially since the destination can be seperated from the source, something that the FPU stack isn't so great at :)
Quote:

Quote:
 Original post by WashuNope ... the minor amount of time a branch miss will attract is insignificant compared to the amount of time waiting for something from main memory.

Of course... but as I said, everything's relative! Of COURSE we don't want a mem fetch (even worse is a cache miss, but even if a hit). But on a more picky level, we also would like to avoid a mispredicted branch (or even a correctly predicted branch on an architecture that doesn't prefetch instructions well)!

Again, watch his lecture, it's very much worth it.

##### Share on other sites
Quote:
 Original post by WashuSS instructions are minorly faster than the PS variety.

Really? Where did you find that, I couldn't find any documentation on that anywhere. :D

Quote:
 Original post by WashuAgain, watch his lecture, it's very much worth it.

Will do. I also read your archived journal entries.. very good! Die-hard singleton fans/critic rants are always fun to read too.

##### Share on other sites
Quote:
Original post by discman1028
Quote:
 Original post by WashuSS instructions are minorly faster than the PS variety. Exactly why this is I'm not sure, since they should be operating entirely in parallel.

Really? Where did you find that, I couldn't find any documentation on that anywhere. :D

One of the intel reference manuals relating to optimizing code for the Core2Duo platform.
Quote:

Quote:
 Original post by WashuAgain, watch his lecture, it's very much worth it.

Will do. I also read your archived journal entries.. very good! Die-hard singleton fans/critic rants are always fun to read too.

Eh, lets leave singletons out of this.

##### Share on other sites
Quote:
 Original post by WashuEh, lets leave singletons out of this.

But I take your side! :)

Quote:
 Original post by discman1028So I am still adamant about the unusefulness of the *ss() operations.

Hmmm maybe I lied, they could be of some use. Maybe you could store your W component (or any component), by convention*, of your vector4 in the [0] position. Then you could modify W by the [0] component of any other vector4 without affecting the X,Y,Z components. I'm not sure if that's why SSE chose to include those instructions, but it's one valid use I hadn't thought of.

The same thing could be done in any vector architecture, but you would have to generate a mask vector like <0xFFFFFFFF,0,0,0> in order to apply the change to the [0] component only.

Just trying to think of ways to use the darn things... :)

*by convention b/c if you try to optimize by shuffling, then performing a 1-component *ss() op, than shuffling back, you could have just done a packed operation, then shuffled the result and the original together to create the desired result, saving one instruction.

##### Share on other sites
Quote:
Original post by discman1028
Quote:
 Original post by WashuEh, lets leave singletons out of this.

But I take your side! :)

Quote:
 Original post by discman1028So I am still adamant about the unusefulness of the *ss() operations.

Hmmm maybe I lied, they could be of some use. Maybe you could store your W component (or any component), by convention*, of your vector4 in the [0] position. Then you could modify W by the [0] component of any other vector4 without affecting the X,Y,Z components. I'm not sure if that's why SSE chose to include those instructions, but it's one valid use I hadn't thought of.

The same thing could be done in any vector architecture, but you would have to generate a mask vector like <0xFFFFFFFF,0,0,0> in order to apply the change to the [0] component only.

Just trying to think of ways to use the darn things... :)

*by convention b/c if you shuffle, then perform, than shuffle back, you could have just done a packed operation, then shuffled the result and the original together to create the desired result, saving one instruction.

##### Share on other sites
Quote:
 Original post by WashuI added stuff to my post, you should read it again.

Ahhhh! I added stuff to mine too... heh let me go back to read yours, hold on. This almost seems like a chat session.

##### Share on other sites
Quote:
 Original post by WashuAlso, the SS instructions allow you to operate on single floats throughout, thus avoiding the almost FPU entirely (not all functionality is replicated). Due to the number of SSE registers this allows for significantly more parallel calculation capabilities than the FPU can perform, typically. Especially since the destination can be seperated from the source, something that the FPU stack isn't so great at :)

Ok, so starting that sentence, you've switched gears to talk about why SS is better than FPU, instead of why SS is different than PS...

On other architectures, I know staying in vector registers is optimal, but wasn't sure if that was the case for SSE (I had heard that SSE and the FPU were a little more tightly coupled on x86, than PowerPC w/Altivec, e.g.). But it is good to confirm that that is incorrect, and in fact I should perform float ops in vector regs (or at least keep floats in vector regs, once they are already there), per your advice, I agree upon that, and that both PS and SS ops can help me do that.

What I had also pondered for a while was whether SS had anything useful to offer over PS. But I guess you could find uses, like the one I mentioned.

##### Share on other sites
Quote:
 Original post by WashuOne of the intel reference manuals relating to optimizing code for the Core2Duo platform.

Thanks, this was a very good reference!

Unfortunately, from the following blurb, I'm not sure if the scalar SS ops are only better than the PS ops on Core Solo and higher:

Quote:
 Execution of SIMD instructions on Intel Core Solo and Intel Core Duo processors areimproved over Pentium M processors by the following enhancements:-- Micro-op fusion — Scalar SIMD operations on register and memory have singleμop flows comparable to X87 flows. Many packed instructions are fused to reduceits μop flow from four to two μops....

I'll post some other facts here (about SS ops versus FPU ops), just for everyone's sake (SSE and SSE2, and probably higher):

Quote:
 3.8.4 x87 vs. Scalar SIMD Floating-point Trade-offs===================================================There are a number of differences between x87 floating-point code and scalarfloating-point code (using SSE and SSE2). The following differences should drivedecisions about which registers and instructions to use:-- When an input operand for a SIMD floating-point instruction contains values thatare less than the representable range of the data type, a denormal exceptionoccurs. This causes a significant performance penalty. An SIMD floating-point3-85GENERAL OPTIMIZATION GUIDELINESoperation has a flush-to-zero mode in which the results will not underflow.Therefore subsequent computation will not face the performance penalty ofhandling denormal input operands. For example, in the case of 3D applicationswith low lighting levels, using flush-to-zero mode can improve performance by asmuch as 50% for applications with large numbers of underflows.-- Scalar floating-point SIMD instructions have lower latencies than equivalent x87instructions. Scalar SIMD floating-point multiply instruction may be pipelined,while x87 multiply instruction is not.-- Only x87 supports transcendental instructions.-- x87 supports 80-bit precision, double extended floating point. SSE support amaximum of 32-bit precision. SSE2 supports a maximum of 64-bit precision.-- Scalar floating-point registers may be accessed directly, avoiding FXCH and topof-stack restrictions.-- The cost of converting from floating point to integer with truncation is significantlylower with Streaming SIMD Extensions 2 and Streaming SIMD Extensionsin the processors based on Intel NetBurst microarchitecture than with eitherchanges to the rounding mode or the sequence prescribed in the Example 3-45.Assembly/Compiler Coding Rule 63. (M impact, M generality) Use StreamingSIMD Extensions 2 or Streaming SIMD Extensions unless you need an x87 feature.Most SSE2 arithmetic operations have shorter latency then their X87 counterpartand they eliminate the overhead associated with the management of the X87register stack.

Unfortunately, it goes on to say that on a Pentium M, it's better to use the FPU. :(

Quote:
 3.8.4.1 Scalar SSE/SSE2 Performance on Intel Core Solo and Intel CoreDuo Processors===================================================On Intel Core Solo and Intel Core Duo processors, the combination of improveddecoding and μop fusion allows instructions which were formerly two, three, and fourμops to go through all decoders. As a result, scalar SSE/SSE2 code can match theperformance of x87 code executing through two floating-point units. On Pentium Mprocessors, scalar SSE/SSE2 code can experience approximately 30% performancedegradation relative to x87 code executing through two floating-point units.

And some more general info:

Quote:
 6.4 SCALAR FLOATING-POINT CODE===================================================There are SIMD floating-point instructions that operate only on the least-significantoperand in the SIMD register. These instructions are known as scalar instructions.They allow the XMM registers to be used for general-purpose floating-point computations.In terms of performance, scalar floating-point code can be equivalent to or exceedx87 floating-point code and has the following advantages:-- SIMD floating-point code uses a flat register model, whereas x87 floating-pointcode uses a stack model. Using scalar floating-point code eliminates the need touse FXCH instructions. These have performance limits on the Intel Pentium 4processor.-- Mixing with MMX technology code without penalty.-- Flush-to-zero mode.-- Shorter latencies than x87 floating-point.

Now I also must wonder why SSE has _mm_unpackhi_ps()/_mm_unpacklo_ps(), when all of those can be performed with _mm_shuffle_ps() w/different immediate val arguments, which also seems to have an equal or lower latency according to the appendix. :-/ (EDIT: Actually seems to depend on the processor. :-/ As if different SSE versions alone weren't hard enough to provide fallbacks for..)

##### Share on other sites
Quote:
 Original post by outRider3) If you need cmov or fcmov and can't get the compiler to generate it for you, do it yourself with some inline assembly and hope the compiler doesn't do anything stupid trying to get your operands into the right registers.

I can't seem to get the compiler to generate this! (Can anyone?)

In any case, assuming I can get inline asm + FCMOV working, I think I will be using FCMOV in situations where it's hard to predict what the answer to the comparison will be.

For, e.g., a loop condition, I think I would use striaght C++, since the branch prediction would be correct for the majority of the iterations, and I have read that the latency of the branch instructions is less than that of FCMOV. (And thus avoiding the branch with FCMOV is only worthwhile if it's tough to predict the result.)

##### Share on other sites
I'm also beginning to feel weary about this optimization. Although it seems beneficial, who am I to think I can write better code than my compiler, for something as fine-grained as a floating-point compare? But, supposedly it really can be done in two branchless instructions...

Maybe compilers just default to the Intel branching assembly instructions for compares due to the lower latency of the instructions, and they assume that branch prediction is *usually* accurate and beneficial. So I am hoping it would be beneficial to expose a few efficient "impossible-to-predict-result" float compare functions.

EDIT: agner backs me up on this. ;)

[Edited by - discman1028 on October 11, 2007 4:14:33 PM]

##### Share on other sites
Quote:
 Original post by discman1028I'm also beginning to feel weary about this optimization. Although it seems beneficial, who am I to think I can write better code than my compiler, for something as fine-grained as a floating-point compare? But, supposedly it really can be done in two branchless instructions...Maybe compilers just default to the Intel branching assembly instructions for compares due to the lower latency of the instructions, and they assume that branch prediction is *usually* accurate and beneficial. So I am hoping it would be beneficial to expose a few efficient "impossible-to-predict-result" float compare functions.EDIT: agner backs me up on this. ;)

Trust me, compilers generate the dumbest assembly sometimes, things that no human would ever do. That's not to say you'll beat them every time, especially when you get into the hundreds or thousands of instructions, but I've seen my share questionable code generated by them.

As for your original problem, try using inline assembly first. I've seen compilers do silly things while trying to get the operands you use into registers that undo the benefit of whatever you're trying to do inline, but hopefully it doesn't happen in your case.

##### Share on other sites
Thanks for the advice. I hope there's no problems too.

FYI, I moved to a clean thread to discuss my inline __asm attempts, here. I also started it in the more appropriate General Programming forum.

##### Share on other sites
Something that should be noted about the SSE packed vs single instructions..

..very few processors have execution units that can process an entire SSE packed operation in one go.. so these instructions may be split into multiple uops within the pipeline..

On the Pentium 3, the basic SSE packed operations (addition, subtraction) are split into two uops that can only be executed through a single execution port

Even on the Pentium 3, the SSE single operations produce a single uop (to the same port as the packed operations)

The use of a single execution port on the P3 is a bottleneck because a specific port can only recieve 1 uop per cycle -> this is why the reciprocal throughput of the basic SSE Packed instructions is never better than 2 cycles per instruction (0.5 packed instructions per cycle)

Also, the P3 class machine can only retire 3 uops per clock cycle, so the packed instructions WILL be consuming 66% of this resource on their retirement cycle vs 33% for the single's ..

The P4 class machines are worse, because the trace cache itself cannot handle more than 3 uops per cycle in addition to the same 3 uop per cycle retirement limit. The P4 does have the advantage in that both SSE Packed and SSE Single instructions produce 1 uop .. which is a god-send considering the limits comming and going..

AMD64's produce 2 uops for the basic SSE Packed and 1 uop for the basic SSE Single but the AMD64 has a different retirement limit -> its still 3, but its 3 macro ops per cycle rather than 3 micro-ops (uops) .. none-the-less, 2 uops consumes 2 execution units yet there is only 1 FADD unit -> much the same as the P3 in practice.

##### Share on other sites

This topic is 3717 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Create an account

Register a new account

• ### Forum Statistics

• Total Topics
628740
• Total Posts
2984470
• ### Similar Content

• I need a Unity Developer for a game I've been working on for the past months.
It's a 2D dungeonCrawler with a typing mechanic along with Item specific attacks and Procedule Generated levels.
The development is planned to be 4 months long(up to 8 months maximum).
Although this is a hobby project the person who apply to work on this project will be paid once we(the team) find investors for the game.
Anyone interested Send an Email to netinhocrisosto@gmail.com with your portifolio or just talk about your expereince with Unity.

• THE PROJECT

INT is a 3D Sci-fi RPG with a strong emphasis on story, role playing, and innovative RPG features such as randomized companions. The focus is on the journey through a war-torn world with fast-paced combat against hordes of enemies. The player must accomplish quests like a traditional RPG, complete objectives, and meet lively crew members who will aid in the player's survival. Throughout the game you can side and complete missions through criminal cartels, and the two major combatants, the UCE and ACP, of the Interstellar Civil War.
Please note that all of our current positions are remote work. You will not be required to travel.
Talent Needed

Unity Engine Programmer
3D Animator
We have made great strides in the year 2017! INT has received a comprehensive face-lift compared to the start of the year. We look forward to a productive, fruitful year 2018!
Revenue-Share
This is the perfect opportunity to get into the game development industry. Being an Indie team we do not have the creative restrictions often imposed by publishers or other third parties. We are extremely conscientious of our work and continuously uphold a high level of quality throughout our project.
We are unable to offer wages or per-item payments at this time. However revenue-sharing from crowd-funding is offered to team members who contribute 15-20 hours per week to company projects, as well as maintain constant communication and adhere to deadlines. Currently the crowd-funding campaign is scheduled for the year 2018. Your understanding is dearly appreciated.

Thank you for your time! We look forward to hearing from you!

John Shen