Branchless math ops (like fsel)

Started by
18 comments, last by Rockoon1 16 years, 6 months ago
Quote:Original post by Washu
Eh, lets leave singletons out of this.


But I take your side! :)

Quote:Original post by discman1028
So I am still adamant about the unusefulness of the *ss() operations.


Hmmm maybe I lied, they could be of some use. Maybe you could store your W component (or any component), by convention*, of your vector4 in the [0] position. Then you could modify W by the [0] component of any other vector4 without affecting the X,Y,Z components. I'm not sure if that's why SSE chose to include those instructions, but it's one valid use I hadn't thought of.

The same thing could be done in any vector architecture, but you would have to generate a mask vector like <0xFFFFFFFF,0,0,0> in order to apply the change to the [0] component only.

Just trying to think of ways to use the darn things... :)

*by convention b/c if you try to optimize by shuffling, then performing a 1-component *ss() op, than shuffling back, you could have just done a packed operation, then shuffled the result and the original together to create the desired result, saving one instruction.
--== discman1028 ==--
Advertisement
Quote:Original post by discman1028
Quote:Original post by Washu
Eh, lets leave singletons out of this.


But I take your side! :)

Quote:Original post by discman1028
So I am still adamant about the unusefulness of the *ss() operations.


Hmmm maybe I lied, they could be of some use. Maybe you could store your W component (or any component), by convention*, of your vector4 in the [0] position. Then you could modify W by the [0] component of any other vector4 without affecting the X,Y,Z components. I'm not sure if that's why SSE chose to include those instructions, but it's one valid use I hadn't thought of.

The same thing could be done in any vector architecture, but you would have to generate a mask vector like <0xFFFFFFFF,0,0,0> in order to apply the change to the [0] component only.

Just trying to think of ways to use the darn things... :)

*by convention b/c if you shuffle, then perform, than shuffle back, you could have just done a packed operation, then shuffled the result and the original together to create the desired result, saving one instruction.

I added stuff to my post, you should read it again.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

Quote:Original post by Washu
I added stuff to my post, you should read it again.


Ahhhh! I added stuff to mine too... heh let me go back to read yours, hold on. This almost seems like a chat session.
--== discman1028 ==--
Quote:Original post by Washu
Also, the SS instructions allow you to operate on single floats throughout, thus avoiding the almost FPU entirely (not all functionality is replicated). Due to the number of SSE registers this allows for significantly more parallel calculation capabilities than the FPU can perform, typically. Especially since the destination can be seperated from the source, something that the FPU stack isn't so great at :)


Is this what you added?

Ok, so starting that sentence, you've switched gears to talk about why SS is better than FPU, instead of why SS is different than PS...

On other architectures, I know staying in vector registers is optimal, but wasn't sure if that was the case for SSE (I had heard that SSE and the FPU were a little more tightly coupled on x86, than PowerPC w/Altivec, e.g.). But it is good to confirm that that is incorrect, and in fact I should perform float ops in vector regs (or at least keep floats in vector regs, once they are already there), per your advice, I agree upon that, and that both PS and SS ops can help me do that.

What I had also pondered for a while was whether SS had anything useful to offer over PS. But I guess you could find uses, like the one I mentioned.
--== discman1028 ==--
Quote:Original post by Washu
One of the intel reference manuals relating to optimizing code for the Core2Duo platform.


Thanks, this was a very good reference!

Unfortunately, from the following blurb, I'm not sure if the scalar SS ops are only better than the PS ops on Core Solo and higher:

Quote:
Execution of SIMD instructions on Intel Core Solo and Intel Core Duo processors are
improved over Pentium M processors by the following enhancements:
-- Micro-op fusion — Scalar SIMD operations on register and memory have single
μop flows comparable to X87 flows. Many packed instructions are fused to reduce
its μop flow from four to two μops.

...


I'll post some other facts here (about SS ops versus FPU ops), just for everyone's sake (SSE and SSE2, and probably higher):

Quote:
3.8.4 x87 vs. Scalar SIMD Floating-point Trade-offs
===================================================
There are a number of differences between x87 floating-point code and scalar
floating-point code (using SSE and SSE2). The following differences should drive
decisions about which registers and instructions to use:
-- When an input operand for a SIMD floating-point instruction contains values that
are less than the representable range of the data type, a denormal exception
occurs. This causes a significant performance penalty. An SIMD floating-point
3-85
GENERAL OPTIMIZATION GUIDELINES
operation has a flush-to-zero mode in which the results will not underflow.
Therefore subsequent computation will not face the performance penalty of
handling denormal input operands. For example, in the case of 3D applications
with low lighting levels, using flush-to-zero mode can improve performance by as
much as 50% for applications with large numbers of underflows.
-- Scalar floating-point SIMD instructions have lower latencies than equivalent x87
instructions. Scalar SIMD floating-point multiply instruction may be pipelined,
while x87 multiply instruction is not.
-- Only x87 supports transcendental instructions.
-- x87 supports 80-bit precision, double extended floating point. SSE support a
maximum of 32-bit precision. SSE2 supports a maximum of 64-bit precision.
-- Scalar floating-point registers may be accessed directly, avoiding FXCH and topof-
stack restrictions.
-- The cost of converting from floating point to integer with truncation is significantly
lower with Streaming SIMD Extensions 2 and Streaming SIMD Extensions
in the processors based on Intel NetBurst microarchitecture than with either
changes to the rounding mode or the sequence prescribed in the Example 3-45.
Assembly/Compiler Coding Rule 63. (M impact, M generality) Use Streaming
SIMD Extensions 2 or Streaming SIMD Extensions unless you need an x87 feature.
Most SSE2 arithmetic operations have shorter latency then their X87 counterpart
and they eliminate the overhead associated with the management of the X87
register stack.


Unfortunately, it goes on to say that on a Pentium M, it's better to use the FPU. :(

Quote:
3.8.4.1 Scalar SSE/SSE2 Performance on Intel Core Solo and Intel Core
Duo Processors
===================================================
On Intel Core Solo and Intel Core Duo processors, the combination of improved
decoding and μop fusion allows instructions which were formerly two, three, and four
μops to go through all decoders. As a result, scalar SSE/SSE2 code can match the
performance of x87 code executing through two floating-point units. On Pentium M
processors, scalar SSE/SSE2 code can experience approximately 30% performance
degradation relative to x87 code executing through two floating-point units.



And some more general info:

Quote:
6.4 SCALAR FLOATING-POINT CODE
===================================================
There are SIMD floating-point instructions that operate only on the least-significant
operand in the SIMD register. These instructions are known as scalar instructions.
They allow the XMM registers to be used for general-purpose floating-point computations.
In terms of performance, scalar floating-point code can be equivalent to or exceed
x87 floating-point code and has the following advantages:
-- SIMD floating-point code uses a flat register model, whereas x87 floating-point
code uses a stack model. Using scalar floating-point code eliminates the need to
use FXCH instructions. These have performance limits on the Intel Pentium 4
processor.
-- Mixing with MMX technology code without penalty.
-- Flush-to-zero mode.
-- Shorter latencies than x87 floating-point.


Now I also must wonder why SSE has _mm_unpackhi_ps()/_mm_unpacklo_ps(), when all of those can be performed with _mm_shuffle_ps() w/different immediate val arguments, which also seems to have an equal or lower latency according to the appendix. :-/ (EDIT: Actually seems to depend on the processor. :-/ As if different SSE versions alone weren't hard enough to provide fallbacks for..)
--== discman1028 ==--
Quote:Original post by outRider
3) If you need cmov or fcmov and can't get the compiler to generate it for you, do it yourself with some inline assembly and hope the compiler doesn't do anything stupid trying to get your operands into the right registers.


I can't seem to get the compiler to generate this! (Can anyone?)

In any case, assuming I can get inline asm + FCMOV working, I think I will be using FCMOV in situations where it's hard to predict what the answer to the comparison will be.

For, e.g., a loop condition, I think I would use striaght C++, since the branch prediction would be correct for the majority of the iterations, and I have read that the latency of the branch instructions is less than that of FCMOV. (And thus avoiding the branch with FCMOV is only worthwhile if it's tough to predict the result.)
--== discman1028 ==--
I'm also beginning to feel weary about this optimization. Although it seems beneficial, who am I to think I can write better code than my compiler, for something as fine-grained as a floating-point compare? But, supposedly it really can be done in two branchless instructions...

Maybe compilers just default to the Intel branching assembly instructions for compares due to the lower latency of the instructions, and they assume that branch prediction is *usually* accurate and beneficial. So I am hoping it would be beneficial to expose a few efficient "impossible-to-predict-result" float compare functions.

EDIT: agner backs me up on this. ;)

[Edited by - discman1028 on October 11, 2007 4:14:33 PM]
--== discman1028 ==--
Quote:Original post by discman1028
I'm also beginning to feel weary about this optimization. Although it seems beneficial, who am I to think I can write better code than my compiler, for something as fine-grained as a floating-point compare? But, supposedly it really can be done in two branchless instructions...

Maybe compilers just default to the Intel branching assembly instructions for compares due to the lower latency of the instructions, and they assume that branch prediction is *usually* accurate and beneficial. So I am hoping it would be beneficial to expose a few efficient "impossible-to-predict-result" float compare functions.

EDIT: agner backs me up on this. ;)


Trust me, compilers generate the dumbest assembly sometimes, things that no human would ever do. That's not to say you'll beat them every time, especially when you get into the hundreds or thousands of instructions, but I've seen my share questionable code generated by them.

As for your original problem, try using inline assembly first. I've seen compilers do silly things while trying to get the operands you use into registers that undo the benefit of whatever you're trying to do inline, but hopefully it doesn't happen in your case.
Thanks for the advice. I hope there's no problems too.

FYI, I moved to a clean thread to discuss my inline __asm attempts, here. I also started it in the more appropriate General Programming forum.
--== discman1028 ==--
Something that should be noted about the SSE packed vs single instructions..

..very few processors have execution units that can process an entire SSE packed operation in one go.. so these instructions may be split into multiple uops within the pipeline..

On the Pentium 3, the basic SSE packed operations (addition, subtraction) are split into two uops that can only be executed through a single execution port

Even on the Pentium 3, the SSE single operations produce a single uop (to the same port as the packed operations)

The use of a single execution port on the P3 is a bottleneck because a specific port can only recieve 1 uop per cycle -> this is why the reciprocal throughput of the basic SSE Packed instructions is never better than 2 cycles per instruction (0.5 packed instructions per cycle)

Also, the P3 class machine can only retire 3 uops per clock cycle, so the packed instructions WILL be consuming 66% of this resource on their retirement cycle vs 33% for the single's ..

The P4 class machines are worse, because the trace cache itself cannot handle more than 3 uops per cycle in addition to the same 3 uop per cycle retirement limit. The P4 does have the advantage in that both SSE Packed and SSE Single instructions produce 1 uop .. which is a god-send considering the limits comming and going..

AMD64's produce 2 uops for the basic SSE Packed and 1 uop for the basic SSE Single but the AMD64 has a different retirement limit -> its still 3, but its 3 macro ops per cycle rather than 3 micro-ops (uops) .. none-the-less, 2 uops consumes 2 execution units yet there is only 1 FADD unit -> much the same as the P3 in practice.

This topic is closed to new replies.

Advertisement