SSE Performance

Started by
13 comments, last by momotte 14 years, 1 month ago
Hi I am currently learning a bit of SSE(2) programming, using intrinsics. As far as i can tell from searching these forums and Google, it seems to be a common misconception, that SSE is generally faster than x87 code. One article mentioned, that as long as one doesn't do at least a 4x4 matrix multiplication or something like inverse-sqrt, most vector/matrix stuff will be slower. And from my own tests i can confirm that. I have written a small 4x4 matrix class and a 4-component vector class and in comparison to the old code everything with a square-root in it is faster, but everything else is mostly 1/3 slower. Now, i do artificial tests, in that i just call one function thousands of times. What i would like to know from more experienced people are the following things: 1) Might SSE be faster, once i do more complex things with my matrix/vector code? I mean, if i have a longer mathematical expression, due to inlining the compiler should be able to get more out of SSE code, than from x87 code. So, could it be that SSE is faster in "normal" code, compared to artificially calling one function over and over (note: my typical use-case is a 3D game/graphics engine). 2) What exactly is the reason, that x87 code is often faster? First i read somewhere, that SSE uses the same registers as x87 and that the CPU needs to switch modes, which takes time. But then i read somewhere else, that this was true for MMX, but not for SSE anymore. What is correct? And if the CPU does not need to switch modes, why is SSE then not always faster. For the record: All my SSE data is 16-byte aligned and when i say it is slower, i am pretty sure it cannot be done with less instructions, because i mostly copied well optimized code (or checked that i had the same solution as others). Any information and insights from more experienced people is welcome. I'm mostly trying to understand SSE better and know more about its performance characteristics. Thanks, Jan.
Advertisement
1- yes. although if you're benchmarking a partially unrolled vectorized float4 addition in a tight loop without any cache-access, I doubt you'll get better performance by throwing in some other random code inside.

2- this was true for MMX, not SSE, SSE has its own 8 128-bit registers. and SSE is not always faster because some operations are not "vector-friendly", like cross or dot products (in SSE2 at least), horizontal component combination, etc..

from my experience, most performance drops occur when you use the packed instructions, and need excessive shuffling, or have one or more components unused in your vector.
you might get better performance using scalar ops, even if you have more instructions.

(about unused components, the first example that comes to mind is using 4-vector ops on 3-vectors. you can consider packing 4 float3 (xyz xyz xyz xyz) into 3 float4 (xxxx yyyy zzzz), and use 3 vector instructions operating on all components instead of 4 ignoring one component, but doing so will generally end up eating more registers (for example, you'll need 3 broadcasted float4 registers to add a simple constant to the 3 float4 above, instead of a single register if you kept the xyz layout. so using structures of arrays might not always be a good idea, it depends a lot on the context)).

and I suppose you already know this, but having less instructions is not necessarily a good thing. what's important is a good instruction ordering, to keep the different CPU units busy, minimise data-dependencies to avoid pipeline stalls, and overlap instruction latencies as much as possible.

SSE is also extremely good for conversion operations. not only float/int, but stuff like float[0,1]/u16, float[0,1]/u8 (which can be blazingly fast once you start processing 8 or 16 values at a time). string operations can get quite a speedup too (I'm not talking about the string-specific instructions in the later SSE models)
CPUs are really good at data parallel (which is a crappy explanation) operations. SIMD and GPUs are all about that on beefier hardware. A Tutorial.

SIMD is just a CPU-friendly way to formulate an algorithm as data parallel.

Many problems cannot be expressed as such. Some are naturally data parallel, some are called embarrassingly parallel. For those that can be, intrinsics are a way of encoding such algorithm in most hardware-friendly way.

Quote:What exactly is the reason, that x87 code is often faster

- The algorithm is not suitable for data parallelism
- The algorithm isn't applied properly

Code which relies heavily on conditional execution, or one which has highly non-deterministic flow is not suitable for most forms of data parallelism, with exception of certain types of hardware, or abundance of memory. SIMD has neither, GPUs are just barely better.

Quote:that i just call one function thousands of times
The overhead here is a killer.

The point of SIMD (especially Streaming part) is to call function once, and the intrinsics process million elements in same call. Same for GPU and batching. Minimize number of calls, especially data conversions. Ideally, you compute square root

Think Formula 1. It needs to be brought to race track by truck, it needs to be tuned, ... Then it keeps going 300mph for 3 hours it is on the track. That is SIMD. It takes time to prepare, but then it's bitchin' fast.
Calling function for each element is Formula 1 in New York downtown. 50mph, red light, 50 mph, red light, 50 mph, red light, 50 mph, red light.
Quote:Original post by Jan K
Hi

I am currently learning a bit of SSE(2) programming, using intrinsics.

As far as i can tell from searching these forums and Google, it seems to be a common misconception, that SSE is generally faster than x87 code. One article mentioned, that as long as one doesn't do at least a 4x4 matrix multiplication or something like inverse-sqrt, most vector/matrix stuff will be slower. And from my own tests i can confirm that. I have written a small 4x4 matrix class and a 4-component vector class and in comparison to the old code everything with a square-root in it is faster, but everything else is mostly 1/3 slower.

Now, i do artificial tests, in that i just call one function thousands of times. What i would like to know from more experienced people are the following things:

1) Might SSE be faster, once i do more complex things with my matrix/vector code? I mean, if i have a longer mathematical expression, due to inlining the compiler should be able to get more out of SSE code, than from x87 code. So, could it be that SSE is faster in "normal" code, compared to artificially calling one function over and over (note: my typical use-case is a 3D game/graphics engine).

2) What exactly is the reason, that x87 code is often faster? First i read somewhere, that SSE uses the same registers as x87 and that the CPU needs to switch modes, which takes time. But then i read somewhere else, that this was true for MMX, but not for SSE anymore. What is correct? And if the CPU does not need to switch modes, why is SSE then not always faster.


For the record: All my SSE data is 16-byte aligned and when i say it is slower, i am pretty sure it cannot be done with less instructions, because i mostly copied well optimized code (or checked that i had the same solution as others).


Any information and insights from more experienced people is welcome. I'm mostly trying to understand SSE better and know more about its performance characteristics.

Thanks,
Jan.


matrix * matrix or vector * matrix, clamp of a vector, minimizing a vector, maximizing a vector, or even vector comparison is faster using SSE. Try showing the code you're using, you're probably making some mistake.

I personally have more than twice the speed doing matrix * matrix for example.
Hi Jan K,
I have been through the same as you with learning SSE.
I tried to speed up my 4d vector class with it and made
multiple benchmarks comparing my code to the compiler
generated.

I measured the time needed for doing the operation in a loop
million times.
The best result i got was beeing 1.1x faster than the compiler.
But through the testing something was weird.
The factor of 1.1 did not change but when I changed the
order of the last two mov commands in my SSE code it ran
10 times slower/faster. The crazy thing was, that the compilers
code did too.
So I found out that my processor (Intel Core2Duo) is analyzing
the dataflow and automatically prefetching to speed up.
Appearantly my processor learned to optimize my code and
did the same with the compilers code that ran after my.

Finally I must say: The processors automatic prefetch makes
measurement in loops unpredictable because he will not behave
the same in your latter program as it is surely not supposed to
run one single operation for a million times but a series of operations.

In my final program the speed up by SSE was between 2 and 6 times as fast.
So it might be worth to test it!

One last word to the faster x87:
If you check the asm code of your programm you will notice that modern
compilers are using no x87 instructions but doing scalar floatingpoint
operations with SSE registers. Therefore the only slow down in speed from
scalar to packed operation is not created by arithmetical operations but
by the slower data movement between memory and cpu(which is hidden behind
the automatic data prefetch of my processor).

I hope this far too long post clarified your problem.
Hello
I have written a matrix library in SSE2 with assembler.
Maybe if you post some samples i can help improve them.
Also, optimization for each processor type is quite different.
What chipset are you targeting?
Thanks for all the insights! Wanted to post yesterday already, but then gamedev was down.

Ok, here some code:
My Matrix*Matrix is faster than the non-SSE version. Matrix*Vector is slower, though, so i'll post that.

The matrix is in column-major order. The vector is a 3-component vector, so the 4th component is always filled with a zero. Since other operations depend on it being zero, the matrix*vector operator needs to make sure, that it stays zero.

inline const Vec3SSE operator* (const MatrixSSE& m, const Vec3SSE& v){// copy the data into a register__m128 Data = v;// mask, later used to zero out the 4th componentconst __m128 mask = _mm_set_ps (1, 1, 1, 0);// the 3 components of the vector__m128 t0 = _mm_shuffle_ps (Data, Data, _MM_SHUFFLE (0, 0, 0, 0));__m128 t1 = _mm_shuffle_ps (Data, Data, _MM_SHUFFLE (1, 1, 1, 1));__m128 t2 = _mm_shuffle_ps (Data, Data, _MM_SHUFFLE (2, 2, 2, 2));t0 = _mm_mul_ps (m.m_Data[0], t0);t1 = _mm_mul_ps (m.m_Data[1], t1);t2 = _mm_mul_ps (m.m_Data[2], t2);t0 = _mm_add_ps (t0, t1);// also add the matrix' 4th vector (as if the vector had a 1 in the 4th component)t2 = _mm_add_ps (t2, m.m_Data[3]); t0 = _mm_add_ps (t0, t2);// multiply with the mask and return the resultreturn (Vec3SSE (_mm_mul_ps (t0, mask)));}


Thanks,
Jan.
Ah, I forgot: I am currently working on an Intel 2.6 GHz Core 2 Quad. I think it supports SSE3, but i wanted to stay compatible with all x64 CPUs, so SSE2 must suffice.

Jan.
I'm not sure about VC++ 2010, but I know intrinsics for 2008 were quite badly implemented. If you take a look at your intrinsic code it rarely resembles the asm equivalent, the compiler can't seem to optimize around it and adds alot of spurious mov's to compensate. If you want good sse performance you'll have to code it by hand in assembly, fortunetly that's pretty easy.

For example, just a few months ago I wrote a simple raytracer in C++ then converted it to assembly, I got around 4x speed increase with hand optimized sse. Its not just doing 4 ops at the same time, rather just having extra room in the registers (ie. a vector taking 1 register instead of 4) meant alot less caching temporary results, reading things in advance, ect...
Quote:Original post by Ryan_001
If you want good sse performance you'll have to code it by hand in assembly, fortunetly that's pretty easy.
I cannot say much about the particular compiler you named (VC 2008), since I'm not using that one, but in general I would strongly recommend against "hand optimizing" code in assembler. If your compiler does not properly optimize intrinsics, then you either forgot to turn on optimizations or the compiler is broken.

In general, assembler code is more or less a black box for the compiler. Some (gcc, most notably) are a bit more intelligent and can still perform minor optimizations (scheduling and register coloring) even on inline assembly, but most compilers will just treat your code as-is and add some extra instructions around it.

Since assembler is a "black box", the compiler cannot prove certain things or even make assumptions, and thus is unable to do most (or all) optimizations inside and around it. On the other hand, a compiler will normally perform all valid optimizations that it is capable of doing with intrinsic functions just fine. A decent optimizer will also interleave your SSE code with non-SSE code when it is possible, which is just awesome, since that code will run "for free".

This topic is closed to new replies.

Advertisement