Sign in to follow this  
Jan K

SSE Performance

Recommended Posts

Hi I am currently learning a bit of SSE(2) programming, using intrinsics. As far as i can tell from searching these forums and Google, it seems to be a common misconception, that SSE is generally faster than x87 code. One article mentioned, that as long as one doesn't do at least a 4x4 matrix multiplication or something like inverse-sqrt, most vector/matrix stuff will be slower. And from my own tests i can confirm that. I have written a small 4x4 matrix class and a 4-component vector class and in comparison to the old code everything with a square-root in it is faster, but everything else is mostly 1/3 slower. Now, i do artificial tests, in that i just call one function thousands of times. What i would like to know from more experienced people are the following things: 1) Might SSE be faster, once i do more complex things with my matrix/vector code? I mean, if i have a longer mathematical expression, due to inlining the compiler should be able to get more out of SSE code, than from x87 code. So, could it be that SSE is faster in "normal" code, compared to artificially calling one function over and over (note: my typical use-case is a 3D game/graphics engine). 2) What exactly is the reason, that x87 code is often faster? First i read somewhere, that SSE uses the same registers as x87 and that the CPU needs to switch modes, which takes time. But then i read somewhere else, that this was true for MMX, but not for SSE anymore. What is correct? And if the CPU does not need to switch modes, why is SSE then not always faster. For the record: All my SSE data is 16-byte aligned and when i say it is slower, i am pretty sure it cannot be done with less instructions, because i mostly copied well optimized code (or checked that i had the same solution as others). Any information and insights from more experienced people is welcome. I'm mostly trying to understand SSE better and know more about its performance characteristics. Thanks, Jan.

Share this post


Link to post
Share on other sites
1- yes. although if you're benchmarking a partially unrolled vectorized float4 addition in a tight loop without any cache-access, I doubt you'll get better performance by throwing in some other random code inside.

2- this was true for MMX, not SSE, SSE has its own 8 128-bit registers. and SSE is not always faster because some operations are not "vector-friendly", like cross or dot products (in SSE2 at least), horizontal component combination, etc..

from my experience, most performance drops occur when you use the packed instructions, and need excessive shuffling, or have one or more components unused in your vector.
you might get better performance using scalar ops, even if you have more instructions.

(about unused components, the first example that comes to mind is using 4-vector ops on 3-vectors. you can consider packing 4 float3 (xyz xyz xyz xyz) into 3 float4 (xxxx yyyy zzzz), and use 3 vector instructions operating on all components instead of 4 ignoring one component, but doing so will generally end up eating more registers (for example, you'll need 3 broadcasted float4 registers to add a simple constant to the 3 float4 above, instead of a single register if you kept the xyz layout. so using structures of arrays might not always be a good idea, it depends a lot on the context)).

and I suppose you already know this, but having less instructions is not necessarily a good thing. what's important is a good instruction ordering, to keep the different CPU units busy, minimise data-dependencies to avoid pipeline stalls, and overlap instruction latencies as much as possible.

SSE is also extremely good for conversion operations. not only float/int, but stuff like float[0,1]/u16, float[0,1]/u8 (which can be blazingly fast once you start processing 8 or 16 values at a time). string operations can get quite a speedup too (I'm not talking about the string-specific instructions in the later SSE models)

Share this post


Link to post
Share on other sites
CPUs are really good at data parallel (which is a crappy explanation) operations. SIMD and GPUs are all about that on beefier hardware. A Tutorial.

SIMD is just a CPU-friendly way to formulate an algorithm as data parallel.

Many problems cannot be expressed as such. Some are naturally data parallel, some are called embarrassingly parallel. For those that can be, intrinsics are a way of encoding such algorithm in most hardware-friendly way.

Quote:
What exactly is the reason, that x87 code is often faster

- The algorithm is not suitable for data parallelism
- The algorithm isn't applied properly

Code which relies heavily on conditional execution, or one which has highly non-deterministic flow is not suitable for most forms of data parallelism, with exception of certain types of hardware, or abundance of memory. SIMD has neither, GPUs are just barely better.

Quote:
that i just call one function thousands of times
The overhead here is a killer.

The point of SIMD (especially Streaming part) is to call function once, and the intrinsics process million elements in same call. Same for GPU and batching. Minimize number of calls, especially data conversions. Ideally, you compute square root

Think Formula 1. It needs to be brought to race track by truck, it needs to be tuned, ... Then it keeps going 300mph for 3 hours it is on the track. That is SIMD. It takes time to prepare, but then it's bitchin' fast.
Calling function for each element is Formula 1 in New York downtown. 50mph, red light, 50 mph, red light, 50 mph, red light, 50 mph, red light.

Share this post


Link to post
Share on other sites
Quote:
Original post by Jan K
Hi

I am currently learning a bit of SSE(2) programming, using intrinsics.

As far as i can tell from searching these forums and Google, it seems to be a common misconception, that SSE is generally faster than x87 code. One article mentioned, that as long as one doesn't do at least a 4x4 matrix multiplication or something like inverse-sqrt, most vector/matrix stuff will be slower. And from my own tests i can confirm that. I have written a small 4x4 matrix class and a 4-component vector class and in comparison to the old code everything with a square-root in it is faster, but everything else is mostly 1/3 slower.

Now, i do artificial tests, in that i just call one function thousands of times. What i would like to know from more experienced people are the following things:

1) Might SSE be faster, once i do more complex things with my matrix/vector code? I mean, if i have a longer mathematical expression, due to inlining the compiler should be able to get more out of SSE code, than from x87 code. So, could it be that SSE is faster in "normal" code, compared to artificially calling one function over and over (note: my typical use-case is a 3D game/graphics engine).

2) What exactly is the reason, that x87 code is often faster? First i read somewhere, that SSE uses the same registers as x87 and that the CPU needs to switch modes, which takes time. But then i read somewhere else, that this was true for MMX, but not for SSE anymore. What is correct? And if the CPU does not need to switch modes, why is SSE then not always faster.


For the record: All my SSE data is 16-byte aligned and when i say it is slower, i am pretty sure it cannot be done with less instructions, because i mostly copied well optimized code (or checked that i had the same solution as others).


Any information and insights from more experienced people is welcome. I'm mostly trying to understand SSE better and know more about its performance characteristics.

Thanks,
Jan.


matrix * matrix or vector * matrix, clamp of a vector, minimizing a vector, maximizing a vector, or even vector comparison is faster using SSE. Try showing the code you're using, you're probably making some mistake.

I personally have more than twice the speed doing matrix * matrix for example.

Share this post


Link to post
Share on other sites
Hi Jan K,
I have been through the same as you with learning SSE.
I tried to speed up my 4d vector class with it and made
multiple benchmarks comparing my code to the compiler
generated.

I measured the time needed for doing the operation in a loop
million times.
The best result i got was beeing 1.1x faster than the compiler.
But through the testing something was weird.
The factor of 1.1 did not change but when I changed the
order of the last two mov commands in my SSE code it ran
10 times slower/faster. The crazy thing was, that the compilers
code did too.
So I found out that my processor (Intel Core2Duo) is analyzing
the dataflow and automatically prefetching to speed up.
Appearantly my processor learned to optimize my code and
did the same with the compilers code that ran after my.

Finally I must say: The processors automatic prefetch makes
measurement in loops unpredictable because he will not behave
the same in your latter program as it is surely not supposed to
run one single operation for a million times but a series of operations.

In my final program the speed up by SSE was between 2 and 6 times as fast.
So it might be worth to test it!

One last word to the faster x87:
If you check the asm code of your programm you will notice that modern
compilers are using no x87 instructions but doing scalar floatingpoint
operations with SSE registers. Therefore the only slow down in speed from
scalar to packed operation is not created by arithmetical operations but
by the slower data movement between memory and cpu(which is hidden behind
the automatic data prefetch of my processor).

I hope this far too long post clarified your problem.

Share this post


Link to post
Share on other sites
Hello
I have written a matrix library in SSE2 with assembler.
Maybe if you post some samples i can help improve them.
Also, optimization for each processor type is quite different.
What chipset are you targeting?

Share this post


Link to post
Share on other sites
Thanks for all the insights! Wanted to post yesterday already, but then gamedev was down.

Ok, here some code:
My Matrix*Matrix is faster than the non-SSE version. Matrix*Vector is slower, though, so i'll post that.

The matrix is in column-major order. The vector is a 3-component vector, so the 4th component is always filled with a zero. Since other operations depend on it being zero, the matrix*vector operator needs to make sure, that it stays zero.



inline const Vec3SSE operator* (const MatrixSSE& m, const Vec3SSE& v)
{
// copy the data into a register
__m128 Data = v;
// mask, later used to zero out the 4th component
const __m128 mask = _mm_set_ps (1, 1, 1, 0);

// the 3 components of the vector
__m128 t0 = _mm_shuffle_ps (Data, Data, _MM_SHUFFLE (0, 0, 0, 0));
__m128 t1 = _mm_shuffle_ps (Data, Data, _MM_SHUFFLE (1, 1, 1, 1));
__m128 t2 = _mm_shuffle_ps (Data, Data, _MM_SHUFFLE (2, 2, 2, 2));

t0 = _mm_mul_ps (m.m_Data[0], t0);
t1 = _mm_mul_ps (m.m_Data[1], t1);
t2 = _mm_mul_ps (m.m_Data[2], t2);

t0 = _mm_add_ps (t0, t1);
// also add the matrix' 4th vector (as if the vector had a 1 in the 4th component)
t2 = _mm_add_ps (t2, m.m_Data[3]);

t0 = _mm_add_ps (t0, t2);

// multiply with the mask and return the result
return (Vec3SSE (_mm_mul_ps (t0, mask)));
}



Thanks,
Jan.

Share this post


Link to post
Share on other sites
Ah, I forgot: I am currently working on an Intel 2.6 GHz Core 2 Quad. I think it supports SSE3, but i wanted to stay compatible with all x64 CPUs, so SSE2 must suffice.

Jan.

Share this post


Link to post
Share on other sites
I'm not sure about VC++ 2010, but I know intrinsics for 2008 were quite badly implemented. If you take a look at your intrinsic code it rarely resembles the asm equivalent, the compiler can't seem to optimize around it and adds alot of spurious mov's to compensate. If you want good sse performance you'll have to code it by hand in assembly, fortunetly that's pretty easy.

For example, just a few months ago I wrote a simple raytracer in C++ then converted it to assembly, I got around 4x speed increase with hand optimized sse. Its not just doing 4 ops at the same time, rather just having extra room in the registers (ie. a vector taking 1 register instead of 4) meant alot less caching temporary results, reading things in advance, ect...

Share this post


Link to post
Share on other sites
Quote:
Original post by Ryan_001
If you want good sse performance you'll have to code it by hand in assembly, fortunetly that's pretty easy.
I cannot say much about the particular compiler you named (VC 2008), since I'm not using that one, but in general I would strongly recommend against "hand optimizing" code in assembler. If your compiler does not properly optimize intrinsics, then you either forgot to turn on optimizations or the compiler is broken.

In general, assembler code is more or less a black box for the compiler. Some (gcc, most notably) are a bit more intelligent and can still perform minor optimizations (scheduling and register coloring) even on inline assembly, but most compilers will just treat your code as-is and add some extra instructions around it.

Since assembler is a "black box", the compiler cannot prove certain things or even make assumptions, and thus is unable to do most (or all) optimizations inside and around it. On the other hand, a compiler will normally perform all valid optimizations that it is capable of doing with intrinsic functions just fine. A decent optimizer will also interleave your SSE code with non-SSE code when it is possible, which is just awesome, since that code will run "for free".

Share this post


Link to post
Share on other sites
Quote:
Original post by samothI cannot say much about the particular compiler you named (VC 2008), since I'm not using that one, but in general I would strongly recommend against "hand optimizing" code in assembler. If your compiler does not properly optimize intrinsics, then you either forgot to turn on optimizations or the compiler is broken.


What compiler do you use? I'm mainly familiar with Visual C++, GCC, or Intel (which costs money). I know there are others out there, but I haven't had the opportunity to work with them. I'd be interested in playing around with any out there that do intrinsics well.

Quote:
In general, assembler code is more or less a black box for the compiler. Some (gcc, most notably) are a bit more intelligent and can still perform minor optimizations (scheduling and register coloring) even on inline assembly, but most compilers will just treat your code as-is and add some extra instructions around it.

Since assembler is a "black box", the compiler cannot prove certain things or even make assumptions, and thus is unable to do most (or all) optimizations inside and around it. On the other hand, a compiler will normally perform all valid optimizations that it is capable of doing with intrinsic functions just fine. A decent optimizer will also interleave your SSE code with non-SSE code when it is possible, which is just awesome, since that code will run "for free".


In theory... in practice I've found that to not be quite so true. And I'm not the only one:

http://www.virtualdub.org/blog/pivot/entry.php?id=162

Its a bit old, bad sadly still true for the most part.

Also consider that a fair number of crucial intrinsics are just plain missing. An example is a 64 bit long divide who's asm mnemonic happens to be simply idiv just doesn't exist (or at least it didn't when I checked a year ago). And I'm not even talking about many of the obscure SSE packed add with swizzled kittens type instructions. I was just trying to do fixed point math.

The whole 'optimize' around never seemed to pan out in real code. Take a look at some dis-assembled intrinsics. I personally was surprised at just how badly VC++ 2008 mangled it. Now most of my experience was with the 64-bit compiler, so maybe its because it was rather new. I don't know. But even simple vector math had all sorts of mov and pack instructions.

On top of that I'm no asm guru by any stretch, I just dabble in it from time to time. And sure debugging asm can be a bit annoying, and there's very little documentation, but I've found it very easy to beat the VC++ 2008 compiler in performance.

In my experience smaller functions or ones that will be inlined are best left as C++ code. Larger performance critical ones hand tuned assembly. Intrinsics so far have left me quite disappointed performance wise, simply because they're not properly supported. I'd love to use them, in say a vector library or what-not (where they would be perfect). I'm hoping VC++ 2010 steps fixes things. But if your playing with intrinsics and getting slow code, take a look at the disassembly, the problem might not lie where you expect it.

Share this post


Link to post
Share on other sites
Quote:
Original post by Antheus
The point of SIMD (especially Streaming part) is to call function once, and the intrinsics process million elements in same call.


Emphasising what Antheus has already said. A lot of people (a search of the forums will find roughly the same question asked many times) think they can re-implement their Vector4 class with SSE and suddenly everything is 4 times faster. You will quickly find that this is not the case, and at best you may get small speedup (I would offer less than 10%) if you are careful about 16-byte aligning all your vectors, making sure your operations inline and carefully writing the SSE.

The real win from SSE comes from processing a large number of elements in one go, where you have determined through profiling that this operation is a bottleneck. This way the setup work of SSE is amortised by processing many elements all at once, you get more efficient cache usage and ultimately, hopefully, a much improved performance boost.

Share this post


Link to post
Share on other sites
Quote:
Original post by Ryan_001
What compiler do you use?
gcc and gcc/MinGW, both version 4.4

Quote:
Original post by Ryan_001
In theory... in practice I've found that to not be quite so true.
I stopped considering to *ever* touch inline assembly again after gcc completely optimized out a checksum calculation by streamlining it with intrinsic function SSE code.
I had written an OFB block codec with SSE2 intrinsics. Someone will now inevitably point out that one should be using AES, but hey. The goal here was not so much to provide nuclear weapon grade security for the next 700,000 years (though I believe the OFB codec doesn't perform much worse than most "real" encryptions), but to make network packets unreadable in reasonable time to someone with normal resources while staying close to zero overhead, maintaining minimum state, zero-cost key switching, no huge lookup tables, and allowing a "seek" operation. Portability beyond x86 was not a concern. Enter SSE2.

Once the codec was done, I decided to add a checksum and wrote that in C++. The compiler peeled off one iteration and blended the checksum instructions working on the last-but-one block with the SSE2 instructions from the codec. Except for the case where only 1 or 2 blocks are encoded, the checksum version runs at exactly the same speed as the no-checksum version. I could never have written anything the like by hand.

I'm still checking the assembly output every now and then when I write something with intrinsics, just to be 110% sure. So far, it always came out as good or better than I would have anticipated, and I have not been tempted to write assembly by hand again. In fact, with all optimizations turned on, simple C++ is almost every time just as good as you can get or very close (and a hell lot easier to maintain/modify). It is amazing what modern compilers can do sometimes.

Share this post


Link to post
Share on other sites
My code needs to be portable across GCC and VC++, most of the time i compile it as 64 Bit. AFAIK inline assembly is neither portable across compilers, nor supported in 64 Bit (at least VC++). Also i like intrinsics better, because i'm really not that experienced in real assembler programming.

Jan.

Share this post


Link to post
Share on other sites
Quote:
Original post by samoth
...
GCC does awesome stuff
...

It is amazing what modern compilers can do sometimes.


I'll have to play with it. Kinda sad an open source compiler beating the pants off VC++ so badly. I've heard its auto-vectorization is pretty good too (which IMO is about time!!).

Now if only I can get it to work with the visual studio IDE...

Share this post


Link to post
Share on other sites
Quote:
Original post by d00fus
The real win from SSE comes from processing a large number of elements in one go


seconded.
you'll see some real wins if you want to transform a batch of, for example, 300 vectors, by the same matrix. this will typically be blazingly fast with SIMD.

however, just a few comments/questions about the code you posted (Jan K)

inline const Vec3SSE operator* (const MatrixSSE& m, const Vec3SSE& v)
{
// copy the data into a register
__m128 Data = v;
// mask, later used to zero out the 4th component
const __m128 mask = _mm_set_ps (1, 1, 1, 0);

// the 3 components of the vector
__m128 t0 = _mm_shuffle_ps (Data, Data, _MM_SHUFFLE (0, 0, 0, 0));
__m128 t1 = _mm_shuffle_ps (Data, Data, _MM_SHUFFLE (1, 1, 1, 1));
__m128 t2 = _mm_shuffle_ps (Data, Data, _MM_SHUFFLE (2, 2, 2, 2));

t0 = _mm_mul_ps (m.m_Data[0], t0);
t1 = _mm_mul_ps (m.m_Data[1], t1);
t2 = _mm_mul_ps (m.m_Data[2], t2);

t0 = _mm_add_ps (t0, t1);
// also add the matrix' 4th vector (as if the vector had a 1 in the 4th component)
t2 = _mm_add_ps (t2, m.m_Data[3]);

t0 = _mm_add_ps (t0, t2);

// multiply with the mask and return the result
return (Vec3SSE (_mm_mul_ps (t0, mask)));
}


first thing that comes to mind is.. enforcing that 'last component must be zero' forces you to use extra ops. do you _really_ need it? (out of curiosity: what do you need it for?)
if you need it anyway, dont mm_mul_ps with the mask! use mm_and_ps, with a mask set to { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0 }

then, you're declaring this mask as a const __m128... depending on the compiler's mood (if you're using visual studio, even with all optimizations set to the max, it still seems to be doing stupid stuff randomly, like not optimizing some trivial stuff, messing-up its registers allocations, etc.. whatever). yes, so, depending on the compiler's mood, this might not be externalized to a constant static memory, and you might get a shitload of garbage instructions that will, as the scalars are not equal, load each x, y, z, w component separately into an xmm register.
this would seriously suck, but it might happen. you really need to check the generated assembly.
you can avoid this by declaring it static const:

#ifdef _MSC_VER
#define ALIGN(a) __declspec(align(a))
#else // assuming __GNUC__/__GNUG__
#define ALIGN(a) __attribute__((aligned(a)))
#endif

//...

// mask, later used to zero out the 4th component
ALIGN(0x10) static const unsigned int mask = { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0 };

//...

t0 = _mm_and_ps(t0, _mm_load_ps(reinterpret_cast<float*>(&mask)));
// OR:
t0 = _mm_and_ps(t0, *reinterpret_cast<__m128*>(&mask));

//...


the last version, with a cast to __m128* will be better, the compiler might generate the version of the 'ANDPS' instruction that takes a memory operand as its second parameter, it would save you an instruction. note that in a batched transform, where you process hundreds of elements, you can keep this mask in a register, loaded only once before the main loop. a clever compiler that inlines the whole transform function in an external loop will be able to do this on its own. I've seen VC++ 2008 do this sometimes, but sometimes not. whatever...

and finally, what is Vec3SSE? does it contain directly an __m128 member? is the constructor that takes an __m128 declare as inlined? does the compiler really inline it, or does it generate a shitload of extra instructions? (same goes with the members of the matrix class?)

is your whole multiplication function inlined? or do you incurr a function call that might not happen when you benchmark your scalar code?

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this