SSE Performance

Started by
13 comments, last by momotte 14 years, 1 month ago
Quote:Original post by samothI cannot say much about the particular compiler you named (VC 2008), since I'm not using that one, but in general I would strongly recommend against "hand optimizing" code in assembler. If your compiler does not properly optimize intrinsics, then you either forgot to turn on optimizations or the compiler is broken.


What compiler do you use? I'm mainly familiar with Visual C++, GCC, or Intel (which costs money). I know there are others out there, but I haven't had the opportunity to work with them. I'd be interested in playing around with any out there that do intrinsics well.

Quote:In general, assembler code is more or less a black box for the compiler. Some (gcc, most notably) are a bit more intelligent and can still perform minor optimizations (scheduling and register coloring) even on inline assembly, but most compilers will just treat your code as-is and add some extra instructions around it.

Since assembler is a "black box", the compiler cannot prove certain things or even make assumptions, and thus is unable to do most (or all) optimizations inside and around it. On the other hand, a compiler will normally perform all valid optimizations that it is capable of doing with intrinsic functions just fine. A decent optimizer will also interleave your SSE code with non-SSE code when it is possible, which is just awesome, since that code will run "for free".


In theory... in practice I've found that to not be quite so true. And I'm not the only one:

http://www.virtualdub.org/blog/pivot/entry.php?id=162

Its a bit old, bad sadly still true for the most part.

Also consider that a fair number of crucial intrinsics are just plain missing. An example is a 64 bit long divide who's asm mnemonic happens to be simply idiv just doesn't exist (or at least it didn't when I checked a year ago). And I'm not even talking about many of the obscure SSE packed add with swizzled kittens type instructions. I was just trying to do fixed point math.

The whole 'optimize' around never seemed to pan out in real code. Take a look at some dis-assembled intrinsics. I personally was surprised at just how badly VC++ 2008 mangled it. Now most of my experience was with the 64-bit compiler, so maybe its because it was rather new. I don't know. But even simple vector math had all sorts of mov and pack instructions.

On top of that I'm no asm guru by any stretch, I just dabble in it from time to time. And sure debugging asm can be a bit annoying, and there's very little documentation, but I've found it very easy to beat the VC++ 2008 compiler in performance.

In my experience smaller functions or ones that will be inlined are best left as C++ code. Larger performance critical ones hand tuned assembly. Intrinsics so far have left me quite disappointed performance wise, simply because they're not properly supported. I'd love to use them, in say a vector library or what-not (where they would be perfect). I'm hoping VC++ 2010 steps fixes things. But if your playing with intrinsics and getting slow code, take a look at the disassembly, the problem might not lie where you expect it.
Advertisement
Maybe its worth trying a bit of inline assembly?
Im away from my dev PC, but here a few snippets 4 now...
http://board.flatassembler.net/topic.php?t=10928
Quote:Original post by Antheus
The point of SIMD (especially Streaming part) is to call function once, and the intrinsics process million elements in same call.


Emphasising what Antheus has already said. A lot of people (a search of the forums will find roughly the same question asked many times) think they can re-implement their Vector4 class with SSE and suddenly everything is 4 times faster. You will quickly find that this is not the case, and at best you may get small speedup (I would offer less than 10%) if you are careful about 16-byte aligning all your vectors, making sure your operations inline and carefully writing the SSE.

The real win from SSE comes from processing a large number of elements in one go, where you have determined through profiling that this operation is a bottleneck. This way the setup work of SSE is amortised by processing many elements all at once, you get more efficient cache usage and ultimately, hopefully, a much improved performance boost.
Quote:Original post by Ryan_001
What compiler do you use?
gcc and gcc/MinGW, both version 4.4

Quote:Original post by Ryan_001
In theory... in practice I've found that to not be quite so true.
I stopped considering to *ever* touch inline assembly again after gcc completely optimized out a checksum calculation by streamlining it with intrinsic function SSE code.
I had written an OFB block codec with SSE2 intrinsics. Someone will now inevitably point out that one should be using AES, but hey. The goal here was not so much to provide nuclear weapon grade security for the next 700,000 years (though I believe the OFB codec doesn't perform much worse than most "real" encryptions), but to make network packets unreadable in reasonable time to someone with normal resources while staying close to zero overhead, maintaining minimum state, zero-cost key switching, no huge lookup tables, and allowing a "seek" operation. Portability beyond x86 was not a concern. Enter SSE2.

Once the codec was done, I decided to add a checksum and wrote that in C++. The compiler peeled off one iteration and blended the checksum instructions working on the last-but-one block with the SSE2 instructions from the codec. Except for the case where only 1 or 2 blocks are encoded, the checksum version runs at exactly the same speed as the no-checksum version. I could never have written anything the like by hand.

I'm still checking the assembly output every now and then when I write something with intrinsics, just to be 110% sure. So far, it always came out as good or better than I would have anticipated, and I have not been tempted to write assembly by hand again. In fact, with all optimizations turned on, simple C++ is almost every time just as good as you can get or very close (and a hell lot easier to maintain/modify). It is amazing what modern compilers can do sometimes.
My code needs to be portable across GCC and VC++, most of the time i compile it as 64 Bit. AFAIK inline assembly is neither portable across compilers, nor supported in 64 Bit (at least VC++). Also i like intrinsics better, because i'm really not that experienced in real assembler programming.

Jan.
Quote:Original post by samoth
...
GCC does awesome stuff
...

It is amazing what modern compilers can do sometimes.


I'll have to play with it. Kinda sad an open source compiler beating the pants off VC++ so badly. I've heard its auto-vectorization is pretty good too (which IMO is about time!!).

Now if only I can get it to work with the visual studio IDE...

Quote:Original post by d00fus
The real win from SSE comes from processing a large number of elements in one go


seconded.
you'll see some real wins if you want to transform a batch of, for example, 300 vectors, by the same matrix. this will typically be blazingly fast with SIMD.

however, just a few comments/questions about the code you posted (Jan K)

inline const Vec3SSE operator* (const MatrixSSE& m, const Vec3SSE& v){// copy the data into a register__m128 Data = v;// mask, later used to zero out the 4th componentconst __m128 mask = _mm_set_ps (1, 1, 1, 0);// the 3 components of the vector__m128 t0 = _mm_shuffle_ps (Data, Data, _MM_SHUFFLE (0, 0, 0, 0));__m128 t1 = _mm_shuffle_ps (Data, Data, _MM_SHUFFLE (1, 1, 1, 1));__m128 t2 = _mm_shuffle_ps (Data, Data, _MM_SHUFFLE (2, 2, 2, 2));t0 = _mm_mul_ps (m.m_Data[0], t0);t1 = _mm_mul_ps (m.m_Data[1], t1);t2 = _mm_mul_ps (m.m_Data[2], t2);t0 = _mm_add_ps (t0, t1);// also add the matrix' 4th vector (as if the vector had a 1 in the 4th component)t2 = _mm_add_ps (t2, m.m_Data[3]); t0 = _mm_add_ps (t0, t2);// multiply with the mask and return the resultreturn (Vec3SSE (_mm_mul_ps (t0, mask)));}


first thing that comes to mind is.. enforcing that 'last component must be zero' forces you to use extra ops. do you _really_ need it? (out of curiosity: what do you need it for?)
if you need it anyway, dont mm_mul_ps with the mask! use mm_and_ps, with a mask set to { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0 }

then, you're declaring this mask as a const __m128... depending on the compiler's mood (if you're using visual studio, even with all optimizations set to the max, it still seems to be doing stupid stuff randomly, like not optimizing some trivial stuff, messing-up its registers allocations, etc.. whatever). yes, so, depending on the compiler's mood, this might not be externalized to a constant static memory, and you might get a shitload of garbage instructions that will, as the scalars are not equal, load each x, y, z, w component separately into an xmm register.
this would seriously suck, but it might happen. you really need to check the generated assembly.
you can avoid this by declaring it static const:

#ifdef _MSC_VER#define	ALIGN(a)		__declspec(align(a))#else // assuming __GNUC__/__GNUG__#define	ALIGN(a)		__attribute__((aligned(a)))#endif//...// mask, later used to zero out the 4th componentALIGN(0x10) static const unsigned int mask = { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0 };//...t0 = _mm_and_ps(t0, _mm_load_ps(reinterpret_cast<float*>(&mask)));// OR:t0 = _mm_and_ps(t0, *reinterpret_cast<__m128*>(&mask));//...


the last version, with a cast to __m128* will be better, the compiler might generate the version of the 'ANDPS' instruction that takes a memory operand as its second parameter, it would save you an instruction. note that in a batched transform, where you process hundreds of elements, you can keep this mask in a register, loaded only once before the main loop. a clever compiler that inlines the whole transform function in an external loop will be able to do this on its own. I've seen VC++ 2008 do this sometimes, but sometimes not. whatever...

and finally, what is Vec3SSE? does it contain directly an __m128 member? is the constructor that takes an __m128 declare as inlined? does the compiler really inline it, or does it generate a shitload of extra instructions? (same goes with the members of the matrix class?)

is your whole multiplication function inlined? or do you incurr a function call that might not happen when you benchmark your scalar code?

This topic is closed to new replies.

Advertisement