Back to General and Gameplay Programming

C++ SIMD/SSE optimization

General and Gameplay Programming Programming

Started by ori-on January 30, 2012 08:42 PM

9 comments, last by vNeeki 12 years, 2 months ago

ori-on

113

Author

January 30, 2012 08:42 PM

Hello,

I would like to optimize a short piece of C++ code using SIMD SSE2 or SSE3 insructions. Could someone get me up to speed, or ideally (if anyone would be so kind) provide the converted code?

The function I need optimized simply reads from a graphics-frame in ARGB format to another frame that is supposed to be in aAAA format (where the actual "a" part is set to 0xff).

unsigned int* f = (unsigned int*)frame;

unsigned int* k = (unsigned int*)alphaKey;



for (unsigned i = mFrameHeight*mFrameWidth; i != 0; i--)

{

	unsigned int a = *f++;

	a &= 0xff000000;

	unsigned int v = (a >> 8) | (a >> 16) | (a >> 24) | 0xff000000;

	*k++ = v;

}

Zoner

232

January 30, 2012 09:56 PM



static const __m128i GAlphaMask = _mm_set_epi32(0xFF000000,0xFF000000,0xFF000000,0xFF000000); // make this a global not in the function



void foo()

{

unsigned int* f = (unsigned int*)frame;

unsigned int* k = (unsigned int*)alphaKey;



size_t numitems = mFrameHeight * mFrameWidth;

size_t numloops = numitems / 4;

size_t remainder = numitems - numloops * 4;

for (size_t index=0;index<numloops; ++index)

{

  __m128i val = _mm_loadu_si128((__m128i*)f);

  __m128i valmasked = _mm_and_si128(val, GAlphaMask);

  __m128i shiftA = _mm_srli_epi32(valmasked , 8);

  __m128i shiftB = _mm_srli_epi32(valmasked , 16);

  __m128i shiftC = _mm_srli_epi32(valmasked , 24);

  __m128i result = _mm_or_si128(_mm_or_si128(shiftA, shiftB), _mm_or_si128(shiftC, GAlphaMask));

  _mm_storeu_si128((__m128i*)k, result);

  f += 4;

  k += 4;

}

// TODO - finish remainder with non-simd code

}

The loop will likely need to be unrolled 2-4 more times as to pipeline better (i.e. use more registers until it starts spilling over onto the stack)

If the data is aligned, the load and store can use the aligned 'non-u' versions instead.

http://www.gearboxsoftware.com/

ori-on

113

Author

January 30, 2012 10:40 PM

Awesome, this works right out of the box! This code is around 30% faster than the native c++ code. Thanks for the fast response Zoner!

The loop will likely need to be unrolled 2-4 more times as to pipeline better (i.e. use more registers until it starts spilling over onto the stack)

If the data is aligned, the load and store can use the aligned 'non-u' versions instead.

Zoner

232

January 30, 2012 10:57 PM

Awesome, this works right out of the box! This code is around 30% faster than the native c++ code. Thanks for the fast response Zoner!

[quote name='Zoner' timestamp='1327960607' post='4907781']
The loop will likely need to be unrolled 2-4 more times as to pipeline better (i.e. use more registers until it starts spilling over onto the stack)

If the data is aligned, the load and store can use the aligned 'non-u' versions instead.

I am using the non-u version, but it didn't make much difference. Also unrolling the loop (4 times) didn't have a significant impact, although I re-used the same variables. By "use more registers" did you mean I should introduce more variables for each unrolled loop sequence?
[/quote]

SIMD intrinics can only be audited by looking at optimized code (unoptimized SIMD code is pretty horrific), basically when an algorithm gets too complicated it has to spill various XMM registers onto the stack. So you have to build the code, check out the asm in a debugger and see if it is doing that or not. This is much less of a problem with 64 bit code as there are twice as many registers to work with.

Re-using the same variables should work for a lot of code, although making the pointers use __restrict will probably be necessary so it can schedule the code more aggressively. If the restrict is helping the resulting asm should look something like:

read A
do work A
read B
do work B
store A
do more work on B
read C
store B
do work C
store C

vs

read A
do work A
store A
read B
do work B
store B
read C
do work C
store C

http://www.gearboxsoftware.com/

ori-on

113

Author

January 31, 2012 11:05 AM

Zoner, thanks for the detailed explanation! I am quite happy with the current speed, if I should need to squeeze some more out of it I will fire up the dissasembler - but I doubt that will be necessary any time soon. Thanks again!

vNeeki

194

February 05, 2012 07:31 AM

This is interesting.
What book or online resource would you recommend to learn properly simd/sse asm ? And does g++ support the same keywords ?

the_edd

2,109

February 05, 2012 12:45 PM

This is interesting.
What book or online resource would you recommend to learn properly simd/sse asm ? And does g++ support the same keywords ?

Zoner's code uses vector intrinsics, which are supported by Visual C++, GCC and Intel's compiler. You can get quite far just by using MSDN's documentation, though it can be annoyingly spread-out at times.

http://www.mr-edd.co.uk
http://bitbucket.org/edd

Zoner

232

February 06, 2012 12:11 AM

Its easier to read the various xmmintrin.h headers (there are 7 or 8 of them now) in the MSVC header directory and see what's there.

The MSDN docs are a jumbled mess (and in multiple 'docs', SSE, SSE2, SSE4, some AVX are fairly separate doc-wise).

#include <mmintrin.h>	   // MMX

#include <xmmintrin.h>	  // SSE1

#include <emmintrin.h>	  // SSE2



#if (MATH_LEVEL & MATH_LEVEL_SSE3)

#include <pmmintrin.h>	  // Intel SSE3

#endif

#if (MATH_LEVEL & MATH_LEVEL_SSSE3)

#include <tmmintrin.h>	  // Intel SSSE3 (the extra S is not a typo)

#endif

#if (MATH_LEVEL & MATH_LEVEL_SSE4_1)

#include <smmintrin.h>	  // Intel SSE4.1

#endif

#if (MATH_LEVEL & MATH_LEVEL_SSE4_2)

#include <nmmintrin.h>	  // Intel SSE4.2

#endif

#if (MATH_LEVEL & MATH_LEVEL_AES)

#include <wmmintrin.h>	  // Intel AES instructions

#endif

#if (MATH_LEVEL & (MATH_LEVEL_AVX_128|MATH_LEVEL_AVX_256))

#include <immintrin.h>	  // Intel AVX instructions

#endif

//#include <intrin.h>	   // Includes all MSVC intrinsics, all of the above plus the crt and win32/win64 platform intrinsics

http://www.gearboxsoftware.com/

vNeeki

194

February 07, 2012 05:53 AM

Great! , thanks for the material

Matt-D

1,574

February 09, 2012 08:23 PM

This is interesting.
What book or online resource would you recommend to learn properly simd/sse asm ? And does g++ support the same keywords ?

Agner's Optimization Manuals are quite good, start with the first one (in particular, the part entitled "Using vector operations"): http://www.agner.org/optimize/
As a background/further grounding you can also take a look at "Practical x64 Assembly and C++" here -- http://www.youtube.c...ser/WhatsACreel -- while SIMD is mostly done in asm, there's an obvious (once you're familiar with it) mapping to intrinsics available in C or C++.

C++ SIMD/SSE optimization

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

C++ SIMD/SSE optimization

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines