C++ SIMD/SSE optimization

Started by
9 comments, last by vNeeki 12 years, 2 months ago
Hello,

I would like to optimize a short piece of C++ code using SIMD SSE2 or SSE3 insructions. Could someone get me up to speed, or ideally (if anyone would be so kind) provide the converted code?

The function I need optimized simply reads from a graphics-frame in ARGB format to another frame that is supposed to be in aAAA format (where the actual "a" part is set to 0xff).

unsigned int* f = (unsigned int*)frame;
unsigned int* k = (unsigned int*)alphaKey;

for (unsigned i = mFrameHeight*mFrameWidth; i != 0; i--)
{
unsigned int a = *f++;
a &= 0xff000000;
unsigned int v = (a >> 8) | (a >> 16) | (a >> 24) | 0xff000000;
*k++ = v;
}
Advertisement

static const __m128i GAlphaMask = _mm_set_epi32(0xFF000000,0xFF000000,0xFF000000,0xFF000000); // make this a global not in the function

void foo()
{
unsigned int* f = (unsigned int*)frame;
unsigned int* k = (unsigned int*)alphaKey;

size_t numitems = mFrameHeight * mFrameWidth;
size_t numloops = numitems / 4;
size_t remainder = numitems - numloops * 4;
for (size_t index=0;index<numloops; ++index)
{
__m128i val = _mm_loadu_si128((__m128i*)f);
__m128i valmasked = _mm_and_si128(val, GAlphaMask);
__m128i shiftA = _mm_srli_epi32(valmasked , 8);
__m128i shiftB = _mm_srli_epi32(valmasked , 16);
__m128i shiftC = _mm_srli_epi32(valmasked , 24);
__m128i result = _mm_or_si128(_mm_or_si128(shiftA, shiftB), _mm_or_si128(shiftC, GAlphaMask));
_mm_storeu_si128((__m128i*)k, result);
f += 4;
k += 4;
}
// TODO - finish remainder with non-simd code
}


The loop will likely need to be unrolled 2-4 more times as to pipeline better (i.e. use more registers until it starts spilling over onto the stack)

If the data is aligned, the load and store can use the aligned 'non-u' versions instead.
http://www.gearboxsoftware.com/
Awesome, this works right out of the box! This code is around 30% faster than the native c++ code. Thanks for the fast response Zoner!


The loop will likely need to be unrolled 2-4 more times as to pipeline better (i.e. use more registers until it starts spilling over onto the stack)

If the data is aligned, the load and store can use the aligned 'non-u' versions instead.

I am using the non-u version, but it didn't make much difference. Also unrolling the loop (4 times) didn't have a significant impact, although I re-used the same variables. By "use more registers" did you mean I should introduce more variables for each unrolled loop sequence?

Awesome, this works right out of the box! This code is around 30% faster than the native c++ code. Thanks for the fast response Zoner!

[quote name='Zoner' timestamp='1327960607' post='4907781']
The loop will likely need to be unrolled 2-4 more times as to pipeline better (i.e. use more registers until it starts spilling over onto the stack)

If the data is aligned, the load and store can use the aligned 'non-u' versions instead.

I am using the non-u version, but it didn't make much difference. Also unrolling the loop (4 times) didn't have a significant impact, although I re-used the same variables. By "use more registers" did you mean I should introduce more variables for each unrolled loop sequence?
[/quote]

SIMD intrinics can only be audited by looking at optimized code (unoptimized SIMD code is pretty horrific), basically when an algorithm gets too complicated it has to spill various XMM registers onto the stack. So you have to build the code, check out the asm in a debugger and see if it is doing that or not. This is much less of a problem with 64 bit code as there are twice as many registers to work with.

Re-using the same variables should work for a lot of code, although making the pointers use __restrict will probably be necessary so it can schedule the code more aggressively. If the restrict is helping the resulting asm should look something like:

read A
do work A
read B
do work B
store A
do more work on B
read C
store B
do work C
store C


vs

read A
do work A
store A
read B
do work B
store B
read C
do work C
store C
http://www.gearboxsoftware.com/
Zoner, thanks for the detailed explanation! I am quite happy with the current speed, if I should need to squeeze some more out of it I will fire up the dissasembler - but I doubt that will be necessary any time soon. Thanks again!
This is interesting.
What book or online resource would you recommend to learn properly simd/sse asm ? And does g++ support the same keywords ?

This is interesting.
What book or online resource would you recommend to learn properly simd/sse asm ? And does g++ support the same keywords ?


Zoner's code uses vector intrinsics, which are supported by Visual C++, GCC and Intel's compiler. You can get quite far just by using MSDN's documentation, though it can be annoyingly spread-out at times.
Its easier to read the various xmmintrin.h headers (there are 7 or 8 of them now) in the MSVC header directory and see what's there.

The MSDN docs are a jumbled mess (and in multiple 'docs', SSE, SSE2, SSE4, some AVX are fairly separate doc-wise).

#include <mmintrin.h> // MMX
#include <xmmintrin.h> // SSE1
#include <emmintrin.h> // SSE2

#if (MATH_LEVEL & MATH_LEVEL_SSE3)
#include <pmmintrin.h> // Intel SSE3
#endif
#if (MATH_LEVEL & MATH_LEVEL_SSSE3)
#include <tmmintrin.h> // Intel SSSE3 (the extra S is not a typo)
#endif
#if (MATH_LEVEL & MATH_LEVEL_SSE4_1)
#include <smmintrin.h> // Intel SSE4.1
#endif
#if (MATH_LEVEL & MATH_LEVEL_SSE4_2)
#include <nmmintrin.h> // Intel SSE4.2
#endif
#if (MATH_LEVEL & MATH_LEVEL_AES)
#include <wmmintrin.h> // Intel AES instructions
#endif
#if (MATH_LEVEL & (MATH_LEVEL_AVX_128|MATH_LEVEL_AVX_256))
#include <immintrin.h> // Intel AVX instructions
#endif
//#include <intrin.h> // Includes all MSVC intrinsics, all of the above plus the crt and win32/win64 platform intrinsics
http://www.gearboxsoftware.com/
Great! , thanks for the material smile.png

This is interesting.
What book or online resource would you recommend to learn properly simd/sse asm ? And does g++ support the same keywords ?


Agner's Optimization Manuals are quite good, start with the first one (in particular, the part entitled "Using vector operations"): http://www.agner.org/optimize/
As a background/further grounding you can also take a look at "Practical x64 Assembly and C++" here -- http://www.youtube.c...ser/WhatsACreel -- while SIMD is mostly done in asm, there's an obvious (once you're familiar with it) mapping to intrinsics available in C or C++.

This topic is closed to new replies.

Advertisement