Help optimize my s/w additive blending [Edit: MMX assembly now]

General and Gameplay Programming Programming

Started by BeanDog June 11, 2006 08:27 PM

20 comments, last by Figgles 17 years, 10 months ago

1,258

June 13, 2006 02:26 PM

Quote:Original post by phantom
1) the compiler is still doing the balk of the work, this includes managing registers and reordering code.
2) It'll work on x64 targets, where as your inline assembly will cause the compiler to cry and reject your code.

3) you're able to mix SIMD instructions with ordinary C++-level optimizations, which is damn hard once you go assembly...

Figgles

122

June 13, 2006 02:29 PM

Hi,

I have a project at Sourceforge called libSIMDx86 that does functions very similar to this. I suggest anyone who considers using their own vector/matrix/quaternion library check this out first.

http://simdx86.sourceforge.net

My question is this: MMX? That instruction set is quite dated. Shoot for SSE2 if you can, otherwise, try MMX+SSE, else try MMX.

Here is some 2 cent SSE2 code... processes 32 pixels/loop -> images must be aligned to 16th byte. It is possible to relax the 16 byte aligned addresses, but then you process half the data / loop.

int Remainder = NumPixelsToProcess % 32;if(Remainder != 0){   //Process extra pixels   SourcePtr += Remainder;   DestPtr += Remainder;}__asm {//setup esi = source, edi = dest, ecx = NumPixels >> 5Process32MorePixels:prefetchnta [esi+128]  //fetch two cache lines aheadprefetchnta [edi+128]  //fetch two cache lines aheadmovdqa xmm0, [edi]     //4movdqa xmm1, [edi+16]  //4movdqa xmm2, [edi+32]  //4movdqa xmm3, [edi+48]  //4, first 64 byte cache linemovdqa xmm4, [edi+64]  //4movdqa xmm5, [edi+80]  //4movdqa xmm6, [edi+96]  //4movdqa xmm7, [edi+112] //4, second 64 byte cache linepaddusb xmm0, [esi]     //4paddusb xmm1, [esi+16]  //4paddusb xmm2, [esi+32]  //4paddusb xmm3, [esi+48]  //4, first 64 byte cache line, 16 pixelspaddusb xmm4, [esi+64]  //4paddusb xmm5, [esi+80]  //4paddusb xmm6, [esi+96]  //4paddusb xmm7, [esi+112] //4, second 64 byte cache line, 32 pixels//streaming store movntdq [edi], xmm0movntdq [edi+16], xmm1movntdq [edi+32], xmm2movntdq [edi+48], xmm3movntdq [edi+64], xmm4movntdq [edi+80], xmm5movntdq [edi+96], xmm6movntdq [edi+112], xmm7add esi, 128add edi, 128dec ecxjnz Process32MorePixelssfence//Cleanup}

If even you can't do SSE2 or have 16 byte aligned images, this MMX code will work too:

int Remainder = NumPixelsToProcess % 16;if(Remainder != 0){   //Process extra pixels manually.   SourcePtr += Remainder;   DestPtr += Remainder;}__asm {//setup esi = source, edi = dest, ecx = NumPixels >> 4Process16MorePixels:movq mm0, [edi]     //2movq mm1, [edi+ 8]  //2movq mm2, [edi+16]  //2movq mm3, [edi+24]  //2movq mm4, [edi+32]  //2movq mm5, [edi+40]  //2movq mm6, [edi+48]  //2movq mm7, [edi+56]  //2, end of 64 byte cache linepaddusb mm0, [esi]     //2paddusb mm1, [esi+ 8]  //2paddusb mm2, [esi+16]  //2paddusb mm3, [esi+24]  //2paddusb mm4, [esi+32]  //2paddusb mm5, [esi+40]  //2paddusb mm6, [esi+48]  //2paddusb mm7, [esi+56]  //2, end of 64 byte cache line//store movq [edi], mm0movq [edi+ 8], mm1movq [edi+16], mm2movq [edi+24], mm3movq [edi+32], mm4movq [edi+40], mm5movq [edi+48], mm6movq [edi+56], mm7add esi, 64add edi, 64dec ecxjnz Process16MorePixels//Cleanup}

If you are sure that the processor has MMX and SSE (but not SSE2), you can change the final set of MOVQ into MOVNTQ and add back the PREFETCHNTA instructions. This will probably give a performance boost. And don't forget, PROFILE! Don't just copy/paste code and not know whether it helps or not. Maybe the SSE2 is slower than the plain MMX, maybe not. TRY it before you consider one better than the other.

I (personally) suggest that you use my library, libSIMDx86, has it has been profiled quite a bit by me. It isn't perfect, but it gets closer each release. Send any questions to baggett.patrick@gmail.com

Help optimize my s/w additive blending [Edit: MMX assembly now]

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Help optimize my s/w additive blending [Edit: MMX assembly now]

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines