Hi,
I have a project at Sourceforge called libSIMDx86 that does functions very similar to this. I suggest anyone who considers using their own vector/matrix/quaternion library check this out first.
http://simdx86.sourceforge.net
My question is this: MMX? That instruction set is quite dated. Shoot for SSE2 if you can, otherwise, try MMX+SSE, else try MMX.
Here is some 2 cent SSE2 code... processes 32 pixels/loop -> images must be aligned to 16th byte. It is possible to relax the 16 byte aligned addresses, but then you process half the data / loop.
int Remainder = NumPixelsToProcess % 32;if(Remainder != 0){ //Process extra pixels SourcePtr += Remainder; DestPtr += Remainder;}__asm {//setup esi = source, edi = dest, ecx = NumPixels >> 5Process32MorePixels:prefetchnta [esi+128] //fetch two cache lines aheadprefetchnta [edi+128] //fetch two cache lines aheadmovdqa xmm0, [edi] //4movdqa xmm1, [edi+16] //4movdqa xmm2, [edi+32] //4movdqa xmm3, [edi+48] //4, first 64 byte cache linemovdqa xmm4, [edi+64] //4movdqa xmm5, [edi+80] //4movdqa xmm6, [edi+96] //4movdqa xmm7, [edi+112] //4, second 64 byte cache linepaddusb xmm0, [esi] //4paddusb xmm1, [esi+16] //4paddusb xmm2, [esi+32] //4paddusb xmm3, [esi+48] //4, first 64 byte cache line, 16 pixelspaddusb xmm4, [esi+64] //4paddusb xmm5, [esi+80] //4paddusb xmm6, [esi+96] //4paddusb xmm7, [esi+112] //4, second 64 byte cache line, 32 pixels//streaming store movntdq [edi], xmm0movntdq [edi+16], xmm1movntdq [edi+32], xmm2movntdq [edi+48], xmm3movntdq [edi+64], xmm4movntdq [edi+80], xmm5movntdq [edi+96], xmm6movntdq [edi+112], xmm7add esi, 128add edi, 128dec ecxjnz Process32MorePixelssfence//Cleanup}
If even you can't do SSE2 or have 16 byte aligned images, this MMX code will work too:
int Remainder = NumPixelsToProcess % 16;if(Remainder != 0){ //Process extra pixels manually. SourcePtr += Remainder; DestPtr += Remainder;}__asm {//setup esi = source, edi = dest, ecx = NumPixels >> 4Process16MorePixels:movq mm0, [edi] //2movq mm1, [edi+ 8] //2movq mm2, [edi+16] //2movq mm3, [edi+24] //2movq mm4, [edi+32] //2movq mm5, [edi+40] //2movq mm6, [edi+48] //2movq mm7, [edi+56] //2, end of 64 byte cache linepaddusb mm0, [esi] //2paddusb mm1, [esi+ 8] //2paddusb mm2, [esi+16] //2paddusb mm3, [esi+24] //2paddusb mm4, [esi+32] //2paddusb mm5, [esi+40] //2paddusb mm6, [esi+48] //2paddusb mm7, [esi+56] //2, end of 64 byte cache line//store movq [edi], mm0movq [edi+ 8], mm1movq [edi+16], mm2movq [edi+24], mm3movq [edi+32], mm4movq [edi+40], mm5movq [edi+48], mm6movq [edi+56], mm7add esi, 64add edi, 64dec ecxjnz Process16MorePixels//Cleanup}
If you are sure that the processor has MMX and SSE (but not SSE2), you can change the final set of MOVQ into MOVNTQ and add back the PREFETCHNTA instructions. This will probably give a performance boost. And don't forget, PROFILE! Don't just copy/paste code and not know whether it helps or not. Maybe the SSE2 is slower than the plain MMX, maybe not. TRY it before you consider one better than the other.
I (personally) suggest that you use my library, libSIMDx86, has it has been profiled quite a bit by me. It isn't perfect, but it gets closer each release. Send any questions to baggett.patrick@gmail.com