# Help optimize my s/w additive blending [Edit: MMX assembly now]

This topic is 4206 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Today I downloaded SDL to play around with for little projects. It's just what the doctor ordered for simple 2D games. For some pyrotechnic effects, though, I need additive blending, not just the general (and somewhat slow) alpha blending included in SDL. So I whipped up a simple additive blending blit function, which assumes 24-bpp surfaces. Of course, I had overflow problems. So, rather than have several conditionals in the very inner loop of the function, I pre-compute a 256x256 lookup table of capped 8-bit additions for use in the function. Ignoring the bounds-checking and rectangle clipping, here's the meat of the function:
unsigned char *pDst = (unsigned char*)dst->pixels + drect->x*3 + drect->y * dst->pitch;
int iPadDst = dst->pitch - srect->w*3;

unsigned char *pSrc = (unsigned char*)src->pixels + srect->x*3 + srect->y * dst->pitch;
int iPadSrc = src->pitch - srect->w*3;

for(int y = 0; y < srect->h; y++)
{
for(int x = 0; x < srect->w; x++)
{
//Look up the new values for each of R,G, and B
pDst[0] = pBlend[pSrc[0]][pDst[0]];
pDst[1] = pBlend[pSrc[1]][pDst[1]];
pDst[2] = pBlend[pSrc[2]][pDst[2]];

//Advance the pointers to the next pixel
pDst+=3;
pSrc+=3;
}
//Advance the pointers to the next line
}

It's really a pain to have that many dereferences in the very inner loop. At first, I did this
*pDst = pBlend[*pSrc][*pDst];
pSrc++;
pDst++;

three times rather than the way I have it now. Changing it to this method improved speed by about 10% in both debug and release mode (VC++ 2005). Any ideas out there for speeding this up further? Am I missing some sweet MMX instruction for doing 3 one-byte capped additions at once? [Edited by - BeanDog on June 12, 2006 8:26:32 PM]

##### Share on other sites
No need for MMX although it can certainly help with the saturation at 255 ("capping")

Some things to consider:

Currently your table represents a function: f(x, y)

..but wouldnt it be better if your table represented f(x + y) .. the addition itself is very efficient.. certainly no worse than calculating a 2D table index (in practice its actualy a lot better than 2D indexing on Intel/AMD)

The table would then only need to have 512 entries instead of 65536 entries, and it would perform the only inefficient operation.. saturation(). Being so small, you can expect very few cache misses which is probably going to be your biggest benefit

So something like:

pDst[0] = SATURATE[pSrc[0] + pDst[0]];
pDts[1] = SATURATE[pSrc[1] + pDst[1]];
pDst[2] = SATURATE[pSrc[2] + pDst[2]];

##### Share on other sites
Did you *try* just using the conditionals? Did you try playing with your compiler optimization settings? Do you *know* that "it's too slow" as is? Do you *know* that the problem is with *this* loop?

Failing any of those, how is your table stored? Consider that even with the trimmest representation (plain array, both dimensions static), the code for the lookup really can't be optimized any further than:

*pDst = *(lookup + *pSrc << 8 + *pDst);

Which ought to already be what's happening. Modern compilers are very, very good with array-indexing-vs-pointer-arithmetic-equivalence, and frankly I'm very surprised that you see any difference at all (maybe it just didn't think of the 'transformation' to how the index variable is used?)

Did you look at the assembly output?

(EDIT: ::bonks self for not thinking of what Rockoon1 did.::)

##### Share on other sites
Quote:
 Original post by ZahlmanDid you *try* just using the conditionals? Did you try playing with your compiler optimization settings? Do you *know* that "it's too slow" as is? Do you *know* that the problem is with *this* loop?

Yes, yes, and yes. This is a very high-traffic piece of code (function called hundreds of times per frame).

Thanks, Rockoon, that gave me another 10% speedup. Any other ideas are welcome!

##### Share on other sites
Whoops, above post is me.

##### Share on other sites

I'd say that your 64k lookup table is a bit too big for the purpose.
Practically, with MMX you could do the trick without any lookup tables, and all the additions in one instruction.

Cheers

##### Share on other sites
I'm quite sure this doesn't provide additional speed-up as the compiler should be smart enough for this. Then again this mighteven be plain wrong thing to do (there's some time from my last micro-optimization session ;)). However, here's my contribution:

The idea is not to write the result straight to pDst at line [1] as it would dirty the cache line (right?). Thus the next read, at line [2], would issue an read from memory (as opposed to read from cache, i.e., cache miss). Therefore, setting up 3 8-bit temporary variables inside the loop might hint the compiler to keep those in registers (perhaps one should also use the register keyword, register char tmp).

Worth a try at least..

unsigned char *pDst = (unsigned char*)dst->pixels + drect->x*3 + drect->y * dst->pitch;int iPadDst = dst->pitch - srect->w*3;unsigned char *pSrc = (unsigned char*)src->pixels + srect->x*3 + srect->y * dst->pitch;int iPadSrc = src->pitch - srect->w*3;for(int y = 0; y < srect->h; y++){	for(int x = 0; x < srect->w; x++)	{		char tmp[3];		//Look up the new values for each of R,G, and B		tmp[0] = pBlend[pSrc[0]][pDst[0]]; //[1]		tmp[1] = pBlend[pSrc[1]][pDst[1]]; //[2]		tmp[2] = pBlend[pSrc[2]][pDst[2]];		pDst[0] = tmp[0];		pDst[1] = tmp[1];		pDst[2] = tmp[2];		//Advance the pointers to the next pixel		pDst+=3;		pSrc+=3;	}	//Advance the pointers to the next line	pDst += iPadDst;	pSrc += iPadSrc;}

EDIT: Just tried to compile it. GCC 4 produces two different sets of code for the two versions. Thus the optimization, whether or not beneficial, is not at least trivial to the compiler. An whats more important code seems to put the three tmps in registers. However, the registers are all 8-bit (al,bl,cl) and it might work faster if one just uses 32-bit temporaries (somehow I recall that 16-bit or 8-bit operations are slow on 32-bit x86).

[Edited by - Winograd on June 12, 2006 4:39:33 AM]

##### Share on other sites
Try replacing your pointer arithmetics with array notation. Compilers generally prefer that, as they can easier keep track of memory aliasing. (especially since you already update the two variables x and y in every iteration, but don't actually *use* them. Try using them to do your array indexing instead of separate pointer arithmetics?) Not sure if it'll make a difference in this particular case though, but might be worth a try.

Other than that, I'd try the following:
Load the values you need from pDst one or two iterations before you need them, and save them into temp variables. That way, you won't have a load-store dependency slowing you down.

That might give you a decent performance boost.

[Edited by - Spoonbender on June 12, 2006 5:08:38 AM]

##### Share on other sites
Quote:
 Original post by SpoonbenderLoad the values you need from pDst one or two iterations before you need them, and save them into temp variables. That way, you won't have a load-store dependency slowing you down.That might give you a decent performance boost.

Smart. Somehow I forgot the pipelining problems all together :). Your suggestion is probably better what I proposed (at least in theory).

##### Share on other sites
Back when i was working on my software renderer, i noticed that memory accesses were VERY slow, try this out maybe?

unsigned char *pDst = (unsigned char*)dst->pixels + drect->x*3 + drect->y * dst->pitch;int iPadDst = dst->pitch - srect->w*3;unsigned char *pSrc = (unsigned char*)src->pixels + srect->x*3 + srect->y * dst->pitch;int iPadSrc = src->pitch - srect->w*3;unsigned long rh = srect->h, rw = srect->w;for(unsigned long y = 0; y < rh; ++y){	for(unsigned long x = 0; x < rw; ++x)	{		register unsigned long t1, t2, t3;		//Look up the new values for each of R,G, and B		t1 = (unsigned long) pSrc[0] + (unsigned long) pDst[0];		t2 = (unsigned long) pSrc[1] + (unsigned long) pDst[1];		t3 = (unsigned long) pSrc[2] + (unsigned long) pDst[2];		pDst[0] = t1 | ((t1 >> 9) * 255);		pDst[1] = t2 | ((t2 >> 9) * 255);		pDst[2] = t3 | ((t3 >> 9) * 255);		//Advance the pointers to the next pixel		pDst+=3;		pSrc+=3;	}	//Advance the pointers to the next line	pDst += iPadDst;	pSrc += iPadSrc;}[\source]

##### Share on other sites
my mistake, it was supposed to be (>> 8)

##### Share on other sites
Quote:
 Original post by BeanDogAny ideas out there for speeding this up further? Am I missing some sweet MMX instruction for doing 3 one-byte capped additions at once?

Well, there is PADDUSB, which performs 8 8-bit saturated adds in parallel. (As well as an SSE2 version which does 16 8-bit adds.)

##### Share on other sites
Since we are microoptimizing things, did you try to use a 32 bit pixels structure? When I was working on our film scanner 2 years ago, I was using something along the line of
struct pixel {  union   {    struct     {      unsigned char r, g, b, a;    } ;    unsigned int rgba;  };};

It was mostly for the sake of writing readable code, but 32 bits aligned read/writes + structures that will be easily manipulated by the compiler == good generated code, so it might be a good idea to test it.

Another idea is to remove your "x" - you can do the loop on pDst:
  // was "for(int x = 0; x < srect->w; x++)"  pDstEnd = pDst + dst->pitch;  for (; pDst < pDstEnd; pDst+=3, pSrc+=3) {    // blah blah  }

Of course, this is very similar to
pixel *beginsrc = reinterpret_cast<pixel*>(pSrc);pixel *endsrc = beginsrc + srect->w;pixel *begindst = reinterpret_cast<pixel*>(pDst);             // First1 , Last1 , First2  , Out     , FuncObjectstd::transform(beginsrc, endsrc, begindst, begindst, tranformator);

(with the correct transformator object (ie, not a function pointer), of course).

HTH,

##### Share on other sites
I've upgraded my routine to use MMX (based on an earlier post on GameDev I found via Google). It seems to work OK. As this is my first attempt at assembler code ever, please help me know what I can do to speed this bad boy up. It's now down to about 4.5ms to fill a 800x600 window via a few dozen calls to the function on my Athlon X2 3800+, while the code in my original post took about 11ms. I have a feeling I can do something to get this code going in both pipelines, but I don't even know where to start.
unsigned char *pDst = (unsigned char*)dst->pixels + drect->x*4 + drect->y * dst->pitch;int iPadDst = dst->pitch - srect->w*4;unsigned char *pSrc = (unsigned char*)src->pixels + srect->x*4 + srect->y * src->pitch;int iPadSrc = src->pitch - srect->w*4;unsigned int len = (unsigned int)srect->w;for(int y = 0; y < srect->h; y++){	__asm 	{		mov esi,pSrc	//esi = pointer to beginning of source line		mov edi,pDst	//edi = pointer to beginning of dest line		mov ecx,len			mov edx,ecx		and edx,1		//edx = parity (1 for odd # of pixels)		shr ecx,1		//ecx = number of 2-pixel groups to do			addblitloop:		movq mm0,[esi]	//load 2 pixels		movq mm1,[edi]	//load 2 pixels		paddusb mm0,mm1 //add them (all 8 bytes at once)		movq [edi],mm0	//store 2 pixels		add esi,8		//increase loop pointers		add edi,8		dec ecx			//Count down remaining 2-pixel pairs		jnz addblitloop		cmp edx,0		jz skipfinishadd //If there was no odd pixel, skip the following code.		movd mm0,[esi]	//Copy the last pixel (note movd instead of movq)		movd mm1,[edi]		paddusb mm0,mm1		movd [edi],mm0skipfinishadd:	}	pSrc += src->pitch;	pDst += dst->pitch;}	__asm	{		emms	}

##### Share on other sites
ewww, inline assembler... use the intrinsics instead [smile]

##### Share on other sites
Just my two cents.
It's not guaranteed to make a difference but anyway...

If the width of the image is always a multiplier of 2 you could unroll the loop in case to do 4 pixels at a time. Something like this:

		movq mm0,[esi]	//load 2 pixels		movq mm1,[edi]	//load 2 pixels		movq mm2,[esi + 8]	//load 2 pixels		movq mm3,[edi + 8]	//load 2 pixels		paddusb mm0,mm1 //add them (all 8 bytes at once)		paddusb mm2,mm3 //add them (all 8 bytes at once)		movq [edi],mm0	//store 2 pixels		movq [edi + 8],mm2	//store 2 pixels

Also you could try loading the data into the cache using prefetch.

If this is completely wrong, someone please correct me.

HellRaiZer

##### Share on other sites
Quote:
 Original post by phantomewww, inline assembler... use the intrinsics instead [smile]

Agreed [smile]

If you want to get the best performance out of this, start by optimizing *everything* you can in plain C++.

Then use intrinsics to enable MMX or SSE.

And then, if you think there's something to gain by it, you can consider using assembly.

But in that order, please. :)

##### Share on other sites
I'd never even heard of intrinsics before. I looked at some sample intrinsics code on CodeProject and elsewhere via Google, and it looks messier and more confusing to me than a few lines of assembly code.

HellRaiZer: I tried that exact code before I posted my assembly above, but I got zero speed increase. Any idea why?

##### Share on other sites
Quote:
 Original post by BeanDogI'd never even heard of intrinsics before. I looked at some sample intrinsics code on CodeProject and elsewhere via Google, and it looks messier and more confusing to me than a few lines of assembly code.

Maybe so, but it has two advantages;

1) the compiler is still doing the balk of the work, this includes managing registers and reordering code.
2) It'll work on x64 targets, where as your inline assembly will cause the compiler to cry and reject your code.

Once you get used to it its not that bad, its basically all the adventages of assembler with the added bonus of the compiler helpping you (I played with SSE for about a week, and the VC8 compiler is indeed very good when it comes to optimising things with intrinsics)

##### Share on other sites
Quote:
 I tried that exact code before I posted my assembly above, but I got zero speed increase. Any idea why?

Unfortunatelly no. There may be several reasons for that. Instruction pairing and jump prediction, are two candidates.

Try adding these two lines at the beginning of the assembly loop (the inner loop):

prefetch [esi + 0x100];prefetch [edi + 0x100];

Play around with the constant in case to find the optimal value.
See if this helps a little.

What phantom said about intrinsics sounds reasonable. I haven't written any code using them, but if the compiler can take care of register traffic and instruction scheduling, it sounds a good choice.

One last thing. I don't code in assembly (mostly i read assembly), but i learned it (or better, still learning it) because i wanted to know how things work at a lower level. That said, anything in my post may be completely wrong, so... :)

HellRaiZer

##### Share on other sites
Quote:
 Original post by phantom1) the compiler is still doing the balk of the work, this includes managing registers and reordering code.2) It'll work on x64 targets, where as your inline assembly will cause the compiler to cry and reject your code.

3) you're able to mix SIMD instructions with ordinary C++-level optimizations, which is damn hard once you go assembly...

##### Share on other sites
Hi,

I have a project at Sourceforge called libSIMDx86 that does functions very similar to this. I suggest anyone who considers using their own vector/matrix/quaternion library check this out first.

http://simdx86.sourceforge.net

My question is this: MMX? That instruction set is quite dated. Shoot for SSE2 if you can, otherwise, try MMX+SSE, else try MMX.

Here is some 2 cent SSE2 code... processes 32 pixels/loop -> images must be aligned to 16th byte. It is possible to relax the 16 byte aligned addresses, but then you process half the data / loop.
int Remainder = NumPixelsToProcess % 32;if(Remainder != 0){   //Process extra pixels   SourcePtr += Remainder;   DestPtr += Remainder;}__asm {//setup esi = source, edi = dest, ecx = NumPixels >> 5Process32MorePixels:prefetchnta [esi+128]  //fetch two cache lines aheadprefetchnta [edi+128]  //fetch two cache lines aheadmovdqa xmm0, [edi]     //4movdqa xmm1, [edi+16]  //4movdqa xmm2, [edi+32]  //4movdqa xmm3, [edi+48]  //4, first 64 byte cache linemovdqa xmm4, [edi+64]  //4movdqa xmm5, [edi+80]  //4movdqa xmm6, [edi+96]  //4movdqa xmm7, [edi+112] //4, second 64 byte cache linepaddusb xmm0, [esi]     //4paddusb xmm1, [esi+16]  //4paddusb xmm2, [esi+32]  //4paddusb xmm3, [esi+48]  //4, first 64 byte cache line, 16 pixelspaddusb xmm4, [esi+64]  //4paddusb xmm5, [esi+80]  //4paddusb xmm6, [esi+96]  //4paddusb xmm7, [esi+112] //4, second 64 byte cache line, 32 pixels//streaming store movntdq [edi], xmm0movntdq [edi+16], xmm1movntdq [edi+32], xmm2movntdq [edi+48], xmm3movntdq [edi+64], xmm4movntdq [edi+80], xmm5movntdq [edi+96], xmm6movntdq [edi+112], xmm7add esi, 128add edi, 128dec ecxjnz Process32MorePixelssfence//Cleanup}

If even you can't do SSE2 or have 16 byte aligned images, this MMX code will work too:

int Remainder = NumPixelsToProcess % 16;if(Remainder != 0){   //Process extra pixels manually.   SourcePtr += Remainder;   DestPtr += Remainder;}__asm {//setup esi = source, edi = dest, ecx = NumPixels >> 4Process16MorePixels:movq mm0, [edi]     //2movq mm1, [edi+ 8]  //2movq mm2, [edi+16]  //2movq mm3, [edi+24]  //2movq mm4, [edi+32]  //2movq mm5, [edi+40]  //2movq mm6, [edi+48]  //2movq mm7, [edi+56]  //2, end of 64 byte cache linepaddusb mm0, [esi]     //2paddusb mm1, [esi+ 8]  //2paddusb mm2, [esi+16]  //2paddusb mm3, [esi+24]  //2paddusb mm4, [esi+32]  //2paddusb mm5, [esi+40]  //2paddusb mm6, [esi+48]  //2paddusb mm7, [esi+56]  //2, end of 64 byte cache line//store movq [edi], mm0movq [edi+ 8], mm1movq [edi+16], mm2movq [edi+24], mm3movq [edi+32], mm4movq [edi+40], mm5movq [edi+48], mm6movq [edi+56], mm7add esi, 64add edi, 64dec ecxjnz Process16MorePixels//Cleanup}

If you are sure that the processor has MMX and SSE (but not SSE2), you can change the final set of MOVQ into MOVNTQ and add back the PREFETCHNTA instructions. This will probably give a performance boost. And don't forget, PROFILE! Don't just copy/paste code and not know whether it helps or not. Maybe the SSE2 is slower than the plain MMX, maybe not. TRY it before you consider one better than the other.

I (personally) suggest that you use my library, libSIMDx86, has it has been profiled quite a bit by me. It isn't perfect, but it gets closer each release. Send any questions to baggett.patrick@gmail.com

##### Share on other sites

This topic is 4206 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Create an account or sign in to comment

You need to be a member in order to leave a comment

## Create an account

Sign up for a new account in our community. It's easy!

Register a new account