Help optimize my s/w additive blending [Edit: MMX assembly now]

Started by
20 comments, last by Figgles 17 years, 10 months ago
Today I downloaded SDL to play around with for little projects. It's just what the doctor ordered for simple 2D games. For some pyrotechnic effects, though, I need additive blending, not just the general (and somewhat slow) alpha blending included in SDL. So I whipped up a simple additive blending blit function, which assumes 24-bpp surfaces. Of course, I had overflow problems. So, rather than have several conditionals in the very inner loop of the function, I pre-compute a 256x256 lookup table of capped 8-bit additions for use in the function. Ignoring the bounds-checking and rectangle clipping, here's the meat of the function:

unsigned char *pDst = (unsigned char*)dst->pixels + drect->x*3 + drect->y * dst->pitch;
int iPadDst = dst->pitch - srect->w*3;

unsigned char *pSrc = (unsigned char*)src->pixels + srect->x*3 + srect->y * dst->pitch;
int iPadSrc = src->pitch - srect->w*3;

for(int y = 0; y < srect->h; y++)
{
	for(int x = 0; x < srect->w; x++)
	{
		//Look up the new values for each of R,G, and B
		pDst[0] = pBlend[pSrc[0]][pDst[0]];
		pDst[1] = pBlend[pSrc[1]][pDst[1]];
		pDst[2] = pBlend[pSrc[2]][pDst[2]];

		//Advance the pointers to the next pixel
		pDst+=3;
		pSrc+=3;
	}
	//Advance the pointers to the next line
	pDst += iPadDst;
	pSrc += iPadSrc;
}
It's really a pain to have that many dereferences in the very inner loop. At first, I did this

*pDst = pBlend[*pSrc][*pDst];
pSrc++;
pDst++;
three times rather than the way I have it now. Changing it to this method improved speed by about 10% in both debug and release mode (VC++ 2005). Any ideas out there for speeding this up further? Am I missing some sweet MMX instruction for doing 3 one-byte capped additions at once? [Edited by - BeanDog on June 12, 2006 8:26:32 PM]
Advertisement
No need for MMX although it can certainly help with the saturation at 255 ("capping")

Some things to consider:

Currently your table represents a function: f(x, y)


..but wouldnt it be better if your table represented f(x + y) .. the addition itself is very efficient.. certainly no worse than calculating a 2D table index (in practice its actualy a lot better than 2D indexing on Intel/AMD)

The table would then only need to have 512 entries instead of 65536 entries, and it would perform the only inefficient operation.. saturation(). Being so small, you can expect very few cache misses which is probably going to be your biggest benefit

So something like:

pDst[0] = SATURATE[pSrc[0] + pDst[0]];
pDts[1] = SATURATE[pSrc[1] + pDst[1]];
pDst[2] = SATURATE[pSrc[2] + pDst[2]];

Did you *try* just using the conditionals? Did you try playing with your compiler optimization settings? Do you *know* that "it's too slow" as is? Do you *know* that the problem is with *this* loop?

Failing any of those, how is your table stored? Consider that even with the trimmest representation (plain array, both dimensions static), the code for the lookup really can't be optimized any further than:

*pDst = *(lookup + *pSrc << 8 + *pDst);


Which ought to already be what's happening. Modern compilers are very, very good with array-indexing-vs-pointer-arithmetic-equivalence, and frankly I'm very surprised that you see any difference at all (maybe it just didn't think of the 'transformation' to how the index variable is used?)

Did you look at the assembly output?

(EDIT: ::bonks self for not thinking of what Rockoon1 did.::)
Quote:Original post by Zahlman
Did you *try* just using the conditionals? Did you try playing with your compiler optimization settings? Do you *know* that "it's too slow" as is? Do you *know* that the problem is with *this* loop?

Yes, yes, and yes. This is a very high-traffic piece of code (function called hundreds of times per frame).

Thanks, Rockoon, that gave me another 10% speedup. Any other ideas are welcome!
Whoops, above post is me.

I'd say that your 64k lookup table is a bit too big for the purpose.
Practically, with MMX you could do the trick without any lookup tables, and all the additions in one instruction.

Cheers
I'm quite sure this doesn't provide additional speed-up as the compiler should be smart enough for this. Then again this mighteven be plain wrong thing to do (there's some time from my last micro-optimization session ;)). However, here's my contribution:

The idea is not to write the result straight to pDst at line [1] as it would dirty the cache line (right?). Thus the next read, at line [2], would issue an read from memory (as opposed to read from cache, i.e., cache miss). Therefore, setting up 3 8-bit temporary variables inside the loop might hint the compiler to keep those in registers (perhaps one should also use the register keyword, register char tmp).

Worth a try at least..


unsigned char *pDst = (unsigned char*)dst->pixels + drect->x*3 + drect->y * dst->pitch;int iPadDst = dst->pitch - srect->w*3;unsigned char *pSrc = (unsigned char*)src->pixels + srect->x*3 + srect->y * dst->pitch;int iPadSrc = src->pitch - srect->w*3;for(int y = 0; y < srect->h; y++){	for(int x = 0; x < srect->w; x++)	{		char tmp[3];		//Look up the new values for each of R,G, and B		tmp[0] = pBlend[pSrc[0]][pDst[0]]; //[1]		tmp[1] = pBlend[pSrc[1]][pDst[1]]; //[2]		tmp[2] = pBlend[pSrc[2]][pDst[2]];		pDst[0] = tmp[0];		pDst[1] = tmp[1];		pDst[2] = tmp[2];		//Advance the pointers to the next pixel		pDst+=3;		pSrc+=3;	}	//Advance the pointers to the next line	pDst += iPadDst;	pSrc += iPadSrc;}


EDIT: Just tried to compile it. GCC 4 produces two different sets of code for the two versions. Thus the optimization, whether or not beneficial, is not at least trivial to the compiler. An whats more important code seems to put the three tmps in registers. However, the registers are all 8-bit (al,bl,cl) and it might work faster if one just uses 32-bit temporaries (somehow I recall that 16-bit or 8-bit operations are slow on 32-bit x86).

[Edited by - Winograd on June 12, 2006 4:39:33 AM]
Try replacing your pointer arithmetics with array notation. Compilers generally prefer that, as they can easier keep track of memory aliasing. (especially since you already update the two variables x and y in every iteration, but don't actually *use* them. Try using them to do your array indexing instead of separate pointer arithmetics?) Not sure if it'll make a difference in this particular case though, but might be worth a try.

Other than that, I'd try the following:
Load the values you need from pDst one or two iterations before you need them, and save them into temp variables. That way, you won't have a load-store dependency slowing you down.

That might give you a decent performance boost.

[Edited by - Spoonbender on June 12, 2006 5:08:38 AM]
Quote:Original post by Spoonbender
Load the values you need from pDst one or two iterations before you need them, and save them into temp variables. That way, you won't have a load-store dependency slowing you down.

That might give you a decent performance boost.


Smart. Somehow I forgot the pipelining problems all together :). Your suggestion is probably better what I proposed (at least in theory).
Back when i was working on my software renderer, i noticed that memory accesses were VERY slow, try this out maybe?

unsigned char *pDst = (unsigned char*)dst->pixels + drect->x*3 + drect->y * dst->pitch;int iPadDst = dst->pitch - srect->w*3;unsigned char *pSrc = (unsigned char*)src->pixels + srect->x*3 + srect->y * dst->pitch;int iPadSrc = src->pitch - srect->w*3;unsigned long rh = srect->h, rw = srect->w;for(unsigned long y = 0; y < rh; ++y){	for(unsigned long x = 0; x < rw; ++x)	{		register unsigned long t1, t2, t3;		//Look up the new values for each of R,G, and B		t1 = (unsigned long) pSrc[0] + (unsigned long) pDst[0];		t2 = (unsigned long) pSrc[1] + (unsigned long) pDst[1];		t3 = (unsigned long) pSrc[2] + (unsigned long) pDst[2];		pDst[0] = t1 | ((t1 >> 9) * 255);		pDst[1] = t2 | ((t2 >> 9) * 255);		pDst[2] = t3 | ((t3 >> 9) * 255);		//Advance the pointers to the next pixel		pDst+=3;		pSrc+=3;	}	//Advance the pointers to the next line	pDst += iPadDst;	pSrc += iPadSrc;}[\source]

This topic is closed to new replies.

Advertisement