Back to General and Gameplay Programming

Help optimize my s/w additive blending [Edit: MMX assembly now]

General and Gameplay Programming Programming

Started by BeanDog June 11, 2006 08:27 PM

20 comments, last by Figgles 17 years, 10 months ago

BeanDog

1,065

Author

June 11, 2006 08:27 PM

Today I downloaded SDL to play around with for little projects. It's just what the doctor ordered for simple 2D games. For some pyrotechnic effects, though, I need additive blending, not just the general (and somewhat slow) alpha blending included in SDL. So I whipped up a simple additive blending blit function, which assumes 24-bpp surfaces. Of course, I had overflow problems. So, rather than have several conditionals in the very inner loop of the function, I pre-compute a 256x256 lookup table of capped 8-bit additions for use in the function. Ignoring the bounds-checking and rectangle clipping, here's the meat of the function:


unsigned char *pDst = (unsigned char*)dst->pixels + drect->x*3 + drect->y * dst->pitch;
int iPadDst = dst->pitch - srect->w*3;

unsigned char *pSrc = (unsigned char*)src->pixels + srect->x*3 + srect->y * dst->pitch;
int iPadSrc = src->pitch - srect->w*3;

for(int y = 0; y < srect->h; y++)
{
	for(int x = 0; x < srect->w; x++)
	{
		//Look up the new values for each of R,G, and B
		pDst[0] = pBlend[pSrc[0]][pDst[0]];
		pDst[1] = pBlend[pSrc[1]][pDst[1]];
		pDst[2] = pBlend[pSrc[2]][pDst[2]];

		//Advance the pointers to the next pixel
		pDst+=3;
		pSrc+=3;
	}
	//Advance the pointers to the next line
	pDst += iPadDst;
	pSrc += iPadSrc;
}

It's really a pain to have that many dereferences in the very inner loop. At first, I did this


*pDst = pBlend[*pSrc][*pDst];
pSrc++;
pDst++;

three times rather than the way I have it now. Changing it to this method improved speed by about 10% in both debug and release mode (VC++ 2005). Any ideas out there for speeding this up further? Am I missing some sweet MMX instruction for doing 3 one-byte capped additions at once? [Edited by - BeanDog on June 12, 2006 8:26:32 PM]

Rockoon1

104

June 11, 2006 10:12 PM

No need for MMX although it can certainly help with the saturation at 255 ("capping")

Some things to consider:

Currently your table represents a function: f(x, y)

..but wouldnt it be better if your table represented f(x + y) .. the addition itself is very efficient.. certainly no worse than calculating a 2D table index (in practice its actualy a lot better than 2D indexing on Intel/AMD)

The table would then only need to have 512 entries instead of 65536 entries, and it would perform the only inefficient operation.. saturation(). Being so small, you can expect very few cache misses which is probably going to be your biggest benefit

So something like:

pDst[0] = SATURATE[pSrc[0] + pDst[0]];
pDts[1] = SATURATE[pSrc[1] + pDst[1]];
pDst[2] = SATURATE[pSrc[2] + pDst[2]];

Zahlman

1,682

June 11, 2006 10:19 PM

Did you *try* just using the conditionals? Did you try playing with your compiler optimization settings? Do you *know* that "it's too slow" as is? Do you *know* that the problem is with *this* loop?

Failing any of those, how is your table stored? Consider that even with the trimmest representation (plain array, both dimensions static), the code for the lookup really can't be optimized any further than:

*pDst = *(lookup + *pSrc << 8 + *pDst);

Which ought to already be what's happening. Modern compilers are very, very good with array-indexing-vs-pointer-arithmetic-equivalence, and frankly I'm very surprised that you see any difference at all (maybe it just didn't think of the 'transformation' to how the index variable is used?)

Did you look at the assembly output?

(EDIT: ::bonks self for not thinking of what Rockoon1 did.::)

Anonymous

June 12, 2006 12:11 AM

Quote:Original post by Zahlman
Did you *try* just using the conditionals? Did you try playing with your compiler optimization settings? Do you *know* that "it's too slow" as is? Do you *know* that the problem is with *this* loop?

Yes, yes, and yes. This is a very high-traffic piece of code (function called hundreds of times per frame).

Thanks, Rockoon, that gave me another 10% speedup. Any other ideas are welcome!

BeanDog

1,065

Author

June 12, 2006 12:12 AM

Whoops, above post is me.

Demus79

362

June 12, 2006 03:36 AM

I'd say that your 64k lookup table is a bit too big for the purpose.
Practically, with MMX you could do the trick without any lookup tables, and all the additions in one instruction.

Cheers

Winograd

380

June 12, 2006 04:39 AM

I'm quite sure this doesn't provide additional speed-up as the compiler should be smart enough for this. Then again this mighteven be plain wrong thing to do (there's some time from my last micro-optimization session ;)). However, here's my contribution:

The idea is not to write the result straight to pDst at line [1] as it would dirty the cache line (right?). Thus the next read, at line [2], would issue an read from memory (as opposed to read from cache, i.e., cache miss). Therefore, setting up 3 8-bit temporary variables inside the loop might hint the compiler to keep those in registers (perhaps one should also use the register keyword, register char tmp).

Worth a try at least..

unsigned char *pDst = (unsigned char*)dst->pixels + drect->x*3 + drect->y * dst->pitch;int iPadDst = dst->pitch - srect->w*3;unsigned char *pSrc = (unsigned char*)src->pixels + srect->x*3 + srect->y * dst->pitch;int iPadSrc = src->pitch - srect->w*3;for(int y = 0; y < srect->h; y++){	for(int x = 0; x < srect->w; x++)	{		char tmp[3];		//Look up the new values for each of R,G, and B		tmp[0] = pBlend[pSrc[0]][pDst[0]]; //[1]		tmp[1] = pBlend[pSrc[1]][pDst[1]]; //[2]		tmp[2] = pBlend[pSrc[2]][pDst[2]];		pDst[0] = tmp[0];		pDst[1] = tmp[1];		pDst[2] = tmp[2];		//Advance the pointers to the next pixel		pDst+=3;		pSrc+=3;	}	//Advance the pointers to the next line	pDst += iPadDst;	pSrc += iPadSrc;}

EDIT: Just tried to compile it. GCC 4 produces two different sets of code for the two versions. Thus the optimization, whether or not beneficial, is not at least trivial to the compiler. An whats more important code seems to put the three tmps in registers. However, the registers are all 8-bit (al,bl,cl) and it might work faster if one just uses 32-bit temporaries (somehow I recall that 16-bit or 8-bit operations are slow on 32-bit x86).

[Edited by - Winograd on June 12, 2006 4:39:33 AM]

Spoonbender

1,258

June 12, 2006 05:08 AM

Try replacing your pointer arithmetics with array notation. Compilers generally prefer that, as they can easier keep track of memory aliasing. (especially since you already update the two variables x and y in every iteration, but don't actually *use* them. Try using them to do your array indexing instead of separate pointer arithmetics?) Not sure if it'll make a difference in this particular case though, but might be worth a try.

Other than that, I'd try the following:
Load the values you need from pDst one or two iterations before you need them, and save them into temp variables. That way, you won't have a load-store dependency slowing you down.

That might give you a decent performance boost.

[Edited by - Spoonbender on June 12, 2006 5:08:38 AM]

Winograd

380

June 12, 2006 05:44 AM

Quote:Original post by Spoonbender
Load the values you need from pDst one or two iterations before you need them, and save them into temp variables. That way, you won't have a load-store dependency slowing you down.

That might give you a decent performance boost.

Smart. Somehow I forgot the pipelining problems all together :). Your suggestion is probably better what I proposed (at least in theory).

Anonymous

June 12, 2006 07:45 AM

Back when i was working on my software renderer, i noticed that memory accesses were VERY slow, try this out maybe?

unsigned char *pDst = (unsigned char*)dst->pixels + drect->x*3 + drect->y * dst->pitch;int iPadDst = dst->pitch - srect->w*3;unsigned char *pSrc = (unsigned char*)src->pixels + srect->x*3 + srect->y * dst->pitch;int iPadSrc = src->pitch - srect->w*3;unsigned long rh = srect->h, rw = srect->w;for(unsigned long y = 0; y < rh; ++y){	for(unsigned long x = 0; x < rw; ++x)	{		register unsigned long t1, t2, t3;		//Look up the new values for each of R,G, and B		t1 = (unsigned long) pSrc[0] + (unsigned long) pDst[0];		t2 = (unsigned long) pSrc[1] + (unsigned long) pDst[1];		t3 = (unsigned long) pSrc[2] + (unsigned long) pDst[2];		pDst[0] = t1 | ((t1 >> 9) * 255);		pDst[1] = t2 | ((t2 >> 9) * 255);		pDst[2] = t3 | ((t3 >> 9) * 255);		//Advance the pointers to the next pixel		pDst+=3;		pSrc+=3;	}	//Advance the pointers to the next line	pDst += iPadDst;	pSrc += iPadSrc;}[\source]

Help optimize my s/w additive blending [Edit: MMX assembly now]

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Help optimize my s/w additive blending [Edit: MMX assembly now]

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines