Sign in to follow this  

Help optimize my s/w additive blending [Edit: MMX assembly now]

This topic is 4206 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Today I downloaded SDL to play around with for little projects. It's just what the doctor ordered for simple 2D games. For some pyrotechnic effects, though, I need additive blending, not just the general (and somewhat slow) alpha blending included in SDL. So I whipped up a simple additive blending blit function, which assumes 24-bpp surfaces. Of course, I had overflow problems. So, rather than have several conditionals in the very inner loop of the function, I pre-compute a 256x256 lookup table of capped 8-bit additions for use in the function. Ignoring the bounds-checking and rectangle clipping, here's the meat of the function:
unsigned char *pDst = (unsigned char*)dst->pixels + drect->x*3 + drect->y * dst->pitch;
int iPadDst = dst->pitch - srect->w*3;

unsigned char *pSrc = (unsigned char*)src->pixels + srect->x*3 + srect->y * dst->pitch;
int iPadSrc = src->pitch - srect->w*3;

for(int y = 0; y < srect->h; y++)
{
	for(int x = 0; x < srect->w; x++)
	{
		//Look up the new values for each of R,G, and B
		pDst[0] = pBlend[pSrc[0]][pDst[0]];
		pDst[1] = pBlend[pSrc[1]][pDst[1]];
		pDst[2] = pBlend[pSrc[2]][pDst[2]];

		//Advance the pointers to the next pixel
		pDst+=3;
		pSrc+=3;
	}
	//Advance the pointers to the next line
	pDst += iPadDst;
	pSrc += iPadSrc;
}
It's really a pain to have that many dereferences in the very inner loop. At first, I did this
*pDst = pBlend[*pSrc][*pDst];
pSrc++;
pDst++;
three times rather than the way I have it now. Changing it to this method improved speed by about 10% in both debug and release mode (VC++ 2005). Any ideas out there for speeding this up further? Am I missing some sweet MMX instruction for doing 3 one-byte capped additions at once? [Edited by - BeanDog on June 12, 2006 8:26:32 PM]

Share this post


Link to post
Share on other sites
No need for MMX although it can certainly help with the saturation at 255 ("capping")

Some things to consider:

Currently your table represents a function: f(x, y)


..but wouldnt it be better if your table represented f(x + y) .. the addition itself is very efficient.. certainly no worse than calculating a 2D table index (in practice its actualy a lot better than 2D indexing on Intel/AMD)

The table would then only need to have 512 entries instead of 65536 entries, and it would perform the only inefficient operation.. saturation(). Being so small, you can expect very few cache misses which is probably going to be your biggest benefit

So something like:

pDst[0] = SATURATE[pSrc[0] + pDst[0]];
pDts[1] = SATURATE[pSrc[1] + pDst[1]];
pDst[2] = SATURATE[pSrc[2] + pDst[2]];

Share this post


Link to post
Share on other sites
Did you *try* just using the conditionals? Did you try playing with your compiler optimization settings? Do you *know* that "it's too slow" as is? Do you *know* that the problem is with *this* loop?

Failing any of those, how is your table stored? Consider that even with the trimmest representation (plain array, both dimensions static), the code for the lookup really can't be optimized any further than:


*pDst = *(lookup + *pSrc << 8 + *pDst);


Which ought to already be what's happening. Modern compilers are very, very good with array-indexing-vs-pointer-arithmetic-equivalence, and frankly I'm very surprised that you see any difference at all (maybe it just didn't think of the 'transformation' to how the index variable is used?)

Did you look at the assembly output?

(EDIT: ::bonks self for not thinking of what Rockoon1 did.::)

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Quote:
Original post by Zahlman
Did you *try* just using the conditionals? Did you try playing with your compiler optimization settings? Do you *know* that "it's too slow" as is? Do you *know* that the problem is with *this* loop?

Yes, yes, and yes. This is a very high-traffic piece of code (function called hundreds of times per frame).

Thanks, Rockoon, that gave me another 10% speedup. Any other ideas are welcome!

Share this post


Link to post
Share on other sites

I'd say that your 64k lookup table is a bit too big for the purpose.
Practically, with MMX you could do the trick without any lookup tables, and all the additions in one instruction.

Cheers

Share this post


Link to post
Share on other sites
I'm quite sure this doesn't provide additional speed-up as the compiler should be smart enough for this. Then again this mighteven be plain wrong thing to do (there's some time from my last micro-optimization session ;)). However, here's my contribution:

The idea is not to write the result straight to pDst at line [1] as it would dirty the cache line (right?). Thus the next read, at line [2], would issue an read from memory (as opposed to read from cache, i.e., cache miss). Therefore, setting up 3 8-bit temporary variables inside the loop might hint the compiler to keep those in registers (perhaps one should also use the register keyword, register char tmp).

Worth a try at least..



unsigned char *pDst = (unsigned char*)dst->pixels + drect->x*3 + drect->y * dst->pitch;
int iPadDst = dst->pitch - srect->w*3;

unsigned char *pSrc = (unsigned char*)src->pixels + srect->x*3 + srect->y * dst->pitch;
int iPadSrc = src->pitch - srect->w*3;

for(int y = 0; y < srect->h; y++)
{
for(int x = 0; x < srect->w; x++)
{
char tmp[3];

//Look up the new values for each of R,G, and B
tmp[0] = pBlend[pSrc[0]][pDst[0]]; //[1]
tmp[1] = pBlend[pSrc[1]][pDst[1]]; //[2]
tmp[2] = pBlend[pSrc[2]][pDst[2]];

pDst[0] = tmp[0];
pDst[1] = tmp[1];
pDst[2] = tmp[2];

//Advance the pointers to the next pixel
pDst+=3;
pSrc+=3;
}
//Advance the pointers to the next line
pDst += iPadDst;
pSrc += iPadSrc;
}




EDIT: Just tried to compile it. GCC 4 produces two different sets of code for the two versions. Thus the optimization, whether or not beneficial, is not at least trivial to the compiler. An whats more important code seems to put the three tmps in registers. However, the registers are all 8-bit (al,bl,cl) and it might work faster if one just uses 32-bit temporaries (somehow I recall that 16-bit or 8-bit operations are slow on 32-bit x86).

[Edited by - Winograd on June 12, 2006 4:39:33 AM]

Share this post


Link to post
Share on other sites
Try replacing your pointer arithmetics with array notation. Compilers generally prefer that, as they can easier keep track of memory aliasing. (especially since you already update the two variables x and y in every iteration, but don't actually *use* them. Try using them to do your array indexing instead of separate pointer arithmetics?) Not sure if it'll make a difference in this particular case though, but might be worth a try.

Other than that, I'd try the following:
Load the values you need from pDst one or two iterations before you need them, and save them into temp variables. That way, you won't have a load-store dependency slowing you down.

That might give you a decent performance boost.

[Edited by - Spoonbender on June 12, 2006 5:08:38 AM]

Share this post


Link to post
Share on other sites
Quote:
Original post by Spoonbender
Load the values you need from pDst one or two iterations before you need them, and save them into temp variables. That way, you won't have a load-store dependency slowing you down.

That might give you a decent performance boost.


Smart. Somehow I forgot the pipelining problems all together :). Your suggestion is probably better what I proposed (at least in theory).

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Back when i was working on my software renderer, i noticed that memory accesses were VERY slow, try this out maybe?


unsigned char *pDst = (unsigned char*)dst->pixels + drect->x*3 + drect->y * dst->pitch;
int iPadDst = dst->pitch - srect->w*3;

unsigned char *pSrc = (unsigned char*)src->pixels + srect->x*3 + srect->y * dst->pitch;
int iPadSrc = src->pitch - srect->w*3;

unsigned long rh = srect->h, rw = srect->w;

for(unsigned long y = 0; y < rh; ++y)
{
for(unsigned long x = 0; x < rw; ++x)
{
register unsigned long t1, t2, t3;

//Look up the new values for each of R,G, and B
t1 = (unsigned long) pSrc[0] + (unsigned long) pDst[0];
t2 = (unsigned long) pSrc[1] + (unsigned long) pDst[1];
t3 = (unsigned long) pSrc[2] + (unsigned long) pDst[2];

pDst[0] = t1 | ((t1 >> 9) * 255);
pDst[1] = t2 | ((t2 >> 9) * 255);
pDst[2] = t3 | ((t3 >> 9) * 255);

//Advance the pointers to the next pixel
pDst+=3;
pSrc+=3;
}
//Advance the pointers to the next line
pDst += iPadDst;
pSrc += iPadSrc;
}
[\source]

Share this post


Link to post
Share on other sites
Quote:
Original post by BeanDogAny ideas out there for speeding this up further? Am I missing some sweet MMX instruction for doing 3 one-byte capped additions at once?


Well, there is PADDUSB, which performs 8 8-bit saturated adds in parallel. (As well as an SSE2 version which does 16 8-bit adds.)

Share this post


Link to post
Share on other sites
Since we are microoptimizing things, did you try to use a 32 bit pixels structure? When I was working on our film scanner 2 years ago, I was using something along the line of
struct pixel 
{
union
{
struct
{
unsigned char r, g, b, a;
} ;
unsigned int rgba;
};
};

It was mostly for the sake of writing readable code, but 32 bits aligned read/writes + structures that will be easily manipulated by the compiler == good generated code, so it might be a good idea to test it.

Another idea is to remove your "x" - you can do the loop on pDst:
  // was "for(int x = 0; x < srect->w; x++)"
pDstEnd = pDst + dst->pitch;
for (; pDst < pDstEnd; pDst+=3, pSrc+=3) {
// blah blah
}


Of course, this is very similar to
pixel *beginsrc = reinterpret_cast<pixel*>(pSrc);
pixel *endsrc = beginsrc + srect->w;
pixel *begindst = reinterpret_cast<pixel*>(pDst);
// First1 , Last1 , First2 , Out , FuncObject
std::transform(beginsrc, endsrc, begindst, begindst, tranformator);

(with the correct transformator object (ie, not a function pointer), of course).

HTH,

Share this post


Link to post
Share on other sites
I've upgraded my routine to use MMX (based on an earlier post on GameDev I found via Google). It seems to work OK. As this is my first attempt at assembler code ever, please help me know what I can do to speed this bad boy up. It's now down to about 4.5ms to fill a 800x600 window via a few dozen calls to the function on my Athlon X2 3800+, while the code in my original post took about 11ms. I have a feeling I can do something to get this code going in both pipelines, but I don't even know where to start.

unsigned char *pDst = (unsigned char*)dst->pixels + drect->x*4 + drect->y * dst->pitch;
int iPadDst = dst->pitch - srect->w*4;

unsigned char *pSrc = (unsigned char*)src->pixels + srect->x*4 + srect->y * src->pitch;
int iPadSrc = src->pitch - srect->w*4;

unsigned int len = (unsigned int)srect->w;

for(int y = 0; y < srect->h; y++)
{
__asm
{
mov esi,pSrc //esi = pointer to beginning of source line
mov edi,pDst //edi = pointer to beginning of dest line

mov ecx,len
mov edx,ecx

and edx,1 //edx = parity (1 for odd # of pixels)
shr ecx,1 //ecx = number of 2-pixel groups to do

addblitloop:
movq mm0,[esi] //load 2 pixels
movq mm1,[edi] //load 2 pixels

paddusb mm0,mm1 //add them (all 8 bytes at once)
movq [edi],mm0 //store 2 pixels

add esi,8 //increase loop pointers
add edi,8
dec ecx //Count down remaining 2-pixel pairs
jnz addblitloop

cmp edx,0
jz skipfinishadd //If there was no odd pixel, skip the following code.

movd mm0,[esi] //Copy the last pixel (note movd instead of movq)
movd mm1,[edi]
paddusb mm0,mm1
movd [edi],mm0

skipfinishadd:
}
pSrc += src->pitch;
pDst += dst->pitch;
}

__asm
{
emms
}

Share this post


Link to post
Share on other sites
Just my two cents.
It's not guaranteed to make a difference but anyway...

If the width of the image is always a multiplier of 2 you could unroll the loop in case to do 4 pixels at a time. Something like this:


movq mm0,[esi] //load 2 pixels
movq mm1,[edi] //load 2 pixels
movq mm2,[esi + 8] //load 2 pixels
movq mm3,[edi + 8] //load 2 pixels

paddusb mm0,mm1 //add them (all 8 bytes at once)
paddusb mm2,mm3 //add them (all 8 bytes at once)
movq [edi],mm0 //store 2 pixels
movq [edi + 8],mm2 //store 2 pixels




Also you could try loading the data into the cache using prefetch.

If this is completely wrong, someone please correct me.

HellRaiZer

Share this post


Link to post
Share on other sites
Quote:
Original post by phantom
ewww, inline assembler... use the intrinsics instead [smile]


Agreed [smile]

If you want to get the best performance out of this, start by optimizing *everything* you can in plain C++.

Then use intrinsics to enable MMX or SSE.

And then, if you think there's something to gain by it, you can consider using assembly.

But in that order, please. :)

Share this post


Link to post
Share on other sites
I'd never even heard of intrinsics before. I looked at some sample intrinsics code on CodeProject and elsewhere via Google, and it looks messier and more confusing to me than a few lines of assembly code.

HellRaiZer: I tried that exact code before I posted my assembly above, but I got zero speed increase. Any idea why?

Share this post


Link to post
Share on other sites
Quote:
Original post by BeanDog
I'd never even heard of intrinsics before. I looked at some sample intrinsics code on CodeProject and elsewhere via Google, and it looks messier and more confusing to me than a few lines of assembly code.


Maybe so, but it has two advantages;

1) the compiler is still doing the balk of the work, this includes managing registers and reordering code.
2) It'll work on x64 targets, where as your inline assembly will cause the compiler to cry and reject your code.

Once you get used to it its not that bad, its basically all the adventages of assembler with the added bonus of the compiler helpping you (I played with SSE for about a week, and the VC8 compiler is indeed very good when it comes to optimising things with intrinsics)

Share this post


Link to post
Share on other sites
Quote:

I tried that exact code before I posted my assembly above, but I got zero speed increase. Any idea why?


Unfortunatelly no. There may be several reasons for that. Instruction pairing and jump prediction, are two candidates.

Try adding these two lines at the beginning of the assembly loop (the inner loop):


prefetch [esi + 0x100];
prefetch [edi + 0x100];


Play around with the constant in case to find the optimal value.
See if this helps a little.

What phantom said about intrinsics sounds reasonable. I haven't written any code using them, but if the compiler can take care of register traffic and instruction scheduling, it sounds a good choice.

One last thing. I don't code in assembly (mostly i read assembly), but i learned it (or better, still learning it) because i wanted to know how things work at a lower level. That said, anything in my post may be completely wrong, so... :)

HellRaiZer

Share this post


Link to post
Share on other sites
Quote:
Original post by phantom
1) the compiler is still doing the balk of the work, this includes managing registers and reordering code.
2) It'll work on x64 targets, where as your inline assembly will cause the compiler to cry and reject your code.

3) you're able to mix SIMD instructions with ordinary C++-level optimizations, which is damn hard once you go assembly...

Share this post


Link to post
Share on other sites
Hi,

I have a project at Sourceforge called libSIMDx86 that does functions very similar to this. I suggest anyone who considers using their own vector/matrix/quaternion library check this out first.

http://simdx86.sourceforge.net


My question is this: MMX? That instruction set is quite dated. Shoot for SSE2 if you can, otherwise, try MMX+SSE, else try MMX.

Here is some 2 cent SSE2 code... processes 32 pixels/loop -> images must be aligned to 16th byte. It is possible to relax the 16 byte aligned addresses, but then you process half the data / loop.

int Remainder = NumPixelsToProcess % 32;
if(Remainder != 0)
{
//Process extra pixels

SourcePtr += Remainder;
DestPtr += Remainder;
}

__asm {

//setup esi = source, edi = dest, ecx = NumPixels >> 5

Process32MorePixels:
prefetchnta [esi+128] //fetch two cache lines ahead
prefetchnta [edi+128] //fetch two cache lines ahead

movdqa xmm0, [edi] //4
movdqa xmm1, [edi+16] //4
movdqa xmm2, [edi+32] //4
movdqa xmm3, [edi+48] //4, first 64 byte cache line
movdqa xmm4, [edi+64] //4
movdqa xmm5, [edi+80] //4
movdqa xmm6, [edi+96] //4
movdqa xmm7, [edi+112] //4, second 64 byte cache line

paddusb xmm0, [esi] //4
paddusb xmm1, [esi+16] //4
paddusb xmm2, [esi+32] //4
paddusb xmm3, [esi+48] //4, first 64 byte cache line, 16 pixels
paddusb xmm4, [esi+64] //4
paddusb xmm5, [esi+80] //4
paddusb xmm6, [esi+96] //4
paddusb xmm7, [esi+112] //4, second 64 byte cache line, 32 pixels

//streaming store
movntdq [edi], xmm0
movntdq [edi+16], xmm1
movntdq [edi+32], xmm2
movntdq [edi+48], xmm3
movntdq [edi+64], xmm4
movntdq [edi+80], xmm5
movntdq [edi+96], xmm6
movntdq [edi+112], xmm7

add esi, 128
add edi, 128
dec ecx
jnz Process32MorePixels
sfence

//Cleanup
}


If even you can't do SSE2 or have 16 byte aligned images, this MMX code will work too:



int Remainder = NumPixelsToProcess % 16;
if(Remainder != 0)
{
//Process extra pixels manually.

SourcePtr += Remainder;
DestPtr += Remainder;
}

__asm {

//setup esi = source, edi = dest, ecx = NumPixels >> 4

Process16MorePixels:
movq mm0, [edi] //2
movq mm1, [edi+ 8] //2
movq mm2, [edi+16] //2
movq mm3, [edi+24] //2
movq mm4, [edi+32] //2
movq mm5, [edi+40] //2
movq mm6, [edi+48] //2
movq mm7, [edi+56] //2, end of 64 byte cache line

paddusb mm0, [esi] //2
paddusb mm1, [esi+ 8] //2
paddusb mm2, [esi+16] //2
paddusb mm3, [esi+24] //2
paddusb mm4, [esi+32] //2
paddusb mm5, [esi+40] //2
paddusb mm6, [esi+48] //2
paddusb mm7, [esi+56] //2, end of 64 byte cache line

//store
movq [edi], mm0
movq [edi+ 8], mm1
movq [edi+16], mm2
movq [edi+24], mm3
movq [edi+32], mm4
movq [edi+40], mm5
movq [edi+48], mm6
movq [edi+56], mm7

add esi, 64
add edi, 64
dec ecx
jnz Process16MorePixels

//Cleanup
}




If you are sure that the processor has MMX and SSE (but not SSE2), you can change the final set of MOVQ into MOVNTQ and add back the PREFETCHNTA instructions. This will probably give a performance boost. And don't forget, PROFILE! Don't just copy/paste code and not know whether it helps or not. Maybe the SSE2 is slower than the plain MMX, maybe not. TRY it before you consider one better than the other.

I (personally) suggest that you use my library, libSIMDx86, has it has been profiled quite a bit by me. It isn't perfect, but it gets closer each release. Send any questions to baggett.patrick@gmail.com

Share this post


Link to post
Share on other sites

This topic is 4206 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this