Today I downloaded SDL to play around with for little projects. It's just what the doctor ordered for simple 2D games. For some pyrotechnic effects, though, I need additive blending, not just the general (and somewhat slow) alpha blending included in SDL. So I whipped up a simple additive blending blit function, which assumes 24-bpp surfaces. Of course, I had overflow problems. So, rather than have several conditionals in the very inner loop of the function, I pre-compute a 256x256 lookup table of capped 8-bit additions for use in the function. Ignoring the bounds-checking and rectangle clipping, here's the meat of the function:
unsigned char *pDst = (unsigned char*)dst->pixels + drect->x*3 + drect->y * dst->pitch;
int iPadDst = dst->pitch - srect->w*3;
unsigned char *pSrc = (unsigned char*)src->pixels + srect->x*3 + srect->y * dst->pitch;
int iPadSrc = src->pitch - srect->w*3;
for(int y = 0; y < srect->h; y++)
{
for(int x = 0; x < srect->w; x++)
{
//Look up the new values for each of R,G, and B
pDst[0] = pBlend[pSrc[0]][pDst[0]];
pDst[1] = pBlend[pSrc[1]][pDst[1]];
pDst[2] = pBlend[pSrc[2]][pDst[2]];
//Advance the pointers to the next pixel
pDst+=3;
pSrc+=3;
}
//Advance the pointers to the next line
pDst += iPadDst;
pSrc += iPadSrc;
}
It's really a pain to have that many dereferences in the very inner loop. At first, I did this
*pDst = pBlend[*pSrc][*pDst];
pSrc++;
pDst++;
three times rather than the way I have it now. Changing it to this method improved speed by about 10% in both debug and release mode (VC++ 2005). Any ideas out there for speeding this up further? Am I missing some sweet MMX instruction for doing 3 one-byte capped additions at once?
[Edited by - BeanDog on June 12, 2006 8:26:32 PM]