Help optimize my s/w additive blending [Edit: MMX assembly now]

Started by
20 comments, last by Figgles 17 years, 10 months ago
my mistake, it was supposed to be (>> 8)
Advertisement
Quote:Original post by BeanDogAny ideas out there for speeding this up further? Am I missing some sweet MMX instruction for doing 3 one-byte capped additions at once?


Well, there is PADDUSB, which performs 8 8-bit saturated adds in parallel. (As well as an SSE2 version which does 16 8-bit adds.)
Since we are microoptimizing things, did you try to use a 32 bit pixels structure? When I was working on our film scanner 2 years ago, I was using something along the line of
struct pixel {  union   {    struct     {      unsigned char r, g, b, a;    } ;    unsigned int rgba;  };};

It was mostly for the sake of writing readable code, but 32 bits aligned read/writes + structures that will be easily manipulated by the compiler == good generated code, so it might be a good idea to test it.

Another idea is to remove your "x" - you can do the loop on pDst:
  // was "for(int x = 0; x < srect->w; x++)"  pDstEnd = pDst + dst->pitch;  for (; pDst < pDstEnd; pDst+=3, pSrc+=3) {    // blah blah  }


Of course, this is very similar to
pixel *beginsrc = reinterpret_cast<pixel*>(pSrc);pixel *endsrc = beginsrc + srect->w;pixel *begindst = reinterpret_cast<pixel*>(pDst);             // First1 , Last1 , First2  , Out     , FuncObjectstd::transform(beginsrc, endsrc, begindst, begindst, tranformator);

(with the correct transformator object (ie, not a function pointer), of course).

HTH,
I've upgraded my routine to use MMX (based on an earlier post on GameDev I found via Google). It seems to work OK. As this is my first attempt at assembler code ever, please help me know what I can do to speed this bad boy up. It's now down to about 4.5ms to fill a 800x600 window via a few dozen calls to the function on my Athlon X2 3800+, while the code in my original post took about 11ms. I have a feeling I can do something to get this code going in both pipelines, but I don't even know where to start.
unsigned char *pDst = (unsigned char*)dst->pixels + drect->x*4 + drect->y * dst->pitch;int iPadDst = dst->pitch - srect->w*4;unsigned char *pSrc = (unsigned char*)src->pixels + srect->x*4 + srect->y * src->pitch;int iPadSrc = src->pitch - srect->w*4;unsigned int len = (unsigned int)srect->w;for(int y = 0; y < srect->h; y++){	__asm 	{		mov esi,pSrc	//esi = pointer to beginning of source line		mov edi,pDst	//edi = pointer to beginning of dest line		mov ecx,len			mov edx,ecx		and edx,1		//edx = parity (1 for odd # of pixels)		shr ecx,1		//ecx = number of 2-pixel groups to do			addblitloop:		movq mm0,[esi]	//load 2 pixels		movq mm1,[edi]	//load 2 pixels		paddusb mm0,mm1 //add them (all 8 bytes at once)		movq [edi],mm0	//store 2 pixels		add esi,8		//increase loop pointers		add edi,8		dec ecx			//Count down remaining 2-pixel pairs		jnz addblitloop		cmp edx,0		jz skipfinishadd //If there was no odd pixel, skip the following code.		movd mm0,[esi]	//Copy the last pixel (note movd instead of movq)		movd mm1,[edi]		paddusb mm0,mm1		movd [edi],mm0skipfinishadd:	}	pSrc += src->pitch;	pDst += dst->pitch;}	__asm	{		emms	}
ewww, inline assembler... use the intrinsics instead [smile]
Just my two cents.
It's not guaranteed to make a difference but anyway...

If the width of the image is always a multiplier of 2 you could unroll the loop in case to do 4 pixels at a time. Something like this:

		movq mm0,[esi]	//load 2 pixels		movq mm1,[edi]	//load 2 pixels		movq mm2,[esi + 8]	//load 2 pixels		movq mm3,[edi + 8]	//load 2 pixels		paddusb mm0,mm1 //add them (all 8 bytes at once)		paddusb mm2,mm3 //add them (all 8 bytes at once)		movq [edi],mm0	//store 2 pixels		movq [edi + 8],mm2	//store 2 pixels


Also you could try loading the data into the cache using prefetch.

If this is completely wrong, someone please correct me.

HellRaiZer
HellRaiZer
Quote:Original post by phantom
ewww, inline assembler... use the intrinsics instead [smile]


Agreed [smile]

If you want to get the best performance out of this, start by optimizing *everything* you can in plain C++.

Then use intrinsics to enable MMX or SSE.

And then, if you think there's something to gain by it, you can consider using assembly.

But in that order, please. :)
I'd never even heard of intrinsics before. I looked at some sample intrinsics code on CodeProject and elsewhere via Google, and it looks messier and more confusing to me than a few lines of assembly code.

HellRaiZer: I tried that exact code before I posted my assembly above, but I got zero speed increase. Any idea why?
Quote:Original post by BeanDog
I'd never even heard of intrinsics before. I looked at some sample intrinsics code on CodeProject and elsewhere via Google, and it looks messier and more confusing to me than a few lines of assembly code.


Maybe so, but it has two advantages;

1) the compiler is still doing the balk of the work, this includes managing registers and reordering code.
2) It'll work on x64 targets, where as your inline assembly will cause the compiler to cry and reject your code.

Once you get used to it its not that bad, its basically all the adventages of assembler with the added bonus of the compiler helpping you (I played with SSE for about a week, and the VC8 compiler is indeed very good when it comes to optimising things with intrinsics)

Quote:
I tried that exact code before I posted my assembly above, but I got zero speed increase. Any idea why?


Unfortunatelly no. There may be several reasons for that. Instruction pairing and jump prediction, are two candidates.

Try adding these two lines at the beginning of the assembly loop (the inner loop):

prefetch [esi + 0x100];prefetch [edi + 0x100];


Play around with the constant in case to find the optimal value.
See if this helps a little.

What phantom said about intrinsics sounds reasonable. I haven't written any code using them, but if the compiler can take care of register traffic and instruction scheduling, it sounds a good choice.

One last thing. I don't code in assembly (mostly i read assembly), but i learned it (or better, still learning it) because i wanted to know how things work at a lower level. That said, anything in my post may be completely wrong, so... :)

HellRaiZer
HellRaiZer

This topic is closed to new replies.

Advertisement