# Help needed for MMX alpha blending using inline assembly

## Recommended Posts

JNT    148
Greetings, fellow programmers! I’m looking for some help with MMX alpha blending (color blending) using inline assembly in a C/C++ environment. I’m pretty much new to assembly language, so not surprisingly I hit a snag pretty close to the finish line. In C++ syntax, I’m using the following algorithm for blending colors:
dst_component += (src_alpha *(src_component –dst_component) >> 8)

… for each component in ‘dst’. Just for the record; the color components are stored in 32-bit unsigned integers (DWORD) - one byte for each component. Using the MMX unpack instructions I unpack every component into a WORD inside a QWORD. The following code is what I’ve come up with:
unsigned int ZERO = 0;
__asm{

movd	  mm0,dst  ; mm0 = dst
punpcklbw mm0,ZERO ; unpack mm0
movd	  mm1,src  ; mm1 = src
punpcklbw mm1,ZERO ; unpack mm1

psubusw	  mm1,mm0  ; mm1 -= mm0
; multiply by alpha – PROBLEM HERE
pslrw	  mm1,8	   ; mm1 >>= 8

packusbw  mm1,ZERO ; pack mm1
movd	  dst,mm1  ; dst = mm1

}

My main problem for the time being is how to insert the ‘src’ alpha into a 64-bit unsigned integer (QWORD) four times – one 16-bit unsigned integer each (WORD), like so: --- Byte 1: src alpha } Word 1 Byte 2: 0 Byte 3: src alpha } Word 2 Byte 4: 0 Byte 5: src alpha } Word 3 Byte 6: 0 Byte 7: src alpha } Word 4 Byte 8: 0 --- Quite frankly, I have no idea how to do this with existing MMX unpack instructions. I guess I could do it in C++, but I can’t find any elegant/efficient way to do it there either. Also, I’m not sure about the MMX multiply instructions. Do you have to make two multiplications per 64-bit value in order to get a correct result, i.e. one multiplication per DWORD? I don’t even know if I’m going about this entire problem correctly, so feel free to give me suggestions! Would much appreciate your help!

##### Share on other sites
DonnieDarko    251
To get the the src alpha channel in a usable format in mm2, when m1 holds the unpacked src rgba, you can do:
movq	  mm2, mm1punpcklbw mm2, mm2punpcklbw mm2, mm2punpcklbw mm2, ZERO

I would suggest that you replace the ZERO int with a pxor'ed register to avoid unnecessary memory accesses. Also it might be faster to do all your mem loads as the first thing to avoid stalls for the following punpcklbw instructions.

##### Share on other sites
JNT    148
Very much appreciate the help! A few questions though:

1) Exactly what type of register should I use instead of the ZERO variable? A 32-bit general purpose register, or a 64-bit MMX register? As stated earlier, I'm pretty much an ASM newbie, but judging from the aforementioned PXOR instruction, you're expecting me to use a MMX register, right?

2) Is my way of implementing the algorithm efficient enough, or should I look to other implementations (or maybe other algorithms altogether)?

##### Share on other sites
DonnieDarko    251
Quote:
 Original post by JNT1) Exactly what type of register should I use instead of the ZERO variable? A 32-bit general purpose register, or a 64-bit MMX register? As stated earlier, I'm pretty much an ASM newbie, but judging from the aforementioned PXOR instruction, you're expecting me to use a MMX register, right?

Yes - xor'ing a register with itself is an easy way to set all bits of that register to 0.

Quote:
 2) Is my way of implementing the algorithm efficient enough, or should I look to other implementations (or maybe other algorithms altogether)?

It looks right to me and think you will get a worthwhile speedup compared to the C++ version that works on each color channel separately. I guess you are going to use this on an array of colors so you could read two colors in one instruction and then move them into separate registers and then also write to colors at once.

If you want to go further and use SSE instructions you can double that and read/write four colors at once. I'm not sure that it's worth it though, since I *think* reading 128 bit from memory takes twice the amount of cycles as reading 64 bit.

The algorithm requires only few instructions, so after a certain amount of optimization you are going to be memory bandwidth limited anyway. But please time your implementations and report the findings back.

##### Share on other sites
JNT    148
Quote:
 Original post by DonnieDarkoBut please time your implementations and report the findings back.

I wasn't all too happy about the results. With the compiler set to debug (MSVC compiler) the performance increase is about 25% - which is all well and good - but with the compiler set to release the performance increase is only one or two percent. Maybe I'm doing something wrong? The full code of the test program is posted below:

#include <iostream>#include <ctime>typedef unsigned int uint32;typedef unsigned char byte;void CVer( void );void AsmVer( void );int main(int argc, char ** argv){  std::cout << "Algorithm iterations per second" << std::endl;  std::cout << "---" << std::endl;  CVer();  AsmVer();  std::cout << "---" << std::endl;  std::cout << "program terminated" << std::endl;  std::cin.get();  return 0;}void CVer( void ){  std::cout << "Standard C/C++: ";  const uint32 RED = 0, GREEN = 1, BLUE = 2, ALPHA = 3; //let's just assume this is the correct byte order  uint32 it = 0;  uint32 dst = 0xff000000, src = 0xffffffff;  byte * dst_ptr = reinterpret_cast<byte*>(&dst), *src_ptr = reinterpret_cast<byte*>(&src);  long _time = clock();  while (clock() -_time < CLOCKS_PER_SEC){    dst_ptr[RED] += (src_ptr[ALPHA] *(src_ptr[RED] -dst_ptr[RED]) >> 8);    dst_ptr[GREEN] += (src_ptr[ALPHA] *(src_ptr[GREEN] -dst_ptr[GREEN]) >> 8);    dst_ptr[BLUE] += (src_ptr[ALPHA] *(src_ptr[BLUE] -dst_ptr[BLUE]) >> 8);    dst_ptr[ALPHA] += (src_ptr[ALPHA] *(src_ptr[ALPHA] -dst_ptr[ALPHA]) >> 8);    ++it;  }  std::cout << it << std::endl;}void AsmVer( void ){  std::cout << "Assembly using MMX: ";  uint32 it = 0;  uint32 dst = 0xff000000, src = 0xffffffff;  long _time = clock();  while (clock() -_time < CLOCKS_PER_SEC){    __asm{      ; clear a _register _for zero use      pxor	mm3, mm3      ; unpack src and dst      movd	mm0, dst      movd	mm1, src      punpcklbw	mm0, mm3      punpcklbw	mm1, mm3      ;unpack src alpha      movq	mm2, mm1      punpcklbw	mm2, mm2      punpcklbw	mm2, mm2      punpcklbw	mm2, mm3      ; arithmetic      psubusw	mm1, mm0      pmullw	mm1, mm2 ; not sure about the multiplication      pmulhuw	mm1, mm2 ; not sure about the multiplication      psrlw	mm1, 8      paddusw	mm1, mm0      ; assign _new value to dst      packuswb	mm1, mm3      movd	dst, mm1      ; increment iterator      inc	it    }  }  __asm { emms }  std::cout << it << std::endl;}

##### Share on other sites
DonnieDarko    251
Hmm, it's quite some years since I timed MMX optimized code, but at the time there was a worthwhile speed gain, it might be different for current compilers/CPUs.

With that said I believe you should make a more reliable test by doing the alpha blend on an image sized buffer and you should definitely remove the call to clock() from the while loop. For each test on an equal sized buffer get time before and after each version of alpha blend. Also, I'm not sure of the accuracy of clock() so it might be a more precise solution to use QueryPerformanceCounter for timing.

##### Share on other sites
JNT    148
Thanks for the link, S1CA! It'll come in handy!

DonnieDarko:
I'll take your suggestions into advisement and probably make the changes by tomorrow. I'll report back when I have some figures.

##### Share on other sites
JNT    148
"I'll /.../ probably make the changes by tomorrow" kind of seems like a laughable thing to say in hindsight. Law studies have picked up again so I'll simply report back when I'm done implementing the changes. I am underway, though.