Sign in to follow this  

Help needed for MMX alpha blending using inline assembly

This topic is 3392 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Greetings, fellow programmers! I’m looking for some help with MMX alpha blending (color blending) using inline assembly in a C/C++ environment. I’m pretty much new to assembly language, so not surprisingly I hit a snag pretty close to the finish line. In C++ syntax, I’m using the following algorithm for blending colors:
dst_component += (src_alpha *(src_component –dst_component) >> 8)
… for each component in ‘dst’. Just for the record; the color components are stored in 32-bit unsigned integers (DWORD) - one byte for each component. Using the MMX unpack instructions I unpack every component into a WORD inside a QWORD. The following code is what I’ve come up with:
unsigned int ZERO = 0;
__asm{

movd	  mm0,dst  ; mm0 = dst
punpcklbw mm0,ZERO ; unpack mm0
movd	  mm1,src  ; mm1 = src
punpcklbw mm1,ZERO ; unpack mm1

psubusw	  mm1,mm0  ; mm1 -= mm0
; multiply by alpha – PROBLEM HERE
pslrw	  mm1,8	   ; mm1 >>= 8
paddusw	  mm1,mm0  ; mm1 +mm0

packusbw  mm1,ZERO ; pack mm1
movd	  dst,mm1  ; dst = mm1


}
My main problem for the time being is how to insert the ‘src’ alpha into a 64-bit unsigned integer (QWORD) four times – one 16-bit unsigned integer each (WORD), like so: --- Byte 1: src alpha } Word 1 Byte 2: 0 Byte 3: src alpha } Word 2 Byte 4: 0 Byte 5: src alpha } Word 3 Byte 6: 0 Byte 7: src alpha } Word 4 Byte 8: 0 --- Quite frankly, I have no idea how to do this with existing MMX unpack instructions. I guess I could do it in C++, but I can’t find any elegant/efficient way to do it there either. Also, I’m not sure about the MMX multiply instructions. Do you have to make two multiplications per 64-bit value in order to get a correct result, i.e. one multiplication per DWORD? I don’t even know if I’m going about this entire problem correctly, so feel free to give me suggestions! Would much appreciate your help!

Share this post


Link to post
Share on other sites
To get the the src alpha channel in a usable format in mm2, when m1 holds the unpacked src rgba, you can do:

movq mm2, mm1
punpcklbw mm2, mm2
punpcklbw mm2, mm2
punpcklbw mm2, ZERO

I would suggest that you replace the ZERO int with a pxor'ed register to avoid unnecessary memory accesses. Also it might be faster to do all your mem loads as the first thing to avoid stalls for the following punpcklbw instructions.

Share this post


Link to post
Share on other sites
Very much appreciate the help! A few questions though:

1) Exactly what type of register should I use instead of the ZERO variable? A 32-bit general purpose register, or a 64-bit MMX register? As stated earlier, I'm pretty much an ASM newbie, but judging from the aforementioned PXOR instruction, you're expecting me to use a MMX register, right?

2) Is my way of implementing the algorithm efficient enough, or should I look to other implementations (or maybe other algorithms altogether)?

Share this post


Link to post
Share on other sites
Quote:
Original post by JNT
1) Exactly what type of register should I use instead of the ZERO variable? A 32-bit general purpose register, or a 64-bit MMX register? As stated earlier, I'm pretty much an ASM newbie, but judging from the aforementioned PXOR instruction, you're expecting me to use a MMX register, right?

Yes - xor'ing a register with itself is an easy way to set all bits of that register to 0.

Quote:

2) Is my way of implementing the algorithm efficient enough, or should I look to other implementations (or maybe other algorithms altogether)?

It looks right to me and think you will get a worthwhile speedup compared to the C++ version that works on each color channel separately. I guess you are going to use this on an array of colors so you could read two colors in one instruction and then move them into separate registers and then also write to colors at once.

If you want to go further and use SSE instructions you can double that and read/write four colors at once. I'm not sure that it's worth it though, since I *think* reading 128 bit from memory takes twice the amount of cycles as reading 64 bit.

The algorithm requires only few instructions, so after a certain amount of optimization you are going to be memory bandwidth limited anyway. But please time your implementations and report the findings back.

Share this post


Link to post
Share on other sites
Quote:
Original post by DonnieDarko
But please time your implementations and report the findings back.


I wasn't all too happy about the results. With the compiler set to debug (MSVC compiler) the performance increase is about 25% - which is all well and good - but with the compiler set to release the performance increase is only one or two percent. Maybe I'm doing something wrong? The full code of the test program is posted below:


#include <iostream>
#include <ctime>

typedef unsigned int uint32;
typedef unsigned char byte;

void CVer( void );
void AsmVer( void );

int main(int argc, char ** argv)
{
std::cout << "Algorithm iterations per second" << std::endl;
std::cout << "---" << std::endl;

CVer();
AsmVer();

std::cout << "---" << std::endl;
std::cout << "program terminated" << std::endl;
std::cin.get();

return 0;
}

void CVer( void )
{
std::cout << "Standard C/C++: ";

const uint32 RED = 0, GREEN = 1, BLUE = 2, ALPHA = 3; //let's just assume this is the correct byte order
uint32 it = 0;
uint32 dst = 0xff000000, src = 0xffffffff;
byte * dst_ptr = reinterpret_cast<byte*>(&dst), *src_ptr = reinterpret_cast<byte*>(&src);
long _time = clock();

while (clock() -_time < CLOCKS_PER_SEC){
dst_ptr[RED] += (src_ptr[ALPHA] *(src_ptr[RED] -dst_ptr[RED]) >> 8);
dst_ptr[GREEN] += (src_ptr[ALPHA] *(src_ptr[GREEN] -dst_ptr[GREEN]) >> 8);
dst_ptr[BLUE] += (src_ptr[ALPHA] *(src_ptr[BLUE] -dst_ptr[BLUE]) >> 8);
dst_ptr[ALPHA] += (src_ptr[ALPHA] *(src_ptr[ALPHA] -dst_ptr[ALPHA]) >> 8);
++it;
}

std::cout << it << std::endl;
}

void AsmVer( void )
{
std::cout << "Assembly using MMX: ";

uint32 it = 0;
uint32 dst = 0xff000000, src = 0xffffffff;
long _time = clock();

while (clock() -_time < CLOCKS_PER_SEC){
__asm{
; clear a _register _for zero use
pxor mm3, mm3

; unpack src and dst
movd mm0, dst
movd mm1, src
punpcklbw mm0, mm3
punpcklbw mm1, mm3

;unpack src alpha
movq mm2, mm1
punpcklbw mm2, mm2
punpcklbw mm2, mm2
punpcklbw mm2, mm3

; arithmetic
psubusw mm1, mm0
pmullw mm1, mm2 ; not sure about the multiplication
pmulhuw mm1, mm2 ; not sure about the multiplication
psrlw mm1, 8
paddusw mm1, mm0

; assign _new value to dst
packuswb mm1, mm3
movd dst, mm1

; increment iterator
inc it
}
}

__asm { emms }

std::cout << it << std::endl;
}

Share this post


Link to post
Share on other sites
Hmm, it's quite some years since I timed MMX optimized code, but at the time there was a worthwhile speed gain, it might be different for current compilers/CPUs.

With that said I believe you should make a more reliable test by doing the alpha blend on an image sized buffer and you should definitely remove the call to clock() from the while loop. For each test on an equal sized buffer get time before and after each version of alpha blend. Also, I'm not sure of the accuracy of clock() so it might be a more precise solution to use QueryPerformanceCounter for timing.

Share this post


Link to post
Share on other sites
Thanks for the link, S1CA! It'll come in handy!

DonnieDarko:
I'll take your suggestions into advisement and probably make the changes by tomorrow. I'll report back when I have some figures.

Share this post


Link to post
Share on other sites
"I'll /.../ probably make the changes by tomorrow" kind of seems like a laughable thing to say in hindsight. Law studies have picked up again so I'll simply report back when I'm done implementing the changes. I am underway, though.

Share this post


Link to post
Share on other sites

This topic is 3392 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this