# My SSE Alpha Blending Routine

This topic is 3363 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I just wrote an SSE alpha blending routine which blends 16 byte pairs in parallel (drawback: uses the same alpha value for all 16 pairs), and I wanted to share it with you because its performance seems quite sweet to me and I would like to hear your opinions on it. I wrote a small single threaded program that blends apx 2 billion byte pairs in a second on my 2.66 ghz core2 duo (actually, the same 16 pairs * 125000000 times...). It uses the SSE PAVGB instruction. This is actually the first assembly routine I wrote that does something "useful" :P -EDIT: see below for improved versions The algorithm used for each byte is the following:
uint8_t in[2];
uint16_t out = 0;
uint8_t a;

for(int32_t i = 0; i < 8; i++)
{
out += (uint16_t)in[(a >> i) & 1] + 1;
out >>= 1;
}

#include <iostream>
#include <ctime>
#include <cstdint>

uint8_t in[32] = {1, 2, 4, 8, 16, 32, 64, 128, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0,  0,  0,  0,   0, 0, 0, 0, 0, 0, 0, 0, 0};

uint8_t out[16] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

uint64_t a = 255;

uint8_t save[512];

int main()
{
asm(".intel_syntax noprefix \n");

asm(" push r15 \n");
asm(" fxsave save \n");

asm(".att_syntax noprefix \n");

for(long i = 0; i < 125000000; i++)
{
asm(".intel_syntax noprefix \n");

asm("  movdqu xmm0, out \n");

asm("  mov r15, [a] \n");
asm("  shr r15, 0 \n");
asm("  and r15, 1 \n");
asm("  shl r15, 4 \n");
asm("  pavgb xmm0, [in + r15] \n");

asm("  mov r15, [a] \n");
asm("  shr r15, 1 \n");
asm("  and r15, 1 \n");
asm("  shl r15, 4 \n");
asm("  pavgb xmm0, [in + r15] \n");

asm("  mov r15, [a] \n");
asm("  shr r15, 2 \n");
asm("  and r15, 1 \n");
asm("  shl r15, 4 \n");
asm("  pavgb xmm0, [in + r15] \n");

asm("  mov r15, [a] \n");
asm("  shr r15, 3 \n");
asm("  and r15, 1 \n");
asm("  shl r15, 4 \n");
asm("  pavgb xmm0, [in + r15] \n");

asm("  mov r15, [a] \n");
asm("  shr r15, 4 \n");
asm("  and r15, 1 \n");
asm("  shl r15, 4 \n");
asm("  pavgb xmm0, [in + r15] \n");

asm("  mov r15, [a] \n");
asm("  shr r15, 5 \n");
asm("  and r15, 1 \n");
asm("  shl r15, 4 \n");
asm("  pavgb xmm0, [in + r15] \n");

asm("  mov r15, [a] \n");
asm("  shr r15, 6 \n");
asm("  and r15, 1 \n");
asm("  shl r15, 4 \n");
asm("  pavgb xmm0, [in + r15] \n");

asm("  mov r15, [a] \n");
asm("  shr r15, 7 \n");
asm("  and r15, 1 \n");
asm("  shl r15, 4 \n");
asm("  pavgb xmm0, [in + r15] \n");

asm("  movdqu out, xmm0 \n");

asm(".att_syntax noprefix \n");
}

asm(".intel_syntax noprefix \n");

asm(" fxrstor save \n");
asm(" pop r15 \n");

asm(".att_syntax noprefix \n");

for(int i = 0; i < 16; i++)
std::cout << (int32_t)out << '\t';

std::cout << std::endl;

return 0;
}


It's basically untested but it seems to work judging on some sample values i tried... It may be a shade off in some cases. Two possible optimizations I have thought are: 1) precomputing the middle 3 instructions of each pavgb chunk in an 8 byte array, or even using the 8 upper gprs in x86-64 2) it might be possible to trade off accuracy for speed by using fewer alpha bits and reducing the number of pavgb chunks The timing was done by just using the time command on the Linux terminal, I guess it's OK for measuring times as large as a second.. -EDIT: I changed movdqu to movdqa and did the precompute thing, so the time got down to 0.649 secs (0.8 with unaligned loads, huge difference). This is the new version:
#include <iostream>
#include <ctime>
#include <cstdint>

uint8_t in[32] = {1, 2, 4, 8, 16, 32, 64, 128, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0,  0,  0,  0,   0, 0, 0, 0, 0, 0, 0, 0, 0};

uint8_t out[16] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

uint64_t a = 255;

uint8_t save[512];

uint64_t pc[8] = {0, 0, 0, 0, 0, 0, 0, 0};

int main()
{
for(long i = 0; i < 8; i++)
pc = ((a >> i) & 1) << 4;

asm(".intel_syntax noprefix \n");

asm(" push r15 \n");
asm(" push r14 \n");
asm(" push r13 \n");
asm(" push r12 \n");
asm(" push r11 \n");
asm(" push r10 \n");
asm(" push r9 \n");
asm(" push r8 \n");
asm(" fxsave save \n");

asm("mov r8,  [pc]     \n");
asm("mov r9,  [pc + 8] \n");
asm("mov r10, [pc + 16]\n");
asm("mov r11, [pc + 24]\n");
asm("mov r12, [pc + 32]\n");
asm("mov r13, [pc + 40]\n");
asm("mov r14, [pc + 48]\n");
asm("mov r15, [pc + 56]\n");

asm(".att_syntax noprefix \n");

for(long i = 0; i < 125000000; i++)
{
asm("  .intel_syntax noprefix \n");
asm("  movdqa xmm0, out \n");

asm("  pavgb xmm0, [in + r8] \n");
asm("  pavgb xmm0, [in + r9] \n");
asm("  pavgb xmm0, [in + r10] \n");
asm("  pavgb xmm0, [in + r11] \n");
asm("  pavgb xmm0, [in + r12] \n");
asm("  pavgb xmm0, [in + r13] \n");
asm("  pavgb xmm0, [in + r14] \n");
asm("  pavgb xmm0, [in + r15] \n");

asm("  movdqa out, xmm0 \n");

asm("  .att_syntax noprefix \n");
}

asm(".intel_syntax noprefix \n");

asm(" fxrstor save \n");
asm(" pop r8 \n");
asm(" pop r9 \n");
asm(" pop r10 \n");
asm(" pop r11 \n");
asm(" pop r12 \n");
asm(" pop r13 \n");
asm(" pop r14 \n");
asm(" pop r15 \n");

asm(".att_syntax noprefix \n");

for(int i = 0; i < 16; i++)
std::cout << (int32_t)out << '\t';

std::cout << std::endl;

return 0;
}


[Edited by - D_Tr on May 31, 2009 2:11:18 AM]

##### Share on other sites
out[0] = ( (255-a)*in[0]+a*in[16] ) /255 ;out[1] = ( (255-a)*in[1]+a*in[17] ) /255 ;...out[15] = ( (255-a)*in[15]+a*in[31] ) /255;

runs 3.5 times faster than your sse version for me. Test yourself:
#include <iostream>#include <ctime>#include <stdint.h>#include <boost/timer.hpp>uint8_t in[32] = {1, 2, 4, 8, 16, 32, 64, 128, 0, 0, 0, 0, 0, 0, 0, 0,                  0, 0, 0, 0,  0,  0,  0,   0, 0, 0, 0, 0, 0, 0, 0, 0};uint8_t out[16] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};uint64_t a ;uint8_t save[512];const int N = 125000000;void sse() {    asm(".intel_syntax noprefix \n");    asm(" push r15 \n");    asm(" fxsave save \n");    asm(".att_syntax noprefix \n");    for(long i = 0; i < N; i++)    {        asm(".intel_syntax noprefix \n");        asm("  movdqu xmm0, out \n");        asm("  mov r15, [a] \n");        asm("  shr r15, 0 \n");        asm("  and r15, 1 \n");        asm("  shl r15, 4 \n");        asm("  pavgb xmm0, [in + r15] \n");        asm("  mov r15, [a] \n");        asm("  shr r15, 1 \n");        asm("  and r15, 1 \n");        asm("  shl r15, 4 \n");        asm("  pavgb xmm0, [in + r15] \n");        asm("  mov r15, [a] \n");        asm("  shr r15, 2 \n");        asm("  and r15, 1 \n");        asm("  shl r15, 4 \n");        asm("  pavgb xmm0, [in + r15] \n");        asm("  mov r15, [a] \n");        asm("  shr r15, 3 \n");        asm("  and r15, 1 \n");        asm("  shl r15, 4 \n");        asm("  pavgb xmm0, [in + r15] \n");        asm("  mov r15, [a] \n");        asm("  shr r15, 4 \n");        asm("  and r15, 1 \n");        asm("  shl r15, 4 \n");        asm("  pavgb xmm0, [in + r15] \n");        asm("  mov r15, [a] \n");        asm("  shr r15, 5 \n");        asm("  and r15, 1 \n");        asm("  shl r15, 4 \n");        asm("  pavgb xmm0, [in + r15] \n");        asm("  mov r15, [a] \n");        asm("  shr r15, 6 \n");        asm("  and r15, 1 \n");        asm("  shl r15, 4 \n");        asm("  pavgb xmm0, [in + r15] \n");        asm("  mov r15, [a] \n");        asm("  shr r15, 7 \n");        asm("  and r15, 1 \n");        asm("  shl r15, 4 \n");        asm("  pavgb xmm0, [in + r15] \n");        asm("  movdqu out, xmm0 \n");        asm(".att_syntax noprefix \n");    }    asm(".intel_syntax noprefix \n");    asm(" fxrstor save \n");    asm(" pop r15 \n");    asm(".att_syntax noprefix \n");}void cpp() {    for(long j = 0; j < N; j++)    {        out[0] = ( (255-a)*in[0]+a*in[16] ) /255 ;        out[1] = ( (255-a)*in[1]+a*in[17] ) /255 ;        out[2] = ( (255-a)*in[2]+a*in[18] ) /255 ;        out[3] = ( (255-a)*in[3]+a*in[19] ) /255 ;        out[4] = ( (255-a)*in[4]+a*in[20] ) /255 ;        out[5] = ( (255-a)*in[5]+a*in[21] ) /255 ;        out[6] = ( (255-a)*in[6]+a*in[22] ) /255 ;        out[7] = ( (255-a)*in[7]+a*in[23] ) /255 ;        out[8] = ( (255-a)*in[8]+a*in[24] ) /255 ;        out[9] = ( (255-a)*in[9]+a*in[25] ) /255 ;        out[10] = ( (255-a)*in[10]+a*in[26] ) /255 ;        out[11] = ( (255-a)*in[11]+a*in[27] ) /255 ;        out[12] = ( (255-a)*in[12]+a*in[28] ) /255 ;        out[13] = ( (255-a)*in[13]+a*in[29] ) /255 ;        out[14] = ( (255-a)*in[14]+a*in[30] ) /255 ;        out[15] = ( (255-a)*in[15]+a*in[31] ) /255 ;    }}int main() {    // Don't let the compiler make any assumptions about the     // initial values    srand ( time(NULL) );    for(int i=0;i<32;i++)        in=rand()%256;    std::cin >> a;        boost::timer t;    t.restart();    sse();    t.elapsed();    std::cout << "sse : " << t.elapsed() << std::endl;    for(int i = 0; i < 16; i++)        std::cout << (int32_t)out << '\t';    std::cout << std::endl;    t.restart();    cpp();    t.elapsed();    std::cout << "cpp : " << t.elapsed() << std::endl;    for(int i = 0; i < 16; i++)        std::cout << (int32_t)out << '\t';    std::cout << std::endl;    return 0;}

I use g++ 4.3.2 .
the output looks like this:
sse : 3.3488	5	29	102	219	141	161	127	37	109	218	21	114	178	47	103	cpp : 0.9387	4	28	101	218	140	160	126	36	108	217	20	113	177	46	102

Am I doing something wrong?

Edit: The second version is still 50% slower than the cpp version.

##### Share on other sites
Fun learning exercise.

Your timings are not going to be accurate for the real world because all of that data fits within the cpu cache. Doing 16 pairs is nothing. If you have any large amount of data it would require keeping the CPU fed, which adds significant overhead. Try running it on alpha blending over an 8MB dataset (roughly 1600x1200) for input and output to check your performance.

Finally, it doesn't seem very useful to me unless you are implementing a software rasterizer. Alpha blending on the graphics cards is done in a much more massively parallel fashion.

##### Share on other sites
@frob: yep it was fun :D

@Kampiz: I tried your code on my machine and got like 0.9-0.95 secs for both cpp and sse code. When i used the second version, though (see first edit on my original post), I get 0.65 with sse and 0.85-0.9 with the cpp code, so there is some gain with the second version :) (at least on my cpu... What's yours?)

[Edited by - D_Tr on December 6, 2008 3:39:25 AM]

##### Share on other sites
For that cpp code, the accuracy problem is simply rounding differences.

For speed, replace the divisions by 255 in the cpp with a multiplication and a shift:

round(x / 255)

(floating point division)

is approximately equivalent to

(x * 258) >> 16

This should actually have quite a bit less rounding error (vs the expected "ideal" result) than the division by 255 method within this domain space (8-bit alpha, 8-bit color channels)

##### Share on other sites
Quote:
 Original post by D_Tr@Kampiz: I tried your code on my machine and got like 0.9-0.95 secs for both cpp and sse code. When i used the second version, though (see first edit on my original post), I get 0.65 with sse and 0.85-0.9 with the cpp code, so there is some gain with the second version :) (at least on my cpu... What's yours?)

It may be cpu specific (I'm testing on AMD Athlon 4200+ X2) or compiler specific (Try g++ or llvm-g++ with -O3). The point is hand written assembly is often not better than optimized c/c++.

Quote:
 Original post by Rockoon1For that cpp code, the accuracy problem is simply rounding differences.For speed, replace the divisions by 255 in the cpp with a multiplication and a shift:round(x / 255)(floating point division) is approximately equivalent to(x * 258) >> 16This should actually have quite a bit less rounding error (vs the expected "ideal" result) than the division by 255 method within this domain space (8-bit alpha, 8-bit color channels)

Originally I was using the approximation >> 8 instead of /255 but then I noticed that both version run at the same speed. Thus I decided to use /255. These old tricks doesn't seem to be that useful on newer hardware.

##### Share on other sites
Quote:
 Original post by KambizOriginally I was using the approximation >> 8 instead of /255 but then I noticed that both version run at the same speed. Thus I decided to use /255. These old tricks doesn't seem to be that useful on newer hardware.

Because your version is evaluated completely at compile-time, so it involves writing constant value into destination.

Look at disassembly.

What you have is actually:
void cpp() {    for(long j = 0; j < N; j++)    {        out[0] = 217;        out[1] = 20;        out[2] = 113;...    }}

##### Share on other sites
Quote:
Original post by Antheus
Quote:
 Original post by KambizOriginally I was using the approximation >> 8 instead of /255 but then I noticed that both version run at the same speed. Thus I decided to use /255. These old tricks doesn't seem to be that useful on newer hardware.

Because your version is evaluated completely at compile-time, so it involves writing constant value into destination.

Look at disassembly.

What you have is actually:
void cpp() {    for(long j = 0; j < N; j++)    {        out[0] = 217;        out[1] = 20;        out[2] = 113;...    }}

I initialize in with random values and read a from cin before calling cpp() :
int main() {    // Don't let the compiler make any assumptions about the     // initial values    srand ( time(NULL) );    for(int i=0;i<32;i++)        in=rand()%256;...}

EDIT:
Sorry, you are right. I think the program will do the calculations just once and then just use the constants in the assignment. the sse version is much faster if I change a in the loop to force calculations.
I'm sorry for the confusion.

[Edited by - Kambiz on December 6, 2008 7:07:27 AM]

##### Share on other sites
optimisation option 1) wont work unless you re-write it in intrinsics...

##### Share on other sites
Quote:
 Original post by RobTheBlokeoptimisation option 1) wont work unless you re-write it in intrinsics...

What exactly do you mean? It works for me in gcc by using r8 - r15

Also, the "movdqa xmm0, out" at the start of the loop eats up a lot of time (nearly 40% on my machine) and is completely redundant, should be removed... The register can be cleared with asm(" pxor xmm0, xmm0 "); instead. This final version the routine is 15.0 times faster than plain cpp (if I shift alpha in the cpp version by one at each iteration to prevent the compiler from "cheating"). The difference might be a bit smaller if I run the routine over a large array though.

[Edited by - D_Tr on December 9, 2008 12:33:54 PM]

##### Share on other sites
From what I understand (and in my limited experience) you are likely to get better performance if you use intrinsics rather than trying to hand roll the whole routine in asm. With intrinsics the compiler can still perform its own optimization.

##### Share on other sites
Quote:
 Original post by D_TrWhat exactly do you mean? It works for me in gcc by using r8 - r15

Inline asm is banned on 64bit windows, and i suspect others will follow suit soon. You really should be using intrinsics, because your 'optimised' asm is non-portable, and secondly you're only optimising for your CPU - i.e. an optimisation on a P4, is usually not an optimisation on a core 2 for example. Intrinsics avoid both of those problems; are easier to read; quicker to write; and will produce faster results for 10% of the effort.

##### Share on other sites
Quote:
 Original post by RobTheBlokeInline asm is banned on 64bit windows, and i suspect others will follow suit soon. You really should be using intrinsics, because your 'optimised' asm is non-portable, and secondly you're only optimising for your CPU - i.e. an optimisation on a P4, is usually not an optimisation on a core 2 for example. Intrinsics avoid both of those problems; are easier to read; quicker to write; and will produce faster results for 10% of the effort.

Not just that, I found (at least MSVC 2005) that inline assembly isn't touched by the compiler, it leaves it right how you wrote it. No matter how well optimized your assembly code is, you miss opportunities for better instruction pairing, reusing registers, etc because you don't touch what comes before and what goes next. The only solution would be to write the entire app in assembly.

When using intrinsics, the compiler optimizes not only your routine, but globally, processing some instructions earlier and deferring others for later execution when possible, to achieve better cache usage, instruction pairing, and less register pressure.

Cheers
Dark Sylinc

PS: I don't agree though, that intrinsics are easier to read. But I guess that's because I'm very used to assembly SIMD. Or that others will also ban inline asm from their compilers as well, but this one is just a worthless opinion.

Edit: By the way, have you seen this article: "MMX Enhanced Alpha Blending" here on GD.Net?
It's rather old, but very usefull.

##### Share on other sites
Quote:
 Original post by RobTheBlokeInline asm is banned on 64bit windows

No.

Microsoft simply doesn't support it in their 64-bit compilers, but that hasn't stopped other compilers.

It is not "banned" on anything.

Quote:
 Original post by RobTheBlokeYou really should be using intrinsics, because your 'optimised' asm is non-portable

Intrinsics are portable?? since when?

Quote:
 Original post by RobTheBloke, and secondly you're only optimising for your CPU - i.e. an optimisation on a P4, is usually not an optimisation on a core 2 for example. Intrinsics avoid both of those problems;

No they don't.

Quote:
 Original post by RobTheBloke are easier to read

For who?

Quote:
 Original post by RobTheBloke; quicker to write;

For who?

Quote:
 Original post by RobTheBloke and will produce faster results for 10% of the effort.

You have a strange definition of the term 'faster.' Generally suboptimal is not faster. Its generally suboptimal. Now, perhaps you don't have either the knowledge necessary to beat a crappy compiler like microsofts or the drive to learn to do exactly that, but that hasn't stopped other people.

##### Share on other sites
Quote:
Original post by Rockoon1
Quote:
 Original post by RobTheBlokeInline asm is banned on 64bit windows

No.
Microsoft simply doesn't support it in their 64-bit compilers, but that hasn't stopped other compilers.
It is not "banned" on anything.

I agree

Quote:
Original post by Rockoon1
Quote:
 Original post by RobTheBlokeYou really should be using intrinsics, because your 'optimised' asm is non-portable

Intrinsics are portable?? since when?

They should be. And in the worst case, you can rewrite the instrinsic as inline C++ functions, without needing to rewrite the routine.

Quote:
Original post by Rockoon1
Quote:
 Original post by RobTheBloke, and secondly you're only optimising for your CPU - i.e. an optimisation on a P4, is usually not an optimisation on a core 2 for example. Intrinsics avoid both of those problems;

No they don't.

Yes they do.

Quote:
Original post by Rockoon1
Quote:
 Original post by RobTheBloke are easier to read

For who?
Quote:
 Original post by RobTheBloke; quicker to write;

For who?

I agree, but some may think so.

Quote:
Original post by Rockoon1
Quote:
 Original post by RobTheBloke and will produce faster results for 10% of the effort.

You have a strange definition of the term 'faster.' Generally suboptimal is not faster. Its generally suboptimal. Now, perhaps you don't have either the knowledge necessary to beat a crappy compiler like microsofts or the drive to learn to do exactly that, but that hasn't stopped other people.

First, it's been a long time since microsoft's compiler aren't crappy. If there's something I have to admire from MS is DirectX and VisualStudio (note I didn't include Windows OS [lol])
Second, when intrinsic appeared they produced broken code, suboptimal code, and many other flaws. That was like 5 years ago. Now they're more mature and produce better results, with some exceptions from time to time, so as a hand-written assembly can be wrong too.
Also see my previous post about global optimization, intrinsic allow that too.

The thing is intrinsics are written in a C-style syntax, and thus can be compiled in any platform, even if there's no SIMD or it isn't x86, as long as there is a library implementation to emulate that functionality.
Assembly syntax is not part of C (is it?) and cannot be ported anywhere, not to mention each compiler has it's own assembly syntax, which makes it non-portable even on the same architecture

Cheers
Dark Sylinc

##### Share on other sites
So I decided to use intrinsics, because they really seem very useful. Even if they do not ALWAYS produce the best code, they allow us to quickly test if the algorithm actually works without requiring writing assembly code, be it inline or with the use of an assembler. However, I encountered a little problem:

When I include tmmintrin.h, I get the following preprocessor error:
/usr/lib/gcc/x86_64-linux-gnu/4.3.2/include/tmmintrin.h:34:3: error: #error "SSSE3 instruction set not enabled"

I have the same problem with pmmintrin.h:
/usr/lib/gcc/x86_64-linux-gnu/4.3.2/include/tmmintrin.h:34:3: error: #error "SSE3 instruction set not enabled"

As I stated in my original post I have a Core 2 Duo @ 2.66 (65 nm), so does anyone know why am I getting these messages? I have no problem with mmintrin.h, xmmintrin.h and emmintrin.h which are for MMX, SSE and SSE2 respectively.

##### Share on other sites
Quote:
 Original post by D_TrWhen I include tmmintrin.h, I get the following preprocessor error:/usr/lib/gcc/x86_64-linux-gnu/4.3.2/include/tmmintrin.h:34:3: error: #error "SSSE3 instruction set not enabled"I have the same problem with pmmintrin.h:/usr/lib/gcc/x86_64-linux-gnu/4.3.2/include/tmmintrin.h:34:3: error: #error "SSE3 instruction set not enabled"As I stated in my original post I have a Core 2 Duo @ 2.66 (65 nm), so does anyone know why am I getting these messages? I have no problem with mmintrin.h, xmmintrin.h and emmintrin.h which are for MMX, SSE and SSE2 respectively.

Run GCC with -mtune=core2 (Intel Core 2 Duo) or -mtune=k8-sse3 (Athlon64)
For a generic (Vendor and model agnostic) version use -msse3 & -mssse3 option

Note that SSSE3 (the one with more 'S') is mostly supported in Intel processors, not AMD's

Cheers
Dark Sylinc

##### Share on other sites
Thanks a lot, -mssse3 did it!