My SSE Alpha Blending Routine

Started by
16 comments, last by asdfg__12 15 years, 4 months ago
I just wrote an SSE alpha blending routine which blends 16 byte pairs in parallel (drawback: uses the same alpha value for all 16 pairs), and I wanted to share it with you because its performance seems quite sweet to me and I would like to hear your opinions on it. I wrote a small single threaded program that blends apx 2 billion byte pairs in a second on my 2.66 ghz core2 duo (actually, the same 16 pairs * 125000000 times...). It uses the SSE PAVGB instruction. This is actually the first assembly routine I wrote that does something "useful" :P -EDIT: see below for improved versions The algorithm used for each byte is the following:

uint8_t in[2];
uint16_t out = 0;
uint8_t a;

for(int32_t i = 0; i < 8; i++)
{
    out += (uint16_t)in[(a >> i) & 1] + 1;
    out >>= 1;
}

#include <iostream>
#include <ctime>
#include <cstdint>

uint8_t in[32] = {1, 2, 4, 8, 16, 32, 64, 128, 0, 0, 0, 0, 0, 0, 0, 0,
                  0, 0, 0, 0,  0,  0,  0,   0, 0, 0, 0, 0, 0, 0, 0, 0};
                   
uint8_t out[16] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

uint64_t a = 255;

uint8_t save[512];

int main()
{    
    asm(".intel_syntax noprefix \n");
    
    asm(" push r15 \n");
    asm(" fxsave save \n");
    
    asm(".att_syntax noprefix \n");
    
    for(long i = 0; i < 125000000; i++)
    {   
        asm(".intel_syntax noprefix \n");
    
        asm("  movdqu xmm0, out \n");
        
        asm("  mov r15, [a] \n");
        asm("  shr r15, 0 \n");
        asm("  and r15, 1 \n");
        asm("  shl r15, 4 \n");
        asm("  pavgb xmm0, [in + r15] \n");
        
        asm("  mov r15, [a] \n");
        asm("  shr r15, 1 \n");
        asm("  and r15, 1 \n");
        asm("  shl r15, 4 \n");
        asm("  pavgb xmm0, [in + r15] \n");
        
        asm("  mov r15, [a] \n");
        asm("  shr r15, 2 \n");
        asm("  and r15, 1 \n");
        asm("  shl r15, 4 \n");
        asm("  pavgb xmm0, [in + r15] \n");
        
        asm("  mov r15, [a] \n");
        asm("  shr r15, 3 \n");
        asm("  and r15, 1 \n");
        asm("  shl r15, 4 \n");
        asm("  pavgb xmm0, [in + r15] \n");
        
        asm("  mov r15, [a] \n");
        asm("  shr r15, 4 \n");
        asm("  and r15, 1 \n");
        asm("  shl r15, 4 \n");
        asm("  pavgb xmm0, [in + r15] \n");
        
        asm("  mov r15, [a] \n");
        asm("  shr r15, 5 \n");
        asm("  and r15, 1 \n");
        asm("  shl r15, 4 \n");
        asm("  pavgb xmm0, [in + r15] \n");
        
        asm("  mov r15, [a] \n");
        asm("  shr r15, 6 \n");
        asm("  and r15, 1 \n");
        asm("  shl r15, 4 \n");
        asm("  pavgb xmm0, [in + r15] \n");
        
        asm("  mov r15, [a] \n");
        asm("  shr r15, 7 \n");
        asm("  and r15, 1 \n");
        asm("  shl r15, 4 \n");
        asm("  pavgb xmm0, [in + r15] \n");
               
        asm("  movdqu out, xmm0 \n");   
        
        asm(".att_syntax noprefix \n");
    }
    
    asm(".intel_syntax noprefix \n");
    
    asm(" fxrstor save \n");
    asm(" pop r15 \n");
    
    asm(".att_syntax noprefix \n");
    
    for(int i = 0; i < 16; i++)
        std::cout << (int32_t)out << '\t';
    
    std::cout << std::endl;
        
    return 0;
}








It's basically untested but it seems to work judging on some sample values i tried... It may be a shade off in some cases. Two possible optimizations I have thought are: 1) precomputing the middle 3 instructions of each pavgb chunk in an 8 byte array, or even using the 8 upper gprs in x86-64 2) it might be possible to trade off accuracy for speed by using fewer alpha bits and reducing the number of pavgb chunks The timing was done by just using the time command on the Linux terminal, I guess it's OK for measuring times as large as a second.. -EDIT: I changed movdqu to movdqa and did the precompute thing, so the time got down to 0.649 secs (0.8 with unaligned loads, huge difference). This is the new version:

#include <iostream>
#include <ctime>
#include <cstdint>

uint8_t in[32] = {1, 2, 4, 8, 16, 32, 64, 128, 0, 0, 0, 0, 0, 0, 0, 0,
                  0, 0, 0, 0,  0,  0,  0,   0, 0, 0, 0, 0, 0, 0, 0, 0};
                   
uint8_t out[16] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

uint64_t a = 255;

uint8_t save[512];

uint64_t pc[8] = {0, 0, 0, 0, 0, 0, 0, 0};

int main()
{   
    for(long i = 0; i < 8; i++)
        pc = ((a >> i) & 1) << 4;
    
    asm(".intel_syntax noprefix \n");
    
    asm(" push r15 \n");
    asm(" push r14 \n");
    asm(" push r13 \n");
    asm(" push r12 \n");
    asm(" push r11 \n");
    asm(" push r10 \n");
    asm(" push r9 \n");
    asm(" push r8 \n");
    asm(" fxsave save \n");
        
    asm("mov r8,  [pc]     \n");
    asm("mov r9,  [pc + 8] \n");
    asm("mov r10, [pc + 16]\n");
    asm("mov r11, [pc + 24]\n");
    asm("mov r12, [pc + 32]\n");
    asm("mov r13, [pc + 40]\n");
    asm("mov r14, [pc + 48]\n");
    asm("mov r15, [pc + 56]\n");
    
    asm(".att_syntax noprefix \n");
    
    for(long i = 0; i < 125000000; i++)
    {   
        asm("  .intel_syntax noprefix \n");
        asm("  movdqa xmm0, out \n");      
         
        asm("  pavgb xmm0, [in + r8] \n");      
        asm("  pavgb xmm0, [in + r9] \n");      
        asm("  pavgb xmm0, [in + r10] \n");       
        asm("  pavgb xmm0, [in + r11] \n");       
        asm("  pavgb xmm0, [in + r12] \n");        
        asm("  pavgb xmm0, [in + r13] \n");      
        asm("  pavgb xmm0, [in + r14] \n");        
        asm("  pavgb xmm0, [in + r15] \n");
               
        asm("  movdqa out, xmm0 \n");   
        
        asm("  .att_syntax noprefix \n");
    }
        
    asm(".intel_syntax noprefix \n");
    
    asm(" fxrstor save \n");
    asm(" pop r8 \n");
    asm(" pop r9 \n");
    asm(" pop r10 \n");
    asm(" pop r11 \n");
    asm(" pop r12 \n");
    asm(" pop r13 \n");
    asm(" pop r14 \n");
    asm(" pop r15 \n");
    
    asm(".att_syntax noprefix \n");
    
    for(int i = 0; i < 16; i++)
        std::cout << (int32_t)out << '\t';
    
    std::cout << std::endl;
        
    return 0;
}








[Edited by - D_Tr on May 31, 2009 2:11:18 AM]
Advertisement
out[0] = ( (255-a)*in[0]+a*in[16] ) /255 ;out[1] = ( (255-a)*in[1]+a*in[17] ) /255 ;...out[15] = ( (255-a)*in[15]+a*in[31] ) /255;

runs 3.5 times faster than your sse version for me. Test yourself:
#include <iostream>#include <ctime>#include <stdint.h>#include <boost/timer.hpp>uint8_t in[32] = {1, 2, 4, 8, 16, 32, 64, 128, 0, 0, 0, 0, 0, 0, 0, 0,                  0, 0, 0, 0,  0,  0,  0,   0, 0, 0, 0, 0, 0, 0, 0, 0};uint8_t out[16] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};uint64_t a ;uint8_t save[512];const int N = 125000000;void sse() {    asm(".intel_syntax noprefix \n");    asm(" push r15 \n");    asm(" fxsave save \n");    asm(".att_syntax noprefix \n");    for(long i = 0; i < N; i++)    {        asm(".intel_syntax noprefix \n");        asm("  movdqu xmm0, out \n");        asm("  mov r15, [a] \n");        asm("  shr r15, 0 \n");        asm("  and r15, 1 \n");        asm("  shl r15, 4 \n");        asm("  pavgb xmm0, [in + r15] \n");        asm("  mov r15, [a] \n");        asm("  shr r15, 1 \n");        asm("  and r15, 1 \n");        asm("  shl r15, 4 \n");        asm("  pavgb xmm0, [in + r15] \n");        asm("  mov r15, [a] \n");        asm("  shr r15, 2 \n");        asm("  and r15, 1 \n");        asm("  shl r15, 4 \n");        asm("  pavgb xmm0, [in + r15] \n");        asm("  mov r15, [a] \n");        asm("  shr r15, 3 \n");        asm("  and r15, 1 \n");        asm("  shl r15, 4 \n");        asm("  pavgb xmm0, [in + r15] \n");        asm("  mov r15, [a] \n");        asm("  shr r15, 4 \n");        asm("  and r15, 1 \n");        asm("  shl r15, 4 \n");        asm("  pavgb xmm0, [in + r15] \n");        asm("  mov r15, [a] \n");        asm("  shr r15, 5 \n");        asm("  and r15, 1 \n");        asm("  shl r15, 4 \n");        asm("  pavgb xmm0, [in + r15] \n");        asm("  mov r15, [a] \n");        asm("  shr r15, 6 \n");        asm("  and r15, 1 \n");        asm("  shl r15, 4 \n");        asm("  pavgb xmm0, [in + r15] \n");        asm("  mov r15, [a] \n");        asm("  shr r15, 7 \n");        asm("  and r15, 1 \n");        asm("  shl r15, 4 \n");        asm("  pavgb xmm0, [in + r15] \n");        asm("  movdqu out, xmm0 \n");        asm(".att_syntax noprefix \n");    }    asm(".intel_syntax noprefix \n");    asm(" fxrstor save \n");    asm(" pop r15 \n");    asm(".att_syntax noprefix \n");}void cpp() {    for(long j = 0; j < N; j++)    {        out[0] = ( (255-a)*in[0]+a*in[16] ) /255 ;        out[1] = ( (255-a)*in[1]+a*in[17] ) /255 ;        out[2] = ( (255-a)*in[2]+a*in[18] ) /255 ;        out[3] = ( (255-a)*in[3]+a*in[19] ) /255 ;        out[4] = ( (255-a)*in[4]+a*in[20] ) /255 ;        out[5] = ( (255-a)*in[5]+a*in[21] ) /255 ;        out[6] = ( (255-a)*in[6]+a*in[22] ) /255 ;        out[7] = ( (255-a)*in[7]+a*in[23] ) /255 ;        out[8] = ( (255-a)*in[8]+a*in[24] ) /255 ;        out[9] = ( (255-a)*in[9]+a*in[25] ) /255 ;        out[10] = ( (255-a)*in[10]+a*in[26] ) /255 ;        out[11] = ( (255-a)*in[11]+a*in[27] ) /255 ;        out[12] = ( (255-a)*in[12]+a*in[28] ) /255 ;        out[13] = ( (255-a)*in[13]+a*in[29] ) /255 ;        out[14] = ( (255-a)*in[14]+a*in[30] ) /255 ;        out[15] = ( (255-a)*in[15]+a*in[31] ) /255 ;    }}int main() {    // Don't let the compiler make any assumptions about the     // initial values    srand ( time(NULL) );    for(int i=0;i<32;i++)        in=rand()%256;    std::cin >> a;        boost::timer t;    t.restart();    sse();    t.elapsed();    std::cout << "sse : " << t.elapsed() << std::endl;    for(int i = 0; i < 16; i++)        std::cout << (int32_t)out << '\t';    std::cout << std::endl;    t.restart();    cpp();    t.elapsed();    std::cout << "cpp : " << t.elapsed() << std::endl;    for(int i = 0; i < 16; i++)        std::cout << (int32_t)out << '\t';    std::cout << std::endl;    return 0;}

I use g++ 4.3.2 .
the output looks like this:
sse : 3.3488	5	29	102	219	141	161	127	37	109	218	21	114	178	47	103	cpp : 0.9387	4	28	101	218	140	160	126	36	108	217	20	113	177	46	102	

Am I doing something wrong?

Edit: The second version is still 50% slower than the cpp version.
Fun learning exercise.

Your timings are not going to be accurate for the real world because all of that data fits within the cpu cache. Doing 16 pairs is nothing. If you have any large amount of data it would require keeping the CPU fed, which adds significant overhead. Try running it on alpha blending over an 8MB dataset (roughly 1600x1200) for input and output to check your performance.

Finally, it doesn't seem very useful to me unless you are implementing a software rasterizer. Alpha blending on the graphics cards is done in a much more massively parallel fashion.
@frob: yep it was fun :D

@Kampiz: I tried your code on my machine and got like 0.9-0.95 secs for both cpp and sse code. When i used the second version, though (see first edit on my original post), I get 0.65 with sse and 0.85-0.9 with the cpp code, so there is some gain with the second version :) (at least on my cpu... What's yours?)


[Edited by - D_Tr on December 6, 2008 3:39:25 AM]
For that cpp code, the accuracy problem is simply rounding differences.

For speed, replace the divisions by 255 in the cpp with a multiplication and a shift:

round(x / 255)

(floating point division)

is approximately equivalent to

(x * 258) >> 16

This should actually have quite a bit less rounding error (vs the expected "ideal" result) than the division by 255 method within this domain space (8-bit alpha, 8-bit color channels)
Quote:Original post by D_Tr
@Kampiz: I tried your code on my machine and got like 0.9-0.95 secs for both cpp and sse code. When i used the second version, though (see first edit on my original post), I get 0.65 with sse and 0.85-0.9 with the cpp code, so there is some gain with the second version :) (at least on my cpu... What's yours?)

It may be cpu specific (I'm testing on AMD Athlon 4200+ X2) or compiler specific (Try g++ or llvm-g++ with -O3). The point is hand written assembly is often not better than optimized c/c++.

Quote:Original post by Rockoon1
For that cpp code, the accuracy problem is simply rounding differences.

For speed, replace the divisions by 255 in the cpp with a multiplication and a shift:

round(x / 255)

(floating point division)

is approximately equivalent to

(x * 258) >> 16

This should actually have quite a bit less rounding error (vs the expected "ideal" result) than the division by 255 method within this domain space (8-bit alpha, 8-bit color channels)


Originally I was using the approximation >> 8 instead of /255 but then I noticed that both version run at the same speed. Thus I decided to use /255. These old tricks doesn't seem to be that useful on newer hardware.
Quote:Original post by Kambiz

Originally I was using the approximation >> 8 instead of /255 but then I noticed that both version run at the same speed. Thus I decided to use /255. These old tricks doesn't seem to be that useful on newer hardware.


Because your version is evaluated completely at compile-time, so it involves writing constant value into destination.

Look at disassembly.

What you have is actually:
void cpp() {    for(long j = 0; j < N; j++)    {        out[0] = 217;        out[1] = 20;        out[2] = 113;...    }}
Quote:Original post by Antheus
Quote:Original post by Kambiz

Originally I was using the approximation >> 8 instead of /255 but then I noticed that both version run at the same speed. Thus I decided to use /255. These old tricks doesn't seem to be that useful on newer hardware.


Because your version is evaluated completely at compile-time, so it involves writing constant value into destination.

Look at disassembly.

What you have is actually:
void cpp() {    for(long j = 0; j < N; j++)    {        out[0] = 217;        out[1] = 20;        out[2] = 113;...    }}

I initialize in with random values and read a from cin before calling cpp() :
int main() {    // Don't let the compiler make any assumptions about the     // initial values    srand ( time(NULL) );    for(int i=0;i<32;i++)        in=rand()%256;...}


EDIT:
Sorry, you are right. I think the program will do the calculations just once and then just use the constants in the assignment. the sse version is much faster if I change a in the loop to force calculations.
I'm sorry for the confusion.

[Edited by - Kambiz on December 6, 2008 7:07:27 AM]
optimisation option 1) wont work unless you re-write it in intrinsics...
Quote:Original post by RobTheBloke
optimisation option 1) wont work unless you re-write it in intrinsics...


What exactly do you mean? It works for me in gcc by using r8 - r15

Also, the "movdqa xmm0, out" at the start of the loop eats up a lot of time (nearly 40% on my machine) and is completely redundant, should be removed... The register can be cleared with asm(" pxor xmm0, xmm0 "); instead. This final version the routine is 15.0 times faster than plain cpp (if I shift alpha in the cpp version by one at each iteration to prevent the compiler from "cheating"). The difference might be a bit smaller if I run the routine over a large array though.

[Edited by - D_Tr on December 9, 2008 12:33:54 PM]

This topic is closed to new replies.

Advertisement