Jump to content
  • Advertisement
Sign in to follow this  
D_Tr

My SSE Alpha Blending Routine

This topic is 3541 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I just wrote an SSE alpha blending routine which blends 16 byte pairs in parallel (drawback: uses the same alpha value for all 16 pairs), and I wanted to share it with you because its performance seems quite sweet to me and I would like to hear your opinions on it. I wrote a small single threaded program that blends apx 2 billion byte pairs in a second on my 2.66 ghz core2 duo (actually, the same 16 pairs * 125000000 times...). It uses the SSE PAVGB instruction. This is actually the first assembly routine I wrote that does something "useful" :P -EDIT: see below for improved versions The algorithm used for each byte is the following:
uint8_t in[2];
uint16_t out = 0;
uint8_t a;

for(int32_t i = 0; i < 8; i++)
{
    out += (uint16_t)in[(a >> i) & 1] + 1;
    out >>= 1;
}
#include <iostream>
#include <ctime>
#include <cstdint>

uint8_t in[32] = {1, 2, 4, 8, 16, 32, 64, 128, 0, 0, 0, 0, 0, 0, 0, 0,
                  0, 0, 0, 0,  0,  0,  0,   0, 0, 0, 0, 0, 0, 0, 0, 0};
                   
uint8_t out[16] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

uint64_t a = 255;

uint8_t save[512];

int main()
{    
    asm(".intel_syntax noprefix \n");
    
    asm(" push r15 \n");
    asm(" fxsave save \n");
    
    asm(".att_syntax noprefix \n");
    
    for(long i = 0; i < 125000000; i++)
    {   
        asm(".intel_syntax noprefix \n");
    
        asm("  movdqu xmm0, out \n");
        
        asm("  mov r15, [a] \n");
        asm("  shr r15, 0 \n");
        asm("  and r15, 1 \n");
        asm("  shl r15, 4 \n");
        asm("  pavgb xmm0, [in + r15] \n");
        
        asm("  mov r15, [a] \n");
        asm("  shr r15, 1 \n");
        asm("  and r15, 1 \n");
        asm("  shl r15, 4 \n");
        asm("  pavgb xmm0, [in + r15] \n");
        
        asm("  mov r15, [a] \n");
        asm("  shr r15, 2 \n");
        asm("  and r15, 1 \n");
        asm("  shl r15, 4 \n");
        asm("  pavgb xmm0, [in + r15] \n");
        
        asm("  mov r15, [a] \n");
        asm("  shr r15, 3 \n");
        asm("  and r15, 1 \n");
        asm("  shl r15, 4 \n");
        asm("  pavgb xmm0, [in + r15] \n");
        
        asm("  mov r15, [a] \n");
        asm("  shr r15, 4 \n");
        asm("  and r15, 1 \n");
        asm("  shl r15, 4 \n");
        asm("  pavgb xmm0, [in + r15] \n");
        
        asm("  mov r15, [a] \n");
        asm("  shr r15, 5 \n");
        asm("  and r15, 1 \n");
        asm("  shl r15, 4 \n");
        asm("  pavgb xmm0, [in + r15] \n");
        
        asm("  mov r15, [a] \n");
        asm("  shr r15, 6 \n");
        asm("  and r15, 1 \n");
        asm("  shl r15, 4 \n");
        asm("  pavgb xmm0, [in + r15] \n");
        
        asm("  mov r15, [a] \n");
        asm("  shr r15, 7 \n");
        asm("  and r15, 1 \n");
        asm("  shl r15, 4 \n");
        asm("  pavgb xmm0, [in + r15] \n");
               
        asm("  movdqu out, xmm0 \n");   
        
        asm(".att_syntax noprefix \n");
    }
    
    asm(".intel_syntax noprefix \n");
    
    asm(" fxrstor save \n");
    asm(" pop r15 \n");
    
    asm(".att_syntax noprefix \n");
    
    for(int i = 0; i < 16; i++)
        std::cout << (int32_t)out << '\t';
    
    std::cout << std::endl;
        
    return 0;
}








It's basically untested but it seems to work judging on some sample values i tried... It may be a shade off in some cases. Two possible optimizations I have thought are: 1) precomputing the middle 3 instructions of each pavgb chunk in an 8 byte array, or even using the 8 upper gprs in x86-64 2) it might be possible to trade off accuracy for speed by using fewer alpha bits and reducing the number of pavgb chunks The timing was done by just using the time command on the Linux terminal, I guess it's OK for measuring times as large as a second.. -EDIT: I changed movdqu to movdqa and did the precompute thing, so the time got down to 0.649 secs (0.8 with unaligned loads, huge difference). This is the new version:
#include <iostream>
#include <ctime>
#include <cstdint>

uint8_t in[32] = {1, 2, 4, 8, 16, 32, 64, 128, 0, 0, 0, 0, 0, 0, 0, 0,
                  0, 0, 0, 0,  0,  0,  0,   0, 0, 0, 0, 0, 0, 0, 0, 0};
                   
uint8_t out[16] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

uint64_t a = 255;

uint8_t save[512];

uint64_t pc[8] = {0, 0, 0, 0, 0, 0, 0, 0};

int main()
{   
    for(long i = 0; i < 8; i++)
        pc = ((a >> i) & 1) << 4;
    
    asm(".intel_syntax noprefix \n");
    
    asm(" push r15 \n");
    asm(" push r14 \n");
    asm(" push r13 \n");
    asm(" push r12 \n");
    asm(" push r11 \n");
    asm(" push r10 \n");
    asm(" push r9 \n");
    asm(" push r8 \n");
    asm(" fxsave save \n");
        
    asm("mov r8,  [pc]     \n");
    asm("mov r9,  [pc + 8] \n");
    asm("mov r10, [pc + 16]\n");
    asm("mov r11, [pc + 24]\n");
    asm("mov r12, [pc + 32]\n");
    asm("mov r13, [pc + 40]\n");
    asm("mov r14, [pc + 48]\n");
    asm("mov r15, [pc + 56]\n");
    
    asm(".att_syntax noprefix \n");
    
    for(long i = 0; i < 125000000; i++)
    {   
        asm("  .intel_syntax noprefix \n");
        asm("  movdqa xmm0, out \n");      
         
        asm("  pavgb xmm0, [in + r8] \n");      
        asm("  pavgb xmm0, [in + r9] \n");      
        asm("  pavgb xmm0, [in + r10] \n");       
        asm("  pavgb xmm0, [in + r11] \n");       
        asm("  pavgb xmm0, [in + r12] \n");        
        asm("  pavgb xmm0, [in + r13] \n");      
        asm("  pavgb xmm0, [in + r14] \n");        
        asm("  pavgb xmm0, [in + r15] \n");
               
        asm("  movdqa out, xmm0 \n");   
        
        asm("  .att_syntax noprefix \n");
    }
        
    asm(".intel_syntax noprefix \n");
    
    asm(" fxrstor save \n");
    asm(" pop r8 \n");
    asm(" pop r9 \n");
    asm(" pop r10 \n");
    asm(" pop r11 \n");
    asm(" pop r12 \n");
    asm(" pop r13 \n");
    asm(" pop r14 \n");
    asm(" pop r15 \n");
    
    asm(".att_syntax noprefix \n");
    
    for(int i = 0; i < 16; i++)
        std::cout << (int32_t)out << '\t';
    
    std::cout << std::endl;
        
    return 0;
}








[Edited by - D_Tr on May 31, 2009 2:11:18 AM]

Share this post


Link to post
Share on other sites
Advertisement

out[0] = ( (255-a)*in[0]+a*in[16] ) /255 ;
out[1] = ( (255-a)*in[1]+a*in[17] ) /255 ;
...
out[15] = ( (255-a)*in[15]+a*in[31] ) /255;

runs 3.5 times faster than your sse version for me. Test yourself:

#include <iostream>
#include <ctime>
#include <stdint.h>
#include <boost/timer.hpp>

uint8_t in[32] = {1, 2, 4, 8, 16, 32, 64, 128, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

uint8_t out[16] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

uint64_t a ;

uint8_t save[512];

const int N = 125000000;

void sse() {
asm(".intel_syntax noprefix \n");

asm(" push r15 \n");
asm(" fxsave save \n");

asm(".att_syntax noprefix \n");

for(long i = 0; i < N; i++)
{
asm(".intel_syntax noprefix \n");

asm(" movdqu xmm0, out \n");

asm(" mov r15, [a] \n");
asm(" shr r15, 0 \n");
asm(" and r15, 1 \n");
asm(" shl r15, 4 \n");
asm(" pavgb xmm0, [in + r15] \n");

asm(" mov r15, [a] \n");
asm(" shr r15, 1 \n");
asm(" and r15, 1 \n");
asm(" shl r15, 4 \n");
asm(" pavgb xmm0, [in + r15] \n");

asm(" mov r15, [a] \n");
asm(" shr r15, 2 \n");
asm(" and r15, 1 \n");
asm(" shl r15, 4 \n");
asm(" pavgb xmm0, [in + r15] \n");

asm(" mov r15, [a] \n");
asm(" shr r15, 3 \n");
asm(" and r15, 1 \n");
asm(" shl r15, 4 \n");
asm(" pavgb xmm0, [in + r15] \n");

asm(" mov r15, [a] \n");
asm(" shr r15, 4 \n");
asm(" and r15, 1 \n");
asm(" shl r15, 4 \n");
asm(" pavgb xmm0, [in + r15] \n");

asm(" mov r15, [a] \n");
asm(" shr r15, 5 \n");
asm(" and r15, 1 \n");
asm(" shl r15, 4 \n");
asm(" pavgb xmm0, [in + r15] \n");

asm(" mov r15, [a] \n");
asm(" shr r15, 6 \n");
asm(" and r15, 1 \n");
asm(" shl r15, 4 \n");
asm(" pavgb xmm0, [in + r15] \n");

asm(" mov r15, [a] \n");
asm(" shr r15, 7 \n");
asm(" and r15, 1 \n");
asm(" shl r15, 4 \n");
asm(" pavgb xmm0, [in + r15] \n");

asm(" movdqu out, xmm0 \n");

asm(".att_syntax noprefix \n");
}

asm(".intel_syntax noprefix \n");

asm(" fxrstor save \n");
asm(" pop r15 \n");

asm(".att_syntax noprefix \n");


}

void cpp() {
for(long j = 0; j < N; j++)
{
out[0] = ( (255-a)*in[0]+a*in[16] ) /255 ;
out[1] = ( (255-a)*in[1]+a*in[17] ) /255 ;
out[2] = ( (255-a)*in[2]+a*in[18] ) /255 ;
out[3] = ( (255-a)*in[3]+a*in[19] ) /255 ;
out[4] = ( (255-a)*in[4]+a*in[20] ) /255 ;
out[5] = ( (255-a)*in[5]+a*in[21] ) /255 ;
out[6] = ( (255-a)*in[6]+a*in[22] ) /255 ;
out[7] = ( (255-a)*in[7]+a*in[23] ) /255 ;

out[8] = ( (255-a)*in[8]+a*in[24] ) /255 ;
out[9] = ( (255-a)*in[9]+a*in[25] ) /255 ;
out[10] = ( (255-a)*in[10]+a*in[26] ) /255 ;
out[11] = ( (255-a)*in[11]+a*in[27] ) /255 ;
out[12] = ( (255-a)*in[12]+a*in[28] ) /255 ;
out[13] = ( (255-a)*in[13]+a*in[29] ) /255 ;
out[14] = ( (255-a)*in[14]+a*in[30] ) /255 ;
out[15] = ( (255-a)*in[15]+a*in[31] ) /255 ;
}
}

int main() {
// Don't let the compiler make any assumptions about the
// initial values
srand ( time(NULL) );
for(int i=0;i<32;i++)
in=rand()%256;

std::cin >> a;

boost::timer t;
t.restart();
sse();
t.elapsed();
std::cout << "sse : " << t.elapsed() << std::endl;
for(int i = 0; i < 16; i++)
std::cout << (int32_t)out << '\t';
std::cout << std::endl;

t.restart();
cpp();
t.elapsed();
std::cout << "cpp : " << t.elapsed() << std::endl;
for(int i = 0; i < 16; i++)
std::cout << (int32_t)out << '\t';
std::cout << std::endl;

return 0;
}




I use g++ 4.3.2 .
the output looks like this:

sse : 3.34
88 5 29 102 219 141 161 127 37 109 218 21 114 178 47 103
cpp : 0.93
87 4 28 101 218 140 160 126 36 108 217 20 113 177 46 102

Am I doing something wrong?

Edit: The second version is still 50% slower than the cpp version.

Share this post


Link to post
Share on other sites
Fun learning exercise.

Your timings are not going to be accurate for the real world because all of that data fits within the cpu cache. Doing 16 pairs is nothing. If you have any large amount of data it would require keeping the CPU fed, which adds significant overhead. Try running it on alpha blending over an 8MB dataset (roughly 1600x1200) for input and output to check your performance.

Finally, it doesn't seem very useful to me unless you are implementing a software rasterizer. Alpha blending on the graphics cards is done in a much more massively parallel fashion.

Share this post


Link to post
Share on other sites
@frob: yep it was fun :D

@Kampiz: I tried your code on my machine and got like 0.9-0.95 secs for both cpp and sse code. When i used the second version, though (see first edit on my original post), I get 0.65 with sse and 0.85-0.9 with the cpp code, so there is some gain with the second version :) (at least on my cpu... What's yours?)


[Edited by - D_Tr on December 6, 2008 3:39:25 AM]

Share this post


Link to post
Share on other sites
For that cpp code, the accuracy problem is simply rounding differences.

For speed, replace the divisions by 255 in the cpp with a multiplication and a shift:

round(x / 255)

(floating point division)

is approximately equivalent to

(x * 258) >> 16

This should actually have quite a bit less rounding error (vs the expected "ideal" result) than the division by 255 method within this domain space (8-bit alpha, 8-bit color channels)

Share this post


Link to post
Share on other sites
Quote:
Original post by D_Tr
@Kampiz: I tried your code on my machine and got like 0.9-0.95 secs for both cpp and sse code. When i used the second version, though (see first edit on my original post), I get 0.65 with sse and 0.85-0.9 with the cpp code, so there is some gain with the second version :) (at least on my cpu... What's yours?)

It may be cpu specific (I'm testing on AMD Athlon 4200+ X2) or compiler specific (Try g++ or llvm-g++ with -O3). The point is hand written assembly is often not better than optimized c/c++.

Quote:
Original post by Rockoon1
For that cpp code, the accuracy problem is simply rounding differences.

For speed, replace the divisions by 255 in the cpp with a multiplication and a shift:

round(x / 255)

(floating point division)

is approximately equivalent to

(x * 258) >> 16

This should actually have quite a bit less rounding error (vs the expected "ideal" result) than the division by 255 method within this domain space (8-bit alpha, 8-bit color channels)


Originally I was using the approximation >> 8 instead of /255 but then I noticed that both version run at the same speed. Thus I decided to use /255. These old tricks doesn't seem to be that useful on newer hardware.

Share this post


Link to post
Share on other sites
Quote:
Original post by Kambiz

Originally I was using the approximation >> 8 instead of /255 but then I noticed that both version run at the same speed. Thus I decided to use /255. These old tricks doesn't seem to be that useful on newer hardware.


Because your version is evaluated completely at compile-time, so it involves writing constant value into destination.

Look at disassembly.

What you have is actually:
void cpp() {
for(long j = 0; j < N; j++)
{
out[0] = 217;
out[1] = 20;
out[2] = 113;
...
}
}

Share this post


Link to post
Share on other sites
Quote:
Original post by Antheus
Quote:
Original post by Kambiz

Originally I was using the approximation >> 8 instead of /255 but then I noticed that both version run at the same speed. Thus I decided to use /255. These old tricks doesn't seem to be that useful on newer hardware.


Because your version is evaluated completely at compile-time, so it involves writing constant value into destination.

Look at disassembly.

What you have is actually:
void cpp() {
for(long j = 0; j < N; j++)
{
out[0] = 217;
out[1] = 20;
out[2] = 113;
...
}
}

I initialize in with random values and read a from cin before calling cpp() :

int main() {
// Don't let the compiler make any assumptions about the
// initial values
srand ( time(NULL) );
for(int i=0;i<32;i++)
in=rand()%256;
...
}


EDIT:
Sorry, you are right. I think the program will do the calculations just once and then just use the constants in the assignment. the sse version is much faster if I change a in the loop to force calculations.
I'm sorry for the confusion.

[Edited by - Kambiz on December 6, 2008 7:07:27 AM]

Share this post


Link to post
Share on other sites
Quote:
Original post by RobTheBloke
optimisation option 1) wont work unless you re-write it in intrinsics...


What exactly do you mean? It works for me in gcc by using r8 - r15

Also, the "movdqa xmm0, out" at the start of the loop eats up a lot of time (nearly 40% on my machine) and is completely redundant, should be removed... The register can be cleared with asm(" pxor xmm0, xmm0 "); instead. This final version the routine is 15.0 times faster than plain cpp (if I shift alpha in the cpp version by one at each iteration to prevent the compiler from "cheating"). The difference might be a bit smaller if I run the routine over a large array though.

[Edited by - D_Tr on December 9, 2008 12:33:54 PM]

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!