Jump to content
  • Advertisement
Sign in to follow this  
kzyczynski

SSE performance question

This topic is 4887 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi, I was going to use SSE instructions in my procedural sky using MSVC2003 intrinsics:
...
int CFloatUtils::mulOP2(float* op1, float op2, float* d, int size)
{
    register int cnt = size / 4;

    assert((size % 4) == 0);

    __m128 op2_128 = _mm_load1_ps(&op2);

    while (cnt) {

        _mm_store_ps(d, _mm_mul_ps(_mm_load_ps(op1), op2_128));

        op1 += 4;
        d += 4;
        --cnt;
    }

    return QE_SUCCESS;
}

int CFloatUtils::add(float* op1, float* op2, float* d, int size)
{
    register int cnt = size / 4;

    assert((size % 4) == 0);

    while (cnt) {

        _mm_store_ps(d, _mm_add_ps(_mm_load_ps(op1), _mm_load_ps(op2)));

        op1 += 4;
        op2 += 4;
        d += 4;
        --cnt;
    }

    return QE_SUCCESS;
}
...
// And here are my Octaves interpolation functions:

int COctavesMng::interpolateOctaves(CVTex* prev, CVTex* last, CVTex* out)
{
    static float z = 0.0f; // temp

    assert(prev->getRect() && last->getRect() && out->getRect());
    assert(prev->memr->width == last->memr->width);
    assert(prev->memr->width == out->memr->width);

    float _1mz = 1 - z;
    float t1;
    for (int y = 0; y < out->memr->width; ++y) {
        t1 = y*out->memr->width;
        for (int x = 0; x < out->memr->width; ++x) {
            out->memr->data[t1 + x] = 
                prev->memr->data[t1 + x]*_1mz + last->memr->data[t1 + x]*z;
        }
    }
    
    z += 0.001f;
    
    return QE_SUCCESS;
}

int COctavesMng::interpolateOctavesSIMD(CVTex* prev, CVTex* last, CVTex* out)
{
    static float z = 0.0f; // temp

    int texsize = out->memr->width*out->memr->width;
    float _1mz = 1 - z;

    fu-> mulOP2(prev->memr->data, _1mz, a1->memr->data, texsize);
    fu-> mulOP2(last->memr->data, z, a2->memr->data, texsize);
    fu-> add(a1->memr->data, a2->memr->data, out->memr->data, texsize);

    z += 0.001f;
    
    return QE_SUCCESS;
}




Big problem is that SSE version is as slow as not SSE version or even slower!. I've made some assumptions about SSE and it seems that I will have to change many things in my code. But maybe I'm doing something fundamentally wrong? I googled a little and some people say that using intrinsics is not the best idea, but how much gain in speed I can achieve switching to pure asm? I'm a little desperated, Help Thanks [Edited by - kzyczynski on July 26, 2005 5:46:39 AM]

Share this post


Link to post
Share on other sites
Advertisement
First, look at disassembly. You'll see what's wrong.

I'm using intrinsics, too. But I never noticed speed decrease. It was always increasing (not as much as with asm, though). Somethimes I had to unroll my loop a little, or decompose intrinsic equation into pieces, to format it similar to expected asm code:


int CFloatUtils::mulOP2(float* op1, float op2, float* d, int size)
{
assert((size % 4) == 0);

__m128 op2_128 = _mm_load1_ps(&op2);

int cnt = size / 4;
register int cnt2 = cnt/2;

while (cnt2) {
const __m128 xop1a = _mm_load_ps(op1);
op1 += 4;
const __m128 xop1b = _mm_load_ps(op1);
op1 += 4;

const __m128 xmula = _mm_mul_ps(xop1a, op2_128);
const __m128 xmulb = _mm_mul_ps(xop1b, op2_128);

_mm_store_ps(d, xmula);
d += 4;
_mm_store_ps(d, xmulb);
d += 4;

--cnt2;
}

if (cnt & 1) {
const __m128 xop1a = _mm_load_ps(op1);
const __m128 xmula = _mm_mul_ps(xop1a, op2_128);
_mm_store_ps(d, xmula);
}

return QE_SUCCESS;
}



For the time being, the MS compiler is not too good at generating SSE code. But it will be better. Now, many people just can't wait, and write asm by hand. But I prefer more portable solution. And trust compiler will inline the generated code, better use registers, and so.
Just needs a little help by now.

After all, do you see anybody writing code (not just SSE) in asm?

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
i would like to add the obvious, and remind you that your float* data has to be 16 byte aligned in order for sse to be any faster than a traditional implementation.

Share this post


Link to post
Share on other sites
Quote:
Original post by deffer
First, look at disassembly. You'll see what's wrong.
I'm using intrinsics, too. But I never noticed speed decrease. It was always increasing (not as much as with asm, though). Somethimes I had to unroll my loop a little, or decompose intrinsic equation into pieces, to format it similar to expected asm code


Well, I tried your version and I still don't have any significiant performance gain. Maybe it depends on what cpu one is using. My development machine is PIII, it's possible that new processors have better SIMD implementation.

Quote:

For the time being, the MS compiler is not too good at generating SSE code. But it will be better. Now, many people just can't wait, and write asm by hand. But I prefer more portable solution. And trust compiler will inline the generated code, better use registers, and so.
Just needs a little help by now.


I think I will have to put aside SIMD coding because there is no way to have really significiant performance gain using intrinsics. After finishing sky system I will try to write few routines in asm.

Quote:

After all, do you see anybody writing code (not just SSE) in asm?


Of course not, but writing little image processing library in asm using SIMD may be worth an effort

Quote:
Original post by Anonymous Poster
i would like to add the obvious, and remind you that your float* data has to be 16 byte aligned in order for sse to be any faster than a traditional implementation.


Yes, my mem is 16 aligned

Thanks,
Chris

Share this post


Link to post
Share on other sites
Quote:
Original post by Anonymous Poster
i would like to add the obvious, and remind you that your float* data has to be 16 byte aligned in order for sse to be any faster than a traditional implementation.


Actually, it would have caused an access violation, and threw an exception.


And for the original problem:
I tested a several versions, and there was no signifficant improvement for mine (athough it's 3 times faster than using plain fpu, so how can you say it's not worth it???).

After a small improvement...

void mul_sse2(float* op1, float op2, float* d, int size)
{
int cnt = size / 4;
register int cnt2 = cnt & ~1;
cnt2 /= 2;

const __m128 op2_128 = _mm_load1_ps(&op2);

while (cnt2) {

_mm_store_ps(d, _mm_mul_ps(_mm_load_ps(op1), op2_128));
_mm_store_ps(d+4, _mm_mul_ps(_mm_load_ps(op1+4), op2_128));

op1 += 8;
d += 8;
cnt2 -= 1;
};

if (cnt & 1) {
_mm_store_ps(d, _mm_mul_ps(_mm_load_ps(op1), op2_128));
};
}



...I got speed increase of about 20% (notice the line d+4 and opt1+4).

Also, I could not change cnt2 = cnt & ~1; to simple cnt2 = cnt; without 15% speed decrease, lol.


The biggest drawback of using intrinsics is that you cannot pass arguments for mul and so, directly from the memory (got to load first), which you can in asm. I tried to make the compiler generate the arg-from-memory code, but failed.

Share this post


Link to post
Share on other sites
Quote:

And for the original problem:
I tested a several versions, and there was no signifficant improvement for mine (athough it's 3 times faster than using plain fpu, so how can you say it's not worth it???).


Wow!, 3x faster?? It would be perfect for my application. I don't know what's going on, maybe there is some problem with my frame rate counter or sth. I will do some more accurate tests tonight. For now I have ~179fps using plain fpu and ~152fps using sse.

Share this post


Link to post
Share on other sites
Quote:
Original post by kzyczynski
I don't know what's going on, maybe there is some problem with my frame rate counter or sth. I will do some more accurate tests tonight. For now I have ~179fps using plain fpu and ~152fps using sse.


I didn't test it using some fps counter, but rdtsc. And that for a small piece of code.
Here's the test I used:

#include <xmmintrin.h>
#include "TStampCounter.h"

// Time: 33100
void __cdecl mul_fpu(float* op1, float op2, float* d, int size)
{
while (size) {
*d = *op1 * op2;
++d;
++op1;
--size;
};
};

// Time: 10700
void __cdecl mul_sse(float* op1, float op2, float* d, int size)
{
register int cnt = size / 4;

__m128 op2_128 = _mm_load1_ps(&op2);

while (cnt) {

_mm_store_ps(d, _mm_mul_ps(_mm_load_ps(op1), op2_128));

op1 += 4;
d += 4;
--cnt;
};
}

// Time: 8750
void __cdecl mul_sse2(float* op1, float op2, float* d, int size)
{
int cnt = size / 4;
// register int cnt2 = cnt/2; */ // For some reason with this (instead of next two lines) it goes to ~9900...
register int cnt2 = cnt & ~1;
cnt2 /= 2;

const __m128 op2_128 = _mm_load1_ps(&op2);

while (cnt2) {

_mm_store_ps(d, _mm_mul_ps(_mm_load_ps(op1), op2_128));
_mm_store_ps(d+4, _mm_mul_ps(_mm_load_ps(op1+4), op2_128));

op1 += 8;
d += 8;
cnt2 -= 1;
};

register cntrest = cnt & 1;
if (cntrest) {
_mm_store_ps(d, _mm_mul_ps(_mm_load_ps(op1), op2_128));
};
}


#define ALIGN16_CHAR(p) ( (p) + (0x10 - (((unsigned int)(p)) & 0xF)) )
#define ALIGN16(p) ALIGN16_CHAR((char*)p)


#define TEST_SIZE (8192)
#define SAFE_SIZE 4

//#define MUL mul_fpu
//#define MUL mul_sse
#define MUL mul_sse2


int _tmain(int argc, _TCHAR* argv[])
{
CTStampCounter tsCounter;
tsCounter.Init();

// Get 16-byte aligned memory.
float *src = new float[TEST_SIZE+SAFE_SIZE];
float *dst = new float[TEST_SIZE+SAFE_SIZE];
float *asrc = (float*) ALIGN16(src);
float *adst = (float*) ALIGN16(dst);

// Init with some values.
static const float base = 0.01f;
float accum = 0.0f;
for (int i=0; i<TEST_SIZE; ++i) {
asrc = accum;
accum += base;
};

// Transform and measure time.
tsCounter.CatchTime0();
MUL(asrc, 1.234567f, adst, TEST_SIZE);
tsCounter.CatchTime1();

// Results (time0 - time1).
DWORD time = tsCounter.GetDuration();
printf("time = %u\n", time);

delete[] src;
delete[] dst;
while (getchar()) { return 0; };
return 0;
}




You can see the timings above each function.

Now, if you really don't see a difference between tose versions, then it must be some other thing that's your botleneck.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
deffer, could you post the asm that gets spit out there?

kzyczynski, A good rule in SSE is that your need to have a high ratio of SSE calculation to data consumption. Here, you are loading and unloading your entire data set to do one operation at a time. This is a very bad choice. asm would help some, but you need to re-examine your method (since all you're doing is linearly interpolating each pixel in the image).

Also, you need to work not just with 1 group of 4 pixels at a time, but multiple (preferably 4) at a time. Another good rule in SSE is to use calculations one set of registers to hide the latency of moving data in and out of other registers.

If I were writing your code, I would go back to the interpolateOctavesSIMD and write in assembly. Something like:


int count = texsize / 16;
int remainder = texsize % 16;

__declspec(align(16)
const float a[4] = { z, z, z, z },
const float b[4] = { _1mz,_1mz,_1mz,_1mz };

const float* data0 = prev->memr->data;
const float* data1 = last->memr->data;
const float* output = out->memr->data;

_asm {
mov esi, count
lea ecx, [data0]
lea edx, [data1]
lea edi, [output]

loopers:

movaps xmm0, [ecx]
movaps xmm1, [edx]
movaps xmm2, [ecx + 10h]
movaps xmm3, [edx + 10h]

movaps xmm4, [ecx + 20h]
movaps xmm5, [edx + 20h]
movaps xmm6, [ecx + 30h]
movaps xmm7, [edx + 30h]

mulps xmm0, a
mulps xmm1, b
mulps xmm2, a
mulps xmm3, b

mulps xmm4, a
mulps xmm5, b
mulps xmm6, a
mulps xmm7, b

addps xmm0, xmm1
addps xmm2, xmm3
addps xmm4, xmm5
addps xmm6, xmm7

movaps [edi], xmm0
movaps [edi + 10h], xmm2
movaps [edi + 20h], xmm4
movaps [edi + 30h], xmm6

add edi, 40h
add ecx, 40h
add edx, 40h

sub esi, 1

jnz loopers
}

// process remainder




I didn't try to compile this, but it should work okay. Pretty simple. This should really fly though. I highly recommend you download CodeAnalyst from AMD, and study these snippets in "Pipeline Simulation" mode. It will show you how these instructions are processed through each stage of the pipeline, and mark stalls with big red blocks! Otherwise you are just flying blind, and SSE is far too fragile for that.


--ajas

Share this post


Link to post
Share on other sites
Quote:
Original post by Anonymous Poster
deffer, could you post the asm that gets spit out there?


You mean, produced by the compiler?
Here's for "mul_sse2":

const __m128 op2_128 = _mm_load1_ps(&;op2);
004010B0 lea ecx,[op2]
004010B3 movss xmm0,dword ptr [ecx]
004010B7 mov ecx,dword ptr [op1]
004010BA sar eax,1

while (cnt2) {
004010BC test eax,eax
004010BE shufps xmm0,xmm0,0
004010C2 je mul_sse2+5Fh (4010EFh)
004010C4 mov edx,eax
004010C6 mov eax,dword ptr [d]
004010C9 lea esp,[esp]

_mm_store_ps(d, _mm_mul_ps(_mm_load_ps(op1), op2_128));
004010D0 movaps xmm1,xmmword ptr [ecx]
004010D3 mulps xmm1,xmm0
004010D6 movaps xmmword ptr [eax],xmm1
_mm_store_ps(d+4, _mm_mul_ps(_mm_load_ps(op1+4), op2_128));
004010D9 movaps xmm1,xmmword ptr [ecx+10h]
004010DD mulps xmm1,xmm0
004010E0 movaps xmmword ptr [eax+10h],xmm1

op1 += 8;
004010E4 add ecx,20h
d += 8;
004010E7 add eax,20h
004010EA dec edx
004010EB jne mul_sse2+40h (4010D0h)

while (cnt2) {
004010ED jmp mul_sse2+62h (4010F2h)
004010EF mov eax,dword ptr [d]
cnt2 -= 1;
};

register cntrest = cnt &; 1;
004010F2 test bl,1
if (cntrest) {
004010F5 je mul_sse2+70h (401100h)
_mm_store_ps(d, _mm_mul_ps(_mm_load_ps(op1), op2_128));
004010F7 movaps xmm1,xmmword ptr [ecx]
004010FA mulps xmm1,xmm0
004010FD movaps xmmword ptr [eax],xmm1
};




Quote:
Original post by Anonymous Poster
I didn't try to compile this, but it should work okay. Pretty simple. This should really fly though.


I think he's right. If you are going to use just a simple loop to process large amount of data, then you shouldn't care about the compiler effectively adapting your asm code (using proper registers, for example). Just write your big-loop in asm.
And ability to write:
mulps xmm0, a
is really useful. And it is not used by the compiler at all (maybe in a next version).

Quote:
Original post by Anonymous Poster
I highly recommend you download CodeAnalyst from AMD, and study these snippets in "Pipeline Simulation" mode.


Oh, that's for some high-end CPUs, as far as I remember (not for my AMD Duron 1GHz, at least). And OP is using Pentium III.

Share this post


Link to post
Share on other sites
Here is your function in assembler.

int mulOP2( float * op1, float op2, float * d, int size )
{
__asm
{
mov ecx, size
shr ecx, 2 // size /4
mov esi, op1
mov edi, d
movss xmm7,op2
shufps xmm7,xmm7,0x00 // op2 op2 op2 op2

__LOOP:
prefetchnta [esi+16] // this is *the key*
movaps xmm0,[esi]
mulps xmm0,xmm7
movaps [edi],xmm0
add esi,16 // 4*sizeof(float)
add edi,16
dec ecx
jnz __LOOP
}
}

The key to performance is the prefetch. You need to work the cache with SSE
or it will suck. You also need to align it or it will suck.

If you ask me that is just as simple as C, and why you would want portability with a processor specific instruction set is mysterious to me.. portability to what? GCC and Intel C++ I guess.


Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!