# SSE confusion !

## Recommended Posts

I'm a newbie in SIMD programming, so for my first act I implemneted a dot product in SSE but the problem is that the SSE code is always slower that normal code here is the source :
#define __SIMD_ASM
//#define __NO_SIMD

#define SIMD_SHUFFLE( srch, src1, desth, dest1 ) ( ((srch)<<6) | ((src1)<<4) | ((desth)<<2) | ((dest1)) )
__declspec(align(16)) float g_vr[4];

class vect
{
public:
union	{
__declspec(align(16)) float v[3];

struct	{
float x;
float y;
float z;
};
};

public:
vect() : x(0.0f), y(0.0f), z(0.0f) {}
vect( float nx, float ny, float nz ) : x(nx), y(ny), z(nz) {}

#ifdef __NO_SIMD
float operator*( const vect& v ) const
{
return (x*v.x + y*v.y + z*v.z);
}

#endif

#ifdef __SIMD_ASM
inline float operator*( const vect& v ) const
{
_asm	{
mov esi, this
mov edi, v
movaps xmm0, [esi]
mulps xmm0, [edi]
// xmm0 = x*v.x, y*v.y, z*v.z

movaps xmm1, xmm0
shufps xmm1, xmm0, SIMD_SHUFFLE(0x01, 0x00, 0x03, 0x02)

shufps xmm0, xmm1, SIMD_SHUFFLE(0x02, 0x03, 0x00, 0x01)

movaps g_vr, xmm0
}

return g_vr[0];
}
#endif
};

the code compiled with __NO_SIMD is always faster than __SIMD_ASM , with 1 million - randomly created vector - dot products ! is there any problem with the code ? am I missing something here ? (the compiler is VC7.1) thanks

##### Share on other sites
Much dependps on how do you calculate perfomance.
For example look here:
http://www.gamedev.net/community/forums/topic.asp?topic_id=380420

Also when you write __declspec(align(16)) v[3] your vec3 will occupy 16 bytes of code like vec4.So if memory is of no concern to you in this case it would be better to implement only vec4.And if you want it to be 12 bytes you shouldnt use movaps to load data fron vec3.xyz to xmm registers.

##### Share on other sites
I'm not an expert on SSE, but here is a common reason why VPU code might be slower. The FPU implementation can be faster if the vector's elements are already in floating point registers. In that case, the FPU executes only 5 FP instructions to do the dot product. Compare that to your code which loads the data from memory and executes 10 SSE instructions.

##### Share on other sites
SSE does not work well horizontally. For calculating the dot product, you have to add X, Y, and Z across a register, which can only be done by shuffling. If you have SSE3, there is a single opcode which does this for you, though. (can't remember it off the top of my head...)

You will find SSE to be a better performer with more vertical operations, such as vector by matrix, or matrix by matrix.

Also, you would be much better off using the SIMD floating-point intrinsics rather than writing assembly code. When you use intrinsics, the compiler can automatically prepare for register usage and even realign instructions for better pipelining. Definitely something you can't do with hand-written assembly.

One last thing (as Soth pointed out): your vect class is aligning its data to 16 bytes (good), but only reserving 12 bytes of space (bad!). When you write back from an XMM register into memory, you'll be writing over the 4 bytes after the 'z' variable in memory, since movaps _always_ works with 16 bytes of data. Add something like a "float w;" after the z -- even if you don't use it.

##### Share on other sites
Quote:
 Also, you would be much better off using the SIMD floating-point intrinsics rather than writing assembly code. When you use intrinsics, the compiler can automatically prepare for register usage and even realign instructions for better pipelining. Definitely something you can't do with hand-written assembly

This is probably why your code runs so much slower - MSVC has to put all kinds of extra stuff around your function call because it can't control the optimizations within it. If you use intrisicts as suggested you should(!) loose this overhead.

If you want to prove this to yourself, write the entire 1 million dp loop in assembly, and you should notice that it is faster.

Keep in mind the MSVC8 compiler does a decent job of optimizing code these days, so you don't get that much pay-off for SSE anymore.

##### Share on other sites
It's not clear that a SIMD dot product will run any faster than a regular x87 dot product. The reason is that SIMD cannot do cross-wise operations until the latest generation of CPUs, so the sum will kill you.

The Pentium IV can only issue one SIMD add per 2 cycles, and one SIMD multiply (or other op) per 2 cycles, so if you have perfect multiply/add balance, you get one instruction per cycle. That also depends on the data being in cache... (L1 cache has about a 3 cycle latency).

The shuffles are more expensive than the multiplies -- I think about 3 cycles. Thus, a dot product that depends on two additional shuffles won't be very fast.

SIMD works much better when you have actual parallel data to work on. A bunch of vertices multiplied by a matrix, for example.

##### Share on other sites
Well, I am an SSE expert! What John Bolton said is true, so I'll go into detail about it all. Not only are you using 10 SSE instructions, the *ps versions are more expensive than the *ss ones, and mov??ps instructions are cheaper than shuffles.

For instance, you could write:

float ret;

asm {
mov esi, this
mov edi, v
movaps xmm0, [esi]
mulps xmm0, [edi]
// xmm0 = x*v.x, y*v.y, z*v.z

movhlps xmm1, xmm0

// only way to get y-component into x is still a shuffle.
shufps xmm0, xmm1, SIMD_SHUFFLE(0x02, 0x03, 0x00, 0x01)

movss ret, xmm0
}

return ret;

and this should perform more closely to your regular FPU dot product... but it's still worse than FPU. Okay, let's try a little harder then.

float ret;

asm {
mov esi, this
mov edi, v
movss xmm0, [esi]vect.x
movss xmm1, [esi]vect.y
movss xmm2, [esi]vect.z

mulss xmm0, [edi]vect.x
mulss xmm1, [edi]vect.y
mulss xmm2, [edi]vect.z

movss ret, xmm2
}

So, no more shuffles, all ops are scalar, it's as fast as SSE can make this exact operation... unfortunately it's still a little bit slower than the FPU! Is there anything that can be done? Well, yes. We're still bound mainly by latency stalling-- particularly at the very end where each output register is used as input to the next instruction-- and by loop/function call overhead (your SSE function isn't *really* inlined, but the FPU is!)

From the perspective that we're optimizing, and we're doing something in a loop, we can unroll once to mitigate the effects of register latency stalls and loop overhead. We also pull the code out of a member function.

inline void dot_SSE_scalar( float* result, const vect* v0, const vect* v1 )
{
_asm {
mov esi, v0
mov edi, v1
mov eax, result

movss xmm0, [esi]vect.x
movss xmm1, [esi]vect.y
movss xmm2, [esi]vect.z

mulss xmm0, [edi]vect.x
mulss xmm1, [edi]vect.y
mulss xmm2, [edi]vect.z

movss xmm3, [esi + 10h]vect.x
movss xmm4, [esi + 10h]vect.y
movss xmm5, [esi + 10h]vect.z

mulss xmm3, [edi + 10h]vect.x
mulss xmm4, [edi + 10h]vect.y
mulss xmm5, [edi + 10h]vect.z

movss [eax], xmm2

movss [eax + 4], xmm5
}

return;
}

And this should definitely beat FPU code, even if the FPU is unrolled similarly. Actually, the compiler is almost definitely already inlining and unrolling the FPU version, so doing it manually is pointless. When MSVC sees the _asm keyword it freaks out and turns off all optimizations in the surrounding code block, so you should definitely not use it for small blocks of code. Theoretically, you could use intrinsics, but MSVC sucks terribly at that also... so you're best either with big chunks of SSE assembly or pure compiled FPU code.

But there's a much bigger problem here. Doing a single dot product is just simply a very poor use of SSE. You want to do at least 4 at once, because then you can use the *ps instructions, and no shuffling is required! (The dot is performed by adding "vertically" rather than "horizontally") The caveat is that you have to store your vectors in a special way. You want to be able to put 4 x-components in a register, all the y-components in a register, all the z-'s in a third. In order to do that, you need to store like: float x[4]; float y[4]; float z[4]; in a single structure. This is called "Structure of Arrays" (SOA). If that seems weird, remember this is only for pieces of code that are identified as math bottlenecks... not part of a general-purpose math library.

So... taking everything into account, here is a once-unrolled SOA dot product, calculating 8 dot-products per call (one in each of wzyx components for 2 consecutive vec3_SOAs):

struct vec3_SOA
{
__declspec(align(16))
float x[4];
float y[4];
float z[4];
};

inline void dot_SSE_soa( float* result, const vec3_SOA* soa0, const vec3_SOA* soa1 )
{
_asm {
mov esi, soa0
mov edi, soa1
mov eax, result

movaps xmm0, [esi]
movaps xmm1, [esi + 10h]
movaps xmm2, [esi + 20h]

mulps xmm0, [edi]
mulps xmm1, [edi + 10h]
mulps xmm2, [edi + 20h]

movaps xmm3, [esi + 30h]
movaps xmm4, [esi + 40h]
movaps xmm5, [esi + 50h]

mulps xmm3, [edi + 30h]
mulps xmm4, [edi + 40h]
mulps xmm5, [edi + 50h]

movaps [eax], xmm2
movaps [eax + 10h], xmm5
}
}

At this point, you're totally memory-bound. No further attempt to optimize the math will have any benefit. Remember that an L2-miss will cost several hundred cycles, while this loop will take maybe 20-30. As a test you could try running it over the same piece of data again and again vs consuming a continuous data set.

I should point out that I didn't actually verify this code! But I've written similar stuff many times, and this is doing all the necessary operations, even if there is a typo or two.

##### Share on other sites
Quote:
 if memory is of no concern to you in this case it would be better to implement only vec4.And if you want it to be 12 bytes you shouldnt use movaps to load data fron vec3.xyz to xmm registers

actually I don't think that adding another (w) component will help, cuz it is already 16byte aligned (although I have tested it with adding a component and it didn't help)

Quote:
 This is probably why your code runs so much slower - MSVC has to put all kinds of extra stuff around your function call because it can't control the optimizations within it. If you use intrisicts as suggested you should(!) loose this overhead.

I think that's right, cuz when I look at the assembly code generated by the compiler there are some un-neseeccory overhead outside the asm block !
I should try using the intrisicts , but I'm not sure that it will gain significant performance.

ajas95: that was a good idea on the SoA structure, but as you said, it is very complicated to integrate this design into special algorithms efficiently.
and a single dot product is not suitable for SSE, I've also tried your other recommended codes , except for the last one, none of them improved the performance, so I guess I leave it with the normal scalar code.

my main bottlenecks are in PVS calculation (ray tracing) which I could benefit from the SoA structure and dot products.
another one for the engine is mainly is matrix-matrix and vector-matrix muliplication and also dot product for bsp tree traversal.

anyway, "hplus0603" mentioned the SSE works better for multiplying vector into matrix,
how is it possible? multiplying a vector into matrix involves 3or4 dot products which is even less efficient that a single dot product.

thanks for the comments.

##### Share on other sites
Quote:
 Original post by sepulanother one for the engine is mainly is matrix-matrix and vector-matrix muliplication and also dot product for bsp tree traversal....multiplying a vector into matrix involves 3or4 dot products which is even less efficient that a single dot product.

Matrix-matrix, vector-matrix are good for SSE, bsp traversal is not. BSP is mostly bound by the fact that you need to do several unpredictable branches, and jump to potentially random places in memory for the next iteration. Only worry about SSE once you've figured out how to beat those bottlenecks.

Matrix multiply requires that the matrix be stored column-major. So, if X, Y, and Z are the ordinal axes and T is the translation:

struct matrix
{
};

So, now you start to see how SOA helps. Except that your original vector doesn't need to be SOA, you just want to create 4 versions of the same vect in SOA form by copying each component into a separate vector and shuffling it across all 4 components. so that now you have:

xmm0: v.x v.x v.x v.x
xmm1: v.y v.y v.y v.y
xmm2: v.z v.z v.z v.z

and your routine looks like:

mulps v.x, mat.x
mulps v.y, mat.y
mulps v.z, mat.z
addps v.x, v.y // v.x += v.y
addps v.z, mat.t // v.z += mat.t
addps v.x, v.z // v.x now contains the transformed vector.

Once you've been doing SSE awhile, you start to see patterns emerge. It can be tricky at first though. I recommend AMD's CodeAnalyst profiler in "Pipeline Simulation" mode to see why what optimization you're trying does or doesn't work.

##### Share on other sites
thanks for the comments ajas
I'm currently working on it to optimize the matrix/vector multiplies. and i will profile the code after that.

##### Share on other sites
Quote:
 Original post by bpointSSE does not work well horizontally. For calculating the dot product, you have to add X, Y, and Z across a register, which can only be done by shuffling. If you have SSE3, there is a single opcode which does this for you, though. (can't remember it off the top of my head...)

##### Share on other sites
you mean by using SSE3 instructions, we can optimize a single dot to something like this ?
(assuming Vectors 4th value is zero, due to my lack of skill in simd programming)

inline float operator*( const vect& v ) const
{
float r;
_asm {
mov esi, this
mov edi, v
movaps xmm0, [esi]
mulps xmm0, [edi]
// xmm0 = (x*v.x, y*v.y, z*v.z, 0)

// xmm0 = (x*v.x + y*v.y, z*v.z, x*v.x + y*v.y, z*v.z)

// xmm0 = (x*v.x + y*v.y + z*v.z, ...)

movss r, xmm0
}
return r;
}

I don't have any SSE3 processor, so I can't test it, but do you think this peace of code can gain better performance than the normal dot code ?

## Create an account or sign in to comment

You need to be a member in order to leave a comment

## Create an account

Sign up for a new account in our community. It's easy!

Register a new account

• ### Forum Statistics

• Total Topics
628333
• Total Posts
2982121

• 22
• 9
• 9
• 13
• 11