SSE confusion !

Started by
10 comments, last by sepul 18 years, 1 month ago
I'm a newbie in SIMD programming, so for my first act I implemneted a dot product in SSE but the problem is that the SSE code is always slower that normal code here is the source :

#define __SIMD_ASM
//#define __NO_SIMD

#define SIMD_SHUFFLE( srch, src1, desth, dest1 ) ( ((srch)<<6) | ((src1)<<4) | ((desth)<<2) | ((dest1)) )
__declspec(align(16)) float g_vr[4];

class vect
{
public:
	union	{
		__declspec(align(16)) float v[3];

		struct	{
			float x;
			float y;
			float z;
		};
	};

public:
	vect() : x(0.0f), y(0.0f), z(0.0f) {}
	vect( float nx, float ny, float nz ) : x(nx), y(ny), z(nz) {}

#ifdef __NO_SIMD
	float operator*( const vect& v ) const
	{
		return (x*v.x + y*v.y + z*v.z);
	}

#endif

#ifdef __SIMD_ASM
	inline float operator*( const vect& v ) const
	{
		_asm	{
			mov esi, this
			mov edi, v
			movaps xmm0, [esi]
			mulps xmm0, [edi]
			// xmm0 = x*v.x, y*v.y, z*v.z
			
			movaps xmm1, xmm0
            shufps xmm1, xmm0, SIMD_SHUFFLE(0x01, 0x00, 0x03, 0x02)
			
			addps xmm1, xmm0
			shufps xmm0, xmm1, SIMD_SHUFFLE(0x02, 0x03, 0x00, 0x01)

			addps xmm0, xmm1

			movaps g_vr, xmm0	
		}

		return g_vr[0];
	}
#endif
};


the code compiled with __NO_SIMD is always faster than __SIMD_ASM , with 1 million - randomly created vector - dot products ! is there any problem with the code ? am I missing something here ? (the compiler is VC7.1) thanks

dark-hammer engine - http://www.hmrengine.com

Advertisement
Much dependps on how do you calculate perfomance.
For example look here:
http://www.gamedev.net/community/forums/topic.asp?topic_id=380420


Also when you write __declspec(align(16)) v[3] your vec3 will occupy 16 bytes of code like vec4.So if memory is of no concern to you in this case it would be better to implement only vec4.And if you want it to be 12 bytes you shouldn`t use movaps to load data fron vec3.xyz to xmm registers.
I'm not an expert on SSE, but here is a common reason why VPU code might be slower. The FPU implementation can be faster if the vector's elements are already in floating point registers. In that case, the FPU executes only 5 FP instructions to do the dot product. Compare that to your code which loads the data from memory and executes 10 SSE instructions.
John BoltonLocomotive Games (THQ)Current Project: Destroy All Humans (Wii). IN STORES NOW!
SSE does not work well horizontally. For calculating the dot product, you have to add X, Y, and Z across a register, which can only be done by shuffling. If you have SSE3, there is a single opcode which does this for you, though. (can't remember it off the top of my head...)

You will find SSE to be a better performer with more vertical operations, such as vector by matrix, or matrix by matrix.

Also, you would be much better off using the SIMD floating-point intrinsics rather than writing assembly code. When you use intrinsics, the compiler can automatically prepare for register usage and even realign instructions for better pipelining. Definitely something you can't do with hand-written assembly.

One last thing (as Soth pointed out): your vect class is aligning its data to 16 bytes (good), but only reserving 12 bytes of space (bad!). When you write back from an XMM register into memory, you'll be writing over the 4 bytes after the 'z' variable in memory, since movaps _always_ works with 16 bytes of data. Add something like a "float w;" after the z -- even if you don't use it.
Quote:
Also, you would be much better off using the SIMD floating-point intrinsics rather than writing assembly code. When you use intrinsics, the compiler can automatically prepare for register usage and even realign instructions for better pipelining. Definitely something you can't do with hand-written assembly

This is probably why your code runs so much slower - MSVC has to put all kinds of extra stuff around your function call because it can't control the optimizations within it. If you use intrisicts as suggested you should(!) loose this overhead.

If you want to prove this to yourself, write the entire 1 million dp loop in assembly, and you should notice that it is faster.

Keep in mind the MSVC8 compiler does a decent job of optimizing code these days, so you don't get that much pay-off for SSE anymore.
It's not clear that a SIMD dot product will run any faster than a regular x87 dot product. The reason is that SIMD cannot do cross-wise operations until the latest generation of CPUs, so the sum will kill you.

The Pentium IV can only issue one SIMD add per 2 cycles, and one SIMD multiply (or other op) per 2 cycles, so if you have perfect multiply/add balance, you get one instruction per cycle. That also depends on the data being in cache... (L1 cache has about a 3 cycle latency).

The shuffles are more expensive than the multiplies -- I think about 3 cycles. Thus, a dot product that depends on two additional shuffles won't be very fast.

SIMD works much better when you have actual parallel data to work on. A bunch of vertices multiplied by a matrix, for example.
enum Bool { True, False, FileNotFound };
Well, I am an SSE expert! What John Bolton said is true, so I'll go into detail about it all. Not only are you using 10 SSE instructions, the *ps versions are more expensive than the *ss ones, and mov??ps instructions are cheaper than shuffles.

For instance, you could write:
float ret;asm {	mov	esi,	this	mov	edi,	v	movaps	xmm0,	[esi]	mulps	xmm0,	[edi]	// xmm0 = x*v.x, y*v.y, z*v.z			movhlps	xmm1,	xmm0	addss	xmm1,	xmm0	// only way to get y-component into x is still a shuffle.	shufps	xmm0,	xmm1, SIMD_SHUFFLE(0x02, 0x03, 0x00, 0x01) 	addss	xmm0,	xmm1	movss	ret,	xmm0}return ret;

and this should perform more closely to your regular FPU dot product... but it's still worse than FPU. Okay, let's try a little harder then.

float ret;asm {	mov	esi,	this	mov	edi,	v	movss	xmm0,	[esi]vect.x	movss	xmm1,	[esi]vect.y	movss   xmm2,	[esi]vect.z		mulss	xmm0,   [edi]vect.x	mulss	xmm1,   [edi]vect.y	mulss	xmm2,   [edi]vect.z	addss	xmm0,	xmm1	addss	xmm2,	xmm0	movss	ret,	xmm2}


So, no more shuffles, all ops are scalar, it's as fast as SSE can make this exact operation... unfortunately it's still a little bit slower than the FPU! Is there anything that can be done? Well, yes. We're still bound mainly by latency stalling-- particularly at the very end where each output register is used as input to the next instruction-- and by loop/function call overhead (your SSE function isn't *really* inlined, but the FPU is!)

From the perspective that we're optimizing, and we're doing something in a loop, we can unroll once to mitigate the effects of register latency stalls and loop overhead. We also pull the code out of a member function.

inline void dot_SSE_scalar( float* result, const vect* v0, const vect* v1 ){	_asm {		mov	esi,	v0		mov	edi,	v1		mov	eax,	result		movss	xmm0,	[esi]vect.x		movss	xmm1,	[esi]vect.y		movss   xmm2,	[esi]vect.z		mulss	xmm0,   [edi]vect.x		mulss	xmm1,   [edi]vect.y		mulss	xmm2,   [edi]vect.z		addss	xmm0,	xmm1		movss	xmm3,	[esi + 10h]vect.x		movss	xmm4,	[esi + 10h]vect.y		movss   xmm5,	[esi + 10h]vect.z		addss	xmm2,	xmm0		mulss	xmm3,   [edi + 10h]vect.x		mulss	xmm4,   [edi + 10h]vect.y		mulss	xmm5,   [edi + 10h]vect.z		addss	xmm3,	xmm4		movss	[eax],	xmm2		addss	xmm5,	xmm3		movss	[eax + 4], xmm5	}	return;}


And this should definitely beat FPU code, even if the FPU is unrolled similarly. Actually, the compiler is almost definitely already inlining and unrolling the FPU version, so doing it manually is pointless. When MSVC sees the _asm keyword it freaks out and turns off all optimizations in the surrounding code block, so you should definitely not use it for small blocks of code. Theoretically, you could use intrinsics, but MSVC sucks terribly at that also... so you're best either with big chunks of SSE assembly or pure compiled FPU code.

But there's a much bigger problem here. Doing a single dot product is just simply a very poor use of SSE. You want to do at least 4 at once, because then you can use the *ps instructions, and no shuffling is required! (The dot is performed by adding "vertically" rather than "horizontally") The caveat is that you have to store your vectors in a special way. You want to be able to put 4 x-components in a register, all the y-components in a register, all the z-'s in a third. In order to do that, you need to store like: float x[4]; float y[4]; float z[4]; in a single structure. This is called "Structure of Arrays" (SOA). If that seems weird, remember this is only for pieces of code that are identified as math bottlenecks... not part of a general-purpose math library.

So... taking everything into account, here is a once-unrolled SOA dot product, calculating 8 dot-products per call (one in each of wzyx components for 2 consecutive vec3_SOAs):
struct vec3_SOA{	__declspec(align(16))		float x[4];		float y[4];		float z[4];};inline void dot_SSE_soa( float* result, const vec3_SOA* soa0,  const vec3_SOA* soa1 ){	_asm {		mov		esi,	soa0		mov		edi,	soa1		mov		eax,	result		movaps	xmm0,	[esi]		movaps	xmm1,	[esi + 10h]		movaps	xmm2,	[esi + 20h]		mulps	xmm0,	[edi]		mulps	xmm1,	[edi + 10h]		mulps	xmm2,	[edi + 20h]		movaps	xmm3,	[esi + 30h]		movaps	xmm4,	[esi + 40h]		movaps	xmm5,	[esi + 50h]		addps	xmm0,	xmm1		mulps	xmm3,	[edi + 30h]		mulps	xmm4,	[edi + 40h]		mulps	xmm5,	[edi + 50h]		addps	xmm3,	xmm4		addps	xmm2,	xmm0		addps	xmm5,	xmm3		movaps	[eax],	xmm2		movaps	[eax + 10h], xmm5	}}


At this point, you're totally memory-bound. No further attempt to optimize the math will have any benefit. Remember that an L2-miss will cost several hundred cycles, while this loop will take maybe 20-30. As a test you could try running it over the same piece of data again and again vs consuming a continuous data set.

I should point out that I didn't actually verify this code! But I've written similar stuff many times, and this is doing all the necessary operations, even if there is a typo or two.
Quote:if memory is of no concern to you in this case it would be better to implement only vec4.And if you want it to be 12 bytes you shouldn`t use movaps to load data fron vec3.xyz to xmm registers

actually I don't think that adding another (w) component will help, cuz it is already 16byte aligned (although I have tested it with adding a component and it didn't help)

Quote:This is probably why your code runs so much slower - MSVC has to put all kinds of extra stuff around your function call because it can't control the optimizations within it. If you use intrisicts as suggested you should(!) loose this overhead.

I think that's right, cuz when I look at the assembly code generated by the compiler there are some un-neseeccory overhead outside the asm block !
I should try using the intrisicts , but I'm not sure that it will gain significant performance.

ajas95: that was a good idea on the SoA structure, but as you said, it is very complicated to integrate this design into special algorithms efficiently.
and a single dot product is not suitable for SSE, I've also tried your other recommended codes , except for the last one, none of them improved the performance, so I guess I leave it with the normal scalar code.

my main bottlenecks are in PVS calculation (ray tracing) which I could benefit from the SoA structure and dot products.
another one for the engine is mainly is matrix-matrix and vector-matrix muliplication and also dot product for bsp tree traversal.

anyway, "hplus0603" mentioned the SSE works better for multiplying vector into matrix,
how is it possible? multiplying a vector into matrix involves 3or4 dot products which is even less efficient that a single dot product.

thanks for the comments.

dark-hammer engine - http://www.hmrengine.com

Quote:Original post by sepul
another one for the engine is mainly is matrix-matrix and vector-matrix muliplication and also dot product for bsp tree traversal.
...

multiplying a vector into matrix involves 3or4 dot products which is even less efficient that a single dot product.


Matrix-matrix, vector-matrix are good for SSE, bsp traversal is not. BSP is mostly bound by the fact that you need to do several unpredictable branches, and jump to potentially random places in memory for the next iteration. Only worry about SSE once you've figured out how to beat those bottlenecks.

Matrix multiply requires that the matrix be stored column-major. So, if X, Y, and Z are the ordinal axes and T is the translation:
struct matrix{    float X[3], pad;    float Y[3], pad;    float Z[3], pad;    float T[3], pad;};


So, now you start to see how SOA helps. Except that your original vector doesn't need to be SOA, you just want to create 4 versions of the same vect in SOA form by copying each component into a separate vector and shuffling it across all 4 components. so that now you have:
xmm0: v.x   v.x   v.x   v.xxmm1: v.y   v.y   v.y   v.yxmm2: v.z   v.z   v.z   v.z


and your routine looks like:

mulps  v.x,  mat.xmulps  v.y,  mat.ymulps  v.z,  mat.zaddps  v.x,  v.y    // v.x += v.yaddps  v.z,  mat.t  // v.z += mat.taddps  v.x,  v.z    // v.x now contains the transformed vector.


Once you've been doing SSE awhile, you start to see patterns emerge. It can be tricky at first though. I recommend AMD's CodeAnalyst profiler in "Pipeline Simulation" mode to see why what optimization you're trying does or doesn't work.
thanks for the comments ajas
I'm currently working on it to optimize the matrix/vector multiplies. and i will profile the code after that.

dark-hammer engine - http://www.hmrengine.com

This topic is closed to new replies.

Advertisement