SEE - help plz

Started by
10 comments, last by MadMax1992 15 years, 11 months ago
Hi I have two versions of the same function, one of them is written with SSE instructions and should be faster than the other but it isn't. Could someone take a look and give me some advice why that is, please? Also I would appreciate if someone would direct me to some good books and sites about SSE. Here are the funcs.(SSE assembly code is pasted from some book and looks fine): inline void CrossVec ( Vektor& result, Vektor& op1, Vektor& op2){ result.i = op1.j*op2.k - op2.j*op1.k; result.j = op2.i*op1.k - op1.i*op2.k; result.k = op1.i*op2.j - op2.i*op1.j; } inline void CrossVec ( Vektor& result, Vektor& op1, Vektor& op2){ __asm { MOV EAX, op1 // Load pointers into CPU regs MOV EBX, op2 MOV ECX, result MOVUPS XMM0, [EAX] // Move unaligned vectors to SSE regs MOVUPS XMM1, [EBX] MOVAPS XMM2, XMM0 // Make a copy of vector A MOVAPS XMM3, XMM1 // Make a copy of vector B SHUFPS XMM0, XMM0, 0xD8 // 11 01 10 00 Flip the middle elements of A SHUFPS XMM1, XMM1, 0xE1 // 11 10 00 01 Flip first two elements of B MULPS XMM0, XMM1 // Multiply the modified register vectors SHUFPS XMM2, XMM2, 0xE1 // 11 10 00 01 Flip first two elements of the A copy SHUFPS XMM3, XMM3, 0xD8 // 11 01 10 00 Flip the middle elements of the B copy MULPS XMM2, XMM3 // Multiply the modified register vectors SUBPS XMM0, XMM2 // Subtract the two resulting register vectors MOVUPS [ECX], XMM0 // Save the return vector } }
Advertisement
it's moving the unaligned vectors into the special SSE registries. You could try defining your vector's like this:

class __declspec(align(16)) Vector3{... etc.


That'll make the code much faster. Secondly: Try to group commands as much as possible. This is my code for my vector cross product (performance is tested)

inline void ShtrVector3::Cross(const ShtrVector3 &vc1, const ShtrVector3 &vc2){	//12-15% faster with SSE	if (!g_bSSE) {		x = vc1.y * vc2.z - vc1.z * vc2.y;		y = vc1.z * vc2.x - vc1.x * vc2.z;		z = vc1.x * vc2.y - vc1.y * vc2.x;	} else {		__asm {		//move everything			mov		esi,	vc1			mov		edi,	vc2			movaps	xmm0,	[esi]			movaps	xmm1,	[edi]			movaps	xmm2,	xmm0			movaps	xmm3,	xmm1			//shuffle everything			shufps	xmm0,	xmm0,	0x09			shufps	xmm1,	xmm1,	0x12			shufps	xmm2,	xmm2,	0x12			shufps	xmm3,	xmm3,	0x09			//multiply everything			mulps	xmm0,	xmm1			mulps	xmm2,	xmm3			//subtract			subps	xmm0,	xmm2			mov		esi,	this			movaps	[esi],	xmm0		}	}}


Lastly, have you tried intrinsics, you should try, because the compiler most of the time creates code with much more effective memory management!

I hope my poor English is readable!

- Max
What exacty does it mean that vectors are aligned?
Here's a good link about alignment.
Also note that the optimizer will be able to get at that inline C++ function and optimize it alongside the calling code, which could for example allow it to avoid reading and writing the variables to memory, or even optimize away the whole thing if it knows the parameter values at compile time. However, it won't touch your assembly version.

For that reason short functions written in assembly can be significantly slower than the equivalent written in C++. Intrinsics can help somewhat, but the compiler doesn't do a very good job optimizing those either.

Posting the code you're testing the performance of might be helpful.

For the biggest speedup I'd recommend rewriting whole loops in assembly, and not small individual functions.
There is no any special test code, I just loop these func. in a for loop. I also tested them on a program I wrote, cloth simulation(a lot of vector operations) and SSE func. slows it down.

by the way, when I loop c++ func. it seems somehow "too fast", as if the compiler knows that there were no changes on the operands and skips the whole loop?!?

What do you mean by "rewriting whole loops in assembly", to write assembly directly in the loop intead of calling the func.?

[Edited by - butthead_82 on May 16, 2008 5:16:16 AM]
The compiler can optimize away whole loops if it realizes the code does nothing. It can also move code to the outside of the loop where it does the same thing on every iteration. To prevent that make sure you actually compute a result and output it after the loop. Examining the disassembly of the generated code can be useful to see what the compiler is actually doing.

Some compilers (the Intel one springs to mind) can also make use of SSE automatically in some cases.

Quote:What do you mean by "rewriting whole loops in assembly", to write assembly directly in the loop intead of calling the func?


What I mean is that if instead of writing a loop in C++ where you call individual assembly functions, if you rewrite the whole loop in assembly you should be able to optimize it much better. For example you can move one off setup code outside of the loop, and leave intermediate results in registers rather than writing them to memory and reading them back in again.

However doing it that way is obviously more work for the programmer, so you should only do that after profiling tells you where to optimize.
Thanx for info.

I aligned the vector struct as MadMax said but compiler reports error when using MOVAPS. How come?


inline void SumVec ( Vektor& rezultat, Vektor& op1, Vektor& op2) {		__asm        {                       MOV EAX, op1                   // Load pointers into CPU regs                MOV EBX, op2		MOV ECX, rezultat                MOVAPS XMM0, [EAX]	// Move unaligned vectors to SSE regs                MOVAPS XMM1, [EBX]                ADDPS XMM0, XMM1            // Add vector elements                MOVAPS [ECX], XMM0      // Save the return vector        }		}


[Edited by - butthead_82 on May 16, 2008 8:50:17 AM]
Have you made a union struct?

I don't know if it makes a difference, but my vector class looks like this:

class __declspec(align(16)) ShtrVector3 {public:	//vector values	union{		struct{			float x;			float y;			float z;		};		float vc[3];	};};


Try it... I don't know if it makes a difference
Quote:Original post by butthead_82
I aligned the vector struct as MadMax said but compiler reports error when using MOVAPS. How come?


Lets play the read your mind game.

Syntax error?
Missing operand error?
No keyboard error?


hum.. so many things to choose from

This topic is closed to new replies.

Advertisement