# SEE - help plz

This topic is 3497 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi I have two versions of the same function, one of them is written with SSE instructions and should be faster than the other but it isn't. Could someone take a look and give me some advice why that is, please? Also I would appreciate if someone would direct me to some good books and sites about SSE. Here are the funcs.(SSE assembly code is pasted from some book and looks fine): inline void CrossVec ( Vektor& result, Vektor& op1, Vektor& op2){ result.i = op1.j*op2.k - op2.j*op1.k; result.j = op2.i*op1.k - op1.i*op2.k; result.k = op1.i*op2.j - op2.i*op1.j; } inline void CrossVec ( Vektor& result, Vektor& op1, Vektor& op2){ __asm { MOV EAX, op1 // Load pointers into CPU regs MOV EBX, op2 MOV ECX, result MOVUPS XMM0, [EAX] // Move unaligned vectors to SSE regs MOVUPS XMM1, [EBX] MOVAPS XMM2, XMM0 // Make a copy of vector A MOVAPS XMM3, XMM1 // Make a copy of vector B SHUFPS XMM0, XMM0, 0xD8 // 11 01 10 00 Flip the middle elements of A SHUFPS XMM1, XMM1, 0xE1 // 11 10 00 01 Flip first two elements of B MULPS XMM0, XMM1 // Multiply the modified register vectors SHUFPS XMM2, XMM2, 0xE1 // 11 10 00 01 Flip first two elements of the A copy SHUFPS XMM3, XMM3, 0xD8 // 11 01 10 00 Flip the middle elements of the B copy MULPS XMM2, XMM3 // Multiply the modified register vectors SUBPS XMM0, XMM2 // Subtract the two resulting register vectors MOVUPS [ECX], XMM0 // Save the return vector } }

##### Share on other sites
it's moving the unaligned vectors into the special SSE registries. You could try defining your vector's like this:

class __declspec(align(16)) Vector3{... etc.

That'll make the code much faster. Secondly: Try to group commands as much as possible. This is my code for my vector cross product (performance is tested)

inline void ShtrVector3::Cross(const ShtrVector3 &vc1, const ShtrVector3 &vc2){	//12-15% faster with SSE	if (!g_bSSE) {		x = vc1.y * vc2.z - vc1.z * vc2.y;		y = vc1.z * vc2.x - vc1.x * vc2.z;		z = vc1.x * vc2.y - vc1.y * vc2.x;	} else {		__asm {		//move everything			mov		esi,	vc1			mov		edi,	vc2			movaps	xmm0,	[esi]			movaps	xmm1,	[edi]			movaps	xmm2,	xmm0			movaps	xmm3,	xmm1			//shuffle everything			shufps	xmm0,	xmm0,	0x09			shufps	xmm1,	xmm1,	0x12			shufps	xmm2,	xmm2,	0x12			shufps	xmm3,	xmm3,	0x09			//multiply everything			mulps	xmm0,	xmm1			mulps	xmm2,	xmm3			//subtract			subps	xmm0,	xmm2			mov		esi,	this			movaps	[esi],	xmm0		}	}}

Lastly, have you tried intrinsics, you should try, because the compiler most of the time creates code with much more effective memory management!

I hope my poor English is readable!

- Max

##### Share on other sites
What exacty does it mean that vectors are aligned?

##### Share on other sites
Also note that the optimizer will be able to get at that inline C++ function and optimize it alongside the calling code, which could for example allow it to avoid reading and writing the variables to memory, or even optimize away the whole thing if it knows the parameter values at compile time. However, it won't touch your assembly version.

For that reason short functions written in assembly can be significantly slower than the equivalent written in C++. Intrinsics can help somewhat, but the compiler doesn't do a very good job optimizing those either.

Posting the code you're testing the performance of might be helpful.

For the biggest speedup I'd recommend rewriting whole loops in assembly, and not small individual functions.

##### Share on other sites
There is no any special test code, I just loop these func. in a for loop. I also tested them on a program I wrote, cloth simulation(a lot of vector operations) and SSE func. slows it down.

by the way, when I loop c++ func. it seems somehow "too fast", as if the compiler knows that there were no changes on the operands and skips the whole loop?!?

What do you mean by "rewriting whole loops in assembly", to write assembly directly in the loop intead of calling the func.?

[Edited by - butthead_82 on May 16, 2008 5:16:16 AM]

##### Share on other sites
The compiler can optimize away whole loops if it realizes the code does nothing. It can also move code to the outside of the loop where it does the same thing on every iteration. To prevent that make sure you actually compute a result and output it after the loop. Examining the disassembly of the generated code can be useful to see what the compiler is actually doing.

Some compilers (the Intel one springs to mind) can also make use of SSE automatically in some cases.

Quote:
 What do you mean by "rewriting whole loops in assembly", to write assembly directly in the loop intead of calling the func?

What I mean is that if instead of writing a loop in C++ where you call individual assembly functions, if you rewrite the whole loop in assembly you should be able to optimize it much better. For example you can move one off setup code outside of the loop, and leave intermediate results in registers rather than writing them to memory and reading them back in again.

However doing it that way is obviously more work for the programmer, so you should only do that after profiling tells you where to optimize.

##### Share on other sites
Thanx for info.

I aligned the vector struct as MadMax said but compiler reports error when using MOVAPS. How come?

inline void SumVec ( Vektor& rezultat, Vektor& op1, Vektor& op2) {		__asm        {                       MOV EAX, op1                   // Load pointers into CPU regs                MOV EBX, op2		MOV ECX, rezultat                MOVAPS XMM0, [EAX]	// Move unaligned vectors to SSE regs                MOVAPS XMM1, [EBX]                ADDPS XMM0, XMM1            // Add vector elements                MOVAPS [ECX], XMM0      // Save the return vector        }		}

[Edited by - butthead_82 on May 16, 2008 8:50:17 AM]

##### Share on other sites
Have you made a union struct?

I don't know if it makes a difference, but my vector class looks like this:

class __declspec(align(16)) ShtrVector3 {public:	//vector values	union{		struct{			float x;			float y;			float z;		};		float vc[3];	};};

Try it... I don't know if it makes a difference

##### Share on other sites
Quote:
 Original post by butthead_82I aligned the vector struct as MadMax said but compiler reports error when using MOVAPS. How come?

Syntax error?
Missing operand error?
No keyboard error?

hum.. so many things to choose from

##### Share on other sites
Sorry, I didn't say correctly what happend. It doesn't report errors during compilation time, the program crashes when I run it. Compilation runs ok.

I didn't use unions, I just have four floats in the struct. I'll try as you suggested.

I don't want to bother everyone any further. Thanx everyone for help.
What I realy need is a good source to learn SSE from so if somebody could dirrect me to some I would appreciate it.

##### Share on other sites
snippet from my blog:

So I asked the friendly community over at gamedev.net if they could knew of any assembly tutorials (mainly concerning SSE and MMX)... SSE isn't all that hard, pretty easy. If you really want to start learning it, read through these tutorials very quickly:

* http://www.neilkemp.us/v3/tutorials/SSE_Tutorial_1.html

And then, use this guide as a reference to available instructions:

* http://www.intel80386.com/simd/mmx2-doc.html

##### Share on other sites

This topic is 3497 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Create an account

Register a new account

• ### Forum Statistics

• Total Topics
628698
• Total Posts
2984275

• 20
• 10
• 13
• 13
• 11