Sign in to follow this  
butthead_82

SEE - help plz

Recommended Posts

Hi I have two versions of the same function, one of them is written with SSE instructions and should be faster than the other but it isn't. Could someone take a look and give me some advice why that is, please? Also I would appreciate if someone would direct me to some good books and sites about SSE. Here are the funcs.(SSE assembly code is pasted from some book and looks fine): inline void CrossVec ( Vektor& result, Vektor& op1, Vektor& op2){ result.i = op1.j*op2.k - op2.j*op1.k; result.j = op2.i*op1.k - op1.i*op2.k; result.k = op1.i*op2.j - op2.i*op1.j; } inline void CrossVec ( Vektor& result, Vektor& op1, Vektor& op2){ __asm { MOV EAX, op1 // Load pointers into CPU regs MOV EBX, op2 MOV ECX, result MOVUPS XMM0, [EAX] // Move unaligned vectors to SSE regs MOVUPS XMM1, [EBX] MOVAPS XMM2, XMM0 // Make a copy of vector A MOVAPS XMM3, XMM1 // Make a copy of vector B SHUFPS XMM0, XMM0, 0xD8 // 11 01 10 00 Flip the middle elements of A SHUFPS XMM1, XMM1, 0xE1 // 11 10 00 01 Flip first two elements of B MULPS XMM0, XMM1 // Multiply the modified register vectors SHUFPS XMM2, XMM2, 0xE1 // 11 10 00 01 Flip first two elements of the A copy SHUFPS XMM3, XMM3, 0xD8 // 11 01 10 00 Flip the middle elements of the B copy MULPS XMM2, XMM3 // Multiply the modified register vectors SUBPS XMM0, XMM2 // Subtract the two resulting register vectors MOVUPS [ECX], XMM0 // Save the return vector } }

Share this post


Link to post
Share on other sites
it's moving the unaligned vectors into the special SSE registries. You could try defining your vector's like this:

class __declspec(align(16)) Vector3
{
... etc.


That'll make the code much faster. Secondly: Try to group commands as much as possible. This is my code for my vector cross product (performance is tested)

inline void ShtrVector3::Cross(const ShtrVector3 &vc1, const ShtrVector3 &vc2)
{
//12-15% faster with SSE
if (!g_bSSE) {
x = vc1.y * vc2.z - vc1.z * vc2.y;
y = vc1.z * vc2.x - vc1.x * vc2.z;
z = vc1.x * vc2.y - vc1.y * vc2.x;
} else {
__asm {
//move everything
mov esi, vc1
mov edi, vc2
movaps xmm0, [esi]
movaps xmm1, [edi]
movaps xmm2, xmm0
movaps xmm3, xmm1

//shuffle everything
shufps xmm0, xmm0, 0x09
shufps xmm1, xmm1, 0x12
shufps xmm2, xmm2, 0x12
shufps xmm3, xmm3, 0x09

//multiply everything
mulps xmm0, xmm1
mulps xmm2, xmm3

//subtract
subps xmm0, xmm2
mov esi, this
movaps [esi], xmm0
}
}
}


Lastly, have you tried intrinsics, you should try, because the compiler most of the time creates code with much more effective memory management!

I hope my poor English is readable!

- Max

Share this post


Link to post
Share on other sites
Also note that the optimizer will be able to get at that inline C++ function and optimize it alongside the calling code, which could for example allow it to avoid reading and writing the variables to memory, or even optimize away the whole thing if it knows the parameter values at compile time. However, it won't touch your assembly version.

For that reason short functions written in assembly can be significantly slower than the equivalent written in C++. Intrinsics can help somewhat, but the compiler doesn't do a very good job optimizing those either.

Posting the code you're testing the performance of might be helpful.

For the biggest speedup I'd recommend rewriting whole loops in assembly, and not small individual functions.

Share this post


Link to post
Share on other sites
There is no any special test code, I just loop these func. in a for loop. I also tested them on a program I wrote, cloth simulation(a lot of vector operations) and SSE func. slows it down.

by the way, when I loop c++ func. it seems somehow "too fast", as if the compiler knows that there were no changes on the operands and skips the whole loop?!?

What do you mean by "rewriting whole loops in assembly", to write assembly directly in the loop intead of calling the func.?

[Edited by - butthead_82 on May 16, 2008 5:16:16 AM]

Share this post


Link to post
Share on other sites
The compiler can optimize away whole loops if it realizes the code does nothing. It can also move code to the outside of the loop where it does the same thing on every iteration. To prevent that make sure you actually compute a result and output it after the loop. Examining the disassembly of the generated code can be useful to see what the compiler is actually doing.

Some compilers (the Intel one springs to mind) can also make use of SSE automatically in some cases.

Quote:
What do you mean by "rewriting whole loops in assembly", to write assembly directly in the loop intead of calling the func?


What I mean is that if instead of writing a loop in C++ where you call individual assembly functions, if you rewrite the whole loop in assembly you should be able to optimize it much better. For example you can move one off setup code outside of the loop, and leave intermediate results in registers rather than writing them to memory and reading them back in again.

However doing it that way is obviously more work for the programmer, so you should only do that after profiling tells you where to optimize.

Share this post


Link to post
Share on other sites
Thanx for info.

I aligned the vector struct as MadMax said but compiler reports error when using MOVAPS. How come?


inline void SumVec ( Vektor& rezultat, Vektor& op1, Vektor& op2) {

__asm
{
MOV EAX, op1 // Load pointers into CPU regs
MOV EBX, op2
MOV ECX, rezultat

MOVAPS XMM0, [EAX] // Move unaligned vectors to SSE regs
MOVAPS XMM1, [EBX]

ADDPS XMM0, XMM1 // Add vector elements
MOVAPS [ECX], XMM0 // Save the return vector
}

}


[Edited by - butthead_82 on May 16, 2008 8:50:17 AM]

Share this post


Link to post
Share on other sites
Have you made a union struct?

I don't know if it makes a difference, but my vector class looks like this:

class __declspec(align(16)) ShtrVector3 
{
public:
//vector values
union{
struct{
float x;
float y;
float z;
};
float vc[3];
};
};


Try it... I don't know if it makes a difference

Share this post


Link to post
Share on other sites
Quote:
Original post by butthead_82
I aligned the vector struct as MadMax said but compiler reports error when using MOVAPS. How come?


Lets play the read your mind game.

Syntax error?
Missing operand error?
No keyboard error?


hum.. so many things to choose from

Share this post


Link to post
Share on other sites
Sorry, I didn't say correctly what happend. It doesn't report errors during compilation time, the program crashes when I run it. Compilation runs ok.

I didn't use unions, I just have four floats in the struct. I'll try as you suggested.

I don't want to bother everyone any further. Thanx everyone for help.
What I realy need is a good source to learn SSE from so if somebody could dirrect me to some I would appreciate it.

Share this post


Link to post
Share on other sites
snippet from my blog:

So I asked the friendly community over at gamedev.net if they could knew of any assembly tutorials (mainly concerning SSE and MMX)... SSE isn't all that hard, pretty easy. If you really want to start learning it, read through these tutorials very quickly:

* http://www.neilkemp.us/v3/tutorials/SSE_Tutorial_1.html
* http://www.3dbuzz.com/vbforum/showthread.php?t=104753

And then, use this guide as a reference to available instructions:

* http://www.intel80386.com/simd/mmx2-doc.html

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this