SEE - help plz
Hi
I have two versions of the same function, one of them is written with SSE instructions and should be faster than the other but it isn't. Could someone take a look and give me some advice why that is, please?
Also I would appreciate if someone would direct me to some good books and sites about SSE.
Here are the funcs.(SSE assembly code is pasted from some book and looks fine):
inline void CrossVec ( Vektor& result, Vektor& op1, Vektor& op2){
result.i = op1.j*op2.k - op2.j*op1.k;
result.j = op2.i*op1.k - op1.i*op2.k;
result.k = op1.i*op2.j - op2.i*op1.j;
}
inline void CrossVec ( Vektor& result, Vektor& op1, Vektor& op2){
__asm
{
MOV EAX, op1 // Load pointers into CPU regs
MOV EBX, op2
MOV ECX, result
MOVUPS XMM0, [EAX] // Move unaligned vectors to SSE regs
MOVUPS XMM1, [EBX]
MOVAPS XMM2, XMM0 // Make a copy of vector A
MOVAPS XMM3, XMM1 // Make a copy of vector B
SHUFPS XMM0, XMM0, 0xD8 // 11 01 10 00 Flip the middle elements of A
SHUFPS XMM1, XMM1, 0xE1 // 11 10 00 01 Flip first two elements of B
MULPS XMM0, XMM1 // Multiply the modified register vectors
SHUFPS XMM2, XMM2, 0xE1 // 11 10 00 01 Flip first two elements of the A copy
SHUFPS XMM3, XMM3, 0xD8 // 11 01 10 00 Flip the middle elements of the B copy
MULPS XMM2, XMM3 // Multiply the modified register vectors
SUBPS XMM0, XMM2 // Subtract the two resulting register vectors
MOVUPS [ECX], XMM0 // Save the return vector
}
}
it's moving the unaligned vectors into the special SSE registries. You could try defining your vector's like this:
That'll make the code much faster. Secondly: Try to group commands as much as possible. This is my code for my vector cross product (performance is tested)
Lastly, have you tried intrinsics, you should try, because the compiler most of the time creates code with much more effective memory management!
I hope my poor English is readable!
- Max
class __declspec(align(16)) Vector3{... etc.
That'll make the code much faster. Secondly: Try to group commands as much as possible. This is my code for my vector cross product (performance is tested)
inline void ShtrVector3::Cross(const ShtrVector3 &vc1, const ShtrVector3 &vc2){ //12-15% faster with SSE if (!g_bSSE) { x = vc1.y * vc2.z - vc1.z * vc2.y; y = vc1.z * vc2.x - vc1.x * vc2.z; z = vc1.x * vc2.y - vc1.y * vc2.x; } else { __asm { //move everything mov esi, vc1 mov edi, vc2 movaps xmm0, [esi] movaps xmm1, [edi] movaps xmm2, xmm0 movaps xmm3, xmm1 //shuffle everything shufps xmm0, xmm0, 0x09 shufps xmm1, xmm1, 0x12 shufps xmm2, xmm2, 0x12 shufps xmm3, xmm3, 0x09 //multiply everything mulps xmm0, xmm1 mulps xmm2, xmm3 //subtract subps xmm0, xmm2 mov esi, this movaps [esi], xmm0 } }}
Lastly, have you tried intrinsics, you should try, because the compiler most of the time creates code with much more effective memory management!
I hope my poor English is readable!
- Max
Also note that the optimizer will be able to get at that inline C++ function and optimize it alongside the calling code, which could for example allow it to avoid reading and writing the variables to memory, or even optimize away the whole thing if it knows the parameter values at compile time. However, it won't touch your assembly version.
For that reason short functions written in assembly can be significantly slower than the equivalent written in C++. Intrinsics can help somewhat, but the compiler doesn't do a very good job optimizing those either.
Posting the code you're testing the performance of might be helpful.
For the biggest speedup I'd recommend rewriting whole loops in assembly, and not small individual functions.
For that reason short functions written in assembly can be significantly slower than the equivalent written in C++. Intrinsics can help somewhat, but the compiler doesn't do a very good job optimizing those either.
Posting the code you're testing the performance of might be helpful.
For the biggest speedup I'd recommend rewriting whole loops in assembly, and not small individual functions.
There is no any special test code, I just loop these func. in a for loop. I also tested them on a program I wrote, cloth simulation(a lot of vector operations) and SSE func. slows it down.
by the way, when I loop c++ func. it seems somehow "too fast", as if the compiler knows that there were no changes on the operands and skips the whole loop?!?
What do you mean by "rewriting whole loops in assembly", to write assembly directly in the loop intead of calling the func.?
[Edited by - butthead_82 on May 16, 2008 5:16:16 AM]
by the way, when I loop c++ func. it seems somehow "too fast", as if the compiler knows that there were no changes on the operands and skips the whole loop?!?
What do you mean by "rewriting whole loops in assembly", to write assembly directly in the loop intead of calling the func.?
[Edited by - butthead_82 on May 16, 2008 5:16:16 AM]
The compiler can optimize away whole loops if it realizes the code does nothing. It can also move code to the outside of the loop where it does the same thing on every iteration. To prevent that make sure you actually compute a result and output it after the loop. Examining the disassembly of the generated code can be useful to see what the compiler is actually doing.
Some compilers (the Intel one springs to mind) can also make use of SSE automatically in some cases.
What I mean is that if instead of writing a loop in C++ where you call individual assembly functions, if you rewrite the whole loop in assembly you should be able to optimize it much better. For example you can move one off setup code outside of the loop, and leave intermediate results in registers rather than writing them to memory and reading them back in again.
However doing it that way is obviously more work for the programmer, so you should only do that after profiling tells you where to optimize.
Some compilers (the Intel one springs to mind) can also make use of SSE automatically in some cases.
Quote:What do you mean by "rewriting whole loops in assembly", to write assembly directly in the loop intead of calling the func?
What I mean is that if instead of writing a loop in C++ where you call individual assembly functions, if you rewrite the whole loop in assembly you should be able to optimize it much better. For example you can move one off setup code outside of the loop, and leave intermediate results in registers rather than writing them to memory and reading them back in again.
However doing it that way is obviously more work for the programmer, so you should only do that after profiling tells you where to optimize.
Thanx for info.
I aligned the vector struct as MadMax said but compiler reports error when using MOVAPS. How come?
[Edited by - butthead_82 on May 16, 2008 8:50:17 AM]
I aligned the vector struct as MadMax said but compiler reports error when using MOVAPS. How come?
inline void SumVec ( Vektor& rezultat, Vektor& op1, Vektor& op2) { __asm { MOV EAX, op1 // Load pointers into CPU regs MOV EBX, op2 MOV ECX, rezultat MOVAPS XMM0, [EAX] // Move unaligned vectors to SSE regs MOVAPS XMM1, [EBX] ADDPS XMM0, XMM1 // Add vector elements MOVAPS [ECX], XMM0 // Save the return vector } }
[Edited by - butthead_82 on May 16, 2008 8:50:17 AM]
Have you made a union struct?
I don't know if it makes a difference, but my vector class looks like this:
Try it... I don't know if it makes a difference
I don't know if it makes a difference, but my vector class looks like this:
class __declspec(align(16)) ShtrVector3 {public: //vector values union{ struct{ float x; float y; float z; }; float vc[3]; };};
Try it... I don't know if it makes a difference
Quote:Original post by butthead_82
I aligned the vector struct as MadMax said but compiler reports error when using MOVAPS. How come?
Lets play the read your mind game.
Syntax error?
Missing operand error?
No keyboard error?
hum.. so many things to choose from
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement