• Advertisement
Sign in to follow this  

optimizing my matrix/vector library using assembler

This topic is 4247 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hello, Why are my vectorized assembly language routines so much slower that vc++ fpu computations? I am currently creating a matrix/vector engine in assembly and c++ and for what ever reason, the c++ version is at least 2.5 times faster than my vectorized version.
inline Matrix3 operator+ ( const Matrix3& mat )
{
ALIGN16 Matrix3 matReturn;

__asm
{

mov eax, dword ptr[mat]
mov ecx, dword ptr[this]
movaps xmm0, dword ptr[ecx]
movaps xmm1, dword ptr[ecx + 0x10]
addps xmm0, dword ptr[eax]
addps xmm1, dword ptr[eax + 0x10]
movaps matReturn, xmm0
movaps matReturn + 0x10, xmm1
// add the oddball using the fpu
fld dword ptr [ecx+0x20]
fadd dword ptr [eax+0x20]
fstp dword ptr [matReturn + 0x20]

};

return matReturn;
};

inline Matrix3 operator+ ( const Matrix3& mat )
{
Matrix3 matReturn;
matReturn.m_fMatrix9[0] = mat.m_fMatrix9[0] + m_fMatrix9[0];
matReturn.m_fMatrix9[1] = mat.m_fMatrix9[1] + m_fMatrix9[1];
matReturn.m_fMatrix9[2] = mat.m_fMatrix9[2] + m_fMatrix9[2];
matReturn.m_fMatrix9[3] = mat.m_fMatrix9[3] + m_fMatrix9[3];
matReturn.m_fMatrix9[4] = mat.m_fMatrix9[4] + m_fMatrix9[4];
matReturn.m_fMatrix9[5] = mat.m_fMatrix9[5] + m_fMatrix9[5];
matReturn.m_fMatrix9[6] = mat.m_fMatrix9[6] + m_fMatrix9[6];
matReturn.m_fMatrix9[7] = mat.m_fMatrix9[7] + m_fMatrix9[7];
matReturn.m_fMatrix9[8] = mat.m_fMatrix9[8] + m_fMatrix9[8];

return matReturn;
};

Thanks for any assistance, exorcist_bob

Share this post


Link to post
Share on other sites
Advertisement
There could be many reasons, the C++ compiler has an optimizer, the compiler places some invisble instructions around your assembly blocks, the compiler can't inline your function, etc.

For simple additions and assignments there is no reason to use assembly, it would be like replacing a+=b with __asm{add a, b}. Also I have header some stuff about SIMD instructions having a small performance penalty when you starts to use them, therefore they should only be used for more complex operations (Matrix4x4 * Matrix4x4 for instance).

Share this post


Link to post
Share on other sites
Short answer: Because assembly isn't faster than C++. And because you need to know *a lot* more about your CPU to write efficient assembly than you need to write good C++.

So, obvious question, why do you use asm? Is it to get access to SSE instructions?
If so, you don't need asm for that. The compiler supplies intrinsics you can use to do SSE in your C++ code, allowing the compiler to properly schedule and optimize your code.

Or is it because you want to avoid overhead?
If so, what overhead? C++ doesn't automatically imply overhead. However, inline ASM sometimes does.
For simple code like this, I doubt there'll be any overhead in properly written C++ code.

Anyway, obvious reasons might be that in asm you have to do the optimization yourself. You have to understand instruction latency and cpu and cache architecture to write the most efficient code. In C++, you just say what you want done, and trust the compiler to handle that stuff (which it's pretty damn good at)

Share this post


Link to post
Share on other sites
Quote:
Original post by CTar
For simple additions and assignments there is no reason to use assembly, it would be like replacing a+=b with __asm{add a, b}. Also I have header some stuff about SIMD instructions having a small performance penalty when you starts to use them, therefore they should only be used for more complex operations (Matrix4x4 * Matrix4x4 for instance).


You can use SIMD for Vector4 + Vector4 operations as well. It's just not as fast. In my testing, it was about 1.5x faster than FPU code as opposed to matrix operations which were more than 3x faster. But yea, there is some overhead I guess, especially when using SIMD for Vector4 + scalar operations.

Share this post


Link to post
Share on other sites
I am basically writing these assembly routines simply to expand my knowledge of assembly, as well as any speed increases that follow. As far as overhead issues, viewing the disassembly shows a lot more overhead in the asm version than C++ overhead. To help with the issue of overhead, should I try to use MASM and link it with the class some how. If so, where can I get some info on how to do this.

PS It is actually inlining it [smile].

Benchmarks:
ASM using SSE: 64 Seconds
ASM using FPU: 59 Seconds
C++: 32 Seconds

These are based on 1,000,000,000 additions

Thanks for your help,
exorcist_bob

Share this post


Link to post
Share on other sites

I've been experimenting a bit with assembly hacking myself, and I'm curious to know more about what you're going through...

What sort of code does the compiler generate? (And which compiler is it?) Is the compiler generating SSE code or FPU code?

Also, have you tried using movss and addss to do the final add, instead of using old-style FPU instructions?

Share this post


Link to post
Share on other sites
you should review this thread where I went into some detail about the situation you're in.

The reality is that Microsoft has no intention of providing good asm support. GCC does, but you need to learn a strange and bizarre syntax. Visual C++ 7.1 absolutely sucks at generating code from intrinsics, but 2005 may show some improvement. And in the end, any effort in this direction is severely crippled by the fact that SSE is a horribly conceived instruction set.

Now, for a specific math-heavy algorithm that I can design from the start around the intent to optimize it with SSE assembly, I typically see a 2.5x speed-up over an FPU implementation. However, micro-optimizing small pieces of math routines will only benefit you if the compiler is brave enough to schedule properly. Which it's not. So it doesn't.

Share this post


Link to post
Share on other sites
It is generating FPU code. The compiler is not smart enought to vectorize (MSVC++ 2005). Here is the complete dissassembly of the intrisic Matrix3 operator+ override:


fld dword ptr [esp+10h]
mov ecx,9
fadd st,st(6)
lea esi,[esp+34h]
lea edi,[esp+10h]
fstp dword ptr [esp+34h]
fld dword ptr [esp+14h]
fadd st,st(5)
fstp dword ptr [esp+38h]
fld dword ptr [esp+18h]
fadd st,st(4)
fstp dword ptr [esp+3Ch]
fld dword ptr [esp+1Ch]
fadd st,st(3)
fstp dword ptr [esp+40h]
fld dword ptr [esp+20h]
fadd st,st(6)
fstp dword ptr [esp+44h]
fld dword ptr [esp+24h]
fadd st,st(5)
fstp dword ptr [esp+48h]
fld dword ptr [esp+28h]
fadd st,st(2)
fstp dword ptr [esp+4Ch]
fld dword ptr [esp+2Ch]
fadd st,st(1)
fstp dword ptr [esp+50h]
fld dword ptr [esp+30h]
fadd st,st(6)
fstp dword ptr [esp+54h]
rep movs dword ptr es:[edi],dword ptr [esi]
fld dword ptr [esp+10h]
fadd st,st(6)
fstp dword ptr [esp+34h]
fld dword ptr [esp+14h]
fadd st,st(5)
fstp dword ptr [esp+38h]
fld dword ptr [esp+18h]
mov ecx,9
fadd st,st(4)
lea esi,[esp+34h]
lea edi,[esp+10h]
fstp dword ptr [esp+3Ch]
fld dword ptr [esp+1Ch]
fadd st,st(3)
fstp dword ptr [esp+40h]
fld dword ptr [esp+20h]
fadd st,st(6)
fstp dword ptr [esp+44h]
fld dword ptr [esp+24h]
fadd st,st(5)
fstp dword ptr [esp+48h]
fld dword ptr [esp+28h]
fadd st,st(2)
fstp dword ptr [esp+4Ch]
fld dword ptr [esp+2Ch]
fadd st,st(1)
fstp dword ptr [esp+50h]
fld dword ptr [esp+30h]
fadd st,st(6)
fstp dword ptr [esp+54h]
rep movs dword ptr es:[edi],dword ptr [esi]
fld dword ptr [esp+10h]
fadd st,st(6)
fstp dword ptr [esp+34h]
fld dword ptr [esp+14h]
fadd st,st(5)
fstp dword ptr [esp+38h]
fld dword ptr [esp+18h]
fadd st,st(4)
fstp dword ptr [esp+3Ch]
fld dword ptr [esp+1Ch]
fadd st,st(3)
fstp dword ptr [esp+40h]
fld dword ptr [esp+20h]
fadd st,st(6)
fstp dword ptr [esp+44h]
fld dword ptr [esp+24h]
fadd st,st(5)
fstp dword ptr [esp+48h]
fld dword ptr [esp+28h]
fadd st,st(2)
mov ecx,9
lea esi,[esp+34h]
lea edi,[esp+10h]
fstp dword ptr [esp+4Ch]
fld dword ptr [esp+2Ch]
fadd st,st(1)
fstp dword ptr [esp+50h]
fld dword ptr [esp+30h]
fadd st,st(6)
fstp dword ptr [esp+54h]
rep movs dword ptr es:[edi],dword ptr [esi]
fld dword ptr [esp+10h]
fadd st,st(6)
fstp dword ptr [esp+34h]
fld dword ptr [esp+14h]
fadd st,st(5)
fstp dword ptr [esp+38h]
fld dword ptr [esp+18h]
mov ecx,9
fadd st,st(4)
lea esi,[esp+34h]
lea edi,[esp+10h]
fstp dword ptr [esp+3Ch]
fld dword ptr [esp+1Ch]
fadd st,st(3)
fstp dword ptr [esp+40h]
fld dword ptr [esp+20h]
fadd st,st(6)
fstp dword ptr [esp+44h]
fld dword ptr [esp+24h]
fadd st,st(5)
fstp dword ptr [esp+48h]
fld dword ptr [esp+28h]
fadd st,st(2)
fstp dword ptr [esp+4Ch]
fld dword ptr [esp+2Ch]
fadd st,st(1)
fstp dword ptr [esp+50h]
fld dword ptr [esp+30h]
fadd st,st(6)
fstp dword ptr [esp+54h]
rep movs dword ptr es:[edi],dword ptr [esi]
fld dword ptr [esp+10h]
fadd st,st(6)
fstp dword ptr [esp+34h]
fld dword ptr [esp+14h]
fadd st,st(5)
fstp dword ptr [esp+38h]
fld dword ptr [esp+18h]
mov ecx,9
fadd st,st(4)
lea esi,[esp+34h]
lea edi,[esp+10h]
fstp dword ptr [esp+3Ch]
fld dword ptr [esp+1Ch]
fadd st,st(3)
fstp dword ptr [esp+40h]
fld dword ptr [esp+20h]
fadd st,st(6)
fstp dword ptr [esp+44h]
fld dword ptr [esp+24h]
fadd st,st(5)
fstp dword ptr [esp+48h]
fld dword ptr [esp+28h]
fadd st,st(2)
fstp dword ptr [esp+4Ch]
fld dword ptr [esp+2Ch]
fadd st,st(1)
fstp dword ptr [esp+50h]
fld dword ptr [esp+30h]
fadd st,st(6)
fstp dword ptr [esp+54h]
rep movs dword ptr es:[edi],dword ptr [esi]
fld dword ptr [esp+10h]
fadd st,st(6)
fstp dword ptr [esp+34h]
fld dword ptr [esp+14h]
fadd st,st(5)
fstp dword ptr [esp+38h]
fld dword ptr [esp+18h]
fadd st,st(4)
mov ecx,9
lea esi,[esp+34h]
lea edi,[esp+10h]
fstp dword ptr [esp+3Ch]
fld dword ptr [esp+1Ch]
fadd st,st(3)
fstp dword ptr [esp+40h]
fld dword ptr [esp+20h]
fadd st,st(6)
fstp dword ptr [esp+44h]
fld dword ptr [esp+24h]
fadd st,st(5)
fstp dword ptr [esp+48h]
fld dword ptr [esp+28h]
fadd st,st(2)
fstp dword ptr [esp+4Ch]
fld dword ptr [esp+2Ch]
fadd st,st(1)
fstp dword ptr [esp+50h]
fld dword ptr [esp+30h]
fadd st,st(6)
fstp dword ptr [esp+54h]
rep movs dword ptr es:[edi],dword ptr [esi]
fld dword ptr [esp+10h]
fadd st,st(6)
fstp dword ptr [esp+34h]
fld dword ptr [esp+14h]
fadd st,st(5)
fstp dword ptr [esp+38h]
fld dword ptr [esp+18h]
mov ecx,9
fadd st,st(4)
lea esi,[esp+34h]
lea edi,[esp+10h]
fstp dword ptr [esp+3Ch]
fld dword ptr [esp+1Ch]
fadd st,st(3)
fstp dword ptr [esp+40h]
fld dword ptr [esp+20h]
fadd st,st(6)
fstp dword ptr [esp+44h]
fld dword ptr [esp+24h]
fadd st,st(5)
fstp dword ptr [esp+48h]
fld dword ptr [esp+28h]
fadd st,st(2)
fstp dword ptr [esp+4Ch]
fld dword ptr [esp+2Ch]
fadd st,st(1)
fstp dword ptr [esp+50h]
fld dword ptr [esp+30h]
fadd st,st(6)
fstp dword ptr [esp+54h]
rep movs dword ptr es:[edi],dword ptr [esi]
fld dword ptr [esp+10h]
fadd st,st(6)
fstp dword ptr [esp+34h]
fld dword ptr [esp+14h]
fadd st,st(5)
fstp dword ptr [esp+38h]
fld dword ptr [esp+18h]
fadd st,st(4)
fstp dword ptr [esp+3Ch]
fld dword ptr [esp+1Ch]
fadd st,st(3)
fstp dword ptr [esp+40h]
fld dword ptr [esp+20h]
fadd st,st(6)
fstp dword ptr [esp+44h]
fld dword ptr [esp+24h]
fadd st,st(5)
fstp dword ptr [esp+48h]
fld dword ptr [esp+28h]
fadd st,st(2)
fstp dword ptr [esp+4Ch]
fld dword ptr [esp+2Ch]
fadd st,st(1)
fstp dword ptr [esp+50h]
fld dword ptr [esp+30h]
fadd st,st(6)
sub eax,1
fstp dword ptr [esp+54h]
mov ecx,9
lea esi,[esp+34h]
lea edi,[esp+10h]
rep movs dword ptr es:[edi],dword ptr [esi]
fld dword ptr [esp+10h]
fadd st,st(6)
fstp dword ptr [esp+34h]
fld dword ptr [esp+14h]
fadd st,st(5)
fstp dword ptr [esp+38h]
fld dword ptr [esp+18h]
fadd st,st(4)
mov ecx,9
lea esi,[esp+34h]
lea edi,[esp+10h]
fstp dword ptr [esp+3Ch]
fld dword ptr [esp+1Ch]
fadd st,st(3)
fstp dword ptr [esp+40h]
fld dword ptr [esp+20h]
fadd st,st(6)
fstp dword ptr [esp+44h]
fld dword ptr [esp+24h]
fadd st,st(5)
fstp dword ptr [esp+48h]
fld dword ptr [esp+28h]
fadd st,st(2)
fstp dword ptr [esp+4Ch]
fld dword ptr [esp+2Ch]
fadd st,st(1)
fstp dword ptr [esp+50h]
fld dword ptr [esp+30h]
fadd st,st(6)
fstp dword ptr [esp+54h]
rep movs dword ptr es:[edi],dword ptr [esi]
fld dword ptr [esp+10h]
fadd st,st(6)
fstp dword ptr [esp+34h]
fld dword ptr [esp+14h]
fadd st,st(5)
fstp dword ptr [esp+38h]
fld dword ptr [esp+18h]
mov ecx,9
fadd st,st(4)
lea esi,[esp+34h]
lea edi,[esp+10h]
fstp dword ptr [esp+3Ch]
fld dword ptr [esp+1Ch]
fadd st,st(3)
fstp dword ptr [esp+40h]
fld dword ptr [esp+20h]
fadd st,st(6)
fstp dword ptr [esp+44h]
fld dword ptr [esp+24h]
fadd st,st(5)
fstp dword ptr [esp+48h]
fld dword ptr [esp+28h]
fadd st,st(2)
fstp dword ptr [esp+4Ch]
fld dword ptr [esp+2Ch]
fadd st,st(1)
fstp dword ptr [esp+50h]
fld dword ptr [esp+30h]
fadd st,st(6)
fstp dword ptr [esp+54h]
rep movs dword ptr es:[edi],dword ptr [esi]
jne wmain+9Fh (40109Fh)
fstp st(5)



As you can see, not a ton of overhead. Now here is my version using SSE:

lea eax,[esp+0A0h]
lea ecx,[esp+40h]
mov dword ptr [esp+10h],eax
mov dword ptr [esp+14h],ecx
mov ebx,3B9ACA00h
mov eax,dword ptr [esp+10h]
mov ecx,dword ptr [esp+14h]
movaps xmm0,xmmword ptr [ecx]
movaps xmm1,xmmword ptr [ecx+10h]
addps xmm0,xmmword ptr [eax]
addps xmm1,xmmword ptr [eax+10h]
movaps xmmword ptr [esp+100h],xmm0
movaps xmmword ptr [esp+110h],xmm1
fld dword ptr [ecx+20h]
fadd dword ptr [eax+20h]
fstp dword ptr [esp+120h]
lea edx,[esp+100h]
push edx
lea ecx,[esp+74h]
call SkepWorks::Math::AsmFloat::Matrix3::Matrix3 (401050h)
sub ebx,1
mov ecx,0Ch
lea esi,[esp+70h]
lea edi,[esp+40h]
rep movs dword ptr es:[edi],dword ptr [esi]
jne wmain+150h (4011C0h)



Straight from the dissassembly output in visual c++. The original source:


inline Matrix3 operator+ ( const Matrix3& mat )
{
ALIGN16 Matrix3 matReturn;

__asm
{
mov eax, dword ptr[mat]
mov ecx, dword ptr[this]
movaps xmm0, dword ptr[ecx]
movaps xmm1, dword ptr[ecx + 0x10]
addps xmm0, dword ptr[eax]
addps xmm1, dword ptr[eax + 0x10]
movaps matReturn, xmm0
movaps matReturn + 0x10, xmm1
// add the oddball using fpu
fld dword ptr [ecx+0x20]
fadd dword ptr [eax+0x20]
fstp dword ptr [matReturn + 0x20]

/*
mov eax, dword ptr[mat]
mov ecx, dword ptr[this]
fld dword ptr [ecx]
fadd dword ptr [eax]
fstp dword ptr [matReturn]
fld dword ptr [ecx+0x4]
fadd dword ptr [eax+0x4]
fstp dword ptr [matReturn + 0x4]
fld dword ptr [ecx+0x8]
fadd dword ptr [eax+0x8]
fstp dword ptr [matReturn + 0x8]
fld dword ptr [ecx+0xC]
fadd dword ptr [eax+0xC]
fstp dword ptr [matReturn + 0xC]
fld dword ptr [ecx+0x10]
fadd dword ptr [eax+0x10]
fstp dword ptr [matReturn + 0x10]
fld dword ptr [ecx+0x14]
fadd dword ptr [eax+0x14]
fstp dword ptr [matReturn + 0x14]
fld dword ptr [ecx+0x18]
fadd dword ptr [eax+0x18]
fstp dword ptr [matReturn + 0x18]
fld dword ptr [ecx+0x1C]
fadd dword ptr [eax+0x1C]
fstp dword ptr [matReturn + 0x1C]
fld dword ptr [ecx+0x20]
fadd dword ptr [eax+0x20]
fstp dword ptr [matReturn + 0x20]
*/

};

return matReturn;
};




inline Matrix3 operator+ ( const Matrix3& mat )
{
Matrix3 matReturn;

matReturn.m_fMatrix9[0] = mat.m_fMatrix9[0] + m_fMatrix9[0];
matReturn.m_fMatrix9[1] = mat.m_fMatrix9[1] + m_fMatrix9[1];
matReturn.m_fMatrix9[2] = mat.m_fMatrix9[2] + m_fMatrix9[2];
matReturn.m_fMatrix9[3] = mat.m_fMatrix9[3] + m_fMatrix9[3];
matReturn.m_fMatrix9[4] = mat.m_fMatrix9[4] + m_fMatrix9[4];
matReturn.m_fMatrix9[5] = mat.m_fMatrix9[5] + m_fMatrix9[5];
matReturn.m_fMatrix9[6] = mat.m_fMatrix9[6] + m_fMatrix9[6];
matReturn.m_fMatrix9[7] = mat.m_fMatrix9[7] + m_fMatrix9[7];
matReturn.m_fMatrix9[8] = mat.m_fMatrix9[8] + m_fMatrix9[8];

return matReturn;
};



Heck, my fpu version is twice as slow, and not much different that the c++ version. So, I deduce that it must be overhead issues. Which brings me to the conclusion that I should use MASM and link it with my dll.

Thanks,
exorcist_bob

Share this post


Link to post
Share on other sites
Why is the constructor being called in this code?


//overhead here
lea eax,[esp+0A0h]
lea ecx,[esp+40h]
mov dword ptr [esp+10h],eax
mov dword ptr [esp+14h],ecx
mov ebx,3B9ACA00h
//through here
//my code
mov eax,dword ptr [esp+10h]
mov ecx,dword ptr [esp+14h]
movaps xmm0,xmmword ptr [ecx]
movaps xmm1,xmmword ptr [ecx+10h]
addps xmm0,xmmword ptr [eax]
addps xmm1,xmmword ptr [eax+10h]
movaps xmmword ptr [esp+100h],xmm0
movaps xmmword ptr [esp+110h],xmm1
fld dword ptr [ecx+20h]
fadd dword ptr [eax+20h]
fstp dword ptr [esp+120h]
//overhead here
lea edx,[esp+100h]
push edx
lea ecx,[esp+74h]
call SkepWorks::Math::AsmFloat::Matrix3::Matrix3 (401050h) // right here
sub ebx,1
mov ecx,0Ch
lea esi,[esp+70h]
lea edi,[esp+40h]
rep movs dword ptr es:[edi],dword ptr [esi]
jne wmain+150h (4011C0h)
// through here



Thanks,
exorcist_bob

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Remember, if you use masm to make a separate function, the compiler will have no way of inlining that... The compiler will also probably have problems inlining if you use __declspec(naked). Your best bet is probably to use intrinsics.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Quote:
Original post by exorcist_bob
Why is the constructor being called in this code?

*** Source Snippet Removed ***

Thanks,
exorcist_bob


Because you create a temporary Matrix3 object?

Share this post


Link to post
Share on other sites
If you're intersted in speed, and perhaps a learning challenge, then how about simply looking into expression templates rather than getting down and dirty with asm?

Share this post


Link to post
Share on other sites
Quote:
Original post by iMalc
If you're intersted in speed, and perhaps a learning challenge, then how about simply looking into expression templates rather than getting down and dirty with asm?


Whoa, what an interesting way to do things! I'll have to check that out, althought I would still like to use assembler. Thanks for the link, though!

Share this post


Link to post
Share on other sites
Quote:
Original post by exorcist_bob
As you can see, not a ton of overhead. Now here is my version using SSE:
*** Source Snippet Removed ***

Straight from the dissassembly output in visual c++. The original source:

*** Source Snippet Removed ***

*** Source Snippet Removed ***

Heck, my fpu version is twice as slow, and not much different that the c++ version. So, I deduce that it must be overhead issues. Which brings me to the conclusion that I should use MASM and link it with my dll.

Generally speaking, scheduling and instruction latency tends to have a much bigger effect on performance than any overhead that may or may not exist.
That's why writing ASM that performs well can be such a pain... It's one of the few complex tasks a compiler is *really* good at. And it's one of the many many things that humans suck badly at. [wink]

So yeah, try using intrinsics, which allows the compiler to do the hard work of scheduling the code. Or look into expression templates or other sneaky tricks.

If you *want* to use an assembler, then you should obviously write stick with writing asm. Do it as a learning experience though, and not out of some vague hope that "it'll be faster".
If you just want the fastest possible code, it's quite possible you could achieve it a lot easier in C++. At the very least, you should try it before getting into ASM.

Share this post


Link to post
Share on other sites
Well, when I ran the program using intrisics, I get speeds around in the middle between C++ speed and my old speed. I have no idea why all the extra data shuffling is actually speeding it up.


row0 = _mm_load_ps(m_fMatrix9);
mov eax,dword ptr [ebp-0Ch]
movaps xmm0,xmmword ptr [eax]
movaps xmmword ptr [ebp-3B0h],xmm0
movaps xmm0,xmmword ptr [ebp-3B0h]
movaps xmmword ptr [ebp-70h],xmm0
row1 = _mm_load_ps(m_fMatrix9+4);
mov eax,dword ptr [ebp-0Ch]
movaps xmm0,xmmword ptr [eax+10h]
movaps xmmword ptr [ebp-390h],xmm0
movaps xmm0,xmmword ptr [ebp-390h]
movaps xmmword ptr [ebp-90h],xmm0
row2 = _mm_load_ss(m_fMatrix9+8);
mov eax,dword ptr [ebp-0Ch]
movss xmm0,dword ptr [eax+20h]
movaps xmmword ptr [ebp-370h],xmm0
movaps xmm0,xmmword ptr [ebp-370h]
movaps xmmword ptr [ebp-0B0h],xmm0
//row3 = _mm_load_ps(m_fMatrix16+12);

base0 = _mm_load_ps(mat.m_fMatrix9);
mov eax,dword ptr [ebx+0Ch]
movaps xmm0,xmmword ptr [eax]
movaps xmmword ptr [ebp-350h],xmm0
movaps xmm0,xmmword ptr [ebp-350h]
movaps xmmword ptr [ebp-0F0h],xmm0
base1 = _mm_load_ps(mat.m_fMatrix9+4);
mov eax,dword ptr [ebx+0Ch]
movaps xmm0,xmmword ptr [eax+10h]
movaps xmmword ptr [ebp-330h],xmm0
movaps xmm0,xmmword ptr [ebp-330h]
movaps xmmword ptr [ebp-110h],xmm0
base2 = _mm_load_ss(mat.m_fMatrix9+8);
mov eax,dword ptr [ebx+0Ch]
movss xmm0,dword ptr [eax+20h]
movaps xmmword ptr [ebp-310h],xmm0
movaps xmm0,xmmword ptr [ebp-310h]
movaps xmmword ptr [ebp-130h],xmm0
//base3 = _mm_load_ps(mat.m_fMatrix16+12);

result0 = _mm_add_ps(row0, base0);
movaps xmm0,xmmword ptr [ebp-0F0h]
movaps xmm1,xmmword ptr [ebp-70h]
addps xmm1,xmm0
movaps xmmword ptr [ebp-2F0h],xmm1
movaps xmm0,xmmword ptr [ebp-2F0h]
movaps xmmword ptr [ebp-170h],xmm0
result1 = _mm_add_ps(row1, base1);
movaps xmm0,xmmword ptr [ebp-110h]
movaps xmm1,xmmword ptr [ebp-90h]
addps xmm1,xmm0
movaps xmmword ptr [ebp-2D0h],xmm1
movaps xmm0,xmmword ptr [ebp-2D0h]
movaps xmmword ptr [ebp-190h],xmm0
result2 = _mm_add_ss(row2, base2);
movaps xmm0,xmmword ptr [ebp-130h]
movaps xmm1,xmmword ptr [ebp-0B0h]
addss xmm1,xmm0
movaps xmmword ptr [ebp-2B0h],xmm1
movaps xmm0,xmmword ptr [ebp-2B0h]
movaps xmmword ptr [ebp-1B0h],xmm0
//result3 = _mm_add_ps(row3, base3);

_mm_store_ps(matResult.m_fMatrix9, result0);
movaps xmm0,xmmword ptr [ebp-170h]
movaps xmmword ptr [ebp-50h],xmm0
_mm_store_ps(matResult.m_fMatrix9+4,result1);
movaps xmm0,xmmword ptr [ebp-190h]
movaps xmmword ptr [ebp-40h],xmm0
_mm_store_ss(matResult.m_fMatrix9+8,result2);
movaps xmm0,xmmword ptr [ebp-1B0h]
movss dword ptr [ebp-30h],xmm0



ASM straight from the dissassembly. As you can see, the data is shuffled around A LOT. Wouldn't memory latency take into effect this away?

Thanks,
exorcist_bob

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
To take full advantage of SIMD instruction sets, you need to blow away the idea of a single vector. You've got a lot of vectors and you need to treat the entire set like part of a big pool of vectors and then structure and arrange them in memory for efficient SSE.

While the traditional vector is a structure and a big pool of them is an array of vectors ("Array of Structures"), SIMD computation benefits most when you arrange things a litle differently.

You should use a "Structure of Arrays" approach instead if you are shooting for the best performance. Only here can you really do what SSE SIMD was intended to do, which is to repeatedly perform the same SEQUENCE of operation on groups of like-data in parallel. With the AoS approach, you often need to do some swizzling to arrange the data correctly for efficient simd .. swizzling is extra overhead that only very rarely crops its head in the SoA approach.

Share this post


Link to post
Share on other sites
Quote:
Original post by exorcist_bob
ASM straight from the dissassembly. As you can see, the data is shuffled around A LOT. Wouldn't memory latency take into effect this away?


Wow, that is terrible! Are you positive that's not a Debug build? I can't think of any other reason it would be using the stack. If it is actually a Release build on VC 2005... ugh, that's disgusting.



Share this post


Link to post
Share on other sites
Agreed :)

The VC optimizer is still a friggin joke, after all. Compiler 101: how about a simple peephole optimizer (running in debug mode or not), for crying out loud?

In the meantime, CRAP like
  movaps      xmmword ptr [ebp-390h],xmm0 
movaps xmm0,xmmword ptr [ebp-390h]
is quite funny. Folks, (get a skilled human to) write your time-critical parts in asm, and life is good.

Share this post


Link to post
Share on other sites
Quote:
Original post by exorcist_bob
Well, when I ran the program using intrisics, I get speeds around in the middle between C++ speed and my old speed. I have no idea why all the extra data shuffling is actually speeding it up.

*** Source Snippet Removed ***

ASM straight from the dissassembly. As you can see, the data is shuffled around A LOT. Wouldn't memory latency take into effect this away?

Thanks,
exorcist_bob


This shuffling data around is why ASM is such a pain to hand-write. Each instruction has a certain latency (which is nothing to do with memory latency), and the cpu can typically execute 3 instructions per cycle, so to get best performance, the compiler has to arrange the code so that every single cycle, there are instructions that can be executed without waiting for the previous ones to finish. If it can fill up all three instruction slots this way, you're lucky. But less will do too. But to achieve this, instructions have to be reordered *a lot*. This reordering is one of the (few) kinds of optimization a compiler is actually better than humans at.

But like pointed out above, there are plenty of other things the compiler is horrible at optimizing... [wink]

But yeah, I really really hope this is a debug build. Try the release version and see what happens then.

Share this post


Link to post
Share on other sites
The library routine was in release mode, but the test program was in debug mode. Also, since the function was inline, the test program 'stole' it and made it a 'debug' version. How strange.

Share this post


Link to post
Share on other sites
Quote:
Original post by Spoonbender
But to achieve this, instructions have to be reordered *a lot*. This reordering is one of the (few) kinds of optimization a compiler is actually better than humans at.


The thing a compiler is bad at is register economization. So the reason that it sucks so hard at intrinsics (though, granted, as the poster decided the above seems not a Release build) is that SSE has just 8 registers. For some operations, you can barely squeek by with 8, but most operations require 16 or 32 or 64 to do without stalling. I would be astounded to find a compiler that could take intrinsics code and do common stuff like a 4x4 transpose or 4x3 Matrix-Vector multiply, or 3x3 determinant anything close to optimally in SSE.

OTOH I've worked on platforms like the Cell, where SPEs have 128 vector registers, and it becomes simply a matter of feeding the compiler enough data-parallel intrinsics (and, anyone interested in performance prgramming, I have no idea why you would try to stick to SSE while Cell chips are starting to spread around!).

So a prerequisite to any of this optimizing is a compiler that will adequately schedule and economize registers. It doesn't exist on x86. So your best bet is to hand-optimize large chunks of SSE assembly for specific problems... go with FPU C++ code that is quick and dirty... or write intrinsics code and hope that compilers will suddenly become more frugal with their consumption of SSE registers.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement