Jump to content
  • Advertisement
Sign in to follow this  

optimizing my matrix/vector library using assembler

This topic is 4334 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hello, Why are my vectorized assembly language routines so much slower that vc++ fpu computations? I am currently creating a matrix/vector engine in assembly and c++ and for what ever reason, the c++ version is at least 2.5 times faster than my vectorized version.
inline Matrix3 operator+ ( const Matrix3& mat )
{
ALIGN16 Matrix3 matReturn;

__asm
{

mov eax, dword ptr[mat]
mov ecx, dword ptr[this]
movaps xmm0, dword ptr[ecx]
movaps xmm1, dword ptr[ecx + 0x10]
addps xmm0, dword ptr[eax]
addps xmm1, dword ptr[eax + 0x10]
movaps matReturn, xmm0
movaps matReturn + 0x10, xmm1
// add the oddball using the fpu
fld dword ptr [ecx+0x20]
fadd dword ptr [eax+0x20]
fstp dword ptr [matReturn + 0x20]

};

return matReturn;
};

inline Matrix3 operator+ ( const Matrix3& mat )
{
Matrix3 matReturn;
matReturn.m_fMatrix9[0] = mat.m_fMatrix9[0] + m_fMatrix9[0];
matReturn.m_fMatrix9[1] = mat.m_fMatrix9[1] + m_fMatrix9[1];
matReturn.m_fMatrix9[2] = mat.m_fMatrix9[2] + m_fMatrix9[2];
matReturn.m_fMatrix9[3] = mat.m_fMatrix9[3] + m_fMatrix9[3];
matReturn.m_fMatrix9[4] = mat.m_fMatrix9[4] + m_fMatrix9[4];
matReturn.m_fMatrix9[5] = mat.m_fMatrix9[5] + m_fMatrix9[5];
matReturn.m_fMatrix9[6] = mat.m_fMatrix9[6] + m_fMatrix9[6];
matReturn.m_fMatrix9[7] = mat.m_fMatrix9[7] + m_fMatrix9[7];
matReturn.m_fMatrix9[8] = mat.m_fMatrix9[8] + m_fMatrix9[8];

return matReturn;
};

Thanks for any assistance, exorcist_bob

Share this post


Link to post
Share on other sites
Advertisement
There could be many reasons, the C++ compiler has an optimizer, the compiler places some invisble instructions around your assembly blocks, the compiler can't inline your function, etc.

For simple additions and assignments there is no reason to use assembly, it would be like replacing a+=b with __asm{add a, b}. Also I have header some stuff about SIMD instructions having a small performance penalty when you starts to use them, therefore they should only be used for more complex operations (Matrix4x4 * Matrix4x4 for instance).

Share this post


Link to post
Share on other sites
Short answer: Because assembly isn't faster than C++. And because you need to know *a lot* more about your CPU to write efficient assembly than you need to write good C++.

So, obvious question, why do you use asm? Is it to get access to SSE instructions?
If so, you don't need asm for that. The compiler supplies intrinsics you can use to do SSE in your C++ code, allowing the compiler to properly schedule and optimize your code.

Or is it because you want to avoid overhead?
If so, what overhead? C++ doesn't automatically imply overhead. However, inline ASM sometimes does.
For simple code like this, I doubt there'll be any overhead in properly written C++ code.

Anyway, obvious reasons might be that in asm you have to do the optimization yourself. You have to understand instruction latency and cpu and cache architecture to write the most efficient code. In C++, you just say what you want done, and trust the compiler to handle that stuff (which it's pretty damn good at)

Share this post


Link to post
Share on other sites
Quote:
Original post by CTar
For simple additions and assignments there is no reason to use assembly, it would be like replacing a+=b with __asm{add a, b}. Also I have header some stuff about SIMD instructions having a small performance penalty when you starts to use them, therefore they should only be used for more complex operations (Matrix4x4 * Matrix4x4 for instance).


You can use SIMD for Vector4 + Vector4 operations as well. It's just not as fast. In my testing, it was about 1.5x faster than FPU code as opposed to matrix operations which were more than 3x faster. But yea, there is some overhead I guess, especially when using SIMD for Vector4 + scalar operations.

Share this post


Link to post
Share on other sites
I am basically writing these assembly routines simply to expand my knowledge of assembly, as well as any speed increases that follow. As far as overhead issues, viewing the disassembly shows a lot more overhead in the asm version than C++ overhead. To help with the issue of overhead, should I try to use MASM and link it with the class some how. If so, where can I get some info on how to do this.

PS It is actually inlining it [smile].

Benchmarks:
ASM using SSE: 64 Seconds
ASM using FPU: 59 Seconds
C++: 32 Seconds

These are based on 1,000,000,000 additions

Thanks for your help,
exorcist_bob

Share this post


Link to post
Share on other sites

I've been experimenting a bit with assembly hacking myself, and I'm curious to know more about what you're going through...

What sort of code does the compiler generate? (And which compiler is it?) Is the compiler generating SSE code or FPU code?

Also, have you tried using movss and addss to do the final add, instead of using old-style FPU instructions?

Share this post


Link to post
Share on other sites
you should review this thread where I went into some detail about the situation you're in.

The reality is that Microsoft has no intention of providing good asm support. GCC does, but you need to learn a strange and bizarre syntax. Visual C++ 7.1 absolutely sucks at generating code from intrinsics, but 2005 may show some improvement. And in the end, any effort in this direction is severely crippled by the fact that SSE is a horribly conceived instruction set.

Now, for a specific math-heavy algorithm that I can design from the start around the intent to optimize it with SSE assembly, I typically see a 2.5x speed-up over an FPU implementation. However, micro-optimizing small pieces of math routines will only benefit you if the compiler is brave enough to schedule properly. Which it's not. So it doesn't.

Share this post


Link to post
Share on other sites
It is generating FPU code. The compiler is not smart enought to vectorize (MSVC++ 2005). Here is the complete dissassembly of the intrisic Matrix3 operator+ override:


fld dword ptr [esp+10h]
mov ecx,9
fadd st,st(6)
lea esi,[esp+34h]
lea edi,[esp+10h]
fstp dword ptr [esp+34h]
fld dword ptr [esp+14h]
fadd st,st(5)
fstp dword ptr [esp+38h]
fld dword ptr [esp+18h]
fadd st,st(4)
fstp dword ptr [esp+3Ch]
fld dword ptr [esp+1Ch]
fadd st,st(3)
fstp dword ptr [esp+40h]
fld dword ptr [esp+20h]
fadd st,st(6)
fstp dword ptr [esp+44h]
fld dword ptr [esp+24h]
fadd st,st(5)
fstp dword ptr [esp+48h]
fld dword ptr [esp+28h]
fadd st,st(2)
fstp dword ptr [esp+4Ch]
fld dword ptr [esp+2Ch]
fadd st,st(1)
fstp dword ptr [esp+50h]
fld dword ptr [esp+30h]
fadd st,st(6)
fstp dword ptr [esp+54h]
rep movs dword ptr es:[edi],dword ptr [esi]
fld dword ptr [esp+10h]
fadd st,st(6)
fstp dword ptr [esp+34h]
fld dword ptr [esp+14h]
fadd st,st(5)
fstp dword ptr [esp+38h]
fld dword ptr [esp+18h]
mov ecx,9
fadd st,st(4)
lea esi,[esp+34h]
lea edi,[esp+10h]
fstp dword ptr [esp+3Ch]
fld dword ptr [esp+1Ch]
fadd st,st(3)
fstp dword ptr [esp+40h]
fld dword ptr [esp+20h]
fadd st,st(6)
fstp dword ptr [esp+44h]
fld dword ptr [esp+24h]
fadd st,st(5)
fstp dword ptr [esp+48h]
fld dword ptr [esp+28h]
fadd st,st(2)
fstp dword ptr [esp+4Ch]
fld dword ptr [esp+2Ch]
fadd st,st(1)
fstp dword ptr [esp+50h]
fld dword ptr [esp+30h]
fadd st,st(6)
fstp dword ptr [esp+54h]
rep movs dword ptr es:[edi],dword ptr [esi]
fld dword ptr [esp+10h]
fadd st,st(6)
fstp dword ptr [esp+34h]
fld dword ptr [esp+14h]
fadd st,st(5)
fstp dword ptr [esp+38h]
fld dword ptr [esp+18h]
fadd st,st(4)
fstp dword ptr [esp+3Ch]
fld dword ptr [esp+1Ch]
fadd st,st(3)
fstp dword ptr [esp+40h]
fld dword ptr [esp+20h]
fadd st,st(6)
fstp dword ptr [esp+44h]
fld dword ptr [esp+24h]
fadd st,st(5)
fstp dword ptr [esp+48h]
fld dword ptr [esp+28h]
fadd st,st(2)
mov ecx,9
lea esi,[esp+34h]
lea edi,[esp+10h]
fstp dword ptr [esp+4Ch]
fld dword ptr [esp+2Ch]
fadd st,st(1)
fstp dword ptr [esp+50h]
fld dword ptr [esp+30h]
fadd st,st(6)
fstp dword ptr [esp+54h]
rep movs dword ptr es:[edi],dword ptr [esi]
fld dword ptr [esp+10h]
fadd st,st(6)
fstp dword ptr [esp+34h]
fld dword ptr [esp+14h]
fadd st,st(5)
fstp dword ptr [esp+38h]
fld dword ptr [esp+18h]
mov ecx,9
fadd st,st(4)
lea esi,[esp+34h]
lea edi,[esp+10h]
fstp dword ptr [esp+3Ch]
fld dword ptr [esp+1Ch]
fadd st,st(3)
fstp dword ptr [esp+40h]
fld dword ptr [esp+20h]
fadd st,st(6)
fstp dword ptr [esp+44h]
fld dword ptr [esp+24h]
fadd st,st(5)
fstp dword ptr [esp+48h]
fld dword ptr [esp+28h]
fadd st,st(2)
fstp dword ptr [esp+4Ch]
fld dword ptr [esp+2Ch]
fadd st,st(1)
fstp dword ptr [esp+50h]
fld dword ptr [esp+30h]
fadd st,st(6)
fstp dword ptr [esp+54h]
rep movs dword ptr es:[edi],dword ptr [esi]
fld dword ptr [esp+10h]
fadd st,st(6)
fstp dword ptr [esp+34h]
fld dword ptr [esp+14h]
fadd st,st(5)
fstp dword ptr [esp+38h]
fld dword ptr [esp+18h]
mov ecx,9
fadd st,st(4)
lea esi,[esp+34h]
lea edi,[esp+10h]
fstp dword ptr [esp+3Ch]
fld dword ptr [esp+1Ch]
fadd st,st(3)
fstp dword ptr [esp+40h]
fld dword ptr [esp+20h]
fadd st,st(6)
fstp dword ptr [esp+44h]
fld dword ptr [esp+24h]
fadd st,st(5)
fstp dword ptr [esp+48h]
fld dword ptr [esp+28h]
fadd st,st(2)
fstp dword ptr [esp+4Ch]
fld dword ptr [esp+2Ch]
fadd st,st(1)
fstp dword ptr [esp+50h]
fld dword ptr [esp+30h]
fadd st,st(6)
fstp dword ptr [esp+54h]
rep movs dword ptr es:[edi],dword ptr [esi]
fld dword ptr [esp+10h]
fadd st,st(6)
fstp dword ptr [esp+34h]
fld dword ptr [esp+14h]
fadd st,st(5)
fstp dword ptr [esp+38h]
fld dword ptr [esp+18h]
fadd st,st(4)
mov ecx,9
lea esi,[esp+34h]
lea edi,[esp+10h]
fstp dword ptr [esp+3Ch]
fld dword ptr [esp+1Ch]
fadd st,st(3)
fstp dword ptr [esp+40h]
fld dword ptr [esp+20h]
fadd st,st(6)
fstp dword ptr [esp+44h]
fld dword ptr [esp+24h]
fadd st,st(5)
fstp dword ptr [esp+48h]
fld dword ptr [esp+28h]
fadd st,st(2)
fstp dword ptr [esp+4Ch]
fld dword ptr [esp+2Ch]
fadd st,st(1)
fstp dword ptr [esp+50h]
fld dword ptr [esp+30h]
fadd st,st(6)
fstp dword ptr [esp+54h]
rep movs dword ptr es:[edi],dword ptr [esi]
fld dword ptr [esp+10h]
fadd st,st(6)
fstp dword ptr [esp+34h]
fld dword ptr [esp+14h]
fadd st,st(5)
fstp dword ptr [esp+38h]
fld dword ptr [esp+18h]
mov ecx,9
fadd st,st(4)
lea esi,[esp+34h]
lea edi,[esp+10h]
fstp dword ptr [esp+3Ch]
fld dword ptr [esp+1Ch]
fadd st,st(3)
fstp dword ptr [esp+40h]
fld dword ptr [esp+20h]
fadd st,st(6)
fstp dword ptr [esp+44h]
fld dword ptr [esp+24h]
fadd st,st(5)
fstp dword ptr [esp+48h]
fld dword ptr [esp+28h]
fadd st,st(2)
fstp dword ptr [esp+4Ch]
fld dword ptr [esp+2Ch]
fadd st,st(1)
fstp dword ptr [esp+50h]
fld dword ptr [esp+30h]
fadd st,st(6)
fstp dword ptr [esp+54h]
rep movs dword ptr es:[edi],dword ptr [esi]
fld dword ptr [esp+10h]
fadd st,st(6)
fstp dword ptr [esp+34h]
fld dword ptr [esp+14h]
fadd st,st(5)
fstp dword ptr [esp+38h]
fld dword ptr [esp+18h]
fadd st,st(4)
fstp dword ptr [esp+3Ch]
fld dword ptr [esp+1Ch]
fadd st,st(3)
fstp dword ptr [esp+40h]
fld dword ptr [esp+20h]
fadd st,st(6)
fstp dword ptr [esp+44h]
fld dword ptr [esp+24h]
fadd st,st(5)
fstp dword ptr [esp+48h]
fld dword ptr [esp+28h]
fadd st,st(2)
fstp dword ptr [esp+4Ch]
fld dword ptr [esp+2Ch]
fadd st,st(1)
fstp dword ptr [esp+50h]
fld dword ptr [esp+30h]
fadd st,st(6)
sub eax,1
fstp dword ptr [esp+54h]
mov ecx,9
lea esi,[esp+34h]
lea edi,[esp+10h]
rep movs dword ptr es:[edi],dword ptr [esi]
fld dword ptr [esp+10h]
fadd st,st(6)
fstp dword ptr [esp+34h]
fld dword ptr [esp+14h]
fadd st,st(5)
fstp dword ptr [esp+38h]
fld dword ptr [esp+18h]
fadd st,st(4)
mov ecx,9
lea esi,[esp+34h]
lea edi,[esp+10h]
fstp dword ptr [esp+3Ch]
fld dword ptr [esp+1Ch]
fadd st,st(3)
fstp dword ptr [esp+40h]
fld dword ptr [esp+20h]
fadd st,st(6)
fstp dword ptr [esp+44h]
fld dword ptr [esp+24h]
fadd st,st(5)
fstp dword ptr [esp+48h]
fld dword ptr [esp+28h]
fadd st,st(2)
fstp dword ptr [esp+4Ch]
fld dword ptr [esp+2Ch]
fadd st,st(1)
fstp dword ptr [esp+50h]
fld dword ptr [esp+30h]
fadd st,st(6)
fstp dword ptr [esp+54h]
rep movs dword ptr es:[edi],dword ptr [esi]
fld dword ptr [esp+10h]
fadd st,st(6)
fstp dword ptr [esp+34h]
fld dword ptr [esp+14h]
fadd st,st(5)
fstp dword ptr [esp+38h]
fld dword ptr [esp+18h]
mov ecx,9
fadd st,st(4)
lea esi,[esp+34h]
lea edi,[esp+10h]
fstp dword ptr [esp+3Ch]
fld dword ptr [esp+1Ch]
fadd st,st(3)
fstp dword ptr [esp+40h]
fld dword ptr [esp+20h]
fadd st,st(6)
fstp dword ptr [esp+44h]
fld dword ptr [esp+24h]
fadd st,st(5)
fstp dword ptr [esp+48h]
fld dword ptr [esp+28h]
fadd st,st(2)
fstp dword ptr [esp+4Ch]
fld dword ptr [esp+2Ch]
fadd st,st(1)
fstp dword ptr [esp+50h]
fld dword ptr [esp+30h]
fadd st,st(6)
fstp dword ptr [esp+54h]
rep movs dword ptr es:[edi],dword ptr [esi]
jne wmain+9Fh (40109Fh)
fstp st(5)



As you can see, not a ton of overhead. Now here is my version using SSE:

lea eax,[esp+0A0h]
lea ecx,[esp+40h]
mov dword ptr [esp+10h],eax
mov dword ptr [esp+14h],ecx
mov ebx,3B9ACA00h
mov eax,dword ptr [esp+10h]
mov ecx,dword ptr [esp+14h]
movaps xmm0,xmmword ptr [ecx]
movaps xmm1,xmmword ptr [ecx+10h]
addps xmm0,xmmword ptr [eax]
addps xmm1,xmmword ptr [eax+10h]
movaps xmmword ptr [esp+100h],xmm0
movaps xmmword ptr [esp+110h],xmm1
fld dword ptr [ecx+20h]
fadd dword ptr [eax+20h]
fstp dword ptr [esp+120h]
lea edx,[esp+100h]
push edx
lea ecx,[esp+74h]
call SkepWorks::Math::AsmFloat::Matrix3::Matrix3 (401050h)
sub ebx,1
mov ecx,0Ch
lea esi,[esp+70h]
lea edi,[esp+40h]
rep movs dword ptr es:[edi],dword ptr [esi]
jne wmain+150h (4011C0h)



Straight from the dissassembly output in visual c++. The original source:


inline Matrix3 operator+ ( const Matrix3& mat )
{
ALIGN16 Matrix3 matReturn;

__asm
{
mov eax, dword ptr[mat]
mov ecx, dword ptr[this]
movaps xmm0, dword ptr[ecx]
movaps xmm1, dword ptr[ecx + 0x10]
addps xmm0, dword ptr[eax]
addps xmm1, dword ptr[eax + 0x10]
movaps matReturn, xmm0
movaps matReturn + 0x10, xmm1
// add the oddball using fpu
fld dword ptr [ecx+0x20]
fadd dword ptr [eax+0x20]
fstp dword ptr [matReturn + 0x20]

/*
mov eax, dword ptr[mat]
mov ecx, dword ptr[this]
fld dword ptr [ecx]
fadd dword ptr [eax]
fstp dword ptr [matReturn]
fld dword ptr [ecx+0x4]
fadd dword ptr [eax+0x4]
fstp dword ptr [matReturn + 0x4]
fld dword ptr [ecx+0x8]
fadd dword ptr [eax+0x8]
fstp dword ptr [matReturn + 0x8]
fld dword ptr [ecx+0xC]
fadd dword ptr [eax+0xC]
fstp dword ptr [matReturn + 0xC]
fld dword ptr [ecx+0x10]
fadd dword ptr [eax+0x10]
fstp dword ptr [matReturn + 0x10]
fld dword ptr [ecx+0x14]
fadd dword ptr [eax+0x14]
fstp dword ptr [matReturn + 0x14]
fld dword ptr [ecx+0x18]
fadd dword ptr [eax+0x18]
fstp dword ptr [matReturn + 0x18]
fld dword ptr [ecx+0x1C]
fadd dword ptr [eax+0x1C]
fstp dword ptr [matReturn + 0x1C]
fld dword ptr [ecx+0x20]
fadd dword ptr [eax+0x20]
fstp dword ptr [matReturn + 0x20]
*/

};

return matReturn;
};




inline Matrix3 operator+ ( const Matrix3& mat )
{
Matrix3 matReturn;

matReturn.m_fMatrix9[0] = mat.m_fMatrix9[0] + m_fMatrix9[0];
matReturn.m_fMatrix9[1] = mat.m_fMatrix9[1] + m_fMatrix9[1];
matReturn.m_fMatrix9[2] = mat.m_fMatrix9[2] + m_fMatrix9[2];
matReturn.m_fMatrix9[3] = mat.m_fMatrix9[3] + m_fMatrix9[3];
matReturn.m_fMatrix9[4] = mat.m_fMatrix9[4] + m_fMatrix9[4];
matReturn.m_fMatrix9[5] = mat.m_fMatrix9[5] + m_fMatrix9[5];
matReturn.m_fMatrix9[6] = mat.m_fMatrix9[6] + m_fMatrix9[6];
matReturn.m_fMatrix9[7] = mat.m_fMatrix9[7] + m_fMatrix9[7];
matReturn.m_fMatrix9[8] = mat.m_fMatrix9[8] + m_fMatrix9[8];

return matReturn;
};



Heck, my fpu version is twice as slow, and not much different that the c++ version. So, I deduce that it must be overhead issues. Which brings me to the conclusion that I should use MASM and link it with my dll.

Thanks,
exorcist_bob

Share this post


Link to post
Share on other sites
Why is the constructor being called in this code?


//overhead here
lea eax,[esp+0A0h]
lea ecx,[esp+40h]
mov dword ptr [esp+10h],eax
mov dword ptr [esp+14h],ecx
mov ebx,3B9ACA00h
//through here
//my code
mov eax,dword ptr [esp+10h]
mov ecx,dword ptr [esp+14h]
movaps xmm0,xmmword ptr [ecx]
movaps xmm1,xmmword ptr [ecx+10h]
addps xmm0,xmmword ptr [eax]
addps xmm1,xmmword ptr [eax+10h]
movaps xmmword ptr [esp+100h],xmm0
movaps xmmword ptr [esp+110h],xmm1
fld dword ptr [ecx+20h]
fadd dword ptr [eax+20h]
fstp dword ptr [esp+120h]
//overhead here
lea edx,[esp+100h]
push edx
lea ecx,[esp+74h]
call SkepWorks::Math::AsmFloat::Matrix3::Matrix3 (401050h) // right here
sub ebx,1
mov ecx,0Ch
lea esi,[esp+70h]
lea edi,[esp+40h]
rep movs dword ptr es:[edi],dword ptr [esi]
jne wmain+150h (4011C0h)
// through here



Thanks,
exorcist_bob

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Remember, if you use masm to make a separate function, the compiler will have no way of inlining that... The compiler will also probably have problems inlining if you use __declspec(naked). Your best bet is probably to use intrinsics.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!