optimizing my matrix/vector library using assembler

Started by
19 comments, last by ajas95 17 years, 9 months ago
Quote:Original post by exorcist_bob
Why is the constructor being called in this code?

*** Source Snippet Removed ***

Thanks,
exorcist_bob


Because you create a temporary Matrix3 object?
Advertisement
If you're intersted in speed, and perhaps a learning challenge, then how about simply looking into expression templates rather than getting down and dirty with asm?
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
Quote:Original post by iMalc
If you're intersted in speed, and perhaps a learning challenge, then how about simply looking into expression templates rather than getting down and dirty with asm?


Whoa, what an interesting way to do things! I'll have to check that out, althought I would still like to use assembler. Thanks for the link, though!
Quote:Original post by exorcist_bob
As you can see, not a ton of overhead. Now here is my version using SSE:
*** Source Snippet Removed ***

Straight from the dissassembly output in visual c++. The original source:

*** Source Snippet Removed ***

*** Source Snippet Removed ***

Heck, my fpu version is twice as slow, and not much different that the c++ version. So, I deduce that it must be overhead issues. Which brings me to the conclusion that I should use MASM and link it with my dll.

Generally speaking, scheduling and instruction latency tends to have a much bigger effect on performance than any overhead that may or may not exist.
That's why writing ASM that performs well can be such a pain... It's one of the few complex tasks a compiler is *really* good at. And it's one of the many many things that humans suck badly at. [wink]

So yeah, try using intrinsics, which allows the compiler to do the hard work of scheduling the code. Or look into expression templates or other sneaky tricks.

If you *want* to use an assembler, then you should obviously write stick with writing asm. Do it as a learning experience though, and not out of some vague hope that "it'll be faster".
If you just want the fastest possible code, it's quite possible you could achieve it a lot easier in C++. At the very least, you should try it before getting into ASM.
Well, when I ran the program using intrisics, I get speeds around in the middle between C++ speed and my old speed. I have no idea why all the extra data shuffling is actually speeding it up.

				row0 = _mm_load_ps(m_fMatrix9);  mov         eax,dword ptr [ebp-0Ch]   movaps      xmm0,xmmword ptr [eax]   movaps      xmmword ptr [ebp-3B0h],xmm0   movaps      xmm0,xmmword ptr [ebp-3B0h]   movaps      xmmword ptr [ebp-70h],xmm0 				row1 = _mm_load_ps(m_fMatrix9+4);  mov         eax,dword ptr [ebp-0Ch]   movaps      xmm0,xmmword ptr [eax+10h]   movaps      xmmword ptr [ebp-390h],xmm0   movaps      xmm0,xmmword ptr [ebp-390h]   movaps      xmmword ptr [ebp-90h],xmm0 				row2 = _mm_load_ss(m_fMatrix9+8);  mov         eax,dword ptr [ebp-0Ch]   movss       xmm0,dword ptr [eax+20h]   movaps      xmmword ptr [ebp-370h],xmm0   movaps      xmm0,xmmword ptr [ebp-370h]   movaps      xmmword ptr [ebp-0B0h],xmm0 				//row3 = _mm_load_ps(m_fMatrix16+12);				base0 = _mm_load_ps(mat.m_fMatrix9);  mov         eax,dword ptr [ebx+0Ch]   movaps      xmm0,xmmword ptr [eax]   movaps      xmmword ptr [ebp-350h],xmm0   movaps      xmm0,xmmword ptr [ebp-350h]   movaps      xmmword ptr [ebp-0F0h],xmm0 				base1 = _mm_load_ps(mat.m_fMatrix9+4);  mov         eax,dword ptr [ebx+0Ch]   movaps      xmm0,xmmword ptr [eax+10h]   movaps      xmmword ptr [ebp-330h],xmm0   movaps      xmm0,xmmword ptr [ebp-330h]   movaps      xmmword ptr [ebp-110h],xmm0 				base2 = _mm_load_ss(mat.m_fMatrix9+8);  mov         eax,dword ptr [ebx+0Ch]   movss       xmm0,dword ptr [eax+20h]   movaps      xmmword ptr [ebp-310h],xmm0   movaps      xmm0,xmmword ptr [ebp-310h]   movaps      xmmword ptr [ebp-130h],xmm0 				//base3 = _mm_load_ps(mat.m_fMatrix16+12);				result0 = _mm_add_ps(row0, base0);  movaps      xmm0,xmmword ptr [ebp-0F0h]   movaps      xmm1,xmmword ptr [ebp-70h]   addps       xmm1,xmm0   movaps      xmmword ptr [ebp-2F0h],xmm1   movaps      xmm0,xmmword ptr [ebp-2F0h]   movaps      xmmword ptr [ebp-170h],xmm0 				result1 = _mm_add_ps(row1, base1);  movaps      xmm0,xmmword ptr [ebp-110h]   movaps      xmm1,xmmword ptr [ebp-90h]   addps       xmm1,xmm0   movaps      xmmword ptr [ebp-2D0h],xmm1   movaps      xmm0,xmmword ptr [ebp-2D0h]   movaps      xmmword ptr [ebp-190h],xmm0 				result2 = _mm_add_ss(row2, base2);  movaps      xmm0,xmmword ptr [ebp-130h]   movaps      xmm1,xmmword ptr [ebp-0B0h]   addss       xmm1,xmm0   movaps      xmmword ptr [ebp-2B0h],xmm1   movaps      xmm0,xmmword ptr [ebp-2B0h]   movaps      xmmword ptr [ebp-1B0h],xmm0 				//result3 = _mm_add_ps(row3, base3);				_mm_store_ps(matResult.m_fMatrix9, result0);  movaps      xmm0,xmmword ptr [ebp-170h]   movaps      xmmword ptr [ebp-50h],xmm0 				_mm_store_ps(matResult.m_fMatrix9+4,result1);  movaps      xmm0,xmmword ptr [ebp-190h]   movaps      xmmword ptr [ebp-40h],xmm0 				_mm_store_ss(matResult.m_fMatrix9+8,result2);  movaps      xmm0,xmmword ptr [ebp-1B0h]   movss       dword ptr [ebp-30h],xmm0 


ASM straight from the dissassembly. As you can see, the data is shuffled around A LOT. Wouldn't memory latency take into effect this away?

Thanks,
exorcist_bob
To take full advantage of SIMD instruction sets, you need to blow away the idea of a single vector. You've got a lot of vectors and you need to treat the entire set like part of a big pool of vectors and then structure and arrange them in memory for efficient SSE.

While the traditional vector is a structure and a big pool of them is an array of vectors ("Array of Structures"), SIMD computation benefits most when you arrange things a litle differently.

You should use a "Structure of Arrays" approach instead if you are shooting for the best performance. Only here can you really do what SSE SIMD was intended to do, which is to repeatedly perform the same SEQUENCE of operation on groups of like-data in parallel. With the AoS approach, you often need to do some swizzling to arrange the data correctly for efficient simd .. swizzling is extra overhead that only very rarely crops its head in the SoA approach.
Quote:Original post by exorcist_bob
ASM straight from the dissassembly. As you can see, the data is shuffled around A LOT. Wouldn't memory latency take into effect this away?


Wow, that is terrible! Are you positive that's not a Debug build? I can't think of any other reason it would be using the stack. If it is actually a Release build on VC 2005... ugh, that's disgusting.



Agreed :)

The VC optimizer is still a friggin joke, after all. Compiler 101: how about a simple peephole optimizer (running in debug mode or not), for crying out loud?

In the meantime, CRAP like
  movaps      xmmword ptr [ebp-390h],xmm0   movaps      xmm0,xmmword ptr [ebp-390h] 
is quite funny. Folks, (get a skilled human to) write your time-critical parts in asm, and life is good.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Quote:Original post by exorcist_bob
Well, when I ran the program using intrisics, I get speeds around in the middle between C++ speed and my old speed. I have no idea why all the extra data shuffling is actually speeding it up.

*** Source Snippet Removed ***

ASM straight from the dissassembly. As you can see, the data is shuffled around A LOT. Wouldn't memory latency take into effect this away?

Thanks,
exorcist_bob


This shuffling data around is why ASM is such a pain to hand-write. Each instruction has a certain latency (which is nothing to do with memory latency), and the cpu can typically execute 3 instructions per cycle, so to get best performance, the compiler has to arrange the code so that every single cycle, there are instructions that can be executed without waiting for the previous ones to finish. If it can fill up all three instruction slots this way, you're lucky. But less will do too. But to achieve this, instructions have to be reordered *a lot*. This reordering is one of the (few) kinds of optimization a compiler is actually better than humans at.

But like pointed out above, there are plenty of other things the compiler is horrible at optimizing... [wink]

But yeah, I really really hope this is a debug build. Try the release version and see what happens then.
The library routine was in release mode, but the test program was in debug mode. Also, since the function was inline, the test program 'stole' it and made it a 'debug' version. How strange.

This topic is closed to new replies.

Advertisement