Jump to content
  • Advertisement

Archived

This topic is now archived and is closed to further replies.

conman

searching math libs to beat

This topic is 5557 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I have written a matrix class using expression templates witch workes really good I guess and I''m searching now other matrix libraries to compare produced assembler code and calculation time. I''ve heard from one called Blitz++ but the links are broken... So I hope someone here knows some good (not commecial) ones I can beat constantin

Share this post


Link to post
Share on other sites
Advertisement
This link to Blitz++ works just fine for me.
You can also try the Matrix Template Library
Or the BLAS routines, including Boost's C++ implementation of those.
Or you could try LAPACK.
Or even the Template Numerical Toolkit.

A Fortran77 compiler may be necessary to use some of those libraries. No aliasing = better performance Which is one of the reasons why Fortran is still used in numerical computations. (That and the sheer number of libraries available).



[ Start Here ! | How To Ask Smart Questions | Recommended C++ Books | C++ FAQ Lite | Function Ptrs | CppTips Archive ]
[ Header Files | File Format Docs | LNK2001 | C++ STL Doc | STLPort | Free C++ IDE | Boost C++ Lib | MSVC6 Lib Fixes ]


[edited by - Fruny on September 22, 2003 8:17:35 PM]

Share this post


Link to post
Share on other sites
>No aliasing = better performance Which is one of the
>reasons why Fortran is still used in numerical
>computations.

Does aliasing really matter if you do something as simple as matrix-vector products or BLAS operations (vector additions etc.)? I''ve never looked into this, but writing out simple BLAS routines in ASM resulted in 0% faster code than what VC++ gave.

But granted, there are more complicated problems in which aliasing might be a little burden...

>(That and the sheer number of libraries available).

That''s another thing to beat then I guess

- Mikko Kauppila

Share this post


Link to post
Share on other sites
What exactly does 'aliasing' mean?
My matrix class is fast cause it does no unnecessary memory operations with temporary objects to solve an expression like C = A+(B*D).


[edited by - conman on September 23, 2003 9:14:38 AM]

Share this post


Link to post
Share on other sites
Can you send an asm sample that your template method produces ? I can probably tell how far it is from the optimal asm solution directly.

For instance cross product, and matrix mul.

Share this post


Link to post
Share on other sites
Ok, this is my assembler output of a matrix multiplication:

Matrix<2, 2, int> A,B,C;
A = B*C;

produces:

mov eax, DWORD PTR _B$[esp+76]
mov edx, DWORD PTR _C$[esp+68]
mov ecx, DWORD PTR _C$[esp+72]
mov esi, DWORD PTR _B$[esp+68]
mov edi, eax
imul edi, ecx
mov ebx, edx
imul ebx, esi
add edi, ebx
mov ebx, DWORD PTR _B$[esp+72]
mov DWORD PTR _A$[esp+68], edi
mov edi, DWORD PTR _B$[esp+80]
mov ebp, edi
imul ebp, ecx
mov ecx, ebx
imul ecx, edx
add ebp, ecx
mov ecx, DWORD PTR _C$[esp+80]
mov edx, ecx
imul ecx, edi
imul edx, eax
mov eax, DWORD PTR _C$[esp+76]
mov DWORD PTR _A$[esp+72], ebp
mov ebp, eax
imul eax, ebx
imul ebp, esi
add edx, ebp
add eax, ecx
mov DWORD PTR _A$[esp+76], edx
mov DWORD PTR _A$[esp+80], eax
[source]

the same one with double:

Matrix<2, 2, double> A,B,C;
A = B*C;

produces:
[source]
fld QWORD PTR _B$[esp+104]
fmul QWORD PTR _C$[esp+104]
fld QWORD PTR _C$[esp+112]
fmul QWORD PTR _B$[esp+120]
faddp ST(1), ST(0)
fstp QWORD PTR _A$[esp+104]
fld QWORD PTR _B$[esp+128]
fmul QWORD PTR _C$[esp+112]
fld QWORD PTR _B$[esp+112]
fmul QWORD PTR _C$[esp+104]
faddp ST(1), ST(0)
fstp QWORD PTR _A$[esp+112]
fld QWORD PTR _C$[esp+120]
fmul QWORD PTR _B$[esp+104]
fld QWORD PTR _C$[esp+128]
fmul QWORD PTR _B$[esp+120]
faddp ST(1), ST(0)
fstp QWORD PTR _A$[esp+120]
fld QWORD PTR _C$[esp+128]
fmul QWORD PTR _B$[esp+128]
fld QWORD PTR _C$[esp+120]
fmul QWORD PTR _B$[esp+112]
faddp ST(1), ST(0)
fstp QWORD PTR _A$[esp+128]


What do you say?

Share this post


Link to post
Share on other sites
I won''t comment the int routine, since it''s probably not used in 3D games. There are some unecessary variable "renamings" (like mov edi, eax probably).

Well this obviously gets rid of stupid temp variables on the stack : Temp=B+C; A=Temp; generated by usual C++ operator implementations.

However the parallelism is still not exploited. The best code attempts to make 2, 3 or 4 separate calculation "threads" (like A[0][0] and A[0][1]) in parallel and group fstps to avoid waiting states, stalls. You probably still loose 20% perfs here, but that''s much better than typical code. I don''t know but Intel compiler would probably recombine fpu instructions better. I suppose you used VisualC++ or gcc here.

Hmm else in general don''t use doubles as local variables (on the stack). You have 50% chances that the stack is not aligned on a 64bits boundary, which is perf killing. Prefer float or customize your double matrices allocations on the heap.


fld QWORD PTR _B$[esp+104]
fmul QWORD PTR _C$[esp+104]
fld QWORD PTR _C$[esp+112]
fmul QWORD PTR _B$[esp+120]
; stall (1 or 2 cycles)
faddp ST(1), ST(0)
; stall (2 cycles ?)
fstp QWORD PTR _A$[esp+104]
etc...

Share this post


Link to post
Share on other sites

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!