searching math libs to beat
I have written a matrix class using expression templates witch workes really good I guess and I''m searching now other matrix libraries to compare produced assembler code and calculation time. I''ve heard from one called Blitz++ but the links are broken...
So I hope someone here knows some good (not commecial) ones I can beat
constantin
This link to Blitz++ works just fine for me.
You can also try the Matrix Template Library
Or the BLAS routines, including Boost's C++ implementation of those.
Or you could try LAPACK.
Or even the Template Numerical Toolkit.
A Fortran77 compiler may be necessary to use some of those libraries. No aliasing = better performance Which is one of the reasons why Fortran is still used in numerical computations. (That and the sheer number of libraries available).
[ Start Here ! | How To Ask Smart Questions | Recommended C++ Books | C++ FAQ Lite | Function Ptrs | CppTips Archive ]
[ Header Files | File Format Docs | LNK2001 | C++ STL Doc | STLPort | Free C++ IDE | Boost C++ Lib | MSVC6 Lib Fixes ]
[edited by - Fruny on September 22, 2003 8:17:35 PM]
You can also try the Matrix Template Library
Or the BLAS routines, including Boost's C++ implementation of those.
Or you could try LAPACK.
Or even the Template Numerical Toolkit.
A Fortran77 compiler may be necessary to use some of those libraries. No aliasing = better performance Which is one of the reasons why Fortran is still used in numerical computations. (That and the sheer number of libraries available).
[ Start Here ! | How To Ask Smart Questions | Recommended C++ Books | C++ FAQ Lite | Function Ptrs | CppTips Archive ]
[ Header Files | File Format Docs | LNK2001 | C++ STL Doc | STLPort | Free C++ IDE | Boost C++ Lib | MSVC6 Lib Fixes ]
[edited by - Fruny on September 22, 2003 8:17:35 PM]
>No aliasing = better performance Which is one of the
>reasons why Fortran is still used in numerical
>computations.
Does aliasing really matter if you do something as simple as matrix-vector products or BLAS operations (vector additions etc.)? I''ve never looked into this, but writing out simple BLAS routines in ASM resulted in 0% faster code than what VC++ gave.
But granted, there are more complicated problems in which aliasing might be a little burden...
>(That and the sheer number of libraries available).
That''s another thing to beat then I guess
- Mikko Kauppila
>reasons why Fortran is still used in numerical
>computations.
Does aliasing really matter if you do something as simple as matrix-vector products or BLAS operations (vector additions etc.)? I''ve never looked into this, but writing out simple BLAS routines in ASM resulted in 0% faster code than what VC++ gave.
But granted, there are more complicated problems in which aliasing might be a little burden...
>(That and the sheer number of libraries available).
That''s another thing to beat then I guess
- Mikko Kauppila
What exactly does 'aliasing' mean?
My matrix class is fast cause it does no unnecessary memory operations with temporary objects to solve an expression like C = A+(B*D).
[edited by - conman on September 23, 2003 9:14:38 AM]
My matrix class is fast cause it does no unnecessary memory operations with temporary objects to solve an expression like C = A+(B*D).
[edited by - conman on September 23, 2003 9:14:38 AM]
Can you send an asm sample that your template method produces ? I can probably tell how far it is from the optimal asm solution directly.
For instance cross product, and matrix mul.
For instance cross product, and matrix mul.
Ok, this is my assembler output of a matrix multiplication:
Matrix<2, 2, int> A,B,C;
A = B*C;
produces:
What do you say?
Matrix<2, 2, int> A,B,C;
A = B*C;
produces:
mov eax, DWORD PTR _B$[esp+76]mov edx, DWORD PTR _C$[esp+68]mov ecx, DWORD PTR _C$[esp+72]mov esi, DWORD PTR _B$[esp+68]mov edi, eaximul edi, ecxmov ebx, edximul ebx, esiadd edi, ebxmov ebx, DWORD PTR _B$[esp+72]mov DWORD PTR _A$[esp+68], edimov edi, DWORD PTR _B$[esp+80]mov ebp, ediimul ebp, ecxmov ecx, ebximul ecx, edxadd ebp, ecxmov ecx, DWORD PTR _C$[esp+80]mov edx, ecximul ecx, ediimul edx, eaxmov eax, DWORD PTR _C$[esp+76]mov DWORD PTR _A$[esp+72], ebpmov ebp, eaximul eax, ebximul ebp, esiadd edx, ebpadd eax, ecxmov DWORD PTR _A$[esp+76], edxmov DWORD PTR _A$[esp+80], eax[source]the same one with double:Matrix<2, 2, double> A,B,C;A = B*C;produces:[source]fld QWORD PTR _B$[esp+104]fmul QWORD PTR _C$[esp+104]fld QWORD PTR _C$[esp+112]fmul QWORD PTR _B$[esp+120]faddp ST(1), ST(0)fstp QWORD PTR _A$[esp+104]fld QWORD PTR _B$[esp+128]fmul QWORD PTR _C$[esp+112]fld QWORD PTR _B$[esp+112]fmul QWORD PTR _C$[esp+104]faddp ST(1), ST(0)fstp QWORD PTR _A$[esp+112]fld QWORD PTR _C$[esp+120]fmul QWORD PTR _B$[esp+104]fld QWORD PTR _C$[esp+128]fmul QWORD PTR _B$[esp+120]faddp ST(1), ST(0)fstp QWORD PTR _A$[esp+120]fld QWORD PTR _C$[esp+128]fmul QWORD PTR _B$[esp+128]fld QWORD PTR _C$[esp+120]fmul QWORD PTR _B$[esp+112]faddp ST(1), ST(0)fstp QWORD PTR _A$[esp+128]
What do you say?
Aliasing : When more than one variable can refer to a given memory location.
[ Start Here ! | How To Ask Smart Questions | Recommended C++ Books | C++ FAQ Lite | Function Ptrs | CppTips Archive ]
[ Header Files | File Format Docs | LNK2001 | C++ STL Doc | STLPort | Free C++ IDE | Boost C++ Lib | MSVC6 Lib Fixes ]
[ Start Here ! | How To Ask Smart Questions | Recommended C++ Books | C++ FAQ Lite | Function Ptrs | CppTips Archive ]
[ Header Files | File Format Docs | LNK2001 | C++ STL Doc | STLPort | Free C++ IDE | Boost C++ Lib | MSVC6 Lib Fixes ]
I won''t comment the int routine, since it''s probably not used in 3D games. There are some unecessary variable "renamings" (like mov edi, eax probably).
Well this obviously gets rid of stupid temp variables on the stack : Temp=B+C; A=Temp; generated by usual C++ operator implementations.
However the parallelism is still not exploited. The best code attempts to make 2, 3 or 4 separate calculation "threads" (like A[0][0] and A[0][1]) in parallel and group fstps to avoid waiting states, stalls. You probably still loose 20% perfs here, but that''s much better than typical code. I don''t know but Intel compiler would probably recombine fpu instructions better. I suppose you used VisualC++ or gcc here.
Hmm else in general don''t use doubles as local variables (on the stack). You have 50% chances that the stack is not aligned on a 64bits boundary, which is perf killing. Prefer float or customize your double matrices allocations on the heap.
Well this obviously gets rid of stupid temp variables on the stack : Temp=B+C; A=Temp; generated by usual C++ operator implementations.
However the parallelism is still not exploited. The best code attempts to make 2, 3 or 4 separate calculation "threads" (like A[0][0] and A[0][1]) in parallel and group fstps to avoid waiting states, stalls. You probably still loose 20% perfs here, but that''s much better than typical code. I don''t know but Intel compiler would probably recombine fpu instructions better. I suppose you used VisualC++ or gcc here.
Hmm else in general don''t use doubles as local variables (on the stack). You have 50% chances that the stack is not aligned on a 64bits boundary, which is perf killing. Prefer float or customize your double matrices allocations on the heap.
fld QWORD PTR _B$[esp+104]fmul QWORD PTR _C$[esp+104]fld QWORD PTR _C$[esp+112]fmul QWORD PTR _B$[esp+120]; stall (1 or 2 cycles)faddp ST(1), ST(0); stall (2 cycles ?)fstp QWORD PTR _A$[esp+104]etc...
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement