Horse power math lib (2)

Started by
16 comments, last by Charles B 19 years, 10 months ago
@Melekor
Hi, happy to see you back.

1) Why don''t you create a project for it at sourceforge.net ?

Because with time I felt the starting job, namely layer A, was too complex and hazardous to build it as a team. Anyone working with me would have become mad seeing layer A constantly rewritten. It took a lot of time before I could find a stable way to design the lowest level which is the key to the upper abstraction layers. Considering the multiple constraints, compilers, compiler options, systems, CPUs, ultimate performance, side effects, syntax, user options.

Now I am proud that this works perfectly with gcc. Now the bases are sound and validated to me. So this may be effectively time to make it actually open source, not for public users, since it''s not the 1.0, but for testers and contributors. I am also a noob in handling CVS projects, so some help about it might help me.

2) Can you post some examples that compare source code to asm output ?

I''ll put that > here < tonite (in one hour) because the code is on another PC and this one freezes when both the modem and and USB key are plugged Dinner time anyway.

3) Will this mean we must compile a separate executable for each target architecture ?
It''s a generic problem, not only for this library, apart the C# solution. But I doubt it will ever compete with performant C++ implementations. Since speed is targetted it is my prefered option. But there are many other options, based on virtual tables, I already explained here before.
"Coding math tricks in asm is more fun than Java"
Advertisement
how much syntactic sugar will you be pouring on?
Well SIMD standardization stinks, it had to be that way. If ISO guided the vendors and compiler writers more, this would have been less painful. Anyway the truely user level is just the C++ stuff. But math also come into play. Ever had a brain attack during a math course ? Hehe

classes have lots of overloaded operators. If you haven''t done too much of that yet, maybe it''s an area where programmers who don''t understand the asm bit could help out ?

Truely it''s why it''s not 1.0. I prefered to concentrate on the vital and harder A layer. But with the basic vector classes, most 3D classes have enough operators available to be implemented with the fully portable layers.

I have spent a lot of time layering the architecture, keeping in mind how different people may cooperate. The perspective is a kind of big tree, which would transform the library into an "encyclopedia" . But it does not need to reach its leaves at start. I have the seed and it''s DNA done. It just has to grow balanced know.

Yes, there can be contributors of different levels. You can write a new class with operators, etc... as usual. You don''t need to bother with specific implementations (v2f32 stuff) at start. Later there should be enough users around to add more and more specialized implementations of methods a,d routines to make this class full speed.

For those interested I could develop the implementation of the quaternion multiplication on 3DNow to show why vectorizers can practically not do it for us. The complexity of reasoning to find the best code is not what an automat likes. This shows why the human touch and the open source strategy is relevant. The number of common routines to do is not that big for many hands. What counts is having a code base able to merge the efforts efficiently. An donce they are done many people will benefit from huge speed ups painlessly.

I''ll probably organize speed "competitions" to refine some important routines. Maybe some hardware vendors could be interested in giving small prices.

ODE distribution has a test_ode.exe
K very well, thanks. I''ll also try to contact the author. We can maybe make a win/win deal between open source authors.

"Coding math tricks in asm is more fun than Java"
quote:Original post by Charles B
ODE distribution has a test_ode.exe
K very well, thanks. I''ll also try to contact the author. We can maybe make a win/win deal between open source authors.


It might be best if you join the mailing list and check out the comments people have made in this thread. The topic of using an external maths lib has recently been discussed, but I don''t think anyone''s going ahead with it.

[teamonkey] [blog] [tinyminions]
Most linear algebra routines can exploit layer B (or C if you want to test your routine quickly in C++ only).
Example (C++) :
xScalar xAABox::DistPlane(const xPlane &N){// _Pmin, _Pmax are the two members of xAABox    xVectori Pos = cmpgt(N, 0);    xVectori Neg = cmple(N, 0);    xVector nearestVertex = (_Pmin&Pos) | (_Pmax&Neg);    return nearestVertex * N;}   

will result in 2 pfcmp, 2 pand, 1 por, and the dot product which is implemented differently in xANSI, x3DNOW, xSSE, etc...


Now this shows the source implementation of the quaternion multiplication for 3DNow. For actual efficiency this function requires to go back to VSI64 (Virtual SIMD 64bits) asm style code. The swizzling of the components in the formula, requires to consider the size of the packets. Coded with pure linear algebra, the code would be less efficient. I preferd to use the asm style. But
_upkl_2f(a,b); (asm style macro function) is the same as
a=unpackl_2f(a,b);
#define vQMul_64(Qr, Qs, Qt){	register v2f32 mm0, mm1, mm2, mm3;	register v2f32 mm4, mm5, mm6, mm7;	mm0 = _xy(Qs); mm2 = _zw(Qs);	mm5 = _xy(Qt); mm6 = _zw(Qt);	mm1 = mm0; mm3 = mm2; mm4 = mm0; mm7 = mm6;	_upkl_2f(mm0, mm0); /* Ax (A is Qs) */	_upkl_2f(mm2, mm2); /* Az */	_upkh_2f(mm1, mm1); /* Ay */	_upkh_2f(mm3, mm3); /* Aw */	_upkl_2f(mm6, mm6); /* Bz (B is Qt) */	_upkh_2f(mm7, mm7); /* Bw */                           /* Left thread         Right thread    */	_mul_2f(mm2, mm5);     /*                     Az*Bxy          */	_mul_2f(mm4, mm6);     /*                     Bz*Axy          */	_mul_2f(mm0, mm5);     /*                     Ax*Bxy          */	_mul_2f(mm6, _zw(Qs)); /*                     Bz*Azw          */	_mul_2f(mm1, mm5);     /* Ay*Bxy                              */	_mul_2f(mm3, mm5);     /* Aw*Bxy                              */	_mov_2f(mm5,_zw(Qs));  /* Azw                                 */	_mul_2f(mm5, mm7);     /* Bw*Azw                              */	_mul_2f(mm7,_xy(Qs));  /* Bw*Axy                              */	_sub_2f(mm2, mm4);     /*                      Az*Bxy-Bz*Axy  */	_add_2f(mm0, mm6);     /*                      Ax*Bxy+Bz*Azw  */	_add_2f(mm1, mm5);     /* Ay*Bxy+Bw*Azw                       */	_add_2f(mm3, mm7);     /* Aw*Bxy+Bw*Axy                       *	_swap_2f(mm2, mm2);    /*                      Az*Byx-Bz*Ayx  */	_swap_2f(mm0, mm0);    /*                      Ax*Bxy+Bz*Azw  */	_negl_2f(mm2);      /*                    ^(Az*Bxy-Bz*Axy) */ 	_negl_2f(mm0);      /*                    ^(Ax*Bxy+Bz*Azw) */	_add_2f(mm3, mm2);  /* Cxy = Aw*Bxy+Bw*Axy + ^(Az*Bxy-Bz*Axy) */	_sub_2f(mm1, mm0);  /* Czw = Ay*Bxy+Bw*Azw - ^(Ax*Bxy+Bz*Azw) */	_xy(Qr) = mm3;	_zw(Qr) = mm1;}  


Since the code is asm like, no wonder it results in the equivalent machine code inside my benchmark loop :

.stabn 68,0,352,LM124-__Z8testLoopvLM124:	movq	mm2, QWORD PTR [ebx+32]	 #  mm0,  .fxy	movq	mm1, QWORD PTR [ebx+40]	 #  mm2,  .fzw	movq	mm7, QWORD PTR [edi+32]	 #  mm5,  .fxy	movq	mm3, QWORD PTR [edi+40]	 #  mm6,  .fzw	movq	mm5, mm2	 #  mm1,  mm0	movq	mm4, mm1	 #  mm3,  mm2	movq	mm6, mm2	 #  mm4,  mm0	movq	mm0, mm3	 #  mm7,  mm6/APP	punpckldq  mm2, mm2	 #  mm0,  mm0	punpckldq  mm1, mm1	 #  mm2,  mm2	punpckhdq  mm5, mm5	 #  mm1,  mm1	punpckhdq  mm4, mm4	 #  mm3,  mm3	punpckldq  mm3, mm3	 #  mm6,  mm6	punpckhdq  mm0, mm0	 #  mm7,  mm7/NO_APP	pfmul	mm1, mm7	 #  mm2,  mm5	pfmul	mm6, mm3	 #  mm4,  mm6	pfmul	mm2, mm7	 #  mm0,  mm5	pfmul	mm3, QWORD PTR [ebx+40]	 #  mm6,  .fzw	pfmul	mm5, mm7	 #  mm1,  mm5	pfmul	mm4, mm7	 #  mm3,  mm5	movq	mm7, QWORD PTR [ebx+40]	 #  mm5,  .fzw	pfmul	mm7, mm0	 #  mm5,  mm7	pfmul	mm0, QWORD PTR [ebx+32]	 #  mm7,  .fxy	pfsub	mm1, mm6	 #  mm2,  mm4	pfadd	mm2, mm3	 #  mm0,  mm6	pfadd	mm5, mm7	 #  mm1,  mm5	pfadd	mm4, mm0	 #  mm3,  mm7	pswapd	mm1, mm1	 #  mm2,  mm2	pswapd	mm2, mm2	 #  mm0,  mm0	movq	mm0, QWORD PTR _ctNegLo	 #  ctNegLo/APP	pxor mm1, mm0	 #  mm2	pxor mm2, mm0	 #  mm0/NO_APP	pfadd	mm4, mm1	 #  mm3,  mm2	pfsub	mm5, mm2	 #  mm1,  mm0	movq	QWORD PTR [esi+32], mm4	 #  .fxy,  mm3	movq	QWORD PTR [esi+40], mm5	 #  .fzw,  mm1LBE61:.stabn 68,0,353,LM125-__Z8testLoopvLM125:	add	ebx, 64	 #  pIn0	add	edi, 64	 #  pIn1	add	esi, 64	 #  pOutLBE59:.stabn 68,0,354,LM126-__Z8testLoopvLM126:	cmp	esi, DWORD PTR [ebp-424]	 #  pOut,  pEnd	jb	L39 



[edited by - Charles B on June 6, 2004 6:08:34 AM]
"Coding math tricks in asm is more fun than Java"
Now some comments about the previous code sample (quat mul) :
Well I'll present this as a QnA since I expect some critics :

Q : Why write it in asm style (layer A) ?
- it's not a necessity. Anyone wanting to create this class might (must) start with a hardware independant routine, and write a compatible version with xVector operations. For instance quaternion mul can be expressed with one cross product, one dot product and 2 scalings. It can be written in two minutes. Once it's done your function works with any compiler, system or hardware.

Alas, due to the swizling in the original formula, I benchmarked it for 3DNow and it gave around 40 cycles, which is a bit slower than the floating point version. This means a more accurate and specific version is required for VSI64 (3DNow) and SSE. My predicate is 3DNow should always be at least eaqual to floating point. For any routine, ANSI < 3DNow < SSE. In practice, 3DNow will win on the average with functions like =(move), +, -, *(scaling) and even dot product. Same thing for SSE.


Q : If one has to write this kind of asm like, why is it any better than traditionnal ways, using intrinsics or inline asm ?

- inline asm syntax is very different in Visual or gcc. It's highly inefficient for small routines in Visual. You are obliged to call functions or freeze register roles. C intrisics lets far more optimization opportunities to the compiler .

- my "intrisics" are compiler independent and normalized. The same code works with Visual C++ or gcc.

- the instruction sets are extended. For instance swap_2f exists (simulated) in 3DNow (K6) while it does not exist in the original instruction set, only in 3DNowExt (Athlon).

- I have corrected the slow builtins of the MMX instructions in gcc.

- If another 2xfloat SIMD exists apart 3DNow, the code will also work for this platform. This argument is more relevant between Altivec and SSE, merged as a single technology : VSI128.


[edited by - Charles B on June 7, 2004 8:26:43 AM]
"Coding math tricks in asm is more fun than Java"
I want to comment to bump this thread up, but really most of what you''re saying is way over my head. Don''t take the lack of discussion to mean lack of interest, just lack of understanding!
[size="1"]
Well I am checking for a CVS at Sourceforge, and there will be html docs explaining it all there. It''s true there are a few concepts specific to this library.

Someone who already knows the Intel intrisics should not be astonished by the lower layers. And the C++ layer is really typical of linear algebra classes.

That should not be too difficult to handle for most users with tutorials and samples.
"Coding math tricks in asm is more fun than Java"
quote:Original post by mrbastard
I want to comment to bump this thread up, but really most of what you''re saying is way over my head. Don''t take the lack of discussion to mean lack of interest, just lack of understanding!


Same thing here

I hope that soon something will be here which even I can use
AngelForce--"When I look back I am lost." - Daenerys Targaryen

This topic is closed to new replies.

Advertisement