Horse power math lib (2)

Charles B · 2004-06-08T10:53:35

Preamble : - this is not the real name, it will be disclosed at the release of the product to avoid troubles. - it's been discussed previously here - some details are now deprecated Intro : This math lib is still under development. But I may release the first public rushes and launch a web site soon or at least send parts of the code to those who wish to test a few things. I consider it still as a proto, just in case I might modify a few details after the discussions here. I wish to find contributors soon because the task is enormous. Briefly, it's an open source, portable C/C++ and ultimately fast math lib , specially designed for the needs of the game and 3D programmers . Currently implemented or tested entirely or partially for : - Windows, Linux, MacOSX - Visual C++ - gcc (DevCpp, Project Builder (Mac)) - ANSI C, 386, 3DNow, SSE, Altivec The initial observation is that after every high level optimization is done what remains is thousands of math functions everywhere in a 3D code, in the physics, in the graphics, in the AI. The processors do not unleash their full power for years because SIMD is not standardized. On one hand, knowing how most classes are typically written, and how compiler or vectorizers work, potential X2 X10 speed ups are common . My thousands benchmarks done say that the legend that compilers are so wise, and will do everything for you is fake, they need to be firmly controled. On the other hand writing asm code or using specific intrisics (gcc builtins) is a pain and only works for one target hardware. More, people that really know how to schedule asm code and know enough math tricks to actually optimize code are really seldom. So the goal is to have an optimal code generated, equal or even better than what an assembly code writer may produce. Far better than what vectorizers or compilers can do alone. However it's not a competition, it's complementary. My library uses the intrisics, some of the free libraries and ressources provided by hw vendors, and it could also certainly benefit of some features of the vectorizers. To me they are primarilly compilers that specially care of SIMD. Shortly here is how it works : A) VSIMD : the lowest level, the first abstraction layer : ---------------------------------------------------------- - It is the normalized equivalent of asm, intrisics or gcc builtins - The various compilers are tweaked, for instance some inefficient gcc builtins are replaced by inline asm. - I normalize the common SIMD instruction sets - I expand them to the greatest mcm ex1 : madd_4f exists for SSE, even if it's normally Altivec specific. ex2 : swap_2f exists even for a 3DNow target even if it's a 3DNowExt. - I add type checking - I handle various default bits of precisions per file at complie time ex3 : xDEFAULT_PRECISION(14) before the headers lets rsqrt_2f() use 3DNow pfrsqrt without the Newton Raphson iteration. - there is a C++ wrapper, with overloaded operators. It works at full speed (even faster that the C instrisics sometimes !?) in release. Examples : // C v2f32 e,f,g; e=unpackl_2f(f,g); // C++ v4u16 a,b,c,d; a=b+c*d; // equivalent to a=madd_4u16(c,d,b); B) VSIMD - linear algebra -------------------------- - low level, C and macro functions : ex4 : vDecl(A); // A very special model to force temp data into registers vBegin(); vQSlerp(A, vLd(*pIn1), vLd(*pIn2)); // Quaternion slerp VQMul(A, A, vLd(*pIn3)); // Quaternion mul vEnd(); C) The C++ linear algebra library. ---------------------------------- There is a name space protection which can be disabled with an environment macro. Now this is fully portable, from ANSI C to SSE2. There is a special class called xScalar that replaces float for more speed. ex5: // Collision detection or culling probably. xAABox Box; xPoint A,B,C; ... xPlane N(A,B,C); xScalar d = N*Box; // smallest distance plane, box // This is very speedy with SIMD code !!! ex6: (benchmarking the C++ quaternion multiplication) t=xReadTimeStampCounter(); for(i=0; i < n; i+=4){ pQout[ i+0 ] = pQin0[ i+0 ] * pQin1[ i+0 ]; pQout[ i+1 ] = pQin0[ i+1 ] * pQin1[ i+1 ]; pQout[ i+2 ] = pQin0[ i+2 ] * pQin1[ i+2 ]; pQout[ i+4 ] = pQin0[ i+4 ] * pQin1[ i+4 ]; } t=t-xReadTimeStampCounter(); ... Gives 18 cycles per output with gcc, 24 with VisualC++ 6.0, for an x3DNOWEXT target on an Athlon. 28 in xANSI (exploits FPU). Try to beat this with the library you use ! This proves that 3DNow or SSE can beat the FPU, even when there is a lot of swizzling. (the quaternion multiplication contains a cross product). D) SoA (Structure of Array) --------------------------- mint8, mint16, etc... mfloat contains 1, 2 or 4 packed floats. It inherits of float, v2f32, or v4f32. Templates are used in C++. ex7: struct xALIGN(16) mySoA { mfloat x,y,z; // your methods here } void my func(mySoA *a, const mySoA *b, const mySoA *c, int n) { for(i=0; i < n; i++){ a[ i ].x = (b[ i ].x+c[ i ].x)/c[ i ].z; a[ i ].y = (b[ i ].y+c[ i ].y)/c[ i ].z; a[ i ].z = 0; } } I wish to answer to any comment, question or suggestion. [edited by - Charles B on June 8, 2004 11:34:35 AM]

Math and Physics Programming

Started by Charles B June 05, 2004 07:14 AM

16 comments, last by Charles B 19 years, 10 months ago

Charles B

863

Author

June 05, 2004 02:15 PM

@Melekor
Hi, happy to see you back.

1) Why don''t you create a project for it at sourceforge.net ?

Because with time I felt the starting job, namely layer A, was too complex and hazardous to build it as a team. Anyone working with me would have become mad seeing layer A constantly rewritten. It took a lot of time before I could find a stable way to design the lowest level which is the key to the upper abstraction layers. Considering the multiple constraints, compilers, compiler options, systems, CPUs, ultimate performance, side effects, syntax, user options.

Now I am proud that this works perfectly with gcc. Now the bases are sound and validated to me. So this may be effectively time to make it actually open source, not for public users, since it''s not the 1.0, but for testers and contributors. I am also a noob in handling CVS projects, so some help about it might help me.

2) Can you post some examples that compare source code to asm output ?

I''ll put that > here < tonite (in one hour) because the code is on another PC and this one freezes when both the modem and and USB key are plugged

Dinner time anyway.

3) Will this mean we must compile a separate executable for each target architecture ?
It''s a generic problem, not only for this library, apart the C# solution. But I doubt it will ever compete with performant C++ implementations. Since speed is targetted it is my prefered option. But there are many other options, based on virtual tables, I already explained here before.

"Coding math tricks in asm is more fun than Java"

Charles B

863

Author

June 05, 2004 02:40 PM

how much syntactic sugar will you be pouring on?
Well SIMD standardization stinks, it had to be that way. If ISO guided the vendors and compiler writers more, this would have been less painful. Anyway the truely user level is just the C++ stuff. But math also come into play. Ever had a brain attack during a math course ? Hehe

classes have lots of overloaded operators. If you haven''t done too much of that yet, maybe it''s an area where programmers who don''t understand the asm bit could help out ?

Truely it''s why it''s not 1.0. I prefered to concentrate on the vital and harder A layer. But with the basic vector classes, most 3D classes have enough operators available to be implemented with the fully portable layers.

I have spent a lot of time layering the architecture, keeping in mind how different people may cooperate. The perspective is a kind of big tree, which would transform the library into an "encyclopedia" . But it does not need to reach its leaves at start. I have the seed and it''s DNA done. It just has to grow balanced know.

Yes, there can be contributors of different levels. You can write a new class with operators, etc... as usual. You don''t need to bother with specific implementations (v2f32 stuff) at start. Later there should be enough users around to add more and more specialized implementations of methods a,d routines to make this class full speed.

For those interested I could develop the implementation of the quaternion multiplication on 3DNow to show why vectorizers can practically not do it for us. The complexity of reasoning to find the best code is not what an automat likes. This shows why the human touch and the open source strategy is relevant. The number of common routines to do is not that big for many hands. What counts is having a code base able to merge the efforts efficiently. An donce they are done many people will benefit from huge speed ups painlessly.

I''ll probably organize speed "competitions" to refine some important routines. Maybe some hardware vendors could be interested in giving small prices.

ODE distribution has a test_ode.exe
K very well, thanks. I''ll also try to contact the author. We can maybe make a win/win deal between open source authors.

"Coding math tricks in asm is more fun than Java"

teamonkey

200

June 05, 2004 05:24 PM

quote:Original post by Charles B
ODE distribution has a test_ode.exe
K very well, thanks. I''ll also try to contact the author. We can maybe make a win/win deal between open source authors.

It might be best if you join the mailing list and check out the comments people have made in this thread. The topic of using an external maths lib has recently been discussed, but I don''t think anyone''s going ahead with it.

[teamonkey] [blog] [tinyminions]

Charles B

863

Author

June 06, 2004 04:49 AM

Most linear algebra routines can exploit layer B (or C if you want to test your routine quickly in C++ only).
Example (C++) :

xScalar xAABox::DistPlane(const xPlane &N){// _Pmin, _Pmax are the two members of xAABox    xVectori Pos = cmpgt(N, 0);    xVectori Neg = cmple(N, 0);    xVector nearestVertex = (_Pmin&Pos) | (_Pmax&Neg);    return nearestVertex * N;}

will result in 2 pfcmp, 2 pand, 1 por, and the dot product which is implemented differently in xANSI, x3DNOW, xSSE, etc...

Now this shows the source implementation of the quaternion multiplication for 3DNow. For actual efficiency this function requires to go back to VSI64 (Virtual SIMD 64bits) asm style code. The swizzling of the components in the formula, requires to consider the size of the packets. Coded with pure linear algebra, the code would be less efficient. I preferd to use the asm style. But
_upkl_2f(a,b); (asm style macro function) is the same as
a=unpackl_2f(a,b);

#define vQMul_64(Qr, Qs, Qt){	register v2f32 mm0, mm1, mm2, mm3;	register v2f32 mm4, mm5, mm6, mm7;	mm0 = _xy(Qs); mm2 = _zw(Qs);	mm5 = _xy(Qt); mm6 = _zw(Qt);	mm1 = mm0; mm3 = mm2; mm4 = mm0; mm7 = mm6;	_upkl_2f(mm0, mm0); /* Ax (A is Qs) */	_upkl_2f(mm2, mm2); /* Az */	_upkh_2f(mm1, mm1); /* Ay */	_upkh_2f(mm3, mm3); /* Aw */	_upkl_2f(mm6, mm6); /* Bz (B is Qt) */	_upkh_2f(mm7, mm7); /* Bw */                           /* Left thread         Right thread    */	_mul_2f(mm2, mm5);     /*                     Az*Bxy          */	_mul_2f(mm4, mm6);     /*                     Bz*Axy          */	_mul_2f(mm0, mm5);     /*                     Ax*Bxy          */	_mul_2f(mm6, _zw(Qs)); /*                     Bz*Azw          */	_mul_2f(mm1, mm5);     /* Ay*Bxy                              */	_mul_2f(mm3, mm5);     /* Aw*Bxy                              */	_mov_2f(mm5,_zw(Qs));  /* Azw                                 */	_mul_2f(mm5, mm7);     /* Bw*Azw                              */	_mul_2f(mm7,_xy(Qs));  /* Bw*Axy                              */	_sub_2f(mm2, mm4);     /*                      Az*Bxy-Bz*Axy  */	_add_2f(mm0, mm6);     /*                      Ax*Bxy+Bz*Azw  */	_add_2f(mm1, mm5);     /* Ay*Bxy+Bw*Azw                       */	_add_2f(mm3, mm7);     /* Aw*Bxy+Bw*Axy                       *	_swap_2f(mm2, mm2);    /*                      Az*Byx-Bz*Ayx  */	_swap_2f(mm0, mm0);    /*                      Ax*Bxy+Bz*Azw  */	_negl_2f(mm2);      /*                    ^(Az*Bxy-Bz*Axy) */ 	_negl_2f(mm0);      /*                    ^(Ax*Bxy+Bz*Azw) */	_add_2f(mm3, mm2);  /* Cxy = Aw*Bxy+Bw*Axy + ^(Az*Bxy-Bz*Axy) */	_sub_2f(mm1, mm0);  /* Czw = Ay*Bxy+Bw*Azw - ^(Ax*Bxy+Bz*Azw) */	_xy(Qr) = mm3;	_zw(Qr) = mm1;}

Since the code is asm like, no wonder it results in the equivalent machine code inside my benchmark loop :

.stabn 68,0,352,LM124-__Z8testLoopvLM124:	movq	mm2, QWORD PTR [ebx+32]	 #  mm0,  .fxy	movq	mm1, QWORD PTR [ebx+40]	 #  mm2,  .fzw	movq	mm7, QWORD PTR [edi+32]	 #  mm5,  .fxy	movq	mm3, QWORD PTR [edi+40]	 #  mm6,  .fzw	movq	mm5, mm2	 #  mm1,  mm0	movq	mm4, mm1	 #  mm3,  mm2	movq	mm6, mm2	 #  mm4,  mm0	movq	mm0, mm3	 #  mm7,  mm6/APP	punpckldq  mm2, mm2	 #  mm0,  mm0	punpckldq  mm1, mm1	 #  mm2,  mm2	punpckhdq  mm5, mm5	 #  mm1,  mm1	punpckhdq  mm4, mm4	 #  mm3,  mm3	punpckldq  mm3, mm3	 #  mm6,  mm6	punpckhdq  mm0, mm0	 #  mm7,  mm7/NO_APP	pfmul	mm1, mm7	 #  mm2,  mm5	pfmul	mm6, mm3	 #  mm4,  mm6	pfmul	mm2, mm7	 #  mm0,  mm5	pfmul	mm3, QWORD PTR [ebx+40]	 #  mm6,  .fzw	pfmul	mm5, mm7	 #  mm1,  mm5	pfmul	mm4, mm7	 #  mm3,  mm5	movq	mm7, QWORD PTR [ebx+40]	 #  mm5,  .fzw	pfmul	mm7, mm0	 #  mm5,  mm7	pfmul	mm0, QWORD PTR [ebx+32]	 #  mm7,  .fxy	pfsub	mm1, mm6	 #  mm2,  mm4	pfadd	mm2, mm3	 #  mm0,  mm6	pfadd	mm5, mm7	 #  mm1,  mm5	pfadd	mm4, mm0	 #  mm3,  mm7	pswapd	mm1, mm1	 #  mm2,  mm2	pswapd	mm2, mm2	 #  mm0,  mm0	movq	mm0, QWORD PTR _ctNegLo	 #  ctNegLo/APP	pxor mm1, mm0	 #  mm2	pxor mm2, mm0	 #  mm0/NO_APP	pfadd	mm4, mm1	 #  mm3,  mm2	pfsub	mm5, mm2	 #  mm1,  mm0	movq	QWORD PTR [esi+32], mm4	 #  .fxy,  mm3	movq	QWORD PTR [esi+40], mm5	 #  .fzw,  mm1LBE61:.stabn 68,0,353,LM125-__Z8testLoopvLM125:	add	ebx, 64	 #  pIn0	add	edi, 64	 #  pIn1	add	esi, 64	 #  pOutLBE59:.stabn 68,0,354,LM126-__Z8testLoopvLM126:	cmp	esi, DWORD PTR [ebp-424]	 #  pOut,  pEnd	jb	L39

[edited by - Charles B on June 6, 2004 6:08:34 AM]

"Coding math tricks in asm is more fun than Java"

Charles B

863

Author

June 06, 2004 05:48 AM

Now some comments about the previous code sample (quat mul) :
Well I'll present this as a QnA since I expect some critics :

Q : Why write it in asm style (layer A) ?
- it's not a necessity. Anyone wanting to create this class might (must) start with a hardware independant routine, and write a compatible version with xVector operations. For instance quaternion mul can be expressed with one cross product, one dot product and 2 scalings. It can be written in two minutes. Once it's done your function works with any compiler, system or hardware.

Alas, due to the swizling in the original formula, I benchmarked it for 3DNow and it gave around 40 cycles, which is a bit slower than the floating point version. This means a more accurate and specific version is required for VSI64 (3DNow) and SSE. My predicate is 3DNow should always be at least eaqual to floating point. For any routine, ANSI < 3DNow < SSE. In practice, 3DNow will win on the average with functions like =(move), +, -, *(scaling) and even dot product. Same thing for SSE.

Q : If one has to write this kind of asm like, why is it any better than traditionnal ways, using intrinsics or inline asm ?

- inline asm syntax is very different in Visual or gcc. It's highly inefficient for small routines in Visual. You are obliged to call functions or freeze register roles. C intrisics lets far more optimization opportunities to the compiler .

- my "intrisics" are compiler independent and normalized. The same code works with Visual C++ or gcc.

- the instruction sets are extended. For instance swap_2f exists (simulated) in 3DNow (K6) while it does not exist in the original instruction set, only in 3DNowExt (Athlon).

- I have corrected the slow builtins of the MMX instructions in gcc.

- If another 2xfloat SIMD exists apart 3DNow, the code will also work for this platform. This argument is more relevant between Altivec and SSE, merged as a single technology : VSI128.

[edited by - Charles B on June 7, 2004 8:26:43 AM]

"Coding math tricks in asm is more fun than Java"

mrbastard

1,577

June 08, 2004 07:39 AM

I want to comment to bump this thread up, but really most of what you''re saying is way over my head. Don''t take the lack of discussion to mean lack of interest, just lack of understanding!

[size="1"]

Charles B

863

Author

June 08, 2004 10:41 AM

Well I am checking for a CVS at Sourceforge, and there will be html docs explaining it all there. It''s true there are a few concepts specific to this library.

Someone who already knows the Intel intrisics should not be astonished by the lower layers. And the C++ layer is really typical of linear algebra classes.

That should not be too difficult to handle for most users with tutorials and samples.

"Coding math tricks in asm is more fun than Java"

AngelForce

122

June 08, 2004 10:53 AM

quote:Original post by mrbastard
I want to comment to bump this thread up, but really most of what you''re saying is way over my head. Don''t take the lack of discussion to mean lack of interest, just lack of understanding!

Same thing here