Jump to content
  • Advertisement

Archived

This topic is now archived and is closed to further replies.

Charles B

Horse power math lib (2)

This topic is 5272 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Preamble : - this is not the real name, it will be disclosed at the release of the product to avoid troubles. - it's been discussed previously here - some details are now deprecated Intro : This math lib is still under development. But I may release the first public rushes and launch a web site soon or at least send parts of the code to those who wish to test a few things. I consider it still as a proto, just in case I might modify a few details after the discussions here. I wish to find contributors soon because the task is enormous. Briefly, it's an open source, portable C/C++ and ultimately fast math lib , specially designed for the needs of the game and 3D programmers . Currently implemented or tested entirely or partially for : - Windows, Linux, MacOSX - Visual C++ - gcc (DevCpp, Project Builder (Mac)) - ANSI C, 386, 3DNow, SSE, Altivec The initial observation is that after every high level optimization is done what remains is thousands of math functions everywhere in a 3D code, in the physics, in the graphics, in the AI. The processors do not unleash their full power for years because SIMD is not standardized. On one hand, knowing how most classes are typically written, and how compiler or vectorizers work, potential X2 X10 speed ups are common . My thousands benchmarks done say that the legend that compilers are so wise, and will do everything for you is fake, they need to be firmly controled. On the other hand writing asm code or using specific intrisics (gcc builtins) is a pain and only works for one target hardware. More, people that really know how to schedule asm code and know enough math tricks to actually optimize code are really seldom. So the goal is to have an optimal code generated, equal or even better than what an assembly code writer may produce. Far better than what vectorizers or compilers can do alone. However it's not a competition, it's complementary. My library uses the intrisics, some of the free libraries and ressources provided by hw vendors, and it could also certainly benefit of some features of the vectorizers. To me they are primarilly compilers that specially care of SIMD. Shortly here is how it works : A) VSIMD : the lowest level, the first abstraction layer : ---------------------------------------------------------- - It is the normalized equivalent of asm, intrisics or gcc builtins - The various compilers are tweaked, for instance some inefficient gcc builtins are replaced by inline asm. - I normalize the common SIMD instruction sets - I expand them to the greatest mcm ex1 : madd_4f exists for SSE, even if it's normally Altivec specific. ex2 : swap_2f exists even for a 3DNow target even if it's a 3DNowExt. - I add type checking - I handle various default bits of precisions per file at complie time ex3 : xDEFAULT_PRECISION(14) before the headers lets rsqrt_2f() use 3DNow pfrsqrt without the Newton Raphson iteration. - there is a C++ wrapper, with overloaded operators. It works at full speed (even faster that the C instrisics sometimes !?) in release. Examples : // C v2f32 e,f,g; e=unpackl_2f(f,g); // C++ v4u16 a,b,c,d; a=b+c*d; // equivalent to a=madd_4u16(c,d,b); B) VSIMD - linear algebra -------------------------- - low level, C and macro functions : ex4 : vDecl(A); // A very special model to force temp data into registers vBegin(); vQSlerp(A, vLd(*pIn1), vLd(*pIn2)); // Quaternion slerp VQMul(A, A, vLd(*pIn3)); // Quaternion mul vEnd(); C) The C++ linear algebra library. ---------------------------------- There is a name space protection which can be disabled with an environment macro. Now this is fully portable, from ANSI C to SSE2. There is a special class called xScalar that replaces float for more speed. ex5: // Collision detection or culling probably. xAABox Box; xPoint A,B,C; ... xPlane N(A,B,C); xScalar d = N*Box; // smallest distance plane, box // This is very speedy with SIMD code !!! ex6: (benchmarking the C++ quaternion multiplication)
t=xReadTimeStampCounter();
for(i=0; i < n; i+=4){
   pQout[ i+0 ] = pQin0[ i+0 ] * pQin1[ i+0 ];
   pQout[ i+1 ] = pQin0[ i+1 ] * pQin1[ i+1 ];
   pQout[ i+2 ] = pQin0[ i+2 ] * pQin1[ i+2 ];
   pQout[ i+4 ] = pQin0[ i+4 ] * pQin1[ i+4 ];
}
t=t-xReadTimeStampCounter();
...
       
Gives 18 cycles per output with gcc, 24 with VisualC++ 6.0, for an x3DNOWEXT target on an Athlon. 28 in xANSI (exploits FPU). Try to beat this with the library you use ! This proves that 3DNow or SSE can beat the FPU, even when there is a lot of swizzling. (the quaternion multiplication contains a cross product). D) SoA (Structure of Array) --------------------------- mint8, mint16, etc... mfloat contains 1, 2 or 4 packed floats. It inherits of float, v2f32, or v4f32. Templates are used in C++. ex7:
struct xALIGN(16) mySoA { mfloat x,y,z; // your methods here }

void my func(mySoA *a, const mySoA *b, const mySoA *c, int n)
{
    for(i=0; i < n; i++){
        a[ i ].x = (b[ i ].x+c[ i ].x)/c[ i ].z;
        a[ i ].y = (b[ i ].y+c[ i ].y)/c[ i ].z;
	a[ i ].z = 0;
    }
}
     
I wish to answer to any comment, question or suggestion. [edited by - Charles B on June 8, 2004 11:34:35 AM]

Share this post


Link to post
Share on other sites
Advertisement
Guest Anonymous Poster
In your previous thread you mentioned concern about getting people to actually use it once you distribute it. The best way would probably be to use your math lib to speed up some existing lib that people frequently use... like ODE or something. Re-write the low-level solver with your own math routines, and then you can say "Here, look, Horsepower ODE is a drop-in replacement for ODE except that it''s 20% faster!"

If you could do something like that, then people would take notice.

Share this post


Link to post
Share on other sites
I figure this thread can't be left with just two replies.

So here are a few questions:
- How feasible will it be to implement low level parts of your lib into an existing program ? I guess optimizing a few chosen functions wouldn't be too much work, but what about substituting all maths code with something from your lib ?

- How many/which higher-level constructs will be available ?

- Using your lib I'll get multiple executables if I wanted my application to take advantage of for example SSE2 and I still wanted it to run on something like a PII, right ?

[edited by - Eternal on June 5, 2004 12:58:27 PM]

Share this post


Link to post
Share on other sites
@Anonymous (1)

that people frequently use... like ODE or something.

Right, it's precisely something I had in mind. Specially as I find the ODE initiative remarquable. I have coded physics for games as a pro. In the past I wanted to make a physics engine and I found someone else had already started one open source.

The library has been specially designed with physics in mind, collision detection, contraint solvers, integrators, ... I consider the physics are not at the level of the graphics today. And people do not consider the CPU enough. Everything is oriented towards the GPUs and their shaders. But I predict the new future standards in video games will use a lot of power from the CPU and wisely.

Since I am quite experienced in math and physics, if find your estimation quite accurate. I suppose that in a 3D program where speed is highly conditionned by the physics (constraints or many collisions), where the CPU (not GPU) conditions the FPS mostly, 20% is realistic for 3DNow and 30%-40% for SSE. I assume here the code was already "well" written with floating points, else the gains can be much higher. "well" means with the right compiler settings, actual inlining, etc...

But to do this I must complete my high level classes (AABox and so on) which is not the most difficult considering how I have designed the project.

The painful stuff is in the A layer. That's why I'd like to find people help me implement and test the layer I named A) for various systems, hardwares and compilers.

My priority is to expand the compatibility base of this layer atm. I want to validate the cross platform predicate the soonest possible. Nonetheless it's true I could use my windows-gcc (DevCpp) version and try it with the ODE code.

If anyone has a performance demo that uses ODE, that could serve as a reference for benchmarks, this would help.





[edited by - Charles B on June 5, 2004 1:33:38 PM]

Share this post


Link to post
Share on other sites
Holy cow, this seems like a huge effort, and you really seem to know what you are talking about.
However, you didn't seem to answer eternal's questions, which are also my own:

1) Who is this library aimed for? I mean, is this designed for mortal game/3D programmers such as myself that don't really know much about low level optimizations? If I have to know how a processor works at the assembly level in order to use this thing, I would probably rather stick with my ghetto 3D math stuff.

2) How hard is it going to be to actually use this library? If you have to do jumping jacks to get it to work, then most people won't bother. For example, don't you have to check for the existance of built-in hardware optimizations such as SIMD before you can use it, and if it doesn't exist, somehow the code has to fall back on a default mechanism so that slightly older computers can also run your application?



[edited by - shadow12345 on June 5, 2004 1:50:53 PM]

Share this post


Link to post
Share on other sites
Hey again Charles B

This math lib of yours sounds totally awesome!
Here are some questions for you:

1) Why don''t you create a project for it at sourceforge.net?

2) Can you post some examples that compare source code to asm output?

3) Will this mean we must compile a separate executable for each target architecture?

Share this post


Link to post
Share on other sites
Sounds very cool.

how much syntactic sugar will you be pouring on? I use a little math lib called math3d, and it''s very easy to use as all the classes have lots of overloaded operators. If you haven''t done too much of that yet, maybe it''s an area where programmers who don''t understand the asm bit could help out?

quote:
Original post by Charles B
If anyone has a performance demo that uses ODE, that could serve as


the ODE distribution has a test_ode.exe - it seems to be just a console app that tests the timing of the math code ODE uses, so no visuals but might be useful for you to benchmark your lib against the ODE math code.



Share this post


Link to post
Share on other sites
@Eternal
but what about substituting all maths code with something from your lib ?

It's a more common situation when the library is anterior to the project. The intrinsic nature of SIMD requires it to be thought at the start of a project to be used at it's best. That's why vectorizers are popular, because people do not have to change their habits. Nonetheless here is how I would try to port an existing 3D engine :

- search/replace the high level classes (vectors, matrices, etc...). Or replace your previous header by a wrapper with typedefs. Profile.

- now try to port the most time consuming specific routines. Hope they are not to big. For instance some special IA routine you use based on huge data sets. Analyse if you might change your structures to make them SoA compatible. Or if you can find something useful already in the lib.

Example : image processing routines, FFT, special col det routines.

- eventually propose and contribute to the lib if your routine is generic enough, because someone might do the same for another routine, and you might benefit from it. It's not a direct answer, but as you know it's a side effect benefit of the Open Source system. You do not need to be communist for that.

So all in all port an preexisting code is not tedious, and it should give many benefits too. Well I already answered to the ODE issue, it might be a good proof if I can do it myself. I can also publish a report about this work, the problems encountered.

- How many/which higher-level constructs will be available ?

Frankly I plan to have 95% of the common classes and routines one can use in a game. It's not the first math lib I write. So I know the most efficient formulas and tricks for nearly everything that might be discussed in the graphics or math forums here. But my strategy is to implement the most widely used classes and methods first, so that people start to test the lib and find the motivation to contribute. The ultimate goal is to transform the library into an encyclopedia

One of the key 'quasi-philosophical' concepts of this lib is to find the right balance between the human and silicium intelligences to reach the most optimized code possible, with the least efforts.

What will come first (non exhaustive) is :
xVector, xPoint, xPlane, xAABox, xQuat, xMatrix, xSphere, etc...

I have decided to focus on xAABox and xQuat first. Because one uses packed comparisons (favors SIMD), the second has a lot of swizzling (favors 387 compatible FPU). This enables me to see te effects of compiler settings and how my lower layers behave in practice.


- Using your lib I'll get multiple executables if I wanted my application to take advantage of for example SSE2 and I still wanted it to run on something like a PII, right ?

There are two main strategies :
- one product (dll or exe) per target hardware.
- or use of "CPUID dynamic linking". It's not about dlls else go back to strategy one. It's inside your exe, CPUID is launched, and virtual tables are filled so that medium and big functions (ex: array processing). Small functions use the compatible harware you selected by default.

There are two parameters you can control at compile time :

- the minimum compatible technology you target. It might be x387, x3DNowExt or xSSE. Inline or macro functions will finally point to this version

- a local, file specific, technology target
This way you can also implement 'CPUID dynamic linking' for some of your most important routines.

example :

File 1:
// myInnerLoopSSE.c
// The same code for SSE and Altivec here
#define xCPU_128
// If you use the compatible layers, you write
// only one source
// #include "routine.h"

File 2:
// myInnerLoopAthlon.c
#define xCPU_3DNowExt
// #include "routine.h"

File 3:
// myInnerLoop.c
#define xCPU_ANSI
// #include "routine.h"

Then you'll have goodies to make your own virtual tables in C, or goodies in C++.


[edited by - Charles B on June 5, 2004 2:27:27 PM]

Share this post


Link to post
Share on other sites
@shadow12345
Holy cow, this seems like a huge effort.
Yep, already more than 6 months struggling with Visual, gcc, MacOSX, docs, etc... But I am now payed back counted in clock cycles, specially with gcc. I can not beat my own C++ code anymore with inline asm.

1) Who is this library aimed for ?
You. At least I hope.
mortal game developper
I consider the typical user level really starts with the C++ classes. You can see with the plane, box sample, that it''s a very standard and common syntax.

If I have to know how a processor works at the assembly level in order to use this thing.
The pain is for the contributors, the closer they are to level A, the most aware they have to be to the knowledge of an asm coder. However this remains easier than pure asm in most cases.


2) How hard ... jumping jacks to get it to work

The default options (enviroment macros) let you work as usual.
You can still use special floating point routines beside the technology if you don''t have everything you need in the C++ classes.

check for the existance of built-in hardware optimizations such as SIMD ...

If you choose a default compatible hardware > xANSI or x387, the only thing you have to do is assert it where your program starts (the main or Winmain, or you own entry point). CPUID is part of the lib. The form is if( xIsCPUCapable(xSSE) ) exit(); for instance. You have to do it anyway whenever you use SIMD, even if you do not use my lib and only use compiler settings or a vectorizer. You must post a message to the user that he downloaded te wrong version or that your software has some requirements. Well it''s more fair with the end user to me.

Somehow the code has to fall back on a default mechanism so that slightly older computers can also run your application ?
It does if you choose the "one product multiple hardwares" option. Probably I''ll set it by default to avoid noobish insults I explained it in more details elsewhere before.

Share this post


Link to post
Share on other sites

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!