Archived

This topic is now archived and is closed to further replies.

ArnoAtWork

maths Library

Recommended Posts

ArnoAtWork    138
I try to make a mathematical library for vectors, matrices... I would like to include SSE and SSE2 support. The problem is to design correctly this lib. I would like to make a test at first to define if the CPU supports or not, SSE instructions. But after that, I would like to call the best instruction for each mathematical needs. But how I should implement that? Should I test at each function like: Vector3 operator+ (const Vector3& v1, const Vector3& v2) { if(SSEsupport) return addVector3SSE(v1,v2); else return addVector3C(v1,v2); } Or should I use virtual interface? MathsInterface* ptrOfMathsInterface; ... if(SupportSSE()) ptrOfMathsInteface = new MathsInterfaceSSE(); else ptrOfMathsInteface = new MathsInterfaceC(); ... Vector3 operator+ (const Vector3& v1, const Vector3& v2) { return ptrOfMathsInterface->addVector3(v1,v2); } Thanks a lot. [edited by - arnoatwork on January 22, 2004 5:26:23 PM]

Share this post


Link to post
Share on other sites
Drag0n    186
Hi!

Do you want to use SSE/2 for optimization or just to get familiar with it? Because I think that checking for SSE support every time you do an operation adds more overhead than what would be removed by using SSE (I might be wrong, never really used SSE).

cya,
Drag0n

-----------------------------
"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning..." -- Rich Cook

My future web presence: Dave''s Programming Resources

Share this post


Link to post
Share on other sites
ArnoAtWork    138
It''s to be used for optimisation.

The problem is to find best approach. As the program doesn''t know what compiler supports before running it, I can''t use template or anything "pre-defined". I need something in real time...

So except testing a flag or using function ptr(or virtual class), I don''t know what I could use...

Share this post


Link to post
Share on other sites
Drag0n    186
Well, I would go and implement two dynamic libraries, one with SSE support and one without. Then you check once before loading the corresponding library.

cya,
Drag0n

-----------------------------
"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning..." -- Rich Cook

My future web presence: Dave''s Programming Resources

Share this post


Link to post
Share on other sites
Drag0n    186
Sure, you could use function pointers. But I would not use virtual functions because they add overhead.

cya,
Drag0n

-----------------------------
"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning..." -- Rich Cook

My future web presence: Dave''s Programming Resources

Share this post


Link to post
Share on other sites
Drag0n    186
You''re welcome!

cya,
Drag0n

-----------------------------
"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning..." -- Rich Cook

My future web presence: Dave''s Programming Resources

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
The trick is that for SSE(/2) to make a worth while difference you need to do a reasonably ammount of processing in a call (ie if you''re just doing 1 packed add instruction per call or something like that you''d be better to not do the call & code it in traditional fpu code)

Share this post


Link to post
Share on other sites
Jan Wassenberg    999
Not easy. First, the worst drawback: you end up penalizing the machine without SSE (which is the one that needs help the most!), because you go with the normal code path, but with some overhead due to the runtime decision. Are there any hard limits for what the non-SSE machine must accomplish? Can you just decide not to support P-II / plain Athlon or below CPUs, or is it a ''I wanna make it as fast as possible'' thing?
With that decision made, we can continue.
What is overhead? Simple, the difference to statically compiled, perfectly inlined code. Any overhead for a math library hurts, if applied to individual operations like addVector3. We can combat this in 3 ways:

1) reduce how often we pay the penalty: choose (via if, function pointer, whatever) SSE / FPU at the algorithm level, i.e. write transform_all_verts_sse and transform_all_verts_fpu, instead of splitting at the multiply_vector level. This is fairly annoying, especially if the algorithm is complex, or there are several such places, but workable.

2) compile your app at install time, taking into account the target system - no more runtime decision at all. This would be ideal IMO, but not everyone has a (good) compiler around.

3) go hardcore: patch your application at runtime, so that the math functions end up where they should. Of course, this ends up non-portable and may not work in the next Windows (*sigh*, if only they would write simple, secure code, instead of locking everything down..), but here we go!
Somehow at init time, we want to change all occurrences of addVector3() with addVector3SSE(), if the CPU supports it. Thus, our overhead would consist of a static function call, which is worse than the inlined version, but much better than an indirect call. Here''s how to do it: addVector3 is a stub routine, which makes note of the place from which it is called (basically, the instruction before the return address on the stack), and patches this call instruction to point to addVector3SSE or addVector3FPU, in accordance to what CPU is detected.
Is there a way to actually inline the function, instead of paying the cost of a function call and copying args onto the stack? Spoiler: yes. Here''s how: addVector3 will patch the function the first time it is called, but it is declared __forceinline, and consists of a call to the patcher function plus padding (enough so that both SSE and FPU versions would fit). The init function doesn''t patch the call *site* as before, but replaces each occurence of itself with the SSE / FPU code. We have thereby reduced overhead to executing padding (can be further reduced with large instructions, instead of NOPs), and decreased instruction cache efficiency (due to padding); both are small compared to an indirect jump.
Note: both ways may require you to VirtualProtect away write protection.

BTW, it''s likely a jump correctly predicted not taken (your condition isn''t going to change, and if the function is called often enough to warrant optimization, it will be in the branch history) will be faster than an indirect jump (be it function pointer or virtual function), but you''ll have to test this on your CPU.

Share this post


Link to post
Share on other sites
Sander    1332
What I do is provide multiple binaries for different platforms. Just #ifdef #endif the relevant parts of the code. Do make sure however that you still check for SSE/SSE2 support before you run it. People could very easily download the wrong version ofcourse. If they did download the wron version, gently quit with an error message explaining them where to get the right version.

It''s a bit more work and you will have to hassle with multiple binary versions, but you''ll get the fastest codepath.

Sander Maréchal
[Lone Wolves Game Development][RoboBlast][Articles][GD Emporium][Webdesign][E-mail]

Share this post


Link to post
Share on other sites
petewood    819
Put the platform specific code in its own functions
Write a generic version of the functions too.
Build your class using templates
Choose the different platform functions depending upon the template specialisation
typedef the specialisations to vector3D in separate platform specific headers
Change the include paths for different builds for different platforms
No need for #ifdefs
Nice and neat
Sorry not my usual detailed reply. Stuff to do

Pete

Share this post


Link to post
Share on other sites
mishikel    148
I''ve written a math library and I''m pretty pleased that everything works correctly. However, I have some similar optimization questions:

SSE
Jan and Sander, it sounds like you guys having working SSE code in your math libs. How much speed improvement have you noticed? In which areas are SSE optimizations most crucial?

Inlining
Which functions did you choose to inline (everything, nothing, something in between)? How much of a speed improvement did you notice?



Thanks,
Matt

Share this post


Link to post
Share on other sites
Jan Wassenberg    999
Sander: hehe, didn''t consider that, because of the trouble for the user - many people don''t know what SSE is, or at least that you need a PIII or Athlon XP ("what''s that?") to run it. I guess it''s workable with a ''you installed the wrong version'' check, but that''s still a hassle.

I actually don''t think any of these suggestions are worth the trouble, unless you find that your math code is demonstrably too slow, and further, that it would be improved by SSE. mishikel, I don''t SSE-optimize stuff unless it really, really matters (see CLOD terrain engine on my page for one example), and math lib isn''t one of them, IMO. For a few odd matrix ops, SSE doesn''t make a difference at all. If you do enough that it would, I''d write the whole thing in asm, doing register alloc myself. It''s kind of silly to load stuff from memory, do a few SSE ops on it, and write it back out to memory.
That said, if you have lots of fsqrt(), you still win by replacing fsqrt with rsqrtss & mulss, even with parameter passing overhead.

Share this post


Link to post
Share on other sites
Sander    1332
mishikel:
I can''t tell the speed improvement of inlining since we never NOT inlined the functions we use. That is the entire reason we went for a multi-binary approach. Inlining + SSE(2) = max speed.

The SSE functions take approximately 20%-25% of the time of the normal C functions (when operating on arrays of 4D vectors). SSE2 has still to be profiled correctly. If you use other vectors (like 3D ones) SSE speed improvement is less than that.

Jan:
We are eliminating the binary hassle via an installer/launcher. Our game will be an online multiplayer only game, thus the latest binaries are always available via the internet. At startup, the launcher checks SSE(2) support (or Altivec for Macintosh) and in the installed version is not the optimal one, the user is prompted to download the optimal version. Zero hassling for the user.



Sander Maréchal
[Lone Wolves Game Development][RoboBlast][Articles][GD Emporium][Webdesign][E-mail]

Share this post


Link to post
Share on other sites