maths Library

Started by
14 comments, last by ArnoAtWork 20 years, 2 months ago
I try to make a mathematical library for vectors, matrices... I would like to include SSE and SSE2 support. The problem is to design correctly this lib. I would like to make a test at first to define if the CPU supports or not, SSE instructions. But after that, I would like to call the best instruction for each mathematical needs. But how I should implement that? Should I test at each function like: Vector3 operator+ (const Vector3& v1, const Vector3& v2) { if(SSEsupport) return addVector3SSE(v1,v2); else return addVector3C(v1,v2); } Or should I use virtual interface? MathsInterface* ptrOfMathsInterface; ... if(SupportSSE()) ptrOfMathsInteface = new MathsInterfaceSSE(); else ptrOfMathsInteface = new MathsInterfaceC(); ... Vector3 operator+ (const Vector3& v1, const Vector3& v2) { return ptrOfMathsInterface->addVector3(v1,v2); } Thanks a lot. [edited by - arnoatwork on January 22, 2004 5:26:23 PM]
Advertisement
Hi!

Do you want to use SSE/2 for optimization or just to get familiar with it? Because I think that checking for SSE support every time you do an operation adds more overhead than what would be removed by using SSE (I might be wrong, never really used SSE).

cya,
Drag0n

-----------------------------
"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning..." -- Rich Cook

My future web presence: Dave''s Programming Resources
-----------------------------"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning..." -- Rich Cook"...nobody ever accused English pronounciation and spelling of being logical." -- Bjarne Stroustrup"...the war on terror is going badly because, if you where to compare it to WWII, it's like America being attacked by Japan, and responding by invading Brazil." -- Michalson
It''s to be used for optimisation.

The problem is to find best approach. As the program doesn''t know what compiler supports before running it, I can''t use template or anything "pre-defined". I need something in real time...

So except testing a flag or using function ptr(or virtual class), I don''t know what I could use...
Well, I would go and implement two dynamic libraries, one with SSE support and one without. Then you check once before loading the corresponding library.

cya,
Drag0n

-----------------------------
"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning..." -- Rich Cook

My future web presence: Dave''s Programming Resources
-----------------------------"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning..." -- Rich Cook"...nobody ever accused English pronounciation and spelling of being logical." -- Bjarne Stroustrup"...the war on terror is going badly because, if you where to compare it to WWII, it's like America being attacked by Japan, and responding by invading Brazil." -- Michalson
But is it quicker in real time than function ptr? Because, using a dynamic library is same approach in fact....
Sure, you could use function pointers. But I would not use virtual functions because they add overhead.

cya,
Drag0n

-----------------------------
"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning..." -- Rich Cook

My future web presence: Dave''s Programming Resources
-----------------------------"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning..." -- Rich Cook"...nobody ever accused English pronounciation and spelling of being logical." -- Bjarne Stroustrup"...the war on terror is going badly because, if you where to compare it to WWII, it's like America being attacked by Japan, and responding by invading Brazil." -- Michalson
Yeah, it''s true...

Thanks.
You''re welcome!

cya,
Drag0n

-----------------------------
"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning..." -- Rich Cook

My future web presence: Dave''s Programming Resources
-----------------------------"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning..." -- Rich Cook"...nobody ever accused English pronounciation and spelling of being logical." -- Bjarne Stroustrup"...the war on terror is going badly because, if you where to compare it to WWII, it's like America being attacked by Japan, and responding by invading Brazil." -- Michalson
The trick is that for SSE(/2) to make a worth while difference you need to do a reasonably ammount of processing in a call (ie if you''re just doing 1 packed add instruction per call or something like that you''d be better to not do the call & code it in traditional fpu code)
Not easy. First, the worst drawback: you end up penalizing the machine without SSE (which is the one that needs help the most!), because you go with the normal code path, but with some overhead due to the runtime decision. Are there any hard limits for what the non-SSE machine must accomplish? Can you just decide not to support P-II / plain Athlon or below CPUs, or is it a ''I wanna make it as fast as possible'' thing?
With that decision made, we can continue.
What is overhead? Simple, the difference to statically compiled, perfectly inlined code. Any overhead for a math library hurts, if applied to individual operations like addVector3. We can combat this in 3 ways:

1) reduce how often we pay the penalty: choose (via if, function pointer, whatever) SSE / FPU at the algorithm level, i.e. write transform_all_verts_sse and transform_all_verts_fpu, instead of splitting at the multiply_vector level. This is fairly annoying, especially if the algorithm is complex, or there are several such places, but workable.

2) compile your app at install time, taking into account the target system - no more runtime decision at all. This would be ideal IMO, but not everyone has a (good) compiler around.

3) go hardcore: patch your application at runtime, so that the math functions end up where they should. Of course, this ends up non-portable and may not work in the next Windows (*sigh*, if only they would write simple, secure code, instead of locking everything down..), but here we go!
Somehow at init time, we want to change all occurrences of addVector3() with addVector3SSE(), if the CPU supports it. Thus, our overhead would consist of a static function call, which is worse than the inlined version, but much better than an indirect call. Here''s how to do it: addVector3 is a stub routine, which makes note of the place from which it is called (basically, the instruction before the return address on the stack), and patches this call instruction to point to addVector3SSE or addVector3FPU, in accordance to what CPU is detected.
Is there a way to actually inline the function, instead of paying the cost of a function call and copying args onto the stack? Spoiler: yes. Here''s how: addVector3 will patch the function the first time it is called, but it is declared __forceinline, and consists of a call to the patcher function plus padding (enough so that both SSE and FPU versions would fit). The init function doesn''t patch the call *site* as before, but replaces each occurence of itself with the SSE / FPU code. We have thereby reduced overhead to executing padding (can be further reduced with large instructions, instead of NOPs), and decreased instruction cache efficiency (due to padding); both are small compared to an indirect jump.
Note: both ways may require you to VirtualProtect away write protection.

BTW, it''s likely a jump correctly predicted not taken (your condition isn''t going to change, and if the function is called often enough to warrant optimization, it will be in the branch history) will be faster than an indirect jump (be it function pointer or virtual function), but you''ll have to test this on your CPU.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3

This topic is closed to new replies.

Advertisement