Is it still worth using SIMD for fast maths?

Started by
13 comments, last by adder_noir 13 years, 1 month ago
Hi,

My chosen engine uses a fast math library which uses SIMD stuff if the processors in the hardware are up to the job. Makes alot of sense I suppose and the chapter explaining the source code sounded dandy. Only minor-ish issue is it's written in Intel so I'm having to convert it using AT&T syntax as I'm compiling using Code::Blocks IDE which is GCC. No big deal so far just a question of learning the different syntax. Most of the commands are even pretty identical.

But.... I noticed when actually looking at the code he only uses the SIMD stuff in two files, one for vectors, one for matrices. Ok fair enough most of the stuff in the polygon classes and all that are maybe not too math heavy, whereas obviously vectors and matrices are heavily used 1000's of times a frame I would guess or an awful lot anyway. The thing is that at least half - maybe more like two thirds! - of the alternative SIMD code paths are commented out!!! He says in the comments that for some reason they are running more slowly than the ordinary C++ stuff!

So my questions are in light of that information:

1)Should I bother using the SIMD parts of the math library? It would mean finishing the job I've started converting the Intel stuff into Extended ASM but that's no really big deal. It would also mean updating the code to accomodate the latest processors types from Intel and AMD. Again perfectly feasible given that I understand exactly how he acquires their personal information ;) All that said - is it worth it? My games aren't likely to be very high poly I'm aiming for gameplay more than graphics.

2)Will writing the SIMD stuff in AT&T syntax and compiling with GCC be any faster than the Intel stuff used with Visual C++'s special _asm addon capabilities? Maybe that might solve the problem?

I am a little horrified to see half this code commented out especially after how good the chapter sounded and I really can't think why Intel assembly syntax would be slow. Maybe something in Visual is ruining the implementation?

Anyways.... thanks for any help offered as always :)
Advertisement
Well I can't comment on much there but:

I'm aiming for gameplay more than graphics.
[/quote]
Definately don't bother untill after you've got the game play done and perfomance is an issue (if it ever is an issue).

Interested in Fractals? Check out my App, Fractal Scout, free on the Google Play store.


Hi,

My chosen engine uses a fast math library which uses SIMD stuff if the processors in the hardware are up to the job. Makes alot of sense I suppose and the chapter explaining the source code sounded dandy. Only minor-ish issue is it's written in Intel so I'm having to convert it using AT&T syntax as I'm compiling using Code::Blocks IDE which is GCC. No big deal so far just a question of learning the different syntax. Most of the commands are even pretty identical.

But.... I noticed when actually looking at the code he only uses the SIMD stuff in two files, one for vectors, one for matrices. Ok fair enough most of the stuff in the polygon classes and all that are maybe not too math heavy, whereas obviously vectors and matrices are heavily used 1000's of times a frame I would guess or an awful lot anyway. The thing is that at least half - maybe more like two thirds! - of the alternative SIMD code paths are commented out!!! He says in the comments that for some reason they are running more slowly than the ordinary C++ stuff!

So my questions are in light of that information:

1)Should I bother using the SIMD parts of the math library? It would mean finishing the job I've started converting the Intel stuff into Extended ASM but that's no really big deal. It would also mean updating the code to accomodate the latest processors types from Intel and AMD. Again perfectly feasible given that I understand exactly how he acquires their personal information ;) All that said - is it worth it? My games aren't likely to be very high poly I'm aiming for gameplay more than graphics.

2)Will writing the SIMD stuff in AT&T syntax and compiling with GCC be any faster than the Intel stuff used with Visual C++'s special _asm addon capabilities? Maybe that might solve the problem?

I am a little horrified to see half this code commented out especially after how good the chapter sounded and I really can't think why Intel assembly syntax would be slow. Maybe something in Visual is ruining the implementation?

Anyways.... thanks for any help offered as always :)


You shouldn't be writing stuff in assembly most compilers provide you with intrinsic function and you should call those, as the compiler can then compile these statements into SSE2 code. Let the compiler do the painful work for you, this way your SIMD instruction should work regardless which CPU is being used, after a recompile for that target CPU off course.
Here is a good article about vector math and why it is faster then normal floating point or double precision arithmetic: http://www.gamasutra...tform_simd_.php He even goes as far as telling you which compiler produce fast code, the SN systems comiler is a compiler that is forked from GCC and tweaked for different platforms.

1. Gameplay code can somethimes be vector intensive, like physics and audio, SIMD instructions can speed things up by factor 3 and above so yeah it makes sense spending some time on it. However it doesn't make sense writing this stuff in asm as I said before there are instrinsic(builtin) functions in compilers that can do this for you and make your coding go alot faster.

2. You shouldnt be writing any asmebler at all, the compiler is better at optimizing code then you are, most of the time. Sometimes it is worth it to let the compiler do its thing and then hand optimize that piece of assembler to run faster.

Worked on titles: CMR:DiRT2, DiRT 3, DiRT: Showdown, GRID 2, theHunter, theHunter: Primal, Mad Max, Watch Dogs: Legion


2)Will writing the SIMD stuff in AT&T syntax and compiling with GCC be any faster than the Intel stuff used with Visual C++'s special _asm addon capabilities? Maybe that might solve the problem?

If the data wasn't designed properly, you can waste a tonne of cycles just shuffling data into SIMD friendly formats in the registers. Thats probably where the slowdown came from. If anything, I'd drop the assembly entierly and use the _mm_mulps() type intrinsics (they are the same on VS and GCC) as it gives the compiler the chance to optimize the actual output, where as the compiler relies on you to have optimal _asm blocks .

That said, it might not even be worth it till you hit an actual performance problem.
Wow this is great advice. Seems like the engine's author has the right idea from a theoretical perspective but maybe not in practice. I think then I will go the route you have all suggested. It did occur to me that 'doing it manually' may actually result in poorer performance than allowing the compiler to do this stuff for itself. Up until now though I wasn't aware of such a thing.

Here's a question though, do I still need to write a seperate code path for SIMD stuff even if I am not manually writing the assembly code and I'm using the intrinsic functions?

I haven't read that article yet, looks like that's the next big thing for me to do. Having said that though a large - or at least considerable - part of this math library seems to rely on returned values from specific assembly code to function. THe engine's author has for example written a struct which will store the central processing unit's manufacturer name in a char* type string which is stored in the struct I just mentioned. I wonder if by using these intrinsic functions which I'm not yet aware of, can I still request and return specific data about the CPU to variables held within this struct, or will using these intrinsic functions abstract me away from being able to do this?

If not then I may have to settle for the reality of re-writing the part of the math library which is orientated around acquiring CPU specific data. This may take some time but I'm already settled on the fact that this is a big enough problem for me to accept it will take some time to resolve, and may mean I end up having more to do with the math library code than I really want to :rolleyes:

Thanks so much for the help so far :D
I rewrote most of my math stuff into SSE a while ago and there was a pretty nice benefit. Converting my math code over was pretty simple, the matrix / vector stuff didn't take long at all. The amount of arithmetic was basically cut to 1/4th what it was in some cases.

Restructuring some other things took a little longer. It's a software renderer and one of the biggest gains in really geometry heavy cases was redoing the final transformation stuff for SSE. I ended up changing how I fill my vertex buffer to dummy up verticies to keep alignment and try to keep things ordered in a way that doesn't require me to shuffle it around. It just goes in and I'm off plowing through data. My first "get it working" attempt was to move from transforming 1 vertex at a time to the entire triangle. Alpha blended pictures worked nicely too since I can do multiple pixels at once.

I haven't really had any issues with SSE yielding slower results on my Q6600. I ended up just ditching the x86 stuff and leaving the SSE code because I had nothing but nice boosts.


VS 2008 generated some stuff for intrinsics though that was beyond earthly logic. 2008 would generate downright awful code like somehow create tons of loads and moves instead of a single movaps or optimize nothing at all. I had some stuff in assembly but that went out the window in 2010 because the compiler started beating me nicely. Even trying to improve on what it generated didn't work as well, so it was clearly playing into some kind of black magic that I couldn't hope to match myself.



Just try to do as much work as you possibly can when you're in SSE land. Don't be afraid to saturate the entire SSE register file, squeeze in as much as you can and do as much as you can before leaving.
Even trying to improve on what it generated didn't work as well, so it was clearly playing into some kind of black magic that I couldn't hope to match myself.


Thanks, that's about what I'm beginning to think. I think I'll just learn as much as I can about this intrinsic stuff and leave it at that. I don't realistically think I would ever acquire enough real step by step advice about this on the internet to get too involved with it. It would mean purchasing a specific book and then a load of time to go through it. Not good as I'm already working through a 900 page book - I'm on page 500.

I'll have a good read of that Gamasutra article but to be frank if I can't digest it I'll just strip out the SIMD code paths from the math library and move on, *maybe* coming back to it when I'm through the book and have got a load of other more pressing concerns working such as terrain, animation and the like. I'm not getting bogged down with a side quest to the point where it becomes as big a chore as the main drive :rolleyes:

Thanks for the information sounds good - if you've got the patience ;)
Just use Eigen if you want efficient math code.

I rewrote most of my math stuff into SSE a while ago and there was a pretty nice benefit. Converting my math code over was pretty simple, the matrix / vector stuff didn't take long at all. The amount of arithmetic was basically cut to 1/4th what it was in some cases.

Restructuring some other things took a little longer. It's a software renderer and one of the biggest gains in really geometry heavy cases was redoing the final transformation stuff for SSE. I ended up changing how I fill my vertex buffer to dummy up verticies to keep alignment and try to keep things ordered in a way that doesn't require me to shuffle it around. It just goes in and I'm off plowing through data. My first "get it working" attempt was to move from transforming 1 vertex at a time to the entire triangle. Alpha blended pictures worked nicely too since I can do multiple pixels at once.

I haven't really had any issues with SSE yielding slower results on my Q6600. I ended up just ditching the x86 stuff and leaving the SSE code because I had nothing but nice boosts.


VS 2008 generated some stuff for intrinsics though that was beyond earthly logic. 2008 would generate downright awful code like somehow create tons of loads and moves instead of a single movaps or optimize nothing at all. I had some stuff in assembly but that went out the window in 2010 because the compiler started beating me nicely. Even trying to improve on what it generated didn't work as well, so it was clearly playing into some kind of black magic that I couldn't hope to match myself.



Just try to do as much work as you possibly can when you're in SSE land. Don't be afraid to saturate the entire SSE register file, squeeze in as much as you can and do as much as you can before leaving.


That will be because you didnt return by value or tried to move stuff back into float registers. When moving from SSE to float and the other way round you create a load-hit-store stall as the values need to be pump thorugh the entire cpu pipeline to move from a SSE register into a normal float register. This is why most SIMD enabled math libs always return you a SSE register even when accessing getX() on the vector or matrix class.

Worked on titles: CMR:DiRT2, DiRT 3, DiRT: Showdown, GRID 2, theHunter, theHunter: Primal, Mad Max, Watch Dogs: Legion

If you don't mind potentially-less-accurate results, try setting Properties | C++ | Code Generation | Floating Point Model to "Fast". You'll get heavily-optimized FP and SSE code since it can apply algebraic rules that don't strictly apply to finite-precision FP values and lift some restrictions imposed by the language standard. Games don't normally require high accuracy, so the tradeoff is most likely worthwhile. (MSDN has a detailed explanation of this.)

This topic is closed to new replies.

Advertisement