x86-64 has SSE by default, so what exactly do you mean by SSE giving you a 5-10x speedup? Did you measure that speedup on x86 instead of x86-64?
I honestly have no idea what the old sse flag does to x86-64. I would have guessed it just gets silently ignored since it's redundant, but apparently you've proven it doesn't get ignored (at least in 4.6.3).
You might try compiling your code with gcc 4.8.x if at all possible. It has a lot of optimization improvements, including the local register allocator (yey for replacing 20+ year old tech!). If you still have problems with the optimizations in a current gcc, you might try asking the gcc folks about it. I'm pretty sure I can guess the response you'd get if you complain about optimizations in 4.6.3 though.
thank you for the reply
soooo what you're saying is that _specifically_ optimizing for sse on 64 bit is completely uselesss?
so is this whole 'use sse' stuff is because people are still shipping with 32 executables on windows (and now on linux too thx steam)?
edit: I measured it on 64 bit only. It was 'with sse' code before optimization vs after. By optimization here I mean vectorizing stuff. But you can see that in the git history. I'm using sse intrinsics specifically, and I have a define for it to enable, see (line 6):https://github.com/Yours3lf/libmymath/blob/master/mymath/mm_common.h
edit2: so I measured with 32 bit compilation on a 64 bit linux. Same thing happens, 1.8 seconds for without sse, 4 seconds for with.
edit3: so I tried gcc 4.8.2:
64 bit with sse: 2.11664 bit without sse: 0.915
32 bit with sse: 2.052
32 bit without sse: 1.7
If I understood the code correctly via a quick glance, it looks like you're measuring the time to do ten million matrix inverses? In the fastest result, that corresponds to 170 nanoseconds per a single inverse. If you are interested in a data point for comparison, here's the benchmark results of a 4x4 sse inverse in MathGeoLib buildbots: http://clb.demon.fi/dump/MathGeoLib_testresults/index.html?revision=da780cb4df817c75e6333557b311c33897c598db
Look for "float4x4::Inverse". On those bots, the best result comes on a Mac Mini, where a matrix inverse takes on average 17 nanoseconds / 40.7 clock cycles (measured with rdtsc instruction).
In general optimizing 4x4 matrix inverse is a very good application for SSE. Even though there are lots of scalar parts to the algorithm, the SSE instruction path does end up ahead. In a lot of problems, you just don't have four matrices to invert simultaneously, so using SoA - while being more effective since it's practically shuffle-free - is somewhat utopistic.
only one million, which means 1700 nanoseconds :S
But this wouldn't be a fair comparison :) my pc has A8-4500m APU which is by far not the fastest :)
But I'll try MathGeoLib, and see how it performs on equal terms (possibly way faster)
I was thinking the same.