# mat4 inverse sse optimization

This topic is 1805 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

hi,

I'm trying to optimize the inverse function of my maths lib. I already implemented everything using sse, I'm now trying to validate it works, and make sure it's faster... well most of the time it's really great, I get 5-10x speedups everywhere, except in the mat4 inverse function.

code here: https://github.com/Yours3lf/libmymath/blob/master/mymath/mm_mat_func.h (line 397)

here's the testcase:

#include "mymath/mymath.h"

#include "SFML/System.hpp"

int main( int argc, char** args )
{
using namespace mymath;
using namespace std;

sf::Clock clock;
mat4 m1;

int size = 4;

for( int x = 0; x < size; ++x )
for( int y = 0; y < size; ++y )
m1[x][y] = atoi( args[x * size + y + 1] );

clock.restart();

for( int c = 0; c < 10e6 + 1; ++c )
m1 = inverse( m1 );

cout << clock.getElapsedTime().asMilliseconds() * 0.001f << endl;
cout << m1 << endl;

return 0;
}



with sse enabled I get 5 seconds execution time, without I get only 1 second.
So I tried to look at it what is causing this. I used objdump to grab the assembly of the exes.

without sse (line 393):

http://pastebin.com/nVYJSp56

with sse (line 763):

http://pastebin.com/VNzvJr4U

as you can see when I disable the sse functionality the compiler somehow recognizes the optimization opportunities, and does way better job at generating efficient code.
If you look at the sse version you can see that it's poor assembly code with lots of non-sse functions that hinder speed.

any idea what am I doing wrong?

ps.: I'm not that good at assembly I can barely read it. I think the 'without sse' code is better because there are no non-sse instructions called.

ps2.: I have no idea why the compiler isn't inlining the inverse function, and why did it compile other functions into the exe, even though they're not even used.

I used 64 bit 12.04 linux (3.2.0-58-generic), gcc 4.6.3 with "-O3 -Wall -Wno-long-long -ansi -pedantic -std=c++0x"

objdump for getting the assembly

best regards,

Yours3!f

Edited by Yours3!f

##### Share on other sites

x86-64 has SSE by default, so what exactly do you mean by SSE giving you a 5-10x speedup? Did you measure that speedup on x86 instead of x86-64?

I honestly have no idea what the old sse flag does to x86-64. I would have guessed it just gets silently ignored since it's redundant, but apparently you've proven it doesn't get ignored (at least in 4.6.3).

You might try compiling your code with gcc 4.8.x if at all possible. It has a lot of optimization improvements, including the local register allocator (yey for replacing 20+ year old tech!). If you still have problems with the optimizations in a current gcc, you might try asking the gcc folks about it. I'm pretty sure I can guess the response you'd get if you complain about optimizations in 4.6.3 though.

##### Share on other sites

x86-64 has SSE by default, so what exactly do you mean by SSE giving you a 5-10x speedup? Did you measure that speedup on x86 instead of x86-64?

I honestly have no idea what the old sse flag does to x86-64. I would have guessed it just gets silently ignored since it's redundant, but apparently you've proven it doesn't get ignored (at least in 4.6.3).

You might try compiling your code with gcc 4.8.x if at all possible. It has a lot of optimization improvements, including the local register allocator (yey for replacing 20+ year old tech!). If you still have problems with the optimizations in a current gcc, you might try asking the gcc folks about it. I'm pretty sure I can guess the response you'd get if you complain about optimizations in 4.6.3 though.

soooo what you're saying is that _specifically_ optimizing for sse on 64 bit is completely uselesss?
so is this whole 'use sse' stuff is because people are still shipping with 32 executables on windows (and now on linux too thx steam)?
edit: I measured it on 64 bit only. It was 'with sse' code before optimization vs after. By optimization here I mean vectorizing stuff. But you can see that in the git history. I'm using sse intrinsics specifically, and I have a define for it to enable, see (line 6):

https://github.com/Yours3lf/libmymath/blob/master/mymath/mm_common.h

edit2: so I measured with 32 bit compilation on a 64 bit linux. Same thing happens, 1.8 seconds for without sse, 4 seconds for with.

edit3: so I tried gcc 4.8.2:
64 bit with sse: 2.116

64 bit without sse: 0.915

32 bit with sse: 2.052

32 bit without sse: 1.7

Edited by Yours3!f

##### Share on other sites

I think I realize where I got confused. I think you were using msse flag instead of mfpmath. For x86-64, -mfpmath=sse is enabled by default. Explicitly setting that should be redundant on x64. On 32-bit x86, the default is 387 (80-bit temporary floats), so you want to override it whenever you don't need 80-bit precision. You could also try mfpmath=sse,387 to see if it gives better performance, but it probably won't for most CPUs.

Flag -msse is saying to use SSE1. SSE2 is -msse2, SSE3 is -msse3, etc. You should probably just use mtune instead of messing with all that. On x86, you can have -march=i686 -mtune=nocona -mfpmath=sse when targeting sse3+. That means the code runs on any 686, is optimized for MMX, SSE1-3, and uses 64-bit SSE math instead of 80-bit 387 math. Just change to -march=x86-64 for x64 (at least I think x86-64 is the generic). You can see the different architecture choices like nocona in the gcc i386/x64 options. Since you use linux, you've probably seen nocona in the names of pre-built binaries before.

Vectorizing source code often pays off, but don't specifically use SSE intrinsics in C/C++ unless the compiler just won't generate the right code. When vectorizing, try to make the data width as wide as makes sense. AVX (Sandy Bridge+/Bulldozer+ CPUs, Linux 2.6.30+, Win7+) has 256-bit width, and AVX-512 is around the corner. If you explicitly used SSE intrinsics, you'd have to rewrite the code as soon as you want to target AVX2 instead of just setting -mtune=core-avx2.

Edited by richardurich

##### Share on other sites

I think I realize where I got confused. I think you were using msse flag instead of mfpmath. For x86-64, -mfpmath=sse is enabled by default. Explicitly setting that should be redundant on x64. On 32-bit x86, the default is 387 (80-bit temporary floats), so you want to override it whenever you don't need 80-bit precision. You could also try mfpmath=sse,387 to see if it gives better performance, but it probably won't for most CPUs.

Flag -msse is saying to use SSE1. SSE2 is -msse2, SSE3 is -msse3, etc. You should probably just use mtune instead of messing with all that. On x86, you can have -march=i686 -mtune=nocona -mfpmath=sse when targeting sse3+. That means the code runs on any 686, is optimized for MMX, SSE1-3, and uses 64-bit SSE math instead of 80-bit 387 math. Just change to -march=x86-64 for x64 (at least I think x86-64 is the generic). You can see the different architecture choices like nocona in the gcc i386/x64 options. Since you use linux, you've probably seen nocona in the names of pre-built binaries before.

Vectorizing source code often pays off, but don't specifically use SSE intrinsics in C/C++ unless the compiler just won't generate the right code. When vectorizing, try to make the data width as wide as makes sense. AVX (Sandy Bridge+/Bulldozer+ CPUs, Linux 2.6.30+, Win7+) has 256-bit width, and AVX-512 is around the corner. If you explicitly used SSE intrinsics, you'd have to rewrite the code as soon as you want to target AVX2 instead of just setting -mtune=core-avx2.

thank you this cleared up lots of things. I explicitly used sse2 intrinsics to squeeze out as much performance as possible, as I'd like to learn that kind of stuff. I'm targeting up to sse3. sse2/3 works great for vec4-s since they're 128 bit wide, and I can't think of how avx would give me benefits if not for matrix code. But again, since I can disable all the hacking that I did in sse2 with just a compiler switch, why not go ahead and explore that?

Now with these information in hand, I rewrote the cmake file, can you please take a look at it if I'm doing it right?
https://github.com/Yours3lf/libmymath/blob/master/CMakeLists.txt

Edit: I think I'm starting to do it right
on 32 bit with explicit sse2: 2.028 seconds

on 32 bit without: 1.877 seconds

on 64 bit with explicit sse2: 2.05 seconds

on 64 bit without: 1.637 seconds

the motivation is that the sine function does 0.3 with explicit sse2 and 0.5 without on 64 bit :)

Edited by Yours3!f

##### Share on other sites

This is a bad application of SIMD and is unlikely to show significant benefits due to the overhead of loading scalar data into vector registers. If you look at the generated assembly, a significant number of instructions are either shuffles or loads. These aren't doing useful work and are the reason why your vectorized implementation is actually slower. I wouldn't count on the compiler to vectorize anything other than trivial computations.

You need to look into formatting your data into aligned Structures of Arrays (SoA) format. For instance, rather than having a Vector4 or Matrix4 class that uses SIMD instructions to operate on a single vector or matrix, have a SIMDVector4 that operates on 4 different 4-component vectors at once, one component at a time.

##### Share on other sites

Do you mean you're explicitly using sse2 intrinsics? If so, try just using them in the sine function (and anywhere else they help) and let the compiler do the magic elsewhere.

On use_32_bit -msse3 is redundant if you're specifying -mtune=nocona on both 32 and 64.

For 64-bit, you have to specify -march=x86-64 (the generic for 64-bit, like i686 for 32-bit) if you want the binaries to have a fallback code path for pre-nocona 64-bit chips.

The use_explicit_sse2 else ("-mtune=nocona -mfpmath=sse") should always be set instead of only when not using intrinsics (guessing purpose of flag). That might get you the missing 0.15/0.4 seconds back.

I don't normally use cmake, so my apologies if I made any mistakes on that front. It seemed straightforward enough though.

##### Share on other sites

Do you mean you're explicitly using sse2 intrinsics? If so, try just using them in the sine function (and anywhere else they help) and let the compiler do the magic elsewhere.

On use_32_bit -msse3 is redundant if you're specifying -mtune=nocona on both 32 and 64.

For 64-bit, you have to specify -march=x86-64 (the generic for 64-bit, like i686 for 32-bit) if you want the binaries to have a fallback code path for pre-nocona 64-bit chips.

The use_explicit_sse2 else ("-mtune=nocona -mfpmath=sse") should always be set instead of only when not using intrinsics (guessing purpose of flag). That might get you the missing 0.15/0.4 seconds back.

I don't normally use cmake, so my apologies if I made any mistakes on that front. It seemed straightforward enough though.

yeah, if you take a look at the files there are fvec versions of the vec2/3/4 files, and there's even a sse file filled with arithmetic etc. functions (all containing explicit sse instructions). Now I have actually implemented them everywhere, because I'd like it to be as fast as possible (as in the case of the sine function). Also as Aressera mentioned the compiler may or may not do a good job.

I turned on -msse3 because the explicit instructions need them. But I moved that now to that switch. (I've update the cmake file again :) )
Actually when enabling it everywhere gave me a 0.04-5 seconds speed penalty :( but that's not too much for a million runs.

cmake is great!!! ;) it is as straightforward as it seems

##### Share on other sites

This is a bad application of SIMD and is unlikely to show significant benefits due to the overhead of loading scalar data into vector registers. If you look at the generated assembly, a significant number of instructions are either shuffles or loads. These aren't doing useful work and are the reason why your vectorized implementation is actually slower. I wouldn't count on the compiler to vectorize anything other than trivial computations.

You need to look into formatting your data into aligned Structures of Arrays (SoA) format. For instance, rather than having a Vector4 or Matrix4 class that uses SIMD instructions to operate on a single vector or matrix, have a SIMDVector4 that operates on 4 different 4-component vectors at once, one component at a time.

Yeah I suspected that, but now I can be sure, thanks!

Edit: I checked again, seems like after the compiler settings richardurich advised me to do the compiler now recognizes that it should do all this in sse (there are still some non-sse though), and now the context switching hell doesn't happen
http://pastebin.com/jsDS3U5c (line 689)

I actually looked into it today, it's this, right?

struct vec4 { float x, y, z, w; };
vector<vec4> data; // AoS
struct vec4_soa { float* x, *y, *z, *w } data_soa; //SoA


Also it's used for optimizing cache accessing, and it's highly dependent on the access patterns, right?
Now usually on the cpu side (and on the gpu too, in the context of gamedev) people are dealing with vectors. So why would it be good to store the individual components so far away from each other.
Like
float a[1024]; float b[1024];
in this case individual components would be pretty far away from each other right? and if we'd like to act on each component in parallel, then that would mean cache misses, right?
So I came to the conclusion that in this case, AoS would be better, but prove me wrong

Edited by Yours3!f

##### Share on other sites

x86-64 has SSE by default, so what exactly do you mean by SSE giving you a 5-10x speedup? Did you measure that speedup on x86 instead of x86-64?

I honestly have no idea what the old sse flag does to x86-64. I would have guessed it just gets silently ignored since it's redundant, but apparently you've proven it doesn't get ignored (at least in 4.6.3).

You might try compiling your code with gcc 4.8.x if at all possible. It has a lot of optimization improvements, including the local register allocator (yey for replacing 20+ year old tech!). If you still have problems with the optimizations in a current gcc, you might try asking the gcc folks about it. I'm pretty sure I can guess the response you'd get if you complain about optimizations in 4.6.3 though.

soooo what you're saying is that _specifically_ optimizing for sse on 64 bit is completely uselesss?
so is this whole 'use sse' stuff is because people are still shipping with 32 executables on windows (and now on linux too thx steam)?
edit: I measured it on 64 bit only. It was 'with sse' code before optimization vs after. By optimization here I mean vectorizing stuff. But you can see that in the git history. I'm using sse intrinsics specifically, and I have a define for it to enable, see (line 6):

https://github.com/Yours3lf/libmymath/blob/master/mymath/mm_common.h

edit2: so I measured with 32 bit compilation on a 64 bit linux. Same thing happens, 1.8 seconds for without sse, 4 seconds for with.

edit3: so I tried gcc 4.8.2:
64 bit with sse: 2.116

64 bit without sse: 0.915

32 bit with sse: 2.052

32 bit without sse: 1.7

If I understood the code correctly via a quick glance, it looks like you're measuring the time to do ten million matrix inverses? In the fastest result, that corresponds to 170 nanoseconds per a single inverse. If you are interested in a data point for comparison, here's the benchmark results of a 4x4 sse inverse in MathGeoLib buildbots: http://clb.demon.fi/dump/MathGeoLib_testresults/index.html?revision=da780cb4df817c75e6333557b311c33897c598db

Look for "float4x4::Inverse". On those bots, the best result comes on a Mac Mini, where a matrix inverse takes on average 17 nanoseconds / 40.7 clock cycles (measured with rdtsc instruction).

In general optimizing 4x4 matrix inverse is a very good application for SSE. Even though there are lots of scalar parts to the algorithm, the SSE instruction path does end up ahead. In a lot of problems, you just don't have four matrices to invert simultaneously, so using SoA - while being more effective since it's practically shuffle-free - is somewhat utopistic.

• ### What is your GameDev Story?

In 2019 we are celebrating 20 years of GameDev.net! Share your GameDev Story with us.

• 28
• 16
• 10
• 10
• 11
• ### Forum Statistics

• Total Topics
634111
• Total Posts
3015572
×