Back to General and Gameplay Programming

mat4 inverse sse optimization

Yours3!f · 2014-02-10T20:43:07

hi, I'm trying to optimize the inverse function of my maths lib. I already implemented everything using sse, I'm now trying to validate it works, and make sure it's faster... well most of the time it's really great, I get 5-10x speedups everywhere, except in the mat4 inverse function. code here: https://github.com/Yours3lf/libmymath/blob/master/mymath/mm_mat_func.h (line 397) here's the testcase: #include "mymath/mymath.h" #include "SFML/System.hpp" int main( int argc, char** args ) { using namespace mymath; using namespace std; sf::Clock clock; mat4 m1; int size = 4; for( int x = 0; x < size; ++x ) for( int y = 0; y < size; ++y ) m1[x][y] = atoi( args[x * size + y + 1] ); clock.restart(); for( int c = 0; c < 10e6 + 1; ++c ) m1 = inverse( m1 ); cout << clock.getElapsedTime().asMilliseconds() * 0.001f << endl; cout << m1 << endl; return 0; } with sse enabled I get 5 seconds execution time, without I get only 1 second. So I tried to look at it what is causing this. I used objdump to grab the assembly of the exes. without sse (line 393): http://pastebin.com/nVYJSp56 with sse (line 763): http://pastebin.com/VNzvJr4U as you can see when I disable the sse functionality the compiler somehow recognizes the optimization opportunities, and does way better job at generating efficient code. If you look at the sse version you can see that it's poor assembly code with lots of non-sse functions that hinder speed. any idea what am I doing wrong? ps.: I'm not that good at assembly I can barely read it. I think the 'without sse' code is better because there are no non-sse instructions called. ps2.: I have no idea why the compiler isn't inlining the inverse function, and why did it compile other functions into the exe, even though they're not even used. I used 64 bit 12.04 linux (3.2.0-58-generic), gcc 4.6.3 with "-O3 -Wall -Wno-long-long -ansi -pedantic -std=c++0x" objdump for getting the assembly best regards, Yours3!f

General and Gameplay Programming Programming

Started by Yours3!f January 29, 2014 09:30 PM

11 comments, last by clb 10 years, 2 months ago

Yours3!f

1,534

Author

February 02, 2014 08:43 AM

x86-64 has SSE by default, so what exactly do you mean by SSE giving you a 5-10x speedup? Did you measure that speedup on x86 instead of x86-64?

I honestly have no idea what the old sse flag does to x86-64. I would have guessed it just gets silently ignored since it's redundant, but apparently you've proven it doesn't get ignored (at least in 4.6.3).

You might try compiling your code with gcc 4.8.x if at all possible. It has a lot of optimization improvements, including the local register allocator (yey for replacing 20+ year old tech!). If you still have problems with the optimizations in a current gcc, you might try asking the gcc folks about it. I'm pretty sure I can guess the response you'd get if you complain about optimizations in 4.6.3 though.

thank you for the reply
soooo what you're saying is that _specifically_ optimizing for sse on 64 bit is completely uselesss?
so is this whole 'use sse' stuff is because people are still shipping with 32 executables on windows (and now on linux too thx steam)?
edit: I measured it on 64 bit only. It was 'with sse' code before optimization vs after. By optimization here I mean vectorizing stuff. But you can see that in the git history. I'm using sse intrinsics specifically, and I have a define for it to enable, see (line 6):

https://github.com/Yours3lf/libmymath/blob/master/mymath/mm_common.h

edit2: so I measured with 32 bit compilation on a 64 bit linux. Same thing happens, 1.8 seconds for without sse, 4 seconds for with.

edit3: so I tried gcc 4.8.2:
64 bit with sse: 2.116

64 bit without sse: 0.915

32 bit with sse: 2.052

32 bit without sse: 1.7

If I understood the code correctly via a quick glance, it looks like you're measuring the time to do ten million matrix inverses? In the fastest result, that corresponds to 170 nanoseconds per a single inverse. If you are interested in a data point for comparison, here's the benchmark results of a 4x4 sse inverse in MathGeoLib buildbots: http://clb.demon.fi/dump/MathGeoLib_testresults/index.html?revision=da780cb4df817c75e6333557b311c33897c598db

Look for "float4x4::Inverse". On those bots, the best result comes on a Mac Mini, where a matrix inverse takes on average 17 nanoseconds / 40.7 clock cycles (measured with rdtsc instruction).

In general optimizing 4x4 matrix inverse is a very good application for SSE. Even though there are lots of scalar parts to the algorithm, the SSE instruction path does end up ahead. In a lot of problems, you just don't have four matrices to invert simultaneously, so using SoA - while being more effective since it's practically shuffle-free - is somewhat utopistic.

only one million, which means 1700 nanoseconds :S
But this wouldn't be a fair comparison :) my pc has A8-4500m APU which is by far not the fastest :)
But I'll try MathGeoLib, and see how it performs on equal terms (possibly way faster)

I was thinking the same.

Blog:

http://extremeistan.wordpress.com/

Stuff I wrote:

https://github.com/Yours3lf/libmymath

https://github.com/Yours3lf/linux_gl_fps

https://github.com/Yours3lf/instanced_font_rendering

http://youtu.be/k8PYkihyGXA

https://github.com/scrawl/smaa-opengl

https://github.com/Yours3lf/gl_browser_gui

Follow me on twitter:

https://twitter.com/0martint

Yours3!f

1,534

Author

February 04, 2014 04:52 PM

so I tried out mathgeolib running on battery:


#include "mymath/mymath.h"
#include "math_geo_lib/MathGeoLib.h"

#include "SFML/System.hpp"

int main( int argc, char** args )
{
  using namespace mymath;
  using namespace std;

  sf::Clock clock;
  mat4 m1;

  int size = 4;

  for( int x = 0; x < size; ++x )
    for( int y = 0; y < size; ++y )
      m1[x][y] = atoi( args[x * size + y + 1] );

  clock.restart();

  for( int c = 0; c < 10e6 + 1; ++c )
    m1 = inverse( m1 );

  cout << clock.getElapsedTime().asMilliseconds() * 0.001f << endl;
  cout << m1 << endl;

  math::float4x4 m2;

  for( int x = 0; x < size; ++x )
    for( int y = 0; y < size; ++y )
      m2[x][y] = atoi( args[x * size + y + 1] );

  clock.restart();

  for( int c = 0; c < 10e6 + 1; ++c )
    m2 = m2.Inverted();

  cout << clock.getElapsedTime().asMilliseconds() * 0.001f << endl;
  cout << m2 << endl;

  return 0;
}

and the results:

mymath w/ sse: 3 seconds

mathgeolib: 19 seconds

should I enable some compiler switches?
I just ran cmake specifying the new g++ as the compiler and setting CMAKE_BUILD_TYPE to Release

Blog:

http://extremeistan.wordpress.com/

Stuff I wrote:

https://github.com/Yours3lf/libmymath

https://github.com/Yours3lf/linux_gl_fps

https://github.com/Yours3lf/instanced_font_rendering

http://youtu.be/k8PYkihyGXA

https://github.com/scrawl/smaa-opengl

https://github.com/Yours3lf/gl_browser_gui

Follow me on twitter:

https://twitter.com/0martint

clb

2,152

February 10, 2014 08:43 PM

The CMakeLists.txt does not enable SSE by default. In that mode, running .Inverted() depends on Gaussian elimination for best numerical performance. To enable SSE for that function, enable the MATH_AUTOMATIC_SSE and MATH_SSE flags on the build. See https://github.com/juj/MathGeoLib/blob/master/src/Math/float4x4.cpp#L1371

mat4 inverse sse optimization

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

mat4 inverse sse optimization

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines