mat4 inverse sse optimization

Started by
11 comments, last by clb 10 years, 2 months ago

x86-64 has SSE by default, so what exactly do you mean by SSE giving you a 5-10x speedup? Did you measure that speedup on x86 instead of x86-64?

I honestly have no idea what the old sse flag does to x86-64. I would have guessed it just gets silently ignored since it's redundant, but apparently you've proven it doesn't get ignored (at least in 4.6.3).

You might try compiling your code with gcc 4.8.x if at all possible. It has a lot of optimization improvements, including the local register allocator (yey for replacing 20+ year old tech!). If you still have problems with the optimizations in a current gcc, you might try asking the gcc folks about it. I'm pretty sure I can guess the response you'd get if you complain about optimizations in 4.6.3 though.

thank you for the reply smile.png
soooo what you're saying is that _specifically_ optimizing for sse on 64 bit is completely uselesss?
so is this whole 'use sse' stuff is because people are still shipping with 32 executables on windows (and now on linux too thx steam)?
edit: I measured it on 64 bit only. It was 'with sse' code before optimization vs after. By optimization here I mean vectorizing stuff. But you can see that in the git history. I'm using sse intrinsics specifically, and I have a define for it to enable, see (line 6):

https://github.com/Yours3lf/libmymath/blob/master/mymath/mm_common.h

edit2: so I measured with 32 bit compilation on a 64 bit linux. Same thing happens, 1.8 seconds for without sse, 4 seconds for with.

edit3: so I tried gcc 4.8.2:
64 bit with sse: 2.116

64 bit without sse: 0.915

32 bit with sse: 2.052

32 bit without sse: 1.7

If I understood the code correctly via a quick glance, it looks like you're measuring the time to do ten million matrix inverses? In the fastest result, that corresponds to 170 nanoseconds per a single inverse. If you are interested in a data point for comparison, here's the benchmark results of a 4x4 sse inverse in MathGeoLib buildbots: http://clb.demon.fi/dump/MathGeoLib_testresults/index.html?revision=da780cb4df817c75e6333557b311c33897c598db

Look for "float4x4::Inverse". On those bots, the best result comes on a Mac Mini, where a matrix inverse takes on average 17 nanoseconds / 40.7 clock cycles (measured with rdtsc instruction).

In general optimizing 4x4 matrix inverse is a very good application for SSE. Even though there are lots of scalar parts to the algorithm, the SSE instruction path does end up ahead. In a lot of problems, you just don't have four matrices to invert simultaneously, so using SoA - while being more effective since it's practically shuffle-free - is somewhat utopistic.

only one million, which means 1700 nanoseconds :S
But this wouldn't be a fair comparison :) my pc has A8-4500m APU which is by far not the fastest :)
But I'll try MathGeoLib, and see how it performs on equal terms (possibly way faster)

I was thinking the same.

Advertisement

so I tried out mathgeolib running on battery:


#include "mymath/mymath.h"
#include "math_geo_lib/MathGeoLib.h"

#include "SFML/System.hpp"

int main( int argc, char** args )
{
  using namespace mymath;
  using namespace std;

  sf::Clock clock;
  mat4 m1;

  int size = 4;

  for( int x = 0; x < size; ++x )
    for( int y = 0; y < size; ++y )
      m1[x][y] = atoi( args[x * size + y + 1] );

  clock.restart();

  for( int c = 0; c < 10e6 + 1; ++c )
    m1 = inverse( m1 );

  cout << clock.getElapsedTime().asMilliseconds() * 0.001f << endl;
  cout << m1 << endl;

  math::float4x4 m2;

  for( int x = 0; x < size; ++x )
    for( int y = 0; y < size; ++y )
      m2[x][y] = atoi( args[x * size + y + 1] );

  clock.restart();

  for( int c = 0; c < 10e6 + 1; ++c )
    m2 = m2.Inverted();

  cout << clock.getElapsedTime().asMilliseconds() * 0.001f << endl;
  cout << m2 << endl;

  return 0;
}

and the results:

mymath w/ sse: 3 seconds

mathgeolib: 19 seconds

should I enable some compiler switches?
I just ran cmake specifying the new g++ as the compiler and setting CMAKE_BUILD_TYPE to Release

The CMakeLists.txt does not enable SSE by default. In that mode, running .Inverted() depends on Gaussian elimination for best numerical performance. To enable SSE for that function, enable the MATH_AUTOMATIC_SSE and MATH_SSE flags on the build. See https://github.com/juj/MathGeoLib/blob/master/src/Math/float4x4.cpp#L1371

This topic is closed to new replies.

Advertisement