• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
Yours3!f

mat4 inverse sse optimization

12 posts in this topic

hi,

 

I'm trying to optimize the inverse function of my maths lib. I already implemented everything using sse, I'm now trying to validate it works, and make sure it's faster... well most of the time it's really great, I get 5-10x speedups everywhere, except in the mat4 inverse function.

 

code here: https://github.com/Yours3lf/libmymath/blob/master/mymath/mm_mat_func.h (line 397)

 

here's the testcase:
 

#include "mymath/mymath.h"

#include "SFML/System.hpp"

int main( int argc, char** args )
{
  using namespace mymath;
  using namespace std;

  sf::Clock clock;
  mat4 m1;

  int size = 4;

  for( int x = 0; x < size; ++x )
    for( int y = 0; y < size; ++y )
      m1[x][y] = atoi( args[x * size + y + 1] );

  clock.restart();

  for( int c = 0; c < 10e6 + 1; ++c )
    m1 = inverse( m1 );

  cout << clock.getElapsedTime().asMilliseconds() * 0.001f << endl;
  cout << m1 << endl;

  return 0;
}

with sse enabled I get 5 seconds execution time, without I get only 1 second.
So I tried to look at it what is causing this. I used objdump to grab the assembly of the exes.

 

without sse (line 393):

http://pastebin.com/nVYJSp56

 

with sse (line 763):

http://pastebin.com/VNzvJr4U

 

as you can see when I disable the sse functionality the compiler somehow recognizes the optimization opportunities, and does way better job at generating efficient code.
If you look at the sse version you can see that it's poor assembly code with lots of non-sse functions that hinder speed.

 

any idea what am I doing wrong?

 

ps.: I'm not that good at assembly I can barely read it. I think the 'without sse' code is better because there are no non-sse instructions called.

ps2.: I have no idea why the compiler isn't inlining the inverse function, and why did it compile other functions into the exe, even though they're not even used.

I used 64 bit 12.04 linux (3.2.0-58-generic), gcc 4.6.3 with "-O3 -Wall -Wno-long-long -ansi -pedantic -std=c++0x"

objdump for getting the assembly

 

best regards,

Yours3!f

Edited by Yours3!f
0

Share this post


Link to post
Share on other sites

x86-64 has SSE by default, so what exactly do you mean by SSE giving you a 5-10x speedup? Did you measure that speedup on x86 instead of x86-64?

 

I honestly have no idea what the old sse flag does to x86-64. I would have guessed it just gets silently ignored since it's redundant, but apparently you've proven it doesn't get ignored (at least in 4.6.3).

 

You might try compiling your code with gcc 4.8.x if at all possible. It has a lot of optimization improvements, including the local register allocator (yey for replacing 20+ year old tech!). If you still have problems with the optimizations in a current gcc, you might try asking the gcc folks about it. I'm pretty sure I can guess the response you'd get if you complain about optimizations in 4.6.3 though.

0

Share this post


Link to post
Share on other sites

x86-64 has SSE by default, so what exactly do you mean by SSE giving you a 5-10x speedup? Did you measure that speedup on x86 instead of x86-64?

 

I honestly have no idea what the old sse flag does to x86-64. I would have guessed it just gets silently ignored since it's redundant, but apparently you've proven it doesn't get ignored (at least in 4.6.3).

 

You might try compiling your code with gcc 4.8.x if at all possible. It has a lot of optimization improvements, including the local register allocator (yey for replacing 20+ year old tech!). If you still have problems with the optimizations in a current gcc, you might try asking the gcc folks about it. I'm pretty sure I can guess the response you'd get if you complain about optimizations in 4.6.3 though.

 

thank you for the reply smile.png
soooo what you're saying is that _specifically_ optimizing for sse on 64 bit is completely uselesss?
so is this whole 'use sse' stuff is because people are still shipping with 32 executables on windows (and now on linux too thx steam)?
edit: I measured it on 64 bit only. It was 'with sse' code before optimization vs after. By optimization here I mean vectorizing stuff. But you can see that in the git history. I'm using sse intrinsics specifically, and I have a define for it to enable, see (line 6):

https://github.com/Yours3lf/libmymath/blob/master/mymath/mm_common.h

edit2: so I measured with 32 bit compilation on a 64 bit linux. Same thing happens, 1.8 seconds for without sse, 4 seconds for with.

edit3: so I tried gcc 4.8.2:
64 bit with sse: 2.116

64 bit without sse: 0.915

32 bit with sse: 2.052

32 bit without sse: 1.7

Edited by Yours3!f
0

Share this post


Link to post
Share on other sites

I think I realize where I got confused. I think you were using msse flag instead of mfpmath. For x86-64, -mfpmath=sse is enabled by default. Explicitly setting that should be redundant on x64. On 32-bit x86, the default is 387 (80-bit temporary floats), so you want to override it whenever you don't need 80-bit precision. You could also try mfpmath=sse,387 to see if it gives better performance, but it probably won't for most CPUs.

 

Flag -msse is saying to use SSE1. SSE2 is -msse2, SSE3 is -msse3, etc. You should probably just use mtune instead of messing with all that. On x86, you can have -march=i686 -mtune=nocona -mfpmath=sse when targeting sse3+. That means the code runs on any 686, is optimized for MMX, SSE1-3, and uses 64-bit SSE math instead of 80-bit 387 math. Just change to -march=x86-64 for x64 (at least I think x86-64 is the generic). You can see the different architecture choices like nocona in the gcc i386/x64 options. Since you use linux, you've probably seen nocona in the names of pre-built binaries before.

 

Vectorizing source code often pays off, but don't specifically use SSE intrinsics in C/C++ unless the compiler just won't generate the right code. When vectorizing, try to make the data width as wide as makes sense. AVX (Sandy Bridge+/Bulldozer+ CPUs, Linux 2.6.30+, Win7+) has 256-bit width, and AVX-512 is around the corner. If you explicitly used SSE intrinsics, you'd have to rewrite the code as soon as you want to target AVX2 instead of just setting -mtune=core-avx2.

Edited by richardurich
0

Share this post


Link to post
Share on other sites

I think I realize where I got confused. I think you were using msse flag instead of mfpmath. For x86-64, -mfpmath=sse is enabled by default. Explicitly setting that should be redundant on x64. On 32-bit x86, the default is 387 (80-bit temporary floats), so you want to override it whenever you don't need 80-bit precision. You could also try mfpmath=sse,387 to see if it gives better performance, but it probably won't for most CPUs.

 

Flag -msse is saying to use SSE1. SSE2 is -msse2, SSE3 is -msse3, etc. You should probably just use mtune instead of messing with all that. On x86, you can have -march=i686 -mtune=nocona -mfpmath=sse when targeting sse3+. That means the code runs on any 686, is optimized for MMX, SSE1-3, and uses 64-bit SSE math instead of 80-bit 387 math. Just change to -march=x86-64 for x64 (at least I think x86-64 is the generic). You can see the different architecture choices like nocona in the gcc i386/x64 options. Since you use linux, you've probably seen nocona in the names of pre-built binaries before.

 

Vectorizing source code often pays off, but don't specifically use SSE intrinsics in C/C++ unless the compiler just won't generate the right code. When vectorizing, try to make the data width as wide as makes sense. AVX (Sandy Bridge+/Bulldozer+ CPUs, Linux 2.6.30+, Win7+) has 256-bit width, and AVX-512 is around the corner. If you explicitly used SSE intrinsics, you'd have to rewrite the code as soon as you want to target AVX2 instead of just setting -mtune=core-avx2.

 

thank you this cleared up lots of things. I explicitly used sse2 intrinsics to squeeze out as much performance as possible, as I'd like to learn that kind of stuff. I'm targeting up to sse3. sse2/3 works great for vec4-s since they're 128 bit wide, and I can't think of how avx would give me benefits if not for matrix code. But again, since I can disable all the hacking that I did in sse2 with just a compiler switch, why not go ahead and explore that?

Now with these information in hand, I rewrote the cmake file, can you please take a look at it if I'm doing it right?
https://github.com/Yours3lf/libmymath/blob/master/CMakeLists.txt

Edit: I think I'm starting to do it right smile.png
on 32 bit with explicit sse2: 2.028 seconds

on 32 bit without: 1.877 seconds

on 64 bit with explicit sse2: 2.05 seconds

on 64 bit without: 1.637 seconds

the motivation is that the sine function does 0.3 with explicit sse2 and 0.5 without on 64 bit :)

Edited by Yours3!f
0

Share this post


Link to post
Share on other sites

This is a bad application of SIMD and is unlikely to show significant benefits due to the overhead of loading scalar data into vector registers. If you look at the generated assembly, a significant number of instructions are either shuffles or loads. These aren't doing useful work and are the reason why your vectorized implementation is actually slower. I wouldn't count on the compiler to vectorize anything other than trivial computations.

 

You need to look into formatting your data into aligned Structures of Arrays (SoA) format. For instance, rather than having a Vector4 or Matrix4 class that uses SIMD instructions to operate on a single vector or matrix, have a SIMDVector4 that operates on 4 different 4-component vectors at once, one component at a time.

0

Share this post


Link to post
Share on other sites

Do you mean you're explicitly using sse2 intrinsics? If so, try just using them in the sine function (and anywhere else they help) and let the compiler do the magic elsewhere.

 

On use_32_bit -msse3 is redundant if you're specifying -mtune=nocona on both 32 and 64.

For 64-bit, you have to specify -march=x86-64 (the generic for 64-bit, like i686 for 32-bit) if you want the binaries to have a fallback code path for pre-nocona 64-bit chips.

The use_explicit_sse2 else ("-mtune=nocona -mfpmath=sse") should always be set instead of only when not using intrinsics (guessing purpose of flag). That might get you the missing 0.15/0.4 seconds back.

 

I don't normally use cmake, so my apologies if I made any mistakes on that front. It seemed straightforward enough though.

0

Share this post


Link to post
Share on other sites

Do you mean you're explicitly using sse2 intrinsics? If so, try just using them in the sine function (and anywhere else they help) and let the compiler do the magic elsewhere.

 

On use_32_bit -msse3 is redundant if you're specifying -mtune=nocona on both 32 and 64.

For 64-bit, you have to specify -march=x86-64 (the generic for 64-bit, like i686 for 32-bit) if you want the binaries to have a fallback code path for pre-nocona 64-bit chips.

The use_explicit_sse2 else ("-mtune=nocona -mfpmath=sse") should always be set instead of only when not using intrinsics (guessing purpose of flag). That might get you the missing 0.15/0.4 seconds back.

 

I don't normally use cmake, so my apologies if I made any mistakes on that front. It seemed straightforward enough though.

yeah, if you take a look at the files there are fvec versions of the vec2/3/4 files, and there's even a sse file filled with arithmetic etc. functions (all containing explicit sse instructions). Now I have actually implemented them everywhere, because I'd like it to be as fast as possible (as in the case of the sine function). Also as Aressera mentioned the compiler may or may not do a good job.

 

I turned on -msse3 because the explicit instructions need them. But I moved that now to that switch. (I've update the cmake file again :) )
Actually when enabling it everywhere gave me a 0.04-5 seconds speed penalty :( but that's not too much for a million runs.

cmake is great!!! ;) it is as straightforward as it seems

0

Share this post


Link to post
Share on other sites

This is a bad application of SIMD and is unlikely to show significant benefits due to the overhead of loading scalar data into vector registers. If you look at the generated assembly, a significant number of instructions are either shuffles or loads. These aren't doing useful work and are the reason why your vectorized implementation is actually slower. I wouldn't count on the compiler to vectorize anything other than trivial computations.

 

You need to look into formatting your data into aligned Structures of Arrays (SoA) format. For instance, rather than having a Vector4 or Matrix4 class that uses SIMD instructions to operate on a single vector or matrix, have a SIMDVector4 that operates on 4 different 4-component vectors at once, one component at a time.

 

Yeah I suspected that, but now I can be sure, thanks!

Edit: I checked again, seems like after the compiler settings richardurich advised me to do the compiler now recognizes that it should do all this in sse (there are still some non-sse though), and now the context switching hell doesn't happen smile.png
http://pastebin.com/jsDS3U5c (line 689)

 

I actually looked into it today, it's this, right?

struct vec4 { float x, y, z, w; }; 
vector<vec4> data; // AoS
struct vec4_soa { float* x, *y, *z, *w } data_soa; //SoA

Also it's used for optimizing cache accessing, and it's highly dependent on the access patterns, right?
Now usually on the cpu side (and on the gpu too, in the context of gamedev) people are dealing with vectors. So why would it be good to store the individual components so far away from each other.
Like
float a[1024]; float b[1024];
in this case individual components would be pretty far away from each other right? and if we'd like to act on each component in parallel, then that would mean cache misses, right?
So I came to the conclusion that in this case, AoS would be better, but prove me wrong smile.png

Edited by Yours3!f
0

Share this post


Link to post
Share on other sites

 

x86-64 has SSE by default, so what exactly do you mean by SSE giving you a 5-10x speedup? Did you measure that speedup on x86 instead of x86-64?

 

I honestly have no idea what the old sse flag does to x86-64. I would have guessed it just gets silently ignored since it's redundant, but apparently you've proven it doesn't get ignored (at least in 4.6.3).

 

You might try compiling your code with gcc 4.8.x if at all possible. It has a lot of optimization improvements, including the local register allocator (yey for replacing 20+ year old tech!). If you still have problems with the optimizations in a current gcc, you might try asking the gcc folks about it. I'm pretty sure I can guess the response you'd get if you complain about optimizations in 4.6.3 though.

 

thank you for the reply smile.png
soooo what you're saying is that _specifically_ optimizing for sse on 64 bit is completely uselesss?
so is this whole 'use sse' stuff is because people are still shipping with 32 executables on windows (and now on linux too thx steam)?
edit: I measured it on 64 bit only. It was 'with sse' code before optimization vs after. By optimization here I mean vectorizing stuff. But you can see that in the git history. I'm using sse intrinsics specifically, and I have a define for it to enable, see (line 6):

https://github.com/Yours3lf/libmymath/blob/master/mymath/mm_common.h

edit2: so I measured with 32 bit compilation on a 64 bit linux. Same thing happens, 1.8 seconds for without sse, 4 seconds for with.

edit3: so I tried gcc 4.8.2:
64 bit with sse: 2.116

64 bit without sse: 0.915

32 bit with sse: 2.052

32 bit without sse: 1.7

 

 

If I understood the code correctly via a quick glance, it looks like you're measuring the time to do ten million matrix inverses? In the fastest result, that corresponds to 170 nanoseconds per a single inverse. If you are interested in a data point for comparison, here's the benchmark results of a 4x4 sse inverse in MathGeoLib buildbots: http://clb.demon.fi/dump/MathGeoLib_testresults/index.html?revision=da780cb4df817c75e6333557b311c33897c598db

 

Look for "float4x4::Inverse". On those bots, the best result comes on a Mac Mini, where a matrix inverse takes on average 17 nanoseconds / 40.7 clock cycles (measured with rdtsc instruction). 

 

In general optimizing 4x4 matrix inverse is a very good application for SSE. Even though there are lots of scalar parts to the algorithm, the SSE instruction path does end up ahead. In a lot of problems, you just don't have four matrices to invert simultaneously, so using SoA - while being more effective since it's practically shuffle-free - is somewhat utopistic.

0

Share this post


Link to post
Share on other sites

 

 

x86-64 has SSE by default, so what exactly do you mean by SSE giving you a 5-10x speedup? Did you measure that speedup on x86 instead of x86-64?

 

I honestly have no idea what the old sse flag does to x86-64. I would have guessed it just gets silently ignored since it's redundant, but apparently you've proven it doesn't get ignored (at least in 4.6.3).

 

You might try compiling your code with gcc 4.8.x if at all possible. It has a lot of optimization improvements, including the local register allocator (yey for replacing 20+ year old tech!). If you still have problems with the optimizations in a current gcc, you might try asking the gcc folks about it. I'm pretty sure I can guess the response you'd get if you complain about optimizations in 4.6.3 though.

 

thank you for the reply smile.png
soooo what you're saying is that _specifically_ optimizing for sse on 64 bit is completely uselesss?
so is this whole 'use sse' stuff is because people are still shipping with 32 executables on windows (and now on linux too thx steam)?
edit: I measured it on 64 bit only. It was 'with sse' code before optimization vs after. By optimization here I mean vectorizing stuff. But you can see that in the git history. I'm using sse intrinsics specifically, and I have a define for it to enable, see (line 6):

https://github.com/Yours3lf/libmymath/blob/master/mymath/mm_common.h

edit2: so I measured with 32 bit compilation on a 64 bit linux. Same thing happens, 1.8 seconds for without sse, 4 seconds for with.

edit3: so I tried gcc 4.8.2:
64 bit with sse: 2.116

64 bit without sse: 0.915

32 bit with sse: 2.052

32 bit without sse: 1.7

 

 

If I understood the code correctly via a quick glance, it looks like you're measuring the time to do ten million matrix inverses? In the fastest result, that corresponds to 170 nanoseconds per a single inverse. If you are interested in a data point for comparison, here's the benchmark results of a 4x4 sse inverse in MathGeoLib buildbots: http://clb.demon.fi/dump/MathGeoLib_testresults/index.html?revision=da780cb4df817c75e6333557b311c33897c598db

 

Look for "float4x4::Inverse". On those bots, the best result comes on a Mac Mini, where a matrix inverse takes on average 17 nanoseconds / 40.7 clock cycles (measured with rdtsc instruction). 

 

In general optimizing 4x4 matrix inverse is a very good application for SSE. Even though there are lots of scalar parts to the algorithm, the SSE instruction path does end up ahead. In a lot of problems, you just don't have four matrices to invert simultaneously, so using SoA - while being more effective since it's practically shuffle-free - is somewhat utopistic.

 

only one million, which means 1700 nanoseconds :S
But this wouldn't be a fair comparison :) my pc has A8-4500m APU which is by far not the fastest :)
But I'll try MathGeoLib, and see how it performs on equal terms (possibly way faster)

I was thinking the same.

0

Share this post


Link to post
Share on other sites

so I tried out mathgeolib running on battery:

#include "mymath/mymath.h"
#include "math_geo_lib/MathGeoLib.h"

#include "SFML/System.hpp"

int main( int argc, char** args )
{
  using namespace mymath;
  using namespace std;

  sf::Clock clock;
  mat4 m1;

  int size = 4;

  for( int x = 0; x < size; ++x )
    for( int y = 0; y < size; ++y )
      m1[x][y] = atoi( args[x * size + y + 1] );

  clock.restart();

  for( int c = 0; c < 10e6 + 1; ++c )
    m1 = inverse( m1 );

  cout << clock.getElapsedTime().asMilliseconds() * 0.001f << endl;
  cout << m1 << endl;

  math::float4x4 m2;

  for( int x = 0; x < size; ++x )
    for( int y = 0; y < size; ++y )
      m2[x][y] = atoi( args[x * size + y + 1] );

  clock.restart();

  for( int c = 0; c < 10e6 + 1; ++c )
    m2 = m2.Inverted();

  cout << clock.getElapsedTime().asMilliseconds() * 0.001f << endl;
  cout << m2 << endl;

  return 0;
}

and the results:

mymath w/ sse: 3 seconds

mathgeolib: 19 seconds

should I enable some compiler switches?
I just ran cmake specifying the new g++ as the compiler and setting CMAKE_BUILD_TYPE to Release

0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0