Jump to content
  • Advertisement

Fast 4x4 Matrix Inverse with SSE SIMD, Explained

Recommended Posts

Hi guys,

I'm writing my math library and implemented some matrix inverse function I would like to share.

The SIMD version I got is more than twice as fast as non-SIMD version (which is what Unreal is using). It is also faster than some other math libraries like Eigen or DirectX Math.


(result from my test, the first 3 columns are my methods)


If you are interested in either theory or implementation, I put together my math derivation and source code in this post:


I would appreciate any feedback :)

Edited by lxjk

Share this post

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Advertisement
  • Advertisement
  • Popular Tags

  • Popular Now

  • Advertisement
  • Similar Content

    • By cgaish
      I am trying to implement a custom texture atlas creator tool in C++, need suggestion regarding any opensource fast API or library for image import and export?
      Also this tool will compress the final output atlas image into multiple formats like DXT5, PVRTC and ETC based on user input, what should be the best way to implement this?
    • By lxjk
      Hi guys,
      There are many ways to do light culling in tile-based shading. I've been playing with this idea for a while, and just want to throw it out there.
      Because tile frustums are general small compared to light radius, I tried using cone test to reduce false positives introduced by commonly used sphere-frustum test.
      On top of that, I use distance to camera rather than depth for near/far test (aka. sliced by spheres).
      This method can be naturally extended to clustered light culling as well.
      The following image shows the general ideas

      Performance-wise I get around 15% improvement over sphere-frustum test. You can also see how a single light performs as the following: from left to right (1) standard rendering of a point light; then tiles passed the test of (2) sphere-frustum test; (3) cone test; (4) spherical-sliced cone test

      I put the details in my blog post (https://lxjk.github.io/2018/03/25/Improve-Tile-based-Light-Culling-with-Spherical-sliced-Cone.html), GLSL source code included!
    • By akshayMore
      I am trying to make a GeometryUtil class that has methods to draw point,line ,polygon etc. I am trying to make a method to draw circle.  
      There are many ways to draw a circle.  I have found two ways, 
      The one way:
      public static void drawBresenhamCircle(PolygonSpriteBatch batch, int centerX, int centerY, int radius, ColorRGBA color) { int x = 0, y = radius; int d = 3 - 2 * radius; while (y >= x) { drawCirclePoints(batch, centerX, centerY, x, y, color); if (d <= 0) { d = d + 4 * x + 6; } else { y--; d = d + 4 * (x - y) + 10; } x++; //drawCirclePoints(batch,centerX,centerY,x,y,color); } } private static void drawCirclePoints(PolygonSpriteBatch batch, int centerX, int centerY, int x, int y, ColorRGBA color) { drawPoint(batch, centerX + x, centerY + y, color); drawPoint(batch, centerX - x, centerY + y, color); drawPoint(batch, centerX + x, centerY - y, color); drawPoint(batch, centerX - x, centerY - y, color); drawPoint(batch, centerX + y, centerY + x, color); drawPoint(batch, centerX - y, centerY + x, color); drawPoint(batch, centerX + y, centerY - x, color); drawPoint(batch, centerX - y, centerY - x, color); } The other way:
      public static void drawCircle(PolygonSpriteBatch target, Vector2 center, float radius, int lineWidth, int segments, int tintColorR, int tintColorG, int tintColorB, int tintColorA) { Vector2[] vertices = new Vector2[segments]; double increment = Math.PI * 2.0 / segments; double theta = 0.0; for (int i = 0; i < segments; i++) { vertices[i] = new Vector2((float) Math.cos(theta) * radius + center.x, (float) Math.sin(theta) * radius + center.y); theta += increment; } drawPolygon(target, vertices, lineWidth, segments, tintColorR, tintColorG, tintColorB, tintColorA); } In the render loop:
      polygonSpriteBatch.begin(); Bitmap.drawBresenhamCircle(polygonSpriteBatch,500,300,200,ColorRGBA.Blue); Bitmap.drawCircle(polygonSpriteBatch,new Vector2(500,300),200,5,50,255,0,0,255); polygonSpriteBatch.end(); I am trying to choose one of them. So I thought that I should go with the one that does not involve heavy calculations and is efficient and faster.  It is said that the use of floating point numbers , trigonometric operations etc. slows down things a bit.  What do you think would be the best method to use?  When I compared the code by observing the time taken by the flow from start of the method to the end, it shows that the second one is faster. (I think I am doing something wrong here ).
      Please help!  
      Thank you.  
    • By CPPapprentice
      Hi Forum,
      in terms of rendering a tiled game level, lets say the level is 3840x2208 pixels using 16x16 tiles. which method is recommended;
      method 1- draw the whole level, store it in a texture-object, and only render whats in view, each frame.
      method 2- on each frame, loop trough all tiles, and only draw and render it to the window if its in view.
      are both of these methods valid? is there other ways? i know method 1 is memory intensive  but method 2 is processing heavy.
      thanks in advance
    • By wobes
      Hi there. I am really sorry to post this, but I would like to clarify the delta compression method. I've read Quake 3 Networking Model: http://trac.bookofhook.com/bookofhook/trac.cgi/wiki/Quake3Networking, but still have some question. First of all, I am using LiteNetLib as networking library, it works pretty well with Google.Protobuf serialization. But then I've faced with an issue when the server pushes a lot of data, let's say 10 players, and server pushes 250kb/s of data with 30hz tickrate, so I realized that I have to compress it, let's say with delta compression. As I understood, the client and server both use unreliable channel. LiteNetLib meta file says that unreliable packet can be dropped, or duplicated; while sequenced channel says that packet can be dropped but never duplicated, so I think I have to use the sequenced channel for Delta compression? And do I have to use reliable channel for acknowledgment, or I can just go with sequenced, and send the StateId with a snapshot and not separately? 
      Thank you. 
  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!