Fast 4x4 Matrix Inverse with SSE SIMD, Explained

Recommended Posts

Hi guys,

I'm writing my math library and implemented some matrix inverse function I would like to share.

The SIMD version I got is more than twice as fast as non-SIMD version (which is what Unreal is using). It is also faster than some other math libraries like Eigen or DirectX Math.

chart.jpg.1cb28518ee800ad1f960c5b9d5e74c56.jpg

(result from my test, the first 3 columns are my methods)

 

If you are interested in either theory or implementation, I put together my math derivation and source code in this post:

https://lxjk.github.io/2017/09/03/Fast-4x4-Matrix-Inverse-with-SSE-SIMD-Explained.html

I would appreciate any feedback :)

Edited by lxjk

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


  • Forum Statistics

    • Total Topics
      628653
    • Total Posts
      2984054
  • Similar Content

    • By pabloreda
       
      I am coding the rasterization of triangles by the baricentric coordinate method.
      Look a lot of code and tutorials that are on the web about the optimization of this algorithm.
      I found a way to optimize it that I did not see it anywhere.
      I Only code the painting of triangles without Zbuffer and without textures. I am not comparing speeds and I am not interested in doing them, I am simply reducing the amount of instructions that are executed in the internal loop.
      The idea is simple, someone must have done it before but he did not publish the code or maybe in hardware it is already that way too.
      It should be noted that for each horizontal line drawn, of the three segments, you only need to look at one when going from negative to positive to start drawing and you only need to look at one when it goes from positive to negative when you stop drawing.
      I try it and it works well, now I am implementing a regular version with texture and zbuffer to realize how to add it to this optimization.
      Does anyone know if this optimization is already done?
      The code is in https://github.com/phreda4/reda4/blob/master/r4/Dev/graficos/rasterize2.txt
      From line 92 to 155
       
    • By FFA702
      Hi. It's been a while since I posted here, and my last posts are almost about this exact same subject. Just saying to demonstrate how annoying this is to me.
      Here is the problem : I'm trying to make a decent raycaster in C#. The main issue is that for this to happen, I need pixel by pixel drawing. My previous raycaster used VS GDI+, and trough several tricks involving pointers and filling a bitmap byte by byte, I was able to obtain half decent results, and make an online server-client style 3d engine complete with a map builder and several really cool features. I unfortunately wasn't able to expand the project further due to poorly written code (I am an hobbyist, I study Business Administration at Uni) and the fact that my quick hack for performance was barely able to carry the bare minimum of what I needed to make a very bare bone raycaster possible. This came with very real sadness, the realization that the project I spent almost 2 years on was essentially useless, bloated and impossible to expand on. 
      Enough background. Now, starting anew, my main concern is to find a way to gain fast pixel by pixel control over the screen. I'm using SFML and C#. My current testbench is pretty simple, I'm using a method I found on the internet written for C++. I Adapted it for C#. I'm filling a Color[,] array (each color is a pixel) and then I copy the RGB values inside a byte[] array before moving them inside the texture buffer. I then display the texture on the screen. I'm not sure what the bottleneck is, the application is faster than my previous one, but it's still too slow for my liking. Raycasters work by redrawing stuff ontop of other stuff, and I fear that adding more stuff would creep it to an halt. I'm posting what I have as a testbench right now, any help would be greatly appreciated. Keep in mind I am not a professional programmer by any mean, I am pretty uncomfortable with pointers and true 3d stuff, but I will use them if I must. 
      using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; using SFML.Audio; using SFML.Graphics; using SFML.System; using SFML.Window; namespace RayCastFF { class Program { public static int TestCounter = 0; public static Color[,] ScreenBuffer = new Color[640, 360]; //an array containing the color of all the pixel, this is intended to be the main target of all manipulation and draw call public static Texture MainViewPort = new Texture(640, 360);//main screen texture unsafe static void Main(string[] args) { //MAINWINDOW SETUP RenderWindow window = new RenderWindow(new VideoMode(640, 360), "RayCaster"); while (window.IsOpen)//MAIN GAME LOOP { //CALL FOR UPDATE Update(); //DRAW window.Clear(); window.DispatchEvents(); Sprite mainviewport = new Sprite(MainViewPort); window.Draw(mainviewport);//draw the texture over the screen window.Display(); //TAKE INPUT } } static void Update() { TestCounter++; if (TestCounter > 639) { TestCounter = 0; } //RESET THE BUFFER (COULD BE REMOVED LATER I GUESS) for (int x = 0; x < 640; x++) { for (int y = 0; y < 360; y++) { ScreenBuffer[x, y] = Color.Black; } } //DO STUFF DrawLine(Color.Red, TestCounter, 200, 100); //(for this test, i simply draw a moving line) //WRITING THE BUFFER INTO THE IMAGE //THIS SHOULD ALWAYS BE THE LAST STEP OF THE UPDATE METHOD byte[] pixels = new byte[640 * 360 * 4]; //640 x 360 pixels x 4 bytes per pixel Color[] cpixels = new Color[640 * 360];//intermediary step to keep everything clear for (int x = 0; x < 640; x++) { for (int y = 0; y < 360; y++) { cpixels[x+(640*y)] = ScreenBuffer[x, y];//make an intermediary array the correct dimention and arrange the pixels in the correct position to be drawn (separate step to keep everything clean, I find this operation incredibly confusing mainly because I had no idea how the pixels are supposed to be arrenged in the first place(still kind of dont)) } } for (int i = 0; i < 640 * 360 * 4; i += 4)//fill the byte array { pixels[i + 0] = cpixels[i / 4].R; pixels[i + 1] = cpixels[i / 4].G; pixels[i + 2] = cpixels[i / 4].B; pixels[i + 3] = cpixels[i / 4].A; } MainViewPort.Update(pixels);//update the texture with the array } //[X , Y] static void DrawLine(Color color, int Wpos, int Ytop, int Ybottom)//simple draw method making a vertical line { for (int y = Ybottom; y < Ytop; y++) { ScreenBuffer[Wpos, y] = color; } } } } What I'd like to end up with is a very fast way to draw the pixels on the window by the abstraction of a single 2d array of 640x360 unit that I could easily and simply manipulate. However, while being simple, it's also somewhat slow. It's also using 30% GPU load for some reason on a 1070GTX 8GB. Again, any help would be greatly appreciated.
      Thanks in advance.
    • By Tanzan
      Hello all,
      My question is a bit hard to describe but hopefully it will do...
      I just wonder what you guys think is the 'best' way of getting info about the model in your view(s)..
      To clearify (i hope ;-) )
      If the model is updating itself every game-cycle and the  (deep) nested objects all do there jobs how do you get info where only the view is interested in?
      So my question is not how to do it but more what do you people think is the best way to do it ?
       
      Regards,
       
      Alex   
    • By aejt
      Sorry for making a new thread about this, but I have a specific question which I couldn't find an answer to in any of the other threads I've looked at.
      I've been trying to get the method shown here to work several days now and I've run out of things to try.
      I've more or less resorted to using the barebones example shown there (with some very minor modifications as it wouldn't run otherwise), but I still can't get it to work. Either I have misunderstood something completely, or there's a mistake somewhere.
       
      My shader code looks like this:
      Vertex shader:
      #version 330 core //Vertex shader //Half the size of the near plane {tan(fovy/2.0) * aspect, tan(fovy/2.0) } uniform vec2 halfSizeNearPlane; layout (location = 0) in vec3 clipPos; //UV for the depth buffer/screen access. //(0,0) in bottom left corner (1, 1) in top right corner layout (location = 1) in vec2 texCoord; out vec3 eyeDirection; out vec2 uv; void main() { uv = texCoord; eyeDirection = vec3((2.0 * halfSizeNearPlane * texCoord) - halfSizeNearPlane , -1.0); gl_Position = vec4(clipPos.xy, 0, 1); } Fragment shader:
      #version 330 core //Fragment shader layout (location = 0) out vec3 fragColor; in vec3 eyeDirection; in vec2 uv; uniform mat4 persMatrix; uniform vec2 depthrange; uniform sampler2D depth; vec4 CalcEyeFromWindow(in float windowZ, in vec3 eyeDirection, in vec2 depthrange) { float ndcZ = (2.0 * windowZ - depthrange.x - depthrange.y) / (depthrange.y - depthrange.x); float eyeZ = persMatrix[3][2] / ((persMatrix[2][3] * ndcZ) - persMatrix[2][2]); return vec4(eyeDirection * eyeZ, 1); } void main() { vec4 eyeSpace = CalcEyeFromWindow(texture(depth, uv).x, eyeDirection, depthrange); fragColor = eyeSpace.rbg; } Where my camera settings are: float fov = glm::radians(60.0f); float aspect = 800.0f / 600.0f; And my uniforms equal: uniform mat4 persMatrix = glm::perspective(fov, aspect, 0.1f, 100.0f) uniform vec2 halfSizeNearPlane = glm::vec2(glm::tan(fov/2.0) * aspect, glm::tan(fov/2.0)) uniform vec2 depthrange = glm::vec2(0.0f, 1.0f) uniform sampler2D depth is a GL_DEPTH24_STENCIL8 texture which has depth values from an earlier pass (if I linearize it and set fragColor = vec3(linearizedZ), it shows up like it should, so nothing seems wrong there).
      I can confirm that it's wrong because it doesn't give me similar results to what saving position in the G-buffer or reconstructing using inverse matrices does.
      Is there something obvious I'm missing? To me the logic seems sound, and from the description on the Khronos wiki I can't see where I go wrong.
      Thanks!
    • By JakNow
      Hello everyone!
      Right now I am writing my own physics engine in java for LWJGL3 3D game and I would like to consult my ideas with you guys. It's not about writing the actual code, but asking if my solution is good, and/or can it be better. And I would like to make it easy to refactor to much others render engine and "game-loop engine". So lets get started!
      The base game architecture looks like this:

      The Core holds just the information about the game itself, so whenever I decided to write some new game I would just have to edit this module.
      The render engine holds just the information about rendering the models, however it only gets the material and mesh data from the model.
      The Model module holds 4 basic information about model:
      Models - basic Model that holds only information about ModelView, position, rotation and scale. Other types of models inherits it and add unique params (AnimatedModel adds Animation mesh data). ModelView is build of ModelPart which are build from TexturedMeshes (will be explained later). Loaders - classes to load specific model type (i.e. Assimp loader for *.obj files) and process classes - to create necessary data to render model (ie. create Mesh which holds vboID, vertices/textures/normals arrays etc). Components - every model can have some component, ie. moveable - which allows to move the object arround the world. Materials - used together with Mesh to create TexturedMesh. Material holds information about diffuse, ambient, etc colors, diffuse, normal textures. PhysicsEngine module has the core (initiation of physics world), collision detection, CollidableComponent (inherit from BaseComponent) and Shapes (i.e AABB, Sphere, Cylinder, MeshCollider). This is the part I would like to discuss with you guys (however if you have something to say about other parts - please go for it!).
      Core: PhysicState - initiation of physics world, update methods, holds default data (i.e. Default narrow collision shape) Collision: Broad Phase Collision Detection (BPCD) and Narrow Phase Collision Detection (NPCD) CollidableComponent - component that can be added to model to make it collidable (in future I was planning to add other components such as: WindComponent for grass model - adds reaction to wind). Only models with CollidableComponent are checked in BPCD and NPCD, the rest are ignored. CollidableComponent has also a boolean isMoveable - i.e. Rock - it is collidable, but its never, ever gonna move. so it doesn't have to be checked with other non-moveable components at BPCD and NPCD.  Shapes - basic shapes and info about them (AABB - points min/max, Sphere - center, radius, etc.) More info are shown below on diagram:
      Right now it works like this:
      I create a model and add a CollidableComponent to it like this:
      public CollidableComponent(Model model, TypeOfShape typeOfShape, boolean isMoveable) TypeOfShape declares the basic Broad Phase Collision Shape (AABB, Sphere, Cylinder, Mesh). The Shape is created from the raw data of the model and transformed to actual data (position, rotation*, scale).If I want to I can add the Narrow Phase Collision Shape MAP - which declares the CollisionShape for each Model Part inside the ModelView. In most cases for me it's going to be MeshCollider (since the game I'm planning to create is in Low Poly Style). 
      IDEA 1: When the CollidableComponent is created it is automatically added to BPCD map to check its collision. Of course it's just temporary, later on I would have to set limit to the map size (i.e. to 500) or split the world to smaller parts and add just the entities which are in this world's part to BPCD. So this is the part where you guys could give me some advice
      IDEA 2: Collision Detection update:
      Right now the update works like this:
      public void update() { narrowPhase.narrowPhaseCollisionMap.clear(); if (!broadPhaseCollisionMap.isEmpty()) { for (Model model : broadPhaseCollisionMap.keySet()) { if ((model.getComponent(CollidableComponent.class)).isMoveable()) { for (Model model2 : broadPhaseCollisionMap.keySet()) { if (!model.equals(model2)) { CollisionShape cs1 = getCollisionShape(model); CollisionShape cs2 = getCollisionShape(model2); if (checkCollision(cs1, cs2)) { narrowPhase.narrowPhaseCollisionMap.add(model); narrowPhase.narrowPhaseCollisionMap.add(model2); } } } } } } if (!narrowPhase.narrowPhaseCollisionMap.isEmpty()) { narrowPhase.update(); } } so:
      1. It checks if the BPC Map is not empty, and if its not it proceed, else nothing happens.
      2. It loops through all the models inside the map and check if it's isMoveable - as I said, I ignore collision detection with objects that doesn't move
      3. 2nd loop throught models and check the model from 1st loop isn't the model from the 2nd loop. If they are - lets ignore it.
      4. If they are 2 different models it retrieve the BPC shapes from the models and if it is the moveable model it updates its CollisionShape data (by the current the position, rotation,* scale*)
      5. Check the intersection between these 2 shapes, and if it true it's added to NPC List
      6. After the BPCD loops if the NPC List is not empty it runs its update
      The NPCD update is pretty similar to BPCD with just 2 exceptions:
      1. It used the List<Models> instead of Map<Model,CollidableComponents> (from models I can retrieve the info about the CollidableComponent, so I might use the List in BPCD aswell instand of Map **)
      2. It checks the collision intersection same as for BPCD but for each ModelPart of Model_1 with each ModelPart of Model_2, returns true and/or excact collision point, etc, and breaks this model-model loop check (so it doesn't check if other parts of the models collide with each other).  
      With my calculations for 50 objects - 27 is static and 23are movable with random position (but some collides): the NP
      Took: 0.0ms for: 1224 collision checks and: 24 positive collisions for BPCD
      Took: 10. ms for: 55776 collision checks and: 576 positive  collisions for NPCD
      Took: 11.0ms in total for BPCD and NPCD
      I can see a huge space to improve the Collision Detection update methods, but I can't find them, so I hope you guys can help me out  Maybe not all models has to be checked in NPCD, i.e. check how far from camera they are, and after some point just ignore NP since it won't be anyhow visible?
      Well, that's all! Sorry for a bit long post, but I hope you at least enjoyed reading it  
       
      *Actually just forgot about adding it to calculation  
      **Just came to my head when I was writing this topic 
  • Popular Now