# Optimization Towards an Optimal VEX-SSE 3*3*float Matrix Transpose

## Recommended Posts

Hi all,

More than a decade ago, a problem came up on this forum for computing a fast transpose of a 3x3 matrix using SSE. The most sensible implementation stores the matrix internally as a 3x4 matrix (so, one row stores 4 elements, aligned in a vector). A version, which I believe to be the fastest currently known, was presented:

On 6/27/2005 at 9:20 PM, ajas95 said:

// input xyz in xmm5,7,6
// output in xmm0,1,7
movaps	xmm0,	xmm7		   // xmm0 : ?? z1 y1 x1
movaps	xmm1,	xmm5		   // xmm1 : ?? z0 y0 x0
unpcklps xmm0,	xmm5		   // xmm0 : y1 y0 x1 x0
unpckhps xmm7,	xmm5		   // xmm7 : ?? ?? z1 z0
movhlps	xmm1,	xmm0		   // xmm1 : ?? z1 y1 y0
shufps	xmm7,	xmm6,	11100100b  // xmm7 : ?? z2 z1 z0
movlhps	xmm0,	xmm6		   // xmm0 : ?? x2 x1 x0
shufps	xmm1,	xmm6,	01010100b  // xmm1 : ?? y2 y1 y0



(P.S. If anyone has a faster way, I'd love to hear it. This uses 5 registers, and so destroys one of the inputs. Still, it's a great problem for people that are into this sort of thing).

I am pleased to report that I have been able to come up with a version which should be faster:

inline void transpose(__m128& A, __m128& B, __m128& C) {
//Input rows in __m128& A, B, and C.  Output in same.
__m128 T0 = _mm_unpacklo_ps(A,B);
__m128 T1 = _mm_unpackhi_ps(A,B);
A = _mm_movelh_ps(T0,C);
B = _mm_shuffle_ps( T0,C, _MM_SHUFFLE(3,1,3,2) );
C = _mm_shuffle_ps( T1,C, _MM_SHUFFLE(3,2,1,0) );
}

This should be 5 instructions instead of ajas95's 8 instructions. Of course, to get that level of performance with either version, you need to inline everything, or else you spend tons of time on moving floating point arguments to/from input registers.

The other thing that is crucial is that the instruction set be VEX encoded. This allows generating instructions that take three arguments, like vunpcklps, instead of instructions like unpcklps that take only two. VEX is only available in AVX and higher (usually passing e.g. -mavx is sufficient to get the compiler to generate VEX instructions).

-G

## Create an account

Register a new account

• ### Forum Statistics

• Total Topics
628345
• Total Posts
2982194
• ### Similar Content

• I am currently attempting to make a navigation mesh for our 2D top down game, which is a multiplayer game using Node.js as the server communication. At the moment, I have implemented A* over an obstacle hardnessmap, which is awfully slow and laggy at times when we test our game on Heroku. I have been trying to find an algorithm to automatically generate the navmesh after map creation, instead of me having to do this manually. I am currently attempting to use Delaunay's Triangulation Divide and Conquer algorithm, but I am running into some issues. I have already asked a question on StackOverflow and am not getting many suggestions and help from it, so I figured I would come here. Is there another algorithm that might be better to use for the navmesh generation in comparison to Deluanay's Triangulation? My current implementation seems extremely buggy during the merge step and I cannot find the error. I have checked over the code countless times, comparing it to the description of the algorithm from http://www.geom.uiuc.edu/~samuelp/del_project.html.
My current code is this:
class MapNode { constructor(x, y) { this.position = new Vector(x, y); this.neighbors = []; } distance(n) { return this.position.distance(n.position); } inNeighbor(n) { for (let i = 0; i < this.neighbors.length; i++) { if (this.neighbors[i] === n) return true; } return false; } addNeighbor(n) { this.neighbors = this.neighbors.filter((node) => node != n); this.neighbors.push(n); } addNeighbors(arr) { let self = this; arr.forEach((n) => self.neighbors.push(n)); } removeNeighbor(n) { this.neighbors = this.neighbors.filter((neighbor) => neighbor != n); } } class Triangle { constructor(p1, p2, p3) { this.p1 = p1; this.p2 = p2; this.p3 = p3; this.neighbors = []; } addNeighbors(n) { this.neighbors.push(n); } } function genSubMat(matrix, ignoreCol) { let r = []; for (let i = 0; i < matrix.length - 1; i++) { r.push([]); for (let j = 0; j < matrix[0].length; j++) { if (j != ignoreCol) r[i].push(matrix[i + 1][j]); } } return r; } function determinantSqMat(matrix) { if (matrix.length != matrix[0].length) return false; if (matrix.length === 2) return matrix[0][0] * matrix[1][1] - matrix[1][0] * matrix[0][1]; let det = 0; for (let i = 0; i < matrix.length; i++) { let r = genSubMat(matrix, i); let tmp = matrix[0][i] * determinantSqMat(r); if (i % 2 == 0) det += tmp; else det -= tmp; } return -det; } // if d is in the circle formed by points a, b, and c, return > 0 // d is on circle, return 0 // d is outside of circle, return < 0 function inCircle(a, b, c, d) { let arr = [a, b, c, d]; let mat = [ [], [], [], [] ]; for (let i = 0; i < arr.length; i++) { mat[i][0] = 1; mat[i][1] = arr[i].position.x; mat[i][2] = arr[i].position.y; mat[i][3] = arr[i].position.x * arr[i].position.x + arr[i].position.y * arr[i].position.y; } return determinantSqMat(mat); } function walkable(from, to, hardnessMap) { let diff = new Vector(to.x - from.x, to.y - from.y); if (Math.abs(diff.x) > Math.abs(diff.y)) diff.scale(Math.abs(1 / diff.x)); else diff.scale(Math.abs(1 / diff.y)); let current = new Vector(from.x + diff.x, from.y + diff.y); while (Math.round(current.x) != to.x || Math.round(current.y) != to.y) { if (hardnessMap[Math.floor(current.y)][Math.floor(current.x)] === 1) return false; current.x += diff.x; current.y += diff.y; } return true; } function getLowest(nodes) { let lowest = nodes[0]; for (let i = 1; i < nodes.length; i++) { if (nodes[i].position.y < lowest.position.y) lowest = nodes[i]; } return lowest; } // returns the angle between 2 vectors, if cw is true, then return clockwise angle between, // else return the ccw angle between. b is the "hinge" point function angleBetween(a, b, c, cw) { let ba = new Vector(a.position.x - b.position.x, a.position.y - b.position.y); let bc = new Vector(c.position.x - b.position.x, c.position.y - b.position.y); let v0 = new Vector(0, 1); let angleBA = v0.angleBetween(ba) * 180 / Math.PI; if (angleBA < 0) angleBA += 360; let angleBC = v0.angleBetween(bc) * 180 / Math.PI; if (angleBC < 0) angleBC += 360; let smallest = Math.min(angleBA, angleBC); let largest = Math.max(angleBA, angleBC); let angle = largest - smallest; return (cw) ? angle : 360 - angle; } function sortSmallestAngle(a, b, list, cw) { list.sort((m, n) => { let vab = new Vector(a.position.x - b.position.x, a.position.y - b.position.y); let vmb = new Vector(m.position.x - b.position.x, m.position.y - b.position.y); let vnb = new Vector(n.position.x - b.position.x, n.position.y - b.position.y); if (cw) return vab.angleBetween(vmb, cw) - vab.angleBetween(vnb, cw); else return vab.angleBetween(vnb, cw) - vab.angleBetween(vmb, cw); }); } // a is in list, b is in the other list function getPotential(a, b, list, cw) { sortSmallestAngle(b, a, list, cw); for (let i = 0; i < list.length - 1; i++) { let angle = angleBetween(b, a, list[i], cw); if (angle > 180) return false; else if (inCircle(a, b, list[i], list[i + 1]) <= 0) return list[i]; else { a.removeNeighbor(list[i]); list[i].removeNeighbor(a); } } let potential = list[list.length - 1]; if (potential) { let angle = angleBetween(a, b, potential, cw); if (angle > 180) return false; return potential; } return false; } function merge(leftNodes, rightNodes, leftBase, rightBase, hardnessMap) { leftBase.addNeighbor(rightBase); rightBase.addNeighbor(leftBase); let newLeft = leftNodes.filter((n) => n != leftBase); let newRight = rightNodes.filter((n) => n != rightBase); let potentialLeft = getPotential(leftBase, rightBase, newLeft, false); let potentialRight = getPotential(rightBase, leftBase, newRight, true); if (!potentialLeft && !potentialRight) return; else if (potentialLeft && !potentialRight) merge(newLeft, newRight, potentialLeft, rightBase, hardnessMap); else if (potentialRight && !potentialLeft) merge(newLeft, newRight, leftBase, potentialRight, hardnessMap); else { if (inCircle(leftBase, rightBase, potentialLeft, potentialRight) <= 0) merge(newLeft, newRight, potentialLeft, rightBase, hardnessMap); if (inCircle(leftBase, rightBase, potentialRight, potentialLeft) <= 0) merge(newLeft, newRight, leftBase, potentialRight, hardnessMap); } } // divide and conquer algorithm function delaunay(nodes, hardnessMap) { if (nodes.length <= 3) { for (let i = 0; i < nodes.length; i++) for (let j = 0; j < nodes.length; j++) if (i != j) nodes[i].addNeighbor(nodes[j]); return nodes; } else { nodes.sort((a, b) => { let tmp = a.position.x - b.position.x; if (tmp === 0) return b.position.y - a.position.y; return tmp; }); let l = nodes.length; let leftNodes; let rightNodes; if (l === 4) { leftNodes = delaunay(nodes.slice(0, 3), hardnessMap); rightNodes = delaunay(nodes.slice(3, 4), hardnessMap); } else { leftNodes = delaunay(nodes.slice(0, Math.floor(nodes.length / 2)), hardnessMap); rightNodes = delaunay(nodes.slice(Math.floor(nodes.length / 2), nodes.length), hardnessMap); } let leftBase = getLowest(leftNodes); let rightBase = getLowest(rightNodes); merge(leftNodes, rightNodes, leftBase, rightBase, hardnessMap); console.log("=============================MergeComplete================================"); return nodes; } }

• I am working on a game in which we control a rectangular box at the bottom of the screen. Three sphere which has alphabets in it fall down. When the game starts, a word is generated from the predefined list of words(which I'll give) and we are supposed to touch the correct sphere having the alphabet based on that word. The question is how to detect if I have touched the correct sphere.
secondly, if I have touched a correct sphere before and there is no recurrence of that alphabet in that word then during the second wave the game should not proceed if I touch the same alphabet again.
Thanks

• Hi there everyone! I'm trying to implement SPH using CPU single core. I'm having troubles in making it stable. I'd like some help in order to understand what is wrong and how could I fix it. Please, take a look at the following videos:
Water inside sphere using Kelager's parameters
Water inside big box
Water inside thinner box
I've already tried using XSPH, the hash method to find the neighbors (now I'm using the regular grid, because the hash method didn't work for me) and two different ways of calculating the pressure force.
I'm using mostly the following articles:
Particle-Based Fluid Simulation for Interactive Applications, Matthias Müller, David Charypar and Markus Gross
Lagrangian Fluid Dynamics Using Smoothed Particle Hydrodynamics, Micky Kelager
Smoothed Particle Hydrodynamics Real-Time Fluid Simulation Approach, David Staubach
Fluid Simulation using Smoothed Particle Hydrodynamics, Burak Ertekin
3D Langrangian Fluid Solver using SPH approximations, Chris Priscott
Any ideas? Thanks!
• By Tanzan
Hello all,
My question is a bit hard to describe but hopefully it will do...
I just wonder what you guys think is the 'best' way of getting info about the model in your view(s)..
To clearify (i hope ;-) )
If the model is updating itself every game-cycle and the  (deep) nested objects all do there jobs how do you get info where only the view is interested in?
So my question is not how to do it but more what do you people think is the best way to do it ?

Regards,

Alex

• Hi all,
I have written a nice(ish) looking game of solitaire in Windows Forms, using VB.NET and Visual Studio 2015. It allows the player to play the game, and if their score is in the top 10, it saves the username and score o a leaderboard (saved in a text file so it is non-volatile). My next goal is to let the user know if the hand they are dealt is playable, but I'm not sure on the most efficient way of achieving this. I'm working on time complexity over space complexity! It's a simple game to code, but not an easy one to find the solution.
Do you know of any algorithms that may achieve this goal that I would be able to look at?

• 9
• 24
• 10
• 9
• 13