Optimization Towards an Optimal VEX-SSE 3*3*float Matrix Transpose

Recommended Posts

Hi all,

More than a decade ago, a problem came up on this forum for computing a fast transpose of a 3x3 matrix using SSE. The most sensible implementation stores the matrix internally as a 3x4 matrix (so, one row stores 4 elements, aligned in a vector). A version, which I believe to be the fastest currently known, was presented:

On 6/27/2005 at 9:20 PM, ajas95 said:

// input xyz in xmm5,7,6
// output in xmm0,1,7
movaps	xmm0,	xmm7		   // xmm0 : ?? z1 y1 x1
movaps	xmm1,	xmm5		   // xmm1 : ?? z0 y0 x0
unpcklps xmm0,	xmm5		   // xmm0 : y1 y0 x1 x0
unpckhps xmm7,	xmm5		   // xmm7 : ?? ?? z1 z0
movhlps	xmm1,	xmm0		   // xmm1 : ?? z1 y1 y0
shufps	xmm7,	xmm6,	11100100b  // xmm7 : ?? z2 z1 z0
movlhps	xmm0,	xmm6		   // xmm0 : ?? x2 x1 x0
shufps	xmm1,	xmm6,	01010100b  // xmm1 : ?? y2 y1 y0



(P.S. If anyone has a faster way, I'd love to hear it. This uses 5 registers, and so destroys one of the inputs. Still, it's a great problem for people that are into this sort of thing).

I am pleased to report that I have been able to come up with a version which should be faster:

inline void transpose(__m128& A, __m128& B, __m128& C) {
//Input rows in __m128& A, B, and C.  Output in same.
__m128 T0 = _mm_unpacklo_ps(A,B);
__m128 T1 = _mm_unpackhi_ps(A,B);
A = _mm_movelh_ps(T0,C);
B = _mm_shuffle_ps( T0,C, _MM_SHUFFLE(3,1,3,2) );
C = _mm_shuffle_ps( T1,C, _MM_SHUFFLE(3,2,1,0) );
}

This should be 5 instructions instead of ajas95's 8 instructions. Of course, to get that level of performance with either version, you need to inline everything, or else you spend tons of time on moving floating point arguments to/from input registers.

The other thing that is crucial is that the instruction set be VEX encoded. This allows generating instructions that take three arguments, like vunpcklps, instead of instructions like unpcklps that take only two. VEX is only available in AVX and higher (usually passing e.g. -mavx is sufficient to get the compiler to generate VEX instructions).

-G

Create an account

Register a new account

• 10
• 17
• 9
• 13
• 41
• Similar Content

• I'm making render just for fun (c++, opengl)
Want to add decals support. Here what I found
A couple of slides from doom
http://martindevans.me/game-development/2015/02/27/Drawing-Stuff-… space-Decals/
No implementation details here
https://turanszkij.wordpress.com/2017/10/12/forward-decal-rendering/
As I see there should be a list of decals for each tile same as for light sources. But what to do next?
Let assume that all decals are packed into a spritesheet. Decal will substitute diffuse and normal.
- What data should be stored for each decal on the GPU?
- Articles above describe decals as OBB. Why OBB if decals seem to be flat?
- How to actually render a decal during object render pass (since it's forward)? Is it projected somehow? Don't understand this part completely.
Are there any papers for this topic?

• Here is the original blog post.
Edit: Sorry, I can't get embedded LaTeX to display properly.
The pinned tutorial post says I have to do it in plain HTML without embedded images?
I actually tried embedding pre-rendered equations and they seemed fine when editing,
but once I submit the post it just turned into a huge mess.
So...until I can find a proper way to fix this, please refer to the original blog post for formatted formulas.
I've replaced the original LaTex mess in this post with something at least more readable.
Any advice on fixing this is appreciated.
This post is part of my Game Math Series.
Source files are on GitHub.
Shortcut to sterp implementation.
Shortcut to code used to generate animations in this post.
An Alternative to Slerp
Slerp, spherical linear interpolation, is an operation that interpolates from one orientation to another, using a rotational axis paired with the smallest angle possible.
Quick note: Jonathan Blow explains here how you should avoid using slerp, if normalized quaternion linear interpolation (nlerp) suffices. Long store short, nlerp is faster but does not maintain constant angular velocity, while slerp is slower but maintains constant angular velocity; use nlerp if you’re interpolating across small angles or you don’t care about constant angular velocity; use slerp if you’re interpolating across large angles and you care about constant angular velocity. But for the sake of using a more commonly known and used building block, the remaining post will only mention slerp. Replacing all following occurrences of slerp with nlerp would not change the validity of this post.
In general, slerp is considered superior over interpolating individual components of Euler angles, as the latter method usually yields orientational sways.
But, sometimes slerp might not be ideal. Look at the image below showing two different orientations of a rod. On the left is one orientation, and on the right is the resulting orientation of rotating around the axis shown as a cyan arrow, where the pivot is at one end of the rod.

If we slerp between the two orientations, this is what we get:

Mathematically, slerp takes the “shortest rotational path”. The quaternion representing the rod’s orientation travels along the shortest arc on a 4D hyper sphere. But, given the rod’s elongated appearance, the rod’s moving end seems to be deviating from the shortest arc on a 3D sphere.
My intended effect here is for the rod’s moving end to travel along the shortest arc in 3D, like this:

The difference is more obvious if we compare them side-by-side:

This is where swing-twist decomposition comes in.

Swing-Twist Decomposition
Swing-Twist decomposition is an operation that splits a rotation into two concatenated rotations, swing and twist. Given a twist axis, we would like to separate out the portion of a rotation that contributes to the twist around this axis, and what’s left behind is the remaining swing portion.
There are multiple ways to derive the formulas, but this particular one by Michaele Norel seems to be the most elegant and efficient, and it’s the only one I’ve come across that does not involve any use of trigonometry functions. I will first show the formulas now and then paraphrase his proof later:
Given a rotation represented by a quaternion R = [W_R, vec{V_R}] and a twist axis vec{V_T}, combine the scalar part from R the projection of vec{V_R} onto vec{V_T} to form a new quaternion: T = [W_R, proj_{vec{V_T}}(vec{V_R})]. We want to decompose R into a swing component and a twist component. Let the S denote the swing component, so we can write R = ST. The swing component is then calculated by multiplying R with the inverse (conjugate) of T: S= R T^{-1} Beware that S and T are not yet normalized at this point. It's a good idea to normalize them before use, as unit quaternions are just cuter. Below is my code implementation of swing-twist decomposition. Note that it also takes care of the singularity that occurs when the rotation to be decomposed represents a 180-degree rotation. public static void DecomposeSwingTwist ( Quaternion q, Vector3 twistAxis, out Quaternion swing, out Quaternion twist ) { Vector3 r = new Vector3(q.x, q.y, q.z); // singularity: rotation by 180 degree if (r.sqrMagnitude < MathUtil.Epsilon) { Vector3 rotatedTwistAxis = q * twistAxis; Vector3 swingAxis = Vector3.Cross(twistAxis, rotatedTwistAxis); if (swingAxis.sqrMagnitude > MathUtil.Epsilon) { float swingAngle = Vector3.Angle(twistAxis, rotatedTwistAxis); swing = Quaternion.AngleAxis(swingAngle, swingAxis); } else { // more singularity: // rotation axis parallel to twist axis swing = Quaternion.identity; // no swing } // always twist 180 degree on singularity twist = Quaternion.AngleAxis(180.0f, twistAxis); return; } // meat of swing-twist decomposition Vector3 p = Vector3.Project(r, twistAxis); twist = new Quaternion(p.x, p.y, p.z, q.w); twist = Normalize(twist); swing = q * Quaternion.Inverse(twist); } Now that we have the means to decompose a rotation into swing and twist components, we need a way to use them to interpolate the rod’s orientation, replacing slerp.
Swing-Twist Interpolation
Replacing slerp with the swing and twist components is actually pretty straightforward. Let the Q_0 and Q_1 denote the quaternions representing the rod's two orientations we are interpolating between. Given the interpolation parameter t, we use it to find "fractions" of swing and twist components and combine them together. Such fractiona can be obtained by performing slerp from the identity quaternion, Q_I, to the individual components. So we replace: Slerp(Q_0, Q_1, t) with: Slerp(Q_I, S, t) Slerp(Q_I, T, t) From the rod example, we choose the twist axis to align with the rod's longest side. Let's look at the effect of the individual components Slerp(Q_I, S, t) and Slerp(Q_I, T, t) as t varies over time below, swing on left and twist on right:
And as we concatenate these two components together, we get a swing-twist interpolation that rotates the rod such that its moving end travels in the shortest arc in 3D. Again, here is a side-by-side comparison of slerp (left) and swing-twist interpolation (right):

I decided to name my swing-twist interpolation function sterp. I think it’s cool because it sounds like it belongs to the function family of lerp and slerp. Here’s to hoping that this name catches on.
And here’s my code implementation:
public static Quaternion Sterp ( Quaternion a, Quaternion b, Vector3 twistAxis, float t ) { Quaternion deltaRotation = b * Quaternion.Inverse(a); Quaternion swingFull; Quaternion twistFull; QuaternionUtil.DecomposeSwingTwist ( deltaRotation, twistAxis, out swingFull, out twistFull ); Quaternion swing = Quaternion.Slerp(Quaternion.identity, swingFull, t); Quaternion twist = Quaternion.Slerp(Quaternion.identity, twistFull, t); return twist * swing; } Proof
Lastly, let’s look at the proof for the swing-twist decomposition formulas. All that needs to be proven is that the swing component S does not contribute to any rotation around the twist axis, i.e. the rotational axis of S is orthogonal to the twist axis. Let vec{V_{R_para}} denote the parallel component of vec{V_R} to vec{V_T}, which can be obtained by projecting vec{V_R} onto vec{V_T}: vec{V_{R_para}} = proj_{vec{V_T}}(vec{V_R}) Let vec{V_{R_perp}} denote the orthogonal component of vec{V_R} to vec{V_T}: vec{V_{R_perp}} = vec{V_R} - vec{V_{R_para}} So the scalar-vector form of T becomes: T = [W_R, proj_{vec{V_T}}(vec{V_R})] = [W_R, vec{V_{R_para}}] Using the quaternion multiplication formula, here is the scalar-vector form of the swing quaternion: S = R T^{-1} = [W_R, vec{V_R}] [W_R, -vec{V_{R_para}}] = [W_R^2 - vec{V_R} ‧ (-vec{V_{R_para}}), vec{V_R} X (-vec{V_{R_para}}) + W_R vec{V_R} + W_R (-vec{V_{R_para}})] = [W_R^2 - vec{V_R} ‧ (-vec{V_{R_para}}), vec{V_R} X (-vec{V_{R_para}}) + W_R (vec{V_R} -vec{V_{R_para}})] = [W_R^2 - vec{V_R} ‧ (-vec{V_{R_para}}), vec{V_R} X (-vec{V_{R_para}}) + W_R vec{V_{R_perp}}] Take notice of the vector part of the result: vec{V_R} X (-vec{V_{R_para}}) + W_R vec{V_{R_perp}} This is a vector parallel to the rotational axis of S. Both vec{V_R} X(-vec{V_{R_para}}) and vec{V_{R_perp}} are orthogonal to the twist axis vec{V_T}, so we have shown that the rotational axis of S is orthogonal to the twist axis. Hence, we have proven that the formulas for S and T are valid for swing-twist decomposition. Conclusion
That’s all.
Given a twist axis, I have shown how to decompose a rotation into a swing component and a twist component.
Such decomposition can be used for swing-twist interpolation, an alternative to slerp that interpolates between two orientations, which can be useful if you’d like some point on a rotating object to travel along the shortest arc.
I like to call such interpolation sterp.
Sterp is merely an alternative to slerp, not a replacement. Also, slerp is definitely more efficient than sterp. Most of the time slerp should work just fine, but if you find unwanted orientational sway on an object’s moving end, you might want to give sterp a try.
• By cgaish
Hi,
I am trying to implement a custom texture atlas creator tool in C++, need suggestion regarding any opensource fast API or library for image import and export?
Also this tool will compress the final output atlas image into multiple formats like DXT5, PVRTC and ETC based on user input, what should be the best way to implement this?
Thanks

• Thanx to @Randy Gaul, I succesfully implemented cube/cube collision detection and response.
1- substract the center of each AABB = 3d vector a.
2- if |x| of a is the biggest, this represents a face on each AABB.
3- if x is pointing at the same(or exact opposte) direction of the normal(of a face), two AABB are colliding on those faces.
But these steps only work if two colliders are cubes, because the size of each half-lengths are different in a right square prism.
Thank you!

• I've been digging around online and can't seem to find any formulas for 3D mesh simplification. I'm not sure where to start but I generally want to know how I could make a function that takes in an array of vertices, indices, and a float/double for the decimation rate. And could I preserve the general shape of the object too?
Thanks for the help!
P.S. I was hoping to do something with Quadric Error / Quadric Edge Collapse if that's possible.