SSE 4x4 Matrix transpose and invert

Started by
5 comments, last by st0ff 7 years, 3 months ago
Hello,

I was surprised to see how difficult it is to find clear implementations of matrix transposition and inversion using SSE so I worked all night and I wanted to share my results with you.

I bet it can be further optimized so comments are welcome.

Matrix transposition

Nothing new here. I post this because most of the site I found with this algorithm was only showing the integer version of it.

Based on this site:
http://www.randombit...se_in_sse2.html


Matrix const Matrix::transpose() const
{
Vec4 const & t0 = _mm_unpacklo_ps(c0, c1);
Vec4 const & t1 = _mm_unpacklo_ps(c2, c3);
Vec4 const & t2 = _mm_unpackhi_ps(c0, c1);
Vec4 const & t3 = _mm_unpackhi_ps(c2, c3);
return Matrix(
_mm_movelh_ps(t0, t1),
_mm_movehl_ps(t1, t0),
_mm_movelh_ps(t2, t3),
_mm_movehl_ps(t3, t2));
}


Matrix inverse

I decided to go with the matrix partitioning algorithm to compute the matrix inverse. Subdividing the 4x4 matrix into four 2x2 matrices.

Based on this paper:
http://econ.la.psu.e...ONED_MATRIX.PDF

The core matrix inversion code :

Matrix const Matrix::inverse() const
{
// Sub matrices
Vec4 const & m11 = _mm_movelh_ps(c0, c1);
Vec4 const & m21 = _mm_movehl_ps(c1, c0);
Vec4 const & m12 = _mm_movelh_ps(c2, c3);
Vec4 const & m22 = _mm_movehl_ps(c3, c2);
// Inverse sub matrices
Vec4 const & inv11 = inv2x2(m11);
Vec4 const & inv22 = inv2x2(m22);
// Partitions
Vec4 const & _m11 = inv2x2(m11 - mul2x2(mul2x2(m12, inv22), m21));
Vec4 const & _m21 = mul2x2(mul2x2(-inv22, m21), _m11);
Vec4 const & _m22 = inv2x2(m22 - mul2x2(mul2x2(m21, inv11), m12));
Vec4 const & _m12 = mul2x2(mul2x2(-inv11, m12), _m22);
// Return inverse
return Matrix(
_mm_movelh_ps(_m11, _m21),
_mm_movehl_ps(_m21, _m11),
_mm_movelh_ps(_m12, _m22),
_mm_movehl_ps(_m22, _m12));
}


As you can see I am using "inv2x2" and "mul2x2" here so here is the code for them:

Based on this site:
http://www.dr-lex.be...matrix_inv.html


static Vec4 const Matrix::mul2x2(Vec4 const & a, Vec4 const & b)
{
return b.swizzle<0,0,2,2>() * _mm_movelh_ps(a, a) + b.swizzle<1,1,3,3>() * _mm_movehl_ps(a, a);
}

static Vec4 const Matrix::det2x2(Vec4 const & m)
{
Vec4 const & temp = m.swizzle<0,0,1,1>() * m.swizzle<3,3,2,2>();
return _mm_sub_ps(_mm_unpacklo_ps(temp, temp), _mm_unpackhi_ps(temp, temp));
}

static Vec4 const Matrix::adj2x2(Vec4 const & m)
{
Vec4 const znzn(0.0f, -0.0f, -0.0f, 0.0f);
return _mm_xor_ps(m.swizzle<3,1,2,0>(), znzn);
}

static Vec4 const Matrix::inv2x2(Vec4 const & m)
{
return adj2x2(m) * det2x2(m).rcp();
}


Just for completeness, this is the code for the "+", "-" and "*" operators of my Vec4 class as well as the implementation of the "swizzle" and "rcp" functions.

Add, sub & mul operators:

friend Vec4 const Vec4::operator + (Vec4 const & a, Vec4 const & b)
{
return _mm_add_ps(a, b);
}

friend Vec4 const Vec4::operator - (Vec4 const & a, Vec4 const & b)
{
return _mm_sub_ps(a, b);
}

friend Vec4 const Vec4::operator * (Vec4 const & a, Vec4 const & b)
{
return _mm_mul_ps(a, b);
}


Vec4 swizzle (basically a disguised shuffle operation)...

template < uint const _X, uint const _Y, uint const _Z, uint const _W >
Vec4 const Vec4::swizzle() const
{
return _mm_shuffle_ps(*this, *this, _MM_SHUFFLE(_W, _Z, _Y, _X));
}


Vec4 reciprocal (1 / v) using newton-raphson correction.

Vec4 const Vec4::rcp() const
{
Vec4 const & x0 = _mm_rcp_ps(*this);
return (x0 + x0) - (*this * x0 * x0);
}


Hopefully this will be useful for someone else.

Cheers,

Shnoutz.
Advertisement
Did you measure it vs an optimized scalar version?

From counting instructions, this solution needs even less FLOPs than an optimized Cramer's-rule-implementation. Very good indeed. Although I haven't had the chance to profile and compare both algorithms, yet.

P.S.: for transposing a Matrix, there is an intrinsics-macro, producing slightly different code inside "xmmintrin.h":


#define _MM_TRANSPOSE4_PS(row0, row1, row2, row3) {                 \
            __m128 tmp3, tmp2, tmp1, tmp0;                          \
                                                                    \
            tmp0   = _mm_shuffle_ps((row0), (row1), 0x44);          \
            tmp2   = _mm_shuffle_ps((row0), (row1), 0xEE);          \
            tmp1   = _mm_shuffle_ps((row2), (row3), 0x44);          \
            tmp3   = _mm_shuffle_ps((row2), (row3), 0xEE);          \
                                                                    \
            (row0) = _mm_shuffle_ps(tmp0, tmp1, 0x88);              \
            (row1) = _mm_shuffle_ps(tmp0, tmp1, 0xDD);              \
            (row2) = _mm_shuffle_ps(tmp2, tmp3, 0x88);              \
            (row3) = _mm_shuffle_ps(tmp2, tmp3, 0xDD);              \
        }

... just another thing to profile. But my guess is: SHUFPS, MOVLHPS, MOVHLPS, UNPCKLPS, UNPCKHPS all use the same execution unit (5) and have the same latencies (1) and throughputs(1). So this may be the same in terms of speed.

Speculation is nigh-on worthless with SIMD code. Profile that sucker :-)

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Speculation is nigh-on worthless with SIMD code. Profile that sucker :-)

To make you partly happy: I did some simple __rdtsc() profiling. The partitioned approach takes on average 80 ticks, while my Cramer's rule approach takes on average 100 ticks. Still, the Cramer-implementation is better, as on average it cumulates less error.

*speculation mode on*

I guess this would make less of a difference when using AVX and doubles, or when really issueing a divps instead of using corrected rcpps.

*speculation mode off*

Does GLM do SSE?

Just a follow-up: I use my Matrix inversion routine to obtain a camera's view matrix from its camera transformation matrix. The "Cramer's rule"-implementation works perfectly all the time, while the partitioned approach frequently produces bad matrices.

I don't really know if it is my implementation or the algorithm itself (although I found a few sites on the net stating that on certain conditions a slightly different computation is necessary), but I will not use the partitioned approach. Those 20 cycles less do not matter if the result is not trustworthy. Maybe some day I find the time to either optimize the cramer implementation further, or to find and remove the bug in the partitioned implementation.

This topic is closed to new replies.

Advertisement