Back to General and Gameplay Programming

SSE 4x4 Matrix transpose and invert

General and Gameplay Programming Programming

Started by Shnoutz March 16, 2012 04:16 PM

5 comments, last by st0ff 7 years, 3 months ago

Shnoutz

356

Author

March 16, 2012 04:16 PM

Hello,

I was surprised to see how difficult it is to find clear implementations of matrix transposition and inversion using SSE so I worked all night and I wanted to share my results with you.

I bet it can be further optimized so comments are welcome.

Matrix transposition

Nothing new here. I post this because most of the site I found with this algorithm was only showing the integer version of it.

Based on this site:
http://www.randombit...se_in_sse2.html



Matrix const Matrix::transpose() const

{

	Vec4 const & t0 = _mm_unpacklo_ps(c0, c1);

	Vec4 const & t1 = _mm_unpacklo_ps(c2, c3);

	Vec4 const & t2 = _mm_unpackhi_ps(c0, c1);

	Vec4 const & t3 = _mm_unpackhi_ps(c2, c3);

	return Matrix(

		_mm_movelh_ps(t0, t1),

		_mm_movehl_ps(t1, t0),

		_mm_movelh_ps(t2, t3),

		_mm_movehl_ps(t3, t2));

}

Matrix inverse

I decided to go with the matrix partitioning algorithm to compute the matrix inverse. Subdividing the 4x4 matrix into four 2x2 matrices.

Based on this paper:
http://econ.la.psu.e...ONED_MATRIX.PDF

The core matrix inversion code :



Matrix const Matrix::inverse() const

{

	// Sub matrices

	Vec4 const & m11 = _mm_movelh_ps(c0, c1);

	Vec4 const & m21 = _mm_movehl_ps(c1, c0);

	Vec4 const & m12 = _mm_movelh_ps(c2, c3);

	Vec4 const & m22 = _mm_movehl_ps(c3, c2);

	// Inverse sub matrices

	Vec4 const & inv11 = inv2x2(m11);

	Vec4 const & inv22 = inv2x2(m22);

	// Partitions

	Vec4 const & _m11 = inv2x2(m11 - mul2x2(mul2x2(m12, inv22), m21));

	Vec4 const & _m21 = mul2x2(mul2x2(-inv22, m21), _m11);

	Vec4 const & _m22 = inv2x2(m22 - mul2x2(mul2x2(m21, inv11), m12));

	Vec4 const & _m12 = mul2x2(mul2x2(-inv11, m12), _m22);

	// Return inverse

	return Matrix(

		_mm_movelh_ps(_m11, _m21),

		_mm_movehl_ps(_m21, _m11),

		_mm_movelh_ps(_m12, _m22),

		_mm_movehl_ps(_m22, _m12));

}

As you can see I am using "inv2x2" and "mul2x2" here so here is the code for them:

Based on this site:
http://www.dr-lex.be...matrix_inv.html



static Vec4 const Matrix::mul2x2(Vec4 const & a, Vec4 const & b)

{

	return b.swizzle<0,0,2,2>() * _mm_movelh_ps(a, a) + b.swizzle<1,1,3,3>() * _mm_movehl_ps(a, a);

}



static Vec4 const Matrix::det2x2(Vec4 const & m)

{

	Vec4 const & temp = m.swizzle<0,0,1,1>() * m.swizzle<3,3,2,2>();

	return _mm_sub_ps(_mm_unpacklo_ps(temp, temp), _mm_unpackhi_ps(temp, temp));

}



static Vec4 const Matrix::adj2x2(Vec4 const & m)

{

	Vec4 const znzn(0.0f, -0.0f, -0.0f, 0.0f);

	return _mm_xor_ps(m.swizzle<3,1,2,0>(), znzn);

}



static Vec4 const Matrix::inv2x2(Vec4 const & m)

{

	return adj2x2(m) * det2x2(m).rcp();

}

Just for completeness, this is the code for the "+", "-" and "*" operators of my Vec4 class as well as the implementation of the "swizzle" and "rcp" functions.

Add, sub & mul operators:



friend Vec4 const Vec4::operator + (Vec4 const & a, Vec4 const & b)

{

	return _mm_add_ps(a, b);

}



friend Vec4 const Vec4::operator - (Vec4 const & a, Vec4 const & b)

{

	return _mm_sub_ps(a, b);

}



friend Vec4 const Vec4::operator * (Vec4 const & a, Vec4 const & b)

{

	return _mm_mul_ps(a, b);

}

Vec4 swizzle (basically a disguised shuffle operation)...



template < uint const _X, uint const _Y, uint const _Z, uint const _W >

Vec4 const Vec4::swizzle() const

{

	return _mm_shuffle_ps(*this, *this, _MM_SHUFFLE(_W, _Z, _Y, _X));

}

Vec4 reciprocal (1 / v) using newton-raphson correction.



Vec4 const Vec4::rcp() const

{

	Vec4 const & x0 = _mm_rcp_ps(*this);

	return (x0 + x0) - (*this * x0 * x0);

}

Hopefully this will be useful for someone else.

Cheers,

Shnoutz.

Crowley99

194

March 16, 2012 11:49 PM

Did you measure it vs an optimized scalar version?

st0ff

114

August 17, 2016 07:56 AM

From counting instructions, this solution needs even less FLOPs than an optimized Cramer's-rule-implementation. Very good indeed. Although I haven't had the chance to profile and compare both algorithms, yet.

P.S.: for transposing a Matrix, there is an intrinsics-macro, producing slightly different code inside "xmmintrin.h":


#define _MM_TRANSPOSE4_PS(row0, row1, row2, row3) {                 \
            __m128 tmp3, tmp2, tmp1, tmp0;                          \
                                                                    \
            tmp0   = _mm_shuffle_ps((row0), (row1), 0x44);          \
            tmp2   = _mm_shuffle_ps((row0), (row1), 0xEE);          \
            tmp1   = _mm_shuffle_ps((row2), (row3), 0x44);          \
            tmp3   = _mm_shuffle_ps((row2), (row3), 0xEE);          \
                                                                    \
            (row0) = _mm_shuffle_ps(tmp0, tmp1, 0x88);              \
            (row1) = _mm_shuffle_ps(tmp0, tmp1, 0xDD);              \
            (row2) = _mm_shuffle_ps(tmp2, tmp3, 0x88);              \
            (row3) = _mm_shuffle_ps(tmp2, tmp3, 0xDD);              \
        }

... just another thing to profile. But my guess is: SHUFPS, MOVLHPS, MOVHLPS, UNPCKLPS, UNPCKHPS all use the same execution unit (5) and have the same latencies (1) and throughputs(1). So this may be the same in terms of speed.

ApochPiQ

23,138

August 17, 2016 06:15 PM

Speculation is nigh-on worthless with SIMD code. Profile that sucker :-)

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

st0ff

114

August 18, 2016 09:59 PM

Speculation is nigh-on worthless with SIMD code. Profile that sucker :-)

To make you partly happy: I did some simple __rdtsc() profiling. The partitioned approach takes on average 80 ticks, while my Cramer's rule approach takes on average 100 ticks. Still, the Cramer-implementation is better, as on average it cumulates less error.

*speculation mode on*

I guess this would make less of a difference when using AVX and doubles, or when really issueing a divps instead of using corrected rcpps.

*speculation mode off*

BBeck

831

August 19, 2016 02:37 AM

Does GLM do SSE?

st0ff

114

February 02, 2017 09:42 PM

Just a follow-up: I use my Matrix inversion routine to obtain a camera's view matrix from its camera transformation matrix. The "Cramer's rule"-implementation works perfectly all the time, while the partitioned approach frequently produces bad matrices.

I don't really know if it is my implementation or the algorithm itself (although I found a few sites on the net stating that on certain conditions a slightly different computation is necessary), but I will not use the partitioned approach. Those 20 cycles less do not matter if the result is not trustworthy. Maybe some day I find the time to either optimize the cramer implementation further, or to find and remove the bug in the partitioned implementation.

SSE 4x4 Matrix transpose and invert

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

SSE 4x4 Matrix transpose and invert

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines