I was surprised to see how difficult it is to find clear implementations of matrix transposition and inversion using SSE so I worked all night and I wanted to share my results with you.
I bet it can be further optimized so comments are welcome.
Matrix transposition
Nothing new here. I post this because most of the site I found with this algorithm was only showing the integer version of it.
Based on this site:
http://www.randombit...se_in_sse2.html
Matrix const Matrix::transpose() const
{
Vec4 const & t0 = _mm_unpacklo_ps(c0, c1);
Vec4 const & t1 = _mm_unpacklo_ps(c2, c3);
Vec4 const & t2 = _mm_unpackhi_ps(c0, c1);
Vec4 const & t3 = _mm_unpackhi_ps(c2, c3);
return Matrix(
_mm_movelh_ps(t0, t1),
_mm_movehl_ps(t1, t0),
_mm_movelh_ps(t2, t3),
_mm_movehl_ps(t3, t2));
}
Matrix inverse
I decided to go with the matrix partitioning algorithm to compute the matrix inverse. Subdividing the 4x4 matrix into four 2x2 matrices.
Based on this paper:
http://econ.la.psu.e...ONED_MATRIX.PDF
The core matrix inversion code :
Matrix const Matrix::inverse() const
{
// Sub matrices
Vec4 const & m11 = _mm_movelh_ps(c0, c1);
Vec4 const & m21 = _mm_movehl_ps(c1, c0);
Vec4 const & m12 = _mm_movelh_ps(c2, c3);
Vec4 const & m22 = _mm_movehl_ps(c3, c2);
// Inverse sub matrices
Vec4 const & inv11 = inv2x2(m11);
Vec4 const & inv22 = inv2x2(m22);
// Partitions
Vec4 const & _m11 = inv2x2(m11 - mul2x2(mul2x2(m12, inv22), m21));
Vec4 const & _m21 = mul2x2(mul2x2(-inv22, m21), _m11);
Vec4 const & _m22 = inv2x2(m22 - mul2x2(mul2x2(m21, inv11), m12));
Vec4 const & _m12 = mul2x2(mul2x2(-inv11, m12), _m22);
// Return inverse
return Matrix(
_mm_movelh_ps(_m11, _m21),
_mm_movehl_ps(_m21, _m11),
_mm_movelh_ps(_m12, _m22),
_mm_movehl_ps(_m22, _m12));
}
As you can see I am using "inv2x2" and "mul2x2" here so here is the code for them:
Based on this site:
http://www.dr-lex.be...matrix_inv.html
static Vec4 const Matrix::mul2x2(Vec4 const & a, Vec4 const & b)
{
return b.swizzle<0,0,2,2>() * _mm_movelh_ps(a, a) + b.swizzle<1,1,3,3>() * _mm_movehl_ps(a, a);
}
static Vec4 const Matrix::det2x2(Vec4 const & m)
{
Vec4 const & temp = m.swizzle<0,0,1,1>() * m.swizzle<3,3,2,2>();
return _mm_sub_ps(_mm_unpacklo_ps(temp, temp), _mm_unpackhi_ps(temp, temp));
}
static Vec4 const Matrix::adj2x2(Vec4 const & m)
{
Vec4 const znzn(0.0f, -0.0f, -0.0f, 0.0f);
return _mm_xor_ps(m.swizzle<3,1,2,0>(), znzn);
}
static Vec4 const Matrix::inv2x2(Vec4 const & m)
{
return adj2x2(m) * det2x2(m).rcp();
}
Just for completeness, this is the code for the "+", "-" and "*" operators of my Vec4 class as well as the implementation of the "swizzle" and "rcp" functions.
Add, sub & mul operators:
friend Vec4 const Vec4::operator + (Vec4 const & a, Vec4 const & b)
{
return _mm_add_ps(a, b);
}
friend Vec4 const Vec4::operator - (Vec4 const & a, Vec4 const & b)
{
return _mm_sub_ps(a, b);
}
friend Vec4 const Vec4::operator * (Vec4 const & a, Vec4 const & b)
{
return _mm_mul_ps(a, b);
}
Vec4 swizzle (basically a disguised shuffle operation)...
template < uint const _X, uint const _Y, uint const _Z, uint const _W >
Vec4 const Vec4::swizzle() const
{
return _mm_shuffle_ps(*this, *this, _MM_SHUFFLE(_W, _Z, _Y, _X));
}
Vec4 reciprocal (1 / v) using newton-raphson correction.
Vec4 const Vec4::rcp() const
{
Vec4 const & x0 = _mm_rcp_ps(*this);
return (x0 + x0) - (*this * x0 * x0);
}
Hopefully this will be useful for someone else.
Cheers,
Shnoutz.