# SSE 4x4 Matrix transpose and invert

## Recommended Posts

Did you measure it vs an optimized scalar version?

#### Share this post

##### Share on other sites

From counting instructions, this solution needs even less FLOPs than an optimized Cramer's-rule-implementation.  Very good indeed.  Although I haven't had the chance to profile and compare both algorithms, yet.

P.S.: for transposing a Matrix, there is an intrinsics-macro, producing slightly different code inside "xmmintrin.h":

#define _MM_TRANSPOSE4_PS(row0, row1, row2, row3) {                 \
__m128 tmp3, tmp2, tmp1, tmp0;                          \
\
tmp0   = _mm_shuffle_ps((row0), (row1), 0x44);          \
tmp2   = _mm_shuffle_ps((row0), (row1), 0xEE);          \
tmp1   = _mm_shuffle_ps((row2), (row3), 0x44);          \
tmp3   = _mm_shuffle_ps((row2), (row3), 0xEE);          \
\
(row0) = _mm_shuffle_ps(tmp0, tmp1, 0x88);              \
(row1) = _mm_shuffle_ps(tmp0, tmp1, 0xDD);              \
(row2) = _mm_shuffle_ps(tmp2, tmp3, 0x88);              \
(row3) = _mm_shuffle_ps(tmp2, tmp3, 0xDD);              \
}



... just another thing to profile.  But my guess is: SHUFPS, MOVLHPS, MOVHLPS, UNPCKLPS, UNPCKHPS all use the same execution unit (5) and have the same latencies (1) and throughputs(1).  So this may be the same in terms of speed.

Edited by st0ff

#### Share this post

##### Share on other sites
Speculation is nigh-on worthless with SIMD code. Profile that sucker :-)

#### Share this post

##### Share on other sites

Speculation is nigh-on worthless with SIMD code. Profile that sucker :-)

To make you partly happy: I did some simple __rdtsc() profiling.  The partitioned approach takes on average 80 ticks, while my Cramer's rule approach takes on average 100 ticks.  Still, the Cramer-implementation is better, as on average it cumulates less error.

*speculation mode on*

I guess this would make less of a difference when using AVX and doubles, or when really issueing a divps instead of using corrected rcpps.

*speculation mode off*

Does GLM do SSE?

#### Share this post

##### Share on other sites

Just a follow-up: I use my Matrix inversion routine to obtain a camera's view matrix from its camera transformation matrix.  The "Cramer's rule"-implementation works perfectly all the time, while the partitioned approach frequently produces bad matrices.

I don't really know if it is my implementation or the algorithm itself (although I found a few sites on the net stating that on certain conditions a slightly different computation is necessary), but I will not use the partitioned approach.  Those 20 cycles less do not matter if the result is not trustworthy.  Maybe some day I find the time to either optimize the cramer implementation further, or to find and remove the bug in the partitioned implementation.

Edited by st0ff

## Create an account or sign in to comment

You need to be a member in order to leave a comment

## Create an account

Sign up for a new account in our community. It's easy!

Register a new account

## Sign in

Already have an account? Sign in here.

Sign In Now