SSE 4x4 Matrix transpose and invert

This topic is 1013 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

Did you measure it vs an optimized scalar version?

Share on other sites

From counting instructions, this solution needs even less FLOPs than an optimized Cramer's-rule-implementation.  Very good indeed.  Although I haven't had the chance to profile and compare both algorithms, yet.

P.S.: for transposing a Matrix, there is an intrinsics-macro, producing slightly different code inside "xmmintrin.h":

#define _MM_TRANSPOSE4_PS(row0, row1, row2, row3) {                 \
__m128 tmp3, tmp2, tmp1, tmp0;                          \
\
tmp0   = _mm_shuffle_ps((row0), (row1), 0x44);          \
tmp2   = _mm_shuffle_ps((row0), (row1), 0xEE);          \
tmp1   = _mm_shuffle_ps((row2), (row3), 0x44);          \
tmp3   = _mm_shuffle_ps((row2), (row3), 0xEE);          \
\
(row0) = _mm_shuffle_ps(tmp0, tmp1, 0x88);              \
(row1) = _mm_shuffle_ps(tmp0, tmp1, 0xDD);              \
(row2) = _mm_shuffle_ps(tmp2, tmp3, 0x88);              \
(row3) = _mm_shuffle_ps(tmp2, tmp3, 0xDD);              \
}



... just another thing to profile.  But my guess is: SHUFPS, MOVLHPS, MOVHLPS, UNPCKLPS, UNPCKHPS all use the same execution unit (5) and have the same latencies (1) and throughputs(1).  So this may be the same in terms of speed.

Edited by st0ff

Share on other sites
Speculation is nigh-on worthless with SIMD code. Profile that sucker :-)

Share on other sites

Speculation is nigh-on worthless with SIMD code. Profile that sucker :-)

To make you partly happy: I did some simple __rdtsc() profiling.  The partitioned approach takes on average 80 ticks, while my Cramer's rule approach takes on average 100 ticks.  Still, the Cramer-implementation is better, as on average it cumulates less error.

*speculation mode on*

I guess this would make less of a difference when using AVX and doubles, or when really issueing a divps instead of using corrected rcpps.

*speculation mode off*

Does GLM do SSE?

Share on other sites

Just a follow-up: I use my Matrix inversion routine to obtain a camera's view matrix from its camera transformation matrix.  The "Cramer's rule"-implementation works perfectly all the time, while the partitioned approach frequently produces bad matrices.

I don't really know if it is my implementation or the algorithm itself (although I found a few sites on the net stating that on certain conditions a slightly different computation is necessary), but I will not use the partitioned approach.  Those 20 cycles less do not matter if the result is not trustworthy.  Maybe some day I find the time to either optimize the cramer implementation further, or to find and remove the bug in the partitioned implementation.

Edited by st0ff

• Game Developer Survey

We are looking for qualified game developers to participate in a 10-minute online survey. Qualified participants will be offered a \$15 incentive for your time and insights. Click here to start!

• 17
• 25
• 13
• 20