Sign in to follow this  

SSE 4x4 Matrix transpose and invert

Recommended Posts

From counting instructions, this solution needs even less FLOPs than an optimized Cramer's-rule-implementation.  Very good indeed.  Although I haven't had the chance to profile and compare both algorithms, yet.

 

P.S.: for transposing a Matrix, there is an intrinsics-macro, producing slightly different code inside "xmmintrin.h":

#define _MM_TRANSPOSE4_PS(row0, row1, row2, row3) {                 \
            __m128 tmp3, tmp2, tmp1, tmp0;                          \
                                                                    \
            tmp0   = _mm_shuffle_ps((row0), (row1), 0x44);          \
            tmp2   = _mm_shuffle_ps((row0), (row1), 0xEE);          \
            tmp1   = _mm_shuffle_ps((row2), (row3), 0x44);          \
            tmp3   = _mm_shuffle_ps((row2), (row3), 0xEE);          \
                                                                    \
            (row0) = _mm_shuffle_ps(tmp0, tmp1, 0x88);              \
            (row1) = _mm_shuffle_ps(tmp0, tmp1, 0xDD);              \
            (row2) = _mm_shuffle_ps(tmp2, tmp3, 0x88);              \
            (row3) = _mm_shuffle_ps(tmp2, tmp3, 0xDD);              \
        }

... just another thing to profile.  But my guess is: SHUFPS, MOVLHPS, MOVHLPS, UNPCKLPS, UNPCKHPS all use the same execution unit (5) and have the same latencies (1) and throughputs(1).  So this may be the same in terms of speed.

Edited by st0ff

Share this post


Link to post
Share on other sites

Speculation is nigh-on worthless with SIMD code. Profile that sucker :-)

To make you partly happy: I did some simple __rdtsc() profiling.  The partitioned approach takes on average 80 ticks, while my Cramer's rule approach takes on average 100 ticks.  Still, the Cramer-implementation is better, as on average it cumulates less error.

 

*speculation mode on*

I guess this would make less of a difference when using AVX and doubles, or when really issueing a divps instead of using corrected rcpps.

*speculation mode off*

Share this post


Link to post
Share on other sites

Just a follow-up: I use my Matrix inversion routine to obtain a camera's view matrix from its camera transformation matrix.  The "Cramer's rule"-implementation works perfectly all the time, while the partitioned approach frequently produces bad matrices.

I don't really know if it is my implementation or the algorithm itself (although I found a few sites on the net stating that on certain conditions a slightly different computation is necessary), but I will not use the partitioned approach.  Those 20 cycles less do not matter if the result is not trustworthy.  Maybe some day I find the time to either optimize the cramer implementation further, or to find and remove the bug in the partitioned implementation.

Edited by st0ff

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this