Jump to content
  • Advertisement
Sign in to follow this  
Shnoutz

SSE 4x4 Matrix transpose and invert

This topic is 506 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Advertisement

From counting instructions, this solution needs even less FLOPs than an optimized Cramer's-rule-implementation.  Very good indeed.  Although I haven't had the chance to profile and compare both algorithms, yet.

 

P.S.: for transposing a Matrix, there is an intrinsics-macro, producing slightly different code inside "xmmintrin.h":

#define _MM_TRANSPOSE4_PS(row0, row1, row2, row3) {                 \
            __m128 tmp3, tmp2, tmp1, tmp0;                          \
                                                                    \
            tmp0   = _mm_shuffle_ps((row0), (row1), 0x44);          \
            tmp2   = _mm_shuffle_ps((row0), (row1), 0xEE);          \
            tmp1   = _mm_shuffle_ps((row2), (row3), 0x44);          \
            tmp3   = _mm_shuffle_ps((row2), (row3), 0xEE);          \
                                                                    \
            (row0) = _mm_shuffle_ps(tmp0, tmp1, 0x88);              \
            (row1) = _mm_shuffle_ps(tmp0, tmp1, 0xDD);              \
            (row2) = _mm_shuffle_ps(tmp2, tmp3, 0x88);              \
            (row3) = _mm_shuffle_ps(tmp2, tmp3, 0xDD);              \
        }

... just another thing to profile.  But my guess is: SHUFPS, MOVLHPS, MOVHLPS, UNPCKLPS, UNPCKHPS all use the same execution unit (5) and have the same latencies (1) and throughputs(1).  So this may be the same in terms of speed.

Edited by st0ff

Share this post


Link to post
Share on other sites

Speculation is nigh-on worthless with SIMD code. Profile that sucker :-)

To make you partly happy: I did some simple __rdtsc() profiling.  The partitioned approach takes on average 80 ticks, while my Cramer's rule approach takes on average 100 ticks.  Still, the Cramer-implementation is better, as on average it cumulates less error.

 

*speculation mode on*

I guess this would make less of a difference when using AVX and doubles, or when really issueing a divps instead of using corrected rcpps.

*speculation mode off*

Share this post


Link to post
Share on other sites

Just a follow-up: I use my Matrix inversion routine to obtain a camera's view matrix from its camera transformation matrix.  The "Cramer's rule"-implementation works perfectly all the time, while the partitioned approach frequently produces bad matrices.

I don't really know if it is my implementation or the algorithm itself (although I found a few sites on the net stating that on certain conditions a slightly different computation is necessary), but I will not use the partitioned approach.  Those 20 cycles less do not matter if the result is not trustworthy.  Maybe some day I find the time to either optimize the cramer implementation further, or to find and remove the bug in the partitioned implementation.

Edited by st0ff

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!