Jump to content
  • Advertisement
Sign in to follow this  

Optimization Towards an Optimal VEX-SSE 3*3*float Matrix Transpose

This topic is 412 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi all,

More than a decade ago, a problem came up on this forum for computing a fast transpose of a 3x3 matrix using SSE. The most sensible implementation stores the matrix internally as a 3x4 matrix (so, one row stores 4 elements, aligned in a vector). A version, which I believe to be the fastest currently known, was presented:

On 6/27/2005 at 9:20 PM, ajas95 said:

// input xyz in xmm5,7,6
// output in xmm0,1,7
movaps	xmm0,	xmm7		   // xmm0 : ?? z1 y1 x1
movaps	xmm1,	xmm5		   // xmm1 : ?? z0 y0 x0
unpcklps xmm0,	xmm5		   // xmm0 : y1 y0 x1 x0
unpckhps xmm7,	xmm5		   // xmm7 : ?? ?? z1 z0
movhlps	xmm1,	xmm0		   // xmm1 : ?? z1 y1 y0
shufps	xmm7,	xmm6,	11100100b  // xmm7 : ?? z2 z1 z0
movlhps	xmm0,	xmm6		   // xmm0 : ?? x2 x1 x0
shufps	xmm1,	xmm6,	01010100b  // xmm1 : ?? y2 y1 y0

(P.S. If anyone has a faster way, I'd love to hear it. This uses 5 registers, and so destroys one of the inputs. Still, it's a great problem for people that are into this sort of thing).

I am pleased to report that I have been able to come up with a version which should be faster:

inline void transpose(__m128& A, __m128& B, __m128& C) {
    //Input rows in __m128& A, B, and C.  Output in same.
    __m128 T0 = _mm_unpacklo_ps(A,B);
    __m128 T1 = _mm_unpackhi_ps(A,B);
    A = _mm_movelh_ps(T0,C);
    B = _mm_shuffle_ps( T0,C, _MM_SHUFFLE(3,1,3,2) );
    C = _mm_shuffle_ps( T1,C, _MM_SHUFFLE(3,2,1,0) );

This should be 5 instructions instead of ajas95's 8 instructions. Of course, to get that level of performance with either version, you need to inline everything, or else you spend tons of time on moving floating point arguments to/from input registers.

The other thing that is crucial is that the instruction set be VEX encoded. This allows generating instructions that take three arguments, like `vunpcklps`, instead of instructions like `unpcklps` that take only two. VEX is only available in AVX and higher (usually passing e.g. `-mavx` is sufficient to get the compiler to generate VEX instructions).


Share this post

Link to post
Share on other sites
Sign in to follow this  

  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!