# Optimization Towards an Optimal VEX-SSE 3*3*float Matrix Transpose

## Recommended Posts

Hi all,

More than a decade ago, a problem came up on this forum for computing a fast transpose of a 3x3 matrix using SSE. The most sensible implementation stores the matrix internally as a 3x4 matrix (so, one row stores 4 elements, aligned in a vector). A version, which I believe to be the fastest currently known, was presented:

On 6/27/2005 at 9:20 PM, ajas95 said:

// input xyz in xmm5,7,6
// output in xmm0,1,7
movaps	xmm0,	xmm7		   // xmm0 : ?? z1 y1 x1
movaps	xmm1,	xmm5		   // xmm1 : ?? z0 y0 x0
unpcklps xmm0,	xmm5		   // xmm0 : y1 y0 x1 x0
unpckhps xmm7,	xmm5		   // xmm7 : ?? ?? z1 z0
movhlps	xmm1,	xmm0		   // xmm1 : ?? z1 y1 y0
shufps	xmm7,	xmm6,	11100100b  // xmm7 : ?? z2 z1 z0
movlhps	xmm0,	xmm6		   // xmm0 : ?? x2 x1 x0
shufps	xmm1,	xmm6,	01010100b  // xmm1 : ?? y2 y1 y0



(P.S. If anyone has a faster way, I'd love to hear it. This uses 5 registers, and so destroys one of the inputs. Still, it's a great problem for people that are into this sort of thing).

I am pleased to report that I have been able to come up with a version which should be faster:

inline void transpose(__m128& A, __m128& B, __m128& C) {
//Input rows in __m128& A, B, and C.  Output in same.
__m128 T0 = _mm_unpacklo_ps(A,B);
__m128 T1 = _mm_unpackhi_ps(A,B);
A = _mm_movelh_ps(T0,C);
B = _mm_shuffle_ps( T0,C, _MM_SHUFFLE(3,1,3,2) );
C = _mm_shuffle_ps( T1,C, _MM_SHUFFLE(3,2,1,0) );
}

This should be 5 instructions instead of ajas95's 8 instructions. Of course, to get that level of performance with either version, you need to inline everything, or else you spend tons of time on moving floating point arguments to/from input registers.

The other thing that is crucial is that the instruction set be VEX encoded. This allows generating instructions that take three arguments, like vunpcklps, instead of instructions like unpcklps that take only two. VEX is only available in AVX and higher (usually passing e.g. -mavx is sufficient to get the compiler to generate VEX instructions).

-G

## Create an account or sign in to comment

You need to be a member in order to leave a comment

## Create an account

Sign up for a new account in our community. It's easy!

Register a new account

• 12
• 10
• 11
• 18
• 13
• ### Similar Content

• By Descent
Wow what a wild game by GalaXa Games Entertainment Interactive. Play now... it's really fun but IF you have epilepsy then don't play. It does not feature flashing pictures, but there is lots of animated stuff that might get ya. Anyway, 4 levels, 2 endings, insane action, BY INFERNAL. Please play it, right nao! Also , nice midi music composed by me is in the game.

demons.rar
• By Stewie.G
Hi,

I've been trying to implement a basic gaussian blur using the gaussian formula, and here is what it looks like so far:
float gaussian(float x, float sigma)
{
float pi = 3.14159;
float sigma_square = sigma * sigma;
float a = 1 / sqrt(2 * pi*sigma_square);
float b = exp(-((x*x) / (2 * sigma_square)));
return a * b;
}
My problem is that I don't quite know what sigma should be.
It seems that if I provide a random value for sigma, weights in my kernel won't add up to 1.
So I ended up calling my gaussian function with sigma == 1, which gives me weights adding up to 1, but also a very subtle blur.
Here is what my kernel looks like with sigma == 1
[0]    0.0033238872995488885
[1]    0.023804742479357766
[2]    0.09713820127276819
[3]    0.22585307043511713
[4]    0.29920669915475656
[5]    0.22585307043511713
[6]    0.09713820127276819
[7]    0.023804742479357766
[8]    0.0033238872995488885

I would have liked it to be more "rounded" at the top, or a better spread instead of wasting [0], [1], [2] with values bellow 0.1.
Based on my experiments, the key to this is to provide a different sigma, but if I do, my kernel values no longer adds up to 1, which results to a darker blur.
I've found this post
... which helped me a bit, but I am really confused with this the part where he divide sigma by 3.
Can someone please explain how sigma works? How is it related to my kernel size, how can I balance my weights with different sigmas, ect...

Thanks :-)

• Once I needed a program for packing an atlas with 3d models. I could not find one, so I made it.
Now it has only basic functionality. Should I improve it further? Does it need someone else?

• Hi all. I have been looking for a real-time global illumination algorithm to use in my game. I've found voxel cone tracing and I'm debating whether or not it's an algorithm worth investing my time researching and implementing. I have this doubt due to the following reasons:
. I see a lot of people say it's really hard to implement.
. Apparently this algorithm requires some Nvidia extension to work efficiently according to the original paper (I highly doubt it though)
. Barely real-time performance, meaning it's too slow to be implemented in a game

So in order to determine if I should invest time in voxel cone tracing, I want to ask the following questions:
. Is the algorithm itself flexible enough so that I can increase the performance by tweaking it (probably lowering the GI quality at the same time, but I don't care)
. Can I implement it without any driver requirement or special extensions, like the paper claims?

• I understand what stationary OBB intersection test is get min/max scalar in 15 axises, when all overlap, then minimum interval overlapping is the axis that used to push away 2 OBBs to "just touch". I have complete this.
So in sweep test, how to choose the right axis projecting and how every projection transform? I only need a ratio of current velocity when start intersect from disjoint.
I have read Ron Levine's post:
http://www.realtimecollisiondetection.net/files/levine_swept_sat.txt

And oliii's:
I get the code but cannot get it works as expect when port to my project, even I don't clearly know what params refer to.

For further optimization, I wish somebody can teach me how velocity part works conceptually instead of just code.
My case is DOBB-OBB only(dynamic to stationary).