# Critiquicize my SSE

## Recommended Posts

ajas95    767
I wrote a simple bit of SSE that does 4x4 * 1x4 multipies, 3 vectors at a time. CodeAnalyst tells me that each iteration through the loop takes 67 cycles, so it's about 22-23 cycles for each vector multiply. A slightly modified version that assumes the bottom row as 0's runs in 54 cycles / 3 vectors. So 18 cycles. I feel that that's an okay job, but it's the first SSE I've ever written (indeed, pretty much first assembly), so I wonder if anyone has any criticisms about it. My real interest isn't to make this the 733T35T fast possible code, but rather to make me a better person :) Basically, since it's uncommented, a brief explanation would do a world of good, I think. Given we want to multiply like:
A B C D   X
E F G H * Y
I J K L   Z
M N P Q   W

with the matrix in memory OpenGL style in 4 Qws like:
r0 = [A E I M]
r1 = [B F J N]
r2 = [C G K P]
r3 = [D H L Q]

and the 3 vectors :
[X0 Y0 Z0 W0]
[X1 Y1 Z1 W1]
[X2 Y2 Z2 W2]

it uses xmm0 and xmm4 to hold r0 and r1 while it shufps X0, X1, and X2 int xmm1, xmm2, xmm3, and also Y0,1,2 into xmm5,6,7. then it multiplies xmm0 (A E I M) into xmm1,2 and 3, and xmm4 (B F J N) into xmm5, 6, and 7. then adds xmm5, 6, and 7 back into xmm1,2,3. So, for instance, xmm1 now looks like: AX0 + BY0 | EX0 + FY0 | IX0 + JY0 | MX0 + NY0 Then it loads r2 into xmm4, and the z-components into 5, 6, 7 and does the same thing. Then the same thing with the W. pretty simple, assuming you have the matrix transposed in memory already. I should also point out that this is a member func of mat4, so that's where &r0 comes from. Alright, enough talk, here's the listing:
	inline void multiply_vec4_group_asm(const vec4* v, vec4* dest, int count)
{
vec4* pr0 = &r0;

_asm
{
mov		esi, v
mov		edx, dest
mov		eax, pr0
mov		ecx, count

start_loop:
movaps	xmm0, [eax]
movaps	xmm4, [eax+10h]
movss	xmm1, [esi]
movss	xmm2, [esi+10h]
movss	xmm3, [esi+20h]
movss   xmm5, [esi+04h]
movss	xmm6, [esi+14h]
movss	xmm7, [esi+24h]

shufps	xmm1, xmm1, 0
shufps  xmm2, xmm2, 0
shufps	xmm3, xmm3, 0
shufps	xmm5, xmm5, 0
shufps  xmm6, xmm6, 0
shufps  xmm7, xmm7, 0

mulps	xmm1, xmm0
mulps	xmm2, xmm0
mulps	xmm3, xmm0
mulps   xmm5, xmm4
mulps   xmm6, xmm4
mulps	xmm7, xmm4

// xmm1 = AX0 + BY0 | EX0 + FY0 | IX0 + JY0 | MX0 + NY0
// now madd z and w components into xmm1..3.

movaps	xmm4, [eax+20h]
movss	xmm5, [esi+08h]
movss	xmm6, [esi+18h]
movss	xmm7, [esi+28h]

shufps	xmm5, xmm5, 0
shufps	xmm6, xmm6, 0
shufps	xmm7, xmm7, 0

mulps	xmm5, xmm4
mulps	xmm6, xmm4
mulps   xmm7, xmm4

movaps	xmm4, [eax+30h]
movss	xmm5, [esi+0Bh]
movss	xmm6, [esi+1Bh]
movss	xmm7, [esi+2Bh]

shufps	xmm5, xmm5, 0
shufps	xmm6, xmm6, 0
shufps	xmm7, xmm7, 0

mulps	xmm5, xmm4
mulps	xmm6, xmm4
mulps   xmm7, xmm4

movaps  [edx],		xmm1
movaps	[edx+10h],	xmm2
movaps	[edx+20h],	xmm3

sub		ecx, 3
jnz		start_loop
}

}



##### Share on other sites
Skizz    794
Surely:
sub	ecx, 3add	edx, 30hjnz	start_loop

should be:
add	edx, 30hsub	ecx, 3jnz	start_loop

The only other things is to understand the latency and throughput for the SSE instructions. Latency is the time taken to complete an instruction and throughput is the time between sucessive dispatches of the instruction. Generally, latency > throughput. So, taking this section of code:
-3 addps xmm1, xmm5-2 addps xmm2, xmm6-1 addps xmm3, xmm70 movaps xmm4, [eax+20h]1 movss xmm5, [esi+08h]2 movss xmm6, [esi+18h]3 movss xmm7, [esi+28h]4 shufps xmm5, xmm5, 05 shufps xmm6, xmm6, 06 shufps xmm7, xmm7, 07 mulps xmm5, xmm48 mulps xmm6, xmm49 mulps xmm7, xmm4

Starting at 1, the movss had a latency of 4 and a throughput of 2 (P4 timings) and uses the MMX_SHIFT unit. Additionally, the memory access increases the latency by 6 (in the best case scenario). So, you've got a one cycle stall between 1 and 2 and 2 and 3. This dispatch stall can be alleviated by placing instrucions at 1.5 and 2.5 that don't use the MMX_SHIFT unit. Looking back up the code, you could place instruction 1 at -2.5, instruction 2 at -1.5 and 3 at -0.5 so that you no longer have the dispatch stall.
This has a knock on effect in that the addps instruction also has a thoughput of 2 so you've eliminated that stall as well:
0 movaps xmm4, [eax+20h]1 addps xmm1, xmm52 movss xmm5, [esi+08h]3 addps xmm2, xmm64 movss xmm6, [esi+18h]5 addps xmm3, xmm76 movss xmm7, [esi+28h]7 shufps xmm5, xmm5, 08 shufps xmm6, xmm6, 09 shufps xmm7, xmm7, 010 mulps xmm5, xmm411 mulps xmm6, xmm412 mulps xmm7, xmm4

As I've already stated, movss has a latency of 4 with an additional 6 for the memory access so instruction 2 won't complete for 10 cycles! So, instruction 7 which is dependant on instruction 2 will have a 5 cycle delay whilst the result of instruction 2 is generated. This is the harder stall to eliminate.
But, the most important thing to note is that your algorithm isn't optimal! You're doing far too many shuffles and you're loading the matrix for each vector.
It'll be better to do the following:
3) shuffle v.x into all four packed positions
4) multiply by row of matrix
6) shuffle v.y into all four packed positions
7) multiply by row of matrix
8) add to result of (4)
10) shuffle v.z into all four packed positions
11) multiply by row of matrix
12) add to result of (8)
13) save result of (8)
14) step to next v
You'll need 4 float vectors though, but you'd want that anyway to prevent memory alignment stalls.
With a bit of fiddling you can reduce latency stalls by loading (v+1).x between 4 and 5, so your code would be:
load matrixload first vectorprocess first vector and load second vectorloop:store previous resultprocess current vectorload next vectorif there are vectors to process then goto loopstore last vectorprocess current vectorstore current vector

You'll also get a huge speed increase if you can interleave some integer operations - effectively getting the integer stuff for free.

Skizz

##### Share on other sites
ajas95    767
Quote:
 load matrixload first vectorprocess first vector and load second vectorloop:store previous resultprocess current vectorload next vectorif there are vectors to process then goto loopstore last vectorprocess current vectorstore current vector

Thanks a lot for the help. I'd found a chart for the latency and throughput, but they hadn't bothered to define what those terms meant. But now it makes sense. In my profiler, it just says that the SSE scheduler is stalled on an instruction for so many cycles, but not why... so I guess that's why.

I haven't learned really how to balance RAW WAW hazards with latency. I guess there's probably little WAW stall. Also I didn't realize that shufs uses a mul. I thought it was just a glorified mov.

As for the algorithm, yeah I reload the whole matrix once for every batch of 3 vec multiplies, so it's "in the loop". But if you keep the whole matrix loaded, then that's 4 registers, and you use 4 registers to store the shufpsed vector, one of which is an accumulator... so there's nowhere really to preload the next vector. That's why I load in the matrix by rows to multiply the vector components in batches. Also, each component of each vector needs to be shufpsed once (using my method) at a minimum, so that 3 vecs * 4 components = 12 shufps... which is how many my code does. So I don't see how to reduce the "far too many shuffles".

I guess what I'm saying is that your algorithm looks like a really good idea, but I don't see how to do it in just 8 regs.

##### Share on other sites
ajas95    767
Okay, I tried using your algorithm by sneaking the movss and shufps where I could find unused registers, and keeping the matrix in registers the whole time, and it cut it down to 20 cycles per vector multiply, a 10%+ speed improvement and no silly "3 vector batches" rule, just as long as you're multiplying >= 2. Of course, the first one stalls like crazy, but who cares?

Here's the new code
	inline void multiply_vec4_2group_asm(const vec4* v, vec4* dest, int count)	{		vec4* pr0 = &r0;		_asm		{			mov		esi, v			mov		edx, dest			mov		eax, pr0			mov		ecx, count			dec		ecx                        // load in the matrix			movaps	xmm0, [eax]			movaps	xmm1, [eax+10h]			movaps	xmm2, [eax+20h]					movaps	xmm3, [eax+30h]                        // load in x,y,z,w into xmm4,5,6,7 respectively			movss	xmm4, [esi]				movss	xmm5, [esi+04h]				movss	xmm6, [esi+08h]					movss	xmm7, [esi+0Bh]			shufps	xmm4, xmm4, 0			shufps	xmm5, xmm5, 0			shufps  xmm6, xmm6, 0			shufps  xmm7, xmm7, 0start_loop:                        // at this point next vector is moved and shufpsed.			mulps	xmm4,	xmm0			mulps	xmm5,	xmm1			mulps   xmm6,	xmm2			mulps	xmm7,	xmm3			addps	xmm6,	xmm4			movss	xmm4,	[esi+10h]			addps	xmm7,	xmm5			shufps	xmm4,	xmm4, 0			movss	xmm5,	[esi+14h]			addps	xmm7,	xmm6			shufps	xmm5,	xmm5, 0			movss	xmm6,	[esi+18h]			movaps  [edx],	xmm7			shufps	xmm6,	xmm6, 0			movss	xmm7,	[esi+1Bh]			add		edx,	10h			add		esi,	10h			dec		ecx			shufps	xmm7,	xmm7, 0			jnz		start_loop                        // finish up the last vector.			mulps	xmm4,	xmm0			mulps	xmm5,	xmm1			mulps   xmm6,	xmm2			mulps	xmm7,	xmm3			addps	xmm6,	xmm4			addps	xmm7,	xmm5			addps	xmm7,	xmm6			movaps  [edx],	xmm7		}	}

##### Share on other sites
Skizz    794
Good to see my ideas working. I've been looking at what you've got and I think there's a small room for improvement:
start_loop:	shufps	xmm7,	xmm7, 0 ; this used to be before start_loop!	mulps	xmm4,	xmm0	mulps	xmm5,	xmm1	mulps   xmm6,	xmm2	mulps	xmm7,	xmm3	addps	xmm5,	xmm4	movss	xmm4,	[esi+10h]	addps	xmm7,	xmm6	movss	xmm6,	[esi+14h]	shufps	xmm4,	xmm4, 0	addps	xmm7,	xmm5	movss	xmm5,	[esi+18h]	shufps	xmm6,	xmm6, 0	movaps  [edx],	xmm7	movss	xmm7,	[esi+1Bh]	add		edx,	10h	add		esi,	10h	sub		ecx, 1; inc/dec can causes a dependancy on the preceding instruction; because it only updates some of the flags whereas add/sub updates; all the flags	shufps	xmm5,	xmm5, 0	jnz		start_loop

You can also decrease the set up stalls as follows (I know, it's minor):
	movss	xmm4, [esi]		movss	xmm5, [esi+04h]		movss	xmm6, [esi+08h]			movss	xmm7, [esi+0Bh]	movaps	xmm0, [eax]	movaps	xmm1, [eax+10h]	movaps	xmm2, [eax+20h]			movaps	xmm3, [eax+30h]

which descreases the stall between the first vector load and the shufps.

Also, from what docs I've got, shufps doesn't use any multiplies - it uses the MMX_SHIFT unit.

You may also want to add a strategic prefetch instruction to load the next + 1 vector, possibly:
prefetchnta byte ptr [ esi + 20h ]

Skizz

##### Share on other sites
Skizz    794
Oh, and if your vectors always have w == 0 then you can cut out a multiple/load/shuffle. And if they're always 1 then you can do an add from xmm3.

Skizz

##### Share on other sites
ajas95    767
Okay, I tried your code with some strange results. It was sometimes as few as 17 cycles, and sometimes as much as 24. The shufps xmm6 and movaps xmm7 lines get stalled and take forever to retire, which propogates through the entire next cycle through the loop. Remember, though, I'm on an AMD XP so there's probably a difference in how the pipeline stalls from a p4. The rearrangement of the loop preamble cut down big-time on the stalls though.

So it gives about a 95-cycle 4x4 * 4x4 matrix multiply, which I feel is a good result. I say 'about' because there's a big fat 24 cycle stall on the branch-mispredict after the last multiply that could be eliminated by a loop-unroll. Including the mispredict it runs in 119 cycles.

Also, there's a bug in the code, that slipped through by pure happen-stance of my test cases:
the movss xmm7, [esi+0Bh] (and 1Bh) should be 0Ch (and 1Ch)!!!
Once I moved past more complicated numbers than 1's and 0's in my tests, it became obvious. I guess my hex addition is not so good :)

##### Share on other sites
Skizz    794
The variation in execution times is probably due to the cache misses. Unrolling the loop to do a full 4x4 by 4x4 matrix multiply would certainly speed it up further. IF you wanted to keep the arbitary count parameter, you'd want to do something like:
setup loop - the preamble, loading the matrix and setting up loop counters and so onjmp vector_table [count & 3]vector_table dd case0, case3, case2, case1case0:    first multiply    sub count, 1case3:    second multiply    sub count, 1case2:    third multiply    sub count, 1case1:    fourth multiply    sub count, 1    jne case0

One side effect is that the code will read beyond the end of the vector array, so you'll need to make sure that it's valid to do so, i.e. there won't be a memory exception generated.

Skizz

P.S. Can't believe I didn't spot the 0bh / 0ch bug. Doh!