Sign in to follow this  
ajas95

Critiquicize my SSE

Recommended Posts

ajas95    767
I wrote a simple bit of SSE that does 4x4 * 1x4 multipies, 3 vectors at a time. CodeAnalyst tells me that each iteration through the loop takes 67 cycles, so it's about 22-23 cycles for each vector multiply. A slightly modified version that assumes the bottom row as 0's runs in 54 cycles / 3 vectors. So 18 cycles. I feel that that's an okay job, but it's the first SSE I've ever written (indeed, pretty much first assembly), so I wonder if anyone has any criticisms about it. My real interest isn't to make this the 733T35T fast possible code, but rather to make me a better person :) Basically, since it's uncommented, a brief explanation would do a world of good, I think. Given we want to multiply like:
A B C D   X
E F G H * Y
I J K L   Z
M N P Q   W

with the matrix in memory OpenGL style in 4 Qws like:
r0 = [A E I M]
r1 = [B F J N]
r2 = [C G K P]
r3 = [D H L Q]

and the 3 vectors :
[X0 Y0 Z0 W0]
[X1 Y1 Z1 W1]
[X2 Y2 Z2 W2]
it uses xmm0 and xmm4 to hold r0 and r1 while it shufps X0, X1, and X2 int xmm1, xmm2, xmm3, and also Y0,1,2 into xmm5,6,7. then it multiplies xmm0 (A E I M) into xmm1,2 and 3, and xmm4 (B F J N) into xmm5, 6, and 7. then adds xmm5, 6, and 7 back into xmm1,2,3. So, for instance, xmm1 now looks like: AX0 + BY0 | EX0 + FY0 | IX0 + JY0 | MX0 + NY0 Then it loads r2 into xmm4, and the z-components into 5, 6, 7 and does the same thing. Then the same thing with the W. pretty simple, assuming you have the matrix transposed in memory already. I should also point out that this is a member func of mat4, so that's where &r0 comes from. Alright, enough talk, here's the listing:
	inline void multiply_vec4_group_asm(const vec4* v, vec4* dest, int count)
	{
		vec4* pr0 = &r0;

		_asm
		{
			mov		esi, v
			mov		edx, dest
			mov		eax, pr0
			mov		ecx, count

start_loop:
			movaps	xmm0, [eax]			
			movaps	xmm4, [eax+10h]		
			movss	xmm1, [esi]			
			movss	xmm2, [esi+10h]		
			movss	xmm3, [esi+20h]		
			movss   xmm5, [esi+04h]		
			movss	xmm6, [esi+14h]		
			movss	xmm7, [esi+24h]		

			shufps	xmm1, xmm1, 0		
			shufps  xmm2, xmm2, 0
			shufps	xmm3, xmm3, 0
			shufps	xmm5, xmm5, 0
			shufps  xmm6, xmm6, 0
			shufps  xmm7, xmm7, 0

			mulps	xmm1, xmm0
			mulps	xmm2, xmm0
			mulps	xmm3, xmm0
			mulps   xmm5, xmm4
			mulps   xmm6, xmm4
			mulps	xmm7, xmm4

			addps   xmm1, xmm5
			addps	xmm2, xmm6
			addps	xmm3, xmm7

			// xmm1 = AX0 + BY0 | EX0 + FY0 | IX0 + JY0 | MX0 + NY0
			// now madd z and w components into xmm1..3.
	

			movaps	xmm4, [eax+20h]
			movss	xmm5, [esi+08h]
			movss	xmm6, [esi+18h]
			movss	xmm7, [esi+28h]

			shufps	xmm5, xmm5, 0
			shufps	xmm6, xmm6, 0
			shufps	xmm7, xmm7, 0

			mulps	xmm5, xmm4
			mulps	xmm6, xmm4
			mulps   xmm7, xmm4

			addps	xmm1, xmm5
			addps	xmm2, xmm6
			addps	xmm3, xmm7

			movaps	xmm4, [eax+30h]
			movss	xmm5, [esi+0Bh]
			movss	xmm6, [esi+1Bh]
			movss	xmm7, [esi+2Bh]

			shufps	xmm5, xmm5, 0
			shufps	xmm6, xmm6, 0
			shufps	xmm7, xmm7, 0

			mulps	xmm5, xmm4
			mulps	xmm6, xmm4
			mulps   xmm7, xmm4

			addps	xmm1, xmm5
			addps	xmm2, xmm6
			addps	xmm3, xmm7

			movaps  [edx],		xmm1
			movaps	[edx+10h],	xmm2
			movaps	[edx+20h],	xmm3

			sub		ecx, 3
			add		edx, 30h
			jnz		start_loop
		}

	}



Share this post


Link to post
Share on other sites
Skizz    794
Surely:

sub ecx, 3
add edx, 30h
jnz start_loop

should be:

add edx, 30h
sub ecx, 3
jnz start_loop

The only other things is to understand the latency and throughput for the SSE instructions. Latency is the time taken to complete an instruction and throughput is the time between sucessive dispatches of the instruction. Generally, latency > throughput. So, taking this section of code:

-3 addps xmm1, xmm5
-2 addps xmm2, xmm6
-1 addps xmm3, xmm7

0 movaps xmm4, [eax+20h]

1 movss xmm5, [esi+08h]
2 movss xmm6, [esi+18h]
3 movss xmm7, [esi+28h]

4 shufps xmm5, xmm5, 0
5 shufps xmm6, xmm6, 0
6 shufps xmm7, xmm7, 0

7 mulps xmm5, xmm4
8 mulps xmm6, xmm4
9 mulps xmm7, xmm4

Starting at 1, the movss had a latency of 4 and a throughput of 2 (P4 timings) and uses the MMX_SHIFT unit. Additionally, the memory access increases the latency by 6 (in the best case scenario). So, you've got a one cycle stall between 1 and 2 and 2 and 3. This dispatch stall can be alleviated by placing instrucions at 1.5 and 2.5 that don't use the MMX_SHIFT unit. Looking back up the code, you could place instruction 1 at -2.5, instruction 2 at -1.5 and 3 at -0.5 so that you no longer have the dispatch stall.
This has a knock on effect in that the addps instruction also has a thoughput of 2 so you've eliminated that stall as well:

0 movaps xmm4, [eax+20h]
1 addps xmm1, xmm5
2 movss xmm5, [esi+08h]
3 addps xmm2, xmm6
4 movss xmm6, [esi+18h]
5 addps xmm3, xmm7
6 movss xmm7, [esi+28h]

7 shufps xmm5, xmm5, 0
8 shufps xmm6, xmm6, 0
9 shufps xmm7, xmm7, 0

10 mulps xmm5, xmm4
11 mulps xmm6, xmm4
12 mulps xmm7, xmm4

As I've already stated, movss has a latency of 4 with an additional 6 for the memory access so instruction 2 won't complete for 10 cycles! So, instruction 7 which is dependant on instruction 2 will have a 5 cycle delay whilst the result of instruction 2 is generated. This is the harder stall to eliminate.
But, the most important thing to note is that your algorithm isn't optimal! You're doing far too many shuffles and you're loading the matrix for each vector.
It'll be better to do the following:
1) load matrix
2) load v.x
3) shuffle v.x into all four packed positions
4) multiply by row of matrix
5) load v.y
6) shuffle v.y into all four packed positions
7) multiply by row of matrix
8) add to result of (4)
9) load v.z
10) shuffle v.z into all four packed positions
11) multiply by row of matrix
12) add to result of (8)
13) save result of (8)
14) step to next v
15) jump to 2
You'll need 4 float vectors though, but you'd want that anyway to prevent memory alignment stalls.
With a bit of fiddling you can reduce latency stalls by loading (v+1).x between 4 and 5, so your code would be:

load matrix
load first vector
process first vector and load second vector
loop:
store previous result
process current vector
load next vector
if there are vectors to process then goto loop
store last vector
process current vector
store current vector


You'll also get a huge speed increase if you can interleave some integer operations - effectively getting the integer stuff for free.

Skizz

Share this post


Link to post
Share on other sites
ajas95    767
Quote:

load matrix
load first vector
process first vector and load second vector
loop:
store previous result
process current vector
load next vector
if there are vectors to process then goto loop
store last vector
process current vector
store current vector

Thanks a lot for the help. I'd found a chart for the latency and throughput, but they hadn't bothered to define what those terms meant. But now it makes sense. In my profiler, it just says that the SSE scheduler is stalled on an instruction for so many cycles, but not why... so I guess that's why.

I haven't learned really how to balance RAW WAW hazards with latency. I guess there's probably little WAW stall. Also I didn't realize that shufs uses a mul. I thought it was just a glorified mov.

As for the algorithm, yeah I reload the whole matrix once for every batch of 3 vec multiplies, so it's "in the loop". But if you keep the whole matrix loaded, then that's 4 registers, and you use 4 registers to store the shufpsed vector, one of which is an accumulator... so there's nowhere really to preload the next vector. That's why I load in the matrix by rows to multiply the vector components in batches. Also, each component of each vector needs to be shufpsed once (using my method) at a minimum, so that 3 vecs * 4 components = 12 shufps... which is how many my code does. So I don't see how to reduce the "far too many shuffles".

I guess what I'm saying is that your algorithm looks like a really good idea, but I don't see how to do it in just 8 regs.

Share this post


Link to post
Share on other sites
ajas95    767
Okay, I tried using your algorithm by sneaking the movss and shufps where I could find unused registers, and keeping the matrix in registers the whole time, and it cut it down to 20 cycles per vector multiply, a 10%+ speed improvement and no silly "3 vector batches" rule, just as long as you're multiplying >= 2. Of course, the first one stalls like crazy, but who cares?

Here's the new code

inline void multiply_vec4_2group_asm(const vec4* v, vec4* dest, int count)
{
vec4* pr0 = &r0;

_asm
{
mov esi, v
mov edx, dest
mov eax, pr0
mov ecx, count
dec ecx

// load in the matrix
movaps xmm0, [eax]
movaps xmm1, [eax+10h]
movaps xmm2, [eax+20h]
movaps xmm3, [eax+30h]

// load in x,y,z,w into xmm4,5,6,7 respectively
movss xmm4, [esi]
movss xmm5, [esi+04h]
movss xmm6, [esi+08h]
movss xmm7, [esi+0Bh]

shufps xmm4, xmm4, 0
shufps xmm5, xmm5, 0
shufps xmm6, xmm6, 0
shufps xmm7, xmm7, 0

start_loop:
// at this point next vector is moved and shufpsed.
mulps xmm4, xmm0
mulps xmm5, xmm1
mulps xmm6, xmm2
mulps xmm7, xmm3

addps xmm6, xmm4
movss xmm4, [esi+10h]
addps xmm7, xmm5
shufps xmm4, xmm4, 0
movss xmm5, [esi+14h]
addps xmm7, xmm6
shufps xmm5, xmm5, 0
movss xmm6, [esi+18h]
movaps [edx], xmm7
shufps xmm6, xmm6, 0
movss xmm7, [esi+1Bh]
add edx, 10h
add esi, 10h
dec ecx
shufps xmm7, xmm7, 0
jnz start_loop

// finish up the last vector.
mulps xmm4, xmm0
mulps xmm5, xmm1
mulps xmm6, xmm2
mulps xmm7, xmm3

addps xmm6, xmm4
addps xmm7, xmm5
addps xmm7, xmm6
movaps [edx], xmm7

}

}

Share this post


Link to post
Share on other sites
Skizz    794
Good to see my ideas working. I've been looking at what you've got and I think there's a small room for improvement:


start_loop:
shufps xmm7, xmm7, 0 ; this used to be before start_loop!
mulps xmm4, xmm0
mulps xmm5, xmm1
mulps xmm6, xmm2
mulps xmm7, xmm3
addps xmm5, xmm4
movss xmm4, [esi+10h]
addps xmm7, xmm6
movss xmm6, [esi+14h]
shufps xmm4, xmm4, 0
addps xmm7, xmm5
movss xmm5, [esi+18h]
shufps xmm6, xmm6, 0
movaps [edx], xmm7
movss xmm7, [esi+1Bh]
add edx, 10h
add esi, 10h
sub ecx, 1
; inc/dec can causes a dependancy on the preceding instruction
; because it only updates some of the flags whereas add/sub updates
; all the flags
shufps xmm5, xmm5, 0
jnz start_loop

You can also decrease the set up stalls as follows (I know, it's minor):

movss xmm4, [esi]
movss xmm5, [esi+04h]
movss xmm6, [esi+08h]
movss xmm7, [esi+0Bh]

movaps xmm0, [eax]
movaps xmm1, [eax+10h]
movaps xmm2, [eax+20h]
movaps xmm3, [eax+30h]

which descreases the stall between the first vector load and the shufps.

Also, from what docs I've got, shufps doesn't use any multiplies - it uses the MMX_SHIFT unit.

You may also want to add a strategic prefetch instruction to load the next + 1 vector, possibly:
prefetchnta byte ptr [ esi + 20h ]


Skizz

Share this post


Link to post
Share on other sites
Skizz    794
Oh, and if your vectors always have w == 0 then you can cut out a multiple/load/shuffle. And if they're always 1 then you can do an add from xmm3.

Skizz

Share this post


Link to post
Share on other sites
ajas95    767
Okay, I tried your code with some strange results. It was sometimes as few as 17 cycles, and sometimes as much as 24. The shufps xmm6 and movaps xmm7 lines get stalled and take forever to retire, which propogates through the entire next cycle through the loop. Remember, though, I'm on an AMD XP so there's probably a difference in how the pipeline stalls from a p4. The rearrangement of the loop preamble cut down big-time on the stalls though.

So it gives about a 95-cycle 4x4 * 4x4 matrix multiply, which I feel is a good result. I say 'about' because there's a big fat 24 cycle stall on the branch-mispredict after the last multiply that could be eliminated by a loop-unroll. Including the mispredict it runs in 119 cycles.

Also, there's a bug in the code, that slipped through by pure happen-stance of my test cases:
the movss xmm7, [esi+0Bh] (and 1Bh) should be 0Ch (and 1Ch)!!!
Once I moved past more complicated numbers than 1's and 0's in my tests, it became obvious. I guess my hex addition is not so good :)

Share this post


Link to post
Share on other sites
Skizz    794
The variation in execution times is probably due to the cache misses. Unrolling the loop to do a full 4x4 by 4x4 matrix multiply would certainly speed it up further. IF you wanted to keep the arbitary count parameter, you'd want to do something like:

setup loop - the preamble, loading the matrix and setting up loop counters and so on

jmp vector_table [count & 3]

vector_table dd case0, case3, case2, case1

case0:
first multiply
sub count, 1
case3:
second multiply
sub count, 1
case2:
third multiply
sub count, 1
case1:
fourth multiply
sub count, 1
jne case0

One side effect is that the code will read beyond the end of the vector array, so you'll need to make sure that it's valid to do so, i.e. there won't be a memory exception generated.

Skizz

P.S. Can't believe I didn't spot the 0bh / 0ch bug. Doh!

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this