A B C D X
E F G H * Y
I J K L Z
M N P Q W
with the matrix in memory OpenGL style in 4 Qws like:
r0 = [A E I M]
r1 =
r2 = [C G K P]
r3 = [D H L Q]
and the 3 vectors :
[X0 Y0 Z0 W0]
[X1 Y1 Z1 W1]
[X2 Y2 Z2 W2]
</pre>
it uses xmm0 and xmm4 to hold r0 and r1 while it shufps X0, X1, and X2 int xmm1, xmm2, xmm3, and also Y0,1,2 into xmm5,6,7.
then it multiplies xmm0 (A E I M) into xmm1,2 and 3, and xmm4 (B F J N) into xmm5, 6, and 7. then adds xmm5, 6, and 7 back into xmm1,2,3. So, for instance, xmm1 now looks like:
AX0 + BY0 | EX0 + FY0 | IX0 + JY0 | MX0 + NY0
Then it loads r2 into xmm4, and the z-components into 5, 6, 7 and does the same thing. Then the same thing with the W.
pretty simple, assuming you have the matrix transposed in memory already. I should also point out that this is a member func of mat4, so that's where &r0 comes from. Alright, enough talk, here's the listing:
<!–STARTSCRIPT–><!–source lang="cpp"–><div class="source"><pre>
<span class="cpp-keyword">inline</span> <span class="cpp-keyword">void</span> multiply_vec4_group_asm(<span class="cpp-keyword">const</span> vec4* v, vec4* dest, <span class="cpp-keyword">int</span> count)
{
vec4* pr0 = &r0;
_asm
{
mov esi, v
mov edx, dest
mov eax, pr0
mov ecx, count
start_loop:
movaps xmm0, [eax]
movaps xmm4, [eax+10h]
movss xmm1, [esi]
movss xmm2, [esi+10h]
movss xmm3, [esi+20h]
movss xmm5, [esi+04h]
movss xmm6, [esi+14h]
movss xmm7, [esi+24h]
shufps xmm1, xmm1, <span class="cpp-number">0</span>
shufps xmm2, xmm2, <span class="cpp-number">0</span>
shufps xmm3, xmm3, <span class="cpp-number">0</span>
shufps xmm5, xmm5, <span class="cpp-number">0</span>
shufps xmm6, xmm6, <span class="cpp-number">0</span>
shufps xmm7, xmm7, <span class="cpp-number">0</span>
mulps xmm1, xmm0
mulps xmm2, xmm0
mulps xmm3, xmm0
mulps xmm5, xmm4
mulps xmm6, xmm4
mulps xmm7, xmm4
addps xmm1, xmm5
addps xmm2, xmm6
addps xmm3, xmm7
<span class="cpp-comment">// xmm1 = AX0 + BY0 | EX0 + FY0 | IX0 + JY0 | MX0 + NY0</span>
<span class="cpp-comment">// now madd z and w components into xmm1..3.</span>
movaps xmm4, [eax+20h]
movss xmm5, [esi+08h]
movss xmm6, [esi+18h]
movss xmm7, [esi+28h]
shufps xmm5, xmm5, <span class="cpp-number">0</span>
shufps xmm6, xmm6, <span class="cpp-number">0</span>
shufps xmm7, xmm7, <span class="cpp-number">0</span>
mulps xmm5, xmm4
mulps xmm6, xmm4
mulps xmm7, xmm4
addps xmm1, xmm5
addps xmm2, xmm6
addps xmm3, xmm7
movaps xmm4, [eax+30h]
movss xmm5, [esi+0Bh]
movss xmm6, [esi+1Bh]
movss xmm7, [esi+2Bh]
shufps xmm5, xmm5, <span class="cpp-number">0</span>
shufps xmm6, xmm6, <span class="cpp-number">0</span>
shufps xmm7, xmm7, <span class="cpp-number">0</span>
mulps xmm5, xmm4
mulps xmm6, xmm4
mulps xmm7, xmm4
addps xmm1, xmm5
addps xmm2, xmm6
addps xmm3, xmm7
movaps [edx], xmm1
movaps [edx+10h], xmm2
movaps [edx+20h], xmm3
sub ecx, <span class="cpp-number">3</span>
add edx, 30h
jnz start_loop
}
}
</pre></div><!–ENDSCRIPT–>
Critiquicize my SSE
I wrote a simple bit of SSE that does 4x4 * 1x4 multipies, 3 vectors at a time. CodeAnalyst tells me that each iteration through the loop takes 67 cycles, so it's about 22-23 cycles for each vector multiply. A slightly modified version that assumes the bottom row as 0's runs in 54 cycles / 3 vectors. So 18 cycles.
I feel that that's an okay job, but it's the first SSE I've ever written (indeed, pretty much first assembly), so I wonder if anyone has any criticisms about it. My real interest isn't to make this the 733T35T fast possible code, but rather to make me a better person :)
Basically, since it's uncommented, a brief explanation would do a world of good, I think. Given we want to multiply like:
Surely:
should be:
The only other things is to understand the latency and throughput for the SSE instructions. Latency is the time taken to complete an instruction and throughput is the time between sucessive dispatches of the instruction. Generally, latency > throughput. So, taking this section of code:
Starting at 1, the movss had a latency of 4 and a throughput of 2 (P4 timings) and uses the MMX_SHIFT unit. Additionally, the memory access increases the latency by 6 (in the best case scenario). So, you've got a one cycle stall between 1 and 2 and 2 and 3. This dispatch stall can be alleviated by placing instrucions at 1.5 and 2.5 that don't use the MMX_SHIFT unit. Looking back up the code, you could place instruction 1 at -2.5, instruction 2 at -1.5 and 3 at -0.5 so that you no longer have the dispatch stall.
This has a knock on effect in that the addps instruction also has a thoughput of 2 so you've eliminated that stall as well:
As I've already stated, movss has a latency of 4 with an additional 6 for the memory access so instruction 2 won't complete for 10 cycles! So, instruction 7 which is dependant on instruction 2 will have a 5 cycle delay whilst the result of instruction 2 is generated. This is the harder stall to eliminate.
But, the most important thing to note is that your algorithm isn't optimal! You're doing far too many shuffles and you're loading the matrix for each vector.
It'll be better to do the following:
1) load matrix
2) load v.x
3) shuffle v.x into all four packed positions
4) multiply by row of matrix
5) load v.y
6) shuffle v.y into all four packed positions
7) multiply by row of matrix
8) add to result of (4)
9) load v.z
10) shuffle v.z into all four packed positions
11) multiply by row of matrix
12) add to result of (8)
13) save result of (8)
14) step to next v
15) jump to 2
You'll need 4 float vectors though, but you'd want that anyway to prevent memory alignment stalls.
With a bit of fiddling you can reduce latency stalls by loading (v+1).x between 4 and 5, so your code would be:
You'll also get a huge speed increase if you can interleave some integer operations - effectively getting the integer stuff for free.
Skizz
sub ecx, 3add edx, 30hjnz start_loop
should be:
add edx, 30hsub ecx, 3jnz start_loop
The only other things is to understand the latency and throughput for the SSE instructions. Latency is the time taken to complete an instruction and throughput is the time between sucessive dispatches of the instruction. Generally, latency > throughput. So, taking this section of code:
-3 addps xmm1, xmm5-2 addps xmm2, xmm6-1 addps xmm3, xmm70 movaps xmm4, [eax+20h]1 movss xmm5, [esi+08h]2 movss xmm6, [esi+18h]3 movss xmm7, [esi+28h]4 shufps xmm5, xmm5, 05 shufps xmm6, xmm6, 06 shufps xmm7, xmm7, 07 mulps xmm5, xmm48 mulps xmm6, xmm49 mulps xmm7, xmm4
Starting at 1, the movss had a latency of 4 and a throughput of 2 (P4 timings) and uses the MMX_SHIFT unit. Additionally, the memory access increases the latency by 6 (in the best case scenario). So, you've got a one cycle stall between 1 and 2 and 2 and 3. This dispatch stall can be alleviated by placing instrucions at 1.5 and 2.5 that don't use the MMX_SHIFT unit. Looking back up the code, you could place instruction 1 at -2.5, instruction 2 at -1.5 and 3 at -0.5 so that you no longer have the dispatch stall.
This has a knock on effect in that the addps instruction also has a thoughput of 2 so you've eliminated that stall as well:
0 movaps xmm4, [eax+20h]1 addps xmm1, xmm52 movss xmm5, [esi+08h]3 addps xmm2, xmm64 movss xmm6, [esi+18h]5 addps xmm3, xmm76 movss xmm7, [esi+28h]7 shufps xmm5, xmm5, 08 shufps xmm6, xmm6, 09 shufps xmm7, xmm7, 010 mulps xmm5, xmm411 mulps xmm6, xmm412 mulps xmm7, xmm4
As I've already stated, movss has a latency of 4 with an additional 6 for the memory access so instruction 2 won't complete for 10 cycles! So, instruction 7 which is dependant on instruction 2 will have a 5 cycle delay whilst the result of instruction 2 is generated. This is the harder stall to eliminate.
But, the most important thing to note is that your algorithm isn't optimal! You're doing far too many shuffles and you're loading the matrix for each vector.
It'll be better to do the following:
1) load matrix
2) load v.x
3) shuffle v.x into all four packed positions
4) multiply by row of matrix
5) load v.y
6) shuffle v.y into all four packed positions
7) multiply by row of matrix
8) add to result of (4)
9) load v.z
10) shuffle v.z into all four packed positions
11) multiply by row of matrix
12) add to result of (8)
13) save result of (8)
14) step to next v
15) jump to 2
You'll need 4 float vectors though, but you'd want that anyway to prevent memory alignment stalls.
With a bit of fiddling you can reduce latency stalls by loading (v+1).x between 4 and 5, so your code would be:
load matrixload first vectorprocess first vector and load second vectorloop:store previous resultprocess current vectorload next vectorif there are vectors to process then goto loopstore last vectorprocess current vectorstore current vector
You'll also get a huge speed increase if you can interleave some integer operations - effectively getting the integer stuff for free.
Skizz
Quote:
load matrix
load first vector
process first vector and load second vector
loop:
store previous result
process current vector
load next vector
if there are vectors to process then goto loop
store last vector
process current vector
store current vector
Thanks a lot for the help. I'd found a chart for the latency and throughput, but they hadn't bothered to define what those terms meant. But now it makes sense. In my profiler, it just says that the SSE scheduler is stalled on an instruction for so many cycles, but not why... so I guess that's why.
I haven't learned really how to balance RAW WAW hazards with latency. I guess there's probably little WAW stall. Also I didn't realize that shufs uses a mul. I thought it was just a glorified mov.
As for the algorithm, yeah I reload the whole matrix once for every batch of 3 vec multiplies, so it's "in the loop". But if you keep the whole matrix loaded, then that's 4 registers, and you use 4 registers to store the shufpsed vector, one of which is an accumulator... so there's nowhere really to preload the next vector. That's why I load in the matrix by rows to multiply the vector components in batches. Also, each component of each vector needs to be shufpsed once (using my method) at a minimum, so that 3 vecs * 4 components = 12 shufps... which is how many my code does. So I don't see how to reduce the "far too many shuffles".
I guess what I'm saying is that your algorithm looks like a really good idea, but I don't see how to do it in just 8 regs.
Okay, I tried using your algorithm by sneaking the movss and shufps where I could find unused registers, and keeping the matrix in registers the whole time, and it cut it down to 20 cycles per vector multiply, a 10%+ speed improvement and no silly "3 vector batches" rule, just as long as you're multiplying >= 2. Of course, the first one stalls like crazy, but who cares?
Here's the new code
Here's the new code
inline void multiply_vec4_2group_asm(const vec4* v, vec4* dest, int count) { vec4* pr0 = &r0; _asm { mov esi, v mov edx, dest mov eax, pr0 mov ecx, count dec ecx // load in the matrix movaps xmm0, [eax] movaps xmm1, [eax+10h] movaps xmm2, [eax+20h] movaps xmm3, [eax+30h] // load in x,y,z,w into xmm4,5,6,7 respectively movss xmm4, [esi] movss xmm5, [esi+04h] movss xmm6, [esi+08h] movss xmm7, [esi+0Bh] shufps xmm4, xmm4, 0 shufps xmm5, xmm5, 0 shufps xmm6, xmm6, 0 shufps xmm7, xmm7, 0start_loop: // at this point next vector is moved and shufpsed. mulps xmm4, xmm0 mulps xmm5, xmm1 mulps xmm6, xmm2 mulps xmm7, xmm3 addps xmm6, xmm4 movss xmm4, [esi+10h] addps xmm7, xmm5 shufps xmm4, xmm4, 0 movss xmm5, [esi+14h] addps xmm7, xmm6 shufps xmm5, xmm5, 0 movss xmm6, [esi+18h] movaps [edx], xmm7 shufps xmm6, xmm6, 0 movss xmm7, [esi+1Bh] add edx, 10h add esi, 10h dec ecx shufps xmm7, xmm7, 0 jnz start_loop // finish up the last vector. mulps xmm4, xmm0 mulps xmm5, xmm1 mulps xmm6, xmm2 mulps xmm7, xmm3 addps xmm6, xmm4 addps xmm7, xmm5 addps xmm7, xmm6 movaps [edx], xmm7 } }
Good to see my ideas working. I've been looking at what you've got and I think there's a small room for improvement:
You can also decrease the set up stalls as follows (I know, it's minor):
which descreases the stall between the first vector load and the shufps.
Also, from what docs I've got, shufps doesn't use any multiplies - it uses the MMX_SHIFT unit.
You may also want to add a strategic prefetch instruction to load the next + 1 vector, possibly:
Skizz
start_loop: shufps xmm7, xmm7, 0 ; this used to be before start_loop! mulps xmm4, xmm0 mulps xmm5, xmm1 mulps xmm6, xmm2 mulps xmm7, xmm3 addps xmm5, xmm4 movss xmm4, [esi+10h] addps xmm7, xmm6 movss xmm6, [esi+14h] shufps xmm4, xmm4, 0 addps xmm7, xmm5 movss xmm5, [esi+18h] shufps xmm6, xmm6, 0 movaps [edx], xmm7 movss xmm7, [esi+1Bh] add edx, 10h add esi, 10h sub ecx, 1; inc/dec can causes a dependancy on the preceding instruction; because it only updates some of the flags whereas add/sub updates; all the flags shufps xmm5, xmm5, 0 jnz start_loop
You can also decrease the set up stalls as follows (I know, it's minor):
movss xmm4, [esi] movss xmm5, [esi+04h] movss xmm6, [esi+08h] movss xmm7, [esi+0Bh] movaps xmm0, [eax] movaps xmm1, [eax+10h] movaps xmm2, [eax+20h] movaps xmm3, [eax+30h]
which descreases the stall between the first vector load and the shufps.
Also, from what docs I've got, shufps doesn't use any multiplies - it uses the MMX_SHIFT unit.
You may also want to add a strategic prefetch instruction to load the next + 1 vector, possibly:
prefetchnta byte ptr [ esi + 20h ]
Skizz
Oh, and if your vectors always have w == 0 then you can cut out a multiple/load/shuffle. And if they're always 1 then you can do an add from xmm3.
Skizz
Skizz
Okay, I tried your code with some strange results. It was sometimes as few as 17 cycles, and sometimes as much as 24. The shufps xmm6 and movaps xmm7 lines get stalled and take forever to retire, which propogates through the entire next cycle through the loop. Remember, though, I'm on an AMD XP so there's probably a difference in how the pipeline stalls from a p4. The rearrangement of the loop preamble cut down big-time on the stalls though.
So it gives about a 95-cycle 4x4 * 4x4 matrix multiply, which I feel is a good result. I say 'about' because there's a big fat 24 cycle stall on the branch-mispredict after the last multiply that could be eliminated by a loop-unroll. Including the mispredict it runs in 119 cycles.
Also, there's a bug in the code, that slipped through by pure happen-stance of my test cases:
the movss xmm7, [esi+0Bh] (and 1Bh) should be 0Ch (and 1Ch)!!!
Once I moved past more complicated numbers than 1's and 0's in my tests, it became obvious. I guess my hex addition is not so good :)
So it gives about a 95-cycle 4x4 * 4x4 matrix multiply, which I feel is a good result. I say 'about' because there's a big fat 24 cycle stall on the branch-mispredict after the last multiply that could be eliminated by a loop-unroll. Including the mispredict it runs in 119 cycles.
Also, there's a bug in the code, that slipped through by pure happen-stance of my test cases:
the movss xmm7, [esi+0Bh] (and 1Bh) should be 0Ch (and 1Ch)!!!
Once I moved past more complicated numbers than 1's and 0's in my tests, it became obvious. I guess my hex addition is not so good :)
The variation in execution times is probably due to the cache misses. Unrolling the loop to do a full 4x4 by 4x4 matrix multiply would certainly speed it up further. IF you wanted to keep the arbitary count parameter, you'd want to do something like:
One side effect is that the code will read beyond the end of the vector array, so you'll need to make sure that it's valid to do so, i.e. there won't be a memory exception generated.
Skizz
P.S. Can't believe I didn't spot the 0bh / 0ch bug. Doh!
setup loop - the preamble, loading the matrix and setting up loop counters and so onjmp vector_table [count & 3]vector_table dd case0, case3, case2, case1case0: first multiply sub count, 1case3: second multiply sub count, 1case2: third multiply sub count, 1case1: fourth multiply sub count, 1 jne case0
One side effect is that the code will read beyond the end of the vector array, so you'll need to make sure that it's valid to do so, i.e. there won't be a memory exception generated.
Skizz
P.S. Can't believe I didn't spot the 0bh / 0ch bug. Doh!
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement