4 x 4 ASM Matrix Multiplication

Published November 06, 2005
Advertisement
So I spent a fair amount of time yesterday and this morning trying to get around the fact that it appeared that I needed to do a lot of data shuffling to multiply two 4x4 matricies using SIMD instructions. I didn't want to accept that because shuffling data around is a waste of time to me unless it absolutely HAS to be done, so I looked up some things on Intel's web site and I was right unfortunately. I have to do a ridiculous amount of data shuffling, OMG its horrible, so ugly it makes me want to [bawling]. Oh well it HAS to be done so I'll have to live with it.

Here is the unedited Intel code:

multiply_4x4_matrix: ;multiplies two 4 x 4 matricies

mov edx, dword ptr [esp+4] ; src1
mov eax, dword ptr [esp+0Ch] ; dst
mov ecx, dword ptr [esp+8] ; src2
movss xmm0, dword ptr [edx]
movaps xmm1, xmmword ptr [ecx]
shufps xmm0, xmm0, 0
movss xmm2, dword ptr [edx+4]
mulps xmm0, xmm1
shufps xmm2, xmm2, 0
movaps xmm3, xmmword ptr [ecx+10h]
movss xmm7, dword ptr [edx+8]
mulps xmm2, xmm3
shufps xmm7, xmm7, 0
addps xmm0, xmm2
movaps xmm4, xmmword ptr [ecx+20h]
movss xmm2, dword ptr [edx+0Ch]
mulps xmm7, xmm4
shufps xmm2, xmm2, 0
addps xmm0, xmm7
movaps xmm5, xmmword ptr [ecx+30h]
movss xmm6, dword ptr [edx+10h]
mulps xmm2, xmm5
movss xmm7, dword ptr [edx+14h]
shufps xmm6, xmm6, 0
addps xmm0, xmm2
shufps xmm7, xmm7, 0
movlps qword ptr [eax], xmm0
movhps qword ptr [eax+8], xmm0
mulps xmm7, xmm3
movss xmm0, dword ptr [edx+18h]
mulps xmm6, xmm1
shufps xmm0, xmm0, 0
addps xmm6, xmm7
mulps xmm0, xmm4
movss xmm2, dword ptr [edx+24h]
addps xmm6, xmm0
movss xmm0, dword ptr [edx+1Ch]
movss xmm7, dword ptr [edx+20h]
shufps xmm0, xmm0, 0
shufps xmm7, xmm7, 0
mulps xmm0, xmm5
mulps xmm7, xmm1
addps xmm6, xmm0
shufps xmm2, xmm2, 0
movlps qword ptr [eax+10h], xmm6
movhps qword ptr [eax+18h], xmm6
mulps xmm2, xmm3
movss xmm6, dword ptr [edx+28h]
addps xmm7, xmm2
shufps xmm6, xmm6, 0
movss xmm2, dword ptr [edx+2Ch]
mulps xmm6, xmm4
shufps xmm2, xmm2, 0
addps xmm7, xmm6
mulps xmm2, xmm5
movss xmm0, dword ptr [edx+34h]
addps xmm7, xmm2
shufps xmm0, xmm0, 0
movlps qword ptr [eax+20h], xmm7
movss xmm2, dword ptr [edx+30h]
movhps qword ptr [eax+28h], xmm7
mulps xmm0, xmm3
shufps xmm2, xmm2, 0
movss xmm6, dword ptr [edx+38h]
mulps xmm2, xmm1
shufps xmm6, xmm6, 0
addps xmm2, xmm0
mulps xmm6, xmm4
movss xmm7, dword ptr [edx+3Ch]
shufps xmm7, xmm7, 0
addps xmm2, xmm6
mulps xmm7, xmm5
addps xmm2, xmm7
movaps xmmword ptr [eax+30h], xmm2
Previous Entry Getting into Windows
0 likes 5 comments

Comments

Roboguy
Do you really need to use SIMD instructions? Might look better if you didn't have to use them.
November 06, 2005 04:37 PM
Caitlin
SIMD makes things quicker. Using C with Gaussian elimination doing the multiplication takes 1074 cycles, C with Cramer's rule takes 846 cycles, and C with Cramer's rule using SIMD only takes 210 cycles. (Performance numbers taken from the Intel document where I got the code).
November 06, 2005 04:54 PM
Roboguy
Maybe, but none of those numbers sound particularly large. Remember:
Premature optimization is the root of all evil [smile].
November 06, 2005 05:01 PM
Caitlin
They're 32 bit single precision floats
November 06, 2005 05:11 PM
Roboguy
Still, you should probably profile before optimizing. That way, you don't waste time optimizing something which really didn't need to be optimized. You can always optimize it later if it's speed is a problem.
November 06, 2005 11:44 PM
You must log in to join the conversation.
Don't have a GameDev.net account? Sign up!
Profile
Author
Advertisement
Advertisement