# Efficient matrix multiplication

This topic is 3708 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I know that for large matrices (with hundreds * hundreds of elements) there exist algorithms better than n^3 for multiplication. However, in the case useful for games (3x3 and 4x4 matrices), do there also exist ways where you have a little bit less operations than in the straightforward code below?
Matrix3 operator*(const Matrix3& A, const Matrix3& B)
{
Matrix3 C;
C.a[0][0] = A.a[0][0]*B.a[0][0] + A.a[1][0]*B.a[0][1] + A.a[2][0]*B.a[0][2];
C.a[0][1] = A.a[0][1]*B.a[0][0] + A.a[1][1]*B.a[0][1] + A.a[2][1]*B.a[0][2];
C.a[0][2] = A.a[0][2]*B.a[0][0] + A.a[1][2]*B.a[0][1] + A.a[2][2]*B.a[0][2];
C.a[1][0] = A.a[0][0]*B.a[1][0] + A.a[1][0]*B.a[1][1] + A.a[2][0]*B.a[1][2];
C.a[1][1] = A.a[0][1]*B.a[1][0] + A.a[1][1]*B.a[1][1] + A.a[2][1]*B.a[1][2];
C.a[1][2] = A.a[0][2]*B.a[1][0] + A.a[1][2]*B.a[1][1] + A.a[2][2]*B.a[1][2];
C.a[2][0] = A.a[0][0]*B.a[2][0] + A.a[1][0]*B.a[2][1] + A.a[2][0]*B.a[2][2];
C.a[2][1] = A.a[0][1]*B.a[2][0] + A.a[1][1]*B.a[2][1] + A.a[2][1]*B.a[2][2];
C.a[2][2] = A.a[0][2]*B.a[2][0] + A.a[1][2]*B.a[2][1] + A.a[2][2]*B.a[2][2];
return C;
}

Matrix4 operator*(const Matrix4& A, const Matrix4& B)
{
Matrix4 C;
C.a[0][0] = A.a[0][0]*B.a[0][0] + A.a[1][0]*B.a[0][1] + A.a[2][0]*B.a[0][2] + A.a[3][0]*B.a[0][3];
C.a[0][1] = A.a[0][1]*B.a[0][0] + A.a[1][1]*B.a[0][1] + A.a[2][1]*B.a[0][2] + A.a[3][1]*B.a[0][3];
C.a[0][2] = A.a[0][2]*B.a[0][0] + A.a[1][2]*B.a[0][1] + A.a[2][2]*B.a[0][2] + A.a[3][2]*B.a[0][3];
C.a[0][3] = A.a[0][3]*B.a[0][0] + A.a[1][3]*B.a[0][1] + A.a[2][3]*B.a[0][2] + A.a[3][3]*B.a[0][3];
C.a[1][0] = A.a[0][0]*B.a[1][0] + A.a[1][0]*B.a[1][1] + A.a[2][0]*B.a[1][2] + A.a[3][0]*B.a[1][3];
C.a[1][1] = A.a[0][1]*B.a[1][0] + A.a[1][1]*B.a[1][1] + A.a[2][1]*B.a[1][2] + A.a[3][1]*B.a[1][3];
C.a[1][2] = A.a[0][2]*B.a[1][0] + A.a[1][2]*B.a[1][1] + A.a[2][2]*B.a[1][2] + A.a[3][2]*B.a[1][3];
C.a[1][3] = A.a[0][3]*B.a[1][0] + A.a[1][3]*B.a[1][1] + A.a[2][3]*B.a[1][2] + A.a[3][3]*B.a[1][3];
C.a[2][0] = A.a[0][0]*B.a[2][0] + A.a[1][0]*B.a[2][1] + A.a[2][0]*B.a[2][2] + A.a[3][0]*B.a[2][3];
C.a[2][1] = A.a[0][1]*B.a[2][0] + A.a[1][1]*B.a[2][1] + A.a[2][1]*B.a[2][2] + A.a[3][1]*B.a[2][3];
C.a[2][2] = A.a[0][2]*B.a[2][0] + A.a[1][2]*B.a[2][1] + A.a[2][2]*B.a[2][2] + A.a[3][2]*B.a[2][3];
C.a[2][3] = A.a[0][3]*B.a[2][0] + A.a[1][3]*B.a[2][1] + A.a[2][3]*B.a[2][2] + A.a[3][3]*B.a[2][3];
C.a[3][0] = A.a[0][0]*B.a[3][0] + A.a[1][0]*B.a[3][1] + A.a[2][0]*B.a[3][2] + A.a[3][0]*B.a[3][3];
C.a[3][1] = A.a[0][1]*B.a[3][0] + A.a[1][1]*B.a[3][1] + A.a[2][1]*B.a[3][2] + A.a[3][1]*B.a[3][3];
C.a[3][2] = A.a[0][2]*B.a[3][0] + A.a[1][2]*B.a[3][1] + A.a[2][2]*B.a[3][2] + A.a[3][2]*B.a[3][3];
C.a[3][3] = A.a[0][3]*B.a[3][0] + A.a[1][3]*B.a[3][1] + A.a[2][3]*B.a[3][2] + A.a[3][3]*B.a[3][3];
return C;
}



##### Share on other sites
At the risk of pointing out something obvious to you, here are my thoughts.

I see two (possibly complimentary) directions to take this in. One is hardware optimization using, say, MMX/SSE. I know fast multiply-accumulate instructions exist -- and what is matrix multiplication but a bunch of multiplies and accumulates? -- and you could conceivably write some inline assembly to do this for you. Chances are Intel has released a library to do this for you.

The second direction is more mathematical:

Most often I assume you will be multiplying matrices of the form,

[ R d ][ 0 1 ]

where R is a rotation matrix, d is a displacement vector, "1" is a scalar and "0" is a row vector of zeros. The point is that you multiply a vector in homogenous coordinates by this matrix on the left to translate and/or rotate it. These 3x3 and 4x4 transformation matrices are elements of SE(2) or SE(3), respectively.

The presence of the "0" and "1" lead to simplifications. I.e.,

[ R1 d1 ] [ R2 d2 ] = [R1*R2  R1*d2(plus)d1 ][ 0  1  ] [ 0  1  ]   [ 0        1          ]

(For some reason, the plus sign doesn't show up correctly in the 1,2 element. Don't know why, so I replaced it with text.) By itself, this will speed things up (a little).

Moreover, if we represent R1 and R2 as quaternions instead of as matrices, then we can multiply them with 12 add/subs and 16 muls, instead of 27 muls and 18 adds. Of course, this might be too much of a pain (it probably would be for me).

Analagously, if your 3x3 matrices have 2x2 rotation matrices embedded in the upper left corner, then you can represent that 2x2 submatrix as a single number, the angle by which it rotates, in which case you can "multiply the rotation matrices" just by adding the angles. Eventually, however, this will require some trig to convert back to Cartesian form, something I generally try to avoid. But if you don't need to do the conversions often, it might be a net win.

Is there a general method to speed up multiplication of unitary matrices (rotation matrices are unitary)? I'm not sure. Someone who knows more linear algebra will have to reply.

The point is that, I don't think you'll do better than O(n^3) in general, but if your matrices have structure, then you can do better by exploiting that structure.

I know the above doesn't really solve your problem outright, but I hope you find some good ideas in it. Happy coding!

##### Share on other sites
The best way to optimize matrix math is to do the code in SSE and assembly.

The code below is COMPLETELY UNTESTED but it should give you an idea of what you need to be doing. An optimization could be to prefetch both matrices into the cache at the beginning of the function.

Edit: you will probably need to add a memory fence to the end of this function. I've also noticed that you return the matrix by creating a copy, instead of by reference. This could be sped up by passing a reference to a destination matrix. Of course, with this strategy you won't be able to overload the multiply operator.

The 2 most important machine instructions are

mulps - multiply aligned packed singals. This instruction actually performs 4
32bit multiplies at once.

global mulmatbymat
mulmatbymat:
mov eax, [esp+4] ; dest matrix
mov ecx, [esp+8] ; m1 matrix 1
mov edx, [esp+12] ; m2 matrix 2

;; fill xmm0-xmm2 with matrix 1
movaps xmm0, [ecx] ; m11,m12,m13,m14
movaps xmm1, [ecx+16*1] ; m21,m22,m23,m24
movaps xmm2, [ecx+16*2] ; m31,m32,m33,m34

;; copy m2[0]-m2[2] into xmm3-xmm5
movaps xmm3, [edx] ; m11,m12,m13,m14
movaps xmm4, xmm3 ; m11,m12,m13,m14
movhlps xmm5, xmm3 ; m13,m14

shufps xmm3, xmm3, 00h ; m11,m11,m11,m11
shufps xmm4, xmm4, 55h ; m12,m12,m12,m12
shufps xmm5, xmm5, 00h ; m13,m13,m13,m13

mulps xmm3, xmm0 ; m11*m11,m11*m12,m11*m13
mulps xmm4, xmm1 ; m12*m21,m12*m22,m12*m23
mulps xmm5, xmm2 ; m13*m31,m13*m32,m13*m33

movaps [eax], xmm3

;; copy m2[4]-m2[6]
movaps xmm3, [edx+16*1] ; m21,m22,m23,m24
movaps xmm4, xmm3 ; ^ ^ ^ ^
movhlps xmm5, xmm3 ; m23,m24

shufps xmm3, xmm3, 00h ; m21,m21,m21,m21
shufps xmm4, xmm4, 55h ; m22,m22,m22,m22
shufps xmm5, xmm5, 00h ; m23,m23,m23,m23

mulps xmm3,xmm0 ; m21*m11,m21*m12,m21*m13
mulps xmm4,xmm1 ; m22*m21,m22*m22,m22*m23
mulps xmm5,xmm2 ; m23*m31,m23*m32,m23*m33

movaps [eax+16*1], xmm3

;; copy m2[8] - m2[10
movaps xmm3, [edx+16*2] ; m31,m32,m33,m34
movaps xmm4, xmm3 ; ^
movhlps xmm5, xmm3 ; m33,m34

shufps xmm3, xmm3, 00h ; m31,m31,m31,m31
shufps xmm4, xmm4, 55h ; m32,m32,m32,m32
shufps xmm5, xmm5, 00h ; m33,m33,m33,m33

mulps xmm3, xmm0
mulps xmm4, xmm1
mulps xmm5, xmm2

movaps [eax+16*2], xmm3

;; copy m2[12] - m2[14]
movaps xmm3, [edx+16*3] ; m41,m42,m43,m44
movaps xmm4, xmm3 ; ^
movhlps xmm5, xmm3 ; m43,m44

shufps xmm3, xmm3, 00h ; m41,m41,m41,m41
shufps xmm4, xmm4, 55h ; m42,m42,m42,m42
shufps xmm5, xmm5, 00h ; m43,m43,m43,m43

mulps xmm3, xmm0
mulps xmm4, xmm1
mulps xmm5, xmm2

movaps [eax+16*3], xmm3

ret

[Edited by - AnthonyS on March 23, 2008 12:38:21 AM]

##### Share on other sites
Can the usage of SSE here be done in a portable way?

Or can maybe compilers like g++ use them automatically where possible if you specify a target architecture?

##### Share on other sites
Yes, by including emmintrin.h or xmmintrin.h/pmmintrin.h (depending on what SSE version you need) and then using the intrinsic functions. Of course, it will still only be SSE, so if "portable" means you want to target something different from Pentium/Athlon too, you will need to use a library like simd-cph (found on Sourceforge).

Apart from better portability and readability, intrinsic functions have the additional benefit that compilers which are unable to properly optimize near inline assembly, such as all current releases from Microsoft, will produce much better code.
For recent versions of GCC (since you mentioned it), that's not an issue, it optimizes just fine in presence of inline assembly. I've been surprised more than once seeing it reschedule across inline assembly boundaries just fine, too.

• 10
• 17
• 9
• 13
• 41