# 4x4 matrix multiplication using SEE

This topic is 2640 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hello everyone

I have made an implementation of 4x4 matrix multiplication using SSE instructions using inline asm with Visual Studio 2010. Here it is

 ///////////////////////////////////////////////////////////////////////////// // Multiply two 4x4 row-major matrices using SSE instructions. __forceinline void SSEMultAlligned(const f32 *pA, const f32 *pB, f32 *pC) { __asm { // Get pointers to matrices // into registers. mov eax, pA mov ecx, pB mov edx, pC // Row 0. movss xmm0, dword ptr [eax] // Move a[0] into xmm0 first component. movaps xmm1, xmmword ptr [ecx] // Move row 0 of b into xmm1. shufps xmm0, xmm0, 0h // Broadcast a[0] to every xmm0 component. movss xmm2, dword ptr [eax + 4h] // Move a[1] into xmm2 first component. mulps xmm0, xmm1 // Multiply a[0] with row 0 of b. shufps xmm2, xmm2, 0h // Broadcast a[1] to every xmm2 component. movaps xmm3, xmmword ptr [ecx + 10h] // Move row 1 of b into xmm3. movss xmm4, dword ptr [eax + 8h] // Move a[2] into xmm4 first component. mulps xmm2, xmm3 // Multiply a[1] by row 1 of b. shufps xmm4, xmm4, 0h // Broadcast a[2] to every xmm4 component. addps xmm0, xmm2 // Accumulate result into xmm0. movaps xmm2, xmmword ptr [ecx + 20h] // Load row 2 of b into xmm2. mulps xmm4, xmm2 // Multiply a[2] with row 2 of b. movaps xmm5, xmmword ptr [ecx + 30h] // Move row 3 of b into xmm1. movss xmm2, dword ptr [eax + 0Ch] // Move a[3] into xmm2 first component. addps xmm0, xmm4 // Accumulate result into xmm0. shufps xmm2, xmm2, 0h // Broadcast a[3] to every xmm1 component. mulps xmm2, xmm5 // Multiply a[3] with row 3 of b. addps xmm0, xmm2 // Accumulate result into xmm0. movaps xmmword ptr [edx], xmm0 // Store first row of result into c. // Row 0 of the resulting matrix // is done. // Row 1. movss xmm0, dword ptr [eax + 1Ch] // Load a[7] into xmm0 first component. movss xmm3, dword ptr [eax + 18h] // Load a[6] into xmm3 first component. shufps xmm0, xmm0, 0h // Broadcast a[7] to every xmm0 component. shufps xmm3, xmm3, 0h // Broadcast a[6] to every xmm3 component. movaps xmm2, xmmword ptr [ecx + 20h] // Load row 2 into xmm2. mulps xmm0, xmm5 // Multiply a[7] with row 3 of b. mulps xmm3, xmm2 // Multiply a[6] with row 2 fo b. movss xmm4, dword ptr [eax + 14h] // Load a[5] into xmm4 forst component. addps xmm0, xmm3 // Accumulate result into xmm0. movaps xmm1, xmmword ptr [ecx + 10h] // Load row 1 into xmm1. shufps xmm4, xmm4, 0h // Broadcast a[5] to every xmm4 component. movss xmm6, dword ptr [eax + 10h] // Load a[4] into xmm6 first component. mulps xmm1, xmm4 // Multiply a[5] with row 1 of b. shufps xmm6, xmm6, 0h // Broadcast a[4] to every xmm6 component. movaps xmm2, xmmword ptr [ecx] // Load row 0 of b into xmm2. addps xmm0, xmm1 // Accumulate result into xmm0. mulps xmm6, xmm2 // Multiply a[4] with row 0 of b. movss xmm3, dword ptr [eax + 20h] // Load a[8] into xmm3 first component. addps xmm0, xmm6 // Accumulate result into xmm0 shufps xmm3, xmm3, 0h // Broadcast a[8] to every xmm3 component. movaps xmmword ptr [edx + 10h], xmm0 // Store second row of result into c. // Row 1 of the resulting matrix // is done. // Row 2. mulps xmm2, xmm3 // Multiply a[8] with row 0 of b. movss xmm0, dword ptr [eax + 24h] // Load a[9] into xmm0 first component. movaps xmm4, xmmword ptr [ecx + 10h] // Load row 1 of b into xmm4. shufps xmm0, xmm0, 0h // Broadcast a[9] to every xmm0 component. movss xmm5, dword ptr [eax + 28h] // Load a[10] into xmm5 first component. mulps xmm0, xmm4 // Multiply a[9] with row 1 of b. shufps xmm5, xmm5, 0h // Broadcast a[10] to every xmm5 compennt. addps xmm2, xmm0 // Accumulate result in xmm2. movaps xmm3, xmmword ptr [ecx + 20h] // Load row 2 of b into xmm3. movss xmm1, dword ptr [eax + 2Ch] // Load a[11] into xmm1 first component. mulps xmm5, xmm3 // Multiply a[10] with row 2 of b. shufps xmm1, xmm1, 0h // Boradcast a[11] to every xmm1 component. addps xmm2, xmm5 // Accumulate result into xmm2. movaps xmm4, xmmword ptr [ecx + 30h] // Load row 3 of b into xmm4. movss xmm6, dword ptr [eax + 3Ch] // Load a[15] into xmm6 first component. mulps xmm1, xmm4 // Multiply a[11] with row 3 of b. shufps xmm6, xmm6, 0h // Broadcast a[15] to every xmm6 component. addps xmm2, xmm1 // Accumulate result into xmm2.*/ movaps xmmword ptr [edx + 20h], xmm2 // Store third row of result into c. // Row 2 of the resulting matrix // is done. // Row 3. mulps xmm4, xmm6 // Multiply a[15] with row 3 of b. movss xmm0, dword ptr [eax + 38h] // Load a[14] into xmm0 first component. movaps xmm1, xmmword ptr [ecx + 20h] // Load row 2 of b into xmm1. shufps xmm0, xmm0, 0h // Broadcast a[14] into every xmm0 component. movss xmm2, dword ptr [eax + 34h] // Load a[13] into xmm2 first component. mulps xmm0, xmm1 // Multiply a[14] with row 2 of b. shufps xmm2, xmm2, 0h // Broadcast a[13] to every xmm2 component. addps xmm4, xmm0 // Accumulate result into xmm4. movaps xmm5, xmmword ptr [ecx + 10h] // Load row 1 of b into xmm5. movss xmm1, dword ptr [eax + 30h] // Load a[12] into xmm1 first component. mulps xmm2, xmm5 // Multiply a[13] with row 1 of b. shufps xmm1, xmm1, 0h // Broadcast a[12] to every xmm1 component. movaps xmm6, xmmword ptr [ecx] // Load row 0 of b into xmm6. addps xmm4, xmm2 // Accumulate result into xmm4. mulps xmm1, xmm6 // Multiply a[12] with row 0 of b. addps xmm4, xmm1 // Add result to movaps xmmword ptr [edx + 30h], xmm4 } } 

And here is the naive C implementation i had done a while ago :

 ///////////////////////////////////////////////////////////////////////////// // Multiply two 4x4 row-major. __forceinline void MultAlligned(const f32 *pA, const f32 *pB, f32 *pC) { // Row 0. pC[0] = pA[0]*pB[0] + pA[1]*pB[4] + pA[2]*pB[8] + pA[3]*pB[12]; pC[1] = pA[0]*pB[1] + pA[1]*pB[5] + pA[2]*pB[9] + pA[3]*pB[13]; pC[2] = pA[0]*pB[2] + pA[1]*pB[6] + pA[2]*pB[10] + pA[3]*pB[14]; pC[3] = pA[0]*pB[3] + pA[1]*pB[7] + pA[2]*pB[11] + pA[3]*pB[15]; // Row 1. pC[4] = pA[4]*pB[0] + pA[5]*pB[4] + pA[6]*pB[8] + pA[7]*pB[12]; pC[5] = pA[4]*pB[1] + pA[5]*pB[5] + pA[6]*pB[9] + pA[7]*pB[13]; pC[6] = pA[4]*pB[2] + pA[5]*pB[6] + pA[6]*pB[10] + pA[7]*pB[14]; pC[7] = pA[4]*pB[3] + pA[5]*pB[7] + pA[6]*pB[11] + pA[7]*pB[15]; // Row 2. pC[8] = pA[8]*pB[0] + pA[9]*pB[4] + pA[10]*pB[8] + pA[11]*pB[12]; pC[9] = pA[8]*pB[1] + pA[9]*pB[5] + pA[10]*pB[9] + pA[11]*pB[13]; pC[10] = pA[8]*pB[2] + pA[9]*pB[6] + pA[10]*pB[10] + pA[11]*pB[14]; pC[11] = pA[8]*pB[3] + pA[9]*pB[7] + pA[10]*pB[11] + pA[11]*pB[15]; // Row 3. pC[12] = pA[12]*pB[0] + pA[13]*pB[4] + pA[14]*pB[8] + pA[15]*pB[12]; pC[13] = pA[12]*pB[1] + pA[13]*pB[5] + pA[14]*pB[9] + pA[15]*pB[13]; pC[14] = pA[12]*pB[2] + pA[13]*pB[6] + pA[14]*pB[10] + pA[15]*pB[14]; pC[15] = pA[12]*pB[3] + pA[13]*pB[7] + pA[14]*pB[11] + pA[15]*pB[15]; } 

The SSE implementation works, it gives the same results as the C implementation as well as the fonction from D3D10math.h. In the SSE implementation, i tried to keep the dependency chain as low as i could interleaving operations on different registers. When i finish a row, i start with the column i last treated to avoid a movaps.

I have not benchmarked it yet though. I will try to do that tomorrow when i get back from work and compare it to the C implementation.

I know some people here are quite proficient in assembly, this is my first serious try at SSE. I would be very interested in any comments, critisicim or hint that could improve this implementation.

One question i have is this : beside portability, would there be a gain using an external assembler like YASM ?

##### Share on other sites
I'd prefer an implementation written using the compiler's SSE intrinsics, rather than directly in assembly. When using intrinsics, the optimiser is able to understand your code fully, instead of disabling itself like it does when encountering an [font="Courier New"]asm[/font] block.

With your C implementation, you should either use restrict pointers, or read the input arrays into local variables first, e.g.:float a0 =pA[ 0], b0 =pB[ 0], a1 =pA[ 1], b1 =pB[ 1], a2 =pA[ 2], b2 =pB[ 2], a3 =pA[ 3], b3 =pB[ 3], a4 =pA[ 4], b4 =pB[ 4], a5 =pA[ 5], b5 =pB[ 5], a6 =pA[ 6], b6 =pB[ 6], a7 =pA[ 7], b7 =pB[ 7], a8 =pA[ 8], b8 =pB[ 8], a9 =pA[ 9], b9 =pB[ 9], a10=pA[10], b10=pB[10], a11=pA[11], b11=pB[11], a12=pA[12], b12=pB[12], a13=pA[13], b13=pB[13], a14=pA[14], b14=pB[14], a15=pA[15], b15=pB[15]; pC[0] = a0*b0 + a1*b4 + a2*b8 + a3*b12; ...

##### Share on other sites

I'd prefer an implementation written using the compiler's SSE intrinsics, rather than directly in assembly. When using intrinsics, the optimiser is able to understand your code fully, instead of disabling itself like it does when encountering an [font="Courier New"]asm[/font] block.

With your C implementation, you should either use restrict pointers, or read the input arrays into local variables first, e.g.:float a0 =pA[ 0], b0 =pB[ 0], a1 =pA[ 1], b1 =pB[ 1], a2 =pA[ 2], b2 =pB[ 2], a3 =pA[ 3], b3 =pB[ 3], a4 =pA[ 4], b4 =pB[ 4], a5 =pA[ 5], b5 =pB[ 5], a6 =pA[ 6], b6 =pB[ 6], a7 =pA[ 7], b7 =pB[ 7], a8 =pA[ 8], b8 =pB[ 8], a9 =pA[ 9], b9 =pB[ 9], a10=pA[10], b10=pB[10], a11=pA[11], b11=pB[11], a12=pA[12], b12=pB[12], a13=pA[13], b13=pB[13], a14=pA[14], b14=pB[14], a15=pA[15], b15=pB[15]; pC[0] = a0*b0 + a1*b4 + a2*b8 + a3*b12; ...

Humm, restrict sounds like a good idea, i never though about it that way. I didn't know that intrinsics would benefit the automatic optimization of the compiler. I will try those two suggestions and "benchmark" the results to see which is faster (at least on my machine). Prefetching or restrict might also help with SSE version.

Thank you very much for your comment and references, very useful as usual.

##### Share on other sites
Prefetching or restrict might also help with SSE version.

Pre-fetching will not help.

__forceinline[/quote]

Arrrrggggghhhhh!

##### Share on other sites

[quote name='Laval B' timestamp='1316425976' post='4863280'] Prefetching or restrict might also help with SSE version.

Pre-fetching will not help.

__forceinline[/quote]

Arrrrggggghhhhh!
[/quote]

Alot of very interesting informations in that thread, i will read it and try to make the best out of it. Thank you very much, it's appreciated.

1. 1
2. 2
3. 3
Rutin
14
4. 4
5. 5

• 9
• 9
• 11
• 11
• 23
• ### Forum Statistics

• Total Topics
633674
• Total Posts
3013277
×