## The problem

**Goal**Multiply a batch of Vector3f's with the same 4x4 matrix.

**Restrictions**

- 'src' and 'dst' arrays shouldn't point to the same memory location
- All pointers should be 16-byte aligned (see below for details on array sizes)
- Treat Vector3f's as positions (w = 1.0)
- Matrix is column major

## Structs and helper functions

```
struct _Vector3f
{
float x, y, z;
};
struct _Vector4f
{
float x, y, z, w;
};
_Vector3f* AllocateSrcArray(unsigned int numVertices)
{
// We are loading 3 xmm regs per loop. So we need numLoops * 12 floats, where
// numLoops is (numVertices / 4) + 1 for the SSE version
unsigned int memSize = sizeof(float) * ((numVertices >> 2) + 1) * 12;
return (_Vector3f*)_aligned_malloc(memSize, 16);
}
_Vector4f* AllocateDstArray(unsigned int numVertices)
{
unsigned int memSize = sizeof(_Vector4f) * (((numVertices >> 2) + 1) << 2);
return (_Vector4f*)_aligned_malloc(memSize, 16);
}
```

As you can see, we are wasting a bit or memory in order to avoid dealing with special cases in the SSE version. The C version isn't affected by the extra padding, so there is no overhead. The worst case scenario (numVertices = 4 * n + 1) is to allocate an extra 48 bytes for the 'dst' array and extra 4 bytes for the 'src' array (total = 52 bytes). Nothing extraordinary when dealing with large batches of vertices. Code-side, the worst case is to perform 3 extra stores to the 'dst' array and 3 extra loads from the 'src' array.
Also note that we don't impose any restrictions on the matrix values, so the result of each transformation should be a 4-element vector.
## C code

```
void TransformVec3Pos_C(_Vector3f* __restrict src, _Vector4f* __restrict dst, unsigned int numVertices, float* __restrict matrix4x4)
{
for(unsigned int iVertex = 0;iVertex < numVertices;++iVertex)
{
_Vector3f* s = &src[iVertex];
_Vector4f* d = &dst[iVertex];
d->x = matrix4x4[0] * s->x + matrix4x4[4] * s->y + matrix4x4[ 8] * s->z + matrix4x4[12];
d->y = matrix4x4[1] * s->x + matrix4x4[5] * s->y + matrix4x4[ 9] * s->z + matrix4x4[13];
d->z = matrix4x4[2] * s->x + matrix4x4[6] * s->y + matrix4x4[10] * s->z + matrix4x4[14];
d->w = matrix4x4[3] * s->x + matrix4x4[7] * s->y + matrix4x4[11] * s->z + matrix4x4[15];
}
}
```

Performance (8k vertices): ~19 cycles/vertex (?: 1.4)
There is nothing special with this code. Straight Vec3/Matrix4x4 multiply. Plain-old FPU code generated by the compiler.
Note: Turning on /arch:SSE compiler option doesn't seem to produce any SSE code for the above function. The compiler insisted on using the FPU for all the calculations. Using /arch:SSE2 compiler option ended up producing a lot of SSE2 double-to-float and float-to-double conversions, which in turn made things worse performance-wise.
## SSE Optimized

The code below process 4 vertices at a time. 4 vertices * 3 floats/vertex = 12 floats = 3 loads of 4 floats per loop First load = (x0, y0, z0, x1) Second load = (y1, z1, x2, y2) Third load = (z2, x3, y3, z3) We don't keep the matrix in xmm regs in order to avoid intermediate stores (profiling says this is good). What we do is try to keep the matrix loads to a minimum by reusing each column as much as possible (see comment in code). Hopefully it should be in cache most of the time. Written in asm because I couldn't find a way to force the compiler to generate the code below using intrisics. The compiler insisted on using the stack for intermediate stores (matrix columns).```
void TrasnformVec3Pos_SSE_Asm(_Vector3f* __restrict src, _Vector4f* __restrict dst, unsigned int numVertices, float* __restrict matrix)
{
__asm
{
// Calculate the total number of iterations...
mov ecx, [numVertices];
shr ecx, 0x02;
inc ecx; // ecx = totalIterations = (numVertices >> 2) + 1;
xor esi, esi; // esi = iIter = 0;
mov eax, [src]; // eax = &src[0];
mov edx, [dst]; // edx = &dst[0];
mov edi, [matrix]; // edi = &matrix[0];
_LoopStart:
// Matrix column 0 is being loaded 2 times.
// Matrix column 1 is being loaded 2 times.
// Matrix column 2 is being loaded 1 time.
// Matrix column 3 is being loaded 1 time.
// Other that the above extra memory loads, each vector is loaded once,
// and there aren't any intermediate stores.
prefetcht0 [eax + 0x100];
movaps xmm0, [eax]; // xmm0 = (x0, y0, z0, x1)
movaps xmm1, [edi]; // xmm1 = (m0, m1, m2, m3) << Matrix.Column[0]
movaps xmm2, xmm0; // xmm2 = (x0, y0, z0, x1)
movaps xmm3, xmm0; // xmm3 = (x0, y0, z0, x1)
shufps xmm2, xmm2, 0x00; // xmm2 = (x0, x0, x0, x0)
prefetcht0 [edx + 0x80];
shufps xmm3, xmm3, 0xFF; // xmm3 = (x1, x1, x1, x1)
mulps xmm2, xmm1; // xmm2 = (x0 * m0, x0 * m1, x0 * m2, x0 * m3)
mulps xmm3, xmm1; // xmm3 = (x1 * m0, x1 * m1, x1 * m2, x1 * m3)
movaps xmm4, xmm0; // xmm4 = (x0, y0, z0, x1)
movaps xmm1, [edi + 0x10]; // xmm1 = (m4, m5, m6, m7) << Matrix.Column[1]
shufps xmm4, xmm4, 0x55; // xmm4 = (y0, y0, y0, y0)
shufps xmm0, xmm0, 0xAA; // xmm0 = (z0, z0, z0, z0)
mulps xmm4, xmm1; // xmm4 = (y0 * m4, y0 * m5, y0 * m6, y0 * m7)
movaps xmm5, [eax + 0x10]; // xmm5 = (y1, z1, x2, y2)
addps xmm2, xmm4; // xmm2 = (x0 * m0 + y0 * m4, x0 * m1 + y0 * m5, x0 * m2 + y0 * m6, x0 * m3 + y0 * m7)
movaps xmm6, xmm5; // xmm6 = (y1, z1, x2, y2)
movaps xmm4, [edi + 0x20]; // xmm4 = (m8, m9, m10, m11) << Matrix.Column[2]
shufps xmm6, xmm6, 0x00; // xmm6 = (y1, y1, y1, y1)
mulps xmm0, xmm4; // xmm0 = (z0 * m8, z0 * m9, z0 * m10, z0 * m11)
mulps xmm6, xmm1; // xmm6 = (y1 * m4, y1 * m5, y1 * m6, y1 * m7)
addps xmm0, xmm2; // xmm0 = (x0 * m0 + y0 * m4 + z0 * m8, x0 * m1 + y0 * m5 + z0 * m9, x0 * m2 + y0 * m6 + z0 * m10, x0 * m3 + y0 * m7 + z0 * m11)
addps xmm3, xmm6; // xmm3 = (x1 * m0 + y1 * m4, x1 * m1 + y1 * m5, x1 * m2 + y1 * m6, x1 * m3 + y1 * m7)
movaps xmm2, xmm5; // xmm2 = (y1, z1, x2, y2)
movaps xmm6, xmm5; // xmm6 = (y1, z1, x2, y2)
shufps xmm2, xmm2, 0x55; // xmm2 = (z1, z1, z1, z1)
shufps xmm6, xmm6, 0xFF; // xmm6 = (y2, y2, y2, y2)
mulps xmm2, xmm4; // xmm2 = (z1 * m8, z1 * m9, z1 * m10, z1 * m11)
mulps xmm6, xmm1; // xmm6 = (y2 * m4, y2 * m5, y2 * m6, y2 * m7)
addps xmm2, xmm3; // xmm2 = (x1 * m0 + y1 * m4 + z1 * m8, x1 * m1 + y1 * m5 + z1 * m9, x1 * m2 + y1 * m6 + z1 * m10, x1 * m3 + y1 * m7 + z1 * m11)
movaps xmm1, [edi]; // xmm1 = (m0, m1, m2, m3) << Matrix.Column[0]
shufps xmm5, xmm5, 0xAA; // xmm5 = (x2, x2, x2, x2)
movaps xmm3, [eax + 0x20]; // xmm3 = (z2, x3, y3, z3)
mulps xmm5, xmm1; // xmm5 = (x2 * m0, x2 * m1, x2 * m2, x2 * m3)
movaps xmm7, xmm3; // xmm7 = (z2, x3, y3, z3)
addps xmm6, xmm5; // xmm6 = (x2 * m0 + y2 * m4, x2 * m1 + y2 * m5, x2 * m2 + y2 * m6, x2 * m3 + y2 * m7)
shufps xmm7, xmm7, 0x00; // xmm7 = (z2, z2, z2, z2)
movaps xmm5, xmm3; // xmm5 = (z2, x3, y3, z3)
mulps xmm7, xmm4; // xmm7 = (z2 * m8, z2 * m9, z2 * m10, z2 * m11)
shufps xmm5, xmm5, 0x55; // xmm5 = (x3, x3, x3, x3)
addps xmm6, xmm7; // xmm6 = (x2 * m0 + y2 * m4 + z2 * m8, x2 * m1 + y2 * m5 + z1 * m9, x2 * m2 + y2 * m6 + z2 * m10, x2 * m3 + y2 * m7 + z2 * m11)
mulps xmm5, xmm1; // xmm5 = (x3 * m0, x3 * m1, x3 * m2, x3 * m3)
movaps xmm7, [edi + 0x30]; // xmm7 = (m12, m13, m14, m15) << Matrix.Column[3]
movaps xmm1, [edi + 0x10]; // xmm1 = (m4, m5, m6, m7) << Matrix.Column[1]
add eax, 0x30;
addps xmm0, xmm7; // xmm0 = d0
addps xmm2, xmm7; // xmm2 = d1
addps xmm6, xmm7; // xmm6 = d2
addps xmm5, xmm7; // xmm5 = (x3 * m0 + m12, x3 * m1 + m13, x3 * m2 + m14, x3 * m3 + m15)
movaps [edx], xmm0; // dst = xmm0
movaps xmm7, xmm3; // xmm7 = (z2, x3, y3, z3)
movaps [edx + 0x10], xmm2; // dst[i + 1] = xmm2
shufps xmm7, xmm7, 0xAA; // xmm7 = (y3, y3, y3, y3)
shufps xmm3, xmm3, 0xFF; // xmm3 = (z3, z3, z3, z3)
mulps xmm7, xmm1; // xmm7 = (y3 * m4, y3 * m5, y3 * m6, y3 * m7)
mulps xmm3, xmm4; // xmm3 = (z3 * m8, z3 * m9, z3 * m10, z3 * m11)
addps xmm5, xmm7; // xmm5 = (x3 * m0 + y3 * m4 + m12, x3 * m1 + y3 * m5 + m13, x3 * m2 + y3 * m6 + m14, x3 * m3 + y3 * m7 + m15)
movaps [edx + 0x20], xmm6; // dst[i + 2] = xmm6
addps xmm5, xmm3; // xmm5 = d3
add edx, 0x40;
inc esi;
cmp esi, ecx;
movaps [edx - 0x10], xmm5; // dst[i + 3] = xmm5
jb _LoopStart;
}
}
```

Performance (8k vertices): ~7 cycles/vertex (?: 0.15)
## Comparison

In order to compare the functions above, we execute 100,000 iterations for each batch size and calculate the clock cycles taken for each one of them. The results are then sorted and the middle 50,000 iterations are used to calculate the average and standard deviation. All values are in cycles/vertexBatchsize(vertices) | C | SSE | Speedup |

128 | 24.3(s=0.98) | 13.8(s=0.48) | 1.76x |

256 | 21.5(s=1.86) | 12.8(s=0.36) | 1.67x |

512 | 19.5(s=1.71) | 8.8(s=0.2) | 2.21x |

1024 | 18.6(s=1.23) | 8.3(s=0.26) | 2.24x |

4096 | 18.9(s=1.29) | 7.8(s=0.13) | 2.42x |

8192 | 18.8(s=1.43) | 7.1(s=0.15) | 2.64x |

65536 | 20.4(s=1.53) | 8.2(s=0.34) | 2.48x |