This is a rewrite of an old blog post I wrote a couple of years back. Finally, I found some time to redo the performance tests based on an observation made by Sean Barrett on the original post. The code below remains the same. The difference lies in the way performance is measured. Here I use RDTSC instead of Intel's Performance Counter Monitor library, which ended up producing a high overhead compared to the actual time taken by the measured functions. As an added bonus, I'm uploading a small VS project for anyone interested to try out.
Goal Multiply a batch of Vector3f’s with the same 4×4 matrix.
‘src’ and ‘dst’ arrays shouldn’t point to the same memory location
All pointers should be 16-byte aligned (see below for details on array sizes)
Treat Vector3f’s as positions (w = 1.0)
Matrix is column major
Structs and helper functions
float x, y, z;
float x, y, z, w;
_Vector3f* AllocateSrcArray(unsigned int numVertices)
// We are loading 3 xmm regs per loop. So we need numLoops * 12 floats, where
// numLoops is (numVertices / 4) + 1 for the SSE version
unsigned int memSize = sizeof(float) * ((numVertices >> 2) + 1) * 12;
return (_Vector3f*)_aligned_malloc(memSize, 16);
_Vector4f* AllocateDstArray(unsigned int numVertices)
unsigned int memSize = sizeof(_Vector4f) * (((numVertices >> 2) + 1) << 2);
return (_Vector4f*)_aligned_malloc(memSize, 16);
As you can see, we are wasting a bit or memory in order to avoid dealing with special cases in the SSE version. The C version isn’t affected by the extra padding, so there is no overhead. The worst case scenario (numVertices = 4 * n + 1) is to allocate an extra 48 bytes for the ‘dst’ array and extra 4 bytes for the ‘src’ array (total = 52 bytes). Nothing extraordinary when dealing with large batches of vertices. Code-side, the worst case is to perform 3 extra stores to the ‘dst’ array and 3 extra loads from the ‘src’ array.
Also note that we don't impose any restrictions on the matrix values, so the result of each transformation should be a 4-element vector.
There is nothing special with this code. Straight Vec3/Matrix4x4 multiply. Plain-old FPU code generated by the compiler.
Note: Turning on /arch:SSE compiler option doesn't seem to produce any SSE code for the above function. The compiler insisted on using the FPU for all the calculations. Using /arch:SSE2 compiler option ended up producing a lot of SSE2 double-to-float and float-to-double conversions, which in turn made things worse performance-wise.
The code below process 4 vertices at a time.
4 vertices * 3 floats/vertex = 12 floats = 3 loads of 4 floats per loop First load = (x0, y0, z0, x1) Second load = (y1, z1, x2, y2) Third load = (z2, x3, y3, z3)
We don’t keep the matrix in xmm regs in order to avoid intermediate stores (profiling says this is good). What we do is try to keep the matrix loads to a minimum by reusing each column as much as possible (see comment in code). Hopefully it should be in cache most of the time.
Written in asm because I couldn’t find a way to force the compiler to generate the code below using intrisics. The compiler insisted on using the stack for intermediate stores (matrix columns).
In order to compare the functions above, we execute 100,000 iterations for each batch size and calculate the clock cycles taken for each one of them. The results are then sorted and the middle 50,000 iterations are used to calculate the average and standard deviation. All values are in cycles/vertex
Table: Comparison between the two methods and speedup (values are averages between several independent runs)
All timings have been measured using RDTSC. All tests have been executed on Core i7 740QM using Microsoft Visual C++ 2008 compiler. The process’ and thread’s affinity has been set to 0×01 (the thread runs on the 1st core only) and the thread’s priority has been set to highest.
If you happen to test the code above, please share your findings. Corrections are always welcome.
I'm a freelance programmer, currently working on Android, web and desktop apps. I find low level programming and optimizations in general, very interesting subjects, so I try to spent most of my free time on them.