Floating Point UnitIt is truly amazing to me how much different the x87 floating point unit operates when compared to the normal integer based x86 area. The last couple days I have been exploring how it works, playing around with different instructions and so on, and it has stack based registers instead of random access registers??? I wonder what led to that design decision at Intel - there must have been a good reason for it at the time, but it sure seems odd now.
SSE OptimizationAnyways, I have started performing some experiments with using the SSE registers to get some improvement in performance on my matrix class. For those that aren't familiar, SSE provides 8 128-bit registers that operate on 4x32-bit float variables at once (as opposed to 8 floating point registers in the FPU). This provides an opportunity to perform some parallelization on your data - which is a perfect fit for vector/matrix operations.
The first operation is a member function to transpose a matrix. It isn't immediately obvious how to optimize this with these additional registers, but after playing around with it for a while, I started to make some progress. To test out the performance difference, I create two matrices in a test program - one with the SSE version of the function and one without it. Then I perform the operation 10,000,000 times on one of them, taking a cycle count before and after the loop. I then repeat it for the other matrix and compare. That should be enough times to average out any OS activity or swapping out threads that occurs during the test. Also, the memory caching mechanism of the processor should effectively be equated out of the test since all of the data are created on the stack - so both matrices reside in the L1 cache (this won't be the case in real use of the matrices, but they both have the same memory access penalty so its alright to ignore it for now).
On average, using std::swap to swap the required matrix entries (there are six swap operations) takes approximately 24 cycles, while my SSE version takes 20 cycles. That's a pretty good start, 17% faster without very much work input. Of course, matrix transposing isn't the most common operation, so we'll see much better performance gains with other functions. I'll post about those next time...