A few small improvements have been made: a TnL cache that can cache the last few vertices. This seemed to have a small improvement in my test cases, but nothing really big. This is more or less what I expected, since I'm using flatshaded models, which obviously don't repeat stuff. It did increase the speed a bit if I used models with non-flatshaded normals (but these looked like shit).
I'll be adding code to calculate normals on the fly, and see if that gains me anything with respect to the original version.
Still need to look at the matrix stuff, so I'm slowly learning some ASM.
During that I did a quick test to see if there was any difference in ASM between functions written my fixed point class (with operator overloads and whatnot) and my fixed point typedef + helper functions. None!, so I'll now happily convert everything to use the class, since it obviously makes the code much clearer.
I'm starting to switch to a new testcase which draws 4 cubes instead of one. The FPS dropped to 1/4 of what it was, which is mostly to be expected, but was still a little bit disappointing, as I had hoped it would drop slightly less.