The overall design of my pipeline was to have each major function have it's own 'processor' object, with abstracted memory buffers sitting between them. This works well, and is in theory a good idea for a multiprocessor system to paralellize the work.
However, all of the abstracted memory access calls between processors are really adding up. This made me change the access calls themselves to something a bit speedier by removing redundant range checking. This gave a significant speed up of about 10%! The next step that I took was to combine the rasterizer, depth tester, and pixel shader into one processor. This removed alot of the unnecesary memory access calls. This made a large increase in the overall speed as well. I'll make a more complete post on this later on, but I also plan on unifying all of the geometry side processors as well (vertex shader, back face culling, and polygon clipping). I don't expect as much of an increase right now, but when I start rendering more geometry it should keep these memory issues out of the way.
I still have several other things to try out, and have found a few tricks that I'll likely write about in future entries. Some of them are specific to my renderer, but others should be of use in general. Memory caches are your friends!
Promit's JournalIf you haven't been reading it yet, Promit's journal has been posting about his time working at nVidia on the new graphics drivers for Vista. It is definitely worth checking out and keeping tabs on it. Congratulations to him for getting inside the golden gates!
Looking forward to seeing what other projects you got banging around up in your noggin :)