The bulk of the changes are going to be setting the thing up to be multi-threaded. Previously I had designed it to be a 'one stage at a time' type of pipeline for running on a single core machine (I don't own any dual core machines at the moment). So the output of one stage in the pipeline would be completely generated, then consumed by the next stage and so on until everything was rendered.
This design doesn't really work with multithreaded stages, at least not the way that I envision it. The new setup is going to have each pipeline stage monitoring it's input buffer to see if there is any new data to process. If so, it will mark the data as consumed and compute it own output for the subsequent stage. This should allow the entire pipeline to use smaller buffers between stages (depending on the stage functionality of course) and also allow the system to scale with additional cores relatively easily.
In fact, there is no reason why a particular stage couldn't generate multiple threads to operate on the same buffer as long as proper resource sharing is followed. Then a brand new octo-core machine can still be utilized if the pipeline is only 4 stages deep.
It is kind of interesting that the ideas that I come up with in my design of a software renderer turn out to be similar to the ideas that the major GPU OEMs perform in hardware. Of course, they designed their systems about two years before me, but I don't get paid to do this either ;)
As of right now, I have a rudimentary vertex shader and rasterizer stages built. The rasterizer still needs some work, but it is good enough to get started. Next up will be a clipping stage to go between the VS and the rasterizer. After that, I think the pixel shader would be the next logical place to look, but we'll see how things are looking then. More to come soon!