As I've mentioned before I've been working on a highly threaded particle system (not of late, but you know, its still in the pipeline as you'll see in a moment) however this has got me thinking about threading in general and trying to make optimal use of the CPU.
Originally my particle system was going to use Intel's Threading Building Blocks, however as I want to release the code most likely under zlib the 'GPL with runtime exception' license TBB is under finally freaked me out enough that I've decided to drop it in favour of using MS's new Concurrency Runtime which is currently shipping with the VS2010 beta.
One thing the CR lets you do is setup a scheduler which controls how many threads are working on things at any give time; if it matches hardware threads, priority, over subscription etc are options which can be set which grants you much more control over how the threads are used when compared to TBB.
Looking at this I got thinking about how to use threads in a game and more importantly how tasks can be applied to them.
If we consider the average single threaded single player game then the loop looks somewhat like this;
update world -> render
There might be variations on how/when the update happens but its basically a linear process.
When you enter the threaded world you can do something like this;
update \ sync
update ---> sync ---> render
update / sync /
Again, when and where the update/sync happens is a side point the fact is rendering again pulls us back to a single thread. You could run the update/sync threads totally apart from the render thread however that brings with it a problem of scalability and sync.
If you have 4 cores and you spawn 4 threads, one for each update and a render thread, and run them all at once then you need to sync between them which will involve a lock of some sort on the world. Scalibility also becomes a concern, more so if you assign each thread a task to carry out as when you throw more cores at it they will go unused.
You could still use a task based system however a key thing is that you might not be rendering all the time; so you could use those 3 threads to update/sync based on tasks but for some of the time the rendering thread will go idle which is time you might be able to use.
For example, assuming your game can render/update at 60fps, your rendering time might only take 4ms of time, which means that for ~12ms a frame a core could very well be idle and not doing useful work.
This is where over subscription comes into play; creating more threads than we have hardware to deal with it.
In a way, if you do a task based system which uses all the cores and you use something like FMOD then you'll already be doing this as it will create at least one thread in the background and other audio APIs do the same.
The key thought behind this is that a device in D3D (and OGL) terms is only ever owned by one thread, so unless you can force a rendering task onto a thread all the time issues start to come up. You might be able to grab the device to a thread and release it again however if this is even possible it would probably cause bad voodoo. For this reason you are pretty much stuck with what thread you render from.
As you are stuck with a thread anyway then why not create one specifically for the task of rendering? You could feed it work in the form of per-frame rendering data and let it do its thing while you get on and update the next frame of the game.
However, this would impact your performance as you'd have more threads looking for resources to run on than you'd have hardware to run them. So, the question becomes would it be better to lose Xms or would the fighting cost you less in the long run?
The matter of cache also comes up however the guys who worked on the CR bring up an important point; during your threads life you are more than likely to preempted anyway, at which point if you have affinity and masks set you'll stall until the CPU has freed that core, or you bounce cores and lose your cache. Chances are however even if you stick around and cost yourself time your cache is going to be messed with anyway so it might not be worth the hastle. (The CR will bounce threads as needed between cores to keep things busy for this reason).
The advent of D3D11 also makes this more practical as you can setup things as follows;
update \ sync \ pre-render
update ---> sync ---> pre-render ---> next frame
update / sync / pre-render /
----- render ------------------------>
In this case the pre-render stage can use tasks and deffered contexts to create the data the render thread will ultimately punt down to the GPU. This could also improve framerate as it will allow more object setup and maybe more optimal data to be passed to the GPU.
There remains matters of syncing the data to be rendered and what happens if you throw a fixed time step into the mix (although this is most likely solved by having the pre-render step run every loop regardless of update status and have it deal with interpolation) however the idea seems workable to me.
If anyone can see any serious flaws in this idea feel free to comment on them, I probably wont get around to this idea for a few months as it stands as I've a few things to do (not least of all the particle system [wink]) but its certainly an idea I'd like to try out.