Sign in to follow this  
  • entries
  • comments
  • views


Sign in to follow this  


As I've mentioned before I've been working on a highly threaded particle system (not of late, but you know, its still in the pipeline as you'll see in a moment) however this has got me thinking about threading in general and trying to make optimal use of the CPU.

Originally my particle system was going to use Intel's Threading Building Blocks, however as I want to release the code most likely under zlib the 'GPL with runtime exception' license TBB is under finally freaked me out enough that I've decided to drop it in favour of using MS's new Concurrency Runtime which is currently shipping with the VS2010 beta.

One thing the CR lets you do is setup a scheduler which controls how many threads are working on things at any give time; if it matches hardware threads, priority, over subscription etc are options which can be set which grants you much more control over how the threads are used when compared to TBB.

Looking at this I got thinking about how to use threads in a game and more importantly how tasks can be applied to them.

If we consider the average single threaded single player game then the loop looks somewhat like this;

update world -> render

There might be variations on how/when the update happens but its basically a linear process.

When you enter the threaded world you can do something like this;

update \ sync
update ---> sync ---> render
update / sync /

Again, when and where the update/sync happens is a side point the fact is rendering again pulls us back to a single thread. You could run the update/sync threads totally apart from the render thread however that brings with it a problem of scalability and sync.

If you have 4 cores and you spawn 4 threads, one for each update and a render thread, and run them all at once then you need to sync between them which will involve a lock of some sort on the world. Scalibility also becomes a concern, more so if you assign each thread a task to carry out as when you throw more cores at it they will go unused.

You could still use a task based system however a key thing is that you might not be rendering all the time; so you could use those 3 threads to update/sync based on tasks but for some of the time the rendering thread will go idle which is time you might be able to use.

For example, assuming your game can render/update at 60fps, your rendering time might only take 4ms of time, which means that for ~12ms a frame a core could very well be idle and not doing useful work.

This is where over subscription comes into play; creating more threads than we have hardware to deal with it.

In a way, if you do a task based system which uses all the cores and you use something like FMOD then you'll already be doing this as it will create at least one thread in the background and other audio APIs do the same.

The key thought behind this is that a device in D3D (and OGL) terms is only ever owned by one thread, so unless you can force a rendering task onto a thread all the time issues start to come up. You might be able to grab the device to a thread and release it again however if this is even possible it would probably cause bad voodoo. For this reason you are pretty much stuck with what thread you render from.

As you are stuck with a thread anyway then why not create one specifically for the task of rendering? You could feed it work in the form of per-frame rendering data and let it do its thing while you get on and update the next frame of the game.

However, this would impact your performance as you'd have more threads looking for resources to run on than you'd have hardware to run them. So, the question becomes would it be better to lose Xms or would the fighting cost you less in the long run?

The matter of cache also comes up however the guys who worked on the CR bring up an important point; during your threads life you are more than likely to preempted anyway, at which point if you have affinity and masks set you'll stall until the CPU has freed that core, or you bounce cores and lose your cache. Chances are however even if you stick around and cost yourself time your cache is going to be messed with anyway so it might not be worth the hastle. (The CR will bounce threads as needed between cores to keep things busy for this reason).

The advent of D3D11 also makes this more practical as you can setup things as follows;

update \ sync \ pre-render
update ---> sync ---> pre-render ---> next frame
update / sync / pre-render /

----- render ------------------------>

In this case the pre-render stage can use tasks and deffered contexts to create the data the render thread will ultimately punt down to the GPU. This could also improve framerate as it will allow more object setup and maybe more optimal data to be passed to the GPU.

There remains matters of syncing the data to be rendered and what happens if you throw a fixed time step into the mix (although this is most likely solved by having the pre-render step run every loop regardless of update status and have it deal with interpolation) however the idea seems workable to me.

If anyone can see any serious flaws in this idea feel free to comment on them, I probably wont get around to this idea for a few months as it stands as I've a few things to do (not least of all the particle system [wink]) but its certainly an idea I'd like to try out.
Sign in to follow this  

1 Comment

Recommended Comments

I'm not a multithreading expert... I just developed a small "engine" with two threads (one updating the scene and one for rendering).

Your reasoning looks fine to me, but my problem is always about the sync part.

The meaning of sync is often obscure to me. While it is reasonable to think there is a thread (or N threads) updating data and another thread (or N threads) taking advantage of such data to perform operations, IMHO the literal concept of synchronization is the root of all evil.

By adding sync points we are introducing bottlenecks.

I'm going further into radical thinking: a sync point is an underperforming fix for something which is broken since the beginning!

What I consider broken is the idea of reverting back to common concurrent scenarios to feel "safer". We learned those (dining philosophers, sleeping barbers, bridges, bears&bees, etc.) and we like to think they work in RL. Actually, they don't.

They don't work because by introducing synchronization, we imply there is a reason to do it.

But is there a reason to have sync points? IMHO, no.

There are two cases where sync points are mandatory:
- concurrent access to resources
- strict execution order

As for the first, I can't see why the renderer should write stuff on the same data structures handled by the update thread. On the other hand, I can't see why the update process should care about reading back data, except for commands like "move 1mt forward" (probably it's going to be "apply a force with direction (0,0,1)").

My point is it's true an object/mesh is a resource, but we need something more to properly handle/render it. We need attributes/properties/parameters. What changes is the attribute set, not the object itself.

The idea is to work on attributes (position, orientation, color, shader params, etc.), not on resources.

What would happen if we copied such attributes?

One shared object/mesh
One copy for the update process (we can also support move forward)
One copy for the rendering process (we can do tricks like interpolation, as you said in your post)

The problem is only about copying attributes from the update process to the rendering process. We can't change attributes while we are reading them for rendering.

I refuse to believe we need a sync point to copy an array. There are surely many solutions around to independently update an array (in my system I send a set of "commands"). We can use multiple buffers, we can send commands with a scene update frame number tag. We can do a lot of things to avoid concurrent access to the same dataset.

As for the strict execution order, of course it is not mandatory. We could need a task graph inside the scene update thread, but there is no reason to strictly execute "update->render", as long as they don't work on the same data.

I will not have time to improve my "engine" in the next few months. I expect to extend it with multiple renderers (and multiple update threads) should be trivial, since I don't have a single lock or sync point. Probably I need to add a couple of classes "gathering" updates, but such a lock-free pattern is easy to reproduce.

My suggestion (if you didn't know about it) is to read the following:
and also check the wonderful mike acton's blog:
with links to crazy stuff like lock free barbers:

I think those resources are great, because they show another way to multithreaded programming is possible.

Again, I'm not an expert and I'm sorry if all my reasoning is wrong...

Share this comment

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now