Multithreading for Games

Started by
17 comments, last by InvalidPointer 12 years, 2 months ago

[quote name='Hodgman' timestamp='1328684518' post='4910803']
It doesn't matter how you transition a thread into a sleep state - once in that state, the kernel's scheduler won't come around and poke it back into a ready state until the next scheduling cycle, which defaultsto15ms on windows IIRC.

Hmmm.. I use Sleep(1) in my multicore support code, after reading more about this topic it seems, that you can utilize the timeBeginnPeriod/timeEndPeriod to increase the timer resolution. As far as I understand, this will affect the whole OS (OS will take the highest resolution). Therefore I would try to encapsulate the sleep like this:

timeBeginPeriod(1);
Sleep(1);
timeEndPeriod(1);

Is this a bad idea (too much overhead, slowing down the OS) ?
[/quote]

Even if you change the timer resolution you still won't have any guarantees that a thread will wake up on time, (If you also boost the thread priority so it is higher priority than everything else on the system then it most likely will wake up on time though), only call timeBeginPeriod once early in the application:

(from msdn.microsoft.com)
Use caution when calling timeBeginPeriod, as frequent calls can significantly affect the system clock, system power usage, and the scheduler. If you call timeBeginPeriod, call it one time early in the application and be sure to call the timeEndPeriod function at the very end of the application.[/quote]
[size="1"]I don't suffer from insanity, I'm enjoying every minute of it.
The voices in my head may not be real, but they have some good ideas!
Advertisement
timeBeginPeriod(1);
Sleep(1);
timeEndPeriod(1);
Is this a bad idea (too much overhead, slowing down the OS) ?
In my experience, it seems to be fairly common practice to call [font=courier new,courier,monospace]timeBeginPeriod[/font]/[font=courier new,courier,monospace]timeEndPeriod[/font] in your main function (i.e. applying the increased scheduling precision for the entire life of your program). N.B. you're also supposed to use [font=courier new,courier,monospace]timeGetDevCaps[/font] to find the system's minimum value, instead of using a hard-coded [font=courier new,courier,monospace]1[/font].
Also, despite the bad guarantees for how long a [font=courier new,courier,monospace]Sleep(1)[/font] will last on Windows, there aren't many other good options if you have to put a thread to sleep -- e.g. if you've got two threads sharing one core, you'll have to sleep them at some point (especially if their priorities are different - in which case one could be starved until it get's it's priority boost after 3-5 seconds of starvation).
There is a simple two-threaded approach that gives a large performance gain for very little work.

Use 1 thread for Drawing and 1 thread for Updating.

Because Update() can never communicate with the DirectX Device
you need two intermediate buffers per component (I call it DrawState)

var drawState = new DrawState[2];

Then you only need to synchronize the switch between blocks ... I use a array index:

lock { updateIndex = 0; drawIndex = 1; }
or
lock { updateIndex = 1; drawIndex = 0; }


and each Component accesses its DrawState like so:

Draw() { drawState = state[drawIndex]; }
Update() { drawState = state[updateIndex]; }


Once you have implemented this it can be trivial to add more threads to the Update() side

To summarise:
Update() only writes to one DrawState object
Draw() only reads from the other DrawState object
Every frame you switch / flip / toggle between the DrawState objects

I don't see the point in writing some games as single threaded and some as multithreaded when you can use this technique for all games.
Here's a really good article that answered a lot of questions for me. Also I really suggest you use Intel Thread Building Blocks since they implement a lot of stuff you need like a task manager and lockless data structures.

Designing the framework of a parallel game engine

I'm still working on my engine but my idea so far is to have separate threads for the game logic and the rendering and audio. Then each thread spawns tasks each update. The rendering happens completely asynchronously from the game. I've never really been strong on the whole Model View Controller idea especially in games, but I ended up going with that kind of model naturally without even trying. The game thread has a controller that updates the entities, which are like the model. Then the renderer is like the view and has scene nodes that get updated by the controller.

Also PhysX 3 uses this kind of task based system which fits naturally into the multithreaded task processing architecture I'm going for.
I've been thinking about this a lot recently after watching the GDC presentation by the guys who wrote Civ V. As many have pointed out above, task based systems are best because they are more scalable. I can kind-of understand how you'd do it for a game like Civ, which seems to me to be very well tailored from a processing point of view to multi-threading. But when it comes to processing more complex structures, like the entities in an Octree, I keep finding blocks where the code will end up becoming serialised regardless of threading. That is you break the tree up into little processing blocks and dispatch them to various threads, but at some point you have to complete the circle and update the structure to a newer version. As soon as you start locking various data structures in order to apply updates, you're losing the benefits of threading. Worse, you can find yourself losing cache coherence and make performance worse rather than better.

As far as Civ was concerned, the moral of the story is to profile the application and various ways of doing things. I'm not sure you can write scalability like that in a generic way. Rather, it has to be done on a case-by-case basis, perhaps with constructs like `parallel for' and so on.
You basically have all entities in the game update as separate tasks each game loop cycle. There should be no locking at all since they are all independent of each other. They tell a central object about their updates, which has a lockless queue on it to prevent too much slowdown on a shared resource.

When all entities have updated at the end of the game loop, the central object notifies all entities that care about those entities about the changes. It also updates the visual side representation of the objects saying this object is now at this position and so on...

But when it comes to processing more complex structures, like the entities in an Octree, I keep finding blocks where the code will end up becoming serialised regardless of threading


You're Doing It Wrong™

Instead of shoehorning parallelism into an already explicitly serial design (trees and linked lists in general are inherently serial data structures; there's no effective way to traverse them in parallel without the threads stepping on each other's toes. Before you ask, skip lists are cheating/hacky) ask yourself if there are any alternative approaches that would instead scale well in a more threaded environment. As DICE found out when building their visibility system for Battlefield 3, the 'dumb' approach can certainly end up a lot faster than clinging to a conventional 'smart' solution simply because that's the way it's been done previously. Data is, ultimately, king smile.png
clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.
I don't see a problem with traversing a tree in parallel. Updating it with new state is the issue of course. With "lock step" threading I can see how it would work. Jobs run, store their results, all results are applied to the tree at the end. I'd be sceptical about the resulting performance improvements of course. Anyway will check out the BF presentation later.

I don't see a problem with traversing a tree in parallel. Updating it with new state is the issue of course. With "lock step" threading I can see how it would work. Jobs run, store their results, all results are applied to the tree at the end. I'd be sceptical about the resulting performance improvements of course. Anyway will check out the BF presentation later.


On the surface, yes, you're bang on the money. You can certainly have multiple threads be traversing the 'physical' tree in memory without incident. Actually doing work on it, however, is a bit murky. Consider the case of arrays for a moment. You can do a blanket segregation at the beginning, maybe partitioning the whole thing out into n chunks (where n is defined as being at or around the number of physical processor cores on the target machine for simplicity's sake) and then have all your threads go in guns blazing. If your computations are otherwise embarrassingly parallel or you're doing some super scheduling cleverness, you can actually get away with no additional synchronization; threads can just do computation without ever needing to interact with one another. You can actually handle array modifications in a similar way by trying to provide some upper bound on work that each thread would do, segregating output slots ahead of time, then having all threads write their (potentially variable-width) output within that set of bounds. You can then coalesce everything in later (possibly also parallel) passes over that data-- this is actually a fairly common approach (though not often directly seen) used in graphics and it actually has a name: stream compaction. There's quite a bit of research dedicated to doing this effectively and I'd suggest some Google-fu if it sounds interesting, as it's a very broad topic mostly outside the scope of discussion.

So, bringing that massive (bi)tangent back to heel, we're back with the idea of work distribution for arrays vs. trees. This is actually somewhat more subtle, but the speed advantage of trees for searching comes directly from their hierarchical skipping of regions based on previous sorting. Since we're in a hierarchy, we don't actually know what region(s) we need to search next until we run the current set of computations; this really, really kills the parallelism factor since it introduces hard synchronization points, and frequently. You can task multiple queries through the structure and see some speedup there, but it's still the same sort of 'bolted-on' concurrency dead end as things like a 'physics/render thread.' It helps in edge cases, but it's very likely those edge cases won't be a limiting factor or even present at all for most of your processing time. As an example, parallel view queries won't help you if you only have one camera, but parallel queries in of themselves will benefit one camera or 10,000 equally.
clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.

This topic is closed to new replies.

Advertisement