Multi-threaded game

Started by
11 comments, last by I_Smell_Tuna 18 years, 5 months ago
Ok, I'm making a 3D video game and I want to put my various sub-systems on their own threads - for obvious performance reasons - while grouping dependant sub-systems together on a thread. I plan on doing this by double buffering the game data. To do this each sub-system will have a version of the game data that it updates before each loop iteration. When a sub-system loop finishes it will update the back buffer and swap the buffers. Does anybody see a problem with this method for multi-threading a video game? The only thing I can think of is that it will consume more memory.
Advertisement
The performance reasons are not obvious - unless you are targeting a multiprocessor system. Are you?

The major problem is that the various sub-systems are usually very interdependent, but that doesn't mean it can't be done.
John BoltonLocomotive Games (THQ)Current Project: Destroy All Humans (Wii). IN STORES NOW!
I'm not targeting multi-processor systems, I am targeting multi-core systems. Dual core processors are already on the market, which means that an application needs to be multi-threaded to utilize both cores. The interdependancies usually are singaling of events which can be worked around easily. And any good program is designed in modules anyway.
Quote:Original post by I_Smell_Tuna
I'm not targeting multi-processor systems, I am targeting multi-core systems.
Same thing, different name.

Quote:The interdependancies usually are singaling of events which you can get around easily.
Yeah, but will it provide you a speed boost? If things need to be synchronized, it stalls the game while the synchronization is worked out.

I think the double-buffer approach has some merit, if you're very selective about what is doubled and what isn't (memory is the big constraint here). The approach is very similar to processor pipelining, with many of the associated issues related to data hazards. Games do have the interesting property that frame N+1 does not really depend on frame N - if something occurs during the processing of frame N, the response can probably be delayed until frame N+2 without too much impact.

It's probably not surprising that many of the same technique for increasing parralelism show up in different domains - the process of ordering instructions in a superscalar processor is similar to ordering transactions in a database, for example. Research in other areas has already been done, and determing what, if anything, can be applied to games is a good place to start.

However, expecting it to give you an immediate performance increase is naïve - such a system will have to be very carefully designed to minimize data hazards and synchronization stalls. It's not impossible, and it is something I'm planning on researching further when I get some free time, but I'm expecting it to be very difficult to make both correct and efficient.
Multi-processor and multi-core systems are not the same by far. Anyway, back to the topic. I mentioned that systems that are dependant on each other would be grouped together and put on the same thread, thus eliminating the need to synchronize them. Each sub-system could also have a public event queue that the sub-system checks before each loop for your sub-system synching.

Thanks for the feedback.
The current situation on the PC side (with the Athlon x2 and Pentium D), it is basically the same as a multiprocessor (at least from a software perspective). The AMD has more ram bandwidth with NUMA on a multiproc system, and quicker cache coherency on a dual core. The Pentium D doesn't have either of those advantages, so it's dang near the same (well win xp home will take advantage of dual core, but not dual proc). CELL is the only processor I'm aware of that currently has asymmetric cores. You could have asymmetric processors too, but I haven't seen anything designed for that (Unless you want to count a GPU as another proc)

But as for your design, it depends on how dependent your sub systems are on each other. For example if there so dependent that you only have 2 threads where one does 90% of the work, you only gain 11% more speed of your possible 100% (on a dual core). So breaking them up more and maintian sync with critical sections (or whatever else works) will probably be faster (but harder to design). Or if you break it up alot (like every dynamic object having it's own thread and dealing with its environment only with event queues, for an extreme example) then context switching and managing all those queues will really hurt performance. I think dual buffering is certainly a good way of handling multiply threads (I think novodex uses a form of that), but it's not be the best solution for everything. As a general design focus you system looks pretty decent (I'm considering a similar system), but it's not an end all be all solution, so you'll probably want to mix in some other other ways of syncing multiply threads (to break up dependent things into threads, or removing queues when they're excesive), when it'll be worth it. How well it'll work is very application specific, and may be perfect for what you're doing, or be fairly ineffienct.

On a side note, When I was looking into some multithreading stuff, I found out that normal allocations (malloc, nonoverloaded new) require blocking. So consider that when you implement your queues (like use a custom allocator with std::list).
A quick aside: Test all multi-threaded code on a dual-core or dual CPU system - this will expose more issues than on just a single CPU.

As for the idea:
Intriguing - I presume you're limiting yourself to two buffers which swap? If not, how would they swap between the various subsystems? I can't see how this could scale to multiple buffers for multiple subsystems without interaction problems.

From what you say I'd envisage something like the following:

One thread outputting current frame to hardware working on one buffer.
One thread running everything else to compute next frame working on other buffer.
Both thread's current iteration of loop finished; swap buffers.
Repeat.

Assuming I have the idea right, the single-threaded equivalent would do exactly the same; this approach should be faster at the expense of memory. How much faster is harder to answer. The critical part is that both thread loops must have finished their current task before swapping occurs. It strikes me that one loop as I'm proposing is all output, the other is all calculation, so a graphics thread and a sound thread could be operating on the output buffer, whilst the rest calculates. In addition, don't be afraid to have 'proper' multi-threading with synchronisation within those two loops as needed - as long as you control context switching, individual loops with OpenMP for example might scale well.
There seems to be a bit of confusion between multiprocessor and multicore. The intel site explains it pretty well.

My question is: Whats the point if all of the processor heavy work is done by the one thread? You havent identified cpu bottle necks eg, AI, ROAM, collision. In the case of a ROAM system, updating the index buffer etc means that it is synced to the rendering, so it'd be part of that. AI and collision may or may not be tightly dependant etc.
As best I can tell, every subsystem is dependent on the previous subsystem. The data flow is something like:
Input -\         > Move -> Collide -> React -> RenderAI ----/

Bascially, each stage needs the result of the previous stage for that frame (except the AI and Input stages). Assigning a CPU to each stage (obviously combining stages on current CPUs), with each stage having its own version of the world state, greatly reduces the amount of synchronization by not sharing mutable data between threads.

Each stage "finishes" at the same time - like a CPU, there's a master "clock", the slowest stage, and the other stages have to stall until the slow stage completes. Once it does, the buffers get shuffled to the next stage and the next cycle starts.

The problem is that maintaining the world state (N copies, where N is the number of stages) is expensive. There are certainly ways to reduce the memory usage - most of it doesn't change from frame to frame and doesn't need to held by each stage - but they involve spending more time. Ah, the eternal CS tradeoff...
It infact possible to decouple the render from the rest of the logic with little problems.

The trick is to have three 'frames' of data hanging around. These would be;
- previous
- last computed
- next

Previous was the state before the current drawing took place, last computed is the the final position for the next time step and 'next' is your "work in progress" frame.

This system works by having fixed time step world up dates while the renderer throws out frames as fast as it can. While its rendering the positions/rotations etc are interpolated between 'previous' and 'last computed'. Once you get to the same time that 'last computed' was produced you move everything up the line, so that it looks like;
- last computed
- next
- previous

and then around you go again.
While you are doing the drawing based on the first 2 frames you update the third frame.
Swapping is just a simple matter of moving pointers around, which is an atomic operation, so at each 'update' point you have 3 pointer swaps per object to perform.

A key point to note with this system is the need for two clocks, one running one time step AHEAD of the other, this is for the update thread so thinks swap correctly.

I've got a small 'proof of concept' program which shows the above working (50 updates per second and over 1700fps) for a simple test scene, I think Superpig has done some work into this as well. My hope is to expand apon it all for my final year project at college for my degree [wink]

This topic is closed to new replies.

Advertisement