Studies of Threading Successes in Popular PC Games and Engines
Posted March 8 2:05 PM by Mike Lewis
This tutorial presented an in-depth look at the multithreading architecture behind Supreme Commander, as well as the designs used in the Gamebryo engine and Allegorithmic's Substance procedural texture system.
Major Philosophies of Multithreading
Thread for performance - split work across multiple CPU cores in order to maximize the speed at which the game runs
Thread for features/eye candy - use extra available cores to support optional bonus features or graphical effects
Thread for middleware - leave the multi-core scaling to middleware libraries for physics, AI, and so on; this approach has the distinct disadvantage of not knowing how well various libraries will play together
Common Strategies for Multithreading The simplest and most common method for multithreading is the render split. In this model, rendering and game simulation are split into distinct threads, and the renderer periodically synchronizes with the simulation thread to update the display. This typically introduces latency of a frame or two between the simulation and the renderer thread, since world state must be effectively double-buffered to avoid excessive locking. In addition to the primary render and simulation threads, a few extra threads may be used for things like audio and loading IO.
Render split architectures are relatively easy to implement, but have the serious disadvantage of not scaling past a few cores. Generally, a render split will not gain benefits beyond 4 cores, and may in fact experience decreased performance. While this is not especially problematic for the current generation of games, as the number of CPU cores in average hardware continues to increase, scaling past 4 cores will become increasingly critical.
A Slightly Better Approach: The Work Crew The work crew architecture extends the render split concept slightly. Rather than having a single thread for game simulation, the work of simulation is split into multiple "chunk" threads. Each thread performs a subset of the simulation work required per simulation tick, and then the results are merged before synchronizing and updating the render thread's world data copy in the double buffer.
This method can be very effective for certain types of games. However, it runs the risk of introducing severe locking contention during the synchronized merge phase. There is also a high demand for memory bandwidth during the synchronization phase, which can cause cache thrashing problems as well as bus saturation issues.
Worker Thread Pooling Another standard architecture involves creating a pool of generalized worker threads. Rather than assigning a specific task to each thread, a centralized queue stores a list of work requests. As threads in the pool become available, they service the requests and remove them from the queue, optionally placing the results of the work in another centralized location.
This approach is very powerful, but again has its limitations. In particular, interdependent requests are very difficult to deal with. If one work request relies on the results of another, locking and contention issues can become almost impossible to manage effectively. For this reason, the usage of this technique must be managed very carefully and planned out as thoroughly as possible before code is written.
Case Study: Supreme Commander The tutorial's first case study provided a glimpse into the technical decisions that went into the multithreading architecture in Supreme Commander.
The game was single-threaded until approximately a year into development. The resulting work to multithread it was difficult but entirely possible.
Supreme Commander makes use of a render split technique. Because of the late introduction of threading in the development cycle, alternatives were limited.
The code makes use of Boost's threading library.
The rendering thread runs at full speed, up to 10 times the number of simulation ticks.
Rendering and game simulation are synchronized via a minimal-locking queue mechanism; these locks are minor and introduce almost no performance hit.
The render/sim synchronization layer is also used to synchronize multiplayer games. This dependency is the main cause of performance issues when playing multiplayer with another person with significantly inferior hardware.
This architecture tends to scale very well with varying loads: if the simulation is busy, the renderer can keep moving at full speed; if the renderer is the bottleneck, the simulation retains its full precision by running faster in the background and dropping display frames.
Supreme Commander runs best on a 4 core system, due to the extra background threads used for additional work (besides the main render and simulation threads).
Case Study: Gamebryo Game Engine The second case study of the day covered the Gamebryo engine, and addressed the design of the engine's parallel programming features.
Ultimate goal: write code once, run it everywhere
Target multiple platforms, and if possible, automatically take advantage of each platform's strengths
Stream processing was selected as the ideal model to suit many types of hardware architectures
Define a "kernel" (operation on a group of data) and then feed it a data stream
Data streams can be partitioned into various sized chunks for simplicity and efficiency
This model maps well onto real hardware such as the PS3's SPUs
High-level control over when and how kernels are executed is provided via a Workflow manager class
The workflow manager internally builds a dependency graph to ensure that tasks execute in the fastest possible order
Each workflow involves a synchronization step at the end, where all kernels' results are accumulated
Case Study: Substance by Allegorithmic The third and final case study of the session was focused on the Substance procedural texture generation system. A brief overview of the tool and its capabilities was provided, and the presenter described how Substance's on-GPU texture generation is capable of generating high-resolution textures without the traditional seek/load latencies of hard drives or optical media.
One major issue facing Substance is that many game developers do not wish to cede GPU cycles to the texture library, but there are ample CPU cycles available (i.e. the game is GPU bound). To handle this situation, Substance offers a mode where spare CPU cycles are consumed to generate the procedural textures, at a minimal or non-existent framerate cost. This diminishes the amount of "stalling" when a game must suddenly load new textures, and avoids ugly artifacts where low-resolution textures must be used while waiting for higher resolution assets to be loaded.
Final Takeaway A few additional useful tips and general bits of advice were handed out during the tutorial session.
Start designing for threading as soon as possible. Threading is already mandatory technology for new game titles.
Don't be afraid to multithread late in a project if necessary - but be aware that it may not be easy, and the options may be limited.
Decouple threads from each other as much as possible; locking and extra synchronization steps are serious problems.
Build in a debug tool that allows users to see thread timings live while playing the game; Supreme Commander used such a tool to great benefit.
Use tools like Intel's Thread Profiler and VTune to discover thread bottlenecks, memory cache thrashing problems, and more. These tools are worth their pricetag many times over for serious projects.
A single shared heap between threads is a bad mistake; set up your code to allocate a heap per thread.
Beware of doing many small allocations and frees; heap fragmentation and cache thrashing can be a major problem here.
To help with memory management issues, design code from the very beginning to allow easy replacement of your memory manager. If necessary, deploy multiple memory managers within the same project to handle different memory usage patterns.