General and Gameplay Programming

In this article we are going to talk randomly about pieces of information that are always better to keep in mind when programming with concurrency, in order to avoid misconceptions, prejudice, common places or plain design errors. I do not have the pretention to know what global design model is better suited for games, but wish to share my experience regarding details of multithreading implementations. My most fluent language is C++, though I'm knowledgeable in C#, java, python and others more esoteric like lisp or prolog; this article aims the low level, thus C++(03) is mostly assumed throughout the article.

## Some terminology

• Scheduling: Deciding when to run and interrupt threads and processes. Usually done by the OS.
Scheduling on wikipedia
• Thread: (a.k.a LWP, lightweight process) One logical flow of instructions in an imperative program.
• Fiber: A user-space emulation of threads, using manual cooperation for scheduling. Considered lighter than threads, thus the name.
Fibers on wikipedia
C.f cooperative scheduling
• Preemption: Sort of the inverse of cooperation. Execution gets stopped by a CPU timer interrupt, the handler is routing to the OS scheduler in which it updates various global logic regarding threads scheduling.
Preemption on wikipedia
• Memory Barrier: Operation that flushes pending load and stores in the load/store buffer of the CPU. Used in general in order to make an operation become visible by other CPUs.
mechanical-sympathy.../memory-barriers
Memory barrier on wikipedia
• Cache coherency: System that allows values in CPU caches to synchronize between separated CPUs.
MESI on wikipedia
• Semaphore: Synchronization primitive that allows to manage access to a number of resource.
Semaphore on wikipedia
• Condition Variable: Is a sync primitive that allows efficient sleep waiting and waking up upon signaling. It should be noted that C# Monitors are Condition Variables (almost).
Cond Var in boost
Cond Var on MSDN
• Mutex: Is a sync primitive that allows to create critical sections. It ensures one and only one thread can win the honor to hold the mutex at once. Some mutex can be process transient (system wide).
mutexes in boost
• Reset Event: A sync primitive a bit higher level than the condition variable, in the sense that the condition is a bool flag, and is contained within the primitive. There are Auto and Manual reset events.
Auto Reset Event on MSDN
• Atomicity: Fact that an operation (or set of operations) is seen as uncut from concurrent activities. The code within a critical section is atomic. The best article I read on that, wikipedia again:
Atomicity on wikipedia
• Atomic operation: Designates the operation(s) that is(are) taken in an atomic section. But very often abused/confused with hardware atomic operations, like test and fetch, lock increment, or lock compare and swap.
Please refer to previous Atomicity wikipedia article.
• Race condition: Fact that depending on execution order of two (or more) concurrent threads, different things happens. It is not always bad, e.g. when every race order possibility has been accounted for, but mostly the term is used pejoratively.
Race condition on wikipedia
Secure Program HOWTO (note the "Anomalous" in their definition.)
• Barrier: sync primitive that allows to wait for a group of thread, be sure that everybody is there, and then let loose all threads.
Barriers in boost doc
• Reduce: Parallel communication algorithm executing an operation (called a collective operation) on a set of values scattered on nodes (or thread), and concentrating the result to one node (or thread). For example to find the Max or the Min of each individual values held by each node (or thread).
man 3 MPI_Reduce
• Migration: Movement of a thread from one physical CPU to another.
Performance implications of single thread migration on a chip multi-core
[color=red]WARNING[/color]: Be critical, don't let yourself be fooled by idiotic resources that look legit on the internet.
An example of an idiotic resource that looks legit: informit.com/guides..cplusplus And why it is not: stackoverflow/../how-come-inc-instruction-of-x86-is-not-atomic Mindjob recursive note: you may also doubt the article you are reading now.

# Use cases in games

Games are poor candidates for parallelism, historical concurrent problems are hash cracking, fluid simulation, matrix decomposition, multi user sessions, multi user services, molecular chemistry, nuclear simulations and other scientific applications.

## Multimedia

Sound However, because of the multimedia involved with games, some degree of parallelism is intrinsic since the beginning of games. Sound playing for instance, cannot be done while the CPU is busy on something, because the CPU must generate the wave. That was true for old PC-speaker based sounds, that locked everything when played. To remedy that issue and bring true multimedia experience to gamers, platforms adopted a super cool thing known as the sound card. The sound card has onboard memory, a DSP, a DAC and small (pre)amplifiers. The CPU needs only very little time to copy a buffer to play into sound card's memory, and then the sound card reads that sound by itself while the CPU can be busy doing something else (or it uses DMA like written here). And a good game must loop often enough on the sound update procedure to always ensure that the CPU never lets the sound card finish playing before its onboard buffer becomes empty, otherwise cracking occurs. Virtually every sound library out there uses a thread to ensure that this is done in time and doesn't depend on cooperation from the client. However this is insufficient in case this thread doesn't get scheduled in time and some OS kernels modules offers a special scheduling mode to give special priority to threads that are flagged as "sound uploaders". c.f Jack daemon and its kernel module : JACK on wikipedia.
"What the heck is this guy suddenly talking about?" "Is this an article about sound or a threading article?" Well if you hadn't noticed the past paragraph is full of the words we covered in the Terminology section. Tadaaa, we have seen a real use case of parallelism in games.
Display The next level is the display, starting with devices equivalent to the Amiga, some display processors could be used for their specific design and unleash great power while unloading the main CPU. This still very much exists today in the form of what we call graphic cards. Graphic display is a form of complex parallelism in the fact that the whole process runs in parallel and that it is massively parallel itself internally as well. A high-end graphics card is sometime said to have 1500 scalar cores for example. This means 1500 silicon units capable of handling a 1x32bit float mathematical operation. This kind of very widely symmetric parallelism is possible thanks to simplified hardware. Notably the scheduler that shares code states for a whole bunch of cores. (refer to Nvidia Kepler GK110 Architecture Whitepaper (pdf))

## Heavy computations

Those are the most futuristic and interesting things to do for rich games. It includes physics computation mostly, notably interactive indirect lighting solution computations, fluid simulations, soft body deformations and collisions, other world simulations like plant growing. Or IA for wildlife, or massive environments like cities populated with tens of thousands of actors... Rendering Realistic rendering still takes a lot of power these days, for example the most realistic renderers we know to date are path tracers. Path tracing still takes a few seconds to converge to a stable solution even in the fastest products. Like Octane Render when running on multiple graphics cards. But it is realistic to envision a game that makes use of a hybrid between rasterization and raytracing. I actually presented (and developed) such an engine at Siggraph 2010 in the booth of Caustic Graphics, as a demo of their OpenRL library.

# Support

## Hardware

Multi cores Multi-core systems, or multi-processor systems, are the reason why using multithreading is interesting. Otherwise threads have very very little benefit over a plain big-loop-updates-all pattern, and the price of the danger that comes with them is just not worth it. On uniprocessor systems, threads can be good for asynchronous waits, IO stuff, network stuffs and sound stuffs. But frankly other than that just using them because we think it will create a neat isolation of features is a risky bet. For that we're better off with decoupling. Thankfully, AMD and Intel are there (ARM also now) and served us with awesome truly buyable-by-all multi-CPU systems, and gives multithreading a true raison d'?tre for performance. AMD AMD has an edge for multi-core performance on their Opteron line, because they actually feature NUMA nodes. This is awesome because it means that not only can you perform calculations (the core of the CPU) in parallel, but you can also access the main memory in parallel thanks to a switched fabric bus. This is partly thanks to the fact that AMD adopted on chip memory controllers from a long time ago.

## Store buffer and compilation

The Load buffer and the Store buffer are little pipelines between the registers and the caches, they hold the values "pending" to write or to read before/after a computation is over. Because of compilation, the result of a computation may never be granted a STORE, because the compiler sees no dependence on that variable. You need to use the volatile keyword to force compilation to end any computation with a Store instruction. Also bear in mind that when pending variables are in the write buffer, they are invisible to other cores. You also may not rely on Store orders because of compilation re-ordering and processor instructions re-ordering (which are two different things but both make problems). If you have seen a state from reading a variable set by another thread, you cannot suppose that "this other thing is necessarily over" because you wrote it before!  // thread 1 myThing; bOver = true;   // thread 2 while (!bOver); // yeah myThing is done now !  That is a beginner's mistake. First there is the obvious, bOver is useless (by static analysis) so the compiler doesn't store it. In thread2 it doesn't Load it so the loop is just an infinite false, that the compiler will warn you about. You need volatile to solve both problems. But, by static analysis, bOrder has no dependence with myThing, therefore it is free to do the Store before even doing myThing. To ensure order, you need to force your compiler to break its dependency static analysis by calling a non-inlined function between myThing and bOver = true. (xBox's compiler has a specific "compiler barrier" intrinsic : _ReadWriteBarrier)
Note that this might not be enough, LLVM for instance has internal optimizer-related flags to 'colorate' functions; some of which are related to detection of side effects.
That is one reason more why you shouldn't try to do lockless programming, if you are not aware of that. At least using locks, the potentiality of falling into a 'trap' (kernel call) will break the optimizer certainty about side effects, and ensure correct ordering. And finally, to ensure order on the CPU level, you need to call a Store Memory Barrier before bOrder = true to finish whatever myThing did. To call a memory barrier, either your compiler can propose an intrinsic, either you can call a dummy locked inc or dec using an assembly directive. And that is why all of that is cumbersome, that most people simply use mutexes+locks which happen to have all of what I described above! Microsoft-specific Be very careful, Microsoft has a special implementation of the volatile keyword in its compilers, it is more than standard in the sense that volatile reads and writes are also paired with memory barriers. And we can expect that the compiler is making no assumption on dependencies when a variable is marked volatile. Don't let yourself be fooled by that MSDN example that is completely not standard and not expected to work on another compiler. To be really working everywhere it would need the both additions that I mentioned above. Or simply to use a lock.

## OS

The scheduler has a variety of entry points : it kicks in at process startup and process end, when a thread enters or exit sleep state or wait state, or when the periodic kernel tick interrupts.
The tick is the only system through which pre-emption is possible. On windows, a normal task starts with 6 points and looses 3 every tick (15.6ms) thus is supposed to relinquish its slice after 2 ticks. I/O disturbs that because tasks gets interrupted, execute some code that is not their own, and all this time is counted as "their fault". The Vista scheduler has some fixs about that, and the linux has an actual "Completely Fair Scheduler".
Note that altering the periodic tick timer rate (with beginTimePeriod) will not change the frequency of when the scheduler executes to determine slice expiration. This is because the routine that is serviced at tick interruption checks that a full 15.6ms time slice passed before executing the pre-emption system. So by actually reducing tick period you may rather introduce an aliasing (by using periods that are not multiple of 15.6) and slow down your system because of slower task switching. Locks The implication of that, is that when trying to acquire a congested lock, the thread will enter a waiting status, and therefore the OS scheduler will switch context to another thread, at this very moment. Basically your calls to the mutexes/critical sections functions (Enter/Exit or scoped lock system or whatever) are literally calls to kernel scheduler procedures. This actually is a chance because it makes things really reactive. If a lock is acquired, we might as well run the thread that owns it the quickest as we can. However, at release time, the thread in wait state will be rescheduled faster if using condition variables.

# Other Mistakes

I'll throw some random things.

## double checked locking pattern

A good link is much better than a re-heated speech. I'll let you read this one peacefuly: http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html It says basically, don't do:  public Helper getHelper() { if (helper == null) { synchronized(this); if (helper == null) helper = new Helper(); } return helper; }  You all recognized the singleton pattern, well when we want to make it thread-safe we need to lock its access, though we often think (wrongly) that locking is too slow and there comes the double-checked-locking. The idea comes from the fact that most of the time we will have the first test false so we avoid the lock. Please check the link to see why this assumption fails.

## thinking that volatile solves everything

Well we saw that one before right? To rub it in even more, a little general knowledge link about it: http://www.bluebytesoftware.com/blog/PermaLink,guid,dd3aff8a-7f8d-4de6-a2e7-d199662b68f4.aspx
And it actually gets worse than just that: Volatiles Are Miscompiled.

## thinking that atomic operations are fast

If you knew that a memory barrier was a risk of CPU stall and a potential slowdown, then you need to know that atomic operations are worse. Compare and Swap is one of the slowest. Most implementations of memory barriers are actually implemented in terms of dummy atomic operations (lock-prefixed instructions like xchg/xadd/xinc).

## thinking that boost shared_ptr is thread safe

It is not; because it has to manage 2 reference counters (weak and strong) which are not updated atomically and thus coherency is not guaranteed.

## Lock-free

There has been a buzz about lock-free algorithms in the past 5 years or so, and somehow since then everybody is all about trying to do lock-free stuff. Man, you should be glad to have multiple cores in the first place, rather than thinking about grabbing the last 1% of performance there is to grab, by investing 500% of the time it would have taken doing an algo lock-full. If your algorithm would benefit more than that by being lock-free, it is a clear sign that it is congested. OS and kernel systems may not have the choice, and everyone depends on them, thus taking the time to develop lock-free stuff makes sense. Now, the reference paper to read before pursuing the conversation: Andrei Alexandrescu?s "Lock-Free Data Structures" Surely you noted:
In fact, there must be around half a dozen of lock-free programming experts around the world, and yours truly is not among them.
Is he being condescendent assuming any reader is just not intelligent enough, or is he talking about himself as "our" since he is serving us with his text ? I?m not fluent enough in English to decide that. But both ways convey the same message; because even were he talking about himself, this guy is still a clever lad, he authors the C++ language after all (yeah just that). Well not alone, but still. Another quote for the road:
livelock--each tries to dodge the other's locking behavior, just like two dudes in the hallway trying to go past one another but end up doing that social dance of swinging left and right in synchronicity. We humans are pretty good at ending that with a laugh; processors, however, often enjoy doing it til rebooting sets them apart
I love that way of putting it, so that everybody can understand. So we have a definition of livelock which is pretty good for my article. Also we have one of the reasons why lock-free algorithms are good. They have a lot of merits like being deadlock-less and livelock-less. They also constitute more robust system by avoiding freezing systems that would wait for somebody who is dead (Thread killing immunity). And a bunch of other cool stuffs that you can read about in the paper. Conclusion Don?t try to go lock-free, if you end up with explicit memory barriers every two lines, what did you gain? You may even one day realize that you re-invented the mutex !
Also, it is not because you are lockless that you are never preempted. After that we can add the problems of spin-waitings (battery drain, slice hoging on schedulers that are not completely fair), and deterministic memory freeing headaches...

## Being afraid of locking

There is a legend out there that freezes the blood of every performance programer saying that locking is expensive. I'm not sure where it comes from, maybe when a non contended lock was making a trip in the kernel... Anyway, today locks remains user space when non contended (e.g. primitives like Microsoft CRITICAL_SECTION). I have timed on MacOS with a colleague boost::scoped_lock on a mutex to execute in the time equivalent to perform 4 integer operations. Of course that is for congestion 0. If your thread has to wait, it will make a trip to the kernel to let the scheduler do some decisions. So you should be afraid of congestion rather than the lock itself.
From this msdn article : "Lockless algorithms are not guaranteed to be faster than algorithms that use locks". Always profile your programs, don't early optimize, especially with multithreading, where correctness is delicate.

# Synchronizations

## Sleeping

Coalescence To further save battery, (from) Windows 7, timers may be Coalesced. I?ll let you read about that if you are interested. Basically it means that you can setup an error margin on your sleep times, so that the OS may group a lot of timers on the same kernel tick, and disable the periodic tick for the rest of the time. Conclusion Do not rely on Sleep even on tickless kernels, it will rarely have the needed precision in games, and you risk yielding your thread for a moment as long as your frame ! On windows there is no microSleep or nanoSleep function like in linux, but I have no idea how these functions are implemented. I timed (using performance counters) the usleep function on an idle system, it respected the requested sleep periods very closely. Even for sleeps of tens of microseconds only. But I believe from some old documentation that these short period sleep functions may use spin waiting. Another way to manage true uSleep would be by programming the HPET for an interruption on the fly. But with the many ACPI issues that chipsets has, it seems hazardous.

## Wait on condition

Condition variables are the smallest concurrent synchronization primitives that allows very efficient wait and wake up. They are accessible through the excellent boost::thread library. Or manually since Windows Vista. CS work only in pair with a mutex (a Win32 critical section object), so that it can release and reacquire it on the Wait call. Typical mistakes with CS Not checking condition before sleeping If you go straight to wait before even checking the condition, then you risk the chance of missing the signal. Waiting and Signaling happens at the end of the race of the respective threads calling them, therefore you must always suppose both outcomes of the race-condition. Missing the signal in this case can be very severe because it could mean freezing the thread forever. Not looping There is a pattern to use a CS, you need to respect:  boost::unique_lock lock(mut); while(!data_ready) cond.wait(lock); // pursue  If you do not loop, then you have no guarantees about the state of your condition. You need to check it again of you want to do something that relies on it. If you just wanted to be wake up then that's fine without a loop though. You have no guarantees because in the wait() call after having been notified by a signal, at the moment the lock is reacquired, some other thread may have acquired the lock already and modified the condition in between. Another source of mistake is believing that your thread cannot escape from the wait() call without a pair notify. It is false because of spurious wake ups. They happen because CS are made to be specially fast and for that, sacrifice consistency to performance. Signaling in the scope of the lock Your data provider thread will emit a signal to wake up the waiter. Do that after you are finished with the data and you have exited the scope of the lock (released the mutex). If you signal your waiter thread while holding the lock, how is the poor waiter supposed to acquire the lock to pursue its job? Fortunately, even in this case your waiter will not be stuck in limbo until the next signal, because it has already moved on to the lock-wait part. Therefore the problem here is one of performance only, you will avoid some context switches by signaling from outside locks. This is where C# is disturbing, don?t let yourself be fooled by the Monitor. Pulse function, that does nothing, stuff is actually executed on Monitor.Exit. Misusing timeouts Timeouts are very important to avoid deadlocks and bring robustness facing error scenarios. The problem is that because CS has to be checked in loops, you need to work with absolute times in order to pass to the wait function the remaining amount of time until the timeout. If you just were to pass a simple constant "10 ms from now" every time you re-enter the loop you grant the wait a new budget of time for free! You may even never go out of your waiting loop using that. Boost documents that perfectly, and recommend the system_time::now + period method. CS Alternative To avoid lost time due to execution yielding, sometime spin locks may be an option. In the case of true multi core systems, and for very short periods, if the produced resources bursts in quickly, then the consumption may as well be fast. A spin lock is implemented using a try_lock and looping over it until acquired.

## Interrupting

We all long for a radical method like KillThread; which actually may even be available sometimes. However with delicate threads, radical manners are rarely suited. The typical pattern to stop a thread is by using interruption points. This system is exactly cooperative scheduling, some agent is sending an interrupt request by setting a flag like exit_when_you_can to true. And the running thread must regularly check for that flag and do something then (e.g like exiting, that?d be great heh). Yielding Even though preemptive schedulers are the norm, we do encounter code that uses manual yield points. Like seen in boost::yield, or Sleep(0). Note that calling yield will be a NoOP if there is no thread in the waiting queue for your CPU. If you wish to save battery by doing that, you're doing it wrong. In this case use real sleeping states using a Wait on condition, or a timed sleep with at least one kernel slice of sleep time. Usually non real time applications uses "Wait on event" pattern (Win32 famous GetMessages function), which happens to be a pattern that is perfect for saving battery life.
In games, the only moment when the render thread execution can breathe, is when waiting for the VSync signal. This will be handled by force when using the buffer swap function. Note that Android mostly doesn't support VSync, nor does it support any setup of how many frames the CPU is allowed to have in advance (swap interval).

# Conclusion

C++11 I'd like to be able to cover promises and futures which are a concept taken straight from lisp (in a paper from 1992) that has been introduced into C++11, but I have no experience at all and thus let the reader look for these if interested. Apparently newer versions of boost start to integrate this concept into C++03 as well. thrust The nVidia C++ library thrust (developer.nvidia/thrust) has the potential to cut costs drastically in parallel compute-heavy jobs developments, I'd like to be able to cover it but I know naught about it.

# Article Update Log

15 June 2013: various enrichments + little correction + OS paragraph 03 May 2013: wrote forgotten paragraph (last) in other mistakes. 03 May 2013: Initial submission 17 Apr 2013: Start Draft

Report Article

## User Feedback

You need to be a member in order to leave a review

## Create an account

Register a new account

• ### Latest Published Articles

• #### Game Engine Containers - handle_map

Jeff Kiah This article explores the creation of a data container for game programming. This container is meant to take the place of C++ standard library containers such as std::map and std::unordered_map, with an alternative that stores data contiguously in memory.
• 34698 views
• #### Casual Connect 2018 Coverage

Beth Feldman GameDev.net's coverage of Casual Connect 2018 from Anaheim, CA.
• 253 views
• #### Postmortem: I Am Overburdened, Recaps and Numbers

Spidi provides a fully detailed breakdown of the development and business results of the release of "I Am Overburdened".