Jump to content

  • Log In with Google      Sign In   
  • Create Account

Like
38Likes
Dislike

Multithreading

By Vivien Oddou | Published May 11 2013 06:29 AM in General Programming
Peer Reviewed by (Josh Vega, GuardianX, jbadams)

c++ mem-fence thread low-level condition-variable mutex c#

In this article we are going to talk randomly about pieces of information that are always better to keep in mind when programming with concurrency, in order to avoid misconceptions, prejudice, common places or plain design errors.

I do not have the pretention to know what global design model is better suited for games, but wish to share my experience regarding details of multithreading implementations.

My most fluent language is C++, though I'm knowledgeable in C#, java, python and others more esoteric like lisp or prolog; this article aims the low level, thus C++(03) is mostly assumed throughout the article.

Some terminology

  • Scheduling: Deciding when to run and interrupt threads and processes. Usually done by the OS.
    Scheduling on wikipedia
  • Thread: (a.k.a LWP, lightweight process) One logical flow of instructions in an imperative program.
    Thread on wikimedia
  • Fiber: A user-space emulation of threads, using manual cooperation for scheduling. Considered lighter than threads, thus the name.
    Fibers on wikipedia
  • Cooperation: Hand made multi-tasking. Each thread specifies points (in code) where it can yield.
    C.f cooperative scheduling
  • Preemption: Sort of the inverse of cooperation. Execution gets stopped by a CPU timer interrupt, the handler is routing to the OS scheduler in which it updates various global logic regarding threads scheduling.
    Preemption on wikipedia
  • Memory Barrier: Operation that flushes pending load and stores in the load/store buffer of the CPU. Used in general in order to make an operation become visible by other CPUs.
    mechanical-sympathy.../memory-barriers
    Memory barrier on wikipedia
  • Cache coherency: System that allows values in CPU caches to synchronize between separated CPUs.
    MESI on wikipedia
  • Semaphore: Synchronization primitive that allows to manage access to a number of resource.
    Semaphore on wikipedia
  • Condition Variable: Is a sync primitive that allows efficient sleep waiting and waking up upon signaling. It should be noted that C# Monitors are Condition Variables (almost).
    Cond Var in boost
    Cond Var on MSDN
  • Mutex: Is a sync primitive that allows to create critical sections. It ensures one and only one thread can win the honor to hold the mutex at once. Some mutex can be process transient (system wide).
    mutexes in boost
  • Reset Event: A sync primitive a bit higher level than the condition variable, in the sense that the condition is a bool flag, and is contained within the primitive. There are Auto and Manual reset events.
    Auto Reset Event on MSDN
  • Atomicity: Fact that an operation (or set of operations) is seen as uncut from concurrent activities. The code within a critical section is atomic. The best article I read on that, wikipedia again:
    Atomicity on wikipedia
  • Atomic operation: Designates the operation(s) that is(are) taken in an atomic section. But very often abused/confused with hardware atomic operations, like test and fetch, lock increment, or lock compare and swap.
    Please refer to previous Atomicity wikipedia article.
  • Race condition: Fact that depending on execution order of two (or more) concurrent threads, different things happens. It is not always bad, e.g. when every race order possibility has been accounted for, but mostly the term is used pejoratively.
    Race condition on wikipedia
    Secure Program HOWTO (note the "Anomalous" in their definition.)
  • Barrier: sync primitive that allows to wait for a group of thread, be sure that everybody is there, and then let loose all threads.
    Barriers in boost doc
  • Reduce: Parallel communication algorithm executing an operation (called a collective operation) on a set of values scattered on nodes (or thread), and concentrating the result to one node (or thread). For example to find the Max or the Min of each individual values held by each node (or thread).
    man 3 MPI_Reduce
  • Migration: Movement of a thread from one physical CPU to another.
    Performance implications of single thread migration on a chip multi-core
WARNING: Be critical, don't let yourself be fooled by idiotic resources that look legit on the internet.
An example of an idiotic resource that looks legit: informit.com/guides..cplusplus
And why it is not: stackoverflow/../how-come-inc-instruction-of-x86-is-not-atomic
Mindjob recursive note: you may also doubt the article you are reading now.

Use cases in games


Games are poor candidates for parallelism, historical concurrent problems are hash cracking, fluid simulation, matrix decomposition, multi user sessions, multi user services, molecular chemistry, nuclear simulations and other scientific applications.

Multimedia


Sound

However, because of the multimedia involved with games, some degree of parallelism is intrinsic since the beginning of games. Sound playing for instance, cannot be done while the CPU is busy on something, because the CPU must generate the wave. That was true for old PC-speaker based sounds, that locked everything when played. To remedy that issue and bring true multimedia experience to gamers, platforms adopted a super cool thing known as the sound card. The sound card has onboard memory, a DSP, a DAC and small (pre)amplifiers.

The CPU needs only very little time to copy a buffer to play into sound card's memory, and then the sound card reads that sound by itself while the CPU can be busy doing something else (or it uses DMA like written here). And a good game must loop often enough on the sound update procedure to always ensure that the CPU never lets the sound card finish playing before its onboard buffer becomes empty, otherwise cracking occurs. Virtually every sound library out there uses a thread to ensure that this is done in time and doesn't depend on cooperation from the client. However this is insufficient in case this thread doesn't get scheduled in time and some OS kernels modules offers a special scheduling mode to give special priority to threads that are flagged as "sound uploaders". c.f Jack daemon and its kernel module : JACK on wikipedia.

"What the heck is this guy suddenly talking about?" "Is this an article about sound or a threading article?"

Well if you hadn't noticed the past paragraph is full of the words we covered in the Terminology section. Tadaaa, we have seen a real use case of parallelism in games.

Display

The next level is the display, starting with devices equivalent to the Amiga, some display processors could be used for their specific design and unleash great power while unloading the main CPU. This still very much exists today in the form of what we call graphic cards. Graphic display is a form of complex parallelism in the fact that the whole process runs in parallel and that it is massively parallel itself internally as well. A high-end graphics card is sometime said to have 1500 scalar cores for example. This means 1500 silicon units capable of handling a 1x32bit float mathematical operation. This kind of very widely symmetric parallelism is possible thanks to simplified hardware. Notably the scheduler that shares code states for a whole bunch of cores. (refer to Nvidia Kepler GK110 Architecture Whitepaper (pdf))

Task system


A task system, while I'm not here to cover how it should be done, is a serious example of usage of parallelism in games.

A task system is a central component of an Engine that have made the design decision to be task-based. Once you have a task model, you can decide to put absolutely all of your code executed as tasks. Tasks may be independents or inter-dependent, the later implies ordering. Ordering can be ensured through various methods, like mentioning a list of tasks IDs that must be finished in the parameters of a task so that your task system may ensure a given order.

Or use task groups in which tasks are executed orderless, but groups are executed sequentially... I am not recommending a particular strategy.

Tasks are usually ran once per frame, they can be persistent (re-execute) across the game's life until they suicide (decision made from a game state change), or be ephemeral and live one frame only...

You can imagine having various enemies on the terrain controlled by IAs, and each enemy has its update-IA task, and ordering is more or less irrelevant depending on the severity of the update side effects. One side effect could be that if NPC monsters A and B are enemies to each other, and have very low lives level, the first that shoots its rail gun wins. Why should randomness decide the fate of that poor monster?

It could even be pushed farther, both monsters may actually only trigger their guns which would in turn be fired by the gun manager task when it updates later on. And the winning order now depends on gunner's update... But this would be true in a sequential update system as well. Except if that case is explicitely handled: for example using a separate collection of life points that is swapped after the end of the Update.

There we go, we just covered another common paradigm of concurrent programming: data separation, in this case double buffering.

Double buffering is used for example in mesh animation to allow a new set of vertice positions to be passed to the graphics card, as the GC is actually using the old set right now, so we don't have to wait until the GC is finished.

Footnotes

Careful about what some people say. For example: Concurrent Programming in Game Engine mentions "any task should be a thread". It is clearly false, what will you do when you reach 100 tasks or 1000 even? Some OS may kill your process directly, some antivirus as well, or you may stall the whole machine. Not even mentioning that 1000 threads means 1GB of stack space. That is why Fibers exist, you should try to avoid context switching and going back into kernel mode (e.g. by yielding), as much as possible. Try to keep one thread<=>1 CPU, and if possible, disable migration. Linux has a virtual file to fiddle with migration cost : /proc/sys/kernel/sched_migration_cost.
Then virtualize threading using your task system and Fibers (cooperative). Of course, if your threads are not busy 100% of the time of one frame, you may still spawn more threads than physical CPUs, up to the load ratio. If you estimate your load ratio to be 33% (of a frame) per task, you may still spawn 24 threads if you have 8 CPUs. But I would advise you stop there. Because concurrent thread activity also wakes up radically different parts of used memory and therefore every context switch also costs a part of your warm cache to be flushed. (The term "warm cache" is actually used in the literature)

Also, this website mentions that you should have an input thread, but makes no mention of the problems that will arise since you cannot create your GUI in another thread than the input thread. I recommend you read this article about keyboard input, it is enlightning.

Another shortcoming of this article is the mention of crash. "Crashes may be intercepted more easily if isolated in a thread". Maybe he talks of SEH? In my experience it doesn't recover everything, and it may even less recover from driver problems. Just let your game crash if it must. It is not a nuclear missile guidance program. Also check the ship with asserts pattern.

Rendering command lists


Parallel construction of command lists are a thing introduced by DirectX11 which is forseen to be a chance for acceleration of a game engine. DX10 already introduced multiple improvements in the API so that it can perform cheap redundancy checks and create constant buffers (uniform buffers) automatically, which helped performance quite a bit already. DX11 pushes further by allowing multiple threads to prepare their graphic commands. You must not rely on the order on which they will be merged in the end, but you can alleviate that in case you feel your engine is heavy on API calls.

Note that since the beginning, OpenGL was supposed to be multi-thread as well, in the specification it mentions that a context is thread specific, unless mentioned otherwise through state sharing with a "shared context".

Nobody implemented that correctly, Apple had the closest thing, they even proposed to let you build the thread-safe or unsafe version of the contexts.

Actually it is even so bad on some drivers (*cough* ATI *cough*), that just creating and working on a context from another thread than the mainthread can cause issues, ranging from un-understable spin waits (causing sloppy framerates while eating 100% of a core), to random crashes when using old functions like picking or pixel reading...

Heavy computations


Those are the most futuristic and interesting things to do for rich games. It includes physics computation mostly, notably interactive indirect lighting solution computations, fluid simulations, soft body deformations and collisions, other world simulations like plant growing. Or IA for wildlife, or massive environments like cities populated with tens of thousands of actors...

Rendering

Realistic rendering still takes a lot of power these days, for example the most realistic renderers we know to date are path tracers. Path tracing still takes a few seconds to converge to a stable solution even in the fastest products. Like Octane Render when running on multiple graphics cards. But it is realistic to envision a game that makes use of a hybrid between rasterization and raytracing. I actually presented (and developed) such an engine at Siggraph 2010 in the booth of Caustic Graphics, as a demo of their OpenRL library.
OpenRL is an API that aims to allow real time raytracing, and the dedicated hardware that Caustic develops is actually allowing that.

And there are other examples.

Raytracing is typically an algorithm that parallelize at 100%, so the most classic implementation uses a pool of as many threads as the machine has CPUs, and then processes the rays sequentially with the most spatial coherency as possible to improve cache usage (that is the big idea of grid marching, the ancestor of cone tracing).

Ray intersections are used in games for logic like bullet impacts or other IA and collisions, obstruction detection, visibility etc. If even really many of those intersections had to be ran (say to determine particles visibility), it could be parallelized without much concern (appart from ensuring that the scene data doesn't change while the intersections are made).

Updating complex partitioning structures is also possible in parallel, this paper does it for BVH.

Thread pools

The most classical implementation of multiple worker tasks are out of thread pools. More or less manually managed, some libraries help in using them: thread pools.

Typical usage intersects strongly with cases where OpenMP can be used, let's take the example of treating an image. In parallel, image treatment is cut by regular slice of lines, if the machine has 2 CPUs, and the image is 128 pixels high, then we will have a thread for the 64 first lines and another for the 64 last.

c.f. thread pool pattern

The spawner thread can either participate (so take care of the first group) or wait until both workers have finished. Finish notification must either use a facility of the pool library or a shared counter accessed in pair with a mutex for protection and coherency. Each time this counter is updated, a notification should be sent through a condition variable to wake up the master thread.

Or simply, finish the work using a barrier, you just have to specify that 2 threads are working so you are waiting for 2 threads.

Look at this example of the pool library, the destructor hides the joining operation to make the usage super neat.

Compute shaders

Compute shaders have been used before in computations like optimization of shadow map cameras according to a pixel perfect discovery (reduce operation) of min and max of the viewable depth. They are useful because they can share data between the rendering API (DirectX) without the need for copying buffers up to CPU and back to the graphic card.

This is a facility also offered by OpenCL in cooperation with OpenGL. OpenCL is difficult (compared to say CUDA) in practice because of a limited spec, and driver bugs. Not only that but it is also difficult to balance the allocations in the working groups to match the machine that is running your kernel the best as possible. Since your algorithm is not always that flexible and the possible hardwares are very diverse.

Libraries like Thrust are here to help working with kernels though, and we may surely expect more of those in a near future.

OpenMP

OpenMP is a super great tool, that allows for execution of parts of loops across the available processors. A little program that demonstrates the easiness. And just another link because I'm very gentle to do the google search for you :)

Now, to the issue : be careful using OpenMP because of some shady parts in the specification, Microsoft has chosen to never kill the pool of thread they use to run OpenMP workers. It is not a problem if you run OpenMP from the main thread all the time, but it becomes a serious problem if you start to run OpenMP from another thread, especially if that other thread tends to be a task!! Because in task systems your threads may be created per frame basis and thus your program will die after having leaked so many threads that the system cannot bear it anymore.

Solution: don't use OpenMP (in that case), or use a static thread that executes all of your OpenMP jobs and make an "Invoke" pattern.

Invoke


Since we are talking about it, let's pursue. The invoke pattern, used in C# for example on Controls, and gives a function (BeginInvoke) that takes a closure to execute later from the UI thread. This comes in handy a lot of times. In games you may want to differ work for later, or you can have threads that posts log messages but that you cannot display on your console window right now because that would access it from another thread than the GUI thread. Or really a lot of other reasons. Like your task system may have finished something, and wants to register another task for later, it can submit a closure to an invoke system.

Some such systems would execute pending closures on the first Idle event (message from the OS), games may never rely on that event. So we are better to decide when is "Idle" by other heuristics. In real time systems we have the chance to be updated very often so some Manager could take care of this decision using framerate information, or some internal profiling, or a quantity of CPU idling waiting for VSynch...

In any case, an invoke system should provide a function to uniquely name call site, so that when you want, you can specify that a call is to be made only once, even if some older call of the same call site is still pending, because you don't care anymore about that old call.

When you execute closures, you must also be very careful of recursive invokes. This must be allowed, therefore you are better to use a dual buffered pending closures container (that you swap before invocation).

In C++03 you want to use boost::function and boost::bind, together in an boost::unordered_map for instance makes a great Invoke system.

Support


Hardware


Multi cores

Multi-core systems, or multi-processor systems, are the reason why using multithreading is interesting. Otherwise threads have very very little benefit over a plain big-loop-updates-all pattern, and the price of the danger that comes with them is just not worth it. On uniprocessor systems, threads can be good for asynchronous waits, IO stuff, network stuffs and sound stuffs. But frankly other than that just using them because we think it will create a neat isolation of features is a risky bet. For that we're better off with decoupling.

Thankfully, AMD and Intel are there (ARM also now) and served us with awesome truly buyable-by-all multi-CPU systems, and gives multithreading a true raison d'être for performance.

AMD

AMD has an edge for multi-core performance on their Opteron line, because they actually feature NUMA nodes. This is awesome because it means that not only can you perform calculations (the core of the CPU) in parallel, but you can also access the main memory in parallel thanks to a switched fabric bus. This is partly thanks to the fact that AMD adopted on chip memory controllers from a long time ago.
Potential downside is higher latency when accessing to the memory allocated by nodes attached to other cores.

Intel

Intel only has come to imitate Opterons-like buses awsomeness from the Core i7 line thanks to the QPI link.

Now that you know that, you can understand that while you make your concurrent algorithms, you need to make sure that the working set of memory that the algorithm will access is within the limit of the L2 size otherwise you'll get terrible performance because of the bus serializing all your access.

Unless you have an Opteron or an i7 in which case your working set can be much larger, thanks to NUMA model.

Hyperthreading

Intel has Symmetric Multithreading, which means that if you run 2 threads on a single core with HT, when one thread is stalled because it is waiting for a value in memory, the other thread runs; effectively filling the waiting gaps caused by memory latencies.

You need to be aware that in Core i7 architectures, one core has 2 sets of TLB caches; which is grand because it means that one thread does not distrub branch prediction and does not "cold cache" the other thread.

If you tend to do tight compute heavy algorithms with little involvment of the L2 cache, then you may not benefit from HT at all because HT works only if one thread stalls from a LOAD or a misprediction.

MESI protocol

The MESI (Mutual Exchange Shared Invalid) protocol is super-duper important to know, because it is how cores communicate the values of variables between each other. On x86 levels of memory are neither inclusive nor exclusive. Which means that a variable may exist in L1, L2 but not RAM, or only in L1, or everywhere in L1, L2, L3 and RAM, or just in RAM. For coherency in time, MESI exists. When a variable value is updated, if ever a cache-line representing the same memory address exists on another core, then this cache line is marked invalid. When the core hosts of an invalid line wants to access that data it must first fetch the latest data from the core where the value is. This obviously cost some latency, and therefore the classic mistake not to make: Thinking that different variables are freely accessible by separated threads. If you make an array of 8 int64_t and launch 8 threads and make each one work on one of the value in the respective index of that array, you will get horrible performance, much worse than the equivalent one-thread loop actually. Because your execution is effectively serialized by MESI. One cache line is 64 bytes, maybe 128 sometimes. Thus you need to pad your array with at least 15 dummy int64_t between each variable to be sure to keep a smooth parallel access.

Store buffer and compilation


The Load buffer and the Store buffer are little pipelines between the registers and the caches, they hold the values "pending" to write or to read before/after a computation is over. Because of compilation, the result of a computation may never be granted a STORE, because the compiler sees no dependence on that variable. You need to use the volatile keyword to force compilation to end any computation with a Store instruction.

Also bear in mind that when pending variables are in the write buffer, they are invisible to other cores. You also may not rely on Store orders because of compilation re-ordering and processor instructions re-ordering (which are two different things but both make problems). If you have seen a state from reading a variable set by another thread, you cannot suppose that "this other thing is necessarily over" because you wrote it before!

// thread 1
myThing;
bOver = true;

// thread 2
while (!bOver);
// yeah myThing is done now !

That is a beginner's mistake.

First there is the obvious, bOver is useless (by static analysis) so the compiler doesn't store it. In thread2 it doesn't Load it so the loop is just an infinite false, that the compiler will warn you about.

You need volatile to solve both problems.

But, by static analysis, bOrder has no dependence with myThing, therefore it is free to do the Store before even doing myThing.

To ensure order, you need to force your compiler to break its dependency static analysis by calling a non-inlined function between myThing and bOver = true. (xBox's compiler has a specific "compiler barrier" intrinsic : _ReadWriteBarrier)
Note that this might not be enough, LLVM for instance has internal optimizer-related flags to 'colorate' functions; some of which are related to detection of side effects.
That is one reason more why you shouldn't try to do lockless programming, if you are not aware of that. At least using locks, the potentiality of falling into a 'trap' (kernel call) will break the optimizer certainty about side effects, and ensure correct ordering.

And finally, to ensure order on the CPU level, you need to call a Store Memory Barrier before bOrder = true to finish whatever myThing did. To call a memory barrier, either your compiler can propose an intrinsic, either you can call a dummy locked inc or dec using an assembly directive.

And that is why all of that is cumbersome, that most people simply use mutexes+locks which happen to have all of what I described above!

Microsoft-specific

Be very careful, Microsoft has a special implementation of the volatile keyword in its compilers, it is more than standard in the sense that volatile reads and writes are also paired with memory barriers. And we can expect that the compiler is making no assumption on dependencies when a variable is marked volatile.

Don't let yourself be fooled by that MSDN example that is completely not standard and not expected to work on another compiler. To be really working everywhere it would need the both additions that I mentioned above. Or simply to use a lock.

OS


The OS gives support for threading through its task scheduling system.
The scheduler has a variety of entry points : it kicks in at process startup and process end, when a thread enters or exit sleep state or wait state, or when the periodic kernel tick interrupts.
The tick is the only system through which pre-emption is possible. On windows, a normal task starts with 6 points and looses 3 every tick (15.6ms) thus is supposed to relinquish its slice after 2 ticks. I/O disturbs that because tasks gets interrupted, execute some code that is not their own, and all this time is counted as "their fault". The Vista scheduler has some fixs about that, and the linux has an actual "Completely Fair Scheduler".
Note that altering the periodic tick timer rate (with beginTimePeriod) will not change the frequency of when the scheduler executes to determine slice expiration. This is because the routine that is serviced at tick interruption checks that a full 15.6ms time slice passed before executing the pre-emption system. So by actually reducing tick period you may rather introduce an aliasing (by using periods that are not multiple of 15.6) and slow down your system because of slower task switching.

Locks

The implication of that, is that when trying to acquire a congested lock, the thread will enter a waiting status, and therefore the OS scheduler will switch context to another thread, at this very moment. Basically your calls to the mutexes/critical sections functions (Enter/Exit or scoped lock system or whatever) are literally calls to kernel scheduler procedures. This actually is a chance because it makes things really reactive. If a lock is acquired, we might as well run the thread that owns it the quickest as we can. However, at release time, the thread in wait state will be rescheduled faster if using condition variables.

Other Mistakes


I'll throw some random things.

double checked locking pattern


A good link is much better than a re-heated speech. I'll let you read this one peacefuly: http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html

It says basically, don't do:

public Helper getHelper()
{
    if (helper == null)
    {
      synchronized(this);
      if (helper == null) 
         helper = new Helper();
    }    
    return helper;
}

You all recognized the singleton pattern, well when we want to make it thread-safe we need to lock its access, though we often think (wrongly) that locking is too slow and there comes the double-checked-locking. The idea comes from the fact that most of the time we will have the first test false so we avoid the lock. Please check the link to see why this assumption fails.

thinking that volatile solves everything


Well we saw that one before right? To rub it in even more, a little general knowledge link about it: http://www.bluebytesoftware.com/blog/PermaLink,guid,dd3aff8a-7f8d-4de6-a2e7-d199662b68f4.aspx

And it actually gets worse than just that: Volatiles Are Miscompiled.

thinking that atomic operations are fast


If you knew that a memory barrier was a risk of CPU stall and a potential slowdown, then you need to know that atomic operations are worse. Compare and Swap is one of the slowest. Most implementations of memory barriers are actually implemented in terms of dummy atomic operations (lock-prefixed instructions like xchg/xadd/xinc).

thinking that boost shared_ptr is thread safe


It is not; because it has to manage 2 reference counters (weak and strong) which are not updated atomically and thus coherency is not guaranteed.

Lock-free


There has been a buzz about lock-free algorithms in the past 5 years or so, and somehow since then everybody is all about trying to do lock-free stuff. Man, you should be glad to have multiple cores in the first place, rather than thinking about grabbing the last 1% of performance there is to grab, by investing 500% of the time it would have taken doing an algo lock-full. If your algorithm would benefit more than that by being lock-free, it is a clear sign that it is congested.

OS and kernel systems may not have the choice, and everyone depends on them, thus taking the time to develop lock-free stuff makes sense.

Now, the reference paper to read before pursuing the conversation: Andrei Alexandrescu´s "Lock-Free Data Structures"

Surely you noted:

In fact, there must be around half a dozen of lock-free programming experts around the world, and yours truly is not among them.

Is he being condescendent assuming any reader is just not intelligent enough, or is he talking about himself as "our" since he is serving us with his text ? I´m not fluent enough in English to decide that. But both ways convey the same message; because even were he talking about himself, this guy is still a clever lad, he authors the C++ language after all (yeah just that). Well not alone, but still.

Another quote for the road:

livelock—each tries to dodge the other’s locking behavior, just like two dudes in the hallway trying to go past one another but end up doing that social dance of swinging left and right in synchronicity. We humans are pretty good at ending that with a laugh; processors, however, often enjoy doing it til rebooting sets them apart

I love that way of putting it, so that everybody can understand. So we have a definition of livelock which is pretty good for my article. Also we have one of the reasons why lock-free algorithms are good. They have a lot of merits like being deadlock-less and livelock-less. They also constitute more robust system by avoiding freezing systems that would wait for somebody who is dead (Thread killing immunity). And a bunch of other cool stuffs that you can read about in the paper.

Conclusion

Don´t try to go lock-free, if you end up with explicit memory barriers every two lines, what did you gain? You may even one day realize that you re-invented the mutex !
Also, it is not because you are lockless that you are never preempted. After that we can add the problems of spin-waitings (battery drain, slice hoging on schedulers that are not completely fair), and deterministic memory freeing headaches...

Being afraid of locking


There is a legend out there that freezes the blood of every performance programer saying that locking is expensive. I'm not sure where it comes from, maybe when a non contended lock was making a trip in the kernel...

Anyway, today locks remains user space when non contended (e.g. primitives like Microsoft CRITICAL_SECTION). I have timed on MacOS with a colleague boost::scoped_lock on a mutex to execute in the time equivalent to perform 4 integer operations. Of course that is for congestion 0. If your thread has to wait, it will make a trip to the kernel to let the scheduler do some decisions. So you should be afraid of congestion rather than the lock itself.
From this msdn article : "Lockless algorithms are not guaranteed to be faster than algorithms that use locks".

Always profile your programs, don't early optimize, especially with multithreading, where correctness is delicate.

Synchronizations


Sleeping


Sleeping is a system call that blocks your thread for a specified amount of time.

Posix documentation of sleep, and Microsoft documentation of Win32´s Sleep.

You may use sleep on threads that are not dedicated to listen to events served by the OS. If you block for more than a few seconds between two calls to a command that reads OS messages (like GetMessage or PeekMessage), the OS will consider your application as frozen and propose to the user to kill it. So prefer to sleep only secondary threads, and be careful about priority inversion.

Now when sleeps are involved, people often think the time actually elapsed will be precise. Big mistake. The Win32 documentation super-clearly mentions that the sleep time will be rounded to current OS Tick period. And that is even supposing that your thread will be rescheduled at all. In fact, it will be moved from the waiting queue to the ready queue. This barely means that after the Sleep period the thread is runnable. If no other thread with stronger priorities have to run now, then you may run now.

Ticks

A tick is the base time period at which the OS will run its scheduler code. That code is installed in the timer interrupt handler of the CPU. There are various timing hardware in most computers, one of which is the RTC which should be implemented with a quartz to be a little precise on the long term (month). This is the ticking device used to maintain the clock (system global time). Unfortunately it is ticking quite rarely, which will cause our problems.

Ticks are a super important concept to be aware of as game developers. Because the period with which we´d like to refresh the graphics is usually 16.6ms or 32.3ms. And this just happens to be in the order of magnitude of the classic kernel Tick : 15.6 ms.

Timers

The function GetTickCount of Win32 now makes all its sense. It is imprecise because like the name says, it is based on Ticks. That is why everybody recommends to use rather a performance counter. The thing is that these sources are good for different things, keep track of small elapses or long elapses. I will refer to this excellent article for future reading.

With that problem in mind, architectures has evolved and RTC has been superseded by HPET. And the kernel tick has been made configurable between 1ms and 15ms. Through the multimedia libray (timeBeginPeriod). Note that 1ms was a default build choice for linux kernels for a long time (option CONFIG_HZ set to 1000), a value that Ubuntu prefered to set at 250. Little note for the lulz: flash is known to set the tick at 1ms, of course it is system wide, so your battery may drain when you have a browser opened.

Tickless

Now kernels are tickless, since windows 8 (2012) and linux 2.6.18 (2006) more precisely. This means that the OS wakes up only when it is needed. In linux 2.6 the tickless kernel is implemented as dynticks, meaning that the periodic tick still exist most of the time, but is disabled when there is no activity. Or scheduled to restart after the end of the shortest timer of all the sleeping applications. Kernel 3.1 will use a new full dyntick system where the tick is also disabled in case of only one task working to allow longer batches without interruption.

The tickless system was introduced (mostly by Intel) in order to save power consumption, and make the most out of advanced speed step CPU sleep modes.

Coalescence

To further save battery, (from) Windows 7, timers may be Coalesced. I´ll let you read about that if you are interested. Basically it means that you can setup an error margin on your sleep times, so that the OS may group a lot of timers on the same kernel tick, and disable the periodic tick for the rest of the time.

Conclusion

Do not rely on Sleep even on tickless kernels, it will rarely have the needed precision in games, and you risk yielding your thread for a moment as long as your frame !

On windows there is no microSleep or nanoSleep function like in linux, but I have no idea how these functions are implemented. I timed (using performance counters) the usleep function on an idle system, it respected the requested sleep periods very closely. Even for sleeps of tens of microseconds only. But I believe from some old documentation that these short period sleep functions may use spin waiting. Another way to manage true µSleep would be by programming the HPET for an interruption on the fly. But with the many ACPI issues that chipsets has, it seems hazardous.

Wait on condition


Condition variables are the smallest concurrent synchronization primitives that allows very efficient wait and wake up.

They are accessible through the excellent boost::thread library. Or manually since Windows Vista.

CS work only in pair with a mutex (a Win32 critical section object), so that it can release and reacquire it on the Wait call.

Typical mistakes with CS

Not checking condition before sleeping

If you go straight to wait before even checking the condition, then you risk the chance of missing the signal. Waiting and Signaling happens at the end of the race of the respective threads calling them, therefore you must always suppose both outcomes of the race-condition. Missing the signal in this case can be very severe because it could mean freezing the thread forever.

Not looping

There is a pattern to use a CS, you need to respect:

boost::unique_lock<boost::mutex> lock(mut);
while(!data_ready)
    cond.wait(lock);
// pursue

If you do not loop, then you have no guarantees about the state of your condition. You need to check it again of you want to do something that relies on it. If you just wanted to be wake up then that's fine without a loop though.

You have no guarantees because in the wait() call after having been notified by a signal, at the moment the lock is reacquired, some other thread may have acquired the lock already and modified the condition in between.

Another source of mistake is believing that your thread cannot escape from the wait() call without a pair notify. It is false because of spurious wake ups. They happen because CS are made to be specially fast and for that, sacrifice consistency to performance.

Signaling in the scope of the lock

Your data provider thread will emit a signal to wake up the waiter. Do that after you are finished with the data and you have exited the scope of the lock (released the mutex). If you signal your waiter thread while holding the lock, how is the poor waiter supposed to acquire the lock to pursue its job? Fortunately, even in this case your waiter will not be stuck in limbo until the next signal, because it has already moved on to the lock-wait part. Therefore the problem here is one of performance only, you will avoid some context switches by signaling from outside locks.

This is where C# is disturbing, don´t let yourself be fooled by the Monitor. Pulse function, that does nothing, stuff is actually executed on Monitor.Exit.

Misusing timeouts

Timeouts are very important to avoid deadlocks and bring robustness facing error scenarios. The problem is that because CS has to be checked in loops, you need to work with absolute times in order to pass to the wait function the remaining amount of time until the timeout. If you just were to pass a simple constant "10 ms from now" every time you re-enter the loop you grant the wait a new budget of time for free! You may even never go out of your waiting loop using that. Boost documents that perfectly, and recommend the system_time::now + period method.

CS Alternative

To avoid lost time due to execution yielding, sometime spin locks may be an option. In the case of true multi core systems, and for very short periods, if the produced resources bursts in quickly, then the consumption may as well be fast. A spin lock is implemented using a try_lock and looping over it until acquired.

Interrupting


We all long for a radical method like KillThread; which actually may even be available sometimes. However with delicate threads, radical manners are rarely suited.

The typical pattern to stop a thread is by using interruption points. This system is exactly cooperative scheduling, some agent is sending an interrupt request by setting a flag like exit_when_you_can to true. And the running thread must regularly check for that flag and do something then (e.g like exiting, that´d be great heh).

Yielding

Even though preemptive schedulers are the norm, we do encounter code that uses manual yield points. Like seen in boost::yield, or Sleep(0). Note that calling yield will be a NoOP if there is no thread in the waiting queue for your CPU. If you wish to save battery by doing that, you're doing it wrong. In this case use real sleeping states using a Wait on condition, or a timed sleep with at least one kernel slice of sleep time. Usually non real time applications uses "Wait on event" pattern (Win32 famous GetMessages function), which happens to be a pattern that is perfect for saving battery life.
In games, the only moment when the render thread execution can breathe, is when waiting for the VSync signal. This will be handled by force when using the buffer swap function. Note that Android mostly doesn't support VSync, nor does it support any setup of how many frames the CPU is allowed to have in advance (swap interval).

Conclusion


For further reading


C++11

I'd like to be able to cover promises and futures which are a concept taken straight from lisp (in a paper from 1992) that has been introduced into C++11, but I have no experience at all and thus let the reader look for these if interested.

Apparently newer versions of boost start to integrate this concept into C++03 as well.

thrust

The nVidia C++ library thrust (developer.nvidia/thrust) has the potential to cut costs drastically in parallel compute-heavy jobs developments, I'd like to be able to cover it but I know naught about it.


Article Update Log


15 June 2013: various enrichments + little correction + OS paragraph
03 May 2013: wrote forgotten paragraph (last) in other mistakes.
03 May 2013: Initial submission
17 Apr 2013: Start Draft



About the Author(s)


Vivien Oddou (Lightness1024!)
Graduated from ENSEIRB (CS engineering school). Author of Extreme Carnage and Carnage Engine, or Nuclear Age (http://projets.6mablog.com/)
Ex employee of Etranges Libellules (Lyon).
Done research at ISIT (Fukuoka) for HPC.
Was 4 years in e-onsoftware (Paris) working on Vue Infinite and LumenRT.
Now employed at tri-Ace (Tokyo) as R&D.


License


GDOL (Gamedev.net Open License)




Comments

Talk about a gold mine of useful links. Pre-organized. And pre-filtered. This must have been a lot of work. Seriously, thanks!

I'm not finished reading it all, but regarding the volatile statement & Microsoft specific page:

The documentation in MSDN is poorly written and I can't remember where this was stated more clearly (at least, people are writting warnings in the comment section).

 

Post-2003 (VS2005 onwards) the volatile in Visual C++ platforms don't issue a Hardware Memory Barrier, just a compiler Mem. Barrier. That means the compiler won't reorder instructions so that any read/write that should happen before an operation to volatile objects doesn't come after it, and any read/write instruction that comes late isn't generated earlier.

This affects the order of instructions in assembly code. That's all.

 

However the CPU's Out of Order Execution unit (OoOE) may still reorder those instructions; therefore an explicit Hardware Memory Barrier is needed.

 

 

GCC on the other hand, doesn't do MS' guarantees, and only volatile objects are ordered relative to other volatile objects. But non-volatile operations that were placed in code after, may generate the respective instruction before.

In GCC's case, you'll need a compiler memory barrier AND a hardware memory barrier.

AFAIK GCC's compiler mem. barrier is:

asm volatile("" : : : "memory");

@Matias Goldberg: Were you refering to this article by Bruce Dawson? It's very thorough and well-written, and has a small chapter about volatile variables and reordering.

Interesting article.

 

For the synchronizations, something that is notably missing is the interlocked variable.  It requires hardware support, and the x86 family includes it.  Interlocked variables provide the proper barriers to enable sharing integers safely.

 

 

Secondly, what you list as multimedia as a use of multitasking I would classify as simple I/O.  Use asynchronous calls to the disk, asynchronous calls for audio, asynchronous calls for video, asynchronous calls for networking, and so on.  Any time you need either input our output, use asynchronous calls if possible.

One note regarding display and non-CPU computation (e.g., graphics accelerators) -- while Amiga is historically important, it was by no means the starting point:

http://www.techspot.com/article/650-history-of-the-gpu/

Something from the article which is accurate yet misleading and could deal with some further information.  In the section talking about mutex performance it discusses user space mutex/critical section and a 0 congestion level.  That is fine of course but the cost of mutex which scares most people is the cost of thread context switching, not the mutex itself.  Linux/BSD have always had pretty fast context switching but even there it adds up quickly in any case of contention.  Window's, until Vista, was horribly slow in this case and still is not as fast as *nix variants though it is getting better.

 

Again, I don't think anything is wrong with the discussion point, just the discussion stops before getting to the real reasons for the justified concern.  People should remain afraid of most synchronization primitives, not for their individual cost but for the aggregated cost when kernel transitions, thread context changes, wake from idle latency and related are considered.

The bit on lock-free only addresses the difficulty of writing lock-free algorithms, but not the pros/cons of using existing ones. Most threading libraries will come at least with a lock-free queue structure these days, which takes no advanced knowledge to use.

It would also be nice to see a mention of the theory behind the phrase 'lock-free' -- i.e. if the OS puts any of your threads to sleep at any time, this cannot cause any other thread to stall. For example, with a mutex-based queue, if the OS puts a thread to sleep while it is holding the insertion lock, then no other thread can insert items until the OS wakes up the original thread so it can release the lock.

 

The other thing I'd like to see is a mention of wait-free algorithms, as these are usually simpler, easier and faster than both lock-free and mutex-based ones, where they apply (such as in embarrassingly parallel algorithms, which also aren't mentioned).

Thank you guys, all of your comments are incredibly constructive and definitely helps completing the article. I considered editing to integrate some of the info added, but I often found that my original phrasing was enough after going back on commented sections.

Also I totally love the link posted by tivolo, I recommend ! (for the people who read the comments :) )

@Hodgman: there aren't mentioned because I have embarrassingly never heard of that term ! But I looked it up and feel its only a terminology detail, but it would make a lot of sense regarding a talk about wait-free.

Though, again, this is not a case where I feel to Edit for that much, because the link I posted about lock-free (from Alexandrescu) talks clearly about the distinction with wait-free and gives good definition. So the information is there.

@frob: your comment made me wonder the most, I thought I talked about "interlocked variable access", which is most commonly refered to as "atomic operations" by language abuse. But when looking back, its indeed only briefly mentioned, that would need a little edit but the link posted by trivolo is so good that it fixes the shortcoming.


Note: Please offer only positive, constructive comments - we are looking to promote a positive atmosphere where collaboration is valued above all else.




PARTNERS