Multithreading vs variable time per frame

Started by
28 comments, last by MegaPixel 9 years, 10 months ago

About using volatile, I thought that was necessary. I thought it was there to make sure any writes to a variable get written to memory immediately rather than allowing the optimiser to cache intermediate values in a register.


In some cases, it literally does nothing. In other cases, it doesn't do nearly enough. Most of the time, volatile just causes needless de-optimization without solving any real problems.

You need more complex memory barriers to ensure proper ordering of reads and writes (especially since either the compiler or the CPU itself can do instruction reordering or memory store/fetch reordering). For some types of values or some CPU architectures, you also need explicit atomic instructions just to ensure that other threads don't see half of a variable change (not a problem most of the time on x86, but it can be on other architectures).

Atomic needs C++11 doesn't it?


If you want to write purely ISO-conforming C++ with absolutely no extensions or libraries, sure. GCC and Clang support intrinsics and other extensions to support atomic values portably across different OSes and architectures. Many game libraries provide their own platform-neutral APIs for threading and atomics (and some even offer higher-level abstractions that are very useful). You can very easily use threads and atomics portably across Linux, OSX, Android, Windows+MingW, Windows+VC++, iOS, etc. using pure-C APIs like SDL2.

https://wiki.libsdl.org/APIByCategory#Threads - SDL threading/atomics support, and I'd guess that you're probably already using SDL (or something equivalent) anyway
https://www.threadingbuildingblocks.org/ - Intel Threaded Building Blocks, which is a high-level concurrency library that supports Windows and Linux

Sean Middleditch – Game Systems Engineer – Join my team!

Advertisement


In some cases, [volatile] literally does nothing. In other cases, it doesn't do nearly enough. Most of the time, volatile just causes needless de-optimization without solving any real problems.

You need more complex memory barriers to ensure proper ordering of reads and writes (especially since either the compiler or the CPU itself can do instruction reordering or memory store/fetch reordering). For some types of values or some CPU architectures, you also need explicit atomic instructions just to ensure that other threads don't see half of a variable change (not a problem most of the time on x86, but it can be on other architectures).

The important point for me was not why I shouldn't use volatile, but why I didn't need to. I didn't understand that function calls act as memory barriers, ensuring that any variables cached in registers are made consistent with memory at that point.

I am using SDL, at least for the PC versions, but I've had some problems with the Android version and will probably use Android's API directly for that. It advises strict caution when using its atomic support.

Function calls do not necessarily act as memory barriers. They might seem to, but there's nothing in the spec that says that they have to (and your compiler sure won't emit expensive fence instructions for each function call)...
Locking or unlocking a mutex does act as a memory barrier though ;)

What I read implies that POSIX demands most function calls act as memory barriers. If the called function's definition is in a separate source file the compiler can't know its content and whether it in turn locks or unlocks a mutex so I think the only safe option is to assume it might and act in a thread-safe way.

What I read implies that POSIX demands most function calls act as memory barriers. If the called function's definition is in a separate source file the compiler can't know its content and whether it in turn locks or unlocks a mutex so I think the only safe option is to assume it might and act in a thread-safe way.

yeah, it's very likely that the compiler will act in this way... But the problem is that your CPU hasn't been told about the fence!!
Modern CPUs -- in the never ending quest for finding more performance by adding more transistors/complexity -- have the capability to actually re-order the stream of ASM commands, and re-order their reads/writes of memory. They're allowed to do as they like, as long as the results that they produce are the same in the end. This is only an issue with multi-core software.
In a single threaded program, whether you write "data=5; readyToReadData=true;" or "readyToReadData=true; data=5;" doesn't matter.
But in a multi-threaded program, this ordering really can matter -- the latter bit of code might mean that another thread sees the Boolean as true, and tries to read the "data" variable before the number "5" has actually been written to it.

Even if your compiler is nice enough to not reorganize your code, you still need to tell the CPU that it's not allowed to reorganize these crucial two steps -- the data must reach RAM before the Boolean does.
To ensure the CPU doesn't mess this ordering up, you need to use (expensive) memory barrier instructions at the right places. The compiler won't do this automatically for every function call because it would make your code 10-100x slower! It only emits these instructions where you ask it to - either automatically when you use mutexes/etc, or manually.

Doing this manually is often called "lock free" programming, but it's extremely dangerous and error prone - you really need to be very familiar wih the hardware architecture. Alternatively, proper use of standard synchronization primitives, such as mutexes, also ensures that CPU memory-barrier instructions will be placed in all the key places.

If you're doing shared-memory concurrency (where a mutable variable is used by more than one thread), either you use the standar synchronization primitives, or you use your expert knowledge of the CPU and memory architecture at an assembly level to hand-code the CPU-level synchronization instructions (using compiler-specific intrinsics, raw ASM, or the C++11 std lib or similar) and write some tests to be sure -- anything else surely has subtle bugs.


In a single threaded program, whether you write "data=5; readyToReadData=true;" or "readyToReadData=true; data=5;" doesn't matter.
But in a multi-threaded program, this ordering really can matter -- the latter bit of code might mean that another thread sees the Boolean as true, and tries to read the "data" variable before the number "5" has actually been written to it.

Even if your compiler is nice enough to not reorganize your code, you still need to tell the CPU that it's not allowed to reorganize these crucial two steps -- the data must reach RAM before the Boolean does.
To ensure the CPU doesn't mess this ordering up, you need to use (expensive) memory barrier instructions at the right places. The compiler won't do this automatically for every function call because it would make your code 10-100x slower! It only emits these instructions where you ask it to - either automatically when you use mutexes/etc, or manually.

The ordering doesn't actually matter if you use a mutex properly. What I'm unsure of now is how to tell the compiler to insert a memory barrier. A mutex lock/unlock should do that, but what if you're not using the compiler's idea of a native threading API? Eg SDL. I don't see any special attributes in SDL_mutex.h etc. Do programmers have to manually insert a compiler directive, which for some reason I haven't heard of in any threading tutorials?

Is inserting a memory barrier for every function call really such a terrible performance hit? The compiler can work out which variables don't need to be synchronised with RAM (local variables which haven't had a reference taken). Those that do would probably have to be saved on the stack instead anyway.

If you use any properly written mutex (SDL included), it will be performing the required barriers for you. Nothing for you to worry about (except ensuring you're always using the mutexes correctly!)

If you use any properly written mutex (SDL included), it will be performing the required barriers for you. Nothing for you to worry about (except ensuring you're always using the mutexes correctly!)

I should think SDL uses the Windows API on Windows and pthreads everywhere else. What concerns me is say I have a block of code like:

SDL_LockMutex(...);

// Alter some variables to be read by another thread

SDL_CondSignal(...); // Tell other thread to use those variables

SDL_UnlockMutex(...);

The compiler doesn't necessarily know in this scope that the SDL functions have to be memory barriers, because they're not part of its "native" threading API, so it can't apply a memory barrier here unless it assumes that all functions make that necessary. Conversely, when it compiles SDL and encounters what it does consider as its native threading API it can't work its way back up the stack to work out what needs protecting, unless it has some clever run-time trickery.

There two types -- 1) the compiler can choose to reorder your code, and 2) the CPU can choose to dynamically reorder your code.

The function call out to the SDL library ensures that the compiler won't reorder your code (either because because it has no idea what's inside that function at the time, or because it does know what's inside that function, and sees a barrier hint).

Inside the function, there's an expensive lock/fence instruction, which ensures that the CPU won't do any reordering across that boundary at runtime.

Also, in your example, the "other thread" would usually be using SDL_LockMutex(sameMutex) to ensure it doesn't start using those variables until the first thread has unlocked the mutex...

Side track question:

Considering the the data decomposition approach, is it of any benefit to base the task scheduler on a work stealing design pattern in order to

minimize mutexes on the job queue ? And if yes, is it currently used by any game engine ?

cheers

This topic is closed to new replies.

Advertisement