• Create Account

## Multithreading vs variable time per frame

Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

29 replies to this topic

### #1realh  Members

338
Like
2Likes
Like

Posted 29 April 2014 - 06:52 AM

Most existing game engines seem to be based around a monolithic single-threaded main loop which serially handles input, AI etc and rendering. I think separating rendering from game logic with threads has major advantages:

Rendering has the biggest influence on performance. With a single thread the game logic has to be able to cope with being polled from anywhere <20 times per second to hundreds, and that can be difficult to manage in some games. With a separate game logic loop you can set it to "tick" at a certain rate independently of the achieved rendering frame rate, and in most cases assume the system can guarantee that frequency unless the whole thing is so bogged down that slowing down the gameplay is acceptable or even advantageous.

The cons of multithreading are the notorious difficulty of thread-safe programming and tracing bugs when you get it wrong. You have to use mutexes etc to make sure the rendering thread doesn't try to render an object while the logic thread is halfway through updating its coordinates or something, and there is a certain trade-off between the amount of blocking and getting every frame to render precisely.

Have skilled programmers over the years examined these pros and cons carefully and decided that single-threading is still the way to go? Or are things changing now that all the major platforms have multicore CPUs and decent threading implementations? And, for example, in Android it's important not to block the input thread and it uses a dedicated rendering thread by default; although native_app_glue apparently favours forwarding input events to the rendering thread.

### #2Hodgman  Moderators

49386
Like
21Likes
Like

Posted 29 April 2014 - 07:25 AM

POPULAR

If you end up with something like a mutex per object, which is there to allow multiple threads to randomly share it.... Then you're completely on the wrong track ;-)

The standard model in use by engines these days is not to have a fixed number threads that each run their own system. Instead you have a pool of threads (sized depending on the number of cores in the system) which are all pretty much equal. The game is then decomposed into a directed-acyclic-graph of small "jobs". A job is a function that reads some inputs and writes to some outputs (no global shared state allowed!). If the output of one job is the input of another, that's a dependency that affects scheduling (this job cannot start until the input actually exist). From there you can construct a schedule, or dependency graph can be constructed, so each job can tell whether it's allowed to start yet or not. Every thread then runs the same loop, trying to pop jobs from a job-queue and executing them (if allowed).

There's no error-prone mutex locking required, there's no shared state to create potential race-conditions, and it takes full advantage of 1, 2, 4, 8, or more core CPUs.

That's the ideal version. Often engines will still use one of those threads as a "main thread", still in a typical single-threaded style, but anything computationally expensive will be diced up into jobs and pushed into the job queue.
Some platforms have restrictions like you mention, so you may have to restrict all "rendering API jobs" to a particular thread too.
But overall, the "job queue" or "job graph" (or flow-based) approach has become the defacto standard in game engines, rather than mutexes (shared state) or message passing styles of concurrency.

Also, message-passing should always be a default choice over shared-state too ;-)

### #3realh  Members

338
Like
2Likes
Like

Posted 29 April 2014 - 11:09 AM

I think you're talking about a different situation/problem. You're describing a sort of load balancing but I want to decouple rendering from game logic so that the latter isn't affected by the speed of rendering. In the game I'm currently writing the logic is simple with a relatively low CPU demand, and doesn't really justify being split up into threads, and I think even a low-end phone should be able to maintain a consistent 60 ticks per second. But the rendering engine might not be able to keep up with that, and I'd like to prevent the game logic having to deal with being called at varying intervals.

I think there's another possible approach, which is to use a single thread, but design the game loop to use ticks of fixed duration for AI/logic. Each time round the loop you check the time, and call the tick handler as many times as necessary to keep the average tick rate constant. Or if the frame rate has dropped too much you can allow the gameplay to slow down (like when Quake shows the tortoise icon). The disadvantages of this are that the game logic still has to be separated from the rendering in case you need to tick more than once per frame, and it's more likely to suffer from laggy response to input than separate threads.

### #4cgrant  Members

1646
Like
2Likes
Like

Posted 29 April 2014 - 11:22 AM

Like Hodgman alluded to...the main thread of the application is also a thread, and depend on usage, there really is no need to create other thread to use rendering/game logic unless a particular task is computational expensive/heavyweight. These task can be farmed out to separate threads so that the main thread doesn't stall. Whether you use the main thread to render or update game logic is a matter of preference, but its usually easier to have this thread do the rendering.

### #5mark ds  Members

1733
Like
10Likes
Like

Posted 29 April 2014 - 12:26 PM

POPULAR

Game Loop

Then

Fixed Timestep Implementation

Combine this information with the post above by Hodgman

Edited by mark ds, 29 April 2014 - 12:27 PM.

### #6SeanMiddleditch  Members

16709
Like
6Likes
Like

Posted 29 April 2014 - 12:40 PM

POPULAR

With a single thread the game logic has to be able to cope with being polled from anywhere <20 times per second to hundreds, and that can be difficult to manage in some games. With a separate game logic loop you can set it to "tick" at a certain rate independently of the achieved rendering frame rate, and in most cases assume the system can guarantee that frequency unless the whole thing is so bogged down that slowing down the gameplay is acceptable or even advantageous.

There's no need for threads to achieve that. Just use a fixed timestep and apply interpolation to your rendering code. One thread, smooth graphics, frame-rate-independent game logic.

The cons of multithreading are the notorious difficulty of thread-safe programming and tracing bugs when you get it wrong.

Only if you do multithreading wrong. Back in the bad ol' days people didn't understand threads all that well, and this was a problem. Look at modern job architectures, isolated threads, communication channels, etc. Writing a massively threaded and completely safe application in languages like Eiffel or Go or Rust or even C# is pretty easy. C/C++ doesn't provide those niceties out of the box, but with a little work (or application of PPL, TBB, etc.) you can get the same kind of environment running in C++ without much headache.

You have to use mutexes etc to make sure the rendering thread doesn't try to render an object while the logic thread is halfway through updating its coordinates or something

You absolutely do not, and really really really should not.

You can make use of many threads several different ways. Use a fork-join model (without _actual_ forks and joins, mind) letting you keep a serial game loop. Or you can use a pipeline approach, which tends not to scale as well and duplicates a lot of data but can make better use of a smaller number of cores due to cache behavior and scheduling concerns. Or you can implement a completely task-based system. You never, ever put a mutex around each object; you really barely even use mutexes at all (they're expensive and error-prone).

The last lock I've even used was a small spinlock in a few small corners of a multi-thread profiling subsystem (though in most of it, the threads run completely independently with no synchronization required at all). Everything else makes use of concurrent queues (channels) and rarely some atomics.

Have skilled programmers over the years examined these pros and cons carefully

Of course. All modern game engines are very parallel.

Game Developer, C++ Geek, Dragon Slayer - http://seanmiddleditch.com

C++ SG14 "Games & Low Latency" - Co-chair - public forums

Wargaming Seattle - Lead Server Engineer - We're hiring!

### #7AllEightUp  Moderators

5610
Like
6Likes
Like

Posted 29 April 2014 - 10:03 PM

POPULAR

Of course. All modern game engines are very parallel.

I hope you are only counting engines in the last couple years in this. The big culprits out there right now: Unity, UnReal and Source are very uneven in how well they thread no matter how pretty the output is. Basically with those engines you see 2 or 3 cores getting pounded pretty hard, those are the hold overs from having been functionally threaded when they started out. Then you generally see the remaining cores getting some utilization from the thread pools but it is generally pretty low and you see a very large amount of contention burning up a notable portion of the cores. Recent engines, notably those using things like the component entity model, thread out data parallel and those *are* very parallel.

But, again, threading in general is still evolving in games. Data parallel works in a lot of cases very well where thread pools and task dags lose horribly due to Amdahl's law. On the other hand, there are a couple places where data parallel falls down. The point of all this is that this is an area of ongoing research, it's not a cut and dried "x" is the thing to do, though thread pools is really not a good solution as mentioned, its a crutch to get some threading, not good threading.

### #8Hodgman  Moderators

49386
Like
8Likes
Like

Posted 29 April 2014 - 10:54 PM

POPULAR

^^That
You can have different update and rendering frequencies without the need for any threads at all. Games are (almost) real-time systems, with a period measured in milliseconds. So as long as the time periods of your periodic systems are also in a similar range (e.g. 5-50ms), then there's no need to complicate things by using extra threads for that.

If you want the update code to run at some fixed frequency, while the rendering just runs as often as it can, then you'd use a main loop something like this:

lastTime = getTime()
while(1)
now = getTime()
deltaTime = now - lastTime
lastTime = now

accumulator += deltaTime
update();
render();


The real use of threads within a game engine is just to unlock the extra computational power of multi-core CPUs. Inside the update/render functions above, the main thread would cooperate with all the other threads in your pool to complete those tasks in parallel.

Data parallel works in a lot of cases very well where thread pools and task dags lose horribly due to Amdahl's law.

All the DAG/Job systems that I've had experience with have been data-parallel.
The two solutions I've seen are, either:

• data-parallel systems will put many jobs into the queue, representing chunks of their data that can be executed in parallel,
• Usually there's another tier in here - e.g. A system has 100000 items - it might submit a "job list", which might contain 100 sub-jobs that each update 100 of the system's items. An atomic counter in the job-list is used to allow every thread in the pool to continually pop different sub-jobs. Once all the sub-jobs have been popped, the job-list itself is removed from the pool's queue. The DAG is actually one of job-lists, not jobs.
• or, every job is assumed to be data parallel, with a numItems constant associate with them, and begin/end input variables in the job entry point. Many threads can execute the same job, specifying different, non-overlapping begin/end ranges within begin=0 to end=numItems. Once the job has been run such that every item in that full range has been executed exactly once, then it's marked as being complete. Non-scalable / non-data-parallel jobs have numItems=1.

Are you saying that a data-parallel system wouldn't have a fixed pool of threads, or maybe you're using connotations of the phrase "thread pool" that I'm missing?

Edited by Hodgman, 29 April 2014 - 10:59 PM.

### #9AllEightUp  Moderators

5610
Like
5Likes
Like

Posted 30 April 2014 - 07:02 AM

POPULAR

Data parallel works in a lot of cases very well where thread pools and task dags lose horribly due to Amdahl's law.

All the DAG/Job systems that I've had experience with have been data-parallel.
The two solutions I've seen are, either:

• data-parallel systems will put many jobs into the queue, representing chunks of their data that can be executed in parallel,
• Usually there's another tier in here - e.g. A system has 100000 items - it might submit a "job list", which might contain 100 sub-jobs that each update 100 of the system's items. An atomic counter in the job-list is used to allow every thread in the pool to continually pop different sub-jobs. Once all the sub-jobs have been popped, the job-list itself is removed from the pool's queue. The DAG is actually one of job-lists, not jobs.
• or, every job is assumed to be data parallel, with a numItems constant associate with them, and begin/end input variables in the job entry point. Many threads can execute the same job, specifying different, non-overlapping begin/end ranges within begin=0 to end=numItems. Once the job has been run such that every item in that full range has been executed exactly once, then it's marked as being complete. Non-scalable / non-data-parallel jobs have numItems=1.

Are you saying that a data-parallel system wouldn't have a fixed pool of threads, or maybe you're using connotations of the phrase "thread pool" that I'm missing?

The difference which I've seen is really the usage patterns and implementation details of the underlying thread system.  Unfortunately I was trying to generalize without getting into the details too much, but that of course leaves a lot of questions for those familiar with the problems involved.  Additionally, I probably mixed up how I described the difference as they tend to overlap a bit.

I'll try and clean this up using an example of a game which shipped a few years ago, the engine was modeled somewhat on concepts of UnReal in certain areas and initially had very little threading.  On the 360 and PS3, running the game as basically 3 threads was not getting the performance required.  I.e. the divisions were basically: main game systems, rendering and GPU computations.  (NOTE: The GPU computations in this case are normally associated with rendering but I keep them separate here since the GPU was over utilized and some work needed to shift back to the CPU's.)  Other than various minor helper threads for file io, general input, and other such things, this was pretty much the limit of the threading and the game was suffering due to this.  As far as shipped (A+) games go, this division is still very common today and the common addition of job queues is not really a "fix" as the changes need to happen architecturally from day one to really utilize current and future core counts.

This is still painting with a very broad brush and not detailing the reasons.  Without getting into the nit picky details I don't think I can go much further without starting up discussions of cache, latencies, contention avoidance and all that.  Hopefully I at least cleared up where I try and draw the line of the differences?

### #10realh  Members

338
Like
2Likes
Like

Posted 30 April 2014 - 10:05 AM

Thanks for all your input, I've learnt a lot from this thread and by reading the articles mark ds linked to. I think there are still some reasons I could stick to having a separate rendering thread though:

1. I've already invested some time writing it that way.
2. It mirrors the way Android seems to favour.
3. I've thought of an easy way to eliminate all those mutexes. I can give every object and the camera etc a pair of matrices and whatever else needs to be accessed in both threads and use them in much the same way as double buffering. Then I only need one global flag to make the rendering thread use one copy while the game thread is using the other, then flip them over at each tick, and the only thread-safe constraint it needs is to be volatile. How does that sound?

 That's no good after all, the global flag wouldn't protect against one thread flipping the flag and accessing a matrix while the other thread is still in the middle of using it. But I think it could work with a triple copy setup and/or an extra flag to indicate whether one thread has completely processed one bank and it's safe for the other to flip and reuse it. Such a flag would need a cond though.

Edited by realh, 30 April 2014 - 10:27 AM.

### #11SeanMiddleditch  Members

16709
Like
4Likes
Like

Posted 30 April 2014 - 02:27 PM

I hope you are only counting engines in the last couple years in this. The big culprits out there right now: Unity, UnReal and Source are very uneven in how well they thread no matter how pretty the output is.

"Modern" implies last few years, yes. Most of the big-name commercial engines have pedigrees going back to the mid-90's. Nothing modern about them aside from a shinier set of graphics features hacked in as the years rolled on.

Game Developer, C++ Geek, Dragon Slayer - http://seanmiddleditch.com

C++ SG14 "Games & Low Latency" - Co-chair - public forums

Wargaming Seattle - Lead Server Engineer - We're hiring!

### #12SeanMiddleditch  Members

16709
Like
0Likes
Like

Posted 30 April 2014 - 02:39 PM

I've thought of an easy way to eliminate all those mutexes. I can give every object and the camera etc a pair of matrices and whatever else needs to be accessed in both threads and use them in much the same way as double buffering. Then I only need one global flag to make the rendering thread use one copy while the game thread is using the other, then flip them over at each tick, and the only thread-safe constraint it needs is to be volatile. How does that sound?

(assuming C++ and not C#) There's no need to be volatile. Use atomics to track the pointers, or just a regular mutex, as volatile doesn't really do anything. You should spend some time reading up on atomics, memory barriers, and locking primitives. You'll also want to be careful with having only a flag, as that leaves room for bad race conditions if you aren't very sure about your memory operation ordering (and what the compiler or CPU might do behind your back).

Game Developer, C++ Geek, Dragon Slayer - http://seanmiddleditch.com

C++ SG14 "Games & Low Latency" - Co-chair - public forums

Wargaming Seattle - Lead Server Engineer - We're hiring!

### #13AllEightUp  Moderators

5610
Like
2Likes
Like

Posted 30 April 2014 - 03:41 PM

I've thought of an easy way to eliminate all those mutexes. I can give every object and the camera etc a pair of matrices and whatever else needs to be accessed in both threads and use them in much the same way as double buffering. Then I only need one global flag to make the rendering thread use one copy while the game thread is using the other, then flip them over at each tick, and the only thread-safe constraint it needs is to be volatile. How does that sound?

As Sean points out, it doesn't need to be volatile and actually volatile is a bad thing to use anyway.  You don't even need this to be atomic either.  Give the sim and renderer a working index and use three threads: control, sim and renderer.  Control starts off by issuing a sim tick to get the initial set of matrices and when sim completes it tells the renderer to render using index 0.  Control increases the index for the sim to 1 and tells it to compute the next frame as the renderer is going.  At this point you can completely decouple rendering and simulation.  If the renderer finishes before the next simulation finishes, you can re-render the same data or wait.  If the sim has finished another frame and is working on a third, renderer can extrapolate from last frame through current frame.  If the renderer is slow, sim keeps alternating between two matrices that it fills with the most recent data.  The key point is that the sim and renderer notify the control thread and wait until told what to do, a single mutex/blocking state per game frame is completely viable and won't pose any significant performance issue.  If the logic controlling the indexes in use is in the control thread, you don't need any thread safety on the indexes since the other threads are unable to contend for the data at that point.

Using shadow data of this form is fairly common.  Deciding what to do when one or the other items is slow, that's really up to you and how you want to deal with it.  In general though, I believe all the mixtures of possible fast/slow require 3 matrices stored and only the control thread is allowed to change the index which the other threads will use.

### #14realh  Members

338
Like
2Likes
Like

Posted 01 May 2014 - 06:53 AM

Thanks for the advice. TBH I was expecting to be told this is a silly idea, just rewrite with a single thread .

About using volatile, I thought that was necessary. I thought it was there to make sure any writes to a variable get written to memory immediately rather than allowing the optimiser to cache intermediate values in a register. The caching can prevent changes propagating to other threads. The sim and rendering threads only need to read the index (of which set of matrices etc to work with) once per tick, so they can copy the volatile master index to a non-volatile thread-local variable to avoid the overhead of volatile. Alternatively the control thread can pass the indexes as arguments to the other threads' tick functions.

Atomic needs C++11 doesn't it? I've been conservative and used older-style C++. The main reason is that I'm a Linux geek and want to support MinGW in case I open source parts of my engine/framework, and I haven't been won over by MSVC either. Unfortunately MinGW doesn't support the C++11 threading API for some reason (as of a few months ago anyway), and I suspect that might imply lack of support for atomic too. I've used a wrapper API around pthreads or WinAPI as appropriate.

### #15Hodgman  Moderators

49386
Like
5Likes
Like

Posted 01 May 2014 - 05:20 PM

POPULAR

Just use mutexes to control the handing over of one "frame" of data from one thread to another. Volatile is only needed if you're trying to reinvent mutexes yourself -- in most projects, use of the volatile keyword is simply banned: if you try to use it, the team will just assume that your code is buggy, because it most likely is (unless you're the kind of guy who understands all the OS, compiler and hardware specific details required to properly implement your own synchronization primitives, and has written multiple unit-tests that have been running for 6 months across different hardware configurations just to be sure ).

You might want to have storage for 3 frames's worth. Say the renderer is reading one of those buffers, and the updater is writing one. If the updater finishes it's work, it can't start a new frame because there's no free buffers (it must wait for the renderer to finish reading one). A 3rd buffer lets the updater start a new frame immediately. YMMV

Make an array of structures that describe which tread owns each buffer, and on which frame it was generated. Make a mutex that protects these structures.
When the render thread starts a frame, lock the mutex and loop through the structure to find the most recent frame that isn't owned by any thread at the moment. Mark it as owned by the renderer, unlock the mutex and render from it. When you're done, lock the mutex, mark it as non-owned and unlock the mutex.
When the update thread starts a frame, lock the mutex, find the oldest non-owned buffer, mark it as owned by the updater, unlock, write a new frame to it, lock, mark as non-owned, unlock.

Edited by Hodgman, 02 May 2014 - 12:48 AM.

### #16L. Spiro  Members

24826
Like
5Likes
Like

Posted 01 May 2014 - 06:15 PM

POPULAR

You have to use mutexes etc to make sure the rendering thread doesn't try to render an object while the logic thread is halfway through updating its coordinates or something, and there is a certain trade-off between the amount of blocking and getting every frame to render precisely.

Multithreaded rendering doesn’t work this way.
The thread that handles rendering uses a command list compiled by the game thread in which commands can include matrix and other uniform values that not only keep a copy of an object’s “coordinates” (transform) but anything else, such as material color (some objects flash), etc.

This is a separate concept from separating logic and rendering, even if rendering and game logic happen on a separate thread from the main (input) thread.

Have skilled programmers over the years examined these pros and cons carefully and decided that single-threading is still the way to go?

On the contrary, our company gave a talk at the last CEDEC detailing the pros of multithreaded rendering even on smart-phone devices.
http://research.tri-ace.com/
Again, not to be confused with decoupling logic and rendering, which has been the common practice since the dawn of time.

And, for example, in Android it's important not to block the input thread

Not blocking input is one of the primary reasons to move rendering and game logic to a separate thread (or to 2 separate threads in the case of multithreaded rendering).

L. Spiro

### #17realh  Members

338
Like
0Likes
Like

Posted 02 May 2014 - 05:38 AM

Volatile is only needed if you're trying to reinvent mutexes yourself -- in most projects, use of the volatile keyword is simply banned: if you try to use it, the team will just assume that your code is buggy, because it most likely is

I need to find a good "volatile considered harmful" article, don't I.

### #18agleed  Members

919
Like
0Likes
Like

Posted 02 May 2014 - 03:20 PM

If you end up with something like a mutex per object, which is there to allow multiple threads to randomly share it.... Then you're completely on the wrong track ;-)

The standard model in use by engines these days is not to have a fixed number threads that each run their own system. Instead you have a pool of threads (sized depending on the number of cores in the system) which are all pretty much equal. The game is then decomposed into a directed-acyclic-graph of small "jobs". A job is a function that reads some inputs and writes to some outputs (no global shared state allowed!). If the output of one job is the input of another, that's a dependency that affects scheduling (this job cannot start until the input actually exist). From there you can construct a schedule, or dependency graph can be constructed, so each job can tell whether it's allowed to start yet or not. Every thread then runs the same loop, trying to pop jobs from a job-queue and executing them (if allowed).

There's no error-prone mutex locking required, there's no shared state to create potential race-conditions, and it takes full advantage of 1, 2, 4, 8, or more core CPUs.

That's the ideal version. Often engines will still use one of those threads as a "main thread", still in a typical single-threaded style, but anything computationally expensive will be diced up into jobs and pushed into the job queue.
Some platforms have restrictions like you mention, so you may have to restrict all "rendering API jobs" to a particular thread too.
But overall, the "job queue" or "job graph" (or flow-based) approach has become the defacto standard in game engines, rather than mutexes (shared state) or message passing styles of concurrency.

Also, message-passing should always be a default choice over shared-state too ;-)

Do you by any chance have any pointers to papers or presentations that go into concrete implementations of these systems? I read tidbits about task based parallelism here and there sometimes but I've never been able to find a comprehensive overview of such a system and some of the gritty details concerning a practical application (e.g. what to do about context switching, how do you actually schedule [implicitly/manually or use some kind of scheduling algorithm at runtime], etc.).

Also I could swear I just recently saw a presentation for a PS4 game where someone said that overall 20-30% or something of the used processors are used at any time, i.e. 70-80% of the potential performance isn't even being used. But that's probably to be expected, parallelism is a bitch and I guess if you're at 30% with 6 cores it's better than being at 17% (i.e. single core performance).

Edited by agleed, 02 May 2014 - 03:21 PM.

### #19Hodgman  Moderators

49386
Like
4Likes
Like

Posted 02 May 2014 - 08:48 PM

Do you by any chance have any pointers to papers or presentations that go into concrete implementations of these systems? I read tidbits about task based parallelism here and there sometimes but I've never been able to find a comprehensive overview of such a system and some of the gritty details concerning a practical application (e.g. what to do about context switching, how do you actually schedule [implicitly/manually or use some kind of scheduling algorithm at runtime], etc.).

This review of the Doom 3 code base is a good read:
http://fabiensanglard.net/doom3/

Actually it's this one that describes the job system:
http://fabiensanglard.net/doom3_bfg/
[/edit]

Intel wrote a series a while ago and code to go with it, under the name "SMOKE":
http://www.gamasutra.com/view/feature/132254/performance_scaling_with_cores_.php

Also I could swear I just recently saw a presentation for a PS4 game where someone said that overall 20-30% or something of the used processors are used at any time, i.e. 70-80% of the potential performance isn't even being used. But that's probably to be expected, parallelism is a bitch and I guess if you're at 30% with 6 cores it's better than being at 17% (i.e. single core performance).

That could also just be that their game ran at 30Hz (which is 33ms per frame), but they simply had less than 33ms of CPU work in their game to do. The PS4/Xbone CPUs are complete beasts compared to 360/PS3! Newer games will continue to add more features until they get closer and closer to their frame budget (usually either 16.6ms or 33.3ms per frame).

### #20SerialKicked  Members

583
Like
2Likes
Like

Posted 03 May 2014 - 07:29 AM

I don't know if it's considered good form to post for this, but that is the most interesting read I had in a long while, it made me reconsider how I will approach multi-threading in my future projects (as it's a bit too late for the current one). So, thank you all

Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.