Multithreading vs variable time per frame

Started by
28 comments, last by MegaPixel 9 years, 11 months ago

Most existing game engines seem to be based around a monolithic single-threaded main loop which serially handles input, AI etc and rendering. I think separating rendering from game logic with threads has major advantages:

Rendering has the biggest influence on performance. With a single thread the game logic has to be able to cope with being polled from anywhere <20 times per second to hundreds, and that can be difficult to manage in some games. With a separate game logic loop you can set it to "tick" at a certain rate independently of the achieved rendering frame rate, and in most cases assume the system can guarantee that frequency unless the whole thing is so bogged down that slowing down the gameplay is acceptable or even advantageous.

The cons of multithreading are the notorious difficulty of thread-safe programming and tracing bugs when you get it wrong. You have to use mutexes etc to make sure the rendering thread doesn't try to render an object while the logic thread is halfway through updating its coordinates or something, and there is a certain trade-off between the amount of blocking and getting every frame to render precisely.

Have skilled programmers over the years examined these pros and cons carefully and decided that single-threading is still the way to go? Or are things changing now that all the major platforms have multicore CPUs and decent threading implementations? And, for example, in Android it's important not to block the input thread and it uses a dedicated rendering thread by default; although native_app_glue apparently favours forwarding input events to the rendering thread.

Advertisement
If you end up with something like a mutex per object, which is there to allow multiple threads to randomly share it.... Then you're completely on the wrong track ;-)

The standard model in use by engines these days is not to have a fixed number threads that each run their own system. Instead you have a pool of threads (sized depending on the number of cores in the system) which are all pretty much equal. The game is then decomposed into a directed-acyclic-graph of small "jobs". A job is a function that reads some inputs and writes to some outputs (no global shared state allowed!). If the output of one job is the input of another, that's a dependency that affects scheduling (this job cannot start until the input actually exist). From there you can construct a schedule, or dependency graph can be constructed, so each job can tell whether it's allowed to start yet or not. Every thread then runs the same loop, trying to pop jobs from a job-queue and executing them (if allowed).

There's no error-prone mutex locking required, there's no shared state to create potential race-conditions, and it takes full advantage of 1, 2, 4, 8, or more core CPUs.

That's the ideal version. Often engines will still use one of those threads as a "main thread", still in a typical single-threaded style, but anything computationally expensive will be diced up into jobs and pushed into the job queue.
Some platforms have restrictions like you mention, so you may have to restrict all "rendering API jobs" to a particular thread too.
But overall, the "job queue" or "job graph" (or flow-based) approach has become the defacto standard in game engines, rather than mutexes (shared state) or message passing styles of concurrency.

Also, message-passing should always be a default choice over shared-state too ;-)

I think you're talking about a different situation/problem. You're describing a sort of load balancing but I want to decouple rendering from game logic so that the latter isn't affected by the speed of rendering. In the game I'm currently writing the logic is simple with a relatively low CPU demand, and doesn't really justify being split up into threads, and I think even a low-end phone should be able to maintain a consistent 60 ticks per second. But the rendering engine might not be able to keep up with that, and I'd like to prevent the game logic having to deal with being called at varying intervals.

As you say, some platforms are restricted to performing all OpenGL calls on a particular thread, or at least strongly advise against spreading your OpenGL calls across threads, so I thought a rendering thread and a logic thread would be a good way to do it. Is this a bad idea full stop because of the need for mutexes? I don't think they can be safely eliminated whenever more than one thread has access to the same object.

I think there's another possible approach, which is to use a single thread, but design the game loop to use ticks of fixed duration for AI/logic. Each time round the loop you check the time, and call the tick handler as many times as necessary to keep the average tick rate constant. Or if the frame rate has dropped too much you can allow the gameplay to slow down (like when Quake shows the tortoise icon). The disadvantages of this are that the game logic still has to be separated from the rendering in case you need to tick more than once per frame, and it's more likely to suffer from laggy response to input than separate threads.

Like Hodgman alluded to...the main thread of the application is also a thread, and depend on usage, there really is no need to create other thread to use rendering/game logic unless a particular task is computational expensive/heavyweight. These task can be farmed out to separate threads so that the main thread doesn't stall. Whether you use the main thread to render or update game logic is a matter of preference, but its usually easier to have this thread do the rendering.

Read these:

Game Loop

Fix Your Timestep

Then

Fixed Timestep Implementation

Combine this information with the post above by Hodgman

With a single thread the game logic has to be able to cope with being polled from anywhere <20 times per second to hundreds, and that can be difficult to manage in some games. With a separate game logic loop you can set it to "tick" at a certain rate independently of the achieved rendering frame rate, and in most cases assume the system can guarantee that frequency unless the whole thing is so bogged down that slowing down the gameplay is acceptable or even advantageous.


There's no need for threads to achieve that. Just use a fixed timestep and apply interpolation to your rendering code. One thread, smooth graphics, frame-rate-independent game logic.

The cons of multithreading are the notorious difficulty of thread-safe programming and tracing bugs when you get it wrong.


Only if you do multithreading wrong. Back in the bad ol' days people didn't understand threads all that well, and this was a problem. Look at modern job architectures, isolated threads, communication channels, etc. Writing a massively threaded and completely safe application in languages like Eiffel or Go or Rust or even C# is pretty easy. C/C++ doesn't provide those niceties out of the box, but with a little work (or application of PPL, TBB, etc.) you can get the same kind of environment running in C++ without much headache.

You have to use mutexes etc to make sure the rendering thread doesn't try to render an object while the logic thread is halfway through updating its coordinates or something


You absolutely do not, and really really really should not.

You can make use of many threads several different ways. Use a fork-join model (without _actual_ forks and joins, mind) letting you keep a serial game loop. Or you can use a pipeline approach, which tends not to scale as well and duplicates a lot of data but can make better use of a smaller number of cores due to cache behavior and scheduling concerns. Or you can implement a completely task-based system. You never, ever put a mutex around each object; you really barely even use mutexes at all (they're expensive and error-prone).

The last lock I've even used was a small spinlock in a few small corners of a multi-thread profiling subsystem (though in most of it, the threads run completely independently with no synchronization required at all). Everything else makes use of concurrent queues (channels) and rarely some atomics.

Have skilled programmers over the years examined these pros and cons carefully


Of course. All modern game engines are very parallel.

Sean Middleditch – Game Systems Engineer – Join my team!

Keep in mind that when talking about these things, threading in engines continues to evolve. Thread pools and dag based task issuance is not really the direction most engines are moving due to the fact that they don't scale very well. Thread pools are mostly a bolt on system to allow functionally threaded engines to stay somewhat relevant, it's just a bolt on that helps, not a great solution. Newer engines have moved (are moving) towards data parallel threading due to the well proven scaling abilities. This is a threading model similar to hardware items such as GPU's, telecom switches etc, and of course languages such as Erlang, OpenCL, etc. The first major usage of the threading style was for Renderman's Shading Language, though it wasn't threaded initially it simulated a giant simd engine which is basically where the best threaded engines have been going. Another example was Havok's hydra threading system, once passed broadphase processing it was data parallel in computing intersections/penetrations and such.

Of course. All modern game engines are very parallel.


I hope you are only counting engines in the last couple years in this. The big culprits out there right now: Unity, UnReal and Source are very uneven in how well they thread no matter how pretty the output is. Basically with those engines you see 2 or 3 cores getting pounded pretty hard, those are the hold overs from having been functionally threaded when they started out. Then you generally see the remaining cores getting some utilization from the thread pools but it is generally pretty low and you see a very large amount of contention burning up a notable portion of the cores. Recent engines, notably those using things like the component entity model, thread out data parallel and those *are* very parallel.

But, again, threading in general is still evolving in games. Data parallel works in a lot of cases very well where thread pools and task dags lose horribly due to Amdahl's law. On the other hand, there are a couple places where data parallel falls down. The point of all this is that this is an area of ongoing research, it's not a cut and dried "x" is the thing to do, though thread pools is really not a good solution as mentioned, its a crutch to get some threading, not good threading.

Read these

^^That happy.png
You can have different update and rendering frequencies without the need for any threads at all. Games are (almost) real-time systems, with a period measured in milliseconds. So as long as the time periods of your periodic systems are also in a similar range (e.g. 5-50ms), then there's no need to complicate things by using extra threads for that.

If you want the update code to run at some fixed frequency, while the rendering just runs as often as it can, then you'd use a main loop something like this:


lastTime = getTime()
while(1)
  now = getTime()
  deltaTime = now - lastTime
  lastTime = now

  accumulator += deltaTime
  while( accumulator > updateStepSize )
    update();
    accumulator -= updateStepSize;
  render();

The real use of threads within a game engine is just to unlock the extra computational power of multi-core CPUs. Inside the update/render functions above, the main thread would cooperate with all the other threads in your pool to complete those tasks in parallel.

Data parallel works in a lot of cases very well where thread pools and task dags lose horribly due to Amdahl's law.

All the DAG/Job systems that I've had experience with have been data-parallel.
The two solutions I've seen are, either:

  • data-parallel systems will put many jobs into the queue, representing chunks of their data that can be executed in parallel,
    • Usually there's another tier in here - e.g. A system has 100000 items - it might submit a "job list", which might contain 100 sub-jobs that each update 100 of the system's items. An atomic counter in the job-list is used to allow every thread in the pool to continually pop different sub-jobs. Once all the sub-jobs have been popped, the job-list itself is removed from the pool's queue. The DAG is actually one of job-lists, not jobs.
  • or, every job is assumed to be data parallel, with a numItems constant associate with them, and begin/end input variables in the job entry point. Many threads can execute the same job, specifying different, non-overlapping begin/end ranges within begin=0 to end=numItems. Once the job has been run such that every item in that full range has been executed exactly once, then it's marked as being complete. Non-scalable / non-data-parallel jobs have numItems=1.

Are you saying that a data-parallel system wouldn't have a fixed pool of threads, or maybe you're using connotations of the phrase "thread pool" that I'm missing?

Data parallel works in a lot of cases very well where thread pools and task dags lose horribly due to Amdahl's law.

All the DAG/Job systems that I've had experience with have been data-parallel.
The two solutions I've seen are, either:

  • data-parallel systems will put many jobs into the queue, representing chunks of their data that can be executed in parallel,
    • Usually there's another tier in here - e.g. A system has 100000 items - it might submit a "job list", which might contain 100 sub-jobs that each update 100 of the system's items. An atomic counter in the job-list is used to allow every thread in the pool to continually pop different sub-jobs. Once all the sub-jobs have been popped, the job-list itself is removed from the pool's queue. The DAG is actually one of job-lists, not jobs.
  • or, every job is assumed to be data parallel, with a numItems constant associate with them, and begin/end input variables in the job entry point. Many threads can execute the same job, specifying different, non-overlapping begin/end ranges within begin=0 to end=numItems. Once the job has been run such that every item in that full range has been executed exactly once, then it's marked as being complete. Non-scalable / non-data-parallel jobs have numItems=1.

Are you saying that a data-parallel system wouldn't have a fixed pool of threads, or maybe you're using connotations of the phrase "thread pool" that I'm missing?

The difference which I've seen is really the usage patterns and implementation details of the underlying thread system. Unfortunately I was trying to generalize without getting into the details too much, but that of course leaves a lot of questions for those familiar with the problems involved. Additionally, I probably mixed up how I described the difference as they tend to overlap a bit.

I'll try and clean this up using an example of a game which shipped a few years ago, the engine was modeled somewhat on concepts of UnReal in certain areas and initially had very little threading. On the 360 and PS3, running the game as basically 3 threads was not getting the performance required. I.e. the divisions were basically: main game systems, rendering and GPU computations. (NOTE: The GPU computations in this case are normally associated with rendering but I keep them separate here since the GPU was over utilized and some work needed to shift back to the CPU's.) Other than various minor helper threads for file io, general input, and other such things, this was pretty much the limit of the threading and the game was suffering due to this. As far as shipped (A+) games go, this division is still very common today and the common addition of job queues is not really a "fix" as the changes need to happen architecturally from day one to really utilize current and future core counts.

Anyway, back to the example engine. By adding a relatively fast job system I moved animation, skinning, some various bits of AI update logic and a couple other systems over to using the thread system and we managed to get the primary threads under 100% and also reduce load on the GPU such that more pretty stuff could be pushed through. This is as far as things were taken, "just enough" to ship. This leaves two reasons I draw the line I do: the thread system itself was limited by the need to integrate with a functional parallel engine and only bits and pieces of the engine were flattened out to actually distribute. Newer engines start with threading from ground zero and the way they are coded is driven by the necessity of threading day one such that the thread system itself does not have to tiptoe around functional concepts. This might seem subtle but it is actually quite a large difference and causes a pretty major change to the overall architecture and design of systems and engine organization. Obviously the ultimate 100% Amdahl's law is still a long way off, but I have seen new engines which can push upwards of 90-95% (i.e. 5-10% is still single threaded) over 24 cores which gives them the ability to do living world simulations orders of magnitude larger than pretty much anything else. Also note that rendering becomes more tightly integrated with threading and allows better throughput and fewer GPU stalls thus helping to push even prettier pictures than before.

This is still painting with a very broad brush and not detailing the reasons. Without getting into the nit picky details I don't think I can go much further without starting up discussions of cache, latencies, contention avoidance and all that. Hopefully I at least cleared up where I try and draw the line of the differences?

Thanks for all your input, I've learnt a lot from this thread and by reading the articles mark ds linked to. I think there are still some reasons I could stick to having a separate rendering thread though:

  1. I've already invested some time writing it that way.
  2. It mirrors the way Android seems to favour.
  3. I've thought of an easy way to eliminate all those mutexes. I can give every object and the camera etc a pair of matrices and whatever else needs to be accessed in both threads and use them in much the same way as double buffering. Then I only need one global flag to make the rendering thread use one copy while the game thread is using the other, then flip them over at each tick, and the only thread-safe constraint it needs is to be volatile. How does that sound?

[Edit] That's no good after all, the global flag wouldn't protect against one thread flipping the flag and accessing a matrix while the other thread is still in the middle of using it. But I think it could work with a triple copy setup and/or an extra flag to indicate whether one thread has completely processed one bank and it's safe for the other to flip and reuse it. Such a flag would need a cond though.

This topic is closed to new replies.

Advertisement