Multithreaded engines

Started by
5 comments, last by Spinningcubes 8 years, 1 month ago
Hi guys,

I have been studying C++ lately, especially the std libraries related to concurrency and threads. I came to know about atomics and lock free programmimg in general. Given that game engines are some of the complex softwares requiring high performance, are there any game engines in market which use multithreading. Or lock free algorithms? (Lock free algorithms are pretty low level even compared to traditional locks, mutexes, semaphores).

Also we have vulkan which absolutely forces us to deal with memory on GPU. It seems we are going deeper and deeper into hardware.
Advertisement

Pretty much every engine is multithreaded to some degree these days - especially if it's been used to create console games.

--The XB360 has an under-powered PowerPC CPU with 3 cores ("hyper-threaded", so 6 hardware threads).

--The PS3 has an under-powered PowerPC CPU with 1 core ("hyper-threaded", so 2 hardware threads), plus a bunch of amazing Cell SPE's cores (which are a NUMA architecture).

-- The PS4/Xbone have two under-powered x86-64 CPU's with 4 cores (8 cores total).

To get any decent amount of CPU processing done on those machines, you've got to split the workloads over many threads.

Lock-free data structures are standard practice for inter-thread communication -- e.g. message queues. Note that lock-free algorithms aren't necessarily cheaper than locking algorithms (some are waaaay more expensive!!); the lock-free guarantee just promises that one stalled thread to ever hold up global progress. Having high contention -- many threads fighting to share one resource at the same time -- is a performance killer regardless of whether you're using lock-free or locking algorithms.

There's also wait-free algorithms that work in a lot of places, and are often simpler than both of the above wink.png

Going beyond your question -- the shared memory concurrency paradigm, that is, explicit creation of threads, and explicit sharing of resources via mutexes/etc, is not a very friendly choice. It's usually reserved for low-level engine programming, because it's hard to get right, so you don't want a whole team of gameplay programmers dabbling with it.

Engines usually create other concurrency abstractions and paradigms for game code to use, such as message passing or flow-based stream processing, which can easily make use of multi-core hardware while being very hard to use incorrectly. Even within the low-level engine code, you want to use these other paradigms as much as possible.

The most common low-level mechanism is a thread pool and a job-queue. The gist of this is that you make one thread per physical CPU core, and tell each of them to continually try to pop "jobs" from a central queue. Jobs are basically just a lambda / std::function / function pointer and argument. When you've got a large amount of work to do, you break it up into a collection of jobs and push them into the queue.

The next layer to build on top of that is a system to manage dependencies -- the output/results from one set of jobs will be the inputs of another set of jobs, so instead of managing mutexes/etc, the tricky part now is making sure that this second group of jobs doesn't begin until after the first group have finished (and making a system that makes it easy to identify and declare these kinds of data-dependencies).

You can read about a real-world version of one of these systems here:

http://blog.molecular-matters.com/2015/08/24/job-system-2-0-lock-free-work-stealing-part-1-basics/

http://blog.molecular-matters.com/2015/09/08/job-system-2-0-lock-free-work-stealing-part-2-a-specialized-allocator/

http://blog.molecular-matters.com/2015/09/25/job-system-2-0-lock-free-work-stealing-part-3-going-lock-free/

http://blog.molecular-matters.com/2015/11/09/job-system-2-0-lock-free-work-stealing-part-4-parallel_for/

Hodges input is pretty much golden. And here is something from personal experience with threading.

There are always ways to thread any system in your engine. But it does not necessairly mean that you should. There are occasions when threading is less performant than doing it single threaded due to the needs of synchronization. But that could just very well be a matter of doing it right... which is hardly ever a realistic insight. You can't do everything right.


You can also find ways to mask the nature of threading from the game play programmers. I tried doing something similar to what Killzone did, which was to turn the update of each Actor into a work job, and schedule all depending jobs in such a fashion. The problem was... this got complicated real fast. So I quickly threw the sucker out.

In the end.... there is always one advice that is solid. Plan for multi-threading ahead of time, than as an after thought when you need it. Implementing it late game, where your systems are already developed can easily be a pain, because it means you need to reduce coupling, which means Frankenstein a good portion of your code, and suffering hours of edits.

I would expand on Hodges and Tangletail's results as they are both valid but I think miss an important point. The problem I have seen with most game threading is that instead of changing the game code itself to work correctly when threaded, folks end up writing these monstrosity threading systems with task dependencies, work stealing etc etc when none of that is really needed. I don't say those solutions are wrong but everytime I see them in production code I generally end up avoiding dependencies and such by simply refactoring the code with proper mutable/immutable divisions such that I can perform all the same work by simply issuing block of tasks, a fence or barrier, and another block of tasks.

Here is an example since I was doing some work last weekend on exactly this problem.

13ms per frame with dependency driven linear code:

[attachment=31119:LinearWork.jpg]

4ms per frame with the dependencies broken into two passes, first pass does all the inter relational computation, second pass writes all the results back to the objects.

[attachment=31120:FixedTasking.jpg]

The changes were simple and quick to implement. I just broke up the 'look at all the other objects' portion of the code from the 'update my data' portion of the code such that it could be run in parallel. And this is not a contrived test, it is real game code running a very expensive simulation over 25k objects. The funny thing is that without the renderer it runs about ~500FPS with 100% cpu load, <~1% spent in kernel and almost completely linear scaling across 2-12 cores and then it starts falling off in terms of gain as you get closer to 24 cores, mostly because 25k objects just isn't enough to feed the cores.

Perhaps I'm missing some reason for these complicated task scheduling systems but I don't think so. I've shipped a number of games using this same style of threading without problems and it usually makes the code considerably easier to maintain if for no other reason than side effects have to be removed. But you have to be explicit and as such you need to break your logic into a series of pieces much more like a GPU set of shaders instead of linear code. It is very similar to the other forum thread about data oriented versus object oriented design.

Anyway, for someone new, I'd always suggest keeping it simple and learning the usually simple set of rules instead of trying to write complicated threading solutions so you don't have to follow the rules correctly.

and it usually makes the code considerably easier to maintain if for no other reason than side effects have to be removed. But you have to be explicit and as such you need to break your logic into a series of pieces much more like a GPU set of shaders instead of linear code. It is very similar to the other forum thread about data oriented versus object oriented design

Yeah this is pretty much the key.
In "typical OOP" systems you often have no idea what the flow of control is (e.g. loop over a collection of interfaces, call a virtual, which calls a virtual, which calls a virtual), and if you don't know what the flow of control is, then you can't know what the data access patterns are, which means you can't safely schedule anything. Graduate programmers try to fix this mess by introducing locks to the code to make it "thread safe", which is not seeing the forest for the trees. An alternative solution is to rearchitect the flow of control so that things don't have to be "thread safe" -- if you know that only one thread is writing to an object at a time, no locks are required.

Instead of typical OOP bullshit, you should try to write systems that are fairly contained when it comes to both flow and data access, and which are very explicit in what all of their inputs and outputs are (and thus explicit in what their side effects are). It then becomes extremely simple to figure out how you can run these systems on a parallel machine.


To illustrate an extreme - on the PS3's SPU cores, you can't directly access any RAM. Instead, you have to basically memcpy a block of RAM to the SPU, do some calculations, and then memcpy the results back to RAM.
We supported this architecture by forcing our jobs to pre-declare a list of pointer+size pairs (e.g. struct Range { void* where; size_t size; bool writable; }; vector<Range> inputsAndOutputs;). The job system would then copy those regions of RAM to the SPU before the job ran, and after the job had finished it copied the writable ranges back to RAM. If you tried to access a RAM range that you hadn't declared first, your code wouldn't work.
At first, this seemed like a massive pain in the ass. Why would anyone want this!?
But... it actually forces a great deal of discipline onto the team, where people start thinking about what the minimal data requirements for their update code is, how the data can be arranged to support parallelism, and how data-dependency chains can be simplified.

It also gives you some neat debugging tools on the PC -- you can emulate a SPU on the PC by doing those unnecessary memcpy's into a temporary working region before running a job, and memcpying the results back into place. If the job is written properly, this shouldn't affect the results (if it does, someone's got an accidental/undeclared data dependency).
I think that game programmers the world over should thank the PS3 for forcing us to become better software engineers smile.png

And as All8 said, this should actually be very familiar to any graphics programmer. On the GPU, we've usually been restricted from accessing RAM whenever we want to -- instead, we've been forced to explicitly declare all of the input/output resources that are required. Even though we usually feed the GPU a serial stream of draw/compute commands, it manages to be one of the most parallel/multi-threaded parts of any game :wink:

1) Make sure that the task is complex enough to be worth the effort.
2) Make sure that data sharing in the task is low enough that synchronization won't cancel out any performance gains
3) Atomics are very useful sometimes. Use them to avoid having a syscall on a mutex when you know the protected code will execute faster than the syscall. Use them to implement complex locking structures where system locks aren't suitable. Don't use them to avoid acquiring a mutex to protect data for a complex task. And question why you're using a mutex for a complex task, anyway.
4) Threads are awesome, locks are not. Take full advantage of the former, avoid having to use the latter.

Nice thread. I'll write a small multicore techdemo namned Sussex over the following weeks so this post arrived at the perfect time for me :).

@spinningcubes | Blog: Spinningcubes.com | Gamedev notes: GameDev Pensieve | Spinningcubes on Youtube

This topic is closed to new replies.

Advertisement