A New Threading System for Game Engines...Your Thoughts?

Started by
18 comments, last by Kylotan 14 years, 3 months ago
I have come up with a new threading system for a Game Engine. Basically, it allows rendering and updates to occur simultaneously, at the cost of more memory usage. It is quite a complex system, but I want to implement it in my next game engine. Here's the informal spec: Part 1 Part 2 I'd love to here your comments, either here on on the posts.
Advertisement
Multi threadding is one of those things that is just about to jump into the world of game engine design. For the most part, previous engines were limited to a single thread only because the rendering API was not able to handle input from multiple threads (DX11 handles it best, as far as I know). I'm curious about your engine though. I'll have to re-read it later on in detail...

gotta go..

EDIT: I read your idea and you pretty much discovered how modern engines work.

Tieing everything together is the biggest problem with using threads. You want ot have a smooth, well designed system so that it does not deadlock on you or wait on critical sections too much. I'm actually playing around with this concept. I ended up developing a thread pool / job system where by you can run any task asynchronously.

As an example, I took my Scene Graph system and split it into three parts - the transformation graph, the scene partitioning tree, and the render graph. The transformation graph is rather simple and creates a world view matrix for each object in my list. This is rather trivial and does not cost too much. The scene partition is a bit more intensive computationally where by I have to calculate many collisions with the frustum. Lastly is a Render Graph. This interfaces with the API and must run on the main thread. This cannot be asynchronous because it must wait until the Transform graph and scene partitioning is finished.

Using my Thread Pool I can run two other threads in conjunction with my main process thread. I add jobs for both the transform graph and the scene partitioning system and wait until both are finished. The threads are not destroyed, rather, they are flagged as inactive and put back into the thread pool. Creating and destroying threads is rather expensive. Once those two threads are done, I continue with my main thread and run the render graph.

From this I learned that there is no way I can effectively run everything I want without hanging one of my threads (main thread). This is the biggest challenge in multi-threading. Things like Networking, Sound, etc are very easy to handle on separate threads because they are rather independent of the main system. You can send events to those subsystems fairly easily and with low risk.

However, from what I gathered from your blog is that you are intending to put every subsystem on its own thread. This might actually be a detriment to your engine then a benefit. Take for example a modern CPU with 4 cores. It can run 4 separate tasks without time slicing. But is only the case if you have ONLY 4 threads running. You have to take into account that you are also contending with system processes and services. Once your 4-core CPU is bogged down, Windows will start to time slice all operations so that the processes get equal amount of time on the hardware.

So, when you design your engine, think about how many threads you are creating and how many of those threads will be chugging all the time. Sound might be something that is low on the priority list, as in Input. But you must also think about the amount of processing it will take to send messages back and forth from all your threads.

Anyway, what I'm trying to say is that there is no 'one' way to design a multi-threaded engine. You will need to experiment with multi-threading. Take a look at my blog entry regarding Thread Pools and their results.

Czesc.

[Edited by - RealMarkP on December 22, 2009 11:14:21 PM]
------------Anything prior to 9am should be illegal.
Yours is a very interesting design for a game engine, but leaves me with the question of how do you manage the various threads writing their output to the scene graph? Is the scene graph implemented with locks, or does your main thread bring all the data together?

I haven't looked at the source code, which I will later today.

I have also been planning to create a Thread Manager, which would allow smart placement of threads based on hints the library user gives. For example, it would try to place Sound, Input, and (perhaps, depending on the game) Networking on the same processor in a 4-core system. This leaves a core for rendering, a core for "collidables", and a core for effects.

On a 2-core system, the threads for each subsection would be a larger problem. Sound, Input, Networking, and Rendering could go on the same core (which may impact rendering performance), and effects and "collidables" on the core. Not the greatest solution.

Of course, the OS can deny the requests, but there is little that can be done about that.

Because the number of cores on the system won't change at runtime, it mostly matters how the threads are configured at the start. I do see your point, though, and depending on the number of threads on a single core, this will slow the core's threads down considerably when in a real-time system.
My Thread Pool uses Critical sections to lock data manipulation and access. Out of the various locking mechanisms, it is the fastest way to lock threads. I would not worry too much about matching the amount of threads you have to the amount of Cores available to you. Basically the system will time slice your threads so that they get equal time on the CPU.

Some of your subsystems do not necessarily need to be on a separate thread. If you are getting your input from the windows message pump, then there is no point of putting that on a separate thread because it is a polling system. You might as well put the polling mechanism in your rendering loop. I would suggest that your engine have a rather small footprint in terms of threads because you never know if the client application wants to use mutli-threading.

For example, The game I'm designing at the moment has three threads:
1.) My renderer thread which does user input, updating of the scene/animation and rendering.
2.) My AI component which calculates the state of the map and, using a genetic algorithm and some path finding, posts the result into the animation manager on the first thread. This is blocking and only runs when triggered.
3.) Lastly, my networking component which handles communication between many players. This is blocking, so it does not use a thread all the time.

Everything is tied together using an event system that is thread safe. It delivers events from the various threads to each other. In essence my engine only runs on one thread. The Game runs on two more.

Say for example, your engine already used 4 threads and I wanted to build my application on top of your engine. I would be adding on 2 more threads for a total of 6 threads. On a 2 or 4 core system, there would be a ton of time slicing involved. The locking between the threads would add a performance hit to the game itself.

What I would recommend (and I will be implementing this in my engine as well) is to have a Thread Pool running at all times. Allocate X number of threads that is equal to the amount of cores that are available. The threads will be idle until called upon. Instead of having threads constantly running, create small batches (I call them 'Jobs') that you can throw into the Thread Pool on demand. For example, before you render objects, split up your scene computations into groups. One job can be a transform graph that will loop through a tree of objects and apply transformations. Another job might be frustum culling which also loops through all the objects, checking their AABB against the camera's Frustum (insert optimized spatial partition here). While these two threads are chugging along, you can do something on your main thread, like read out new keyboard and mouse input from the message pump or update a particle system. All this is done before rendering.

You will still need a dedicated thread for networking or sound though. However, they can be event based so that they block until events are triggered.



Reference Material:
White Paper On Thread Pools.
Intel's Article on Multi Core Engines. (PDF).
------------Anything prior to 9am should be illegal.
Maybe I'm just old school but what benefit do many games see from multi threading anyway? You already have a GPU for graphics, an MPU for sound, and a CPU for everything else? I imagine some might say for the purposes of extreme AI but that would be extreme indeed! I can honestly say that I have never developed a program that would benefit from multi-threading. I'm sure some of you will disagree.
Quote:Original post by lakmir0
Maybe I'm just old school but what benefit do many games see from multi threading anyway?


I think that the provided links gave already one answer for that. Separating different tasks to separate threads will (if properly implemented) use the given resources better which will result as increased performance if multiple cores are available.

It is possible that in the future the speed of the cores won't increase too much (although other optimizations will increase the performance). However, it is likely that more and more cores will be available. In order to take the advantage of them, one must implement an efficient method of multithreading.

Quote:
I can honestly say that I have never developed a program that would benefit from multi-threading. I'm sure some of you will disagree.


You could look into image processing as an example. It is quite easy to understand in that field why multi-core CPUs bring real performance boost.

Cheers!

Quote:Original post by lakmir0
Maybe I'm just old school but what benefit do many games see from multi threading anyway? You already have a GPU for graphics, an MPU for sound, and a
CPU for everything else?


Well, to be honest 'many' games won't see much of an advantage simply because they aren't doing enough to stress the system on a single core + gpu, never mind consume multiple cores.

However, as the complexity of the game increases so the benifits of multiple threads increases. Before I get to that however there is one thing which needs mentioning; sound is software driven these days.

With Vista and Win7 unless you are using OpenAL on Creative hardware then you are going to be using a software sound path, which right away is probably going to spawn you at least one extra thread (FMOD for example spawns at least one, maybe more if you are doing streaming).

At this point you get into the area of 'complex games'; as the number of things you want to throw around increase so your processing and draw command submission times increase.

The "simplest" way to get some more headroom is to split into two threads; one for update and one for rendering (this is ignoring minor threads for sound or networking as they rarely take too much time). This lets you update the scene on one core while drawing on the other. This means that you can limit your world updates to say 50 updates per second (fixed world timestep is pretty much a must for physics and network based games, it also makes the maths easier for other games) while your rendering thread can render as often as it likes, interpolating between world steps.

The problem with this is that it isn't scaleable; while it works fine for something like an AMD X2, Athlon 2 or Core 2 Duo cpu as soon as you hit the tri/quad core cpus you've stopped using half the power again and with 4 core/8 thread chips about the problem gets worse (and will get worse again soon as 6 core/12 thread and 6 core cpus are due next year).

Skimming the thread it does seem that RealMarkP is talking about the 'best' way to do things; tasks.

The idea being that you decide up front how many threads you want then, as you progress through your logic, you hand off tasks to these threads to carry out. The main advantages here are scalibility and the fact that you never kill off threads (starting a thread is a heavy weight task).

Scalibility is the big one as it lets you expand to take advantage of the power there (although you'd clearly have to scale things back a bit for less cores, but this is more a numbers rather than logic thing, your game logic would stay the same) just means you need to ensure that parrallel tasks and there dependancies are dealt with in the correct manner.

Given the correct API this system is very easy to get up and running with, a few months back I was playing with Intel's threaded building blocks (TBB) library and had a (very basic, logic only) particle system setup which used all 8 threads on my i7 to process 1,000,000 particles in <4ms which gave me an effective work time of 32ms taken, far far too much for 30fps, never mind a 60fps target rate (granted, I was using some SSE and had thrown OOP partly out the window but TBB made it very easy to just throw each processing segment at the threads as tasks).

So, as I started with, for many games there is no need, or maybe at most they might want to seperate update and rendering for a simple game. Once you get complex however being able to split work across threads is pretty much a must if you want that power.

[Edited by - phantom on December 28, 2009 8:56:38 AM]
Quote:
Maybe I'm just old school but what benefit do many games see from multi threading anyway? You already have a GPU for graphics, an MPU for sound, and a
CPU for everything else?

XBox360 - 3 cores, 6 hardware threads.
PS3 - 1 core + 6 spus, 8 hardware threads.
My PC - 4 core, 4 hardware threads.

So, anyone trying to take full advantage of any of those machines needs to be running 4+ threads to use all that processing power.

Take a look at something like Supreme Commander. It has SOOOO much going on I'm used to seeing the old-school "cable only!!" games replaced with "quad-core only!!".

Quote:
I'd love to here your comments, either here on on the posts.

I'd second the thread pool idea. You can even marshal your task based threading in as just large jobs to the thread pool. I was just reading (from another multi-threading post here) about Intel's Thread Building Blocks tools that help you deal with this in a nice fine-grained way.

Also note that depending on how compute bound your code is, and how well organized your data is, you could be wasting a tonne of cycles just waiting for memory. If you have a lot of tasks that consume lots of data, you can also run into the bottleneck of memory bandwidth and cache thrashing. You should be considering your memory layout as well as the code-to-thread layout.
Check out Antheus' post here.

Also, consider more dependencies than just the graphics-entity relationship. You might need a lot more buffers than you think considering how serial the relationship of many subsystems is. Ie, the AI needs to pick an animation, the animation has to be done for the graphics, the graphics then skins the model. Animation can take a LOT of cpu and is a trivially parallelizable task considering independant animations for different characters can all be done in parallel. You therefore might want to double buffer the input queue for the anim system as well. Same for many other subsystems.

[Edited by - KulSeran on December 28, 2009 11:58:50 AM]
Another aspect you might want to consider is that synchronization requires locks. These locks can lock an object for an extended period of time which causes other threads to wait for the object to be free. While those threads wait, you essentially stall the system. Mind you, when I say extended period of time, I mean milliseconds and not minutes. But over time, these stalls can add up to something significant which can lower performance instead of increasing it.
------------Anything prior to 9am should be illegal.
Locks can be avoided if you setup your architecture in certain ways.

I don't claim this is the best method, I've not done huge tests on it, however I'm a fan of doing two passes over the objects.

The first pass updates an internal state which can read from other objects external states. The second is a 'sync' step where objects copy their internal state to an externally visable state for others to use on the next pass.

The downside of this is that it requires two passes over the components thus costs you memory bandwidth; however it does remove stalling of threads due to locks and other potential memory hazards along the way.

This topic is closed to new replies.

Advertisement