Sign in to follow this  
Patrick Niedzielski

A New Threading System for Game Engines...Your Thoughts?

Recommended Posts

I have come up with a new threading system for a Game Engine. Basically, it allows rendering and updates to occur simultaneously, at the cost of more memory usage. It is quite a complex system, but I want to implement it in my next game engine. Here's the informal spec: Part 1 Part 2 I'd love to here your comments, either here on on the posts.

Share this post


Link to post
Share on other sites
Multi threadding is one of those things that is just about to jump into the world of game engine design. For the most part, previous engines were limited to a single thread only because the rendering API was not able to handle input from multiple threads (DX11 handles it best, as far as I know). I'm curious about your engine though. I'll have to re-read it later on in detail...

gotta go..

EDIT: I read your idea and you pretty much discovered how modern engines work.

Tieing everything together is the biggest problem with using threads. You want ot have a smooth, well designed system so that it does not deadlock on you or wait on critical sections too much. I'm actually playing around with this concept. I ended up developing a thread pool / job system where by you can run any task asynchronously.

As an example, I took my Scene Graph system and split it into three parts - the transformation graph, the scene partitioning tree, and the render graph. The transformation graph is rather simple and creates a world view matrix for each object in my list. This is rather trivial and does not cost too much. The scene partition is a bit more intensive computationally where by I have to calculate many collisions with the frustum. Lastly is a Render Graph. This interfaces with the API and must run on the main thread. This cannot be asynchronous because it must wait until the Transform graph and scene partitioning is finished.

Using my Thread Pool I can run two other threads in conjunction with my main process thread. I add jobs for both the transform graph and the scene partitioning system and wait until both are finished. The threads are not destroyed, rather, they are flagged as inactive and put back into the thread pool. Creating and destroying threads is rather expensive. Once those two threads are done, I continue with my main thread and run the render graph.

From this I learned that there is no way I can effectively run everything I want without hanging one of my threads (main thread). This is the biggest challenge in multi-threading. Things like Networking, Sound, etc are very easy to handle on separate threads because they are rather independent of the main system. You can send events to those subsystems fairly easily and with low risk.

However, from what I gathered from your blog is that you are intending to put every subsystem on its own thread. This might actually be a detriment to your engine then a benefit. Take for example a modern CPU with 4 cores. It can run 4 separate tasks without time slicing. But is only the case if you have ONLY 4 threads running. You have to take into account that you are also contending with system processes and services. Once your 4-core CPU is bogged down, Windows will start to time slice all operations so that the processes get equal amount of time on the hardware.

So, when you design your engine, think about how many threads you are creating and how many of those threads will be chugging all the time. Sound might be something that is low on the priority list, as in Input. But you must also think about the amount of processing it will take to send messages back and forth from all your threads.

Anyway, what I'm trying to say is that there is no 'one' way to design a multi-threaded engine. You will need to experiment with multi-threading. Take a look at my blog entry regarding Thread Pools and their results.

Czesc.

[Edited by - RealMarkP on December 22, 2009 11:14:21 PM]

Share this post


Link to post
Share on other sites
Yours is a very interesting design for a game engine, but leaves me with the question of how do you manage the various threads writing their output to the scene graph? Is the scene graph implemented with locks, or does your main thread bring all the data together?

I haven't looked at the source code, which I will later today.

I have also been planning to create a Thread Manager, which would allow smart placement of threads based on hints the library user gives. For example, it would try to place Sound, Input, and (perhaps, depending on the game) Networking on the same processor in a 4-core system. This leaves a core for rendering, a core for "collidables", and a core for effects.

On a 2-core system, the threads for each subsection would be a larger problem. Sound, Input, Networking, and Rendering could go on the same core (which may impact rendering performance), and effects and "collidables" on the core. Not the greatest solution.

Of course, the OS can deny the requests, but there is little that can be done about that.

Because the number of cores on the system won't change at runtime, it mostly matters how the threads are configured at the start. I do see your point, though, and depending on the number of threads on a single core, this will slow the core's threads down considerably when in a real-time system.

Share this post


Link to post
Share on other sites
My Thread Pool uses Critical sections to lock data manipulation and access. Out of the various locking mechanisms, it is the fastest way to lock threads. I would not worry too much about matching the amount of threads you have to the amount of Cores available to you. Basically the system will time slice your threads so that they get equal time on the CPU.

Some of your subsystems do not necessarily need to be on a separate thread. If you are getting your input from the windows message pump, then there is no point of putting that on a separate thread because it is a polling system. You might as well put the polling mechanism in your rendering loop. I would suggest that your engine have a rather small footprint in terms of threads because you never know if the client application wants to use mutli-threading.

For example, The game I'm designing at the moment has three threads:
1.) My renderer thread which does user input, updating of the scene/animation and rendering.
2.) My AI component which calculates the state of the map and, using a genetic algorithm and some path finding, posts the result into the animation manager on the first thread. This is blocking and only runs when triggered.
3.) Lastly, my networking component which handles communication between many players. This is blocking, so it does not use a thread all the time.

Everything is tied together using an event system that is thread safe. It delivers events from the various threads to each other. In essence my engine only runs on one thread. The Game runs on two more.

Say for example, your engine already used 4 threads and I wanted to build my application on top of your engine. I would be adding on 2 more threads for a total of 6 threads. On a 2 or 4 core system, there would be a ton of time slicing involved. The locking between the threads would add a performance hit to the game itself.

What I would recommend (and I will be implementing this in my engine as well) is to have a Thread Pool running at all times. Allocate X number of threads that is equal to the amount of cores that are available. The threads will be idle until called upon. Instead of having threads constantly running, create small batches (I call them 'Jobs') that you can throw into the Thread Pool on demand. For example, before you render objects, split up your scene computations into groups. One job can be a transform graph that will loop through a tree of objects and apply transformations. Another job might be frustum culling which also loops through all the objects, checking their AABB against the camera's Frustum (insert optimized spatial partition here). While these two threads are chugging along, you can do something on your main thread, like read out new keyboard and mouse input from the message pump or update a particle system. All this is done before rendering.

You will still need a dedicated thread for networking or sound though. However, they can be event based so that they block until events are triggered.



Reference Material:
White Paper On Thread Pools.
Intel's Article on Multi Core Engines. (PDF).

Share this post


Link to post
Share on other sites
Maybe I'm just old school but what benefit do many games see from multi threading anyway? You already have a GPU for graphics, an MPU for sound, and a CPU for everything else? I imagine some might say for the purposes of extreme AI but that would be extreme indeed! I can honestly say that I have never developed a program that would benefit from multi-threading. I'm sure some of you will disagree.

Share this post


Link to post
Share on other sites
Quote:
Original post by lakmir0
Maybe I'm just old school but what benefit do many games see from multi threading anyway?


I think that the provided links gave already one answer for that. Separating different tasks to separate threads will (if properly implemented) use the given resources better which will result as increased performance if multiple cores are available.

It is possible that in the future the speed of the cores won't increase too much (although other optimizations will increase the performance). However, it is likely that more and more cores will be available. In order to take the advantage of them, one must implement an efficient method of multithreading.

Quote:

I can honestly say that I have never developed a program that would benefit from multi-threading. I'm sure some of you will disagree.


You could look into image processing as an example. It is quite easy to understand in that field why multi-core CPUs bring real performance boost.

Cheers!

Share this post


Link to post
Share on other sites
Quote:
Original post by lakmir0
Maybe I'm just old school but what benefit do many games see from multi threading anyway? You already have a GPU for graphics, an MPU for sound, and a
CPU for everything else?


Well, to be honest 'many' games won't see much of an advantage simply because they aren't doing enough to stress the system on a single core + gpu, never mind consume multiple cores.

However, as the complexity of the game increases so the benifits of multiple threads increases. Before I get to that however there is one thing which needs mentioning; sound is software driven these days.

With Vista and Win7 unless you are using OpenAL on Creative hardware then you are going to be using a software sound path, which right away is probably going to spawn you at least one extra thread (FMOD for example spawns at least one, maybe more if you are doing streaming).

At this point you get into the area of 'complex games'; as the number of things you want to throw around increase so your processing and draw command submission times increase.

The "simplest" way to get some more headroom is to split into two threads; one for update and one for rendering (this is ignoring minor threads for sound or networking as they rarely take too much time). This lets you update the scene on one core while drawing on the other. This means that you can limit your world updates to say 50 updates per second (fixed world timestep is pretty much a must for physics and network based games, it also makes the maths easier for other games) while your rendering thread can render as often as it likes, interpolating between world steps.

The problem with this is that it isn't scaleable; while it works fine for something like an AMD X2, Athlon 2 or Core 2 Duo cpu as soon as you hit the tri/quad core cpus you've stopped using half the power again and with 4 core/8 thread chips about the problem gets worse (and will get worse again soon as 6 core/12 thread and 6 core cpus are due next year).

Skimming the thread it does seem that RealMarkP is talking about the 'best' way to do things; tasks.

The idea being that you decide up front how many threads you want then, as you progress through your logic, you hand off tasks to these threads to carry out. The main advantages here are scalibility and the fact that you never kill off threads (starting a thread is a heavy weight task).

Scalibility is the big one as it lets you expand to take advantage of the power there (although you'd clearly have to scale things back a bit for less cores, but this is more a numbers rather than logic thing, your game logic would stay the same) just means you need to ensure that parrallel tasks and there dependancies are dealt with in the correct manner.

Given the correct API this system is very easy to get up and running with, a few months back I was playing with Intel's threaded building blocks (TBB) library and had a (very basic, logic only) particle system setup which used all 8 threads on my i7 to process 1,000,000 particles in <4ms which gave me an effective work time of 32ms taken, far far too much for 30fps, never mind a 60fps target rate (granted, I was using some SSE and had thrown OOP partly out the window but TBB made it very easy to just throw each processing segment at the threads as tasks).

So, as I started with, for many games there is no need, or maybe at most they might want to seperate update and rendering for a simple game. Once you get complex however being able to split work across threads is pretty much a must if you want that power.

[Edited by - phantom on December 28, 2009 8:56:38 AM]

Share this post


Link to post
Share on other sites
Quote:

Maybe I'm just old school but what benefit do many games see from multi threading anyway? You already have a GPU for graphics, an MPU for sound, and a
CPU for everything else?

XBox360 - 3 cores, 6 hardware threads.
PS3 - 1 core + 6 spus, 8 hardware threads.
My PC - 4 core, 4 hardware threads.

So, anyone trying to take full advantage of any of those machines needs to be running 4+ threads to use all that processing power.

Take a look at something like Supreme Commander. It has SOOOO much going on I'm used to seeing the old-school "cable only!!" games replaced with "quad-core only!!".

Quote:

I'd love to here your comments, either here on on the posts.

I'd second the thread pool idea. You can even marshal your task based threading in as just large jobs to the thread pool. I was just reading (from another multi-threading post here) about Intel's Thread Building Blocks tools that help you deal with this in a nice fine-grained way.

Also note that depending on how compute bound your code is, and how well organized your data is, you could be wasting a tonne of cycles just waiting for memory. If you have a lot of tasks that consume lots of data, you can also run into the bottleneck of memory bandwidth and cache thrashing. You should be considering your memory layout as well as the code-to-thread layout.
Check out Antheus' post here.

Also, consider more dependencies than just the graphics-entity relationship. You might need a lot more buffers than you think considering how serial the relationship of many subsystems is. Ie, the AI needs to pick an animation, the animation has to be done for the graphics, the graphics then skins the model. Animation can take a LOT of cpu and is a trivially parallelizable task considering independant animations for different characters can all be done in parallel. You therefore might want to double buffer the input queue for the anim system as well. Same for many other subsystems.

[Edited by - KulSeran on December 28, 2009 11:58:50 AM]

Share this post


Link to post
Share on other sites
Another aspect you might want to consider is that synchronization requires locks. These locks can lock an object for an extended period of time which causes other threads to wait for the object to be free. While those threads wait, you essentially stall the system. Mind you, when I say extended period of time, I mean milliseconds and not minutes. But over time, these stalls can add up to something significant which can lower performance instead of increasing it.

Share this post


Link to post
Share on other sites
Locks can be avoided if you setup your architecture in certain ways.

I don't claim this is the best method, I've not done huge tests on it, however I'm a fan of doing two passes over the objects.

The first pass updates an internal state which can read from other objects external states. The second is a 'sync' step where objects copy their internal state to an externally visable state for others to use on the next pass.

The downside of this is that it requires two passes over the components thus costs you memory bandwidth; however it does remove stalling of threads due to locks and other potential memory hazards along the way.

Share this post


Link to post
Share on other sites
Also, what would happen when a job asks for a thread and none is available? Or if a job with a higher priority is constantly using the thread making the jobs with lower priority not having always a chance of getting their own?

Just curious. Merry Christmas!

Share this post


Link to post
Share on other sites
Quote:

These locks can lock an object for an extended period of time which causes other threads to wait for the object to be free.

It also means that you have to be careful to keep locks for as short a time as possible. It is possible to starve a thread because one holds the lock for too long, while the other holds it only for a moment.


while(1) while(1)
{ {
..stuff.. lock()
lock() ..stuff..
..single thing.. ..more stuff..
unlock() unlock()
..stuff.. Sleep(1)
} }

If thread A doesn't get time during the short moment when thread B is sleeping, then thread A can starve for a long time.

Quote:

Also, what would happen when a job asks for a thread and none is available?

Actually, you'd go the other way around. All the threads look at the queue of jobs that are available, and through whatever priority system you have, eat up the next job. As long as you do your sync points right, you should never end up with jobs that never get time. Since everything will eventually require something else to finish before it submits more jobs.

This isn't a random assortment of tasks, but a game. Design the system to take into account the usage. A background task like streaming in level data, or a networking task, or whatever asynchronous task might just need a thread of its own, outside the pool. The thread pool is good for scheduling lots of little assorted jobs that happen to lots of items in parallel. Like dealing out 200 entity updates over 4 threads, or 200 animation blends, or 2000 culling tests. You'd use it for anything where you have some trivially parallelizable data that you need split across multiple cores. You can make it work on larger jobs as well, but the overall idea is that the sources and sinks for jobs are well known in advance, just the number of jobs and the time each job takes can vary. But, each job has a defined start and end point, so it can't 'constantly' use up one thread.

Share this post


Link to post
Share on other sites
My, I'm sorry it took me so long to reply. Thanks for the input.

Putting the input routines in with my window procedure would not work, in my case, because I am not guaranteed to have one. I tried to stay as cross-platform as possible here. On a Windows only solution, this would be optimal.

However, there is are two problems with the design of thread pools that seems to have been overlooked. With this design, if I understand it correctly, you have one thread per processor, so there is no thread switching on a core. However, there are other programs running during the game session. Some players play music, have antivirus, even use an Operating System while running the game (Imagine that! ;). All these programs run on (at least) one thread, so there is still thread switching going on. Granted, there are fewer threads competing on the same core, but this cannot eliminate the problem entirely.

Perhaps the more pressing problem is that processor affinity is not guaranteed to be acknowledged. Similarly to how a compiler does not have to make C++ function declared with the "inline" keyword actually inline, the OS does not have to (and may very well not) place your threads on the processor you ask. If it finds that one processor is being used heavily, while another is not used at all, it may place two of your threads in the thread pool on the same core. Depending on what jobs are on each of these, this could severely slow down the frame.

Share this post


Link to post
Share on other sites
Quote:

However, there is are two problems with the design of thread pools that seems to have been overlooked. With this design, if I understand it correctly, you have one thread per processor, so there is no thread switching on a core. However, there are other programs running during the game session. Some players play music, have antivirus, even use an Operating System while running the game (Imagine that! ;). All these programs run on (at least) one thread, so there is still thread switching going on. Granted, there are fewer threads competing on the same core, but this cannot eliminate the problem entirely.

That is a problem no matter how you look at it. The idea of thread pools isn't to eliminate that issue. It is to provide the OS with enough CPU bound threads to fill up all the compute time on all the cores. If a core supports 2 or 4 hardware threads, then you'd put 2 or 4 threads on that core. (hyperthreading for instance increases hardware threads per core). You just have to hope that you have enough variation in your thread's processes that it can find instructions to interleave between your threads. AND that you aren't memory bandwidth bound.

In the end, you end up with enough threads doing enough things to keep all the cores busy. As opposed to a non-scalable setup where you only have threads per task(AI, Physics, Networking), and you game requires Exactly that many cores to run optimally, any less and you waste time swapping threads, and any additional cores are not used. Now, this might not be a problem on fixed hardware. If you know you are on the XBox with 3 xeons, 6hw threads. Then you can build specifically for that. But if you want something that runs good on a single core, dual core, quad core, And the soon to be 8 or 16 core machines. Then you need to start with a scalable design.



Quote:

Perhaps the more pressing problem is that processor affinity is not guaranteed to be acknowledged. Similarly to how a compiler does not have to make C++ function declared with the "inline" keyword actually inline, the OS does not have to (and may very well not) place your threads on the processor you ask. If it finds that one processor is being used heavily, while another is not used at all, it may place two of your threads in the thread pool on the same core. Depending on what jobs are on each of these, this could severely slow down the frame.

That is good and bad. Hopefully the OS is smart about the scheduling. Sometimes it isn't. For instance, there is a "core maximizer" tool for Supreme Commander, all it does is sit in the background and shuffle the threads around a little, because sometimes after alt-tabbing, the OS will decided to pair the wrong set of threads together.
But letting the OS swap the threads around at will is often better than tieing down the threads to one particular core. Expecially since the OS abastracts the cores away from you. You don't know what cores exist where, but the OS does, and hopefully makes decisions based on that.
For instance, the origional quad core Intel chips were 2 dual cores stuck together. While the quad core from AMD was 4 cores stuck together. A tiny difference, but affects alot due to how the cache for each setup is different. Same thing happens when you have multiple processors. Hopefully the OS can arrange the threads based on the memory segments that the threads will access, to minimize the processors having to talk to one another. It can't always, and you'd have to help it out. But then you have to know the arrangement of cores in the machine.

Share this post


Link to post
Share on other sites
Quote:
Original post by phantom
Locks can be avoided if you setup your architecture in certain ways.

I don't claim this is the best method, I've not done huge tests on it, however I'm a fan of doing two passes over the objects.

The first pass updates an internal state which can read from other objects external states. The second is a 'sync' step where objects copy their internal state to an externally visable state for others to use on the next pass.

The downside of this is that it requires two passes over the components thus costs you memory bandwidth; however it does remove stalling of threads due to locks and other potential memory hazards along the way.

This cunning plan has been used by graphics renderers since time immemorial. Of course, the 'sync' step is flipping the buffers, so what used to be the 'internal' state is now the external state, and you regenerate the new internal state entirely each time based on buffered data elsewhere. But the principle is the same - you can't afford to have the monitor and the program competing over the same lump of video memory so you give the monitor an old copy to play with while you tinker with the new copy.

It's interesting that what has become common knowledge for graphics doesn't immediately occur to game programmers for the more general case!

Share this post


Link to post
Share on other sites
Quake 3 did something similar to what's posted above. They would dump everything into a "front buffer" which ended up being a triangle soup, shaders and a very basic set of commands while the backend thread was processing the "front buffer" of commands and data. When that frame signalled completion the command buffers and associated structures were swapped and the cycle began all over again. Doom 3 apparently took this one step further but we need to wait for the public source to review that :).

I need to find an article back on how one company broke up their engine into tasks, jobs, etc. since it was an interesting read, but I can't remember the name of the game right now to save my life.

EDIT:

It was Lost Planet, I'll go hunting for the presentation. I think it was at the GDC.

Share this post


Link to post
Share on other sites
"..It's interesting that what has become common knowledge for graphics doesn't immediately occur to game programmers for the more general case!"


While it is true, I believe you have to consider the purpose of a GPU versus a CPU. Programming on a GPU in parallel is thought of 'natively' because a GPU is a very specialized processor that only performs a limited set of specific functions on each 'bit' or pixel of data. Since each function is almost guaranteed to be independent, parallel programming on a GPU becomes much easier.

In contrast, programming on a CPU is much more 'complex', meaning that it is not dedicated to a select set of functions. Since a CPU is far more 'generalized', the coder needs to explicitly establish thread locking / unlocking / scheduling of resources within the CPU. This inherently makes parallel programming on a CPU much more difficult.

Share this post


Link to post
Share on other sites
With regards to the question raised earlier about locking the scene graph - you only need to lock the nodes that are being modified, not the whole thing. So if 4 enemies are updating their visual information in the scene graph, they only need to lock their own nodes, not any other info.

Anytime the rendering thread encounters a locked node when traversing the scene graph, rather than waiting for that node to be released, it can keep an internal list of all nodes it enountered that were locked and revisit them after traversing the rest of the tree (your mutex class would need to support asynchronous/try locks i.e. try to lock it, and return a boolean flag on whether or not it was successful).

Design your locking policy (and the placement of the mutexes themselves) to reduce as much as possible the possibibility of two threads trying to aquire the same mutex (and for those instances where they will, see if there's anyway of having the blocked thread do other work until the mutex is freed up).

Share this post


Link to post
Share on other sites
Quote:
Original post by astralvoid
"..It's interesting that what has become common knowledge for graphics doesn't immediately occur to game programmers for the more general case!"

While it is true, I believe you have to consider the purpose of a GPU versus a CPU. Programming on a GPU in parallel is thought of 'natively' because a GPU is a very specialized processor that only performs a limited set of specific functions on each 'bit' or pixel of data.

But I am talking about a trivial operation (double buffering) that we were using for games long before they did anything for us beyond just showing a frame buffer to the screen. It's nothing to do with GPU operations as such. It's just to do with how to efficiently implement mutual exclusion between the monitor and the game, ie. by having 2 sets of data.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this