Back to General and Gameplay Programming

Designing an efficient multithreading architecture

General and Gameplay Programming Programming

Started by Klutzershy December 10, 2013 09:04 PM

14 comments, last by Jason Z 10 years, 3 months ago

Klutzershy

1,697

Author

December 11, 2013 01:08 AM

I personally use my "main thread" as a worker thread too. Whenever any of my threads has nothing to do (e.g. it has to wait for the results of another thread before it can continue), then they make themselves useful by popping jobs from the job queue and doing some work. I basically have a "WaitFor..." busy loop, that continually checks if the condition has been met to exit, else tries to run a job, else after enough tries with no jobs in the queue it yields or sleeps.

Ah, that's a good point. I could allow the I/O thread to do a work job as well if there are no I/O jobs pending.

Regarding embarrassingly parallel jobs -- in order to move these off to the GPU, you also need the consumers of those jobs to be ok with extremely long latencies. It's not possible to get short CPU->GPU->CPU latencies on PC without destroying overall performance.

There aren't going to be many (if any at all) circumstances where a CPU step depends on a single GPU step, no. I want to keep what I calculate on the GPU on the GPU, so to speak. The only instance I can think of is using the CPU to do narrowphase CCD after the GPU does broadphase pruning for collision detection (ideally, I'd do the narrowphase on the GPU as well, but I can't think of any good GPU-friendly CCD algorithm),

Also, don’t “spawn” threads, awaken them. They should already exist and just be idling in a waiting state, waiting for an event to set them in motion.
And a “wait” state is not a “sleep” state.

Whoops, mixing up my terminology here. Yes, I plan to have these worker threads always running and waiting for jobs to be queued, not spawned with the jobs themselves.

"So there you have it, ladies and gentlemen: the only API I’ve ever used that requires both elevated privileges and a dedicated user thread just to copy a block of structures from the kernel to the user." - Casey Muratori

boreal.aggydaggy.com

All8Up

5,999

December 11, 2013 05:41 PM

fread/fwrite are blocking wrappers around the OS's internal non-blocking file-system API.
Instead of using a non-blocking-API, wrapped in a blocking API, wrapped in a thread to make it non-blocking, you should just use the native OS APIs
You still might want an "IO" thread for running decompression, or alternatively just treat decompression as a regular job for your worker threads.

Even without decompression you still want to leave the IO calls in a worker thread for most API's. The reasoning is not CPU performance related as 99% of the time these threads should be sitting in a sleep state waiting for the next event. The problem you have with calling these API's from the main thread is the inconsistent latency you introduce which can cause extreme io read/write performance losses. There are two cases which specifically come to mind to explain this. One is Windows specific, unless you go full court IOCP you are likely going to be using the callback variations of the API which means the thread needs to go into a wakeable sleep regularly or it will never fire the events. Doing that on the main game thread would be less than desirable and also given the randomness of most game loops would put in random latencies. The latencies are the bad part for all the API's though as they tend to add up and feedback on each other in the case of file IO.

A simple high level example of the the Io latency causing issues, obviously avoiding details so just the gist here. Say you read at 1k chunks and you are reading a file piece by piece. (This is not how you want to do it, again just for example assume it is not horrible dumb. ) The OS likely reads an entire track in a single revolution and say 20k of the file is on the track. If you don't service the event for the last 1k fast enough the OS may get another request and start moving to service that and flush the 19k worth of data you *could* have gotten if you serviced the IO faster. So, your little bit of latency just cost you 19k of potentially immediately available data, delayed when the OS will get back to servicing your requests and of course you may now have to wait for the drive head to come back from very far away so it all adds up quickly.

Overall, the number of threads you have is not necessarily something you have to work at minimizing to an extreme degree. In the cases of file and audio work, a thread per system is completely viable and has no measurable impact on the remaining systems. They sleep most of the time, wake up to do a tiny amount of work inbetween your primary processing and go back to sleep without impacting on performance at all. This portion of your architecture should really be a non-issue, just use the threads in this case as that is how the OS is designed to use them, your work queue and distributing the real work of the game, that's the tricky bit with a truck load of problems you'll have to worry about.

RandomDev

127

December 16, 2013 06:41 AM

I'm designing an engine for my big 3D game project, and I want to make sure everything is very scalable to processors with many cores, without having more threads than necessary spawned at a time.

This is the architecture I'm considering right now in terms of different threads:

1 scheduling thread
Runs the main loop

Spawns work and I/O jobs

Sends draw/compute calls to the GPU

1 I/O thread
Blocks on calls to fread and fwrite

Spawns work jobs for decoding

1 sound thread
Runs from within OpenAL or satisfies SDL audio callbacks

n worker threads, where n = ncores - 3
Run serial work jobs (embarassingly parallel jobs will be run on the GPU)

While this makes sense to me for processors like Intel i7's or AMD FX's which generally have more than 4 (logical) cores, for a 4-core processor like an i5, there is only one worker thread.

Should the scheduling thread also be able to run work jobs? If so, is it safe enough to have any thread be able to send draw/compute calls to the GPU (using OpenGL 4)?

I don't have much experience in multithreaded game development. Is there any reason behind restricting the kind of work the threads can carry out?

How often can these works run in paralell? E.x. The scenes before large battles are usually very quite, this will result having the audio thread idling when it could have done other work. Ofcourse it always depends on you use case, but I would say that restricting the threads doesn't provide enough flexibility.

I'm working on an engine as well, but I've a thread pool which I use when updating the game logic which is contained in my Entity-Component design.

Lastly, there are a couple of research papers out there on this topic, I've lost them but I'm sure that they are easy to find on Google scholar :-).

Shannon Barber

1,684

December 30, 2013 05:56 AM

You're assigning task to sequential process. This could make the design of the modules simpler by allowing you to use blocking functions instead of requiring asynchronous functions but this will do nothing to increase the performance of your game.

If you can identify specific areas that benefit from parallelism then it might be easier to have a dedicated thread-pool for that function - as an exercise multi-thread your geometry culling. It's a recursive process so it's an easy first step.

From there you'll have to do some research about what is optimal. My guess would be to split the screen twice, into 4 sections, and assign each section to a thread.

It's been a long time since I worked on multi-threaded rendering but at the time (circa 2000) OGL was painful to work with and D3D was more accessible to multi-threading. OGL context switches are expensive and you can't share them between threads.

- The trade-off between price and quality does not exist in Japan. Rather, the idea that high quality brings on cost reduction is widely accepted.-- Tajima & Matsubara

Hodgman

52,717

December 30, 2013 10:06 AM

It's been a long time since I worked on multi-threaded rendering but at the time (circa 2000) OGL was painful to work with and D3D was more accessible to multi-threading. OGL context switches are expensive and you can't share them between threads.

D3D11 does have the ability to write multi-core code with it in a fairly sane manner, but D3D9 is completely lacking, and yeah, GL is a pain in the butt.

Personally, I still take the approach of only having one thread actually submit work to the GPU, however, I use many threads to prepare that work -- things like scene traversal, culling, sorting, redundant state-change removal, etc, etc, is all done in your own code (i.e. it doesn't use D3D/GL functions at all), so you can thread it however you like

. 22 Racing Series .

Jason Z

6,437

December 30, 2013 05:45 PM

D3D11 does have the ability to write multi-core code with it in a fairly sane manner, but D3D9 is completely lacking, and yeah, GL is a pain in the butt.
Personally, I still take the approach of only having one thread actually submit work to the GPU, however, I use many threads to prepare that work -- things like scene traversal, culling, sorting, redundant state-change removal, etc, etc, is all done in your own code (i.e. it doesn't use D3D/GL functions at all), so you can thread it however you like

This is the key message here - the multithreaded API in D3D11 is mostly relevant if you don't have a strong separation (like Hodgman) between doing some CPU work and submitting your state and draw calls. If you perform some moderate work in between your various API calls, then it is potentially beneficial to execute some chunks of that work on multiple threads. That's primarily due to parallelizing the CPU work, rather than any kind of speed up at the driver level...

What is important (and my usual advice in this type of design discussion) is to design your code so that you can modularly move back and forth from multi-threaded to single-threaded execution. This is actually not too difficult with D3D11, since they implemented the deferred contexts with more or less the exact same interface as the immediate context - your rendering code doesn't actually know the difference unless it specifically asks via an API call. Once your code is modularized, you can profile and it should be much easier to tweak your specific situation for a target hardware configuration.

Jason Zink :: DirectX MVP

Direct3D 11 engine on CodePlex: Hieroglyph 3

Direct3D Books: Practical Rendering and Computation with Direct3D 11, Programming Vertex, Geometry, and Pixel Shaders
Articles: Dual-Paraboloid Mapping Article :: Parallax Occlusion Mapping Article (original):: Fast Silhouettes Article

Games: Lunar Rift

Designing an efficient multithreading architecture

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Designing an efficient multithreading architecture

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines