Jump to content

  • Log In with Google      Sign In   
  • Create Account

FREE SOFTWARE GIVEAWAY

We have 4 x Pro Licences (valued at $59 each) for 2d modular animation software Spriter to give away in this Thursday's GDNet Direct email newsletter.


Read more in this forum topic or make sure you're signed up (from the right-hand sidebar on the homepage) and read Thursday's newsletter to get in the running!


Designing an efficient multithreading architecture


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
15 replies to this topic

#1 Boreal Games   Members   -  Reputation: 854

Like
1Likes
Like

Posted 10 December 2013 - 03:04 PM

I'm designing an engine for my big 3D game project, and I want to make sure everything is very scalable to processors with many cores, without having more threads than necessary spawned at a time.

 

This is the architecture I'm considering right now in terms of different threads:

  • 1 scheduling thread
    • Runs the main loop
    • Spawns work and I/O jobs
    • Sends draw/compute calls to the GPU
  • 1 I/O thread
    • Blocks on calls to fread and fwrite
    • Spawns work jobs for decoding
  • 1 sound thread
    • Runs from within OpenAL or satisfies SDL audio callbacks
  • n worker threads, where n = ncores - 3
    • Run serial work jobs (embarassingly parallel jobs will be run on the GPU)

While this makes sense to me for processors like Intel i7's or AMD FX's which generally have more than 4 (logical) cores, for a 4-core processor like an i5, there is only one worker thread.

Should the scheduling thread also be able to run work jobs?  If so, is it safe enough to have any thread be able to send draw/compute calls to the GPU (using OpenGL 4)?



Sponsor:

#2 Godmil   Members   -  Reputation: 744

Like
1Likes
Like

Posted 10 December 2013 - 03:27 PM

You probably know a lot more than me, but it may be worth pointing out that every time I see someone ask about multithreading the overwhelming reply is "Don't!". As unless you really know about all the pitfalls it can bring it can just be more trouble than it's worth.

however maybe you do know your stuff, in which case good luck :)



#3 Boreal Games   Members   -  Reputation: 854

Like
1Likes
Like

Posted 10 December 2013 - 03:37 PM

You probably know a lot more than me, but it may be worth pointing out that every time I see someone ask about multithreading the overwhelming reply is "Don't!". As unless you really know about all the pitfalls it can bring it can just be more trouble than it's worth.

however maybe you do know your stuff, in which case good luck smile.png

Oh, I know what I'm getting myself into.  I know all the issues about memory racing and whatnot across threads, so I'm going to minimize the number of times that threads need to synchronize, and use locks properly when I need to.



#4 AgentC   Members   -  Reputation: 1417

Like
1Likes
Like

Posted 10 December 2013 - 03:41 PM

If you have multiple OpenGL contexts (one for each thread, with sharing of objects setup in between them) you will be able to make OpenGL calls from several threads simultaneously, but you must yourself ensure you're not eg. updating the same OpenGL object in one thread while rendering with it in the other. I also wouldn't be suprised if you discover more driver bugs that way, compared to singlethreaded use, but I don't have personal experience of that.

 

If your jobs are the kind of "update culling", "animate n entities", definitely run them also in the main thread, especially if your frame processing cannot proceed further without completing them first.


Every time you add a boolean member variable, God kills a kitten. Every time you create a Manager class, God kills a kitten. Every time you create a Singleton...

Urho3D (engine)  Hessian (C64 game project)


#5 Six222   Members   -  Reputation: 439

Like
4Likes
Like

Posted 10 December 2013 - 04:21 PM

If you haven't already take a look at how Doom 3 BFG handles threads: http://fabiensanglard.net/doom3_bfg/index.php


Edited by Six222, 10 December 2013 - 04:21 PM.


#6 Boreal Games   Members   -  Reputation: 854

Like
1Likes
Like

Posted 10 December 2013 - 04:38 PM

If your jobs are the kind of "update culling", "animate n entities", definitely run them also in the main thread, especially if your frame processing cannot proceed further without completing them first.

Do you think it's better to run jobs on the main thread or instead spawn an extra worker thread and let the main thread sleep while there are jobs still pending?



#7 AgentC   Members   -  Reputation: 1417

Like
1Likes
Like

Posted 10 December 2013 - 05:31 PM

I'd believe that is best answered by profiling, but each time a thread goes to sleep, it may not wake up as timely as you'd want due to the OS scheduling. That goes for both the main thread & workers. Therefore my gut feeling is against the extra worker thread.


Every time you add a boolean member variable, God kills a kitten. Every time you create a Manager class, God kills a kitten. Every time you create a Singleton...

Urho3D (engine)  Hessian (C64 game project)


#8 King Mir   Members   -  Reputation: 2050

Like
1Likes
Like

Posted 10 December 2013 - 06:15 PM

If your jobs are the kind of "update culling", "animate n entities", definitely run them also in the main thread, especially if your frame processing cannot proceed further without completing them first.

Do you think it's better to run jobs on the main thread or instead spawn an extra worker thread and let the main thread sleep while there are jobs still pending?

Creating threads has a cost. Why pay it when you don't have to?

I'm designing an engine for my big 3D game project, and I want to make sure everything is very scalable to processors with many cores, without having more threads than necessary spawned at a time.

This is the architecture I'm considering right now in terms of different threads:

  • 1 scheduling thread
    • Runs the main loop
    • Spawns work and I/O jobs
    • Sends draw/compute calls to the GPU
  • 1 I/O thread
    • Blocks on calls to fread and fwrite
    • Spawns work jobs for decoding
  • 1 sound thread
    • Runs from within OpenAL or satisfies SDL audio callbacks
  • n worker threads, where n = ncores - 3
    • Run serial work jobs (embarassingly parallel jobs will be run on the GPU)
While this makes sense to me for processors like Intel i7's or AMD FX's which generally have more than 4 (logical) cores, for a 4-core processor like an i5, there is only one worker thread.
Should the scheduling thread also be able to run work jobs? If so, is it safe enough to have any thread be able to send draw/compute calls to the GPU (using OpenGL 4)?

I'd have suggest having the number of worker threads be equal to the number of cores or maybe even double that. (Forum member Frob has said in his experience 2 times the number of virtual cores is usually a good way to go).

The reason you want as many worker threads as cores even though you have other threads, is the worker threads probably have a different workload than the task specific threads (or whatever the word for that is). Those threads may occasionally sleep, block on an IO signal, and under use a single core. You want your worker threads to saturate the remaining resources, so it's better to have too many worker threads some of the time, than too few at any time. You can, however, give task specific threads higher priority.

Edited by King Mir, 10 December 2013 - 06:31 PM.


#9 Hodgman   Moderators   -  Reputation: 31920

Like
3Likes
Like

Posted 10 December 2013 - 06:31 PM

fread/fwrite are blocking wrappers around the OS's internal non-blocking file-system API.

Instead of using a non-blocking-API, wrapped in a blocking API, wrapped in a thread to make it non-blocking, you should just use the native OS APIs wink.png

You still might want an "IO" thread for running decompression, or alternatively just treat decompression as a regular job for your worker threads.

 

The threads spawned by your audio middleware probably spend most of their time sleeping, so I wouldn't allocate them an entire core. I'd probably ignore the sound threads when figuring out how many worker threads to spawn.

 

I personally use my "main thread" as a worker thread too. Whenever any of my threads has nothing to do (e.g. it has to wait for the results of another thread before it can continue), then they make themselves useful by popping jobs from the job queue and doing some work. I basically have a "WaitFor..." busy loop, that continually checks if the condition has been met to exit, else tries to run a job, else after enough tries with no jobs in the queue it yields or sleeps.

 

Regarding the GPU, on PC it's probably still the fastest choice to just have one thread as the dedicated GPU thread. Other threads can perform rendering work -- e.g. doing frustrum culling, building render queues, performing state sorting or redundant state-change removal, etc... but only one thread actually draws things.

Multi-threaded drawing is possible, but AFAIK, the drivers do not perform very well at the moment.

 

Regarding embarrassingly parallel jobs -- in order to move these off to the GPU, you also need the consumers of those jobs to be ok with extremely long latencies. It's not possible to get short CPU->GPU->CPU latencies on PC without destroying overall performance.


Edited by Hodgman, 10 December 2013 - 06:33 PM.


#10 L. Spiro   Crossbones+   -  Reputation: 14374

Like
3Likes
Like

Posted 10 December 2013 - 07:06 PM

I would distribute them as follows:

 

  • Core 1—Several resource-light threads.
    • Sound—Ticks only a few times per second to keep sound buffers filled, requiring very few resources.
    • Input (keyboard/mouse/etc.)—High-priority but mostly sleeping until a button is pressed, requiring very few resources.
    • Network thread—Medium priority but still mostly just waiting for events, which generally amount to at-most 20-per-second in heavy times.
    • 1 low-priority worker thread for background loading etc.—With sound, input, and networking all mostly in a sleep or wait state (and only waking to do very quick tasks before going back to waiting), there is still enough core left for a low-priority worker for any kind of task that is not very time-sensitive.
  • Core 2—Logic.
    • Game thread—Reads queued inputs, performs game logic, performs frustum culling, sorting, and submits render commands.
    • 1 worker thread—The game thread will have the heaviest load when it needs to update game logic, which is typically bound to be as infrequent as possible (and in racing games in which it can be more than 100 times per second, the load is balanced anyway so that not much logic actually takes place) and on the down-time the game thread simply interpolates object positions for re-submission to the render thread.  There is typically enough CPU left over for a worker thread.  It can either be constantly running at a medium-low priority or it can be the same priority and forced awake when the game thread is waiting and forced to wait when the game thread is awake.
  • Core 3—Rendering and worker scheduling.
    • Render thread—Sends render commands to the GPU.
    • Worker scheduler—Runs when the render thread is waiting, waits when the render thread is awakened.  It takes very little time to read over a list of requested tasks and awaken worker threads.
  • Cores 4 to Total-3—Anything else that needs to be done.
    • 1 worker thread—Extra cores used for extra work.  File loading, decoding, decompressing, whatever.  Can change depending on the game.  Each thread is high-priority, but waits until a job is there for it to do.

With this plan you still have multiple workers on 4-core systems, and the main 2 components (rendering and game logic) each basically have their own cores—they share it with a worker thread but that thread leaves them alone while they are active (though this should be handled with care on the game thread, since it doesn’t necessarily have to have down-time).

 

Also, don’t “spawn” threads, awaken them.  They should already exist and just be idling in a waiting state, waiting for an event to set them in motion.

And a “wait” state is not a “sleep” state.

 

 

L. Spiro


Edited by L. Spiro, 10 December 2013 - 07:17 PM.

It is amazing how often people try to be unique, and yet they are always trying to make others be like them. - L. Spiro 2011
I spent most of my life learning the courage it takes to go out and get what I want. Now that I have it, I am not sure exactly what it is that I want. - L. Spiro 2013
I went to my local Subway once to find some guy yelling at the staff. When someone finally came to take my order and asked, “May I help you?”, I replied, “Yeah, I’ll have one asshole to go.”
L. Spiro Engine: http://lspiroengine.com
L. Spiro Engine Forums: http://lspiroengine.com/forums

#11 Boreal Games   Members   -  Reputation: 854

Like
0Likes
Like

Posted 10 December 2013 - 07:08 PM


I personally use my "main thread" as a worker thread too. Whenever any of my threads has nothing to do (e.g. it has to wait for the results of another thread before it can continue), then they make themselves useful by popping jobs from the job queue and doing some work. I basically have a "WaitFor..." busy loop, that continually checks if the condition has been met to exit, else tries to run a job, else after enough tries with no jobs in the queue it yields or sleeps.

Ah, that's a good point.  I could allow the I/O thread to do a work job as well if there are no I/O jobs pending.

 


Regarding embarrassingly parallel jobs -- in order to move these off to the GPU, you also need the consumers of those jobs to be ok with extremely long latencies. It's not possible to get short CPU->GPU->CPU latencies on PC without destroying overall performance.

There aren't going to be many (if any at all) circumstances where a CPU step depends on a single GPU step, no.  I want to keep what I calculate on the GPU on the GPU, so to speak.  The only instance I can think of is using the CPU to do narrowphase CCD after the GPU does broadphase pruning for collision detection (ideally, I'd do the narrowphase on the GPU as well, but I can't think of any good GPU-friendly CCD algorithm),

 


Also, don’t “spawn” threads, awaken them. They should already exist and just be idling in a waiting state, waiting for an event to set them in motion.

And a “wait” state is not a “sleep” state.

Whoops, mixing up my terminology here.  Yes, I plan to have these worker threads always running and waiting for jobs to be queued, not spawned with the jobs themselves.


Edited by Boreal Games, 10 December 2013 - 07:15 PM.


#12 AllEightUp   Moderators   -  Reputation: 4270

Like
2Likes
Like

Posted 11 December 2013 - 11:41 AM


fread/fwrite are blocking wrappers around the OS's internal non-blocking file-system API.
Instead of using a non-blocking-API, wrapped in a blocking API, wrapped in a thread to make it non-blocking, you should just use the native OS APIs 
You still might want an "IO" thread for running decompression, or alternatively just treat decompression as a regular job for your worker threads.

 

Even without decompression you still want to leave the IO calls in a worker thread for most API's.  The reasoning is not CPU performance related as 99% of the time these threads should be sitting in a sleep state waiting for the next event.  The problem you have with calling these API's from the main thread is the inconsistent latency you introduce which can cause extreme io read/write performance losses.  There are two cases which specifically come to mind to explain this.  One is Windows specific, unless you go full court IOCP you are likely going to be using the callback variations of the API which means the thread needs to go into a wakeable sleep regularly or it will never fire the events.  Doing that on the main game thread would be less than desirable and also given the randomness of most game loops would put in random latencies.  The latencies are the bad part for all the API's though as they tend to add up and feedback on each other in the case of file IO.

 

A simple high level example of the the Io latency causing issues, obviously avoiding details so just the gist here.  Say you read at 1k chunks and you are reading a file piece by piece.  (This is not how you want to do it, again just for example assume it is not horrible dumb. smile.png )  The OS likely reads an entire track in a single revolution and say 20k of the file is on the track.  If you don't service the event for the last 1k fast enough the OS may get another request and start moving to service that and flush the 19k worth of data you *could* have gotten if you serviced the IO faster.  So, your little bit of latency just cost you 19k of potentially immediately available data, delayed when the OS will get back to servicing your requests and of course you may now have to wait for the drive head to come back from very far away so it all adds up quickly.

 

Overall, the number of threads you have is not necessarily something you have to work at minimizing to an extreme degree.  In the cases of file and audio work, a thread per system is completely viable and has no measurable impact on the remaining systems.  They sleep most of the time, wake up to do a tiny amount of work inbetween your primary processing and go back to sleep without impacting on performance at all.  This portion of your architecture should really be a non-issue, just use the threads in this case as that is how the OS is designed to use them, your work queue and distributing the real work of the game, that's the tricky bit with a truck load of problems you'll have to worry about.


Edited by AllEightUp, 11 December 2013 - 11:51 AM.


#13 RandomDev   Members   -  Reputation: 127

Like
0Likes
Like

Posted 16 December 2013 - 12:41 AM

I'm designing an engine for my big 3D game project, and I want to make sure everything is very scalable to processors with many cores, without having more threads than necessary spawned at a time.

 

This is the architecture I'm considering right now in terms of different threads:

  • 1 scheduling thread
    • Runs the main loop
    • Spawns work and I/O jobs
    • Sends draw/compute calls to the GPU
  • 1 I/O thread
    • Blocks on calls to fread and fwrite
    • Spawns work jobs for decoding
  • 1 sound thread
    • Runs from within OpenAL or satisfies SDL audio callbacks
  • n worker threads, where n = ncores - 3
    • Run serial work jobs (embarassingly parallel jobs will be run on the GPU)

While this makes sense to me for processors like Intel i7's or AMD FX's which generally have more than 4 (logical) cores, for a 4-core processor like an i5, there is only one worker thread.

Should the scheduling thread also be able to run work jobs?  If so, is it safe enough to have any thread be able to send draw/compute calls to the GPU (using OpenGL 4)?

 

I don't have much experience in multithreaded game development. Is there any reason behind restricting the kind of work the threads can carry out?

How often can these works run in paralell? E.x. The scenes before large battles are usually very quite, this will result having the audio thread idling when it could have done other work. Ofcourse it always depends on you use case, but I would say that restricting the threads doesn't provide enough flexibility.

I'm working on an engine as well, but I've a thread pool which I use when updating the game logic which is contained in my Entity-Component design.

Lastly, there are a couple of research papers out there on this topic, I've lost them but I'm sure that they are easy to find on Google scholar :-).



#14 Shannon Barber   Moderators   -  Reputation: 1390

Like
0Likes
Like

Posted 29 December 2013 - 11:56 PM

You're assigning task to sequential process. This could make the design of the modules simpler by allowing you to use blocking functions instead of requiring asynchronous functions but this will do nothing to increase the performance of your game.

 

If you can identify specific areas that benefit from parallelism then it might be easier to have a dedicated thread-pool for that function - as an exercise multi-thread your geometry culling. It's a recursive process so it's an easy first step.

 

From there you'll have to do some research about what is optimal. My guess would be to split the screen twice, into 4 sections, and assign each section to a thread.

 

It's been a long time since I worked on multi-threaded rendering but at the time (circa 2000) OGL was painful to work with and D3D was more accessible to multi-threading. OGL context switches are expensive and you can't share them between threads.


- The trade-off between price and quality does not exist in Japan. Rather, the idea that high quality brings on cost reduction is widely accepted.-- Tajima & Matsubara

#15 Hodgman   Moderators   -  Reputation: 31920

Like
1Likes
Like

Posted 30 December 2013 - 04:06 AM


It's been a long time since I worked on multi-threaded rendering but at the time (circa 2000) OGL was painful to work with and D3D was more accessible to multi-threading. OGL context switches are expensive and you can't share them between threads.
D3D11 does have the ability to write multi-core code with it in a fairly sane manner, but D3D9 is completely lacking, and yeah, GL is a pain in the butt.

Personally, I still take the approach of only having one thread actually submit work to the GPU, however, I use many threads to prepare that work -- things like scene traversal, culling, sorting, redundant state-change removal, etc, etc, is all done in your own code (i.e. it doesn't use D3D/GL functions at all), so you can thread it however you like wink.png



#16 Jason Z   Crossbones+   -  Reputation: 5382

Like
0Likes
Like

Posted 30 December 2013 - 11:45 AM


D3D11 does have the ability to write multi-core code with it in a fairly sane manner, but D3D9 is completely lacking, and yeah, GL is a pain in the butt.
Personally, I still take the approach of only having one thread actually submit work to the GPU, however, I use many threads to prepare that work -- things like scene traversal, culling, sorting, redundant state-change removal, etc, etc, is all done in your own code (i.e. it doesn't use D3D/GL functions at all), so you can thread it however you like

This is the key message here - the multithreaded API in D3D11 is mostly relevant if you don't have a strong separation (like Hodgman) between doing some CPU work and submitting your state and draw calls.  If you perform some moderate work in between your various API calls, then it is potentially beneficial to execute some chunks of that work on multiple threads.  That's primarily due to parallelizing the CPU work, rather than any kind of speed up at the driver level...

 

What is important (and my usual advice in this type of design discussion) is to design your code so that you can modularly move back and forth from multi-threaded to single-threaded execution.  This is actually not too difficult with D3D11, since they implemented the deferred contexts with more or less the exact same interface as the immediate context - your rendering code doesn't actually know the difference unless it specifically asks via an API call.  Once your code is modularized, you can profile and it should be much easier to tweak your specific situation for a target hardware configuration.






Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS