Sign in to follow this  
karl88

Threadpool C++ 11 game engine

Recommended Posts

Hey!
 
It's time to implement a threadpool in my game engine with std::mutex, std::condition_variable, std::unique_lock and std::lock_guard.
I have read a lot about Lock-free Programming, but that seems really hard to get it right. The number of jobs I need to execute each frame is approx. 40-200.
So, I have couple of questions that needs to be answered before I do any attempts to implement this stuff.
 
How many threads should the threadpool contain? 
As many as the number of cores on the CPU? And if this is the case, should I create nrThreadsThreadPool = nrCpuCores - 1(main thread)?
 
Example:
  • Current computer: 4 cores, 4 threads.
  • Threadpool contains 3 threads.
Let's say that I have 17500 objects in an array and I need to perform frustum culling on each object.
Should I send 5000, 5000, 5000 as three jobs to the threadpool and perform 2500 in the main thread? 
This solution seems to be non-optimal, because each frustum culling test will take different amount of time to complete, and this can result in cores that has nothing to do.
Or, should I create a work size variable with a size of 512 or something similar, and calculate the number of jobs that I need to send to the threadpool?
 
threadpool.push(job_0, arrayStart_0, arrayEnd_0) //512 frustum culling.
threadpool.push(job_1, arrayStart_1, arrayEnd_1) //512 frustum culling.
threadpool.push(job_2, arrayStart_2, arrayEnd_2) //512 frustum culling.
threadpool.push(job_3, arrayStart_3, arrayEnd_3) //512 frustum culling.
...
threadpool.push(job_n-1, arrayStart_n-1, arrayEnd_n-1) //512 frustum culling.
mainThreadFrustumCulling() //The main thread should execute the remaining frustum culling tests.
threadpool.wait() //Wait for frustum culling to finish in the thread pool.
doSomethingWithTheResults()
 
The job size should be enough to outweigh the overhead of locking/unlocking with std::mutex.

Share this post


Link to post
Share on other sites

I would check to see if the total job size is less than some threshold (e.g. 512 objects), if so then just use a sequential algorithm on the main thread. Otherwise, if you have N_t threads and N_w work items, then I would split it up into pieces of ceiling(N_w/N_t) size. This creates jobs of roughly equal size (aside from the last), maximizes parallelism and minimizes the number of synchronizations.

The number of threads to use depends on what else is going on in the background (e.g. audio, physics, asset loading), and how much CPU those tasks use. If you completely saturate the CPU it might cause other tasks to be preempted. I would use one thread for rendering, one for physics, one for audio, and the rest (out of the total available) for the pool, but you should really look at the usage and make an informed decision from there.

Edited by Aressera

Share this post


Link to post
Share on other sites

I add my main thread into the thread pool -- so on a 4-core machine, I create 3 threads and end up with a pool of 4 threads.
By default, a task on 17500 items would be split into 4 jobs that each operate on 4375 items -- but yeah, having coarse grained jobs reduces opportunities for the system to self-balance.

Have a look at doom 3's design here: http://fabiensanglard.net/doom3_bfg/threading.php
In the case where you split into a large number of smaller jobs (512 items each), if you push each of those jobs individually, you put a lot more pressure onto the job queue that necessary. Doom 3 solved this by having a "job list" queue. You create a "job list" containing all of these small frustum culling jobs (512 items each), and then push that list into the queue. The list contains an atomic integer of which task is the next one to be processed, which makes it act like a small, simpler queue (zero producer, multiple consumer).

Edited by Hodgman

Share this post


Link to post
Share on other sites

@Infinisearch Usually not a thread pool is something that allocates and runs threads. The job systems make these thread busy.

However usually these two are implemented together as one thing.

Edited by imoogiBG

Share this post


Link to post
Share on other sites


Ideally the threads in your thread pool won't ever wait for anything (except more jobs to enter the job queue). I actually create a second thread-pool of low-priority threads to handle any long-running operations that could contain waiting (e.g. interacting with the file system -- even the OS's native async APIs sometimes block for several ms). 

However, on hyper-threaded CPU's each core can actually host two threads, and whenever one is waiting on data to arrive from RAM, it can process some instructions from the other thread. Some benchmarks show this can give a ~15% improvement (e.g. using 8 threads instead of 4), but in my own testing, my game runs better with just 4 threads rather than 8 on a hyper-threaded quad core :o

Very interesting that in your test with 4 threads it runs better than 8 threads. Obviously this tradeoff threshold would happen, but I never thought it could be at a small number like 8 threads.

On my game engine, the gain wasn't much, but actually prevented the performance to go down in a test progressively adding objects. But, I have a case (not game-related and is an early test) that the multithread use gives almost 2x the performance with 8 threads (4 core computer). It's a painting application that I'm making, where a 'dab' of a brush is drawn by the 8 threads (each thread is responsible to draw a clipped region of the same pixel array).

The only explanation I can thing of for this is that the RAM has become a bottleneck, since mine is 1.6 GHz, and CPU is 2.5 GHz.

I imagine the same could be used for A* algorithms, as they use an arrays.

Share this post


Link to post
Share on other sites

It's usually a bad idea to heavily thread operations that write to the same area of memory. Even if you guarantee they're writing to different parts, you can still get 'false sharing'.

If you're doing very similar operations across an array, I'd say it's usually best to keep it on a single thread but to vectorise the operation if possible. If you do have to thread it, keep the false sharing problem in mind, and try to mitigate it by doing fewer writes, postponing writes until the end, working on larger chunks of contiguous memory, etc.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this