Threadpool C++ 11 game engine

Started by
13 comments, last by penguinbyebye 6 years, 11 months ago
Hey!
It's time to implement a threadpool in my game engine with std::mutex, std::condition_variable, std::unique_lock and std::lock_guard.
I have read a lot about Lock-free Programming, but that seems really hard to get it right. The number of jobs I need to execute each frame is approx. 40-200.
So, I have couple of questions that needs to be answered before I do any attempts to implement this stuff.
How many threads should the threadpool contain?
As many as the number of cores on the CPU? And if this is the case, should I create nrThreadsThreadPool = nrCpuCores - 1(main thread)?
Example:
  • Current computer: 4 cores, 4 threads.
  • Threadpool contains 3 threads.
Let's say that I have 17500 objects in an array and I need to perform frustum culling on each object.
Should I send 5000, 5000, 5000 as three jobs to the threadpool and perform 2500 in the main thread?
This solution seems to be non-optimal, because each frustum culling test will take different amount of time to complete, and this can result in cores that has nothing to do.
Or, should I create a work size variable with a size of 512 or something similar, and calculate the number of jobs that I need to send to the threadpool?
threadpool.push(job_0, arrayStart_0, arrayEnd_0) //512 frustum culling.
threadpool.push(job_1, arrayStart_1, arrayEnd_1) //512 frustum culling.
threadpool.push(job_2, arrayStart_2, arrayEnd_2) //512 frustum culling.
threadpool.push(job_3, arrayStart_3, arrayEnd_3) //512 frustum culling.
...
threadpool.push(job_n-1, arrayStart_n-1, arrayEnd_n-1) //512 frustum culling.
mainThreadFrustumCulling() //The main thread should execute the remaining frustum culling tests.
threadpool.wait() //Wait for frustum culling to finish in the thread pool.
doSomethingWithTheResults()
The job size should be enough to outweigh the overhead of locking/unlocking with std::mutex.
Advertisement

I would check to see if the total job size is less than some threshold (e.g. 512 objects), if so then just use a sequential algorithm on the main thread. Otherwise, if you have N_t threads and N_w work items, then I would split it up into pieces of ceiling(N_w/N_t) size. This creates jobs of roughly equal size (aside from the last), maximizes parallelism and minimizes the number of synchronizations.

The number of threads to use depends on what else is going on in the background (e.g. audio, physics, asset loading), and how much CPU those tasks use. If you completely saturate the CPU it might cause other tasks to be preempted. I would use one thread for rendering, one for physics, one for audio, and the rest (out of the total available) for the pool, but you should really look at the usage and make an informed decision from there.

I add my main thread into the thread pool -- so on a 4-core machine, I create 3 threads and end up with a pool of 4 threads.
By default, a task on 17500 items would be split into 4 jobs that each operate on 4375 items -- but yeah, having coarse grained jobs reduces opportunities for the system to self-balance.

Have a look at doom 3's design here: http://fabiensanglard.net/doom3_bfg/threading.php
In the case where you split into a large number of smaller jobs (512 items each), if you push each of those jobs individually, you put a lot more pressure onto the job queue that necessary. Doom 3 solved this by having a "job list" queue. You create a "job list" containing all of these small frustum culling jobs (512 items each), and then push that list into the queue. The list contains an atomic integer of which task is the next one to be processed, which makes it act like a small, simpler queue (zero producer, multiple consumer).

Check this out: https://github.com/progschj/ThreadPool/blob/master/ThreadPool.h

I made some tweaks for my version but it really works very well.

SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

@Infinisearch Usually not a thread pool is something that allocates and runs threads. The job systems make these thread busy.

However usually these two are implemented together as one thing.

Awesome (!) lock-free queues which come in handy:

single-producer-single-consumer

https://github.com/cameron314/readerwriterqueue

multi-producer-single-consumer

https://github.com/cameron314/concurrentqueue


Ideally the threads in your thread pool won't ever wait for anything (except more jobs to enter the job queue). I actually create a second thread-pool of low-priority threads to handle any long-running operations that could contain waiting (e.g. interacting with the file system -- even the OS's native async APIs sometimes block for several ms).

However, on hyper-threaded CPU's each core can actually host two threads, and whenever one is waiting on data to arrive from RAM, it can process some instructions from the other thread. Some benchmarks show this can give a ~15% improvement (e.g. using 8 threads instead of 4), but in my own testing, my game runs better with just 4 threads rather than 8 on a hyper-threaded quad core :o

Very interesting that in your test with 4 threads it runs better than 8 threads. Obviously this tradeoff threshold would happen, but I never thought it could be at a small number like 8 threads.

On my game engine, the gain wasn't much, but actually prevented the performance to go down in a test progressively adding objects. But, I have a case (not game-related and is an early test) that the multithread use gives almost 2x the performance with 8 threads (4 core computer). It's a painting application that I'm making, where a 'dab' of a brush is drawn by the 8 threads (each thread is responsible to draw a clipped region of the same pixel array).

The only explanation I can thing of for this is that the RAM has become a bottleneck, since mine is 1.6 GHz, and CPU is 2.5 GHz.

I imagine the same could be used for A* algorithms, as they use an arrays.

It's usually a bad idea to heavily thread operations that write to the same area of memory. Even if you guarantee they're writing to different parts, you can still get 'false sharing'.

If you're doing very similar operations across an array, I'd say it's usually best to keep it on a single thread but to vectorise the operation if possible. If you do have to thread it, keep the false sharing problem in mind, and try to mitigate it by doing fewer writes, postponing writes until the end, working on larger chunks of contiguous memory, etc.

Luckly in my case, the arrays are big enough (when the brush is big) to create a considerable offset between the position that each thread start to modify, I should just add a fewer safety measures to make one thread wait if other thread is slower in for some reason.

This topic is closed to new replies.

Advertisement