- Current computer: 4 cores, 4 threads.
- Threadpool contains 3 threads.
Threadpool C++ 11 game engine
I would check to see if the total job size is less than some threshold (e.g. 512 objects), if so then just use a sequential algorithm on the main thread. Otherwise, if you have N_t threads and N_w work items, then I would split it up into pieces of ceiling(N_w/N_t) size. This creates jobs of roughly equal size (aside from the last), maximizes parallelism and minimizes the number of synchronizations.
The number of threads to use depends on what else is going on in the background (e.g. audio, physics, asset loading), and how much CPU those tasks use. If you completely saturate the CPU it might cause other tasks to be preempted. I would use one thread for rendering, one for physics, one for audio, and the rest (out of the total available) for the pool, but you should really look at the usage and make an informed decision from there.
I add my main thread into the thread pool -- so on a 4-core machine, I create 3 threads and end up with a pool of 4 threads.
By default, a task on 17500 items would be split into 4 jobs that each operate on 4375 items -- but yeah, having coarse grained jobs reduces opportunities for the system to self-balance.
Have a look at doom 3's design here: http://fabiensanglard.net/doom3_bfg/threading.php
In the case where you split into a large number of smaller jobs (512 items each), if you push each of those jobs individually, you put a lot more pressure onto the job queue that necessary. Doom 3 solved this by having a "job list" queue. You create a "job list" containing all of these small frustum culling jobs (512 items each), and then push that list into the queue. The list contains an atomic integer of which task is the next one to be processed, which makes it act like a small, simpler queue (zero producer, multiple consumer).
Check this out: https://github.com/progschj/ThreadPool/blob/master/ThreadPool.h
I made some tweaks for my version but it really works very well.
@Infinisearch Usually not a thread pool is something that allocates and runs threads. The job systems make these thread busy.
However usually these two are implemented together as one thing.
Awesome (!) lock-free queues which come in handy:
single-producer-single-consumer
https://github.com/cameron314/readerwriterqueue
multi-producer-single-consumer
Ideally the threads in your thread pool won't ever wait for anything (except more jobs to enter the job queue). I actually create a second thread-pool of low-priority threads to handle any long-running operations that could contain waiting (e.g. interacting with the file system -- even the OS's native async APIs sometimes block for several ms).
However, on hyper-threaded CPU's each core can actually host two threads, and whenever one is waiting on data to arrive from RAM, it can process some instructions from the other thread. Some benchmarks show this can give a ~15% improvement (e.g. using 8 threads instead of 4), but in my own testing, my game runs better with just 4 threads rather than 8 on a hyper-threaded quad core :o
Very interesting that in your test with 4 threads it runs better than 8 threads. Obviously this tradeoff threshold would happen, but I never thought it could be at a small number like 8 threads.
On my game engine, the gain wasn't much, but actually prevented the performance to go down in a test progressively adding objects. But, I have a case (not game-related and is an early test) that the multithread use gives almost 2x the performance with 8 threads (4 core computer). It's a painting application that I'm making, where a 'dab' of a brush is drawn by the 8 threads (each thread is responsible to draw a clipped region of the same pixel array).
The only explanation I can thing of for this is that the RAM has become a bottleneck, since mine is 1.6 GHz, and CPU is 2.5 GHz.
I imagine the same could be used for A* algorithms, as they use an arrays.
It's usually a bad idea to heavily thread operations that write to the same area of memory. Even if you guarantee they're writing to different parts, you can still get 'false sharing'.
If you're doing very similar operations across an array, I'd say it's usually best to keep it on a single thread but to vectorise the operation if possible. If you do have to thread it, keep the false sharing problem in mind, and try to mitigate it by doing fewer writes, postponing writes until the end, working on larger chunks of contiguous memory, etc.
Luckly in my case, the arrays are big enough (when the brush is big) to create a considerable offset between the position that each thread start to modify, I should just add a fewer safety measures to make one thread wait if other thread is slower in for some reason.