TBB seems to be a very common choice.
[Provided you've had practice with memory re-ordering issues and esoteric C language rules, then] Writing your own isn't that hard.
I like to use a SPMD style of programming quite a bit these days. What I do is:
- Create a bunch of threads -- typically 1 per core.
- Assign each thread a number from 0 to numThreads-1. Store this number using TLS so it can be retrieved like a global variable at any time.
- Write a pure function that can distrubute a range of a problem between threads, e.g.
inline void DistributeTask( uint workerIndex, uint numWorkers, uint items, uint &begin, uint& end )
{
uint perWorker = items / numWorkers;
begin = perWorker * workerIndex;
end = (workerIndex==numWorkers-1)
? items //ensure last thread covers whole range
: begin + perWorker;
begin = perWorker ? begin : min(workerIndex, (u32)items); //special case,
end = perWorker ? end : min(workerIndex+1, (u32)items); //less items than workers
}
- Run the same main loop on all threads. Upon reaching a processing loop, use this function to determine the range of the loop that should be processed.
- If the results of one loop become the inputs to another loop, then make sure you synchronize the workers appropriately. e.g. increment an atomic counter as you leave the first loop, and before the second loop, wait until counter == numThreads.
- In my implementation, I combine this SPMD model with a job-based model, where while any thread is busy-waiting on a condition such as the above, they continually try to pop 'job packets' from a shared queue for processing, to make good use of this waiting time.