C++ Lockless Queue for a Threadpool

Started by
11 comments, last by Hodgman 13 years, 7 months ago
I would generally advise against lockfree programming. While there are some special applications (device driver, network stack) in which lockfree algorithms can significantly improve performance, for most "normal" scenarios they are a totally stupid thing to do.

Lockfree algorithms are big win (2-10 times faster) in scenarios where many threads fight over the same lock all the time without cease. In a "normal" scenario where a task actually involves something being done too, this is not the case.

Unless you have millions of tasks which involve not much more than adding together 2 numbers, you will normally have one, in the exceptional case two threads access the lock at the same time. In this low congestion case, my 4 year old computer can schedule 10 million tasks per second no problem, using a plain normal critical section and no complicated, fragile code.
If you need to schedule more than that, then you should consider that something is fundamentally wrong with your approach (i.e. tasks are much too small).

Besides being complicated, unnecessary, and error-prone, lockfree algorithms have yet another problem, which is that they burn CPU cycles. If you have a pool of worker threads and feed the pool tasks at random times, then the worker threads must either block on a semaphore or a similar object, or they will spin and burn cycles all the time (and fight over time slices with your main application code, making it slower), even when no tasks are on the queue.
So, unless you need to process taks all the time without cease, you must implement some mechanism to put the worker threads to sleep (or suspend, whatever terminology you want to use), which means that you pay the overhead of an OS synchronisation object in addition to the lockfree mechanism. Which means you could just use a semaphore in the first place.
Advertisement
It's beginning to sound like "Locklessness" is almost taboo. I see your point on how they could be helpful with network stacks and drivers, and that's a great example of where locklessness could be helpful.

Thanks for all the help, everyone. Still, I do appreciate any comments or suggestions form anyone.

And although I'm still optimizing it, (thus why this thread exists) I was wondering if you guys would have any feedback on a small tutorial I did for creating a C++ Win32 Threadpool.

http://keithmaggio.wordpress.com/code/c-win32-thread-pool-manager/

Thanks again, guys! I really appreciate the help.
It's not so much that they are "taboo", but they are not a magic cure, and implementing them is not trivial. Locks have their problems and lockfree algorithms solve these, but they have different problems that locks do not have.

If you have a "sane" task granularity, you need to schedule maybe 500 or 1000 of them per second at most. At this order of magnitude, it does not matter at all whether it takes 20 cycles or 2000 cycles to dispatch one. It usually doesn't matter if tasks are executed in a "fair" manner, as long as they are executed. It usually doesn't matter if one task might be stalled for a couple of milliseconds, as long as it is executed at some point. On the other hand, keeping the CPU spinning when there is no work to be done could matter a lot. It always depends.
Well, like I said before this isn't me trying to solve anything in particular, this is me just wanting to try new things and have a bit of fun :-)
Quote:Original post by UtMan
OK Good so there is still a challenge to be had.

So any tips/references on implementing LIGHTLY-locking queues? And any other comments or tips on implementing a good threadpool?
The best solution I've come up with is to simply use a non-locking (non-thread-safe) queue:
typedef std::vector<std::vector<Job*>> ThreadLocalJobBatch;ThreadLocalJobBatch queue;queue.resize(numThreads);for( int i=0; i!=numThreads; ++i )  queue.reserve(guessNumJobsPerFrame);
Take a list of jobs once frame/sub-frame in advance (adding latency is the key to fast concurrency) and distribute the jobs into one of these job batches. Double-buffer the pool's job-batch, so you can be scheduling one batch of jobs while another is executing. Then all you need is an atomic swap to flip your double-buffer (one atomic op per batch, instead of one/many per job).
Algorithmically, this is not only 'lock free' but also 'wait free'.

Don't get hung up on trying to make inter-thread communication fast, instead try to find ways to eliminate/reduce that communication.
Don't try to solve threading problems with locking if they can be solved with scheduling (if you know two bits of code won't be scheduled to run at the same time, you don't need locks - locks are one way to control scheduling, but not the only way). IMO lock-free structures are often a big red-herring when looking to improve the performance of concurrent code.

[Edited by - Hodgman on September 5, 2010 7:11:27 AM]

This topic is closed to new replies.

Advertisement