Task parallelism

Started by
15 comments, last by Promit 7 years ago

Who cares? So you think the queue is shorter than it is. Either you bail thinking it's empty, or you take an element from the side you own and the changes to the other side have zero impact. This is the whole flipping reason we use this approach in lockless designs!

Maybe, or maybe not. Thinking that the queue is empty when it is not is sub-optimal, but it is no desaster. Now what about thinking that there is still room for pushing one more element when there isn't? Mind you, this is a bounded queue, so you'll be overwriting the oldest element. Sure... the queue filling up ideally shouldn't happen, but it can happen and it is something an implementation must be able to deal with, in a robust manner.

The scheduler is responsible for managing resources. It's kind of its job. If the scheduler is overflowing the queue, and you don't want that to happen, that's your fault for implementing it that way. Although it is also perfectly valid to accept undetermined results when that happens (I worked on software that accepted that result, because it forces a reboot, and there was already a memory overflow to get there) OP didn't give the scheduler code, so why are you just assuming it is wrong, and causing an error? And if code were given, and it were wrong, why not just help?


Yes, maybe. Or maybe it's perfectly safe unless there is exactly one element in the queue (suddenly "your" side is "theirs" as well, both indices point at the same object) and at that precise moment a worker which is not the main thread (or two of them) has nothing to do.


What are you talking about? The two sides aren't pointing at the same element if the size is one. If the scheduler is adding a second entry, it's not doing it in the same exact position it put the first element into.

Honestly, this is a legit lockless queue for single producer, single consumer. Google it. Most results are going to look nearly identical. I personally participated on a team that put this implementation into millions of homes. You're trying to find errors that just aren't there unless someone implemented it wrong.

Thing is, any kind of lockfree queue, but work stealing as described on the Molecular Musings blog in particular is highly complex (that's an understatement!), and any such thing as "correct" or "perfectly safe" shouldn't ever be said lightly (we know way too few details, having seen only a small fragment of code).


I was pretty explicit about not talking about the work stealing. I was pretty explicit that I doubt it is even useful. (It also seems worth noting that the linked article said "The implementation of the work-stealing queue uses locks.") OP hadn't implemented work stealing yet, just the normal lockless queue. And you asserted the existing code was making assumptions it was not making.

I want people to understand how these things work, so I felt it necessary to correct the misperception that a lockless queue doing separate reads of top and bottom is somehow not allowed. It is absolutely allowed, if done properly. There was nothing in the code provided to suggest it was not done properly, although we did not get all code.

Either way, what I'm trying to say is that the advantage of such a complex system may very well be debated, and the practical gains may quite possibly be marginal, if existent. The risk of getting a total fuckup with unexplainable (and unreproducable!) failures three weeks after shipping is, however, very real.


Then say that. Say what you mean. There is risk in every piece of code, and developers must choose what risks are worth taking. And it is completely okay if your project does not feel a specific risk is worth taking, just as it is completely okay if a different project decides the same risk is worth taking for their goals.
Advertisement

Someone explain to me what the point is with 'work stealing'? Doesn't each thread in a threadpool just pick the next task/job in order from a shared queue. Each task/job is not-parallelizable, so why move it from one thread to another?

Someone explain to me what the point is with 'work stealing'? Doesn't each thread in a threadpool just pick the next task/job in order from a shared queue. Each task/job is not-parallelizable, so why move it from one thread to another?


In this case, each thread has its own job queue. Thread stealing seems to be a proposal to gain some of the benefits of a shared queue. I have no idea what proposed scenarios benefit from this approach over more traditional ones.

Someone explain to me what the point is with 'work stealing'? Doesn't each thread in a threadpool just pick the next task/job in order from a shared queue. Each task/job is not-parallelizable, so why move it from one thread to another?


In this case, each thread has its own job queue. Thread stealing seems to be a proposal to gain some of the benefits of a shared queue. I have no idea what proposed scenarios benefit from this approach over more traditional ones.

It seems you'd have to have a very intimate knowledge of the OS scheduler to gain any performance. I'm all for esoteric solutions, and trying things simply for the sake of learning, but this seems a little extreme even for my tastes.

someone explain to me what the point is with 'work stealing'? Doesn't each thread in a threadpool just pick the next task/job in order from a shared queue. Each task/job is not-parallelizable, so why move it from one thread to another?


In this case, each thread has its own job queue. Thread stealing seems to be a proposal to gain some of the benefits of a shared queue. I have no idea what proposed scenarios benefit from this approach over more traditional ones.
It seems you'd have to have a very intimate knowledge of the OS scheduler to gain any performance. I'm all for esoteric solutions, and trying things simply for the sake of learning, but this seems a little extreme even for my tastes.

Lock-free algorithms aren't necessarily fast. They're quite often slower than locking based algorithms :wink: Going lock free doesn't magically solve your data sharing performance issues.

A single, shared queue containing all jobs will be a highly contended resource, as all threads constantly push and pop work through it. The real performance cost in any multi-threaded code is contention. Two threads that both try to use the queue at close intervals in time will cause all sorts of havoc at the CPU cache level...

Having one queue per thread is a very simple method of reducing the amount of resource contention. Check your own queue first, and if it's empty, then check someone else's queue. Assuming that every thread is generating a lot of jobs, this means that most of the time, each thread will only work with its own queue, and there will be very little contention occurring.

Honestly, this is a legit lockless queue for single producer, single consumer. Google it.

He's hidden the code now, but push was not safe (it published work to the consumer before it was complete), and pop was very suspicious. We called it out with good reason :)

The thing is, and I apologize for repeating myself over and over again, the problem that work stealing is trying to solve is non-existent. Yes, it may be a very real problem in other parallel computing applications where tasks go over several physical CPUs on different physical machines (maybe with network in between), but almost certainly not here, in this case.

I mean, if this is for academic interest, fair enough. But if it is for something practical...

Also, the mentioned approach of work-stealing while only ever posting to one's own queue is somewhat nonsensical because it has the exact problem (non-existing problem) that work-stealing tries to solve. There will be (non-existing) contention on the "steal end" of the single queue that receives all the tasks. Because, you know, nothing is ever posted to the per-thread queues. Makes you wonder why they're there in the first place.

Compare that to other approaches like MoodyCamel's per-worker queue-of-queues, which indeed avoids contention for what it's worth. But this requires a quite a bit more evolved scheduler, and although it's an awesome piece of code, I've measured his approach with other approaches (including a simple mutex-queue) to exactly zero difference in a real application (it sure beats the devil out of other implementations in a synthetic benchmark, but we're not programming synthethic benchmarks, are we).

The reason for doing parallel computations is to get more bang out of the buck insofar as less wall time passes until the computation is finished (in CPU time, actually more time passes).

This works only if (at least) the following assumptions are valid:

  1. Fork-join plus parallel execution takes less wall time than executing everything sequentially (I'll assume that "tasks can execute in parallel at all" is too obvious a precondition to be worth mentioning).
  2. Tasks do not, on the average, appear faster than the pool of workers can process them. It may occasionally happen (that's why the queue has a size greater than 1), but not on the average, or persistently. If tasks persistently appeared faster than they can be processed, every task queue of any possible size would inevitably overflow, with nothing you can do to prevent that from happening.
  3. Yet, tasks usually take longer (much longer) to process than to be pushed to and pulled from the queue. If this was not the case, why would you go through the whole ordeal in the fist place (unless you have an infinite number of CPU cores and infinite memory bandwidth, this doesn't gain you anything).

Combining the above kind of implies that the producer does not queue new tasks all the time without cease, but either sleeps/blocks in between at times, or does a different computation, or something. Blocking when the queue is full may be a valid option, or having a queue big enough for all tasks until the producer incidentially blocks on vertical sync anyway, or whatever.

It also implies that sampling randomly at any desired point in time, workers are much more likely to be found in "work" mode than in "pull" mode (because "work" takes so much longer than "pop"):

While doing a couple of atomic operations (or acquiring a lock) is not precisely a free operation, nor is pulling an invalid cacheline from RAM, it is still relatively fast, compared to executing a 5-6 digit amount of instructions in a task (things may be very different when e.g. making an IPC or network request, think e.g. ZMQ task ventilator where we talk about milliseconds, not dozens of nanoseconds).

That's why even locking is not such a big problem as people often expect it to be. Unless there is nothing to do, the workers spend >99.9% of their time processing tasks (well, of course! that is what you are interested in...), which in turn means that the likelihood for any thread to be trying to acquire the lock at any given time is very low. Which in turn implies "contention is not really an issue". Certainly, every now and then two threads will try to acquire the lock at the same time, or do a concurrent atomic operation. If that couldn't happen, then you wouldn't need to worry and could just scrap that whole thing. But once in a while, that's OK -- as long as it doesn't happen all the time, every time, you're good to go. No observable difference. If tasks are reasonably-sized, the odds are on your side.

If contention is a problem, you have, in my opinion, almost certainly done something wrong.

The thing is, and I apologize for repeating myself over and over again, the problem that work stealing is trying to solve is non-existent.

No need to apologize, I think you hit the nail on the head, and it deserves repeating.

If queue contention is an issue, then instead of trying to fix the contention directly, first try and fix what causes the contention. Give your workers beefier tasks, so they aren't spending a disproportionate amount of time getting work, when they should be spending that time actually working. Then you can afford the overhead of a simple locking queue, for the 0.1% of the time threads spend getting work, for the even lesser amount of time that they might spend actually contending with each other.

If for whatever reason (in this case the sake of argument) you're stuck with lots of piddly tasks and can't do much about it, then I also agree that the approach as presented in the article makes little sense. Per-worker queues helps consumer contention for sure, but mitigating producer contention is the job of the task scheduler. The purpose of work stealing is to then deal with the sub-optimal load balancing that can result from the scheduler focusing on contention.

However the article seems to miss the mark here. It ignores the fact that each worker can actually create contention on its own queue by pushing tasks that other threads constantly have to steal (and no mention of handling overflow in these cases). Even a naive round-robin strategy where each worker queue gets every i'th task could spread out that pressure some. It then focuses on work stealing as a solution to said contention (which it isn't). I have no doubt that the author's system works, because there isn't anything technically wrong with it. But it feels like a misunderstanding of the problem combined with a misapplication of solutions leading to an approach that isn't much better (if at all) than the original it was trying to improve upon.

I just want to highlight also that lock-free does not have anything to do with contention-free and indeed several lockfree algorithms will not perform well if heavy contention is a problem. What you want for that are wait-free algorithms/data structures, and those are generally only helpful if you have other work you can do before checking back to see if the contention has been resolved.

SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

This topic is closed to new replies.

Advertisement