Sign in to follow this  
Shaarigan

Task parallelism

Recommended Posts

Hey

so after a while I need some input too now for my game engine. I want to implement a job system using task parallelism and work stealing to achieve an arbitary share of work over all system cores. My system uses a lock free Queue to store each task at a specific worker (that may or may not be on a specific CPU core). When a task is pushed to the system, it is added to calling threads queue and the job system alsow tries to wake up the owner or an additional thread if possible. A task is dequeued from the queues owner and executed if successfull, otherwise the worker looks at a random task and tries to get work from that. The worker is going to sleep if all of the above fails to adjust the workload of the engine in the running system.

So now I have the atomic queue that works well for now but I want to implement some kind of job stealing to it where I'm not quite sure how to access the queue from another thread. Using dequeue works as expected and investigating my 12 worker (have hexa core CPU) at least one has the test task as element of its queue but N of them may have had it already so this seems ok to me.

My general question is how is job stealing implemented on the queue side; should exist an other function in the queue that lets another thread steel a task or is a simple dequeue right ok for it? Or is the real work done at the managing side that handles the worker sharing to each other and of course the art of the state?

The background is that I saw some implementation where the queue was more kind of a stac for the owning thread so push and pop was handled at the owner side and dequeue on the thieves side. I saw a few problems in that implementation where running continious tasks like the messaging system would always be on top of the list and always run on certain worker when there would be other continious tasks could stuck the whole job system by continious tasks running on each worker and so never ever getting other tasks to run even if they could be stolen. So I decided to go for a traditional queue that would garantuee that each task regardless if continious or not would get its runtime.

And yes, I'm not using std::atomic nor anything else from the C++11 toolbox rather than a platform wrapper to some interloked/__sync_XX functions so dont advise any C++11 related gimmicks here please

Thanks in advanced

Edited by Shaarigan

Share this post


Link to post
Share on other sites

Ugh. Did someone say "Don't write your own lockfree stuff"? You know, it often doesn't appear to be helpful when people toss out statements like that, but... seriously. Don't write your own lockfree stuff, unless you know what you're doing.

A few points to think about:

  1. What happens if the queue is full? This is something that can very conceivably happen with a bounded queue such as yours. You're not checking for that condition, but even if you wanted to do it, how would an unconditional exchange_add even allow for it? It will have incremented the index whether you are happy about it or not!
  2. What is the rationale behind compare-exchanging the read and write pointers (which is a super-expensive operation)?
  3. How do you know that if(bottom - top > 0) is correct or meaningful at all? Two presumably atomic, but distinct operations, one for reading the write pointer, and one for reading the read pointer are not atomic. Which means that whatever values you have, in a multithreaded environment with threads adjusting one or the other, they are almost guaranteed to be wrong at times.
  4. Memory ordering, happens-before guarantees, anyone?

Share this post


Link to post
Share on other sites

Ugh. Did someone say "Don't write your own lockfree stuff"?

Sorry but in each post I read here there is sooner or later anyone telling "why not using C++11's this that whatever instead of making your own one" so ignore it :unsure:

 

As written in the entire post, I did some research and used to read some blogposts, for example https://blog.molecular-matters.com/2015/08/24/job-system-2-0-lock-free-work-stealing-part-1-basics/ where the author itself implemented a combined LIFO/FIFO struct that as I wrote entirely maybe come to conflict with continious tasks like the message system. I liked the idea of adjusting one thread to each CPU core (if the OS does so but thats not point of topic) but thought that a dequeue operation should happen instead of a pop one

Edited by Shaarigan

Share this post


Link to post
Share on other sites
The steal would do the same as dequeue, just from the other side (if you dequeue from top, you steal from bottom). And it would require two tasks be present to permit one to be stolen unless you want to solve a three-way race in a lockless system (I'll explain below).

I am all for reinventing the wheel, so understand I strongly disagree with people saying you shouldn't try to implement a lockless queue. It's good experience, much like rebuilding a motor as a mechanic is entirely unnecessary, but really good experience to have. But I wonder if you have actually thought this through. It feels like you're trying to use a CPU scheduler concept for your design of a task scheduler, and I doubt this specific approach works how you want. If you cannot guarantee the time from the queue size check to the adjustment of the queue pointer is less than the time the thread you're stealing from takes to obtain and complete a queue, this is not a safe implementation. And you cannot guarantee that on a system where you don't control the CPU scheduler. The OS can bench your stealing thread for as long as it wants right in the middle of that work.

If this approach did work, you would also have to ensure things work properly if you try to steal at the same time the dispatcher dispatches to the same thread. (It is easily solved, but a solution must be chosen) That didn't seem to be mentioned in the article you linked to (I only skimmed it, so apologies if it was addressed), and you might never even notice the need exists during your testing, because you probably wouldn't run a bunch of other programs in the background while running a game that used your game engine. But users will! (most programmers massively undertest their multithreading code for race conditions, and often don't even know how to test to encourage race conditions get encountered more often)

If I am correct that you hadn't considered this, and don't have a solution for it, my personal recommendation is that you should focus on gaining more experience with traditional multithreading first. Try using locks, but minimizing how many you need, and how long you hold them. Learn to use signals to supplement locks, and massively improve what is possible. Learn how to test intentionally faulty code to make it more likely to hit the race condition. Basically, learn the ins and outs of any topic before you try a revolutionary idea in that topic area. And if you get really comfortable with locks, you will always be highly sought after by pretty much any company you would even consider for employment. Especially if you start talking about how to design tests for locks. But don't give up pushing the boundaries, and reinventing wheels. Just make sure you understand how a particular wheel works before you try to reinvent it.

Share this post


Link to post
Share on other sites
To avoid this problem, do a single atomic fetch with each index occupying only half of the atomically-fetched word, then sort them out by shifting and masking

This is an interesting and usefull approach I maybe test later

EDIT: It seems I need some coffee or hollidays <_<

Edited by Shaarigan

Share this post


Link to post
Share on other sites
Who cares? So you think the queue is shorter than it is. Either you bail thinking it's empty, or you take an element from the side you own and the changes to the other side have zero impact. This is the whole flipping reason we use this approach in lockless designs!

Maybe, or maybe not. Thinking that the queue is empty when it is not is sub-optimal, but it is no desaster. Now what about thinking that there is still room for pushing one more element when there isn't? Mind you, this is a bounded queue, so you'll be overwriting the oldest element. Sure... the queue filling up ideally shouldn't happen, but it can happen and it is something an implementation must be able to deal with, in a robust manner.

 

 

And it's perfectly safe because you're the only one who can modify your side.

Yes, maybe. Or maybe it's perfectly safe unless there is exactly one element in the queue (suddenly "your" side is "theirs" as well, both indices point at the same object) and at that precise moment a worker which is not the main thread (or two of them) has nothing to do. Surely it's perfectly safe to take that object from my side, because nobody else can take it. Surely it's no problem that the two hungry workers neither agree with each other nor with me whether there's a job left in the queue.

Thing is, any kind of lockfree queue, but work stealing as described on the Molecular Musings blog in particular is highly complex (that's an understatement!), and any such thing as "correct" or "perfectly safe" shouldn't ever be said lightly (we know way too few details, having seen only a small fragment of code). Either way, what I'm trying to say is that the advantage of such a complex system may very well be debated, and the practical gains may quite possibly be marginal, if existent. The risk of getting a total fuckup with unexplainable (and unreproducable!) failures three weeks after shipping is, however, very real.

I am not doubting that Stefan Reinalter got his implementation to work in ME (his engine has been shown to work, so apparently the job system somehow works, too!), but I'm ready to doubt that the gains warrant the complexity. Possibly (I have no data to back this up, but still, possibly) there are not even any gains compared to a not-stealing lockfree solution, or even a simple queue with a fast userspace mutex. Those don't perform as abysmally as urban myth tells us, you know. They're actually coming quite close while being totally trivial to do.

There are a lot of assumptions that may or may not be true, such as for example using one queue gives a lot of congestion (if that is really a problem, then your jobs are too trivial), or padding a 20-byte sized job object to fill an entire cache line is an ingenious idea (it may not be because the assumption of false sharing is wrong if the data is readonly or write-once, but burning three times as much L1 as necessary may very well put you at a disadvantage). To most of these assumptions, there is no clear-cut "right" or "wrong", it depends a lot on what you're doing with it, too.

There is the assumption that it's an ingenious idea to only post jobs to one's own queue and steal from another if no jobs are on yours.The article indeed says in verbatim that the producer only ever posts to its own queue. This means, by consequence, that there are never any normal dequeues from workers. Anything a worker does must finally always happen by stealing (because nothing is ever pushed onto any worker's queue, except if it pushes a task itself, but then it could just as well process the dependent job right away and avoid the overhead!).

With tongue-in-cheek, you could say that this design has the "best" of two worlds: Congestion at the single queue that actually contains jobs, and overhead for looking at the worker's own queue plus some other workers' queues first. You know, everybody pulling jobs from a single queue was what we wanted to avoid, to start with...

Edited by samoth

Share this post


Link to post
Share on other sites

Who cares? So you think the queue is shorter than it is. Either you bail thinking it's empty, or you take an element from the side you own and the changes to the other side have zero impact. This is the whole flipping reason we use this approach in lockless designs!

Maybe, or maybe not. Thinking that the queue is empty when it is not is sub-optimal, but it is no desaster. Now what about thinking that there is still room for pushing one more element when there isn't? Mind you, this is a bounded queue, so you'll be overwriting the oldest element. Sure... the queue filling up ideally shouldn't happen, but it can happen and it is something an implementation must be able to deal with, in a robust manner.

The scheduler is responsible for managing resources. It's kind of its job. If the scheduler is overflowing the queue, and you don't want that to happen, that's your fault for implementing it that way. Although it is also perfectly valid to accept undetermined results when that happens (I worked on software that accepted that result, because it forces a reboot, and there was already a memory overflow to get there) OP didn't give the scheduler code, so why are you just assuming it is wrong, and causing an error? And if code were given, and it were wrong, why not just help?
 
 

Yes, maybe. Or maybe it's perfectly safe unless there is exactly one element in the queue (suddenly "your" side is "theirs" as well, both indices point at the same object) and at that precise moment a worker which is not the main thread (or two of them) has nothing to do.


What are you talking about? The two sides aren't pointing at the same element if the size is one. If the scheduler is adding a second entry, it's not doing it in the same exact position it put the first element into.

Honestly, this is a legit lockless queue for single producer, single consumer. Google it. Most results are going to look nearly identical. I personally participated on a team that put this implementation into millions of homes. You're trying to find errors that just aren't there unless someone implemented it wrong.

Thing is, any kind of lockfree queue, but work stealing as described on the Molecular Musings blog in particular is highly complex (that's an understatement!), and any such thing as "correct" or "perfectly safe" shouldn't ever be said lightly (we know way too few details, having seen only a small fragment of code).


I was pretty explicit about not talking about the work stealing. I was pretty explicit that I doubt it is even useful. (It also seems worth noting that the linked article said "The implementation of the work-stealing queue uses locks.") OP hadn't implemented work stealing yet, just the normal lockless queue. And you asserted the existing code was making assumptions it was not making.

I want people to understand how these things work, so I felt it necessary to correct the misperception that a lockless queue doing separate reads of top and bottom is somehow not allowed. It is absolutely allowed, if done properly. There was nothing in the code provided to suggest it was not done properly, although we did not get all code.

Either way, what I'm trying to say is that the advantage of such a complex system may very well be debated, and the practical gains may quite possibly be marginal, if existent. The risk of getting a total fuckup with unexplainable (and unreproducable!) failures three weeks after shipping is, however, very real.


Then say that. Say what you mean. There is risk in every piece of code, and developers must choose what risks are worth taking. And it is completely okay if your project does not feel a specific risk is worth taking, just as it is completely okay if a different project decides the same risk is worth taking for their goals.

Share this post


Link to post
Share on other sites

Someone explain to me what the point is with 'work stealing'?  Doesn't each thread in a threadpool just pick the next task/job in order from a shared queue.  Each task/job is not-parallelizable, so why move it from one thread to another?

Share this post


Link to post
Share on other sites

Someone explain to me what the point is with 'work stealing'?  Doesn't each thread in a threadpool just pick the next task/job in order from a shared queue.  Each task/job is not-parallelizable, so why move it from one thread to another?


In this case, each thread has its own job queue. Thread stealing seems to be a proposal to gain some of the benefits of a shared queue. I have no idea what proposed scenarios benefit from this approach over more traditional ones.

Share this post


Link to post
Share on other sites

 

Someone explain to me what the point is with 'work stealing'?  Doesn't each thread in a threadpool just pick the next task/job in order from a shared queue.  Each task/job is not-parallelizable, so why move it from one thread to another?


In this case, each thread has its own job queue. Thread stealing seems to be a proposal to gain some of the benefits of a shared queue. I have no idea what proposed scenarios benefit from this approach over more traditional ones.

 

It seems you'd have to have a very intimate knowledge of the OS scheduler to gain any performance.  I'm all for esoteric solutions, and trying things simply for the sake of learning, but this seems a little extreme even for my tastes.

Share this post


Link to post
Share on other sites

someone explain to me what the point is with 'work stealing'?  Doesn't each thread in a threadpool just pick the next task/job in order from a shared queue.  Each task/job is not-parallelizable, so why move it from one thread to another?


In this case, each thread has its own job queue. Thread stealing seems to be a proposal to gain some of the benefits of a shared queue. I have no idea what proposed scenarios benefit from this approach over more traditional ones.
It seems you'd have to have a very intimate knowledge of the OS scheduler to gain any performance.  I'm all for esoteric solutions, and trying things simply for the sake of learning, but this seems a little extreme even for my tastes.

Lock-free algorithms aren't necessarily fast. They're quite often slower than locking based algorithms :wink: Going lock free doesn't magically solve your data sharing performance issues.

A single, shared queue containing all jobs will be a highly contended resource, as all threads constantly push and pop work through it. The real performance cost in any multi-threaded code is contention. Two threads that both try to use the queue at close intervals in time will cause all sorts of havoc at the CPU cache level...

Having one queue per thread is a very simple method of reducing the amount of resource contention. Check your own queue first, and if it's empty, then check someone else's queue. Assuming that every thread is generating a lot of jobs, this means that most of the time, each thread will only work with its own queue, and there will be very little contention occurring.

Honestly, this is a legit lockless queue for single producer, single consumer. Google it.

He's hidden the code now, but push was not safe (it published work to the consumer before it was complete), and pop was very suspicious. We called it out with good reason :)

Edited by Hodgman

Share this post


Link to post
Share on other sites

The thing is, and I apologize for repeating myself over and over again, the problem that work stealing is trying to solve is non-existent. Yes, it may be a very real problem in other parallel computing applications where tasks go over several physical CPUs on different physical machines (maybe with network in between), but almost certainly not here, in this case.

I mean, if this is for academic interest, fair enough. But if it is for something practical...

Also, the mentioned approach of work-stealing while only ever posting to one's own queue is somewhat nonsensical because it has the exact problem (non-existing problem) that work-stealing tries to solve. There will be (non-existing) contention on the "steal end" of the single queue that receives all the tasks. Because, you know, nothing is ever posted to the per-thread queues. Makes you wonder why they're there in the first place.

Compare that to other approaches like MoodyCamel's per-worker queue-of-queues, which indeed avoids contention for what it's worth. But this requires a quite a bit more evolved scheduler, and although it's an awesome piece of code, I've measured his approach with other approaches (including a simple mutex-queue) to exactly zero difference in a real application (it sure beats the devil out of other implementations in a synthetic benchmark, but we're not programming synthethic benchmarks, are we).

The reason for doing parallel computations is to get more bang out of the buck insofar as less wall time passes until the computation is finished (in CPU time, actually more time passes).

This works only if (at least) the following assumptions are valid:

  1. Fork-join plus parallel execution takes less wall time than executing everything sequentially (I'll assume that "tasks can execute in parallel at all" is too obvious a precondition to be worth mentioning).
  2. Tasks do not, on the average, appear faster than the pool of workers can process them. It may occasionally happen (that's why the queue has a size greater than 1), but not on the average, or persistently. If tasks persistently appeared faster than they can be processed, every task queue of any possible size would inevitably overflow, with nothing you can do to prevent that from happening.
  3. Yet, tasks usually take longer (much longer) to process than to be pushed to and pulled from the queue. If this was not the case, why would you go through the whole ordeal in the fist place (unless you have an infinite number of CPU cores and infinite memory bandwidth, this doesn't gain you anything).

Combining the above kind of implies that the producer does not queue new tasks all the time without cease, but either sleeps/blocks in between at times, or does a different computation, or something. Blocking when the queue is full may be a valid option, or having a queue big enough for all tasks until the producer incidentially blocks on vertical sync anyway, or whatever.

It also implies that sampling randomly at any desired point in time, workers are much more likely to be found in "work" mode than in "pull" mode (because "work" takes so much longer than "pop"):

While doing a couple of atomic operations (or acquiring a lock) is not precisely a free operation, nor is pulling an invalid cacheline from RAM, it is still relatively fast, compared to executing a 5-6 digit amount of instructions in a task (things may be very different when e.g. making an IPC or network request, think e.g. ZMQ task ventilator where we talk about milliseconds, not dozens of nanoseconds).

That's why even locking is not such a big problem as people often expect it to be. Unless there is nothing to do, the workers spend >99.9% of their time processing tasks (well, of course! that is what you are interested in...), which in turn means that the likelihood for any thread to be trying to acquire the lock at any given time is very low. Which in turn implies "contention is not really an issue". Certainly, every now and then two threads will try to acquire the lock at the same time, or do a concurrent atomic operation. If that couldn't happen, then you wouldn't need to worry and could just scrap that whole thing. But once in a while, that's OK -- as long as it doesn't happen all the time, every time, you're good to go. No observable difference. If tasks are reasonably-sized, the odds are on your side.

If contention is a problem, you have, in my opinion, almost certainly done something wrong.

Edited by samoth

Share this post


Link to post
Share on other sites
The thing is, and I apologize for repeating myself over and over again, the problem that work stealing is trying to solve is non-existent.

No need to apologize, I think you hit the nail on the head, and it deserves repeating.

If queue contention is an issue, then instead of trying to fix the contention directly, first try and fix what causes the contention. Give your workers beefier tasks, so they aren't spending a disproportionate amount of time getting work, when they should be spending that time actually working. Then you can afford the overhead of a simple locking queue, for the 0.1% of the time threads spend getting work, for the even lesser amount of time that they might spend actually contending with each other.

If for whatever reason (in this case the sake of argument) you're stuck with lots of piddly tasks and can't do much about it, then I also agree that the approach as presented in the article makes little sense. Per-worker queues helps consumer contention for sure, but mitigating producer contention is the job of the task scheduler. The purpose of work stealing is to then deal with the sub-optimal load balancing that can result from the scheduler focusing on contention.

However the article seems to miss the mark here. It ignores the fact that each worker can actually create contention on its own queue by pushing tasks that other threads constantly have to steal (and no mention of handling overflow in these cases).  Even a naive round-robin strategy where each worker queue gets every i'th task could spread out that pressure some. It then focuses on work stealing as a solution to said contention (which it isn't). I have no doubt that the author's system works, because there isn't anything technically wrong with it. But it feels like a misunderstanding of the problem combined with a misapplication of solutions leading to an approach that isn't much better (if at all) than the original it was trying to improve upon.

Edited by Zipster

Share this post


Link to post
Share on other sites

I just want to highlight also that lock-free does not have anything to do with contention-free and indeed several lockfree algorithms will not perform well if heavy contention is a problem. What you want for that are wait-free algorithms/data structures, and those are generally only helpful if you have other work you can do before checking back to see if the contention has been resolved.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this