Task parallelism

Started by
15 comments, last by Promit 7 years, 1 month ago

Hey

so after a while I need some input too now for my game engine. I want to implement a job system using task parallelism and work stealing to achieve an arbitary share of work over all system cores. My system uses a lock free Queue to store each task at a specific worker (that may or may not be on a specific CPU core). When a task is pushed to the system, it is added to calling threads queue and the job system alsow tries to wake up the owner or an additional thread if possible. A task is dequeued from the queues owner and executed if successfull, otherwise the worker looks at a random task and tries to get work from that. The worker is going to sleep if all of the above fails to adjust the workload of the engine in the running system.

So now I have the atomic queue that works well for now but I want to implement some kind of job stealing to it where I'm not quite sure how to access the queue from another thread. Using dequeue works as expected and investigating my 12 worker (have hexa core CPU) at least one has the test task as element of its queue but N of them may have had it already so this seems ok to me.

My general question is how is job stealing implemented on the queue side; should exist an other function in the queue that lets another thread steel a task or is a simple dequeue right ok for it? Or is the real work done at the managing side that handles the worker sharing to each other and of course the art of the state?

The background is that I saw some implementation where the queue was more kind of a stac for the owning thread so push and pop was handled at the owner side and dequeue on the thieves side. I saw a few problems in that implementation where running continious tasks like the messaging system would always be on top of the list and always run on certain worker when there would be other continious tasks could stuck the whole job system by continious tasks running on each worker and so never ever getting other tasks to run even if they could be stolen. So I decided to go for a traditional queue that would garantuee that each task regardless if continious or not would get its runtime.

And yes, I'm not using std::atomic nor anything else from the C++11 toolbox rather than a platform wrapper to some interloked/__sync_XX functions so dont advise any C++11 related gimmicks here please

Thanks in advanced

Advertisement

Ugh. Did someone say "Don't write your own lockfree stuff"? You know, it often doesn't appear to be helpful when people toss out statements like that, but... seriously. Don't write your own lockfree stuff, unless you know what you're doing.

A few points to think about:

  1. What happens if the queue is full? This is something that can very conceivably happen with a bounded queue such as yours. You're not checking for that condition, but even if you wanted to do it, how would an unconditional exchange_add even allow for it? It will have incremented the index whether you are happy about it or not!
  2. What is the rationale behind compare-exchanging the read and write pointers (which is a super-expensive operation)?
  3. How do you know that if(bottom - top > 0) is correct or meaningful at all? Two presumably atomic, but distinct operations, one for reading the write pointer, and one for reading the read pointer are not atomic. Which means that whatever values you have, in a multithreaded environment with threads adjusting one or the other, they are almost guaranteed to be wrong at times.
  4. Memory ordering, happens-before guarantees, anyone?

uint32 bottom = __exchange_add(&write_ptr, 1); tasks[bottom % FIXED_SIZE] = task;

How do you ensure thread-safety here? You increment the write cursor and then write the data to that location afterwards... But as soon as the write cursor has bee incremented, other threads will assume that it's safe to read this item, potentially before it has been written.

uint32 top = __compare_exchange(&read_ptr, 0, 0), bottom = __compare_exchange(&write_ptr, 0, 0);

What's the logic of this step?
top = read_ptr; if(read_ptr == 0) read_ptr = 0;
bottom = write_ptr; if(write_ptr== 0) write_ptr = 0;

I'm not quite sure how to access the queue from another thread

There's many kinds of "lock free queues". You need a SPMC or MPMC variant if you want every thread to be able to safely pull work out of any queue.
I would not recommend writing one, as it's really easy to get these kind of low-level thread-safe containers slightly wrong in a way where they work fine for 6 months before experiencing a race condition :(
However, I don't follow my own advice and mine is here if you want to inspect my logic: fifo_mpmc.h fifo_mpmc.hpp It's probably sub-optimal, but should hopefully be safe as it's shipped on a few game DVDs :lol:
Searching for "multi-producer multi-consumer lock-free queue" will bring up some other implementations.

One simple way to use a job system is to only use a single, central, shared MPMC queue for all jobs, which all threads push to and all threads pop from. I would use that until you notice that thread contention is an issue.
Another way is to use a SPMC queue per thread; every thread pushes to its own queue and typically pops from its own queue, but can also pop from any other queue if its own is empty.
Or use a MPMC queue per thread, and in addition to the above, threads can push jobs into other threads queues too, if required.

Ugh. Did someone say "Don't write your own lockfree stuff"?

Sorry but in each post I read here there is sooner or later anyone telling "why not using C++11's this that whatever instead of making your own one" so ignore it :unsure:

As written in the entire post, I did some research and used to read some blogposts, for example https://blog.molecular-matters.com/2015/08/24/job-system-2-0-lock-free-work-stealing-part-1-basics/ where the author itself implemented a combined LIFO/FIFO struct that as I wrote entirely maybe come to conflict with continious tasks like the message system. I liked the idea of adjusting one thread to each CPU core (if the OS does so but thats not point of topic) but thought that a dequeue operation should happen instead of a pop one

The steal would do the same as dequeue, just from the other side (if you dequeue from top, you steal from bottom). And it would require two tasks be present to permit one to be stolen unless you want to solve a three-way race in a lockless system (I'll explain below).

I am all for reinventing the wheel, so understand I strongly disagree with people saying you shouldn't try to implement a lockless queue. It's good experience, much like rebuilding a motor as a mechanic is entirely unnecessary, but really good experience to have. But I wonder if you have actually thought this through. It feels like you're trying to use a CPU scheduler concept for your design of a task scheduler, and I doubt this specific approach works how you want. If you cannot guarantee the time from the queue size check to the adjustment of the queue pointer is less than the time the thread you're stealing from takes to obtain and complete a queue, this is not a safe implementation. And you cannot guarantee that on a system where you don't control the CPU scheduler. The OS can bench your stealing thread for as long as it wants right in the middle of that work.

If this approach did work, you would also have to ensure things work properly if you try to steal at the same time the dispatcher dispatches to the same thread. (It is easily solved, but a solution must be chosen) That didn't seem to be mentioned in the article you linked to (I only skimmed it, so apologies if it was addressed), and you might never even notice the need exists during your testing, because you probably wouldn't run a bunch of other programs in the background while running a game that used your game engine. But users will! (most programmers massively undertest their multithreading code for race conditions, and often don't even know how to test to encourage race conditions get encountered more often)

If I am correct that you hadn't considered this, and don't have a solution for it, my personal recommendation is that you should focus on gaining more experience with traditional multithreading first. Try using locks, but minimizing how many you need, and how long you hold them. Learn to use signals to supplement locks, and massively improve what is possible. Learn how to test intentionally faulty code to make it more likely to hit the race condition. Basically, learn the ins and outs of any topic before you try a revolutionary idea in that topic area. And if you get really comfortable with locks, you will always be highly sought after by pretty much any company you would even consider for employment. Especially if you start talking about how to design tests for locks. But don't give up pushing the boundaries, and reinventing wheels. Just make sure you understand how a particular wheel works before you try to reinvent it.

Just something I found whats worth to mention https://www.gamedev.net/topic/580310-c-lockless-queue-for-a-threadpool/

Sorry but in each post I read here there is sooner or later...

Yes, and I dislike doing it myself. Yet, there is a good reason for that.

Did you read the "couple of points to think about" part in my post?

did some research

Fair enough, but you have several obvious and very serious errors in the few lines of code you posted. I'm not talking about not-so-obvious ones, just the ones that are immediately visible.

You wrote that your code "works fine", but that is yet another serious mistake. Lockfree code must be either provably correct, or it must at least be validated very, very carefully with a specialized tool such as Relacy. It is a most devastating mistake to run some tests on your development machine and if it doesn't crash step forward and say "yay, it works". You know, truth is, it does not work. But what's worst, maybe it does work sometimes, so you do not even realize that it's broken, nor can you foresee when it will fail. Absence of failure is not proof of correctness.

If nothing else, consider my above point with the two atomic reads (which are really compare-exchanges, unnecessarily). Your code translates to:

  1. do atomic operation 1
  2. do atomic operation 2
  3. assume(atomic operation 1 + atomic operation 2 == consistent state)

This cannot work. Never, not ever. Except... by pure chance (because there happens to be no concurrent modification). It is formally wrong.

Two distinct atomic operations are individually atomic, of course, but they are not atomic in their entity. It is perfectly possible (and "possible" means it not only could happen in principle, but it will eventually happen, given enough time) that right after you read the first value, another thread modifies the second value before you get a chance to read it. Or, something different. It might update both values to a different, consistent state, but you have already read half of it, so your state is inconsistent anyway. Or something. Whatever it is, it can, and will eventually fail, and when it happens you will not know what went wrong.

(To avoid this problem, do a single atomic fetch with each index occupying only half of the atomically-fetched word, then sort them out by shifting and masking. This is correct, i.e. it actually works.)

Really, do not, do not write your own lockfree stuff unless you are 100% confident that you understand exactly every single detail and every implication. Even then, even if you are 100% certain that you know your way around, you may still do it wrong (Herb Sutter did, after all).

To avoid this problem, do a single atomic fetch with each index occupying only half of the atomically-fetched word, then sort them out by shifting and masking

This is an interesting and usefull approach I maybe test later

EDIT: It seems I need some coffee or hollidays <_<

The algorithm Shaarigan is discussing does not assume atomic operations 1 and 2 were done in a consistent state, at least not in the article linked. (I have no idea if Shaarigan's code does, as the scheduler side was not included) This algorithm is not some amazing new idea, but a rather common lockless solution. (The stealing part is less common, and I honestly haven't seen it before, seriously doubt it's usefulness in any realistic situation, but I suppose it might work)

On these two atomic reads. Let's assume the queue length is wrong as a result of an extra element being added in between the two reads. Who cares? So you think the queue is shorter than it is. Either you bail thinking it's empty, or you take an element from the side you own and the changes to the other side have zero impact. This is the whole flipping reason we use this approach in lockless designs!

And it's perfectly safe because you're the only one who can modify your side. The scheduler is allowed to read it to determine if adding another element is allowed (why the queue can't overflow). And it doesn't matter if the scheduler gets stale data, because the modification only goes in one direction, so the scheduler can never think it's safe to add an element, just think it's not safe when it actually is, and we don't care if that happens.

I am far more sceptical of the "stealing" modification to a well-known, common lockless approach, but Shaarigan seems to recognize something fishy is going on with that stealing concept too, and questions are being asked. That is a good thing! If anyone proposed how to code the steal, perhaps Shaarigan would have said "hey, that's not safe! Here's why..." although it is impossible to provide a safe implementation without the scheduler side of the code.

Is Shaarigan in over their head? Probably. Most people are when asking questions on a public forum. But challenging ourselves is how we learn. I hope we all push ourselves beyond our comfort zones. While I have pretty good experience with locks and lockless alternatives, I have asked questions relating to Unity and C# in the past six months that most developers would find proved I had no idea what I was doing. It's because they were right. I did have no idea what I was doing.
Who cares? So you think the queue is shorter than it is. Either you bail thinking it's empty, or you take an element from the side you own and the changes to the other side have zero impact. This is the whole flipping reason we use this approach in lockless designs!

Maybe, or maybe not. Thinking that the queue is empty when it is not is sub-optimal, but it is no desaster. Now what about thinking that there is still room for pushing one more element when there isn't? Mind you, this is a bounded queue, so you'll be overwriting the oldest element. Sure... the queue filling up ideally shouldn't happen, but it can happen and it is something an implementation must be able to deal with, in a robust manner.

And it's perfectly safe because you're the only one who can modify your side.

Yes, maybe. Or maybe it's perfectly safe unless there is exactly one element in the queue (suddenly "your" side is "theirs" as well, both indices point at the same object) and at that precise moment a worker which is not the main thread (or two of them) has nothing to do. Surely it's perfectly safe to take that object from my side, because nobody else can take it. Surely it's no problem that the two hungry workers neither agree with each other nor with me whether there's a job left in the queue.

Thing is, any kind of lockfree queue, but work stealing as described on the Molecular Musings blog in particular is highly complex (that's an understatement!), and any such thing as "correct" or "perfectly safe" shouldn't ever be said lightly (we know way too few details, having seen only a small fragment of code). Either way, what I'm trying to say is that the advantage of such a complex system may very well be debated, and the practical gains may quite possibly be marginal, if existent. The risk of getting a total fuckup with unexplainable (and unreproducable!) failures three weeks after shipping is, however, very real.

I am not doubting that Stefan Reinalter got his implementation to work in ME (his engine has been shown to work, so apparently the job system somehow works, too!), but I'm ready to doubt that the gains warrant the complexity. Possibly (I have no data to back this up, but still, possibly) there are not even any gains compared to a not-stealing lockfree solution, or even a simple queue with a fast userspace mutex. Those don't perform as abysmally as urban myth tells us, you know. They're actually coming quite close while being totally trivial to do.

There are a lot of assumptions that may or may not be true, such as for example using one queue gives a lot of congestion (if that is really a problem, then your jobs are too trivial), or padding a 20-byte sized job object to fill an entire cache line is an ingenious idea (it may not be because the assumption of false sharing is wrong if the data is readonly or write-once, but burning three times as much L1 as necessary may very well put you at a disadvantage). To most of these assumptions, there is no clear-cut "right" or "wrong", it depends a lot on what you're doing with it, too.

There is the assumption that it's an ingenious idea to only post jobs to one's own queue and steal from another if no jobs are on yours.The article indeed says in verbatim that the producer only ever posts to its own queue. This means, by consequence, that there are never any normal dequeues from workers. Anything a worker does must finally always happen by stealing (because nothing is ever pushed onto any worker's queue, except if it pushes a task itself, but then it could just as well process the dependent job right away and avoid the overhead!).

With tongue-in-cheek, you could say that this design has the "best" of two worlds: Congestion at the single queue that actually contains jobs, and overhead for looking at the worker's own queue plus some other workers' queues first. You know, everybody pulling jobs from a single queue was what we wanted to avoid, to start with...

This topic is closed to new replies.

Advertisement