• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
Shaarigan

Task parallelism

16 posts in this topic

Posted (edited)

Hey

so after a while I need some input too now for my game engine. I want to implement a job system using task parallelism and work stealing to achieve an arbitary share of work over all system cores. My system uses a lock free Queue to store each task at a specific worker (that may or may not be on a specific CPU core). When a task is pushed to the system, it is added to calling threads queue and the job system alsow tries to wake up the owner or an additional thread if possible. A task is dequeued from the queues owner and executed if successfull, otherwise the worker looks at a random task and tries to get work from that. The worker is going to sleep if all of the above fails to adjust the workload of the engine in the running system.

So now I have the atomic queue that works well for now but I want to implement some kind of job stealing to it where I'm not quite sure how to access the queue from another thread. Using dequeue works as expected and investigating my 12 worker (have hexa core CPU) at least one has the test task as element of its queue but N of them may have had it already so this seems ok to me.

My general question is how is job stealing implemented on the queue side; should exist an other function in the queue that lets another thread steel a task or is a simple dequeue right ok for it? Or is the real work done at the managing side that handles the worker sharing to each other and of course the art of the state?

The background is that I saw some implementation where the queue was more kind of a stac for the owning thread so push and pop was handled at the owner side and dequeue on the thieves side. I saw a few problems in that implementation where running continious tasks like the messaging system would always be on top of the list and always run on certain worker when there would be other continious tasks could stuck the whole job system by continious tasks running on each worker and so never ever getting other tasks to run even if they could be stolen. So I decided to go for a traditional queue that would garantuee that each task regardless if continious or not would get its runtime.

And yes, I'm not using std::atomic nor anything else from the C++11 toolbox rather than a platform wrapper to some interloked/__sync_XX functions so dont advise any C++11 related gimmicks here please

Thanks in advanced

Edited by Shaarigan
0

Share this post


Link to post
Share on other sites

Posted (edited)

Ugh. Did someone say "Don't write your own lockfree stuff"?

Sorry but in each post I read here there is sooner or later anyone telling "why not using C++11's this that whatever instead of making your own one" so ignore it :unsure:

 

As written in the entire post, I did some research and used to read some blogposts, for example https://blog.molecular-matters.com/2015/08/24/job-system-2-0-lock-free-work-stealing-part-1-basics/ where the author itself implemented a combined LIFO/FIFO struct that as I wrote entirely maybe come to conflict with continious tasks like the message system. I liked the idea of adjusting one thread to each CPU core (if the OS does so but thats not point of topic) but thought that a dequeue operation should happen instead of a pop one

Edited by Shaarigan
0

Share this post


Link to post
Share on other sites
The steal would do the same as dequeue, just from the other side (if you dequeue from top, you steal from bottom). And it would require two tasks be present to permit one to be stolen unless you want to solve a three-way race in a lockless system (I'll explain below).

I am all for reinventing the wheel, so understand I strongly disagree with people saying you shouldn't try to implement a lockless queue. It's good experience, much like rebuilding a motor as a mechanic is entirely unnecessary, but really good experience to have. But I wonder if you have actually thought this through. It feels like you're trying to use a CPU scheduler concept for your design of a task scheduler, and I doubt this specific approach works how you want. If you cannot guarantee the time from the queue size check to the adjustment of the queue pointer is less than the time the thread you're stealing from takes to obtain and complete a queue, this is not a safe implementation. And you cannot guarantee that on a system where you don't control the CPU scheduler. The OS can bench your stealing thread for as long as it wants right in the middle of that work.

If this approach did work, you would also have to ensure things work properly if you try to steal at the same time the dispatcher dispatches to the same thread. (It is easily solved, but a solution must be chosen) That didn't seem to be mentioned in the article you linked to (I only skimmed it, so apologies if it was addressed), and you might never even notice the need exists during your testing, because you probably wouldn't run a bunch of other programs in the background while running a game that used your game engine. But users will! (most programmers massively undertest their multithreading code for race conditions, and often don't even know how to test to encourage race conditions get encountered more often)

If I am correct that you hadn't considered this, and don't have a solution for it, my personal recommendation is that you should focus on gaining more experience with traditional multithreading first. Try using locks, but minimizing how many you need, and how long you hold them. Learn to use signals to supplement locks, and massively improve what is possible. Learn how to test intentionally faulty code to make it more likely to hit the race condition. Basically, learn the ins and outs of any topic before you try a revolutionary idea in that topic area. And if you get really comfortable with locks, you will always be highly sought after by pretty much any company you would even consider for employment. Especially if you start talking about how to design tests for locks. But don't give up pushing the boundaries, and reinventing wheels. Just make sure you understand how a particular wheel works before you try to reinvent it.
1

Share this post


Link to post
Share on other sites

Posted (edited)

To avoid this problem, do a single atomic fetch with each index occupying only half of the atomically-fetched word, then sort them out by shifting and masking

This is an interesting and usefull approach I maybe test later

EDIT: It seems I need some coffee or hollidays <_<

Edited by Shaarigan
0

Share this post


Link to post
Share on other sites

Posted (edited)

Who cares? So you think the queue is shorter than it is. Either you bail thinking it's empty, or you take an element from the side you own and the changes to the other side have zero impact. This is the whole flipping reason we use this approach in lockless designs!

Maybe, or maybe not. Thinking that the queue is empty when it is not is sub-optimal, but it is no desaster. Now what about thinking that there is still room for pushing one more element when there isn't? Mind you, this is a bounded queue, so you'll be overwriting the oldest element. Sure... the queue filling up ideally shouldn't happen, but it can happen and it is something an implementation must be able to deal with, in a robust manner.

 

 

And it's perfectly safe because you're the only one who can modify your side.

Yes, maybe. Or maybe it's perfectly safe unless there is exactly one element in the queue (suddenly "your" side is "theirs" as well, both indices point at the same object) and at that precise moment a worker which is not the main thread (or two of them) has nothing to do. Surely it's perfectly safe to take that object from my side, because nobody else can take it. Surely it's no problem that the two hungry workers neither agree with each other nor with me whether there's a job left in the queue.

Thing is, any kind of lockfree queue, but work stealing as described on the Molecular Musings blog in particular is highly complex (that's an understatement!), and any such thing as "correct" or "perfectly safe" shouldn't ever be said lightly (we know way too few details, having seen only a small fragment of code). Either way, what I'm trying to say is that the advantage of such a complex system may very well be debated, and the practical gains may quite possibly be marginal, if existent. The risk of getting a total fuckup with unexplainable (and unreproducable!) failures three weeks after shipping is, however, very real.

I am not doubting that Stefan Reinalter got his implementation to work in ME (his engine has been shown to work, so apparently the job system somehow works, too!), but I'm ready to doubt that the gains warrant the complexity. Possibly (I have no data to back this up, but still, possibly) there are not even any gains compared to a not-stealing lockfree solution, or even a simple queue with a fast userspace mutex. Those don't perform as abysmally as urban myth tells us, you know. They're actually coming quite close while being totally trivial to do.

There are a lot of assumptions that may or may not be true, such as for example using one queue gives a lot of congestion (if that is really a problem, then your jobs are too trivial), or padding a 20-byte sized job object to fill an entire cache line is an ingenious idea (it may not be because the assumption of false sharing is wrong if the data is readonly or write-once, but burning three times as much L1 as necessary may very well put you at a disadvantage). To most of these assumptions, there is no clear-cut "right" or "wrong", it depends a lot on what you're doing with it, too.

There is the assumption that it's an ingenious idea to only post jobs to one's own queue and steal from another if no jobs are on yours.The article indeed says in verbatim that the producer only ever posts to its own queue. This means, by consequence, that there are never any normal dequeues from workers. Anything a worker does must finally always happen by stealing (because nothing is ever pushed onto any worker's queue, except if it pushes a task itself, but then it could just as well process the dependent job right away and avoid the overhead!).

With tongue-in-cheek, you could say that this design has the "best" of two worlds: Congestion at the single queue that actually contains jobs, and overhead for looking at the worker's own queue plus some other workers' queues first. You know, everybody pulling jobs from a single queue was what we wanted to avoid, to start with...

Edited by samoth
0

Share this post


Link to post
Share on other sites

Who cares? So you think the queue is shorter than it is. Either you bail thinking it's empty, or you take an element from the side you own and the changes to the other side have zero impact. This is the whole flipping reason we use this approach in lockless designs!

Maybe, or maybe not. Thinking that the queue is empty when it is not is sub-optimal, but it is no desaster. Now what about thinking that there is still room for pushing one more element when there isn't? Mind you, this is a bounded queue, so you'll be overwriting the oldest element. Sure... the queue filling up ideally shouldn't happen, but it can happen and it is something an implementation must be able to deal with, in a robust manner.

The scheduler is responsible for managing resources. It's kind of its job. If the scheduler is overflowing the queue, and you don't want that to happen, that's your fault for implementing it that way. Although it is also perfectly valid to accept undetermined results when that happens (I worked on software that accepted that result, because it forces a reboot, and there was already a memory overflow to get there) OP didn't give the scheduler code, so why are you just assuming it is wrong, and causing an error? And if code were given, and it were wrong, why not just help?
 
 

Yes, maybe. Or maybe it's perfectly safe unless there is exactly one element in the queue (suddenly "your" side is "theirs" as well, both indices point at the same object) and at that precise moment a worker which is not the main thread (or two of them) has nothing to do.


What are you talking about? The two sides aren't pointing at the same element if the size is one. If the scheduler is adding a second entry, it's not doing it in the same exact position it put the first element into.

Honestly, this is a legit lockless queue for single producer, single consumer. Google it. Most results are going to look nearly identical. I personally participated on a team that put this implementation into millions of homes. You're trying to find errors that just aren't there unless someone implemented it wrong.

Thing is, any kind of lockfree queue, but work stealing as described on the Molecular Musings blog in particular is highly complex (that's an understatement!), and any such thing as "correct" or "perfectly safe" shouldn't ever be said lightly (we know way too few details, having seen only a small fragment of code).


I was pretty explicit about not talking about the work stealing. I was pretty explicit that I doubt it is even useful. (It also seems worth noting that the linked article said "The implementation of the work-stealing queue uses locks.") OP hadn't implemented work stealing yet, just the normal lockless queue. And you asserted the existing code was making assumptions it was not making.

I want people to understand how these things work, so I felt it necessary to correct the misperception that a lockless queue doing separate reads of top and bottom is somehow not allowed. It is absolutely allowed, if done properly. There was nothing in the code provided to suggest it was not done properly, although we did not get all code.

Either way, what I'm trying to say is that the advantage of such a complex system may very well be debated, and the practical gains may quite possibly be marginal, if existent. The risk of getting a total fuckup with unexplainable (and unreproducable!) failures three weeks after shipping is, however, very real.


Then say that. Say what you mean. There is risk in every piece of code, and developers must choose what risks are worth taking. And it is completely okay if your project does not feel a specific risk is worth taking, just as it is completely okay if a different project decides the same risk is worth taking for their goals.
1

Share this post


Link to post
Share on other sites

Someone explain to me what the point is with 'work stealing'?  Doesn't each thread in a threadpool just pick the next task/job in order from a shared queue.  Each task/job is not-parallelizable, so why move it from one thread to another?

0

Share this post


Link to post
Share on other sites

Someone explain to me what the point is with 'work stealing'?  Doesn't each thread in a threadpool just pick the next task/job in order from a shared queue.  Each task/job is not-parallelizable, so why move it from one thread to another?


In this case, each thread has its own job queue. Thread stealing seems to be a proposal to gain some of the benefits of a shared queue. I have no idea what proposed scenarios benefit from this approach over more traditional ones.
1

Share this post


Link to post
Share on other sites

 

Someone explain to me what the point is with 'work stealing'?  Doesn't each thread in a threadpool just pick the next task/job in order from a shared queue.  Each task/job is not-parallelizable, so why move it from one thread to another?


In this case, each thread has its own job queue. Thread stealing seems to be a proposal to gain some of the benefits of a shared queue. I have no idea what proposed scenarios benefit from this approach over more traditional ones.

 

It seems you'd have to have a very intimate knowledge of the OS scheduler to gain any performance.  I'm all for esoteric solutions, and trying things simply for the sake of learning, but this seems a little extreme even for my tastes.

0

Share this post


Link to post
Share on other sites

Posted (edited)

The thing is, and I apologize for repeating myself over and over again, the problem that work stealing is trying to solve is non-existent.

No need to apologize, I think you hit the nail on the head, and it deserves repeating.

If queue contention is an issue, then instead of trying to fix the contention directly, first try and fix what causes the contention. Give your workers beefier tasks, so they aren't spending a disproportionate amount of time getting work, when they should be spending that time actually working. Then you can afford the overhead of a simple locking queue, for the 0.1% of the time threads spend getting work, for the even lesser amount of time that they might spend actually contending with each other.

If for whatever reason (in this case the sake of argument) you're stuck with lots of piddly tasks and can't do much about it, then I also agree that the approach as presented in the article makes little sense. Per-worker queues helps consumer contention for sure, but mitigating producer contention is the job of the task scheduler. The purpose of work stealing is to then deal with the sub-optimal load balancing that can result from the scheduler focusing on contention.

However the article seems to miss the mark here. It ignores the fact that each worker can actually create contention on its own queue by pushing tasks that other threads constantly have to steal (and no mention of handling overflow in these cases).  Even a naive round-robin strategy where each worker queue gets every i'th task could spread out that pressure some. It then focuses on work stealing as a solution to said contention (which it isn't). I have no doubt that the author's system works, because there isn't anything technically wrong with it. But it feels like a misunderstanding of the problem combined with a misapplication of solutions leading to an approach that isn't much better (if at all) than the original it was trying to improve upon.

Edited by Zipster
0

Share this post


Link to post
Share on other sites

I just want to highlight also that lock-free does not have anything to do with contention-free and indeed several lockfree algorithms will not perform well if heavy contention is a problem. What you want for that are wait-free algorithms/data structures, and those are generally only helpful if you have other work you can do before checking back to see if the contention has been resolved.

1

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0