Job Manager Lock-Free

Started by
14 comments, last by Zipster 8 years, 9 months ago

In mine at the moment, if i fail to pop a job, I try to decrement a semaphore (which will block if it hasn't been incremented). Every time I push a job, I increment the semaphore.
When the queue becomes empty, yes, it does spin for a while until the semaphore reaches zero, then sleeps until the next push.

There's probably a better way to implement that using events/condition-variables...

This. You can even do it with your own "semaphore like" thingie (I wouldn't know what to call it... event count maybe?) which is basically two integers, each for produce and consume events. You can take the counter modulo an array's size as the index. The advantage being that it's easier to implement than a "typical" lockfree queue and it can work without compare-exchange (on the producer end, in any case).

On the other hand, I've seen zero difference in practice between all that awesome lockfree stuff and a simple queue with a lock or a semaphore (even abusing a Windows I/O completion port as queue yields surprisingly good results with practically zero implementation work, done that). In retrospective, I would personally never waste the time it takes to implement a lockfree task queue again (it's much harder to get error-free than it looks like at first glance, and it's doubtful whether you really gain anything).

Yes, lockfree stuff matters if you push 250,000 tasks per frame. But well, you don't do that. You do want to partition work, of course, but you also want cache coherency and you want to avoid bus contention or context switches, so you won't push too small work items. Which means you won't push too many.

If you push 5,000 or so tasks per frame, the measurable gains are exactly zero (at least that's what it looks like to me... please correct me if you have measured something different), and the overhead of a typical fast userspace lock (with kernel backoff) is very much acceptable. The implementation, on the other hand, is soooooooooo much easier and less error prone. Yes, it's not cool, but it works.

Advertisement
I make a queue for each thread except the main one and have the process that breaks up the job send the parts in a rotation through the queues, then perform the last job by itself. This way there is no lock and no wait, and effectively speaking there is no waiting at the end of the job either. As an example, using this method I parallelized Quicksort in such a way that its expected runtime is O(n), and it runs much faster in benchmarking than any code I can get a hold of. It's actually twice as fast as most O(n) sorts because there is no preprocessing phase. For multithreading in general, you only get what you put into it though. The real work is in breaking up the algorithm and arranging the data in such as way that it can be run correctly in separate threads. I have not really used it in my game yet but it's very useful for other applications.

This is my thread. There are many threads like it, but this one is mine.

you can only work lock free safely on structures that work with plain primitive integers. what are you going to do with that ? you need to push function objects, so unless you work out a complex permanent memory position pooling technique and your lock free structures manipulate indices in that pool only, you have no exit.
read this document:
http://erdani.com/publications/cuj-2004-10.pdf

Once you build a lock-free queue out of 32bit primitives (or smarter: use someone else's), you can push any kind of objects through it.
The document that you linked to uses this fact to push an entire std::map from one thread to all others!


also, without condition variable, lock free also means wait free, and thus you're going to need to embark in some crazy scheme where you need to recycle your job executor thread, into something that can make useful stuff during the moments there is no job; otherwise you need to spin and generate heat for nothing.

That's not what wait-free means. A wait-free queue would be really good thing!

why try going lock free when nothing said you would gain any perf than when using mutex+CV?

The point of lock-free isn't necessarily to go faster (altgough, micro-optimizing your thread synchronization by hand writing atomic code usually is an attempt to go faster).
The point of lock-free algorithms is they give you the lock-free guarantee:
¤ no guarantee (lock based code) means that if thread A calls Push, but the OS steals it's timeslice while it's inside the function, then the queue's mutex will remain locked indefinitely (until the OS arbitrarily decides to resume thread A). Meanwhile, threads B, C and D all block inside the Pop function, waiting for the mutex to be released.
¤ the lock-free guarantee promises that this can't happen. No matter how horrible the OS is, there's always *at least one* thread that can make progress. If lots of threads are hammering a CAS loop, then exactly one will make it out of the loop each time (while the rest fail the CAS and retry). Now if the OS freezes A during a Push, B, C and D can still continue to Push/Pop just fine.
¤ the wait-free guarantee promises that no thread can ever delay another thread - all threads always make progress through the algorithm. This is obviously the best category to be in, but often requires a lot of planning, scheduling control, and wasted RAM.
e.g. if you where ok with only being able to Push jobs to be executed next frame (and there was a mechanism in place that ensures that every thread is processing the same frame -- i.e at least one real sync point per frame), you could allocate one queue per-thread per-thread per-frame, then the completely unsynchronized (lockless) wait-free Push could look something like:
queues[frame+1][thisThredIdx][pushToThreadIdx].push_back(item);

Embarrassingly parallel algorithms tend to be wait free - e.g. splitting the screen into a grid when ray-tracing.

Even though wait-free is a stronger guarantee than lock-free, usually the code actually does end up being much faster than lock-based code (e.g. imagine per-pixel mutexes in a ray-tracer!).

On the other hand, lock-free code usually isn't that much faster than locked based (at least on modern OS's), but it does give a much better worst case behaviour.
...However, yep, in this particular case, we want threads to block if the queue is empty, so locks are actually helpful :D

It's really not rocket science to build a lock-free job pool though - if you use an existing MPMC FIFO (aka lock-free queue) library at the code. Doing so gives you more stable behaviour in the cases where the queue is non-empty, and then you can fall back onto locks/CVs/etc as soon as Pop faips, to be a good citizen to the OS and not waste cycles on busy waits.
The point of lock-free is [avoid nasty scheduler stuff]

All of the above is correct, but luckily the evil doesn't happen so often.

The thing with locks is, if you are not doing things horribly wrong (like, take a lock and update 15,000 database rows, then sync to disk, and finally release the lock), your threads spend 99.95% of their time doing work and 0.05% of their time looking at the lock. Which means that the lock is almost always uncontended, and acquiring/releasing the lock stays in userland and is a matter of 30-35 cycles. Not on the average case, but in practically every case. Yes, it can happen that a thread's timeslice ends while it holds the lock (and it will happen), but this is unlikely, and thus a rare thing.

The reason why the evil conditions don't happen often follows from the fact that you don't want to make work items too small. Passing data between threads takes time, context switches take time, hammering the bus with CAS is a bottleneck, linear access patterns are more cache-friendly, memory bandwidth is finite, blah blah, blah.

Therefore, in summary, one never lets a thread do one single computation, but at least some ten thousand or so (better hundred thousand) cycles worth of computation, going over at least a kilobyte or so of data (better some dozen kilobytes).

The consequence is, of course, that the thread automatically spends 99.95% of its time inside calculatoins and thus outside the lock ( / semaphore / critical section).

Yes, there is a chance of doing it wrong, and people are doing it wrong, and they have in particular done that in the past (when there was less awareness of the issues). But in the exemplary case of holding a lock while waiting on file I/O or such, lockfree structures won't help either, they will just burn more CPU with spinning. They really only help substantially for the (luckily!) rarely occuring steal-timeslice-while-lock-held event.

Well, you don't really know how the OS will implement things. So say you make some library that runs on a couple OSes and many versions of it. What is your 'lock'?

The point is, the answer can turn out to be something you don't like and the behavior unexpected, like your OS running out of locks and going crazy in spinlock. And you won't really be able to figure out where your time is going, it won't really be measurable in any obvious way. Yet there's no deadlock either.

There's also the idea of locks that go on for a long time, that is the standard case that is a big problem.

For the engine I was using there was a lock for the scene graph apparently (which is pretty screwy to me but it is what it is). If I would perform a collision detection outside the main thread it would crash unless I locked it. But if I locked it then it would hang there during the whole frame draw stage and bring everything to a grinding halt.

This is my thread. There are many threads like it, but this one is mine.

Well, you don't really know how the OS will implement things. So say you make some library that runs on a couple OSes and many versions of it. What is your 'lock'?

Typically, a lock looks somewhat similar to this:


bool try_lock() { return !flag.test_and_set(std::memory_order_acquire); }

void lock()
{
int count1 = 0; int count2 = 0;

for(;;)
{
    while(count1++ < SOME_NUMBER1)
    {
        while(count2++ < SOME_NUMBER2) { if(try_lock()) return; _mm_pause(); }
        std::this_thread::yield();
    }

    #if defined(__linux__)
        result = futex_wait(...);
    #elif defined(__WIN32__)
        result = NtWaitForKeyedEvent(...);
    #else
        result = pthread_mutex_wait(...);
    #endif

    if(result == SUCCESS) return;
}

(This is of course only the simplest possible way, not recursive and no owner, doesn't care about fairness and PI -- much more complicated implementations exist.)

You know quite well what will happen, and when. In particular, you know that this:

like your OS running out of locks and going crazy in spinlock

is something that cannot possibly happen. Why? Well, because you don't allocate kernel objects (such as "locks") at the time you're trying to lock them.

This is a resource that you allocate at, or soon after, program start.

Yes, it is possible that this fails, anything can -- in principle -- fail. It is possible for the operating system to be unable to create a thread or event/semaphore/mutex/whatever kernel object for whatever reason (out of memory, out of handles, ...). In this case your entire system is seriously fucked up and your program failed to initialize properly and cannot run. However, it will not mysteriously fail while you're trying to acquire the lock later.


Once you build a lock-free queue out of 32bit primitives (or smarter: use someone else's), you can push any kind of objects through it.
The document that you linked to uses this fact to push an entire std::map from one thread to all others!

Then I surely need to read it again. In the meantime would you care to develop on this point ? Because it seems it would help OP in his situation.


That's not what wait-free means. A wait-free queue would be really good thing!

Ah yes, I confused 2 concepts a bit fast. However, I don't see how a wait free queue would be a good thing. I really think an "executor" thread that takes jobs from a queue, and is designed to be possibly idle for long periods of time, SHOULD definitely wait, like in the sense of being totally set to sleep by OS de-scheduling. Otherwise what are we talking about ? a wait-free queue, that means that you can do some operation looking like "tryQuery" or "tryPop" or something and you're sure it will return without ever waiting. Am I right ?

Then what do you do ?

You go do other stuff or you loop on the tryPop ?

The latter would be very bad in my book if by design this threat is supposed to possibly wait for extended periods.

I think it is a real nice value to have a "pop_or_wait_until_something_comes" method on the queue, that will wake up the sleeping consumer with a condition variable. This allows to get a low power consumption for the app/game. Nowdays cooling sinks are really badly designed so using the CPU less means you are very likely to get more power when it matters. (because of if you keep sucking cpu work for polling, it gets hot, and then throttles down, and then when you really need power you get a slug instead).

Awight, those were my concerns. So now based on this, I would like to hear, constructively, how could the OP answer these criterions while still using a wait free queue.


[all the rest]

well actually you already answer it somehow in the last part, if I understand correctly, basically you say that when the tryPop fails, then it is ok to lock and wait on the condition variable. sounds like double step locking. hm...

Well, you don't really know how the OS will implement things. So say you make some library that runs on a couple OSes and many versions of it. What is your 'lock'?

Typically, a lock looks somewhat similar to this:


bool try_lock() { return !flag.test_and_set(std::memory_order_acquire); }

void lock()
{
int count1 = 0; int count2 = 0;

for(;;)
{
    while(count1++ < SOME_NUMBER1)
    {
        while(count2++ < SOME_NUMBER2) { if(try_lock()) return; _mm_pause(); }
        std::this_thread::yield();
    }

    #if defined(__linux__)
        result = futex_wait(...);
    #elif defined(__WIN32__)
        result = NtWaitForKeyedEvent(...);
    #else
        result = pthread_mutex_wait(...);
    #endif

    if(result == SUCCESS) return;
}

(This is of course only the simplest possible way, not recursive and no owner, doesn't care about fairness and PI -- much more complicated implementations exist.)

You know quite well what will happen, and when. In particular, you know that this:

like your OS running out of locks and going crazy in spinlock

is something that cannot possibly happen. Why? Well, because you don't allocate kernel objects (such as "locks") at the time you're trying to lock them.

This is a resource that you allocate at, or soon after, program start.

Yes, it is possible that this fails, anything can -- in principle -- fail. It is possible for the operating system to be unable to create a thread or event/semaphore/mutex/whatever kernel object for whatever reason (out of memory, out of handles, ...). In this case your entire system is seriously fucked up and your program failed to initialize properly and cannot run. However, it will not mysteriously fail while you're trying to acquire the lock later.

Calling some API is not knowing exactly what happens, that is the opposite. You could do it yourself without any API calls but then you are tied to making additional solutions for specific processors.

The problem I mentioned happens all the time, if you do things exactly as you specify. Especially when you are using many platforms. That is one of the main reasons why why there is an attempt to make lock free code, and books on the subject etc.

This is my thread. There are many threads like it, but this one is mine.

Calling some API is not knowing exactly what happens, that is the opposite.

Well, no, calling an API means making a contract. So, you are in some way right that you don't know exactly what happens. But you do know exactly what the contract is, and I am not aware of any operating system that doesn't fulfill its contracts.

Of course, you have to exactly know your contracts. For example, you cannot use NtReleaseKeyedEvent in the same way as you use sys_futex(FUTEX_WAIT,...).

Why?

Well, because sys_futex(FUTEX_WAIT) checks the value prior to blocking, so no wakeups can be missed in case the condition changes concurrently whereas NtWaitForKeyedEvent does not check the value. So you might think it would be possible to miss a wakeup event and being locked out forever. In order to prevent that from happening, NtReleaseKeyedEvent blocks until another thread waits on the same key (or, if you specified a timeout, until it times out).

But, if you respect these little implementation-specific details (and with a bit of discipline you automatically do -- you don't call a wake function when you know that there are no waiters, that's a useless syscall, you just don't do that), I wouldn't know how any weird surprises could happen. The point is, you need some notion of "block me, and let me wake up later" in order to avoid burning CPU forever. This is not achieveable without a syscall of sorts.

Blocking isn't necessarily evil, by the way, it can very well be a performance gain. Imagine two or three threads contending over a spinlock on a single-core machine (the example works for any N+1 threads on any N-core machine). What happens when the first thread grabs the lock and is scheduled out? The other two will start spinning and hope for the value to change. Which, of course, can never possibly happen because the first thread isn't running! Now, if they just decide to give up and block after spinning a dozen times, the first thread will be scheduled again and will run, and thus the system will make progress.

The whole thing about blocking (or yielding) is to make the situation less bad when you're actually already doing things wrong. If you do things right (that is, if your lock is not contended), acquiring the lock is a simple atomic test-and-set operation, and releasing it is writing out a zero value (some compilers, e.g. GCC, and some CPUs have dedicated release intrinsics and instructions for that).

Once you've tried to test-and-set some memory location for some time (same is true for CAS) and it didn't work, what is the best thing you can do? Keep on trying? Or try some time later? Since you have not been successful the last hundred or so times you tried, it seems likely that a lot of other threads are trying the exact same right now. Which is not good, it's not right. But it's what is happening, and you trying again contributes to making the problem worse.

So... yielding might not be the worst plan!

Do you have a guarantee that you will be re-scheduled within some milliseconds? No, of course not. But it's still the best thing you can do in the situation.

In mine at the moment, if i fail to pop a job, I try to decrement a semaphore (which will block if it hasn't been incremented). Every time I push a job, I increment the semaphore.
When the queue becomes empty, yes, it does spin for a while until the semaphore reaches zero, then sleeps until the next push.

There's probably a better way to implement that using events/condition-variables...

That is exactly what I do as well. I've found that you can tune the semaphore's max count to a certain extent to mitigate the "wind down" period where the threads spin on an empty queue, but depending on how and when jobs are pushed to the queue, you can run into situations where threads aren't awakened when there's work for them to do. So for now, I just set the max count equal to the number of worker threads.

There is actually a much easier way to look at this than has been suggested so far, though it is along the lines of Hodgmans solution. Basically just never let the queue fail to return something, instead when it is going to fail you return a default task which contains the 'wait' code. In that task you can perform the same solution as Hodgman suggests or be a little more clever using a condition variable to prevent the excess spins. Using this form has a number of benefits since you are effectively setting your system up somewhat like a VM. For instance, I adjust the number of threads in my system dynamically so if a single core can keep everything happy, I have a couple of the threads stuck in wait tasks. For mobile, this allows the cpu to power down cores not in use and extend battery life considerably. Having them as simple tasks means that the complexity of control is removed from the tight execution loop and of course waking them back up is as simple as releasing a semaphore or condition variable. I have found this approach to have a great number of benefits once you get your thinking around to it, mostly in that it remains lockless but with the ability to play nice with the OS as desired.

This is an interesting way to approach the problem, but since returning a default task when a queue is empty isn't typical queue behavior, you'd have to roll your own solution instead of being able to drop in a ready-made container like tbb::concurrent_queue. At that point it's a toss up between writing wait code in the thread loop ala Hodgman's solution, or writing default task handling code when popping the queue.

This topic is closed to new replies.

Advertisement