How wait on condition variable works fundamentally

Started by
17 comments, last by Alberth 7 years, 3 months ago

However if I understand your reply correctly, primitives for condition variable do need kernel mode code right? there is no way to avoid it?

The CPU instruction that puts a core to sleep is priviliged. Therefore, only the kernel can put you to sleep. Note though, the kernel may decide to not put the core to sleep and do something else (like switch to a different thread or process). Note only the Kernel can do that too.

Otherwise malware would do disasters on your machine. A virus could perfectly freeze your system forever (until you hit reset button). Or prevent a high priority process from ever being scheduled. In short, lots of bad things.

Implementations do their best to stay away from entering the kernel as much as possible. But if putting the thread to sleep is unavoidable, the kernel must get involved (see an example of Lightweight Mutex showing how atomics can be used around a mutex to stay away from entering kernel code as long as threads don't access the mutex at the same time)

Even if it is kernel mode, how does OS achieve it? does the scheduler maintain something like map<condition_variable, list<*threadContext>> and once a condition_variable get modified, os find it's corresponding threadContext list and make context switch accordingly?

You're not too far off; and it probably won't make the context switch right away, but rather schedule the woken up thread for execution. It may so happen it often gets scheduled to be executed immediately after though.
The details can vary a lot per OS; and it's quite rudimentary because kernels are often in C and there's a lot of restrictions (e.g. the Linux kernel does not like FPU code, allocations are tricky, kernel code size must be kept to a minimum otherwise the chances of a bug get higher and a bug in kernel code can be disastrous, etc)
If you are more interest then go study the Linux Kernel and FreeBSD's pthread_cond_init implementations. They're open source after all.

In my mind, there must be a 'bool flag" this thread is periodically checking, otherwise, how this 'sleeping' thread know it's time to wake up? So if it is 'periodically' evaluating a memory address (or CPU register), then does that mean using condition variable is just another 'sleep(time)' call under the hood with much fine grain delta time?

The popular OSes we know are interrupt-driven. That means the OS schedules themselves to the CPU "wake me up in 16ms". The interrupt causes the CPU to execute kernel code at the scheduled point in time. When woken up, the OS will go through a list of threads it needs to run and execute them. If there's nothing to run, then it puts the core back to sleep (not without rescheduling itself before sleeping).
On Windows, how frequent the OS schedule interval itself is controlled (more like hinted) by timeBeginPeriod (it's a dreaded system-wide value).

On Linux, it's defined at compile time.

Both Linux & Windows now support tickless kernel (which means the OS won't schedule itself). But beware that means all but one core are tickless. There must be at least one core rescheduling itself periodically.

Technically if this were DOS, it could be fully tickless if the OS waits for an external interrupt, like keyboard input. But modern systems hardly are this simple (e.g. just setting an alarm to ring at 8pm means the OS can no longer wait on external interrupts: it has to reschedule itself at periodic intervals to check if the current date exceeds the alarm date).

How interrupts work? Well there's a crystal oscillator vibrating at a specific frequency which will send signals to the CPU at regular intervals... ok enough. If I keep talking we'll end up talking about atoms and open questions in science. You have more than enough now to keep digging on your own.

Advertisement

This is a great example of why a Computer Science education is a good idea, even for a game programmer. It's been awhile for me, but we had two very relevant classes.

One was something like "Concurrent Programming", where we learned all about synchronization primitives and the theoretical underpinnings. Perhaps most importantly, all the things that can go wrong, conflicts, deadlocks, race conditions, and the like.

The second was "Operating Systems", where we learned all about process and thread management and scheduling. In particular, as I believe is common, we implemented parts of a toy OS (I think it was NachOS, in my case). This totally removed the mystery of how these "magic" synchronization primitives work. Threads and processes are just "things" that can be manipulated, same as any other part of a machine.

So, get your degree, kids! It will be worth it in the end.

The one reason I always try to avoid using mutex is that it causes use/kernel mode switch which is expensive, and I think that's why in some cases spinlock acquire/release can be much much faster than using mutex.

A context switch between processes is expensive (mainly because each process has it's own memory space that must be saved/restored). A context switch between threads within the same process should be fairly efficient.

If your producer and consumer threads are all in the same process, then you should be perfectly fine using condition variables and letting the OS handle it.

So when I want to implement a threadpool myself, I was thinking using spinlock to protect the job queue. But soon I realize that if my job queue is empty, I will saturate CPU. And then I find condition variable can avoid this busy-waiting pretty good, however how cv achieve it is a little disappointing to me: first it require a lock (I guess I can use spinlock too, right?).... and now it seems its thread wakeup mechanism involve another kernel call.... I was thinking of use a hybrid spinlock to achieve both fast lock acquire/release while avoiding busy-wait by calling sleep after around 4000 spin on spinlock, however my threads may not be super responsible (since incoming job is not waking up sleeping thread but is waiting the thread to wake up....)
I think you're over-engineering. If you spend most of the time in a mutex, you are either using them too much (very fine-grained access to shared data is a bad idea), or your work-jobs are not big enough.

The other thing you may want to watch out for is premature optimization. You make design decisions to avoid problems that yet do not exist. That is a highly dangerous game, where expensive mistakes are often made. The simple reason for that is, we humans cannot predict where the bottleneck of the processor is going to end up. I play the game of guessing where it is, just before I profile, and in the past 30 years, I have been consistently wrong. In other words, there is a big chance you are incorrect in your assumptions.

Build the thing first, then profile and fix if necessary.

The popular OSes we know are interrupt-driven. That means the OS schedules themselves to the CPU "wake me up in 16ms". The interrupt causes the CPU to execute kernel code at the scheduled point in time. When woken up, the OS will go through a list of threads it needs to run and execute them. If there's nothing to run, then it puts the core back to sleep (not without rescheduling itself before sleeping). On Windows, how frequent the OS schedule interval itself is controlled (more like hinted) by timeBeginPeriod (it's a dreaded system-wide value). On Linux, it's defined at compile time. Both Linux & Windows now support tickless kernel (which means the OS won't schedule itself). But beware that means all but one core are tickless. There must be at least one core rescheduling itself periodically.

Thanks Matias for such a detailed reply :)

so you mean there isn't truly 'immediate wakeup' even you use condition-variable to wake up thread rather than wait timer by calling 'Sleep(ms)". So using condition-variable seems you are just asking OS scheduler to do this 'periodical checking', and calling 'Sleep()' is doing the checking yourself? If that is the case, I would assume waiting on condition-variable won't be any better if you call Sleep with right param in terms of responsiveness. (please correct me if I am wrong)

Thanks


This is a great example of why a Computer Science education is a good idea, even for a game programmer. It's been awhile for me, but we had two very relevant classes.

One was something like "Concurrent Programming", where we learned all about synchronization primitives and the theoretical underpinnings. Perhaps most importantly, all the things that can go wrong, conflicts, deadlocks, race conditions, and the like.

The second was "Operating Systems", where we learned all about process and thread management and scheduling. In particular, as I believe is common, we implemented parts of a toy OS (I think it was NachOS, in my case). This totally removed the mystery of how these "magic" synchronization primitives work. Threads and processes are just "things" that can be manipulated, same as any other part of a machine.

So, get your degree, kids! It will be worth it in the end.

Thanks for pointing the right courses for me~ (I am currently a graduate student in CS, but the course I've taken "parallel program" "Operating System" haven't cover such low level detail) Any idea where I can find good online course for this? :) thanks


A context switch between processes is expensive (mainly because each process has it's own memory space that must be saved/restored). A context switch between threads within the same process should be fairly efficient. If your producer and consumer threads are all in the same process, then you should be perfectly fine using condition variables and letting the OS handle it.

But this post suggest context switch between threads seems also very costly (Please correct me if I missed something important)

Thanks

I think you're over-engineering. If you spend most of the time in a mutex, you are either using them too much (very fine-grained access to shared data is a bad idea), or your work-jobs are not big enough.

I think I won't spend too much time inside a mutex. The concern is 'touching' mutex is very costly: think about a busy-waiting threadpool implementation with tiny jobs or with jobqueue empty, the jobqueue will be frequently accessed, so with mutex, the overhead is seems scary....

The other thing you may want to watch out for is premature optimization. You make design decisions to avoid problems that yet do not exist. That is a highly dangerous game, where expensive mistakes are often made. The simple reason for that is, we humans cannot predict where the bottleneck of the processor is going to end up. I play the game of guessing where it is, just before I profile, and in the past 30 years, I have been consistently wrong. In other words, there is a big chance you are incorrect in your assumptions.

That's definitely true for me most of the time. But my advantage is that I am still in school, so no industry deadline pushes me :-) And I have found always_thinking_for_the_best_solution way forced me keep learning new thing while doing fun development~ But I will definite take your advise on that once I go to industry :)

Thanks for pointing the right courses for me~ (I am currently a graduate student in CS, but the course I've taken "parallel program" "Operating System" haven't cover such low level detail) Any idea where I can find good online course for this? :) thanks


That is discouraging. Well, I can't personally vouch for it, but MIT's OS class seems to hit all of the right areas: https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-828-operating-system-engineering-fall-2012/

Good luck!
Geoff

The popular OSes we know are interrupt-driven. That means the OS schedules themselves to the CPU "wake me up in 16ms". The interrupt causes the CPU to execute kernel code at the scheduled point in time. When woken up, the OS will go through a list of threads it needs to run and execute them. If there's nothing to run, then it puts the core back to sleep (not without rescheduling itself before sleeping). On Windows, how frequent the OS schedule interval itself is controlled (more like hinted) by timeBeginPeriod (it's a dreaded system-wide value). On Linux, it's defined at compile time. Both Linux & Windows now support tickless kernel (which means the OS won't schedule itself). But beware that means all but one core are tickless. There must be at least one core rescheduling itself periodically.

Thanks Matias for such a detailed reply :)

so you mean there isn't truly 'immediate wakeup' even you use condition-variable to wake up thread rather than wait timer by calling 'Sleep(ms)". So using condition-variable seems you are just asking OS scheduler to do this 'periodical checking', and calling 'Sleep()' is doing the checking yourself? If that is the case, I would assume waiting on condition-variable won't be any better if you call Sleep with right param in terms of responsiveness. (please correct me if I am wrong)

Not necessarily, because when you hit the condition's variable "wake up" functionality and enter the kernel, the OS may take that opportunity to start checking which threads need to be woken up and schedule them for immediate execution; rather than waiting for next interrupt.

Kernel-side it's tricky, as the kernel needs to deal with a race condition: the kernel started messing with the scheduler, guess what could happen if an interrupt gets triggered (to reevaluate the scheduler) while it's not done messing with the scheduler.

Polling for Sleep means the OS has no choice but to wait the time you requested. With cond. variables, you may get it sooner. So, no. They're not the same in terms of performance/efficiency/responsiveness.

Furthermore Sleep(1) means you want to sleep for at least 1 ms. So if you you just missed an interrupt and call Sleep; then you're going to have to wait 1.9999ms (because by the next interrupt only 0.9999ms will have elapsed, so it won't yet be time to wake up from sleep). Whereas with condition vars. you either wait somewhere between 0 (if kernel takes the chance when you entered the cond. var; or takes the chance when another thread/process entered kernel for an unrelated reason) and 0.9999ms (if kernel waits for the next interrupt).

But this post suggest context switch between threads seems also very costly (Please correct me if I missed something important)

That post is really on the subject of kernel mode vs user mode synchronization, which is related to, but separate from the subject of context switching. Either method will block and potentially switch out a thread. On a side note, if you only ever create no more threads than there are cores to run on, you'll never have a thread switched out, but you will still have to deal with synchronization.

Regarding synchronization, if the overhead of the synchronization is genuinely an issue, then the solution isn't in the method of synchronization but the design of the system. Specifically, you've broken your tasks down too small. Take the trivial task of summing a million floats as an example. In the extreme, if we break the problem down to passing and summing pairs of floats to the worker threads, then synchronization becomes a major overhead no matter what solution you use. Never mind that you've completely destroyed your cache hit rate in the process (another topic entirely). If instead we block out large chunks of floats for each worker to process (ie send blocks of 250k floats to each of 4 workers), then synchronization becomes insignificant no matter what solution you use.

But this post suggest context switch between threads seems also very costly (Please correct me if I missed something important)
That post is somewhat irrelevant insofar as it compares very different things.

A Win32 critical section is a user-mode spinlock which will after a thousand iterations block on a NtKeyedEvent. Which is a lightweight process-specific kernel object that is able to block with a "key" (usually an address, but needs not be). Very much like a futex under Linux with FUTEX_PRIVATE_FLAG given(only just, to be truthful, one must say a futex is like a keyed event, not the other way around -- kev was there first). The keyed event does not need to fulfill the same guarantees as the Win32 mutex below.

A Win32 mutex is a heavyweight kernel object which can be duplicated and shared between different processes and guarantees that any access to it is synchronized accordingly, and overall order (across processes) is correct. This is a much, much different thing.

In short, a critical section means "8-9 cycles best case" if there is zero lock contention, "few dozen to a hundred cycles" in the normal case, and ten thousand cycles followed by a mode switch in the very worst case. A mutex means a mode switch, and rescheduling every single time. User/kernel transitions are not as expensive as they used to be with modern operating systems and modern processors since they are no longer (necessarily) context switches but merely mode changes. However, they are by no means cheap operations, rescheduling is not free, and in case a different thread is switched to, the mode change becomes a context switch (with TLB gone and everything). Whenever you call a function that goes into the kernel, assume that you are possibly/probably burning something like 5,000-10,000 cycles.

Now, a context switch between two processes also includes moving pages from the working set to the standby list and the other way around, and killing the caches. Which, again, is not a free operation.

This topic is closed to new replies.

Advertisement