Sign in to follow this  
all_names_taken

C++, shared memory, threads

Recommended Posts

The following program doesn't terminate when compiled with GCC in Linux, using default compiler flags. However, it does terminate when a "sleep(1);" is added in the Thread1::operator() method after the std::cout line. My guess is that what's wrong is that the variable "run" isn't read more than once, only before the while loop. Making "run" volatile didn't help either. How should the program below be written (and what compiler flags should be used) to work as intended in a platform independent manner (e.g. it should shut down thread1 when the main thread has finished executing the loop)?
bool run=true;
struct Thread1 {
	void operator()() const {
		while(run) {
			std::cout << "Hello from thread1\n";
		}
	}
};

int main(int argc, char *argv[]) {
	Thread1 t;
	boost::thread bt(t);

	for(int i=0; i<10; ++i) {
			std::cout << "Hello from thread2\n";
	}
	run=false;

	bt.join();

	return 0;
}


Share this post


Link to post
Share on other sites
It terminates for me on same setup. Program runs and finishes although the spawned thread may not output anything because run gets set to false before getting into Thread1's while() loop. You don't need protection around the run boolean (mutex or volatile) in this case. I would however be concerned about std::cout since as far as I am aware it is not thread safe and this could well explain the bad behaviour in not terminating.

Share this post


Link to post
Share on other sites
Observe that the thread's operator() is not guaranteed to run at any particular time.

You construct the new boost::thread(), do some processing, set the flag, and then wait.

The act of creating a thread and preparing it to run takes a significant amount of work. It is likely that the system was still doing that initialization work when the flag was adjusted and the join() method was reached in the main thread.

Simply marking your run condition as volatile would not help, because that would not adjust program flow. It would force it to fetch the value again, but that amount of work is trivial compared with the act of creating a new thread.


In your case, you need to use one of the synchronization objects to wait until the other thread is fully running. Avoid the mutex, it is often the first concept learned when multiprocessing, yet it is the most expensive and time consuming when executed. It appears that a boost::condition_variable would satisfy your needs.

You would want one before the main thread's loop, so that the child thread has time to become fully initialized and begin running. Your run flag would be another condition_variable, indicating when the processing is done.

Share this post


Link to post
Share on other sites
Ok, I will try that, will soon be back to report the result.

But apart from that, in general, what is preventing the compiler from optimizing away the check of "run" unless its volatile? Shouldn't it need to be volatile for it to work?

Share this post


Link to post
Share on other sites
Probably nothing. Maybe the compiler saw threads in your code and decided to play it safe. Maybe if you increase the optimisation level it'll disappear.

But relying on volatile is a serious faux-pas anyway. Don't do it unless you know exactly what you're doing, which if you need to post a thread like this, you don't. (No offense.) If you want to communicate between threads, use something explicitly provided by your threading library for the purpose to do it. Shared data is the root of all concurrent-programming evil.

Share this post


Link to post
Share on other sites
Quote:
Original post by Kylotan
But relying on volatile is a serious faux-pas anyway. Don't do it unless you know exactly what you're doing, which if you need to post a thread like this, you don't. (No offense.)

No offense taken. I know most of the theory on what may happen with combining compiler optimization and multi-threading, and I have a vague memory of reading some article about problems with volatile, but I can't recall the details and this uncertainty is why I'm asking! In short, how can the above program be implemented so that it is guaranteed to work?

Share this post


Link to post
Share on other sites
Quote:
Original post by all_names_taken
Does boost::thread at all provide anything for the purpose of shared memory?

What do you mean by "shared memory"?

It provides locking mechanisms and signals.

One locking mechanism is the shared lock base, which handles one of the safe ways to share access to memory. Anybody can read when it is unlocked, but to write it must be locked to write.

Does that answer your question? There are several other ways to transmit data between threads, and other ways to "share" memory, and boost offers a pretty comprehensive suite of the building blocks.

Share this post


Link to post
Share on other sites
Quote:
Original post by all_names_taken
No offense taken. I know most of the theory on what may happen with combining compiler optimization and multi-threading, and I have a vague memory of reading some article about problems with volatile, but I can't recall the details and this uncertainty is why I'm asking! In short, how can the above program be implemented so that it is guaranteed to work?
You need a memory barrier to actually transfer data between cores. Without them an updated flag value or an old flag value can end up getting stuck within a processors cache. The volatile keyword is intended to disable compiler optimizations when dealing with things like hardware registers or when communicating with signal handlers, as such it merely forces the the compiler to issue read and write instructions in the appropriate places and has no direct effect on the cache.

Of course in this case you almost certainly want to use a higher-level synchronization primitive provided by your threading library, like a mutex or a condition variable. These will take care of the appropriate memory barriers for you and work with the operating system to avoid busy-waiting.

Share this post


Link to post
Share on other sites
Quote:
Original post by frob
What do you mean by "shared memory"?

It provides locking mechanisms and signals.

One locking mechanism is the shared lock base, which handles one of the safe ways to share access to memory. Anybody can read when it is unlocked, but to write it must be locked to write.

Thanks, I'll take a look at that!

Quote:
Original post by frob
Does that answer your question?

Well, there are two issues with a test program of this type:
- The first one is mutex: e.g. to ensure that there aren't any conflicts with thread A reading while thread B is writing, such that thread A reads something that is halfway updated. (Since a boolean write operation is most likely atomic, this part shouldn't be a problem even without a mutex lock.)

- The other issue is preventing the compiler from making optimizations that will only work in a single-threaded application. The while loop may for example compile to the following assembly output in an unoptimized build (pseudo code assembly, hehe):

...
start:
read variable RUN to register X
if register X is false goto END
do instructions for loop body...
goto start

end:
...

But since "RUN" isn't written to inside the loop body, in a single-threaded setting the compiler might want to be smart and optimize away the repeated reads of X from memory, giving this:


...
read variable RUN to register X
start:
if register X is false goto END
do instructions for loop body...
goto start

end:
...

This is disastrous if the intention is for the other thread to be able to modify RUN while the while loop of this thread is executed, as a way to signal termination of the loop.

Share this post


Link to post
Share on other sites
Quote:
Original post by implicit
Quote:
Original post by all_names_taken
No offense taken. I know most of the theory on what may happen with combining compiler optimization and multi-threading, and I have a vague memory of reading some article about problems with volatile, but I can't recall the details and this uncertainty is why I'm asking! In short, how can the above program be implemented so that it is guaranteed to work?
You need a memory barrier to actually transfer data between cores. Without them an updated flag value or an old flag value can end up getting stuck within a processors cache. The volatile keyword is intended to disable compiler optimizations when dealing with things like hardware registers or when communicating with signal handlers, as such it merely forces the the compiler to issue read and write instructions in the appropriate places and has no direct effect on the cache.

Of course in this case you almost certainly want to use a higher-level synchronization primitive provided by your threading library, like a mutex or a condition variable. These will take care of the appropriate memory barriers for you and work with the operating system to avoid busy-waiting.

Ok, so mutexes provide the necessary memory barriers to protect against optimizations that remove the multiple reads from memory of the variable "run" in the example above? I didn't know that, I thought they were only intended for preventing multiple access! I'll try this using boost::mutex as soon as I get to my Linux machine!

One thing I read about though, was that Java had semantics that allowed some degree of rearrangement of statement past the memory barrier in at least one direction. IIRC the compiler was allowed to move statements after the end of the lock to inside it. Are there any similar gotchas to keep in mind when using mutexes from boost thread?

Share this post


Link to post
Share on other sites
Quote:
Original post by all_names_taken
- The first one is mutex: e.g. to ensure that there aren't any conflicts with thread A reading while thread B is writing, such that thread A reads something that is halfway updated. (Since a boolean write operation is most likely atomic, this part shouldn't be a problem even without a mutex lock.)


No.


Relying on "a foo operation is atomic" to get you correct multithreaded behaviour is not a good idea. There are many issues that you don't even begin to address when simply wishing for atomicity (because that write may not even be truly atomic depending on what hardware you're on, what the cache looks like, how the compiler generates code, what the variable's alignment is, whether your compiler emits one-byte booleans or longer (int sized) booleans, etc. etc. etc.).

For instance, if all you do is hope that your reads/writes are atomic, you might run afoul of cache issues when your threads split across cores. One thread may see stale data (and therefore misbehave) if you don't use correct memory fences.

Another potential issue is that the CPU itself might rearrange reads and writes for pipeline efficiency, which destroys your assumed order of operations, and prevents data from moving correctly.

Never, ever, ever rely on atomic reads/writes to provide thread safety. If you have data that you need to touch from more than one thread, you need a synchronization primitive to do it correctly. (Note that this is even true in so-called "lockless" situations where in reality you are locking at the CPU level instead of the OS/thread level, but still technically performing a locked operation.)

Share this post


Link to post
Share on other sites
Quote:

One thing I read about though, was that Java had semantics that allowed some degree of rearrangement of statement past the memory barrier in at least one direction. IIRC the compiler was allowed to move statements after the end of the lock to inside it. Are there any similar gotchas to keep in mind when using mutexes from boost thread?

One of the problems with C++ is that it is standardised against an abstract machine. This machine only has one thread of execution. This means the compiler is generally free to do any optimisations whatsoever as long as the resultant code acts the same on this abstract machine.

This obviously would be a nightmare when reordering statements to and from a conceptually "locked" block is possible. I believe that current compilers have some kinds of intrinsics support for threads, so boost::thread will use threading primitives provided by the platform, and the compiler understands these intrinsics and will take their meaning into account when optimising.

The exact details are not specified by the standard, so vary with each implementation.

But the next version of C++ is supposed to be thread-aware.

Share this post


Link to post
Share on other sites
Quote:
Original post by all_names_taken
I'll try this using boost::mutex as soon as I get to my Linux machine! ... Are there any similar gotchas to keep in mind when using mutexes from boost thread?
Argh!

As mentioned above, the mutex is usually the first thing the novice pulls from the shelf. It is also the worst choice in nearly every situation.

A mutex is a BIG thing. It requires the most system resources. It requires the biggest wait. It is the ultimate big synchronization blocker.

Avoid it whenever possible.



There is a highly relevant Intel article: Choosing Between Synchronization Primitives that could be an interesting read to you. It focuses on Windows rather than Boost, but the information still applies.

Ideally you do not synchronize. There are many forms of asynchronous communication available. Use them. Rely on them. They are the One True Path to get performance out of multithreaded code. If you must synchronize the processes, use the fastest and lightest form available.

Share this post


Link to post
Share on other sites
Looks like the latest round of updates to boost.thread they decided to list ALL the synchronization primitives under "mutex". I think it is an unfortunate naming choice.



They very recently rewrote it, so the old performance comparisons don't work with the rewritten names and documentation.

It appears:

timed mutex = Windows interlocked spinlock (fast)

shared mutex = Windows semaphore
condition variable = Windows semaphore

scoped mutex = Windows mutex (slow)

Share this post


Link to post
Share on other sites
I don't think so. I'm no expert, but it doesn't look as if boost::mutex uses any of the Windows synchronisation primitives at all besides events. That is, it doesn't seem as if they use the Win32 mutex, semaphore or critical section primitives.

I had a quick look through the source for boost::mutex. boost::timed_mutex and boost::mutex both share the same underlying implementation (boost::detail::basic_timed_mutex). When boost::mutex calls lock(), it actually calls timed_lock() with an infinite wait.

Inside timed_lock(), it appears that the designers at boost have created their own implementation of a semaphore, using Event primitives in Win32.

The semaphore keeps track of the number of times it's been locked. I'm not entirely sure how it works, but it waits using WaitForSingleObject on a Win32 event object (created using CreateEvent). When you unlock, SetEvent is called if some conditions are met which unblocks everyone sitting in WaitForSingleObject.

This is using Boost 1.38 and Windows, of course. I'm not sure what any of this means for performance, only that the usual advice for native Windows primitives probably doesn't apply here.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this