Inter-thread communication

Started by
20 comments, last by Hodgman 9 years, 8 months ago

For what its worth, if anybody knows this software:

http://www.e-onsoftware.com/products/vue/

I'm the co-author of the opengl rendering together with christophe riccio (http://www.g-truc.net/).

So basically, the opengl preview viewports of Vue has a GUI-thread message producer; and Render-thread message consumer queue system, based on a dual std::vector that swaps during consumption (each frame); and each message "push" takes the lock (boost mutex + lock) and each consumption also before swapping the queue. It just so happens that in common situations, more than 200 000 messages are pushed by second, and it is by no way a bottleneck.

I just mean to say, if you are afraid of 512 locks per frame... there is a serious problem in your "don't do too much premature optimization" thinking process, that needs fixing.

I agree that its total fun to attempt to write a lock free queue, but, if it was production code, frankly not worth the risk; and plainly misplaced focus time.

Now just about the filter thing, one day, just to see, I made a filter to avoid useless duplication of messages, it worked but was slower than the raw, dumb queue. I idon't say its absolutely going to be the same in your case; just try... but in my case being clever was being slower.

Advertisement

msg = disp.QueryFreeMessage();
if(msg)
....
T* QueryFreeMessage() { return &queue[iWriteMarker]; }

if(msg) will always be true.

Dispatch can fail, but the caller doesn't know sad.png
Also, in this case, the caller has already used QueryFreeMessage to get a pointer to a queue-slot, and has written data into that slot, even though the queue is full (overwriting not-yet-consumed data).
You probably want to make QueryFreeMessage actually return false to solve this, and change the error inside Dispatch into an assertion failure, because it shouldn't ever happen if the client is using the class correctly.

GetMessage increments the read cursor, which lets the other thread know that it's safe to override that slot... but if the write thread does reuse that slot before HandleMessage is called, then you'll have data that's being leaked (never actually getting consumed), and other data that gets consumed twice. To solve that, you'd have to only increment the read cursor after the data has been consumed.
Or, Instead of returning T*'s to the user, return a T by value, which has been copied before the cursor has been incremented.

Ah yes - indeed. Thanks for the keen observations. The first one I was aware of and the solution I put in place was to silently drop the overflowed messages and dump a notification to the console. It isn't really my intention to have the final code do anything remarkable when QueryFreeMessage() fails (as it would after the amendments you propose), because I don't intend to implement any queue resizing or wait contingency anyway.

The broader picture I'm actually somewhat more concerned about is that in terms of communicating between the input and update (render) threads, at the end of the day I don't think there's a way to make it lock-free. Or rather I'm failing to grasp how to effectively go about synchronization.

Consider the below two calls. The following example is simplistic, but it hilights the problem rather effectively: how can I guarantee that GetSize(), when called from an arbitrary thread, will actually return the correct values? In fact, while the SetSize()-GetSize() example is trivial, for any application that has several threads that depend on the window's current width and height, this seems like a potential cluster-bork.

In other words - I think I've been far too concerned about making the updates non-concurrent, but what is instead much more worrisome is how the data can be reliably read.


//public, called by any thread
Window::SetSize(int w, int h)
{
//directly calls DoSetSize() if called by consumer thread, otherwise posts the message to queue
DispatchWindowMessage(this, w, h);
}

//called by any thread
Window::GetSize(int &w, int &h)
{
//simple read, assumes both values are up-to-date
w = width;
h = height;
}

//only accessible to the consumer thread
Window::DoSetSize(int w, int h)
{
//can set these without a lock, but what if GetSize() is called at the same time from another thread?
width = w;
height = h;
}


A simple solution would be to leave this for the application to worry about, but I I'm more interested in how this kind of synchronization is achieved in something more critical, like an operating system.


For what its worth, if anybody knows this software:
http://www.e-onsoftware.com/products/vue/
I'm the co-author of the opengl rendering together with christophe riccio (http://www.g-truc.net/).
So basically, the opengl preview viewports of Vue has a GUI-thread message producer; and Render-thread message consumer queue system, based on a dual std::vector that swaps during consumption (each frame); and each message "push" takes the lock (boost mutex + lock) and each consumption also before swapping the queue. It just so happens that in common situations, more than 200 000 messages are pushed by second, and it is by no way a bottleneck.
I just mean to say, if you are afraid of 512 locks per frame... there is a serious problem in your "don't do too much premature optimization" thinking process, that needs fixing.
I agree that its total fun to attempt to write a lock free queue, but, if it was production code, frankly not worth the risk; and plainly misplaced focus time.

Now just about the filter thing, one day, just to see, I made a filter to avoid useless duplication of messages, it worked but was slower than the raw, dumb queue. I idon't say its absolutely going to be the same in your case; just try... but in my case being clever was being slower.

Thanks for the heads up! I was actually more surprised about the number of messages I was receiving for the amount of windows I had on my GUI. The truth is that I simply didn't have a good grasp of how much communication was going on in my code. I hardly went so far as to optimize this - the filter in this case was rather more useful for debugging purposes :)


The broader picture I'm actually somewhat more concerned about is that in terms of communicating between the input and update (render) threads, at the end of the day I don't think there's a way to make it lock-free. Or rather I'm failing to grasp how to effectively go about synchronization.

Consider the below two calls. The following example is simplistic, but it hilights the problem rather effectively: how can I guarantee that GetSize(), when called from an arbitrary thread, will actually return the correct values? In fact, while the SetSize()-GetSize() example is trivial, for any application that has several threads that depend on the window's current width and height, this seems like a potential cluster-bork.

In other words - I think I've been far too concerned about making the updates non-concurrent, but what is instead much more worrisome is how the data can be reliably read.


//public, called by any thread
Window::SetSize(int w, int h)
{
//directly calls DoSetSize() if called by consumer thread, otherwise posts the message to queue
DispatchWindowMessage(this, w, h);
}

//called by any thread
Window::GetSize(int &w, int &h)
{
//simple read, assumes both values are up-to-date
w = width;
h = height;
}

//only accessible to the consumer thread
Window::DoSetSize(int w, int h)
{
//can set these without a lock, but what if GetSize() is called at the same time from another thread?
width = w;
height = h;
}


A simple solution would be to leave this for the application to worry about, but I I'm more interested in how this kind of synchronization is achieved in something more critical, like an operating system.

In this limited case you could just make the width and hight a single 64 bit atomic variable. That will ensure that the size is always what a single call to DoSetSize() asked for. But necessarily, calling GetSize() cannot guarantee, without external synchronization, that the size remain what it returned one instruction later.

There's two main categories for multi-threaded communication -- shared state and message passing (with the latter being the right default(tm), though the former usually being the first that's taught...).

You're mixing both of them here -- to change the window size you use message passing, to discover the window size you use shared state. Just pick one! tongue.png

Either put a lock around the data, or use messages to retrieve the size as well as setting it (you could either send a 'getSize' message to the window containing the object who wants to know, and have the window send a 'setSize' message back to that object, or, have objects register to receive 'onResized' messages).

Also, shared state requires synchronization. If some bit of data is going to be shared between multiple threads, it needs to be synchronized. Simply putting a mutex/critical-section around the width/height data is fine. Locking like this is only actually a huge performance penalty if there is contention (many threads trying to use that data at once). Keep in mind that lock-free synchronization generally also performs very poorly in situations of high contention!

In this limited case you could just make the width and hight a single 64 bit atomic variable. That will ensure that the size is always what a single call to DoSetSize() asked for. But necessarily, calling GetSize() cannot guarantee, without external synchronization, that the size remain what it returned one instruction later.

I kind of figured this myself after I did some toying around with my code. My initial idea was to pack all four (position and size) into one qword, limiting both size and position to 16 bits (which frankly should be enough, unless the user has something like five 4k displays tiled horizontally). At least internally my main concern isn't really whether a call to GetSize() returns the new or old values, but rather that the returned values be correct for any given moment of time. I suspect Windows is doing something of this kind, because it only exposes one API call that can change both the size and position of a window at the same time (SetWindowPos()). Right off the bat I can't find information as to what the maximum and minimum client coordinates or size are.


There's two main categories for multi-threaded communication -- shared state and message passing (with the latter being the right default™, though the former usually being the first that's taught...).

You're mixing both of them here -- to change the window size you use message passing, to discover the window size you use shared state. Just pick one!
Either put a lock around the data, or use messages to retrieve the size as well as setting it (you could either send a 'getSize' message to the window containing the object who wants to know, and have the window send a 'setSize' message back to that object, or, have objects register to receive 'onResized' messages).

Also, shared state requires synchronization. If some bit of data is going to be shared between multiple threads, it needs to be synchronized. Simply putting a mutex/critical-section around the width/height data is fine. Locking like this is only actually a huge performance penalty if there is contention (many threads trying to use that data at once). Keep in mind that lock-free synchronization generally also performs very poorly in situations of high contention!

Yep - the thing is I'm figuring this out as I go and it was a two-step jump in logic to get to this conclusion. The thing I was fixed on was keeping helper parameters in my window class, such as rcClient and rcWindow while the solution is to store both as combined volatiles and change derived calls such as GetWindowRect() to something like this:



volatile LONGLONG iSize, iPosition;

IRect GetWindowRect()
{
LONGLONG posLocal = iPosition;
LONGLONG sizLocal = iSize;

int x = LDWORD(posLocal);
int y = HDWORD(posLocal);
int w = LDWORD(sizLocal);
int h = HDWORD(sizLocal);


return IRect(x, y, x + w, y + h);
}

PS - as for contention. I'm not really worried about it in this case so much as I care about data correctness :)

This relies that reads and writes of LONGLONG's between CPU<->RAM are atomic operations.

That's the kind of thing where you're writing code that's making explicit assumptions about the CPU architecture that you're running on... This kind of code should only exist within low-level system-specific modules, such as the internal implementation of std::mutex, std::atomic, etc... i.e. code that you know you'll have to rewrite if you want to recompile for a different CPU.

P.S. writing to a LONGLONG is not one atomic operation on x86 -- the data will be transferred in 2 operations (or 3 if it's unaligned), which means that this is exactly as unsafe as your previous code!

P.P.S. if you're using the volatile keyword (except in the above mentioned situation - writing CPU-dependent implementations of synchronisation primitives) then you have a bug. That's hyperbole, of course, but seriously, in many code bases the use of volatile is automatically rejected or flagged as a potential bug (because it doesn't do what most people think it does, and in 99% of cases, it's not actually useful in writing multithreaded code). In my entire engine, that keyword gets used once.

Volatile is not for thread synchronization, it's for IO. It's both too stringent and too weak for synchronization. It does not guarantee indivisibility, so it can split reads and write into two operations. It excessively limits the ability of the compiler to move code around the volatile read or write, but then doesn't limit the CPU from doing the same reordering (because the CPU knows that it's not really doing IO). Atomic variables are the correct way to share data like that between threads.

This relies that reads and writes of LONGLONG's between CPU<->RAM are atomic operations.

That's the kind of thing where you're writing code that's making explicit assumptions about the CPU architecture that you're running on... This kind of code should only exist within low-level system-specific modules, such as the internal implementation of std::mutex, std::atomic, etc... i.e. code that you know you'll have to rewrite if you want to recompile for a different CPU.

P.S. writing to a LONGLONG is not one atomic operation on x86 -- the data will be transferred in 2 operations (or 3 if it's unaligned), which means that this is exactly as unsafe as your previous code!

P.P.S. if you're using the volatile keyword (except in the above mentioned situation - writing CPU-dependent implementations of synchronisation primitives) then you have a bug. That's hyperbole, of course, but seriously, in many code bases the use of volatile is automatically rejected or flagged as a potential bug (because it doesn't do what most people think it does, and in 99% of cases, it's not actually useful in writing multithreaded code). In my entire engine, that keyword gets used once.

That link was informative. As for the reason for using LONGLONG, I was going by the spec for InterlockedExchange64(), which seems to accept a LONGLONG pointer, but internally operates on an __int64. The operation is apparently atomic regardless of the architecture (32 and 64 bit).

Also, apparently there's a fair amount of misinformation online regarding the use of volatile and it is sometimes less than straightforward to figure out which author has it wrong. I did some Googling and it's still somewhat unclear to me which library or lock mechanism, for that matter, to use. Not only in this particular case (where I think I can get away with a lock-free solution as there's little to no contention), but in general. Intel's Threading Building Blocks seems like a good library overall with nice cross-platform support, although it's a bit heavy for what I'm doing. Overall my intention is to do simple buffer-swapping between threads - in fact so far the only place where some form of synchronization is needed is responding to user input.

As for locks, I take most OS-specific mechanisms are anyway better optimized than what most libraries can roll. An interesting alternative seems to be to use a custom spin lock that would do away with portability issues in one swell swoop. Although this article seems hilights some of the potential performance costs.

As for the reason for using LONGLONG, I was going by the spec for InterlockedExchange64(), which seems to accept a LONGLONG pointer, but internally operates on an __int64. The operation is apparently atomic regardless of the architecture (32 and 64 bit).

Ah ok - yeah if you use that instruction, you can atomically move 64bits of data, but just a regular default assignment won't do it.

Also, apparently there's a fair amount of misinformation online regarding the use of volatile and it is sometimes less than straightforward to figure out which author has it wrong. I did some Googling and it's still somewhat unclear to me which library or lock mechanism, for that matter, to use.

If you're just after standard shared memory stuff, then pthreads is probably a good choice of library -- it's basically the standard for threads, mutexes, etc... Alternatively, C++11 has actually standardized a lot of threading stuff, so there's std::thread, std::atomic, std::mutex etc -- can't get more standard than that!

Regarding volatile, IMHO unless you're writing a device driver, the only place you should use volatile is inside the implementation of std::atomic (or similar reinvented wheel).

As for locks, I take most OS-specific mechanisms are anyway better optimized than what most libraries can roll. An interesting alternative seems to be to use a custom spin lock that would do away with portability issues in one swell swoop. Although this article seems hilights some of the potential performance costs.

I'm guilty of having one of those here [futex.h, atomic.h], and yes, I've managed to ruin the stability of Windows with it so badly that I was forced to hard-reboot a Win7 machine wacko.png

Its not like Windows (even 7) was really a robust OS :

http://blog.zorinaq.com/?e=74

no OOM killer, no completely fair scheduler, no full dyn ticks, super bad IO scheduler... its just a BIG piece of 1993 lava flow antipattern in the wild.

This topic is closed to new replies.

Advertisement