Jump to content

  • Log In with Google      Sign In   
  • Create Account

Inter-thread communication


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
21 replies to this topic

#1 irreversible   Crossbones+   -  Reputation: 1343

Like
0Likes
Like

Posted 16 July 2014 - 04:46 PM

tl;dr - in short, this is a question of how to best implement my own version of PostThreadMessage().

 

More precisely, my target problem right now is communication between the input and render threads in the context of a GUI. I'd like this very same logic to be extensible to communication between the input and game logic threads, and load and render threads. I've kind of settled on a fairly simple counter-based method as I've come along, as it just works.

 

My assumptions are:

 

- I don't want to use a locking mechanism

- instead I either use polling in the render thread if running in real-time mode, or signaling via something like a waitable object to force a refresh, if not running in real-time mode. This bit I'm somewhat worried about as I'm not yet sure as to how portable it is.

- I'd like the communication scheme to be extensible to any many-to-one thread model where many threads can send messages to one target thread

 

With this in mind the best (and so far only, really) way I've managed think of is a FIFO stack with each sender thread having its own dispatch queue and a write marker that points into a circular buffer (eg the stack). The receiving thread then reads everything from its respective read marker up to the write marker at each iteration (or when notified). There is no synchronization between the two threads other than a possible SetEvent() called by the sender if the render thread doesn't already poll in a tight loop.

 

So far this has been a fairly straightforward and foolproof approach. The primary problem I see here is guesstimating the size of the message buffer and, in particular, what to do when an overflow occurs. Before going about reallocating memory to fit more messages I figured it'd make sense to actually see if there might be more trickery involved in the way Windows does it (yeah, that was a question).

 

Also, in a way it feels a bit like a hack, even though it works :)



Sponsor:

#2 Hodgman   Moderators   -  Reputation: 30349

Like
6Likes
Like

Posted 16 July 2014 - 04:59 PM

What you've implemented is a "lock-free queue", which is a structure that's notoriously difficult to implement. If you have no synchronization code in there, then I can assure you that you likely have a bug ;P
For it to be portable, you need to insert memory fences at specific points to ensure the order of reads and writes between threads is consistent. The difficult part here is that *most* x86 CPUs will seem to work fine even without the fences, making it hard to tell if your code is correct or not...
I had a unit-test for my previous lock-free multiple-producer/multiple-consumer FIFO set to run every time I compiled my code. After passing continuously for 6 months, it finally failed once due to a tiny race condition... ;(

Resizable lock-free structures are even harder to design, so if this is for a game, I would advocate just "knowing your data" and setting a suitable upper limit (and asserting/crashing on overflow).

If this is for Windows only, there's a lock-free singly linked list somewhere in the Win32 API (I forget the name... Slist?), which is probably used by their own messaging stuff.

Some links:
http://www.drdobbs.com/parallel/writing-lock-free-code-a-corrected-queue/210604448
http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
https://code.google.com/p/eight/source/browse/include/eight/core/thread/fifo_mpmc.h
https://code.google.com/p/eight/source/browse/include/eight/core/thread/fifo_spsc.h

The last two links are into my code-base. The read/write cursors are 'atomics', which means they insert memory fences when they are used.
The MPMC version (multiple producers/consumers) has an atomic flag inside each data item, so that the write cursor can be incremented, then the data written, then the data-flag set to true - so that multiple writer-threads can use it at once. Reader threads can then wait on these flags to be true before incrementing the read cursor.

In some cases though, it might actually be better to use an array of SPSC queues, because this way there will be no contention between writer-threads!

#3 Washu   Senior Moderators   -  Reputation: 5177

Like
2Likes
Like

Posted 16 July 2014 - 05:17 PM


If this is for Windows only, there's a lock-free singly linked list somewhere in the Win32 API (I forget the name... Slist?), which is probably used by their own messaging stuff.

Yes, SList is correct.

 

Additionally boost has some lock free containers in it, and if I recall correctly intel released a whole library of lock free data structures.


Edited by Washu, 16 July 2014 - 05:22 PM.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.
ScapeCode - Blog | SlimDX


#4 mark ds   Members   -  Reputation: 1239

Like
0Likes
Like

Posted 16 July 2014 - 05:30 PM

This is totally off the top of my head, so correct me if I'm wrong...

 

initialise atomic to zero, then:

 

void ReadWriteFromQueue( bool read )

{

    while( atomic_increment != 1 ) // another thread has access already

    {

        atomic_decrement;

    }

 

    if( read == true )

    {

        // read from queue code here

    }

    else

    {

        // write to queue here

        // if queue size is too small, resize (possibly during development to best guess max queue size)

    }

 

    atomic_decrement;

}


Edited by mark ds, 16 July 2014 - 05:32 PM.


#5 SeanMiddleditch   Members   -  Reputation: 5741

Like
6Likes
Like

Posted 16 July 2014 - 06:23 PM

This is totally off the top of my head, so correct me if I'm wrong...


This is not lock-free. You've just implemented a (very poor) spinlock-based critical section. That spinlock is going to burn through CPU since it has no step back, allows thread starvation, and the while loop will do a lot of cache bouncing. There are lock-free circular-buffer-based bounded concurrent queues and lock-free linked-list unbounded concurrent queues, and even more advanced wait-free concurrent queues.
 

With this in mind the best (and so far only, really) way I've managed think of is a FIFO stack


This is roughly what you should be doing. Some other languages call these "channels" or "ports" or just "concurrent queues." Everything that Hodgman said is true, though; these are difficult data structures to write and yours is almost certainly broken without you realizing it, so you should take one of the existing well-tested rock-solid implementations instead of trying to make your own.

Do be aware of a potential dead-lock you can get into with this design. If your render thread is ever stuck waiting for the main thread and the main thread is stuck waiting for the render thread, your game will deadlock. You do end up needing some synchronization like this due to the peculiarities of what specifically is allowed to live on a render thread and which parts of rendering have to live on the main thread.

These kinds of deadlocks can happen even more frequently with bounded concurrent queues since you can't push a new event into said queue when it's full but the other thread might _also_ be blocked trying to push into its queue, so you have to be smart about what you do when you need to push a value into a bounded queue. Unbounded queues don't necessarily solve this problem, either; they just change it to be harder to detect and the results to be more serious. You have to think a lot about the synchronization between threads (or processes, or service) even in "lock-free" contexts.

This is one of the reasons I prefer fork-join parallelism. You don't have a dedicated render thread but rather a pool of worker threads that can be tasked to do physics work, AI work, rendering work, etc, while the main thread continues to run serially over the game loop (the actual OS "main thread," e.g. the "input thread," may still need to be a separate thread from the game/main thread and you still need to deal with all the above problems, but at least it's localized to just one pair of threads and not for each subsystem).

#6 irreversible   Crossbones+   -  Reputation: 1343

Like
0Likes
Like

Posted 17 July 2014 - 04:17 AM


If you have no synchronization code in there, then I can assure you that you likely have a bug ;P

 

 

 


Resizable lock-free structures are even harder to design, so if this is for a game, I would advocate just "knowing your data" and setting a suitable upper limit (and asserting/crashing on overflow).

 

I haven't gone over the links you provided yet, but I with the SPSC approach I'm using I don't think I need atomics - even for the read/write markers, as it should be safe to assume these can never conflict. Right?

 

The toughest part of writing a multi-threaded application is comprehending the logic of execution flow and I indeed have rock-walled with strange effects showing up every now and then for apparently no reason. This is why I'm finally setting up a messaging system and aren't really afraid of taking my time with it. Thanks for going into the trouble of posting some reading!

With respect to handling overflow - I'm actually not all that concerned with it. It's just that my initial brute-force addition of GUI refresh messages for in-game windows somewhat surprised me: even though I only had 74 controls hidden and visible that were being created at application start, without filtering, my dispatcher received almost 2000 refresh commands even before the first frame was drawn. I figured I'd done a fairly thorough job of removing redundancy from my code, but cascading the dirty flag can really add up quickly. With simple filtering it wasn't hard to normalize the amount of messages and it helped me identify a simple missing check in the rest of the code, but the cold truth (and a lesson) is that my conservative estimate of queue length of 512 messages overflowed quite quickly. And that made me worried.

 

I suppose the dispatch queue can in a way be a fairly efficient way of also identifying redundancy that can otherwise be somewhat hard to spot.

 

 

 


In some cases though, it might actually be better to use an array of SPSC queues, because this way there will be no contention between writer-threads!

 

That's actually what I'm doing right now: each producer thread has its own queue and the consumer iterates through each of them. I don't really see a reason to use an MPSC/MPMC queue in a game as there aren't too many threads to start with. I mentioned this in my own words in the OP smile.png

 

I guess if my target was a new operating system with an unbounded thread count or some thread-heavy distributed system, I would need a less bloated approach to minimize memory redundancy and centralize messaging, but right now (especially once I've profiled my code a bit and know how may messages are actually flying around) I don't really have a problem with allocating relatively copious amounts of space to accommodate large enough queues. The way I'm thinking right now is along the lines of "simpler is better": a megabyte per thread-thread interaction can hold a fair number of messages and costs almost nothing.

 

 

 


These kinds of deadlocks can happen even more frequently with bounded concurrent queues since you can't push a new event into said queue when it's full but the other thread might _also_ be blocked trying to push into its queue, so you have to be smart about what you do when you need to push a value into a bounded queue. Unbounded queues don't necessarily solve this problem, either; they just change it to be harder to detect and the results to be more serious. You have to think a lot about the synchronization between threads (or processes, or service) even in "lock-free" contexts.

 

I do appreciate the potential complexity that can arise when the number of threads increases. Since I'm still dabbling with thread-safe messaging I think I'll do what Hodgman said and set up a hard limit to the queue length. Once I hit that during development I'll just analyze the reason and either work backwards to reduce the number of messages or increase the queue size as needed.

 

 

 


Do be aware of a potential dead-lock you can get into with this design. If your render thread is ever stuck waiting for the main thread and the main thread is stuck waiting for the render thread, your game will deadlock.

 

With an SPSC queue I don't frankly see how a deadlock can occur, especially since none of the threads are waiting for another thread. I'm going for true asynchronous communication here where the render thread just takes pre-calculated input at the start of each frame and applies it all at once to avoid mid-frame updates.


Edited by irreversible, 17 July 2014 - 04:28 AM.


#7 Hodgman   Moderators   -  Reputation: 30349

Like
5Likes
Like

Posted 17 July 2014 - 04:52 AM

with the SPSC approach I'm using I don't think I need atomics - even for the read/write markers, as it should be safe to assume these can never conflict. Right?
 

Only if you can assume that your code executes in the order that you've written it in, which you can't. So - Not right :D

First up, the compiler is totally allowed to reorder your code, as long as the reordered version results in the same visible behavior from a single-threaded point of view.
In a single-threaded program, it doesn't matter whether you write new data and then update the cursor, or whether you do those two writes in the other order... The outcome is the same, so the compiler is free to play with that ordering.
I'm a multi-threaded program though, the wrong ordering can cause the consumer to try and read new data before it has even been written!
So, unless you're using lovingly hand-crafted ASM, you at the very least require a compile-time memory barrier to force the compiler to emit the instructions in the correct order.

Second, CPUs are damn fancy these days - they basically have an optimizer built in!
The CPU is totally allowed to reorder your code, as long as the reordered version results in the same visible behavior from a single-threaded point of view! It can reorder the instruction stream, and can also buffer up and reorder reads and writes to memory!
That means the same situation can occur, where data/cursors are read/written out of order and you end up with corrupt results.
Now, x86 is pretty nice and although it does a lot of reordering, it does create some kinds of implicit fences around certain memory accesses to avoid this problem, but there are many CPUs that don't. To write solid, portable lock-free code, you either have to read the manuals for your target CPUs so you know what assumptions are valid, or you need to insert the appropriate runtime memory fence instructions, which will tell the CPU not to reorder (if necessary).

#8 irreversible   Crossbones+   -  Reputation: 1343

Like
0Likes
Like

Posted 17 July 2014 - 09:09 AM

 

with the SPSC approach I'm using I don't think I need atomics - even for the read/write markers, as it should be safe to assume these can never conflict. Right?
 

Only if you can assume that your code executes in the order that you've written it in, which you can't. So - Not right biggrin.png

First up, the compiler is totally allowed to reorder your code, as long as the reordered version results in the same visible behavior from a single-threaded point of view.
In a single-threaded program, it doesn't matter whether you write new data and then update the cursor, or whether you do those two writes in the other order... The outcome is the same, so the compiler is free to play with that ordering.
I'm a multi-threaded program though, the wrong ordering can cause the consumer to try and read new data before it has even been written!
So, unless you're using lovingly hand-crafted ASM, you at the very least require a compile-time memory barrier to force the compiler to emit the instructions in the correct order.

Second, CPUs are damn fancy these days - they basically have an optimizer built in!
The CPU is totally allowed to reorder your code, as long as the reordered version results in the same visible behavior from a single-threaded point of view! It can reorder the instruction stream, and can also buffer up and reorder reads and writes to memory!
That means the same situation can occur, where data/cursors are read/written out of order and you end up with corrupt results.
Now, x86 is pretty nice and although it does a lot of reordering, it does create some kinds of implicit fences around certain memory accesses to avoid this problem, but there are many CPUs that don't. To write solid, portable lock-free code, you either have to read the manuals for your target CPUs so you know what assumptions are valid, or you need to insert the appropriate runtime memory fence instructions, which will tell the CPU not to reorder (if necessary).

 

 

Point taken, understood and appreciated :).

 

I'm in a bit of a hurry, but just for kicks, here's my Q'n'D implementation. It seems to be working at first glance, but it's open for peer review. Note that I don't need to send generic messages (eg the range of message types is restricted to simple pointer/variable updates which are contained in a structure specific to each consumer thread), thus avoiding unnecessary casting.

 

Producer thread:

 

msg = disp.QueryFreeMessage();

if(msg)

 { fill in and call disp.Dispatch(); }

 

Consumer thread:

 

while(msg = disp.GetMessage()) { HandleMessage(msg); }

#define MAX_PRODUCER_THREADS	64
#define MAX_PRODUCER_MESSAGES	65536

//LINUX:	__sync_add_and_fetch()
//OSX:		OSAtomicAdd32
//WINDOWS:	InterlockedIncrement()
UINT MyAtomicIncrement(	IN volatile UINT * ptr);
void MyAtomicSet(		IN volatile UINT * ptr, IN UINT value);

template<class T>
struct IThreadDispatchQueueSPSC	{
	UINT iThreadID;

	volatile UINT iWriteMarker;
	volatile UINT iReadMarker;

	T queue[MAX_PRODUCER_MESSAGES];

	IThreadDispatchQueueSPSC()
		{
		iWriteMarker = 0;
		iReadMarker = 0;
		}

	T* QueryFreeMessage()
		{ return &queue[iWriteMarker]; }

	T* GetMessage()
		{
		if(iReadMarker == iWriteMarker)
			return 0;

		T* res = &queue[iReadMarker];

		UINT iNext = iReadMarker + 1;

		if(iNext >= MAX_PRODUCER_MESSAGES - 1)
			{ iNext = 0; }

		MyAtomicSet(&iReadMarker, iNext);

		return res;
		}

	void Dispatch()
		{
		UINT iNext = iWriteMarker + 1;

		if(iNext >= MAX_PRODUCER_MESSAGES - 1)
			{ iNext = 0; }

		if(iNext == iReadMarker)
			{ lout << "Warning: dispatch queue full ('" << TStackTrace::getThreadName(iThreadID) << "')" << endl; return; }

		MyAtomicSet(&iWriteMarker, iNext);
		}
	};

template <class T>
struct IThreadMessageQueue_MPSCArray	{
	UINT iConsumerThreadID;

	IThreadDispatchQueueSPSC<T> producerThreads[MAX_PRODUCER_THREADS];

	volatile UINT iNumProducerThreads;

	IThreadMessageQueue_MPSCArray()
		{
		iNumProducerThreads = 0;
		iConsumerThreadID = 0;

		ZeroMemory(producerThreads, sizeof(producerThreads));
		}

	bool RegisterProducerThread(UINT iThreadID)
		{
		if(iNumProducerThreads >= MAX_PRODUCER_THREADS)
			{
			lout << "Error: too many producer threads: " << iNumProducerThreads << endl;
			//THROW("Too many producer threads");
			return false;
			}

		int iIndex = MyAtomicIncrement(&iNumProducerThreads) - 1;
 		lout << "new producer thread: " << iIndex << " " << iThreadID << endl;
		producerThreads[iIndex].iThreadID = iThreadID;

		return true;
		}

	//returns a pointer to the next free message, NULL if the stack overflows.
	T* QueryFreeMessage()
		{
		UINT iThreadID = GetCurrentThreadId();

		for(int i = 0; i < (int)iNumProducerThreads; i++)
			{
			IThreadDispatchQueueSPSC<T>* t = &producerThreads[i];

			if(iThreadID == t->iThreadID)
				{ return t->QueryFreeMessage(); }
			}

		//add this thread
		if(!RegisterProducerThread(iThreadID))
			return NULL;

		//try again
		return QueryFreeMessage();
		}

	//produce a message into the queue
	void Dispatch()
		{
		UINT iThreadID = GetCurrentThreadId();

		for(int i = 0; i < (int)iNumProducerThreads; i++)
			{
			IThreadDispatchQueueSPSC<T>* t = &producerThreads[i];
			if(t->iThreadID == iThreadID)
				{ t->Dispatch(); break; }
			}
		}

	//iterates through unhandled messages sent to this thread. Should only be called by the consumer thread.
	T* GetMessage()
		{
		for(int i = 0; i < (int)iNumProducerThreads; i++)
			{
			IThreadDispatchQueueSPSC<T>* t = &producerThreads[i];

			if(t->iReadMarker != t->iWriteMarker)
				{ return t->GetMessage(); }
			}

		//out of messages
		return NULL;
		}
};



#9 Hodgman   Moderators   -  Reputation: 30349

Like
2Likes
Like

Posted 20 July 2014 - 08:48 AM

msg = disp.QueryFreeMessage();
if(msg)
....
T* QueryFreeMessage() { return &queue[iWriteMarker]; }

if(msg) will always be true.
 
Dispatch can fail, but the caller doesn't know sad.png
Also, in this case, the caller has already used QueryFreeMessage to get a pointer to a queue-slot, and has written data into that slot, even though the queue is full (overwriting not-yet-consumed data).
You probably want to make QueryFreeMessage actually return false to solve this, and change the error inside Dispatch into an assertion failure, because it shouldn't ever happen if the client is using the class correctly.
 
GetMessage increments the read cursor, which lets the other thread know that it's safe to override that slot... but if the write thread does reuse that slot before HandleMessage is called, then you'll have data that's being leaked (never actually getting consumed), and other data that gets consumed twice. To solve that, you'd have to only increment the read cursor after the data has been consumed.
Or, Instead of returning T*'s to the user, return a T by value, which has been copied before the cursor has been incremented.
 


Additionally boost has some lock free containers in it, and if I recall correctly intel released a whole library of lock free data structures.

Cool, last time I looked at that, it was just the idea / submission-for-review stage, not actually accepted into boost yet.



#10 Washu   Senior Moderators   -  Reputation: 5177

Like
1Likes
Like

Posted 20 July 2014 - 05:12 PM

 

 Additionally boost has some lock free containers in it, and if I recall correctly intel released a whole library of lock free data structures.

Cool, last time I looked at that, it was just the idea / submission-for-review stage, not actually accepted into boost yet.

 

Yeah, its nice that they finally got it up and going... as for the Intel library I was thinking of.... here's a link


In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.
ScapeCode - Blog | SlimDX


#11 Lightness1024   Members   -  Reputation: 736

Like
1Likes
Like

Posted 22 July 2014 - 08:27 AM

For what its worth, if anybody knows this software:

http://www.e-onsoftware.com/products/vue/

I'm the co-author of the opengl rendering together with christophe riccio (http://www.g-truc.net/).

So basically, the opengl preview viewports of Vue has a GUI-thread message producer; and Render-thread message consumer queue system, based on a dual std::vector that swaps during consumption (each frame); and each message "push" takes the lock (boost mutex + lock) and each consumption also before swapping the queue. It just so happens that in common situations, more than 200 000 messages are pushed by second, and it is by no way a bottleneck.

I just mean to say, if you are afraid of 512 locks per frame... there is a serious problem in your "don't do too much premature optimization" thinking process, that needs fixing.

I agree that its total fun to attempt to write a lock free queue, but, if it was production code, frankly not worth the risk; and plainly misplaced focus time.

 

Now just about the filter thing, one day, just to see, I made a filter to avoid useless duplication of messages, it worked but was slower than the raw, dumb queue. I idon't say its absolutely going to be the same in your case; just try... but in my case being clever was being slower.



#12 irreversible   Crossbones+   -  Reputation: 1343

Like
0Likes
Like

Posted 22 July 2014 - 11:05 AM

 

msg = disp.QueryFreeMessage();
if(msg)
....
T* QueryFreeMessage() { return &queue[iWriteMarker]; }

if(msg) will always be true.
 
Dispatch can fail, but the caller doesn't know sad.png
Also, in this case, the caller has already used QueryFreeMessage to get a pointer to a queue-slot, and has written data into that slot, even though the queue is full (overwriting not-yet-consumed data).
You probably want to make QueryFreeMessage actually return false to solve this, and change the error inside Dispatch into an assertion failure, because it shouldn't ever happen if the client is using the class correctly.
 
GetMessage increments the read cursor, which lets the other thread know that it's safe to override that slot... but if the write thread does reuse that slot before HandleMessage is called, then you'll have data that's being leaked (never actually getting consumed), and other data that gets consumed twice. To solve that, you'd have to only increment the read cursor after the data has been consumed.
Or, Instead of returning T*'s to the user, return a T by value, which has been copied before the cursor has been incremented.

 

Ah yes - indeed. Thanks for the keen observations. The first one I was aware of and the solution I put in place was to silently drop the overflowed messages and dump a notification to the console. It isn't really my intention to have the final code do anything remarkable when QueryFreeMessage() fails (as it would after the amendments you propose), because I don't intend to implement any queue resizing or wait contingency anyway.

 

The broader picture I'm actually somewhat more concerned about is that in terms of communicating between the input and update (render) threads, at the end of the day I don't think there's a way to make it lock-free. Or rather I'm failing to grasp how to effectively go about synchronization.

 

Consider the below two calls. The following example is simplistic, but it hilights the problem rather effectively: how can I guarantee that GetSize(), when called from an arbitrary thread, will actually return the correct values? In fact, while the SetSize()-GetSize() example is trivial, for any application that has several threads that depend on the window's current width and height, this seems like a potential cluster-bork.

 

In other words - I think I've been far too concerned about making the updates non-concurrent, but what is instead much more worrisome is how the data can be reliably read.

//public, called by any thread
Window::SetSize(int w, int h)
{
//directly calls DoSetSize() if called by consumer thread, otherwise posts the message to queue
DispatchWindowMessage(this, w, h);
}

//called by any thread
Window::GetSize(int &w, int &h)
{
//simple read, assumes both values are up-to-date
w = width;
h = height;
}

//only accessible to the consumer thread
Window::DoSetSize(int w, int h)
{
//can set these without a lock, but what if GetSize() is called at the same time from another thread?
width = w;
height = h;
}


A simple solution would be to leave this for the application to worry about, but I I'm more interested in how this kind of synchronization is achieved in something more critical, like an operating system.

 

 


For what its worth, if anybody knows this software:
http://www.e-onsoftware.com/products/vue/
I'm the co-author of the opengl rendering together with christophe riccio (http://www.g-truc.net/).
So basically, the opengl preview viewports of Vue has a GUI-thread message producer; and Render-thread message consumer queue system, based on a dual std::vector that swaps during consumption (each frame); and each message "push" takes the lock (boost mutex + lock) and each consumption also before swapping the queue. It just so happens that in common situations, more than 200 000 messages are pushed by second, and it is by no way a bottleneck.
I just mean to say, if you are afraid of 512 locks per frame... there is a serious problem in your "don't do too much premature optimization" thinking process, that needs fixing.
I agree that its total fun to attempt to write a lock free queue, but, if it was production code, frankly not worth the risk; and plainly misplaced focus time.
 
Now just about the filter thing, one day, just to see, I made a filter to avoid useless duplication of messages, it worked but was slower than the raw, dumb queue. I idon't say its absolutely going to be the same in your case; just try... but in my case being clever was being slower.

 

Thanks for the heads up! I was actually more surprised about the number of messages I was receiving for the amount of windows I had on my GUI. The truth is that I simply didn't have a good grasp of how much communication was going on in my code. I hardly went so far as to optimize this - the filter in this case was rather more useful for debugging purposes :) 



#13 King Mir   Members   -  Reputation: 2000

Like
1Likes
Like

Posted 22 July 2014 - 09:27 PM


The broader picture I'm actually somewhat more concerned about is that in terms of communicating between the input and update (render) threads, at the end of the day I don't think there's a way to make it lock-free. Or rather I'm failing to grasp how to effectively go about synchronization.

 

Consider the below two calls. The following example is simplistic, but it hilights the problem rather effectively: how can I guarantee that GetSize(), when called from an arbitrary thread, will actually return the correct values? In fact, while the SetSize()-GetSize() example is trivial, for any application that has several threads that depend on the window's current width and height, this seems like a potential cluster-bork.

 

In other words - I think I've been far too concerned about making the updates non-concurrent, but what is instead much more worrisome is how the data can be reliably read.

//public, called by any thread
Window::SetSize(int w, int h)
{
//directly calls DoSetSize() if called by consumer thread, otherwise posts the message to queue
DispatchWindowMessage(this, w, h);
}

//called by any thread
Window::GetSize(int &w, int &h)
{
//simple read, assumes both values are up-to-date
w = width;
h = height;
}

//only accessible to the consumer thread
Window::DoSetSize(int w, int h)
{
//can set these without a lock, but what if GetSize() is called at the same time from another thread?
width = w;
height = h;
}


A simple solution would be to leave this for the application to worry about, but I I'm more interested in how this kind of synchronization is achieved in something more critical, like an operating system.

 

In this limited case you could just make the width and hight a single 64 bit atomic variable. That will ensure that the size is always what a single call to DoSetSize() asked for. But necessarily, calling GetSize() cannot guarantee, without external synchronization, that the size remain what it returned one instruction later.



#14 Hodgman   Moderators   -  Reputation: 30349

Like
2Likes
Like

Posted 22 July 2014 - 10:06 PM

There's two main categories for multi-threaded communication -- shared state and message passing (with the latter being the right default™, though the former usually being the first that's taught...).

 

You're mixing both of them here -- to change the window size you use message passing, to discover the window size you use shared state. Just pick one! tongue.png

Either put a lock around the data, or use messages to retrieve the size as well as setting it (you could either send a 'getSize' message to the window containing the object who wants to know, and have the window send a 'setSize' message back to that object, or, have objects register to receive 'onResized' messages).

 

Also, shared state requires synchronization. If some bit of data is going to be shared between multiple threads, it needs to be synchronized. Simply putting a mutex/critical-section around the width/height data is fine. Locking like this is only actually a huge performance penalty if there is contention (many threads trying to use that data at once). Keep in mind that lock-free synchronization generally also performs very poorly in situations of high contention!



#15 irreversible   Crossbones+   -  Reputation: 1343

Like
0Likes
Like

Posted 23 July 2014 - 01:07 AM

In this limited case you could just make the width and hight a single 64 bit atomic variable. That will ensure that the size is always what a single call to DoSetSize() asked for. But necessarily, calling GetSize() cannot guarantee, without external synchronization, that the size remain what it returned one instruction later.

 

 

I kind of figured this myself after I did some toying around with my code. My initial idea was to pack all four (position and size) into one qword, limiting both size and position to 16 bits (which frankly should be enough, unless the user has something like five 4k displays tiled horizontally). At least internally my main concern isn't really whether a call to GetSize() returns the new or old values, but rather that the returned values be correct for any given moment of time. I suspect Windows is doing something of this kind, because it only exposes one API call that can change both the size and position of a window at the same time (SetWindowPos()). Right off the bat I can't find information as to what the maximum and minimum client coordinates or size are.

 

 

 


There's two main categories for multi-threaded communication -- shared state and message passing (with the latter being the right default™, though the former usually being the first that's taught...).
 
You're mixing both of them here -- to change the window size you use message passing, to discover the window size you use shared state. Just pick one! 
Either put a lock around the data, or use messages to retrieve the size as well as setting it (you could either send a 'getSize' message to the window containing the object who wants to know, and have the window send a 'setSize' message back to that object, or, have objects register to receive 'onResized' messages).
 
Also, shared state requires synchronization. If some bit of data is going to be shared between multiple threads, it needs to be synchronized. Simply putting a mutex/critical-section around the width/height data is fine. Locking like this is only actually a huge performance penalty if there is contention (many threads trying to use that data at once). Keep in mind that lock-free synchronization generally also performs very poorly in situations of high contention!

 

Yep - the thing is I'm figuring this out as I go and it was a two-step jump in logic to get to this conclusion. The thing I was fixed on was keeping helper parameters in my window class, such as rcClient and rcWindow while the solution is to store both as combined volatiles and change derived calls such as GetWindowRect() to something like this:


volatile LONGLONG iSize, iPosition;

IRect GetWindowRect()
{
LONGLONG posLocal = iPosition;
LONGLONG sizLocal = iSize;

int x = LDWORD(posLocal);
int y = HDWORD(posLocal);
int w = LDWORD(sizLocal);
int h = HDWORD(sizLocal);


return IRect(x, y, x + w, y + h);
}

PS - as for contention. I'm not really worried about it in this case so much as I care about data correctness :)


Edited by irreversible, 23 July 2014 - 01:13 AM.


#16 Hodgman   Moderators   -  Reputation: 30349

Like
3Likes
Like

Posted 23 July 2014 - 01:33 AM

This relies that reads and writes of LONGLONG's between CPU<->RAM are atomic operations.

 

That's the kind of thing where you're writing code that's making explicit assumptions about the CPU architecture that you're running on... This kind of code should only exist within low-level system-specific modules, such as the internal implementation of std::mutex, std::atomic, etc... i.e. code that you know you'll have to rewrite if you want to recompile for a different CPU.

 

P.S. writing to a LONGLONG is not one atomic operation on x86 -- the data will be transferred in 2 operations (or 3 if it's unaligned), which means that this is exactly as unsafe as your previous code!

 

P.P.S. if you're using the volatile keyword (except in the above mentioned situation - writing CPU-dependent implementations of synchronisation primitives) then you have a bug. That's hyperbole, of course, but seriously, in many code bases the use of volatile is automatically rejected or flagged as a potential bug (because it doesn't do what most people think it does, and in 99% of cases, it's not actually useful in writing multithreaded code). In my entire engine, that keyword gets used once.


Edited by Hodgman, 23 July 2014 - 01:34 AM.


#17 King Mir   Members   -  Reputation: 2000

Like
1Likes
Like

Posted 23 July 2014 - 01:37 AM

Volatile is not for thread synchronization, it's for IO. It's both too stringent and too weak for synchronization. It does not guarantee indivisibility, so it can split reads and write into two operations. It excessively limits the ability of the compiler to move code around the volatile read or  write, but then doesn't limit the CPU from doing the same reordering (because the CPU knows that it's not really doing IO). Atomic variables are the correct way to share data like that between threads.



#18 irreversible   Crossbones+   -  Reputation: 1343

Like
0Likes
Like

Posted 23 July 2014 - 05:00 AM

This relies that reads and writes of LONGLONG's between CPU<->RAM are atomic operations.

 

That's the kind of thing where you're writing code that's making explicit assumptions about the CPU architecture that you're running on... This kind of code should only exist within low-level system-specific modules, such as the internal implementation of std::mutex, std::atomic, etc... i.e. code that you know you'll have to rewrite if you want to recompile for a different CPU.

 

P.S. writing to a LONGLONG is not one atomic operation on x86 -- the data will be transferred in 2 operations (or 3 if it's unaligned), which means that this is exactly as unsafe as your previous code!

 

P.P.S. if you're using the volatile keyword (except in the above mentioned situation - writing CPU-dependent implementations of synchronisation primitives) then you have a bug. That's hyperbole, of course, but seriously, in many code bases the use of volatile is automatically rejected or flagged as a potential bug (because it doesn't do what most people think it does, and in 99% of cases, it's not actually useful in writing multithreaded code). In my entire engine, that keyword gets used once.

 

That link was informative. As for the reason for using LONGLONG, I was going by the spec for InterlockedExchange64(), which seems to accept a LONGLONG pointer, but internally operates on an __int64. The operation is apparently atomic regardless of the architecture (32 and 64 bit).

 

Also, apparently there's a fair amount of misinformation online regarding the use of volatile and it is sometimes less than straightforward to figure out which author has it wrong. I did some Googling and it's still somewhat unclear to me which library or lock mechanism, for that matter, to use. Not only in this particular case (where I think I can get away with a lock-free solution as there's little to no contention), but in general. Intel's Threading Building Blocks seems like a good library overall with nice cross-platform support, although it's a bit heavy for what I'm doing. Overall my intention is to do simple buffer-swapping between threads - in fact so far the only place where some form of synchronization is needed is responding to user input.

 

As for locks, I take most OS-specific mechanisms are anyway better optimized than what most libraries can roll. An interesting alternative seems to be to use a custom spin lock that would do away with portability issues in one swell swoop. Although this article seems hilights some of the potential performance costs.



#19 Hodgman   Moderators   -  Reputation: 30349

Like
0Likes
Like

Posted 23 July 2014 - 05:13 AM

As for the reason for using LONGLONG, I was going by the spec for InterlockedExchange64(), which seems to accept a LONGLONG pointer, but internally operates on an __int64. The operation is apparently atomic regardless of the architecture (32 and 64 bit).

Ah ok - yeah if you use that instruction, you can atomically move 64bits of data, but just a regular default assignment won't do it.

Also, apparently there's a fair amount of misinformation online regarding the use of volatile and it is sometimes less than straightforward to figure out which author has it wrong. I did some Googling and it's still somewhat unclear to me which library or lock mechanism, for that matter, to use.

If you're just after standard shared memory stuff, then pthreads is probably a good choice of library -- it's basically the standard for threads, mutexes, etc... Alternatively, C++11 has actually standardized a lot of threading stuff, so there's std::thread, std::atomic, std::mutex etc -- can't get more standard than that!
 
Regarding volatile, IMHO unless you're writing a device driver, the only place you should use volatile is inside the implementation of std::atomic (or similar reinvented wheel). 

As for locks, I take most OS-specific mechanisms are anyway better optimized than what most libraries can roll. An interesting alternative seems to be to use a custom spin lock that would do away with portability issues in one swell swoop. Although this article seems hilights some of the potential performance costs.

I'm guilty of having one of those here [futex.h, atomic.h], and yes, I've managed to ruin the stability of Windows with it so badly that I was forced to hard-reboot a Win7 machine wacko.png



#20 Lightness1024   Members   -  Reputation: 736

Like
0Likes
Like

Posted 23 July 2014 - 07:52 AM

Its not like Windows (even 7) was really a robust OS : 

http://blog.zorinaq.com/?e=74

 

no OOM killer, no completely fair scheduler, no full dyn ticks, super bad IO scheduler... its just a BIG piece of 1993 lava flow antipattern in the wild.


Edited by Lightness1024, 23 July 2014 - 07:53 AM.





Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS