Double-check my lock-free voodoo

Started by
30 comments, last by Drew_Benton 15 years ago
I glanced over your code and couldn't find any issues with it.
Advertisement
Quote:Original post by Moomin
Quote:Original post by Promit
Here's a fun one: Visual C++ provides no intrinsic for cmpxchg16b (unless they added it very recently), nor will it compile 64 bit assembly. How much do you love MASM? [grin]


Your information is slightly dated. _InterlockedCompareExchange128 was added late 2007.
Well, that was awfully nice of them. Glad to know I'm wrong.
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
Quote:Original post by Promit
Quote:Original post by Moomin
Quote:Original post by Promit
Here's a fun one: Visual C++ provides no intrinsic for cmpxchg16b (unless they added it very recently), nor will it compile 64 bit assembly. How much do you love MASM? [grin]


Your information is slightly dated. _InterlockedCompareExchange128 was added late 2007.
Well, that was awfully nice of them. Glad to know I'm wrong.


To quote another poster here whose name I can't recall: Visual Studio is pretty dope.

That being said, they *still* don't have atomic addition/subtraction, etc. for any type of variable. Way to go.
clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.
Quote:Original post by InvalidPointer
That being said, they *still* don't have atomic addition/subtraction, etc. for any type of variable. Way to go.

Is there a popular C++ compiler that does?
Some comments:
1> Why in the world do your nodes not initialize to NULL in their constructor?
2> Why are you doing inline assembly in the middle of a many-line function -- why not write CAS and call the function?

3> Why is anything to do with PendingReads volatile? And why don't you have pop() return the payload? (oh you are using a std::stack. Why not store a std::stack of MessageInfo*s? Ie, move the cruft away from the mainline of your code)

4> Why is anything to do with Read volatile? You do swap things into read, but you only access Read in a single-threaded manner.

Hmm. That's because of C++ not wanting to let you remove volatile implicitly, isn't it? Meh. Oh well.

So this is what I'd prefer to read/write:
template<typename T>bool CompareAndSet( T* location, T setValue, T expectedValue ){	// ASM goes here}class LocklessMailbox{	// stores Node*s in a stack -- moves the single-threaded stuff out of the threading code	OneThreadNodeStack Read;	// A stack of the actual stuff we want to deliver.  Once again, less crud in the tricky	// threading code:	std::stack<MessageInfo*> PendingReads;	// The WriteHead is the hard threading part:	typedef volatile Node NodeVol;	typedef NodeVol *pNodeVol;	typedef volatile pNodeVol pvNodeVol;	pvNodeVol WriteHead;public:	LocklessMailbox()	{		WriteHead = new Node;	}public:	// [SNIP]	MessageInfo* PopTopPendingReads() {		if (PendingReads.empty()) return NULL;		MessageInfo* retval = PendingReads.top();		PendingReads.pop();		return retval;	}	// notice that there is no more volatile in this function.  Nor is there any memory management:	MessageInfo* GetMessage()	{		if(!PendingReads.empty())		{			return PopTopPendingReads();		}		if(Read.empty())			SwapReadAndWrite();		// pop_top() returns the data at the top of the Read stack,		// and returns NULL if the stack is empty.  It deals with memory		// management:		while(MessageInfo* message = Read.pop_top())		{			PendingReads.push(message);		}		return PendingReads.pop_top();	}	// here is the dangerous threading code:	void AppendMessage( MessageInfo* retval )	{		pNodeVol new_head = new Node( retval );		while (true) {			new_head->Next = WriteHead;			bool success = CompareAndSet<Node*>( &WriteHead, new_head, new_head->Next );			if (success) {				return;			}		}	}	void SwapReadAndWrite()	{		while (true)		{			Node* ReadHead = Read.GetHead();			Node* old_write = WriteHead;			bool success = CompareAndSet<Node*>( &WriteHead, ReadHead, old_write );			if (success) {				Read.SetHead( old_write );				return;			}		}	}	// [SNIP]};


But this is just cleaning up the code.

On the other hand, if the code was cleaned up like this, I could find errors in your lock-free algorithm implementation far easier -- I wouldn't have to also look for memory allocation errors and skip over boilerplate code.

With only one consumer, once things are in the Read stack _they are no longer a lock-free algorithm problem_. And the same with PendingReads. Get the code that is not a lock-free algorithm problem as far away from your tricky lock-free algorithm code as is reasonably possible.

Another thing I did is I reduced the amount of state that is saved over the loops by as much as possible. This makes checking for correctness easier.
It seems that everyone and their dog is going lock-free lately. Given that lock-free data structures and algorithms are a real PITA to get right, are you guys sure they'd actually gain you anywhere near as much as you think?

I haven't personally tried using lock-free algorithms and comparing the results to similar code using locks yet, but msdn seems to warn that, considering all the extra effort and the high risk of not getting it right, going lock-free just may not be even worth it in any particular case.

Anyone have any experience with lock-free code and how it compares to code using locks, performance-wise I mean?


P.S. Oops, forgot the link to the msdn article.

here
Quote:Original post by Red Ant
It seems that everyone and their dog is going lock-free lately. Given that lock-free data structures and algorithms are a real PITA to get right, are you guys sure they'd actually gain you anywhere near as much as you think?

I haven't personally tried using lock-free algorithms and comparing the results to similar code using locks yet, but msdn seems to warn that, considering all the extra effort and the high risk of not getting it right, going lock-free just may not be even worth it in any particular case.

Anyone have any experience with lock-free code and how it compares to code using locks, performance-wise I mean?


P.S. Oops, forgot the link to the msdn article.

here

I've kinda noticed that too. I just posted something on lock-free queues myself :X
Here I thought there might be one, two people that had experimented with them and could offer advice. What I did not expect, however, was the fact that 30 or so people were *also* trying them out and (in some cases) were hung up on the same issues.


DISCLAIMER: I'm not factoring thread spinning into the equation here; it's too implementation-dependent. All info presented here is AFAIK (read: done in academic happy-land) and may differ from what goes on 'in the real world.'

In regards to your assertion: In *theory,* lock-free algorithms modifying a single value should roughly be twice as fast in regards to synchronization overhead. Since you must both enter and exit a lock, that's two interlocked operations right there, neither of which do any real work. In practice, however, it can pretty much go all over the place. If you do every operation in an interlocked manner, it's likely going to be slower; locks would be a better option in that scenario. Depends on what you're trying to do and how frequently something's going to be modified.
clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.
I'm going lock-free here for a very particular reason: passing messages between tasks needs to be as efficient as possible. If message passing uses any locks, we introduce a severe bottleneck that limits scalability.

It's taken a lot of careful thought, but I've got a mostly lockless system going; the only locks are when threads are started or stopped. During a thread start/stop, all access to message mailboxes is blocked. However, concurrent message passes from separate threads will not block each other. This means we can sustain a high throughput of messages, without sacrificing safety during thread init/cleanup code.

In general, IMHO lock-free code is totally not worth the effort. You have to have some ridiculous contention on a locked resource before it really becomes a performance issue, but up until that point it's much easier to get locking right than it is to get lock-free right.



NotAYakk - thanks for the feedback. I'll post up an improved version of the code shortly.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Right, here's the current implementation, straight from the Epoch language project:

//// Helper function for atomic compare-and-swap// Returns true on success//template <class DataType>bool CompareAndSwap(DataType* field, DataType oldvalue, DataType newvalue){	DataType retval;	_asm	{		mfence		mov eax, oldvalue		mov ecx, newvalue		mov edx, dword ptr [field]		lock cmpxchg dword ptr[edx], ecx		mov retval, eax		mfence	}	return (retval == oldvalue);}//// This class encapsulates a lockless mailbox algorithm for asynchronous message passing.// Each thread is granted a mailbox when it is started up; this mailbox is used for all// messages passed into that thread. Messages are dequeued in FIFO order.//// IMPORTANT: this algorithm is only safe for a many producer/single consumer scenario.//            Critical assumptions made by the code are only valid if a single consumer//            thread is used. However, any number of producer threads is permitted.//// The algorithm uses three distinct stacks: a read stack, a write stack, and a cache.// Producers push items onto the write stack using a simple and well-known CAS method.// When the consumer thread arrives to retrieve a message, it first checks the cache// for any waiting messages. If the cache is empty, the consumer thread swaps the read// and write stacks. At this point, the read stack is controlled entirely by the consumer// thread, so we don't need to worry about contention for elements in the read stack.// The entire read stack is then traversed and its contents pushed into the cache.// Note that we achieve FIFO semantics only by using this extra cache stack.//template <class PayloadType>class LocklessMailbox{// Construction and destructionpublic:	//	// Construct and initialize the read and write stacks	//	LocklessMailbox()	{		std::auto_ptr<Node> wh(new Node);		std::auto_ptr<Node> rh(new Node);		WriteHead = wh.release();		ReadHead = rh.release();	}	//	// Clean up the read, write, and cache stacks, freeing any	// remaining messages from each stack.	//	~LocklessMailbox()	{		Release();	}// Message passing interfacepublic:	//	// Register a message from a producer thread. Any number of threads	// may call this function.	//	void AddMessage(PayloadType* info)	{		Node* msgnode = new Node(info);		while(true)		{			msgnode->Next = WriteHead;			bool success = CompareAndSwap<Node*>(&WriteHead, msgnode->Next, msgnode);			if(success)				return;		}	}	//	// Retrieve a message from the mailbox.	// IMPORTANT: only ONE consumer thread (per mailbox) should call this function	//	PayloadType* GetMessage()	{		if(!PendingReads.empty())			return PopPendingRead();		if(ReadHead->Next == NULL)			SwapReadAndWrite();		Node* n = ReadHead;		while(ReadHead->Next)		{			PendingReads.push(n);			n = n->Next;			ReadHead = n;		}		if(PendingReads.empty())			return NULL;		return PopPendingRead();	}// Internal helpersprivate:	//	// Pop a pending read from the cache stack	//	PayloadType* PopPendingRead()	{		Node* readnode = PendingReads.top();		PayloadType* payload = readnode->Payload;		delete readnode;		PendingReads.pop();		return payload;	}	//	// Internal helper for swapping the read/write stacks	// See class comment for details	//	void SwapReadAndWrite()	{		while(true)		{			Node* readhead = ReadHead;			Node* oldwrite = WriteHead;			bool success = CompareAndSwap<Node*>(&WriteHead, oldwrite, readhead);			if(success)			{				ReadHead = oldwrite;				return;			}		}	}	//	// Release the mailbox and free any remaining messages	//	void Release()	{		for(std::stack<Node*>::container_type::iterator iter = PendingReads.c.begin(); iter != PendingReads.c.end(); ++iter)		{			delete (*iter)->Payload;			delete *iter;		}		Node* n = WriteHead;		while(n)		{			delete n->Payload;			Node* nextn = n->Next;			delete n;			n = nextn;		}		n = ReadHead;		while(n)		{			delete n->Payload;			Node* nextn = n->Next;			delete n;			n = nextn;		} 	}// Internal trackingprivate:	struct Node	{		Node() : Next(NULL), Payload(NULL) { }		Node(PayloadType* p) : Next(NULL), Payload(p) { }		Node* Next;		PayloadType* Payload;	};	Node* WriteHead;	Node* ReadHead;	std::stack<Node*> PendingReads;};



The only thing I'm not sure of at this point is the memory fences; but that should be trivial to fix now that the CAS operation is centralized.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

If you got potentially more than one thread per HW thread, with lock-free algorithms you avoid priority inversion and it also scales better with high contention. Anyway, it's debatable if in many-core architectures lock-free algorithms would bring any benefit, so I guess we just have to wait to get our hands on those to see how it turns out in practice (: But if you have had the trouble of making lock-free algorithms, they are trivial to convert to algorithms that use optimally short locks to see the difference.

This topic is closed to new replies.

Advertisement