Here's an implementation of a lock-free multi-reader/multi-writer array-based queue

Prune · 2008-11-17T23:51:53

I've tested the following code and it works very well. I started with the algorithm from "A Practical Nonblocking Queue Algorithm Using Compare-And-Swap", however I found an error in their algorithm (after a very protracted debugging section to first find the error and then check that it was not due to an implementation mistake). Searching other references to this paper, I came across "Formal Verification of An Array-based Nonblocking Queue" which discusses the same error, and provides an fix as well as a small enhancement. I was still getting occasionally an error, but a load barrier fixed that (see code). This code might still have some bugs despite my best effort, so I make no warranties whatsoever regarding it. Additionally, if you do not substantially change it, please retain the copyright notice. A couple of notes are in order: - if you enqueue/dequeue in a busy loop waiting for a non-full/non-empty queue, performance might be better if you add a slight delay after each failed operation, at least an _mm_pause (or __yield if you're on AMD64); - there weird-looking casting with dereference and address-of operators around item is because if T is something like an int or float, where there is a standard conversion defined, at least MSVC gives an error saying that static_cast should be used instead of reinterpret_cast (yet if there's no such conversion, the reverse happens and it says reinterpret_cast should be used); if anyone knows of a cleaner solution, please post (and C-style cast doesn't count); - _ReadBarrier is only a compiler reordering barrier, but seems sufficient in my testing (_WriteBarrier also works, as does _ReadWriteBarrier); using _mm_lfence, which is a CPU reordering barrier, also seems to work, but _mm_sfence doesn't (the latter is strange given _WriteBarrier works, but I have no explanation); do not remove the barrier as with some optimization options you will get errors (albeit infrequent for most setups); - this won't work on a 64-bit systems; using _InterlockedCompareExchange128 might allow an easy conversion, however; - and, I have not tested this on an AMD processor (though I expect it should work). // Copyright (C) 2008 Borislav Trifonov // Based on algorithm from "Formal Verification of An Array-based Nonblocking Queue" // TODO: For 64-bit system, if cannot adapt to _InterlockedCompare64Exchange128() or _InterlockedCompareExchange128(), // use instead algorithm from Fig.5 in "Non-Blocking Concurrent FIFO Queues with Single Word Synchronization Primitives" #if !defined ATOMIC_QUEUE #define ATOMIC_QUEUE #include <exception> #include <cassert> #include "processor.h" template<class T, unsigned long size> class AtomicQueue { public: class Exc : public std::exception { public: inline Exc(const char []); }; inline AtomicQueue(void); inline ~AtomicQueue(void); inline bool Push(T const); // Returns false if full inline bool Pop(T &); // Returns false if emtpy inline T operator[](unsigned long); // No error checking private: __declspec(align(4)) volatile unsigned long head, tail; unsigned long long *volatile buff; // TODO: Find out if the volatile is necessary }; template<class T, unsigned long size> inline AtomicQueue<T, size>::AtomicQueue(void) : head(0), tail(0), buff(0) { assert(sizeof(T) == 4); assert(sizeof(unsigned long) == 4); assert(sizeof(unsigned long long) == 8); buff = static_cast<unsigned long long *>(_mm_malloc(size * sizeof(unsigned long long), 8)); if (!buff) throw Exc("Not enough memory"); for (int i = 0; i < size; ++i) buff = 0ull; } #include <iostream> template<class T, unsigned long size> inline AtomicQueue<T, size>::~AtomicQueue(void) { _mm_free(buff); } template<class T, unsigned long size> inline bool AtomicQueue<T, size>::Push(T const item) { while (true) { __declspec(align(4)) unsigned long const t(tail), i(t % size), h(head); __declspec(align(8)) unsigned long long const a(buff); _ReadBarrier(); // TODO: Test any time environment changed: on _MSC_VER/_M_IX86 // _WriteBarrier() and _mm_lfence() also seem sufficient, but _mm_sfence() is not if (t != tail) continue; if (t == head + size) { if (buff[h % size] & 0xFFFFFFFFull && h == head) return false; _InterlockedCompareExchange(&head, h + 1, h); continue; } // Item is stored in lower two bytes since the interlocked intrinsics ignore the MSB if (a & 0xFFFFFFFFull) { if (buff & 0xFFFFFFFFull) _InterlockedCompareExchange(&tail, t + 1, t); } else if (_InterlockedCompareExchange64(buff + i, static_cast<unsigned long long>(*reinterpret_cast<unsigned long const *>(&item)) | ((a & 0xFFFFFFFF00000000ull) + 0x100000000ull), a) == static_cast<long long>(a)) { _InterlockedCompareExchange(&tail, t + 1, t); return true; } } } template<class T, unsigned long size> inline bool AtomicQueue<T, size>::Pop(T &item) { while (true) { __declspec(align(4)) unsigned long const h(head), i(h % size), t(tail); __declspec(align(8)) unsigned long long const a(buff); _ReadBarrier(); // TODO: Test any time environment changed: on _MSC_VER/_M_IX86 // _WriteBarrier() and _mm_lfence() also seem sufficient, but _mm_sfence() is not if (h != head) continue; if (h == tail) { if (!(buff[t % size] & 0xFFFFFFFFull) && t == tail) return false; _InterlockedCompareExchange(&tail, t + 1, t); continue; } if (!(a & 0xFFFFFFFFull)) { if (!(buff & 0xFFFFFFFFull)) _InterlockedCompareExchange(&head, h + 1, h); } else if (_InterlockedCompareExchange64(buff + i, (a & 0xFFFFFFFF00000000ull) + 0x100000000ull, a) == static_cast<long long>(a)) { _InterlockedCompareExchange(&head, h + 1, h); unsigned long dummy(a & 0xFFFFFFFFull); item = *reinterpret_cast<T *>(&dummy); return true; } } } template<class T, unsigned long size> inline T AtomicQueue<T, size>::operator[](unsigned long i) { unsigned long dummy(buff[(head + i) % size] & 0xFFFFFFFFull); return *reinterpret_cast<T *>(&dummy); } template<class T, unsigned long size> inline AtomicQueue<T, size>::Exc::Exc(const char msg[]) : std::exception(msg) { } #endif Comments welcome. I hope someone besides me finds it useful. The processor.h header is simply my way of avoiding having to include Windows header files. It needs some additions before it can be used on Linux and so is incomplete, but here it is anyway. #if !defined PROCESSOR_H #define PROCESSOR_H // Note: MSVC may require /Oi compiler option // CPU load/store reordering barriers and fast yield // TODO: For other than _MSC_VER; might be OK for __ICL #if defined _M_IA64 // TODO: Probably no IA64 support needed extern "C" __mf #pragma intrinsic(__mf) #define MemoryBarrier __mf extern "C" void __yield(void); #pragma intrinsic(__yield) #define YieldProcessor __yield #else extern "C" void _mm_mfence(void); #pragma intrinsic(_mm_mfence) extern "C" void _mm_lfence(void); #pragma intrinsic(_mm_lfence) extern "C" void _mm_sfence(void); #pragma intrinsic(_mm_sfence) extern "C" void _mm_pause(void); #pragma intrinsic(_mm_pause) #define YieldProcessor _mm_pause #if defined _M_AMD64 // TODO: Check if this is OK if load barrier needed extern "C" void __faststorefence(void); #pragma intrinsic(__faststorefence) #define MemoryBarrier __faststorefence #elif defined _M_IX86 /*__forceinline void MemoryBarrier(void) { long Barrier; __asm xchg Barrier, eax }*/ #define MemoryBarrier _mm_mfence #else #error "Unsupported environment" #endif #endif // Compiler load/store reordering barriers // TODO: Check if these are implied by the CPU barriers #if defined __ICL || defined __ICC || defined __ECC #define _ReadWriteBarrier __memory_barrier #define _ReadBarrier __memory_barrier #define _WriteBarrier __memory_barrier #elif defined _MSC_VER extern "C" void _ReadWriteBarrier(void); #pragma intrinsic(_ReadWriteBarrier) extern "C" void _ReadBarrier(void); #pragma intrinsic(_ReadBarrier) extern "C" void _WriteBarrier(void); #pragma intrinsic(_WriteBarrier) #elif defined __GNUC__ // TODO: Other options for read- or write-only barriers? #define _ReadWriteBarrier() __asm__ __volatile__("" ::: "memory") #define _ReadBarrier() __asm__ __volatile__("" ::: "memory") #define _WriteBarrier() __asm__ __volatile__("" ::: "memory") #else #error "Unsupported environment" #endif // Interlocked intrinsics // TODO: For other than _MSC_VER; might be OK for __ICL; // note that __GNUC__ __sync_bool_compare_and_swap does the comparison but _Interlocked... doesn't extern "C" long _InterlockedCompareExchange(volatile unsigned long *, unsigned long, unsigned long); #pragma intrinsic(_InterlockedCompareExchange) extern "C" long long _InterlockedCompareExchange64(volatile unsigned long long *, unsigned long long, unsigned long long); #pragma intrinsic(_InterlockedCompareExchange64) // Aligned memory allocator and deallocator #if defined __ICL || defined __ICC || defined __ECC // TODO: May not be necessary; also, probably __ECC (IA64) support not needed extern "C" void* __cdecl _mm_malloc(size_t, size_t); extern "C" void __cdecl _mm_free(void *); #elif defined _MSC_VER #include <malloc.h> #elif defined __GNUC__ #include <mm_malloc.h> #else #error "Unsupported environment" #endif #endif [Edited by - Prune on November 17, 2008 2:48:24 PM]

LessBread

1,415

November 17, 2008 02:36 PM

Prune, how does your implementation compare with Herb Sutter's exploration of concurrency in the last few issues of Dr. Dobb's? Writing a Generalized Concurrent Queue

And can you go back and edit your initial post so that it doesn't stretch the page out so widely? [smile] The source boxes shouldn't stretch out like that, but they are. Try breaking up the "TODO" comments into multiple lines and placing some of the trailing comments on their own line. That might resolve the stretching.

"I thought what I'd do was, I'd pretend I was one of those deaf-mutes." - the Laughing Man

Prune

224

Author

November 17, 2008 02:43 PM

Quote:Original post by LessBread
Prune, how does your implementation compare with Herb Sutter's exploration of concurrency in the last few issues of Dr. Dobb's? Writing a Generalized Concurrent Queue

If you look at page 4, where he references fully non-blocking queues, he refers to Michael and Scott's algorithm, which I also referred to above when I mentioned linked-list-based queues; that algorithm is also discussed in one of the verification papers I mentioned.
For both array and linked-list implementations, one appears to require a DCAS, but Sutter is in fact mistaken--it's not necessary. See the single-word CAS paper I referenced earlier (I'm going to forward that reference to Sutter). Now, this is a bit of a trick since, though this last example is an array-based queue, it does use a linked list for the LL/SC emulation, but the efficiency impact is small--indeed, this approach is actually faster than DCAS on systems with more than a few processors.

Quote:And can you go back and edit your initial post so that it doesn't stretch the page out so widely? [smile] The source boxes shouldn't stretch out like that, but they are. Try breaking up the "TODO" comments into multiple lines and placing some of the trailing comments on their own line. That might resolve the stretching.

Done.

[Edited by - Prune on November 17, 2008 3:43:20 PM]

"But who prays for Satan? Who, in eighteen centuries, has had the common humanity to pray for the one sinner that needed it most?" --Mark Twain

~~~~~~~~~~~~~~~Looking for a high-performance, easy to use, and lightweight math library? http://www.cmldev.net/ (note: I'm not associated with that project; just a user)

LessBread

1,415

November 17, 2008 05:07 PM

Cool. Hopefully Sutter will respond.

Thanks for the formatting clean up (there's more to be done but of secondary importance).

"I thought what I'd do was, I'd pretend I was one of those deaf-mutes." - the Laughing Man

Hodgman

52,717

November 17, 2008 05:25 PM

Quote:Original post by LessBread
Prune, how does your implementation compare with Herb Sutter's exploration of concurrency in the last few issues of Dr. Dobb's? Writing a Generalized Concurrent Queue

I can't speak for Prune's implementation, but Sutter teaches us that locality is very important for concurrency. Therefore I was confused by his choice of a linked-list over an array when implementing the queue in his article - wouldn't an array be much more cache friendly?

. 22 Racing Series .

Prune

224

Author

November 17, 2008 05:45 PM

I posted it on Sutter's blog.

Interestingly, Intel's Compiler doesn't seem to have an intrinsic for the double quad-word CAS cmpxchg16b, even though MSVC does (_InterlockedCompareExchange128), though MSDN warns that it's slow. Perhaps this omission is an indication Intel might drop that instruction from future x86-64 CPUs (there have been rumors going back to 2004) :(
Itanium only has cmp8xchg16 (_InterlockedCompare64Exchange128) which, though it swaps two quad-words, only compares one, so it's basically useless for this algorithm, and on IA64 the Evequoz algorithm would need to be used. Given the slowness consideration about cmpxchg16b, that algorithm might actually be comparable in speed on x86-64 as well.

Bleh, linking to other stuff that uses the intrinsics, such as SDL, gives

"error C2733: second C linkage of overloaded function '_InterlockedCompareExchange' not allowed"

So I had to change the arguments of the interlocked intrinsics in processor.h back to long from unsigned long.

[Edited by - Prune on November 17, 2008 6:45:09 PM]

"But who prays for Satan? Who, in eighteen centuries, has had the common humanity to pray for the one sinner that needed it most?" --Mark Twain

~~~~~~~~~~~~~~~Looking for a high-performance, easy to use, and lightweight math library? http://www.cmldev.net/ (note: I'm not associated with that project; just a user)

Hodgman

52,717

November 17, 2008 06:00 PM

Going back to your sizeof(T)==4 restriction - AFAIK this isn't required.

The only variables that need to be atomic (i.e. need to use interlocked exchange) are the head/tail variables - the actual data array shouldn't require any special protection. See Suttors words on "do work, then publish" - the actual data array is the work, and head/tail are used to publish.

My FIFO uses a special Index class for the head/tail vars. This class uses 32-bit Compare And Swap and allows the head/tail vars to be incremented while making sure that they cannot increment past each other (turning a full queue into an empty queue by accident). There's also some hackery to avoid the ABA problem. Locking is also provided to allow the array to be resized. 16bits are given to the index, 15 to "solving" ABA, and 1 bit acts as a mutex.

N.B. The TAtomic class wraps up the CAS operation with the function SetIfEqual(new,old).

//copyright 2008 Hodgman - Free for personal or educational use only ;)	enum IncrementType	{		Pre,		Post	};	/** @brief Integer type that maintains a counter of modifications made.	 * The Counted Index is designed to avoid the "ABA problem" in lock-free code.	 * It can also act as a fast user-space mutex.	 * @todo finish documenting this class	 */	class CCountedIndex	{	public:		CCountedIndex(uint32 v=0) : i(v) {}		CCountedIndex(const CCountedIndex& o) : i(o.i) {}		/** Tries to lock the mutex bit		 * @return true if the mutex was locked, else false */		bool Lock()		{			uint v = i;			if( v & LockMask )				return false;			return i.SetIfEqual( v | LockMask, v ); //set lock		}		/** Unlocks the mutex bit. The mutex is assumed to be locked prior to calling. */		void Unlock()		{			uint v = i;			ASSERT( v & LockMask );			i = v & ~LockMask;                      //remove lock		}		/** Assigns a new value to the index bits. No thread-safety checks! */		void Set( uint32 index )		{			uint v = i;			ASSERT( v & LockMask );			uint n = ((v+(1<<IndexBits))&CountMask) | //increment counter			         ((index)&IndexMask)            | //set index			         LockMask;                        //keep lock			i = n;		}		/** Increment the index bits. May fail due to concurrent modifications.		 * @param limit One past the maximum index value. Index will wrap to 0 if limit is reached.		 * @return true if the index was incremented, else false */		bool Increment( uint32 limit )		{			uint v = i;			if( v & LockMask )				return false;			uint n = ((v+(1<<IndexBits))&CountMask) | //increment counter			         Inc(v&IndexMask,limit);          //increment index			return i.SetIfEqual( n, v );		}		/** . */		template<IncrementType T, bool L>		uint32 Increment( uint32 limit, uint32 full, bool& fail )		{			uint32 v = i;			ASSERT( !L || (v & LockMask) )			if( !L && (v & LockMask) )			{				return std::numeric_limits<uint32>::max();			}			uint oldIndex = v&IndexMask;			uint newIndex = Inc(oldIndex, limit);          //increment index			fail = (newIndex == full);			if( fail )				return std::numeric_limits<uint32>::max();			int n = newIndex | ((v+(1<<IndexBits))&CountMask) |//increment counter				 (L?LockMask:0);			return i.SetIfEqual( n, v )				? (T==Pre?oldIndex:newIndex)				: std::numeric_limits<uint32>::max();		}		/** . */		template<IncrementType T, bool L>		uint32 Increment( uint32 limit, const CCountedIndex& original )		{			uint32 v = i;			ASSERT( !L || (v & LockMask) )			if( v != original.i || (!L&&(v & LockMask)) )				return std::numeric_limits<uint32>::max();			uint oldIndex = v&IndexMask;			uint newIndex = Inc(oldIndex, limit);   //increment index			int n = newIndex |			        ((v+(1<<IndexBits))&CountMask)| //increment counter			        (L?LockMask:0);			return i.SetIfEqual( n, v )				? (T==Pre?oldIndex:newIndex)				: std::numeric_limits<uint32>::max();		}		/** . */		void IncrementAndUnlock( uint32 limit )		{			uint v = i;			ASSERT( v & LockMask );			uint n = ((v+(1<<IndexBits))&CountMask) | //increment counter			         Inc(v&IndexMask,limit);          //increment index			i = n;		}		/** . */		bool IncrementAndLock( uint32 limit )		{			uint v = i;			if( v & LockMask )				return false;			int n = ((v+(1<<IndexBits))&CountMask) | //increment counter			        Inc(v&IndexMask,limit)         | //increment index			        LockMask;                        //set lock			return i.SetIfEqual( n, v );		}		/** . */		operator uint32() { return i&IndexMask; }		/** . */		bool operator==( uint32 j ) { return (uint32(i)&IndexMask) == j; }	private:		CCountedIndex& operator=(const CCountedIndex& o);		inline int Inc(uint32 i, uint32 size){ ASSERT(i<IndexMask);return (i+1==size)?0:i+1; }		TAtomic<uint32> i;		const static uint IndexMask = 0x0000FFFF;		const static uint CountMask = 0x7FFF0000;		const static uint LockMask  = 0x80000000;		const static uint IndexBits = 16;		const static uint CountBits = 15;		const static uint LockBits  = 1;	};

. 22 Racing Series .

the_edd

2,109

November 17, 2008 06:25 PM

Quote:Original post by Hodgman
I can't speak for Prune's implementation, but Sutter teaches us that locality is very important for concurrency. Therefore I was confused by his choice of a linked-list over an array when implementing the queue in his article - wouldn't an array be much more cache friendly?

Well if you have two threads writing to the same cache line often (or simply one reading and one writing), that's very bad for concurrency. By putting the elements in a list, you reduce the chance of this happening. See ~1:18:40 in http://video.google.com/videoplay?docid=-4714369049736584770.

http://www.mr-edd.co.uk
http://bitbucket.org/edd

Prune

224

Author

November 17, 2008 07:02 PM

Hodgman, I don't see how you'd handle the following situation:
You have a queue as follows: 00X0000 (X represents one element, 0 empty).
Counting from 0, head is at 2 and tail at 3.
Three threads initiate writes. The tail is incremented atomically three times to 6; the threads publish 3, 4, and 5 correspondingly for work. Then they are simultaneously writing into those locations. A reader thread can read locations 3-5 before their writes have completed, which is incorrect behavior.

You could do something with two heads and two tails, but then the code complexity becomes equivalent to the modified Shann algorithm I used.

Note also that there are several formally proved properties about that algorithm's correctness, meeting invariants of the definition of a queue, linearizability, and being fully non-blocking, which is what decided the issue for me. Finally, a 32-bit CAS can be used if you instead consider one 16 bit word to be the reference counter and the other, instead of a pointer or whatever, to be a 16 bit index into an array (or some other subdivision of the 32 bits).

the_edd, one could use a strided "array" instead with elements spaced at least one cache line apart :P

BTW I'm really having a problem with the signed/unsigned thing. How do I avoid having to put reinterpret_cast for the first argument of every _InterlockedExchange??() ? (plus, would the standard conversion preserve bit patterns of the other, non-pointer, arguments? I think this would be the case on 2's complement architectures)

"But who prays for Satan? Who, in eighteen centuries, has had the common humanity to pray for the one sinner that needed it most?" --Mark Twain

~~~~~~~~~~~~~~~Looking for a high-performance, easy to use, and lightweight math library? http://www.cmldev.net/ (note: I'm not associated with that project; just a user)

Hodgman

52,717

November 17, 2008 08:05 PM

Quote:Original post by Prune
Hodgman, I don't see how you'd handle the following situation:
You have a queue as follows: 00X0000 (X represents one element, 0 empty).
Counting from 0, head is at 2 and tail at 3.
Three threads initiate writes. The tail is incremented atomically three times to 6; the threads publish 3, 4, and 5 correspondingly for work. Then they are simultaneously writing into those locations. A reader thread can read locations 3-5 before their writes have completed, which is incorrect behavior.

Sorry, I forgot to mention that my array is a structure like this:

struct Node{	T data;	CAtomic valid;	Node():valid(0){}	Node(const Node&o):data(o.data),valid(o.valid){}};TAtomic<Node*> array;CCountedIndex  head, tail;

The 'valid' variable within the array is used as a flag to indicate whether the write has competed yet or not.
If a pop operation occurs on a node that is not yet valid, then it has to wait for the writer to complete it's operation (and set the flag).
Push looks something like:

uint32 limit = 7;//size of arrayuint32 full = head;bool isFull;uint32 write = tail.Increment<Pre,false>( limit, full, isFull );if( write != std::numeric_limits<uint32>::max() )//increment succeeded{	array[write].data = v;//do work	array[write].valid = 1;//then publish	break;//this snippet is inside a while(true) loop}else if( isFull )//increment failed because array is full{	//try to lock head+tail and resize array}

. 22 Racing Series .

Prune

224

Author

November 17, 2008 08:16 PM

But this is in essence a spin-lock, so your algorithm cannot be considered lock-free. Can you give guarantees that lock convoys, priority inversion, livelock/deadlock, etc. can never occur?

"But who prays for Satan? Who, in eighteen centuries, has had the common humanity to pray for the one sinner that needed it most?" --Mark Twain

~~~~~~~~~~~~~~~Looking for a high-performance, easy to use, and lightweight math library? http://www.cmldev.net/ (note: I'm not associated with that project; just a user)

Here's an implementation of a lock-free multi-reader/multi-writer array-based queue

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Here's an implementation of a lock-free multi-reader/multi-writer array-based queue

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines