# Critique my threadsafe lockless queue!

This topic is 3045 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I'm practically a complete noob at multithreaded programming, but I'm trying to learn. So I attempted to build a threadsafe queue. It's intended to support multiple writers and a single reader. I want it to work on x86 platforms, and it be even better if it worked on PPC (Xbox 360 and whatnot) even with all that CPU's crazy instruction reordering (one particular spot that worries me is whether the middle line inside Push() can get reordered AFTER the final increment of m_writer). So, what is wrong with it? What haven't I accounted for? How can I make it better/faster/cheaper?
template <typename T, int MAX_SIZE = 1024>
class MyQueue
{
T m_stuff[MAX_SIZE];

// index of the next item that will be popped

// index of the most recent completed write
LONG m_writer;

// index of the next available spot for writing
LONG m_nextWrite;

public:

MyQueue()
{
m_writer = 0;
m_nextWrite = 0;
}

// attempt a push. Returns true if successfull, or false if the queue is full.
bool Push(T t)
{
// make sure we don't overwrite old values
if (GetCount() < MAX_SIZE - 1)
{
// reserve a spot for writing
LONG spotToWrite = InterlockedIncrement(&m_nextWrite) % MAX_SIZE;

// do the write
m_stuff[spotToWrite] = t;

// update the "write" cursor
InterlockedIncrement(&m_writer);

return true;
}

return false;
}

// pop a value and return true, or return false if nothing left to pop
// (this only supports single readers)
bool Pop(T& t)
{
if (GetCount() > 0)
{
return true;
}

return false;
}

int GetCount() { return m_writer - m_reader; }

private:

static LONG InterlockedIncrement(volatile LONG* target)
{
LONG start, newValue;
do
{
start = *target;
newValue = start + 1;
} while (InterlockedCompareExchange(target, newValue, start) != start);

return start;
}
};


##### Share on other sites
Any particular reason you don't use the built-in Windows InterlockedIncrement function? Also, I don't see anything that guarantees that your values are aligned on a 32-bit boundary, which is necessary for the built-in interlocked functions to work correctly.

Your push function is not safe; suppose a thread is interrupted between the "do the write" step and the "update the write cursor" step. Since you want multiple writers, it is possible that two writers stomp on each other in this function.

Pop() may also give incorrect results if it is interrupted during the GetCount() check. Even though you just want a single reader there's a chance things can go wrong if the timing is just so.

You can find a working, debugged implementation of a lockless queue in the Epoch language SDK. The relevant files are found in Source/Shared/Utility/Threading.

##### Share on other sites
Quote:
 Original post by ApochPiQAny particular reason you don't use the built-in Windows InterlockedIncrement function?

Hmm, I suppose not. The only difference between my function and the Windows function is that I return the final value - 1, but the Windows function returns the final value. I think I could easily code around the difference.

Quote:
 Also, I don't see anything that guarantees that your values are aligned on a 32-bit boundary, which is necessary for the built-in interlocked functions to work correctly.

Aren't normal integers always aligned to 32-bit boundaries by default? If not, how would I align them? Using the __declspec(align(#)) stuff?

Quote:
 Your push function is not safe; suppose a thread is interrupted between the "do the write" step and the "update the write cursor" step. Since you want multiple writers, it is possible that two writers stomp on each other in this function.Pop() may also give incorrect results if it is interrupted during the GetCount() check. Even though you just want a single reader there's a chance things can go wrong if the timing is just so.

But the code gets the index to write to first before it does the write, so each thread should be guaranteed to be writing to a unique index (right?). The next InterlockedIncrement() call after the write is just for the benefit of the reader thread, not the writers.

I guess there is a chance that GetCount() could be incorrect, but worst case it should only be less than the actual correct value. So the worst that could happen in Pop() is that it incorrectly thinks the queue is full. But would probably only happen when the queue is nearly full anyway.

Still, now I see a problem in Push(). Several threads could get past the "make sure we don't overwrite old values" check and end up overwriting values before the reader has a chance to read them. (I'm not sure the code was clear, but my goal with that "make sure we don't overwrite old values" check was to prevent the write cursor getting too far ahead of the read cursor and loop back around and start overwriting values written earlier, not to prevent multiple threads from overwriting data should they all get into the Push() method at the same time. That's what the "reserve a spot for writing" part is for.)

Quote:
 You can find a working, debugged implementation of a lockless queue in the Epoch language SDK. The relevant files are found in Source/Shared/Utility/Threading.

Wow, I was hoping I could get away with something really simple. Guess I have some studying to do.

Thanks!

##### Share on other sites
Quote:
 Original post by BradDaBugBut the code gets the index to write to first before it does the write, so each thread should be guaranteed to be writing to a unique index (right?). The next InterlockedIncrement() call after the write is just for the benefit of the reader thread, not the writers.

Quote:
 bool Push(T t){ // make sure we don't overwrite old values if (GetCount() < MAX_SIZE - 1) { // reserve a spot for writing LONG spotToWrite = InterlockedIncrement(&m_nextWrite) % MAX_SIZE; // do the write m_stuff[spotToWrite] = t; // update the "write" cursor InterlockedIncrement(&m_writer);

Let MAX_SIZE be 3. Let queue be empty.

On 4-core PC, they all call getCount() and determine there is enough room (0 elements, MAX_SIZE=3).

Thread    Spot 1         0 2         1 3         2 4         0

Threads 1 and 4 now race for spot 0.

After all 4 threads push, the size of queue is 4, but MAX_SIZE is 3. So queue contains one element more than it can hold...

One possible flow:
CPU Tick   Thrd1             Thrd2           Thrd3           Thrd4   1      GetCount()       GetCount()      GetCount()     GetCount()   2                     InterlockedInc   3                                                      InterlockedInc   4                      m_stuff[] = t;  InterlockedInc   5      InterlockedInc                  m_stuff[] = t;  m_stuff[] = t;   6      m_stuff[] = t;

Always remember that without locks, you have no order guarantees.

And this is just 50,000 feet view. It doesn't take into consideration memory or ordering issues and whatnot....

##### Share on other sites
I've been trying to tweak stuff to fix that bug in Push(), and the solution I'm looking at right now involves adding an extra CAS function in Push() and an extra one in Pop(). At what point does the overhead of all those CAS functions start to add up to the point where there's no advantage over a simple mutex?

##### Share on other sites
Here's my second attempt. Ignore what I said earlier about the extra CAS functions. This just avoids the overwriting old values bug in Push() by checking if the queue is full AFTER the m_nextWrite increment. If the queue is full then it un-increments m_nextWrite and bails out.

template <typename T, int MAX_SIZE = 1024>class MyQueue{	T m_stuff[MAX_SIZE];	// index of the next item that will be popped	LONG m_reader;	// index of the most recent completed write	LONG m_writer;	// index of the next available spot for writing	LONG m_nextWrite;public:	MyQueue()	{		m_reader = 0;		m_writer = 0;		m_nextWrite = 0;	}	// attempt a push. Returns true if successfull, or false if the queue is full.	bool Push(T t)	{		// reserve a spot for writing		LONG spotToWrite = InterlockedIncrement(&m_nextWrite) - 1;		// check if the queue is full		// (it's possible for m_reader to be bigger IRL than the value we get		// here, but worst case is the queue will appear full when it isn't)		if (spotToWrite - m_reader >= MAX_SIZE - 1)		{			// queue is full, so back out			InterlockedDecrement(&m_nextWrite);			return false;		}		// do the write		m_stuff[spotToWrite % MAX_SIZE] = t;		// update the "write" cursor		InterlockedIncrement(&m_writer);		return true;	}	// pop a value and return true, or return false if nothing left to pop	// (this only supports single readers)	bool Pop(T& t)	{		if (GetCount() > 0)		{			t = m_stuff[m_reader % MAX_SIZE];			m_reader++;			return true;		}		return false;	}	int GetCount() { return m_writer - m_reader; }};

Edit: fixed a tiny bug

[Edited by - BradDaBug on September 10, 2009 9:05:01 AM]

##### Share on other sites
Quote:
  LONG spotToWrite = InterlockedIncrement(&m_nextWrite) - 1; if (spotToWrite - m_reader >= MAX_SIZE - 1) { // queue is full, so back out InterlockedDecrement(&m_nextWrite); return false; }

Let MAX_SIZE be 3.
Let queue have 2 elements in it.
spotToWrite = 2

3 threads on 3 cores push concurrently.

IDX = spotToWrite
Tick     T1                T2               T3                   T4 1      IDX=2            IDX=3             IDX=4 2      IDX-m_reader=2                                      pop/m_reader++ 3      mstuff[2] = t    IDX-m_reader=2   IDX-m_reader=3 4                       mstuff[2] = t    return

Oops. T2 overwrote the value of T1.

The only way to solve this is to spin on both, push as well as pop.

Simply put, instead of increment/decrement, the solution would be to make both push and pop spin on InterlockedExchange. But that is only part of problem.

A more annoying problem is how to solve the case when queue is full. If push keeps spinning, then it has the potential of stalling, even dead-locking all threads that attempt to push, until an element is popped.

Typically, with fixed-size queues, one needs to make operations optional. Pseudo writer push code would look something like this:
void someMainLoop() {  vector<int> buffer;  locklessqueue<int> q;  while (true) { // the logic, ....    while (!buffer.empty() && q.push(buffer.front()) buffer.pop_front();    ...    // stuff to put on queue    buffer.push_back(...)  }};

Why all the mess. Queue might be full. We can either wait until it's empty, or we simply keep going, hoping it will be available later.

So we keep stuff to put on queue privately, in regular container. Every once in a while, we try to put as much data as possible on the queue. If queue is full, we buffer the data for a while.

Many lock-free/wait-free/lock-less queues use dynamic structure for this very reason. Limiting the size is quite tricky from design perspective, unless one allows push() to fail. Of course, dynamic memory allocation has its own set of problems, and typically requires thread-safe lock-less allocator, or incredibly complex hazard pointers.

##### Share on other sites
(no offense ;))
Unfortunately, it can't be lockless when "Interlocked(Inc|Dec)rement()" is an

Quote:
 MSDNatomic operation.

That is, if multiple threads do an atomic operation on shared data, then you have a lock.

##### Share on other sites
Quote:
Original post by phresnel
(no offense ;))
Unfortunately, it can't be lockless when "Interlocked(Inc|Dec)rement()" is an

Quote:
 MSDNatomic operation.

That is, if multiple threads do an atomic operation on shared data, then you have a lock.

Tend to disagree.
Following this reasoning would mean that a lockless algorithm cannot exist, by definition.

A lockless algorithm has to use atomic operations, otherwise he'd have to use mutexes, critical sections, spin-locks, or something similar - making it non-lockless, eventually :).

##### Share on other sites
Quote:
Original post by tivolo
Quote:
Original post by phresnel
(no offense ;))
Unfortunately, it can't be lockless when "Interlocked(Inc|Dec)rement()" is an

Quote:
 MSDNatomic operation.

That is, if multiple threads do an atomic operation on shared data, then you have a lock.

Tend to disagree.
Following this reasoning would mean that a lockless algorithm cannot exist, by definition.

A lockless algorithm has to use atomic operations, otherwise he'd have to use mutexes, critical sections, spin-locks, or something similar - making it non-lockless, eventually :).

Both of you are correct in my opinion. Captain Obvious would disagree, but lockless (or lockfree, more correctly) does not actually mean that no locks are used. Or that no atomic operations are used. It just means progress will always be made on the algorithm regardless of the state of any of the threads working on the problem.

Technically, you do have a lock if you do an atomic operation. Heck, the x86 assembly primitive instruction for this is even called "lock". But this doesn't preclude it from being used in a lockfree algorithm, because the guarantees of this particular type of lock are such that progress will always be made and hence it is a lock-free algorithm. It's actually a wait-free algorithm in this case, which is stronger than a lock-free algorithm. Technically, in a lock-free algorithm that's not wait-free, you're more than welcome to block as many threads as you want indefinitely using any synchronization primitive you can think of, it just has to be such that the other threads can pick up the work.

##### Share on other sites
Quote:
Original post by cache_hit
Quote:
Original post by tivolo
Quote:
Original post by phresnel
(no offense ;))
Unfortunately, it can't be lockless when "Interlocked(Inc|Dec)rement()" is an

Quote:
 MSDNatomic operation.

That is, if multiple threads do an atomic operation on shared data, then you have a lock.

Tend to disagree.
Following this reasoning would mean that a lockless algorithm cannot exist, by definition.

A lockless algorithm has to use atomic operations, otherwise he'd have to use mutexes, critical sections, spin-locks, or something similar - making it non-lockless, eventually :).

Both of you are correct in my opinion. Captain Obvious would disagree, but lockless (or lockfree, more correctly) does not actually mean that no locks are used. Or that no atomic operations are used. It just means progress will always be made on the algorithm regardless of the state of any of the threads working on the problem.

Technically, you do have a lock if you do an atomic operation. Heck, the x86 assembly primitive instruction for this is even called "lock". But this doesn't preclude it from being used in a lockfree algorithm, because the guarantees of this particular type of lock are such that progress will always be made and hence it is a lock-free algorithm. It's actually a wait-free algorithm in this case, which is stronger than a lock-free algorithm. Technically, in a lock-free algorithm that's not wait-free, you're more than welcome to block as many threads as you want indefinitely using any synchronization primitive you can think of, it just has to be such that the other threads can pick up the work.

Agreed, and thanks for the explanation. I wasn't thinking of x86's "lock" instruction, as I'm primarily working on PPC.

##### Share on other sites
Quote:
 Original post by AntheusLet MAX_SIZE be 3.Let queue have 2 elements in it.m_reader = 0spotToWrite = 23 threads on 3 cores push concurrently.1 thread on separate core reads.IDX = spotToWriteTick T1 T2 T3 T4 1 IDX=2 IDX=3 IDX=4 2 IDX-m_reader=2 pop/m_reader++ 3 mstuff[2] = t IDX-m_reader=2 IDX-m_reader=3 4 mstuff[2] = t returnOops. T2 overwrote the value of T1.

Why would T2 write to m_stuff[2]? Why wouldn't it write to m_stuff[0]?

##### Share on other sites
Quote:
 Original post by BradDaBugWhy would T2 write to m_stuff[2]? Why wouldn't it write to m_stuff[0]?

Yep, my bad. Looked at the wrong value.

But there might be another problem. Consider what happens if queue is just full (let queue be very large). Five threads attempt to write to it and fail. m_nextWrite is now 5 elements ahead of maximum size, and all threads are just about to decrement it. Two succeed, 3 are still pending.

On next push, spotToWrite is +3. m_writer now gets incremented, and next pop will deque element +1, which has not yet been written.
The remaining 3 threads now complete the decrement, ending with m_nextWrite+1.

I'm not going to make a table for this, ASCII art in these posts is somewhat cumbersome and too error prone.

##### Share on other sites
cache_hit:
Okay maybe my definition of lockfree is to strict :)

Quote:
 Original post by tivoloTend to disagree.Following this reasoning would mean that a lockless algorithm cannot exist, by definition.

If you have algorithms that work solely on non-shared data, and only read (but not write) shared data, then you can get strictly lock free algorithms.

E.g., consider a real-time ray tracer where the tracing phase is subdivided into tiles, and each thread has a number of tiles to solve. Assume those tiles are non-overlapping. And assume that during the solver, the scene is read-only. Than that tracing algorithm is lockfree (including without atomic ops).

##### Share on other sites
Quote:
 Original post by phresnelE.g., consider a real-time ray tracer where the tracing phase is subdivided into tiles, and each thread has a number of tiles to solve. Assume those tiles are non-overlapping. And assume that during the solver, the scene is read-only. Than that tracing algorithm is lockfree (including without atomic ops).

Then each thread is carrying out its own single-threaded operation. Whether they read from the same memory address or run on completely different computers is irrelevant. Imagine that there are 4 threads with 100 tiles each, but for some reason one of the threads only manages 50 when the others are done. The problem is getting the other 3 threads to help in completing the last 50 tiles, without computing the same tile twice, and without locking. That needs atomic operations.

##### Share on other sites
Lock-free is apparently properly named "non-blocking synchronization".

Quote:
 consider a real-time ray tracer where the tracing phase is subdivided into tiles, and each thread has a number of tiles to solve. Assume those tiles are non-overlapping. And assume that during the solver, the scene is read-only. Than that tracing algorithm is lockfree

Enter false sharing.

Isn't lockfree programming fun?

##### Share on other sites
Quote:

Yes, lockfree without shared mutable data. Of course on top of it all, there must be an observer that watches how the process advanced. That one is not lockfree, so I was really talking about lockfree "sub"-algorithms.

I try with an ascii diagram:
       ray tracer          |         ...          |     spawner---+-------+     |         |       |  thread    thread    threadtile | |  tile | |  tile | |        tile |    tile |    tile |    tile      tile      tile

Each thread would report its status into a distinct area, and does not communicate with other threads. This is not the optimal approach when some threads are finished faster than others, so there might be too much waiting at the level of the spawner. Anyways, you can compensate for this and drastically reduce the amount of unecessary waiting by giving each thread a number of tiles which are (pseudo-)randomly spread over the screen. Surely it ain't too easy to find the optimal ratio of tiles per thread (i.e. tilesize).

Quote:
Original post by Antheus
Quote:
 consider a real-time ray tracer where the tracing phase is subdivided into tiles, and each thread has a number of tiles to solve. Assume those tiles are non-overlapping. And assume that during the solver, the scene is read-only. Than that tracing algorithm is lockfree

Enter false sharing.

No false sharing if each tile has it's very own writing location in their own cachelines, plus, as said, the scene is immutable at that point. Note that by "lock free algorithm", I did not mean the ray tracer as a whole (I would have written it, otherwise), but the very process of each thread rendering to some given tiles.

##### Share on other sites
Quote:
 Original post by AntheusBut there might be another problem. Consider what happens if queue is just full (let queue be very large). Five threads attempt to write to it and fail. m_nextWrite is now 5 elements ahead of maximum size, and all threads are just about to decrement it. Two succeed, 3 are still pending.At this point, reader thread pops everything. m_nextWrite is still +3.On next push, spotToWrite is +3. m_writer now gets incremented, and next pop will deque element +1, which has not yet been written.The remaining 3 threads now complete the decrement, ending with m_nextWrite+1.

Doh, you're right. How do you spot this stuff? Does it just come with experience?

Here's my third attempt. I added a spin lock to Push() to make sure that it reserves a spot ONLY if the queue isn't full. No trying to back out later. And if another thread happens to reserve a spot between the "is queue full" check and making the reservation then it has to do the check again.

The only problem I can see is that there's no guarantee that a thread won't sit inside the spin forever while other threads get to do all the writing.
template <typename T, int MAX_SIZE = 1024>class MyQueue{    T m_stuff[MAX_SIZE];    // index of the next item that will be popped    LONG m_reader;    // index of the most recent completed write    LONG m_writer;    // index of the next available spot for writing    LONG m_nextWrite;public:    MyQueue()    {        m_reader = 0;        m_writer = 0;        m_nextWrite = 0;    }    // attempt a push. Returns true if successfull, or false if the queue is full.    bool Push(T t)    {        // reserve a spot for writing        LONG spotToWrite;        while(true)        {            spotToWrite = m_nextWrite;            // make sure the queueu isn't full            // (it's possible for m_reader to be bigger IRL than the value we get            // here, but worst case is the queue will appear full when it isn't)            if (spotToWrite - m_reader < MAX_SIZE - 1)            {                // looks like the queue has room for more, so attempt                // to reserve a spot.                if (InterlockedCompareExchange(&m_nextWrite, spotToWrite + 1, spotToWrite) == spotToWrite)                    break;            }            else            {                // queue is full, so bail out                return false;            }            // still here? that means we thought there was room in the            // queue, but another thread wrote to the queue before we could            // reserve a spot, so now we need to check again.        }        // do the write        m_stuff[spotToWrite % MAX_SIZE] = t;        // update the "write" cursor        InterlockedIncrement(&m_writer);        return true;    }    // pop a value and return true, or return false if nothing left to pop    // (this only supports single readers)    bool Pop(T& t)    {        if (GetCount() > 0)        {            t = m_stuff[m_reader % MAX_SIZE];            m_reader++;            return true;        }        return false;    }    int GetCount() { return m_writer - m_reader; }};

##### Share on other sites
What happens if a thread gets interrupted between the "spotToWrite = m_nextWrite;" line and the subsequent subtraction in the following if? It's possible that things could go out of sync here and end up with incorrect behaviour. (Determining exactly how things would go wrong is left as an exercise to the reader [wink])

##### Share on other sites
Well, for one thing, if your "long" is merely 32 bits long, after about 2G writes, you end up with negative values, and % gives negative result. Even if you use unsigned, after wraparound it will fail for non power of 2 sizes. I see no obvious threading bugs though (assuming 1 reader thread). Good idea on monotonically incrementing read and write indexes, that simplifies it a lot.

Quote:
 Original post by ApochPiQAny particular reason you don't use the built-in Windows InterlockedIncrement function? Also, I don't see anything that guarantees that your values are aligned on a 32-bit boundary

GCC arranges for 32-bit alignment (for 32-bit numbers) by default, and probably so does msvc
Quote:
 What happens if a thread gets interrupted between the "spotToWrite = m_nextWrite;" line and the subsequent subtraction in the following if?

unless I'm horribly mistaken about intent of this code, the whole idea of copying a value into a variable, checking and incrementing it, and then using atomic compare exchange, is that compare exchange fails if m_nextWrite has been modified by other thread in that span, and thus make it impossible for 2 threads to reserve same spot. Looks fairly standard to me.

##### Share on other sites
Quote:
 Original post by BradDaBugHere's my third attempt. I added a spin lock to Push() to make sure that it reserves a spot ONLY if the queue isn't full. No trying to back out later. And if another thread happens to reserve a spot between the "is queue full" check and making the reservation then it has to do the check again.

5 threads reserve spots: 0,1,2,3,4 (correct)
3 threads write: 4,3,2 (for some reason, this occurs in reverse order)
+---+---+---+---+---+| ? | ? | 2 | 3 | 4 |+---+---+---+---+---+  ^ m_writer

+---+---+---+---+---+| ? | ? | 2 | 3 | 4 |+---+---+---+---+---+      m_writer^

pop() is called, it deques first 3 elements. It obtains ?,?,2.

+---+---+---+---+---+| 0 | 1 | 2 | 3 | 4 |+---+---+---+---+---+      m_reader^       ^m_writer

pop() is called, it returns 3,4.

A stack is somewhat simpler to implement in-place. For queue, study the singly-linked list example (there is still no solid lock-free doubly-linked). Then attempt to implement linked list in-place using cursors.

##### Share on other sites

Quote:
Original post by Antheus
Quote:
 Original post by BradDaBugHere's my third attempt. I added a spin lock to Push() to make sure that it reserves a spot ONLY if the queue isn't full. No trying to back out later. And if another thread happens to reserve a spot between the "is queue full" check and making the reservation then it has to do the check again.

5 threads reserve spots: 0,1,2,3,4 (correct)
3 threads write: 4,3,2 (for some reason, this occurs in reverse order)

Didn't see that one. Or reader can even read some structure in process of it being written, if T is not trivial.
Could maybe be fixed with some "write complete" flag in the structure, compare-and-set from true to false by reader prior to popping (if CAS fails, either return false from pop, or keep trying in a loop).
edit: by the way some memory barriers may be necessary as well.

Practically though, in any sort of practical setting... I'd just use 'locked' queue.
If locking is a bottleneck, if single reader thread itself won't be a bottleneck, if you can't eliminate need for queue... that's a lot of ifs until you'll need to optimize locked queue to 'lockless' with single reader.
Locking being a bottleneck and single read thread not being a bottleneck may well exclude each other in all the uses you'll encounter.
I, myself, use threading for a: working gui while doing something in background, b: making use of multicore - splitting massive calculation between cores (with openmp). Former only passes few things per second from gui thread to processing threads at most, latter is best done without such queues, and is a *really* well developed subject (supercomputers have been 'multicore' for ages).

[Edited by - Dmytry on September 12, 2009 4:33:10 AM]

##### Share on other sites
Quote:
 Original post by DmytryDidn't see that one. Or reader can even read some structure in process of it being written, if T is not trivial.

That shouldn't happen:
m_stuff[spotToWrite % MAX_SIZE] = t;InterlockedIncrement(&m_writer);

spotToWrite is obtained atomically, so one spot can only go to a single writer.
The writing is completely sequential, and reader will only pop after m_writer was incremented.

The problem with this approach is simply in that it requires two consecutive atomic operations as well as a very long write operation.

In case of linked list or stack, "spotToWrite" is what 'new' returns, writing is done in separate thread, and update is performed atomically. But with in-place structure, this is much trickier.

Quote:
 Could maybe be fixed with some "write complete" flag in the structure

The main problem right now is that there are too many "flags".

There is a way. Linked list with stack. Unused elements are kept in stack and (de)allocated in lockless manner, writes can then be fully concurrent, and updates to list involve well understood pointer updates.

##### Share on other sites
Quote:
Original post by Antheus
Quote:
 Original post by DmytryDidn't see that one. Or reader can even read some structure in process of it being written, if T is not trivial.

That shouldn't happen

I meant, during the sequence that you described, where it is reading non-written entries, because the other writer thread has incremented the counter after having added other entry later in the queue.

Quote:
 The main problem right now is that there are too many "flags".There is a way. Linked list with stack. Unused elements are kept in stack and (de)allocated in lockless manner, writes can then be fully concurrent, and updates to list involve well understood pointer updates.

Yes, single linked list is easier to get right.

Out of curiosity. What would be the real world usage scenario where you would actually need this thing?

##### Share on other sites
Quote:
 Original post by DmytryOut of curiosity. What would be the real world usage scenario where you would actually need this thing?

Active Objects.

You have n objects that are handled by thread pool. At any given time, an active object will be active in at most one thread only.

Other active objects running in separate threads are free to post messages to other objects.

Lock-free queue here has the advantage over mutex-based solution since it should avoid the dead-lock via reverse order locking. While queue alone cannot cause this, posting to queue from inside a critical section can. There are ways to prevent such situation, but non-blocking synchronization can eliminate the problem altogether.

Performance benefits of lock-free structures are not guaranteed, especially since constant factor overhead can be mitigated by getting access to more cores and potentially less waiting.

But still, find a third-party, real world tested implementation that has been in use for a while.