Back to General and Gameplay Programming

Critique my threadsafe lockless queue!

General and Gameplay Programming Programming

Started by BradDaBug September 09, 2009 10:26 AM

43 comments, last by Dmytry 14 years, 7 months ago

2,410

September 11, 2009 12:52 PM

Quote:Original post by BradDaBug

Here's my third attempt. I added a spin lock to Push() to make sure that it reserves a spot ONLY if the queue isn't full. No trying to back out later. And if another thread happens to reserve a spot between the "is queue full" check and making the reservation then it has to do the check again.

5 threads reserve spots: 0,1,2,3,4 (correct)
3 threads write: 4,3,2 (for some reason, this occurs in reverse order)

+---+---+---+---+---+| ? | ? | 2 | 3 | 4 |+---+---+---+---+---+  ^ m_writer

The 3 threads increment m_writer.

+---+---+---+---+---+| ? | ? | 2 | 3 | 4 |+---+---+---+---+---+      m_writer^

pop() is called, it deques first 3 elements. It obtains ?,?,2.

Remaining two threads complete:

+---+---+---+---+---+| 0 | 1 | 2 | 3 | 4 |+---+---+---+---+---+      m_reader^       ^m_writer

pop() is called, it returns 3,4.

A stack is somewhat simpler to implement in-place. For queue, study the singly-linked list example (there is still no solid lock-free doubly-linked). Then attempt to implement linked list in-place using cursors.

Dmytry

1,151

September 12, 2009 02:33 AM

That's really interesting thread btw.

Quote:Original post by Antheus
Quote:Original post by BradDaBug

Here's my third attempt. I added a spin lock to Push() to make sure that it reserves a spot ONLY if the queue isn't full. No trying to back out later. And if another thread happens to reserve a spot between the "is queue full" check and making the reservation then it has to do the check again.

5 threads reserve spots: 0,1,2,3,4 (correct)
3 threads write: 4,3,2 (for some reason, this occurs in reverse order)

Didn't see that one. Or reader can even read some structure in process of it being written, if T is not trivial.
Could maybe be fixed with some "write complete" flag in the structure, compare-and-set from true to false by reader prior to popping (if CAS fails, either return false from pop, or keep trying in a loop).
edit: by the way some memory barriers may be necessary as well.

Practically though, in any sort of practical setting... I'd just use 'locked' queue.
If locking is a bottleneck, if single reader thread itself won't be a bottleneck, if you can't eliminate need for queue... that's a lot of ifs until you'll need to optimize locked queue to 'lockless' with single reader.
Locking being a bottleneck and single read thread not being a bottleneck may well exclude each other in all the uses you'll encounter.
I, myself, use threading for a: working gui while doing something in background, b: making use of multicore - splitting massive calculation between cores (with openmp). Former only passes few things per second from gui thread to processing threads at most, latter is best done without such queues, and is a *really* well developed subject (supercomputers have been 'multicore' for ages).

[Edited by - Dmytry on September 12, 2009 4:33:10 AM]

Antheus

2,410

September 12, 2009 07:54 AM

Quote:Original post by Dmytry
Didn't see that one. Or reader can even read some structure in process of it being written, if T is not trivial.

That shouldn't happen:

m_stuff[spotToWrite % MAX_SIZE] = t;InterlockedIncrement(&m_writer);

spotToWrite is obtained atomically, so one spot can only go to a single writer.
The writing is completely sequential, and reader will only pop after m_writer was incremented.

The problem with this approach is simply in that it requires two consecutive atomic operations as well as a very long write operation.

In case of linked list or stack, "spotToWrite" is what 'new' returns, writing is done in separate thread, and update is performed atomically. But with in-place structure, this is much trickier.

Quote:Could maybe be fixed with some "write complete" flag in the structure

The main problem right now is that there are too many "flags".

There is a way. Linked list with stack. Unused elements are kept in stack and (de)allocated in lockless manner, writes can then be fully concurrent, and updates to list involve well understood pointer updates.

Dmytry

1,151

September 12, 2009 04:11 PM

Quote:Original post by Antheus
Quote:Original post by Dmytry
Didn't see that one. Or reader can even read some structure in process of it being written, if T is not trivial.

That shouldn't happen

I meant, during the sequence that you described, where it is reading non-written entries, because the other writer thread has incremented the counter after having added other entry later in the queue.

Quote:
The main problem right now is that there are too many "flags".

There is a way. Linked list with stack. Unused elements are kept in stack and (de)allocated in lockless manner, writes can then be fully concurrent, and updates to list involve well understood pointer updates.

Yes, single linked list is easier to get right.

Out of curiosity. What would be the real world usage scenario where you would actually need this thing?

Antheus

2,410

September 12, 2009 04:29 PM

Quote:Original post by Dmytry

Out of curiosity. What would be the real world usage scenario where you would actually need this thing?

Active Objects.

You have n objects that are handled by thread pool. At any given time, an active object will be active in at most one thread only.

Other active objects running in separate threads are free to post messages to other objects.

Lock-free queue here has the advantage over mutex-based solution since it should avoid the dead-lock via reverse order locking. While queue alone cannot cause this, posting to queue from inside a critical section can. There are ways to prevent such situation, but non-blocking synchronization can eliminate the problem altogether.

Performance benefits of lock-free structures are not guaranteed, especially since constant factor overhead can be mitigated by getting access to more cores and potentially less waiting.

But still, find a third-party, real world tested implementation that has been in use for a while.

Dmytry

1,151

September 13, 2009 03:39 AM

Quote:Original post by Antheus
Quote:Original post by Dmytry

Out of curiosity. What would be the real world usage scenario where you would actually need this thing?

Active Objects.

You have n objects that are handled by thread pool. At any given time, an active object will be active in at most one thread only.

Other active objects running in separate threads are free to post messages to other objects.

Hmm. Sounds niche. I tried to find implemention for OP, and didn't find anything besides people making it because they think it's 1337 (and an implementation with *single* reader and *single* writer thread, which is really a too trivial case). The worst examples are using x86 assembly, msvc style, same assembly cut n pasted several times over source, evidently under assumption that writing it in assembly will ensure no reordering.
You've avoided the question btw. Active Objects are only real world when your real world pointy headed boss tells you to implement Active Objects coz he read somewhere that those are a must... otherwise you still need a scenario that is best solved with active objects.

Quote:
Lock-free queue here has the advantage over mutex-based solution since it should avoid the dead-lock via reverse order locking. While queue alone cannot cause this, posting to queue from inside a critical section can. There are ways to prevent such situation, but non-blocking synchronization can eliminate the problem altogether.

If the push method locks, pushes, then unlocks queue's mutex, and locks nothing else in between, it won't deadlock.
Or are you talking of a critical section that disables interrupts, thus preventing other thread to complete the push? That seems outdated.

Hodgman

52,717

September 13, 2009 06:40 AM

A lock-free queue *is* an order of magnitude more complicated to write/debug, so you should only bother if you need that performance boost. The performance difference can be *huge* though.
I gave up on trying to get a really fast lock-free queue working though. Every 6 months I'd find another flaw that would crash once in a billion...

I'm now using wait free queues, which turns out to be easier - just use std::vector/etc and solve thread-safety with scheduling instead of synchronisation ;)

Quote:Original post by Dmytry
Quote:Original post by Antheus
Quote:Original post by Dmytry
Out of curiosity. What would be the real world usage scenario where you would actually need this thing?
Active Objects.
You have n objects that are handled by thread pool. At any given time, an active object will be active in at most one thread only.
Other active objects running in separate threads are free to post messages to other objects.
Active Objects are only real world when your real world pointy headed boss tells you to implement Active Objects coz he read somewhere that those are a must... otherwise you still need a scenario that is best solved with active objects.

I'm building a thread-safe entity/component system at the moment, which works pretty much the same as Active Objects or the Actor model. It's very handy because all of the game logic is implicitly multi-threaded and thread-safe. Inter-entity function calls are converted into message objects at compile time and written into these queues instead of being instantly executed.

I use something like the following for the queue structure. One write queue is created per thread, and a single read-only queue.

template<class T>class ThreadLocalQueue{  typedef std::vector<T>         Vector;  typedef std::vector<Vector> TlsVector;  TlsVector m_Write;  Vector    m_Read;public:  ThreadLocalQueue( int numThreads ) : m_Write(numThreads) {}        Vector& Write()       { return m_Write[GetThreadId()]; }  const Vector& Read () const { return m_Read; }  void Flip()  {    //Barrier    //assert worker threads aren't running!    //clear m_Read    //for each m_Write    //  copy contents of m_Write[n] into m_Read    //  clear m_Write[n]    //Barrier  }}

To distribute work, each thread is given a different range of the read queue to execute. As the threads do their work, they write results into each of the write queues.
After all the threads have finished working (i.e. finished using the queue), the main thread calls Flip() and schedules up another cycle of work for the threads to execute.

The only caveats are to make sure memory re-ordering doesn't mess up Flip() with appropriate memory fences, and the vectors should be aligned to the cache-line size. The GetThreadId() function (which makes sure each thread writes to a different queue) is only a few lines of code based on thread local storage.

[Edited by - Hodgman on September 13, 2009 5:40:52 PM]

. 22 Racing Series .

Antheus

2,410

September 13, 2009 08:27 AM

Quote:Original post by Dmytry
Active Objects are only real world when your real world pointy headed boss tells you to implement Active Objects coz he read somewhere that those are a must... otherwise you still need a scenario that is best solved with active objects.

Network servers, built on top of asynchronous sockets. Distributed objects (think CORBA). Programming model like that of Stackless Python, but running concurrently on multiple cores.

However, it turns out it's simpler to implement a fully generic queue (n-readers/n-writers), or it comes with standard library already in some languages.

Dmytry

1,151

September 13, 2009 09:18 AM

Quote:Original post by Hodgman
A lock-free queue *is* an order of magnitude more complicated to write/debug, so you should only bother if you need that performance boost. The performance difference can be *huge* though.

When a lot of threads are writing? Into a queue which got only 1 reader thread anyway?

Quote:

I gave up on trying to get a really fast lock-free queue working though. Every 6 months I'd find another flaw that would crash once in a billion...

I'm now using wait free queues, which turns out to be easier - just use std::vector/etc and solve thread-safety with scheduling instead of synchronisation ;)
Quote:Original post by Dmytry
Quote:Original post by Antheus
Quote:Original post by Dmytry
Out of curiosity. What would be the real world usage scenario where you would actually need this thing?
Active Objects.
You have n objects that are handled by thread pool. At any given time, an active object will be active in at most one thread only.
Other active objects running in separate threads are free to post messages to other objects.
Active Objects are only real world when your real world pointy headed boss tells you to implement Active Objects coz he read somewhere that those are a must... otherwise you still need a scenario that is best solved with active objects.
I'm building a thread-safe entity/component system at the moment, which works pretty much the same as Active Objects or the Actor model. It's very handy because all of the game logic is implicitly multi-threaded and thread-safe. Inter-entity function calls are converted into message objects at compile time and written into these queues instead of being instantly executed.

I use something like the following for the queue structure. One write queue is created per thread, and a single read-only queue.
*** Source Snippet Removed ***
To distribute work, each thread is given a different range of the read queue to execute. As the threads do their work, they write results into each of the write queues.
After all the threads have finished working (i.e. finished using the queue), the main thread calls Flip() and schedules up another cycle of work for the threads to execute.

The only caveats are to make sure memory re-ordering doesn't mess up Flip() with appropriate memory fences, and the vectors should be aligned to the cache-line size. The GetThreadId() function (which makes sure each thread writes to a different queue) is only a few lines of code based on thread local storage.

But you're not heavily using queues with multiple writing threads and 1 reading thread, right?

If I needed my game's updates be parallel, I'd do it like that: When processing a simulation step the current state is read only, and the next state is write only. A next state depends only to current state (which includes controller state and network message queue). Then I could simply use openmp parallel for (or reinvention thereof) to calculate next state. There would be a chunks-to-process queue, but it'd be read by multiple threads, and it wouldn't have a lot of elements anyway (even 1000 per simulation step is overkill).

I won 5th place on threaded optimization contest once, for 8 cores (lost to the competitors whom also did a lot of SSE optimization).

Dmytry

1,151

September 13, 2009 09:44 AM

Quote:Original post by Antheus
Quote:Original post by Dmytry
Active Objects are only real world when your real world pointy headed boss tells you to implement Active Objects coz he read somewhere that those are a must... otherwise you still need a scenario that is best solved with active objects.

Network servers, built on top of asynchronous sockets. Distributed objects (think CORBA). Programming model like that of Stackless Python, but running concurrently on multiple cores.

Hmm. You'd rather rely heavily on queues with multiple readers for those cases (for [re]distributing tasks/messages/etc between processing threads), no? Just what would be the case where you would need to send a LOT of messages concurrently from many threads (so many that the push is often executed by multiple cores simultaneously) into one thread?

Multiple threads writing concurrently a lot into a queue with single reading thread, that's the most literal case of 'bottle neck' I ever seen.

Quote:
However, it turns out it's simpler to implement a fully generic queue (n-readers/n-writers), or it comes with standard library already in some languages.

which possibly uses locks.

Critique my threadsafe lockless queue!

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Critique my threadsafe lockless queue!

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines