Lockless FIFO Queue Implementation

Akhilleus · 2009-06-21T00:52:00

Hey, My longer explanation is outdated, so I'll just say this: I'm working on a lockless queue implementation to use in cross-thread informing of events in a server; the code is below. This queue is designed to work only in the case of one reader, one writer--IE, one and only one thread calling pop(), and one and only one thread calling push(). I *think* I have the problems that were in it worked out, although for now please ignore: 1. The lack of a memory lock in the position specified in push(), and a way to handle creation/deletion of nodes other than the default operators--I'll implement these when I'm sure the logic of the algorithm is right. 2. The possibility of first and last being written to the same cache line and being unreadable at the same time. 3. The current lack of any sort of size() method and the fact that it only stores pointers right now (didn't want to get into error handling for calls to pop() on an empty queue and junk up the code until I got it working at all). 4. The lack of optimization, although of course tips are nice :). Thanks! -Akhilleus template<typename T> class Queue { public: Queue() : last(new QNode(0, true, SAFE, 0)), first(new QNode(0, false, SKIP_NEXT, last)) {} void push(T *data) { ///////////////////////////////////////////////////////////////////////////////////////////////// //shared variables read: // (N/A) //shared variables written: // prev->tail (if read during write, value read is guaranteed to be meaningful and equal to the // correct or the previous value, as only one bit is actually changed on any write, // because tail is only ever 0 or 1) //notes: // 1. the pointer last is never read or written by pop() // 2. the object *last is only accessed by pop() when reading its tail and/or data values // 3. the original object *last is released to full access by pop() by the operation // prev->tail = 0, after which it is never accessed by push() again ///////////////////////////////////////////////////////////////////////////////////////////////// last->next = new QNode(data); QNode* prev = last; last = const_cast<QNode*>(last->next); //memory barrier here so that this line get run last prev->tail = 0; } //unnecessary in the current code, as pop() returns a pointer (NULL if the queue is empty) //bool pop_safe() //{ // return(first->status != SKIP_NEXT || first->next->tail != 1); //} T *pop() { ///////////////////////////////////////////////////////////////////////////////////////////////// //shared variables read: // last->tail (always in the form first->next->tail) //shared variables written: // (N/A) //notes: // 1. the pointer last is never read or written by pop() // 2. the object *last is only accessed by pop() when reading its value status (atomic) // 3. the original object *last is released to full access by pop() by the atomic operation // prev->status = SAFE, after which it is never accessed by push() again ///////////////////////////////////////////////////////////////////////////////////////////////// while(true) { if(first->next->tail) { switch(first->status) { case(SAFE) : //return first's data without popping first->status = ALREADY_READ; return(first->data); case(ALREADY_READ) : //return last's data without popping first->status = SKIP_NEXT; return(first->next->data); default: //SKIP_NEXT (first->status never equals TAIL) //return nothing, as all values have been popped and the queue is empty return(0); } } else //next node is not the tail { T* to_return; QNode* prev; switch(first->status) { case(SAFE) : //pop first from the queue to_return = first->data; prev = first; first = const_cast<QNode*>(first->next); delete prev; return(to_return); case(ALREADY_READ) : //move on to the next node prev = first; first = const_cast<QNode*>(first->next); delete prev; break; default: //SKIP_NEXT //set next node to ALREADY_READ and move on to it first->next->status = ALREADY_READ; prev = first; first = const_cast<QNode*>(first->next); delete prev; } } //end if } } private: //classes & constants static const char SAFE = 0; static const char ALREADY_READ = 1; static const char SKIP_NEXT = 2;; struct QNode { QNode(T *d = 0, bool is_tail = true, char node_status = SAFE, QNode* next_node = 0) : data(d), tail(is_tail ? 1 : 0), status(node_status), next(next_node) {} T *data; volatile char tail; char status; QNode volatile *next; }; //local data QNode *last; //WARNING: do not switch declaration order with first (see init. list) QNode *first; }; template<typename T> const char Queue<T>::SAFE; template<typename T> const char Queue<T>::ALREADY_READ; template<typename T> const char Queue<T>::SKIP_NEXT; [Edited by - Akhilleus on June 12, 2009 5:13:45 PM]

Networking and Multiplayer Programming

Started by Akhilleus June 09, 2009 06:03 PM

26 comments, last by Hodgman 14 years, 10 months ago

kyoryu

224

June 18, 2009 11:09 PM

Just out of curiosity, how much of your time is being spent inside of functions that are reading/writing from queues?

hplus0603

11,916

June 19, 2009 12:42 PM

In this case, I think the question should be the reverse: Given a certain performance of a queue, how much queuing can you do in your application?

The reason is that future software architecture will look different from current software architecture. Small tasks that are assigned to worked thread pools are probably going to be an important part of that. Once you start writing your software like that, queuing costs are going to become noticeable.

Btw: Visual Studio 2010 is going to include a lightweight tasking/queuing library along these lines (but with more primitives than just queuing).

enum Bool { True, False, FileNotFound };

kyoryu

224

June 20, 2009 03:13 AM

Quote:Original post by hplus0603
In this case, I think the question should be the reverse: Given a certain performance of a queue, how much queuing can you do in your application?

The reason is that future software architecture will look different from current software architecture. Small tasks that are assigned to worked thread pools are probably going to be an important part of that. Once you start writing your software like that, queuing costs are going to become noticeable.

Btw: Visual Studio 2010 is going to include a lightweight tasking/queuing library along these lines (but with more primitives than just queuing).

Oh, agreed entirely. My question was more an attempt to probe into whether this is a real problem that's being solved right now, or a case of premature optimization.

Hodgman

52,717

June 20, 2009 10:00 AM

Quote:Original post by kyoryu
My question was more an attempt to probe into whether this is a real problem that's being solved right now, or a case of premature optimization.

I'm building an actor/entity system at the moment, where to allow parallel updates of entities, all inter-entity function calls must be queued (to be executed at a safe moment). This could blow out anywhere from 100 to 1M queue ops per frame.
Traditional locks can take hundreds of cycles to process. If we estimate 200 cycles of a 2.4GHz CPU, that's 1/12th of a second to do 1M lock operations.
Lock-free queues are an order of magnitude faster, and regular queues are an order of magnitude faster again.

Seeing I'm making something so reliant on queues, I've ditched all lock-free containers up front. It is a premature optimisation ;), but it's common sense that with this frequency of usage I've got to avoid using any scalability busting approach (and that means no potential synchronisation).

. 22 Racing Series .

kyoryu

224

June 20, 2009 02:08 PM

Cool, I love the Actor model, I honestly believe that we will see a drift towards it in the near future, as it handles concurrency much better.

From my own experiences implementing Actor systems, I'd really recommend just starting out with a locking queue first, to get a performance baselines at least. If it's encapsulated in a class, you can get an idea of what performance is like, and then make improvements from there. It is very possible for a lockless implementation to be slower than a locked implementation, depending on the efficiency of the lockless queue.

marcjulian

959

June 20, 2009 02:43 PM

There was a nice presentation at this years GDC about
lockless programming and the gist was: "Use locks!" :)

I implemented this week a threaded task manager for my game and threw many, many
small tasks at it - the profiler showed that the locks are not nearly enough of a performance hit to be worried about them.

So I would second the use locks suggestion, at least for the time being :)

HTH,
Marc

hplus0603

11,916

June 20, 2009 11:21 PM

Quote:or a case of premature optimization

Knuth must be one of the most mis-quoted people on the planet. What he actually said in that quote is that "too often, we worry about efficiencies in the small" -- what he complained about was programmers breaking out the assembler before they even had a running program to profile.

In fact, that same article clearly argues for understanding the performance characteristics of your problem before starting implementation -- the current discussion would be one example of exactly what he wanted people to understand before implementing a system.

If you don't believe me, go and look it up :-)

Anyway: if your lock is a Windows CRITICAL_SECTION, then it ends up being no more expensive than a lock-free spin-lock for the case of no contention, and only ends up hitting the kernel under contention. If you have multiple queues and multiple workers, the amount of contention will often be vanishingly small, and thus "regular" locks might be totally adequate -- and less bug-prone -- than lockless programming.

enum Bool { True, False, FileNotFound };

Hodgman

52,717

June 21, 2009 12:52 AM

Quote:Original post by kyoryu
Cool, I love the Actor model, I honestly believe that we will see a drift towards it in the near future, as it handles concurrency much better.

From my own experiences implementing Actor systems, I'd really recommend just starting out with a locking queue first, to get a performance baselines at least. If it's encapsulated in a class, you can get an idea of what performance is like, and then make improvements from there. It is very possible for a lockless implementation to be slower than a locked implementation, depending on the efficiency of the lockless queue.

Like I said, I'm not using locking or lock-free queues ;) Both require synchronisation of some sort which is a scalability buster.
I'm just using plain old thread-unsafe containers - one per thread. At a single synch point these thread-local queues are merged together to be processed later.

I can't really just switch from using one type of container to another with this design, I've got to design it from the ground up so that thread-safety is taken care of by scheduling, not by locks of any sort (you can think of my single-synch point as the one lock in the system).

I guess you could kind of sum it up like this:

typedef std::vector<IMessage*> MessageQueue;typedef std::vector<MessageQueue> ThreadLocalMessageQueue;ThreadLocalMessageQueue m_PublicQueue( numThreads );MessageQueue            m_PrivateQueue;//executed by one thread, all others pausevoid SynchPoint(){  m_PrivateQueue.clear();  for each m_PublicQueue as Q  {    copy contents of Q into m_PrivateQueue    Q.clear();  }}//executed by all threadsvoid ParralelWork( int threadID, int numThreads ){  int workPerThread = m_PrivateQueue.size() / numThreads;  int start = workPerThread * threadID;  int end = start + workPerThread;  for( int i=start; i!= end; ++ i )    m_PrivateQueue->Execute(threadID);//this will push new items into m_PublicQueue[threadID]    wait for all other threads to finish  if( threadId == 0 )    SynchPoint();  else    wait for synch point to finish}

. 22 Racing Series .

Lockless FIFO Queue Implementation

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Lockless FIFO Queue Implementation

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines