Sign in to follow this  
ApochPiQ

Unity Double-check my lock-free voodoo

Recommended Posts

I've been working on a lockless message passing system for the last couple of days, and finally came up with what seems like a pretty solid implementation. First, some important background: this is designed for a one consumer, multiple producer scenario. It will assuredly not work in a multiple-consumer situation. However, from my checks and logic, it should be fully safe as long as only one thread is consuming data. The theory is very simple. We maintain three stacks. The first stack is the "write stack." All producers append their messages to the write stack with a simple CAS-based method. The second stack is the "read stack." This stack is only visible to the consumer; all work by the consumer is done on the read stack. It is effectively not aware of the write stack. The third stack is a little bit of a hack. I want this system to have FIFO semantics, but the write stack is implemented as a LIFO stack. Therefore, the read process uses this third stack to essentially reverse messages that it reads. Consuming a message means popping an item off this third stack. Anyone with basic familiarity with lock-free structures should be familiar with the CAS LIFO stack, so I won't bother explaining it. The consumer half is a bit wonky, though, so here's the logic. (I make heavy use of a "sentinel" node which is just a node in either the read or write stack that has no attached data payload. This sentinel is basically the "end of list" marker.)
  • If the read head is the sentinel node, swap the pointers to the read and write stacks, and proceed as usual
  • Traverse the read stack (which is basically a singly linked list at this point) and push each item onto the third stack as it is visited.
  • If the third stack is empty, return NULL, indicating no pending messages
  • Pop the first item off the third stack, clean up some memory usage, and return it to the caller as the current message
I've reviewed this logic repeatedly, and stress-tested the implementation under 30 threads of load on my dual-core workstation. As near as I can tell the code is free of potential problems, primarily because of the magic constraint that we only have one consumer. So here's the code: [edit] Removed the original code; please see the updated version here. [/edit] I welcome any comments, suggestions, and/or gentle explanations that I am a moron and that this doesn't actually work [grin] [Edited by - ApochPiQ on March 20, 2009 6:22:50 PM]

Share this post


Link to post
Share on other sites
Is there particular reason why you don't just use lock-less FIFO queue, which is also safe for multiple-producers multiple-consumers case.

Share this post


Link to post
Share on other sites
What level of portability are you going for? volatile gives you different guarantees depending on the compiler you're using (or even the *version* of the compiler you're using). I'd try not to use it at all and instead insert the barriers explicitly.

I'm not qualified to comment much further than that on the nitty gritty. You have some exception safety issues, though, unless you're assuming new will never throw (which isn't the case on Windows, for example).

Share this post


Link to post
Share on other sites
I thought cmpxchg8b is quite widely supported on 32-bit x86 platforms. Not sure how well cmpxchg16b is supported for 64-bit platforms though. Do you have something to back up the "CAS2 is not widely supported" claim, e.g. a supported instruction table or something?

Share this post


Link to post
Share on other sites
I was looking back through my notes and intertron history and realized that all the sources I have regarding CAS2 on 64-bit processors were pretty old, and likely outdated. Seems like cmpxchg16b is pretty common now.

In any case, I've got a working algorithm (I think) and don't care to go through all the trouble of building another one [wink]


Yeah, exception safety is a bit of a problem here; I'll be tuning that up along with lots of other code that needs it.

Manual memory fencing also sounds like a good idea. I'll add that to my list and hopefully get it hammered out pretty quickly.


Thanks all for the feedback thus far; although I'm still curious if anyone can find a problem in the implementation...

Share this post


Link to post
Share on other sites
Here's a fun one: Visual C++ provides no intrinsic for cmpxchg16b (unless they added it very recently), nor will it compile 64 bit assembly. How much do you love MASM? [grin]

Share this post


Link to post
Share on other sites
Quote:
Original post by Promit
Here's a fun one: Visual C++ provides no intrinsic for cmpxchg16b (unless they added it very recently), nor will it compile 64 bit assembly. How much do you love MASM? [grin]


Your information is slightly dated. _InterlockedCompareExchange128 was added late 2007.

Share this post


Link to post
Share on other sites
Quote:
Original post by ApochPiQ
In any case, I've got a working algorithm (I think) and don't care to go through all the trouble of building another one [wink]

I haven't checked through your code, so I can't tell if there are issues with it, but one thing you might want to try to do is to reproduce the ABA issue, which is why CAS2 is often used. The problem may rise very rarely, thus it's difficult to test empirically. I implemented a lock-free FIFO queue in Spin-X Engine based on "Optimized Lock-Free FIFO Queue continued" paper though, and it's not terribly difficult.

Share this post


Link to post
Share on other sites
I've walked through every combination of partial execution/preemption and can't find a case where ABA occurs. This is primarily because of the way the single consumer works - I have a guarantee during the SwapReadAndWrite operation that the only preemption can come from additional writers (since there's only the one consumer). Similarly, because of the way the read/write stacks are swapped by the consumer, I should never see an ABA pattern in the producer's push operation.

However, I'm fairly green when it comes to lockless programming, and we all know how pernicious and subtle lockless bugs can be - so I'd really appreciate it if someone else could double-check my logic and make sure I'm not overlooking something.

Share this post


Link to post
Share on other sites
Quote:
Original post by Moomin
Quote:
Original post by Promit
Here's a fun one: Visual C++ provides no intrinsic for cmpxchg16b (unless they added it very recently), nor will it compile 64 bit assembly. How much do you love MASM? [grin]


Your information is slightly dated. _InterlockedCompareExchange128 was added late 2007.
Well, that was awfully nice of them. Glad to know I'm wrong.

Share this post


Link to post
Share on other sites
Quote:
Original post by Promit
Quote:
Original post by Moomin
Quote:
Original post by Promit
Here's a fun one: Visual C++ provides no intrinsic for cmpxchg16b (unless they added it very recently), nor will it compile 64 bit assembly. How much do you love MASM? [grin]


Your information is slightly dated. _InterlockedCompareExchange128 was added late 2007.
Well, that was awfully nice of them. Glad to know I'm wrong.


To quote another poster here whose name I can't recall: Visual Studio is pretty dope.

That being said, they *still* don't have atomic addition/subtraction, etc. for any type of variable. Way to go.

Share this post


Link to post
Share on other sites
Quote:
Original post by InvalidPointer
That being said, they *still* don't have atomic addition/subtraction, etc. for any type of variable. Way to go.

Is there a popular C++ compiler that does?

Share this post


Link to post
Share on other sites
Some comments:
1> Why in the world do your nodes not initialize to NULL in their constructor?
2> Why are you doing inline assembly in the middle of a many-line function -- why not write CAS and call the function?

3> Why is anything to do with PendingReads volatile? And why don't you have pop() return the payload? (oh you are using a std::stack. Why not store a std::stack of MessageInfo*s? Ie, move the cruft away from the mainline of your code)

4> Why is anything to do with Read volatile? You do swap things into read, but you only access Read in a single-threaded manner.

Hmm. That's because of C++ not wanting to let you remove volatile implicitly, isn't it? Meh. Oh well.

So this is what I'd prefer to read/write:

template<typename T>
bool CompareAndSet( T* location, T setValue, T expectedValue )
{
// ASM goes here
}

class LocklessMailbox
{
// stores Node*s in a stack -- moves the single-threaded stuff out of the threading code
OneThreadNodeStack Read;

// A stack of the actual stuff we want to deliver. Once again, less crud in the tricky
// threading code:
std::stack<MessageInfo*> PendingReads;

// The WriteHead is the hard threading part:
typedef volatile Node NodeVol;
typedef NodeVol *pNodeVol;
typedef volatile pNodeVol pvNodeVol;
pvNodeVol WriteHead;
public:
LocklessMailbox()
{
WriteHead = new Node;
}

public:
// [SNIP]
MessageInfo* PopTopPendingReads() {
if (PendingReads.empty()) return NULL;
MessageInfo* retval = PendingReads.top();
PendingReads.pop();
return retval;
}

// notice that there is no more volatile in this function. Nor is there any memory management:
MessageInfo* GetMessage()
{
if(!PendingReads.empty())
{
return PopTopPendingReads();
}

if(Read.empty())
SwapReadAndWrite();

// pop_top() returns the data at the top of the Read stack,
// and returns NULL if the stack is empty. It deals with memory
// management:
while(MessageInfo* message = Read.pop_top())
{
PendingReads.push(message);
}

return PendingReads.pop_top();
}

// here is the dangerous threading code:
void AppendMessage( MessageInfo* retval )
{
pNodeVol new_head = new Node( retval );
while (true) {
new_head->Next = WriteHead;
bool success = CompareAndSet<Node*>( &WriteHead, new_head, new_head->Next );
if (success) {
return;
}
}
}

void SwapReadAndWrite()
{
while (true)
{
Node* ReadHead = Read.GetHead();
Node* old_write = WriteHead;
bool success = CompareAndSet<Node*>( &WriteHead, ReadHead, old_write );
if (success) {
Read.SetHead( old_write );
return;
}
}
}
// [SNIP]
};



But this is just cleaning up the code.

On the other hand, if the code was cleaned up like this, I could find errors in your lock-free algorithm implementation far easier -- I wouldn't have to also look for memory allocation errors and skip over boilerplate code.

With only one consumer, once things are in the Read stack _they are no longer a lock-free algorithm problem_. And the same with PendingReads. Get the code that is not a lock-free algorithm problem as far away from your tricky lock-free algorithm code as is reasonably possible.

Another thing I did is I reduced the amount of state that is saved over the loops by as much as possible. This makes checking for correctness easier.

Share this post


Link to post
Share on other sites
It seems that everyone and their dog is going lock-free lately. Given that lock-free data structures and algorithms are a real PITA to get right, are you guys sure they'd actually gain you anywhere near as much as you think?

I haven't personally tried using lock-free algorithms and comparing the results to similar code using locks yet, but msdn seems to warn that, considering all the extra effort and the high risk of not getting it right, going lock-free just may not be even worth it in any particular case.

Anyone have any experience with lock-free code and how it compares to code using locks, performance-wise I mean?


P.S. Oops, forgot the link to the msdn article.

here

Share this post


Link to post
Share on other sites
Quote:
Original post by Red Ant
It seems that everyone and their dog is going lock-free lately. Given that lock-free data structures and algorithms are a real PITA to get right, are you guys sure they'd actually gain you anywhere near as much as you think?

I haven't personally tried using lock-free algorithms and comparing the results to similar code using locks yet, but msdn seems to warn that, considering all the extra effort and the high risk of not getting it right, going lock-free just may not be even worth it in any particular case.

Anyone have any experience with lock-free code and how it compares to code using locks, performance-wise I mean?


P.S. Oops, forgot the link to the msdn article.

here

I've kinda noticed that too. I just posted something on lock-free queues myself :X
Here I thought there might be one, two people that had experimented with them and could offer advice. What I did not expect, however, was the fact that 30 or so people were *also* trying them out and (in some cases) were hung up on the same issues.


DISCLAIMER: I'm not factoring thread spinning into the equation here; it's too implementation-dependent. All info presented here is AFAIK (read: done in academic happy-land) and may differ from what goes on 'in the real world.'

In regards to your assertion: In *theory,* lock-free algorithms modifying a single value should roughly be twice as fast in regards to synchronization overhead. Since you must both enter and exit a lock, that's two interlocked operations right there, neither of which do any real work. In practice, however, it can pretty much go all over the place. If you do every operation in an interlocked manner, it's likely going to be slower; locks would be a better option in that scenario. Depends on what you're trying to do and how frequently something's going to be modified.

Share this post


Link to post
Share on other sites
I'm going lock-free here for a very particular reason: passing messages between tasks needs to be as efficient as possible. If message passing uses any locks, we introduce a severe bottleneck that limits scalability.

It's taken a lot of careful thought, but I've got a mostly lockless system going; the only locks are when threads are started or stopped. During a thread start/stop, all access to message mailboxes is blocked. However, concurrent message passes from separate threads will not block each other. This means we can sustain a high throughput of messages, without sacrificing safety during thread init/cleanup code.

In general, IMHO lock-free code is totally not worth the effort. You have to have some ridiculous contention on a locked resource before it really becomes a performance issue, but up until that point it's much easier to get locking right than it is to get lock-free right.



NotAYakk - thanks for the feedback. I'll post up an improved version of the code shortly.

Share this post


Link to post
Share on other sites
Right, here's the current implementation, straight from the Epoch language project:


//
// Helper function for atomic compare-and-swap
// Returns true on success
//
template <class DataType>
bool CompareAndSwap(DataType* field, DataType oldvalue, DataType newvalue)
{
DataType retval;

_asm
{
mfence
mov eax, oldvalue
mov ecx, newvalue
mov edx, dword ptr [field]

lock cmpxchg dword ptr[edx], ecx
mov retval, eax
mfence
}

return (retval == oldvalue);
}


//
// This class encapsulates a lockless mailbox algorithm for asynchronous message passing.
// Each thread is granted a mailbox when it is started up; this mailbox is used for all
// messages passed into that thread. Messages are dequeued in FIFO order.
//
// IMPORTANT: this algorithm is only safe for a many producer/single consumer scenario.
// Critical assumptions made by the code are only valid if a single consumer
// thread is used. However, any number of producer threads is permitted.
//
// The algorithm uses three distinct stacks: a read stack, a write stack, and a cache.
// Producers push items onto the write stack using a simple and well-known CAS method.
// When the consumer thread arrives to retrieve a message, it first checks the cache
// for any waiting messages. If the cache is empty, the consumer thread swaps the read
// and write stacks. At this point, the read stack is controlled entirely by the consumer
// thread, so we don't need to worry about contention for elements in the read stack.
// The entire read stack is then traversed and its contents pushed into the cache.
// Note that we achieve FIFO semantics only by using this extra cache stack.
//
template <class PayloadType>
class LocklessMailbox
{
// Construction and destruction
public:
//
// Construct and initialize the read and write stacks
//
LocklessMailbox()
{
std::auto_ptr<Node> wh(new Node);
std::auto_ptr<Node> rh(new Node);

WriteHead = wh.release();
ReadHead = rh.release();
}

//
// Clean up the read, write, and cache stacks, freeing any
// remaining messages from each stack.
//
~LocklessMailbox()
{
Release();
}

// Message passing interface
public:

//
// Register a message from a producer thread. Any number of threads
// may call this function.
//
void AddMessage(PayloadType* info)
{
Node* msgnode = new Node(info);
while(true)
{
msgnode->Next = WriteHead;
bool success = CompareAndSwap<Node*>(&WriteHead, msgnode->Next, msgnode);
if(success)
return;
}
}

//
// Retrieve a message from the mailbox.
// IMPORTANT: only ONE consumer thread (per mailbox) should call this function
//
PayloadType* GetMessage()
{
if(!PendingReads.empty())
return PopPendingRead();

if(ReadHead->Next == NULL)
SwapReadAndWrite();

Node* n = ReadHead;
while(ReadHead->Next)
{
PendingReads.push(n);
n = n->Next;
ReadHead = n;
}

if(PendingReads.empty())
return NULL;

return PopPendingRead();
}

// Internal helpers
private:

//
// Pop a pending read from the cache stack
//
PayloadType* PopPendingRead()
{
Node* readnode = PendingReads.top();
PayloadType* payload = readnode->Payload;
delete readnode;
PendingReads.pop();
return payload;
}

//
// Internal helper for swapping the read/write stacks
// See class comment for details
//
void SwapReadAndWrite()
{
while(true)
{
Node* readhead = ReadHead;
Node* oldwrite = WriteHead;
bool success = CompareAndSwap<Node*>(&WriteHead, oldwrite, readhead);
if(success)
{
ReadHead = oldwrite;
return;
}
}
}

//
// Release the mailbox and free any remaining messages
//
void Release()
{
for(std::stack<Node*>::container_type::iterator iter = PendingReads.c.begin(); iter != PendingReads.c.end(); ++iter)
{
delete (*iter)->Payload;
delete *iter;
}

Node* n = WriteHead;
while(n)
{
delete n->Payload;
Node* nextn = n->Next;
delete n;
n = nextn;
}

n = ReadHead;
while(n)
{
delete n->Payload;
Node* nextn = n->Next;
delete n;
n = nextn;
}
}

// Internal tracking
private:
struct Node
{
Node() : Next(NULL), Payload(NULL) { }
Node(PayloadType* p) : Next(NULL), Payload(p) { }

Node* Next;
PayloadType* Payload;
};

Node* WriteHead;
Node* ReadHead;
std::stack<Node*> PendingReads;
};




The only thing I'm not sure of at this point is the memory fences; but that should be trivial to fix now that the CAS operation is centralized.

Share this post


Link to post
Share on other sites
If you got potentially more than one thread per HW thread, with lock-free algorithms you avoid priority inversion and it also scales better with high contention. Anyway, it's debatable if in many-core architectures lock-free algorithms would bring any benefit, so I guess we just have to wait to get our hands on those to see how it turns out in practice (: But if you have had the trouble of making lock-free algorithms, they are trivial to convert to algorithms that use optimally short locks to see the difference.

Share this post


Link to post
Share on other sites
Quote:
Original post by JarkkoL
Anyway, it's debatable if in many-core architectures lock-free algorithms would bring any benefit, so I guess we just have to wait to get our hands on those to see how it turns out in practice (:


Here is a HashMap implementation for Java that scales linearly up to 768 real cores using lock-free algorithms (including resizing which is particularly impressive).

Regards
elFarto

Share this post


Link to post
Share on other sites
Quote:
Original post by elFarto
Here is a HashMap implementation for Java that scales linearly up to 768 real cores using lock-free algorithms (including resizing which is particularly impressive).

Do you mean it has been actually tested on such a system? Which system is that and is there some stats and details about the test? I would be very interested to see this.

Share this post


Link to post
Share on other sites
Quote:
Original post by JarkkoL
Quote:
Original post by elFarto
Here is a HashMap implementation for Java that scales linearly up to 768 real cores using lock-free algorithms (including resizing which is particularly impressive).

Do you mean it has been actually tested on such a system? Which system is that and is there some stats and details about the test? I would be very interested to see this.

Yes. The guy that wrote it (Cliff Click) works for Azul, and they have a 768 and 896 core machines. Here is a blog post he made about it.

Regards
elFarto

Share this post


Link to post
Share on other sites
Quote:
I'm going lock-free here for a very particular reason: passing messages between tasks needs to be as efficient as possible. If message passing uses any locks, we introduce a severe bottleneck that limits scalability.

That's nice speculation, but unless you have actual profiler results to show, it is just speculation.

To play devil's advocate for a moment, if you use a lock judiciously (eg, only lock after you have created the message objects and can immediately write and let go) the difference might be negligible in practice.

However, multiple producers is a significant complication. Unless there's some crazy number of possible producers, why not just use multiple lockless queues? They're easy to do without x86 assembly tricks.

Share this post


Link to post
Share on other sites
Quote:
Original post by elFarto
Yes. The guy that wrote it (Cliff Click) works for Azul, and they have a 768 and 896 core machines.

Would be nice to have such a system for testing my multi-threading stuff ;) I haven't implemented lock-free hash map myself, but I believe that due to the nature of the hash map the contention is significantly reduced when you got large number of strips. Cliff uses 4096-way hash map so even with 768 cpu's there aren't many cpu's in average competing on a single strip. This is quite different when you are using a data structure where all the cpu's access the same variable (e.g. single lifo queue) and which would really demonstrate how lock-free scales in comparison to e.g. short spinlocks in high contention cases.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Forum Statistics

    • Total Topics
      627697
    • Total Posts
      2978680
  • Similar Content

    • By Pixelated_Nate
      Hi all!
      We are looking for a C# programmer for our 2D Action RPG titled Adavia, made in Unity.
      The game itself is akin to Legend of Zelda: Link to the Past, though we're also adding in traditional RPG elements such as Character Creation.
      This is more of a hobby than anything commercial, if it somehow does manage to go commercially, all revenue will be split equally among the team.
      If you're interested, we ask that you be comfortable with:
      Coding A.I's for enemies and NPCs. Working with GUI's. Communicating regularly with the team via Skype (text only). If you have any questions or would like to apply, please contact me at nathan.jenkins1012@gmail.com
       
    • By ilovegames
      You are the commander of a special forces squadron. You were given a task that appeared simple at first glance - to check for suspicious activity in the building of an abandoned psychiatric hospital. But you could not even imagine what you will actually have to face.
      Download https://falcoware.com/HospitalSurvival.php



    • By ilovegames
      You find yourself in an abandoned place full of mutants in the dead of night, and have to kill waves of monsters with a different kind of weapon. The main goal is to survive through the night.
      Controls:
      WASD – Walk
      Shift – Run
      Mouse1 - Attack
      Space - Jump
      Scroll Down – Change weapon
      Esc - Exit, pause
      Download https://falcoware.com/NightSurvival.php



    • By ilovegames
      "Lost Signal" agency investigates paranormal events from all around the world. You are one of the agents who participates in the research of various artifacts. Investigate paranormal activities in this 3D game which has great action and elaborate 3D graphics.
      Interesting quests and creepy monsters await you!
      Download https://falcoware.com/TheLostSignalSCP.php



    • By trjh2k2
      I've never really been a "Unity guy", since all of my game-dev learning happened in C++, and in other engines, but I recently discovered the "complete projects" section in the asset store.  It's full up on projects you can buy that are billed as "ready to customize and release", with full ad integration.  Some of them claim to be for educational purposes, but why would you include a complete, polished, full featured game with ads as an educational example?
      This leads me to the question of why this goes by unchallenged?  Does Unity and the environment of the Unity Store actively encourage this style of game development?  Is the problem of asset flipping our own fault?  I don't mean this as a "we should make Unity shut this down" kind of thread, but rather just to examine whether or not the environment of being able to just buy whole games or pieces of games is something that damages the industry.  I get why Unity would allow it, and I'm sure it's a working business model for some people- and maybe some people DO actually just use these to learn from, but I'm not that naive as to think that there aren't people who recognize this as one of the shortest paths to putting a game on the market so they can cash in.
      Thoughts?
  • Popular Now