Sign in to follow this  

Unity Double-check my lock-free voodoo

This topic is 3182 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I've been working on a lockless message passing system for the last couple of days, and finally came up with what seems like a pretty solid implementation. First, some important background: this is designed for a one consumer, multiple producer scenario. It will assuredly not work in a multiple-consumer situation. However, from my checks and logic, it should be fully safe as long as only one thread is consuming data. The theory is very simple. We maintain three stacks. The first stack is the "write stack." All producers append their messages to the write stack with a simple CAS-based method. The second stack is the "read stack." This stack is only visible to the consumer; all work by the consumer is done on the read stack. It is effectively not aware of the write stack. The third stack is a little bit of a hack. I want this system to have FIFO semantics, but the write stack is implemented as a LIFO stack. Therefore, the read process uses this third stack to essentially reverse messages that it reads. Consuming a message means popping an item off this third stack. Anyone with basic familiarity with lock-free structures should be familiar with the CAS LIFO stack, so I won't bother explaining it. The consumer half is a bit wonky, though, so here's the logic. (I make heavy use of a "sentinel" node which is just a node in either the read or write stack that has no attached data payload. This sentinel is basically the "end of list" marker.)
  • If the read head is the sentinel node, swap the pointers to the read and write stacks, and proceed as usual
  • Traverse the read stack (which is basically a singly linked list at this point) and push each item onto the third stack as it is visited.
  • If the third stack is empty, return NULL, indicating no pending messages
  • Pop the first item off the third stack, clean up some memory usage, and return it to the caller as the current message
I've reviewed this logic repeatedly, and stress-tested the implementation under 30 threads of load on my dual-core workstation. As near as I can tell the code is free of potential problems, primarily because of the magic constraint that we only have one consumer. So here's the code: [edit] Removed the original code; please see the updated version here. [/edit] I welcome any comments, suggestions, and/or gentle explanations that I am a moron and that this doesn't actually work [grin] [Edited by - ApochPiQ on March 20, 2009 6:22:50 PM]

Share this post


Link to post
Share on other sites
What level of portability are you going for? volatile gives you different guarantees depending on the compiler you're using (or even the *version* of the compiler you're using). I'd try not to use it at all and instead insert the barriers explicitly.

I'm not qualified to comment much further than that on the nitty gritty. You have some exception safety issues, though, unless you're assuming new will never throw (which isn't the case on Windows, for example).

Share this post


Link to post
Share on other sites
I thought cmpxchg8b is quite widely supported on 32-bit x86 platforms. Not sure how well cmpxchg16b is supported for 64-bit platforms though. Do you have something to back up the "CAS2 is not widely supported" claim, e.g. a supported instruction table or something?

Share this post


Link to post
Share on other sites
I was looking back through my notes and intertron history and realized that all the sources I have regarding CAS2 on 64-bit processors were pretty old, and likely outdated. Seems like cmpxchg16b is pretty common now.

In any case, I've got a working algorithm (I think) and don't care to go through all the trouble of building another one [wink]


Yeah, exception safety is a bit of a problem here; I'll be tuning that up along with lots of other code that needs it.

Manual memory fencing also sounds like a good idea. I'll add that to my list and hopefully get it hammered out pretty quickly.


Thanks all for the feedback thus far; although I'm still curious if anyone can find a problem in the implementation...

Share this post


Link to post
Share on other sites
Here's a fun one: Visual C++ provides no intrinsic for cmpxchg16b (unless they added it very recently), nor will it compile 64 bit assembly. How much do you love MASM? [grin]

Share this post


Link to post
Share on other sites
Quote:
Original post by Promit
Here's a fun one: Visual C++ provides no intrinsic for cmpxchg16b (unless they added it very recently), nor will it compile 64 bit assembly. How much do you love MASM? [grin]


Your information is slightly dated. _InterlockedCompareExchange128 was added late 2007.

Share this post


Link to post
Share on other sites
Quote:
Original post by ApochPiQ
In any case, I've got a working algorithm (I think) and don't care to go through all the trouble of building another one [wink]

I haven't checked through your code, so I can't tell if there are issues with it, but one thing you might want to try to do is to reproduce the ABA issue, which is why CAS2 is often used. The problem may rise very rarely, thus it's difficult to test empirically. I implemented a lock-free FIFO queue in Spin-X Engine based on "Optimized Lock-Free FIFO Queue continued" paper though, and it's not terribly difficult.

Share this post


Link to post
Share on other sites
I've walked through every combination of partial execution/preemption and can't find a case where ABA occurs. This is primarily because of the way the single consumer works - I have a guarantee during the SwapReadAndWrite operation that the only preemption can come from additional writers (since there's only the one consumer). Similarly, because of the way the read/write stacks are swapped by the consumer, I should never see an ABA pattern in the producer's push operation.

However, I'm fairly green when it comes to lockless programming, and we all know how pernicious and subtle lockless bugs can be - so I'd really appreciate it if someone else could double-check my logic and make sure I'm not overlooking something.

Share this post


Link to post
Share on other sites
Quote:
Original post by Moomin
Quote:
Original post by Promit
Here's a fun one: Visual C++ provides no intrinsic for cmpxchg16b (unless they added it very recently), nor will it compile 64 bit assembly. How much do you love MASM? [grin]


Your information is slightly dated. _InterlockedCompareExchange128 was added late 2007.
Well, that was awfully nice of them. Glad to know I'm wrong.

Share this post


Link to post
Share on other sites
Quote:
Original post by Promit
Quote:
Original post by Moomin
Quote:
Original post by Promit
Here's a fun one: Visual C++ provides no intrinsic for cmpxchg16b (unless they added it very recently), nor will it compile 64 bit assembly. How much do you love MASM? [grin]


Your information is slightly dated. _InterlockedCompareExchange128 was added late 2007.
Well, that was awfully nice of them. Glad to know I'm wrong.


To quote another poster here whose name I can't recall: Visual Studio is pretty dope.

That being said, they *still* don't have atomic addition/subtraction, etc. for any type of variable. Way to go.

Share this post


Link to post
Share on other sites
Quote:
Original post by InvalidPointer
That being said, they *still* don't have atomic addition/subtraction, etc. for any type of variable. Way to go.

Is there a popular C++ compiler that does?

Share this post


Link to post
Share on other sites
Some comments:
1> Why in the world do your nodes not initialize to NULL in their constructor?
2> Why are you doing inline assembly in the middle of a many-line function -- why not write CAS and call the function?

3> Why is anything to do with PendingReads volatile? And why don't you have pop() return the payload? (oh you are using a std::stack. Why not store a std::stack of MessageInfo*s? Ie, move the cruft away from the mainline of your code)

4> Why is anything to do with Read volatile? You do swap things into read, but you only access Read in a single-threaded manner.

Hmm. That's because of C++ not wanting to let you remove volatile implicitly, isn't it? Meh. Oh well.

So this is what I'd prefer to read/write:

template<typename T>
bool CompareAndSet( T* location, T setValue, T expectedValue )
{
// ASM goes here
}

class LocklessMailbox
{
// stores Node*s in a stack -- moves the single-threaded stuff out of the threading code
OneThreadNodeStack Read;

// A stack of the actual stuff we want to deliver. Once again, less crud in the tricky
// threading code:
std::stack<MessageInfo*> PendingReads;

// The WriteHead is the hard threading part:
typedef volatile Node NodeVol;
typedef NodeVol *pNodeVol;
typedef volatile pNodeVol pvNodeVol;
pvNodeVol WriteHead;
public:
LocklessMailbox()
{
WriteHead = new Node;
}

public:
// [SNIP]
MessageInfo* PopTopPendingReads() {
if (PendingReads.empty()) return NULL;
MessageInfo* retval = PendingReads.top();
PendingReads.pop();
return retval;
}

// notice that there is no more volatile in this function. Nor is there any memory management:
MessageInfo* GetMessage()
{
if(!PendingReads.empty())
{
return PopTopPendingReads();
}

if(Read.empty())
SwapReadAndWrite();

// pop_top() returns the data at the top of the Read stack,
// and returns NULL if the stack is empty. It deals with memory
// management:
while(MessageInfo* message = Read.pop_top())
{
PendingReads.push(message);
}

return PendingReads.pop_top();
}

// here is the dangerous threading code:
void AppendMessage( MessageInfo* retval )
{
pNodeVol new_head = new Node( retval );
while (true) {
new_head->Next = WriteHead;
bool success = CompareAndSet<Node*>( &WriteHead, new_head, new_head->Next );
if (success) {
return;
}
}
}

void SwapReadAndWrite()
{
while (true)
{
Node* ReadHead = Read.GetHead();
Node* old_write = WriteHead;
bool success = CompareAndSet<Node*>( &WriteHead, ReadHead, old_write );
if (success) {
Read.SetHead( old_write );
return;
}
}
}
// [SNIP]
};



But this is just cleaning up the code.

On the other hand, if the code was cleaned up like this, I could find errors in your lock-free algorithm implementation far easier -- I wouldn't have to also look for memory allocation errors and skip over boilerplate code.

With only one consumer, once things are in the Read stack _they are no longer a lock-free algorithm problem_. And the same with PendingReads. Get the code that is not a lock-free algorithm problem as far away from your tricky lock-free algorithm code as is reasonably possible.

Another thing I did is I reduced the amount of state that is saved over the loops by as much as possible. This makes checking for correctness easier.

Share this post


Link to post
Share on other sites
It seems that everyone and their dog is going lock-free lately. Given that lock-free data structures and algorithms are a real PITA to get right, are you guys sure they'd actually gain you anywhere near as much as you think?

I haven't personally tried using lock-free algorithms and comparing the results to similar code using locks yet, but msdn seems to warn that, considering all the extra effort and the high risk of not getting it right, going lock-free just may not be even worth it in any particular case.

Anyone have any experience with lock-free code and how it compares to code using locks, performance-wise I mean?


P.S. Oops, forgot the link to the msdn article.

here

Share this post


Link to post
Share on other sites
Quote:
Original post by Red Ant
It seems that everyone and their dog is going lock-free lately. Given that lock-free data structures and algorithms are a real PITA to get right, are you guys sure they'd actually gain you anywhere near as much as you think?

I haven't personally tried using lock-free algorithms and comparing the results to similar code using locks yet, but msdn seems to warn that, considering all the extra effort and the high risk of not getting it right, going lock-free just may not be even worth it in any particular case.

Anyone have any experience with lock-free code and how it compares to code using locks, performance-wise I mean?


P.S. Oops, forgot the link to the msdn article.

here

I've kinda noticed that too. I just posted something on lock-free queues myself :X
Here I thought there might be one, two people that had experimented with them and could offer advice. What I did not expect, however, was the fact that 30 or so people were *also* trying them out and (in some cases) were hung up on the same issues.


DISCLAIMER: I'm not factoring thread spinning into the equation here; it's too implementation-dependent. All info presented here is AFAIK (read: done in academic happy-land) and may differ from what goes on 'in the real world.'

In regards to your assertion: In *theory,* lock-free algorithms modifying a single value should roughly be twice as fast in regards to synchronization overhead. Since you must both enter and exit a lock, that's two interlocked operations right there, neither of which do any real work. In practice, however, it can pretty much go all over the place. If you do every operation in an interlocked manner, it's likely going to be slower; locks would be a better option in that scenario. Depends on what you're trying to do and how frequently something's going to be modified.

Share this post


Link to post
Share on other sites
I'm going lock-free here for a very particular reason: passing messages between tasks needs to be as efficient as possible. If message passing uses any locks, we introduce a severe bottleneck that limits scalability.

It's taken a lot of careful thought, but I've got a mostly lockless system going; the only locks are when threads are started or stopped. During a thread start/stop, all access to message mailboxes is blocked. However, concurrent message passes from separate threads will not block each other. This means we can sustain a high throughput of messages, without sacrificing safety during thread init/cleanup code.

In general, IMHO lock-free code is totally not worth the effort. You have to have some ridiculous contention on a locked resource before it really becomes a performance issue, but up until that point it's much easier to get locking right than it is to get lock-free right.



NotAYakk - thanks for the feedback. I'll post up an improved version of the code shortly.

Share this post


Link to post
Share on other sites
Right, here's the current implementation, straight from the Epoch language project:


//
// Helper function for atomic compare-and-swap
// Returns true on success
//
template <class DataType>
bool CompareAndSwap(DataType* field, DataType oldvalue, DataType newvalue)
{
DataType retval;

_asm
{
mfence
mov eax, oldvalue
mov ecx, newvalue
mov edx, dword ptr [field]

lock cmpxchg dword ptr[edx], ecx
mov retval, eax
mfence
}

return (retval == oldvalue);
}


//
// This class encapsulates a lockless mailbox algorithm for asynchronous message passing.
// Each thread is granted a mailbox when it is started up; this mailbox is used for all
// messages passed into that thread. Messages are dequeued in FIFO order.
//
// IMPORTANT: this algorithm is only safe for a many producer/single consumer scenario.
// Critical assumptions made by the code are only valid if a single consumer
// thread is used. However, any number of producer threads is permitted.
//
// The algorithm uses three distinct stacks: a read stack, a write stack, and a cache.
// Producers push items onto the write stack using a simple and well-known CAS method.
// When the consumer thread arrives to retrieve a message, it first checks the cache
// for any waiting messages. If the cache is empty, the consumer thread swaps the read
// and write stacks. At this point, the read stack is controlled entirely by the consumer
// thread, so we don't need to worry about contention for elements in the read stack.
// The entire read stack is then traversed and its contents pushed into the cache.
// Note that we achieve FIFO semantics only by using this extra cache stack.
//
template <class PayloadType>
class LocklessMailbox
{
// Construction and destruction
public:
//
// Construct and initialize the read and write stacks
//
LocklessMailbox()
{
std::auto_ptr<Node> wh(new Node);
std::auto_ptr<Node> rh(new Node);

WriteHead = wh.release();
ReadHead = rh.release();
}

//
// Clean up the read, write, and cache stacks, freeing any
// remaining messages from each stack.
//
~LocklessMailbox()
{
Release();
}

// Message passing interface
public:

//
// Register a message from a producer thread. Any number of threads
// may call this function.
//
void AddMessage(PayloadType* info)
{
Node* msgnode = new Node(info);
while(true)
{
msgnode->Next = WriteHead;
bool success = CompareAndSwap<Node*>(&WriteHead, msgnode->Next, msgnode);
if(success)
return;
}
}

//
// Retrieve a message from the mailbox.
// IMPORTANT: only ONE consumer thread (per mailbox) should call this function
//
PayloadType* GetMessage()
{
if(!PendingReads.empty())
return PopPendingRead();

if(ReadHead->Next == NULL)
SwapReadAndWrite();

Node* n = ReadHead;
while(ReadHead->Next)
{
PendingReads.push(n);
n = n->Next;
ReadHead = n;
}

if(PendingReads.empty())
return NULL;

return PopPendingRead();
}

// Internal helpers
private:

//
// Pop a pending read from the cache stack
//
PayloadType* PopPendingRead()
{
Node* readnode = PendingReads.top();
PayloadType* payload = readnode->Payload;
delete readnode;
PendingReads.pop();
return payload;
}

//
// Internal helper for swapping the read/write stacks
// See class comment for details
//
void SwapReadAndWrite()
{
while(true)
{
Node* readhead = ReadHead;
Node* oldwrite = WriteHead;
bool success = CompareAndSwap<Node*>(&WriteHead, oldwrite, readhead);
if(success)
{
ReadHead = oldwrite;
return;
}
}
}

//
// Release the mailbox and free any remaining messages
//
void Release()
{
for(std::stack<Node*>::container_type::iterator iter = PendingReads.c.begin(); iter != PendingReads.c.end(); ++iter)
{
delete (*iter)->Payload;
delete *iter;
}

Node* n = WriteHead;
while(n)
{
delete n->Payload;
Node* nextn = n->Next;
delete n;
n = nextn;
}

n = ReadHead;
while(n)
{
delete n->Payload;
Node* nextn = n->Next;
delete n;
n = nextn;
}
}

// Internal tracking
private:
struct Node
{
Node() : Next(NULL), Payload(NULL) { }
Node(PayloadType* p) : Next(NULL), Payload(p) { }

Node* Next;
PayloadType* Payload;
};

Node* WriteHead;
Node* ReadHead;
std::stack<Node*> PendingReads;
};




The only thing I'm not sure of at this point is the memory fences; but that should be trivial to fix now that the CAS operation is centralized.

Share this post


Link to post
Share on other sites
If you got potentially more than one thread per HW thread, with lock-free algorithms you avoid priority inversion and it also scales better with high contention. Anyway, it's debatable if in many-core architectures lock-free algorithms would bring any benefit, so I guess we just have to wait to get our hands on those to see how it turns out in practice (: But if you have had the trouble of making lock-free algorithms, they are trivial to convert to algorithms that use optimally short locks to see the difference.

Share this post


Link to post
Share on other sites
Quote:
Original post by JarkkoL
Anyway, it's debatable if in many-core architectures lock-free algorithms would bring any benefit, so I guess we just have to wait to get our hands on those to see how it turns out in practice (:


Here is a HashMap implementation for Java that scales linearly up to 768 real cores using lock-free algorithms (including resizing which is particularly impressive).

Regards
elFarto

Share this post


Link to post
Share on other sites
Quote:
Original post by elFarto
Here is a HashMap implementation for Java that scales linearly up to 768 real cores using lock-free algorithms (including resizing which is particularly impressive).

Do you mean it has been actually tested on such a system? Which system is that and is there some stats and details about the test? I would be very interested to see this.

Share this post


Link to post
Share on other sites
Quote:
Original post by JarkkoL
Quote:
Original post by elFarto
Here is a HashMap implementation for Java that scales linearly up to 768 real cores using lock-free algorithms (including resizing which is particularly impressive).

Do you mean it has been actually tested on such a system? Which system is that and is there some stats and details about the test? I would be very interested to see this.

Yes. The guy that wrote it (Cliff Click) works for Azul, and they have a 768 and 896 core machines. Here is a blog post he made about it.

Regards
elFarto

Share this post


Link to post
Share on other sites
Quote:
I'm going lock-free here for a very particular reason: passing messages between tasks needs to be as efficient as possible. If message passing uses any locks, we introduce a severe bottleneck that limits scalability.

That's nice speculation, but unless you have actual profiler results to show, it is just speculation.

To play devil's advocate for a moment, if you use a lock judiciously (eg, only lock after you have created the message objects and can immediately write and let go) the difference might be negligible in practice.

However, multiple producers is a significant complication. Unless there's some crazy number of possible producers, why not just use multiple lockless queues? They're easy to do without x86 assembly tricks.

Share this post


Link to post
Share on other sites
Quote:
Original post by elFarto
Yes. The guy that wrote it (Cliff Click) works for Azul, and they have a 768 and 896 core machines.

Would be nice to have such a system for testing my multi-threading stuff ;) I haven't implemented lock-free hash map myself, but I believe that due to the nature of the hash map the contention is significantly reduced when you got large number of strips. Cliff uses 4096-way hash map so even with 768 cpu's there aren't many cpu's in average competing on a single strip. This is quite different when you are using a data structure where all the cpu's access the same variable (e.g. single lifo queue) and which would really demonstrate how lock-free scales in comparison to e.g. short spinlocks in high contention cases.

Share this post


Link to post
Share on other sites

This topic is 3182 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Forum Statistics

    • Total Topics
      628658
    • Total Posts
      2984078
  • Similar Content

    • By arash khalaqhdoust
      hey guys i hope you doing all well. last night i released my first game in google app store, i really appreciate you guys  to download it. and share your reviews about it
      the idea of game comes from mini hackgame of Bioshock.
       link of download:
      https://play.google.com/store/apps/details?id=com.RVBinary.piperist
      many thanks
    • By ForgedInteractive
      Who We Are
      We are Forged Interactive, a small team of like-minded game developers with the sole purpose of making games we love! Currently, we're progressing very quickly with our first project and there are plenty of opportunities and work for new interested programmers. With this project, our development platform is Unity 5.5.2 and C# as our behavioral language. Since this project is our first release, the game itself is a smaller project though progress is moving quickly. We are looking to finalize the current project and get started on future projects in the near future and are expanding our team to do so.
       
      Who We Are Looking For:
      Programmer Level Designer  
      About the Game
      Ours is the tale of two siblings, thrown into a world of chaos. Living in the shadow of their parents' heroic deeds and their Uncle's colorful military career, Finn and Atia are about to become the next force to shape our world. How will you rise through the ranks of Hereilla and what will be your legacy? Once defeated your enemies turn coat and join you in your adventures. Players can enjoy a range of troops and abilities based on their gameplay style which become more important as maps introduce more challenging terrain, enemies and bosses. Strong orc knights, dangerous shamans, and even a dragon are out on the prowl. Knowing when to fight and when to run, and how to manage your army is essential. Your actions alone decide the fate of this world.
       
      Previous Work by Team
      Although we are working towards our first game as Forged Interactive, our team members themselves have worked on titles including and not limited to:
      Final Fantasy Kingsglaive FIFA 2017 Xcom 2 Civilization  
      What do we expect?
      Reference work or portfolio. Examples what have you already done and what projects you have worked on academic or otherwise. The ability to commit to the project on a regular basis. If you are going on a two-week trip, we don't mind, but it would be good if you could commit 10+ hours to the project each week. Willingness to work with a royalty based compensation model, you will be paid when the game launches. Openness to learning new tools and techniques
       
      What can we offer?
      Continuous support and availability from our side. You have the ability to give design input, and creative say in the development of the game. Shown in credits on websites, in-game and more. Insight and contacts from within the Industry.
       
      Contact
      If you are interested in knowing more or joining, please email or PM us on Skype. A member of our management team will reply to you within 48 hours.
       
      E-mail: Recruitment@ForgedInteractive.com
      Skype: ForgedInteractive
       
      Regards,
      David, Colin and Joseph
       
      Follow us on:
      Facebook: https://www.facebook.com/ForgedInteractive/
      Twitter: @ForgedInteract
      Youtube: https://www.youtube.com/channel/UCpK3zhq5ToOeDpdI0Eik-Ug?view_as=subscriber
      Reddit: www.reddit.com/user/Forged_Interactive

    • By dell96
      I'm trying to make my first project but I'm stuck i don't know how to make my crate to start to spawn again when i hit the start button after i die.
      hoping someone can help!!!
      Crate.cs
      CrateSpawn.cs
      Cratework.cs
      GameController.cs
      GameManager.cs
  • Popular Now