• Advertisement
Sign in to follow this  

Unity Double-check my lock-free voodoo

This topic is 3256 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I've been working on a lockless message passing system for the last couple of days, and finally came up with what seems like a pretty solid implementation. First, some important background: this is designed for a one consumer, multiple producer scenario. It will assuredly not work in a multiple-consumer situation. However, from my checks and logic, it should be fully safe as long as only one thread is consuming data. The theory is very simple. We maintain three stacks. The first stack is the "write stack." All producers append their messages to the write stack with a simple CAS-based method. The second stack is the "read stack." This stack is only visible to the consumer; all work by the consumer is done on the read stack. It is effectively not aware of the write stack. The third stack is a little bit of a hack. I want this system to have FIFO semantics, but the write stack is implemented as a LIFO stack. Therefore, the read process uses this third stack to essentially reverse messages that it reads. Consuming a message means popping an item off this third stack. Anyone with basic familiarity with lock-free structures should be familiar with the CAS LIFO stack, so I won't bother explaining it. The consumer half is a bit wonky, though, so here's the logic. (I make heavy use of a "sentinel" node which is just a node in either the read or write stack that has no attached data payload. This sentinel is basically the "end of list" marker.)
  • If the read head is the sentinel node, swap the pointers to the read and write stacks, and proceed as usual
  • Traverse the read stack (which is basically a singly linked list at this point) and push each item onto the third stack as it is visited.
  • If the third stack is empty, return NULL, indicating no pending messages
  • Pop the first item off the third stack, clean up some memory usage, and return it to the caller as the current message
I've reviewed this logic repeatedly, and stress-tested the implementation under 30 threads of load on my dual-core workstation. As near as I can tell the code is free of potential problems, primarily because of the magic constraint that we only have one consumer. So here's the code: [edit] Removed the original code; please see the updated version here. [/edit] I welcome any comments, suggestions, and/or gentle explanations that I am a moron and that this doesn't actually work [grin] [Edited by - ApochPiQ on March 20, 2009 6:22:50 PM]

Share this post


Link to post
Share on other sites
Advertisement
Is there particular reason why you don't just use lock-less FIFO queue, which is also safe for multiple-producers multiple-consumers case.

Share this post


Link to post
Share on other sites
Every lockless FIFO I've seen requires not just CAS but CAS2; since CAS2 is not widely supported (especially in 64-bit land), I've opted to try and stick to CAS for portability.

Share this post


Link to post
Share on other sites
What level of portability are you going for? volatile gives you different guarantees depending on the compiler you're using (or even the *version* of the compiler you're using). I'd try not to use it at all and instead insert the barriers explicitly.

I'm not qualified to comment much further than that on the nitty gritty. You have some exception safety issues, though, unless you're assuming new will never throw (which isn't the case on Windows, for example).

Share this post


Link to post
Share on other sites
I thought cmpxchg8b is quite widely supported on 32-bit x86 platforms. Not sure how well cmpxchg16b is supported for 64-bit platforms though. Do you have something to back up the "CAS2 is not widely supported" claim, e.g. a supported instruction table or something?

Share this post


Link to post
Share on other sites
I was looking back through my notes and intertron history and realized that all the sources I have regarding CAS2 on 64-bit processors were pretty old, and likely outdated. Seems like cmpxchg16b is pretty common now.

In any case, I've got a working algorithm (I think) and don't care to go through all the trouble of building another one [wink]


Yeah, exception safety is a bit of a problem here; I'll be tuning that up along with lots of other code that needs it.

Manual memory fencing also sounds like a good idea. I'll add that to my list and hopefully get it hammered out pretty quickly.


Thanks all for the feedback thus far; although I'm still curious if anyone can find a problem in the implementation...

Share this post


Link to post
Share on other sites
Here's a fun one: Visual C++ provides no intrinsic for cmpxchg16b (unless they added it very recently), nor will it compile 64 bit assembly. How much do you love MASM? [grin]

Share this post


Link to post
Share on other sites
Quote:
Original post by Promit
Here's a fun one: Visual C++ provides no intrinsic for cmpxchg16b (unless they added it very recently), nor will it compile 64 bit assembly. How much do you love MASM? [grin]


Your information is slightly dated. _InterlockedCompareExchange128 was added late 2007.

Share this post


Link to post
Share on other sites
Quote:
Original post by ApochPiQ
In any case, I've got a working algorithm (I think) and don't care to go through all the trouble of building another one [wink]

I haven't checked through your code, so I can't tell if there are issues with it, but one thing you might want to try to do is to reproduce the ABA issue, which is why CAS2 is often used. The problem may rise very rarely, thus it's difficult to test empirically. I implemented a lock-free FIFO queue in Spin-X Engine based on "Optimized Lock-Free FIFO Queue continued" paper though, and it's not terribly difficult.

Share this post


Link to post
Share on other sites
I've walked through every combination of partial execution/preemption and can't find a case where ABA occurs. This is primarily because of the way the single consumer works - I have a guarantee during the SwapReadAndWrite operation that the only preemption can come from additional writers (since there's only the one consumer). Similarly, because of the way the read/write stacks are swapped by the consumer, I should never see an ABA pattern in the producer's push operation.

However, I'm fairly green when it comes to lockless programming, and we all know how pernicious and subtle lockless bugs can be - so I'd really appreciate it if someone else could double-check my logic and make sure I'm not overlooking something.

Share this post


Link to post
Share on other sites
Quote:
Original post by Moomin
Quote:
Original post by Promit
Here's a fun one: Visual C++ provides no intrinsic for cmpxchg16b (unless they added it very recently), nor will it compile 64 bit assembly. How much do you love MASM? [grin]


Your information is slightly dated. _InterlockedCompareExchange128 was added late 2007.
Well, that was awfully nice of them. Glad to know I'm wrong.

Share this post


Link to post
Share on other sites
Quote:
Original post by Promit
Quote:
Original post by Moomin
Quote:
Original post by Promit
Here's a fun one: Visual C++ provides no intrinsic for cmpxchg16b (unless they added it very recently), nor will it compile 64 bit assembly. How much do you love MASM? [grin]


Your information is slightly dated. _InterlockedCompareExchange128 was added late 2007.
Well, that was awfully nice of them. Glad to know I'm wrong.


To quote another poster here whose name I can't recall: Visual Studio is pretty dope.

That being said, they *still* don't have atomic addition/subtraction, etc. for any type of variable. Way to go.

Share this post


Link to post
Share on other sites
Quote:
Original post by InvalidPointer
That being said, they *still* don't have atomic addition/subtraction, etc. for any type of variable. Way to go.

Is there a popular C++ compiler that does?

Share this post


Link to post
Share on other sites
Some comments:
1> Why in the world do your nodes not initialize to NULL in their constructor?
2> Why are you doing inline assembly in the middle of a many-line function -- why not write CAS and call the function?

3> Why is anything to do with PendingReads volatile? And why don't you have pop() return the payload? (oh you are using a std::stack. Why not store a std::stack of MessageInfo*s? Ie, move the cruft away from the mainline of your code)

4> Why is anything to do with Read volatile? You do swap things into read, but you only access Read in a single-threaded manner.

Hmm. That's because of C++ not wanting to let you remove volatile implicitly, isn't it? Meh. Oh well.

So this is what I'd prefer to read/write:

template<typename T>
bool CompareAndSet( T* location, T setValue, T expectedValue )
{
// ASM goes here
}

class LocklessMailbox
{
// stores Node*s in a stack -- moves the single-threaded stuff out of the threading code
OneThreadNodeStack Read;

// A stack of the actual stuff we want to deliver. Once again, less crud in the tricky
// threading code:
std::stack<MessageInfo*> PendingReads;

// The WriteHead is the hard threading part:
typedef volatile Node NodeVol;
typedef NodeVol *pNodeVol;
typedef volatile pNodeVol pvNodeVol;
pvNodeVol WriteHead;
public:
LocklessMailbox()
{
WriteHead = new Node;
}

public:
// [SNIP]
MessageInfo* PopTopPendingReads() {
if (PendingReads.empty()) return NULL;
MessageInfo* retval = PendingReads.top();
PendingReads.pop();
return retval;
}

// notice that there is no more volatile in this function. Nor is there any memory management:
MessageInfo* GetMessage()
{
if(!PendingReads.empty())
{
return PopTopPendingReads();
}

if(Read.empty())
SwapReadAndWrite();

// pop_top() returns the data at the top of the Read stack,
// and returns NULL if the stack is empty. It deals with memory
// management:
while(MessageInfo* message = Read.pop_top())
{
PendingReads.push(message);
}

return PendingReads.pop_top();
}

// here is the dangerous threading code:
void AppendMessage( MessageInfo* retval )
{
pNodeVol new_head = new Node( retval );
while (true) {
new_head->Next = WriteHead;
bool success = CompareAndSet<Node*>( &WriteHead, new_head, new_head->Next );
if (success) {
return;
}
}
}

void SwapReadAndWrite()
{
while (true)
{
Node* ReadHead = Read.GetHead();
Node* old_write = WriteHead;
bool success = CompareAndSet<Node*>( &WriteHead, ReadHead, old_write );
if (success) {
Read.SetHead( old_write );
return;
}
}
}
// [SNIP]
};



But this is just cleaning up the code.

On the other hand, if the code was cleaned up like this, I could find errors in your lock-free algorithm implementation far easier -- I wouldn't have to also look for memory allocation errors and skip over boilerplate code.

With only one consumer, once things are in the Read stack _they are no longer a lock-free algorithm problem_. And the same with PendingReads. Get the code that is not a lock-free algorithm problem as far away from your tricky lock-free algorithm code as is reasonably possible.

Another thing I did is I reduced the amount of state that is saved over the loops by as much as possible. This makes checking for correctness easier.

Share this post


Link to post
Share on other sites
It seems that everyone and their dog is going lock-free lately. Given that lock-free data structures and algorithms are a real PITA to get right, are you guys sure they'd actually gain you anywhere near as much as you think?

I haven't personally tried using lock-free algorithms and comparing the results to similar code using locks yet, but msdn seems to warn that, considering all the extra effort and the high risk of not getting it right, going lock-free just may not be even worth it in any particular case.

Anyone have any experience with lock-free code and how it compares to code using locks, performance-wise I mean?


P.S. Oops, forgot the link to the msdn article.

here

Share this post


Link to post
Share on other sites
Quote:
Original post by Red Ant
It seems that everyone and their dog is going lock-free lately. Given that lock-free data structures and algorithms are a real PITA to get right, are you guys sure they'd actually gain you anywhere near as much as you think?

I haven't personally tried using lock-free algorithms and comparing the results to similar code using locks yet, but msdn seems to warn that, considering all the extra effort and the high risk of not getting it right, going lock-free just may not be even worth it in any particular case.

Anyone have any experience with lock-free code and how it compares to code using locks, performance-wise I mean?


P.S. Oops, forgot the link to the msdn article.

here

I've kinda noticed that too. I just posted something on lock-free queues myself :X
Here I thought there might be one, two people that had experimented with them and could offer advice. What I did not expect, however, was the fact that 30 or so people were *also* trying them out and (in some cases) were hung up on the same issues.


DISCLAIMER: I'm not factoring thread spinning into the equation here; it's too implementation-dependent. All info presented here is AFAIK (read: done in academic happy-land) and may differ from what goes on 'in the real world.'

In regards to your assertion: In *theory,* lock-free algorithms modifying a single value should roughly be twice as fast in regards to synchronization overhead. Since you must both enter and exit a lock, that's two interlocked operations right there, neither of which do any real work. In practice, however, it can pretty much go all over the place. If you do every operation in an interlocked manner, it's likely going to be slower; locks would be a better option in that scenario. Depends on what you're trying to do and how frequently something's going to be modified.

Share this post


Link to post
Share on other sites
I'm going lock-free here for a very particular reason: passing messages between tasks needs to be as efficient as possible. If message passing uses any locks, we introduce a severe bottleneck that limits scalability.

It's taken a lot of careful thought, but I've got a mostly lockless system going; the only locks are when threads are started or stopped. During a thread start/stop, all access to message mailboxes is blocked. However, concurrent message passes from separate threads will not block each other. This means we can sustain a high throughput of messages, without sacrificing safety during thread init/cleanup code.

In general, IMHO lock-free code is totally not worth the effort. You have to have some ridiculous contention on a locked resource before it really becomes a performance issue, but up until that point it's much easier to get locking right than it is to get lock-free right.



NotAYakk - thanks for the feedback. I'll post up an improved version of the code shortly.

Share this post


Link to post
Share on other sites
Right, here's the current implementation, straight from the Epoch language project:


//
// Helper function for atomic compare-and-swap
// Returns true on success
//
template <class DataType>
bool CompareAndSwap(DataType* field, DataType oldvalue, DataType newvalue)
{
DataType retval;

_asm
{
mfence
mov eax, oldvalue
mov ecx, newvalue
mov edx, dword ptr [field]

lock cmpxchg dword ptr[edx], ecx
mov retval, eax
mfence
}

return (retval == oldvalue);
}


//
// This class encapsulates a lockless mailbox algorithm for asynchronous message passing.
// Each thread is granted a mailbox when it is started up; this mailbox is used for all
// messages passed into that thread. Messages are dequeued in FIFO order.
//
// IMPORTANT: this algorithm is only safe for a many producer/single consumer scenario.
// Critical assumptions made by the code are only valid if a single consumer
// thread is used. However, any number of producer threads is permitted.
//
// The algorithm uses three distinct stacks: a read stack, a write stack, and a cache.
// Producers push items onto the write stack using a simple and well-known CAS method.
// When the consumer thread arrives to retrieve a message, it first checks the cache
// for any waiting messages. If the cache is empty, the consumer thread swaps the read
// and write stacks. At this point, the read stack is controlled entirely by the consumer
// thread, so we don't need to worry about contention for elements in the read stack.
// The entire read stack is then traversed and its contents pushed into the cache.
// Note that we achieve FIFO semantics only by using this extra cache stack.
//
template <class PayloadType>
class LocklessMailbox
{
// Construction and destruction
public:
//
// Construct and initialize the read and write stacks
//
LocklessMailbox()
{
std::auto_ptr<Node> wh(new Node);
std::auto_ptr<Node> rh(new Node);

WriteHead = wh.release();
ReadHead = rh.release();
}

//
// Clean up the read, write, and cache stacks, freeing any
// remaining messages from each stack.
//
~LocklessMailbox()
{
Release();
}

// Message passing interface
public:

//
// Register a message from a producer thread. Any number of threads
// may call this function.
//
void AddMessage(PayloadType* info)
{
Node* msgnode = new Node(info);
while(true)
{
msgnode->Next = WriteHead;
bool success = CompareAndSwap<Node*>(&WriteHead, msgnode->Next, msgnode);
if(success)
return;
}
}

//
// Retrieve a message from the mailbox.
// IMPORTANT: only ONE consumer thread (per mailbox) should call this function
//
PayloadType* GetMessage()
{
if(!PendingReads.empty())
return PopPendingRead();

if(ReadHead->Next == NULL)
SwapReadAndWrite();

Node* n = ReadHead;
while(ReadHead->Next)
{
PendingReads.push(n);
n = n->Next;
ReadHead = n;
}

if(PendingReads.empty())
return NULL;

return PopPendingRead();
}

// Internal helpers
private:

//
// Pop a pending read from the cache stack
//
PayloadType* PopPendingRead()
{
Node* readnode = PendingReads.top();
PayloadType* payload = readnode->Payload;
delete readnode;
PendingReads.pop();
return payload;
}

//
// Internal helper for swapping the read/write stacks
// See class comment for details
//
void SwapReadAndWrite()
{
while(true)
{
Node* readhead = ReadHead;
Node* oldwrite = WriteHead;
bool success = CompareAndSwap<Node*>(&WriteHead, oldwrite, readhead);
if(success)
{
ReadHead = oldwrite;
return;
}
}
}

//
// Release the mailbox and free any remaining messages
//
void Release()
{
for(std::stack<Node*>::container_type::iterator iter = PendingReads.c.begin(); iter != PendingReads.c.end(); ++iter)
{
delete (*iter)->Payload;
delete *iter;
}

Node* n = WriteHead;
while(n)
{
delete n->Payload;
Node* nextn = n->Next;
delete n;
n = nextn;
}

n = ReadHead;
while(n)
{
delete n->Payload;
Node* nextn = n->Next;
delete n;
n = nextn;
}
}

// Internal tracking
private:
struct Node
{
Node() : Next(NULL), Payload(NULL) { }
Node(PayloadType* p) : Next(NULL), Payload(p) { }

Node* Next;
PayloadType* Payload;
};

Node* WriteHead;
Node* ReadHead;
std::stack<Node*> PendingReads;
};




The only thing I'm not sure of at this point is the memory fences; but that should be trivial to fix now that the CAS operation is centralized.

Share this post


Link to post
Share on other sites
If you got potentially more than one thread per HW thread, with lock-free algorithms you avoid priority inversion and it also scales better with high contention. Anyway, it's debatable if in many-core architectures lock-free algorithms would bring any benefit, so I guess we just have to wait to get our hands on those to see how it turns out in practice (: But if you have had the trouble of making lock-free algorithms, they are trivial to convert to algorithms that use optimally short locks to see the difference.

Share this post


Link to post
Share on other sites
Quote:
Original post by JarkkoL
Anyway, it's debatable if in many-core architectures lock-free algorithms would bring any benefit, so I guess we just have to wait to get our hands on those to see how it turns out in practice (:


Here is a HashMap implementation for Java that scales linearly up to 768 real cores using lock-free algorithms (including resizing which is particularly impressive).

Regards
elFarto

Share this post


Link to post
Share on other sites
Quote:
Original post by elFarto
Here is a HashMap implementation for Java that scales linearly up to 768 real cores using lock-free algorithms (including resizing which is particularly impressive).

Do you mean it has been actually tested on such a system? Which system is that and is there some stats and details about the test? I would be very interested to see this.

Share this post


Link to post
Share on other sites
Quote:
Original post by JarkkoL
Quote:
Original post by elFarto
Here is a HashMap implementation for Java that scales linearly up to 768 real cores using lock-free algorithms (including resizing which is particularly impressive).

Do you mean it has been actually tested on such a system? Which system is that and is there some stats and details about the test? I would be very interested to see this.

Yes. The guy that wrote it (Cliff Click) works for Azul, and they have a 768 and 896 core machines. Here is a blog post he made about it.

Regards
elFarto

Share this post


Link to post
Share on other sites
Quote:
I'm going lock-free here for a very particular reason: passing messages between tasks needs to be as efficient as possible. If message passing uses any locks, we introduce a severe bottleneck that limits scalability.

That's nice speculation, but unless you have actual profiler results to show, it is just speculation.

To play devil's advocate for a moment, if you use a lock judiciously (eg, only lock after you have created the message objects and can immediately write and let go) the difference might be negligible in practice.

However, multiple producers is a significant complication. Unless there's some crazy number of possible producers, why not just use multiple lockless queues? They're easy to do without x86 assembly tricks.

Share this post


Link to post
Share on other sites
Quote:
Original post by elFarto
Yes. The guy that wrote it (Cliff Click) works for Azul, and they have a 768 and 896 core machines.

Would be nice to have such a system for testing my multi-threading stuff ;) I haven't implemented lock-free hash map myself, but I believe that due to the nature of the hash map the contention is significantly reduced when you got large number of strips. Cliff uses 4096-way hash map so even with 768 cpu's there aren't many cpu's in average competing on a single strip. This is quite different when you are using a data structure where all the cpu's access the same variable (e.g. single lifo queue) and which would really demonstrate how lock-free scales in comparison to e.g. short spinlocks in high contention cases.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
  • Advertisement
  • Popular Tags

  • Advertisement
  • Popular Now

  • Similar Content

    • By Innoc uous
      I'm working on a space game, and I suck at art. I would love to get some help from someone who is more skilled than me. Things I need include modular space ship parts and GUI elements. Nothing too fancy, just functional so I can get a prototype put together. This could potentially become a serious project, but for now this is just a hobby project.
       
      In this video, you can see a few things I already completed
      :2018-02-24 20-08-13.flv2018-02-24 20-08-13.flv
    • By Innoc uous
      If you want to incorporate noise into your shaders, the Turbulance Library has you covered. Using code I gathered from this library, I made a cginc file that contains all you need to easily implement noise into your unity shaders. Who knows how this stuff works, but man, does it work well!
       
      https://pastebin.com/LLCUpJut
       
      Here is an example of what you can create using these noise functions.
       
    • By Nio Martinez
      I'll be buying a new laptop as my workstation for building games, Mostly 3D but not hard core. 
       
      I'm stuck at choosing between these 2 specs below. Does this really matter and if so, can some one tell my how and why it matters. 
      Choice1:
      Intel core i5-8250U (8th gen Kabylake refresh)(6 MB Smart Cache, 1.6 GHz Base with Turbo Boost up to 3.4 GHz) 4 cores 8 threads
      RAM 8 GB DDR4 (2400 MHz)
      GPU 2 GB DDR5 Nvidia MX150 256 bit
      SSD: yes
      Choice2:
      Intel core i7-7500U 2.70GHz Base Processor (4M Cache, up to 3.50 GHz Boost) 2 Cores, 4 Threads
      RAM 4 GB DDR4 (1800 MHz)
      GPU 2 GB DDR5 Nvidia GeForce 940MX 256 bit
      SSD: No
       
    • By Manuel Berger
      Hello fellow devs!
      Once again I started working on an 2D adventure game and right now I'm doing the character-movement/animation. I'm not a big math guy and I was happy about my solution, but soon I realized that it's flawed.
      My player has 5 walking-animations, mirrored for the left side: up, upright, right, downright, down. With the atan2 function I get the angle between player and destination. To get an index from 0 to 4, I divide PI by 5 and see how many times it goes into the player-destination angle.

      In Pseudo-Code:
      angle = atan2(destination.x - player.x, destination.y - player.y) //swapped y and x to get mirrored angle around the y axis
      index = (int) (angle / (PI / 5));
      PlayAnimation(index); //0 = up, 1 = up_right, 2 = right, 3 = down_right, 4 = down

      Besides the fact that when angle is equal to PI it produces an index of 5, this works like a charm. Or at least I thought so at first. When I tested it, I realized that the up and down animation is playing more often than the others, which is pretty logical, since they have double the angle.

      What I'm trying to achieve is something like this, but with equal angles, so that up and down has the same range as all other directions.

      I can't get my head around it. Any suggestions? Is the whole approach doomed?

      Thank you in advance for any input!
       
    • By devbyskc
      Hi Everyone,
      Like most here, I'm a newbie but have been dabbling with game development for a few years. I am currently working full-time overseas and learning the craft in my spare time. It's been a long but highly rewarding adventure. Much of my time has been spent working through tutorials. In all of them, as well as my own attempts at development, I used the audio files supplied by the tutorial author, or obtained from one of the numerous sites online. I am working solo, and will be for a while, so I don't want to get too wrapped up with any one skill set. Regarding audio, the files I've found and used are good for what I was doing at the time. However I would now like to try my hand at customizing the audio more. My game engine of choice is Unity and it has an audio mixer built in that I have experimented with following their tutorials. I have obtained a great book called Game Audio Development with Unity 5.x that I am working through. Half way through the book it introduces using FMOD to supplement the Unity Audio Mixer. Later in the book, the author introduces Reaper (a very popular DAW) as an external program to compose and mix music to be integrated with Unity. I did some research on DAWs and quickly became overwhelmed. Much of what I found was geared toward professional sound engineers and sound designers. I am in no way trying or even thinking about getting to that level. All I want to be able to do is take a music file, and tweak it some to get the sound I want for my game. I've played with Audacity as well, but it didn't seem to fit the bill. So that is why I am looking at a better quality DAW. Since being solo, I am also under a budget contraint. So of all the DAW software out there, I am considering Reaper or Presonus Studio One due to their pricing. My question is, is investing the time to learn about using a DAW to tweak a sound file worth it? Are there any solo developers currently using a DAW as part of their overall workflow? If so, which one? I've also come across Fabric which is a Unity plug-in that enhances the built-in audio mixer. Would that be a better alternative?
      I know this is long, and maybe I haven't communicated well in trying to be brief. But any advice from the gurus/vets would be greatly appreciated. I've leaned so much and had a lot of fun in the process. BTW, I am also a senior citizen (I cut my programming teeth back using punch cards and Structured Basic when it first came out). If anyone needs more clarification of what I am trying to accomplish please let me know.  Thanks in advance for any assistance/advice.
  • Advertisement