Double-check my lock-free voodoo

Started by
30 comments, last by Drew_Benton 15 years ago
Quote:Original post by JarkkoL
Anyway, it's debatable if in many-core architectures lock-free algorithms would bring any benefit, so I guess we just have to wait to get our hands on those to see how it turns out in practice (:


Here is a HashMap implementation for Java that scales linearly up to 768 real cores using lock-free algorithms (including resizing which is particularly impressive).

Regards
elFarto
Advertisement
Quote:Original post by elFarto
Here is a HashMap implementation for Java that scales linearly up to 768 real cores using lock-free algorithms (including resizing which is particularly impressive).

Do you mean it has been actually tested on such a system? Which system is that and is there some stats and details about the test? I would be very interested to see this.
Quote:Original post by JarkkoL
Quote:Original post by elFarto
Here is a HashMap implementation for Java that scales linearly up to 768 real cores using lock-free algorithms (including resizing which is particularly impressive).

Do you mean it has been actually tested on such a system? Which system is that and is there some stats and details about the test? I would be very interested to see this.

Yes. The guy that wrote it (Cliff Click) works for Azul, and they have a 768 and 896 core machines. Here is a blog post he made about it.

Regards
elFarto
Quote:I'm going lock-free here for a very particular reason: passing messages between tasks needs to be as efficient as possible. If message passing uses any locks, we introduce a severe bottleneck that limits scalability.

That's nice speculation, but unless you have actual profiler results to show, it is just speculation.

To play devil's advocate for a moment, if you use a lock judiciously (eg, only lock after you have created the message objects and can immediately write and let go) the difference might be negligible in practice.

However, multiple producers is a significant complication. Unless there's some crazy number of possible producers, why not just use multiple lockless queues? They're easy to do without x86 assembly tricks.
Quote:Original post by elFarto
Yes. The guy that wrote it (Cliff Click) works for Azul, and they have a 768 and 896 core machines.

Would be nice to have such a system for testing my multi-threading stuff ;) I haven't implemented lock-free hash map myself, but I believe that due to the nature of the hash map the contention is significantly reduced when you got large number of strips. Cliff uses 4096-way hash map so even with 768 cpu's there aren't many cpu's in average competing on a single strip. This is quite different when you are using a data structure where all the cpu's access the same variable (e.g. single lifo queue) and which would really demonstrate how lock-free scales in comparison to e.g. short spinlocks in high contention cases.
A couple of things I've noticed:
*) You use "new" and "delete". I doubt those are lock free.
*) you don't take any measures against compile- or runtime instruction reordering.
I don't know much about it, but I remember seeing somewhere that although some older amd64 do indeed not support DWCAS, their address space is limited to 48 bits, making emulation possible.
Referring to the code in the first post: Intel says not to use volatile: http://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming/ (article by one of the TBB guys)
Use memory+compiler barriers instead, and wrap them into Acquire() and Release() operations (check my thread from a while back on lock-free queue, I'm not sure how to find it)
I don't declare any variables volatile in my lock-free circular array queue.
"But who prays for Satan? Who, in eighteen centuries, has had the common humanity to pray for the one sinner that needed it most?" --Mark Twain

~~~~~~~~~~~~~~~Looking for a high-performance, easy to use, and lightweight math library? http://www.cmldev.net/ (note: I'm not associated with that project; just a user)
Quote:Original post by Prune
Referring to the code in the first post: Intel says not to use volatile: http://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming/ (article by one of the TBB guys)
Use memory+compiler barriers instead, and wrap them into Acquire() and Release() operations (check my thread from a while back on lock-free queue, I'm not sure how to find it)
I don't declare any variables volatile in my lock-free circular array queue.


I've posted updated code since then; I'll mark the OP as such to prevent further confusion [smile]

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Oops, I should have read more carefully.

You mentioned that all FIFO algorithms you saw require CAS2. C.Evequoz has published a paper (see my previos thread for the reference) that implements a wait free, lock-free, multi-reader/multi-writer FIFO with a CAS only. This is useful even when the instruction set has CAS2 as, if I remember correctly, CAS2 is expensive and his algorithm scales better with the number of cores.
"But who prays for Satan? Who, in eighteen centuries, has had the common humanity to pray for the one sinner that needed it most?" --Mark Twain

~~~~~~~~~~~~~~~Looking for a high-performance, easy to use, and lightweight math library? http://www.cmldev.net/ (note: I'm not associated with that project; just a user)

This topic is closed to new replies.

Advertisement