Here's an implementation of a lock-free multi-reader/multi-writer array-based queue

Started by
21 comments, last by Prune 15 years, 5 months ago
Mostly all my thoughts were covered by the above posts (wicked thread and snippets by the way!! I love seeing this calibre of discussion on GDNet, awesome).

Cache line friendly arrays are definitely superior to linked lists for playing nice in the sandbox with L2 Cache (especially on the 360).

Also, I can definitely backup the use of the memory barriers. I have run into this exact same problem doing concurrency programming with a ring buffer on the 360. Certain optimizations would reorder instructions, and the write index of the ring buffer would sometimes miss an increment write before the next read happened, so forcing a memory barrier in a key place flushed out the transaction.

Nice work!

~Graham
Advertisement
Quote:Original post by Prune
But this is in essence a spin-lock, so your algorithm cannot be considered lock-free. Can you give guarantees that lock convoys, priority inversion, livelock/deadlock, etc. can never occur?
This spin-lock only occurs when a pop operation tries to pop data that is still being written. Most of the time, a write is very simple operation so this should be a rare occurrence (which is why I chose to spin rather than to return immediately).

To be truly Lock-Free, I have to guarantee that at least one thread will always be making progress, even if all the other threads are paused.
To make it satisfy this condition, the pop operation should just return false (failure) if the write is not yet completed (instead of spinning).

I should probably make this a boolean template argument so the user of my FIFO can choose if they want it to be truly lock free or not. ;)
This would require disabling my resize feature though, as I'm not sure how you would resize/relocate the array in a true lock-free design.
Hogman, I'll take a look at your code again tomorrow.

I played around a bit with alignment. In my testing I'm using two writers and two readers. For the following, I set it so that in most cases the queue would be neither full nor empty.

Putting head and tail on different cache lines (align to 32, since align() doesn't officially take a larger number, and then put an intervening member variable) doesn't lead to a speedup I can notice rising above the noise floor of variations between trials.

Strided array access, where the stride value is 4 (assuming 64-bit Core 2 cache lines) and array size is 999 (thus it uses every element of the array) gives a quarter to a third speedup. However, using an array size of 9999 has the opposite effect. So it seems worrying about cache line thrashing is only worth it if the array fits within L1, otherwise the more frequent reads from L2 is the overriding consideration.

One problem due to not knowing the size at compile time is that I can't set the stride at compile time either, and having stride as a variable instead of literal gives me a big performance hit--comparable to the gain I was getting from avoiding the cache line thrashing itself! (Having size as a literal makes much less of a difference--am I running out of registers here?) Any ideas as to how to best resolve this?

(Note: the temporaries in push/pop have alignment specified as well even though they should stay in registers just in case, since they must have those minimum alignments for the interlocked operations, though I guess even there it's only needed for the 64-bit one, since 4-byte basic types are aligned by default to 4-byte boundaries.)
"But who prays for Satan? Who, in eighteen centuries, has had the common humanity to pray for the one sinner that needed it most?" --Mark Twain

~~~~~~~~~~~~~~~Looking for a high-performance, easy to use, and lightweight math library? http://www.cmldev.net/ (note: I'm not associated with that project; just a user)

This topic is closed to new replies.

Advertisement