Jump to content
  • Advertisement
Sign in to follow this  
Ryan_001

std::atomic memory ordering question

This topic is 579 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Today I wrote a simple multi-threaded intrusive stack.  In doing so I came across some questions though, that I can't seem to find the answer to.
 
1) I used a typical compare and swap/CAS/compare and exchange to update the containers internals.  To that end http://en.cppreference.com/w/cpp/atomic/atomic/compare_exchange had a typical CAS linked list example:

node<T>* new_node = new node<T>(data);
new_node->next = head.load(std::memory_order_relaxed);
while(!head.compare_exchange_weak(new_node->next, new_node, std::memory_order_release, std::memory_order_relaxed));

What confuses me is the memory order used.  They use relaxed for the initial load, and release/relaxed for the CAS.  I would assume for the initial load you would want acquire, to ensure you get the latest/most up to date value.  The release on success makes sense, any changes to the head needs to be made visible to the other threads.  What confuses me is the relaxed on failure.  If the CAS fails, would it not require an acquire so that modifications made by other threads become visible on the next calls to CAS?
 
2) What is the difference between memory_order_acq_rel and memory_order_seq_cst?  Under what situation would I need seq_cst over acq_rel?

Share this post


Link to post
Share on other sites
Advertisement

I would assume for the initial load you would want acquire, to ensure you get the latest/most up to date value.

No, that is not what acquire does. Memory ordering can be a bit confusing, to be honest, but don't despair. The example is pretty bad in my opinion, too.
 
Acquire and relaxed (as well as release) are all exactly as up-to-date as each other. There is no difference there.
 
The difference is that when you say "acquire", you are also requesting that no loads are reordered so that they occur before the atomic operation, and if you say "release", you are requesting that no stores are reordered to occur past the atomic operation (plus, if a target architecture requires a special dance, like for example fence instructions or explicitly flushing caches, this is done). Worded differently, in this combination you can rely that any non-atomic stuff that happened before the atomic release is 100% up-to-date after the atomic acquire, when you try to read it (non-atomically).
 
In the example, they are getting the head pointer with relaxed ordering. That's fine. They're saying "give me that pointer, and screw the rest". Since they're not intending to look at the data the pointer points at, or any other data for that matter, that's legitimate.
 
Then they are compare-exchanging the pointer to something different (a new node to be added). This is done with "release/relaxed" ordering, in other words the operation must guarantee release semantics if it succeeds (but doesn't need to if it fails). What's the rationale? Well, if you set the head node pointer to some newly created node, someone is sooner or later going to get that pointer, and then they're going to dereference the pointer and read the pointed-to data. That data had better be properly written to memory before someone tries to read it! This is just what release (in combination with acquire) guarantees.
Now, one can argue that if the compare-exchange fails, the head pointer still bears its original, unchanged value, and thus points to a boring, unmodified object which already existed before we even started. Thus, as there are no non-atomic writes that we have to worry about, the operation can as well be relaxed (it still can be, but needs not be release). Therefore, they use a mixed release/relaxed model in the compare-exchange.
However, in practice, this only adds to confusion and is somewhat nonsensical because on the majority of architectures it isn't even possible to differentiate dynamically between the two cases1, and where it is (partially) possible, it's quite a bit of work and involves a conditional jump, which is pretty much just as expensive as simply doing things with release semantics in the first place (which doesn't confuse people nearly as much).
 
The main problem is, the example misses the part of the code (pop function) where the acquire operation happens, so it is pretty confusing. It looks like you should combine release with relaxed, which isn't the case.
 
There is another issue with learning from such a minimal (and incomplete) examples which I forgot to mention. Lockfree programming is not always as harmless as it seems. For example, there is no mention of the "ABA problem" on that page, which is one of those surprises you may encounter (especially with lockfree stacks). In one sentence, it boils down to the compare-exchange suceeding when it should in reality fail (you removed a node, freed it, allocated a new one, got the same address, and re-added it while some other thread is adding nodes as well, and is oblivious of what's going on). If you have never heard of this before, be sure to research on "ABA" as well as "tagged pointers".

                                                                             
1 In general, once you have prevented the compiler from reordering stores by requesting that if the operation suceeds ordering must be some particular way, how are you going to undo that once you discover the atomic operation failed! That's not possible without recompiling. The only thing you can conceivably skip conditionally would be fences or explicit cacheline flushes on architectures which require them.

Edited by samoth

Share this post


Link to post
Share on other sites

I appreciate the comprehensive response.  I have a few small questions so bear with me if you don't mind :)
 

The difference is that when you say "acquire", you are also requesting that no loads are reordered so that they occur before the atomic operation, and if you say "release", you are requesting that no stores are reordered to occur past the atomic operation (plus, if a target architecture requires a special dance, like for example fence instructions or explicitly flushing caches, this is done). Worded differently, in this combination you can rely that any non-atomic stuff that happened before the atomic release is 100% up-to-date after the atomic acquire, when you try to read it (non-atomically).

 
So what you are saying is that even with 'relaxed' changes to atomic variables are visible across all cores?  The acquire/release is just related to non-atomic data (so something like a compiler directive and an LFENCE/SFENCE on x86)?  The website I linked states: "Relaxed operation: there are no synchronization or ordering constraints, only atomicity is required of this operation".  Now the words 'no synchronization' seems to imply (to me at least) that any changes to the variable are not necessarily propagated to other cores (ie. the variable could still just be sitting in a register and not yet flushed to memory/cache).  Also on architectures with a weak memory ordering, again this sounds to me like release/acquire is necessary...  Not trying to be argumentative, just still a bit confused.  Also if what you say is true, where does memory_order_consume fall into all this.  What you describe as relaxed is how I imagined consume to work.
 

Also, what is the point of memory_order_seq_cst?  If memory_order_acq_rel ensures no reads or writes can be ordered around that atomic variable, and any changes are seen by all other cores that also use an appropriate acq/rel, what additional functionality does memory_order_seq_cst?

 

I understand the ABA problem, and just worked around it by using a fast user mode spin-lock for pop.  The worst case scenario is of course slower than a true lock-free algorithm, but the average case scenario was much quicker.  From what I understand tagged pointers, hazard pointers, and reference counting all seemed to require knowledge of the underlying architecture/hardware and/or asm to get decent performance, none of which was available with just standard portable C++.  Though if I'm wrong about that and you can point me in the right direction I'd love to see how people are solving the ABA problem without resorting to assembly or embedding reference counters into unused bits in pointers.

Share this post


Link to post
Share on other sites

So what you are saying is that even with 'relaxed' changes to atomic variables are visible across all cores?  The acquire/release is just related to non-atomic data (so something like a compiler directive and an LFENCE/SFENCE on x86)?

 Pretty much..
Write-release is basically a fence instruction followed by a write instruction. Read-acquire is a read instruction followed by a fence instruction.
 
Many other atomic API's actually use this abstraction -- instead of doing a "read-acquire" operation on an atomic variable, you instead only have fencing functions and non-atomic variables. In these API's you would write stuff like:
extern int myAtomic, myNonAtomic;

int i = myAtomic;
AcquireFence();

int x = myNonAtomic; // this read cannot be moved above the fence
 
However, x86 very nearly gives sequential consistency out of the box anyway - it does a lot of out-of-order processing, but as far as memory goes, it only ever moves stores to be after loads. It can't move a load past another load, a store past a store, or a load past a store - so the requirement to use fence instructions is pretty minimal. Because of this, a lot of people get away with "incorrect" threaded code, which would instantly break on a different architecture. The C++ standard is trying very hard to make sure that threaded code will work as well on x86 as much as it will on ARM or PowerPC or Itanium.

e.g. in the above example, we know that x86 never reorders reads past each other (but your compiler might!!) so on x86, that AcquireFence() function call only has to emit a compiler intrinsic that suppresses compile-time instruction reordering.
 

Also, what is the point of memory_order_seq_cst?  If memory_order_acq_rel ensures no reads or writes can be ordered around that atomic variable, and any changes are seen by all other cores that also use an appropriate acq/rel, what additional functionality does memory_order_seq_cst?

The difference only appears when using more than two threads. Write-release and read-acquire form a pairing that ensures the right memory order is observed, but sequential consistency works outside of a single pairing.
 
Setup: atomic x=0, y=0;
Thread A: x = 1;
Thread B: y = 1;
Thread C: if x==1 && y==0 then print("Thread A ran first");
Thread D: if x==0 && y==1 then print("Thread B ran first");
 
If this is using acquire/release semantics (A/B use write-release and C/D use read-acquire), then that forms a typical synchronization barrier between A&C/D and B&C/D. e.g. Any work that 'A' did prior to x=1 will be visible before x=1 is visible, which is typically what you want.
However, in this example there is no other work being synchronized - we're just interested in the atomic integers themselves.
With acquire/release semantics, it's possible that both thread C and D will both print their statements -- C can see x=1 become visible before y=1 does, and D can see y=1 become visible before x=1 does.
Sequential consistency adds an extra guarantee that C and D will both see the results become visible in the same order.
Usually such guarantees are not required :wink:

Share this post


Link to post
Share on other sites

So what you are saying is that even with 'relaxed' changes to atomic variables are visible across all cores?

Yes. To be 100% correct, one must admit it's a little bit more complicated, but all in all it's "Yes!". Atomic writes, no matter what memory ordering, are of course not instantly visible everywhere, but that's not necessary (or even possible, I wouldn't know how physics would allow for this on a non-zero-dimension chip, since the speed of light is not infinite). They are, however, visible everywhere "pretty soon" ( = as instantly as you can get it), and what's most important, consistently, in a well-defined order, and with no garbage values or operations lost.

 

You will notice that while on e.g. x86 atomic loads and stores are just plain normal loads and stores, read-modify-write operations such as increments or compare-exchange are significantly slower (around 10 extra cycles). That's because guaranteeing a consistent, well-defined order is not at all trivial to do for read-modify-write with several participiants. But somehow, the CPU manages, even for these :)

What you expect needs not be what you get, but that's as-intended. To guarantee consistency, it is a necessity. For example, imagine you do an atomic load and get 5 for some value, and then you (atomically) increment it.  Obviously, the value must be 6 now, and the returned previous value (pretty useless, isn't it!) must of course be 5. But surprise... the returned value is 6, what happened there? Some other thread also incremented the value at the same time, or rather the smallest amount of time earlier, just before you did. Any core looking at the value (including yourself!) the instant before the value was updated atomically saw 5 and everybody looking at it the next instant saw 6, and anyone looking now will see 7 (or, in case another few threads modified it, yet another different value). It is however perfectly consistent at every point in time. No increments are "lost" under any circumstances. Same for compare-swap. You many not immediately get what you wished for, but it is always absolutely well-defined (and observeable by you!) what's going on. Either the value was exactly what you expected, and you succeeded, so now the value is exactly what you want it to be (though the next instant it might no longer be). Or, someone else was faster. But whatever it is, there are no doubts.

 

the wording no synchronization

It's admittedly a bit of an unlucky wording, but rest assured, when they say "no synchronization", they refer to the non-atomic writes/reads.

 

memory_order_seq_cst

See Hodgman's explanation, can't really say it much better than that. You very rarely, if ever, need that.

 

One should maybe note (or maybe not, it makes things more complicated, and there is not really that much to gain) that there is yet another mode called "consume", which is basically the same as "acquire", but with very minute fine-tuning that may allow for extra optimizations. It does not provide the happened-before guarantee in respect to everything, but only for dependent data. That means, data which is e.g. pointed-to by an atomically modified pointer has ordering guarantees, but some other data doesn't. In practice, the difference is small, mostly theoretic, and usually zero. On the other hand, with this extra-smart mode, there's an opportunity for you to fantastically screw things up in a non-obvious way for the few architectures where it matters (while it "works fine" on others). That's in my opinion a high price to pay for little gain.

So, all in all, I think it's best to use relaxed whenever you know for certain that you don't care about the ordering of other data, and acquire/release in every other case, and reply to anyone stating that there's also other modes with: "yeah, I know". This is safe and fool-proof and 90% as good performance-wise in the worst case (on few architectures), and 100% identical in the normal, average case.
 

 

I understand the ABA problem, and just worked around it by using a fast user mode spin-lock for pop.

In that case, you can as well not use atomic operations at all (except inside your spinlock of course). When using a spinlock, there is no concurrent modification of the head pointer happening.

 

Which, funnily, often isn't even a bad solution, and often isn't even noticeably slower. Sometimes, depending on the usage pattern, it may even perform better, and it can allow for trivial implementations of operations which, depending on the structure, can be rather tricky to get right, such as adding 5 or 10 items in one go (on a stack, that's still pretty easy, but try on a bounded queue with several producers and consumers). Sometimes, locking and spinning isn't the worst solution, but the best (and even blocking can be!).

 

tagged pointers, hazard pointers, and reference counting all seemed to require knowledge of the underlying architecture/hardware and/or asm

Knowledge of the hardware, yes. Asm not so much.

 

For example, the specification of x86_64 mandates that there be 48 logical address bits (those are merely for virtual addresses, a CPU will typically only have 38 or 39 physical lanes), and the most significant bit is replicated to the topmost 16 bits to form a 64-bit pointer ("canonical address", the CPU will produce a fault if that's not the case!). The idea behind this is that weird stuff is actually quite ingenious: addresses are split into two halves (since operating systems usually reserve one half for kernel space, this is very handy) and those halves will "grow together" from both ends towards the middle, if you decide to add more address bits later, in a couple of years from now. Which howevr is rather unlikely to happen any time soon, even the largest applications are not remotely coming close to these constraints (unless you know someone who needs to memory-map a 256TiB file?).

 

Now, the bottom line of that is that you have a guaranteed 16 unused bits in every pointer. Sure, these bits are not really unused, and tampering with them puts you very deep in undefined behavior land. But you know for certain that any piece of data that you allocated (and which is thus by definition not in kernel address space) will have all zeroes in there. So you can as well use these bits for counting and making ABA reasonably unlikely to occur -- as long as you remember to zero them out again before actually using the pointer!

 

The real problem with ABA is that compare-exchange does exactly what its name says, but that is not the same thing as what the programmer thinks. The programmer thinks of CAS as "comparing pointers makes sure these two are the same objects", but the processor merely says "yup, those two bit patterns are the same". The processor doesn't care that the bit pattern is maybe a pointer to something that really isn't what you expected it to be. But a simple counter, which guarantees a different bit pattern every time, very easily eliminates this problem with reasonably high likelihood.

 

If you are yet a bit more courageous, you can also exploit the knowledge that virtually all allocators align allocations of any kind on at least 8 if not 16 byte boundaries (MinGW and clang do 32 bytes, I am told but haven't verified that MSVC does 16 bytes). Which means you have another 2 to 5 bits which are unused in each pointer. Since you know which compiler you're using (and if you are super paranoid about portability, you can static_assert that alignof something is so and so much, just in case), that's of course still strictly undefined behavior, but it's undefined behavior that will work fine with no surprises. Putting everything together you can for example write something like this in C++, with not a single line of asm:

class packed_ptr
{
    // Assumes:
    // 1. AMD64 specification, i.e. the 16 MSBs are zero
    // 2. malloc works with (at least) 16-byte alignment, i.e. trailing 4 bits are zero too
    // ---> 20 bits that we can use.
    uint64_t ptr;
public:
    packed_ptr() : ptr() {}

    packed_ptr(void* p, unsigned int val = 0) noexcept : ptr(((uint64_t) p << 16) | (val & 0xfffff)) { assert( val <= 0xfffff); }

    void*        pointer() const noexcept { return (void*)((ptr >> 16) & ~0xf); }
    unsigned int value()   const noexcept { return (ptr & 0xfffff); }
    void*        raw()     const noexcept { return (void*)ptr; }

    template<typename T> operator T*() const noexcept { return (T*) pointer(); }
    operator unsigned int()            const noexcept { return value(); }

    void set_ptr(void* in)        noexcept { ptr = (ptr & 0x00000000000fffff) | ((uint64_t) in << 16); }
    void set_int(unsigned int in) noexcept { ptr = (ptr & 0xfffffffffff00000) | (in & 0xfffff); }

    void set(void* p, int val)    noexcept { ptr = ((uint64_t) p << 16) |  (val & 0xfffff); }

    explicit operator bool() const noexcept { return pointer(); }
    bool operator==(void* p) const noexcept { return pointer() == p; }

    // It's a bit unclear what one would like to increment/decrement here, the pointer (need to know sizeof(T), thus would have to templatize class!),
    // or the value -- so just don't let user do that.
    packed_ptr operator++() = delete;
    packed_ptr operator--() = delete;
};

Of course you will want to profoundly test your tagged pointer class (which... *cough*...  I haven't done with the above one, but it seems to work for what I'm using it with, at least so far I've not had a crash).

 

 

 

 

 

 

Share this post


Link to post
Share on other sites

I really appreciate the in depth answers.

 

Yes, I am aware that you can stuff counters into the unused bits of pointers, I've seen that technique referred to as 'tagged pointers' elsewhere, but I don't know if that's the official term.  You don't get too much to play with if you want to be portable.  The LSb is pretty much never used (always 0) and is a good one to use, but on 16-bit systems (if any still exist though I'm not sure) you don't even get the 2nd LSb, and on 32-bit systems you only get 2 bits.  Useful for flags, not so much for counters.  I'm hesitant on using the upper 16 bits because even though they are reserved for now, that won't be forever.  Also I don't know of an easy way to static_assert() that.  Probably not an issue I would ever face TBH...

 

I've been reading through the papers from http://www.liblfds.org/ to see how they're doing it.  Good information, but the solutions presented are quite complex for even the simplest of containers, I'm trying to decide how much time I want to spend on this :)

Share this post


Link to post
Share on other sites
Yes, I am aware that you can stuff counters into the unused bits of pointers, I've seen that technique referred to as 'tagged pointers' elsewhere, but I don't know if that's the official term.  You don't get too much to play with if you want to be portable.  The LSb is pretty much never used (always 0) and is a good one to use, but on 16-bit systems (if any still exist though I'm not sure) you don't even get the 2nd LSb, and on 32-bit systems you only get 2 bits.  Useful for flags, not so much for counters.  I'm hesitant on using the upper 16 bits because even though they are reserved for now, that won't be forever.  Also I don't know of an easy way to static_assert() that.  Probably not an issue I would ever face TBH...

 I would recommend avoiding pointers and using integer indices/etc instead. Many algorithms rely on a double-CAS operation, which is 128-bits for two pointers, which often doesn't exist in practice or has a large perf penalty. If you're using a 16bit index value, then a regular 32-bit CAS instruction is a double-CAS for your purposes :)

It also gives you complete peace of mind when it comes to stealing bits for other purposes, such as ABA counters, etc...

 

In most gamedev situations, you can know your worst case requirements ahead of time, pre-allocate all your memory, and know how many bits you need for an index (often: not many!). All of this means you can write code that's a hell of a lot simpler than a general purpose solution.

Also, pointer-based container implementations often (not always) imply that you're dynamically allocating memory. Most memory allocators are locking, not lock-free, so it's always funny to see a call to new/malloc inside the implementation of a lock-free collection structure as that undoes all your hard work :lol:

Edited by Hodgman

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!