Jump to content

  • Log In with Google      Sign In   
  • Create Account


atomic write


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
7 replies to this topic

#1 ultramailman   Prime Members   -  Reputation: 1437

Like
0Likes
Like

Posted 11 December 2013 - 05:52 PM

Hello. A while ago I was curious about threading, particularly about synchronization without mutexes.

 

From what I understand, a full memory barrier makes sure writes are done before the barrier, and writes and reads after the barrier are done after the barrier.

 

So I wrote two helpful functions, atomic read and atomic write.

static inline unsigned atomic_read(atomic_t * ptr)
{
    __sync_synchronize(); // makes sure all the writes to this variable are done first
    return ptr->val;
}
static inline void atomic_write(atomic_t * ptr, unsigned val)
{
    __sync_synchronize(); // makes sure writes to this variable happens after the code before this
    ptr->val = val;
}

 

Then I thought I'd look at how other people implement it, and what I found was that everyone else's implementation of atomic write is like this:

static inline void atomic_write(atomic_t * ptr, unsigned val)
{
    ptr->val = val;
    __sync_synchronize(); // notice how the barrier is after the write, as opposed to before
}

 

I'm guessing putting the barrier after the write makes sure the write is visible as soon as possible. But since atomic read already has a barrier before the read, isn't the barrier after the write redundant?

 

For example, using my ordering, an atomic write followed by a read would become:

barrier

write

barrier

read

 

and the reverse would be:

barrier

read

barrier

write

 

(both looks ok to me, every read and write sandwiched in barriers)

 

Using the more ordering I observed in other implementations, it would be:

write

barrier

barrier

read

 

(this looks ok, just the redundant barrier there)

 

and the reverse would be:

barrier

read

write

barrier

 

(here it looks like the read and write can be done out of order, because there isn't a barrier between read and write)

 

So my question is this: which one is a more correct placement of the full memory barrier, if it matters at all?



Sponsor:

#2 Álvaro   Crossbones+   -  Reputation: 10654

Like
0Likes
Like

Posted 11 December 2013 - 08:19 PM

I believe yours is wrong, but I am not an expert. I found a video that is kind of long, but very informative: http://concurrencyfreaks.blogspot.com/2013/02/the-new-memory-model-in-c11-and-c11.html

#3 Hodgman   Moderators   -  Reputation: 24059

Like
3Likes
Like

Posted 11 December 2013 - 11:41 PM

Your functions should be named atomic_read_release and atomic_write_release.
If you moved the sync to be after the operation, they should then be called atomic_read_acquire and atomic_write_acquire.
If you sync on both sides of the operation, it could be called atomic_read_full and atomic_write_full cool.png
 
Some algorithms require read_acquire and write_release, other (rarer) algorithms require read_release and write_acquire.
 
However, at the ASM level on x86, there usually isn't actually such a thing as an acquire/release fence, and instead there's only full-fences.
Instead of writing that as a full (acquire+release) fence intrinsic followed by a regular C read/write code, there's probably an intrinsic for your compiler that does both the fence (either full, acquire or release) and the read/write as a single line of code.
 

To make your examples easier to understand, assume the write+barrier and the barrier+read are occurring on different threads, and add in some other data.
e.g. let's say I've got an atomic variable called "ready", which tells another thread when some shared data has been produced.
Thread A:
  shared = 42;
  barrier // ensure shared has reached ram before proceeding
  ready = true;//write-release ordering

Thread B:
  local myReady = ready;//read-acquire ordering
  barrier // ensure ready is read from ram before shared is
  if( myReady )
    print shared

Edited by Hodgman, 11 December 2013 - 11:48 PM.


#4 ultramailman   Prime Members   -  Reputation: 1437

Like
0Likes
Like

Posted 12 December 2013 - 12:35 AM

@Alvaro: thanks, I watched the first one. Very informative indeed. So from what I understand, what I have there can be acquire/release barriers instead of full barriers.

 

@Hodgman: those namings kind of make sense to me now, after going through that vid Alvaro linked. At first I named them barrier_write and barrier_read. Everyone else named them atomic though, so I kind of followed along.

 

There are actually intrinsics in gcc for acquire and release, but they look kind of limited in usage (acquire write is allowed to support only writing 1, release will only write zero). There are also intrinsics that pack a barrier into a write (compare and swap) or a barrier into a read (add zero and fetch).

 

In your example, are those barriers neccessary, if the variable "ready" is atomic?

 

Are there any advantages to intrinsics with both a barrier and a memory operation packed in one line (besides being more assured that it's working correctly)? Testing shows they are slower than using a full barrier explicitly.


Edited by ultramailman, 12 December 2013 - 01:27 AM.


#5 samoth   Crossbones+   -  Reputation: 4069

Like
0Likes
Like

Posted 12 December 2013 - 04:22 PM

I think you can even use "consume" in this case, since you seem to have no dependency on non-atomic data. Though it will result in exactly the same code 99.9% of the time, it might be slightly faster under some conditions on some architectures.

 

Some algorithms require read_acquire and write_release, other (rarer) algorithms require read_release and write_acquire.
Are you sure? The C++ standard only defines load operations for acquire and consume, store operations for release (and of course relaxed and seq_cst for any kind of operation) in 29.3. acquire+release is only allowed for read-modify-write ("fetch-op" type of functions).

It also explicitly says e.g. under atomic_store (29.6/8): "the order shall not be memory_order_consume, memory_order_acquire, nor memory_order_acq_rel", and the respective opposite for atomic_load (29.6/12).

 

Of course the mere fact that C++ doesn't allow someting doesn't mean that no such thing could exist, but I also have trouble imagining what it should do. An acquiring load synchronizes with a releasing store. What does an acquiring store synchronize with? This doesn't seem to make sense to me. Can you give an example of an algorithm which actually needs such a weird ordering?



#6 frob   Moderators   -  Reputation: 16208

Like
1Likes
Like

Posted 12 December 2013 - 06:54 PM

... And of course the fun doesn't end there.

When memory barriers and atomic actions and threading get involved, there are two wonderful companions you absolutely must remember.

First, NEVER forget your friendly and usually helpful optimizer within your compiler. Most of the time this wonderful system can completely transform the readable but slow c++ code into an indecipherable but speedy machine code, often by moving instructions around, interleaving work, and eliminating tasks entirely. Usually it does a great job. But when it comes to multi-step tasks that NEED to be atomic, always beware. While the optimizer is wonderful at parties and makes everyone feel happy, when you are alone it feels no guilt whatsoever about stabbing you in the back and dumping your corpse into a shredder. While the optimizer is perfectly happy to enforce locks and barriers in specific documented ways, when you start coding your way into dark corners you need to watch your back. Check the documentation for the implementation, because what happens to work one day may leave a serious mess another.

Another party favorite are those compilers that support processors and tasks most people have never heard of. GCC is the most popular one of those. Everyone at programming parties can talk about the x86 family and ARM family. But these can spin wild tales about Alpha and National Semiconductor and VAX systems. The tales aren't just about big tasks, but microcontrollers on hard drives, the programs inside memory chips, and the code running inside those little unidentified components on motherboards. These well-traveled compilers also do some bragging of life support systems and antilock brake controllers and fly-by-wire piloting systems written in C++. The thing that can give you nightmares is knowing the stories are true.




Getting to the point:

First, even with C++'s new memory model the language itself isn't enough. Every serious program I've ever looked at required some degree of operating system support or compiler-specific flags or custom intrinsics to ensure the optimizer doesn't destroy shared objects. Even though C++11 provided more functionality it is still support rather than perfect coverage. The compiler itself must be aware of the restrictions lest the optimizer destroy it, and every compiler is different. In the real world you must always take steps beyond the c++ standard library to write solid multithreaded code, even with the latest additions.

Second, C++ is a general purpose programming language. It isn't just used for desktop apps, games, and web servers. It is used on more systems than is easily imagined. I have done embedded programming work that included remote controls and also traffic light controllers, both in C++. You might not equate flipping through channels as going through C++ code, but it can. You may not think simply rolling through an intersection on a bike as using C++ code, but it can. You may not think about slamming on the brakes in a car as using C++ code, but it can. So when you think "I cannot understand why anyone would want a one-bit acquire and release", remember that software is everywhere. When GCC or other compiler vendor adds a feature, it is in response to a need for it. It might be a need on an obscure chip only used in an obscure industry, but the need exists somewhere. The two big chips most of us use these days, x86 and ARM, both do a lot of heavy lifting for us that takes additional code on other platforms.



When it comes to things like atomic actions, multithreaded environments, and threading, just knowing the C++ language is not enough. Being able to cite the C++ standard about the new memory model is not enough. You need to get comfortable with reading the documentation for the OS, for the compiler, and for the other tools that are very specific to the thing being programmed.
Check out my personal indie blog at bryanwagstaff.com.

#7 samoth   Crossbones+   -  Reputation: 4069

Like
1Likes
Like

Posted 13 December 2013 - 02:42 AM


Are there any advantages to intrinsics with both a barrier and a memory operation packed in one line (besides being more assured that it's working correctly)? Testing shows they are slower than using a full barrier explicitly.

Yes, and no (on both accounts).

 

Intrinsics have the advantage that they are supported by older compiler versions and that they were not broken on the one broken compiler (yes, this happens, though luckily it's very rare) that I've used in my life. They have the disadvantage that they are not as fast as they could be in some situations, and that they are inherently non-portable, and some of them are limited or "special" functions as you've noted. You should not use them if you can help it.

 

The intrinsics are the equivalent of using seq_cst on the newer standard functions/classes, except when noted otherwise (notably __sync_lock_release).  Which is also the default for standard atomics because it is the safest mode that has the strictest guarantees. This usually involves the equivalent of a mfence instruction, or, who knows... something else, maybe even a lock (though I doubt you'll ever encounter a system where that is the case). It is very much dependent on the architecture.

If you don't know what to use, just stay with seq_cst. Whatever it is, it will work ("work" insofar as seq_cst enforces the strictest possible guarantees on ordering, you can of course still write incorrect code with a correctly working atomic operation and strict ordering).

 

However, often, such  a strict ordering guarantee is not necessary, and a somewhat more relaxed model can be used. There is no observable difference (well, there is one, but in this case you said you don't care about it!), but the compiler may be able to do faster than being super strict. For example, you might just want to increment a counter from several threads (say, to count some events), and you want to be sure it's counting correctly. So this kind of needs to be "atomic", but it doesn't really need anything else. You don't really care about what may have happened before and who is seeing changes when, all you want is that it increments correctly. In that case you can still use seq_cst, but you don't need to. You can do faster.

A less strict memory order might for example allow the compiler to merely use a different instruction (but not have to do a full sync), or it might only require it to see a store as a barrier that prevents previous stores from being reordered (and can rely that loads arrive correctly, because incidentially the target architecture works like that). Or, something else.

 

These are a lot of implementation details that you probably don't know and don't even want to know. The nice thing is that unless you have a broken compiler, the compiler knows just fine, and you need to think only in terms of what guarantees you need, not in terms of what the hardware does or might do. Or what you think it does and doesn't.


Edited by samoth, 13 December 2013 - 02:47 AM.


#8 yusufaytas   Members   -  Reputation: 114

Like
0Likes
Like

Posted 13 December 2013 - 09:17 AM

Your functions should be named atomic_read_release and atomic_write_release.
If you moved the sync to be after the operation, they should then be called atomic_read_acquire and atomic_write_acquire.
If you sync on both sides of the operation, it could be called atomic_read_full and atomic_write_full cool.png
 
Some algorithms require read_acquire and write_release, other (rarer) algorithms require read_release and write_acquire.
 
However, at the ASM level on x86, there usually isn't actually such a thing as an acquire/release fence, and instead there's only full-fences.
Instead of writing that as a full (acquire+release) fence intrinsic followed by a regular C read/write code, there's probably an intrinsic for your compiler that does both the fence (either full, acquire or release) and the read/write as a single line of code.
 

To make your examples easier to understand, assume the write+barrier and the barrier+read are occurring on different threads, and add in some other data.
e.g. let's say I've got an atomic variable called "ready", which tells another thread when some shared data has been produced.

Thread A:
  shared = 42;
  barrier // ensure shared has reached ram before proceeding
  ready = true;//write-release ordering

Thread B:
  local myReady = ready;//read-acquire ordering
  barrier // ensure ready is read from ram before shared is
  if( myReady )
    print shared

thank you very much.






Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS