Sign in to follow this  
Prune

volatile and consistency

Recommended Posts

There are various articles discouraging the use of the volatile keyword in multithreaded programming because it provides no ordering and atomicity guarantees (needing barriers and/or interlocked instructions).

However, in Intel TBB internals one sees code equivalent to
template<typename T> __forceinline T Acquire(T volatile const &p) { T t = p; _ReadWriteBarrier(); return t; }
template<typename S, typename T> __forceinline void Release(volatile S &p, T const t) { _ReadWriteBarrier(); p = t; }
volatile clearly does not seem to be useless. Is the barrier by itself sufficient when accessing a shared variable?

Say atomicity is already guaranteed by dealing with an aligned dword-sized variable. Then if one thread assigns a value to x and another thread reads x in a loop, I've noticed that if I simply have x = 5; in one thread and while(x != 5) {...} in another (x being a class data member of type int in my test), sometimes there is a non-negligible time elapsed between assignment in the first thread and loop termination--time much longer than the system scheduler granularity. However, if I use the Acquire()/Release() from above, the loop terminates right away every time. My question is, if _only_ a memory barrier is used, is that sufficient, or is volatile _also_ necessary? And what if there are locks around the variable access, is the volatile still necessary? Does it make a difference if the variable is dereferenced through a pointer, or whether it's a member of an object?
In the memory hierarchy of register<->L1<->L2<->RAM, since L2 is also shared between CPUs, wouldn't it only be necessary for the shared value to be updated in L2 by one thread for the other thread to see the change? Yet volatile would seem to be inefficient then if it has to reach out all the way to RAM

Share this post


Link to post
Share on other sites
C and C++ make very few guarantees about things like the behavior of volatile, which is why I'd guess many people suggest against it. To be sure things will work as expected, you'll need to use whatever conventions your compiler and platform require.

Using MSVC with windows, I typically use the intrinsic _Interlocked* functions for performing atomic operations along with the barrier intrinsics when I need to ensure ordering.

Share this post


Link to post
Share on other sites
Volatile is not useless, however it was never intended to provide ordering (with respect to the CPU pipeline and cache coherency) or atomicity. It is necessary not only in multi-threaded programming but in any situation where memory contents can change independently of the currently executing code, e.g. MMIO regs.

All volatile does is tell the compiler that the contents of a particular variable can change outside it's field of vision (by another thread, by a device, etc), which means that it can't go and optimize away loads and stores and just keep things in registers, or reorder certain loads/stores. This is independent of memory barriers, memory coherency, or multi-processor visibility. Volatile will prevent the compiler from doing what it loves... hoisting loads, sinking stores, scheduling, etc. It doesn't make the CPU "reach all the way to memory", it forces the compiler to be conservative with loads and stores of volatile variables, that's all; the loads/stores still hit cache like they normally would unless you do something to prevent it.

If Acquire didn't specify that the parameter p was volatile and you called it in a loop nothing would stop a compiler from lifting the load outside the loop, and then you'd never see the value updated in the loop.

Share this post


Link to post
Share on other sites
Quote:
Original post by outRider
If Acquire didn't specify that the parameter p was volatile and you called it in a loop nothing would stop a compiler from lifting the load outside the loop, and then you'd never see the value updated in the loop.

Thanks, this is one of the answers I was looking for (I didn't think it was obvious because one way to guess it would be to say that since the barrier prevents reordering of further reads to before this one, lifting it out of the loop could violate that...no?)

Quote:
It doesn't make the CPU "reach all the way to memory", it forces the compiler to be conservative with loads and stores of volatile variables, that's all; the loads/stores still hit cache like they normally would unless you do something to prevent it.

But the most common example given for use of volatile is with hardware using DMA such that memory mapped values can be modified by hardware external to the CPU. I don't see how the cache can come into play there since the memory may not be modified through the CPU; how would the machine know if an address is cacheable or not? If I use volatile int *p; that could just as well refer to an address that is changed by hardware external to the CPU; how would the compiler/CPU know whether it's just plain old RAM that it can cache so it wouldn't have to load from the actual address?

Share this post


Link to post
Share on other sites
Either the cache and the DMA controller will signal each other (e.g. cache is notified/invalidated as the RAM is written to), or the OS will perform some magic to ensure that programs don't access DMAable regions during a transfer and manually perform cache flushing before letting the program access the results.


Because all this barrier stuff is non-standard, there's two kinds of reordering you've got to stop. Some barrier macros only stop runtime (out-of-order exeuction) reordering, some only stop compile-time reordering, and some stop both. Some of them only have an impact in the current scope, and some affect all code up the callstack.

Also, microsoft decided to change the meaning of volatile in 2005, so now when you access a volatile variable, it actually does generate a barrier!

IIRC, the standard definition of volatile will stop compile-time reordering, but only with respect to other volatiles, which isn't that helpful.


Well tested lock functions will employ volatile/barriers/whatever so that you don't have to. Without locks though, the requirements are very compiler specific.

Share this post


Link to post
Share on other sites
Quote:
Original post by Prune
Quote:
Original post by outRider
If Acquire didn't specify that the parameter p was volatile and you called it in a loop nothing would stop a compiler from lifting the load outside the loop, and then you'd never see the value updated in the loop.

Thanks, this is one of the answers I was looking for (I didn't think it was obvious because one way to guess it would be to say that since the barrier prevents reordering of further reads to before this one, lifting it out of the loop could violate that...no?)


No. Lifting the read out of the loop doesn't violate anything. The read is already in front of the memory barrier, lifting it out of the loop still keeps it ahead of the barrier. If the variable is declared volatile then the compiler knows not to assume anything about the value and that reading the same address over and over in a loop is intentional and needs to be kept there. If it wasn't volatile any compiler worth its salt would lift that guy out of the loop.

Quote:
Original post by Prune
Quote:
It doesn't make the CPU "reach all the way to memory", it forces the compiler to be conservative with loads and stores of volatile variables, that's all; the loads/stores still hit cache like they normally would unless you do something to prevent it.

But the most common example given for use of volatile is with hardware using DMA such that memory mapped values can be modified by hardware external to the CPU. I don't see how the cache can come into play there since the memory may not be modified through the CPU; how would the machine know if an address is cacheable or not? If I use volatile int *p; that could just as well refer to an address that is changed by hardware external to the CPU; how would the compiler/CPU know whether it's just plain old RAM that it can cache so it wouldn't have to load from the actual address?


That's an orthogonal issue and has nothing to do with the compiler.

Basically, either the device or bus controller is CPU-cache aware, meaning if the device reads/writes to main memory it will see/invalidate the copy in the CPU's cache (IIRC PCI/PCIE works like this)

or

the device/bus controller isn't cache aware (IIRC AGP) but the CPU can treat some blocks of memory as uncached, in which case the driver for the device should set this up. When you allocate a block of memory that both the CPU and device will interact with the driver will make sure to specify that that range should be treated as uncached (using MTRR/IAR regs on x86)

or

neither of the above is available, in which case somewhere in the driver for that device it will use explicit CPU instructions (wbinvd on x86) to flush caches and will usually require you to use map/unmap/lock/unlock/begin/end type functions to access this memory so it can flush as necessary (or there will be some API like readword()/writeword() and they'll tell you to use those).

As you can see, none of it has anything to do with the volatile keyword.

Share this post


Link to post
Share on other sites
If you care about portability at all, do yourself a favor and pretend that volatile didn't exist. Use proper synchronization primitives. As has been noted, volatile actually does kind of do what you want from it in all Visual Studio versions starting with VS 2005, but as soon as you start using any other OS / compiler combinations you have NO such guarantee.

Share this post


Link to post
Share on other sites
Quote:
Original post by outRider
That's an orthogonal issue and has nothing to do with the compiler.

Basically, either the device or bus controller is CPU-cache aware, meaning if the device reads/writes to main memory it will see/invalidate the copy in the CPU's cache (IIRC PCI/PCIE works like this)

or

the device/bus controller isn't cache aware (IIRC AGP) but the CPU can treat some blocks of memory as uncached, in which case the driver for the device should set this up. When you allocate a block of memory that both the CPU and device will interact with the driver will make sure to specify that that range should be treated as uncached (using MTRR/IAR regs on x86)

or

neither of the above is available, in which case somewhere in the driver for that device it will use explicit CPU instructions (wbinvd on x86) to flush caches and will usually require you to use map/unmap/lock/unlock/begin/end type functions to access this memory so it can flush as necessary (or there will be some API like readword()/writeword() and they'll tell you to use those).

As you can see, none of it has anything to do with the volatile keyword.

The reason I asked was to have a better idea of the efficiency implications of volatile, in terms of whether caching would still be used for regular memory.
If this weren't the case, I would be trying to more avoid it.

Red Ant, I was specifically asking about the standard C++ semantics of volatile which is about memory consistency. The MSVC extension mentioned adds implicit barriers, and I use them explicitly--but still requires the volatile keyboard as outRider explained as the barriers by themselves are not sufficient--at least for local variables, which according to MSDN will only be affected by the barrier if they are marked volatile: http://msdn.microsoft.com/en-us/library/f20w0x5e%28v=VS.80%29.aspx The same page also says that if the variable address is accessible non-locally, it does not need volatile. I don't know if other compilers would follow that convention... any comments on this outRider or others?

One interesting thing I noticed is that at least for x86 and x86-64, while MSVC has separate _ReadBarrier, _WriteBarrier, and _ReadWriteBarrier, GCC has for all three macros that in the end expand to the same thing, __asm__ __volatile__("" ::: "memory"), and Intel's compiler for both Windows and Linux uses __memory_barrier. Why are they the same for these compilers? I know that older Intel x86 chips did not do write reordering, but I thought x86-64 does.

Share this post


Link to post
Share on other sites
Quote:
Original post by Prune
The reason I asked was to have a better idea of the efficiency implications of volatile, in terms of whether caching would still be used for regular memory.
If this weren't the case, I would be trying to more avoid it.

Red Ant, I was specifically asking about the standard C++ semantics of volatile which is about memory consistency. The MSVC extension mentioned adds implicit barriers, and I use them explicitly--but still requires the volatile keyboard as outRider explained as the barriers by themselves are not sufficient--at least for local variables, which according to MSDN will only be affected by the barrier if they are marked volatile: http://msdn.microsoft.com/en-us/library/f20w0x5e%28v=VS.80%29.aspx The same page also says that if the variable address is accessible non-locally, it does not need volatile. I don't know if other compilers would follow that convention... any comments on this outRider or others?


I wouldn't rely on other compilers being able to deduce the global visibility of local variables. This is really simpler than you think it is. Consider this kind of code:

int *p = ...;

x = *p + 4;
...
j = f(*p);
...
*p = *p + 1

What do you think the compiler is going to do with *p? Load it from memory every time you use it in an expression, or load it once and keep the value in a register and reuse the register? Now, if you declare *p volatile it will generate a load each time, because you're telling it that the value in memory can change without the compiler's knowledge, because as far as the compiler can see most of the loads can be optimized out. That's completely different from what a barrier does, but the two work hand in hand. When you're dealing with concurrency and shared memory you not only have to make sure things are seen in the right order by others (barriers) but you also have to make sure that you see things properly and in a timely fashion and that you actually load things from memory and not just hang on to stale data (volatile).

What MSVC is saying is that it can save you the trouble of using volatile if it can deduce that a local variable can be updated without it's knowledge (i.e. it's globally visible) and in those cases it will make sure the variable's uses don't cross a barrier. Now, I don't know if this also means the compiler will make sure to reload the value at every use like volatile or if it's something in between volatile and no volatile.

Quote:
Original post by Prune
One interesting thing I noticed is that at least for x86 and x86-64, while MSVC has separate _ReadBarrier, _WriteBarrier, and _ReadWriteBarrier, GCC has for all three macros that in the end expand to the same thing, __asm__ __volatile__("" ::: "memory"), and Intel's compiler for both Windows and Linux uses __memory_barrier. Why are they the same for these compilers? I know that older Intel x86 chips did not do write reordering, but I thought x86-64 does.


Probably for granularity's sake. On weak memory systems barriers actually result in CPU instructions being emitted because even if the compiler doesn't reorder loads and stores the pipeline can. On strong memory systems where barriers don't usually result in instructions I guess you could give the compiler some latitude by being more precise about a barrier's function, but I wouldn't be surprised if _ReadBarrier/_WriteBarrier/_ReadWriteBarrier all did the same thing on MSVC for x86. Just a guess anyway, I don't know the inner workings of MSVC.

Share this post


Link to post
Share on other sites
Quote:
Original post by Prune
One interesting thing I noticed is that at least for x86 and x86-64, while MSVC has separate _ReadBarrier, _WriteBarrier, and _ReadWriteBarrier, GCC has for all three macros that in the end expand to the same thing, __asm__ __volatile__("" ::: "memory"), and Intel's compiler for both Windows and Linux uses __memory_barrier. Why are they the same for these compilers? I know that older Intel x86 chips did not do write reordering, but I thought x86-64 does.
You'd have to check the assembly language ;)
IIRC, x86 only has one instruction (or instruction modifier) to cause a memory barrier, and it acts as a full fence (ReadWrite). x64 might be the same in only having a full-fence instruction.
Other architectures do have different fence instructions though, which is why MSVC gives you the 3 different macros.

Share this post


Link to post
Share on other sites
Quote:
Original post by Red Ant
Prune, do you intend to write your own synchronization primitives from ground up or why are you worrying about all this stuff?

Because I am using it. When possible, I use alternatives to locks, such as lock-free, message passing, double buffering of shared data, etc.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this