[C++] How's my (lockless) driving?

Started by
20 comments, last by NotAYakk 15 years ago
Hi. I made a lock-free MRSW (Multiple Reader, Single Writer) queue as part of my new and improved job management system. Probably throw that up here for critique later, but back to business. This is loosely based off of Insomniac's SrSwFixedLookasideDequeue with a handful of improvements (namely proper C++ encapsulation/data hiding, as I suspect the original was written in straight C) and modifications (the whole SRSW->MRSW bit) done by myself. Feel free to use it if it's helpful. Header:

/*========================================================*Stack.h			[GMLib Containers: Generic]
  --------------------------------------------------------

  (MRSW) Thread-safe fixed-size stack container. Provides
  basic tracking functionality for an external array of
  objects with maximum array length given as a template
  parameter.

  --------------------------------------------------------

\*========================================================*/
#pragma once

//==============================================
//		-- INCLUDES --
//==============================================
#include "../CompilerDefs.h"
#include "../Functions/FunctionsInterlocked.h"
//----------------------------------------------

/*==============================================
//		-- USAGE --
//==============================================
  Does not implement copy constructors or
  assignment operators, as these are non-trivial
  in a thread-safe context.
//--------------------------------------------*/
//============================================== // -- NEW DATATYPES/FORWARD DELCARATIONS -- //============================================== template<size_t length> class MRSWFixedSizeStack; //---------------------------------------------- template<size_t length> class MRSWFixedSizeStack { public: //============================================== // -- CONSTRUCTORS/DESTRUCTORS -- //============================================== MRSWFixedSizeStack(); ~MRSWFixedSizeStack(); //---------------------------------------------- //============================================== // -- MEMBER FUNCTIONS/PUBLIC INTERFACES -- //============================================== GMForceInlineHint bool IsFull() const; GMForceInlineHint GMNoAliasHint size_t GetMaxCapacity() const; GMForceInlineHint size_t GetElementCount(); GMForceInlineHint size_t GetNextPushed(); GMForceInlineHint void Clear(); size_t Pop(); void Push(); //---------------------------------------------- private: MRSWFixedSizeStack<length>& operator=(const MRSWFixedSizeStack<length>&); MRSWFixedSizeStack(const MRSWFixedSizeStack<length>&); GMAlignInterlocked(size_t popCount); GMAlignInterlocked(size_t pushCount); }; //============================================== // -- IMPLEMENTATIONS (INLINE) -- //============================================== #include "Stack.inl" //---------------------------------------------- Implementation: (Stack.inl)
template<size_t length>
MRSWFixedSizeStack<length>::MRSWFixedSizeStack() : pushCount(0), popCount(0)
{
}

template<size_t length>
MRSWFixedSizeStack<length>::~MRSWFixedSizeStack()
{
}

template<size_t length>
GMForceInlineHint bool MRSWFixedSizeStack<length>::IsFull() const
{
	size_t count = GetElementCount();
	return(0 == count);
}

template<size_t length>
GMForceInlineHint GMNoAliasHint size_t MRSWFixedSizeStack<length>::GetMaxCapacity() const
{
	return length;
}

template<size_t length>
GMForceInlineHint size_t MRSWFixedSizeStack<length>::GetElementCount()
{
	volatile size_t* popPtr = &popCount;
	volatile size_t* pushPtr = &pushCount;
	return ( (*popPtr) - (*pushPtr) );
}

template<size_t length>
GMForceInlineHint size_t MRSWFixedSizeStack<length>::GetNextPushed()
{
	volatile size_t* pushPtr = &pushCount;
	size_t elementIndex = ((*pushPtr) & (length-1));
	return elementIndex;
}

template<size_t length>
GMForceInlineHint void MRSWFixedSizeStack<length>::Clear()
{
	pushCount = popCount = 0;
}

template<size_t length>
size_t MRSWFixedSizeStack<length>::Pop()
{
	size_t elementIndex = (GMAtomicIncrement(popCount) - 1);
	return (elementIndex & (length-1));
	// kind of a fancy modulos here, the count will loop
	// (useful for repeated iteration, i.e. in the job pool)
}

template<size_t length>
void MRSWFixedSizeStack<length>::Push()
{
	volatile size_t* pushPtr = &pushCount;
	size_t elementIndex = ((*pushPtr)+1);
	*pushPtr = elementIndex;
}



Comments? Suggestions? Clarifications? There's a couple usage caveats regarding some (supplied) memory fences in 'everyday' use, but other than that it should be fairly self-explanatory. I'm interested in possible bugs (especially push- and pop-wise) and aware that the Clear() function isn't explicitly safe. Yes, I'm also aware that the increment in Push() isn't atomic either (though reads/writes to the counter will be). EDIT: Bah. The source tags blew up my formatting at the top. Sorry for the bumps, etc.
clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.
Advertisement
Quote:
  volatile size_t* popPtr = &popCount  volatile size_t* pushPtr = &pushCount  return ( (*popPtr) - (*pushPtr) );

How is that not a random number? And why bother doing the pointer stuff?

What guarantees are you trying to pull off? And if your function guarantees nothing, shouldn't it be rather explicit about guaranteeing nothing?

Ie, between reading either of those varibles and reading the other, the other pointer could be increased by an arbitrary amount. Which means that you could get a value anywhere between -infinity and +infinity, regardless of the state of the stack when you enter the function or after you leave the function.

(Note that the order the values are read is undefined by the above code. I don't think there are any sequence points between a() and b() in the expression: a() - b())...

Ayep:
http://en.wikipedia.org/wiki/Sequence_point

Even if there was a sequence point (which you could cause by breaking the statement up a bit), it would still generate an interval of results that was open to infinity or negative infinity.
Quote: size_t elementIndex = ((*pushPtr) & (length-1));

Did you just assume length was a power of two?

You should check that. Ie:
// fails to compile if passed false:template<bool b>struct AssertTrue;template<> struct AssertTrue<true> {};template<size_t n>struct AssertEven: AssertTrue<(n/2)*2 == n> {};template<size_t n>struct AssertPowerOfTwo:  AssertEven<n>,  AssertPowerOfTwo<n/2>{};template<> struct AssertPowerOfTwo<1>: AssertTrue<true> {};

Then a compile-time assert that your length is a power of two becomes easy.

Quote:
volatile size_t* pushPtr = &pushCount
size_t elementIndex = ((*pushPtr)+1);
*pushPtr = elementIndex;
If you are the only thread writing, you don't have to worry about saying things are volatile.

Quote: size_t elementIndex = (GMAtomicIncrement(popCount) - 1);
return (elementIndex & (length-1));

What guarantees, and or checks, do you have that you won't out-consume the producer?

Or that the producer won't think that you had consumed your element already, and happily write over it before you return?

...

If you are doing an always-incrementing stack with modular arithmetic, and are assuming that the ABA problem will never occur, what you need to do when you pop is:
1> Check that read pointer < write pointer. If not, fail (nothing to read!)
2> Get a copy of the data under the read pointer,
3> Do a CAS increment of the pointer, asserting it used to be pointing to the data you wanted.
4> If 3 fails, goto 1 and repeat until you succeed at step 3.

Then, on the push side:
1> Check that read pointer+length > write pointer (or is it >=? Hmm.). If not, fail (there is no room in the queue)
2> Once the above occurs, as only you are increasing the write pointer, nothing the other threads will do can break it. So now non-atomically:
3> Write to the data under the write pointer.
4> Increment the write pointer (the write has to be atomic -- and it is on most systems so long as your memory is aligned -- but the increment doesn't have to be)

I think the above works.

Note that a CAS increment is not a mere atomic increment. You need to insure both that you are increasing it by 1, and that the old value was the data you made a local copy of.

This makes a couple assumptions (such as reading from the data is thread-safe, and works even if the data is being corrupted or written to by some other thread! I'd make your queue a queue of pointers, and just copy the pointers, and this works nicely.)
Sorry, but this is nowhere near thread safe. (There are also numerous potential bugs and crashes in the code, but I'll stick to the thread-safety issues for now.)

	volatile size_t* popPtr = &popCount	volatile size_t* pushPtr = &pushCount	return ( (*popPtr) - (*pushPtr) );


First of all, why bother with the pointer business? Why not just: return (popCount - pushCount); Also, volatile is not enough to ensure thread safety. There may still be out-of-order reads/writes done by the CPU, and volatile does not ensure that reads/writes are atomic, so they might get interrupted and cause corruption.

Unfortunately, even the snippet return (popCount - pushCount); is not thread safe. It's possible that a thread interrupts the operation and changes one of those two variables, and then you return to the original thread and get a bogus result. This is due again to the possibilities of out-of-order code execution on the CPU.

	volatile size_t* pushPtr = &pushCount	size_t elementIndex = ((*pushPtr) & (length-1));	return elementIndex;


Again, what if the thread gets interrupted in the middle of calculating elementIndex? Then you return a bogus result.


	size_t elementIndex = (GMAtomicIncrement(popCount) - 1);	return (elementIndex & (length-1));


Yet again, there are several places here that the thread might get interrupted and cause a bogus result; suppose you get pre-empted between the two lines of code? You will then return an invalid elementIndex.

	volatile size_t* pushPtr = &pushCount	size_t elementIndex = ((*pushPtr)+1);	*pushPtr = elementIndex;


What happens if I push simultaneously from two threads, and one thread gets interrupted by the other in the middle of a push? You'll lose the position counters. Also, what happens if I pop during a push? More chaos will result.


In general, you need to do three things to write lock-free code correctly:
  1. Keep in mind that both the compiler and the CPU may execute instructions out of order for performance reasons

  2. volatile is useless for threading. Use actual CPU intrinsics or OS functions to implement true memory barriers/fences

  3. Check every sequence point in your code; you might get interrupted by another thread in the middle of that point. Unless you are insanely careful, chances are that getting interrupted will mean corruption. Even better, check the generated assembly for your code - you might get interrupted between any two instructions, or instructions may be executed out of order. Either one can cause bugs in lockless code.


Even this won't guarantee perfect results, but it should get you started [smile]

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

32-bit reads/writes on 32-bit boundaries are atomic; the volatile trickery was Insomniac's solution to avoid the compiler meddling with those particular reads. Makes some sense, when you think about it-- the volatile pointer means that the compiler *must* fetch the value pointed to out of memory and can't cache it in a register. Since aligned 32-bit reads/writes are atomic, (as they are in this case) there's no chance the values will get messed with midway through. At the same time, you aren't locked into having *all* reads/writes be this way; it can just be done when it makes sense, i.e. in cases like this. The volatiles themselves have nothing to do with atomicity; merely to inform the compiler that these values should not be optimized.

Quote:Original post by NotAYakk
Quote:
  volatile size_t* popPtr = &popCount  volatile size_t* pushPtr = &pushCount  return ( (*popPtr) - (*pushPtr) );

How is that not a random number? And why bother doing the pointer stuff?

What guarantees are you trying to pull off? And if your function guarantees nothing, shouldn't it be rather explicit about guaranteeing nothing?

Ie, between reading either of those varibles and reading the other, the other pointer could be increased by an arbitrary amount. Which means that you could get a value anywhere between -infinity and +infinity, regardless of the state of the stack when you enter the function or after you leave the function.

(Note that the order the values are read is undefined by the above code. I don't think there are any sequence points between a() and b() in the expression: a() - b())...

Ayep:
http://en.wikipedia.org/wiki/Sequence_point

Even if there was a sequence point (which you could cause by breaking the statement up a bit), it would still generate an interval of results that was open to infinity or negative infinity.
Quote: size_t elementIndex = ((*pushPtr) & (length-1));

Did you just assume length was a power of two?

You should check that. Ie:
*** Source Snippet Removed ***
Then a compile-time assert that your length is a power of two becomes easy.

Quote:
volatile size_t* pushPtr = &pushCount
size_t elementIndex = ((*pushPtr)+1);
*pushPtr = elementIndex;
If you are the only thread writing, you don't have to worry about saying things are volatile.

Quote: size_t elementIndex = (GMAtomicIncrement(popCount) - 1);
return (elementIndex & (length-1));

What guarantees, and or checks, do you have that you won't out-consume the producer?

Or that the producer won't think that you had consumed your element already, and happily write over it before you return?

...

If you are doing an always-incrementing stack with modular arithmetic, and are assuming that the ABA problem will never occur, what you need to do when you pop is:
1> Check that read pointer < write pointer. If not, fail (nothing to read!)
2> Get a copy of the data under the read pointer,
3> Do a CAS increment of the pointer, asserting it used to be pointing to the data you wanted.
4> If 3 fails, goto 1 and repeat until you succeed at step 3.

Then, on the push side:
1> Check that read pointer+length > write pointer (or is it >=? Hmm.). If not, fail (there is no room in the queue)
2> Once the above occurs, as only you are increasing the write pointer, nothing the other threads will do can break it. So now non-atomically:
3> Write to the data under the write pointer.
4> Increment the write pointer (the write has to be atomic -- and it is on most systems so long as your memory is aligned -- but the increment doesn't have to be)

I think the above works.

Note that a CAS increment is not a mere atomic increment. You need to insure both that you are increasing it by 1, and that the old value was the data you made a local copy of.

This makes a couple assumptions (such as reading from the data is thread-safe, and works even if the data is being corrupted or written to by some other thread! I'd make your queue a queue of pointers, and just copy the pointers, and this works nicely.)


The sequence points are a good point, I'll adjust that so the values are stored into an intermediate register so that there's some guarantees regarding execution order. See next bit. Also, the volatility is to prevent compiler optimization, nothing more. Again, see next bit :)

Quote:Original post by ApochPiQ
Sorry, but this is nowhere near thread safe. (There are also numerous potential bugs and crashes in the code, but I'll stick to the thread-safety issues for now.)

	volatile size_t* popPtr = &popCount	volatile size_t* pushPtr = &pushCount	return ( (*popPtr) - (*pushPtr) );


First of all, why bother with the pointer business? Why not just: return (popCount - pushCount); Also, volatile is not enough to ensure thread safety. There may still be out-of-order reads/writes done by the CPU, and volatile does not ensure that reads/writes are atomic, so they might get interrupted and cause corruption.

Unfortunately, even the snippet return (popCount - pushCount); is not thread safe. It's possible that a thread interrupts the operation and changes one of those two variables, and then you return to the original thread and get a bogus result. This is due again to the possibilities of out-of-order code execution on the CPU.

	volatile size_t* pushPtr = &pushCount	size_t elementIndex = ((*pushPtr) & (length-1));	return elementIndex;


Again, what if the thread gets interrupted in the middle of calculating elementIndex? Then you return a bogus result.


	size_t elementIndex = (GMAtomicIncrement(popCount) - 1);	return (elementIndex & (length-1));


Yet again, there are several places here that the thread might get interrupted and cause a bogus result; suppose you get pre-empted between the two lines of code? You will then return an invalid elementIndex.

	volatile size_t* pushPtr = &pushCount	size_t elementIndex = ((*pushPtr)+1);	*pushPtr = elementIndex;


What happens if I push simultaneously from two threads, and one thread gets interrupted by the other in the middle of a push? You'll lose the position counters. Also, what happens if I pop during a push? More chaos will result.


In general, you need to do three things to write lock-free code correctly:
  1. Keep in mind that both the compiler and the CPU may execute instructions out of order for performance reasons

  2. volatile is useless for threading. Use actual CPU intrinsics or OS functions to implement true memory barriers/fences

  3. Check every sequence point in your code; you might get interrupted by another thread in the middle of that point. Unless you are insanely careful, chances are that getting interrupted will mean corruption. Even better, check the generated assembly for your code - you might get interrupted between any two instructions, or instructions may be executed out of order. Either one can cause bugs in lockless code.


Even this won't guarantee perfect results, but it should get you started [smile]

I'm kinda confused regarding a few points you make there (in particular some of the stuff about process switches/preempting) but more on that in a sec.

In particular, I am aware of the issues you mention with the Count() function; that being said the issues are perfectly acceptable given the intended use. Worst-case scenario, there would be approximately a one-frame latency between when a task is added and when it's executed. These values are intended to be indices into an array. Additionally, the fact that this is a *single-writer* queue means that since no other threads are writing to this particular piece of data, there's no actual need to make the increments themselves atomic, as long as the the reads/writes themselves are. Again, this can be assumed due to the width and alignment of the pushCount variable.

It appears I also neglected to mention that the array length is indeed constricted to be a power of two. The static assert is on the to-make list, but at the moment I don't have such a thing all ready ;)
clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.
This may or may not matter to you, but it is perhaps worth pointing out that since C++ doesn't really have a well defined memory model, issues of thread safety are architecture dependent (especially when writing lock-free code). For example, you say that "32-bit reads/writes on 32-bit boundaries are atomic": this is true for x86, but do you know if it's true for PowerPC or ARM (I don't know the answer myself, the point is it's worth being sure if the code might end up running on them).

I don't have enough experience with lock-free programming to critique your code myself, but Charles Bloom has some fairly extensive and very useful information on his blog, which you may want to look at if you haven't already: this post should be a good place to start.

John B
The best thing about the internet is the way people with no experience or qualifications can pretend to be completely superior to other people who have no experience or qualifications.
Quote:32-bit reads/writes on 32-bit boundaries are atomic; the volatile trickery was Insomniac's solution to avoid the compiler meddling with those particular reads.


Unaligned reads and writes are very decidedly not atomic. What are you doing to ensure that none of your reads/writes are non-atomic? I see no mention of data alignment at all in your code.


Quote:Makes some sense, when you think about it-- the volatile pointer means that the compiler *must* fetch the value pointed to out of memory and can't cache it in a register. Since aligned 32-bit reads/writes are atomic, (as they are in this case) there's no chance the values will get messed with midway through. At the same time, you aren't locked into having *all* reads/writes be this way; it can just be done when it makes sense, i.e. in cases like this. The volatiles themselves have nothing to do with atomicity; merely to inform the compiler that these values should not be optimized.


This is very shaky logic. For one thing, volatile has a notoriously vague specification in the standard; what volatile means to one compiler may be completely different from what it means on another compiler. Even different compiler versions handle volatile differently. Relying on a particular compiler's behaviour is very, very bad practice.

Any decent compiler should include intrinsics/fence instructions that prevent reordering by the compiler. However, you still need to take into account reordering by the CPU.


Quote:I'm kinda confused regarding a few points you make there (in particular some of the stuff about process switches/preempting) but more on that in a sec.

In particular, I am aware of the issues you mention with the Count() function; that being said the issues are perfectly acceptable given the intended use. Worst-case scenario, there would be approximately a one-frame latency between when a task is added and when it's executed. These values are intended to be indices into an array. Additionally, the fact that this is a *single-writer* queue means that since no other threads are writing to this particular piece of data, there's no actual need to make the increments themselves atomic, as long as the the reads/writes themselves are. Again, this can be assumed due to the width and alignment of the pushCount variable.


Simply having one writer and multiple readers is not sufficient to guarantee safety. You also have to ensure that nobody reads during a partially-completed write operation. If someone reads stale/half-written data during a half-finished write, you will get corruption - not a mere 1 frame latency, but invalid state in your container. (Also, if your readers are calling Pop, you technically have a multi-writer scenario, because multiple threads can modify the state of the container - but that's another issue entirely.)

Typically this is ensured via locks. When doing a lock-free algorithm, you have to design it so that your code can be preempted at any time by another thread. This means relying on hardware or OS level atomic operations, as I mentioned before. Your code does nothing to avoid being preempted and therefore causing problems.

In particular, you can't tell if your code is vulnerable to preemption issues just from the raw C/C++ code. You have to also consider the generated assembly to make sure that it is safe to interrupt.


Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Quote:Original post by InvalidPointer
32-bit reads/writes on 32-bit boundaries are atomic; the volatile trickery was Insomniac's solution to avoid the compiler meddling with those particular reads. Makes some sense, when you think about it-- the volatile pointer means that the compiler *must* fetch the value pointed to out of memory and can't cache it in a register. Since aligned 32-bit reads/writes are atomic, (as they are in this case) there's no chance the values will get messed with midway through. At the same time, you aren't locked into having *all* reads/writes be this way; it can just be done when it makes sense, i.e. in cases like this. The volatiles themselves have nothing to do with atomicity; merely to inform the compiler that these values should not be optimized.

No. As mentioned, if they're unaligned, then they're not atomic
Also 32-bit alignment & volatiles still doesn't prevent out-of-order execution.
Think of OoOE as a sort of anti-volatile at a hardware level, instead compiler-optimization level.
To solve the OoOE you need a memory barrier

Cheers
Dark Sylinc
Quote:Original post by ApochPiQ
Quote:32-bit reads/writes on 32-bit boundaries are atomic; the volatile trickery was Insomniac's solution to avoid the compiler meddling with those particular reads.


Unaligned reads and writes are very decidedly not atomic. What are you doing to ensure that none of your reads/writes are non-atomic? I see no mention of data alignment at all in your code.

In the header:
GMAlignInterlocked(size_t popCount);GMAlignInterlocked(size_t pushCount);

That's a macro that aligns whatever's in the parenthesis along a 4-byte (32-bit) boundary, with some additional stuff for x64 and/or the processor architecture. Though it's a bit ugly, it handles some of the differences in compiler syntax very cleanly-- the MSVC _declspec(align()) x and GCC x __attribute__(aligned()) being the major examples. The stack-based stuff used in functions is thread-local, so AFAIK we don't need to worry about it, correct? i.e. it's irrelevant on the basis that no other thread will touch it.

Also, my current understanding is that the particular behavior of volatile I'm relying on is one of the only few guaranteed by the standard, i.e. no caching of the value? Is this correct? I'm aware MSVC does extend this more than slightly, but again I'm not explicitly relying on any particular behavior.

Quote:Original post by ApochPiQ
Quote:Makes some sense, when you think about it-- the volatile pointer means that the compiler *must* fetch the value pointed to out of memory and can't cache it in a register. Since aligned 32-bit reads/writes are atomic, (as they are in this case) there's no chance the values will get messed with midway through. At the same time, you aren't locked into having *all* reads/writes be this way; it can just be done when it makes sense, i.e. in cases like this. The volatiles themselves have nothing to do with atomicity; merely to inform the compiler that these values should not be optimized.


This is very shaky logic. For one thing, volatile has a notoriously vague specification in the standard; what volatile means to one compiler may be completely different from what it means on another compiler. Even different compiler versions handle volatile differently. Relying on a particular compiler's behaviour is very, very bad practice.

Any decent compiler should include intrinsics/fence instructions that prevent reordering by the compiler. However, you still need to take into account reordering by the CPU.


Quote:I'm kinda confused regarding a few points you make there (in particular some of the stuff about process switches/preempting) but more on that in a sec.

In particular, I am aware of the issues you mention with the Count() function; that being said the issues are perfectly acceptable given the intended use. Worst-case scenario, there would be approximately a one-frame latency between when a task is added and when it's executed. These values are intended to be indices into an array. Additionally, the fact that this is a *single-writer* queue means that since no other threads are writing to this particular piece of data, there's no actual need to make the increments themselves atomic, as long as the the reads/writes themselves are. Again, this can be assumed due to the width and alignment of the pushCount variable.


Simply having one writer and multiple readers is not sufficient to guarantee safety. You also have to ensure that nobody reads during a partially-completed write operation. If someone reads stale/half-written data during a half-finished write, you will get corruption - not a mere 1 frame latency, but invalid state in your container. (Also, if your readers are calling Pop, you technically have a multi-writer scenario, because multiple threads can modify the state of the container - but that's another issue entirely.)

Typically this is ensured via locks. When doing a lock-free algorithm, you have to design it so that your code can be preempted at any time by another thread. This means relying on hardware or OS level atomic operations, as I mentioned before. Your code does nothing to avoid being preempted and therefore causing problems.

In particular, you can't tell if your code is vulnerable to preemption issues just from the raw C/C++ code. You have to also consider the generated assembly to make sure that it is safe to interrupt.

Again, that's the whole deal with atomic writes, is it not? They can't be interrupted. The InterlockedIncrement() also utilizes the MSVC intrinsic when available, which not only uses the atomic LOCK inc assembly on systems that support it, but also includes a full set of barriers to boot. I do have/use them:
#if !defined( _PPC ) && !defined( _POWERPC )	#if defined( _X64 )		#define GMWriteFence() __faststorefence()	#else		#define GMWriteFence() _mm_sfence()	#endif	#define GMReadFence() _mm_lfence()#else	#define GMWriteFence() __asm{lwsync}	#define GMReadFence() __asm{lwsync}#endif

but the implementation I base this off of suggests these be used *externally* when actually reading/writing the 'managed' data. Should these be placed inside the push/pops?
clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.
You still need to compare to the generated assembly. Just looking at the C++ is going to tell you very little. You have to be examining the robustness of your algorithm at the assembly level, because the compiler may map your C++ code to machine code in ways you aren't handling.

Usually, given a chunk of assembly code, it's fairly easy to spot preemption problems. It's virtually impossible to see them reliably from just the C++ code.

So my suggestion is to post up the generated assembly and we can examine that; just making conjectures about the C++ isn't going to get us anywhere [smile]

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

But is atomicness not unguaranteed on multi-core processery?

As for volatile, I remember two three four useful articles:

  • "Andrei Alexandrescu::volatile - Multithreaded Programmer's Best Friend"

  • "Arch Robison::Volatile: Almost Useless for Multi-Threaded Programmin"g

  • "Ian Lance Taylor::volatile"
    Quote:Ian Lance Taylor
    While volatile writes are guaranteed to occur in the program order for the core which is executing them, there is no guarantee that any other core will see the writes in the same order. Using volatile does not imply any sort of memory barrier; the processor can and will rearrange volatile memory accesses (this will not happen for address ranges used for memory mapped hardware, but it will for ordinary memory).

  • Bartosz Milewski::C++ Atomics and Memory Ordering
    Quote:Bartosz Milewski
    No, I'm not talking about the stock market. I mentioned before that C++ volatile has nothing to do with multithreading. This is not entirely true because some compiler vendors took the liberty to add non-standard semantics to volatile (out of pure desperation, since they had to support multi-processor code, and pre-0x C++ offered no help in this area). This new semantics, at least in the case of Microsoft compilers, did not include sequential consistency. Instead it guaranteed acquire and release semantics (which, on the x86, didn't result in any fencing). So, although the publication pattern will work on a Microsoft compiler with volatile variables, the Peterson lock won't! I thought that was an interesting piece of trivia.




Also, see that thread: ompf.org::Use of Volatile.

This topic is closed to new replies.

Advertisement