Critique my lockless queue, Part II

Started by
19 comments, last by BradDaBug 14 years, 4 months ago
Quote:Original post by cache_hit
First you say that the usage of volatile to ensure that the compiler doesn't re-order access should be portable, then you say it's nonstandard behavior of volatile adding memory fences. That's exactly what memory fences are for, making sure operations aren't re-ordered.
There's compile-time reordering and run-time reordering. I was saying that volatile only inhibits compile-time reordering, whereas a memory-fence only inhibits run-time reordering. There's no contradiction there.
Quote:FWIW, it's Microsoft specific behavior, and as far as the C++ standard goes, it says nothing about re-ordering or anything of the sort. volatile is almost completely useless if you're only interested in it's standard defined behavior.
AFAIK, the behaviour where volatile adds fences (i.e. inhibits runtime reordering) is Microsoft-specific, but I thought all compilers inhibited compile-time reordering of volatile variables. Accessing a volatile variable is one of those points in a program where everything before it must be complete and nothing after it must have taken place (of course modern CPUs break this by optimising at run-time, which is why you need fences as well).

Can someone reference the part of the spec dealing with volatile? IIRC the rules of volatile are directly linked to the sequencing of the abstract machine...

[Edited by - Hodgman on December 6, 2009 7:44:19 PM]
Advertisement
The standard says a number of things about volatile:

Quote:
7.15.1.8 volatile is a hint to the implementation to avoid aggressive optimization involving the object
because the value of the object might be changed by means undetectable by an implementation. See 1.9 for
detailed semantics. In general, the semantics of volatile are intended to be the same in C + + as they are
in C.


Of course, it's only a hint. So we refer back to 1.9 and it says the following:

Quote:
1.9 When the processing of the abstract machine is interrupted by receipt of a signal, the values of objects with
type other than volatile sig_atomic_t are unspecified, and the value of any object not of
volatile sig_atomic_t that is modified by the handler becomes undefined.


There are a number of surrounding comments regarding volatile as well
Quote:
1.6 The observable behavior of the abstract machine is its sequence of reads and writes to volatile data and
calls to library I/O functions.

1.7 Accessing an object designated by a volatile lvalue (3.10), modifying an object, calling a library I/O
function, or calling a function that does any of those operations are all side effects, which are changes in the
state of the execution environment. Evaluation of an expression might produce side effects. At certain
specified points in the execution sequence called sequence points, all side effects of previous evaluations
shall be complete and no side effects of subsequent evaluations shall have taken place.

1.11 The least requirements on a conforming implementation are:
— At sequence points, volatile objects are stable in the sense that previous evaluations are complete and
subsequent evaluations have not yet occurred.
— At program termination, all data written into files shall be identical to one of the possible results that
execution of the program according to the abstract semantics would have produced.
— The input and output dynamics of interactive devices shall take place in such a fashion that prompting
messages actually appear prior to a program waiting for input. What constitutes an interactive device is
implementation-defined.
[Note: more stringent correspondences between abstract and actual semantics may be defined by each
implementation. ]



I couldn't find anything else related enough. Furthermore, this is from the Microsoft documentation:

Quote:
Microsoft Specific

Objects declared as volatile are not used in certain optimizations because their values can change at any time. The system always reads the current value of a volatile object at the point it is requested, even if a previous instruction asked for a value from the same object. Also, the value of the object is written immediately on assignment.

Also, when optimizing, the compiler must maintain ordering among references to volatile objects as well as references to other global objects. In particular,

A write to a volatile object (volatile write) has Release semantics; a reference to a global or static object that occurs before a write to a volatile object in the instruction sequence will occur before that volatile write in the compiled binary.

A read of a volatile object (volatile read) has Acquire semantics; a reference to a global or static object that occurs after a read of volatile memory in the instruction sequence will occur after that volatile read in the compiled binary.

This allows volatile objects to be used for memory locks and releases in multithreaded applications.

Note
Although the processor will not reorder un-cacheable memory accesses, un-cacheable variables must be volatile to guarantee that the compiler will not change memory order.

End Microsoft Specific



It is for this reason that articles such as the one linked by swiftcoder above exist. In particular, that article states:

Quote:
Hans Boehm points out that there are only three portable uses for volatile. I'll summarize them here:

marking a local variable in the scope of a setjmp so that the variable does not rollback after a longjmp.
memory that is modified by an external agent or appears to be because of a screwy memory mapping
signal handler mischief



In practice, I do think a lot of compilers take the "hint" mentioned in 7.15.1.8 of the standard quoted above. I just don't think it's safe to say it's a "portable" use of the keyword.
Quote:Original post by cache_hit
— At sequence points, volatile objects are stable in the sense that previous evaluations are complete and
subsequent evaluations have not yet occurred.

Hans Boehm points out that there are only three portable uses for volatile.
* volatile may be used when variables may be "externally modified", but the modification in fact is triggered synchronously by the thread itself, e.g. because the underlying memory is mapped at multiple locations.
The way that I interpret that is: two reads of a volatile variable won't be optimised down to a single fetch. If you request the value of a volatile variable twice, it will be fetched from memory twice, and those fetches will occur in the sequence specified by the programmer.

If you simply used a memory fence instruction, but didn't use a volatile variable, the compiler is allowed to remove your "redundant" fetches. However, they're not really "redundant" because you know that the value might have been modified behind the scenes (by another thread -- i.e. "externally modified" in Hans Boehm's words), and you really do need to perform the subsequent "redundant" fetches.

Hence, both volatile access and memory fences are important for thread-safe code, which is why the Interlocked* API demands you use volatile.

In Herb Sutter's volatile vs. volatile, he also interprets the spec as saying that reads from a volatile variable can't be reordered (with respect to each other) or omitted by the compiler (unlike normal reads).

[edit]So I'm wrong about it inhibiting *all* compile-time reordering -- some compilers still allow regular variable accesses to be moved across a volatile access -- but all compilers should preserve the order of volatile accesses with respect to other volatile accesses, and most importantly, no compiler should omit any volatile accesses.

[edit #2]We really should get back to critiquing BradDaBug's actual code now --
I'll try to have an in-depth look at it tonight ;)

[Edited by - Hodgman on December 6, 2009 10:00:15 PM]
Ok BradDaBug, there unfortunately are a few bugs in your code that I can see.

1) Everywhere you've written "volatile Node*", you actually meant "Node* volatile".

The former is a (regular) pointer to a volatile value, the latter is a volatile pointer to a (regular) value. Seeing the pointer is the thing being modified by multiple threads, it's the thing that needs to be volatile.

2) In your "pop" functions, you've got this code:
                n = m_head;                if (InterlockedCompareExchangePointer((volatile PVOID*)&m_head, (void*)m_head->Next, (void*)n) == n)                    break;
This is dangerous as m_head is read from twice - first it's value is read so it can be copied into n, then it is read a second time at m_head->Next. It's possible for the value of m_head to be modified between these two accesses, leading to a race condition.

You can fix this by getting the value of Next via n, instead of accessing m_head a second time.
                n = m_head;                Node* next = n->Next;


Here's my fixed (but untested version). You'll notice that by fixing bug #1, none of your local variables need to use the volatile keyword any more:
template<typename T>class MessageStack{    struct Node    {        Node* volatile Next;        T Data;    };    Node* volatile m_head;    SDL_sem* m_semaphore;public:    MessageStack()    {        m_semaphore = SDL_CreateSemaphore(0);        m_head = NULL;    }    ~MessageStack()    {        SDL_DestroySemaphore(m_semaphore);        while(m_head != NULL)        {            Node* n = m_head;            m_head = m_head->Next;            _aligned_free(n);        }    }    void AddMessage(T message)    {        Node* n = (Node*)_aligned_malloc(sizeof(Node), sizeof(void*));        n->Data = message;        while(true)        {            n->Next = m_head;                        if (InterlockedCompareExchangePointer((volatile PVOID*)&m_head, n, n->Next) == n->Next)                break;        }                // let any waiting readers know there's an available node        SDL_SemPost(m_semaphore);    }    // blocks until a message is available, and then removes it from the list    T RemoveMessage()    {        // block the reader until we have something available        SDL_SemWait(m_semaphore);        Node* n = NULL;        while(true)        {            n = m_head;            Node* next = n->Next;            if (InterlockedCompareExchangePointer((volatile PVOID*)&m_head, next, n) == n)                break;        }        T t = n->Data;        _aligned_free(n);        return t;    }    // attempts to get the next message, but won't block.    // returns true if a message was received, false otherwise    bool TryRemoveMessage(T* message)    {        if (SDL_SemTryWait(m_semaphore) == 0)        {            Node* n = NULL;            while(true)            {                n = m_head;            Node* next = n->Next;                if (InterlockedCompareExchangePointer((volatile PVOID*)&m_head, next, n) == n)                    break;            }            *message = n->Data;            _aligned_free(n);            return true;        }        return false;    }};


Before I'd declare any of these fancy data-structures to be truly thread safe though (even after peer review), I'd construct a stress test where lots and lots of threads hammer the thing with billions of push/pop operations, making sure no values ever get lost/corrupted in the process ;)
I've had a queue before that worked fine in a stress test for hours before one of the subtle bugs caused it to corrupt a value :/

[Edited by - Hodgman on December 7, 2009 6:50:35 AM]
Lock-free access for such a queue is achieved at best with ATOMIC read-writes and this is possible on compilers where 'volatile' means exactly that e.g. VC++ 2010 or later (any version earlier than this has not offered atomic read-writes via this keyword, they have rather offered a system-wide access to variables using it e.g. not atomic)
Quote:Original post by ddlox
Lock-free access for such a queue is achieved at best with ATOMIC read-writes...
The Interlocked* functions (used in this code) are part of the Windows API for performing atomic operations on any Win32 compiler. Code written to use this API should assume that volatile has the standard meaning, not the MS-specific atomic meaning.
Quote:Original post by Hodgman
1) Everywhere you've written "volatile Node*", you actually meant "Node* volatile".

Oops, thanks.

Quote:2) In your "pop" functions, you've got this code:*** Source Snippet Removed ***This is dangerous as m_head is read from twice - first it's value is read so it can be copied into n, then it is read a second time at m_head->Next. It's possible for the value of m_head to be modified between these two accesses, leading to a race condition.

You can fix this by getting the value of Next via n, instead of accessing m_head a second time.*** Source Snippet Removed ***

Why would it matter if m_head changes between those two lines? Wouldn't the InterlockedCompareExchangePointer() detect that n is now different than m_head, then when it returned the current version of m_head it would be different than n, and the loop would continue and try again?

On a hopefully unrelated note, I've been doing some testing with several threads, and I've started getting an error message that says something like "Damage before 0xSomeMemoryAddress which was allocated by aligned routine". It either happens when I call _aligned_malloc() or _aligned_free() (I've seen it happen with either). I can't see any obvious places where memory could get corrupted. I'm allocating everything, including the stack itself, with _aligned_malloc(). Is there anything else I need to do to make sure the memory is aligned properly? Or is the issue something else?
I like the DARK layout!
Try running under application verifier.
Quote:Original post by BradDaBug
Why would it matter if m_head changes between those two lines? Wouldn't the InterlockedCompareExchangePointer() detect that n is now different than m_head, then when it returned the current version of m_head it would be different than n, and the loop would continue and try again?
Yeah it won't be a big deal, unless m_Head has been changed to NULL ;) It slightly worse performance-wise as well, due to m_Head being volatile (whereas n is not).

To handle that case (where there is one item on the stack, and two threads simultaneously pop it), you'd have to do something like:
                n = m_head;                if( !n )                {//one of:                    return RemoveMessage(); //for RemoveMessage, try again                    return false;//for TryRemoveMessage, return failure                }                Node* next = n->Next;
[EDIT]Actually, does the semaphore take care of this for you? Just to make sure, you could put in: if(!n)printf("oh noes!!!"); ;)[/EDIT]


Quote:On a hopefully unrelated note, I've been doing some testing with several threads, and I've started getting an error message that says something like "Damage before 0xSomeMemoryAddress which was allocated by aligned routine".
Not exactly sure, but it sounds like a memory-overrun problem. Double-check you're not looping past the end of any arrays, using any pointers after they've been deleted, or doing any dangerous casts (e.g. allocating a Foo and casting the pointer to a Bar*).
Actually, the above bug, if it occurs, would lead to calling free twice on the same pointer, which is going to do god-knows-what to memory.


[edit #2]
Found another bug :( As before, it probably won't do anything bad because the Interlocked Exchange will save you... but you are potentially accessing invalid memory, which AFAIK is undefined behaviour.

This problem is actually written about in most of the papers on lock-free linked list structures... It's really hard to safely deallocate the nodes, because other threads may still hold pointers to them.
One (rough) solution is to garbage collect the nodes - instead of deallocating them instantly, you put them into a queue to be deleted in "some suitable amount of time in the future".
Here's the scenario:
    // Two threads call this function at the same time:    T RemoveMessage()    {        SDL_SemWait(m_semaphore);        Node* n = NULL;        while(true)        {            n = m_head;        //Both threads get here at the same time, they both read the same value of m_head into 'n'        //Thread #1 then goes to sleep here (see below where it wakes up in this scenario)            Node* next = n->Next;        //Thread #1 just accessed unallocated memory, hopefully the system didn't notice...            if (InterlockedCompareExchangePointer((volatile PVOID*)&m_head, next, n) == n)                break;        }        T t = n->Data;        _aligned_free(n);        //When Thread#2 reaches this point, Thread#1 wakes back up        //'n' has now been deallocated, but Thread#1 is about to access 'n.Next', which is now undefined!        return t;    }


[Edited by - Hodgman on December 7, 2009 5:15:55 PM]
Yep, I think you're right.

I probably spotted a similar bug in AddMessage(). As soon as InterlockedCompareExchangePointer() executes, but before the n->Next part is evaluated on the other side of the == another thread can execute RemoveMessage() to remove and delete the n that was just added, so n->Next is invalid. Since the result returned by InterlockedCompareExchangePointer() is different than whatever is in n->Next (since the debug CRT has written 0xfeeefeee or whatever) the loop tries again with a node that's already been freed... and bad things happen. I think I fixed that one just by storing n->Next is another variable before the call to InterlockedCompareExchangePointer() and using that instead of n->Next.
I like the DARK layout!

This topic is closed to new replies.

Advertisement