# cache friendly lock free concept?

This topic is 1537 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Mattias, doesn't that example apply to multiple threads running the same code? Slicer4ever's code has two threads running separate code. Hold on. Not an expert in multithreaded programming, but...
The network thread acts on LOCK_REQUESTED/UNLOCK_REQUESTED and sets to LOCKED/UNLOCKED, whereas the game thread acts on LOCKED/UNLOCKED and sets to UNLOCK_REQUESTED/LOCK_REQUESTED. A thread cannot renter its "critical sections" until the other thread unsets the [tt]LockState[/tt]. Assuming that [tt]LockState[/tt] is executed and committed after at the end of the threads' "critical sections", isn't the worse case scenario then a thread may need to reenter its loop? Or, am I misunderstanding something here? Edited by fastcall22

##### Share on other sites
fastcall22, oohh! I read the code wrong. I apologize
You're right. The access to LockState appears to not contain race conditions (can't say about the rest of the variables because I didn't look those carefully) as long as LockState is only accessed from two threads.

Nonetheless, he should put hardware fences/Memory barriers or otherwise the out of order execution unit from an x86 cpu can cause LockState to be set to a different value sooner than expected.

Furthermore, and as already pointed out, InPackets OutPackets and TempList are likely to be in the same cache line which means a lot of false cache sharing.

Another problem with this routine is that it appears the packets can be delayed several iterations because LockState is still being updated in its 4 state system; which can add noticeable lag.

##### Share on other sites

fastcall22, oohh! I read the code wrong. I apologize
You're right. The access to LockState appears to not contain race conditions (can't say about the rest of the variables because I didn't look those carefully) as long as LockState is only accessed from two threads.

Nonetheless, he should put hardware fences/Memory barriers or otherwise the out of order execution unit from an x86 cpu can cause LockState to be set to a different value sooner than expected.

Furthermore, and as already pointed out, InPackets OutPackets and TempList are likely to be in the same cache line which means a lot of false cache sharing.

Another problem with this routine is that it appears the packets can be delayed several iterations because LockState is still being updated in its 4 state system; which can add noticeable lag.

so.....now that we are finally past this=-\, can we get back to the root of the original question:

Furthermore, and as already pointed out, InPackets OutPackets and TempList are likely to be in the same cache line which means a lot of false cache sharing.

This is mostly what i care about, basically i'm exchanging the pointers, so let's say the threads are running on two separate cores, so basically, if i understand what's going on at that level(which i probably don't, hence these questions) ThreadA swaps the pointers, then in order for ThreadB to have these new pointer locations in it's cache, will ThreadA send an update message to ThreadB(doubtful, as in theory ThreadB doesn't know ThreadA also has these pointers), so to get the updated values, it has to read it back from ram(meaning ThreadA would write to ram), or potentially from a higher level cache? can someone explain to me what's going on at this level, and if i should be doing something else here to ensure i don't have more false cache sharing?

##### Share on other sites

This is mostly what i care about, basically i'm exchanging the pointers, so let's say the threads are running on two separate cores, so basically, if i understand what's going on at that level(which i probably don't, hence these questions) ThreadA swaps the pointers, then in order for ThreadB to have these new pointer locations in it's cache, will ThreadA send an update message to ThreadB(doubtful, as in theory ThreadB doesn't know ThreadA also has these pointers), so to get the updated values, it has to read it back from ram(meaning ThreadA would write to ram), or potentially from a higher level cache? can someone explain to me what's going on at this level, and if i should be doing something else here to ensure i don't have more false cache sharing?

No that's not what's going on.

A thread on Core A writes one of the 4 variables, which puts it in it's L1 cache. Then a thread on Core B reads one of the variables, but noticing that the cache line of that variable was modified by core A, it must get it from the write buffer of Core A, which depending on the detail of the implementation involve going though the L3 cache, or otherwise has a comparable latency. It might then write to that cache line itself, so that Core A's L1 cached version is invalidated, and it must read from the write buffer of B. Then Core A might write to the cache line, which invalidates Core B's l1 and L2 cache of that line. And back and forth.

That kind of thing is unavoidable on LockState, and to a lesser extent on TempList, but if all 4 variables are on the same cache line, they are all shared as if they were one variable, incurring the performance penalties of sharing on all of them whenever any one of them is accessed. It's like they are one variable to the cache. So the solution is to ensure that all four variables are on a separate cache line.

One way to do this is instead of using pointers and a primitive integer, use a class the size of a cache line. You can even make it your linked list, if you make swapping the list cheep, and put enough padding at the end of it to fill a cache line. Edited by King Mir

##### Share on other sites

This is mostly what i care about, basically i'm exchanging the pointers, so let's say the threads are running on two separate cores, so basically, if i understand what's going on at that level(which i probably don't, hence these questions) ThreadA swaps the pointers, then in order for ThreadB to have these new pointer locations in it's cache, will ThreadA send an update message to ThreadB(doubtful, as in theory ThreadB doesn't know ThreadA also has these pointers), so to get the updated values, it has to read it back from ram(meaning ThreadA would write to ram), or potentially from a higher level cache? can someone explain to me what's going on at this level, and if i should be doing something else here to ensure i don't have more false cache sharing?

No that's not what's going on.

A thread on Core A writes one of the 4 variables, which puts it in it's L1 cache. Then a thread on Core B reads one of the variables, but noticing that the cache line of that variable was modified by core A, it must get it from the write buffer of Core A, which depending on the detail of the implementation involve going though the L3 cache, or otherwise has a comparable latency. It might then write to that cache line itself, so that Core A's L1 cached version is invalidated, and it must read from the write buffer of B. Then Core A might write to the cache line, which invalidates Core B's l1 and L2 cache of that line. And back and forth.

That kind of thing is unavoidable on LockState, and to a lesser extent on TempList, but if all 4 variables are on the same cache line, they are all shared as if they were one variable, incurring the performance penalties of sharing on all of them whenever any one of them is accessed. It's like they are one variable to the cache. So the solution is to ensure that all four variables are on a separate cache line.

One way to do this is instead of using pointers and a primitive integer, use a class the size of a cache line. You can even make it your linked list, if you make swapping the list cheep, and put enough padding at the end of it to fill a cache line.

Thank you for such a detailed explanation of everything that's going on! =-)

##### Share on other sites
Yes. King Mir's explanation on false cache sharing is spot on. If you modify a single byte inside a cache line, it "dirties" the whole line; and modifying the same region of memory from all cores at very high frequencies adds a tremendous overhead.

It may be worth saying that most x86 archs have a cache line size of 64 bytes; though this isn't guaranteed (i.e. it could be 128 bytes on some systems, 32 bytes on others, etc); while in ARM it is more typical to find cache sizes of either 32 or 64 bytes. What I mean, padding of just 4 or 8 bytes won't be enough.

##### Share on other sites

Your design is logically good, but technically mistaken.  Lock-free structures generally depend on compare/exchange instructions to keep threads from reading the wrong value or overwriting someone else's.  "Lock-free" is a popular term these days, and while it is an awesome choice in the right circumstances, it is not always the best.  For e.g.:

* Lock-free design works best when the period of contention between threads is very small, such as when swapping a pointer value or updating a counter.  This is because there is always a brief spin whenever the protected value is found to in contention.

* Lock-based design works best when the period of contention will be longer, such as to execute some IO or do some processing, as the waiting thread is taken out of the scheduler and no core "spins" during this time.

Nonetheless, your use of Volatile will not do what you want. It is intended to force the compiler to do NO optimizations on that variable by declaring that the value might be updated at any time--such as by a separate piece of hardware--but the compiler does not place a memory fence around accesses to the variable. So, yes, your caches may be inconsistent.

In general, designing highly concurrent logic is hard. It's fun to do, and if that is your goal, please proceed!  I hold a patent in this area (owned by my previous employer) because I love it.  But most often, if your goal is just efficiency, classic producer/consumer logic (with or without lockinig) will meet any possible need you have related to networking--which lets face it, is incredibly slow compared to our CPUs these days :-)

##### Share on other sites

Everyone's already covered what's right and wrong (don't use volatile! It may work on MSVC due to extensions, but it's actually wrong) with the code, but I just have to point out that this code is not 'lock free'.

Lock free doesn't mean "doesn't use the existing lock structures"; it's a very specific term that basically means that if any of the threads happens to stall at any point in time, this will not impact the progress of any other thread.
The code in this thread is the opposite of that -- it hands control of the state variable back and forth between two threads exclusively. If you freeze the network thread while the state is not "unlocked", then the game thread also freezes. [edit: no it doesn't!]

The algorithm relies on the fact that only one thread "owns" the state variable at a time -- as mentioned by others, if his wasnt the case, you'd need some CAS/interlocked logic in there somewhere.

P.s. not being"lock free" isn't a bad thing. I'm just being a definition-nazi ;)

Edited by Hodgman

##### Share on other sites
Looks lock free to me. Freezing the network thread like you say would stop updating TempList, but the game thread would still continue to call UpdateGames().

I do think that blocking on LOCK_REQUESTED would be prudent though, especially if RetrievePendingPackets() is not lock free.

##### Share on other sites

King Mir is correct on that point.
If the network thread freezes the game thread will still call UpdateGames() and loop again; this part is not related to the state of LockState (we are of course assuming GameThread() is in some kind of loop or else none of this would be a discussion since it would be called once and that would be the end of it).

I’ve stopped replying in the interest of not debating with someone who simply refuses to understand what I am trying to explain, and I hoped that if others tried to explain the same thing he may come around.  But since I am here-

So I should point out that the volatile extension link show and example of a volatile variable being used to implement mutual exclusion, without any threading primitives.

That example is unrelated to his—it shows the variable being controlled by 1 thread while being accessed by another.  The 2nd thread never changes the value of “Sentinel”, it simply reads from it.  This example would work on any platform and any compiler, as it conforms entirely to the standard.  It has nothing to do with extensions or Microsoft®.  Also, please do not use misleading terms such as “mutual exclusion”—that is a far cry from what is happening in that sample code.  All it’s saying is that the 2nd thread will not load and keep in a register (as an optimization) the value of Sentinel in the while loop and thus will properly terminate whenever the 1st thread changes the value.

Let me put it another way.
An “if” statement requires a load, a compare, and a jump, in assembly.  Without a fence, a value could be loaded to a register, changed on another thread, but then have the CMP done on the previous value (loaded to a register) and perform its JMP based on that.  In fact that could happen even with a fence, especially if poorly placed.

If you assume Microsoft® extensions are aiming to solve this problem (which they are not), where exactly do you propose the compiler knows to place the fence.  On both threads.

It doesn’t.  Using fences correctly requires a human author, and even then it fails frequently unless the author is really really good at what he or she does.  There is no way the compiler could ever know where to place the fences on both threads, and there is no way the compiler would even try—it takes 100,000 compiler accuracies to make up for a single compiler bug.

If you need proof that Matias Goldberg’s first post is accurate, consider this line from that link:

A write to a volatile object (volatile write) has Release semantics; a reference to a global or static object that occurs before a write to a volatile object in the instruction sequence will occur before that volatile write in the compiled binary.

As he correctly stated, the compiled binary is the only thing related to the Microsoft® specifics.
My job is to optimize code on all platforms, including consoles, and I spend most of my day staring at compiled assembly. Not only that but I wrote MHS, including its assembler and disassembler, and have spent a lot of time looking at code from other games and applications including my own.

The compiler is absolutely 100% free to change the orders of reads and writes as long as the final result is the same, and I can promise you this happens on every single compiler I have encountered, for every platform.  In fact Nintendo 3DS’s compiler seems to go out of its way to do this, I assume for cache coherency.

This is the relationship between Microsoft®’s extensions and volatile.  It guarantees that in the compiled binary the code will be executed in the order specified by your code.

It does absolutely nothing to prevent the above situation between MOV, CMP, and JMP.  It adds no fences at all.  If you do not believe me, make your own program and check its disassembly.

Get back to us when you have done that.

L. Spiro

##### Share on other sites

I thought that UpdateGames() was inside if(LockState==UNLOCKED) -- this would mean that the main loop would stop updating the game if the network thread paused while the state was anything except UNLOCKED.

But I misread, so RetrievePendingPackets() and UpdateGames() with both continually run, regardless of the activity of the other thread. Freezing one thread won't impede the other.

So, yeah, as long as you say that the data being communicated is "a list of packets", then the communication between these two threads is lock-free.

If you wanted to be ridiculously picky, you could argue that the threads are actually communicating "individual packets", in which case there is locking, as you might have packets buffered up in InPackets/OutPackets that are stuck there due to the LockState variable being held by a thread that's gone to sleep (e.g. if the main thread is pre-empted right before LockState=UNLOCK_REQUESTED, then even though some data has been "shared" at this point, the network thread can't read it until the game thread wakes up  That's something for academia to argue about though

##### Share on other sites
@L. Spiro
That applies to many architectures, but it doesn't apply to x86. In x86 any MOV from memory by itself has acquire semantics. Any MOV into memory has release semantics. You don't need a fence. You only need a fence to implement sequential consistency, usually placed after each store, but with two threads, that's not needed, because the difference between acquire-release semantics and sequential consistency is only apparent with more threads.

So you're correct from the standpoint of writing portable C++, but on x86 the code will work as is. Because of the above, x86 will allow code that is technically undefined C++ to work as one would expect. (Let me reiterate that: the code as written is undefined behavior)

EDIT:Strike-through due to point made by Matias Goldberg bellow. Edited by King Mir

##### Share on other sites

The compiler is absolutely 100% free to change the orders of reads and writes as long as the final result is the same, and I can promise you this happens on every single compiler I have encountered, for every platform.

This is the relationship between Microsoft®’s extensions and volatile.  It guarantees that in the compiled binary the code will be executed in the order specified by your code.  [<-- and that's enough]

It does absolutely nothing to prevent the above situation between MOV, CMP, and JMP.  It adds no fences at all.  If you do not believe me, make your own program and check its disassembly.

x86 CPU's don't reorder memory ops though, and they don't have fence instructions.

... Actually "Loads may be reordered with older stores to different locations" and there are rarely used LFENCE/SFENCE/MFENCE instructions ...[/edit]

Any instruction with the lock prefix acts like a fence, but that prefix is only required when you need to perform an atomic operation on memory that could potentially be modified by two threads simultaneously.

Slicer's code works on mutual-exclusion. Assuming that there's no reordering, then there's no variables that are going to be written to by two threads at once -- every variable always has only 1 writer. So, it will work on x86 with just compile-time fences.

It won't be portable to other CPU's unless you add fences, so he really should be using a std::atomic still

Yeah, if his code was more complex and the state variable could have two potential writers, he would need to use CAS instead of an if and assignment.

The standard C/C++ volatile is not a proper compile-time fence, it only acts as a fence with regards to other volatiles -- reads/writes to volatiles aren't reordered past each other, but reads/writes to non-volatiles can be reordered past volatile reads/writes. The MSVC extension goes further than this turns turns volatile reads/writes into full compile-time fences.

A standard C++ compiler would be able to remove the =1 and =2 lines from this code, but MSVC keeps them, because there's a volatile write between them all.

	volatile int g_state_v = 0;
extern int g_state;
...
g_state = 1;
g_state_v++;
g_state = 2;
g_state_v++;
g_state = 3;
printf( "%d %d", g_state, g_state_v );
Edited by Hodgman

##### Share on other sites

@L. Spiro
That applies to many architectures, but it doesn't apply to x86. In x86 any MOV from memory by itself has acquire semantics. Any MOV into memory has release semantics. You don't need a fence. You only need a fence to implement sequential consistency, usually placed after each store, but with two threads, that's not needed, because the difference between acquire-release semantics and sequential consistency is only apparent with more threads.

So you're correct from the standpoint of writing portable C++, but on x86 the code will work as is. Because of the above, x86 will allow code that is technically undefined C++ to work as one would expect. (Let me reiterate that: the code as written is undefined behavior)

That's an overly simplification.
Section 8.2.2 from the Intel Intel 64 and IA-32 Architectures Software Developer’s Manual is very clear that:
"Reads may be reordered with older writes to different locations but not with older writes to the same location."

Here's even an example of how such reordering can affect the logic of an application even when just running 2 threads. Whether this issue affects the OP's code, can only be said by inspecting the assembly.

##### Share on other sites

@L. Spiro
That applies to many architectures, but it doesn't apply to x86. In x86 any MOV from memory by itself has acquire semantics. Any MOV into memory has release semantics. You don't need a fence. You only need a fence to implement sequential consistency, usually placed after each store, but with two threads, that's not needed, because the difference between acquire-release semantics and sequential consistency is only apparent with more threads.

So you're correct from the standpoint of writing portable C++, but on x86 the code will work as is. Because of the above, x86 will allow code that is technically undefined C++ to work as one would expect. (Let me reiterate that: the code as written is undefined behavior)

That's an overly simplification.
Section 8.2.2 from the Intel Intel 64 and IA-32 Architectures Software Developer’s Manual is very clear that:
"Reads may be reordered with older writes to different locations but not with older writes to the same location."
Here's even an example of how such reordering can affect the logic of an application even when just running 2 threads. Whether this issue affects the OP's code, can only be said by inspecting the assembly.

No, the code you linked to can be analyzed in a way that doesn't require looking at the disassebly to spot the problem, and so can the OP. In the op it's trivial to see that each access to the lock is an acquire-read on one variable followed by some critical section code, followed by a release-write on the same variable. That pattern is tried and true.

zeroWants = true; //release on zeroWants
victim = 0; //release on victim
while (oneWants && victim == 0) // aquire on  oneWants and victim
continue;
// critical code

It's a release followed by acquire, which is nothing. The acquire by itself would be usable to create a lock, but clearly here this lock does not work if you remove the first two lines.

But conceded, that does show that acquire-release semantics can differ from sequential consistency even with two threads. Thank you for the example.

Edited by King Mir

##### Share on other sites
hey guys, thanks for the continued input so far. i was reading up on false cache sharing to better understand what's going on, and what i can do(i saw the 64 bytes as most cache-lines). I found this article by intel: http://software.intel.com/sites/default/files/m/d/4/1/d/8/3-4-MemMgt_-_Avoiding_and_Identifying_False_Sharing_Among_Threads.pdf which gave me some interesting insights, most importantly was this passsage:

Avoid false sharing but use these
techniques sparingly. Overuse can hinder the effective use of
the processor’s available cache. Even with multiprocessor shared cache designs, avoiding false
sharing is recommended. The small potential gain for trying to maximize cache utilization on
multi processor shared cache designs does not generally outweigh the software maintenance
costs required to support multiple code paths for different cache architectures.

are they suggesting(as L. Spiro did before), that in the long run it's not really worth it to try to prevent false sharing. What i read from this(and i might be reading it wrong), as that optimizing for this only realistically provides a small performance gain anyway?

@Matias Goldberg, your link actually really helped me understand more of what's going on. I think one of my biggest problems with writing well-designed multi-processor code is that i've been relying on the x86 architecture's threading mechanisms, it's clear to me now that my locks could very easily fail as-is on other processors which don't have such strict store/load guidelines.

now then, let's say i added memory fences, would this be sufficient to prevent the issues that are presented with the double-checked locking problem:


violatile int LockState = UNLOCKED;

RetrievePendingPackets(InPackets);
if(LockState==LOCK_REQUESTED){
TempList = InPackets;
InPackets = Temp;
[write fence]
LockState = LOCKED;
}
if(LockState==UNLOCK_REQUESTED){
for(link<Packet*> *Lnk = TempList->GetFirst(); Lnk!=null; Lnk = Lnk->Next){
SendPacket(Lnk->value);
}
TempList->Clear();
[write fence]
LockState = UNLOCKED;
}
return;
}

if(LockState==LOCKED){
for(link<Packet*> *Lnk = TempList->GetFirst(); Lnk!=null; Lnk = Lnk->Next){
ProcessPacket(Lnk->value);
}
TempList->Clear();
TempList = OutPackets;
OutPackets = Temp;
[write fence]

LockState=UNLOCK_REQUESTED;
}
if(LockState==UNLOCKED){
[write fence]
LockState = LOCK_REQUESTED;
}
UpdateGames();  //Will push packets into the outpackets list.
}

i think i woudn't have to put read fence's in front of the access to LockState, as neither thread really care's when it see's the correct value, just so long as the linklist's are seen with the correct values when i begin working on them is all i care about...is my line of thinking correct here?

##### Share on other sites
It's easier to instead of using memory fences yourself, to use C++11 atomic primatives with the correct ordering specified. In this case you need minimally memory_order_aquire on each read of lock state (if statement condition), and memory_order_release on every write to lockState. The only exception is the if(LockState==UNLOCKED)LockState = LOCK_REQUESTED; Here, both can be memory_order_relaxed, because there is no critical section.

Alternatively don't bother with all that and just use the default behavior of atomic, changing violatile int LockState = UNLOCKED; to std::atomic<int> LockState = UNLOCKED; only.

With the memory barrier code you added, it's essentially correct, except you don't need the write fence before LockState = LOCK_REQUESTED.

EDIT: Speaking of if(LockState==UNLOCKED)LockState = LOCK_REQUESTED, you could just remove that, and reorder your code to only go back and forth once. Edited by King Mir