# cache friendly lock free concept?

This topic is 1540 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I have suspicions that the following concept is not very cache friendly, but i'm no expert on the subject, and like some pointers.

essentially, i have 2 threads, one for networking, and one for the game simulation.  i want to process the packets in the game simulation thread, rather than in the network thread so that the simulation can be maintained as deterministic.

here's the concept i came up with for switching in/out packets between the two threads:



violatile int LockState = UNLOCKED;

RetrievePendingPackets(InPackets);
if(LockState==LOCK_REQUESTED){
TempList = InPackets;
InPackets = Temp;
LockState = LOCKED;
}
if(LockState==UNLOCK_REQUESTED){
for(link<Packet*> *Lnk = TempList->GetFirst(); Lnk!=null; Lnk = Lnk->Next){
SendPacket(Lnk->value);
}
TempList->Clear();
LockState = UNLOCKED;
}
return;
}

if(LockState==LOCKED){
for(link<Packet*> *Lnk = TempList->GetFirst(); Lnk!=null; Lnk = Lnk->Next){
ProcessPacket(Lnk->value);
}
TempList->Clear();
TempList = OutPackets;
OutPackets = Temp;
LockState=UNLOCK_REQUESTED;
}
if(LockState==UNLOCKED) LockState = LOCK_REQUESTED;
UpdateGames();  //Will push packets into the outpackets list.
}



my thoughts are that if the pointers are being swapped in one thread on a different core, then it needs to update the pointer for the value of the pointer for the cache on the thread for the other core.

so, thoughts on it are much appreciated.

Edited by slicer4ever

##### Share on other sites
Well, for one, you could move the data closer together by using a vector instead of a linked list. You could move the data even closer by using the vector to store the packets themselves, though managing would be a bit tricky:
typedef unsigned char byte;
vector<byte> packet_store;

template<class T>
void write_memory( byte* at, const T& val ) {
// assumes Packet<-T and T is POD
*reinterpret_cast<T*>(at) = val;
}

template<class T>
void store_packet( vector<byte>& packet_store, const T& packet ) {
size_t total_size = sizeof(size_t) + sizeof(T);

size_t idx = packet_store.size();
packet_store.resize(packet_store.size() + total_size); // assumes Packet<-T and T is POD

char* at = &packet_store[0] + idx;
write_memory(at,total_size);
write_memory(at+sizeof(size_t),packet);
}

template<class F>
void foreach_packet( vector<byte>& packet_store, F f ) {
byte* at = &packet_store[0];
byte* end = begin + packet_store.size();
for ( size_t total_size = 0; at < end; at += total_size ) {
sz = *reinterpret_cast<size_t*>(at);
Packet* p = reinterpret_cast<Packet*>(at+sizeof(size_t));

f(p);
}
}

Beware, sample code is not alignment-aware.

.. and like some pointers.

Inb4 XKCD or 0x1234ABCD huehuehue.

##### Share on other sites

Are you sure that this design will work? I think lock free queues are usually implemented with ringbuffers.

##### Share on other sites

Are you sure that this design will work? I think lock free queues are usually implemented with ringbuffers.

mind pointing out why it won't work?  it's activly deployed right now(the above code is psedo-code i wrote up that demonstrates what i'm doing right now, i'll go over it to make sure i didn't miss something.)

@fastcall32, storing each packet in a single buffer would defiantly be more cache-friendly, but for now i'd like to focus on this concept itself, rather than how the packets are stored.

##### Share on other sites

I guess I was just thrown off because I didn't read into the code enough.

##### Share on other sites

As written, this is a giant race condition waiting to happen... I'm not surprised it "appears" to work, but it's a matter of time before you get some really bad bugs.

##### Share on other sites

As written, this is a giant race condition waiting to happen... I'm not surprised it "appears" to work, but it's a matter of time before you get some really bad bugs.

could you please expand on this.  i've been reading over the code several times, and don't see where any racing occurs, each step of the swapping occurs in discrete steps, so at no point do i see where overlap of the data can occur.

##### Share on other sites

There is a shared state in this synchronization (LockState). If you want to access it from both threads you have to synchronize these accesses. One way is to use hardware memory barriers before and after the access of the LockState. This will block the instruction and memory access reordering of the compiler and processor and synchronizes memory. The other way is to use an atomic read/write which locks the the memory bus for the time of the memory access. (This is what a CAS-cycle does.) But in the current state of the code it synchronizes the two threads only with luck.

##### Share on other sites
Firstly you aren't using any atomic variables (or legit locks), so it will only work on very permissive architectures like x86, and even them may not if you turn optimizations up high enough for the compiler to mess your code up. LockState needs to be atomic.

Secondly, You have the potential for false sharing with your 4 variables together like that. You need to ensure that all of them are on their own cache line. The usual way to do this is to put each in a class, with an otherwise unused final feild that acts as a spacer. There are other ways too, but beware that, in general, the compiler is free to remove unused variables.

Thirdly, use a vector (or other array structure) not a linked list for cache performance.

Fortunately, assuming the atomic issue is fixed, I don't see a data race.

##### Share on other sites

I mark lockstate as violatile, which ensures it is usable for memory locking in this example.

@King Mir, I was thinking that the swap might be suiceptable to the pointers sitting on their own cache lines, when i do the swap, and update the lock, the other thread will likely not have the swapped pointers sitting in it's cache.  hmm, this is a potentially serious problem.

Edited by slicer4ever

##### Share on other sites

I mark lockstate as violatile, which ensures it is usable for memory locking in this example.

This is wrong assumption. From many links about, I picked one, http://en.wikipedia.org/wiki/Volatile_variable

Quote:

Operations on volatile variables are not atomic, nor do they establish a proper happens-before relationship for threading. This is according to the relevant standards (C, C++, POSIX, WIN32),[2] and this is the matter of fact for the vast majority of current implementations. Thus, the usage of volatile keyword as a portable synchronization mechanism is discouraged by many C/C++ groups.[3][4][5]

##### Share on other sites

I mark lockstate as violatile, which ensures it is usable for memory locking in this example.

This is wrong assumption. From many links about, I picked one, http://en.wikipedia.org/wiki/Volatile_variable

Quote:

Operations on volatile variables are not atomic, nor do they establish a proper happens-before relationship for threading. This is according to the relevant standards (C, C++, POSIX, WIN32),[2] and this is the matter of fact for the vast majority of current implementations. Thus, the usage of volatile keyword as a portable synchronization mechanism is discouraged by many C/C++ groups.[3][4][5]

http://msdn.microsoft.com/en-us/library/12a04hfd%28v=vs.90%29.aspx

his allows volatile objects to be used for memory locks and releases in multithreaded applications.

##### Share on other sites

I mark lockstate as violatile, which ensures it is usable for memory locking in this example.

@King Mir, I was thinking that the swap might be suiceptable to the pointers sitting on their own cache lines, when i do the swap, and update the lock, the other thread will likely not have the swapped pointers sitting in it's cache.  hmm, this is a potentially serious problem.

It needs to be atomic. You can use aquire-release semantics instead of sequential consistency, but volitile is insufficient.

Quite the opposite, you want the data on seperate cache lines. All 4 variables need to be on their own cache line. Otherwise you could have false sharing and cache thrashing. It's not a correctness problem, but it is a performance problem.

##### Share on other sites

I mark lockstate as violatile, which ensures it is usable for memory locking in this example.

This is wrong assumption. From many links about, I picked one, http://en.wikipedia.org/wiki/Volatile_variable

Quote:
Operations on volatile variables are not atomic, nor do they establish a proper happens-before relationship for threading. This is according to the relevant standards (C, C++, POSIX, WIN32),%5B2%5D and this is the matter of fact for the vast majority of current implementations. Thus, the usage of volatile keyword as a portable synchronization mechanism is discouraged by many C/C++ groups.%5B3%5D%5B4%5D%5B5%5D

http://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.90).aspx

his allows volatile objects to be used for memory locks and releases in multithreaded applications.

Why are you using Microsoft specific compiler extensions when a standard solution is readily available?

##### Share on other sites

I mark lockstate as violatile, which ensures it is usable for memory locking in this example.

This is wrong assumption. From many links about, I picked one, http://en.wikipedia.org/wiki/Volatile_variable

Quote:
Operations on volatile variables are not atomic, nor do they establish a proper happens-before relationship for threading. This is according to the relevant standards (C, C++, POSIX, WIN32),%5B2%5D and this is the matter of fact for the vast majority of current implementations. Thus, the usage of volatile keyword as a portable synchronization mechanism is discouraged by many C/C++ groups.%5B3%5D%5B4%5D%5B5%5D

http://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.90).aspx

Haha, that's cute. No.

The Release and Acquire semantics of the volatile commented in that MSDN article refer to compiler's order of operations in which the assembly gets generated; they're compiler memory barriers. But it does not guarantee that they generate the proper hardware memory barriers.
A quick look to the generated assembly makes it evident. The reason it's "Microsoft Specific" is because the standard doesn't guarantee that. For non-volatile variables, the optimizing compiler is allowed to put reads/writes before or after the volatile read/write happens; which is still something important when dealing with concurrent code; but in all cases volatile does nothing in terms of HW memory barriers nor guarantee atomicity.

The code you posted is simply a race condition waiting to happen. Even if volatile actually generated hardware memory barriers; your code is still broken because "if( LockState == UNLOCKED ){ LockState = LOCKED }" is still not atomic.
The second thread can enter the 'if' block because the first one didn't set LockState to LOCKED yet.
You need a hardware-locked CAS to do that, or a mutex.

##### Share on other sites

his allows volatile objects to be used for memory locks and releases in multithreaded applications.

It still needs to be atomically accessed via the Interlocked* family of functions (http://msdn.microsoft.com/en-us/library/windows/desktop/ms684122(v=vs.85).aspx).
Being volatile means the compiler can’t perform any optimizations on it due to the assumption that it will be accessed across multiple threads, preventing the compiler from making assumptions that could allow it to otherwise add said optimizations.
Being volatile is necessary for multi-threading but it is only one part.  Notice how those functions require volatile parameters.

Most likely the only reason you have never noticed a race condition is because incoming packets are fairly rare compared to game loops, and when they do happen they would have to do so at exactly the right time such that both threads are accessing that variable.
And because of how rare incoming packets are, your concerns on cache-friendliness are misplaced.

As long as you fix the race condition you shouldn’t worry too much more about this area.

L. Spiro Edited by L. Spiro

##### Share on other sites

The code you posted is simply a race condition waiting to happen. Even if volatile actually generated hardware memory barriers; your code is still broken because "if( LockState == UNLOCKED ){ LockState = LOCKED }" is still not atomic.

The second thread can enter the 'if' block because the first one didn't set LockState to LOCKED yet.
You need a hardware-locked CAS to do that, or a mutex.

elaborate on this please, what "if" block is being entered in the second thread, if LockState has not changed it's state yet?

##### Share on other sites

Lock state can change due to another thread, you are currently using 4 different states two in each thread and the ifs are correct. However it is really easy to miss this 2 months down the line when you are debugging this and you add a LOCKED check or UNLOCK_REQUESTED check in the different thread, that will cause you a race condition.

With acquiring an actual HW barrier and releasing this when you are done with the section that has to be done before the threads can run in parallel again, eg a data copy from one thread to another, you guarantee that they will not create a flip situation like described above.

I used this template as my data copy between threads when I wrote a physics simulation, which ran on a separate thread form the input and rendering: http://pastebin.com/Uxgh6qC8 The class in this example protects the memory access to it's internal variable with Critical sections, this is a HW barrier that is slightly easier to use than either a  mutex or a semaphore on windows. Generally when doing memory operations from one thread to the other you want to make sure that you either return a copy of the protected data like in the example, or you write directly to a class variable that is protected through a HW barrier on each access to avoid stomping of data. The reason for the copy return is because if you release the CS and then return the m_data var, another thread could already have acquired the CS and is midway in writing to the variable you are now returning and thus is garbage, the copy avoids this all together.

As stated above volatile doesn't guarantee a lock on the memory all it guarantees is that the variable will be fetched from memory or written to in memory, so that you know that the CPU doesn't have a register copy of this variable with a different value. It is sometimes useful to mark non critical operations one a thread to another thread in this way, but even then you can end up with a race condition.

Edited by NightCreature83

##### Share on other sites

his allows volatile objects to be used for memory locks and releases in multithreaded applications.

It still needs to be atomically accessed via the Interlocked* family of functions (http://msdn.microsoft.com/en-us/library/windows/desktop/ms684122(v=vs.85).aspx).
Being volatile means the compiler can’t perform any optimizations on it due to the assumption that it will be accessed across multiple threads, preventing the compiler from making assumptions that could allow it to otherwise add said optimizations.
Being volatile is necessary for multi-threading but it is only one part.  Notice how those functions require volatile parameters.

But as it says on the top of the link you posted "Simple reads and writes to properly-aligned 32-bit variables are atomic operations." And in the link slicer4ever linked to, Microsoft extends the definition of volatile to give it acquire and release semantics. These two things together should be sufficient for multithreading, especially with only two threads involved.

Of course, I don't recommend doing this. Just use C++11 std::atomic<int>.

-----

I also agree with NightCreature83 that the code will only work with the two threads. However, I think EnterCriticalSection is blocking, so that would make this no longer lock free. But actually you might want to do that anyway, at least on the network side. Edited by King Mir

##### Share on other sites

But as it says on the top of the link you posted "Simple reads and writes to properly-aligned 32-bit variables are atomic operations." And in the link slicer4ever linked to, Microsoft extends the definition of volatile to give it acquire and release semantics. These two things together should be sufficient for multithreading, especially with only two threads involved.

See http://www.gamedev.net/topic/649586-cache-friendly-lock-free-concept/#entry5106076

The link in my article refers to all properly-aligned 32-bit variables on x86 machines, not just volatile ones, and it refers to only the actual instruction that performs the read or write. As in, these cases are not possible:
*
Thread 1 performs “MOV [0xBADF00D], EAX” while thread 2 performs “MOV [0xBADF00D], EDX”—it is not possible for address 0xBADF00D to have part of the value from thread 1’s EAX and part of the value from thread 2’s EDX; it will have one or the other.

*
Thread 1 performs “MOV [0xBADF00D], EAX” while thread 2 performs “MOV EDX, [0xBADF00D]”—the reads and writes will not happen concurrently such that it is possible for thread to load part of the previous value at 0xBADF00D and part of the EAX value that was being written to it at the same time.

The article I posted refers to this and only this type of atomic operation, and is unrelated to volatile integers.
With care you can make his code work without the Interlocked* functions via proper fencing.  That in itself is a bit expensive but could be faster than Interlocked*, but the trade-off in performance is that it only works in the most controlled of cases.  Introducing yet another thread into the mix may break it, or an unforeseen case in which packets are received unexpectedly quickly at just the wrong moment (not saying I spot that case here but we had a similar case at work).

Also, the guy sitting next to me right wrote this thorough article on threading, which covers basically everything you could ever want to know.

L. Spiro

Edited by L. Spiro

##### Share on other sites
So I should point out that the volatile extension link show and example of a volatile variable being used to implement mutual exclusion, without any threading primitives.

You don't need any fencing to implement acquire-release semantics on x86. You do generally for sequential consistency, but here there are only two threads, so acquire-release is equivalent to sequential consistency.

So the OP's code, and Microsoft's example should work on x86. It shouldn't work on ARM, and Microsoft compiles for ARM, so maybe they should clarify that. (Unless they actually do add a memory fence on ARM).

In any case, nobody do this.

----

Also I disagree that "if( LockState == UNLOCKED ){ LockState = LOCKED }" needs to be atomic. It's the same logic as all the other branches, except that the body is empty. Edited by King Mir

##### Share on other sites
Thanks for the input so far. @L. Spiro, i'm reading over the articles's you linked to. I however still don't understand the logic that Matias Goldberg put forth:

The code you posted is simply a race condition waiting to happen. Even if volatile actually generated hardware memory barriers; your code is still broken because "if( LockState == UNLOCKED ){ LockState = LOCKED }" is still not atomic.
The second thread can enter the 'if' block because the first one didn't set LockState to LOCKED yet.

I'd still like an explanation about what you mean by "The second thread can enter the 'if' block" when the second thread relies on LockState being set to LOCKED, how can it somehow enter that if branch then? could you please seriously elaborate on this. Edited by slicer4ever

##### Share on other sites

You can get task switched to another thread after any instruction (an if and assignment can be several instructions long), so you need to do the check and the set of the new value in a single instruction.

Like this

InterlockedCompareExchange function

http://msdn.microsoft.com/en-us/library/windows/desktop/ms683560%28v=vs.85%29.aspx

##### Share on other sites
Using compare exchange there would only make the program slower. There's no need for the operation to be atomic here.

##### Share on other sites

Thanks for the input so far. @L. Spiro, i'm reading over the articles's you linked to. I however still don't understand the logic that Matias Goldberg put forth:

The code you posted is simply a race condition waiting to happen. Even if volatile actually generated hardware memory barriers; your code is still broken because "if( LockState == UNLOCKED ){ LockState = LOCKED }" is still not atomic.
The second thread can enter the 'if' block because the first one didn't set LockState to LOCKED yet.

I'd still like an explanation about what you mean by "The second thread can enter the 'if' block" when the second thread relies on LockState being set to LOCKED, how can it somehow enter that if branch then? could you please seriously elaborate on this.

When you have the following code running on two threads at the same time:
if( LockState == UNLOCKED )
{
LockState = LOCKED;
doSomething();
}
The following can potentially happen:
1a) thread 1 reads LockState. Its value is UNLOCKED. Enters the if block.
1b) thread 2 reads LockState. Its value is UNLOCKED. Enters the if block.
2a) thread 1 sets LockState to LOCKED
2b) thread 2 sets LockState to LOCKED

There you go, two threads entering the if block. LockState is LOCKED.

Another case:
1a) thread 1 reads LockState. Its value is UNLOCKED. Enters the if block.
1b) thread 2 reads LockState. Its value is UNLOCKED. Doesn't enter the if block yet.
2a) thread 1 sets LockState to LOCKED
2b) thread 2 gets rescheduled by the OS
5b) thread 2 compares LockState, which was read in step 1b and was now in a register. It is still UNLOCKED despite in RAM is now set to LOCK. As a result, enters the if block.
6a) thread 1 is still inside doSomething

Again, both threads entered the if block at the same time.
I'm surprised I have to explain this because it's multithreading 101. It's in every tutorial and book out there. Edited by Matias Goldberg