# cache friendly lock free concept?

This topic is 2079 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I have suspicions that the following concept is not very cache friendly, but i'm no expert on the subject, and like some pointers.

essentially, i have 2 threads, one for networking, and one for the game simulation.  i want to process the packets in the game simulation thread, rather than in the network thread so that the simulation can be maintained as deterministic.

here's the concept i came up with for switching in/out packets between the two threads:



violatile int LockState = UNLOCKED;

RetrievePendingPackets(InPackets);
if(LockState==LOCK_REQUESTED){
TempList = InPackets;
InPackets = Temp;
LockState = LOCKED;
}
if(LockState==UNLOCK_REQUESTED){
for(link<Packet*> *Lnk = TempList->GetFirst(); Lnk!=null; Lnk = Lnk->Next){
SendPacket(Lnk->value);
}
TempList->Clear();
LockState = UNLOCKED;
}
return;
}

if(LockState==LOCKED){
for(link<Packet*> *Lnk = TempList->GetFirst(); Lnk!=null; Lnk = Lnk->Next){
ProcessPacket(Lnk->value);
}
TempList->Clear();
TempList = OutPackets;
OutPackets = Temp;
LockState=UNLOCK_REQUESTED;
}
if(LockState==UNLOCKED) LockState = LOCK_REQUESTED;
UpdateGames();  //Will push packets into the outpackets list.
}



my thoughts are that if the pointers are being swapped in one thread on a different core, then it needs to update the pointer for the value of the pointer for the cache on the thread for the other core.

so, thoughts on it are much appreciated.

Edited by slicer4ever

##### Share on other sites
Well, for one, you could move the data closer together by using a vector instead of a linked list. You could move the data even closer by using the vector to store the packets themselves, though managing would be a bit tricky:
typedef unsigned char byte;
vector<byte> packet_store;

template<class T>
void write_memory( byte* at, const T& val ) {
// assumes Packet<-T and T is POD
*reinterpret_cast<T*>(at) = val;
}

template<class T>
void store_packet( vector<byte>& packet_store, const T& packet ) {
size_t total_size = sizeof(size_t) + sizeof(T);

size_t idx = packet_store.size();
packet_store.resize(packet_store.size() + total_size); // assumes Packet<-T and T is POD

char* at = &packet_store[0] + idx;
write_memory(at,total_size);
write_memory(at+sizeof(size_t),packet);
}

template<class F>
void foreach_packet( vector<byte>& packet_store, F f ) {
byte* at = &packet_store[0];
byte* end = begin + packet_store.size();
for ( size_t total_size = 0; at < end; at += total_size ) {
sz = *reinterpret_cast<size_t*>(at);
Packet* p = reinterpret_cast<Packet*>(at+sizeof(size_t));

f(p);
}
}

Beware, sample code is not alignment-aware.

.. and like some pointers.

Inb4 XKCD or 0x1234ABCD huehuehue.

##### Share on other sites

Are you sure that this design will work? I think lock free queues are usually implemented with ringbuffers.

##### Share on other sites

Are you sure that this design will work? I think lock free queues are usually implemented with ringbuffers.

mind pointing out why it won't work?  it's activly deployed right now(the above code is psedo-code i wrote up that demonstrates what i'm doing right now, i'll go over it to make sure i didn't miss something.)

@fastcall32, storing each packet in a single buffer would defiantly be more cache-friendly, but for now i'd like to focus on this concept itself, rather than how the packets are stored.

##### Share on other sites

I guess I was just thrown off because I didn't read into the code enough.

##### Share on other sites

As written, this is a giant race condition waiting to happen... I'm not surprised it "appears" to work, but it's a matter of time before you get some really bad bugs.

##### Share on other sites

As written, this is a giant race condition waiting to happen... I'm not surprised it "appears" to work, but it's a matter of time before you get some really bad bugs.

could you please expand on this.  i've been reading over the code several times, and don't see where any racing occurs, each step of the swapping occurs in discrete steps, so at no point do i see where overlap of the data can occur.

##### Share on other sites

There is a shared state in this synchronization (LockState). If you want to access it from both threads you have to synchronize these accesses. One way is to use hardware memory barriers before and after the access of the LockState. This will block the instruction and memory access reordering of the compiler and processor and synchronizes memory. The other way is to use an atomic read/write which locks the the memory bus for the time of the memory access. (This is what a CAS-cycle does.) But in the current state of the code it synchronizes the two threads only with luck.

##### Share on other sites
Firstly you aren't using any atomic variables (or legit locks), so it will only work on very permissive architectures like x86, and even them may not if you turn optimizations up high enough for the compiler to mess your code up. LockState needs to be atomic.

Secondly, You have the potential for false sharing with your 4 variables together like that. You need to ensure that all of them are on their own cache line. The usual way to do this is to put each in a class, with an otherwise unused final feild that acts as a spacer. There are other ways too, but beware that, in general, the compiler is free to remove unused variables.

Thirdly, use a vector (or other array structure) not a linked list for cache performance.

Fortunately, assuming the atomic issue is fixed, I don't see a data race.

##### Share on other sites

I mark lockstate as violatile, which ensures it is usable for memory locking in this example.

@King Mir, I was thinking that the swap might be suiceptable to the pointers sitting on their own cache lines, when i do the swap, and update the lock, the other thread will likely not have the swapped pointers sitting in it's cache.  hmm, this is a potentially serious problem.

Edited by slicer4ever

• 9
• 17
• 10
• 11
• 18