Jump to content

  • Log In with Google      Sign In   
  • Create Account


cache friendly lock free concept?


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
41 replies to this topic

#1 slicer4ever   Crossbones+   -  Reputation: 3205

Like
0Likes
Like

Posted 31 October 2013 - 09:57 AM

I have suspicions that the following concept is not very cache friendly, but i'm no expert on the subject, and like some pointers.

 

essentially, i have 2 threads, one for networking, and one for the game simulation.  i want to process the packets in the game simulation thread, rather than in the network thread so that the simulation can be maintained as deterministic.

 

here's the concept i came up with for switching in/out packets between the two threads:

 
 
 
violatile int LockState = UNLOCKED;
linklist<Packet*> *InPackets = new linklist<Packet*>;
linklist<Packet*> *OutPackets = new linklist<Packet*>;
linklist<Packet*> *TempList = new linklist<Packet*>;
 
 
void NetworkThread(){
    RetrievePendingPackets(InPackets);
    if(LockState==LOCK_REQUESTED){
       linklist<Packet*> *Temp = TempList;
       TempList = InPackets;
       InPackets = Temp;
       LockState = LOCKED;
   }
   if(LockState==UNLOCK_REQUESTED){
      for(link<Packet*> *Lnk = TempList->GetFirst(); Lnk!=null; Lnk = Lnk->Next){
         SendPacket(Lnk->value);
      }
      TempList->Clear();
      LockState = UNLOCKED;
   }
   return;
}
 
void GameThread(){
   if(LockState==LOCKED){
     for(link<Packet*> *Lnk = TempList->GetFirst(); Lnk!=null; Lnk = Lnk->Next){
        ProcessPacket(Lnk->value);
     }
     TempList->Clear();
     linklist<Packet*> *Temp = TempList;
     TempList = OutPackets;
     OutPackets = Temp;
     LockState=UNLOCK_REQUESTED;
   }
   if(LockState==UNLOCKED) LockState = LOCK_REQUESTED;
   UpdateGames();  //Will push packets into the outpackets list.
}
 

my thoughts are that if the pointers are being swapped in one thread on a different core, then it needs to update the pointer for the value of the pointer for the cache on the thread for the other core.

 

so, thoughts on it are much appreciated.


Edited by slicer4ever, 31 October 2013 - 02:20 PM.

Check out https://www.facebook.com/LiquidGames for some great games made by me on the Playstation Mobile market.

Sponsor:

#2 fastcall22   Crossbones+   -  Reputation: 3969

Like
0Likes
Like

Posted 31 October 2013 - 10:14 AM

Well, for one, you could move the data closer together by using a vector instead of a linked list. You could move the data even closer by using the vector to store the packets themselves, though managing would be a bit tricky:
typedef unsigned char byte;
vector<byte> packet_store;

template<class T>
void write_memory( byte* at, const T& val ) {
    // assumes Packet<-T and T is POD
    *reinterpret_cast<T*>(at) = val;
}

template<class T>
void store_packet( vector<byte>& packet_store, const T& packet ) {
    size_t total_size = sizeof(size_t) + sizeof(T);
    
    size_t idx = packet_store.size();
    packet_store.resize(packet_store.size() + total_size); // assumes Packet<-T and T is POD
    
    char* at = &packet_store[0] + idx;
    write_memory(at,total_size);
    write_memory(at+sizeof(size_t),packet);
}

template<class F>
void foreach_packet( vector<byte>& packet_store, F f ) {
    byte* at = &packet_store[0];
    byte* end = begin + packet_store.size();
    for ( size_t total_size = 0; at < end; at += total_size ) {
        sz = *reinterpret_cast<size_t*>(at);
        Packet* p = reinterpret_cast<Packet*>(at+sizeof(size_t));
        
        f(p);
    }
}
Beware, sample code is not alignment-aware.

.. and like some pointers.

Inb4 XKCD or 0x1234ABCD huehuehue.

WW91J3ZlIGdvdCBhIHNlY3JldCBib251cyBwb2ludCE=


#3 Madhed   Crossbones+   -  Reputation: 2493

Like
0Likes
Like

Posted 31 October 2013 - 10:30 AM

Are you sure that this design will work? I think lock free queues are usually implemented with ringbuffers.



#4 slicer4ever   Crossbones+   -  Reputation: 3205

Like
0Likes
Like

Posted 31 October 2013 - 11:06 AM

Are you sure that this design will work? I think lock free queues are usually implemented with ringbuffers.

mind pointing out why it won't work?  it's activly deployed right now(the above code is psedo-code i wrote up that demonstrates what i'm doing right now, i'll go over it to make sure i didn't miss something.)

 

@fastcall32, storing each packet in a single buffer would defiantly be more cache-friendly, but for now i'd like to focus on this concept itself, rather than how the packets are stored.


Check out https://www.facebook.com/LiquidGames for some great games made by me on the Playstation Mobile market.

#5 Madhed   Crossbones+   -  Reputation: 2493

Like
0Likes
Like

Posted 31 October 2013 - 11:46 AM

I guess I was just thrown off because I didn't read into the code enough.



#6 ApochPiQ   Moderators   -  Reputation: 14291

Like
3Likes
Like

Posted 31 October 2013 - 11:56 AM

As written, this is a giant race condition waiting to happen... I'm not surprised it "appears" to work, but it's a matter of time before you get some really bad bugs.



#7 slicer4ever   Crossbones+   -  Reputation: 3205

Like
0Likes
Like

Posted 31 October 2013 - 12:08 PM

As written, this is a giant race condition waiting to happen... I'm not surprised it "appears" to work, but it's a matter of time before you get some really bad bugs.

could you please expand on this.  i've been reading over the code several times, and don't see where any racing occurs, each step of the swapping occurs in discrete steps, so at no point do i see where overlap of the data can occur.


Check out https://www.facebook.com/LiquidGames for some great games made by me on the Playstation Mobile market.

#8 zaphod2   Members   -  Reputation: 578

Like
1Likes
Like

Posted 31 October 2013 - 01:21 PM

There is a shared state in this synchronization (LockState). If you want to access it from both threads you have to synchronize these accesses. One way is to use hardware memory barriers before and after the access of the LockState. This will block the instruction and memory access reordering of the compiler and processor and synchronizes memory. The other way is to use an atomic read/write which locks the the memory bus for the time of the memory access. (This is what a CAS-cycle does.) But in the current state of the code it synchronizes the two threads only with luck.

 



#9 King Mir   Members   -  Reputation: 1910

Like
2Likes
Like

Posted 31 October 2013 - 02:01 PM

Firstly you aren't using any atomic variables (or legit locks), so it will only work on very permissive architectures like x86, and even them may not if you turn optimizations up high enough for the compiler to mess your code up. LockState needs to be atomic.

Secondly, You have the potential for false sharing with your 4 variables together like that. You need to ensure that all of them are on their own cache line. The usual way to do this is to put each in a class, with an otherwise unused final feild that acts as a spacer. There are other ways too, but beware that, in general, the compiler is free to remove unused variables.

Thirdly, use a vector (or other array structure) not a linked list for cache performance.

Fortunately, assuming the atomic issue is fixed, I don't see a data race.

#10 slicer4ever   Crossbones+   -  Reputation: 3205

Like
-1Likes
Like

Posted 31 October 2013 - 02:18 PM

I mark lockstate as violatile, which ensures it is usable for memory locking in this example.

 

@King Mir, I was thinking that the swap might be suiceptable to the pointers sitting on their own cache lines, when i do the swap, and update the lock, the other thread will likely not have the swapped pointers sitting in it's cache.  hmm, this is a potentially serious problem.


Edited by slicer4ever, 31 October 2013 - 02:19 PM.

Check out https://www.facebook.com/LiquidGames for some great games made by me on the Playstation Mobile market.

#11 DoctorGlow   Members   -  Reputation: 703

Like
1Likes
Like

Posted 31 October 2013 - 02:22 PM

I mark lockstate as violatile, which ensures it is usable for memory locking in this example.

 

This is wrong assumption. From many links about, I picked one, http://en.wikipedia.org/wiki/Volatile_variable

 

Quote:

Operations on volatile variables are not atomic, nor do they establish a proper happens-before relationship for threading. This is according to the relevant standards (C, C++, POSIX, WIN32),[2] and this is the matter of fact for the vast majority of current implementations. Thus, the usage of volatile keyword as a portable synchronization mechanism is discouraged by many C/C++ groups.[3][4][5]



#12 slicer4ever   Crossbones+   -  Reputation: 3205

Like
-1Likes
Like

Posted 31 October 2013 - 02:26 PM

 

I mark lockstate as violatile, which ensures it is usable for memory locking in this example.

 

This is wrong assumption. From many links about, I picked one, http://en.wikipedia.org/wiki/Volatile_variable

 

Quote:

Operations on volatile variables are not atomic, nor do they establish a proper happens-before relationship for threading. This is according to the relevant standards (C, C++, POSIX, WIN32),[2] and this is the matter of fact for the vast majority of current implementations. Thus, the usage of volatile keyword as a portable synchronization mechanism is discouraged by many C/C++ groups.[3][4][5]

 

http://msdn.microsoft.com/en-us/library/12a04hfd%28v=vs.90%29.aspx

 


his allows volatile objects to be used for memory locks and releases in multithreaded applications.


Check out https://www.facebook.com/LiquidGames for some great games made by me on the Playstation Mobile market.

#13 King Mir   Members   -  Reputation: 1910

Like
1Likes
Like

Posted 31 October 2013 - 02:29 PM

I mark lockstate as violatile, which ensures it is usable for memory locking in this example.
 
@King Mir, I was thinking that the swap might be suiceptable to the pointers sitting on their own cache lines, when i do the swap, and update the lock, the other thread will likely not have the swapped pointers sitting in it's cache.  hmm, this is a potentially serious problem.

It needs to be atomic. You can use aquire-release semantics instead of sequential consistency, but volitile is insufficient.

Quite the opposite, you want the data on seperate cache lines. All 4 variables need to be on their own cache line. Otherwise you could have false sharing and cache thrashing. It's not a correctness problem, but it is a performance problem.

#14 King Mir   Members   -  Reputation: 1910

Like
2Likes
Like

Posted 31 October 2013 - 02:35 PM

I mark lockstate as violatile, which ensures it is usable for memory locking in this example.

This is wrong assumption. From many links about, I picked one, http://en.wikipedia.org/wiki/Volatile_variable
 
Quote:
Operations on [/size]volatile variables are not [/size]atomic, nor do they establish a proper happens-before relationship for threading. This is according to the relevant standards (C, C++, POSIX, WIN32),[/size]%5B2%5D and this is the matter of fact for the vast majority of current implementations. Thus, the usage of [/size]volatile keyword as a portable synchronization mechanism is discouraged by many C/C++ groups.[/size]%5B3%5D%5B4%5D%5B5%5D

http://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.90).aspx
 
 

his allows volatile objects to be used for memory locks and releases in multithreaded applications.

Why are you using Microsoft specific compiler extensions when a standard solution is readily available?

#15 Matias Goldberg   Crossbones+   -  Reputation: 3007

Like
2Likes
Like

Posted 31 October 2013 - 04:11 PM

I mark lockstate as violatile, which ensures it is usable for memory locking in this example.

This is wrong assumption. From many links about, I picked one, http://en.wikipedia.org/wiki/Volatile_variable
 
Quote:
Operations on [/size]volatile variables are not [/size]atomic, nor do they establish a proper happens-before relationship for threading. This is according to the relevant standards (C, C++, POSIX, WIN32),[/size]%5B2%5D and this is the matter of fact for the vast majority of current implementations. Thus, the usage of [/size]volatile keyword as a portable synchronization mechanism is discouraged by many C/C++ groups.[/size]%5B3%5D%5B4%5D%5B5%5D

http://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.90).aspx

Haha, that's cute. No.

The Release and Acquire semantics of the volatile commented in that MSDN article refer to compiler's order of operations in which the assembly gets generated; they're compiler memory barriers. But it does not guarantee that they generate the proper hardware memory barriers.
A quick look to the generated assembly makes it evident. The reason it's "Microsoft Specific" is because the standard doesn't guarantee that. For non-volatile variables, the optimizing compiler is allowed to put reads/writes before or after the volatile read/write happens; which is still something important when dealing with concurrent code; but in all cases volatile does nothing in terms of HW memory barriers nor guarantee atomicity.

The code you posted is simply a race condition waiting to happen. Even if volatile actually generated hardware memory barriers; your code is still broken because "if( LockState == UNLOCKED ){ LockState = LOCKED }" is still not atomic.
The second thread can enter the 'if' block because the first one didn't set LockState to LOCKED yet.
You need a hardware-locked CAS to do that, or a mutex.

#16 L. Spiro   Crossbones+   -  Reputation: 12257

Like
3Likes
Like

Posted 31 October 2013 - 04:14 PM

his allows volatile objects to be used for memory locks and releases in multithreaded applications.

It still needs to be atomically accessed via the Interlocked* family of functions (http://msdn.microsoft.com/en-us/library/windows/desktop/ms684122(v=vs.85).aspx).
Being volatile means the compiler can’t perform any optimizations on it due to the assumption that it will be accessed across multiple threads, preventing the compiler from making assumptions that could allow it to otherwise add said optimizations.
Being volatile is necessary for multi-threading but it is only one part.  Notice how those functions require volatile parameters.

Most likely the only reason you have never noticed a race condition is because incoming packets are fairly rare compared to game loops, and when they do happen they would have to do so at exactly the right time such that both threads are accessing that variable.
And because of how rare incoming packets are, your concerns on cache-friendliness are misplaced.

As long as you fix the race condition you shouldn’t worry too much more about this area.
 
 
L. Spiro

Edited by L. Spiro, 31 October 2013 - 04:20 PM.

It is amazing how often people try to be unique, and yet they are always trying to make others be like them. - L. Spiro 2011
I spent most of my life learning the courage it takes to go out and get what I want. Now that I have it, I am not sure exactly what it is that I want. - L. Spiro 2013
I went to my local Subway once to find some guy yelling at the staff. When someone finally came to take my order and asked, “May I help you?”, I replied, “Yeah, I’ll have one asshole to go.”
L. Spiro Engine: http://lspiroengine.com
L. Spiro Engine Forums: http://lspiroengine.com/forums

#17 slicer4ever   Crossbones+   -  Reputation: 3205

Like
0Likes
Like

Posted 31 October 2013 - 04:52 PM

The code you posted is simply a race condition waiting to happen. Even if volatile actually generated hardware memory barriers; your code is still broken because "if( LockState == UNLOCKED ){ LockState = LOCKED }" is still not atomic.

The second thread can enter the 'if' block because the first one didn't set LockState to LOCKED yet.
You need a hardware-locked CAS to do that, or a mutex.

 

 

elaborate on this please, what "if" block is being entered in the second thread, if LockState has not changed it's state yet?


Check out https://www.facebook.com/LiquidGames for some great games made by me on the Playstation Mobile market.

#18 NightCreature83   Crossbones+   -  Reputation: 2672

Like
1Likes
Like

Posted 31 October 2013 - 05:19 PM

Lock state can change due to another thread, you are currently using 4 different states two in each thread and the ifs are correct. However it is really easy to miss this 2 months down the line when you are debugging this and you add a LOCKED check or UNLOCK_REQUESTED check in the different thread, that will cause you a race condition.

 

With acquiring an actual HW barrier and releasing this when you are done with the section that has to be done before the threads can run in parallel again, eg a data copy from one thread to another, you guarantee that they will not create a flip situation like described above.

 

I used this template as my data copy between threads when I wrote a physics simulation, which ran on a separate thread form the input and rendering: http://pastebin.com/Uxgh6qC8 The class in this example protects the memory access to it's internal variable with Critical sections, this is a HW barrier that is slightly easier to use than either a  mutex or a semaphore on windows. Generally when doing memory operations from one thread to the other you want to make sure that you either return a copy of the protected data like in the example, or you write directly to a class variable that is protected through a HW barrier on each access to avoid stomping of data. The reason for the copy return is because if you release the CS and then return the m_data var, another thread could already have acquired the CS and is midway in writing to the variable you are now returning and thus is garbage, the copy avoids this all together.

 

As stated above volatile doesn't guarantee a lock on the memory all it guarantees is that the variable will be fetched from memory or written to in memory, so that you know that the CPU doesn't have a register copy of this variable with a different value. It is sometimes useful to mark non critical operations one a thread to another thread in this way, but even then you can end up with a race condition.


Edited by NightCreature83, 31 October 2013 - 05:32 PM.

Worked on titles: CMR:DiRT2, DiRT 3, DiRT: Showdown, GRID 2, Mad Max

#19 King Mir   Members   -  Reputation: 1910

Like
1Likes
Like

Posted 31 October 2013 - 06:43 PM

his allows volatile objects to be used for memory locks and releases in multithreaded applications.

It still needs to be atomically accessed via the Interlocked* family of functions (http://msdn.microsoft.com/en-us/library/windows/desktop/ms684122(v=vs.85).aspx).
Being volatile means the compiler can’t perform any optimizations on it due to the assumption that it will be accessed across multiple threads, preventing the compiler from making assumptions that could allow it to otherwise add said optimizations.
Being volatile is necessary for multi-threading but it is only one part.  Notice how those functions require volatile parameters.

But as it says on the top of the link you posted "Simple reads and writes to properly-aligned 32-bit variables are atomic operations." And in the link slicer4ever linked to, Microsoft extends the definition of volatile to give it acquire and release semantics. These two things together should be sufficient for multithreading, especially with only two threads involved.

Of course, I don't recommend doing this. Just use C++11 std::atomic<int>.


-----

I also agree with NightCreature83 that the code will only work with the two threads. However, I think EnterCriticalSection is blocking, so that would make this no longer lock free. But actually you might want to do that anyway, at least on the network side.

Edited by King Mir, 31 October 2013 - 06:55 PM.


#20 L. Spiro   Crossbones+   -  Reputation: 12257

Like
1Likes
Like

Posted 31 October 2013 - 07:33 PM

But as it says on the top of the link you posted "Simple reads and writes to properly-aligned 32-bit variables are atomic operations." And in the link slicer4ever linked to, Microsoft extends the definition of volatile to give it acquire and release semantics. These two things together should be sufficient for multithreading, especially with only two threads involved.

See http://www.gamedev.net/topic/649586-cache-friendly-lock-free-concept/#entry5106076

The link in my article refers to all properly-aligned 32-bit variables on x86 machines, not just volatile ones, and it refers to only the actual instruction that performs the read or write. As in, these cases are not possible:
*
Thread 1 performs “MOV [0xBADF00D], EAX” while thread 2 performs “MOV [0xBADF00D], EDX”—it is not possible for address 0xBADF00D to have part of the value from thread 1’s EAX and part of the value from thread 2’s EDX; it will have one or the other.

*
Thread 1 performs “MOV [0xBADF00D], EAX” while thread 2 performs “MOV EDX, [0xBADF00D]”—the reads and writes will not happen concurrently such that it is possible for thread to load part of the previous value at 0xBADF00D and part of the EAX value that was being written to it at the same time.


The article I posted refers to this and only this type of atomic operation, and is unrelated to volatile integers.
With care you can make his code work without the Interlocked* functions via proper fencing.  That in itself is a bit expensive but could be faster than Interlocked*, but the trade-off in performance is that it only works in the most controlled of cases.  Introducing yet another thread into the mix may break it, or an unforeseen case in which packets are received unexpectedly quickly at just the wrong moment (not saying I spot that case here but we had a similar case at work).

 

 

Also, the guy sitting next to me right wrote this thorough article on threading, which covers basically everything you could ever want to know.

Multithreading

 

 

L. Spiro


Edited by L. Spiro, 31 October 2013 - 07:36 PM.

It is amazing how often people try to be unique, and yet they are always trying to make others be like them. - L. Spiro 2011
I spent most of my life learning the courage it takes to go out and get what I want. Now that I have it, I am not sure exactly what it is that I want. - L. Spiro 2013
I went to my local Subway once to find some guy yelling at the staff. When someone finally came to take my order and asked, “May I help you?”, I replied, “Yeah, I’ll have one asshole to go.”
L. Spiro Engine: http://lspiroengine.com
L. Spiro Engine Forums: http://lspiroengine.com/forums




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS