Back to General and Gameplay Programming

cache friendly lock free concept?

General and Gameplay Programming Programming

Started by slicer4ever October 31, 2013 03:57 PM

40 comments, last by King Mir 10 years, 5 months ago

2,506

November 01, 2013 03:40 AM

So I should point out that the volatile extension link show and example of a volatile variable being used to implement mutual exclusion, without any threading primitives.

You don't need any fencing to implement acquire-release semantics on x86. You do generally for sequential consistency, but here there are only two threads, so acquire-release is equivalent to sequential consistency.

So the OP's code, and Microsoft's example should work on x86. It shouldn't work on ARM, and Microsoft compiles for ARM, so maybe they should clarify that. (Unless they actually do add a memory fence on ARM).

In any case, nobody do this.

----

Also I disagree that "if( LockState == UNLOCKED ){ LockState = LOCKED }" needs to be atomic. It's the same logic as all the other branches, except that the body is empty.

slicer4ever

6,773

Author

November 01, 2013 11:51 PM

Thanks for the input so far. @L. Spiro, i'm reading over the articles's you linked to. I however still don't understand the logic that Matias Goldberg put forth:

The code you posted is simply a race condition waiting to happen. Even if volatile actually generated hardware memory barriers; your code is still broken because "if( LockState == UNLOCKED ){ LockState = LOCKED }" is still not atomic.
The second thread can enter the 'if' block because the first one didn't set LockState to LOCKED yet.

I'd still like an explanation about what you mean by "The second thread can enter the 'if' block" when the second thread relies on LockState being set to LOCKED, how can it somehow enter that if branch then? could you please seriously elaborate on this.

Check out https://www.facebook.com/LiquidGames for some great games made by me on the Playstation Mobile market.

Paradigm Shifter

5,832

November 02, 2013 12:43 AM

You can get task switched to another thread after any instruction (an if and assignment can be several instructions long), so you need to do the check and the set of the new value in a single instruction.

Like this

InterlockedCompareExchange function

http://msdn.microsoft.com/en-us/library/windows/desktop/ms683560%28v=vs.85%29.aspx

"Most people think, great God will come from the sky, take away everything, and make everybody feel high" - Bob Marley

King Mir

2,506

November 02, 2013 01:54 AM

Using compare exchange there would only make the program slower. There's no need for the operation to be atomic here.

Matias Goldberg

9,637

November 02, 2013 10:30 PM

Thanks for the input so far. @L. Spiro, i'm reading over the articles's you linked to. I however still don't understand the logic that Matias Goldberg put forth:

The code you posted is simply a race condition waiting to happen. Even if volatile actually generated hardware memory barriers; your code is still broken because "if( LockState == UNLOCKED ){ LockState = LOCKED }" is still not atomic.
The second thread can enter the 'if' block because the first one didn't set LockState to LOCKED yet.

I'd still like an explanation about what you mean by "The second thread can enter the 'if' block" when the second thread relies on LockState being set to LOCKED, how can it somehow enter that if branch then? could you please seriously elaborate on this.

When you have the following code running on two threads at the same time:

if( LockState == UNLOCKED )
{
LockState = LOCKED;
doSomething();
}

The following can potentially happen:
1a) thread 1 reads LockState. Its value is UNLOCKED. Enters the if block.
1b) thread 2 reads LockState. Its value is UNLOCKED. Enters the if block.
2a) thread 1 sets LockState to LOCKED
2b) thread 2 sets LockState to LOCKED
3a) thread 1 calls doSomething
3b) thread 2 calls doSomething

There you go, two threads entering the if block. LockState is LOCKED.

Another case:
1a) thread 1 reads LockState. Its value is UNLOCKED. Enters the if block.
1b) thread 2 reads LockState. Its value is UNLOCKED. Doesn't enter the if block yet.
2a) thread 1 sets LockState to LOCKED
2b) thread 2 gets rescheduled by the OS
3a) thread 1 calls doSomething
4b) thread 2 resumes execution.
5b) thread 2 compares LockState, which was read in step 1b and was now in a register. It is still UNLOCKED despite in RAM is now set to LOCK. As a result, enters the if block.
6a) thread 1 is still inside doSomething
6b) thread 2 calls doSomething

Again, both threads entered the if block at the same time.
I'm surprised I have to explain this because it's multithreading 101. It's in every tutorial and book out there.

Twitter: @matiasgoldberg

Distant Souls ? Alliance AirWar ? My Free Royalty-Free Music Library

fastcall22

10,918

November 02, 2013 11:26 PM

Mattias, doesn't that example apply to multiple threads running the same code? Slicer4ever's code has two threads running separate code. Hold on. Not an expert in multithreaded programming, but...
The network thread acts on LOCK_REQUESTED/UNLOCK_REQUESTED and sets to LOCKED/UNLOCKED, whereas the game thread acts on LOCKED/UNLOCKED and sets to UNLOCK_REQUESTED/LOCK_REQUESTED. A thread cannot renter its "critical sections" until the other thread unsets the LockState. Assuming that LockState is executed and committed after at the end of the threads' "critical sections", isn't the worse case scenario then a thread may need to reenter its loop? Or, am I misunderstanding something here?

Matias Goldberg

9,637

November 03, 2013 12:04 AM

fastcall22, oohh! I read the code wrong. I apologize
You're right. The access to LockState appears to not contain race conditions (can't say about the rest of the variables because I didn't look those carefully) as long as LockState is only accessed from two threads.

Nonetheless, he should put hardware fences/Memory barriers or otherwise the out of order execution unit from an x86 cpu can cause LockState to be set to a different value sooner than expected.

Furthermore, and as already pointed out, InPackets OutPackets and TempList are likely to be in the same cache line which means a lot of false cache sharing.

Another problem with this routine is that it appears the packets can be delayed several iterations because LockState is still being updated in its 4 state system; which can add noticeable lag.

Twitter: @matiasgoldberg

Distant Souls ? Alliance AirWar ? My Free Royalty-Free Music Library

slicer4ever

6,773

Author

November 03, 2013 03:04 AM

fastcall22, oohh! I read the code wrong. I apologize
You're right. The access to LockState appears to not contain race conditions (can't say about the rest of the variables because I didn't look those carefully) as long as LockState is only accessed from two threads.

Nonetheless, he should put hardware fences/Memory barriers or otherwise the out of order execution unit from an x86 cpu can cause LockState to be set to a different value sooner than expected.

Furthermore, and as already pointed out, InPackets OutPackets and TempList are likely to be in the same cache line which means a lot of false cache sharing.

Another problem with this routine is that it appears the packets can be delayed several iterations because LockState is still being updated in its 4 state system; which can add noticeable lag.

so.....now that we are finally past this=-\, can we get back to the root of the original question:

Furthermore, and as already pointed out, InPackets OutPackets and TempList are likely to be in the same cache line which means a lot of false cache sharing.

This is mostly what i care about, basically i'm exchanging the pointers, so let's say the threads are running on two separate cores, so basically, if i understand what's going on at that level(which i probably don't, hence these questions) ThreadA swaps the pointers, then in order for ThreadB to have these new pointer locations in it's cache, will ThreadA send an update message to ThreadB(doubtful, as in theory ThreadB doesn't know ThreadA also has these pointers), so to get the updated values, it has to read it back from ram(meaning ThreadA would write to ram), or potentially from a higher level cache? can someone explain to me what's going on at this level, and if i should be doing something else here to ensure i don't have more false cache sharing?

Check out https://www.facebook.com/LiquidGames for some great games made by me on the Playstation Mobile market.

King Mir

2,506

November 03, 2013 07:16 AM

This is mostly what i care about, basically i'm exchanging the pointers, so let's say the threads are running on two separate cores, so basically, if i understand what's going on at that level(which i probably don't, hence these questions) ThreadA swaps the pointers, then in order for ThreadB to have these new pointer locations in it's cache, will ThreadA send an update message to ThreadB(doubtful, as in theory ThreadB doesn't know ThreadA also has these pointers), so to get the updated values, it has to read it back from ram(meaning ThreadA would write to ram), or potentially from a higher level cache? can someone explain to me what's going on at this level, and if i should be doing something else here to ensure i don't have more false cache sharing?

No that's not what's going on.

A thread on Core A writes one of the 4 variables, which puts it in it's L1 cache. Then a thread on Core B reads one of the variables, but noticing that the cache line of that variable was modified by core A, it must get it from the write buffer of Core A, which depending on the detail of the implementation involve going though the L3 cache, or otherwise has a comparable latency. It might then write to that cache line itself, so that Core A's L1 cached version is invalidated, and it must read from the write buffer of B. Then Core A might write to the cache line, which invalidates Core B's l1 and L2 cache of that line. And back and forth.

That kind of thing is unavoidable on LockState, and to a lesser extent on TempList, but if all 4 variables are on the same cache line, they are all shared as if they were one variable, incurring the performance penalties of sharing on all of them whenever any one of them is accessed. It's like they are one variable to the cache. So the solution is to ensure that all four variables are on a separate cache line.

One way to do this is instead of using pointers and a primitive integer, use a class the size of a cache line. You can even make it your linked list, if you make swapping the list cheep, and put enough padding at the end of it to fill a cache line.

slicer4ever

6,773

Author

November 03, 2013 05:24 PM

This is mostly what i care about, basically i'm exchanging the pointers, so let's say the threads are running on two separate cores, so basically, if i understand what's going on at that level(which i probably don't, hence these questions) ThreadA swaps the pointers, then in order for ThreadB to have these new pointer locations in it's cache, will ThreadA send an update message to ThreadB(doubtful, as in theory ThreadB doesn't know ThreadA also has these pointers), so to get the updated values, it has to read it back from ram(meaning ThreadA would write to ram), or potentially from a higher level cache? can someone explain to me what's going on at this level, and if i should be doing something else here to ensure i don't have more false cache sharing?
No that's not what's going on.

A thread on Core A writes one of the 4 variables, which puts it in it's L1 cache. Then a thread on Core B reads one of the variables, but noticing that the cache line of that variable was modified by core A, it must get it from the write buffer of Core A, which depending on the detail of the implementation involve going though the L3 cache, or otherwise has a comparable latency. It might then write to that cache line itself, so that Core A's L1 cached version is invalidated, and it must read from the write buffer of B. Then Core A might write to the cache line, which invalidates Core B's l1 and L2 cache of that line. And back and forth.

That kind of thing is unavoidable on LockState, and to a lesser extent on TempList, but if all 4 variables are on the same cache line, they are all shared as if they were one variable, incurring the performance penalties of sharing on all of them whenever any one of them is accessed. It's like they are one variable to the cache. So the solution is to ensure that all four variables are on a separate cache line.

One way to do this is instead of using pointers and a primitive integer, use a class the size of a cache line. You can even make it your linked list, if you make swapping the list cheep, and put enough padding at the end of it to fill a cache line.

Thank you for such a detailed explanation of everything that's going on! =-)

Check out https://www.facebook.com/LiquidGames for some great games made by me on the Playstation Mobile market.

cache friendly lock free concept?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

cache friendly lock free concept?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines