cache friendly lock free concept?

Started by
40 comments, last by King Mir 10 years, 5 months ago
Yes. King Mir's explanation on false cache sharing is spot on. If you modify a single byte inside a cache line, it "dirties" the whole line; and modifying the same region of memory from all cores at very high frequencies adds a tremendous overhead.

It may be worth saying that most x86 archs have a cache line size of 64 bytes; though this isn't guaranteed (i.e. it could be 128 bytes on some systems, 32 bytes on others, etc); while in ARM it is more typical to find cache sizes of either 32 or 64 bytes. What I mean, padding of just 4 or 8 bytes won't be enough.
Advertisement

I only read about 1/2 of this thread.

Your design is logically good, but technically mistaken. Lock-free structures generally depend on compare/exchange instructions to keep threads from reading the wrong value or overwriting someone else's. "Lock-free" is a popular term these days, and while it is an awesome choice in the right circumstances, it is not always the best. For e.g.:

* Lock-free design works best when the period of contention between threads is very small, such as when swapping a pointer value or updating a counter. This is because there is always a brief spin whenever the protected value is found to in contention.

* Lock-based design works best when the period of contention will be longer, such as to execute some IO or do some processing, as the waiting thread is taken out of the scheduler and no core "spins" during this time.

Nonetheless, your use of Volatile will not do what you want. It is intended to force the compiler to do NO optimizations on that variable by declaring that the value might be updated at any time--such as by a separate piece of hardware--but the compiler does not place a memory fence around accesses to the variable. So, yes, your caches may be inconsistent.

In general, designing highly concurrent logic is hard. It's fun to do, and if that is your goal, please proceed! I hold a patent in this area (owned by my previous employer) because I love it. But most often, if your goal is just efficiency, classic producer/consumer logic (with or without lockinig) will meet any possible need you have related to networking--which lets face it, is incredibly slow compared to our CPUs these days :-)

Everyone's already covered what's right and wrong (don't use volatile! It may work on MSVC due to extensions, but it's actually wrong) with the code, but I just have to point out that this code is not 'lock free'.

Lock free doesn't mean "doesn't use the existing lock structures"; it's a very specific term that basically means that if any of the threads happens to stall at any point in time, this will not impact the progress of any other thread.
The code in this thread is the opposite of that -- it hands control of the state variable back and forth between two threads exclusively. If you freeze the network thread while the state is not "unlocked", then the game thread also freezes. [edit: no it doesn't!]

The algorithm relies on the fact that only one thread "owns" the state variable at a time -- as mentioned by others, if his wasnt the case, you'd need some CAS/interlocked logic in there somewhere.

P.s. not being"lock free" isn't a bad thing. I'm just being a definition-nazi ;)

Looks lock free to me. Freezing the network thread like you say would stop updating TempList, but the game thread would still continue to call UpdateGames().

I do think that blocking on LOCK_REQUESTED would be prudent though, especially if RetrievePendingPackets() is not lock free.

King Mir is correct on that point.
If the network thread freezes the game thread will still call UpdateGames() and loop again; this part is not related to the state of LockState (we are of course assuming GameThread() is in some kind of loop or else none of this would be a discussion since it would be called once and that would be the end of it).

I’ve stopped replying in the interest of not debating with someone who simply refuses to understand what I am trying to explain, and I hoped that if others tried to explain the same thing he may come around. But since I am here-

So I should point out that the volatile extension link show and example of a volatile variable being used to implement mutual exclusion, without any threading primitives.

That example is unrelated to his—it shows the variable being controlled by 1 thread while being accessed by another. The 2nd thread never changes the value of “Sentinel”, it simply reads from it. This example would work on any platform and any compiler, as it conforms entirely to the standard. It has nothing to do with extensions or Microsoft®. Also, please do not use misleading terms such as “mutual exclusion”—that is a far cry from what is happening in that sample code. All it’s saying is that the 2nd thread will not load and keep in a register (as an optimization) the value of Sentinel in the while loop and thus will properly terminate whenever the 1st thread changes the value.

Let me put it another way.
An “if” statement requires a load, a compare, and a jump, in assembly. Without a fence, a value could be loaded to a register, changed on another thread, but then have the CMP done on the previous value (loaded to a register) and perform its JMP based on that. In fact that could happen even with a fence, especially if poorly placed.

If you assume Microsoft® extensions are aiming to solve this problem (which they are not), where exactly do you propose the compiler knows to place the fence. On both threads.


It doesn’t. Using fences correctly requires a human author, and even then it fails frequently unless the author is really really good at what he or she does. There is no way the compiler could ever know where to place the fences on both threads, and there is no way the compiler would even try—it takes 100,000 compiler accuracies to make up for a single compiler bug.


If you need proof that Matias Goldberg’s first post is accurate, consider this line from that link:

A write to a volatile object (volatile write) has Release semantics; a reference to a global or static object that occurs before a write to a volatile object in the instruction sequence will occur before that volatile write in the compiled binary.

As he correctly stated, the compiled binary is the only thing related to the Microsoft® specifics.
My job is to optimize code on all platforms, including consoles, and I spend most of my day staring at compiled assembly. Not only that but I wrote MHS, including its assembler and disassembler, and have spent a lot of time looking at code from other games and applications including my own.

The compiler is absolutely 100% free to change the orders of reads and writes as long as the final result is the same, and I can promise you this happens on every single compiler I have encountered, for every platform. In fact Nintendo 3DS’s compiler seems to go out of its way to do this, I assume for cache coherency.

This is the relationship between Microsoft®’s extensions and volatile. It guarantees that in the compiled binary the code will be executed in the order specified by your code.

It does absolutely nothing to prevent the above situation between MOV, CMP, and JMP. It adds no fences at all. If you do not believe me, make your own program and check its disassembly.

Get back to us when you have done that.

L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

I thought that UpdateGames() was inside if(LockState==UNLOCKED) -- this would mean that the main loop would stop updating the game if the network thread paused while the state was anything except UNLOCKED.

But I misread, so RetrievePendingPackets() and UpdateGames() with both continually run, regardless of the activity of the other thread. Freezing one thread won't impede the other.

So, yeah, as long as you say that the data being communicated is "a list of packets", then the communication between these two threads is lock-free.

If you wanted to be ridiculously picky, you could argue that the threads are actually communicating "individual packets", in which case there is locking, as you might have packets buffered up in InPackets/OutPackets that are stuck there due to the LockState variable being held by a thread that's gone to sleep (e.g. if the main thread is pre-empted right before LockState=UNLOCK_REQUESTED, then even though some data has been "shared" at this point, the network thread can't read it until the game thread wakes up) laugh.png wink.png That's something for academia to argue about though tongue.png

@L. Spiro
That applies to many architectures, but it doesn't apply to x86. In x86 any MOV from memory by itself has acquire semantics. Any MOV into memory has release semantics. You don't need a fence. You only need a fence to implement sequential consistency, usually placed after each store, but with two threads, that's not needed, because the difference between acquire-release semantics and sequential consistency is only apparent with more threads.

So you're correct from the standpoint of writing portable C++, but on x86 the code will work as is. Because of the above, x86 will allow code that is technically undefined C++ to work as one would expect. (Let me reiterate that: the code as written is undefined behavior)

EDIT:Strike-through due to point made by Matias Goldberg bellow.

The compiler is absolutely 100% free to change the orders of reads and writes as long as the final result is the same, and I can promise you this happens on every single compiler I have encountered, for every platform.

This is the relationship between Microsoft®’s extensions and volatile. It guarantees that in the compiled binary the code will be executed in the order specified by your code. [<-- and that's enough]

It does absolutely nothing to prevent the above situation between MOV, CMP, and JMP. It adds no fences at all. If you do not believe me, make your own program and check its disassembly.

x86 CPU's don't reorder memory ops though, and they don't have fence instructions.

[edit]... Actually "Loads may be reordered with older stores to different locations" and there are rarely used LFENCE/SFENCE/MFENCE instructions ...[/edit]

Any instruction with the lock prefix acts like a fence, but that prefix is only required when you need to perform an atomic operation on memory that could potentially be modified by two threads simultaneously.

Slicer's code works on mutual-exclusion. Assuming that there's no reordering, then there's no variables that are going to be written to by two threads at once -- every variable always has only 1 writer. So, it will work on x86 with just compile-time fences.

It won't be portable to other CPU's unless you add fences, so he really should be using a std::atomic still biggrin.png

Yeah, if his code was more complex and the state variable could have two potential writers, he would need to use CAS instead of an if and assignment.

The standard C/C++ volatile is not a proper compile-time fence, it only acts as a fence with regards to other volatiles -- reads/writes to volatiles aren't reordered past each other, but reads/writes to non-volatiles can be reordered past volatile reads/writes. The MSVC extension goes further than this turns turns volatile reads/writes into full compile-time fences.

A standard C++ compiler would be able to remove the =1 and =2 lines from this code, but MSVC keeps them, because there's a volatile write between them all.


	volatile int g_state_v = 0;
	extern int g_state;
...
	g_state = 1;
	g_state_v++;
	g_state = 2;
	g_state_v++;
	g_state = 3;
	printf( "%d %d", g_state, g_state_v );

@L. Spiro
That applies to many architectures, but it doesn't apply to x86. In x86 any MOV from memory by itself has acquire semantics. Any MOV into memory has release semantics. You don't need a fence. You only need a fence to implement sequential consistency, usually placed after each store, but with two threads, that's not needed, because the difference between acquire-release semantics and sequential consistency is only apparent with more threads.

So you're correct from the standpoint of writing portable C++, but on x86 the code will work as is. Because of the above, x86 will allow code that is technically undefined C++ to work as one would expect. (Let me reiterate that: the code as written is undefined behavior)

That's an overly simplification.
Section 8.2.2 from the Intel Intel 64 and IA-32 Architectures Software Developer’s Manual is very clear that:
"Reads may be reordered with older writes to different locations but not with older writes to the same location."

Here's even an example of how such reordering can affect the logic of an application even when just running 2 threads. Whether this issue affects the OP's code, can only be said by inspecting the assembly.

@L. Spiro
That applies to many architectures, but it doesn't apply to x86. In x86 any MOV from memory by itself has acquire semantics. Any MOV into memory has release semantics. You don't need a fence. You only need a fence to implement sequential consistency, usually placed after each store, but with two threads, that's not needed, because the difference between acquire-release semantics and sequential consistency is only apparent with more threads.

So you're correct from the standpoint of writing portable C++, but on x86 the code will work as is. Because of the above, x86 will allow code that is technically undefined C++ to work as one would expect. (Let me reiterate that: the code as written is undefined behavior)

That's an overly simplification.
Section 8.2.2 from the Intel Intel 64 and IA-32 Architectures Software Developer’s Manual is very clear that:
"Reads may be reordered with older writes to different locations but not with older writes to the same location."
Here's even an example of how such reordering can affect the logic of an application even when just running 2 threads. Whether this issue affects the OP's code, can only be said by inspecting the assembly.

No, the code you linked to can be analyzed in a way that doesn't require looking at the disassebly to spot the problem, and so can the OP. In the op it's trivial to see that each access to the lock is an acquire-read on one variable followed by some critical section code, followed by a release-write on the same variable. That pattern is tried and true.

In the link you've got:


zeroWants = true; //release on zeroWants 
victim = 0; //release on victim
while (oneWants && victim == 0) // aquire on  oneWants and victim
continue;
// critical code

It's a release followed by acquire, which is nothing. The acquire by itself would be usable to create a lock, but clearly here this lock does not work if you remove the first two lines.

But conceded, that does show that acquire-release semantics can differ from sequential consistency even with two threads. Thank you for the example.

This topic is closed to new replies.

Advertisement