visual studio code optimization is breaking my code.

Started by
23 comments, last by Matt-D 11 years, 9 months ago
Checking and setting the flag can't be done in one operation, so what happens if Thread 1 is between checking and setting the flag, and Thread 2 checks the flag? Then both will enter the loop, too.
Advertisement
In FunctionB you *could* get a problem with compiler/CPU optimizations causing lines/instructions to be swapped around if they are independent of each other (some of the "Do Stuff"-stuff could be executed before the busy wait) which is guaranteed not to happen when the code involves critical sections (because they create boundaries for such optimizations).
OK, serves me right for trying to debug code late at night :-)


Let's look at the starvation example. You seem to think that I'm suggesting that there is magic going on; this is not true. I do not mean to imply that variables are just mysteriously getting changed for no apparent reason. I'm stepping through the execution of the code, one line at a time, and describing the effects of each step.

You say that this lock is not used for any time-critical operations, so why are you worried about it? Why not use a proven-safe locking mechanism like an OS critical section/mutex? You're not going to do any better performance-wise with spinlocks unless you have a far richer understanding of the intricacies of threading behavior and nondeterminism. Spinlocks of indefinite duration are generally a bad idea, because if you're not in time-critical operation, you are by definition running your operations for potentially "long" periods of time. This means your busy wait will waste CPU core time when you could just have the thread put to sleep by the OS. More importantly, you can have the thread woken up precisely in the right circumstances by using a proven lock instead of this mechanism.


I'm a bit busy at the moment so I'm not really able to study this in more detail; upon closer inspection of the loop invariants I don't think there's a trivial case where you can double-enter the lock, but that may not be guaranteed.

What's far more important is that this is effectively a really bad way of single-threading your program.

Run through the starvation condition and its inverse (where function B gets starved instead of A). If you step through the logic carefully, you'll notice that you actually force each function to wait until after the other thread has run at least once. You're basically halving your computational throughput here. You're not simply enforcing that the threads do not run simultaneously; you're forcing them to take turns alternating back and forth. And if that's the behavior you actually want, threading is not the right solution. You should just write everything in a single thread.

If that's not what you want, I strongly suggest you use a proven lock mechanism instead of trying to roll your own.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Sorry, with or without volatile, your code is not thread safe.
Volatile fixes the compiler reordering your variable, but you failed to put a Hardware memory barrier (which prevents the Hardware CPU from reordering your execution due to "Out of order execution" algorithms).

Once that's fixed, you'll have to consider the following:

  1. If by "safe" you mean that, after FunctionB just finished, in the next iteration of functionA, it will "do stuff" and then tell FunctionB to run again, then you're wrong. FunctionA may iterate a million times before FunctionB sets Flag back to 0 ('A' would loop doing nothing), or sets Flag |= 1 ('A' would loop "doing stuff" but without telling B to go ahead). This is the starvation problem. If you don't worry that B may iterate several times until it sees it can actually unlock (perhaps because you're processing discardable data, and you only care about the "latest" one, just like UDP video streaming in networking), then "yeah, it's sort of safe", but your threads can starve a lot. You need to yield them. On Windows you can use SwitchToThread() (OS Level, requires entering Kernel mode) or the pause instruction (HW Level, not guaranteed to actually yield)
  2. Your code has a lot of false cache sharing. The pause instruction is supposed to aid on that, but you're reading-and-writting to the same memory location too much.
  3. Your code won't work with more than 2 threads.
  4. What you're doing is called a busy spin-lock. You should beware of live locks. Live locks are harder to spot than deadlocks, because the debugger will tell you when your process looks dead, but during livelocks, everything seems to be working fine. Except it's not.
  5. Busy spin-locks with more threads than CPU cores (oversubscription) is a very bad idea. HyperThreading cores don't count as a real core for this case.
  6. Since thread A is waiting for B, and B then will wait for A, you negate any performance benefit from multi-threading, while getting all the cons described above. You'll be better off with a kernel-based lock in this case. An implementation like yours may be useful in a case where you know there will be a very tiny lock every once in a while. I would hardly recommend using your implementation because it's too error-prone once it gets bigger and hard to maintain, but if you're trying to learn how multithreading works and interacts, then it's fine smile.png

Checking and setting the flag can't be done in one operation, so what happens if Thread 1 is between checking and setting the flag, and Thread 2 checks the flag? Then both will enter the loop, too.

...how exactly do both enter the loop?, I don't see how your line of thinking lines up with the code: functionB sets bit 1 of flag, then functionA sets bit 2 of the flag, since functionA sets bit 2 at the end of the loop, it can not return until functionB unsets bit 2 of flag, as such, i see no possible way that after functionB locks itself from the data, it can then somehow re-enter the data....


In FunctionB you *could* get a problem with compiler/CPU optimizations causing lines/instructions to be swapped around if they are independent of each other (some of the "Do Stuff"-stuff could be executed before the busy wait) which is guaranteed not to happen when the code involves critical sections (because they create boundaries for such optimizations).

what exact pre-definition codes cause such boundry's to be created?, that i can't just call such code as well?


OK, serves me right for trying to debug code late at night :-)

no worries mate, was ~1am here when i posted this=-).


Let's look at the starvation example. You seem to think that I'm suggesting that there is magic going on;

Let's stop right here for a second, i have never once used the word magic to describe anything that is going on, so please, can we stay away from assuming that simply because i do not realize some things the os is doing means i believe that it's just magically happening?


this is not true. I do not mean to imply that variables are just mysteriously getting changed for no apparent reason. I'm stepping through the execution of the code, one line at a time, and describing the effects of each step.

yes, i appreciate the stepping through, it allows me to see what you are exactly talking about. I understand perfectly what you are talking about now, if the pre-emptive thread can temporarly store the variable "Flag", then upon resume of the bitwise operation, it uses that stored state, as such, even though functionA would have changed the value of Flag, for that instance, Flag is still locked, until Thread 1 resumes(which in the case presented above, it would most likly not, but in my actual working program, it does.)


You say that this lock is not used for any time-critical operations, so why are you worried about it? Why not use a proven-safe locking mechanism like an OS critical section/mutex? You're not going to do any better performance-wise with spinlocks unless you have a far richer understanding of the intricacies of threading behavior and nondeterminism.

this is a classic catch-22: can't get experiance without work, can't get work without experiance. exactly when am i suppose to learn and understand such mechanism, if i'm told to simply use existing mechanisms?


Spinlocks of indefinite duration are generally a bad idea, because if you're not in time-critical operation, you are by definition running your operations for potentially "long" periods of time. This means your busy wait will waste CPU core time when you could just have the thread put to sleep by the OS. More importantly, you can have the thread woken up precisely in the right circumstances by using a proven lock instead of this mechanism.

yes, this is theoretically true that functionA is simply wasting cycles as it waits for unlocking, and functionB is wasting cycles waiting for locking. however, for my requirments, it meets my needs.


I'm a bit busy at the moment so I'm not really able to study this in more detail; upon closer inspection of the loop invariants I don't think there's a trivial case where you can double-enter the lock, but that may not be guaranteed.

not a problem, i personally can't recognize an instance(at least within user-land space) where it's possible for functionA to re-enter the loop, considering it's a self-locking mechanism.



What's far more important is that this is effectively a really bad way of single-threading your program.

Run through the starvation condition and its inverse (where function B gets starved instead of A). If you step through the logic carefully, you'll notice that you actually force each function to wait until after the other thread has run at least once. You're basically halving your computational throughput here. You're not simply enforcing that the threads do not run simultaneously; you're forcing them to take turns alternating back and forth. And if that's the behavior you actually want, threading is not the right solution. You should just write everything in a single thread.

this section of the code is used for my editor in order to lock an world object from my renderer to be deleted, since it's a rare-occasion(mostly on startup's, or re-loads of the entire world), it's a non-issue for this particular instance.

much of my threading is done with double-buffering matrix's, and using a swap variable, essentially, i have three matrix's, one is for the update thread itself, the other two are for drawing, when my update thread finishs, it checks if one of the two other matrix's are available(such that they have been swapped), if so, it takes the non-drawing matrix, and writes to that, marks the flag for swapping, and the renderer swaps it the next time it comes along.



Sorry, with or without volatile, your code is not thread safe.
Volatile fixes the compiler reordering your variable, but you failed to put a Hardware memory barrier (which prevents the Hardware CPU from reordering your execution due to "Out of order execution" algorithms).

how exactly do i place a hardware barrier on the code section?


  1. If by "safe" you mean that, after FunctionB just finished, in the next iteration of functionA, it will "do stuff" and then tell FunctionB to run again, then you're wrong. FunctionA may iterate a million times before FunctionB sets Flag back to 0 ('A' would loop doing nothing), or sets Flag |= 1 ('A' would loop "doing stuff" but without telling B to go ahead). This is the starvation problem. If you don't worry that B may iterate several times until it sees it can actually unlock (perhaps because you're processing discardable data, and you only care about the "latest" one, just like UDP video streaming in networking), then "yeah, it's sort of safe", but your threads can starve a lot. You need to yield them. On Windows you can use SwitchToThread() (OS Level, requires entering Kernel mode) or the pause instruction (HW Level, not guaranteed to actually yield)


i completly understand that functionA can cycle a million times, that's acceptable in my application, i agree that yielding/switching is the better way performance wise, in my current test's, the time spent waiting has not caused for warranting such measures.


2. Your code has a lot of false cache sharing. The pause instruction is supposed to aid on that, but you're reading-and-writting to the same memory location too much.


essentially, if i'm reading this right, the problem is that i am not yielding the thread that is waiting and reading, as such, i causing false cache sharing when i do write? i've never looked at optimizing code against local cache's before, so i'll have to do some more reading on the subject.


3. Your code won't work with more than 2 threads.

yes, this has been pointed out, and i completly understand the reasoning behind why.


4. What you're doing is called a busy spin-lock. You should beware of live locks. Live locks are harder to spot than deadlocks, because the debugger will tell you when your process looks dead, but during livelocks, everything seems to be working fine. Except it's not.

i've heard of deadlocks and starvations before, but never live locks, it's defiantly something i'll keep my eye on, thanks for the info.



5. Busy spin-locks with more threads than CPU cores (oversubscription) is a very bad idea. HyperThreading cores don't count as a real core for this case.


how so exaclty?, i can understand that it means both are executing at the same time, as such it's always wasting one thread, but as long as it's only spin-locked, then it's implied that something is progressing to eventually clear the lock.


6. Since thread A is waiting for B, and B then will wait for A, you negate any performance benefit from multi-threading, while getting all the cons described above. You'll be better off with a kernel-based lock in this case. An implementation like yours may be useful in a case where you know there will be a very tiny lock every once in a while. I would hardly recommend using your implementation because it's too error-prone once it gets bigger and hard to maintain, but if you're trying to learn how multithreading works and interacts, then it's fine smile.png

yes, exactly, the case is small, and very rare, it's not designed to be used anywhere close to actual multi-threading in real-time, it's simply used when my entire world needs to be destroyed or loaded.
Check out https://www.facebook.com/LiquidGames for some great games made by me on the Playstation Mobile market.
Leaving aside that a command queue would be a much better way to solve the problem as described above I'll field the hyper-threading question.

As mentioned a hyper-threaded core is not a real two cores; the two threads share execution elements and work on the principle that during a normal thread's execution there is 'dead time' while values are being fetched from memory or otherwise progressing and that using this 'dead time' to execute another command stream grants you better hardware utilisation.

However with a tight spin-lock there is never any 'dead time' to take advantage of - sure the second thread might well get scheduled in but most of your time is going to be spent spinning the same instructions on the core wasting power, time and resources in general. The Window's scheduler also tries to use as few physical cores as possible during execution so if you spin up an app with two threads in it chances are they will both end up on the same physical core and just burn resources for no real reason.

That's not to say spinlocks are a bad idea, if you know the lock won't be held for long they can do the job well as it saves your thread getting swapped out and losing its quantum - however a pure spin lock in a world of hyper-threaded locks isn't a good idea. I seem to recall that Window's Critical_Section is under the hood an adaptive spinlock in that it will spin for a very short period of time before the thread gets swapped out and a real 'lock' is waited on. (It might even track history on locks to make this choice, I can't recall off hand).

This also bleeds into atomics and false sharing as even if your two threads DO end up on seperate cores spinning away the data they will want is in two seperate cache 1 lines so unles you tell the other core to reload it's cache to 'sync' even when the bit gets set there is no garentee the other thread will see it happen; either at all or for a 'long' time depending on what the system does.
(False sharing isn't just a lock problem it's a multi-threaded problem in general).

[quote name='brx' timestamp='1341245994' post='4954952']
In FunctionB you *could* get a problem with compiler/CPU optimizations causing lines/instructions to be swapped around if they are independent of each other (some of the "Do Stuff"-stuff could be executed before the busy wait) which is guaranteed not to happen when the code involves critical sections (because they create boundaries for such optimizations).

what exact pre-definition codes cause such boundry's to be created?, that i can't just call such code as well?
[/quote]

I am not 100% sure how it is in C++, but in Java accessing a volatile variable creates such a boundary. From Matias Goldberg's post I assume it is somewhat similar for C++, but it seems to be valid for the compiler optimizations only (not the reordering that might be done by the CPU).

One question I do have is why using a mutex instead of rolling your own seems completely out of the question for you?

I am not 100% sure how it is in C++, but in Java accessing a volatile variable creates such a boundary. From Matias Goldberg's post I assume it is somewhat similar for C++, but it seems to be valid for the compiler optimizations only (not the reordering that might be done by the CPU).

One question I do have is why using a mutex instead of rolling your own seems completely out of the question for you?
According to the C++98 standard, memory barriers (and threads) aren't even mentioned. Volatile isn't guaranteed to prevent re-ordering of memory accesses in general -- it only prevents reordering with respect to other volatile variables, which isn't much use at all for multi-threading programming!

When writing standards compliant C++98 code, it's therefore impossible to implement a memory-barrier or sync primitive such as a mutex yourself -- in order to implement these, you need to use implementation-defined behaviour (i.e. use compiler-specific C++ code, instead of portable/standard C++ code).

Some compilers have (dubiously) decided to make it so that volatile inserts memory barriers, but IMO this is very harmful as it lures you into writing incorrect code that only works on that particular compiler.

In short: In C++ it's much better to use your OS's synchronisation primitives, such as critical-sections on Windows, than to try and build them yourself.
Having said that, I do actually use my own home-made mutex class instead of Windows' one...unsure.png

However with a tight spin-lock there is never any 'dead time' to take advantage of - sure the second thread might well get scheduled in but most of your time is going to be spent spinning the same instructions on the core wasting power, time and resources in general.

That's not to say spinlocks are a bad idea, if you know the lock won't be held for long they can do the job well as it saves your thread getting swapped out and losing its quantum - however a pure spin lock in a world of hyper-threaded locks isn't a good idea.
A properly written spinlock should issue [font=courier new,courier,monospace]mm_pause[/font]/[font=courier new,courier,monospace]yeild[/font] type instructions instead of [font=courier new,courier,monospace]nop[/font] instructions, which actually tells a hyper-threaded core to switch over to the other hardware thread.
In general, they're still a bad idea, but when used properly, they actually an important part of taking full advantage of HT hardware. e.g. if Threads A and B are both running on the same HT core, and A has to wait for B, then a spin-lock (using pause/yield) will actually be quite efficient, as long as B is expected to unblock the lock in a fairly short amount of time.
I just remembered that I read, that the C++11 std::atomic feature deals with instruction reordering. However, I have never used it and can't say anything about it. If you do have a C++11 compiler the std::thread and std::mutex classes should be all you need anyways.

Less error prone is the usage of std::async and std::future, however, from what you wrote so far your use-case seems to fall out of the scope of those two.

A properly written spinlock should issue [font=courier new,courier,monospace]mm_pause[/font]/[font=courier new,courier,monospace]yeild[/font] type instructions instead of [font=courier new,courier,monospace]nop[/font] instructions, which actually tells a hyper-threaded core to switch over to the other hardware thread.
In general, they're still a bad idea, but when used properly, they actually an important part of taking full advantage of HT hardware. e.g. if Threads A and B are both running on the same HT core, and A has to wait for B, then a spin-lock (using pause/yield) will actually be quite efficient, as long as B is expected to unblock the lock in a fairly short amount of time.


Yes, my bad, I should have said a spinlock implimented as the OP had - as you say the pause/yield instructions take care of the problem I mentioned :)

This topic is closed to new replies.

Advertisement