# visual studio code optimization is breaking my code.

This topic is 2054 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

hello everyone, i'm having a weird issue with my code, and i believe i've tracked it down to the compiler optimizing out/pre-fetching/doing something that is breaking my code.

it's a multithreaded locking mechanism, so it basically looks like this:

 unsigned char Flag=0; void FunctionA(void){ while(1){ if(!(Flag&amp;2)){ //if not locked... //Do stuff if(Flag&amp;1) Flag|=2; //Locked this thread... } } } void FunctionB(void){ Flag|=1; //Request lock while(!(Flag&amp;2)); //Wait for lock <-- THIS IS BROKEN? //Do stuff Flag^=3; //Unlock } 

with debugging through output, i've discovered that FunctionA does indeed lock the variable, and FunctionB is inside of the loop, waiting for locking, but even after my output clearly tells me flag has been locked, my waiting thread simple does not break, i assume this is because flag might be pre-catched or something?

note, that without optimization this works as expected.

edit: ok, so in vs, i can even pause the application, and look directly at the flag, which is reported to the value i'm waiting on, yet, for w/e reason, i'm still inside the loop?! Edited by slicer4ever

##### Share on other sites
If you want the compiler to know that a variable may be modified outside the code it can see at the time, you need to mark that variable as volatile (or reference it through a volatile reference, etc.).

##### Share on other sites
ah, i didn't know that, thanks 100 fold, fixs my problem=-)

##### Share on other sites

There's at least one trivial deadlock condition and possibly also conditions in which your code can allow multiple threads inside the lock, and that's easy to spot without spending much time analyzing the behavior.

purely in the above code?, how can functionA interact until after m_Flag^=3 is done inside functionB?

Also, if your actual code is more complex than this, be warned that it's actually probably got more problems, not fewer.

yes, my code is far more complex, but it boils down to exactly the situation as the above code.

Threading is hard. Locking in pure user-space is not possible to do safely. You need to use kernel-level locks or hardware-level locks or you will introduce very, very nasty bugs into your programs.

i don't see why user-space locks can't work similar to kernel/hardware locks, the only exception being if the os, or an outside program interferes, but in my opinion, if someone wants to monkey around with their ram to modify my program at runtime, that's their business, and my program shouldn't try to prevent this, or any exceptions caused by such behavior. Edited by slicer4ever

##### Share on other sites
Threading is hard. Locking in pure user-space is not possible to do safely. You need to use kernel-level locks or hardware-level locks or you will introduce very, very nasty bugs into your programs.
Intel would like to disagree with you there, also, CRITICAL_SECTION based locking on a windows platform first seeks to do user-mode locking before it asks the kernel for help.

I would say safe user-mode locking is not possible without a lot of research and applied knowledge, instead of plainly stating its impossible.

So in this case, all those bit op's would need to become atomic for starters, but spinning can still be done on the "naked" volatile variable, you just need to use a atomic op to ensure its free when you try lock it for the current thread. Edited by Necrolis

##### Share on other sites
This has nothing to do with someone "interfering" with your program. It's a property of pre-emptive multitasking kernels that you cannot avoid.

In your code, consider the following sequence of events:

- Flag starts at 0
- In thread 1, function B runs
- Function B loops until the second bit of Flag is cleared (which is immediately true, since Flag is 0)
- At this exact moment, after breaking from the loop but prior to the "do stuff" portion, thread 1 is suspended by the kernel
- Thread 2 begins executing inside function A
- Flag & 2 is false, because as we just stated the loop exited in Function B
- Therefore we "do stuff" in function A
- Partway through "doing stuff", thread 2 is suspended by the kernel, and thread 1 resumes

Now both Thread 1 and Thread 2 are inside the "lock" and doing stuff.

I will say I was incorrect about deadlocks, but you can still starve threads trivially. For example, you can starve the execution of "doing stuff" in Function A with the following pattern:

- Thread 1 runs Function A
- Thread 2 on a concurrent core comes in and executes Function B
- Thread 2 running Function B suspends just prior to the xor in the last line
- Function A continues to run in thread 1 and has just executed the inner if() check BUT NOT the flag change
- At this point Flag is still 1!
- Flag is clobbered to 2 by the xor
- Thread 1 resumes, thread 2 continues on its merry way doing something unrelated
- Now, remember that function A was suspended just before the inner if() check
- Flag is now set back to 3
- A loops; because bit 2 is set, it skips the outer if() check and begins to spin
- Thread 2 comes back into function B
- Flag is "updated" to 3 by function B
- Thread 2 suspends just after exiting the loop (since the 2nd bit of Flag is still set)
- Thread 1 remains spinning until whenever Thread 2 wakes up and gets around to clearing the bits

I'm sure there are more problems that I could find if I took the time to look over all the interleaving possibilities of thread executions, but the presence of a serious lock failure and a starvation condition should be enough to convince you that your lock is busted.

 And that's only considering two-thread interactions. You can get all kinds of funky bugs by introducing a third thread. I highly urge you to try (mentally) running three threads with this lock pattern and seeing if you can find some of the race conditions. Edited by ApochPiQ

##### Share on other sites

[quote name='ApochPiQ' timestamp='1341206672' post='4954779']Threading is hard. Locking in pure user-space is not possible to do safely. You need to use kernel-level locks or hardware-level locks or you will introduce very, very nasty bugs into your programs.
Intel would like to disagree with you there, also, CRITICAL_SECTION based locking on a windows platform first seeks to do user-mode locking before it asks the kernel for help.

I would say safe user-mode locking is not possible without a lot of research and applied knowledge, instead of plainly stating its impossible.

So in this case, all those bit op's would need to become atomic for starters, but spinning can still be done on the "naked" volatile variable, you just need to use a atomic op to ensure its free when you try lock it for the current thread.
[/quote]

Intel thoroughly agrees with me. If you read their source listing, they're using hardware-level locks (aka interlocked instructions) to achieve atomicity. Interlocking is still a form of hardware-level locking, especially on modern memory buses and multi-layer cache architectures.

##### Share on other sites

This has nothing to do with someone "interfering" with your program. It's a property of pre-emptive multitasking kernels that you cannot avoid.

In your code, consider the following sequence of events:

- Flag starts at 0
- In thread 1, function B runs
- Function B loops until the second bit of Flag is cleared (which is immediately true, since Flag is 0)
- At this exact moment, after breaking from the loop but prior to the "do stuff" portion, thread 1 is suspended by the kernel
- Thread 2 begins executing inside function A
- Flag & 2 is false, because as we just stated the loop exited in Function B
- Therefore we "do stuff" in function A
- Partway through "doing stuff", thread 2 is suspended by the kernel, and thread 1 resumes

Now both Thread 1 and Thread 2 are inside the "lock" and doing stuff.

the problem here is that Function B loops until the second bit is SET, not when it is cleared, it sets bit 1 of flag requesting for a lock, and then cycles until bit 2 is set, in functionA, it sees bit 1 is set, and set's bit 2. at this point it cannot access it's data until bit 2 is cleared, and bit 2 is only cleared until the end of functionA, when it xor's the flag.

- Thread 1 runs Function A ----------------------------------------------------------------------------------------------------------//FunctionA starts
- Thread 2 on a concurrent core comes in and executes Function B--------------------------------------------------------- //FunctionB starts same time as A
- Thread 2 running Function B suspends just prior to the xor in the last line-------------------------------------------------// since it has reached xor(but not done it), we can assume flag is 3
- Function A continues to run in thread 1 and has just executed the inner if() check BUT NOT the flag change------//FunctionA not sure what you mean here?, do you mean at if(Flag&2) and it has not done Flag&2?, i was under the assumption the thread couldn't be suspended at such an operation?
- At this point Flag is still 1!-------------------------------------------------------------------------------------------------------------//Flag is still 3 since we have not xor'd
- Flag is clobbered to 2 by the xor-----------------------------------------------------------------------------------------------------//xor clears flag to 0 not 2
- Thread 1 resumes, thread 2 continues on its merry way doing something unrelated
- Now, remember that function A was suspended just before the inner if() check
- Flag is now set back to 3--------------------------------------------------------------------------------------------------------------//Can a thread resume actually overwrite a variable(again didn't know), wouldn't the 3 be in a temporary stored register, at this point, and not actually write the value?
- A loops; because bit 2 is set, it skips the outer if() check and begins to spin
- Thread 2 comes back into function B
- Flag is "updated" to 3 by function B-------------------------------------------------------------------------------------------------//Again, did not know a thread resume can directly overwrite a user-space variable.
- Thread 2 suspends just after exiting the loop (since the 2nd bit of Flag is still set)
- Thread 1 remains spinning until whenever Thread 2 wakes up and gets around to clearing the bits

I'm sure there are more problems that I could find if I took the time to look over all the interleaving possibilities of thread executions, but the presence of a serious lock failure and a starvation condition should be enough to convince you that your lock is busted.

i added some comments to the steps, but i did not know that thread resumes can overwrite a value by themselves, i was under the assumption that they couldn't perserve the variable in such a way(as well, i've actually never encountered this level of multi-threading problem)
edit: my spacing wasn't persevered, my comments next to each step is after //
edit2: if i read this correctly, on thread 2's resume it would fix the flag?(if so, i suppose my above issue is a bit out of context, as functionB is inside a loop of the thread, so in theory, this issue would resolve itself as long as thread 2 properly restores itself, and thread 1 doesn't keep gettting stuck at the inner-loop.

edit3: sorry if your in the process of responding with my edits, anyway, as to your bottom statement, i have to disagree about my lock being bust, at best it prevents the data it's locking from being accessed by the thread that isn't suppose to be able to, at worse(and only in situations where thread 2 is suspended permanently after it's operation's, is that it locks the data permanently.) however, since my situation(not the above code, my actual program) means that Thread 1 will always terminate before Thread 2, i can safely assume that any starvation is only a temporary occurence, and this particular thread locking is not used in any real-time conditions.

 And that's only considering two-thread interactions. You can get all kinds of funky bugs by introducing a third thread. I highly urge you to try (mentally) running three threads with this lock pattern and seeing if you can find some of the race conditions.

yes, i realize three threads can defiantly have a huge problems with this particular locking system, i've developed another method for handling n threads in a similar manner, essentially, each thread is given a bit to represent it's locked state, and on request, each thread locks itself when able to, and the requestor moves forward only after all the threads have succesfully locked themselves. Edited by slicer4ever

##### Share on other sites

Intel thoroughly agrees with me. If you read their source listing, they're using hardware-level locks (aka interlocked instructions) to achieve atomicity. Interlocking is still a form of hardware-level locking, especially on modern memory buses and multi-layer cache architectures.
somehow over looked the part about hardware locking, guess its too early in the morning for me . In which case I agree with your statement

##### Share on other sites
Checking and setting the flag can't be done in one operation, so what happens if Thread 1 is between checking and setting the flag, and Thread 2 checks the flag? Then both will enter the loop, too.

##### Share on other sites
In FunctionB you *could* get a problem with compiler/CPU optimizations causing lines/instructions to be swapped around if they are independent of each other (some of the "Do Stuff"-stuff could be executed before the busy wait) which is guaranteed not to happen when the code involves critical sections (because they create boundaries for such optimizations).

##### Share on other sites
OK, serves me right for trying to debug code late at night :-)

Let's look at the starvation example. You seem to think that I'm suggesting that there is magic going on; this is not true. I do not mean to imply that variables are just mysteriously getting changed for no apparent reason. I'm stepping through the execution of the code, one line at a time, and describing the effects of each step.

You say that this lock is not used for any time-critical operations, so why are you worried about it? Why not use a proven-safe locking mechanism like an OS critical section/mutex? You're not going to do any better performance-wise with spinlocks unless you have a far richer understanding of the intricacies of threading behavior and nondeterminism. Spinlocks of indefinite duration are generally a bad idea, because if you're not in time-critical operation, you are by definition running your operations for potentially "long" periods of time. This means your busy wait will waste CPU core time when you could just have the thread put to sleep by the OS. More importantly, you can have the thread woken up precisely in the right circumstances by using a proven lock instead of this mechanism.

I'm a bit busy at the moment so I'm not really able to study this in more detail; upon closer inspection of the loop invariants I don't think there's a trivial case where you can double-enter the lock, but that may not be guaranteed.

What's far more important is that this is effectively a really bad way of single-threading your program.

Run through the starvation condition and its inverse (where function B gets starved instead of A). If you step through the logic carefully, you'll notice that you actually force each function to wait until after the other thread has run at least once. You're basically halving your computational throughput here. You're not simply enforcing that the threads do not run simultaneously; you're forcing them to take turns alternating back and forth. And if that's the behavior you actually want, threading is not the right solution. You should just write everything in a single thread.

If that's not what you want, I strongly suggest you use a proven lock mechanism instead of trying to roll your own.

##### Share on other sites
Volatile fixes the compiler reordering your variable, but you failed to put a Hardware memory barrier (which prevents the Hardware CPU from reordering your execution due to "Out of order execution" algorithms).

Once that's fixed, you'll have to consider the following:

1. If by "safe" you mean that, after FunctionB just finished, in the next iteration of functionA, it will "do stuff" and then tell FunctionB to run again, then you're wrong. FunctionA may iterate a million times before FunctionB sets Flag back to 0 ('A' would loop doing nothing), or sets Flag |= 1 ('A' would loop "doing stuff" but without telling B to go ahead). This is the starvation problem. If you don't worry that B may iterate several times until it sees it can actually unlock (perhaps because you're processing discardable data, and you only care about the "latest" one, just like UDP video streaming in networking), then "yeah, it's sort of safe", but your threads can starve a lot. You need to yield them. On Windows you can use SwitchToThread() (OS Level, requires entering Kernel mode) or the pause instruction (HW Level, not guaranteed to actually yield)
2. Your code has a lot of false cache sharing. The pause instruction is supposed to aid on that, but you're reading-and-writting to the same memory location too much.
4. What you're doing is called a busy spin-lock. You should beware of live locks. Live locks are harder to spot than deadlocks, because the debugger will tell you when your process looks dead, but during livelocks, everything seems to be working fine. Except it's not.
5. Busy spin-locks with more threads than CPU cores (oversubscription) is a very bad idea. HyperThreading cores don't count as a real core for this case.
6. Since thread A is waiting for B, and B then will wait for A, you negate any performance benefit from multi-threading, while getting all the cons described above. You'll be better off with a kernel-based lock in this case. An implementation like yours may be useful in a case where you know there will be a very tiny lock every once in a while. I would hardly recommend using your implementation because it's too error-prone once it gets bigger and hard to maintain, but if you're trying to learn how multithreading works and interacts, then it's fine
Edited by Matias Goldberg

##### Share on other sites

Checking and setting the flag can't be done in one operation, so what happens if Thread 1 is between checking and setting the flag, and Thread 2 checks the flag? Then both will enter the loop, too.

...how exactly do both enter the loop?, I don't see how your line of thinking lines up with the code: functionB sets bit 1 of flag, then functionA sets bit 2 of the flag, since functionA sets bit 2 at the end of the loop, it can not return until functionB unsets bit 2 of flag, as such, i see no possible way that after functionB locks itself from the data, it can then somehow re-enter the data....

In FunctionB you *could* get a problem with compiler/CPU optimizations causing lines/instructions to be swapped around if they are independent of each other (some of the "Do Stuff"-stuff could be executed before the busy wait) which is guaranteed not to happen when the code involves critical sections (because they create boundaries for such optimizations).

what exact pre-definition codes cause such boundry's to be created?, that i can't just call such code as well?

OK, serves me right for trying to debug code late at night :-)

no worries mate, was ~1am here when i posted this=-).

Let's look at the starvation example. You seem to think that I'm suggesting that there is magic going on;

Let's stop right here for a second, i have never once used the word magic to describe anything that is going on, so please, can we stay away from assuming that simply because i do not realize some things the os is doing means i believe that it's just magically happening?

this is not true. I do not mean to imply that variables are just mysteriously getting changed for no apparent reason. I'm stepping through the execution of the code, one line at a time, and describing the effects of each step.

yes, i appreciate the stepping through, it allows me to see what you are exactly talking about. I understand perfectly what you are talking about now, if the pre-emptive thread can temporarly store the variable "Flag", then upon resume of the bitwise operation, it uses that stored state, as such, even though functionA would have changed the value of Flag, for that instance, Flag is still locked, until Thread 1 resumes(which in the case presented above, it would most likly not, but in my actual working program, it does.)

You say that this lock is not used for any time-critical operations, so why are you worried about it? Why not use a proven-safe locking mechanism like an OS critical section/mutex? You're not going to do any better performance-wise with spinlocks unless you have a far richer understanding of the intricacies of threading behavior and nondeterminism.

this is a classic catch-22: can't get experiance without work, can't get work without experiance. exactly when am i suppose to learn and understand such mechanism, if i'm told to simply use existing mechanisms?

Spinlocks of indefinite duration are generally a bad idea, because if you're not in time-critical operation, you are by definition running your operations for potentially "long" periods of time. This means your busy wait will waste CPU core time when you could just have the thread put to sleep by the OS. More importantly, you can have the thread woken up precisely in the right circumstances by using a proven lock instead of this mechanism.

yes, this is theoretically true that functionA is simply wasting cycles as it waits for unlocking, and functionB is wasting cycles waiting for locking. however, for my requirments, it meets my needs.

I'm a bit busy at the moment so I'm not really able to study this in more detail; upon closer inspection of the loop invariants I don't think there's a trivial case where you can double-enter the lock, but that may not be guaranteed.

not a problem, i personally can't recognize an instance(at least within user-land space) where it's possible for functionA to re-enter the loop, considering it's a self-locking mechanism.

What's far more important is that this is effectively a really bad way of single-threading your program.

Run through the starvation condition and its inverse (where function B gets starved instead of A). If you step through the logic carefully, you'll notice that you actually force each function to wait until after the other thread has run at least once. You're basically halving your computational throughput here. You're not simply enforcing that the threads do not run simultaneously; you're forcing them to take turns alternating back and forth. And if that's the behavior you actually want, threading is not the right solution. You should just write everything in a single thread.

this section of the code is used for my editor in order to lock an world object from my renderer to be deleted, since it's a rare-occasion(mostly on startup's, or re-loads of the entire world), it's a non-issue for this particular instance.

much of my threading is done with double-buffering matrix's, and using a swap variable, essentially, i have three matrix's, one is for the update thread itself, the other two are for drawing, when my update thread finishs, it checks if one of the two other matrix's are available(such that they have been swapped), if so, it takes the non-drawing matrix, and writes to that, marks the flag for swapping, and the renderer swaps it the next time it comes along.

Volatile fixes the compiler reordering your variable, but you failed to put a Hardware memory barrier (which prevents the Hardware CPU from reordering your execution due to "Out of order execution" algorithms).

how exactly do i place a hardware barrier on the code section?

1. If by "safe" you mean that, after FunctionB just finished, in the next iteration of functionA, it will "do stuff" and then tell FunctionB to run again, then you're wrong. FunctionA may iterate a million times before FunctionB sets Flag back to 0 ('A' would loop doing nothing), or sets Flag |= 1 ('A' would loop "doing stuff" but without telling B to go ahead). This is the starvation problem. If you don't worry that B may iterate several times until it sees it can actually unlock (perhaps because you're processing discardable data, and you only care about the "latest" one, just like UDP video streaming in networking), then "yeah, it's sort of safe", but your threads can starve a lot. You need to yield them. On Windows you can use SwitchToThread() (OS Level, requires entering Kernel mode) or the pause instruction (HW Level, not guaranteed to actually yield)

i completly understand that functionA can cycle a million times, that's acceptable in my application, i agree that yielding/switching is the better way performance wise, in my current test's, the time spent waiting has not caused for warranting such measures.

2. Your code has a lot of false cache sharing. The pause instruction is supposed to aid on that, but you're reading-and-writting to the same memory location too much.

essentially, if i'm reading this right, the problem is that i am not yielding the thread that is waiting and reading, as such, i causing false cache sharing when i do write? i've never looked at optimizing code against local cache's before, so i'll have to do some more reading on the subject.

yes, this has been pointed out, and i completly understand the reasoning behind why.

4. What you're doing is called a busy spin-lock. You should beware of live locks. Live locks are harder to spot than deadlocks, because the debugger will tell you when your process looks dead, but during livelocks, everything seems to be working fine. Except it's not.

i've heard of deadlocks and starvations before, but never live locks, it's defiantly something i'll keep my eye on, thanks for the info.

5. Busy spin-locks with more threads than CPU cores (oversubscription) is a very bad idea. HyperThreading cores don't count as a real core for this case.

how so exaclty?, i can understand that it means both are executing at the same time, as such it's always wasting one thread, but as long as it's only spin-locked, then it's implied that something is progressing to eventually clear the lock.

6. Since thread A is waiting for B, and B then will wait for A, you negate any performance benefit from multi-threading, while getting all the cons described above. You'll be better off with a kernel-based lock in this case. An implementation like yours may be useful in a case where you know there will be a very tiny lock every once in a while. I would hardly recommend using your implementation because it's too error-prone once it gets bigger and hard to maintain, but if you're trying to learn how multithreading works and interacts, then it's fine

yes, exactly, the case is small, and very rare, it's not designed to be used anywhere close to actual multi-threading in real-time, it's simply used when my entire world needs to be destroyed or loaded. Edited by slicer4ever

##### Share on other sites
Leaving aside that a command queue would be a much better way to solve the problem as described above I'll field the hyper-threading question.

As mentioned a hyper-threaded core is not a real two cores; the two threads share execution elements and work on the principle that during a normal thread's execution there is 'dead time' while values are being fetched from memory or otherwise progressing and that using this 'dead time' to execute another command stream grants you better hardware utilisation.

However with a tight spin-lock there is never any 'dead time' to take advantage of - sure the second thread might well get scheduled in but most of your time is going to be spent spinning the same instructions on the core wasting power, time and resources in general. The Window's scheduler also tries to use as few physical cores as possible during execution so if you spin up an app with two threads in it chances are they will both end up on the same physical core and just burn resources for no real reason.

That's not to say spinlocks are a bad idea, if you know the lock won't be held for long they can do the job well as it saves your thread getting swapped out and losing its quantum - however a pure spin lock in a world of hyper-threaded locks isn't a good idea. I seem to recall that Window's Critical_Section is under the hood an adaptive spinlock in that it will spin for a very short period of time before the thread gets swapped out and a real 'lock' is waited on. (It might even track history on locks to make this choice, I can't recall off hand).

This also bleeds into atomics and false sharing as even if your two threads DO end up on seperate cores spinning away the data they will want is in two seperate cache 1 lines so unles you tell the other core to reload it's cache to 'sync' even when the bit gets set there is no garentee the other thread will see it happen; either at all or for a 'long' time depending on what the system does.
(False sharing isn't just a lock problem it's a multi-threaded problem in general).

##### Share on other sites

[quote name='brx' timestamp='1341245994' post='4954952']
In FunctionB you *could* get a problem with compiler/CPU optimizations causing lines/instructions to be swapped around if they are independent of each other (some of the "Do Stuff"-stuff could be executed before the busy wait) which is guaranteed not to happen when the code involves critical sections (because they create boundaries for such optimizations).

what exact pre-definition codes cause such boundry's to be created?, that i can't just call such code as well?
[/quote]

I am not 100% sure how it is in C++, but in Java accessing a volatile variable creates such a boundary. From Matias Goldberg's post I assume it is somewhat similar for C++, but it seems to be valid for the compiler optimizations only (not the reordering that might be done by the CPU).

One question I do have is why using a mutex instead of rolling your own seems completely out of the question for you?

##### Share on other sites

I am not 100% sure how it is in C++, but in Java accessing a volatile variable creates such a boundary. From Matias Goldberg's post I assume it is somewhat similar for C++, but it seems to be valid for the compiler optimizations only (not the reordering that might be done by the CPU).

One question I do have is why using a mutex instead of rolling your own seems completely out of the question for you?
According to the C++98 standard, memory barriers (and threads) aren't even mentioned. Volatile isn't guaranteed to prevent re-ordering of memory accesses in general -- it only prevents reordering with respect to other volatile variables, which isn't much use at all for multi-threading programming!

When writing standards compliant C++98 code, it's therefore impossible to implement a memory-barrier or sync primitive such as a mutex yourself -- in order to implement these, you need to use implementation-defined behaviour (i.e. use compiler-specific C++ code, instead of portable/standard C++ code).

Some compilers have (dubiously) decided to make it so that volatile inserts memory barriers, but IMO this is very harmful as it lures you into writing incorrect code that only works on that particular compiler.

In short: In C++ it's much better to use your OS's synchronisation primitives, such as critical-sections on Windows, than to try and build them yourself.
Having said that, I do actually use my own home-made mutex class instead of Windows' one...

However with a tight spin-lock there is never any 'dead time' to take advantage of - sure the second thread might well get scheduled in but most of your time is going to be spent spinning the same instructions on the core wasting power, time and resources in general.

That's not to say spinlocks are a bad idea, if you know the lock won't be held for long they can do the job well as it saves your thread getting swapped out and losing its quantum - however a pure spin lock in a world of hyper-threaded locks isn't a good idea.
A properly written spinlock should issue [font=courier new,courier,monospace]mm_pause[/font]/[font=courier new,courier,monospace]yeild[/font] type instructions instead of [font=courier new,courier,monospace]nop[/font] instructions, which actually tells a hyper-threaded core to switch over to the other hardware thread.
In general, they're still a bad idea, but when used properly, they actually an important part of taking full advantage of HT hardware. e.g. if Threads A and B are both running on the same HT core, and A has to wait for B, then a spin-lock (using pause/yield) will actually be quite efficient, as long as B is expected to unblock the lock in a fairly short amount of time. Edited by Hodgman

##### Share on other sites
I just remembered that I read, that the C++11 std::atomic feature deals with instruction reordering. However, I have never used it and can't say anything about it. If you do have a C++11 compiler the std::thread and std::mutex classes should be all you need anyways.

Less error prone is the usage of std::async and std::future, however, from what you wrote so far your use-case seems to fall out of the scope of those two. Edited by brx

##### Share on other sites

A properly written spinlock should issue [font=courier new,courier,monospace]mm_pause[/font]/[font=courier new,courier,monospace]yeild[/font] type instructions instead of [font=courier new,courier,monospace]nop[/font] instructions, which actually tells a hyper-threaded core to switch over to the other hardware thread.
In general, they're still a bad idea, but when used properly, they actually an important part of taking full advantage of HT hardware. e.g. if Threads A and B are both running on the same HT core, and A has to wait for B, then a spin-lock (using pause/yield) will actually be quite efficient, as long as B is expected to unblock the lock in a fairly short amount of time.

Yes, my bad, I should have said a spinlock implimented as the OP had - as you say the pause/yield instructions take care of the problem I mentioned

##### Share on other sites

[quote name='Matias Goldberg' timestamp='1341284218' post='4955147']
Volatile fixes the compiler reordering your variable, but you failed to put a Hardware memory barrier (which prevents the Hardware CPU from reordering your execution due to "Out of order execution" algorithms).

how exactly do i place a hardware barrier on the code section?[/quote]

We're willing to explain you, but we also expect you would at least research the trivial parts (look for MSDN's link).

[quote name='Matias Goldberg' timestamp='1341284218' post='4955147']
2. Your code has a lot of false cache sharing. The pause instruction is supposed to aid on that, but you're reading-and-writting to the same memory location too much.

essentially, if i'm reading this right, the problem is that i am not yielding the thread that is waiting and reading, as such, i causing false cache sharing when i do write? i've never looked at optimizing code against local cache's before, so i'll have to do some more reading on the subject.[/quote]
No, it's multi-threading in general problem as someone already said; and if you're into multithreading then you should really look at optimizing cache usage. Today PCs are bottlenecked by cache misses, memory latency, and bandwidth even with one thread; not to mention they kill your scalability.
False cache sharing can't be solved with yielding. It can only be solved by not writting to a memory location you're also reading in another thread at the same time. Any kind of lock will need a variable that is shared of course (therefore false sharing becomes inevitable) but your mechanism performs non-atomic reads and writes waaaaaay too often to that variable. It can hurt you badly (in terms of performance). HW Locks hint the CPU that a variable is shared, and thus less susceptible to reducing performance.

[quote name='Matias Goldberg' timestamp='1341284218' post='4955147']
5. Busy spin-locks with more threads than CPU cores (oversubscription) is a very bad idea. HyperThreading cores don't count as a real core for this case.

how so exaclty?, i can understand that it means both are executing at the same time, as such it's always wasting one thread, but as long as it's only spin-locked, then it's implied that something is progressing to eventually clear the lock.[/quote]
Not in safety, but in terms of performance. Note that when you say "eventually clear the lock", I read "Thread B will loop doing nothing for like 10 seconds before it actually lets thread A do something. It will very very VERY slow.
If you don't believe me, use the task manager to lock your process to one core.

[quote name='Matias Goldberg' timestamp='1341284218' post='4955147']
6. Since thread A is waiting for B, and B then will wait for A, you negate any performance benefit from multi-threading, while getting all the cons described above. You'll be better off with a kernel-based lock in this case. An implementation like yours may be useful in a case where you know there will be a very tiny lock every once in a while. I would hardly recommend using your implementation because it's too error-prone once it gets bigger and hard to maintain, but if you're trying to learn how multithreading works and interacts, then it's fine

yes, exactly, the case is small, and very rare, it's not designed to be used anywhere close to actual multi-threading in real-time, it's simply used when my entire world needs to be destroyed or loaded.
[/quote]
Looks like a Critical Section or a mechanism using HW locks would be faster and easier; but Kudos for thinking out of the box.

##### Share on other sites
If this is just for a flag to say it's done loading, then just do that;

 volatile bool doneLoading =false; main() { startLoadingThread(); while(!doneLoading) { ShowSillyStuff(); } } loadThread() { //... doneLoading = true; } 

If you can restrict things so that one-and-only-one thread writes to a variable then you can keep-it-simple and volatile gets the job done.

If it's more complicated than that, read about mutexes, semaphores, & monitors.

Lastly, the optimizer will very rarely break your code.
More likely your code is broken and it broke the optimizer.

##### Share on other sites

If you can restrict things so that one-and-only-one thread writes to a variable then you can keep-it-simple and volatile gets the job done.
This is compiler-dependant.
MSVC will decide to insert a memory fence when writing to that volatile variable, but other compilers may not (and should not, as that's not what volatile is supposed to do).
On standards-compliant compilers, it's possible for the results of the "//..." section to be written to memory after the [font=courier new,courier,monospace]doneLoading=true[/font] assignment occurs, creating a race condition.
On the other hand, using your compiler's or OS's mutex/etc (to lock/unlock around each read/write operation) is guaranteed to work, or if you're familiar with the hardware, you can use your compiler's intrinsics for the appropriate assembler commands. Edited by Hodgman

##### Share on other sites
Wow so many replies...

SImply put, you absolutely cannot garuntee a locking mechinism without hardware protection. Intel has had this support for.. well just about forever now. The magic function you are most likely looking for is an assembler command called CMPXCHG. This command is a locking command -- this means that the address of the variable you are trying to access is 'locked' from other processes until this instruction executes, thus garunteeing exclusivity at the hardware level. Basically CMPXCHG compares the value in the AL/AX/EAX register with the first operand (destination operand). If the two values are equal, the second operand (source operand) is loaded into the destination operand. Otherwise, the destination operand is loaded into the AL/AX/EAX register. So you call CMPXCHG passing in a 0 for the source operand (what you expect -- unlocked), and 1 for the dest operand (what you want -- locked). The function will only 'lock' (aka set the value to 1) if it is currently unlocked (currently set to 0). If the operation fails (something else locked it before this one did) it will set the ZF (zero flag) to false to indicate failure, otherwise zero flag will be 1.

 unsigned char CompareAndSwap( int *ptr, int oldvar, int newvar ) { unsigned char result; __asm__ __volatile__ ( "lock; cmpxchg %3, %1\n" "sete %b0\n" : "=r"(result), "+m"(*ptr), "+a"(oldvar) : "r"(newvar) : "memory", "cc" ); return result; }  Edited by Tim Sarbin

##### Share on other sites
It looks like you're trying to manually (re)implement (inter-)thread synchronization/notification using busy-waiting (which, as others have pointed out, isn't necessarily a very good idea).

Condition variables are usually a better solution (and don't waste CPU cycles as busy-waiting does), see the first answer here:
http://stackoverflow...le-vs-semaphore

If you can use C++11, they're already in the std library; examples:
http://cppannotation...lus18.html#l319
http://en.cppreferen...dition_variable // see the examples for wait, wait_for, and wait_until

If you can't use C++11, Boost.Thread library provides identical (just replace "std::" in C++11 examples with "boost::", that's it) feature:
http://www.boost.org...ion.condvar_ref

If you can't use Boost either (have fun...), see the source code for the implementation details relevant to your OS.

If you want a quick overview of C++11 multithreading features in general take a look at "C++11 Concurrency Tutorial":
http://www.justsoftw...concurrency.pdf Edited by Matt-D