That's probably the most common/acceptable use of volatile -- telling the compiler that it should definitely read that boolean each iteration, instead of optimising it to a single read -- especially in cases where reordering isn't a concern.
Is the following C++ pseudocode (assuming C++03) bad/evil/dangerous? ... Instruction reordering isn't bad in this example
This is only something that works in practice though, and is reliant on assumptions about your hardware. There's no requirement in C++03 that when one thread writes a value of 'true' to the boolean, that this value will ever become visible to other threads, volatile or not.
This is another hardware-specific detail, not specified by C++03.
since reads/writes are atomic
In your case, atomic and memory_order_relaxed. Deciding that locks are too slow though is an optimisation issue.
And what C++11 data types would be appropriate here?
One thing that bugs me is that MSVC's volatile has acted like C++11's atomic (with memory_order_seq_cst) since VS2005 -- i.e. on x86, it uses cmpxchg-type instructions. This is because too many people wrote bad volatile-based code that should be wrong due to re-ordering issues, so Microsoft changed the meaning of volatile to include a full memory fence (no read/write can be reordered past a volatile read/write), to fix people's buggy code, which just encourages people to write more buggy code that will break on other C++ compilers...
With your loop, using MSVC's volatile or C++11's atomic, you get one fully-fenced read per iteration. Using locks, and assuming no contention () you get a fenced read, a regular read, and a fenced write per iteration, which isn't much different. Taking contention into account though, you also might get a busy-wait with repeated fenced reads, and possibly a context-switch.
Aside from these performance differences, there's sometimes theoretical reasons to want a particular kind of non-blocking guarantee, which is a better reason to avoid locks. N.B. some kinds of lock-free systems will have worse performance than locking ones, but do so because they require the guarantee for whatever reason.
Are there any valid use-cases for volatile in multi-threaded code, aside from ones like the above that can be replaced with atomic?
almost all of those volatile overloads I mentioned are for the C++11 threading library