• Create Account

### #ActualHodgman

Posted 08 October 2012 - 05:35 AM

Such as the PowerPC's perticular read/write ordering that allow a thread to read the value that was written while the write op is on it's way to the cache (if I remember the talk correctly).

When writing C/C++ code for any architecture, you should assume that your reads/writes do not necessarily take place in the order that you've written them. They can occur behind the scenes in any order as long as the behaviour of the (single-threaded) program is still the same as if they did take place in the order you specified.
To ensure that your program is well behaved when using multiple threads, you need to use memory fences, which act as a barrier to reordering where reads and/or writes can't be moved across the barrier.
Your standard synchronisation objects like mutexes will use a memory fence internally, to ensure that operations performed inside the critical section are visible in RAM before the 'lock' is visibly released.

N.B. while x86 usually doesn't re-order reads/writes, this isn't always true, so you should still use memory fences at logical points (or standard syncronisation objects at logical points), so that you're always doing work and then publishing it.

So what I'm interested to know is does anyone know about the business side to parallel computing, are we in a situation where proprietary technology is the only technology that is going to give programmers the best capabilities for squeezing the most out of the platforms or what is the future of this technology.

Not really. x86 has a compare-and-swap instruction, while PPC/ARM have load-linked/store-conditional instructions, but the latter can be used to build the former. So if you write an algorithm in terms of CAS (and as above, use appropriate memory fences) then it will be portable.

N.B. you've got to keep in mind what "lock free" actually means. It's often taken to mean "I did this using raw CAS instead of mutexes!", but what it actually means is, that if any one thread is arbitrarily put to sleep by the OS at any time, it will not impact the progress of all the other threads -- i.e. there will always be at least one thread that's able to make progress, regardless of the (lack of) progress in other threads.
When writing low-level 'thread-safe' structures using CAS, you can still come up with something that's more optimal than your standard mutex/etc, but still isn't strictly lock-free -- and this doesn't always matter (especially if you're writing on a real-time OS).

If you have a single queue that a hundred threads read and write from hundreds of times each every millisecond, then a lock-free design starts to look attractive.

IMHO, lock free doesn't always help too much here, because even though lock-free guarantees that progress will be made, it doesn't guarantee that all 100 threads will constantly be making progress. With a lock-free queue for example, the 'write cursor' is being shared between all 100 threads, and they're all competing to access it. This causes a ridiculous amount of cache-contention, which slows everything down, and you still have to deal with traditional issues like starvation (e.g. where one particular thread's CAS operations constantly fail, while other threads hog the queue).
The solution to these kinds of bad designs is often to remove the shared resources instead of micro-optimizing them.

Things like lock-free or wait-free data structures are both extremely difficult to design, highly dependant on the implementation of the specific platform, and in many cases wholly unnecessary from a performance stand point.

Wait-free is ridiculously easy in comparison.
Take the previous example where 100 threads are all trying to output thousands of results to a queue. By simply giving each thread it's own queue, and delaying the consumption of the results until all producers are complete, you satisfy the wait-freedom requirements of having no thread impede the progress of any other (at least during the "each thread" portion).
//init:
std::vector<int> outputs[100];
Atomic complete = 0;

for( int i=0; i!=100000; ++i )
++complete;//atomic increment - implementation should include release fence
Consume();

//Consume:
WaitUntil( Equal(complete,100) ); // conceptually -- while(AtomicRead(complete)!=100) Sleep();
for( int i=0; i!=outputs.size(); ++i )
printf("%d ", i );

Also, yes, lock-free structures are very hard to design, however, IMHO a lot of academic effort in this area has been completely wasted by trying to shoehorn non-parallel ideas like the doubly-linked-list into lock-free versions. If you attack more sensible and useful problems, then the complexity is no where near as difficult as these useless "general case" (pejorative quote marks) structures.

### #1Hodgman

Posted 07 October 2012 - 06:00 AM

Such as the PowerPC's perticular read/write ordering that allow a thread to read the value that was written while the write op is on it's way to the cache (if I remember the talk correctly).

When writing C/C++ code for any architecture, you should assume that your reads/writes do not necessarily take place in the order that you've written them. They can occur behind the scenes in any order as long as the behaviour of the (single-threaded) program is still the same as if they did take place in the order you specified.
To ensure that your program is well behaved when using multiple threads, you need to use memory fences, which act as a barrier to reordering where reads and/or writes can't be moved across the barrier.
Your standard synchronisation objects like mutexes will use a memory fence internally, to ensure that operations performed inside the critical section are visible in RAM before the 'lock' is visible released.

N.B. while x86 usually doesn't re-order reads/writes, this isn't always true, so you should still use memory fences at logical points (or standard syncronisation objects at logical points), so that you're always doing work and then publishing it.

So what I'm interested to know is does anyone know about the business side to parallel computing, are we in a situation where proprietary technology is the only technology that is going to give programmers the best capabilities for squeezing the most out of the platforms or what is the future of this technology.

Not really. x86 has a compare-and-swap instruction, while PPC/ARM have load-linked/store-conditional instructions, but the latter can be used to build the former. So if you write an algorithm in terms of CAS (and as above, use appropriate memory fences) then it will be portable.

N.B. you've got to keep in mind what "lock free" actually means. It's often taken to mean "I did this using raw CAS instead of mutexes!", but what it actually means is, that if any one thread is arbitrarily put to sleep by the OS at any time, it will not impact the progress of any other thread -- i.e. there will always be at least one thread that's able to make progress, regardless of the (lack of) progress in other threads.
When writing low-level 'thread-safe' structures using CAS, you can still come up with something that's more optimal than your standard mutex/etc, but still isn't strictly lock-free -- and this doesn't always matter (especially if you're writing on a real-time OS).

If you have a single queue that a hundred threads read and write from hundreds of times each every millisecond, then a lock-free design starts to look attractive.

IMHO, lock free doesn't always help too much here, because even though lock-free guarantees that progress will be made, it doesn't guarantee that all 100 threads will constantly be making progress. With a lock-free queue for example, the 'write cursor' is being shared between all 100 threads, and they're all competing to access it. This causes a ridiculous amount of cache-contention, which slows everything down, and you still have to deal with traditional issues like starvation (e.g. where one particular thread's CAS operations constantly fail, while other threads hog the queue).
The solution to these kinds of bad designs is often to remove the shared resources instead of micro-optimizing them.

Things like lock-free or wait-free data structures are both extremely difficult to design, highly dependant on the implementation of the specific platform, and in many cases wholly unnecessary from a performance stand point.

Wait-free is ridiculously easy in comparison.
Take the previous example where 100 threads are all trying to output thousands of results to a queue. By simply giving each thread it's own queue, and delaying the consumption of the results until all producers are complete, you satisfy the wait-freedom requirements of having no thread impede the progress of any other (at least during the "each thread" portion).
//init:
std::vector<int> outputs[100];
Atomic complete = 0;

for( int i=0; i!=100000; ++i )
printf("%d ", i );