Multi-Core Technology

Started by
4 comments, last by arosmagic 11 years, 7 months ago
There doesn't appear to be a hardware board and this is pretty much a hardware question related to the business sides of the hardware but it relates to me via lockless programming.

I've been really interested in lockless programming for a long time now, haven't actually tried it because I haven't had a reason to use it yet, but today I learnt about the parallella project and I reckon it's a pretty cool idea to try and create an open source hardware framework for parallel processing but I worry that the technological solutions that allow programmers to get the most out of multi-core processors may be licensed. Such as the PowerPC's perticular read/write ordering that allow a thread to read the value that was written while the write op is on it's way to the cache (if I remember the talk correctly). As to proprietary barriers take 3d printing for example which is something that has opened up thanks partly to some patents expiring.

So what I'm interested to know is does anyone know about the business side to parallel computing, are we in a situation where proprietary technology is the only technology that is going to give programmers the best capabilities for squeezing the most out of the platforms or what is the future of this technology. Is buying myself one of these parallela boards going to be something where I can play around with some ARM&RISC's locklessly just so I can feel awesome. Are we going to have the next 20+ years of intel/amd ruling the parallel processor market because they got on the ground early and solved a bunch of early technological problems.

EDIT: Just realised this could fit in the business thread but I don't know if anyone other than programmers would know this.
I say Code! You say Build! Code! Build! Code! Build! Can I get a woop-woop? Woop! Woop!
Advertisement
The reason that intel [and to a significantly lesser extent, amd] are dominating the multicore domain with 4, 6, and 8 core processors while things like the parallella board, or the tilera [looks pretty much identical to parallellla, and has been established for a really long time], or any of a dozen others, with 100 cores are niche designs has very little to do with legal restrictions, and a whole lot to do with a sea of invested resources on intel's part and the lack of appeal to developers who still effectively are getting the perfromance they need out of a single core. Outside of game dev and embedded system design, very few people are willing to sweat and bleed for that last cycle in a pervasive way. Things like lock-free or wait-free data structures are both extremely difficult to design, highly dependant on the implementation of the specific platform, and in many cases wholly unnecessary from a performance stand point. These platforms have their place, but it isn't in consumer material [yet].

If you have a single queue that a hundred threads read and write from hundreds of times each every millisecond, then a lock-free design starts to look attractive. For most applications though, a locking design is 'good enough', and is portable because there are standardized libraries, is easy to work with and easy to verify because there exist language level primitives, and is something your average programmer is familiar with and thus is comfortable working with. Lock-free designs look GREAT in micro-benchmarks where you have N threads that just read and write from a shared stack T times. They look significantly less meaningful if those reads and writes are seperated by computation of non-trivial complexity, which tends to be the case in actual use [the difference being that the computation between the accesses dominates, and the 2-3X difference you get out of lock-based as compared to lock-free designs is swallowed up, since now it only constitutes 1% of the computation time].

Lock free designs are ugly and highly specialized. Ugly and specialized to the degree that there is work that has been done in procedurally generating lock-free designs given a specification [a specification that could be a locking design, which leads to the question: why bother] that can be formally proven correct.

All this layered on top of the fact that platforms like parallella and tilera aren't going to get performance out of shared memory synchronization anyway. message-passing is going to be THE way to get performance on these platforms.

Designs like this only end up being a lot more interesting in server markets or supercomputing platforms. Even in those markets though, there are a lot of reasons why you have a hard time seeing platforms like this crop up, which have nothing to do with legality. [ease of programming, tool support, lack of benchmarks that quantify energy savings, the fact that things like FPGAs still put the performance per watt stats shown by any CPU to shame, etc etc etc]. You see SOME things like this, but very few in practice.

1. I'm interested to know is does anyone know about the business side to parallel computing, are we in a situation where proprietary technology is the only technology that is going to give programmers the best capabilities for squeezing the most out of the platforms or what is the future of this technology.
2. Is buying myself one of these parallela boards going to be something where I can play around with some ARM&RISC's locklessly just so I can feel awesome.
3. Are we going to have the next 20+ years of intel/amd ruling the parallel processor market because they got on the ground early and solved a bunch of early technological problems.


1. Yes and no. The way most people use PC's --- to surf the web and answer email and write documents --- does not benefit from parallel processing. Relatively few industries need parallel computing. Games use it for rendering and physics and such. High performance scientific computing needs it. High performance business computing needs it. These industries can afford to use proprietary technologies because they are either directly or indirectly in the business of writing custom software.

2. If that is what it takes to make you feel awesome, you may consider re-evaluaing your life.

3. Probably, but it is more from the network effect. 2008 saw a major milestone: one billion PCs in use globally. Any new players in the server market or the desktop market must be compatible with the existing ecosystem. It is entirely possible that another company can end up creating their own chipsets, marketing them, and becoming competitive with AMD and Intel's current PC lineup, but I wouldn't bet on it. Even Apple has given up and moved to x86 architecture.
Thanks oolala that was a nice assessment of the usefulness of this stuff that I had not read anywhere, most places being more concerned with technical details rather than design considerations.

frob: I try to feel awesome in lots of little things, tackling a new programming challenge with some extremely specialized hardware is one of those things that would add just a little bit of awesome to my day. Please hold back the insults as to my life style in the future thanks.

So from what I can tell parallel control boards are only useful, exclusively, towards SIMD type applications. The majority of code or most applications, not being such. And there basically isn't any benefits to be had, more a loss to be suffered, from lock free programming in terms of code architecture; which is something somewhat more important than the possibility of having to cut back on features to meet a CPU budget.

Thanks guys. I'd still like to grab one of these and play around with it, I've been thinking it'll be useful for video processing, like in the case of processing data from a webcam for touch-detection in a diy multi-touch pc. These things might be able to parse through that data a bit easier than a processor running on 1 thread in a brute force fashion could.

EDIT: oh god I sound like I'm marketing for this crap: it is utterly useless, except for specialized SIMD, and you can probably get a cheap graphics card with openCL support and program that easier than you could this: and as for a the code being a market product, at least graphics cards are pretty standard.
I say Code! You say Build! Code! Build! Code! Build! Can I get a woop-woop? Woop! Woop!
Such as the PowerPC's perticular read/write ordering that allow a thread to read the value that was written while the write op is on it's way to the cache (if I remember the talk correctly).
When writing C/C++ code for any architecture, you should assume that your reads/writes do not necessarily take place in the order that you've written them. They can occur behind the scenes in any order as long as the behaviour of the (single-threaded) program is still the same as if they did take place in the order you specified.
To ensure that your program is well behaved when using multiple threads, you need to use memory fences, which act as a barrier to reordering where reads and/or writes can't be moved across the barrier.
Your standard synchronisation objects like mutexes will use a memory fence internally, to ensure that operations performed inside the critical section are visible in RAM before the 'lock' is visibly released.

N.B. while x86 usually doesn't re-order reads/writes, this isn't always true, so you should still use memory fences at logical points (or standard syncronisation objects at logical points), so that you're always doing work and then publishing it.

So what I'm interested to know is does anyone know about the business side to parallel computing, are we in a situation where proprietary technology is the only technology that is going to give programmers the best capabilities for squeezing the most out of the platforms or what is the future of this technology.
Not really. x86 has a compare-and-swap instruction, while PPC/ARM have load-linked/store-conditional instructions, but the latter can be used to build the former. So if you write an algorithm in terms of CAS (and as above, use appropriate memory fences) then it will be portable.

N.B. you've got to keep in mind what "lock free" actually means. It's often taken to mean "I did this using raw CAS instead of mutexes!", but what it actually means is, that if any one thread is arbitrarily put to sleep by the OS at any time, it will not impact the progress of all the other threads -- i.e. there will always be at least one thread that's able to make progress, regardless of the (lack of) progress in other threads.
When writing low-level 'thread-safe' structures using CAS, you can still come up with something that's more optimal than your standard mutex/etc, but still isn't strictly lock-free -- and this doesn't always matter (especially if you're writing on a real-time OS).
If you have a single queue that a hundred threads read and write from hundreds of times each every millisecond, then a lock-free design starts to look attractive.
IMHO, lock free doesn't always help too much here, because even though lock-free guarantees that progress will be made, it doesn't guarantee that all 100 threads will constantly be making progress. With a lock-free queue for example, the 'write cursor' is being shared between all 100 threads, and they're all competing to access it. This causes a ridiculous amount of cache-contention, which slows everything down, and you still have to deal with traditional issues like starvation (e.g. where one particular thread's CAS operations constantly fail, while other threads hog the queue).
The solution to these kinds of bad designs is often to remove the shared resources instead of micro-optimizing them.

Things like lock-free or wait-free data structures are both extremely difficult to design, highly dependant on the implementation of the specific platform, and in many cases wholly unnecessary from a performance stand point.
Wait-free is ridiculously easy in comparison.
Take the previous example where 100 threads are all trying to output thousands of results to a queue. By simply giving each thread it's own queue, and delaying the consumption of the results until all producers are complete, you satisfy the wait-freedom requirements of having no thread impede the progress of any other (at least during the "each thread" portion).
//init:
std::vector<int> outputs[100];
Atomic complete = 0;

//each thread:
outputs[thisThread].reserve(100000);
for( int i=0; i!=100000; ++i )
outputs[thisThread].push_back(i);
++complete;//atomic increment - implementation should include release fence
if( thisThread == 0 )
Consume();

//Consume:
WaitUntil( Equal(complete,100) ); // conceptually -- while(AtomicRead(complete)!=100) Sleep();
for( int thread=0; thread!=100; ++thread )
for( int i=0; i!=outputs.size(); ++i )
printf("%d ", i );


Also, yes, lock-free structures are very hard to design, however, IMHO a lot of academic effort in this area has been completely wasted by trying to shoehorn non-parallel ideas like the doubly-linked-list into lock-free versions. If you attack more sensible and useful problems, then the complexity is no where near as difficult as these useless "general case" [size=2](pejorative quote marks) structures.
While the traditional von-neumann architecture can be considered canonical for single-processor computation, when it comes to parallel computers there is no such canonical model. There are so many way you can design a parallel computer that different classes of algorithms are more optimal than others on any particular one. Here I'm talking in terms of real programs running on real machines instead of theoretical ones. Even so, no matter how clever you get with our without locks (even with hardware assist), you can't get around Amdahl's Law.

We, at Aros Magic, have figured out the hitherto intractable problem of decoding media files in parallel and are showcasing it as the world's fastest photo viewer. If promotion of massivel parallel computation interests you, please support us.

Yes, our technology is essentially lock-free smile.png .

This topic is closed to new replies.

Advertisement