Sign in to follow this  
_Sigma

MultiThreading, parallel computing questions

Recommended Posts

I'm looking into writing some scientific research models that take advantage of multiple cores by being highly threaded / potentiall distributed. However I'm a little lost and overwhelmed (in regards to the multithreaded part), so if anyone can help me out, it would be much appreciated! I've been doing some research, and it appears that there are 3 main threading libraries: Boost threads OpenMP Intel TBB I'm a little confused as to the exact differences as well as the pros and cons between the three... I plan on developing primarily on Windows, however, being able to be platform independent is a necessity. As well, since this is academic, open source + free/cheap is also needed. So if anyone could recommend one of the three (and tell me why!)(or something else) it would be great. I'm currently leaning toward Intel's, only because I kinda get what is going on with it. As well, can anyone recommend any good books to learn about multithreading, and parallel computing, including how to break down traditional algorithms into threaded ones. Thanks all! [Edited by - _Sigma on February 20, 2008 2:26:39 PM]

Share this post


Link to post
Share on other sites
They don't do the same things. OpenMP is an API that can help in achieving task-level parallelism. It's fairly high-level, a lot of it is automated, and it requires compiler support. But from what I've seen, it isn't particularly effective when applied to games.

Boost threads, for all that it is, is just a cross-platform wrapper around the conventional idea of a thread. Using boost::thread you can spawn threads of execution in a platform agnostic manner. It also gives you some basic syncronisation primitives to handle inter-thread communication. It doesn't get much more low-level than that.

I've never heard of or used Intel TBB, so I can't really comment on it.

Share this post


Link to post
Share on other sites
Boost threads are just a cross-platform wrapper for different OS's threading implementations (See also pthreads for a portable threading solution).

OpenMP is a language/compiler extension that allows you to use threads automagically, such as writing a for-loop and telling the compiler that it can execute iterations concurrently. It does all the threading behind the scenes for you. I've seen some people recommend this as a prototyping tool to use in conjunction with profiling to find areas of your program that will benefit from concurrency.

Intel TBB is only free for non-commercial use, so if you're only doing research it should be ok.

I haven't actually used any of these 3 tools, so someone please correct me if I'm wrong ;)

Share this post


Link to post
Share on other sites
Quote:
Original post by _Sigma
I'm looking into writing some scientific research models that take advantage of multiple cores by being highly threaded / potentiall distributed.


Look at MPI for distributed parallelism. I think Boost has a C++ MPI wrapper, but I don't know what it's like to use. Probably nicer than the native C interface, I expect.

The other alternative, which you should seriously consider/investigate if possible, is to use a language other than C++. C++ doesn't even have a multithreaded memory model (yet).

There have been a bunch of threads recently on reading recommendations. You might want to have a look for those.

Share this post


Link to post
Share on other sites
Quote:
There have been a bunch of threads recently on reading recommendations. You might want to have a look for those.

Do you know of any off the top of your hat?

Thanks for all the replys. Does anyone see anything wrong with using intel's TBB then?

Share this post


Link to post
Share on other sites
Found this one.

There are others.

Quote:
Thanks for all the replys. Does anyone see anything wrong with using intel's TBB then?


It doesn't have any mechanism to handle exceptions, which I find bothersome. You have to be very careful that each chunk of work does not throw.

Share this post


Link to post
Share on other sites
OpenMP is pretty cool, for what you're doing it could be a good option. I wrote an article on it here.

However you need to decide early on if you want to pursue multithreaded or distributed solutions. I think there is an OMP-like distributed library somewhere.

Share this post


Link to post
Share on other sites
Quote:
Original post by d000hg
However you need to decide early on if you want to pursue multithreaded or distributed solutions. I think there is an OMP-like distributed library somewhere.


Why? They're essentially the same thing. Is it because OpenMP only works with threads?

Share this post


Link to post
Share on other sites
Quote:
Original post by _Sigma
I'm looking into writing some scientific research models that take advantage of multiple cores by being highly threaded / potentiall distributed. However I'm a little lost and overwhelmed, so if anyone can help me out, it would be much appreciated!

[caution] Generally, writing about this type of thing is done by topical experts who have already written them, or under the direction of someone who has already written in a related field (such as a graduate-school professor). It can take significant research on your part before you become expert enough on a topic to write such models. Also, part of 'scientific' involves peer review, which left undone can completely shred the reputation of a novice or upcoming professional who lacks the required experience and expertise.

Feeling lost and overwhelmed is probably a good sign that you are not expert enough to write them. For their first real research projects, graduate students generally feel capable and knowledgeable about the topic yet slightly wary of actually putting it out for public consumption.

Your choices for that list of "3 main threading libraries" shows just how little you know about the field. Most parallel scientific computing models involve networks of machines. A single PC isn't going to do serious weather forecasting, analyze sea floor sonar scans for oil beds, review your medical scans searching for cancer, render next year's new animated movies, or perform most other serious real-world tasks.

Consequently, relatively few real-world scientific parallel computing models take advantage of those specific technologies except to help performance on the individual machine rather than the core algorithm. The main libraries for C and C++ based applications are PVM and MPI, with some other message passing libraries thrown in for flavor.

Fairly new topics include GPGPU programming for relatively small data-parallel programs on a single computer, but those are not really used in industry at present. Note that even GPGPU is being seen as an optimization for the individual machines in a cluster, not a replacement of the cluster.

Quote:
Original post by _Sigma
As well, can anyone recommend any good books to learn about multithreading, and parallel computing, including how to break down traditional algorithms into threaded ones.

A good (although slightly out of date) book is "Designing and Building Parallel Programs", available in dead-tree edition, and also generously made freely available by the author here. The book covers the major method of breaking down algorithms and rebuilding them in parallel ways.

Many other good books (I'm not going to list them, do some research on your own!) are available in dead-tree form.

Quote:
Original post by the_edd
The other alternative, which you should seriously consider/investigate if possible, is to use a language other than C++.

Here is a 10-year-old list of parallel languages. That list doesn't pretend to be complete back then, and many new languages have cropped up. It does cover the major languages. New languages are based on those.

Share this post


Link to post
Share on other sites
I got the impression that the original poster may well be an academic who has legitimate data and will go through the relevant peer review channels, but has heard that parallel computing is the most efficient way to do this. So the fact that he doesn't know much about distributed computing doesn't necessarily reflect on the legitimacy of what he wants to do.

_Sigma, perhaps you could elaborate on (a) the nature and amount of the data you're expecting to have, and (b) the type of algorithm you will need to run on it (ignoring threading/distributed concerns).

Share this post


Link to post
Share on other sites
I've used all three. As mentioned, they serve different purposes. I have a post about OpenMP vs. TBB on my blog, which is in my profile, if you want more information.

Boost.Threads are nice for task-based threading. In particular, their mutex-type of classes and conditions are great. TBB provides similar facilities, but TBB can be configured to use your standard Windows threads, thus having full compatibility with Boost thread-constructs. In addition, TBB gives you some concurrent containers and gives you iterator-style loops that you can break up. They have a lot of useful facilities. OpenMP, I've also used. OpenMP is the simplest option, if, you have very simple loops that you're looking to break up. For the more complex C++-style cases, I haven't found OpenMP as useful as TBB, especially with the fact that throwing an exception into an OpenMP thread = death with no chance of recovery. One more thing to note is that mixing OpenMP threads with OS threads/boost threads/TBB have the risk of running too many threads at once, since the schedulers don't know about each other.

Overall, I prefer to use Boost.Threads with TBB. OpenMP isn't bad, though, and it is standard, which is a good thing, so you shouldn't necessarily discount it. Just remember that OpenMP's focus isn't C++, even though C++ is supported.

Also, TBB is commercial, as noted. I believe it's $300 per developer for the commercial license.

Share this post


Link to post
Share on other sites
Rydinare, thank you very much for your reply + blog. Has helped lots.
Quote:
Also, TBB is commercial, as noted. I believe it's $300 per developer for the commercial license.

However, there is a free open source version...is there a reason it wouldn't work for me?

Quote:
frob
...

Erm, I guess I didn't explain my self well? I feel like I jsut asked how to write a climate model running on my uber-l33t Pentium D.... O_o

Quote:
Your choices for that list of "3 main threading libraries" shows just how little you know about the field

No shit Sherlock. That is why I'm asking here. I didn't realize that knowing the answer to one's question was a prerequisite for posting on these forms.


Quote:
I got the impression that the original poster may well be an academic who has legitimate data and will go through the relevant peer review channels, but has heard that parallel computing is the most efficient way to do this. So the fact that he doesn't know much about distributed computing doesn't necessarily reflect on the legitimacy of what he wants to do.


I'm currently finishing up my undergrad, which is a degree specifically aimed at having a strong background for modelling physical processes. It is a combination Physical Geography, Computer Science and Physics. After this, I will be actively looking into a masters degree.

This last summer I worked with a model to investigate the effects that an increased shrub density has on end of winter snow-water-equivalents in the Mackenzie Delta. I presented a poster at the IUGG
conference in Perugia, Italy. I'm currently working on testing a blowing some model implementation. For example, comparing to satellite data, observations, as well as parameter sensitivity analysis. In addition, I've implemented a radiation balance model to see the diurnal patterns in glacial melt as well as a basic geometry rules based wind model for two undergraduate projects.

So whilst I'm not a PhD student, I do have experience working with models, as well as writing my own. I've well aware of the peer review that is done as well as the extreme complexities involved.

In regards to data, I am (was) working with $70,000 Lidar DEM data sets, as well as other sat. imagery, etc....Perhaps that will stand as testimate to the fact I'm not just dicking around in my basement pretending I'm writing weather models.

Quote:
A single PC isn't going to do serious weather forecasting, analyze sea floor sonar scans for oil beds, review your medical scans searching for cancer, render next year's new animated movies, or perform most other serious real-world tasks.

Yes I know. However, you can still do some very cool, and still meaning full, modelling experiments on a single computer. You don't need a super computer to investigate real world phenomenon.

After working last summer with that model (the one I presented in Perugia) I became completely disheartened with the quality of the model. In short, it was shit. It would randomly crash, it was slow, etc. It was the type of thing that wouldn't pass a 2nd year CS course. After talking to others who I was working with, it appeared that models like this were standard in my field (hydrology, snow research, stuff like that). I raised concern with my supervisor who pretty much told me to get used to it.

The people writing these models have NO CS background, they are all self taught. They lack fundamental CS skills. For example, the model I'm currently working with is developed by one fellow who doesn't have formal CS training. Therefore the model is:

1) Not under source control
2) has no external and little internal documentation
3) Coded on the fly. There are no design docs, etc
4) Tied to a dead compiler that if you can find, costs you $600
5) It does too much, and (poorly) reinvents the wheel extensively.

As well it crashes randomly and has quirks out the wazoo. I am honestly sceptical of the implementation of portions of the model (given what I've seen) and am honestly unsure if I can trust its results.

Oh, did I mention it was closed source? Yeah, I can't even look at the code.

So I see this mess and go "I can do better".

Some models (such as the one I'm working with) are moving to what are called HRUs - Hydrological Response Units. So instead of having a 2m dem, and working in a gridded fashion, they break up the landscape into areas that the author feels will all respond the same to an event. So the computations are very simple, however you lose the effects that topography has on the processes. You can further break up the landscape into many more HRUs, however you still lose out on topography. You get a faster model at a sacrifice of being able to use topography...

I see no reason why a highly threaded, distributed model can't be done to surpass (in my mind) the clunky notion of a HRU. I'm under the impression that it is because of the extreme lack of CS skills in the field that no one really does it.

Therefore, I wish to start learning multi threaded / distributed programming (which I don't have any experience with) so I can take advantage of multi threaded platforms.

Quote:
Sigma, perhaps you could elaborate on (a) the nature and amount of the data you're expecting to have, and (b) the type of algorithm you will need to run on it (ignoring threading/distributed concerns)

Data can be 1000x1000 (or more) matrices.
The algorithms really depend on the model so it is difficult to comment without any specific cases...

I hope the above has further explained my point and goals...

Quote:
Found this one.

heh, yeah I actually posted there.

Share this post


Link to post
Share on other sites
Quote:
I see no reason why a highly threaded, distributed model can't be done to surpass (in my mind) the clunky notion of a HRU. I'm under the impression that it is because of the extreme lack of CS skills in the field that no one really does it.


Academia. CS is considered no better than carpenter skills there. In general, people will argue that you need any CS knowledge - they did it themselves, why would they need some useless guy in their lab that would be writing code they cannot understand, yet it doesn't do anything more than their code did, while it has huge disadvantage that it's obscure, requires strange libraries and is too complex.

What you're going up against here is insurmountable resistance that you won't break. That's the way scientific community works. I'm certain that half of these models run on Fortran interpreters using code that was transcribed from punch cards.

Writing code in academic circles is considered a technical job. It falls right there with assembling a cabinet, replacing a light bulb or plugging in an electrical device.

Lastly, there's the ego thing. Proposing to high profile researchers they re-use the code they'll find huge problems with other people's code. For example, files will not be capitalized, something that is of crucial importance. So they'll rewrite everything themselves.

If you're developing this for yourself, and can apply it, then it'll be fine. But if you're hoping to revolutionize the way this research works, you're in for a rude awakening. You're most likely the only one who will ever see this code, no matter how many places you publish it in.

IMHO, I wouldn't try solving problems that don't need to be solved. Look into existing clustering solutions and study the concepts of concurrency so that you can split the model appropriately. This will give you most effect for minimal effort.

Share this post


Link to post
Share on other sites
Quote:
If you're developing this for yourself, and can apply it, then it'll be fine.

Yes, this would be my immediate goal.
Quote:
Look into existing clustering solutions and study the concepts of concurrency so that you can split the model appropriately

Thanks for the link.
Can each Beowulf node take advantage of multi threaded stuff?

Share this post


Link to post
Share on other sites
Quote:
Original post by _Sigma

Can each Beowulf node take advantage of multi threaded stuff?


I believe it's process per core. It doesn't matter though, it's MPI-based and designed for arbitrary sizes.

For computationally heavy stuff, threads are undesirable since they cause performance decrease. For highest through-put, split the input data set into disjoint sets, then have individual processes work on each one. Highest performance and simplest design is possible through independent processes (either multiple cores or network cluster), with each process getting full CPU time.

Threads help when you need responsive applications and where tasks are small and cost of IPC high. For computationally heavy applications, neither of this is true if problem set is correctly defined and parallelized.

Share this post


Link to post
Share on other sites
I have some experience in the field with large scale clusters, so I don't have that much experience with threading other than working on a multi-threaded server and some OMP projects.

MPI will get the job done but I found that you end up spending most of the time getting the message passing timing correct and trying to balance load more evenly. Then I did some stuff in Charm++, at first I hated it as it is relatively young and rough around the edges. The more I used it and trying to go back to MPI made me appreciate the object oriented aspects of the language. A properly configured charm installation on your cluster will automatically load balance all the charm objects (I think they had some weird name for those) as well. This freed me from some of those details and allowed me to concentrate on the app itself a little more.

I warn you though, Charm++ is a research language, but I have seen it in the wild.

For my thoughts on OMP, I felt that while the threading pragmas can be useful for small improvements here and there it doesn't take away from the need to design your program with parallelism in mind to get the most performance improvement. Of course we can only wish that every problem was embarrassingly parallel =)

Share this post


Link to post
Share on other sites
Metalstorm, thanks. I'll look into that.

d000hg. That is fantastic. Cheers.

So does it seems like a reasonable approach to begin learning basic multi threading for a SINGLE computer, get the hang of that, and then move to the distributed programming?

Starting with TBB seems good as I found this book.

Does this seem reasonable?

Quote:
However you need to decide early on if you want to pursue multithreaded or distributed solutions. I think there is an OMP-like distributed library somewhere.

To echo Kylotan, why is that?

I really appreciate all the replies here, exactly what I was looking for!

Share this post


Link to post
Share on other sites
I wouldn't say that you have to choose if you want to go distributed or just locally threaded. I think the most helpful thing to do when beginning to learn multi-threaded/process programming is to understand the design of such programs rather than the specific library you end up using.

If you don't have a good grasp of critical sections, semaphores, deadlocks, race conditions and so on and so forth that shows up in parallel computing, if can cause a much larger headache than just learning the library.

I think the library comes second since each implementation has its advantages and disadvantages. You will probably end up learning a few different libraries and picking between them for which one works best.

Anyway, hope you enjoy the process of learning parallel computing _sigma. It's really quite satisfying the first time you see an amazing speedup from the linear version of your program.

Share this post


Link to post
Share on other sites
Quote:
Original post by _Sigma
Rydinare, thank you very much for your reply + blog. Has helped lots.
Quote:
Also, TBB is commercial, as noted. I believe it's $300 per developer for the commercial license.

However, there is a free open source version...is there a reason it wouldn't work for me?


If you're not doing commercial work, the open source version should be fine.

I'm glad you found my blog and post useful. Let me know if you have any more questions. [smile]

Share this post


Link to post
Share on other sites
Quote:
Original post by Metalstorm
If you don't have a good grasp of critical sections, semaphores, deadlocks, race conditions and so on and so forth that shows up in parallel computing, if can cause a much larger headache than just learning the library.

Do you know of a book that covers these topics in detail?

Quote:
Let me know if you have any more questions.

Will do. Many thanks.

Share this post


Link to post
Share on other sites
_Sigma, I'm sorry you took my post the wrong way. We get many kids in here who barely understand what a matrix is and have only a basic grasp of middle-school algebra, and then ask questions that are substantially similar to your original post. It is good to know that you will at least have some reviews of your work.
Quote:
Original post by _Sigma
After working last summer with that model (the one I presented in Perugia) I became completely disheartened with the quality of the model. In short, it was shit. It would randomly crash, it was slow, etc. It was the type of thing that wouldn't pass a 2nd year CS course. After talking to others who I was working with, it appeared that models like this were standard in my field (hydrology, snow research, stuff like that). I raised concern with my supervisor who pretty much told me to get used to it.

The people writing these models have NO CS background, they are all self taught. They lack fundamental CS skills. For example, the model I'm currently working with is developed by one fellow who doesn't have formal CS training.

As was mentioned by others, that is par for the course in academia.

While in graduate studies I worked with a several multi-gigabyte terrain models (which were expensive in pre-DVD days). These involved both regular DEM and TINs, overlaid with DOQs and some stitched areal photos. The tools were written by the individual students, and generally sucked. The research focused on terrain registration and decomposition of various alpine terrains.

I quit the lab while still in school and started work at a real company. They had real tools, real programmers, and real data with multiple terabytes spread across multiple servers. Tools were powerful and fast. Code was shared, reviewed, profiled, optimized, profiled, optimized, and profiled some more.

Even the worst tools at that job were better than the best tools at school -- and that's coming from a school that was (and still is) recognized in the top 25 schools in the field (in the US).


Also after I joined the real world, I discovered that a Master's thesis is roughly one month of "real world" work, and a PhD thesis is just over two months. That's all it takes if you have the proper tools. (I was admitted to the PhD program, but the masters program was sufficient for my game-programming goals.)




So getting back on topic to your concerns...

The online book I mentioned in my earlier post (DBPP) is a fairly short read and can help you split your algorithms up nicely. Once broken down, it is usually trivial to compose the pieces into a very scalable design.

If you are on a single multi-core computer, you can configure MPI to run as though each core were a separate node. If you are on a small cluster, MPI runs naturally enough in that environment. And of course, it runs just fine with a cluster size of 1.

Consequently, if you don't bother with any multithreading on your own, MPI can run as a single thread, or multiple threads on a single machine, or on a single thread per machine on a cluster, or even run one instance per core on every machine. Your program won't need to know or care about the difference. It is quite versatile.

If you were in industry there are many more optimizations that I would recommend, but they are really unnecessary in academia. Those improvements range from the simple cases of finding ways to approximate the first iterations of your solution for minimal cost and pre-populating your dataset with estimated values, later getting to the details of re-ordering your loops to take advantage of the cache lines of your CPU and restructuring your data to minimize cache misses and continuously stream data to the processor.





If your school doesn't have a license, you should strongly and loudly pitch a case to your department head about getting a license for Intel's Cluster Toolkit.

The toolkit includes a fast MPI implementation that works well on all the configurations described above. It also includes their math kernel library that can handle your million-element arrays like a child eating candy. Those engineers at Intel actually know that you can allocate the 4MB block as a single data page rather than as 1000 or more data pages of data that your academic programs probably use. Their math libraries are already written to take advantage of multi-processor computers, so you would only need one instance running per machine. Finally, it includes excellent profilers and tuning solutions to help you find and fix the more obvious bottlenecks.

I would avoid using the other threading libraries (even Intel's TBB) mainly because they add complexity and really only serve as a minor optimization to the larger problem.

Quote:
I see no reason why a highly threaded, distributed model can't be done to surpass (in my mind) the clunky notion of a HRU. I'm under the impression that it is because of the extreme lack of CS skills in the field that no one really does it.

As far as HRUs are concerned, in the actual industry they work quite well with most other GIS solutions. Of course, you aren't likely to find any of those in academia either. GIS solutions work with contour regions and irregular networks, and doing so allows them to outperform raster grid solutions in terms of both size and speed. It is quite natural for that kind of data to be merged into regions to gain significant performance benefits. You might want to look into that before deciding it is a problem that needs fixing.


Oh, and incidentally, you might want to talk to some people at a local hydrology lab or even a big civil engineering firm, as they can probably instruct you better than your professors on this topic. As for your research, ask the industry professionals what they think would make a good two-week project. That's more than enough to satisfy most undergraduate requirements.

Share this post


Link to post
Share on other sites
Quote:
Original post by bakery2k1
Quote:
Original post by Rydinare
If you're not doing commercial work, the open source version should be fine.


Is there any reason one can't use the open source version for "commercial work"?

Intel TBB is licensed under the GPL. The GPL is a viral license, meaning that any derivative works must also be made GPL. This also applies to "linking" against a GPL'd library.

I believe that in this case, Intel's TBB includes an explicit clause that grants exemption to this. Which means that yes, you are allowed to freely use the open source Intel TBB library without having to make your code GPL.

I'm not 100% certain on this, though, so you should check before doing anything.

Share this post


Link to post
Share on other sites
Quote:
Original post by _Sigma
So does it seems like a reasonable approach to begin learning basic multi threading for a SINGLE computer, get the hang of that, and then move to the distributed programming?

[...]

Quote:
However you need to decide early on if you want to pursue multithreaded or distributed solutions. I think there is an OMP-like distributed library somewhere.

To echo Kylotan, why is that?


In theoretical terms, concurrent and distributed computing are one and the same. You have multiple processes running concurrently and multiple resources they need to access, and communication between processes either directly via messages or indirectly via the resources. For learning purposes, it helps to approach it like this, rather than getting bogged down in how threads are handled by CPUs, and so on.

However in implementation terms, some approaches are specialised for one or the other, so you have to be careful. Generally, a distributed approach will work perfectly on one machine with several cores, as it makes no assumptions about local resources. However, a multi-threaded approach will not scale to multiple machines because it makes several assumptions, such as (a) zero latency access to and instant responses from implicitly shared resources (ie. memory), and (b) the availability of synchronisation primitives for guarding that shared state which can be instantly used.

Therefore, I'd say no, skip the multi-threading, and just go for a multi-process approach. That will work just as well on 2 single core PCs as on 1 dual-core PC, which won't be the case with a threaded model.

If at all possible, the best performing model will look like this:

1) Split data up into areas that don't relate to each other at all.
2) Send it off to multiple systems
3.1, 3.2, ... 3.n) Systems all spend some time crunching their batch of data
4.1, 4.2, ... 4.n) Systems send data back to be aggregated
5) Data is aggregated as necessary.

Generally it's better to waste a little bit of time by 'unnecessarily' duplicating data in step 1 in order to send it off to different places, than it is to have people explicitly share data and have to negotiate over exclusive access to it. However, how easy this is depends on the sort of algorithm you're using, which is why I asked about it. An algorithm that simulates the effects of aging on many people will scale well, because each person can be done in isolation, but an algorithm that simulates a workplace of even far fewer people will scale less well, because of all the interactions between them.

[Edited by - Kylotan on February 21, 2008 7:39:07 AM]

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this