Jump to content

  • Log In with Google      Sign In   
  • Create Account

Awesome job so far everyone! Please give us your feedback on how our article efforts are going. We still need more finished articles for our May contest theme: Remake the Classics

very general OpenMPI question


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
7 replies to this topic

#1 Lord_Vader   Members   -  Reputation: 125

Like
0Likes
Like

Posted 06 March 2012 - 11:40 AM

Hello all,

I have a very general question about OpenMPI...

Lets say that we have a program that it is supposed to run on multiple processors on a distributive system.

I know that we can use OpenMPI to make a program run on multiple processors and OpenMP if we want the programm to run
lets say on a multicore procesor (shared memory).

Just a very general question that I have :P

How is it possible to utilize all the cores of every processor in a distributive system?
Do we have to use both OpenMPI and OpenMP or OpenMPI will suffice?

Thanx

#2 Rekai   Members   -  Reputation: 102

Like
1Likes
Like

Posted 06 March 2012 - 12:12 PM

First question:
Usually distributed systems have a queue system as Sun Grid Engine (SGE) or NetBatch that let you descrive tasks in terms of how many slots (cores) you require, and run them on the system. You just need to tell the manager to allocate all the slots for you.

Secons question:
MPI will do. You can also use openMP, which is useful in case the nodes of the system are shared-memory machines.

#3 Lord_Vader   Members   -  Reputation: 125

Like
0Likes
Like

Posted 06 March 2012 - 12:30 PM

Thanks for the answer,

Actually I am working on a scientific application that can run on several nodes (let's say 3). Each node has several cpus (12) and each cpu has 6 cores but I am not 100% that all the cores of the cpus are utilized...

#4 Antheus   Members   -  Reputation: 2369

Like
1Likes
Like

Posted 06 March 2012 - 12:32 PM

How is it possible to utilize all the cores of every processor in a distributive system?


Nobody knows. Really, not even Google or supercomputer designers. If you name a problem and hardware layout/topology, then it's possible to examine past research and find a suitable algorithm, if one exists and was published.

Distributed programming is unanswered question as of right now. We have good experience with specific designs for specific tasks, but no general answer.

If latency or consistency is important, there are currently hard limits on what can be done and costs tend to be prohibitive, so those are the first to go.

Hardware also has huge impact. Relative speeds of network, memory and CPU as well as their topology determines the choice of algorithms.

That's about as good a general answer as possible. If you have a specific problem, then it might be possible to say something more specific.

scientific application


Which does what?

#5 Lord_Vader   Members   -  Reputation: 125

Like
0Likes
Like

Posted 06 March 2012 - 12:44 PM

Basically It is an NBODY simulator for gravity. It evolves an intitial distibution of particles for a certain ammount of time on varying or static potential and It can also fit data from telescopes while evolving. The particles are distributed to several processors on several nodes. The guy that made it was not a professional programmer neither am I but I have some basic knowledge as as a hobyist programmer ( unfortunatelly not on MPI yet) and I have to maintain and develop the program :)...

#6 Antheus   Members   -  Reputation: 2369

Like
1Likes
Like

Posted 06 March 2012 - 01:42 PM

Basically It is an NBODY simulator for gravity.


Figured as much...

Look into existing libraries. For nbody you have 3 choices: brute force, Barnes hut or FMM. Brute force doesn't really scale unless you have some ~1000 GPUs. BH and FMM depend on spatial distribution.

For evenly spaced data FMM is fairly easy. Distribute data, run each node independently. But it suffers from non-uniform distributions, so for many situations it won't run optimally, efficiency can get quite low, negating the benefits of multiple machines.

Barnes Hut is better, but requires heuristics to resize regions. That may cause considerable data transfer between nodes.


NBody is fairly well understood problem, there's plenty of libraries out there. For larger scale, custom partitioning schemes are still commonly developed simply because it's difficult to provide optimal bandwidth/CPU segmentation. both BH and FMM are frequently IO bound these days, even locally, where for large sets (1GB+) the DRAM simply isn't fast enough.

As for OMP - I'm not a fan. For local computation it's way too easy to introduce false sharing which negates all benefits of multiple cores.

For the work I did I inevitably ended up with custom problem-tailored implementation and gaining up to an order of magnitude improvement. For networked implementations it's quite difficult. There's various techniques, but finding what will work for your particular data set takes a bit of experimentation.

For local computation, especially for CPU/GPU hybrids, existing third-party libraries are quite competitive and if your problem fits into GPU RAM they will outperform or at least be competitive with highly optimized high-end i7 versions.

In my experience, constant factors dominate these algorithms, so whichever implementation is used, it needs to be balanced for given hardware. It may work out-of-box though.

#7 Lord_Vader   Members   -  Reputation: 125

Like
0Likes
Like

Posted 06 March 2012 - 02:08 PM

Thanks, for the comments. The potential solver of the simulator that I am using is based on multipole expansion of the potential but without a tree. It works well and it has been used already to produce several papers but it can be considered as a hybrid gravity simulator because it's main purpose is not to evolve but to fit models to observations.

#8 Antheus   Members   -  Reputation: 2369

Like
0Likes
Like

Posted 06 March 2012 - 03:27 PM

General disadvantage of FMM for distributed processing is that cells are of equal sizes and need to be fairly balanced to partition well, otherwise you're waiting for slowest node.

On a shared memory system it's simpler, minus the false sharing issues. Assign individual cells to different threads, when done with one cell, pick another. FMM is a better fit for GPU as well.

I've found that writes can be most costly in these simulations. Since algorithms approach O(n) complexity, writing results for each particle isn't free, especially if building interaction lists for each particle in a thread-local list. For distributed system that matters less. A lot can be gained by keeping things in registers and doing computations on the fly, if the number of variables is small enough.

As said, I'm not up to date with recent libraries, the push these days appears to be mostly towards GPU clusters.


Since you're about fitting models, exploring pure matrix-based approach might be viable, depending on solvers you're using, since those are among the most optimized code out there (BLAS et al). There's some overlap with FEM, which is likely to have more information on constraint solving. For matrix-based approach, Matlab might be a viable option, it apparently supports multi-threading, but probably not networked distribution.

OMP or other APIs here by itself are secondary concern since they simply depend too much on hardware and topology, at least all the solutions I've seen were mostly hand-rolled, with third-party libraries used for local computation only.




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS