Jump to content
  • Advertisement
Sign in to follow this  
Lord_Vader

very general OpenMPI question

This topic is 2472 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hello all,

I have a very general question about OpenMPI...

Lets say that we have a program that it is supposed to run on multiple processors on a distributive system.

I know that we can use OpenMPI to make a program run on multiple processors and OpenMP if we want the programm to run
lets say on a multicore procesor (shared memory).

Just a very general question that I have :P

How is it possible to utilize all the cores of every processor in a distributive system?
Do we have to use both OpenMPI and OpenMP or OpenMPI will suffice?

Thanx

Share this post


Link to post
Share on other sites
Advertisement
First question:
Usually distributed systems have a queue system as Sun Grid Engine (SGE) or NetBatch that let you descrive tasks in terms of how many slots (cores) you require, and run them on the system. You just need to tell the manager to allocate all the slots for you.

Secons question:
MPI will do. You can also use openMP, which is useful in case the nodes of the system are shared-memory machines.

Share this post


Link to post
Share on other sites
Thanks for the answer,

Actually I am working on a scientific application that can run on several nodes (let's say 3). Each node has several cpus (12) and each cpu has 6 cores but I am not 100% that all the cores of the cpus are utilized...

Share this post


Link to post
Share on other sites
How is it possible to utilize all the cores of every processor in a distributive system?[/quote]

Nobody knows. Really, not even Google or supercomputer designers. If you name a problem and hardware layout/topology, then it's possible to examine past research and find a suitable algorithm, if one exists and was published.

Distributed programming is unanswered question as of right now. We have good experience with specific designs for specific tasks, but no general answer.

If latency or consistency is important, there are currently hard limits on what can be done and costs tend to be prohibitive, so those are the first to go.

Hardware also has huge impact. Relative speeds of network, memory and CPU as well as their topology determines the choice of algorithms.

That's about as good a general answer as possible. If you have a specific problem, then it might be possible to say something more specific.

scientific application[/quote]

Which does what?

Share this post


Link to post
Share on other sites
Basically It is an NBODY simulator for gravity. It evolves an intitial distibution of particles for a certain ammount of time on varying or static potential and It can also fit data from telescopes while evolving. The particles are distributed to several processors on several nodes. The guy that made it was not a professional programmer neither am I but I have some basic knowledge as as a hobyist programmer ( unfortunatelly not on MPI yet) and I have to maintain and develop the program :)...

Share this post


Link to post
Share on other sites

Basically It is an NBODY simulator for gravity.


Figured as much...

Look into existing libraries. For nbody you have 3 choices: brute force, Barnes hut or FMM. Brute force doesn't really scale unless you have some ~1000 GPUs. BH and FMM depend on spatial distribution.

For evenly spaced data FMM is fairly easy. Distribute data, run each node independently. But it suffers from non-uniform distributions, so for many situations it won't run optimally, efficiency can get quite low, negating the benefits of multiple machines.

Barnes Hut is better, but requires heuristics to resize regions. That may cause considerable data transfer between nodes.


NBody is fairly well understood problem, there's plenty of libraries out there. For larger scale, custom partitioning schemes are still commonly developed simply because it's difficult to provide optimal bandwidth/CPU segmentation. both BH and FMM are frequently IO bound these days, even locally, where for large sets (1GB+) the DRAM simply isn't fast enough.

As for OMP - I'm not a fan. For local computation it's way too easy to introduce false sharing which negates all benefits of multiple cores.

For the work I did I inevitably ended up with custom problem-tailored implementation and gaining up to an order of magnitude improvement. For networked implementations it's quite difficult. There's various techniques, but finding what will work for your particular data set takes a bit of experimentation.

For local computation, especially for CPU/GPU hybrids, existing third-party libraries are quite competitive and if your problem fits into GPU RAM they will outperform or at least be competitive with highly optimized high-end i7 versions.

In my experience, constant factors dominate these algorithms, so whichever implementation is used, it needs to be balanced for given hardware. It may work out-of-box though.

Share this post


Link to post
Share on other sites
Thanks, for the comments. The potential solver of the simulator that I am using is based on multipole expansion of the potential but without a tree. It works well and it has been used already to produce several papers but it can be considered as a hybrid gravity simulator because it's main purpose is not to evolve but to fit models to observations.

Share this post


Link to post
Share on other sites
General disadvantage of FMM for distributed processing is that cells are of equal sizes and need to be fairly balanced to partition well, otherwise you're waiting for slowest node.

On a shared memory system it's simpler, minus the false sharing issues. Assign individual cells to different threads, when done with one cell, pick another. FMM is a better fit for GPU as well.

I've found that writes can be most costly in these simulations. Since algorithms approach O(n) complexity, writing results for each particle isn't free, especially if building interaction lists for each particle in a thread-local list. For distributed system that matters less. A lot can be gained by keeping things in registers and doing computations on the fly, if the number of variables is small enough.

As said, I'm not up to date with recent libraries, the push these days appears to be mostly towards GPU clusters.


Since you're about fitting models, exploring pure matrix-based approach might be viable, depending on solvers you're using, since those are among the most optimized code out there (BLAS et al). There's some overlap with FEM, which is likely to have more information on constraint solving. For matrix-based approach, Matlab might be a viable option, it apparently supports multi-threading, but probably not networked distribution.

OMP or other APIs here by itself are secondary concern since they simply depend too much on hardware and topology, at least all the solutions I've seen were mostly hand-rolled, with third-party libraries used for local computation only.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!