Basically It is an NBODY simulator for gravity.
Figured as much...
Look into existing libraries. For nbody you have 3 choices: brute force, Barnes hut or FMM. Brute force doesn't really scale unless you have some ~1000 GPUs. BH and FMM depend on spatial distribution.
For evenly spaced data FMM is fairly easy. Distribute data, run each node independently. But it suffers from non-uniform distributions, so for many situations it won't run optimally, efficiency can get quite low, negating the benefits of multiple machines.
Barnes Hut is better, but requires heuristics to resize regions. That may cause considerable data transfer between nodes.
NBody is fairly well understood problem, there's plenty of libraries out there. For larger scale, custom partitioning schemes are still commonly developed simply because it's difficult to provide optimal bandwidth/CPU segmentation. both BH and FMM are frequently IO bound these days, even locally, where for large sets (1GB+) the DRAM simply isn't fast enough.
As for OMP - I'm not a fan. For local computation it's way too easy to introduce false sharing which negates all benefits of multiple cores.
For the work I did I inevitably ended up with custom problem-tailored implementation and gaining up to an order of magnitude improvement. For networked implementations it's quite difficult. There's various techniques, but finding what will work for your particular data set takes a bit of experimentation.
For local computation, especially for CPU/GPU hybrids, existing third-party libraries are quite competitive and if your problem fits into GPU RAM they will outperform or at least be competitive with highly optimized high-end i7 versions.
In my experience, constant factors dominate these algorithms, so whichever implementation is used, it needs to be balanced for given hardware. It may work out-of-box though.