Games development with OpenMP

Started by
7 comments, last by Jan Wassenberg 14 years, 2 months ago
It is possibale to create a game engine that supports fully parallel proccesing using the OpenMP library? I saw that there are a lot of restrictions in for-loops design , most of them belong to memory mangement. can i make a function calls inside iterations of for-loop with "#pragama omp parallel for"? or in this case I should use only macros?
Advertisement
I've used OpenMP quite a bit and absolutely love it. Its so incredibly easy to use. You can indeed call functions within your parallel sections, no problem.

The ease of use of OpenMP comes at the cost of lack of inbuilt functions. There is no event based support, so using as part of a GUI engine isn't a good idea, and there is no support for monitors, semaphores, mutexes, etc (though there is #pragma omp critical and #pragma omp atomic).

On the other hand, I've just recently been playing with some of boosts message passing and other multithreading support libraries (boost::asio especially), and I think if used in conjunction with OpenMP it'll be a winning combination for any project.
Dave.
Thanks Dave!

I have another question:

Can I use Object of type some_class of mine inside such a for-loop?
Quote:Original post by Eliad Moshe
Thanks Dave!

I have another question:

Can I use Object of type some_class of mine inside such a for-loop?


You do anything you would normally do inside a loop body, but you have to be careful of data dependencies between iterations.

There are also restrictions on the form of the loop. In C and C++ for example, the counter has to be of signed integral type. There's a bunch of stuff like that. Both MSVC and g++ seem to give reasonable warnings about this kind of thing though.

I've only had a little experience with OpenMP and while it was easy to use, I wasn't particularly impressed with the way scheduling was handled, at least in Microsoft's implementation.
Yeah as the_edd stated, you can use it exactly as if there was no parallelism, but you have to watch out for the standard synchonization issues as with any multithreaded program.

And yeah forgot to mention in my previous post, as the_edd reminded me, the form of your loops is fairly strict. In particular, you can't directly do linked list or other iterator based looping. There are a couple of tricks around this, but the performance is several times (and I mean really several times) worse.

Unless the body of the loop is a very time-consuming algorithm (relatively speaking), then parallelising for-loops on linked lists (as opposed to arrays or vectors), is simply not worth it. Believe me, I've tried in vain to find a way to parallelise std::list loops of 10s-100s of thousands of elements, and it always works out quicker to just use a single thread vs. 8 cores (!). I thought OpenMP 3 would be the key with its new 'tasking' feature, so I invested a lot in Intel's C++ compiler (since Visual C only supports OpenMP 2). But it turns out to be no better than the single-nowait trick you can do in OpenMP 2.
Dave.
The breaking up of algorithms is the hard part in multithreading which is why I have to think it's almost pointless to bother with openMP or TBB.

They might be easy to use but when you have something that really needs it and works well with it it's worth doing it right, and if you try to use it everywhere then you will slow down a lot.

This is my thread. There are many threads like it, but this one is mine.

Quote:parallelising for-loops on linked lists (as opposed to arrays or vectors), is simply not worth it.

I'm wondering when lists are ever an appropriate data structure for time-critical applications. 'Unbounded' collections can efficiently be implemented by reserving address space and demand-mapping memory. If you need to insert at both ends, then std::deque-style schemes improve locality and decrease link overhead (though the VC STL's default bucket size of 16 seems insanely small). Deleting items in the middle is cheap if the last item can be copied into its place or holes are acceptable.
That said, even lists can be processed in parallel (-> "parallel list ranking"); skip lists may also help.

Quote:The breaking up of algorithms is the hard part in multithreading which is why I have to think it's almost pointless to bother with openMP or TBB.
They might be easy to use but when you have something that really needs it and works well with it it's worth doing it right, and if you try to use it everywhere then you will slow down a lot.

And what would you propose instead? Replacing OpenMP parallel sections with do-it-yourself threading is probably less efficient due to thread-creation overhead unless you also implement a thread pool. However, that would be a fool's errand because OpenMP already does it well with no apparent overhead.
BTW, "doing it right" is certainly possible with OpenMP - simply start a parallel section, pin each thread to a core via POSIX or OS-specific affinity APIs and then add your own partitioning of the problem (ideally aware of shared caches and NUMA topology).
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Quote:Original post by Jan Wassenberg
I'm wondering when lists are ever an appropriate data structure for time-critical applications. 'Unbounded' collections can efficiently be implemented by reserving address space and demand-mapping memory. If you need to insert at both ends, then std::deque-style schemes improve locality and decrease link overhead (though the VC STL's default bucket size of 16 seems insanely small). Deleting items in the middle is cheap if the last item can be copied into its place or holes are acceptable.
That said, even lists can be processed in parallel (-> "parallel list ranking"); skip lists may also help.


struct Node {  size_t prev, next;  T value;}std::vector< Node >.
This solves the memory problem, while retaining linked list characteristics.

The advantage of this version is that custom order can be maintained, but if removal is implemented via pop-and-swap, then unordered traversal can still be used (on delete, replace removed element with last one, then fix the pointers).

Such representation might be more favorable than having two arrays, one for data, one for order.

There is not all that much use for this type of lists, but sometimes they can come handy.
Interesting, but that's basically just a linked-list using an (expandable) pool allocator. The above mentioned demand-mapping scheme would be even more efficient in terms of allocation (and lack of copying). However, you still have the disadvantage of "wasting" a large proportion of the cache on those prev/next links. Sure, you could do crazy stuff like storing (n-1)-bit deltas and referring to a lookaside table instead if the top bit is set, but I haven't yet seen an application where container ordering was important and some other scheme wasn't applicable. (Example: insertion sort is actually nlogn if you can leave some gaps in the output - see "library sort". Under the same assumption, "counting" sort can write elements to their final position without even needing to count their frequency - see http://en.wikipedia.org/wiki/Counting_sort#Single-pass_variant)
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3

This topic is closed to new replies.

Advertisement