Slow down when using OpenMP with stl vector per thread

Started by
28 comments, last by Tom Sloper 6 years, 5 months ago

Hi,

I'm new to OpenMP and for the second time i notice a slow down in code like this:

const Graph::GraphData &graph = graphPerPart[part];

#pragma omp parallel for
for (int borderIndex = 0; borderIndex < newBorders.size(); borderIndex++)

{
int source = borders[borderIndex].FirstVertex(*mesh);
int target = borders[borderIndex].LastVertex(*mesh);

std::vector<float> min_distance;
std::vector<int> previous;

Graph::DijkstraComputePaths (source, graph.vertex_adjacency_list, min_distance, previous); // this grows both vectors

newBorders[borderIndex].DoSomethingWithTheClosestPath (previous, target, ...); // this also grows some std::vectors
}

OMP code runs only 0.75 times as fasat as single threaded (VS 2013, quad core).

I guess it is because each thread has to manage memory for the stl::vectors - can this be true and is there a clue to fix this?

This is for a preprocessing tool and avoiding stl surely isn't worth the time,

but usually i get the expected speed ups with very little affort so i'm curious.

Also i'd like to know if OpenMP can get closely as fast as other multithreading techniques for runtime code, because it's so easy to use.

Advertisement
Threading is not a "magic wand". It will not speed up your code unless you design it to be threaded, and will most likely slow down your code if not utilized correctly (or worse, corrupt your data and crash your program).

What vectors are growing? What locking mechanisms are you using to make sure you don't corrupt memory? How are you sharing memory?

And most importantly - have you run it in a profiling tool to see where all your time is spent?

There is no synchronization or shared memory necessary here (I'm aware of those issues from GPGPU).

Actually the algorithm computes shortest paths to smooth cluster boundaries from mesh segmentation - no math here, just walking lists and storing data.

The growing vectors are the two shown in the listing, std::list in the new border to store the path, plus a std::set queue in Dijkstra.

I guess the heap allocation serializes the cpu?

Visual Studio Profiler is new to me as well. I'll take a look... :)

I guess the heap allocation serializes the cpu?


Heap allocation is serialized, yes. (May depend on your compiler and how compliant they are with C++11/14 standards)

OMP uses a kind of naive approach to MT. Basically it is trying to simplify something that can't really be simplified. You still have to know in detail what;s going on or you will go slower, but if you do know that much detail you are better off not using OMP. There is really no good reason for OMP to exist, the only thing I can think of is that it is nice and crossplatform, so you could add in some limited MT stuff to a tool library etc. and not have to worry about different implementations.

This is my thread. There are many threads like it, but this one is mine.

Partially correct. You DO need to understand what is actually happening. Otherwise race conditions, data splicing, lost updates, cache coherency problems, and much more can all destroy the logic.

However, also partially incorrect. OpenMP does have a great reason for existing, and for many uses it is a wonderful solution. As you wrote, it generalizes well. It can be leveraged nicely to take advantage of parallel processing on simple solutions that otherwise wouldn't justify the effort of writing parallel systems. If you are writing small code that has big data, OpenMP can dramatically simplify development.

It just tends to be a less-than-ideal fit for most game development, with enormous monolithic designs and implementations.

You could try using vector.reserve(n) if you know how large they need to be at the end (depends how much memory youre willing to waste if you approximate it too high) - this will ensure no allocations need to be done (apart from the first one) when resizing the vector until it grows bigger than n elements.

EDIT: You say its single threaded, how long execution time overall are we talking here? Theres probably some overhead for creating and running the thread etc. Hows the performance with 4 threads or whatever youre going to use in practise? (and this is in release mode right?)

o3o

Thanks for all the answers!

So OpenMP is as expected not the ideal choice for CPU game dev, so what is?

OpenCL (It's great for GPU but awkward for CPU)

AMP (Microsoft only? At least i know it's simplifying GPUs too much, so i have some doupts against it)

C++11 threads (not yet widely supported)

Native Thread API, or a lightweight crossplattform version like PThreads... the only right choice?


You could try using vector.reserve(n)

That should be the solution but is not worth to implement at the moment.

Actually single threaded takes 1.4 seconds and MT(i7-930) 2 seconds for the bunny model.

Ratio keeps constant if i use a larger mesh so it runs longer.

You could try using vector.reserve(n)


That should be the solution but is not worth to implement at the moment.

Wait, what? One line of code is not worth implementing?

devstropo.blogspot.com - Random stuff about my gamedev hobby

Are you sure its actually using more than one thread?

o3o

This topic is closed to new replies.

Advertisement