Back to General and Gameplay Programming

Slow down when using OpenMP with stl vector per thread

Joe · 2017-10-24T14:20:13

Hi, I'm new to OpenMP and for the second time i notice a slow down in code like this: const Graph::GraphData &graph = graphPerPart[part]; #pragma omp parallel for for (int borderIndex = 0; borderIndex < newBorders.size(); borderIndex++) { int source = borders[borderIndex].FirstVertex(*mesh); int target = borders[borderIndex].LastVertex(*mesh); std::vector<float> min_distance; std::vector<int> previous; Graph::DijkstraComputePaths (source, graph.vertex_adjacency_list, min_distance, previous); // this grows both vectors newBorders[borderIndex].DoSomethingWithTheClosestPath (previous, target, ...); // this also grows some std::vectors } OMP code runs only 0.75 times as fasat as single threaded (VS 2013, quad core). I guess it is because each thread has to manage memory for the stl::vectors - can this be true and is there a clue to fix this? This is for a preprocessing tool and avoiding stl surely isn't worth the time, but usually i get the expected speed ups with very little affort so i'm curious. Also i'd like to know if OpenMP can get closely as fast as other multithreading techniques for runtime code, because it's so easy to use.

General and Gameplay Programming Programming

Started by JoeJ June 14, 2015 06:05 PM

28 comments, last by Tom Sloper 6 years, 5 months ago

BitMaster

8,652

June 16, 2015 07:10 AM

The function resize() is equivalent to -- and sometimes implemented as, calling pop_back() a bunch of times. Reserve the space first. You have potentially just caused n reallocations to trigger, and potentially just triggered n! (n factorial) moves. Most standard library implementations use bigger buckets, but some of them only allocate in very small increments, a few memory-conscious implementations only increment by one by default.

First, I think you meant 'push_back', not 'pop_back'. And then, are you certain about that? I just checked the documentation and the promised complexity is 'linear in the difference between the current size and count', not 'amortized linear' as would be expected when using a bunch of push_backs.

JoeJ

4,181

Author

June 16, 2015 10:05 AM

I think, if you want to stick to the STL, you should replace the set with a priority queue and see what happens

That's it!

I'm not yet 100% sure i did that correctly, but results match and it's much faster already single threaded and also gives speed up using OMP.

On a six core, priority_queue method takes 0.4 seconds ST and 0.07 seconds MT - just perfect

(set method timings have been 1.4 / 2 seconds)

but rather than using an all-pairs algorithm, you are using a single-source algorithm and repeating it for every pair.

Just to try out if the idea works and Dijkstra was already implemted. There was a note above the code that i already replaced this with a fast method.

I've tried the other minor things you have suggested, but they did not differ measureable in runtime.

I'ts a result of being lazy, paranoid and thinking "compiler should be clever enough to detect this" ;D

I really appreciate your help guys!

In the future I'll look more closely what happens under the hood of STL (use it very rarely).

Here the changes:

EDIT: Fixed a bug - Just saw Pink Horror has already noticed when posting :)


    struct FirstFar
    {
        int operator() (const std::pair<int,float> &first,
                        const std::pair<int,float> &second)
        {
            return first.second > second.second;
        }
    };

    static void DijkstraComputePaths2 (int source,
                              const adjacency_list_t &adjacency_list,
                              std::vector<float> &min_distance,
                              std::vector<int> &previous)
    {
        int n = adjacency_list.size();
        //min_distance.clear();
        min_distance.reserve(n);
        min_distance.resize(n, max_weight);
        min_distance[source] = 0;
        //previous.clear();
        previous.reserve(n);
        previous.resize(n, -1);
        
        std::priority_queue< std::pair<int,float>, std::vector<std::pair<int,float> >, FirstFar> vertex_queue;
        vertex_queue.push (std::make_pair(source, min_distance[source]));

        while (!vertex_queue.empty())
        {
            int u = vertex_queue.top().first;
            float dist = vertex_queue.top().second;
            vertex_queue.pop();
            
            // Visit each edge exiting u
            const std::vector<Neighbour> &neighbors = adjacency_list[u];
            for (std::vector<Neighbour>::const_iterator neighbor_iter = neighbors.begin();
                 neighbor_iter != neighbors.end();
                 neighbor_iter++)
            {
                int v = neighbor_iter->target;
                float weight = neighbor_iter->weight;
                float distance_through_u = dist + weight;
                if (distance_through_u < min_distance[v])
                {
                    min_distance[v] = distance_through_u;
                    previous[v] = u;
                    vertex_queue.push(std::make_pair(v, min_distance[v]));
                }
            }
        }
    }

Pink Horror

2,459

June 16, 2015 01:30 PM

> vertex_queue.erase(vertex_queue.begin());

Slow. You've just caused every element to be shifted by one position, calling bunches of move constructors and destructors in the process. You've got several of these in your algorithm. If you must do that pattern, start at the end and work backwards. Or use a data structure that has better handling of removal from the beginning, like a deque.

vertex_queue was a set, not a vector.

alh420

5,995

June 16, 2015 01:57 PM

First, I think you meant 'push_back', not 'pop_back'. And then, are you certain about that? I just checked the documentation and the promised complexity is 'linear in the difference between the current size and count', not 'amortized linear' as would be expected when using a bunch of push_backs.

You would have to do something similar to push_back, since you must initialize each and every item when resizing.

You also have to do some allocation if the vector is too small, no way to get around that, and that reallocation will be linear to the size of the entire vector.

Though I would expect a sane implementation would only make at most one allocation, in effect, doing a "reserve" internally before the "push_back":s (and maybe skip the "do I need to re-allocate"-check in the "push_back"-equivalent code.)

Pink Horror

2,459

June 16, 2015 02:19 PM

int u = vertex_queue.top().first;
vertex_queue.pop();
float dist = vertex_queue.top().second;

This doesn't look right.

BitMaster

8,652

June 16, 2015 03:11 PM

Though I would expect a sane implementation would only make at most one allocation, in effect, doing a "reserve" internally before the "push_back":s (and maybe skip the "do I need to re-allocate"-check in the "push_back"-equivalent code.)

Exactly. frob described a scenario where it would be common for several reallocations happening in response to a resize(). I cannot really believe that since, as you said, doing something as simple as a reserve before adding the elements would already fix that issue. I cannot believe something that simple would not be part of anything worthy of calling itself 'standard library'.
Well, maybe some of those broken console compilers have that issue but that really should not be mentioned in a general programming context.

Pink Horror

2,459

June 16, 2015 04:29 PM

Though I would expect a sane implementation would only make at most one allocation, in effect, doing a "reserve" internally before the "push_back":s (and maybe skip the "do I need to re-allocate"-check in the "push_back"-equivalent code.)

Exactly. frob described a scenario where it would be common for several reallocations happening in response to a resize(). I cannot really believe that since, as you said, doing something as simple as a reserve before adding the elements would already fix that issue. I cannot believe something that simple would not be part of anything worthy of calling itself 'standard library'.
Well, maybe some of those broken console compilers have that issue but that really should not be mentioned in a general programming context.

Even if there were several reallocations, when the vector reserves more space the growth is exponential. You're still not going to get frob's n-factorial moves that way. In total it's a linear amount of moves. Whatever moves you get in the future, the container's analysis accounted for them with the amortized constant time in each push_back that originally put something in the vector.

Of course, the improvement here was with the set, not the vector. The vector's reallocation is not the problem, especially because all those moves happen in parallel on their own threads. (Edit: I don't mean to imply the moves for one container are done in parallel. I mean each copy of the container in this particular problem can do the move portion of its resize independently of the others.) Switching the set to a vector-backed container got rid of whatever thread synchronization was done for the heap for every insert and erase, which kept the OpenMP threads from doing much work in parallel.

Zahid Hasan Ansari

October 24, 2017 10:18 AM

Did anyone try cuda thrust insteed of stl to improve the performance of


DijkstraComputePaths2 algorithm ?

I have tried cuda thrust programing but suck in


std::priority_queue function there is no alternative in cuda thrust for this

Tom Sloper

17,201

October 24, 2017 02:20 PM

2-year necro. locking.

-- Tom Sloper -- sloperama.com

Slow down when using OpenMP with stl vector per thread

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Slow down when using OpenMP with stl vector per thread

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines