Slow down when using OpenMP with stl vector per thread

Started by
28 comments, last by Tom Sloper 6 years, 5 months ago

The function resize() is equivalent to -- and sometimes implemented as, calling pop_back() a bunch of times. Reserve the space first. You have potentially just caused n reallocations to trigger, and potentially just triggered n! (n factorial) moves. Most standard library implementations use bigger buckets, but some of them only allocate in very small increments, a few memory-conscious implementations only increment by one by default.


First, I think you meant 'push_back', not 'pop_back'. And then, are you certain about that? I just checked the documentation and the promised complexity is 'linear in the difference between the current size and count', not 'amortized linear' as would be expected when using a bunch of push_backs.
Advertisement

I think, if you want to stick to the STL, you should replace the set with a priority queue and see what happens

That's it! biggrin.png

I'm not yet 100% sure i did that correctly, but results match and it's much faster already single threaded and also gives speed up using OMP.

On a six core, priority_queue method takes 0.4 seconds ST and 0.07 seconds MT - just perfect smile.png

(set method timings have been 1.4 / 2 seconds)


but rather than using an all-pairs algorithm, you are using a single-source algorithm and repeating it for every pair.

Just to try out if the idea works and Dijkstra was already implemted. There was a note above the code that i already replaced this with a fast method.

I've tried the other minor things you have suggested, but they did not differ measureable in runtime.

I'ts a result of being lazy, paranoid and thinking "compiler should be clever enough to detect this" ;D

I really appreciate your help guys!

In the future I'll look more closely what happens under the hood of STL (use it very rarely).

Here the changes:

EDIT: Fixed a bug - Just saw Pink Horror has already noticed when posting :)


    struct FirstFar
    {
        int operator() (const std::pair<int,float> &first,
                        const std::pair<int,float> &second)
        {
            return first.second > second.second;
        }
    };

    static void DijkstraComputePaths2 (int source,
                              const adjacency_list_t &adjacency_list,
                              std::vector<float> &min_distance,
                              std::vector<int> &previous)
    {
        int n = adjacency_list.size();
        //min_distance.clear();
        min_distance.reserve(n);
        min_distance.resize(n, max_weight);
        min_distance[source] = 0;
        //previous.clear();
        previous.reserve(n);
        previous.resize(n, -1);
        
        std::priority_queue< std::pair<int,float>, std::vector<std::pair<int,float> >, FirstFar> vertex_queue;
        vertex_queue.push (std::make_pair(source, min_distance[source]));

        while (!vertex_queue.empty())
        {
            int u = vertex_queue.top().first;
            float dist = vertex_queue.top().second;
            vertex_queue.pop();
            
            // Visit each edge exiting u
            const std::vector<Neighbour> &neighbors = adjacency_list[u];
            for (std::vector<Neighbour>::const_iterator neighbor_iter = neighbors.begin();
                 neighbor_iter != neighbors.end();
                 neighbor_iter++)
            {
                int v = neighbor_iter->target;
                float weight = neighbor_iter->weight;
                float distance_through_u = dist + weight;
                if (distance_through_u < min_distance[v])
                {
                    min_distance[v] = distance_through_u;
                    previous[v] = u;
                    vertex_queue.push(std::make_pair(v, min_distance[v]));
                }
            }
        }
    }


> vertex_queue.erase(vertex_queue.begin());

Slow. You've just caused every element to be shifted by one position, calling bunches of move constructors and destructors in the process. You've got several of these in your algorithm. If you must do that pattern, start at the end and work backwards. Or use a data structure that has better handling of removal from the beginning, like a deque.

vertex_queue was a set, not a vector.


First, I think you meant 'push_back', not 'pop_back'. And then, are you certain about that? I just checked the documentation and the promised complexity is 'linear in the difference between the current size and count', not 'amortized linear' as would be expected when using a bunch of push_backs.

You would have to do something similar to push_back, since you must initialize each and every item when resizing.

You also have to do some allocation if the vector is too small, no way to get around that, and that reallocation will be linear to the size of the entire vector.

Though I would expect a sane implementation would only make at most one allocation, in effect, doing a "reserve" internally before the "push_back":s (and maybe skip the "do I need to re-allocate"-check in the "push_back"-equivalent code.)


int u = vertex_queue.top().first;
vertex_queue.pop();
float dist = vertex_queue.top().second;

This doesn't look right.

Though I would expect a sane implementation would only make at most one allocation, in effect, doing a "reserve" internally before the "push_back":s (and maybe skip the "do I need to re-allocate"-check in the "push_back"-equivalent code.)


Exactly. frob described a scenario where it would be common for several reallocations happening in response to a resize(). I cannot really believe that since, as you said, doing something as simple as a reserve before adding the elements would already fix that issue. I cannot believe something that simple would not be part of anything worthy of calling itself 'standard library'.
Well, maybe some of those broken console compilers have that issue but that really should not be mentioned in a general programming context.

Though I would expect a sane implementation would only make at most one allocation, in effect, doing a "reserve" internally before the "push_back":s (and maybe skip the "do I need to re-allocate"-check in the "push_back"-equivalent code.)


Exactly. frob described a scenario where it would be common for several reallocations happening in response to a resize(). I cannot really believe that since, as you said, doing something as simple as a reserve before adding the elements would already fix that issue. I cannot believe something that simple would not be part of anything worthy of calling itself 'standard library'.
Well, maybe some of those broken console compilers have that issue but that really should not be mentioned in a general programming context.

Even if there were several reallocations, when the vector reserves more space the growth is exponential. You're still not going to get frob's n-factorial moves that way. In total it's a linear amount of moves. Whatever moves you get in the future, the container's analysis accounted for them with the amortized constant time in each push_back that originally put something in the vector.

Of course, the improvement here was with the set, not the vector. The vector's reallocation is not the problem, especially because all those moves happen in parallel on their own threads. (Edit: I don't mean to imply the moves for one container are done in parallel. I mean each copy of the container in this particular problem can do the move portion of its resize independently of the others.) Switching the set to a vector-backed container got rid of whatever thread synchronization was done for the heap for every insert and erase, which kept the OpenMP threads from doing much work in parallel.

Did anyone try cuda thrust insteed of stl to improve the performance of 


DijkstraComputePaths2 algorithm ? 

I have tried cuda thrust programing but suck in 


std::priority_queue function there is no alternative in cuda thrust for this

 

2-year necro. locking.

-- Tom Sloper -- sloperama.com

This topic is closed to new replies.

Advertisement