# Strange slowness, std::vector, MSVC++9

## Recommended Posts

Hi, The following code is a renderlist, containing a list of RenderNodes.
class RenderList;
class RenderNode;

#ifndef class_RenderList_h
#define class_RenderList_h 1
#include "../StaticMesh.h"
#include "../gMaths.h"

class RenderNode
{
public:
RenderNode(){};
RenderNode(StaticMesh &m, g_transform &p):mesh(m),pos(p){}

StaticMesh mesh;
g_transform pos;
};

class RenderList
{
public:
RenderList()
{

}

~RenderList()
{
Flush();
}

void AddNode(StaticMesh& m, g_transform& p)
{
RenderNode newnode(m,p);
nodes.push_back(newnode);
}

void Flush()
{
nodes.clear();
}

void ConsoleDump()
{
for(unsigned int index=0; index < nodes.size(); index++)
{
std::cout<<"RenderList contains node named "<<nodes[index].mesh.name<<"\n";
}
}
//TAKES 60ms
void RenderList::Append(const RenderList& list)
{
nodes.reserve(list.nodes.size());
nodes.insert(nodes.end(),list.nodes.begin(),list.nodes.end());
}
//TAKES 21ms WITH SAME DATA
void RenderList::Append(const RenderList& list)
{
nodes.reserve(list.nodes.size());
unsigned int end = list.nodes.size();
for(unsigned int index = 0; index < end; ++index)
{
nodes.push_back(list.nodes[index]);
}
}

private:
std::vector<RenderNode> nodes;
friend class Renderer;
};
#endif


Now, for some reason the first version of RenderList::Append() is slower than the second one, as it takes 60ms to copy 227 elements into the first render list compared with the second version of Append() which takes 21ms to perform the same operation. All I am doing to test the two is commenting one out in code and leaving the other, then recompiling. Can anybody shed some light on why the first is slower than the second? Surely a single insert() should be faster than a series of push_back() ? Thanks. [Edited by - speciesUnknown on August 9, 2008 10:07:44 PM]

#### Share this post

##### Share on other sites
If you call push_back over and over the vector will need to be reallocated and copied each time its capacity is exceeded.

However, if you call insert the vector can perform a single reallocation to accommodate the new elements.

D'oh. Thats what I get for skimming your post.

#### Share this post

##### Share on other sites
Quote:
 Original post by fpsgamerIf you call push_back several times the vector may need to be resized as much as 'n' times.However, if you call insert it can perform a single resize to accommodate all the new elements.

I dont think i made muyself clear. The first one uses insert() and takes 60ms. the second one uses a series of push_back() and takes 20ms. I would expect the opposite effect.

Quote:
 D'oh. Thats what I get for skimming your post.

Yeah, im guilty of that occasionally.
But still, why the bizarre slowness? Everything im reading suggests that insert() should be faster than a load of single calls to push_back.

#### Share this post

##### Share on other sites
Well its no longer a performance issue, but rather an absurdity, because by adding some basic optimisations elsewhere I have reduced the gather time to less than a millisecond.

But I am still wondering how insert() can be slower than a load of push_back().

#### Share this post

##### Share on other sites
AFAIK, the standard library makes no performance claims regarding a single call to std::vector<T>::insert vs. repeatedly calling std::vector<T>::push_back. All I could find in Josuttis is a mention that calling insert can be expected to be faster when:

• Elements are inserted or removed at the end

• The capacity is large enough on entry

• Multiple elements are inserted by a single call rather than by multiple calls

Note that the last point is with respect to multiple calls to insert, rather than multiple calls of push_back.

One thing I did notice is that you call reserve on your existing list with the size of the list to be appended to the existing list. Reserve specifies the desired total capacity, not the desired extension to the existing capacity, so you should more properly write:

nodes.reserve(nodes.size() + list.nodes.size());

#### Share this post

##### Share on other sites
Quote:
 Original post by legalizeAFAIK, the standard library makes no performance claims regarding a single call to std::vector::insert vs. repeatedly calling std::vector::push_back. All I could find in Josuttis is a mention that calling insert can be expected to be faster when:Elements are inserted or removed at the endThe capacity is large enough on entryMultiple elements are inserted by a single call rather than by multiple callsNote that the last point is with respect to multiple calls to insert, rather than multiple calls of push_back.One thing I did notice is that you call reserve on your existing list with the size of the list to be appended to the existing list. Reserve specifies the desired total capacity, not the desired extension to the existing capacity, so you should more properly write:nodes.reserve(nodes.size() + list.nodes.size());

Right, i heeded what you said about reserve() being the total amount of space. However, this made no difference. I'll bear it in mind in the future, however. Thanks alot.

By reducing the amount of data being passed around when copying StaticMesh (it uses VBO's so i made it clear the initial array), I have brought the gather time to a millisecond for the push_back method, and 3ms for the insert() method. im still pondering the difference in speed, however.

#### Share this post

##### Share on other sites
STL containers are deathly-slow when compiled in Debug mode. If you're not already using Release mode, try switching to it and then seeing if performance is still sqewed unusualy.

#### Share this post

##### Share on other sites
Quote:
 Original post by speciesUnknownRight, i heeded what you said about reserve() being the total amount of space. However, this made no difference. I'll bear it in mind in the future, however. Thanks alot.

Yeah, this was just a side observation; I didn't expect that to be the source of the problem.

Quote:
 By reducing the amount of data being passed around when copying StaticMesh (it uses VBO's so i made it clear the initial array), I have brought the gather time to a millisecond for the push_back method, and 3ms for the insert() method. im still pondering the difference in speed, however.

The only thing I can think of is that the nature of the operations (insert being more general than push_back) causes insert to do more book-keeping overhead: insert must invalidate iterators because it can be used to insert items int he middle of a container, whereas push_back can never invalidate any existing iterators because it can only append to the container. The discussion in Josuttis for insert talks about its iterator invalidating consequences.

I know its a bit of a rat's nest, but you do have the whole source for insert and push_back in the standard library, so you can compare what they do to see why they perform differently.

If I had been doing this, I probably would have written:
std::copy(list.nodes.begin(), list.nodes.end(), back_inserter(nodes));

I wonder how that would perform.

#### Share this post

##### Share on other sites
It's because of the way that they decided to implement vector::insert. Here's the source code:

	template<class _Iter>		void _Insert(const_iterator _Where, _Iter _First, _Iter _Last,			input_iterator_tag)		{	// insert [_First, _Last) at _Where, input iterators #if _HAS_ITERATOR_DEBUGGING		if (_Where._Mycont != this			|| _Where._Myptr < _Myfirst || _Mylast < _Where._Myptr)			_DEBUG_ERROR("vector insert iterator outside range"); #endif /* _HAS_ITERATOR_DEBUGGING */		if (_First != _Last)			{	// worth doing, gather at end and rotate into place			size_type _Oldsize = size();			size_type _Whereoff = _Where._Myptr - _Myfirst;			for (; _First != _Last; ++_First)				_Insert_n(end(), (size_type)1, (value_type)*_First);			_Reverse(_Myfirst + _Whereoff, _Myfirst + _Oldsize);			_Reverse(_Myfirst + _Oldsize, _Mylast);			_Reverse(_Myfirst + _Whereoff, _Mylast);			}		}

It's those calls to _Reverse at the end which is causing the slowdown. Here's how it works:

Let's say you want to insert the sequence [a b c] into the middle of a vector containing [1 2 3 4 5 6]:
    1 2 3 4 5 6         ^       a b c

The input is added to the end:
    1 2 3 4 5 6 a b c    ^     ^     ^     ^_Myfirst  |     |   _Mylast       _Where   |                |        _Myfirst+_Oldsize

First reverse: ([4 5 6] reverses to [6 5 4])
1 2 3 6 5 4 a b c^     ^     ^     ^

Second reverse: ([a b c] reverses to [c b a])
1 2 3 6 5 4 c b a^     ^     ^     ^

Third reverse: ([6 5 4 c b a] reverses to [a b c 4 5 6])
1 2 3 a b c 4 5 6^     ^     ^     ^

The problem is that if you're inserting into the end, two of these three reverses will still occur. Taking the above example, if [a b c] was inserted into the end, it'd be reversed to [c b a], then immediately re-reversed back to [a b c]. If you have non-trivial copy constructors, this can turn out to be a very slow operation.

I wonder what Dinkumware's justification for this method was. No matter what you do, this method of "rotating into place" will always involve more copies than just resizing the vector, shoving across all elements after _Where, and then inserting the elements directly into place. Let's take the worst-case scenario: inserting into the beginning of a vector.

Take for example, inserting [a b c d] into the beginning of a vector containing [1 2 3 4 5 6]. Dinkumware's implementation will cause 2+3+5 = 10 swaps. Each swap is comprised of 3 copies, so that's 30 copies. But if the vector was simply increased in size by 4 elements, then the sequence [1 2 3 4 5 6] moved across by 4 elements that would only be 4 copies. Then the sequence [a b c d] would then be inserted into the now empty first 4 elements.

[Edited by - Sc4Freak on August 10, 2008 12:18:57 AM]

#### Share this post

##### Share on other sites
Lots of helpful suggestions, however I would like to clarify a few things,

1) I am compiling in release mode with debug symbols disabled.

2) I have disabled the secure SCL

3) I am copying by value, however only with a small amount of data. class StaticMesh looks like this:

class StaticMesh{	public:		StaticMesh();		~StaticMesh();		void addTriangle(const std::string& texture, static_triangle triangle);		// that single triangle is added				void finalise();		// mesh is finalised into a number of vertex buffers		void OpenGLDraw(TextureTable * m);		std::map<std::string,triangle_list>triangle_lists;		std::map<std::string,VertexBufferObject>vert_arrays;	private:		void m_immediate_mode_draw(TextureTable * m);		void m_vertex_array_draw(TextureTable * m);		// one vector of polygons per material		// no use of vertex references, each polygon keeps its own data.		GLuint VBOvertices;		GLuint VBOtexcoords;		bool finalised;	public:		void merge(const StaticMesh& sauce);};

The std::map of <std::string, triangle_list> is deleted after finalise() is called, and then cleared. when I was not doing this, I was getting gather times of 20ms with push_back and 60ms with insert(). now i get 1ms and 3ms respectively.

The map of <std::string, VertexBufferObject> also holds a small amount of data, this is what the VertexBufferObject class looks like:

class VertexBufferObject{	public:		VertexBufferObject();		VertexBufferObject(const triangle_list& source_data);		void Draw();	private:		void fillArrays(const triangle_list& in_data);		void EnableStates();		void DisableStates();		GLuint length;		GLuint VBOVertices;		GLuint VBOTexCoords;		friend class StaticMesh;};

however, i want to highlight that the ratio is still 3:1 after I drastically shrank the amount of data stored in them. (1ms and 3ms compared with 20ms and 60ms)

#### Share this post

##### Share on other sites
Quote:
 Original post by Sc4FreakTake for example, inserting [a b c d] into the beginning of a vector containing [1 2 3 4 5 6]. Dinkumware's implementation will cause 2+3+5 = 10 swaps. Each swap is comprised of 3 copies, so that's 30 copies. But if the vector was simply increased in size by 4 elements, then the sequence [1 2 3 4 5 6] moved across by 4 elements that would only be 4 copies. Then the sequence [a b c d] would then be inserted into the now empty first 4 elements.

Umm... I think you underestimated it a little bit. The sequence [1 2 3 4 5 6] when moved to the end will require 6 elements to be copied when inserting [a b c d] into the beginning. If you need to increase the capacity of the vector, I would expect the copying of the elements down in the vector to happen concurrently with copying them into the new storage.

Generally I think of insert as an operation I do on linked list type structures and don't expect it to be fast on a vector.

#### Share this post

##### Share on other sites
@Sc4Freak: I am also not sure why the Dinkumware library implements the insert the way it does but I think it may have to do with the fact that the objects stored in the vector do not need to be default constructible.

#### Share this post

##### Share on other sites
One millisecond for gathering a render list is anything but trivial, that should be at most a tenth of that.

For a start, you should reserve the render list vector right at the beginning of your level or render frame, not with each insertion that expands it, as any actual re-allocation is going to invoke the copy constructor for all of the contained objects. Put in an assert if the render list does need to expand, and manually change the start-up reserve size for your next run through, or log the expansion for later checking.

std::string isn't just a wrapper around a char*. Some implementations also contain an array of 12 chars to avoid unnecessary allocations when working with small strings. This is adding to the non-triviality of your copy constructor, not to mention the overhead of actually allocating a new store for the new strings char* if the contained string is > 12 chars.

With this in mind, I'd seriously recommend that you don't use any data types larger than 4 - 16 bytes in an array that will undergo any more than a few modifications per frame, and especially don't do so with non POD (plain old data) objects.

If you can ensure the life span of your render objects then you should be using references / pointers to them instead.

If you really need transient data, then you should keep a pool of them with a free stack that references the pool elements, and grab and return free elements by pushing and popping on the back of the free elements list as and when you need them.

Having said that, I'm willing to bet that a pre-reserved vector of pointers to the render objects and new-ing / deleting them when you need / don't need them will be faster than your current implementation without having to go to the bother of creating a pool.

#### Share this post

##### Share on other sites
Quote:
 Original post by Glen CorbyOne millisecond for gathering a render list is anything but trivial, that should be at most a tenth of that.

...assuming that it even matters. If profiling shows that you need to improve the things you're talking about, then by all means go ahead and do that. But what if it doesn't? Then you're just wasting your time.

First get it right, then get it tight.

Don't waste valuable time optimizing something that may not need it.

#### Share this post

##### Share on other sites
Quote:
Original post by legalize
Quote:
 Original post by Glen CorbyOne millisecond for gathering a render list is anything but trivial, that should be at most a tenth of that.

Don't waste valuable time optimizing something that may not need it.

Sure thing, but there is an element of style at play here, we're not just talking about optimisation, we're talking about best practices. The pool idea is probably taking things a bit far when you're not doing final polish and optimisation, but avoiding copying nested structures unnecessarily is something you should be striving to do from the very first line of code you write.

All other things being equal, when considering design and implementation, you should start heading down the path that isn't immediately shooting you in the foot.

#### Share this post

##### Share on other sites
Quote:
 Original post by Glen Corbybut avoiding copying nested structures unnecessarily is something you should be striving to do from the very first line of code you write.

Maybe, but it could be fussing over something that doesn't matter. There are very few universal rules in programming, particularly with C++ because its so versatile and can be applied in so many ways.

I have no problem copying strings around until its shown to be a problem. I prefer to treat strings as values.

#### Share this post

##### Share on other sites
Quote:
Original post by legalize
Quote:
 Original post by Glen Corbybut avoiding copying nested structures unnecessarily is something you should be striving to do from the very first line of code you write.

Maybe, but it could be fussing over something that doesn't matter. There are very few universal rules in programming, particularly with C++ because its so versatile and can be applied in so many ways.

I have no problem copying strings around until its shown to be a problem. I prefer to treat strings as values.

True, but when you chose not to fuss over things like this, languages such as Java and C# are much more suited for you, they are just as versatile and the default handling for complex type ends up being faster in the worst case.

You probably won't see the overhead of string copies due to the speed of modern processors and memory access. And you're right, if it hasn't been shown to be a problem then you will most likely have bigger fish to fry elsewhere.

It's just that in this case the OP had an issue with copy by value that was crippling until he modified the way he did things, which reduce the crippling cost to a manageable cost (20ms - 1ms). The cost of a std::string copy, std::map copy and a few unsigned int copies is in the manageable cost bracket for this case, so he can stop here.

But if the OP's design had been different from the outset such that copy by value was avoided in core engine code where possible, then he wouldn't have had to take several hours out of his time mid stream to figure out why something supposedly innocuous was taking so long.

I'm not saying this is an absolute rule, I'm just saying that this is a case of balancing cost, risk and reward. Once you start factoring the general performance cost of copy by value, the cost in development time in doing so starts to rapidly fall away as you adopt it as your default mode of operation, it becomes very easy and essentially becomes free.

#### Share this post

##### Share on other sites
Quote:
Original post by Glen Corby
Quote:
Original post by legalize
Quote:
 Original post by Glen Corbybut avoiding copying nested structures unnecessarily is something you should be striving to do from the very first line of code you write.

Maybe, but it could be fussing over something that doesn't matter. There are very few universal rules in programming, particularly with C++ because its so versatile and can be applied in so many ways.

I have no problem copying strings around until its shown to be a problem. I prefer to treat strings as values.

True, but when you chose not to fuss over things like this, languages such as Java and C# are much more suited for you, they are just as versatile and the default handling for complex type ends up being faster in the worst case.

I didn't say I'd never fuss over it, I said I wouldn't fuss over it until its shown to be a problem. If its going to be a problem, its going to be a problem in one structure that can be modified to use shared_ptr<std::string> instead of a std::string value. No big deal to refactor a single class because its being copied all over the place and its got an expensive copy for members that needn't be unique in each copy.

Quote:
 It's just that in this case the OP had an issue with copy by value that was crippling until he modified the way he did things, which reduce the crippling cost to a manageable cost (20ms - 1ms).

Crippling cost? Now you're reading into the original post. As I read it, they wondered why the one was much slower than the other. No mention of "crippling costs".

#### Share this post

##### Share on other sites
The "function call overhead" will be mitigated because all of that code will be inlined.

The reverse bit is odd, I think I would have just implemented insert as a for loop using push_back... Does reverse avoid copy-construction?

The way to get a faster copy is to exploit the CPU - you need to get the code to issue a
rep stosd
instruction - a good memcpy will use it.

You can forcibly .resize the vector then bit-blast the data in.
size_t prevSize = v.size();size_t newSize = prevSize + n;v.resize(newSize);memcpy(&v[prevSize], &newData[0], n*sizeof(T));

#### Share this post

##### Share on other sites
Quote:
Original post by legalize
Quote:
Original post by Glen Corby
Quote:
Original post by legalize
Quote:
 Original post by Glen Corbybut avoiding copying nested structures unnecessarily is something you should be striving to do from the very first line of code you write.

Maybe, but it could be fussing over something that doesn't matter. There are very few universal rules in programming, particularly with C++ because its so versatile and can be applied in so many ways.

I have no problem copying strings around until its shown to be a problem. I prefer to treat strings as values.

True, but when you chose not to fuss over things like this, languages such as Java and C# are much more suited for you, they are just as versatile and the default handling for complex type ends up being faster in the worst case.

I didn't say I'd never fuss over it, I said I wouldn't fuss over it until its shown to be a problem. If its going to be a problem, its going to be a problem in one structure that can be modified to use shared_ptr<std::string> instead of a std::string value. No big deal to refactor a single class because its being copied all over the place and its got an expensive copy for members that needn't be unique in each copy.

I'm sorry for implying that you're using the wrong language, and see my concession

Quote:
 And you're right, if it hasn't been shown to be a problem then you will most likely have bigger fish to fry elsewhere.

However I still strongly contend that it is a good habit to get into from the outset unless you have massive convenience benefits from copy by value.

Quote:

Quote:
 It's just that in this case the OP had an issue with copy by value that was crippling until he modified the way he did things, which reduce the crippling cost to a manageable cost (20ms - 1ms).

Crippling cost? Now you're reading into the original post. As I read it, they wondered why the one was much slower than the other. No mention of "crippling costs".

I definitely do consider a real time application that at best will run at 15 - 40 fps in release mode if doing nothing else but gathering a render list to be crippled.

#### Share this post

##### Share on other sites
Quote:
 Original post by Shannon BarberYou can forcibly .resize the vector then bit-blast the data in.

Yeah, you can do that. But if that's your approach to C++, just go back to C. Please.

#### Share this post

##### Share on other sites
Quote:
Original post by Glen Corby
Quote:
 Crippling cost? Now you're reading into the original post. As I read it, they wondered why the one was much slower than the other. No mention of "crippling costs".

I definitely do consider a real time application that at best will run at 15 - 40 fps in release mode if doing nothing else but gathering a render list to be crippled.

Well, that's a judgment for the original poster to make, not me or you. You're reading a lot of assumptions into his post that he didn't state. They may be true and the poster may be speaking from the same perspective as you. Or they may not.

I gave up trying to mind read people's specific situations in their posts long ago.

#### Share this post

##### Share on other sites
Guys and girls,

I'm aware that there is a considerable overhead for my copy constructor. My only wonderment was why on earth the supposedly faster insert() is always 3 times slower than iterating and using push_back(). I'm aware that my gather time is too slow, and that this is caused by the amount of data I was passing around.

I'm reading your comments and taking them as advice on how to remove the other overheads. My initial question has been answered now: insert needs to allow for non-default constructors.

Basically, there will be a small number of render list nodes, less than 100. the mesh I am testing it with has rather alot because it is loading the entire level and sending all nodes to be rendered on every frame. However, in the finished product this display list will be changed only when the player moves into a new portalised chunk, so the 1ms gather time is optimised enough. For dynamic objects, I will store the meshes on the stack and the render node will consist of just a view matrix and a reference to the appropriate object.

[Edited by - speciesUnknown on August 10, 2008 10:02:17 PM]

#### Share this post

##### Share on other sites
Quote:
 Original post by speciesUnknownI'm reading your comments and taking them as advice on how to remove the other overheads.

IMO, you're already doing the most important thing: optimizing from measurements. The rest of us can only guess what you're doing, but you've got the data in front of you to guide you on your quest. Just remember that the order of magnitude improvements in performance generally come from choosing better algorithms or data structures (which is where using std::string as a value type or a reference type comes in).

#### Share this post

##### Share on other sites
I'm using std::string as a key type in various places, as well as to store names which are only changed once at load time. Is there a better alternative? I have written a C library wrapper around char * for a bullshit exam where we had to use C strings, so perhaps I could use that instead, and overload the < operator?

## Create an account or sign in to comment

You need to be a member in order to leave a comment

## Create an account

Sign up for a new account in our community. It's easy!

Register a new account

## Sign in

Already have an account? Sign in here.

Sign In Now