STL Performance Thread

Started by
79 comments, last by ApochPiQ 14 years, 1 month ago
Quote:Original post by cache_hit
He's trying to create a hypothetical example where the vector is going to do a lot of copying due to having to re-arrange lots of stuff. This is a pathological example of poor performance given the right circumstances and a sufficiently expensive-to-copy object. Regardless, the problem is non-existant with r-value references so there's no point in even discussing it IMO.
It's also non-existent if you simply decide not to copy the objects while sorting them. Just sort indices or pointers instead of the actual heavyweight objects...
Design choice is obviously more important than library choice here.

If you do the equivalent operations without the STL (i.e. copying a vector == malloc/free) I don't see how there's going to be any performance difference. In his example, he's solving it in a stupid way with the STL and a smart way without it -- this doesn't prove the STL is slow, it just proves that one design is stupid.

Obvious fallacy is obvious.
Advertisement
Quote:Original post by Hodgman
Quote:Original post by cache_hit
He's trying to create a hypothetical example where the vector is going to do a lot of copying due to having to re-arrange lots of stuff. This is a pathological example of poor performance given the right circumstances and a sufficiently expensive-to-copy object. Regardless, the problem is non-existant with r-value references so there's no point in even discussing it IMO.
It's also non-existent if you simply decide not to copy the objects while sorting them. Just sort indices or pointers instead of the actual heavyweight objects...
Design choice is obviously more important than library choice here.

If you do the equivalent operations without the STL (i.e. copying a vector == malloc/free) I don't see how there's going to be any performance difference. In his example, he's solving it in a stupid way with the STL and a smart way without it -- this doesn't prove the STL is slow, it just proves that one design is stupid.

Obvious fallacy is obvious.


Oh i agree, but that was too obvious to even mention :). I mentioned && refs because it makes vector<> *hands down* superior. Try storing objects by value in a c-style array and then lets what your hand written sort function performs at compared to that of a move-enabled implementation.



On another note the subject of template code bloat is highly exaggerated. Yea it exists, but not nearly as much as you (and by you i mean the indefinite you) think. And news flash, if you need a list of ints and a list of floats, then guess what - the code has to exist somewhere. The difference is that the compiler can often use the same code for multiplw parameterizations, while you cant, unless you store everything as void* in which case theres no ppint even discussing it anymore
Quote:
If you have code that proves that your mechanism is 150ms and the STL version is ridiculously slow, then post it, and I'm sure we can analyze it and tell you what you're doing differently and how to use STL more efficiently.

This a 1000 times over. When i first started using STL I didn't know about all the tricks. You can reserve, resize, set custom allocators, replace iobufs, use range-checked accessors, etc. Not every operation is "fast" by default, but it IS "amortized fast" for a general set of operations. I think too many people put some esoteric case into their program, and complain STL is TOO SLOW!
For instance they will have vector<int> and push_back() a million items without calling .reserve(), then complain that new[1000000] is faster. I know I've made the mistake of not reserving before.
Or they complain that std::list<int> is too slow, and replace it with new int[1000]... well you've changed data structures. Did you really want a list? or an array? maybe you actually need a tree? map? set?
Or they complain the find() is too slow. Well, it is a linear search. You COULD use lower_bound, upper_bound, equal_range and do a binary search.
This type of benchmarks is notoriously pointless anyway.

First question to ask is why are we sorting this many things or this type. If this needs to be done frequently, first split the storage into hot and cold parts. Hot contain a simple pair of key and pointer to actual data. This pair should be POD.

Second question to ask is - can we benefit from different approach. Can we sort less, can we use multiple cores, can we change the complexity of the problem.

Finally, what does the user actually want to do? It is highly unlikely that they will actually want to sort million gizmos. They will probably want to get top 10 movies - which is completely different problem altogether.


As far as graphics algorithms go - there isn't much to say. Papers are written in academic manner, but they usually involve only very trivial constructs, all which can be implemented in elegant and highly compact manner using in-place and implicit data structures.

Typical example would be recursion stack problem with quicksort. I find it simply unworthy of even mentioning - quicksort and many other recursive algorithms can be trivially converted to explicit stack version. But this simply isn't worth discussing or worrying about, it's semantically equivalent to going from for to while loop.

There is a very tiny edge case to be made with regard to hard-coding (globals) and other implicit relations, but the benefits gained from them are simply no longer viable under today's deadlines and cost models.

Effects of architectures, such as cache, synchronous vs. asynchronous, dynamic vs. static, etc... apply to all languages and libraries, and are mostly a design decision which should be taken into account in planning state to design the overall style of implementation - this is what will determine the overall characteristics of application more than any individual choice of algorithm or optimizations. It is also the area which is notoriously to get right first time off, since due to hardware capacities today the peak behavior will not be obvious from prototypes. This is the area where expertise does pay off, but very few pay attention to it or even consider it.

This willful ignorance of experience is mostly a consequence of abysmally cheap labor costs, which makes it cheaper to rewrite five times rather than hiring a single expert at five times the price. Practice shows that this approach is economically superior as far as business bottom line goes.

Whether this is good, bad or ugly is somewhat beyond the point, just like complaining that rain is wet. Bring an umbrella - or learn why the business works this way, and try to find a way to convince them they should be doing it differently. Most won't care, most won't benefit, but those that do will likely be the people who understand the underlying problems themselves.
Quote:Original post by Hyrcan
Quote:If you want to convince yourself stl is a joke, just make a class with a vector of ints and a couple other primitive members, basically a typical object of some kind. Then make a vector of these classes, each with a dozen ints in the child vector. Add about a million to the vector, then sort it and see how long that takes to complete(for comparison, I can run the same code with my own implementation in about 150ms).

Funny thing is, that takes a few ms using the STL on my machine. So, what now?

You sorted a vector with a million classes on a random key value in the class, each containing a vector with 12 items in it in a few milliseconds without taking some kind of shortcut? Not that I don't believe you, but, I don't believe you because I've recently benchmarked it pretty thoroughly in VC++ and all benchmarks I see point to other implementations being about the same or even worse.

The results I get is it takes about 135ms to sort 1 million ints in STL.
Which to me is a pretty good result, but how often do you have a static array of a million ints?

It takes about 450ms if you use a vector instead of an array. Which is kind of crappy, but not enough to slit your wrists, but since this is the real world case for any large data it does kind of suck. Especially when you have sorting as part of a much more complex algorithm. A case that comes up again and again and again.

If you move to the scenario I describe it will be executing til the day you die unless you replace std:swap and optimize the copy constructor (and doing that for hundreds of classes is not a fun prospect). Which to me is kind of ridiculous. The interface here is bad enough to suffer, but aside from benchmarks almost everything you want to sort is going to be something more complicated. Like a linear octree node, a vertex, that kind of thing.


Quote:Original post by Codeka
Quote:Original post by thatguyfromthething
Quote:It's certainly not possible without having a profiler
Profiler is another misconception, it's like saying you need a debugger to figure out bugs. It's just raw data, it catches things that amount to mistakes but it really tells you very little. Like the debugger, it is easy to misuse as a crutch that negates actual thought - it's not there to debug logic, but to catch simple mistakes and that's it.
Are you telling me you don't use a profiler when doing performance analysis? If not a profiler then what? How do you know "Implementation A" is faster or slower than "Implementation B" if you don't profile?


You can't get very far just optimizing the obvious, whole program performance is almost always a bigger issue, especially the larger and more complex your program becomes.

Code that runs great in a simple test might cause problems in a large long running program and when people say stuff like "code size doesn't matter" or even that there's no reason to make your own data structures it tells me they have not dealt with these issues before. Especially if the say there's no reason to code them at all. If all you use is one thing and have never done it yourself how can you even think you know anything about it?

Maybe the confusion is in thinking containers = all data structures.

[Edited by - thatguyfromthething on February 26, 2010 5:35:30 PM]

This is my thread. There are many threads like it, but this one is mine.

Quote:Original post by thatguyfromthething
You sorted a vector with a million classes on a random key value in the class, each containing a vector with 12 items in it in a few milliseconds without taking some kind of shortcut? Not that I don't believe you, but, I don't believe you because I've recently benchmarked it pretty thoroughly in VC++ and all benchmarks I see point to other implementations being about the same or even worse.

The results I get is it takes about 135ms to sort 1 million ints in STL.
Which to me is a pretty good result, but how often do you have a static array of a million ints?

It takes about 450ms if you use a vector instead of an array. Which is kind of crappy, but not enough to slit your wrists, but since this is the real world case for any large data it does kind of suck. Especially when you have sorting as part of a much more complex algorithm. A case that comes up again and again and again.

If you move to the scenario I describe it will be executing til the day you die unless you replace std:swap and optimize the copy constructor (and doing that for hundreds of classes is not a fun prospect). Which to me is kind of ridiculous. The interface here is bad enough to suffer, but aside from benchmarks almost everything you want to sort is going to be something more complicated. Like a linear octree node, a vertex, that kind of thing.


Stop posturing and offering vague descriptions, and post your benchmark code.

At the risk of overstepping ApochPiQ's authority: I'm saying that in my Moderator Voice(TM). Put your money where your mouth is, or get out of the thread.

Other people posted actual code and results that disagree with your assertions. So far you have only words.

I do not like disinformation and rumour-mongering.
Quote:Original post by thatguyfromthething
The results I get is it takes about 135ms to sort 1 million ints in STL.
Which to me is a pretty good result, but how often do you have a static array of a million ints?

It takes about 450ms if you use a vector instead of an array.


No. MSVC 2008 Professional, standard release build settings:

#include <boost/timer.hpp>#include <boost/random.hpp>#include <algorithm>#include <iostream>#include <vector>int a[1000000]; // to avoid a stack crashint main() {	boost::mt19937 rng;	boost::variate_generator<boost::mt19937&, boost::uniform_int<>  > intrng   (rng, boost::uniform_int<>(0,99999999) );	std::vector<int> v(1000000);	std::generate( v.begin(), v.end()  , intrng );	std::copy( v.begin(), v.end(), a+0 );	boost::timer asort;	std::sort( a+0      , a+1000000 );	double asort_time = asort.elapsed();	boost::timer vsort;	std::sort( v.begin(), v.end() );	double vsort_time = vsort.elapsed();	std::cout << "Time to sort A: " << asort_time << "\n";	std::cout << "Time to sort V: " << vsort_time << "\n";}

Time to sort A: 0.117Time to sort V: 0.114Press any key to continue . . .


#include <boost/timer.hpp>#include <boost/random.hpp>#include <algorithm>#include <iostream>#include <vector>int a[1000000]; // to avoid a stack crashint main() {	boost::mt19937 rng;	boost::variate_generator<boost::mt19937&, boost::uniform_int<>  > intrng   (rng, boost::uniform_int<>(0,9) );	std::vector<int> v(1000000);	std::generate( v.begin(), v.end()  , intrng );	std::copy( v.begin(), v.end(), a+0 );	boost::timer asort;	std::sort( a+0      , a+1000000 );	double asort_time = asort.elapsed();	boost::timer vsort;	std::sort( v.begin(), v.end() );	double vsort_time = vsort.elapsed();	std::cout << "Time to sort A: " << asort_time << "\n";	std::cout << "Time to sort V: " << vsort_time << "\n";}


Time to sort A: 0.017Time to sort V: 0.017Press any key to continue . . .
Quote:Original post by thatguyfromthething
Quote:Are you telling me you don't use a profiler when doing performance analysis? If not a profiler then what? How do you know "Implementation A" is faster or slower than "Implementation B" if you don't profile?

You can't get very far just optimizing the obvious, whole program performance is almost always a bigger issue, especially the larger and more complex your program becomes.
And what exactly do you think 'whole program performance' is? In the context of a sequential program, it is the sum of the performance of each component. Improve the performance of the slowest component, and you improve the whole.

You should be familiar with Amdahl's law, especially as it applies to sequential programs.

Note that I am not recommending that you optimise these areas in isolation (as you implied). Instead, I am suggesting you use a profiler to locate the slowest areas in your *entire* program, and optimise them in that context.
Quote:If all you use is one thing and have never done it yourself how can you even think you know anything about it?
Pretty much everyone in this discussion has a theoretical background in data structures and algorithms, and I would warrant most of us have implemented plenty of each.

The onus is on you at this point to demonstrate that you have a theoretical and practical background in this subject, by bringing a concrete implementation and verifiable performance data to this discussion.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Show me teh code!
Quote:Original post by Zahlman
Quote:Original post by thatguyfromthething
You sorted a vector with a million classes on a random key value in the class, each containing a vector with 12 items in it in a few milliseconds without taking some kind of shortcut? Not that I don't believe you, but, I don't believe you because I've recently benchmarked it pretty thoroughly in VC++ and all benchmarks I see point to other implementations being about the same or even worse.

The results I get is it takes about 135ms to sort 1 million ints in STL.
Which to me is a pretty good result, but how often do you have a static array of a million ints?

It takes about 450ms if you use a vector instead of an array. Which is kind of crappy, but not enough to slit your wrists, but since this is the real world case for any large data it does kind of suck. Especially when you have sorting as part of a much more complex algorithm. A case that comes up again and again and again.

If you move to the scenario I describe it will be executing til the day you die unless you replace std:swap and optimize the copy constructor (and doing that for hundreds of classes is not a fun prospect). Which to me is kind of ridiculous. The interface here is bad enough to suffer, but aside from benchmarks almost everything you want to sort is going to be something more complicated. Like a linear octree node, a vertex, that kind of thing.


Stop posturing and offering vague descriptions, and post your benchmark code.

At the risk of overstepping ApochPiQ's authority: I'm saying that in my Moderator Voice(TM). Put your money where your mouth is, or get out of the thread.

Other people posted actual code and results that disagree with your assertions. So far you have only words.

I do not like disinformation and rumour-mongering.


Not to mention that it is *completely nonsensical*. I mean, 3x as long to sort of a vector of ints than an array of ints? That makes ABSOLUTELY NO SENSE. Notice the caps. You might as well say that unsigned addition is slower than signed addition. It shows a complete lack of understanding of... well, anything. Here's a hint : sort a vector of ints is -binary equivalent- to sorting an array of ints.


Now, onto the other asinine claims - that you'll be here all day if you sort a vector of classes containing a vector of ints. I like how my repeated explanation that r-value references completely eliminate any performance problem has been so consistently ignored. Convenient. But nevertheless I've decided to run some tests for you. And unlike you, I'm posting results.


#include "stdafx.h"#include <Windows.h>#include <vector>#include <algorithm>#include <numeric>#include <iostream>#include <time.h>using namespace std;LARGE_INTEGER freq;struct unmoveable_foo{public:	unmoveable_foo()	{		for (int i=0; i < 10; ++i)			vec_.push_back(rand());	}	vector<int> vec_;	int sum() const { return std::accumulate(vec_.begin(), vec_.end(), 0); }	bool operator<(const unmoveable_foo& other) const	{		return sum() < other.sum();	}};struct moveable_foo{	moveable_foo()	{		for (int i=0; i < 10; ++i)			vec_.push_back(rand());	}	moveable_foo(moveable_foo&& other)		: vec_(std::move(other.vec_))	{ }	moveable_foo& operator=(moveable_foo&& other)	{		vec_ = std::move(other.vec_);		return *this;	}	vector<int> vec_;		int sum() const { return std::accumulate(vec_.begin(), vec_.end(), 0); }	bool operator<(const moveable_foo& other) const	{		return sum() < other.sum();	}};struct rand_int{	rand_int() : i(rand()) {}	int i;	bool operator<(const rand_int& other) const	{		return i < other.i;	}};template<class T>void test_sort(){	cout << "Testing sort of " << typeid(T).name() << endl;	LARGE_INTEGER start;	LARGE_INTEGER stop;	std::cout << "Initializing vector..." << std::endl;	vector<T> vec(1000000);	std::cout << "Initializing array..." << std::endl;	T* array = new T[1000000];	std::cout << "Sorting..." << std::endl;	::QueryPerformanceCounter(&start);	sort(vec.begin(), vec.end());	::QueryPerformanceCounter(&stop);	cout << "vector: " << (float)(stop.QuadPart - start.QuadPart) / (float)freq.QuadPart << " seconds" << endl;	::QueryPerformanceCounter(&start);	sort(array, array+1000000);	::QueryPerformanceCounter(&stop);	cout << "array: " << (float)(stop.QuadPart - start.QuadPart) / (float)freq.QuadPart << " seconds" << endl;}int _tmain(int argc, _TCHAR* argv[]){	::QueryPerformanceFrequency(&freq);	srand(time(0));	test_sort<rand_int>();	test_sort<moveable_foo>();	test_sort<unmoveable_foo>();	std::cin.get();	return 0;}


Testing sort of struct rand_int
Initializing vector...
Initializing array...
Sorting...
vector: 0.0762761 seconds
array: 0.0781865 seconds
Testing sort of struct moveable_foo
Initializing vector...
Initializing array...
Sorting...
vector: 0.830706 seconds
array: 0.835166 seconds
Testing sort of struct unmoveable_foo
Initializing vector...
Initializing array...
Sorting...
vector: 1.42725 seconds
array: 1.51604 seconds


Please stop posting nonsense. Unless you feel like explaining why .83 seconds to sort 1,000,000 vectors, each containing 10 items, with a comparison function that ADDS UP ALL 10 NUMBERS, is "waiting until the day I die".


Quote:Original post by thatguyfromthething
I don't believe you because I've recently benchmarked it pretty thoroughly in VC++


quoted for hilarity.

This topic is closed to new replies.

Advertisement