Sorting a std::vector efficiently

Started by
25 comments, last by cozzie 10 years, 1 month ago

One pass of bubble sort has exactly two cache misses since it traverses the data set exactly once, and strictly linearly. At the second cache miss, the hardware prefetcher kicks in.
For random input, it has on the average n/2 branch mispredictions if the compiler doesn't optimize properly, and zero with a compiler that uses CMOV.
In any case, its amortized runtime is constant. You can call it "temporal coherence", if you think that sounds more intelligent, but it's the same thing.

Since you like to childishly mangle quotes, be glad Bregma caught the fact that you don’t know what a stable sort is rather than myself. I’d not have been so kind.

However you decided to keep going, and now you’re not just arguing with me but with Wikipedia:


http://en.wikipedia.org/wiki/Bubble_sort#In_practice

Bubble sort is asymptotically equivalent in running time to insertion sort in the worst case, but the two algorithms differ greatly in the number of swaps necessary. Experimental results such as those of Astrachan have also shown that insertion sort performs considerably better even on random lists. For these reasons many modern algorithm textbooks avoid using the bubble sort algorithm in favor of insertion sort.

Bubble sort also interacts poorly with modern CPU hardware. It requires at least twice as many writes as insertion sort, twice as many cache misses, and asymptotically more branch mispredictions. Experiments by Astrachan sorting strings in Java show bubble sort to be roughly 5 times slower than insertion sort…


In any case, its amortized runtime is constant. You can call it "temporal coherence", if you think that sounds more intelligent, but it's the same thing.

No, they are not the same thing. An amortized constant run-time is the result of temporal coherence.

Frame-to-frame temporal coherence in the case of a render queue means using the sorted results from the previous frame to increase the performance of the sort on the current frame. By using temporal coherence you typically get the best case O(n) for both bubble sort and insertion sort, resulting in an amortized run-time constant.

Now the only question that remains is why you want to stick to a bubble sort so badly. It sounds as though you just want to argue by this point, especially by how you mangled my quote.

L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Advertisement

To answer the original question: just sort the indices, and use a radix sort. If your indices are 32-bit, you can use a three-pass radix sort with 11-bit buckets that will nicely fit into the CPU's L1 cache.

And you can also make the radix sort benefit from temporal coherence. But even if you don't, it will be much, much faster than a std::sort in comparison.

<off-topic>

(lots of aggressive stuff)

Look, if you weren't yelling "you idiots" every time someone says a word, and would instead read what people have to say instead of switching into aggressive mode, you might cause a lot less noise.

You claim that std::sort is excessively slow when it demonstrably isn't, based on an allegation (inability to inline comparisons) that isn't true at all. Several people tell you that you're wrong, and you turn around every time saying that it doesn't inline function pointers.

Someone (Ryan_001) mentioned you could even use Bubblesort for small data sets such as 64k items or less. While I don't fully agree with that (it's still a piss poor algorithm for a complete sort, so I whould still not use it when something better is readily available), I mentioned that Bubblesort has its merits as a 1-pass-per-frame sort if sorting depth is all you do.

This is a well-known and proven idiom. The technique has constant per-frame overhead, amortizing the work over several frames.

You didn't read or understand what I was saying at all because everybody except yourself is stupid (at least that's the message delivered by your answers). You stopped reading after "Bubblesort" and went into discussing completely unrelated details about a full Bubblesort, of which some, such as the cache misses, are outright wrong (and... irrelevant, because you're discussing a totally different thing).

Besides, when sorting indices (or pointers) with the intent of iterating the corresponding random-address objects in pseudo-random order subsequently, complaining about too many cache misses is a bit of a queer move anyway.

Similar story with this stable sort thing. It's not a problem in the first place. My example of what you need stable sorts for may or may not have been the 100% best fit (though I believe it is not wrong), but that doesn't make a difference. Sure, you like to play on sophistries, but to what avail. Whether or not std::sort is stable simply isn't an issue in this case.

Artefacts due to using a non-stable sort can only ever occur on transparent geometry. Further, it can only happen on different transparent objects which have exactly the same properties (as far as the renderqueue can tell) and occupy the same space, but still you treat them as "different" objects.

This is something that shouldn't happen in the first place, unless your model is "broken". Mind you, we're not talking about objects being somewhat close to each other or touching, or objects having similar properties.

But if it does happen (or can happen) for some reason there are at least two obvious solutions to it (well, three... stable sort would be the third). Either, you simply don't split something that to all appearances isn't two separate objects into two separable identical objects.

Or, you can include the object's index in the sort key as a preemptive measure, which automatically makes every sort stable, since objects that occupy the same space and have all the same properties are still binary different. No identical keys exist that could be ordered differently.

Either way, we're already 3 posts down discussing absolutely nothing except ego. Which really doesn't bring the thread forward.

(And it probably annoys everyone else as much as it annoys me, so I'm not going to continue that discussion.)

</off-topic>

To answer the original question: just sort the indices, and use a radix sort. If your indices are 32-bit, you can use a three-pass radix sort with 11-bit buckets that will nicely fit into the CPU's L1 cache.

This looks like indeed a nice idea for applying radix sort. There's so few places where it really fits.

I couldn't tell whether it's worth the trouble (still always using std::sort, which for anything practical surprisingly turns out just good enough every time, I've never had enough of a justification for actually using radix sort in my entire life), but from playing with it a decade or two ago, I remember that radix sort can easily be 3-4 times faster than e.g. quicksort/introsort.

So if the sort is indeed bottlenecking you out (that is, you don't meet your frame time, and profiling points to the sort), that would probably be a viable option.

Slight nitpick, though: One doesn't of course sort the indices, which would be somewhat pointless. You most certainly didn't mean to say that, but it kind of sounded like it.

One sorts the keys (moving indices). Or well, one sorts indices by key, or whatever one would call it.

Which means that most likely, 3 passes won't be enough, sadly (or you need bigger buckets), since you will almost certainly have at least 48 or 64-bit numbers to deal with (except if you have very few passes and render states, and are very unkind to your depth info).

Not using a stable radix sort to save temp memory may be viable (can instead concat the indices like described above if needed, even if this means an extra pass... the storage size difference alone likely outweights the extra pass because of fitting L1).


To answer the original question: just sort the indices, and use a radix sort. If your indices are 32-bit, you can use a three-pass radix sort with 11-bit buckets that will nicely fit into the CPU's L1 cache.

This looks like indeed a nice idea for applying radix sort. There's so few places where it really fits.

I couldn't tell whether it's worth the trouble (still always using std::sort, which for anything practical surprisingly turns out just good enough every time, I've never had enough of a justification for actually using radix sort in my entire life), but from playing with it a decade or two ago, I remember that radix sort can easily be 3-4 times faster than e.g. quicksort/introsort.

So if the sort is indeed bottlenecking you out (that is, you don't meet your frame time, and profiling points to the sort), that would probably be a viable option.

Slight nitpick, though: One doesn't of course sort the indices, which would be somewhat pointless. You most certainly didn't mean to say that, but it kind of sounded like it.

One sorts the keys (moving indices). Or well, one sorts indices by key, or whatever one would call it.

Which means that most likely, 3 passes won't be enough, sadly (or you need bigger buckets), since you will almost certainly have at least 48 or 64-bit numbers to deal with (except if you have very few passes and render states, and are very unkind to your depth info).

Not using a stable radix sort to save temp memory may be viable (can instead concat the indices like described above if needed, even if this means an extra pass... the storage size difference alone likely outweights the extra pass because of fitting L1).

Thanks for pointing out my mistake, yes I wanted to say that one should sort the keys :). What you get back are the indices which are then used to access your data.

Fully generic rendering-related data might not fit into a 32-bit key, that's true. But there are certain occasions where 32 bits are more than enough (or even 16 might suffice), e.g. for sorting particles.

As a quick note on performance, sorting 50k particles back-to-front using an ordinary std::sort on 16-bit keys takes 3.2ms on my machine, whereas an optimized radix-sort needs 0.25ms. If all you need to sort are keys that somehow index other data, it's almost always better to just sort that, and correctly index the data on access because this causes much fewer memory accesses during the sort.

See below.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

My, this thread turned nasty.

The point of a stable sort is that if for example you have a database of people who have a name ("John Smith") and an age ("45 years"), you first sort by name, and then later sort by age. If the sort is stable, this will give a list of people sorted by age, and within each group of people with the same age, names will be sorted alphabetically.

That's not a stable sort: that's a secondary sort. A stable sort will preserve the relative order of two items with the same primary sort key.

Just, you know, to be pedantic.

No, a secondary sort would result in people sorted by name first, then by age, if you're sorting in that order. You're right about the definition of a stable sort, but a result of that is what Samoth describes: the results of previous sorts remain in sorted order when they have the same value under the new sort criteria. It's a very good use case of a stable sort.

Wow, thanks a lot for all the input (and entertainment :rolleyes.gif )

Now let's see how I can use all this information.

What I've done so far:

- make the vectors I use for sorting, members of my class, no stack allocation each time I sort

- decided now to use std::sort instead of std::stable_sort, the chances that two floats with the distance of renderable to camera are exactly the same, are minimal.

In the near future I'll get into radix sorting and see if I can learn this.

- decided to implement some sort of 'temporal coherency', assuming that this means that I will reuse the sort results of the last time, to save time on sorting in the next iteration

- cull and sort each 2 frames instead of each frame (not noticable till now)

Here's the code I have now, with some flaws:


/** definition of the sorting vectors, in my class **/

	// Sorting vectors
	std::vector<int>	mBlendSortIndexTemp;	
	std::vector<int>	mBlendSortOrderTemp;	
	std::vector<float>	mBlendSortDistToCam;	

/** initial setting up the renderqueue **/

	// Vectors for sorting, note: size = total nr. of blended renderables
	mBlendSortIndexTemp.resize(nrBlendedRend);
	mBlendSortOrderTemp.resize(nrBlendedRend);
	mBlendSortDistToCam.resize(nrBlendedRend);

	for(int init=0;init<nrBlendedRend;++init) mBlendSortOrderTemp[init] = init;

/** the sorting function, called after the mBlendedBucket is filled with visible renderables **/

/**************************************************************************************/
/***							SORT BUCKET BLENDED									***/
/*** ==> usage: sort blended renderables, after the bucket is refilled 				***/
/*** ==> updates the order for blended renderables, back to front on cam distance	***/
/**************************************************************************************/

bool CRenderQueue::SortBucketBlended()
{
	if(mEffectDataCreated)
	{
		size_t bucketSize = mBucketBlended.size();

		for(size_t renderable=0;renderable<bucketSize;++renderable)
		{
			mBlendSortOrderTemp[renderable] = renderable;		// needed because of 'varying' bucket
			mBlendSortIndexTemp[renderable] = renderable;
			mBlendSortDistToCam[renderable] = mBucketBlended[renderable].DistToCam;
		}

		sort(mBlendSortOrderTemp.begin(), mBlendSortOrderTemp.begin() + bucketSize, [&](int a, int b)
			{ return mBlendSortDistToCam[a] > mBlendSortDistToCam[b]; });
		// replace "mBlendSortOrderTemp.being() + bucketSize" by ".end()", works only if bucket size/content doesn't vary
		// OR resize the vectors each time sorting is done?!

		mBucketBlendedIndex.resize(bucketSize);

		for(size_t results=0;results<bucketSize;++results)
			mBucketBlendedIndex[results] = mBlendSortIndexTemp[mBlendSortOrderTemp[results]];
			
		return true;
	}
	else return false;
}


The 'flaws'/ challenges I face;

- my mBlendedBucket content varies each frame, it contains only visible renderables at that moment

- my current order of doing things is:

** cull all renderables

** save the visible ones in the mBlendedBucket

** sort the mBlendedBucket into new index, NO temporal coherency (sort order is reset to 0, 1, 2, 3 etc. each new sort I do)

** render using the new index (don't sort the bucket objects itself)

With this situation, I have some disavantages:

- I have to resize the vector and/or reset the tempOrder (0, 1, 2 etc.)

- this throws away the possibility of having temporal coherency (/reusing the order from the last sort)

I've thought how to solve this, but can only conclude that this would only be possible when I always sort ALL renderables, instead of sorting only the visible ones.
I would appreciate some pointers on how to achieve my goal without having to sort ALL renderables (and not continue of the discussion std::sort versus insertion versus radix).

Any input is welcome.

Crealysm game & engine development: http://www.crealysm.com

Looking for a passionate, disciplined and structured producer? PM me

/** Off topic.
I take full responsibility for how tragic this topic has become. Life is dealing me a double-fisted blow right now for the second time ever. I’m handling it better than the first time but somehow I was irresponsible enough to vent off a bit here. My problems aren’t over but I’m taking the reigns from here.
I could have replied in a much softer tone for sure, and my last post is just nonsensical (that’s not even the general idea of what I wanted to say).
So I need to apologize. It will not happen again. **/

I've thought how to solve this, but can only conclude that this would only be possible when I always sort ALL renderables, instead of sorting only the visible ones.

The solution is to fix this:

** sort the mBlendedBucket into new index, NO temporal coherency (sort order is reset to 0, 1, 2, 3 etc. each new sort I do)

I described it in my first post:

#1: Add render-queue items to a render queue.
- #a: If the total number of items is the same as in the last frame, continue.
- #b: If the total number of items is less than before, recreate the index list from 0 to total items, in order.
- #c: If the total number of items is greater than before, just add the new indices in order at the end of the list, not modifying the indices from the previous frame.

Here is the code I use:

#ifndef __LSSTD_INDEXSORTER_H__
#define __LSSTD_INDEXSORTER_H__

#include "../LSSTDStandardLib.h"
#include "LSSTDSearch.h"

namespace lsstd {

	/**
	 * Class CIndexSorter
	 * \brief Sorts data by index references.
	 *
	 * Description: Sorts data by index references.  Instead of moving the actual data around, it creates an index table
	 *	for that data and sorts the indices.  This is often faster than trying to sort data composed of many-byte elements.
	 *	The data being sorted must provide < and == operators.
	 */
	template <typename _tType>
	class CIndexSorter {
	public :
		// == Various constructors.
		LSE_CALLCTOR							CIndexSorter() :
			m_pui32Indices( NULL ),
			m_ui32TotalIndices( 0UL ),
			m_ui32AllocIndices( 0UL ) {
		}
		LSE_CALLCTOR							~CIndexSorter() {
			delete [] m_pui32Indices;
		}


		// == Operators.
		/**
		 * Assignment operator.
		 *
		 * \param _isOther The object to copy.
		 * \return Returns the copied instance.
		 */
		CIndexSorter<_tType> & LSE_CALL			operator = ( const CIndexSorter<_tType> &_isOther ) {
			if ( _isOther.m_pui32Indices > m_pui32Indices ) {
				// Have to allocate more.
				delete [] m_pui32Indices;
				m_pui32Indices = new LSUINT32[_isOther.m_pui32Indices];
				m_ui32AllocIndices = _isOther.m_pui32Indices;
			}
			m_ui32TotalIndices = _isOther.m_pui32Indices;
			for ( LSUINT32 I = m_ui32TotalIndices; I--; ) {
				m_pui32Indices[I] = _isOther.m_pui32Indices[I];
			}
			return (*this);
		}


		// == Functions.
		/**
		 * Sort the given data using an insertion sort.
		 *
		 * \param _ptData The data to sort.
		 * \param _ui32Total Total objects to sort.
		 * \return Returns a reference to this object.
		 */
		CIndexSorter<_tType> & LSE_CALL			InsertionSort( const _tType * _ptData, LSUINT32 _ui32Total ) {
			PrepareFor( _ui32Total );
			if ( _ui32Total <= 1UL ) { return (*this); }	// 0 or 1 items cannot be sorted.

			// Sort it.
			InsertionSort( m_pui32Indices, _ptData, m_ui32TotalIndices );
			return (*this);
		}

		/**
		 * Gets a pointer to the (sorted) index data.
		 *
		 * \return Returns a pointer to the (sorted) index data.
		 */
		const LSUINT32 * LSE_CALL				GetIndices() const { return m_pui32Indices; }

		/**
		 * Prepares to sort the given number of elements, but does not sort them.
		 *
		 * \param PARM The number of elements to prepare to sort.
		 */
		LSVOID LSE_CALL							PrepareFor( LSUINT32 _ui32Total ) {
			if ( _ui32Total > m_ui32AllocIndices ) {
				// Up-size the allocated amount.
				delete [] m_pui32Indices;
				m_pui32Indices = new LSUINT32[_ui32Total];
				m_ui32AllocIndices = _ui32Total;
				m_ui32TotalIndices = 0UL;
			}
			if ( m_ui32TotalIndices > _ui32Total ) {
				for ( LSUINT32 I = 0UL; I < _ui32Total; ++I ) {
					// Add indices that were not added before.
					m_pui32Indices[I] = I;
				}
			}
			else {
				for ( LSUINT32 I = m_ui32TotalIndices; I < _ui32Total; ++I ) {
					// Add indices that were not added before.
					m_pui32Indices[I] = I;
				}
			}
			m_ui32TotalIndices = _ui32Total;
		}


	protected :
		// == Members.
		/**
		 * Sorted indices.
		 */
		LSUINT32 *								m_pui32Indices;

		/**
		 * Total indices.
		 */
		LSUINT32								m_ui32TotalIndices;

		/**
		 * Total indices allocated.
		 */
		LSUINT32								m_ui32AllocIndices;


		// == Functions.

		/**
		 * Use an insertion sort to sort the array of indices.
		 *
		 * \param _pui32Indices The indices to sort.
		 * \param _ptValues The values to sort in-place.
		 * \param _uiptrTotal Total elements to which _ptValues and _pui32Indices point.
		 */
		static LSVOID LSE_CALL					InsertionSort( LSUINT32 * _pui32Indices, const _tType * _ptValues, LSUINTPTR _uiptrTotal ) {
			if ( _uiptrTotal <= 1UL ) { return; }
			LSUINT32 * pui32Cur = _pui32Indices + 1UL;
			LSUINT32 * pui32End = _pui32Indices + _uiptrTotal;
			while ( pui32Cur < pui32End ) {
				const _tType * ptKey = &_ptValues[(*pui32Cur)];
				for ( LSUINT32 * pui32NextDown = pui32Cur - 1UL; pui32NextDown >= _pui32Indices && (*ptKey) < _ptValues[(*pui32NextDown)]; --pui32NextDown ) {
					LSUINT32 ui32Temp = (*pui32NextDown);
					(*pui32NextDown) = pui32NextDown[1];
					pui32NextDown[1] = ui32Temp;
				}
				++pui32Cur;
			}
		}
	};

}	// namespace lsstd

#endif	// __LSSTD_INDEXSORTER_H__
The code that takes advantage of temporal coherence is inside PrepareFor().
You don’t have to set all the indices back to 0, 1, 2, etc. each time, just under certain circumstances.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Someone (Ryan_001) mentioned you could even use Bubblesort for small data sets such as 64k items or less. While I don't fully agree with that (it's still a piss poor algorithm for a complete sort, so I whould still not use it when something better is readily available), I mentioned that Bubblesort has its merits as a 1-pass-per-frame sort if sorting depth is all you do.

This is a well-known and proven idiom. The technique has constant per-frame overhead, amortizing the work over several frames.

In my defense I wasn't being serious about the bubble sort; rather what I was trying to say is that modern processors are so fast, that writing elaborate sorting routines (or otherwise) for less than 64k items is a waste of time because even something as slow as a bubble sort will blast through them in no time. But otherwise I generally agree with you.

Its also nice to see the under-appreciated radix/bucket being championed. Granted, like previous posters, I rarely use it as the ubiquitous std::sort has proven more than sufficient for most of my purposes. One thing that's interesting with a radix sort is that you can do an initial pass with a large buckets. Then each of the individual buckets can be sorted in parallel, and in place, using either std::sort or std::stable_sort. This helps immensely with cache coherency; and really, on modern processors there's few things that'll speed up algorithms as much as just throwing more processors at them ; ) It'd also be pretty easy to write a compute shader that can sort N buckets in parallel, if D3D11 or OpenGL 4 is an option.

@LSpiro: thanks (again), I concluded this point a bit late after understanding it all.

I've done the changes and got it all working, see below.

I've written "possible optimization: blend sort, replace std::sort by radix sort" on my list of remarks in my code documentation.

This topic was quite something, but in the end I can say that I'm happpy with the result :)

Any left over feedback is appreciated.


/**************************************************************************************/
/***							SORT BUCKET BLENDED									***/
/*** ==> usage: sort blended renderables, after the bucket is refilled 				***/
/*** ==> updates the order for blended renderables, back to front on cam distance	***/
/**************************************************************************************/

bool CRenderQueue::SortBucketBlended()
{
	if(mEffectDataCreated)
	{
		size_t bucketSize = mBucketBlended.size();
		size_t lastBucketSize = mBlendSortOrderTemp.size();

		// Scenario 1: new bucket size == last bucket:		don't change reference 'OrderTemp'
		// Scenario 2: new bucket size < then last bucket	refill 'OrderTemp' (invalid objects!)
		if(bucketSize < lastBucketSize)
		{
			mBlendSortOrderTemp.resize(bucketSize);
			for(size_t r=0;r<bucketSize;++r) mBlendSortOrderTemp[r] = r;
		}
		// Scenario 3: new bucket size > then last bucket	add indices to 'OrderTemp' to match bucket Size
		if(bucketSize > lastBucketSize)
		{
			for(size_t r = lastBucketSize;r<bucketSize;++r) mBlendSortOrderTemp.push_back(r);
		}

		for(size_t renderable=0;renderable<bucketSize;++renderable)
		{
			mBlendSortIndexTemp[renderable] = renderable;
			mBlendSortDistToCam[renderable] = mBucketBlended[renderable].DistToCam;
		}

		sort(mBlendSortOrderTemp.begin(), mBlendSortOrderTemp.end(), [&](int a, int b)
			{ return mBlendSortDistToCam[a] > mBlendSortDistToCam[b]; });

		// copy results to the new index
		if(bucketSize != lastBucketSize) mBucketBlendedIndex.resize(bucketSize);
		for(size_t results=0;results<bucketSize;++results)
			mBucketBlendedIndex[results] = mBlendSortIndexTemp[mBlendSortOrderTemp[results]];
			
		return true;
	}
	else return false;
}

Crealysm game & engine development: http://www.crealysm.com

Looking for a passionate, disciplined and structured producer? PM me

This topic is closed to new replies.

Advertisement