sort algorithms

Started by
13 comments, last by maxgpgpu 12 years, 7 months ago
[font="Book Antiqua"]The further I get into writing my 3D simulation/game engine, the more places I end up needing sort routines. Even worse, they are finding their way into some very speed critical parts of the engine (like broad and narrow phase collision detection). In broad phase collision detection the sort is performed every frame, so the object never get grossly out of order. For this particular purpose, I have so far found the insertion sort I wrote for this purpose to be extremely fast and satisfying.

However, in narrow phase collision detection, sometimes the contents of the arrays I sort are fairly well ordered, other times they are [almost] in reverse order, and sometimes they are just all scrambled up. I have struggled my way through a lot of experimentation already and found some reasonable good solutions, but I'd like to improve the efficiency further. So I'll report my results here and ask for comments from anyone who has experimented with sort routines before.

#1: insertion sort
A: very fast when array contents are not too far out of sort order.
B: somewhat slow when array contents are moderately randomized.
C: slow when array contents are highly randomized or [near] reversed.

#2: merge sort
A: adequately fast when array contents are not too far out of sort order.
B: fairly fast when array contents are moderately randomized.
C: fairly fast when array contents are highly randomized or [near] reversed.

#3: heap sort
A: barely adequately fast when array contents are not too far out of sort order.
B: somewhat slow when array contents are moderately randomized.
C: slow but not terrible when array contents are highly randomized or [near] reversed.

#4: smooth sort
A: somewhat slow when array contents are not too far out of sort order.
B: somewhat slow when array contents are moderately randomized.
C: slow when array contents are highly randomized or [near] reversed.

#5: quick sort
A: stunningly slow when array contents are not too far out of sort order.
B: slow to moderately fast when array contents are moderately randomized.
C: very fast when array contents are highly randomized.
D: stunningly slow when array contents are [near] reversed.

Fortunately for my application, I have a couple reliable measures of how randomized or reversed the array is before I call the sort routine. So my current "hack" (or "self-regulating heuristic" if I was inclined to speak academic) switches between insertion-sort and merge-sort to attempt to get best performance in every situation.

It is tempting to also include the quick-sort routine in my scheme, because now and then it beats merge sort by a fair margin on highly randomized array contents. However, quick-sort is prone to once in a while "taking forever" for no apparent reason. Clearly quick-sort can be thrown into some kind of thrashing situation by specific ordering of highly randomized array contents. This is peculiar and I don't understand why. But since the main goal here is to make sure the work is always done before the frame ends, quick-sort appears too problematic for my purposes, unless I'm missing something and can somehow predict in advance when these hyper-slow cases will occur.

Overall, I was very impressed by insertion sort and merge sort, somewhat let down by heap sort, very let down by smooth sort (which too some work to get running), and rather grossed out by the flagrant unpredictability of quick-sort.

One thing is for sure. Looking at the theoretical "best case", "average", "worst case" ratings in the wikipedia page for "sort routines" does not provide much insight into how fast a routine will work when completed. And yes, I have sort verification functions run after each sort to assure they are all working correctly. And yes, in my speed comparisons, the sort routines are always starting with exactly the same data. The arrays being sorted range from very short (32 elements) to moderate (500~2000 elements).

Anyone who has experimented with sort routines, please report your experiences. All my sort routines and testing are indirectly sorting arrays of double precision floating point values. I say indirectly because two arrays are involved, an array of integers and an array of double precision values. Each integer is the location in the array of double precision values. Therefore, the sort routines actually sort the integer array, but the sort is based upon the values in the array of double precision values. In other words, after the sort is complete, if you were to look at the integer array, you would see the integer values were all scrambled up (not sorted in any obvious way). In other words, int[0] to int[top] does not end up ordered, but double[int[0]] to double[int[top]] does end up ordered. Nonetheless, this is essentially equivalent to sorting an array of doubles, and all comparisons made to perform the sort are comparisons of double precision values.[/font]
Advertisement
radix sort!

I implemented that in AS3 and it was a clear winner speed wise overall.

I used it to sort polygons (tested upto about 50,000) and it was the fastest method I could find.

It was perfect as the performance was consistently good on both sorted and unsorted arrays, hope it helps.

radix
You should try a radix sort -- it has the neat aspect of being of [font="Courier New"]O(n)[/font] complexity, instead of [font="Courier New"]O(n[sup]2[/sup])[/font] or [font="Courier New"]O(n log n)[/font].

The radix-sort orders items based on the bitwise representation of their sort-keys, instead of comparing pairs of keys via a comparison function.
IIRC, the floating point standard fortunately has the property where more-signficant-bits of the representation are stored before less-significant-bits, which means that you can sort them as if they were binary integers (except for special cases, like NaN).
i.e. this should work, meaning a radix sort should work:float a = ..., b = ...;
bool aLessB = (a < b);
bool aLessBHack = ( (*(int*)&a) < (*(int*)&b) );
assert( aLessB == aLessBHack );
[edit]Ninja'd! smile.gif
How do you choose the pivot element for quicksort? If it gets slow for almost ordered and reverse ordered data I assume you choose the first or last element? Try median 3 pivoting with the first, last and "middle" element.
In practice, quick sort is still the fastest of those listed. During my benchmarks it loses to my bottom-up in-place stable merge sort only once every 20 times or so, though when merge sort loses it usually isn’t by much.
Remember to use a bottom-up merge sort instead of the traditional recursive one; by avoiding recursion and re-allocation you will save a lot of time.

I use quick sort when the stability of the sort does not matter, and merge sort when it does.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

[font="Book Antiqua"]Thanks for all the comments and tips. I think I skipped the radix sort because it is "not a comparison sort". I thought that was a requirement and only integers could be sorted with non-comparison sorts, but maybe I'm wrong. I'll check into that. I certainly don't need to sort any of the special-case wacko double precision values like "not a number" or "infinity" or any such crazies.... just real numbers (vertex coordinates in most cases).

I assume I want to perform the radix sort on the f64 as if they were s64 values, right? That is, perform signed integer compare operations. Though... since they generate the same result... why does it matter whether the f64 values are compared as f64 values or as s64 values? Hopefully I figure that out when I read out the radix sort.

It is true that quick sort is the fastest as along as it is happy with the data. I find it sorta laughable that the cases that blows its mind are when the data is already sorted (correctly or in [roughly] reverse order). My measures agree with you - when data is well randomized merge sort is slower than quick-sort, but not by very much. And when quick-sort clogs up, merge sort is extremely stable. In fact, merge sort is the most stable sort I've tested by a fair margin. I just created a 3D plot and merge sort just doesn't have any "outliers" while all the others do.

I'll search for "median 3 pivoting" and see what I can find. Thanks for that tip.

I'll also try to find out what the bottom-up variant of the merge sort is, and check whether that's what I have.

I should also mention that I implemented a non-recursive variant of quick-sort, which is completely equivalent except it avoids function call overhead. I did so after reading a couple claims this variant is a bit faster.

Again, thanks for all the tips, and I'll report my results next time, including a new plot with radix sort if anyone cares to see it.[/font]

In practice, quick sort is still the fastest of those listed. During my benchmarks it loses to my bottom-up in-place stable merge sort only once every 20 times or so, though when merge sort loses it usually isn’t by much.
Remember to use a bottom-up merge sort instead of the traditional recursive one; by avoiding recursion and re-allocation you will save a lot of time.

I use quick sort when the stability of the sort does not matter, and merge sort when it does.


L. Spiro


In practice, it depends on temporal coherancy of your lists. If only two or three elements are out of place, even a bubble sort can be quicker than a qsort or merge sort.

[font="Book Antiqua"]Thanks for all the comments and tips. I think I skipped the radix sort because it is "not a comparison sort". I thought that was a requirement and only integers could be sorted with non-comparison sorts, but maybe I'm wrong. I'll check into that. I certainly don't need to sort any of the special-case wacko double precision values like "not a number" or "infinity" or any such crazies.... just real numbers (vertex coordinates in most cases).

I assume I want to perform the radix sort on the f64 as if they were s64 values, right? That is, perform signed integer compare operations. Though... since they generate the same result... why does it matter whether the f64 values are compared as f64 values or as s64 values? Hopefully I figure that out when I read out the radix sort.[/font]

As you rightfully mentioned in your initial post, you should not be sorting the values themselves, but indices into those values.
Especially if you are sorting triangles. You should be sorting the index buffer, not the vertex buffer, and if you are able to get indices to be 16 bits each, a radix sort will be extremely fast.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid


[quote name='maxgpgpu' timestamp='1313627785' post='4850552']
[font="Book Antiqua"]Thanks for all the comments and tips. I think I skipped the radix sort because it is "not a comparison sort". I thought that was a requirement and only integers could be sorted with non-comparison sorts, but maybe I'm wrong. I'll check into that. I certainly don't need to sort any of the special-case wacko double precision values like "not a number" or "infinity" or any such crazies.... just real numbers (vertex coordinates in most cases).

I assume I want to perform the radix sort on the f64 as if they were s64 values, right? That is, perform signed integer compare operations. Though... since they generate the same result... why does it matter whether the f64 values are compared as f64 values or as s64 values? Hopefully I figure that out when I read out the radix sort.[/font]

As you rightfully mentioned in your initial post, you should not be sorting the values themselves, but indices into those values. Especially if you are sorting triangles. You should be sorting the index buffer, not the vertex buffer, and if you are able to get indices to be 16 bits each, a radix sort will be extremely fast.
L. Spiro
[/quote]
[font="Book Antiqua"]Unfortunately, I'll either need to continue sorting s32 integers like I am now, or have two routines, one for s16 and one for s32. In probably 99% of all real-world cases the s16 routine will work fine, but I also need to support arrays larger than 65536 elements. So yeah, if the radix sort with s16 values is significantly faster than the radix sort with s32 values, then I better create two routines and invoke the s16 version when (elements < 65536). But I'm not sure s16 versus s32 will make that big a difference, since I need to access and test (sort on) the f64 value for each s16 or s32 value refers to. But it isn't much extra work to implement both and find out, so I will... assuming I can figure out how to implement the radix sort at all. So far the information is fairly integer biased. I think I can make the f64 behave and sort like an s64 value, but first I need to sit down and make sure I'm not dreaming. If my initial guess is correct, if the value is negative, I just need to invert the most significant 13-bits in the f64/s64 bit pattern and sort based upon that. Or something like that. Maybe I should have the routine perform the sort one 13-bit wide radix at a time, then I only need to treat the first one different (MS13).
[/font]
[font="Book Antiqua"]Actually, I'm not sorting the index buffer OR vertex buffer. For every triangle I have its minimum.x and maximum.x in the s16 array (or s32 array). You might ask why do I have signed integer arrays (s16 and s32) instead of u16 and u32. Good question. The answer is, I represent the minimum.x of a triangle as ~triangleID in the array (a negative value) and the maximum.x of the triangle with triangleID (a positive value). That's how I distinguish whether a given element is the minimum.x or maximum.x which my collision routines definitely need to know! Effectively, the first 3 values in the index buffer end up being ~0 and 0 (triangleID == 0), the next 3 values end up being ~1 and 1 (triangleID == 1), and so forth. So it is very analogous to the index buffer all right, but not exactly.

I'm really jazzed about the radix sort, and hope it turns out as great as fast as it sounds. I wish I could find an f64 implementation. Maybe I will yet, but probably (and as usual), I'm probably better off taking the time to implement it myself from scratch.[/font]
Here's an un-tested implementation of a radix sort on doubles. I recommend implementing this yourself so that you understand the idea, but you can use this for reference.
Usage:double myData[5] = { 5, 3, 4, 1, 2 };
double buffer[5];//temp storage for sorting
RadixSort( myData, buffer, 5 );
[source lang=cpp]double* RadixSort( double* input, double* buffer, unsigned size )
{
if( size <= 1 )
return input;//no work to do here!
const unsigned passes = sizeof( double );//the algorithm sorts on a single byte-column per pass (8bit radix)
unsigned count[256];//how many times each possible byte occurs
unsigned index[256];//where to put each possible byte in the output array
double* buffers[2] = { input, buffer };
for( unsigned pass=0; pass!=passes; ++pass )
{
double* out = buffers[1];
double* begin = buffers[0];
double* end = begin + size;
//Build the histogram
memset( count, 0, sizeof(unsigned)*256 );
for( double* it=begin; it!=end; ++it )
{
const unsigned byte& key = *(((unsigned byte*)it) + pass);//extract the byte being used by this pass
++count[key];//count how many times that byte occurs in the input
}
//Build the partition lookup
index[0] = 0;//the zero-byte entries start at the start
for( unsigned i=1; i!=256; ++i )
{
index = index[i-1] + count[i-1];//the 'i' byte entries start at the end of previous, plus the space used by the previous
}
//Perform the sort - move values into the spaces decided above
for( double* it=begin; it!=end; ++it )
{
u8& key = *(((unsigned byte*)it) + pass);
unsigned idx = index[key]++;
out[idx] = *it;
}
//flip buffers for the next pass
double* temp = buffers[0];
buffers[0] = buffers[1];
buffers[1] = temp;
}
return buffers[0];
}[/source][edit]This won't work on negative float values, only positives will be sorted correctly. This page explains how to fix it, and here is another implementation.

[edit #2] N.B. for small data sets (~a hundred) I found the standard library's quicksort implementation to beat my radix sort.
I also tried using a larger radix of 16-bits instead of 8-bits, but this only became faster for very large data-sets (several hundred-thousand items).

This topic is closed to new replies.

Advertisement