# sort algorithms

## Recommended Posts

[font="Book Antiqua"]The further I get into writing my 3D simulation/game engine, the more places I end up needing sort routines. Even worse, they are finding their way into some very speed critical parts of the engine (like broad and narrow phase collision detection). In broad phase collision detection the sort is performed every frame, so the object never get grossly out of order. For this particular purpose, I have so far found the insertion sort I wrote for this purpose to be extremely fast and satisfying.

However, in narrow phase collision detection, sometimes the contents of the arrays I sort are fairly well ordered, other times they are [almost] in reverse order, and sometimes they are just all scrambled up. I have struggled my way through a lot of experimentation already and found some reasonable good solutions, but I'd like to improve the efficiency further. So I'll report my results here and ask for comments from anyone who has experimented with sort routines before.

#1: insertion sort
A: very fast when array contents are not too far out of sort order.
B: somewhat slow when array contents are moderately randomized.
C: slow when array contents are highly randomized or [near] reversed.

#2: merge sort
A: adequately fast when array contents are not too far out of sort order.
B: fairly fast when array contents are moderately randomized.
C: fairly fast when array contents are highly randomized or [near] reversed.

#3: heap sort
A: barely adequately fast when array contents are not too far out of sort order.
B: somewhat slow when array contents are moderately randomized.
C: slow but not terrible when array contents are highly randomized or [near] reversed.

#4: smooth sort
A: somewhat slow when array contents are not too far out of sort order.
B: somewhat slow when array contents are moderately randomized.
C: slow when array contents are highly randomized or [near] reversed.

#5: quick sort
A: stunningly slow when array contents are not too far out of sort order.
B: slow to moderately fast when array contents are moderately randomized.
C: very fast when array contents are highly randomized.
D: stunningly slow when array contents are [near] reversed.

Fortunately for my application, I have a couple reliable measures of how randomized or reversed the array is before I call the sort routine. So my current "hack" (or "self-regulating heuristic" if I was inclined to speak academic) switches between insertion-sort and merge-sort to attempt to get best performance in every situation.

It is tempting to also include the quick-sort routine in my scheme, because now and then it beats merge sort by a fair margin on highly randomized array contents. However, quick-sort is prone to once in a while "taking forever" for no apparent reason. Clearly quick-sort can be thrown into some kind of thrashing situation by specific ordering of highly randomized array contents. This is peculiar and I don't understand why. But since the main goal here is to make sure the work is always done before the frame ends, quick-sort appears too problematic for my purposes, unless I'm missing something and can somehow predict in advance when these hyper-slow cases will occur.

Overall, I was very impressed by insertion sort and merge sort, somewhat let down by heap sort, very let down by smooth sort (which too some work to get running), and rather grossed out by the flagrant unpredictability of quick-sort.

One thing is for sure. Looking at the theoretical "best case", "average", "worst case" ratings in the wikipedia page for "sort routines" does not provide much insight into how fast a routine will work when completed. And yes, I have sort verification functions run after each sort to assure they are all working correctly. And yes, in my speed comparisons, the sort routines are always starting with exactly the same data. The arrays being sorted range from very short (32 elements) to moderate (500~2000 elements).

Anyone who has experimented with sort routines, please report your experiences. All my sort routines and testing are indirectly sorting arrays of double precision floating point values. I say indirectly because two arrays are involved, an array of integers and an array of double precision values. Each integer is the location in the array of double precision values. Therefore, the sort routines actually sort the integer array, but the sort is based upon the values in the array of double precision values. In other words, after the sort is complete, if you were to look at the integer array, you would see the integer values were all scrambled up (not sorted in any obvious way). In other words, int[0] to int[top] does not end up ordered, but double[int[0]] to double[int[top]] does end up ordered. Nonetheless, this is essentially equivalent to sorting an array of doubles, and all comparisons made to perform the sort are comparisons of double precision values.[/font]

##### Share on other sites

I implemented that in AS3 and it was a clear winner speed wise overall.

I used it to sort polygons (tested upto about 50,000) and it was the fastest method I could find.

It was perfect as the performance was consistently good on both sorted and unsorted arrays, hope it helps.

##### Share on other sites
You should try a radix sort -- it has the neat aspect of being of [font="Courier New"]O(n)[/font] complexity, instead of [font="Courier New"]O(n[sup]2[/sup])[/font] or [font="Courier New"]O(n log n)[/font].

The radix-sort orders items based on the bitwise representation of their sort-keys, instead of comparing pairs of keys via a comparison function.
IIRC, the floating point standard fortunately has the property where more-signficant-bits of the representation are stored before less-significant-bits, which means that you can sort them as if they were binary integers (except for special cases, like NaN).
i.e. this should work, meaning a radix sort should work:[code]float a = ..., b = ...;
bool aLessB = (a < b);
bool aLessBHack = ( (*(int*)&a) < (*(int*)&b) );
assert( aLessB == aLessBHack );[/code]Ninja'd! [img]http://public.gamedev.net/public/style_emoticons/default/smile.gif[/img]

##### Share on other sites
How do you choose the pivot element for quicksort? If it gets slow for almost ordered and reverse ordered data I assume you choose the first or last element? Try median 3 pivoting with the first, last and "middle" element.

##### Share on other sites
In practice, quick sort is still the fastest of those listed. During my benchmarks it loses to my bottom-up in-place stable merge sort only once every 20 times or so, though when merge sort loses it usually isn’t by much.
Remember to use a bottom-up merge sort instead of the traditional recursive one; by avoiding recursion and re-allocation you will save a lot of time.

I use quick sort when the stability of the sort does not matter, and merge sort when it does.

L. Spiro

##### Share on other sites
[font="Book Antiqua"]Thanks for all the comments and tips. I think I skipped the radix sort because it is "not a comparison sort". I thought that was a requirement and only integers could be sorted with non-comparison sorts, but maybe I'm wrong. I'll check into that. I certainly don't need to sort any of the special-case wacko double precision values like "not a number" or "infinity" or any such crazies.... just real numbers (vertex coordinates in most cases).

I assume I want to perform the radix sort on the f64 as if they were s64 values, right? That is, perform signed integer compare operations. Though... since they generate the same result... why does it matter whether the f64 values are compared as f64 values or as s64 values? Hopefully I figure that out when I read out the radix sort.

It is true that quick sort is the fastest as along as it is happy with the data. I find it sorta laughable that the cases that blows its mind are when the data is already sorted (correctly or in [roughly] reverse order). My measures agree with you - when data is well randomized merge sort is slower than quick-sort, but not by very much. And when quick-sort clogs up, merge sort is extremely stable. In fact, merge sort is the most stable sort I've tested by a fair margin. I just created a 3D plot and merge sort just doesn't have any "outliers" while all the others do.

I'll search for "median 3 pivoting" and see what I can find. Thanks for that tip.

I'll also try to find out what the bottom-up variant of the merge sort is, and check whether that's what I have.

I should also mention that I implemented a non-recursive variant of quick-sort, which is completely equivalent except it avoids function call overhead. I did so after reading a couple claims this variant is a bit faster.

Again, thanks for all the tips, and I'll report my results next time, including a new plot with radix sort if anyone cares to see it.[/font]

##### Share on other sites
[quote name='YogurtEmperor' timestamp='1313574138' post='4850208']
In practice, quick sort is still the fastest of those listed. During my benchmarks it loses to my bottom-up in-place stable merge sort only once every 20 times or so, though when merge sort loses it usually isn’t by much.
Remember to use a bottom-up merge sort instead of the traditional recursive one; by avoiding recursion and re-allocation you will save a lot of time.

I use quick sort when the stability of the sort does not matter, and merge sort when it does.

L. Spiro
[/quote]

In practice, it depends on temporal coherancy of your lists. If only two or three elements are out of place, even a bubble sort can be quicker than a qsort or merge sort.

##### Share on other sites
[quote name='maxgpgpu' timestamp='1313627785' post='4850552']
[font="Book Antiqua"]Thanks for all the comments and tips. I think I skipped the radix sort because it is "not a comparison sort". I thought that was a requirement and only integers could be sorted with non-comparison sorts, but maybe I'm wrong. I'll check into that. I certainly don't need to sort any of the special-case wacko double precision values like "not a number" or "infinity" or any such crazies.... just real numbers (vertex coordinates in most cases).

I assume I want to perform the radix sort on the f64 as if they were s64 values, right? That is, perform signed integer compare operations. Though... since they generate the same result... why does it matter whether the f64 values are compared as f64 values or as s64 values? Hopefully I figure that out when I read out the radix sort.[/font][/quote]
As you rightfully mentioned in your initial post, you should not be sorting the values themselves, but indices into those values.
Especially if you are sorting triangles. You should be sorting the index buffer, not the vertex buffer, and if you are able to get indices to be 16 bits each, a radix sort will be extremely fast.

L. Spiro

##### Share on other sites
[quote name='YogurtEmperor' timestamp='1313634778' post='4850580']
[quote name='maxgpgpu' timestamp='1313627785' post='4850552']
[font="Book Antiqua"]Thanks for all the comments and tips. I think I skipped the radix sort because it is "not a comparison sort". I thought that was a requirement and only integers could be sorted with non-comparison sorts, but maybe I'm wrong. I'll check into that. I certainly don't need to sort any of the special-case wacko double precision values like "not a number" or "infinity" or any such crazies.... just real numbers (vertex coordinates in most cases).

I assume I want to perform the radix sort on the f64 as if they were s64 values, right? That is, perform signed integer compare operations. Though... since they generate the same result... why does it matter whether the f64 values are compared as f64 values or as s64 values? Hopefully I figure that out when I read out the radix sort.[/font][/quote]
As you rightfully mentioned in your initial post, you should not be sorting the values themselves, but indices into those values. Especially if you are sorting triangles. You should be sorting the index buffer, not the vertex buffer, and if you are able to get indices to be 16 bits each, a radix sort will be extremely fast.
L. Spiro
[/quote]
[font="Book Antiqua"]Unfortunately, I'll either need to continue sorting s32 integers like I am now, or have two routines, one for s16 and one for s32. In probably 99% of all real-world cases the s16 routine will work fine, but I also need to support arrays larger than 65536 elements. So yeah, if the radix sort with s16 values is significantly faster than the radix sort with s32 values, then I better create two routines and invoke the s16 version when (elements < 65536). But I'm not sure s16 versus s32 will make that big a difference, since I need to access and test (sort on) the f64 value for each s16 or s32 value refers to. But it isn't much extra work to implement both and find out, so I will... assuming I can figure out how to implement the radix sort at all. So far the information is fairly integer biased. I think I can make the f64 behave and sort like an s64 value, but first I need to sit down and make sure I'm not dreaming. If my initial guess is correct, if the value is negative, I just need to invert the most significant 13-bits in the f64/s64 bit pattern and sort based upon that. Or something like that. Maybe I should have the routine perform the sort one 13-bit wide radix at a time, then I only need to treat the first one different (MS13).
[/font]
[font="Book Antiqua"]Actually, I'm not sorting the index buffer OR vertex buffer. For every triangle I have its minimum.x and maximum.x in the s16 array (or s32 array). You might ask why do I have signed integer arrays (s16 and s32) instead of u16 and u32. Good question. The answer is, I represent the minimum.x of a triangle as ~triangleID in the array (a negative value) and the maximum.x of the triangle with triangleID (a positive value). That's how I distinguish whether a given element is the minimum.x or maximum.x which my collision routines definitely need to know! Effectively, the first 3 values in the index buffer end up being ~0 and 0 (triangleID == 0), the next 3 values end up being ~1 and 1 (triangleID == 1), and so forth. So it is very analogous to the index buffer all right, but not exactly.

I'm really jazzed about the radix sort, and hope it turns out as great as fast as it sounds. I wish I could find an f64 implementation. Maybe I will yet, but probably (and as usual), I'm probably better off taking the time to implement it myself from scratch.[/font]

##### Share on other sites
Here's an un-tested implementation of a radix sort on doubles. I recommend implementing this yourself so that you understand the idea, but you can use this for reference.
Usage:[code]double myData[5] = { 5, 3, 4, 1, 2 };
double buffer[5];//temp storage for sorting
RadixSort( myData, buffer, 5 );[/code][source lang=cpp]double* RadixSort( double* input, double* buffer, unsigned size )
{
if( size <= 1 )
return input;//no work to do here!
const unsigned passes = sizeof( double );//the algorithm sorts on a single byte-column per pass (8bit radix)
unsigned count[256];//how many times each possible byte occurs
unsigned index[256];//where to put each possible byte in the output array
double* buffers[2] = { input, buffer };
for( unsigned pass=0; pass!=passes; ++pass )
{
double* out = buffers[1];
double* begin = buffers[0];
double* end = begin + size;
//Build the histogram
memset( count, 0, sizeof(unsigned)*256 );
for( double* it=begin; it!=end; ++it )
{
const unsigned byte& key = *(((unsigned byte*)it) + pass);//extract the byte being used by this pass
++count[key];//count how many times that byte occurs in the input
}
//Build the partition lookup
index[0] = 0;//the zero-byte entries start at the start
for( unsigned i=1; i!=256; ++i )
{
index[i] = index[i-1] + count[i-1];//the 'i' byte entries start at the end of previous, plus the space used by the previous
}
//Perform the sort - move values into the spaces decided above
for( double* it=begin; it!=end; ++it )
{
u8& key = *(((unsigned byte*)it) + pass);
unsigned idx = index[key]++;
out[idx] = *it;
}
//flip buffers for the next pass
double* temp = buffers[0];
buffers[0] = buffers[1];
buffers[1] = temp;
}
return buffers[0];
}[/source]This won't work on negative float values, only positives will be sorted correctly. [url="http://www.codercorner.com/RadixSortRevisited.htm"]This page[/url] explains how to fix it, and [url="http://stereopsis.com/radix.html"]here[/url] is another implementation.

[edit #2] N.B. for small data sets ([i]~a hundred[/i]) I found the standard library's quicksort implementation to beat my radix sort.
I also tried using a larger radix of 16-bits instead of 8-bits, but this only became faster for very large data-sets ([i]several hundred-thousand items[/i]).

##### Share on other sites
[quote name='Hodgman' timestamp='1313638502' post='4850592']
Here's an un-tested implementation of a radix sort on doubles. I recommend implementing this yourself so that you understand the idea, but you can use this for reference.
Usage:[i]This won't work on negative float values, only positives will be sorted correctly. [url="http://www.codercorner.com/RadixSortRevisited.htm"]This page[/url] explains how to fix it, and [url="http://stereopsis.com/radix.html"]here[/url] is another implementation.

[edit #2] N.B. for small data sets ([i]~a hundred[/i]) I found the standard library's quicksort implementation to beat my radix sort.
I also tried using a larger radix of 16-bits instead of 8-bits, but this only became faster for very large data-sets ([i]several hundred-thousand items[/i]).
[/quote]

I've implemented, profiled, optimized with platform specific assembly a radix sort for our company a bunch of times over the past years and even with a highly tuned radix sort anything under 1000 elements is usually better to be done with another sorting algorithm. I say usually because to get the most out of a radix sort means good use of prefetch'ing the cache. If your cache is shared with another thread then performance is unreliable.

I recommend radix sorts only for large data sets (e.g. when you know your min is 500). Hands down a f32 radix sort of 1000 elements or more will always be the fastest.

-= Dave

##### Share on other sites
You certainly know that no sorting algorithm based on key comparision can be faster than O(n log n) in average case. So, your best bet for an average case , key comparision sort is quick sort.

Everything else depends on the usage,desired behavior and granted information about the set to be sorted. With low temporal coherancy "insertion sort" is very good. Merge sort is a very stable sorting algorithm with granted O(n log n) behaviour, and quick sort have some bad worst case behavior. Bubble sort on the other hand is quite fast when sorting very small sets.

Bucket/radix sort is very good, and even when not using it to sort the complete set, it could be used to quickly fill buckets for certain ranges.

Just for fun, I would try out the following, mutlithreaded sorting algorithm:
1. Use bucket/radix sort to quickly generate some buckets of certain ranges.
2. For each bucket start a thread to use merge sort.
3. For small merge-levels (4-8) use bubble sort/insertion sort.

Utilizing the GPU would be the next step (=> merge sort for low level sorting).

##### Share on other sites
Uhh you forgot to test Gnome Sort. [img]http://public.gamedev.net/public/style_emoticons/default/rolleyes.gif[/img]

[url="http://en.wikipedia.org/wiki/Gnome_sort"]http://en.wikipedia.org/wiki/Gnome_sort[/url]

##### Share on other sites
[font="Book Antiqua"]I went back to the drawing board (where I should and usually always go), and carefully wrote down on paper the steps of what I understand to be the radix sort. That got me clear about how it works, and stimulated a few thoughts about practical implementation.

First of all, it appears to me that sorting from MSR to LSR is vastly smarter and should be considerably faster. Maybe I'm missing or misunderstanding something (like exactly how the LSR version works), but I'm surprised to find almost all radix sorts I've found on f32 and f64 (floating point) data starts at LSR and works up. After all, there's rarely any need to keep the order of identical values in their original unsorted order, and certainly not in my case.

For example, in my case of sorting on the value of f64 variables, running a single pass with 11-bits [or possibly 13-bits] of radix at a time will sort the input array into 2048 [or 8192] correctly ordered sections of the target array. If I'm sorting a few hundred to a few thousand values, the array will be close to completely sorted, and sometimes even completely sorted, with only 0, 1, 2 or at most a few elements in each bucket. So, after the first pass, it seems a lot smarter to just run a trivial insertion sort on each of the buckets that contains 2 or more elements (the 0 or 1 element buckets are obviously already sorted correctly and in their exactly correct final location in the target array).

This seems an incredibly fast and powerful hybrid approach! For sure, if the input array has tens or hundreds of thousands of elements, some of the buckets might contain dozens of elements. But my current tests say insertion sort is very fast for up to 100~200 elements, where merge sort becomes faster. Hmmmm.

If I understand the LSR version, this kind of trick doesn't work, because the buckets are ordered by least-significant-bits, so they're in some wacko, intermediate order until the MSR is finally processed.

This also means (if I understand correctly) that the LSR version is moving elements far away from their current positions as more significant radix groups are processed. This will tend to make cache misses much more common than the MSR sort where elements are originally put very close to their final location and never moved very far (again, unless I'm missing something). In fact, if I understand this, the buckets will usually be so small that elements being swapped will fairly often be in the same cache line - talk about efficient!

I wrote a loop to print out a couple hundred select f64 values from "most negative" to "most positive", along with their integer hexadecimal form, and verified the simple tweak does indeed work correctly (to make them sort correctly). That is, treat the f64 value as an s64 integer (think "union") and when the s64 value "sv" is negative, simply ~sv (bitwise not the entire s64 bit pattern), otherwise simply flip the sign bit (which can be done with xor on the s64 value or simply "fv = -fv" on the floating point value. It might be faster to modify the original data, do the sort, then modify back to the original values, but I'm not gonna do that myself just to make sure nothing bad happens if another process or thread is accessing that data simultaneously or concurrently. Oh boy would [i]that[/i] bug be a bugger to figure out!

There are a couple other possible optimizations. Since the code always knows how many elements will be in radix bucket, when that number is 2 (and possibly 3), it might be easier to perform the sort when each element is inserted. When the bucket contains 0 elements, just insert the element. When the bucket contains 1 element, just perform a simple compare and either write the new value at offset = 1, or move the existing element to offset = 1 and write the new element into offset = 0. Since soooo many buckets are likely to contain 0, 1, 2 elements, this could make the need to perform further sorts rare (on very few buckets). I'll have to try this both ways to see which is best.

It is tempting to start out with a wide radix like 13-bits or even 16-bits, which would tend to cause the entire sort to be essentially finished after only one radix (since most or all buckets would only contain 0,1,2 elements (for moderate size arrays)). However, I'm guessing 11-bit or 13-bit is optimum, and for relatively small input arrays, 8-bits might be better --- all due to cache considerations.

Anyway, all the above is pure theory (and possibly misunderstanding in the case of the LSR technique), so it is time for me to start typing code and find out the reality. I'm not gonna worry about GPU sorts at the moment, since I am very dubious that sorting several hundred to a few thousand elements on the GPU will be competitive given the overhead. However, clearly the work of a radix sort can be distributed to multiple CPU cores, so I'll think about that later if necessary. However, my tendency so far is to only distribute work to other CPU cores on a "subsystem basis" (a big chunk of independent processing) rather than trying to coordinate multiple cores on a single algorithm. But.... maybe. We'll see. I suspect the radix sort will be so fast that it won't matter --- maybe wishful thinking.

As it turns out, I just finished creating the f64 data that I now need to sort, so I could actually add code in the routine that created that data to create the bucket count array, which will be super-efficient (already have the variables in hand, so zero cache misses on the data, plus we'll get the count array into cache for the actual sort).

Oh, one politically incorrect comment. Those O(n) ratings are just about useless as far as I can tell. I have algorithms running that have the same "best case", "average case", "worst case" equations (often O(n log n)), but run at [i]very[/i] different speed, often in non-intuitive ways. Who knows, maybe some sort routines have O = 1 and others have O = 38 or something, but whatever it is, I don't find those equations to be very useful in a practical sense.

Anyway, details to come once I have something limping along.
[/font]

##### Share on other sites
[font="Book Antiqua"]Well, my experience with the radix sort was not very satisfying. As I mentioned in my previous message, I implemented an MSR version of radix sort. After the initial sort on the MSR, I cycle through the groups and recursively call radix sort again to sort each group. However, before the code recursively calls radix sort it checks to see how many elements in that radix group, and if the number of elements is less than <pick-a-number>, it performs an inline insertion sort to sort that radix group. The radix sort has quite a bit of overhead to set up for the sort (like creating count[] and offset[] arrays), so the speed does indeed increase quite a bit when <pick-a-number> is chosen wisely (somewhere between 32 ~ 128 depending on how highly randomized the input array is).

But the problem isn't as much the radix sort per-se as the format of floating-point numbers. I put in a bunch of printing diagnostics to try to understand the problem, and that did expose the problem. Just take this one example of an input array with 728 elements. After setting the count array my diagnostics printed out the contents of the count array. And "holy smokes" is the answer! Of the 728 elements, all but 12 ended up in the same radix-group (a count of 716), and the other 12 ended up in one other radix group! Yikes! No wonder it is slow. It performed that pass to nearly achieve [i]nothing[/i]. Even a level or two down the values were far from evenly distributed. In fact, in the next level down the elements were distributed over only about 6 groups! And while the distribution gradually got better as it recursed down to lesser significant radix-fields in the values, the overhead of clearing and computing those count and offset arrays became a killer (at least I assume that must be the killer, cuz what else is there in the code besides that and the radix sort/move process?

I'm betting this would work much better on integers... especially small integers (16-bit and 32-bit integers), and especially if their values are fairly well distributed over the range of 16-bit or 32-bit integer values. Unfortunately, to sort effective, the array of floating point numbers would need to look like this:

[/font][font="Courier New"]+2.54e-301
-3.14e+116
-7.56e-211
-6.15e+017
+8.97e-111
+4.62e+284
-1.36e-164
-9.07e-029
+5.48e+083

[/font][font="Book Antiqua"]You get the idea, right?

But the reality is, most arrays have sets of numbers that have something in common - they represent measures of some phenomenon other than maximally randomized bit patterns (which should sort [i]great[/i] in radix-sort). And in my case, the numbers are x (or y or z) world coordinates of the triangles that compose a 3D object. Well, typically they are distributed over a range of (say) -0.65 to +1.17, or +2.38 to +4.79, or something similar. To sort that kind of array, radix sort royally sucks!

By tweaking the <pick-a-number> value I was able to make radix sort sometimes beat one of the other sorts on a particular set of values. However, I don't think radix sort ever beat even two of the 4 sorts I'm running on the same data (and printing cycle-counts for). Which means, radix sort was never the winner - never. Bummer.

So it seems like radix sort is brilliant in theory, but the nature of real-world arrays badly defeats its main mechanism (distribution of the values into many radix-groups)... especially with floating point format.

Well, that's what I found. If anyone has any suggestions to fix this practical problem, let me know. My first thought was to hash the sucker, but that totally ruins the sort process! I kept thinking there must be a way to hash but not wreck the sorting process, but I don't see it. But maybe I'm missing something. I hope so, cuz radix sort has so much theoretical potential.[/font]

## Create an account

Register a new account

• ### Forum Statistics

• Total Topics
628284
• Total Posts
2981830

• 10
• 10
• 10
• 11
• 17