sort algorithms

Graphics and GPU Programming Programming

Started by maxgpgpu August 17, 2011 09:21 AM

13 comments, last by maxgpgpu 12 years, 8 months ago

867

August 18, 2011 08:43 AM

Here's an un-tested implementation of a radix sort on doubles. I recommend implementing this yourself so that you understand the idea, but you can use this for reference.
Usage:[edit]This won't work on negative float values, only positives will be sorted correctly. This page explains how to fix it, and here is another implementation.

[edit #2] N.B. for small data sets (~a hundred) I found the standard library's quicksort implementation to beat my radix sort.
I also tried using a larger radix of 16-bits instead of 8-bits, but this only became faster for very large data-sets (several hundred-thousand items).

I've implemented, profiled, optimized with platform specific assembly a radix sort for our company a bunch of times over the past years and even with a highly tuned radix sort anything under 1000 elements is usually better to be done with another sorting algorithm. I say usually because to get the most out of a radix sort means good use of prefetch'ing the cache. If your cache is shared with another thread then performance is unreliable.

I recommend radix sorts only for large data sets (e.g. when you know your min is 500). Hands down a f32 radix sort of 1000 elements or more will always be the fastest.

-= Dave

Graphics Programmer - Ready At Dawn Studios

Ashaman73

13,718

August 18, 2011 12:54 PM

You certainly know that no sorting algorithm based on key comparision can be faster than O(n log n) in average case. So, your best bet for an average case , key comparision sort is quick sort.

Everything else depends on the usage,desired behavior and granted information about the set to be sorted. With low temporal coherancy "insertion sort" is very good. Merge sort is a very stable sorting algorithm with granted O(n log n) behaviour, and quick sort have some bad worst case behavior. Bubble sort on the other hand is quite fast when sorting very small sets.

Bucket/radix sort is very good, and even when not using it to sort the complete set, it could be used to quickly fill buckets for certain ranges.

Just for fun, I would try out the following, mutlithreaded

sorting algorithm:
1. Use bucket/radix sort to quickly generate some buckets of certain ranges.
2. For each bucket start a thread to use merge sort.
3. For small merge-levels (4-8) use bubble sort/insertion sort.

Utilizing the GPU would be the next step (=> merge sort for low level sorting).

Ashaman

Gnoblins: Website - Facebook - Twitter - Youtube - Steam Greenlit - IndieDB - Gamedev Log

EJH

315

August 18, 2011 07:45 PM

Uhh you forgot to test Gnome Sort.

http://en.wikipedia.org/wiki/Gnome_sort

maxgpgpu

207

Author

August 19, 2011 05:21 AM

[font="Book Antiqua"]I went back to the drawing board (where I should and usually always go), and carefully wrote down on paper the steps of what I understand to be the radix sort. That got me clear about how it works, and stimulated a few thoughts about practical implementation.

First of all, it appears to me that sorting from MSR to LSR is vastly smarter and should be considerably faster. Maybe I'm missing or misunderstanding something (like exactly how the LSR version works), but I'm surprised to find almost all radix sorts I've found on f32 and f64 (floating point) data starts at LSR and works up. After all, there's rarely any need to keep the order of identical values in their original unsorted order, and certainly not in my case.

For example, in my case of sorting on the value of f64 variables, running a single pass with 11-bits [or possibly 13-bits] of radix at a time will sort the input array into 2048 [or 8192] correctly ordered sections of the target array. If I'm sorting a few hundred to a few thousand values, the array will be close to completely sorted, and sometimes even completely sorted, with only 0, 1, 2 or at most a few elements in each bucket. So, after the first pass, it seems a lot smarter to just run a trivial insertion sort on each of the buckets that contains 2 or more elements (the 0 or 1 element buckets are obviously already sorted correctly and in their exactly correct final location in the target array).

This seems an incredibly fast and powerful hybrid approach! For sure, if the input array has tens or hundreds of thousands of elements, some of the buckets might contain dozens of elements. But my current tests say insertion sort is very fast for up to 100~200 elements, where merge sort becomes faster. Hmmmm.

If I understand the LSR version, this kind of trick doesn't work, because the buckets are ordered by least-significant-bits, so they're in some wacko, intermediate order until the MSR is finally processed.

This also means (if I understand correctly) that the LSR version is moving elements far away from their current positions as more significant radix groups are processed. This will tend to make cache misses much more common than the MSR sort where elements are originally put very close to their final location and never moved very far (again, unless I'm missing something). In fact, if I understand this, the buckets will usually be so small that elements being swapped will fairly often be in the same cache line - talk about efficient!

I wrote a loop to print out a couple hundred select f64 values from "most negative" to "most positive", along with their integer hexadecimal form, and verified the simple tweak does indeed work correctly (to make them sort correctly). That is, treat the f64 value as an s64 integer (think "union") and when the s64 value "sv" is negative, simply ~sv (bitwise not the entire s64 bit pattern), otherwise simply flip the sign bit (which can be done with xor on the s64 value or simply "fv = -fv" on the floating point value. It might be faster to modify the original data, do the sort, then modify back to the original values, but I'm not gonna do that myself just to make sure nothing bad happens if another process or thread is accessing that data simultaneously or concurrently. Oh boy would that bug be a bugger to figure out!

There are a couple other possible optimizations. Since the code always knows how many elements will be in radix bucket, when that number is 2 (and possibly 3), it might be easier to perform the sort when each element is inserted. When the bucket contains 0 elements, just insert the element. When the bucket contains 1 element, just perform a simple compare and either write the new value at offset = 1, or move the existing element to offset = 1 and write the new element into offset = 0. Since soooo many buckets are likely to contain 0, 1, 2 elements, this could make the need to perform further sorts rare (on very few buckets). I'll have to try this both ways to see which is best.

It is tempting to start out with a wide radix like 13-bits or even 16-bits, which would tend to cause the entire sort to be essentially finished after only one radix (since most or all buckets would only contain 0,1,2 elements (for moderate size arrays)). However, I'm guessing 11-bit or 13-bit is optimum, and for relatively small input arrays, 8-bits might be better --- all due to cache considerations.

Anyway, all the above is pure theory (and possibly misunderstanding in the case of the LSR technique), so it is time for me to start typing code and find out the reality. I'm not gonna worry about GPU sorts at the moment, since I am very dubious that sorting several hundred to a few thousand elements on the GPU will be competitive given the overhead. However, clearly the work of a radix sort can be distributed to multiple CPU cores, so I'll think about that later if necessary. However, my tendency so far is to only distribute work to other CPU cores on a "subsystem basis" (a big chunk of independent processing) rather than trying to coordinate multiple cores on a single algorithm. But.... maybe. We'll see. I suspect the radix sort will be so fast that it won't matter --- maybe wishful thinking.

As it turns out, I just finished creating the f64 data that I now need to sort, so I could actually add code in the routine that created that data to create the bucket count array, which will be super-efficient (already have the variables in hand, so zero cache misses on the data, plus we'll get the count array into cache for the actual sort).

Oh, one politically incorrect comment. Those O(n) ratings are just about useless as far as I can tell. I have algorithms running that have the same "best case", "average case", "worst case" equations (often O(n log n)), but run at very different speed, often in non-intuitive ways. Who knows, maybe some sort routines have O = 1 and others have O = 38 or something, but whatever it is, I don't find those equations to be very useful in a practical sense.

Anyway, details to come once I have something limping along.
[/font]

maxgpgpu

207

Author

August 21, 2011 06:32 AM

[font="Book Antiqua"]Well, my experience with the radix sort was not very satisfying. As I mentioned in my previous message, I implemented an MSR version of radix sort. After the initial sort on the MSR, I cycle through the groups and recursively call radix sort again to sort each group. However, before the code recursively calls radix sort it checks to see how many elements in that radix group, and if the number of elements is less than <pick-a-number>, it performs an inline insertion sort to sort that radix group. The radix sort has quite a bit of overhead to set up for the sort (like creating count[] and offset[] arrays), so the speed does indeed increase quite a bit when <pick-a-number> is chosen wisely (somewhere between 32 ~ 128 depending on how highly randomized the input array is).

But the problem isn't as much the radix sort per-se as the format of floating-point numbers. I put in a bunch of printing diagnostics to try to understand the problem, and that did expose the problem. Just take this one example of an input array with 728 elements. After setting the count array my diagnostics printed out the contents of the count array. And "holy smokes" is the answer! Of the 728 elements, all but 12 ended up in the same radix-group (a count of 716), and the other 12 ended up in one other radix group! Yikes! No wonder it is slow. It performed that pass to nearly achieve nothing. Even a level or two down the values were far from evenly distributed. In fact, in the next level down the elements were distributed over only about 6 groups! And while the distribution gradually got better as it recursed down to lesser significant radix-fields in the values, the overhead of clearing and computing those count and offset arrays became a killer (at least I assume that must be the killer, cuz what else is there in the code besides that and the radix sort/move process?

I'm betting this would work much better on integers... especially small integers (16-bit and 32-bit integers), and especially if their values are fairly well distributed over the range of 16-bit or 32-bit integer values. Unfortunately, to sort effective, the array of floating point numbers would need to look like this:

[/font][font="Courier New"]+2.54e-301
-3.14e+116
-7.56e-211
-6.15e+017
+8.97e-111
+4.62e+284
-1.36e-164
-9.07e-029
+5.48e+083

[/font][font="Book Antiqua"]You get the idea, right?

But the reality is, most arrays have sets of numbers that have something in common - they represent measures of some phenomenon other than maximally randomized bit patterns (which should sort great in radix-sort). And in my case, the numbers are x (or y or z) world coordinates of the triangles that compose a 3D object. Well, typically they are distributed over a range of (say) -0.65 to +1.17, or +2.38 to +4.79, or something similar. To sort that kind of array, radix sort royally sucks!

By tweaking the <pick-a-number> value I was able to make radix sort sometimes beat one of the other sorts on a particular set of values. However, I don't think radix sort ever beat even two of the 4 sorts I'm running on the same data (and printing cycle-counts for). Which means, radix sort was never the winner - never. Bummer.

So it seems like radix sort is brilliant in theory, but the nature of real-world arrays badly defeats its main mechanism (distribution of the values into many radix-groups)... especially with floating point format.

Well, that's what I found. If anyone has any suggestions to fix this practical problem, let me know. My first thought was to hash the sucker, but that totally ruins the sort process! I kept thinking there must be a way to hash but not wreck the sorting process, but I don't see it. But maybe I'm missing something. I hope so, cuz radix sort has so much theoretical potential.[/font]

sort algorithms

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

sort algorithms

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines