# Algorithm Radix sort discussion

## Recommended Posts

Posted (edited)

Over a few weeks I have been experimenting with radix sort in this gamedev.ru thread: https://gamedev.ru/flame/forum/?id=227644 .

I decided to repost the findings here, because:
1. Hopefully they will be helpful to someone.
2. I would like to get more feedback.
3. There seems to be lack of general discussion regarding implementation details of radix sort.

The primary artifact of the experiments is the following radix sort library:

https://rextester.com/HLQELD48880
This contains both the library and test harness.
Fair warning: it is undertested at the moment, and there might be stupid bugs (having people too look at it, both from standpoint of correctness, and performance is one of the reasons I'm posting this).

Here's also experimental testbed, which the above library grew from:
https://rextester.com/HEZLFE12709
Note: confusingly, it uses _msb and _lsb suffixes, even though algorithms operate on digits, not bits.

Now, to discuss some points of radix sort.
If you are not familiar with radix sort, wikipedia provides reasonable introduction: https://en.wikipedia.org/wiki/Radix_sort , so I'm not repeating that here.
Generally, radix sort comes in 2 flavors: LSD (sorting least significant digits first) and MSD (sorting most significant digits first). Both can be implemented as out-of-place stable sort. MSD can also be implemented as in-place (not stable) sort.

Here are some observations, in no particular order:
1. People often seem to be concerned with sorting a plain array of numbers. I find that a rather rare use-case: generally, you want to know, which objects these numbers correspond to. So I think struct{uint32_t key;uint32_t payload;} the most likely use-case. Here payload is some sort of handle (say, index in an array; it also can be a pointer, but that is bulkier on x64) to your actual "fat" object (not much reason for lugging those around).
2. I often encountered statement, that MSD is preferable for large arrays, for 2 reasons: it has better cache locality, and it can descend into fallback search for small sizes. That does not quite match the experiments: for struct{uint32_t key;uint32_t payload;} in my tests LSD is consistently faster for n>2000. For larger (>=64b) keys MSD does start to win. For larger payload with small key MSD sometimes wins.
3. I've seen fallback sort for MSD usually implemented as an insertion sort. It raises the problem, of when to switch to it: typical (for other alorithms, say introsort) threshold of ~16 is too small, and running radix sort on it incurs extra overhead; the sensible threshold for radix sort, ~64, makes insertion sort rather slow. I've implemented a merge sort (with threshold of 128/256), which further falls back to insertion sort for n<=18.
4. Prefetching helps. A lot. I'm talking like x3 speedup on large inputs. And, BTW, that's prefetching writing.
5. Input data matters. While radix sort (especially LSD) doesn't seem to care, whether the array of random keys is sorted or not, the keys being a sequence of 0..(n-1) slows down execution x3 (somewhat less, if shuffled). The current best guess is that it is due to a lot of power-of-two sized buckets causing cache aliasing. Not sure, how to fix it.
6. Optimization options (in this implementation) do not seem to matter much, beyond basic (as in -O1).
7. Replacing insertion sort with sorting network doesn't seem to help (on its own sorting network does slightly beat insertion sort).

Very broadly, on struct{uint32_t key;uint32_t payload;} for n>4000, (out-of-place) radix sort tends to be about 5 times faster, than std::sort.

Here are some charts (who doesn't like charts?):

Spoiler

Please note, that these correspond to somewhat outdated versions.

Most of the timings are from rextester.com above. I realize, that benchmarking on an online compiler is not the best idea, but these tests are transparent: everyone sees the same results, along with the knobs set to made them.
If you can find the time, I would appreciate results from your machine (preferably on larger range of sizes; rextester.com only allows ~5 seconds runtime). Ideally, you could also provide your CPU and RAM information (2 tabs from https://www.cpuid.com/softwares/cpu-z.html would do wonderfully).
Especially welcome are results from non-x64 machines.

I invite both general discussion regarding radix sort, and comments on this specific implementation.

Edited by FordPerfect

##### Share on other sites

Visual Studio does not compile testbed:

if (n & 1) ++c[p[-data_size] << 1]; // data_size is unsigned; compiler throws error over this

Here are my results.

CPU-Z:

Spoiler

Compiled ax x86:

Spoiler

System: 32bit
Estimated CPU frequency: 3.691 GHz
Algorithm                    |      16|      22|      31|      43|      60|      84|     118|     166|     234|     330|     466|     659|     932|    1318|    1864|    2636|    3728|    5272|    7456|   10544|   14912|   21089|   29825|   42181|   59655|   84369|  119321|  168753|  238664|  337539|  477376|  675146|  954849| 1350429| 1909892| 2701132
std_sort                     |  14.0  |  15.2  |  17.5  |  15.1  |  22.6  |  40.4  |  22.4  |  26.5  |  38.4  |  64.9  |  58.4  |  89.4  |  88.0  | 106.3  | 119.6  | 127.5  | 137.1  | 145.2  | 152.8  | 157.9  | 163.6  | 171.5  | 176.6  | 184.8  | 185.6  | 218.7  | 216.5  | 218.7  | 216.5  | 218.7  | 232.0  | 235.1  | 243.5  | 248.7  | 255.1  | 262.4
std_stable_sort              |  14.6  |  15.4  |  17.9  |  21.7  |  20.9  |  21.4  |  42.2  |  54.0  |  31.1  |  45.1  |  57.9  |  65.4  |  86.6  |  88.1  | 101.8  | 108.0  | 114.0  | 120.5  | 126.1  | 131.0  | 138.5  | 144.1  | 150.0  | 155.5  | 185.6  | 175.0  | 185.6  | 175.0  | 185.6  | 185.9  | 201.0  | 202.3  | 212.6  | 224.1  | 230.0  | 233.7
merge_sort_ins_full<18>      |  13.5  |  16.6  |  18.0  |  32.5  |  22.4  |  26.6  |  32.7  |  43.1  |  51.3  |  69.0  |  82.1  | 102.2  | 106.6  | 124.5  | 128.1  | 142.5  | 145.9  | 159.1  | 161.0  | 173.9  | 177.9  | 191.9  | 191.8  | 204.7  | 185.6  | 218.7  | 216.5  | 218.7  | 232.0  | 251.5  | 247.4  | 256.9  | 262.9  | 276.0  | 278.3  | 291.1
merge_sort_ins_new<18>       |  16.4  |  16.2  |  17.3  |  27.3  |  21.9  |  26.5  |  28.7  |  69.4  |  76.7  |  62.8  |  77.7  |  95.4  | 104.9  | 121.8  | 126.1  | 143.2  | 144.2  | 158.3  | 159.7  | 172.8  | 174.1  | 187.1  | 188.3  | 201.6  | 185.6  | 218.7  | 216.5  | 240.6  | 232.0  | 240.6  | 239.7  | 256.9  | 262.9  | 273.3  | 274.4  | 288.3
radix_sort_msb<4>            |  14.3  |  16.6  |  25.2  |  21.3  |  22.3  |  28.9  |  58.0  |  34.7  |  44.9  |  25.3  |  30.7  |  43.4  |  51.7  |  70.1  |  76.0  |  92.2  |  91.6  |  54.8  |  60.1  |  70.5  |  74.3  |  85.6  |  88.2  | 100.4  | 123.7  |  87.5  |  61.9  |  65.6  |  77.3  |  98.4  | 100.5  | 109.3  | 108.2  |  71.1  |  75.4  |  84.7
radix_sort_msb<5>            |  14.6  |  16.5  |  17.5  |  24.7  |  38.0  |  26.6  |  28.3  |  39.0  |  72.4  |  33.0  |  25.9  |  33.6  |  44.0  |  59.0  |  64.3  |  77.2  |  80.8  |  93.8  |  92.5  |  44.3  |  48.7  |  56.0  |  60.9  |  70.8  |  61.9  |  87.5  |  92.8  |  87.5  |  92.8  |  43.7  |  54.1  |  65.6  |  65.7  |  79.3  |  83.1  |  94.3
radix_sort_msb<8, 128>       |  14.3  |  16.5  |  17.5  |  21.1  |  40.3  |  48.0  |  28.0  |  17.6  |  21.6  |  14.9  |  15.3  |  26.8  |  29.1  |  27.3  |  31.3  |  34.7  |  39.3  |  45.7  |  51.3  |  61.1  |  65.5  |  77.0  |  74.5  |  46.5  |   0.0  |  43.7  |  30.9  |  43.7  |  46.4  |  43.7  |  46.4  |  43.7  |  46.4  |  57.4  |  61.8  |  69.7
radix_sort_msb<11, 256>      |  15.0  |  16.9  |  17.8  |  26.8  |  23.3  |  26.9  |  27.7  |  51.7  |  76.0  |  55.4  |  40.9  |  39.1  |  42.0  |  41.8  |  37.6  |  32.6  |  35.2  |  36.8  |  37.7  |  38.0  |  39.6  |  40.3  |  43.8  |  50.9  |  61.9  |  43.7  |  92.8  |  87.5  |  92.8  |  98.4  | 100.5  |  76.5  |  73.4  |  65.6  |  61.8  |  57.4
radix_sort_msb<11>           |  14.9  |  18.6  |  17.8  |  21.3  |  22.9  |  26.4  |  29.0  |  35.8  |  44.7  |  55.9  |  43.0  |  38.7  |  41.1  |  42.1  |  37.8  |  32.7  |  35.3  |  36.6  |  37.6  |  38.4  |  39.5  |  40.0  |  43.9  |  51.3  |  61.9  |  87.5  |  61.9  |  65.6  |  77.3  |  87.5  | 100.5  |  82.0  |  73.4  |  65.6  |  61.8  |  56.0
radix_sort_lsb<4>            |  71.6  |  68.6  |  71.5  |  66.1  |  63.0  |  60.7  |  60.5  |  59.0  |  57.7  |  56.9  |  56.5  |  56.5  |  56.8  |  56.3  |  55.9  |  56.0  |  56.1  |  56.5  |  56.9  |  56.1  |  56.5  |  56.7  |  57.7  |  56.9  |  61.9  |  43.7  |  61.9  |  43.7  |  61.9  |  54.7  |  61.9  |  65.6  |  61.8  |  62.9  |  61.8  |  61.5
radix_sort_lsb<5>            |  70.5  |  64.7  |  67.5  |  61.7  |  56.7  |  54.8  |  53.8  |  51.5  |  50.3  |  49.9  |  48.9  |  49.1  |  48.9  |  48.5  |  48.4  |  48.2  |  48.3  |  48.7  |  48.6  |  48.4  |  48.7  |  48.8  |  48.8  |  48.9  |  61.9  |  43.7  |  61.9  |  43.7  |  46.4  |  54.7  |  46.4  |  54.7  |  54.1  |  54.7  |  54.1  |  54.7
radix_sort_lsb<6>            |  78.4  |  67.7  |  59.7  |  52.5  |  48.1  |  45.3  |  45.1  |  42.8  |  40.8  |  39.7  |  39.7  |  39.1  |  38.5  |  38.2  |  38.1  |  38.5  |  38.3  |  38.4  |  38.5  |  38.8  |  39.3  |  39.5  |  39.5  |  39.3  |  61.9  |  43.7  |  61.9  |  43.7  |  46.4  |  43.7  |  46.4  |  43.7  |  46.4  |  46.5  |  46.4  |  46.5
radix_sort_lsb<7>            |  90.3  |  75.9  |  62.3  |  53.3  |  47.0  |  42.5  |  40.2  |  38.4  |  36.3  |  34.9  |  34.3  |  33.7  |  33.3  |  32.8  |  32.7  |  32.7  |  32.6  |  32.6  |  32.5  |  32.4  |  33.1  |  32.9  |  33.0  |  33.9  |  61.9  |  43.7  |  30.9  |  43.7  |  30.9  |  32.8  |  38.7  |  38.3  |  38.7  |  38.3  |  38.7  |  38.3
radix_sort_lsb<8>            | 136.4  | 102.2  |  80.6  |  65.4  |  53.5  |  46.1  |  40.6  |  36.0  |  32.9  |  31.2  |  29.5  |  28.3  |  27.4  |  26.9  |  26.8  |  26.5  |  26.2  |  26.2  |  26.2  |  26.8  |  26.8  |  26.9  |  27.6  |  28.5  |   0.0  |   0.0  |  30.9  |  21.9  |  30.9  |  32.8  |  23.2  |  32.8  |  30.9  |  32.8  |  30.9  |  31.4
radix_sort_lsb_pre<8>        | 117.8  |  97.1  |  79.2  |  65.2  |  54.2  |  47.0  |  42.1  |  38.2  |  35.1  |  33.3  |  31.8  |  31.0  |  30.4  |  29.9  |  29.6  |  29.8  |  29.5  |  29.4  |  29.5  |  29.2  |  29.7  |  29.6  |  29.5  |  29.7  |  61.9  |  43.7  |  30.9  |  21.9  |  30.9  |  32.8  |  30.9  |  27.3  |  34.8  |  32.8  |  30.9  |  32.8
radix_sort_lsb_loop<8>       | 243.6  | 187.3  | 143.0  | 111.4  |  89.4  |  73.3  |  62.9  |  54.2  |  47.9  |  43.7  |  40.5  |  38.4  |  36.7  |  35.7  |  35.1  |  34.8  |  34.2  |  34.2  |  34.7  |  34.0  |  34.8  |  34.7  |  34.6  |  35.7  |   0.0  |   0.0  |  30.9  |  21.9  |  46.4  |  43.7  |  38.7  |  38.3  |  42.5  |  41.0  |  40.6  |  39.6
radix_sort_lsb8_c            | 145.8#~| 113.5#~|  91.5#~|  73.3#~|  61.8#~|  54.9#~|  49.5#~|  44.7#~|  41.4#~|  38.6#~|  38.2#~|  37.1#~|  36.1#~|  35.7#~|  35.2#~|  34.8#~|  34.5#~|  34.6#~|  34.5#~|  35.6#~|  35.9#~|  35.8#~|  35.6#~|  35.8#~|   0.0#~|  43.7#~|  30.9#~|  43.7#~|  30.9#~|  32.8#~|  46.4#~|  43.7#~|  46.4#~|  43.7#~|  52.2#~|  51.9#~
radix_sort_lsb<9>            | 187.5  | 143.4  | 111.2  |  87.5  |  70.3  |  58.0  |  49.4  |  42.7  |  37.6  |  34.5  |  32.2  |  30.6  |  29.3  |  28.2  |  28.1  |  27.5  |  27.2  |  27.2  |  27.0  |  27.3  |  28.5  |  27.7  |  28.6  |  27.9  |   0.0  |   0.0  |  30.9  |  21.9  |  15.5  |  21.9  |  30.9  |  32.8  |  30.9  |  32.8  |  32.9  |  34.2
radix_sort_lsb<10>           | 327.8  | 247.9  | 182.9  | 138.9  | 107.2  |  84.0  |  67.7  |  55.7  |  47.3  |  41.1  |  36.9  |  34.0  |  31.5  |  30.0  |  29.5  |  28.6  |  28.6  |  28.7  |  28.3  |  27.9  |  28.7  |  29.1  |  29.0  |  29.1  |   0.0  |   0.0  |  30.9  |  43.7  |  30.9  |  32.8  |  30.9  |  38.3  |  30.9  |  35.5  |  34.8  |  35.5
radix_sort_lsb<11>           | 500.8  | 373.7  | 271.0  | 200.8  | 149.8  | 113.5  |  86.3  |  67.2  |  54.0  |  44.3  |  37.1  |  32.3  |  28.6  |  26.2  |  24.4  |  23.3  |  22.7  |  22.3  |  22.2  |  22.5  |  23.0  |  23.4  |  23.6  |  24.4  |  61.9  |  43.7  |  30.9  |  21.9  |  15.5  |  21.9  |  30.9  |  27.3  |  27.1  |  30.1  |  32.9  |  32.8
radix_sort_msb_inplace<8>    |  14.3  |  16.2  |  17.9  |  21.3  |  43.5  |  51.5  |  29.5  |  41.9  |  54.6  |  28.6  |  22.8  |  25.1  |  30.4  |  37.7  |  38.7  |  40.8  |  44.2  |  52.6  |  58.5  |  68.6  |  72.4  |  85.3  |  86.6  | 100.8  | 123.7  |  87.5  |  61.9  |  65.6  |  61.9  |  65.6  |  61.9  |  60.1  |  65.7  |  71.1  |  77.3  |  87.5
radix_sort_msb_inplace<8, 256>|  14.1  |  16.7  |  27.2  |  21.1  |  22.2  |  27.4  |  63.1  |  41.4  |  51.9  |  23.1  |  23.5  |  25.4  |  29.6  |  35.6  |  39.0  |  41.0  |  44.2  |  52.4  |  59.1  |  68.6  |  72.2  |  83.8  |  87.1  | 101.0  |  61.9  |  43.7  |  61.9  |  65.6  |  61.9  |  65.6  |  61.9  |  60.1  |  61.8  |  73.8  |  75.4  |  86.1
radix_sort_heuristic         |  16.0  |  16.6  |  17.7  |  22.1  |  22.2  |  26.5  |  28.0  |  29.2  |  15.1  |  14.9  |  24.0  |  26.8  |  22.5  |  27.7  |  26.8  |  27.0  |  26.3  |  26.2  |  26.2  |  26.2  |  27.6  |  26.7  |  26.9  |  27.2  |   0.0  |  43.7  |  30.9  |  21.9  |  30.9  |  32.8  |  30.9  |  27.3  |  30.9  |  30.1  |  32.9  |  32.8
radix_sort_heuristic_inplace |  14.8  |  17.5  |  26.2  |  21.2  |  22.4  |  27.7  |  62.3  |  26.0  |  22.7  |  26.8  |  30.4  |  25.3  |  30.1  |  34.8  |  39.0  |  40.6  |  43.9  |  44.5  |  46.3  |  47.3  |  47.9  |  48.4  |  53.9  |  64.9  |  61.9  |  43.7  |  61.9  |  43.7  |  61.9  |  54.7  |  61.9  |  54.7  |  65.7  |  73.8  |  77.3  |  73.8
radix_sort_heuristic_msb     |  16.9  |  16.4  |  18.0  |  21.2  |  22.4  |  27.0  |  27.7  |  17.7  |  15.4  |  15.3  |  17.4  |  26.2  |  23.1  |  27.9  |  31.2  |  35.2  |  39.6  |  36.9  |  38.3  |  39.3  |  39.8  |  40.0  |  43.6  |  51.5  |  61.9  |  43.7  |  30.9  |  21.9  |  30.9  |  43.7  |  46.4  |  43.7  |  46.4  |  57.4  |  59.9  |  56.0
radix_sort_lsb_separate<8>   |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |  61.9  |   0.0  |  30.9  |  21.9  |  30.9  |  32.8  |  30.9  |  32.8  |  30.9  |  32.8  |  32.9  |  32.8
NOTE: # means not sorted.
~ means not stable.

Compiled as x64:

Spoiler

System: 64bit
Estimated CPU frequency: 3.691 GHz
Algorithm                    |      16|      22|      31|      43|      60|      84|     118|     166|     234|     330|     466|     659|     932|    1318|    1864|    2636|    3728|    5272|    7456|   10544|   14912|   21089|   29825|   42181|   59655|   84369|  119321|  168753|  238664|  337539|  477376|  675146|  954849| 1350429| 1909892| 2701132
std_sort                     |   7.9  |  12.4  |  11.2  |  10.6  |  14.2  |  15.7  |  36.9  |  20.3  |  60.2  |  31.8  |  79.4  |  67.3  |  83.2  |  97.1  | 111.0  | 118.3  | 125.6  | 131.8  | 139.7  | 144.5  | 148.5  | 157.8  | 161.6  | 169.2  | 123.8  | 175.0  | 185.6  | 175.0  | 201.1  | 207.8  | 208.8  | 213.2  | 220.4  | 226.9  | 233.9  | 240.5
std_stable_sort              |   8.8  |   9.7  |  19.7  |  13.2  |  26.1  |  19.7  |  21.2  |  48.2  |  20.5  |  65.9  |  42.6  |  76.2  |  85.6  |  86.3  |  95.4  | 102.2  | 109.7  | 116.0  | 122.6  | 126.0  | 133.9  | 139.8  | 146.2  | 151.3  | 123.8  | 131.3  | 185.6  | 175.0  | 170.1  | 185.9  | 201.1  | 202.3  | 208.8  | 216.0  | 224.2  | 229.6
merge_sort_ins_full<18>      |  10.1  |  12.7  |  12.5  |  19.0  |  35.2  |  15.8  |  48.0  |  31.9  |  38.1  |  53.1  |  78.2  |  81.9  |  90.1  | 100.9  | 104.9  | 116.3  | 118.4  | 128.1  | 129.6  | 140.0  | 144.0  | 154.2  | 155.4  | 165.1  | 123.8  | 175.0  | 185.6  | 196.9  | 185.6  | 207.8  | 201.1  | 213.2  | 212.6  | 224.2  | 224.2  | 235.1
merge_sort_ins_new<18>       |  11.1  |  11.0  |  12.4  |  16.8  |  14.6  |  22.9  |  24.4  |  57.7  |  29.4  |  69.9  |  60.2  |  88.2  |  92.7  |  98.2  | 104.7  | 116.3  | 118.7  | 128.6  | 130.1  | 139.4  | 141.6  | 151.9  | 154.4  | 163.1  | 123.8  | 175.0  | 154.7  | 175.0  | 185.6  | 196.9  | 201.1  | 213.2  | 212.6  | 221.4  | 222.3  | 233.7
radix_sort_msb<4>            |  11.8  |  12.4  |  12.5  |  19.4  |  14.7  |  15.8  |  46.7  |  25.3  |  62.9  |  16.4  |  38.4  |  32.6  |  50.8  |  54.9  |  63.8  |  74.9  |  74.6  |  44.4  |  49.1  |  57.0  |  60.0  |  71.0  |  71.4  |  81.2  | 123.8  |  87.5  |  61.9  |  43.8  |  61.9  |  76.6  |  77.3  |  87.5  |  88.9  |  57.4  |  59.9  |  69.7
radix_sort_msb<5>            |  12.3  |  10.7  |  19.5  |  12.8  |  15.2  |  30.1  |  17.0  |  57.0  |  31.7  |  26.0  |  17.2  |  35.3  |  29.8  |  48.3  |  49.7  |  62.7  |  67.0  |  76.2  |  74.5  |  34.0  |  38.7  |  45.7  |  49.3  |  56.9  |  61.9  |  43.8  |  61.9  |  65.6  |  77.3  |  43.7  |  46.4  |  54.7  |  58.0  |  62.9  |  67.6  |  76.5
radix_sort_msb<8, 128>       |  11.6  |  11.1  |  12.7  |  12.9  |  14.7  |  38.3  |  17.9  |  23.7  |  11.9  |  19.4  |  12.8  |  14.3  |  23.8  |  21.7  |  26.9  |  27.2  |  33.6  |  39.1  |  43.6  |  51.4  |  55.4  |  65.5  |  63.4  |  37.9  |  61.9  |   0.0  |  30.9  |  21.9  |  30.9  |  32.8  |  38.7  |  38.3  |  42.5  |  49.2  |  52.2  |  60.1
radix_sort_msb<11, 256>      |  10.6  |  12.7  |  12.6  |  12.8  |  32.1  |  15.8  |  47.6  |  26.3  |  62.0  |  43.2  |  32.0  |  27.1  |  34.8  |  23.3  |  28.2  |  23.6  |  27.2  |  27.2  |  28.5  |  28.8  |  30.9  |  32.3  |  36.4  |  42.9  |   0.0  |  43.8  |  61.9  |  65.6  |  77.3  |  76.6  |  85.1  |  65.6  |  58.0  |  54.7  |  50.3  |  46.5
radix_sort_msb<11>           |  12.6  |  12.5  |  12.6  |  12.8  |  33.0  |  16.0  |  47.3  |  24.3  |  62.4  |  43.2  |  42.5  |  38.3  |  24.1  |  31.2  |  22.5  |  26.9  |  25.5  |  27.4  |  28.0  |  28.7  |  30.9  |  32.1  |  36.7  |  43.2  |  61.9  |  43.8  |  61.9  |  65.6  |  61.9  |  76.6  |  77.3  |  60.1  |  58.0  |  51.9  |  48.3  |  46.5
radix_sort_lsb<4>            |  63.9  |  50.1  |  51.5  |  50.5  |  46.9  |  46.5  |  43.1  |  39.4  |  40.4  |  38.2  |  37.3  |  36.6  |  35.8  |  35.9  |  37.8  |  38.8  |  38.1  |  38.3  |  38.0  |  38.1  |  38.4  |  38.7  |  38.5  |  38.4  |  61.9  |  43.8  |  61.9  |  43.8  |  30.9  |  43.7  |  38.7  |  43.7  |  46.4  |  46.5  |  46.4  |  46.5
radix_sort_lsb<5>            |  69.3  |  61.6  |  55.9  |  48.6  |  48.5  |  45.6  |  38.7  |  35.7  |  34.5  |  33.7  |  31.5  |  32.7  |  31.5  |  32.0  |  33.5  |  33.1  |  33.5  |  34.0  |  33.4  |  33.6  |  33.7  |  33.5  |  33.8  |  33.7  |  61.9  |   0.0  |  30.9  |  21.9  |  30.9  |  32.8  |  38.7  |  43.7  |  42.5  |  43.7  |  42.5  |  42.4
radix_sort_lsb<6>            |  83.0  |  70.3  |  55.5  |  53.2  |  41.4  |  39.1  |  34.3  |  33.3  |  29.3  |  28.6  |  28.4  |  27.6  |  26.6  |  26.5  |  27.6  |  28.0  |  27.5  |  27.8  |  27.3  |  28.2  |  28.1  |  28.4  |  28.6  |  28.7  |   0.0  |  43.8  |  30.9  |  21.9  |  30.9  |  32.8  |  30.9  |  32.8  |  34.8  |  38.3  |  38.7  |  38.3
radix_sort_lsb<7>            | 108.8  |  90.4  |  67.4  |  57.6  |  47.2  |  38.0  |  32.6  |  30.2  |  27.6  |  25.6  |  24.3  |  23.4  |  22.5  |  22.2  |  21.7  |  21.8  |  22.4  |  23.4  |  23.2  |  22.6  |  23.5  |  23.3  |  23.8  |  23.6  |   0.0  |  43.8  |   0.0  |  21.9  |  30.9  |  21.9  |  30.9  |  27.3  |  34.8  |  32.8  |  32.9  |  34.2
radix_sort_lsb<8>            | 195.3  | 150.8  | 108.1  |  82.5  |  66.0  |  49.9  |  41.0  |  33.2  |  28.5  |  24.7  |  22.1  |  20.1  |  18.6  |  17.8  |  17.2  |  17.4  |  17.3  |  18.0  |  17.5  |  17.4  |  18.2  |  18.1  |  18.2  |  18.5  |  61.9  |  43.8  |   0.0  |   0.0  |  15.5  |  21.9  |  23.2  |  27.3  |  27.1  |  24.6  |  27.1  |  27.3
radix_sort_lsb_pre<8>        |  96.8  |  72.7  |  55.9  |  46.6  |  36.5  |  30.9  |  26.4  |  22.9  |  20.6  |  18.6  |  17.4  |  16.6  |  16.0  |  15.6  |  15.5  |  15.7  |  16.8  |  16.6  |  16.8  |  16.9  |  16.9  |  17.2  |  17.3  |  17.1  |   0.0  |   0.0  |   0.0  |  21.9  |  15.5  |  21.9  |  23.2  |  21.9  |  23.2  |  24.6  |  25.1  |  24.6
radix_sort_lsb_loop<8>       | 228.1  | 169.5  | 125.9  |  96.1  |  73.6  |  57.7  |  46.3  |  38.6  |  32.8  |  28.2  |  25.1  |  23.0  |  21.4  |  20.5  |  19.9  |  19.6  |  19.5  |  19.9  |  19.9  |  19.4  |  20.7  |  20.1  |  20.0  |  20.5  |   0.0  |   0.0  |  30.9  |  21.9  |  15.5  |  21.9  |  23.2  |  27.3  |  27.1  |  27.3  |  27.1  |  27.3
radix_sort_lsb8_c            | 198.5#~| 154.0#~| 113.2#~|  90.3#~|  70.9#~|  56.4#~|  48.9#~|  42.4#~|  37.3#~|  33.2#~|  31.4#~|  30.2#~|  28.7#~|  28.9#~|  28.1#~|  26.8#~|  26.7#~|  26.7#~|  26.6#~|  26.9#~|  27.4#~|  27.4#~|  27.9#~|  27.6#~|  61.9#~|  43.8#~|  30.9#~|  21.9#~|  30.9#~|  32.8#~|  30.9#~|  38.3#~|  38.7#~|  35.5#~|  48.3#~|  46.5#~
radix_sort_lsb<9>            | 294.8  | 218.6  | 160.1  | 119.9  |  90.5  |  69.6  |  54.9  |  43.4  |  36.0  |  29.9  |  25.9  |  23.4  |  21.2  |  19.9  |  19.2  |  19.0  |  19.4  |  19.9  |  19.9  |  20.6  |  20.6  |  21.0  |  20.8  |  20.8  |  61.9  |  43.8  |   0.0  |  21.9  |  30.9  |  21.9  |  23.2  |  21.9  |  23.2  |  27.3  |  27.1  |  27.3
radix_sort_lsb<10>           | 562.1  | 413.9  | 297.7  | 219.5  | 170.1  | 128.1  |  95.6  |  72.4  |  55.9  |  44.7  |  36.9  |  32.1  |  27.6  |  24.7  |  22.8  |  22.1  |  22.2  |  22.4  |  23.6  |  23.1  |  24.1  |  24.9  |  24.4  |  24.6  |  61.9  |  43.8  |  30.9  |  21.9  |  15.5  |  21.9  |  30.9  |  32.8  |  30.9  |  30.1  |  30.9  |  30.1
radix_sort_lsb<11>           | 928.5  | 680.1  | 486.9  | 354.9  | 258.4  | 189.6  | 139.4  | 103.7  |  77.5  |  59.0  |  45.9  |  36.8  |  30.1  |  24.6  |  21.7  |  19.7  |  19.2  |  19.6  |  19.7  |  19.4  |  20.8  |  20.7  |  21.1  |  21.5  |   0.0  |   0.0  |  30.9  |  21.9  |  15.5  |  21.9  |  23.2  |  21.9  |  30.9  |  30.1  |  30.9  |  30.1
radix_sort_msb_inplace<8>    |  12.4  |  10.7  |  19.5  |  12.9  |  14.7  |  31.2  |  17.5  |  57.3  |  33.6  |  29.7  |  20.8  |  25.5  |  22.5  |  31.8  |  28.0  |  35.4  |  38.7  |  44.8  |  49.5  |  57.7  |  61.1  |  70.1  |  73.5  |  83.9  |  61.9  |  43.8  |  61.9  |  43.8  |  46.4  |  54.7  |  46.4  |  54.7  |  58.0  |  62.9  |  67.6  |  75.2
radix_sort_msb_inplace<8, 256>|  10.8  |  10.7  |  20.8  |  19.4  |  14.8  |  16.4  |  46.4  |  26.9  |  62.5  |  20.6  |  28.9  |  21.8  |  30.6  |  23.9  |  33.4  |  33.4  |  39.4  |  45.0  |  49.7  |  57.4  |  61.5  |  70.3  |  72.7  |  84.0  | 123.8  |  43.8  |  61.9  |  65.6  |  46.4  |  54.7  |  54.1  |  54.7  |  58.0  |  62.9  |  65.7  |  75.2
radix_sort_heuristic         |  11.9  |  11.4  |  12.5  |  21.0  |  14.7  |  15.8  |  46.6  |  27.6  |  20.2  |  10.9  |  20.1  |  14.0  |  23.5  |  21.7  |  17.3  |  17.4  |  17.3  |  18.0  |  17.5  |  17.7  |  17.9  |  18.7  |  18.2  |  18.5  |   0.0  |   0.0  |   0.0  |  21.9  |  30.9  |  21.9  |  23.2  |  21.9  |  27.1  |  24.6  |  27.1  |  26.0
radix_sort_heuristic_inplace |  10.8  |  12.5  |  14.0  |  14.0  |  15.3  |  38.5  |  18.4  |  33.1  |  21.7  |  27.8  |  21.8  |  28.9  |  20.1  |  24.2  |  31.2  |  36.9  |  40.0  |  39.5  |  39.5  |  39.8  |  41.0  |  42.6  |  47.5  |  58.2  |  61.9  |  43.8  |  61.9  |  65.6  |  46.4  |  54.7  |  54.1  |  54.7  |  58.0  |  65.6  |  67.6  |  75.2
radix_sort_heuristic_msb     |  14.1  |  12.7  |  20.1  |  13.8  |  31.0  |  16.2  |  40.5  |  20.5  |  22.3  |  20.1  |  11.1  |  21.8  |  17.7  |  25.2  |  24.5  |  29.0  |  33.8  |  27.7  |  28.1  |  28.8  |  30.6  |  32.4  |  36.4  |  43.0  |  61.9  |  43.8  |  30.9  |  43.8  |  30.9  |  32.8  |  38.7  |  32.8  |  42.5  |  46.5  |  52.2  |  46.5
radix_sort_lsb_separate<8>   |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |  87.5  |  61.9  |  43.8  |  30.9  |  21.9  |  30.9  |  21.9  |  38.7  |  32.8  |  34.8  |  38.3  |  36.7  |  35.5
NOTE: # means not sorted.
~ means not stable.

##### Share on other sites
1 hour ago, Zaoshi Kaba said:

Visual Studio does not compile tes﻿tb﻿ed:﻿

Yes, radix_sort_lsb8_c was borked in more ways than one. Hopefully, fixed:

The results are about what I would expect (someone already posted data for i7 8700K in the original thread).

So, building all histograms in one go (radix_sort_lsb_pre<8>) does seem to consistently help on x64 in your data, but not on x86 (and the difference is minor, anyway).

And the win over std::sort is even more marked (its also amusing to see std::stable_sort slightly beating std::sort).

Thanks!

##### Share on other sites

Also, I looked at it, and apparently, actual write prefetch (PREFETCHW / _MM_HINT_ET1 / __builtin_prefetch(p,1,3);) hurts a lot, compared to normal _MM_HINT_T0. Go, figure.

Updated main library:

1 hour ago, Zaoshi Kaba said:

Here are my results.﻿

BTW, radix_sort_msb_inplace<8> and radix_sort_msb_inplace<8,256> are exactly the same in that version, so there shouldn't be a difference (beyound measurement noise).

##### Share on other sites

GPU version of radix sorting (with native NVIDIA Turing support, compute).

## Create an account or sign in to comment

You need to be a member in order to leave a comment

## Create an account

Sign up for a new account in our community. It's easy!

Register a new account

• ### Game Developer Survey

We are looking for qualified game developers to participate in a 10-minute online survey. Qualified participants will be offered a \$15 incentive for your time and insights. Click here to start!

• 9
• 56
• 17
• 28