Faster Sorting - at 88% of memory bandwidth

Started by
14 comments, last by Jan Wassenberg 13 years, 5 months ago
I figure GameDev often calls for highly efficient algorithms, and am happy to to present Faster Radix Sort via Virtual Memory and Write-Combining (6 page PDF). It works for 8..64 bit keys and also supports associated values. The kicker is that the throughput of each pass (which sorts 8 bits) is at least 88% of the system's theoretical peak memory bandwidth!
This means 8-bit keys can be copied into sorted order much faster than some memcpy() implementations.

Since healthy skepticism is warranted for "new sorting algorithms", I will note that this is a variant of counting sort with two improvements. The first avoids the initial counting phase; each value to be sorted is read and written exactly once, which is made possible by aggressive use of huge amounts of virtual address space. The second and more important improvement concerns the radix sort that allows sorting more than 8-bit keys. The old `reverse sorting' trick from distributed-memory algorithms first applies an MSD pass followed by several LSD passes, which reduces communication and NUMA overhead. Indeed, the algorithm scales perfectly on a dual-socket i7 system.

This makes for a 1.5-fold speedup over Intel's recently published radix sort. The measured throughput also exceeds results reported for Knight's Ferry (32-core) and Fermi GPUs (when sorting in main memory).
Unfortunately, the code is not publicly available, but reimplementation is an option, and my employer (Fraunhofer applied research labs) would be happy to discuss licensing.

Comments and discussion are welcome :)
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Advertisement
That virtual memory trick is really slick.

I have a question though:
Quote:It works for 8..64 bit keys and also supports associated values.

How does it work for 64 bit keys? Wouldn't the allocation required exceed the 64 bit virtual address space if you used 64 bit keys?


Also, I'd be interested to read a copy of your student thesis on the subject of efficient file handling, but I can't find a copy.
@Nitage it is on the right-side panel on the link he provided. It does look interesting; It is ashame I don't understand it fully. ):
Quote:How does it work for 64 bit keys? Wouldn't the allocation required exceed the 64 bit virtual address space if you used 64 bit keys?

Good question. The counting sort only needs to allocate storage for 8-bit "digits" of the key. After the initial MSD pass scatters items to the first set of buckets, the remaining LSD passes copy back and forth between 2 sets of buckets. 32-bit keys require no more than 3 sets in total because one can be reused after copying from it. In the 64-bit case, the same kind of renaming is possible, so there is no increase. Incidentally, a colleague is attempting to eliminate one of the 3 buffers by copying in-place.

Quote:I'd be interested to read a copy of your thesis on the subject of efficient file handling, but I can't find a copy.

Optimizing File Accesses via Ordering and Caching [144 KB PDF]

Quote:It is ashame I don't understand it fully

Which part needs better explaining?
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
It reminds me of several parallel-data sort papers from a decade back.

Much like yours, they assign special properties to keys and then perform the sort, giving O(n log log n) performance. Google finds this one among others.

I wonder what would happen if you could combine the results of the two. Unfortunately I don't think any can be made general purpose.
hm. Have you seen any implementations of those ideas? We are aware of the series of papers by Thorup/Andersson/Hagerup and Han, but they appear totally impractical due to huge constants and/or processor word size requirements.
I'll go out on a limb here and question their usefulness, because it doesn't look like much larger word sizes will be forthcoming (we won't have 2**64 worth of local storage for a while, so it doesn't make sense to have larger pointers; wide registers such as Knights Ferry's 512 bits are usually partitioned into lanes). Also, radix sort's constant factors are extremely low, as evidenced by this implementation nearly maxing out theoretical memory bandwidth (which does not account for read-write turnarounds).
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
No, I don't know of any implementations of them, but that's mostly because fancy specialized sorting algorithms are not the focus in game development.

The standard library's sort functions, heap functions, and associative containers provide more than adequate sorting functionality for games. Plus it they are valid for the general case and use almost no additional memory, unlike all the algorithms above (including yours that requires certain constraints and way too much virtual address space for most games).

Think of it more as a suggestion for future research and another paper.
Quote:standard library's sort functions, heap functions, and associative containers provide more than adequate sorting functionality for games.

Agreed, their functionality is certainly adequate, but what about performance? Are there not use cases (old example: polygon sorting in a software renderer) where speed is paramount?

Quote:including yours that requires certain constraints

Which ones are a problem for you or the applications you envision?

Quote:way too much virtual address space for most games

heh, that is true for everyone still stuck in 32-bit land. In that case, you can drastically reduce space requirements by selecting a smaller base (2**8 is convenient because it doesn't require bit masking and is the maximum you can afford with respect to TLBs) or just using a linked-list of buckets.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
I suspect frob is referring to consoles, where the virtual address space requirements make this algorithm a non-starter. Even when those platforms have sufficient virtual address space, it's generally not usable in the same way as on a PC OS (either no automatic backing store, or the implementation is much slower).
FYI: It doesn't get an obvious mention in you description above, but this appears to be an external sort, with the main focus on improving the Hard Disk I/O side of things with clever caching and asychronous operations etc.

Quote:Original post by Jan Wassenberg
Also, radix sort's constant factors are extremely low, as evidenced by this implementation nearly maxing out theoretical memory bandwidth
Forgive me but I'm not quite seeing how B suggests A, especially when one just refers to an algorithm and the other refers to your specific implementation.

Given "any reduction in file size due to compresion lessens I/O time at no cost", what are your comments on simply using a far superrior compression algorithm such as LZMA (which still has very fast decompression) and any existing external sorting algorithm, versus what you have done?
Since anything you compare your techinque to would have to use the same compression techinque, was it worth even using compresion here? Or was it to nullify any effects of driver-based compression?
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms

This topic is closed to new replies.

Advertisement