• Advertisement
Sign in to follow this  

Faster Sorting - at 88% of memory bandwidth

This topic is 2668 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I figure GameDev often calls for highly efficient algorithms, and am happy to to present Faster Radix Sort via Virtual Memory and Write-Combining (6 page PDF). It works for 8..64 bit keys and also supports associated values. The kicker is that the throughput of each pass (which sorts 8 bits) is at least 88% of the system's theoretical peak memory bandwidth!
This means 8-bit keys can be copied into sorted order much faster than some memcpy() implementations.

Since healthy skepticism is warranted for "new sorting algorithms", I will note that this is a variant of counting sort with two improvements. The first avoids the initial counting phase; each value to be sorted is read and written exactly once, which is made possible by aggressive use of huge amounts of virtual address space. The second and more important improvement concerns the radix sort that allows sorting more than 8-bit keys. The old `reverse sorting' trick from distributed-memory algorithms first applies an MSD pass followed by several LSD passes, which reduces communication and NUMA overhead. Indeed, the algorithm scales perfectly on a dual-socket i7 system.

This makes for a 1.5-fold speedup over Intel's recently published radix sort. The measured throughput also exceeds results reported for Knight's Ferry (32-core) and Fermi GPUs (when sorting in main memory).
Unfortunately, the code is not publicly available, but reimplementation is an option, and my employer (Fraunhofer applied research labs) would be happy to discuss licensing.

Comments and discussion are welcome :)

Share this post


Link to post
Share on other sites
Advertisement
That virtual memory trick is really slick.

I have a question though:
Quote:
It works for 8..64 bit keys and also supports associated values.

How does it work for 64 bit keys? Wouldn't the allocation required exceed the 64 bit virtual address space if you used 64 bit keys?


Also, I'd be interested to read a copy of your student thesis on the subject of efficient file handling, but I can't find a copy.

Share this post


Link to post
Share on other sites
@Nitage it is on the right-side panel on the link he provided. It does look interesting; It is ashame I don't understand it fully. ):

Share this post


Link to post
Share on other sites
Quote:
How does it work for 64 bit keys? Wouldn't the allocation required exceed the 64 bit virtual address space if you used 64 bit keys?

Good question. The counting sort only needs to allocate storage for 8-bit "digits" of the key. After the initial MSD pass scatters items to the first set of buckets, the remaining LSD passes copy back and forth between 2 sets of buckets. 32-bit keys require no more than 3 sets in total because one can be reused after copying from it. In the 64-bit case, the same kind of renaming is possible, so there is no increase. Incidentally, a colleague is attempting to eliminate one of the 3 buffers by copying in-place.

Quote:
I'd be interested to read a copy of your thesis on the subject of efficient file handling, but I can't find a copy.

Optimizing File Accesses via Ordering and Caching [144 KB PDF]

Quote:
It is ashame I don't understand it fully

Which part needs better explaining?

Share this post


Link to post
Share on other sites
It reminds me of several parallel-data sort papers from a decade back.

Much like yours, they assign special properties to keys and then perform the sort, giving O(n log log n) performance. Google finds this one among others.

I wonder what would happen if you could combine the results of the two. Unfortunately I don't think any can be made general purpose.

Share this post


Link to post
Share on other sites
hm. Have you seen any implementations of those ideas? We are aware of the series of papers by Thorup/Andersson/Hagerup and Han, but they appear totally impractical due to huge constants and/or processor word size requirements.
I'll go out on a limb here and question their usefulness, because it doesn't look like much larger word sizes will be forthcoming (we won't have 2**64 worth of local storage for a while, so it doesn't make sense to have larger pointers; wide registers such as Knights Ferry's 512 bits are usually partitioned into lanes). Also, radix sort's constant factors are extremely low, as evidenced by this implementation nearly maxing out theoretical memory bandwidth (which does not account for read-write turnarounds).

Share this post


Link to post
Share on other sites
No, I don't know of any implementations of them, but that's mostly because fancy specialized sorting algorithms are not the focus in game development.

The standard library's sort functions, heap functions, and associative containers provide more than adequate sorting functionality for games. Plus it they are valid for the general case and use almost no additional memory, unlike all the algorithms above (including yours that requires certain constraints and way too much virtual address space for most games).

Think of it more as a suggestion for future research and another paper.

Share this post


Link to post
Share on other sites
Quote:
standard library's sort functions, heap functions, and associative containers provide more than adequate sorting functionality for games.

Agreed, their functionality is certainly adequate, but what about performance? Are there not use cases (old example: polygon sorting in a software renderer) where speed is paramount?

Quote:
including yours that requires certain constraints

Which ones are a problem for you or the applications you envision?

Quote:
way too much virtual address space for most games

heh, that is true for everyone still stuck in 32-bit land. In that case, you can drastically reduce space requirements by selecting a smaller base (2**8 is convenient because it doesn't require bit masking and is the maximum you can afford with respect to TLBs) or just using a linked-list of buckets.

Share this post


Link to post
Share on other sites
I suspect frob is referring to consoles, where the virtual address space requirements make this algorithm a non-starter. Even when those platforms have sufficient virtual address space, it's generally not usable in the same way as on a PC OS (either no automatic backing store, or the implementation is much slower).

Share this post


Link to post
Share on other sites
FYI: It doesn't get an obvious mention in you description above, but this appears to be an external sort, with the main focus on improving the Hard Disk I/O side of things with clever caching and asychronous operations etc.

Quote:
Original post by Jan Wassenberg
Also, radix sort's constant factors are extremely low, as evidenced by this implementation nearly maxing out theoretical memory bandwidth
Forgive me but I'm not quite seeing how B suggests A, especially when one just refers to an algorithm and the other refers to your specific implementation.

Given "any reduction in file size due to compresion lessens I/O time at no cost", what are your comments on simply using a far superrior compression algorithm such as LZMA (which still has very fast decompression) and any existing external sorting algorithm, versus what you have done?
Since anything you compare your techinque to would have to use the same compression techinque, was it worth even using compresion here? Or was it to nullify any effects of driver-based compression?

Share this post


Link to post
Share on other sites
Quote:
I suspect frob is referring to consoles, where the virtual address space requirements make this algorithm a non-starter. Even when those platforms have sufficient virtual address space, it's generally not usable in the same way as on a PC OS (either no automatic backing store, or the implementation is much slower).

Bummer. The 6144*N storage (for 8 processors and radix 2**8) is admittedly excessive even on PCs. However, the virtual-memory trick is not an integral part of the algorithm. It is just a neat way to avoid the isFull checks required when organizing buckets as a deque-style container. Engineering the fastest sorting implementation, bar none, requires pulling out all the stops :)

However, the memory requirements and expensive page faults + VirtualAlloc do indeed motivate another approach. As suggested above, replacing the reserved virtual address space with expandable deques would decrease memory use to 3*(N + radix*(cacheLineSize + 2)) + internal fragmentation in the deque's allocations.

With such a modification, even consoles could profit from the main ideas of the algorithm (reverse-sorting and write-combining), which remain unaffected.

Quote:
FYI: It doesn't get an obvious mention in you description above, but this appears to be an external sort, with the main focus on improving the Hard Disk I/O side of things with clever caching and asychronous operations etc.

Whoa, let's not confuse the recent sort paper (which abuses virtual memory but is by no means an external/out-of-core algorithm) with my four year old study thesis (a system for speeding up IO).

Quote:
Quote:
Also, radix sort's constant factors are extremely low, as evidenced by this implementation nearly maxing out theoretical memory bandwidth

Forgive me but I'm not quite seeing how B suggests A, especially when one just refers to an algorithm and the other refers to your specific implementation.

I think it's reasonable to say an algorithm entails a certain constant factor, i.e. no implementation will beat a certain value (of course a bad implementation could involve arbitrarily high factors, but that shouldn't reflect upon the algorithm). The Anderssen et al. "signature sort" and its variants necessarily involve a higher constant factor than simpler algorithms because they have to perform hashing, merging of very large words and construction/traversal of a trie.
With radix sort, when we get close to the metal, it no longer makes sense to just count operations of cost O(1). For example, a VM counting sort pass only needs to read and write each element exactly once, but it is faster to introduce an additional (!) non-temporal write to copy cache-line-sized buffers to the final destination, thus avoiding cache pollution. For this reason, I consider constant factors to be the slope of the actual runtime vs. N line. If the algorithm is transferring the minimum amount of data required to solve the problem and the implementation maxes out the machine's peak memory throughput, then its constant factor is clearly as low as it can get (which attribute extends to the algorithm, as argued above).

Quote:
Given "any reduction in file size due to compresion lessens I/O time at no cost", what are your comments on simply using a far superrior compression algorithm such as LZMA (which still has very fast decompression) and any existing external sorting algorithm, versus what you have done?
Since anything you compare your techinque to would have to use the same compression techinque, was it worth even using compresion here? Or was it to nullify any effects of driver-based compression?

OK, let's switch to the separate IO work.
The goal there was to speed up the loading of a certain dataset (files accessed by the 0 A.D. game). Having less data to load thanks to compression is great as long as the decompression speed outpaces the HD. Especially with the advent of SSDs, we have to be very careful not to worsen performance. This would be a problem with LZMA, because it is actually not "very fast", more like a factor of 1.5-3 behind Zip and the other fast LZ variants (see http://tukaani.org/lzma/benchmarks.html or http://www.maximumcompression.com/data/summary_mf4.php). In this case, interoperability (making it easy for anyone to create an archive of the dataset) was also an important argument in favor of Zip.

Share this post


Link to post
Share on other sites
Quote:
Original post by Jan Wassenberg
Quote:
I suspect frob is referring to consoles, where the virtual address space requirements make this algorithm a non-starter. Even when those platforms have sufficient virtual address space, it's generally not usable in the same way as on a PC OS (either no automatic backing store, or the implementation is much slower).

With such a modification, even consoles could profit from the main ideas of the algorithm (reverse-sorting and write-combining), which remain unaffected.

That wasn't really my emphasis.


Sorting generally isn't a bottleneck for games.

The very few places in games where sorting is actually a bottleneck are also the times where virtual address space would be the most painful to use.

In almost every case the trivial-to-use general purpose sort functions work just fine. Don't spend time trying to fix what isn't broken. If you can improve the general purpose sort routines then I will take it, otherwise the maintenence effort will not be worth the human cost of dealing with the code.




An example is in order:

One of the rare cases in games where sorting speed becomes an issue is when working with a particle system. Assume you have 10K particles. You only have about 10 microseconds during your frame to sort, prune, AND process them.

It is simply impractical to run any of the sorts above. The full processing pipeline must be very aggressive, running on cache-friendly memory-restricted data formats. The data needs to be cut down to the bare minimum and processed completely as the data hits the cache, or the algorithm will fail from simple memory bandwidth limits. Trying to use MORE memory in this environment would be a disaster.



For environments with offline processing or non-interactive speeds, in those cases I agree that these sort algorithms are great research topics. When sorting is already taking seconds, minutes, or even hours speeding them up with a bit of virtual memory is awesome. But that is not the games environment.



When your total processing budget is measured in nanoseconds, the words "virtual memory" are a joke.

Share this post


Link to post
Share on other sites
Thanks for the feedback! I am a bit surprised that there hasn't been more mention of time-critical sorting applications here (that it is generally not a bottleneck is clear, because people make do with what they have).

Quote:
Trying to use MORE memory in this environment would be a disaster.

I am afraid you are misunderstanding the algorithm's use of memory. It does not use more physical memory, just more address space - which basically means the freedom to write "anywhere" (depending on the keys' values), not just sequentially somewhere. However, the amount to be written remains exactly the same - and the utilization of available bandwidth would do even memcpy() proud.

Quote:
When your total processing budget is measured in nanoseconds, the words "virtual memory" are a joke.

heh, only for a certain understanding of "virtual memory". The image segmentation algorithm of which this sort was originally a part runs at 40 MPixel/s = 25 ns per pixel. It makes heavy use of the MMU - besides the trick mentioned here, all arrays use pre-reserved address space and on-demand commits to ensure they are contiguous, expandable and never invalidated nor copied.
However, there is no paging going on, and no additional cost over "normal"/naive use of memory.

Share this post


Link to post
Share on other sites
Do you have any comparative data on when this new sort beats existing ones?

For example, for random 32-bit keys, std::sort beats my 8-bit radix sort until the number of items is larger than about 200. Also, my 16-bit radix sort isn't effective until there's about 70k items.

Share this post


Link to post
Share on other sites
Good idea! I haven't determined the break-even point because we're always sorting millions of items, but it would be interesting to know. Will hopefully be able to look into it starting Tuesday.

Share this post


Link to post
Share on other sites
Update: for 32-bit keys with 32-bit payloads, ICC 12's std::sort is faster until about 56K items.
When there are 64M items to be sorted, the new method outperforms std::sort by a factor of 11.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement