Faster Sorting - at 88% of memory bandwidth

Started by
14 comments, last by Jan Wassenberg 13 years, 5 months ago
Quote:I suspect frob is referring to consoles, where the virtual address space requirements make this algorithm a non-starter. Even when those platforms have sufficient virtual address space, it's generally not usable in the same way as on a PC OS (either no automatic backing store, or the implementation is much slower).

Bummer. The 6144*N storage (for 8 processors and radix 2**8) is admittedly excessive even on PCs. However, the virtual-memory trick is not an integral part of the algorithm. It is just a neat way to avoid the isFull checks required when organizing buckets as a deque-style container. Engineering the fastest sorting implementation, bar none, requires pulling out all the stops :)

However, the memory requirements and expensive page faults + VirtualAlloc do indeed motivate another approach. As suggested above, replacing the reserved virtual address space with expandable deques would decrease memory use to 3*(N + radix*(cacheLineSize + 2)) + internal fragmentation in the deque's allocations.

With such a modification, even consoles could profit from the main ideas of the algorithm (reverse-sorting and write-combining), which remain unaffected.

Quote:FYI: It doesn't get an obvious mention in you description above, but this appears to be an external sort, with the main focus on improving the Hard Disk I/O side of things with clever caching and asychronous operations etc.

Whoa, let's not confuse the recent sort paper (which abuses virtual memory but is by no means an external/out-of-core algorithm) with my four year old study thesis (a system for speeding up IO).

Quote:
Quote:Also, radix sort's constant factors are extremely low, as evidenced by this implementation nearly maxing out theoretical memory bandwidth

Forgive me but I'm not quite seeing how B suggests A, especially when one just refers to an algorithm and the other refers to your specific implementation.

I think it's reasonable to say an algorithm entails a certain constant factor, i.e. no implementation will beat a certain value (of course a bad implementation could involve arbitrarily high factors, but that shouldn't reflect upon the algorithm). The Anderssen et al. "signature sort" and its variants necessarily involve a higher constant factor than simpler algorithms because they have to perform hashing, merging of very large words and construction/traversal of a trie.
With radix sort, when we get close to the metal, it no longer makes sense to just count operations of cost O(1). For example, a VM counting sort pass only needs to read and write each element exactly once, but it is faster to introduce an additional (!) non-temporal write to copy cache-line-sized buffers to the final destination, thus avoiding cache pollution. For this reason, I consider constant factors to be the slope of the actual runtime vs. N line. If the algorithm is transferring the minimum amount of data required to solve the problem and the implementation maxes out the machine's peak memory throughput, then its constant factor is clearly as low as it can get (which attribute extends to the algorithm, as argued above).

Quote:Given "any reduction in file size due to compresion lessens I/O time at no cost", what are your comments on simply using a far superrior compression algorithm such as LZMA (which still has very fast decompression) and any existing external sorting algorithm, versus what you have done?
Since anything you compare your techinque to would have to use the same compression techinque, was it worth even using compresion here? Or was it to nullify any effects of driver-based compression?

OK, let's switch to the separate IO work.
The goal there was to speed up the loading of a certain dataset (files accessed by the 0 A.D. game). Having less data to load thanks to compression is great as long as the decompression speed outpaces the HD. Especially with the advent of SSDs, we have to be very careful not to worsen performance. This would be a problem with LZMA, because it is actually not "very fast", more like a factor of 1.5-3 behind Zip and the other fast LZ variants (see http://tukaani.org/lzma/benchmarks.html or http://www.maximumcompression.com/data/summary_mf4.php). In this case, interoperability (making it easy for anyone to create an archive of the dataset) was also an important argument in favor of Zip.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Advertisement
Quote:Original post by Jan Wassenberg
Quote:I suspect frob is referring to consoles, where the virtual address space requirements make this algorithm a non-starter. Even when those platforms have sufficient virtual address space, it's generally not usable in the same way as on a PC OS (either no automatic backing store, or the implementation is much slower).

With such a modification, even consoles could profit from the main ideas of the algorithm (reverse-sorting and write-combining), which remain unaffected.

That wasn't really my emphasis.


Sorting generally isn't a bottleneck for games.

The very few places in games where sorting is actually a bottleneck are also the times where virtual address space would be the most painful to use.

In almost every case the trivial-to-use general purpose sort functions work just fine. Don't spend time trying to fix what isn't broken. If you can improve the general purpose sort routines then I will take it, otherwise the maintenence effort will not be worth the human cost of dealing with the code.




An example is in order:

One of the rare cases in games where sorting speed becomes an issue is when working with a particle system. Assume you have 10K particles. You only have about 10 microseconds during your frame to sort, prune, AND process them.

It is simply impractical to run any of the sorts above. The full processing pipeline must be very aggressive, running on cache-friendly memory-restricted data formats. The data needs to be cut down to the bare minimum and processed completely as the data hits the cache, or the algorithm will fail from simple memory bandwidth limits. Trying to use MORE memory in this environment would be a disaster.



For environments with offline processing or non-interactive speeds, in those cases I agree that these sort algorithms are great research topics. When sorting is already taking seconds, minutes, or even hours speeding them up with a bit of virtual memory is awesome. But that is not the games environment.



When your total processing budget is measured in nanoseconds, the words "virtual memory" are a joke.
Thanks for the feedback! I am a bit surprised that there hasn't been more mention of time-critical sorting applications here (that it is generally not a bottleneck is clear, because people make do with what they have).

Quote:Trying to use MORE memory in this environment would be a disaster.

I am afraid you are misunderstanding the algorithm's use of memory. It does not use more physical memory, just more address space - which basically means the freedom to write "anywhere" (depending on the keys' values), not just sequentially somewhere. However, the amount to be written remains exactly the same - and the utilization of available bandwidth would do even memcpy() proud.

Quote:When your total processing budget is measured in nanoseconds, the words "virtual memory" are a joke.

heh, only for a certain understanding of "virtual memory". The image segmentation algorithm of which this sort was originally a part runs at 40 MPixel/s = 25 ns per pixel. It makes heavy use of the MMU - besides the trick mentioned here, all arrays use pre-reserved address space and on-demand commits to ensure they are contiguous, expandable and never invalidated nor copied.
However, there is no paging going on, and no additional cost over "normal"/naive use of memory.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Do you have any comparative data on when this new sort beats existing ones?

For example, for random 32-bit keys, std::sort beats my 8-bit radix sort until the number of items is larger than about 200. Also, my 16-bit radix sort isn't effective until there's about 70k items.
Good idea! I haven't determined the break-even point because we're always sorting millions of items, but it would be interesting to know. Will hopefully be able to look into it starting Tuesday.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Update: for 32-bit keys with 32-bit payloads, ICC 12's std::sort is faster until about 56K items.
When there are 64M items to be sorted, the new method outperforms std::sort by a factor of 11.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3

This topic is closed to new replies.

Advertisement