Quote:I suspect frob is referring to consoles, where the virtual address space requirements make this algorithm a non-starter. Even when those platforms have sufficient virtual address space, it's generally not usable in the same way as on a PC OS (either no automatic backing store, or the implementation is much slower).
Bummer. The 6144*N storage (for 8 processors and radix 2**8) is admittedly excessive even on PCs. However, the virtual-memory trick is not an integral part of the algorithm. It is just a neat way to avoid the isFull checks required when organizing buckets as a deque-style container. Engineering the fastest sorting implementation, bar none, requires pulling out all the stops :)
However, the memory requirements and expensive page faults + VirtualAlloc do indeed motivate another approach. As suggested above, replacing the reserved virtual address space with expandable deques would decrease memory use to 3*(N + radix*(cacheLineSize + 2)) + internal fragmentation in the deque's allocations.
With such a modification, even consoles could profit from the main ideas of the algorithm (reverse-sorting and write-combining), which remain unaffected.
Quote:FYI: It doesn't get an obvious mention in you description above, but this appears to be an external sort, with the main focus on improving the Hard Disk I/O side of things with clever caching and asychronous operations etc.
Whoa, let's not confuse the recent sort paper (which abuses virtual memory but is by no means an external/out-of-core algorithm) with my four year old study thesis (a system for speeding up IO).
Quote:Quote:Also, radix sort's constant factors are extremely low, as evidenced by this implementation nearly maxing out theoretical memory bandwidth
Forgive me but I'm not quite seeing how B suggests A, especially when one just refers to an algorithm and the other refers to your specific implementation.
I think it's reasonable to say an algorithm entails a certain constant factor, i.e. no implementation will beat a certain value (of course a bad implementation could involve arbitrarily high factors, but that shouldn't reflect upon the algorithm). The Anderssen et al. "signature sort" and its variants necessarily involve a higher constant factor than simpler algorithms because they have to perform hashing, merging of very large words and construction/traversal of a trie.
With radix sort, when we get close to the metal, it no longer makes sense to just count operations of cost O(1). For example, a VM counting sort pass only needs to read and write each element exactly once, but it is faster to introduce an additional (!) non-temporal write to copy cache-line-sized buffers to the final destination, thus avoiding cache pollution. For this reason, I consider constant factors to be the slope of the actual runtime vs. N line. If the algorithm is transferring the minimum amount of data required to solve the problem and the implementation maxes out the machine's peak memory throughput, then its constant factor is clearly as low as it can get (which attribute extends to the algorithm, as argued above).
Quote:Given "any reduction in file size due to compresion lessens I/O time at no cost", what are your comments on simply using a far superrior compression algorithm such as LZMA (which still has very fast decompression) and any existing external sorting algorithm, versus what you have done?
Since anything you compare your techinque to would have to use the same compression techinque, was it worth even using compresion here? Or was it to nullify any effects of driver-based compression?
OK, let's switch to the separate IO work.
The goal there was to speed up the loading of a certain dataset (files accessed by the 0 A.D. game). Having less data to load thanks to compression is great as long as the decompression speed outpaces the HD. Especially with the advent of SSDs, we have to be very careful not to worsen performance. This would be a problem with LZMA, because it is actually not "very fast", more like a factor of 1.5-3 behind Zip and the other fast LZ variants (see http://tukaani.org/lzma/benchmarks.html or http://www.maximumcompression.com/data/summary_mf4.php). In this case, interoperability (making it easy for anyone to create an archive of the dataset) was also an important argument in favor of Zip.