Billion word sort

Started by
37 comments, last by scottdewald 17 years, 5 months ago
I have this project for school where I need to write a sorting algorithm that will work for one billion 5 character strings [a-u, A-U]. I'm not looking for answers, more just ideas some of you might have. It's supposed to be done in C++ on a single computer with a single hard drive. And the algorithm needs to be black boxed. It should work with 100, 1000, 10000, etcetera 5 character strings. It needs to figure it out at runtime on it's own. After it's sorted I need to search for 2 strings hidden in the billion strings. What i'm planning now is a merge sort of some kind. Before I start reading from the file, i'll create a bunch of files on the hard drive which will be responsible for holding strings starting with a certain character. eg. filenames[a.dat, b.dat, c.dat, ... G.dat, H.dat,...]. As I pull the data out of the main file, i'll throw it into the appropriate temp file...the string abHgu would go to the a.dat file. Once I reach the end of the main file, i'll go and sort each individual file and then merge them back into one file. When i'm reading the initial data, i'll probably open a couple hundred streams (you can open 509 simultaneously in vs2005...3 are reserved for cin, cout, cerr) and have each one read like every nth string. Anyway, any hints or ideas people might like to share, or point out anything that might push me in a better direction or point something out I'm getting wrong, would be very much appreciated.
Advertisement
I like your merge sort approach but consider whether or not it is necessary to create a bunch of files preemptively. Is it really necessary? Why not just make files when needed? Also, is it necessary that you use files based on words starting with 'a' (or whatever the first letter is) - what happens if you have many files full of sorted sets of strings?
If you can use the Windows API (or any other OS's API, really), I'd strongly suggest using a single memory mapped file. If the file format is fixed (such as one 5-character word per line, with the line terminator being "\n"), you can easily figure out how many words you have, and from there you can easily do something like load in as many as you can fit into memory, sort them, then write those to your mapped file (storing the offset and size in an in-memory 'master index'), then repeat until the end. After that, a simple merge could combine the stacks from the first memory mapped file into a final destination file.

You could use block read/write (fread/fwrite or ifstream::read/ofstream::write)instead of memory mapping, but it's more complicated because it means manually 'paging' things into and out of memory as needed, and it could end up much slower if you do it wrong. The upside is that it would be standard C or standard C++ and thus would be platform independant.
"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk
I agree with Colin. what would happen if all the words start with "a"? you'd have a.dat being equal to the original file. you can't recurse using this same sort function and other standard sorting functions might not fit in memory.

also consider that the item size is fixed (5 chars) this means we could treat the file as an array (and use fseek to get the element)and use heapsort or other in-place algorithims to sort the whole file but at the cost of very slow operation since lots of disk access have to be made. This might not be feasible to actually implement on a billion size list.

personaly, I would:
1. get the max amount of memory possible
2. allocate a large percent (say 70-80%) in a buffer
3. read as much data as possible into the buffer
4. sort the buffer using standard sorting (quicksort,etc)
5. write the buffer into a temp file
6. clear the buffer and read the next batch of data, sort, write (steps 3-5)
7. you now have several sorted temp files
8. merge the sorted files (and in the process search for the 2 hidden strings)

notes:
*since the items are always 5 chars, you can use binary search and fseek to find
the hidden strings. As an added bonus, can a lot of words at the same time using fread to avoid reading line by line to maybe get a small speed boost.

In the best case, all data fits into the buffer so sorting is just all in memory. this can be modified to use no additional disk space by writing the sorted data back into the main file.

---------------Magic is real, unless declared integer.- the collected sayings of Wiz Zumwalt
The strings are all 5 characters?

The characters are restricted to [a-u, A-U]?

So 42 different code points per character.

How many different 5 character strings are there?

42^5, or roughly 13 million, or roughly 2^27, strings.

If you only need to know existance, you can store this in 2^19 bytes of data, or 1/2 a meg. Get a string. Encode it as a number from 0 to 2^27. Set that bit in your 1/2 megabyte array. Repeat.

If you need to have a count of duplicates, things get harder. :)

Well, not much harder.

4 bytes per node, 2^27 nodes, is 2^31 bytes. Or about 2 gigabytes of data. Easy to fit in a hard drive.

Optimization I leave to you.

The algorithm I'm using is known as "bucket sort". It relies on the fact that you have more words than you have possibly different words in order to break the O(n lgn) theoretical restriction on the speed of sorting.
NotAYakk has the right idea (I didn't want to give it all away) but a couple of wrong numbers.

Quote:Original post by NotAYakk
42^5, or roughly 130 million, or roughly 2^27, strings.

If you only need to know existance, you can store this in 2^24 bytes of data, or 16 megs Get a string. Encode it as a number from 0 to 2^27. Set that bit in your 16 megabyte array. Repeat.


I'm pretty sure the most optimal solution would be to make a lexigraphic tree.
struct WordTree{  bool IsWord;  WordTree * letters[26];    //assuming all 26 letters};


If you have a word max length, it makes it even easier. Basically, the word is in the list if you can traverse its letters without hitting a NULL pointer and the (IsWord) var is set.

Oh yeah, the best part is searching AND inserting AND deleting are performed in constant time regardless of the # of words.

Hope that helps.
-------Harmotion - Free 1v1 top-down shooter!Double Jump StudiosBlog
bucket sort is aka radix sort.

Create a function that takes the 5 character string and encodes it to a 30 bit string. Encode the 42 possibilities to a 6 bit string and or-shift the 5 of them together into 1 30 bit number.

Make a single 1GB file and zero it.
Read in the first string, encode it and add 1 to the byte position in the 1GB file.
Do this for all 1billion numbers. If the numbers are evenly distributed this will work without issue.

Sorting is now O(n) and searching is O(1). Take the string you are looking for, encode it, look in that spot in the file and if it's >0 it's there.

Now, your instructor might be hip to this killer optimization, and he might toss in something insane like 500,000,000 of the same value so we need a way for this to scale.

When you read in the byte value and add 1 to it, you need to ensure you're not over-flowing 255. If you read in 255, then you need to create a second 1GB file. At first you might think to just continue counting in the new file starting with 1 and leave the 255 in the first file. Don't do that - instead use the second file as the second byte of the number. 0xFF + 1 = 0x0100. So write a 00 into the first file and a 01 into the second file. Make this scale up to 4 1GB files and you'll handle up to 4billion of the same string repeating itself over & over.
(Just make sure your read routine takes into account the bonus files if they exist so you read back the correct count.)

If you code it efficently you can bust out the sort in about 30 seconds.
- The trade-off between price and quality does not exist in Japan. Rather, the idea that high quality brings on cost reduction is widely accepted.-- Tajima & Matsubara
I would probably use an approach similar to bubble sort, only it's on blocks. Let N be the maximum number of words that you can safely fit in memory at any given moment without hitting the swap.


1. Open file "input"
2. Open file "output"
3. Create two buffers A and B of size N/2 each.
4. Read two buffers from "input".
5. Sort buffer A using any in-place method of your choice.
6. Begin Loop
7. Sort buffer B using any in-place method of your choice.
8. Merge A and B by outputting their first N/2 elements into "output".
9. Merge the rest of A and B, sorted, into A.
10. If (Read N/2 elements from "input" into B) then goto 6
11. Else Write A to "output".


The above performs a bubble sort iteration on the data. Once you do (2*Size/N) iterations, the file will be sorted. For greater speed, you can use your intended approach of writing elements to 52 different files based on their first letter, which will decrease the number of necessary iterations.
I strongly suspect that operating system specific APIs such as those required for memory mapping are not to be used. Furthurmore, this may be intended as an external sorting exercize, which would mean that they do not want you to try and sort the whole thing in memory. I would seek clarification on this issue.
I myself have in the past done such an assignment, and it had to be external sorting.

The approach suggested to us was merge to and from files. You simply build runs as long as possible by appending each item in the first file which has a previous item less than the new item. Or if no such file exists AND you already have a certain number of files, you append it to the one with the largest previous item (starting a new run in the same file). About 16 files max should be plenty. Then close those files and open them as input, and repeat. You continue until you complete one iteration without having to write a second file, at which point you can rename the one file to the final filename as it is fully sorted.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms

This topic is closed to new replies.

Advertisement