Jump to content
  • Advertisement
Sign in to follow this  
klayAlloy

Billion word sort

This topic is 4325 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I have this project for school where I need to write a sorting algorithm that will work for one billion 5 character strings [a-u, A-U]. I'm not looking for answers, more just ideas some of you might have. It's supposed to be done in C++ on a single computer with a single hard drive. And the algorithm needs to be black boxed. It should work with 100, 1000, 10000, etcetera 5 character strings. It needs to figure it out at runtime on it's own. After it's sorted I need to search for 2 strings hidden in the billion strings. What i'm planning now is a merge sort of some kind. Before I start reading from the file, i'll create a bunch of files on the hard drive which will be responsible for holding strings starting with a certain character. eg. filenames[a.dat, b.dat, c.dat, ... G.dat, H.dat,...]. As I pull the data out of the main file, i'll throw it into the appropriate temp file...the string abHgu would go to the a.dat file. Once I reach the end of the main file, i'll go and sort each individual file and then merge them back into one file. When i'm reading the initial data, i'll probably open a couple hundred streams (you can open 509 simultaneously in vs2005...3 are reserved for cin, cout, cerr) and have each one read like every nth string. Anyway, any hints or ideas people might like to share, or point out anything that might push me in a better direction or point something out I'm getting wrong, would be very much appreciated.

Share this post


Link to post
Share on other sites
Advertisement
I like your merge sort approach but consider whether or not it is necessary to create a bunch of files preemptively. Is it really necessary? Why not just make files when needed? Also, is it necessary that you use files based on words starting with 'a' (or whatever the first letter is) - what happens if you have many files full of sorted sets of strings?

Share this post


Link to post
Share on other sites
If you can use the Windows API (or any other OS's API, really), I'd strongly suggest using a single memory mapped file. If the file format is fixed (such as one 5-character word per line, with the line terminator being "\n"), you can easily figure out how many words you have, and from there you can easily do something like load in as many as you can fit into memory, sort them, then write those to your mapped file (storing the offset and size in an in-memory 'master index'), then repeat until the end. After that, a simple merge could combine the stacks from the first memory mapped file into a final destination file.

You could use block read/write (fread/fwrite or ifstream::read/ofstream::write)instead of memory mapping, but it's more complicated because it means manually 'paging' things into and out of memory as needed, and it could end up much slower if you do it wrong. The upside is that it would be standard C or standard C++ and thus would be platform independant.

Share this post


Link to post
Share on other sites
I agree with Colin. what would happen if all the words start with "a"? you'd have a.dat being equal to the original file. you can't recurse using this same sort function and other standard sorting functions might not fit in memory.

also consider that the item size is fixed (5 chars) this means we could treat the file as an array (and use fseek to get the element)and use heapsort or other in-place algorithims to sort the whole file but at the cost of very slow operation since lots of disk access have to be made. This might not be feasible to actually implement on a billion size list.

personaly, I would:
1. get the max amount of memory possible
2. allocate a large percent (say 70-80%) in a buffer
3. read as much data as possible into the buffer
4. sort the buffer using standard sorting (quicksort,etc)
5. write the buffer into a temp file
6. clear the buffer and read the next batch of data, sort, write (steps 3-5)
7. you now have several sorted temp files
8. merge the sorted files (and in the process search for the 2 hidden strings)

notes:
*since the items are always 5 chars, you can use binary search and fseek to find
the hidden strings. As an added bonus, can a lot of words at the same time using fread to avoid reading line by line to maybe get a small speed boost.

In the best case, all data fits into the buffer so sorting is just all in memory. this can be modified to use no additional disk space by writing the sorted data back into the main file.

Share this post


Link to post
Share on other sites
The strings are all 5 characters?

The characters are restricted to [a-u, A-U]?

So 42 different code points per character.

How many different 5 character strings are there?

42^5, or roughly 13 million, or roughly 2^27, strings.

If you only need to know existance, you can store this in 2^19 bytes of data, or 1/2 a meg. Get a string. Encode it as a number from 0 to 2^27. Set that bit in your 1/2 megabyte array. Repeat.

If you need to have a count of duplicates, things get harder. :)

Well, not much harder.

4 bytes per node, 2^27 nodes, is 2^31 bytes. Or about 2 gigabytes of data. Easy to fit in a hard drive.

Optimization I leave to you.

The algorithm I'm using is known as "bucket sort". It relies on the fact that you have more words than you have possibly different words in order to break the O(n lgn) theoretical restriction on the speed of sorting.

Share this post


Link to post
Share on other sites
NotAYakk has the right idea (I didn't want to give it all away) but a couple of wrong numbers.

Quote:
Original post by NotAYakk
42^5, or roughly 130 million, or roughly 2^27, strings.

If you only need to know existance, you can store this in 2^24 bytes of data, or 16 megs Get a string. Encode it as a number from 0 to 2^27. Set that bit in your 16 megabyte array. Repeat.


Share this post


Link to post
Share on other sites
I'm pretty sure the most optimal solution would be to make a lexigraphic tree.

struct WordTree
{
bool IsWord;
WordTree * letters[26]; //assuming all 26 letters
};


If you have a word max length, it makes it even easier. Basically, the word is in the list if you can traverse its letters without hitting a NULL pointer and the (IsWord) var is set.

Oh yeah, the best part is searching AND inserting AND deleting are performed in constant time regardless of the # of words.

Hope that helps.

Share this post


Link to post
Share on other sites
bucket sort is aka radix sort.

Create a function that takes the 5 character string and encodes it to a 30 bit string. Encode the 42 possibilities to a 6 bit string and or-shift the 5 of them together into 1 30 bit number.

Make a single 1GB file and zero it.
Read in the first string, encode it and add 1 to the byte position in the 1GB file.
Do this for all 1billion numbers. If the numbers are evenly distributed this will work without issue.

Sorting is now O(n) and searching is O(1). Take the string you are looking for, encode it, look in that spot in the file and if it's >0 it's there.

Now, your instructor might be hip to this killer optimization, and he might toss in something insane like 500,000,000 of the same value so we need a way for this to scale.

When you read in the byte value and add 1 to it, you need to ensure you're not over-flowing 255. If you read in 255, then you need to create a second 1GB file. At first you might think to just continue counting in the new file starting with 1 and leave the 255 in the first file. Don't do that - instead use the second file as the second byte of the number. 0xFF + 1 = 0x0100. So write a 00 into the first file and a 01 into the second file. Make this scale up to 4 1GB files and you'll handle up to 4billion of the same string repeating itself over & over.
(Just make sure your read routine takes into account the bonus files if they exist so you read back the correct count.)

If you code it efficently you can bust out the sort in about 30 seconds.

Share this post


Link to post
Share on other sites
I would probably use an approach similar to bubble sort, only it's on blocks. Let N be the maximum number of words that you can safely fit in memory at any given moment without hitting the swap.


1. Open file "input"
2. Open file "output"
3. Create two buffers A and B of size N/2 each.
4. Read two buffers from "input".
5. Sort buffer A using any in-place method of your choice.
6. Begin Loop
7. Sort buffer B using any in-place method of your choice.
8. Merge A and B by outputting their first N/2 elements into "output".
9. Merge the rest of A and B, sorted, into A.
10. If (Read N/2 elements from "input" into B) then goto 6
11. Else Write A to "output".


The above performs a bubble sort iteration on the data. Once you do (2*Size/N) iterations, the file will be sorted. For greater speed, you can use your intended approach of writing elements to 52 different files based on their first letter, which will decrease the number of necessary iterations.

Share this post


Link to post
Share on other sites
I strongly suspect that operating system specific APIs such as those required for memory mapping are not to be used. Furthurmore, this may be intended as an external sorting exercize, which would mean that they do not want you to try and sort the whole thing in memory. I would seek clarification on this issue.
I myself have in the past done such an assignment, and it had to be external sorting.

The approach suggested to us was merge to and from files. You simply build runs as long as possible by appending each item in the first file which has a previous item less than the new item. Or if no such file exists AND you already have a certain number of files, you append it to the one with the largest previous item (starting a new run in the same file). About 16 files max should be plenty. Then close those files and open them as input, and repeat. You continue until you complete one iteration without having to write a second file, at which point you can rename the one file to the final filename as it is fully sorted.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!