Fast way to load in a dictionary
#1 Members - Reputation: 101
Posted 12 September 2011 - 10:44 AM
This is the code (very simple):
string[] words = dictionaryRef.text.Split("\n"[0]);
_words =newList<string>(words);
This allocates on start around 12mb of storage, iphone has around 43mb I think. So that + textures + sounds + the OS it tends to break.
Speed wise, accessing using a binary search is fine. But its storing it in memory more efficiently (and loading it more efficiently).
The text.Split appears to take up alot of heap memory.
Any advice?
-> side note, the actual amount of memory isnt too bad once its loaded, but loading it is quite memory intensive. I have tried readline() and by character but those methods are super super slow.
Any advice appreciated.
#2 Members - Reputation: 2369
Posted 12 September 2011 - 10:55 AM
Given element at index n, left child is at index 2n and right child at index 2n+1.
A C representation of string is zero terminated. So to store strings, one creates words separated with \0. For example, "Hello\0World\0Everyone\0" If a string is empty, it would result in just \0.
All that's left is to tell where the words begin: [0, 6, 12] in above example.
So now we have two plain arrays. One is char[] the other is int[].
char * text = new char[12MB]; int * offsets = new int[170k];One can now use fread to load entire array in one go.
To search this tree, start at offsets[1]. Left child is offset[2], right child is offset[3]. And so on. To simplify indexing, first element is at offset 1, not 0.
Adjust for the language/classes you're using.
#4 Members - Reputation: 2042
Posted 12 September 2011 - 12:31 PM
Failing that, you'll either need to load on demand (perhaps with an MRU list), or come up with a compression scheme.
What are you using the dictionary for? How heavily are you using it? etc
#5 Members - Reputation: 1349
Posted 12 September 2011 - 01:33 PM
char * text = new char[# of chars in text file];
int * offsets = new int[# of words in text file];
C++: A Dialog | C++0x Features: Part1 (lambdas, auto, static_assert) , Part 2 (rvalue references) , Part 3 (decltype) | Write Games | Fix Your Timestep!
#6 Members - Reputation: 5802
Posted 12 September 2011 - 01:49 PM
#7 Members - Reputation: 1300
Posted 12 September 2011 - 10:34 PM
What Antheus/nobodynews proposed is very reasonable, and it would take something like 2MB of memory. If that's still too much, you can exploit the fact that large blocks of words will have common prefixes. For instance, you could use a trie for the first few characters, and place Antheus-style sub-dictionaries at the leaves.But I doubt the memory savings are worth the extra complexity of this approach.
Well, it sounds like just the situation that tries are made for. Downside is the overhead at each node which might cancel out most of the benefit unless done in some clever way.
#8 Members - Reputation: 109
Posted 13 September 2011 - 01:46 AM
This still allocates 12mb of data in one go. Which is what im trying to avoid. Its the string.split that is extremely inefficient but every other option seems too slow too.
How large is the text file? You shouldn't need more memory than that to store the entire dictionary in memory.
Code a simple program that reads a line, does ftell() and writes the index in binary form to a new file. Call this file dictionary.index or something and then read the entire dictionary like this:
FILE *dictionaryFile = fopen("dictionary", "rb");
FILE *indexFile = fopen("dictionary.index", "rb");
fseek(dictionaryFile, 0, SEEK_END);
fseek(indexFile, 0, SEEK_END);
int dictionarySize = ftell(dictionaryFile);
int indexSize = ftell(indexFile) / sizeof int;
fseek(dictionaryFile, 0, SEEK_SET);
fseek(indexFile, 0, SEEK_SET);
char* dictionaryWords = new char[dictionarySize];
int* wordIndexes = new int[indexSize];
fread(dictionaryFile, 1, dictionarySize, dictionaryWords);
fread(indexFile, 4, indexSize, wordIndexes);No parsing required, get a word by using dictionaryWords[wordIndexes[wordNumber]]. The length of the word is wordIndexes[wordNumber + 1] - wordIndexes[wordNumber]. You must add 1 to the size of wordIndexes and set the last entry to the size of the dictionaryWords file when creating it (or simply write 0x00000000 to the end of the .index file). If you're really tight for memory, you can mmap the actual dictionary instead.
#9 Members - Reputation: 101
Posted 13 September 2011 - 02:43 AM
- It is not the memory once the file is loaded but rather the memory WHILE it is being loaded and STRING.SPLIT() into the words. (12mb it costs for some reason).
I was looking for a FAST and LOW MEMORY way of loading each word.
while(word = readline()) // too slow
array = String.Split() //too much memory
Please also note. The file is already loaded. Its a Unity Text Asset. So I access the file via its string.
So I have 1 very large string with each new word on a new line.
I need a memory efficient fast way of splitting this into 170k words.
Many thanks for your inputs so far
#10 Members - Reputation: 120
Posted 13 September 2011 - 03:45 AM
If the second, use a co-routine.
Moreover, I doubt that
while(myString[count++] != '\0')
{
if (myString[count] == '\n') {
myArrayListOfNewLineIndex.add(count);
}
}
is way slower than split.
And you gain memory ;)
#12 Members - Reputation: 120
Posted 13 September 2011 - 04:12 AM
The Co-routine will spread the work across many frames, si if you don't need your text immediately, it will take [long] time, but it won't block your loading or your game ( so users won't notice it ).
#13 Members - Reputation: 1410
Posted 13 September 2011 - 05:04 AM
1. Set the upper and lower bound to the string.length - 1 and zero.
2. Set the current offset to half way between the upper and lower bound.
3. Move the current offset backwards one character at a time to find the start of the word. That is you either hit offset 0 or find the '\n' at the end of the previous word.
4. Compare the word you just found against the search word. If it doesn't match set the upper or lower bound to the current offset as you would for a standard binary search.
5. Repeat from step 2 until the binary search is complete (i.e. you've found the word, or the upper and lower bound are equal).






