Jump to content

  • Log In with Google      Sign In   
  • Create Account


Fast way to load in a dictionary


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
12 replies to this topic

#1 AshehRS   Members   -  Reputation: 101

Like
0Likes
Like

Posted 12 September 2011 - 10:44 AM

Im having trouble with memory on older generation iPhones (ipod touch 1st gen, 2nd gen e.t.c). This is due to the amount of memory allocated when I load and store a 170k word dictionary.

This is the code (very simple):

string[] words = dictionaryRef.text.Split("\n"[0]);
_words
=newList<string>(words);
This allocates on start around 12mb of storage, iphone has around 43mb I think. So that + textures + sounds + the OS it tends to break.

Speed wise, accessing using a binary search is fine. But its storing it in memory more efficiently (and loading it more efficiently).

The text.Split appears to take up alot of heap memory.

Any advice?




-> side note, the actual amount of memory isnt too bad once its loaded, but loading it is quite memory intensive. I have tried readline() and by character but those methods are super super slow.




Any advice appreciated.



Sponsor:

#2 Antheus   Members   -  Reputation: 2385

Like
1Likes
Like

Posted 12 September 2011 - 10:55 AM

Remember that binary tree can be represented using an array.

Given element at index n, left child is at index 2n and right child at index 2n+1.

A C representation of string is zero terminated. So to store strings, one creates words separated with \0. For example, "Hello\0World\0Everyone\0" If a string is empty, it would result in just \0.

All that's left is to tell where the words begin: [0, 6, 12] in above example.

So now we have two plain arrays. One is char[] the other is int[].

char * text = new char[12MB];
int * offsets = new int[170k];
One can now use fread to load entire array in one go.

To search this tree, start at offsets[1]. Left child is offset[2], right child is offset[3]. And so on. To simplify indexing, first element is at offset 1, not 0.

Adjust for the language/classes you're using.

#3 AshehRS   Members   -  Reputation: 101

Like
0Likes
Like

Posted 12 September 2011 - 12:17 PM

This still allocates 12mb of data in one go. Which is what im trying to avoid. Its the string.split that is extremely inefficient but every other option seems too slow too.

#4 e‍dd   Members   -  Reputation: 2097

Like
0Likes
Like

Posted 12 September 2011 - 12:31 PM

Does the iPhone not already have a built-in dictionary? A minute of googling produced UITextChecker. Can this be used?

Failing that, you'll either need to load on demand (perhaps with an MRU list), or come up with a compression scheme.

What are you using the dictionary for? How heavily are you using it? etc

#5 nobodynews   Crossbones+   -  Reputation: 1572

Like
1Likes
Like

Posted 12 September 2011 - 01:33 PM

How large is the raw text file? 170k words at 10 chars a word is only 1.7MB and I wouldn't be surprised if the average number of characters in a word was less than 10. You also mention that after loading everything the memory usage goes down. In which case I think Antheus has the best idea he just over-estimed the memory usage. He really should have said:

char * text = new char[# of chars in text file];
int * offsets = new int[# of words in text file];

C++: A Dialog | C++0x Features: Part1 (lambdas, auto, static_assert) , Part 2 (rvalue references) , Part 3 (decltype) | Write Games | Fix Your Timestep!


#6 Álvaro   Crossbones+   -  Reputation: 10598

Like
0Likes
Like

Posted 12 September 2011 - 01:49 PM

What Antheus/nobodynews proposed is very reasonable, and it would take something like 2MB of memory. If that's still too much, you can exploit the fact that large blocks of words will have common prefixes. For instance, you could use a trie for the first few characters, and place Antheus-style sub-dictionaries at the leaves.But I doubt the memory savings are worth the extra complexity of this approach.

#7 Trienco   Crossbones+   -  Reputation: 1944

Like
0Likes
Like

Posted 12 September 2011 - 10:34 PM

What Antheus/nobodynews proposed is very reasonable, and it would take something like 2MB of memory. If that's still too much, you can exploit the fact that large blocks of words will have common prefixes. For instance, you could use a trie for the first few characters, and place Antheus-style sub-dictionaries at the leaves.But I doubt the memory savings are worth the extra complexity of this approach.


Well, it sounds like just the situation that tries are made for. Downside is the overhead at each node which might cancel out most of the benefit unless done in some clever way.
f@dzhttp://festini.device-zero.de

#8 Vectorian   Members   -  Reputation: 109

Like
0Likes
Like

Posted 13 September 2011 - 01:46 AM

This still allocates 12mb of data in one go. Which is what im trying to avoid. Its the string.split that is extremely inefficient but every other option seems too slow too.


How large is the text file? You shouldn't need more memory than that to store the entire dictionary in memory.

Code a simple program that reads a line, does ftell() and writes the index in binary form to a new file. Call this file dictionary.index or something and then read the entire dictionary like this:
FILE *dictionaryFile = fopen("dictionary", "rb");
FILE *indexFile = fopen("dictionary.index", "rb");

fseek(dictionaryFile, 0, SEEK_END);
fseek(indexFile, 0, SEEK_END);
int dictionarySize = ftell(dictionaryFile);
int indexSize = ftell(indexFile) / sizeof int;

fseek(dictionaryFile, 0, SEEK_SET);
fseek(indexFile, 0, SEEK_SET);

char* dictionaryWords = new char[dictionarySize];
int* wordIndexes = new int[indexSize];
fread(dictionaryFile, 1, dictionarySize, dictionaryWords);
fread(indexFile, 4, indexSize, wordIndexes);
No parsing required, get a word by using dictionaryWords[wordIndexes[wordNumber]]. The length of the word is wordIndexes[wordNumber + 1] - wordIndexes[wordNumber]. You must add 1 to the size of wordIndexes and set the last entry to the size of the dictionaryWords file when creating it (or simply write 0x00000000 to the end of the .index file).

If you're really tight for memory, you can mmap the actual dictionary instead.

#9 AshehRS   Members   -  Reputation: 101

Like
0Likes
Like

Posted 13 September 2011 - 02:43 AM

Nobody appears to have listened :D

- It is not the memory once the file is loaded but rather the memory WHILE it is being loaded and STRING.SPLIT() into the words. (12mb it costs for some reason).

I was looking for a FAST and LOW MEMORY way of loading each word.

while(word = readline()) // too slow

array = String.Split() //too much memory


Please also note. The file is already loaded. Its a Unity Text Asset. So I access the file via its string.

So I have 1 very large string with each new word on a new line.

I need a memory efficient fast way of splitting this into 170k words.


Many thanks for your inputs so far :D

#10 Christuff   Members   -  Reputation: 120

Like
0Likes
Like

Posted 13 September 2011 - 03:45 AM

Why do you need it fast ? for immediate use OR for speeding up load time ?
If the second, use a co-routine.

Moreover, I doubt that

while(myString[count++] != '\0') 
{
	if (myString[count] == '\n') {
		myArrayListOfNewLineIndex.add(count);
	}
} 

is way slower than split.
And you gain memory ;)

#11 AshehRS   Members   -  Reputation: 101

Like
0Likes
Like

Posted 13 September 2011 - 04:01 AM

How will a co-routine help?

myString[count], how do I fill this array originally without using Split()?

#12 Christuff   Members   -  Reputation: 120

Like
0Likes
Like

Posted 13 September 2011 - 04:12 AM

MyString[count] will be YourTextAssets.text[count] in your case.

The Co-routine will spread the work across many frames, si if you don't need your text immediately, it will take [long] time, but it won't block your loading or your game ( so users won't notice it ).

#13 Adam_42   Crossbones+   -  Reputation: 2179

Like
0Likes
Like

Posted 13 September 2011 - 05:04 AM

The most memory efficient way to search for words in that dictionary string is to not split it into a list of words at all. Assuming you just need to look up words in the dictionary and the word list is sorted you can do it directly on that source data with a tweaked binary search. Pseudocode:

1. Set the upper and lower bound to the string.length - 1 and zero.
2. Set the current offset to half way between the upper and lower bound.
3. Move the current offset backwards one character at a time to find the start of the word. That is you either hit offset 0 or find the '\n' at the end of the previous word.
4. Compare the word you just found against the search word. If it doesn't match set the upper or lower bound to the current offset as you would for a standard binary search.
5. Repeat from step 2 until the binary search is complete (i.e. you've found the word, or the upper and lower bound are equal).




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS