Word Lists
Hi,
I can't seem to find a decent word list. I'm looking for a word list that will account for different tenses and plurals of the words also (ex - punch, punched, puncher, punchers, ect..); A word list that this game might use: http://www.lumosity.com/brain-games/flexibility-games/word-bubbles. Do you guys having anything?
I found these while searching for the canonical "dictionary.txt" file
http://wordlist.sourceforge.net/
http://www.outpost9.com/files/WordLists.html
Early search engines used to use lemmatization and stemming algorithms to generate English word variants automatically.
http://wordlist.sourceforge.net/
http://www.outpost9.com/files/WordLists.html
Early search engines used to use lemmatization and stemming algorithms to generate English word variants automatically.
The problem is known as morphological parsing.
There are probably existing implementations available.
There are probably existing implementations available.
Thanks guys! I *think* I found a decent word list, and the porter stemming 2 algorithm seems perfect(-enough)!
Now the next tasks becomes efficiently loading the word list at startup. Do you guys think writing the dictionary to a binary file offline, then loading that file at runtime would be the way to go?
Now the next tasks becomes efficiently loading the word list at startup. Do you guys think writing the dictionary to a binary file offline, then loading that file at runtime would be the way to go?
How big is the list? I can't imagine that reading it in the naïve way would be all that slow. Have you tested it? If it takes < 1 second to load, there's probably no point optimizing it...
The txt file is 2.2 mb, and I've been able to transform it into a 1.5 mb binary file, which still freezes my computer for a while parsing it. This word list is huge.
What I'm thinking about it separating the word list into a file for each letter, then concurrently loading the files, and merging the results back into a list.
What I'm thinking about it separating the word list into a file for each letter, then concurrently loading the files, and merging the results back into a list.
Quote:Original post by bronxbomber92
The txt file is 2.2 mb, and I've been able to transform it into a 1.5 mb binary file, which still freezes my computer for a while parsing it. This word list is huge.
What I'm thinking about it separating the word list into a file for each letter, then concurrently loading the files, and merging the results back into a list.
What exactly is the bottleneck? Disk IO? Tree construction? Memory allocations?
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement