Word Lists

Started by
7 comments, last by bronxbomber92 14 years, 5 months ago
Hi, I can't seem to find a decent word list. I'm looking for a word list that will account for different tenses and plurals of the words also (ex - punch, punched, puncher, punchers, ect..); A word list that this game might use: http://www.lumosity.com/brain-games/flexibility-games/word-bubbles. Do you guys having anything?
Advertisement
I found these while searching for the canonical "dictionary.txt" file
http://wordlist.sourceforge.net/
http://www.outpost9.com/files/WordLists.html

Early search engines used to use lemmatization and stemming algorithms to generate English word variants automatically.
The problem is known as morphological parsing.

There are probably existing implementations available.
A radix tree might be useful for you.
Thanks guys! I *think* I found a decent word list, and the porter stemming 2 algorithm seems perfect(-enough)!
Now the next tasks becomes efficiently loading the word list at startup. Do you guys think writing the dictionary to a binary file offline, then loading that file at runtime would be the way to go?
How big is the list? I can't imagine that reading it in the naïve way would be all that slow. Have you tested it? If it takes < 1 second to load, there's probably no point optimizing it...
The txt file is 2.2 mb, and I've been able to transform it into a 1.5 mb binary file, which still freezes my computer for a while parsing it. This word list is huge.

What I'm thinking about it separating the word list into a file for each letter, then concurrently loading the files, and merging the results back into a list.
Quote:Original post by bronxbomber92
The txt file is 2.2 mb, and I've been able to transform it into a 1.5 mb binary file, which still freezes my computer for a while parsing it. This word list is huge.

What I'm thinking about it separating the word list into a file for each letter, then concurrently loading the files, and merging the results back into a list.


What exactly is the bottleneck? Disk IO? Tree construction? Memory allocations?
Nevermind, I got the disk loading figured out!

Thanks all, I'll come back if I have problems with the morpheme extraction.

This topic is closed to new replies.

Advertisement