Jump to content
  • Advertisement
Sign in to follow this  
Vincent_M

Keeping Track of a Dictionary

This topic is 1849 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I've always been a fan of games like Scrabble, Words with Friends, and Bookworm. One thing they all seem to have in common that fascinates are their real-time spelling algorithms. One thing I'd like to know though: where did the dictionary of valid words come from? Was it an extern file that was shipped with the game, hardcoded, a feature built into the OS, etc?

 

Also, how did they manage to make the search algorithm so fast? You'd think having a giant dictionary of tens of thousands (possibly more in Words with Friends), they wouldn't be searching through 1 giant library-like list.

 

I'd assume it's just a large file that's alphabetized, but with smaller reference lists, similar to MySQL queries. There'd probably possibly in 26 separate lists (one for each letter) organizing all of the 'a' characters in one, 'b' in another, and so on... Then, maybe have 26 lists for each of the original 26 lists for the second one in the list to organize each list up alphabetically from the second word.

 

Here's a thread started over at StackOverflow on spell checking:

http://stackoverflow.com/questions/2294915/what-algorithm-gives-suggestions-in-a-spell-checker

Share this post


Link to post
Share on other sites
Advertisement

(1) http://en.wikipedia.org/wiki/Words_%28Unix%29

 

(2) Parse that, and put them into a trie with prefix compression (or PATRICIA, critbit-tree or anything similar). If your trie/tree structure is such that you can directly dump it onto disk, do that (faster startup).

 

This will result in an extremely small file (prefix compression means you only need to store in each node the one character (or bit) that's different after the common prefix) which can be searched more or less independently of the number of words. There are some variations of the different algorithms, but basically, you need not compare N words or log(N) words, or any such thing, but only as many bits (or characters) that differ, starting from the beginning. Plus, with some structures, a final comparison.

Or, to find all candidate words that start with some characters, you only need to compare as many characters as you already have. In other words O(N) in respect of your word's length, which in practice (words usually aren't much longer than 8-10 characters) is pretty much the same as "instantaneous".

Edited by samoth

Share this post


Link to post
Share on other sites

A while ago, somebody posted here about the Moby project, which provides public-domain word lists.  (Thanks, jwezorek.)

 

I used one of their dictionaries in a squiggly-line spell-checker I'd written.  By using a binary search, I was able to store all the words in a single file without compromising performance.

Edited by shuma-gorath

Share this post


Link to post
Share on other sites

If you're hunting for anagrams, there's a simple trick to find them much quicker than the obvious approach of looking up all permutations of the source letters, which gets slow with long words as the number of permutations is N factorial.

 

The trick is to note that if you sort all the letters within a word into alphabetical order, you get a result which is identical regardless of the order of the letters within the original word. For example "melon" and "lemon" both end up as "elmno" when sorted.

 

That means an appropriately populated std::map<string, vector<string>> can find you all valid anagrams of an input word with one O(log(N)) lookup, once you've sorted the letters in the input word into alphabetical order.

 

You could also populate it so that you get all valid anagrams of any subset of the input word, although that increases memory usage because of the duplication, and will take longer to generate (especially if you don't generate the entries for shorter words first to avoid repeating work).

Share this post


Link to post
Share on other sites

To search a database of data, it is quick to do with a binary tree.  Some people actually use a SQL database which I think is overkill.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!