Word database

Started by
18 comments, last by frob 12 years, 2 months ago
Hi,
I am trying implement a program where it creates a huge databse of words.

One of the features is to add words from available list of words eg: If FILE 1 has a word list of having hundreds of word
and there is another FILE 2 which also has hundreds of word. Then my program will take these files and implement

1) check for duplicate words and truncate them
2) add words into a new file which will contain all words from FILE 1 and FILE 2


I am confused about how to implement it..like

---> Should I use a standard C/C++ library if yes which will be better ??

----> OR whether to get one word from first file and then compare it with the second file...like a linear search??

----> OR whether to go with size of the word and then sort it out???

-----> OR any other logic???

My main objective is to have a good run time even if the files becomes HUGE in size with millions of words.


All suggestions are welcome....

Indie Game Developer

Game Studio: Freakout Games

Advertisement
It's fairly simple to do efficiently. Just insert each word into an associative container like a std::set - that should get you O(N log N) performance.
Well, if you're on unix or linux of sorts...

"cat FILE1 FILE2 | sort | uniq"

I am trying implement a program where it creates a huge databse of words.

One of the features is to add words from available list of words eg: If FILE 1 has a word list of having hundreds of word
and there is another FILE 2 which also has hundreds of word. Then my program will take these files and implement

1) check for duplicate words and truncate them
2) add words into a new file which will contain all words from FILE 1 and FILE 2


I am confused about how to implement it..like

My main objective is to have a good run time even if the files becomes HUGE in size with millions of words.


You are unlikely to have millions (plural) of words, even if you take all conjugations of verbs in lanugages that modify the word.

Consider that many word lists are under 1MB of total length. Even the MOBY word list (with all varitions in spelling of five languages including German) is just barely over 5 million words. Some of the largest English common-word lists, such as the official Scrabble word list, are under one megabyte.

If you are looking for just a simple language dictionary, the data structure you are looking for is called a trie. You can iterate over it in alphabetical order. Searches to check for the existance of words is linear time. The penalty is in runtime memory, but it is much smaller penalty than storing unique copies of every full word.

An alternative yet similar structure is the ternary search tree, which is a trie of order 3.


Google can find you many existing implementations if both you don't care to roll your own. IIRC there is a TRT implementation in Boost.

Well, if you're on unix or linux of sorts...

"cat FILE1 FILE2 | sort | uniq"

Well i think Bash (or other *sh stuff) isn't fast enough for this purpose, however nowdays on a good PC it in fact might not be a problem.

I would recomend Python, as it is known from its advanced text operation features. Python itself is implemented in C so using algorithms from its standard libs can be quite efficient. My friend has done something silmilar using Python, but the word lists were about hundreds of thousands, not milions. But it worked well.

Well i think Bash (or other *sh stuff) isn't fast enough for this purpose, however nowdays on a good PC it in fact might not be a problem.
[/quote]
Bash is merely creating pipes and invoking the utility programs. Typically, GNU utilities would be written in C. On my machine:


user@host:~$ file `which sort uniq cat`
/usr/bin/sort: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.15, stripped
/usr/bin/uniq: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.15, stripped
/bin/cat: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.15, stripped

[quote name='Antheus' timestamp='1327016583' post='4904402']
Well, if you're on unix or linux of sorts...

"cat FILE1 FILE2 | sort | uniq"

Well i think Bash (or other *sh stuff) isn't fast enough for this purpose, however nowdays on a good PC it in fact might not be a problem.

I would recomend Python, as it is known from its advanced text operation features. Python itself is implemented in C so using algorithms from its standard libs can be quite efficient. My friend has done something silmilar using Python, but the word lists were about hundreds of thousands, not milions. But it worked well.
[/quote]

You might be surprised.
What I don't understand is why you are not using an actual database? All of what you are trying to do is what DBs do best.

j.

What I don't understand is why you are not using an actual database? All of what you are trying to do is what DBs do best.


I need to move couch from one side of the room to another, so I'll just slide it over.

Why am I not using Fedex and their container ship logistics network? Kinda overkill.

I need to move couch from one side of the room to another, so I'll just slide it over.

Why am I not using Fedex and their container ship logistics network? Kinda overkill.


Not overkill at all. He wants to have possibly a records or so and quickly search the records. That is what DBs are for. There is no reason to role your own solution, when you can just install SQL Express or mySQL and go to town. Instead of writing all of that code to do it, you make a few simple SQL statements to do all the work.

j.

This topic is closed to new replies.

Advertisement