Jump to content
  • Advertisement
Sign in to follow this  
marcus12

Word database

This topic is 2515 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi,
I am trying implement a program where it creates a huge databse of words.

One of the features is to add words from available list of words eg: If FILE 1 has a word list of having hundreds of word
and there is another FILE 2 which also has hundreds of word. Then my program will take these files and implement

1) check for duplicate words and truncate them
2) add words into a new file which will contain all words from FILE 1 and FILE 2


I am confused about how to implement it..like

---> Should I use a standard C/C++ library if yes which will be better ??

----> OR whether to get one word from first file and then compare it with the second file...like a linear search??

----> OR whether to go with size of the word and then sort it out???

-----> OR any other logic???

My main objective is to have a good run time even if the files becomes HUGE in size with millions of words.


All suggestions are welcome....

Share this post


Link to post
Share on other sites
Advertisement
It's fairly simple to do efficiently. Just insert each word into an associative container like a std::set - that should get you O(N log N) performance.

Share this post


Link to post
Share on other sites

I am trying implement a program where it creates a huge databse of words.

One of the features is to add words from available list of words eg: If FILE 1 has a word list of having hundreds of word
and there is another FILE 2 which also has hundreds of word. Then my program will take these files and implement

1) check for duplicate words and truncate them
2) add words into a new file which will contain all words from FILE 1 and FILE 2


I am confused about how to implement it..like

My main objective is to have a good run time even if the files becomes HUGE in size with millions of words.


You are unlikely to have millions (plural) of words, even if you take all conjugations of verbs in lanugages that modify the word.

Consider that many word lists are under 1MB of total length. Even the MOBY word list (with all varitions in spelling of five languages including German) is just barely over 5 million words. Some of the largest English common-word lists, such as the official Scrabble word list, are under one megabyte.

If you are looking for just a simple language dictionary, the data structure you are looking for is called a trie. You can iterate over it in alphabetical order. Searches to check for the existance of words is linear time. The penalty is in runtime memory, but it is much smaller penalty than storing unique copies of every full word.

An alternative yet similar structure is the ternary search tree, which is a trie of order 3.


Google can find you many existing implementations if both you don't care to roll your own. IIRC there is a TRT implementation in Boost.

Share this post


Link to post
Share on other sites

Well, if you're on unix or linux of sorts...

"cat FILE1 FILE2 | sort | uniq"

Well i think Bash (or other *sh stuff) isn't fast enough for this purpose, however nowdays on a good PC it in fact might not be a problem.

I would recomend Python, as it is known from its advanced text operation features. Python itself is implemented in C so using algorithms from its standard libs can be quite efficient. My friend has done something silmilar using Python, but the word lists were about hundreds of thousands, not milions. But it worked well.

Share this post


Link to post
Share on other sites

Well i think Bash (or other *sh stuff) isn't fast enough for this purpose, however nowdays on a good PC it in fact might not be a problem.
[/quote]
Bash is merely creating pipes and invoking the utility programs. Typically, GNU utilities would be written in C. On my machine:


user@host:~$ file `which sort uniq cat`
/usr/bin/sort: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.15, stripped
/usr/bin/uniq: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.15, stripped
/bin/cat: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.15, stripped

Share this post


Link to post
Share on other sites

[quote name='Antheus' timestamp='1327016583' post='4904402']
Well, if you're on unix or linux of sorts...

"cat FILE1 FILE2 | sort | uniq"

Well i think Bash (or other *sh stuff) isn't fast enough for this purpose, however nowdays on a good PC it in fact might not be a problem.

I would recomend Python, as it is known from its advanced text operation features. Python itself is implemented in C so using algorithms from its standard libs can be quite efficient. My friend has done something silmilar using Python, but the word lists were about hundreds of thousands, not milions. But it worked well.
[/quote]

You might be surprised.

Share this post


Link to post
Share on other sites
What I don't understand is why you are not using an actual database? All of what you are trying to do is what DBs do best.

j.

Share this post


Link to post
Share on other sites

What I don't understand is why you are not using an actual database? All of what you are trying to do is what DBs do best.


I need to move couch from one side of the room to another, so I'll just slide it over.

Why am I not using Fedex and their container ship logistics network? Kinda overkill.

Share this post


Link to post
Share on other sites

I need to move couch from one side of the room to another, so I'll just slide it over.

Why am I not using Fedex and their container ship logistics network? Kinda overkill.


Not overkill at all. He wants to have possibly a records or so and quickly search the records. That is what DBs are for. There is no reason to role your own solution, when you can just install SQL Express or mySQL and go to town. Instead of writing all of that code to do it, you make a few simple SQL statements to do all the work.

j.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!