Finding a string in a file given x tags

Started by
8 comments, last by alnite 8 years, 8 months ago

Ok, the basic programming problem is this. I have a file filled with words. Each word has between 1 and a theoretically infinite number of tags that describe it.

Something like:

Hello : Tags: Greeting, Polite

Now I have a class, Foo, that wants a word given that it knows the tags that the word is associated with. The function GetWord() takes the tags as an argument and returns the word.

So, the function opens the words file and must find a word that matches every tag it's given, including words that have extra tags associated with them.

What would be the best way to implement this function. I was thinking something alone the lines of an unordered_map that takes a variable number of keys and returns everything that has those keys but also ones that have those keys and some extra ones, so long as it matches all the given keys.

If this makes any sense at all I could use some help.

Advertisement

How big is the data file? Are we speaking about 10 datasets or about a million?

If the ammount of data sets is not too big, i personally would use some kind of list containing a paar of the words an the tags.

Get the first tag, iterate over the list and remove all datasets which are not containing the tag.

Repeat this on the reduced list and use the result for the repeatition with the next tag.

After considering each tag, the reamining list are your searched result.

If your data ammount is realy big i would consider setting up a database with something like sqlite for example and get the result using sql commands

It also depends on how many queries you expect during one session. If this is the whole point or a major feature of the software, starting to physically search through files on your drive would appear to be a big no-go and the more interesting question would be: what kind of data structure is a good compromise between fast look ups and memory usage.

A quick and dirty approach that is heavy on memory:

map<string, set<string> > table;
 
words = table[first_tag];
for each additional tag
{
    words = intersection(words, table[additional_tag]);
}
return words;

The very first step should probably be a set<string*>, so every word is only kept in memory once with the set only storing pointers to them (or storing them in a vector and having only indices in the set, though that only makes sense if you expect fewer than 64k words).

f@dzhttp://festini.device-zero.de
Why the 64k limit, Trienco?

For the sake of argument and scalability of the system, let;s say that the data sets include every word of every written language on earth. They can be broken into multiple files but you are going to have several thousand words in each file, and opening new files is expensive, especially since you have to generate the sentences dynamically. As for the number of queries, imagine you have to build an entire 50,000 word book on the fly. This means that you have to open somewhere in the range of 100 files containing some 5000 words each, grab them in the correct order, and drop them in a buffer to be displayed in the correct order.

The system will definitely be much smaller, but I also want it to be sustainable and not need to be completely rewritten in the event that it needs to be updated for faster, more efficient hardware.

For the sake of argument and scalability of the system, let;s say that the data sets include every word of every written language on earth. They can be broken into multiple files but you are going to have several thousand words in each file, and opening new files is expensive, especially since you have to generate the sentences dynamically. As for the number of queries, imagine you have to build an entire 50,000 word book on the fly. This means that you have to open somewhere in the range of 100 files containing some 5000 words each, grab them in the correct order, and drop them in a buffer to be displayed in the correct order.

The system will definitely be much smaller, but I also want it to be sustainable and not need to be completely rewritten in the event that it needs to be updated for faster, more efficient hardware.

Well in that case sql database would definitly be a really good idea, should at least be the best tradeof between performance/memory usage and implementation effort

Why the 64k limit, Trienco?

Based on the admittedly random assumption of a 32bit build. Indexing the words instead of storing pointers means you can get away with a 16bit index, rather than a 32bit pointer (but obviously only if 16bit are enough, hence the 64k limit). Of course, if we're talking 64bit pointers, then a 32bit index would still be an improvement.

On the other hand, if the system does need to deal with a huge number of words, the approach as such is probably not a good idea in the first place. As I said, it would be my first instinct quick and dirty way to just get it done, without much regard to resource usage.

f@dzhttp://festini.device-zero.de

The function GetWord() takes the tags as an argument and returns the word.

Shouldn't GetWord() return multiple words? What if there are multiple words with the same tags?

Hello: Tags: Greeting, Polite

Hi: Tags: Greeting

GetWord("Greeting")

The 'best' way to implement this varies depending on the scope of this problem. There's already several solutions out there that can perform a map/reduce query, and that is perfect for situations like this. However, you do make this sound like a homework/exercise problem, in which using a 3rd party library/database probably wouldn't make sense, or beyond of what was being asked.

The GetWord() function only needs to return one word or phrase made up of one string. If there are multiple correct answers to a query it just picks one at random and returns it, assuming that the person who gave the tags knew that this would happen. I would prefer to use only the libraries I have worked with for the rest of the project, which is the standard library, SDL, and OpenGL. If any of these have a built in answer that would be nice, but I am fully prepared to make an answer from scratch.

If you want to do this from scratch for learning purposes, then what Trienco suggested will do the job. Read through the data file, and populate the map.

Given this example:

Hello : Tags: Greeting, Polite

Hi : Tags: Greeting

World : Tags: Object

Your map would look like this:

map["Greeting"] = [ "Hello", "Hi" ]

map["Polite"] = ["Hello"]

map["Object"] = ["World"]

Then finding the word given a list of tags is just a matter of getting the intersection of the words.

SDL and OpenGL won't help you here.

This topic is closed to new replies.

Advertisement