Jump to content
  • Advertisement
Sign in to follow this  
gretty

Algorithm to search unordered string vector

This topic is 3091 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hello

I have a large vector of strings(RecentWords) that are arranged according to how frequenly they occur in a paragraph (the most frequent word occupies the 1st element & so on).

I am trying to find a quick searching algorithm to find if a string already exists in the vector RecentWords.

I have forgotten alot of the main searching algoritms (except binary, quick)

What do you think would be best considering that the vector of strings(RecentWords) is not arranged alphabetically?

I am especially interested to learn some new algorithms if you think one might be relevant.

Share this post


Link to post
Share on other sites
Advertisement
Sorting is the key to efficient searching. With unsorted data, you have to search the entire thing.

One solution is to create a master list of items, and then create different indexes on it. If the total size is small you could simply maintain two copies of the data, one sorted by recency and the other sorted alphabetically. However, if the total size is small a brute force approach could suffice too.

For larger data sets, you could store the strings in a master container, then have two vectors of indices into it. You might even get away with using the recent list as the master list, depending on your removal patterns. The important thing is to guarantee the the different containers are in sync. Using the alphabetically sorted list as the master wouldn't be a great idea, as every insert invalidates lots of indexes and that takes time to fix.

Everything here is use-pattern dependent. How frequent are searches by time? What about searches for existing words? How often are entries appended to the time list? Are they ever inserted into the middle? Are removals handled like a queue or a stack, or can they be arbitrary? What is the expected size of the data set, and what is likely to be the peak size? Are entries removed when they become too old or stale?

Share this post


Link to post
Share on other sites
Thanks for the reply.

I think I was going down the road you are explaining but I gave up because it confused me. But just to clarify are you saying I create a vector of WordInfo structures & store 2 different indexes for the word. Ie...




struct WordInfo
{
string word;
unsigned int FreqIndex;
unsigned int AlphaIndex;
};

vector <WordInfo> WordList; // sorted by Frequency

// PS, I am a little confused are you saying I should use 2 vectors (one
// arranged by Frequency & the other Alphabetically? Or just one vector?






I think I am going to go down this path, but what algorithm would you suggest I use to sort the WordInfo structures? Merge, Insertion?

I am having a little difficulty getting my head around using one vector, because I cant solve how I would find if a word exists in WordList when it is arranged by word frequency.

So I am thinking if I had 2 vectors; Frequency arraged & Alphabetically, I can use a binary search to check if a WordInfo struct already exists & an insertion sort to add a new WordInfo struct to both vectors - but does 2 vectors defeat the purpose of a having a WordInfo struct?? Uh this type of thinking does my head in

Share this post


Link to post
Share on other sites
You can sort with std::sort and then do a binary search with std::lower_bound or upper_bound.

RipOff's trying to say (if i read it right)

struct WordInfo
{
std::string word;
unsigned int freqcount;
};

std::vector<WordInfo> words;
std::vector<int> sortIndex;

bool freq_compare( const WordInfo &a, const WordInfo &b )
{
return a.freqcount < b.freqcount;
}

class lexical_compare
{
public:
lexical_compare( std::vector<WordInfo> &source ) :
m_source(source){}

bool operator()(const int &a, const int &b )
{
return m_source.at(a) < m_source.at(b);
}

private:
std::vector<WordInfo> &m_source;
};
template <typename T>
struct gen {
T x;
gen(T seed) : x(seed) { }

T operator ()() { return x++; }
};

sortIndex.resize(words.size());
std::generate(sortIndex.begin(), sortIndex.end(), gen<int>(0));

std::sort( words.begin(), words.end(), freq_compare );
std::sort( sortIndex.begin(), sortIndex.end(), lexical_compare(words) );





But really it all depends on the amount of inserting and searching you need to do? and why? Could you get away with inserting all words, then sorting on frequency? Do you search by frequency MORE than you ever search by word? or is it more often by word, and less often by frequency?
Could you get away with storing the words in a std::map (or hash_map), then lookup of any information by "word" would be fast.

Share this post


Link to post
Share on other sites
How many words? If less than 1000, then just do linear search.

Otherwise, keep a std::set around. Keep all known words in there. Add and remove to set as needed.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!