fast comparison

Started by
25 comments, last by random_thinker 18 years, 7 months ago
Quote:Original post by guyaton
for clarification for myself, make a hash of the "comparing string" as i used in my example? havn't used a hash table in C++ but i know it is a member of the stl. how would i do it using c++ stl?

thanks again
~guyaton
No, you store the words in the string "some sample string for testing purposes" into a hash table. Like hash["some"]=0;hash["sample"]=0;... and then check if the comparing strings can be found in the hash: "hash.find("noun") != hash.end()", "hash.find("something") != hash.end()" and so on
Advertisement
Why can't you just load up your reference strings in an array, and just cycle through any phrase something like this:

unsigned int i;std::string my_phrase = "some sample string for testing purposes";std::vector<std::string>my_words;my_words.push_back("for");my_words.push_back("noun");my_words.push_back("something");my_words.push_back("is");my_words.push_back("test");my_words.push_back("blah");for ( i = 0; i < my_words.size(); ++i){  if (my_phrase.find(my_words) != std::string::npos)  {      // Do something if found.  }}


--random_thinker

Edit: No, that won't work, it will show "test" is there...hang on...

[Edited by - random_thinker on August 29, 2005 3:20:09 PM]
--random_thinkerAs Albert Einstein said: 'Imagination is more important than knowledge'. Of course, he also said: 'If I had only known, I would have been a locksmith'.
random_thinker: i was just trying to find something fairly efficiant. If I have 30 keyworks and say 40,000 works (even if i loaded it all line by line) it could take a while to search thru each word. This is why I am trying to find something efficiant.

~guyaton

edit: i came up with this real quick...still questioning how fast it is:

struct tempStr{	string foundString;	unsigned char location;	tempStr( string &str, unsigned char loc )	{		foundString = str;		location = loc;	}};//.....//elsewhere....map<string, unsigned char> myHash;	myHash.insert( pair<string, unsigned char>("class", (unsigned char)0 ) );	myHash.insert( pair<string, unsigned char>("#include", (unsigned char)1) );	string someString = "thist is a classy function class";	string::size_type seeker = 0;		vector<tempStr> output;	while( seeker != someString.npos )	{		string::size_type tempSeeker;		string subsubstring = someString.substr( seeker, ( tempSeeker = someString.find_first_of( ' ', seeker ) ) - seeker );				//find( myHash.begin(), myHash.end(), someString );		map<string, unsigned char>::iterator someIter = myHash.find( subsubstring );		if( someIter == myHash.end() )		{			seeker = tempSeeker+1;		}		else		{			tempStr test = tempStr( subsubstring, (unsigned char)seeker ) ;			output.push_back( test );		}	}


edit #2: been reading on the Boyer-Moore Algorythm and it sounds pretty fast and cool. might check that out.

[Edited by - guyaton on August 29, 2005 4:45:13 PM]
~guyaton
Oh, I see. You could also try using an ifstream that would only pick up 'whole words', something like the following (you might need to strip punctuation from current_string):

std::string current_string;std::string my_words = "for noun something is test blah";  // or put these in a std::map<>, and later do a find(), depending upon the info you need.std::ifstream my_text;my_text.open("my_file.txt");while (!my_text.fail()){   my_text >> current_string;   if (my_words.find(current_string) != std::string::npos)  // Could do a  find() on a std::map<> of my_words here.   {      // Do something if found.   }}


--random_thinker
--random_thinkerAs Albert Einstein said: 'Imagination is more important than knowledge'. Of course, he also said: 'If I had only known, I would have been a locksmith'.
if anyone else is interested found a really good tut of what the Boyer-Moore algorithm is compliments of www.google.com

http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/fstrpos-example.html

~guyaton
~guyaton
guyaton,

I would have thought that if you could limit your string search to the 30 words rather than the 40,000 by streaming the 40,000 one at a time then the C++ string search algorithm should be fast enough, no?

--random_thinker
--random_thinkerAs Albert Einstein said: 'Imagination is more important than knowledge'. Of course, he also said: 'If I had only known, I would have been a locksmith'.
Quote:Original post by random_thinker
guyaton,

I would have thought that if you could limit your string search to the 30 words rather than the 40,000 by streaming the 40,000 one at a time then the C++ string search algorithm should be fast enough, no?

--random_thinker


Depends upon what's fast enough, but your way requires (worst case) as number of keywords times number of text words string comparisons (which are sloooow). That's about as slow as you can do it, without repeating yourself.
You can do things like find_first_of() which might speed things up...

--random_thinkerAs Albert Einstein said: 'Imagination is more important than knowledge'. Of course, he also said: 'If I had only known, I would have been a locksmith'.
You don't need/want boyer-moore here because you're searching for specific words.. And std::map is not a hash map, it's a tree which also keeps the elements in order (a feature you don't need here). There's no standard hash map in STL (yet) although many STL implementations place some sort of hash_map into the std-namespace. hash map has less indirection than tree structures, so it's easy on cache. It should be a very fast solution.. Whether you should place the comparing strings in the hash map or the words in the other string, depends on which ones change less often (you want to build the hash only rarely)
Quote:Original post by Anonymous Poster
You don't need/want boyer-moore here because you're searching for specific words.. And std::map is not a hash map, it's a tree which also keeps the elements in order (a feature you don't need here). There's no standard hash map in STL (yet) although many STL implementations place some sort of hash_map into the std-namespace. hash map has less indirection than tree structures, so it's easy on cache. It should be a very fast solution.. Whether you should place the comparing strings in the hash map or the words in the other string, depends on which ones change less often (you want to build the hash only rarely)
In all honesty that's generally quite good advice, and I must admit, I didn't read every previous post before suggesting the Boyer-Moore algorithm (particularly the ones about it always being single words). I also didn't post a link because there's bound to be HEAPS of stuff on the net about it.

I'm a big fan of hash-tables myself too, however the Boyer-Moore, or other related algorithms are very likely to be around as quick as hashing the words when you consider the time taken for hashing every word.
The Boyer-Moore algorithm has a reasonably good payback for a relatively smallish setup cost, not to mention a relatively smallish memory overhead compared to hashing.

It's all about tradeoffs between memory usage / speed / flexability / simplicity etc. I'm merely saying that it's still an option, depending on how important the things I just mentioned are.[smile]
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms

This topic is closed to new replies.

Advertisement