Sign in to follow this  
Domarius

Text search algorithm?

Recommended Posts

So how can my c++ app act like the web search engines do? I want to generate a "best match" for some search term, against a huge list of strings. It would be best if partial word matches worked, eg. jump would match jumper something like 75%, instead of just ignoring it altogether just because it's not exactly the same word. Never mind the storing/retrieving of data - its just the actual act of determining a "best match" that I'm curious about. The only way I can think of would be to step through the source string and look for a best match starting from character 0 in the target string, record the result, then start again but starting from character 1 in the target string, record the result if higher than last, etc. etc.

Share this post


Link to post
Share on other sites
You can probably just use a regular expression library. If you have never used regex's before it will take a little learning and practice but they're really useful. I've only used them in perl, but I'm sure there's a c++ library out there somewhere.

Share this post


Link to post
Share on other sites
Thanks for the quick reply! (Too quick for me - I edited my first post since)

I will check out this one
http://www.tropicsoft.com/Components/RegularExpression/

I think this will do :) Thanks again.

Share this post


Link to post
Share on other sites
I think you might be right... it allows any part of one word to match any part of another?

Looks like it makes lots of copies of each part... does this tree structure really grow huge quickly?

Do you think this is how the www search engines work?

Share this post


Link to post
Share on other sites
Search engines use a combination of a number of techniques however the primary one to my knowledge is based on linear algebra, the basic idea is to treat the strings as a vector space. In the simpilest implementation you have a basis vector for each keyword then each for each article you can create a vector representing it based on the number of times each keyword appears and when you have to search for something you construct the vector representing it and take the Inner product between the two from which you can calculate the cos of the "angle" between them which gives you the indication of how close a match it is.

You still need some other algorithm to find the keywords when building the vectors though.

Example:

using euclidian inner product (dot) for simplicity

keywords = "Search", "String", "Implementation"

text1 = "Searching strings is implemented using vector spaces"
text2 = "I like strings"

text1_vector = (1, 1, 1)
text2_vector = (0, 1, 0)

normalized_text1_vector = (1/sqrt(3), 1/sqrt(3), 1/sqrt(3))
normalized_text2_vector = ( 0, 1, 0)

search_string = "Search"
search_vector = (1, 0, 0)
normalized_search_string = (1, 0, 0)

normalized_search_vector . normalized_text1_vector =
1 * 1/sqrt(3) + 0 * 1/sqrt(3) + 0 * 1/sqrt(3) = 1/sqrt(3)

normalized_search_vector . normalized_text2_vector =
1 * 0 + 0 * 1 + 0 * 0 = 0

text1_relevence = 57.7%
text2_relevence = 0%


If you want I can dig up the notes from my linear algebra class which had a section on this and link you to a copy.
(note: I dont recomend you use this way but it can be interesting)

Share this post


Link to post
Share on other sites
Quote:
Original post by Domarius
I think you might be right... it allows any part of one word to match any part of another?

Looks like it makes lots of copies of each part... does this tree structure really grow huge quickly?


One efficient implementation I've seen of this stores the whole string to be searched once, and stores the suffixes as the first and last index of the string in the whole string.

If all you want is to tell if the string exists in your corpus this is a good C-like implementation to build from. http://marknelson.us/1996/08/01/suffix-trees/

Another concept you might find useful is this: http://en.wikipedia.org/wiki/Levenshtein_distance

These can help you identify keywords that are close to what the user typed that are actually in the corpus, it's commonly used in those "fix as you type" tools and spell checkers that suggest corrections.

Share this post


Link to post
Share on other sites
Wow this is a very evolved concept. Thanks for the insight on how these things work guys. It's very interesting, and I don't think I have the ability to take on making my own yet, but I started reading http://marknelson.us/1996/08/01/suffix-trees/ and I was actually following it, hehe... it's very good. But it's nearly midnight and my capacity for such things is rapidly narrowing. Though I have saved this thread page for later reference. This is certainly satisfying my curiosity on the topic. I always imagined there would have to be some very clever solution for trying every possible match between some search term and huge amounts of text.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this