Text search algorithm?

Started by
6 comments, last by Domarius 16 years, 11 months ago
So how can my c++ app act like the web search engines do? I want to generate a "best match" for some search term, against a huge list of strings. It would be best if partial word matches worked, eg. jump would match jumper something like 75%, instead of just ignoring it altogether just because it's not exactly the same word. Never mind the storing/retrieving of data - its just the actual act of determining a "best match" that I'm curious about. The only way I can think of would be to step through the source string and look for a best match starting from character 0 in the target string, record the result, then start again but starting from character 1 in the target string, record the result if higher than last, etc. etc.

For local multiplayer retro themed games, visit Domarius Games.

Advertisement
You can probably just use a regular expression library. If you have never used regex's before it will take a little learning and practice but they're really useful. I've only used them in perl, but I'm sure there's a c++ library out there somewhere.
Thanks for the quick reply! (Too quick for me - I edited my first post since)

I will check out this one
http://www.tropicsoft.com/Components/RegularExpression/

I think this will do :) Thanks again.

For local multiplayer retro themed games, visit Domarius Games.

I think what you want is a suffix tree:

Suffix Tree
I think you might be right... it allows any part of one word to match any part of another?

Looks like it makes lots of copies of each part... does this tree structure really grow huge quickly?

Do you think this is how the www search engines work?

For local multiplayer retro themed games, visit Domarius Games.

Search engines use a combination of a number of techniques however the primary one to my knowledge is based on linear algebra, the basic idea is to treat the strings as a vector space. In the simpilest implementation you have a basis vector for each keyword then each for each article you can create a vector representing it based on the number of times each keyword appears and when you have to search for something you construct the vector representing it and take the Inner product between the two from which you can calculate the cos of the "angle" between them which gives you the indication of how close a match it is.

You still need some other algorithm to find the keywords when building the vectors though.

Example:
using euclidian inner product (dot) for simplicitykeywords = "Search", "String", "Implementation"text1 = "Searching strings is implemented using vector spaces"text2 = "I like strings"text1_vector = (1, 1, 1)text2_vector = (0, 1, 0)normalized_text1_vector = (1/sqrt(3), 1/sqrt(3), 1/sqrt(3))normalized_text2_vector = (        0,         1,         0) search_string = "Search"search_vector = (1, 0, 0)normalized_search_string = (1, 0, 0)normalized_search_vector . normalized_text1_vector =      1 * 1/sqrt(3) + 0 * 1/sqrt(3) + 0 * 1/sqrt(3) = 1/sqrt(3)normalized_search_vector . normalized_text2_vector =                             1 * 0 + 0 * 1 + 0 * 0 = 0text1_relevence = 57.7%text2_relevence = 0% 


If you want I can dig up the notes from my linear algebra class which had a section on this and link you to a copy.
(note: I dont recomend you use this way but it can be interesting)
Quote:Original post by Domarius
I think you might be right... it allows any part of one word to match any part of another?

Looks like it makes lots of copies of each part... does this tree structure really grow huge quickly?


One efficient implementation I've seen of this stores the whole string to be searched once, and stores the suffixes as the first and last index of the string in the whole string.

If all you want is to tell if the string exists in your corpus this is a good C-like implementation to build from. http://marknelson.us/1996/08/01/suffix-trees/

Another concept you might find useful is this: http://en.wikipedia.org/wiki/Levenshtein_distance

These can help you identify keywords that are close to what the user typed that are actually in the corpus, it's commonly used in those "fix as you type" tools and spell checkers that suggest corrections.
Wow this is a very evolved concept. Thanks for the insight on how these things work guys. It's very interesting, and I don't think I have the ability to take on making my own yet, but I started reading http://marknelson.us/1996/08/01/suffix-trees/ and I was actually following it, hehe... it's very good. But it's nearly midnight and my capacity for such things is rapidly narrowing. Though I have saved this thread page for later reference. This is certainly satisfying my curiosity on the topic. I always imagined there would have to be some very clever solution for trying every possible match between some search term and huge amounts of text.

For local multiplayer retro themed games, visit Domarius Games.

This topic is closed to new replies.

Advertisement