Sign in to follow this  
gamma_strahler

An approach for a search algorithm like google does it

Recommended Posts

Hi, in our company we want to redesign the search logic for our application. Part of our concept is the idea to find the database entries which contain ALL specified search terms in the database entries (ANDed search, something like google does it). so i searched the web for existing algorithms for this approach but didn´t find anything explaning this strategy. Can you give me some starting points and hints of how i could accomplish this? Thanks in advance! gamma_strahler

Share this post


Link to post
Share on other sites
The algorithm google uses is called 'pagerank' searching on that should provide some more useful results.

However, you might be better using a prebuilt solution like http://lucene.apache.org/ rather than trying to make a search engine yourself.

Share this post


Link to post
Share on other sites
Quote:
Original post by gamma_strahler

Can you give me some starting points and hints of how i could accomplish this?


Usually you don't search the database.

You parse the data and build an index. Then you lookup within the index using the criteria you want.

If you just need a trivial search, then SQL gives you LIKE.

Share this post


Link to post
Share on other sites
Quote:
The algorithm google uses is called 'pagerank' searching on that should provide some more useful results.
That's their ranking algorithm.

To do boolean queries (and, or, not) you want an inverted index (google for it, there's loads of info). They let you do merge joins quickly.

But yeah, use prebuilt stuff if possible, a robust inverted index + support code is rather a chore to write from scratch. Lucene was mentioned, postgresql comes with builtin text search, I assume many dbs come with built in text search capabilities.

Share this post


Link to post
Share on other sites
I think google may be working towards some sort of api thats out of the box, as Adm_42 suggests a packaged solution may be best.

developing a search algorithm will take bags of time. But then again if you develop an algorithm thats better than google after a few years of programming you'll make millions. ;-)

It might also be worth investigating the features that are included with the database software you already have. in sql server 2005 its possible too setup full text indexed searches for tables (with a little bit of shoehorning)

Share this post


Link to post
Share on other sites
phpbb and IPB are two examples of php boards that implement their own fulltext searches... The theory behind it is quite simple: create a word list and keep track of all (articles/content pages) that contain that word, and also keep track of the frequency of use of words. That makes up your fulltext index. The process from there on is very simple: search for each search term alphabetically in the index and, based on the boolean keywords (AND, OR, etc), find out which pages to include or exclude from the results. Use the word frequency to exclude common words, like "the" or "is." Words that appear in more than 50% of records are usually safe to discard from a search. You can also discard words that are less than 4 characters in length, numbers, URLs... etc...

The only "drawback" is that the index needs to be updated every time a relevant record is created, deleted or modified. Things can get messy there... Make sure you optimize ("defragment") your databases often.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this