An approach for a search algorithm like google does it

Started by
4 comments, last by kiwibonga 15 years, 11 months ago
Hi, in our company we want to redesign the search logic for our application. Part of our concept is the idea to find the database entries which contain ALL specified search terms in the database entries (ANDed search, something like google does it). so i searched the web for existing algorithms for this approach but didn´t find anything explaning this strategy. Can you give me some starting points and hints of how i could accomplish this? Thanks in advance! gamma_strahler
Advertisement
The algorithm google uses is called 'pagerank' searching on that should provide some more useful results.

However, you might be better using a prebuilt solution like http://lucene.apache.org/ rather than trying to make a search engine yourself.
Quote:Original post by gamma_strahler

Can you give me some starting points and hints of how i could accomplish this?


Usually you don't search the database.

You parse the data and build an index. Then you lookup within the index using the criteria you want.

If you just need a trivial search, then SQL gives you LIKE.
Quote:The algorithm google uses is called 'pagerank' searching on that should provide some more useful results.
That's their ranking algorithm.

To do boolean queries (and, or, not) you want an inverted index (google for it, there's loads of info). They let you do merge joins quickly.

But yeah, use prebuilt stuff if possible, a robust inverted index + support code is rather a chore to write from scratch. Lucene was mentioned, postgresql comes with builtin text search, I assume many dbs come with built in text search capabilities.
I think google may be working towards some sort of api thats out of the box, as Adm_42 suggests a packaged solution may be best.

developing a search algorithm will take bags of time. But then again if you develop an algorithm thats better than google after a few years of programming you'll make millions. ;-)

It might also be worth investigating the features that are included with the database software you already have. in sql server 2005 its possible too setup full text indexed searches for tables (with a little bit of shoehorning)
http://www.fotofill.co.uk
phpbb and IPB are two examples of php boards that implement their own fulltext searches... The theory behind it is quite simple: create a word list and keep track of all (articles/content pages) that contain that word, and also keep track of the frequency of use of words. That makes up your fulltext index. The process from there on is very simple: search for each search term alphabetically in the index and, based on the boolean keywords (AND, OR, etc), find out which pages to include or exclude from the results. Use the word frequency to exclude common words, like "the" or "is." Words that appear in more than 50% of records are usually safe to discard from a search. You can also discard words that are less than 4 characters in length, numbers, URLs... etc...

The only "drawback" is that the index needs to be updated every time a relevant record is created, deleted or modified. Things can get messy there... Make sure you optimize ("defragment") your databases often.

This topic is closed to new replies.

Advertisement