Sign in to follow this  
boolean

[web] ASP+Access: Search page, ordered by relevance

Recommended Posts

I have a search form on a page that, when the results are returned, I want to have it ordered by the relevance of the entries (So those with the most keyword hits come up first). I can't for the life of me figure out how I would do this. Anyone ever used one of these? I tried googling, but all the info assumes I am using an Index Server.

Share this post


Link to post
Share on other sites
Well at the moment it is just searching on the same text that is returned to the page which the user would read. I probably should make a seperate keyword table.

Share this post


Link to post
Share on other sites
Normally I'd suggest using fulltext indexing.

This works with SQL server, although it's a bit annoying to set up and violates some of SQL server's normal rules (because it runs in an external service called "Microsoft Search").

But it works.

I dunno about MSAccess, but I strongly recommend not using it for the backend to an ASP page - it will result in poor performance and (in my experience) data corruption.

Mark

Share this post


Link to post
Share on other sites
I second the recommendation against using Access. I once had to build an ASP front for an Access database at work. It ran for about 3 weeks before promptly falling flat on its face! The only way I could get it back was creating a new database and importing all the data across. We switched it to MySQL after that, and its been running non-stop for a year and a half now!

Share this post


Link to post
Share on other sites
The only thing I can think of is doing multiple queries.
I wouldn't recommend it, but I'm afraid it's the only way possible.
You could split up the keywords and first search for records containing them all, then searching for them containing only some or one of them.
To clarify my poor explanation, using some pseudo-SQL:
keywords: "A", "B"

SELECT [name], 0 AS [relevance] FROM tblEntry
WHERE [text] LIKE '%A%' AND [text] LIKE '%B%'

UNION ALL

SELECT [name], 1 AS [relevance] FROM tblEntry
WHERE ([text] LIKE '%A%' AND [text] NOT LIKE '%B%') OR ([text] LIKE '%B%' AND [text] NOT LIKE '%A%')

ORDER BY [relevance], [name]

I don't think I'll have to tell you that this may seem reasonable for one or two keywords, but dreadful when using three or more keywords (the number of unions increases in an exponential fashion).
So for the love of God, don't do it! [grin]

Share this post


Link to post
Share on other sites
There's no automagic way of doing this with Access that I know of. Aside from using multiple queries, if the recordsets are small enough you could order them yourself in ASP.

Share this post


Link to post
Share on other sites
There are a bunch of ways of doing it - you could create entries in a temporary table (or temporary entries in a permanent table) containing the items with keywords and add them up,
then join this temporary table, ordering by the score you've accumulated

Mark

Share this post


Link to post
Share on other sites
Quote:
Original post by markr
There are a bunch of ways of doing it - you could create entries in a temporary table (or temporary entries in a permanent table) containing the items with keywords and add them up,
then join this temporary table, ordering by the score you've accumulated

Mark


This doesn't make sense.. =P ???? Could you elaborate a little more on this?

---
Michael Tanczos

Share this post


Link to post
Share on other sites
Quote:
Original post by boolean
I have a search form on a page that, when the results are returned, I want to have it ordered by the relevance of the entries (So those with the most keyword hits come up first).

I can't for the life of me figure out how I would do this. Anyone ever used one of these? I tried googling, but all the info assumes I am using an Index Server.


The best thing you might be able to do is to create keyword lists by hand for each item. Then create one table that represents your individual records you want to search that uses the primary key "ARTICLE_ID" (simplified for this example). Then create a lexicon that contains each and every keyword. The lexicon is basically a table with the format: WORD, ARTICLE_ID, HITCOUNT

So if an article has keywords "shading, 3d, render" then you create an indexer that for each article calculates the hit count for each keyword in that article and adds it to the lexicon.

To do a search you do something like: "SELECT ARTICLE_ID FROM Lexicon WHERE WORD LIKE 'searchterm' ORDER BY HITCOUNT DESC"

Doing searches based on multiple terms would require a way to reconcile the true value of the number of hits. If 3d appears 30 times in one article, and shading appears 5 times in that same article.. is it better than an article that has shading 30 times but 3d only 5 times?

This might be your closest bet..

---
Michael Tanczos

Share this post


Link to post
Share on other sites
Hi all.

What Mark said was right. It isn't his fault if it made no sense to you.

I recently had the same problem and used this same solution.

The reason it is better to create the temp table is that your relevance algorithm may come to be more complicated than just how many of the keywords.
For example, a search of "all keywords" would yield all results with the same score doing it the other way. Or if the user searched just one word.

Suppose you add other factors, like how close together the words are, how many times each word appears, whether a word is weighted high or low in a separate keywords list, how many of the keywords fell adjacent to each other in the search text, and in what order. (and many more)

So when the concept of relevance grows more complex than "how many of the keywords there were".

So a good solution in any of these scenarios is to build the temp table off the original recordset, parse them in ASP (Or whatever), score them, and record the record ID and score as 2 columns in a new table. Then on display, either join the tables in the second query (as Mark said) or just individually retrieve each record's data as needed (the way I did it). Also in this method, you could page out your results based on the query of the temp table, so that you don't need to run your taxing query (where keyword = x and x and x and x). You only run that query once and then all subsequent pages just query the next 20 records out of the temp table.

Share this post


Link to post
Share on other sites
Quote:
Original post by Wengang
What Mark said was right. It isn't his fault if it made no sense to you.


It makes no sense because he explained it poorly. That would be his fault.

Quote:
So a good solution in any of these scenarios is to build the temp table off the original recordset, parse them in ASP (Or whatever), score them, and record the record ID and score as 2 columns in a new table. Then on display, either join the tables in the second query (as Mark said) or just individually retrieve each record's data as needed (the way I did it).


This isn't particularly helpful either as the sentence structure somewhat broken. Maybe it makes sense to you and that's fine for you, but when you are explaining things it pays to be precise and create more of a roadmap than a scavenger hunt. "..score them, and record the record ID and score as 2 columns in a new table" is nebulous. You probably mean store, but what two columns? Are you doing the indexing on every single search?


Just so it doesn't seem like I'm picking on you for the sake of picking on you, I'll give you an example:

"Suppose you add other factors, like how close together the words are, how many times each word appears, whether a word is weighted high or low in a separate keywords list, how many of the keywords fell adjacent to each other in the search text, and in what order."

How exactly do you store this? Items such as proximity, for example.. How do you account for factors such as word weighting, proximity, and other unalike metrics? The useful answer is that writing a search engine in ASP that will parse recordsets is not a particularly easy task unless you plan on utilizing factors such as word hit counts as your sole metric for relevancy determination. MS Access is also designed for small business databases, and lexicons can easily grow to substantially large sizes even if you do things like filter words through stop lists. Also, storing exact word positions requires associated lists for each particular word. This eats up more space in a database that is really made for small applications to begin with.

Just so it doesn't seem like I'm talking out of my ass, I'll add that I have written an enterprise class search engine that indexes multiple gigabytes of hypertextual data and can handle several million web pages simultaneously. The memory footprint and processing speed are low enough that really the disk space and disk access times are the critical limiting factors (though the company ran out of cash before we could really start pushing the limits of what it could handle). When you start dealing on the scale of millions of pages, damn near everything becomes an issue.. especially the index inversion process. Text retrieval was my major focus for my CS degree as well and I have given talks on the subject for the PSU IEEE when I was a student.

---
Michael Tanczos

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this