Suggestion algorithm

Started by
7 comments, last by Goof Prog 10 years, 10 months ago

I'm currently in the process of designing a web service that will allow users to give binary information about whether they enjoyed a game/movie/show/anime/music/book/etc. Then the service will turn around and offer suggestions of other media that the user would have a good probability of enjoying.

Netflix/Amazon/IMDB/a lot of other services provide this feature.

The algorithm I'm looking at using for suggesting new media for the user based on what they "liked" is the somewhat well known cosine algorithm. This algorithm basically makes the assumption that there is a reasonably strong correlation between the cosine of the angle between two vectors and their "similarity."

Basically imagine a grid where one axis is all users and the other axis is all media. Each intersection is a 1 if that media was liked by that user and a 0 if not.

Now if a user likes media item 'X,' then the vector for X is taken against the vector for every other media item. For each pair, the dot product is taken, then divided by the product of their magnitudes. This results in the cosine of the angle between the two vectors.

These are then ranked and the top candidates are suggested to the user in order.

My question/problem:

This will become prohibitively expensive quickly.

First off, the cosine(angle) needs to be computed for every pair of vectors every so often.

Then for every user, the suggestions based on their "likes" need to be sorted and presented every so often. Basically for this we need to look through the database of cosine(angle)'s and grab all the scores that include as one of their items, one that the user has liked. Then sort that list and present the top few to the user.

For millions of users and thousands of items, this could grow way out of control.

Are their cheaper ways of doing this? Or am I going about it algorithmically wrong. Or is it just something that has to be dealt with directly via cacheing/segmenting/etc.?

Advertisement
This class of problems is called collaborative filtering.

Leaving performance to the side for a minute, I think there are some problems with the algorithm to begin with. Your grid is probably very very sparse, and you can't indicate "0" to mean both "the user didn't like the item" and "the user didn't rate the item" (or perhaps you haven't described precisely enough what's in the grid).

This can be viewed as a machine learning problem, where you have a many-parameter model that predicts ratings given user and item. If you have a database of user-item-rating triplets, you can adjust the parameters to try to make the model fit the database.

I haven't programmed one of these things myself, but I have heard enough about it that I know where I would start. Assign each user and each item a "factor vector", which is just a list of a few (say, 20) numbers. To estimate a rating given a user and an item, compute the dot product of their factor vectors, then add a per-item bias (some items are more popular than others) and a per-user bias (some people give higher ratings than others), and finally pass the result through a sigmoid function, like 1/(1+exp(-x)) to obtain a probability of the user liking the item.

Where do the factor vectors come from? They start as random data and you tweak them with some learning algorithm that aims to reduce the error rate when evaluating the model over the database.

I think this algorithm will result in much better probability estimates, and it is also much faster to evaluate (to compare two items you basically need to compute the dot product of two vectors of length 20, not "millions").

So this factor vector would be refined kind of similar to ELO?

Probably not...

"Most people think, great God will come from the sky, take away everything, and make everybody feel high" - Bob Marley

Was actually referring to http://en.wikipedia.org/wiki/Elo_rating_system

But I like the band too

You can define the error rate of your model to be the average amount of information per rating given the model (a.k.a. cross entropy). If the model estimates that the rating will be X with probability p, the amount of information associated with the rating being X is -log2(p). Now you can compute the derivative of this number with respect to each of the numbers involved in the computation of p, and use some form of gradient descent to adjust those numbers. Iterate through the data a few times until the numbers converge.

You should keep some of your data for testing the quality of your predictions. You can use cross entropy as the measure of performance.

Machine learning is a difficult subject, but I think this might be a good project to get into it, because the learning is pretty straight forward.

Are you familiar with Bayesean Inference at all? That's one approach to this kind of problem.

Unless I misunderstand your approach, I think how you meant to model your approach is not as a grid -- vectors in the grid become relative, unless I'm missing something, and hence, not much use. Instead, if each user is a vector of all media, you can then compare the preferences of a given user, and then make suggestions based upon the likes of similar users; in principle:

  • For the target user, form a vector composed of all media they've rated, and identical vectors for all other users, or a sufficiently-large sample of other users.
  • Compare this vector to the vector of each other user for similarity (dot product), choose a set of users that are most closely matched. If you like, you can weight the users according to whether the similarity is strong or weak.
  • From the set of similar users, suggest to the target user media that similar users have liked, but which the target user has not consumed/rated themselves. Rank recommendations by the consistency with which the similar users preferred each media.

You can do this based on simple thumbs-up/thumbs-down of individual media, or by a fuzzy rating. There are pros and cons to both approaches. You can also decompose media into representative traits (animation or live-action, genre, themes, writers, directors, actors, etc) and then form your vectors based on the traits, rather than the media itself (then, to recommend, you do the reverse, looking for media which has similar traits -- in fact, to some extent you can rely less on the comparison to similar users).

throw table_exception("(? ???)? ? ???");

Take a look at http://www.netflixprize.com/ which I found via http://stackoverflow.com/questions/626220/how-do-recommendation-systems-work

I do not think it will work. You are looking into data relationships not the actual trig function. It isn't a math algorithm it is an object algorithm.

This topic is closed to new replies.

Advertisement