I'm currently in the process of designing a web service that will allow users to give binary information about whether they enjoyed a game/movie/show/anime/music/book/etc. Then the service will turn around and offer suggestions of other media that the user would have a good probability of enjoying.
Netflix/Amazon/IMDB/a lot of other services provide this feature.
The algorithm I'm looking at using for suggesting new media for the user based on what they "liked" is the somewhat well known cosine algorithm. This algorithm basically makes the assumption that there is a reasonably strong correlation between the cosine of the angle between two vectors and their "similarity."
Basically imagine a grid where one axis is all users and the other axis is all media. Each intersection is a 1 if that media was liked by that user and a 0 if not.
Now if a user likes media item 'X,' then the vector for X is taken against the vector for every other media item. For each pair, the dot product is taken, then divided by the product of their magnitudes. This results in the cosine of the angle between the two vectors.
These are then ranked and the top candidates are suggested to the user in order.
This will become prohibitively expensive quickly.
First off, the cosine(angle) needs to be computed for every pair of vectors every so often.
Then for every user, the suggestions based on their "likes" need to be sorted and presented every so often. Basically for this we need to look through the database of cosine(angle)'s and grab all the scores that include as one of their items, one that the user has liked. Then sort that list and present the top few to the user.
For millions of users and thousands of items, this could grow way out of control.
Are their cheaper ways of doing this? Or am I going about it algorithmically wrong. Or is it just something that has to be dealt with directly via cacheing/segmenting/etc.?