The problems with rigid categorization - sorting content items into distinct categories as 'containers' - are fairly well-known:
- How do you decide what categories there should be? GDNet only creates new forums when there's sufficient traffic in one area to warrant it; we do this for good reason, but until the traffic reaches critical mass, the category on a topic isn't as precise as it could be.
- How do you decide which category something should be in? When you've got category so vaguely defined as 'Game Programming' and 'General Programming,' it's easy to see how people can get confused.
- What do you do when a content item should appear in more than one category? And what if they should appear in each category to unequal extents?
- How do categories relate to one another? If something in one category is commonly in another category, perhaps they should be nested? If something is in the nested category, is it always also in the parent category?
A different approach is flexible category annotations, or 'tags.' Instead of viewing categories as containers that content items are sorted into, they're viewed as indexes into the content pool, fuzzy sets that describe the data rather than housing it.
What am I telling you this for? It's pretty well-known stuff by now, I guess. I'm bringing it up because over the past few days I've been working mostly on the tagging and search engines for V5.
The tagging engine has a pretty simple set of responsibilities:
- Store and retrieve the tags associated by a user with a given resource.
- Calculate some set of 'aggregated' tags for a resource, using the tags applied to the item by all users.
- Find the resources most relevant to a tag or set of tags.
The implementation I've written so far is a naive one, but it'll suffice for the time being. The aggregation process is simply the average of all user-applied tags, crude but open to tweaking later. Finding the most relevant resources is little more than a SELECT query, scoring relevance by taking the mean least squared error between each tag set and the supplied search tags. There are problems, but they can be fixed later.
One nice trick resulting from the RESTful schema for the site is that each resource has a nice, clear URI - ideal for using as a key. So each tagset is the association of a set of (Tag, Weight) pairs with a Uri. The result is completely content-agnostic; the tagging engine knows nothing about the kinds of content the site offers.
The tagging engine's last responsibility - finding resources - is obviously highly related to the search engine. Not all searches are tag-related; for example, Active Topics is a search for all discussion threads updated in the past 24 hours, while it's easy to imagine other searches based around the author of the content or similar. So, there is a separate search service that stores, maintains, and performs all saved and transient searches, using the tagging engine when appropriate.