identifying document category

Started by
6 comments, last by Eshwar Rohit 17 years, 1 month ago
I'm currently working in a mobile web project and i need to categorize various articles. for example the categories can be:Politics,Sports,Weather,Movies Given a document containing an article the System has to automatically identify which category the article belongs to. since i'm new to AI i'm not sure how to start. It will be of great help if somone can suggest how to proceed. Thank u Eshwar
Advertisement
Bayesian filters are one great approach. These are what people use to train email programs to recognise 'spam' and 'not-spam', but you can also train them to recognise any arbitrary categories, such as politics, sports, etc. Obviously they require a bit of training in the first place, but once that's done, they require little or no maintenance from then on.

You may want to look at open source applications like POPFile to see how it's used in email, although I believe the code is in Perl so it's perhaps not easy to understand!

For Python, you can use the Reverend library. You'll see some of the ideas on that page are quite close to what you want to do.
I would try to use a technique similar to Bayesian spam filtering.

You basically start with a prior probability that the article belongs to each of your categories. If 40% of the articles are about Politics, and the other categories are 20% each, start with a vector (0.4,0.2,0.2,0.2) to represent your current estimates of the probability of the article being in each category.

Now every time you read a word, you use Bayes's formula to update your probabilities. You need to know how frequently each word happens in each category. For instance, the probability of the next word being "president" might be 0.000129 in Politics, 0.000008 in Sports, etc. The update to the probabilities looks like this:

P'(Politics) = P(Politics)*P("president"|Politics)/P("president")

Notice that in the end you just want to compare probabilities, not estimate their exact value. Because all four probabilities are going to be divided by P("president"), you can skip that part.

What if a word has never appeared before in a particular category? Well, with the formula above, you would be multiplying by 0, but it is perfectly possible that an unknown word would appear in an article, so you need to establish a minimum probability to use in the formula, say 10^-9 (you probably should test several values of this number, but I doubt the system will be very sensitive to it).

You will see that for long articles, the numbers you are keeping track of become ridiculously small. You may actually underflow whatever floating-point type you are using, and end up with 0. A common solution for this consists of keeping track of the logarithm of the probability, instead of the probability itself. The formula now looks like this:

log(P'(Politics)) = log(P(Politics)) + log(P("president"|Politics))

The quantity you are keeping track of now is called "information" (except for a sign flip).

Here is another way to think about the whole process, which will allow us some generalization. If the initial probabilities are equal, the entire method I just described is roughly equivalent to this: Compute the frequencies of words and create a Huffman encoding for each category (the encoding of each word is roughly proportional to -log(frequency)). Encode the article with each of the Huffman encodings and classify the article as belonging to the class that resulted in the shortest message.

Now you can replace Huffman encoding by any other compression technique, and you get a different classification system. You could go about it like this: take all the articles in Politics in some random order, compress them using your favorite compression algorithm and measure the length of the result. Now you add the article you are trying to classify to the collection and see how much longer the compressed representation is. Do this for each class, and pick the class that resulted in the smallest increment in length.

In practice, you wouldn't compress the entire collection of articles every time. Instead, you could keep whatever state the compression algorithm has at the end of compressing the collection, and just compress the new article starting from that state.

If you want to recover the ability to have prior probabilities, all you have to do is add the logarithm in base 2 of the prior probability to the length computed with the method above, and that should basically give you the correct result.

Well, that's what I would try.

[EDIT: Kylotan beat me to it, but I hope you still find the explanation useful.]

[Edited by - alvaro on March 5, 2007 9:58:15 AM]

Just a slight correction...

Quote:Original post by alvaro
The update to the probabilities looks like this:

P'(Politics) = P(Politics)*P("president"|Politics)/P("president")


The quantity on the left hand side will be a conditional probability and is the a posteriori estimate of the probability of the event 'Politics' given that we have observed the word 'president'. Thus, it would be written
P(Politics | 'president') = P(Politics) * P('president' | Politics) / P('president')


Cheers,

Timkin
Quote:You will see that for long articles, the numbers you are keeping track of become ridiculously small. You may actually underflow whatever floating-point type you are using, and end up with 0. A common solution for this consists of keeping track of the logarithm of the probability, instead of the probability itself. The formula now looks like this:

Couldn't you also limit yourself to the first paragraph or X bytes for typical articles?
Quote:Original post by Timkin
The quantity on the left hand side will be a conditional probability and is the a posteriori estimate of the probability of the event 'Politics' given that we have observed the word 'president'. Thus, it would be written
P(Politics | 'president') = P(Politics) * P('president' | Politics) / P('president')


Yes, that's how the formula is usually written. I thought it would be more clear to think about it as "our new estimate of the probability of the category being Politics". At least that's how I think about it.

thank u evryone fr sharing your ideas....

i'll try to work on ur guidelines..and com back to you in case of difficulties.

once again thanku guyz!!

cheers
Eshwar
thank u evryone fr sharing your ideas....

i'll try to work on ur guidelines..and com back to you in case of difficulties.

once again thanku guyz!!

cheers
Eshwar

This topic is closed to new replies.

Advertisement