Sunday, October 17, 2010

Text Classification using Naive Bayes Classifier

I received some emails related to my spam filter post. Some of them asked me to submit a code related to it. A very simple implementation of Spam Filter in Python can be found in Collective Intelligence (I highly recommend this book). However, I wrote a simple text classification application in C# which can be used to create a Spam Filter.

I used 'spamassasin' datasets for training and then tested the same datasets using the naive based classification. The results are quiet interesting. All of the 'ham' mails are classified as 'ham', however not all 'spam' emails are classified as 'spam'. The results for spam classification aren't that bad because as mentioned in the slides, it is ok to have misclassified a 'spam' email than a 'ham' email. Further, the classification model can be improved by including features such as letter case, html, terms co-occurrence, terms being considered for classification, etc.

You have to first train the system before testing it. So click the 'Train' button before clicking on 'Test'. The application requires two datasets: 'easy_ham_2' and 'spam_2' both of which can be found in spamassasin website. The two datasets are also used for testing. You can make changes in the code to test a different dataset.

I only used the top 15 words for classification. You can change that in the 'docprob' function that calculates the document 'spam' and 'ham' probabilities.

The output of testing the datasets is printed on the console.

This is just a proof-of-concept application and so it is not perfect. i didn't pay much attention on the design (UI) but UI doesn't have much to do with spam filtering functionality anyways. You can use other datasets from spamassasins website for testing the trained system.

Also this code can be used for text classification in general using as many classes as one wants. Any suggestions are more than welcomed.

Here is the complete project file.