Mining the Web: Naive Bayes Classifier

Sunday, October 17, 2010

Text Classification using Naive Bayes Classifier

I received some emails related to my spam filter post. Some of them asked me to submit a code related to it. A very simple implementation of Spam Filter in Python can be found in Collective Intelligence (I highly recommend this book). However, I wrote a simple text classification application in C# which can be used to create a Spam Filter.

I used 'spamassasin' datasets for training and then tested the same datasets using the naive based classification. The results are quiet interesting. All of the 'ham' mails are classified as 'ham', however not all 'spam' emails are classified as 'spam'. The results for spam classification aren't that bad because as mentioned in the slides, it is ok to have misclassified a 'spam' email than a 'ham' email. Further, the classification model can be improved by including features such as letter case, html, terms co-occurrence, terms being considered for classification, etc.

You have to first train the system before testing it. So click the 'Train' button before clicking on 'Test'. The application requires two datasets: 'easy_ham_2' and 'spam_2' both of which can be found in spamassasin website. The two datasets are also used for testing. You can make changes in the code to test a different dataset.

I only used the top 15 words for classification. You can change that in the 'docprob' function that calculates the document 'spam' and 'ham' probabilities.

The output of testing the datasets is printed on the console.

This is just a proof-of-concept application and so it is not perfect. i didn't pay much attention on the design (UI) but UI doesn't have much to do with spam filtering functionality anyways. You can use other datasets from spamassasins website for testing the trained system.

Also this code can be used for text classification in general using as many classes as one wants. Any suggestions are more than welcomed.

Here is the complete project file.

Saturday, March 20, 2010

Creating Spam Filter using Naive Bayes Classifier

Few months ago I gave a lecture to CS students about data mining. I decided to show how a spam filter can be built using simple data mining technique called naive bases classifier. It was an interactive lecture and I was surprised by the students' interest in the field and the questions that they asked.

Since the intention was to keep things simple, I used a simple example to walk them through the steps of creating a spam filter (see the slides embedded below). It was an exciting and rewarding experience. The most interesting part was when I asked for suggestion regarding the types of features which should be extracted for spam filtering. I was hoping to hear something like "individual words", "email addresses", etc and they indeed were among the responses that I got. But I wasn't expecting them to suggest taking into account "case sensitivity", "color scheme", "co-0ccurance" and "spelling" of words in email body. These responses were from undergrads who were mostly new to the idea of data mining. The curiosity and in-depth knowledge of these students made the lecture very enjoyable.

Here are the slides: