Mining the Web: spam filter

Showing posts with label spam filter. Show all posts

Sunday, October 17, 2010

Text Classification using Naive Bayes Classifier

I received some emails related to my spam filter post. Some of them asked me to submit a code related to it. A very simple implementation of Spam Filter in Python can be found in Collective Intelligence (I highly recommend this book). However, I wrote a simple text classification application in C# which can be used to create a Spam Filter.

I used 'spamassasin' datasets for training and then tested the same datasets using the naive based classification. The results are quiet interesting. All of the 'ham' mails are classified as 'ham', however not all 'spam' emails are classified as 'spam'. The results for spam classification aren't that bad because as mentioned in the slides, it is ok to have misclassified a 'spam' email than a 'ham' email. Further, the classification model can be improved by including features such as letter case, html, terms co-occurrence, terms being considered for classification, etc.

You have to first train the system before testing it. So click the 'Train' button before clicking on 'Test'. The application requires two datasets: 'easy_ham_2' and 'spam_2' both of which can be found in spamassasin website. The two datasets are also used for testing. You can make changes in the code to test a different dataset.

I only used the top 15 words for classification. You can change that in the 'docprob' function that calculates the document 'spam' and 'ham' probabilities.

The output of testing the datasets is printed on the console.

This is just a proof-of-concept application and so it is not perfect. i didn't pay much attention on the design (UI) but UI doesn't have much to do with spam filtering functionality anyways. You can use other datasets from spamassasins website for testing the trained system.

Also this code can be used for text classification in general using as many classes as one wants. Any suggestions are more than welcomed.

Here is the complete project file.

Saturday, March 20, 2010

Creating Spam Filter using Naive Bayes Classifier

Few months ago I gave a lecture to CS students about data mining. I decided to show how a spam filter can be built using simple data mining technique called naive bases classifier. It was an interactive lecture and I was surprised by the students' interest in the field and the questions that they asked.

Since the intention was to keep things simple, I used a simple example to walk them through the steps of creating a spam filter (see the slides embedded below). It was an exciting and rewarding experience. The most interesting part was when I asked for suggestion regarding the types of features which should be extracted for spam filtering. I was hoping to hear something like "individual words", "email addresses", etc and they indeed were among the responses that I got. But I wasn't expecting them to suggest taking into account "case sensitivity", "color scheme", "co-0ccurance" and "spelling" of words in email body. These responses were from undergrads who were mostly new to the idea of data mining. The curiosity and in-depth knowledge of these students made the lecture very enjoyable.

Here are the slides:

Sunday, February 17, 2008

IR or Spam Filter

I haven't been updating my blog lately. The commencement of the spring 2008 semester has to do something with that but the real reason is that I have been busy researching some interesting topics for my AI project.

Last semester I wanted to create an Image Spam Filter for my "Data mining & Pattern Recognition" class. My theory was to apply OCR techniques to capture text from the image, after which it really becomes a text spam-filter problem. I was thinking of using Neural Networks for Image Text recognition and Bayes' Theorem for spam-filtering. But my idea was unanimously rejected by my group and instead we developed a Handwriting Recognition system (which was still a better project to work on than our previous plan to tell time by reading an image of analogue clock.)

So now I have a chance to have another go at my Image Spam Filter project. But I am still debating about it. The reason is that I am also fond of Information Retrieval problems. I am thinking of working on a Search engine and automatic indexing of a technical book/manual. Since this is an individual project I can do whatever I want (Of course my professor has to approve it.)

If I think about it, I find both, IR and Spam Filter, interesting. So it is not a clash of interest but more of what will I gain from them and what I want to do in future.

While I try to analyze this, feel free to give me your suggestions. Maybe you can give me an insight which may eventually help me reach a decision.