Saturday, March 20, 2010

Creating Spam Filter using Naive Bayes Classifier

Few months ago I gave a lecture to CS students about data mining. I decided to show how a spam filter can be built using simple data mining technique called naive bases classifier. It was an interactive lecture and I was surprised by the students' interest in the field and the questions that they asked.

Since the intention was to keep things simple, I used a simple example to walk them through the steps of creating a spam filter (see the slides embedded below). It was an exciting and rewarding experience. The most interesting part was when I asked for suggestion regarding the types of features which should be extracted for spam filtering. I was hoping to hear something like "individual words", "email addresses", etc and they indeed were among the responses that I got. But I wasn't expecting them to suggest taking into account "case sensitivity", "color scheme", "co-0ccurance" and "spelling" of words in email body. These responses were from undergrads who were mostly new to the idea of data mining. The curiosity and in-depth knowledge of these students made the lecture very enjoyable.

Here are the slides: