Mining the Web

Tuesday, November 16, 2010

Using Language Models for Information Retrieval (Python)

Continuing my regard for Bayes' Theorem, I decided to write a small python program that will crawl a list of urls and then will allow the user to search that list. I used the query likelihood Language Model for information retrieval.

A good explanation of query likelihood can be found in the Manning's Introduction to Information Retrieval book. I used the example mentioned in the "Language models for information retrieval" chapter. Here is the excerpt from that chapter:

In the query likelihood model, we construct from each document d in the collection a language model Md. We rank documents by P(d | q), where the probability of a document is interpreted as the likelihood that it is relevant to the query. Using Bayes rule , we have:

P(d | q) = P(q | d) P(d) / P(q)

P(q) is the same for all documents, and so can be ignored. The prior probability of a document P(d) is often treated as uniform across all d and so it can also be ignored.

In this model, Zero probabilities of words can be a problem. This can be solved by smoothing. Smoothing not only avoids zero probability but also it provides a term weighting component. More information about smoothing can be found in Manning's book.

I implemented the query likelihood model in python. I used BeautifulSoap to parse the crawled urls. You can download the python code from here. Once you have downloaded the code you can then use the code as follows:

1. import LangModel

2. Define a list of urls you want to search:
u=['http://cnn.com','http://techcrunch.com']

3. Call the crawl function to crawl the list:
LangModel.crawl(u)

4. Type the query you want to search and press the return key

5. The output will show you a ranked list of urls and the score of each according to query likelihood model.

6. You can search the list again by using the search function:
LangModel.search('your query')

That is it. Another application of Bayes' Theorem!

Stay tuned for more python code :)

Sunday, November 14, 2010

Bayes' theorem : A Love Story

A few days ago, I saw the video of Hilary Mason presenting the history of machine learning. As far as I know, she has covered the significant developments in the field in the most simplistic, understandable and casual way possible. She shed some light on the fundamental math and algorithmic tools employed in the field.

Hilary suggested, that after the AI winter storm, the best thing that happened in computer science was when computer scientists started invading the field of statistics. That is they starting using probability theories to built application and solve problems. She went on to give an example of how Bayes Theorem was used (and can be used) to classify text, detect spam, etc. I can very well identify with her point.

When I started my MS in computer science, I was supposed to take courses like Software Design, Advance DB, Advance Operating in my first semester. I couldn't get in the Advance Operating Sys class and so I enrolled in Data Mining class. The data mining techniques used, the algorithms that solve complex problems and simply finding interesting patterns in raw data just blew me away. Needless to say, taking DM was probably the best decision of my academic career. I changed my concentration from "Software Engineering" to "Machine Intelligence", worked on making application become smarter (rather than just building a smart application to pass a course) and read more about interesting concepts than that was covered in the class.

My first DM project was building a Spam Filter using Bayes' Theorem. Even though I built much complex DM applications after that, it remains one of my favorite things that I have done. I remember that project presentation was an option to get extra credit (project executable and report however accounted for 40% of the grade). I was doing pretty well in that class but I really wanted to present my Spam Filter. I have never felt so excited about implementing and presenting a project, so I knew that this is something special. And hence, even today when I take on projects that require use of more sophisticated algorithms like SVM, etc. I still perform a quick test using Bayes' Theorem just to get some insight. Maybe it is not a correct approach but I have felt that it has helped me fine tune the algorithms I have to use.

Being a big fan of all things Bayes, I wanted the undergrad and grad students in my college to be excited about using statistics in their projects as well. And hence I gave short tutorial about creating a spam filter using naive bayes theorem during one of the ACM open house meetings. I did that because, I realized that not every CS student enrolled in data mining. The meeting was pretty successful and I tend to believe that I at least made a handful of students excited about using Bayes' theorem.

So if you are an undergard (or even grad) and you are thinking about what project to work on, I'll suggest that you look up some machine learning techniques and see how you can use them to make applications related to your course. There is no better feeling than seeing your application run through mountains of raw(unstructured) data and then providing you with amazing insights and solution.

Sunday, October 17, 2010

Text Classification using Naive Bayes Classifier

I received some emails related to my spam filter post. Some of them asked me to submit a code related to it. A very simple implementation of Spam Filter in Python can be found in Collective Intelligence (I highly recommend this book). However, I wrote a simple text classification application in C# which can be used to create a Spam Filter.

I used 'spamassasin' datasets for training and then tested the same datasets using the naive based classification. The results are quiet interesting. All of the 'ham' mails are classified as 'ham', however not all 'spam' emails are classified as 'spam'. The results for spam classification aren't that bad because as mentioned in the slides, it is ok to have misclassified a 'spam' email than a 'ham' email. Further, the classification model can be improved by including features such as letter case, html, terms co-occurrence, terms being considered for classification, etc.

You have to first train the system before testing it. So click the 'Train' button before clicking on 'Test'. The application requires two datasets: 'easy_ham_2' and 'spam_2' both of which can be found in spamassasin website. The two datasets are also used for testing. You can make changes in the code to test a different dataset.

I only used the top 15 words for classification. You can change that in the 'docprob' function that calculates the document 'spam' and 'ham' probabilities.

The output of testing the datasets is printed on the console.

This is just a proof-of-concept application and so it is not perfect. i didn't pay much attention on the design (UI) but UI doesn't have much to do with spam filtering functionality anyways. You can use other datasets from spamassasins website for testing the trained system.

Also this code can be used for text classification in general using as many classes as one wants. Any suggestions are more than welcomed.

Here is the complete project file.

Sunday, September 26, 2010

DT-Tree: A Semantic Representation of Scientific Papers

This year I have been really busy working on some research projects, hence the delay in blog posts. Recently, one of my research works got accepted at the 2010 10th IEEE International Conference on Computer and Information Technology (CIT 2010). The conference proceedings were published by IEEE-CS and you can find the link to paper here. I am really grateful to my advisor, Dr. Shawn X. Wang, whose guidance and support made me achieve my goals. The paper was about a new indexing structure for scientific documents that we developed. Following is the abstract of the paper:

With the tremendous growth in electronic publication, locating the most relevant references is becoming a challenging task. Most effective document indexing structures represent a document as a vector of very high dimensionality. It is well known that such a representation suffers from the curse of dimensionality. In this paper, we introduce DT-Tree (DocumentTerm-Tree) - a new structure for the representation of scientific documents. DT-Tree represents a document using the 50 most frequent terms in that specific document. These terms are grouped into a tree structure according to where they appear in the document, such as title, abstract, or section title, etc. The distance between two documents is calculated based on their DT-Trees. Two DTTrees are compared using Dice coefficient between the corresponding nodes of the trees. To verify the effectiveness of our similarity measure, we conducted experiments to cluster 150 documents in three categories, namely biology [1], chemistry [2-3] and physics [3]. The experimental results indicated 100% accuracy.

I am still working on some more research projects and will post more information about them here.

Saturday, March 20, 2010

Creating Spam Filter using Naive Bayes Classifier

Few months ago I gave a lecture to CS students about data mining. I decided to show how a spam filter can be built using simple data mining technique called naive bases classifier. It was an interactive lecture and I was surprised by the students' interest in the field and the questions that they asked.

Since the intention was to keep things simple, I used a simple example to walk them through the steps of creating a spam filter (see the slides embedded below). It was an exciting and rewarding experience. The most interesting part was when I asked for suggestion regarding the types of features which should be extracted for spam filtering. I was hoping to hear something like "individual words", "email addresses", etc and they indeed were among the responses that I got. But I wasn't expecting them to suggest taking into account "case sensitivity", "color scheme", "co-0ccurance" and "spelling" of words in email body. These responses were from undergrads who were mostly new to the idea of data mining. The curiosity and in-depth knowledge of these students made the lecture very enjoyable.

Here are the slides:

Friday, January 29, 2010

Parsing Robots.txt File

Crawling is an essential part of search engine development. The more sites a search engine crawls, the bigger its index will be. However, a site can restrict web crawlers from crawling certain pages of the site by identifying those pages in a robots.txt file and making it available at the local URL /robots.txt. For example, you can see what the robots.txt file for CNN looks like here.

While crawling a site, it is a good idea to read the robots.txt file first so that you have a list of pages that your crawler shouldn't access. This way you are not only respecting the permissions provided by the site but also saving CPU cycles which can be utilized elsewhere.

You can find complete detail about the format and semantics of a robots.txt file, visit The Web Robots Pages. Here is a portion from this webpage:

The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognised headers are ignored.

User-agent

The value of this field is the name of the robot the record is describing access policy for.

If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.

The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.

If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.

Disallow

The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.

Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.

Using this description we can easily write the code to create a list of disallowed URLs. All we need to do is parse the robots.txt file and split the file first by using "User-agent" and then again by "Disallow". This will give us the list which we can then use in our crawler. We just need to check this list every time we are about to crawl a page.

Following is the code in C# to parse the robots.txt file and to create a list of disallowed URLs:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Net;
using System.Text.RegularExpressions;

namespace ParsingRobotTxt
{
class Program
{
public static HttpWebRequest req;
public static HttpWebResponse res;
static Stream resStream;

static void Main(string[] args)
{
String baseUrl = "http://www.cnn.com/";
baseUrl += "/robots.txt";

getDisallowedUrls(baseUrl);
}

static void getDisallowedUrls(string baseUrl)
{
if (isValidUrl(baseUrl))
{
urlOpen();
}

String RobotTxtContent = read();

List disallowed = new List(); // List that holds Urls which shouldn't be crawled
String[] user_agents = Regex.Split(RobotTxtContent, "User-agent:");
String userAgents = "";
foreach (String agent in user_agents)
{
if (agent.Trim().StartsWith("*"))
{
userAgents = agent.Trim().Substring(1);
}
}

String[] disallow = Regex.Split(userAgents, "Disallow:");

foreach (String item in disallow)
{
if (item != "\n")
{
disallowed.Add(item.Trim());
Console.WriteLine(baseUrl + item.Trim());
}
}

Console.ReadLine();

}

public static String read()
{
StreamReader sr = new StreamReader(resStream);
String strText = sr.ReadToEnd();
return strText;
}

public static void urlOpen()
{
resStream = res.GetResponseStream();
}

public static bool isValidUrl(String url)
{
try
{
req = (HttpWebRequest)HttpWebRequest.Create(url);
res = (HttpWebResponse)req.GetResponse();
return (res.StatusCode == HttpStatusCode.OK);
}
catch (Exception ex)
{
Console.WriteLine("Not a Valid URL:" + ex.Message + " - " + url);
return false;
}
}
}
}

Monday, November 23, 2009

Duplicate Detection using MD5 and Jaccard Coefficient in C#

Duplicate documents are significant issue in context of the Web. Detecting near duplicate documents in a large data set, like the web, is a long standing problem. All the major search engines strive to remove duplicate documents from their search results. In this post I will show you how to create a duplicate detection system using n-grams, MD5 algorithm and Jaccard Coefficient using C#.

Method
One of the simplest and efficient duplicate detection techniques uses n-grams. An n-gram (also referred to as shingle) is a sequence of consecutive words of size 'n'. For instance, the sentence 'Are endorsements keeping Slumdog kids away from school' can be represented in 3-gram phrases as follows:

- Are endorsements keeping
- endorsements keeping Slumdog
- keeping Slumdog kids
- Slumdog kids away
- kids away from
- away from school

These shingles can then be converted to a hash number using MD5 algorithm. Once we have encoded n-grams of the entire document we can then compare them with encoded n-grams of another documents using Jaccard Coefficient.

Code
Using Linq you can perform 'union' and 'intersection' functions in order to calculate Jaccard Coefficient. Here is the code for creating n-grams from 2 documents and calculating Jaccard Coefficient:

public double Detect(string page1, string page2)
{
List s1 = CreateShingles(page1);
List s2 = CreateShingles(page2);

var commonelements = s1.Intersect(s2);
var totalElements = s1.Union(s2);

double jaccardCoef = (Convert.ToDouble(commonelements.Count())) / (Convert.ToDouble(totalElements.Count()));

return jaccardCoef;
}

You can implement your own MD5 algorithm but many languages provide built-in functionality to use MD-5. In C# you can use MD5 function using System.Security.Cryptography. Here is the code:

public string CalculateMD5Hash(string input)
{
// step 1, calculate MD5 hash from input
MD5 md5 = System.Security.Cryptography.MD5.Create();
byte[] inputBytes = System.Text.Encoding.ASCII.GetBytes(input);
byte[] hash = md5.ComputeHash(inputBytes);

// step 2, convert byte array to hex string
StringBuilder sb = new StringBuilder();
for (int i = 0; i < hash.Length; i++) { sb.Append(hash[i].ToString("X2")); } return sb.ToString(); }

The 'input' here is the 3-gram phrase. Here is the complete Code.
In case if you just like to see how it works, here is the executable file.

The program just look for words in the web pages being compared. This may lead to slightly erratic results due to the text ads within the main content. Since some text ads change on every refresh the duplication percentage may differ by a few decimal points every time the two pages are compared.

A better way is to extract the main content from the web pages and then extract individual words from it.