Mining the Web

Sunday, September 26, 2010

DT-Tree: A Semantic Representation of Scientific Papers

This year I have been really busy working on some research projects, hence the delay in blog posts. Recently, one of my research works got accepted at the 2010 10th IEEE International Conference on Computer and Information Technology (CIT 2010). The conference proceedings were published by IEEE-CS and you can find the link to paper here. I am really grateful to my advisor, Dr. Shawn X. Wang, whose guidance and support made me achieve my goals. The paper was about a new indexing structure for scientific documents that we developed. Following is the abstract of the paper:

With the tremendous growth in electronic publication, locating the most relevant references is becoming a challenging task. Most effective document indexing structures represent a document as a vector of very high dimensionality. It is well known that such a representation suffers from the curse of dimensionality. In this paper, we introduce DT-Tree (DocumentTerm-Tree) - a new structure for the representation of scientific documents. DT-Tree represents a document using the 50 most frequent terms in that specific document. These terms are grouped into a tree structure according to where they appear in the document, such as title, abstract, or section title, etc. The distance between two documents is calculated based on their DT-Trees. Two DTTrees are compared using Dice coefficient between the corresponding nodes of the trees. To verify the effectiveness of our similarity measure, we conducted experiments to cluster 150 documents in three categories, namely biology [1], chemistry [2-3] and physics [3]. The experimental results indicated 100% accuracy.

I am still working on some more research projects and will post more information about them here.

Saturday, March 20, 2010

Creating Spam Filter using Naive Bayes Classifier

Few months ago I gave a lecture to CS students about data mining. I decided to show how a spam filter can be built using simple data mining technique called naive bases classifier. It was an interactive lecture and I was surprised by the students' interest in the field and the questions that they asked.

Since the intention was to keep things simple, I used a simple example to walk them through the steps of creating a spam filter (see the slides embedded below). It was an exciting and rewarding experience. The most interesting part was when I asked for suggestion regarding the types of features which should be extracted for spam filtering. I was hoping to hear something like "individual words", "email addresses", etc and they indeed were among the responses that I got. But I wasn't expecting them to suggest taking into account "case sensitivity", "color scheme", "co-0ccurance" and "spelling" of words in email body. These responses were from undergrads who were mostly new to the idea of data mining. The curiosity and in-depth knowledge of these students made the lecture very enjoyable.

Here are the slides:

Friday, January 29, 2010

Parsing Robots.txt File

Crawling is an essential part of search engine development. The more sites a search engine crawls, the bigger its index will be. However, a site can restrict web crawlers from crawling certain pages of the site by identifying those pages in a robots.txt file and making it available at the local URL /robots.txt. For example, you can see what the robots.txt file for CNN looks like here.

While crawling a site, it is a good idea to read the robots.txt file first so that you have a list of pages that your crawler shouldn't access. This way you are not only respecting the permissions provided by the site but also saving CPU cycles which can be utilized elsewhere.

You can find complete detail about the format and semantics of a robots.txt file, visit The Web Robots Pages. Here is a portion from this webpage:

The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognised headers are ignored.

User-agent

The value of this field is the name of the robot the record is describing access policy for.

If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.

The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.

If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.

Disallow

The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.

Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.

Using this description we can easily write the code to create a list of disallowed URLs. All we need to do is parse the robots.txt file and split the file first by using "User-agent" and then again by "Disallow". This will give us the list which we can then use in our crawler. We just need to check this list every time we are about to crawl a page.

Following is the code in C# to parse the robots.txt file and to create a list of disallowed URLs:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Net;
using System.Text.RegularExpressions;

namespace ParsingRobotTxt
{
class Program
{
public static HttpWebRequest req;
public static HttpWebResponse res;
static Stream resStream;

static void Main(string[] args)
{
String baseUrl = "http://www.cnn.com/";
baseUrl += "/robots.txt";

getDisallowedUrls(baseUrl);
}

static void getDisallowedUrls(string baseUrl)
{
if (isValidUrl(baseUrl))
{
urlOpen();
}

String RobotTxtContent = read();

List disallowed = new List(); // List that holds Urls which shouldn't be crawled
String[] user_agents = Regex.Split(RobotTxtContent, "User-agent:");
String userAgents = "";
foreach (String agent in user_agents)
{
if (agent.Trim().StartsWith("*"))
{
userAgents = agent.Trim().Substring(1);
}
}

String[] disallow = Regex.Split(userAgents, "Disallow:");

foreach (String item in disallow)
{
if (item != "\n")
{
disallowed.Add(item.Trim());
Console.WriteLine(baseUrl + item.Trim());
}
}

Console.ReadLine();

}

public static String read()
{
StreamReader sr = new StreamReader(resStream);
String strText = sr.ReadToEnd();
return strText;
}

public static void urlOpen()
{
resStream = res.GetResponseStream();
}

public static bool isValidUrl(String url)
{
try
{
req = (HttpWebRequest)HttpWebRequest.Create(url);
res = (HttpWebResponse)req.GetResponse();
return (res.StatusCode == HttpStatusCode.OK);
}
catch (Exception ex)
{
Console.WriteLine("Not a Valid URL:" + ex.Message + " - " + url);
return false;
}
}
}
}

Monday, November 23, 2009

Duplicate Detection using MD5 and Jaccard Coefficient in C#

Duplicate documents are significant issue in context of the Web. Detecting near duplicate documents in a large data set, like the web, is a long standing problem. All the major search engines strive to remove duplicate documents from their search results. In this post I will show you how to create a duplicate detection system using n-grams, MD5 algorithm and Jaccard Coefficient using C#.

Method
One of the simplest and efficient duplicate detection techniques uses n-grams. An n-gram (also referred to as shingle) is a sequence of consecutive words of size 'n'. For instance, the sentence 'Are endorsements keeping Slumdog kids away from school' can be represented in 3-gram phrases as follows:

- Are endorsements keeping
- endorsements keeping Slumdog
- keeping Slumdog kids
- Slumdog kids away
- kids away from
- away from school

These shingles can then be converted to a hash number using MD5 algorithm. Once we have encoded n-grams of the entire document we can then compare them with encoded n-grams of another documents using Jaccard Coefficient.

Code
Using Linq you can perform 'union' and 'intersection' functions in order to calculate Jaccard Coefficient. Here is the code for creating n-grams from 2 documents and calculating Jaccard Coefficient:

public double Detect(string page1, string page2)
{
List s1 = CreateShingles(page1);
List s2 = CreateShingles(page2);

var commonelements = s1.Intersect(s2);
var totalElements = s1.Union(s2);

double jaccardCoef = (Convert.ToDouble(commonelements.Count())) / (Convert.ToDouble(totalElements.Count()));

return jaccardCoef;
}

You can implement your own MD5 algorithm but many languages provide built-in functionality to use MD-5. In C# you can use MD5 function using System.Security.Cryptography. Here is the code:

public string CalculateMD5Hash(string input)
{
// step 1, calculate MD5 hash from input
MD5 md5 = System.Security.Cryptography.MD5.Create();
byte[] inputBytes = System.Text.Encoding.ASCII.GetBytes(input);
byte[] hash = md5.ComputeHash(inputBytes);

// step 2, convert byte array to hex string
StringBuilder sb = new StringBuilder();
for (int i = 0; i < hash.Length; i++) { sb.Append(hash[i].ToString("X2")); } return sb.ToString(); }

The 'input' here is the 3-gram phrase. Here is the complete Code.
In case if you just like to see how it works, here is the executable file.

The program just look for words in the web pages being compared. This may lead to slightly erratic results due to the text ads within the main content. Since some text ads change on every refresh the duplication percentage may differ by a few decimal points every time the two pages are compared.

A better way is to extract the main content from the web pages and then extract individual words from it.

Thursday, October 15, 2009

Text Summarization using Vector Space Model

The rapid growth of World Wide Web has resulted in information overload. One cannot possibly read every news article, document, paper, etc that is out there on the web. That is why text summarization has become essential in this information age. Even the search engines show the summary of each of the search results to help users select the most appropriate link.

However, summarizing any document automatically is not an easy task. It is one of the challenging Natural Language Processing (NLP) problems. In this post, I will show you how to create a summary from a web document using Vector Space Model.

We will first extract the sentences from the document. The summary generated for each document will actually consist of sentences that are extracted from it. We will try to pick the sentences are more representative of the document than others using the VSM.

In the previous examples, there were two main components: list of documents and the query. We then assumed that there exist a string array "docs" which contains the query's (at zero index) and the documents' text. We are going to use the same data structure ("docs") but this time instead of the query we will have the document's text (which we are trying to summarize) at the zero index and the various sentences that we extracted from the document in the remaining indices.

We will then rank the sentences using the VSM according to their similarity with the document. The similarity will decide whether we want to include the sentence in our summary or not.

The code from VSM example is almost the same, the only new function is the sentence extractor that I had to write for this example. The sentence extractor is created using the following heuristics:

1. If the character is "?" or "!" then mark this as the end of the sentence
2. If the character is "." the check that it is not preceded by "Mr", "Mrs", "Dr", etc

if Yes, then don't mark this as the end of the sentence
if No, then check that "." is not followed by a number
- if Yes, then don't mark this as the end of the sentence
- if No, then mark this as the end of the sentence

The program requires a valid url address from which it will read the item and display the Top 5 sentence that it thinks best summarizes the document. For this simple program, I assume that "p" tags are used to display paragraphs.

As can be seen from the image above, the summary produced by this simple method is not too bad :)

Here is the code.

Saturday, October 10, 2009

Extracting sentences from a document

Extracting words from a document is quite common and a fairly easy task. And using variations of Regex.Split(documentText, @"\W") one can easily do that. However, for a spam filtering project, I wanted to extract sentences from a document. I actually just wanted to extract the first and the last sentence but I thought it would be easier just to split the entire document into sentences at first using split function as follows:

char[] delimiter={'.','?','!'};
string[] sentences = document.Split(delimiter);

These two lines of code will work on most of the cases since '?','!' and '.' are usually the characters that represent the end of a sentence. However, using the period (.) can lead to erroneous results since dot '.' can be used to represent a decimal number ( 3.5 , 100.03, etc) or abbreviations (Mr., Mrs., Dr., etc).

Unfortunately, the corpora that I was using has a lot of instances of decimal numbers and abbreviations. So I have to come up with a solution. After unsuccessfully trying to Google (or Bing) for the some ideas, I came up with my own functions to solve this problem. The functions worked perfectly fine on my corpora but it may require some modifications to work for other types document.

The main function is 'parseSenteces()' that splits the document into sentences. It requires that the entire document text be sent as a string parameter.

The 'checkInitils()' function is called when the the program comes across a period. This function checks whether the character before the period represent any initial (like Dr., Miss., Mrs., Mr.) or not and returns a boolean value upon its finding.

The 'checkNum' function checks whether the character before the encountered period are numbers or not.

If the 'checkInitials()' and 'checkNum()' returns false then that represent end of sentence, and that sentence is added in a List. If any of them returns true, then the period is considered a part of sentence and the counter moves on to check the next character.

Here is the code:

private void parseSentences(string document)
{
string para = document;
string sentence = ""; //keeps track of current sentence
List words = new List(); //stores different sentences
int counter = 0; //keeps track of index of the character being read

CharEnumerator charEnum = para.GetEnumerator();//Enumerator to read each character
while (charEnum.MoveNext())
{
char currChar = para[counter]; //holds the current character being read

if (currChar == '.')
{
//check if the character before period are numbers
if (checkNum(para, currChar, counter))
{
//if not then period represents a number and not end of sentence
sentence += currChar.ToString();
}

//check if the character before period are initials of some sort
else if (checkInitial(para, currChar, counter))
{
sentence += currChar.ToString();
}
else
{
//add the sentence to the list
words.Add(sentence);
sentence = "";
}
}
else if (currChar == ' ')
{

sentence += currChar.ToString();

}

else if ((currChar == '!') || (currChar == '?'))
{
words.Add(sentence);
sentence = "";

}

else
{ sentence += currChar.ToString(); }
counter++;
}

if (sentence != "")
{ words.Add(sentence); }

string list = "";

foreach (string s in words)
{
Console.WriteLine(s);
}
}

private bool checkNum(string para, char currChar, int counter)
{ //space means not a decimal number

try
{
if (para[counter + 1] == ' ')
return false;
} catch(Exception)
{ return false; }

char currNext= para[counter+1];

if ((currNext >= '0' & currNext <= '9'))
{ char currPrev=para[counter-1];

if ((currPrev >= '0' & currPrev <= '9'))
return true;
}
return false;
}

private bool checkInitial(string para, char currChar, int counter)
{
string word1 = ""; string word2 = "";

for (int i = 3; i > 0; i--)
{
try
{
if (i < 3)
{
word1 += para[counter - i].ToString();
}
word2 += para[counter - i].ToString();
}
catch(Exception e)
{
Console.WriteLine(e.toString());
}

}

if ((word1.ToLower() == "mr") || (word1.ToLower() == "dr") || (word2.ToLower() == "miss") || (word2.ToLower() == "mrs"))
{
return true;
}
return false;
}

Monday, September 7, 2009

Singular Value Decomposition (SVD) and Latent Semantic Indexing (LSI) in C#

LSI is used in a variety of information retrieval and text processing applications. It uses uses a mathematical technique called Singular Value Decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text.

Tutorials for LSI and SVD can be found online. I will use Bluebit .Net Matrix Library to calculate SVD and other matrix and vector operations needed in LSI.

I implemented LSI in C# using the example described in Mi Ishlita. The program will display the name of each document and its similarity value. You can use any sorting algorithm to sort the results by the similarity values. Like in the previous example, the program assumes that there is a string array "docs" which contains the query's (at zero index) and the documents' text.

To use the code (below), you need to download and install .NET Matrix Library (NML™) 4.3 and then reference .NET Matrix Library from within your project using the following instructions:

Open the Visual Studio a and open a new Visual Basic or C# project.
Select Project | Add Reference...
Select the .NET tab and find Bluebit .NET Matrix Library listed, double click on it and then press OK.

Here is the code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Bluebit.MatrixLibrary;
using System.Collections;
using System.Text.RegularExpressions;

namespace LSI
{
class Program
{
static double[,] A; //term-document Array
static double[] q; //term-query Array
static List wordlist = new List(); //List of terms found in documents
static SortedDictionary sortedList = new SortedDictionary(); //Documents ranked by VSM with angle value
static string[] docs ={"gold silver truck", //Query
"shipment of gold damaged in a fire",//Doc 1
"delivery of silver arrived in a silver truck", //Doc 2
"shipment of gold arrived in a truck"}; //Doc 3

static void Main(string[] args)
{
createWordList();
createVector();
LatentSemanticIndexing();

foreach (KeyValuePair kvp in sortedList)
{

Console.WriteLine(kvp.Value + " " + kvp.Key);
}
Console.ReadLine();
}

public static void createWordList()
{
foreach (string doc in docs)
{
wordlist = getWordList(wordlist, doc);
}
wordlist.Sort(); //sort the wordlist alphabetically
}

public static List getWordList(List wordlist, string query)
{
Regex exp = new Regex("\\w+", RegexOptions.IgnoreCase);
MatchCollection MCollection = exp.Matches(query);

foreach (Match match in MCollection)
{
if (!wordlist.Contains(match.Value))
{
wordlist.Add(match.Value);
}
}

return wordlist;
}

public static void createVector()
{
double[] queryvector;
q = new double[wordlist.Count];
A = new double[wordlist.Count, docs.Length-1];
for (int j = 0; j < docs.Length; j++)
{
queryvector = new double[wordlist.Count];

for (int i = 0; i < wordlist.Count; i++)
{
//calculate Term Frequency
double tf = getTF(docs[j], wordlist[i]);

if (j == 0) //if Query term then add it to query array
{
q[i] = tf;
}
else //if document term then add it to document array
{
A[i, j-1] = tf;
}
}

}

}

private static void LatentSemanticIndexing()
{
//Singular Value Decomposition
Matrix docMatrix = new Matrix(A);
SVD svd = new SVD(docMatrix);

//A = U S VT
Matrix U = svd.U;
Matrix S = svd.S;
Matrix V = svd.V;
Matrix VT = svd.VH;

//Dimensionality Reduction: Computing Uk, Sk, Vk and VkT
Matrix Uk = new Matrix(U.ToArray(), U.Rows, U.Cols - 1);
Matrix Sk = new Matrix(S.ToArray(), S.Rows - 1, S.Cols - 1);
Matrix Vk = new Matrix(V.ToArray(), V.Rows, V.Cols - 1);
Matrix VkT = Vk.Transpose();

//q = qT Uk Sk-1
Matrix queryMatrix = new Matrix(q,q.Length,1);
queryMatrix = queryMatrix.Transpose() * Uk * Sk.Inverse();

//sim(q, d) = sim(qT Uk Sk-1, dT Uk Sk-1) using cosine similarities
for (int i = 0; i < V.Rows; i++)
{
Vector docVector = Vk.RowVector(i);
Vector queryVector = queryMatrix.RowVector(0);
double sim = Vector.DotProduct(docVector, queryVector) / (docVector.Length * queryVector.Length);

Console.WriteLine("Doc " + (i + 1).ToString() + " :" + sim);
}

}
private static double getTF(string document, string term)
{
string[] queryTerms = Regex.Split(document, "\\s");
double count = 0;

foreach (string t in queryTerms)
{
if (t == term)
{
count++;
}
}
return count;

}

}
}

After you have sorted the results in descending order using the similarity values, your results will be as follows:
d2>d3>d2