Sunday, September 26, 2010
DT-Tree: A Semantic Representation of Scientific Papers
Saturday, March 20, 2010
Creating Spam Filter using Naive Bayes Classifier
Since the intention was to keep things simple, I used a simple example to walk them through the steps of creating a spam filter (see the slides embedded below). It was an exciting and rewarding experience. The most interesting part was when I asked for suggestion regarding the types of features which should be extracted for spam filtering. I was hoping to hear something like "individual words", "email addresses", etc and they indeed were among the responses that I got. But I wasn't expecting them to suggest taking into account "case sensitivity", "color scheme", "co-0ccurance" and "spelling" of words in email body. These responses were from undergrads who were mostly new to the idea of data mining. The curiosity and in-depth knowledge of these students made the lecture very enjoyable.
Here are the slides:
Friday, January 29, 2010
Parsing Robots.txt File
While crawling a site, it is a good idea to read the robots.txt file first so that you have a list of pages that your crawler shouldn't access. This way you are not only respecting the permissions provided by the site but also saving CPU cycles which can be utilized elsewhere.
You can find complete detail about the format and semantics of a robots.txt file, visit The Web Robots Pages. Here is a portion from this webpage:
The record starts with one or more User-agent
lines, followed by one or more Disallow
lines, as detailed below. Unrecognised headers are ignored.
- User-agent
- The value of this field is the name of the robot the record is describing access policy for.
If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.
The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.
If the value is '
*
', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt
" file. - Disallow
- The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example,
Disallow: /help
disallows both/help.html
and/help/index.html
, whereasDisallow: /help/
would disallow/help/index.html
but allow/help.html
.Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.
Following is the code in C# to parse the robots.txt file and to create a list of disallowed URLs:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Net;
using System.Text.RegularExpressions;
namespace ParsingRobotTxt
{
class Program
{
public static HttpWebRequest req;
public static HttpWebResponse res;
static Stream resStream;
static void Main(string[] args)
{
String baseUrl = "http://www.cnn.com/";
baseUrl += "/robots.txt";
getDisallowedUrls(baseUrl);
}
static void getDisallowedUrls(string baseUrl)
{
if (isValidUrl(baseUrl))
{
urlOpen();
}
String RobotTxtContent = read();
List
String[] user_agents = Regex.Split(RobotTxtContent, "User-agent:");
String userAgents = "";
foreach (String agent in user_agents)
{
if (agent.Trim().StartsWith("*"))
{
userAgents = agent.Trim().Substring(1);
}
}
String[] disallow = Regex.Split(userAgents, "Disallow:");
foreach (String item in disallow)
{
if (item != "\n")
{
disallowed.Add(item.Trim());
Console.WriteLine(baseUrl + item.Trim());
}
}
Console.ReadLine();
}
public static String read()
{
StreamReader sr = new StreamReader(resStream);
String strText = sr.ReadToEnd();
return strText;
}
public static void urlOpen()
{
resStream = res.GetResponseStream();
}
public static bool isValidUrl(String url)
{
try
{
req = (HttpWebRequest)HttpWebRequest.Create(url);
res = (HttpWebResponse)req.GetResponse();
return (res.StatusCode == HttpStatusCode.OK);
}
catch (Exception ex)
{
Console.WriteLine("Not a Valid URL:" + ex.Message + " - " + url);
return false;
}
}
}
}
Monday, November 23, 2009
Duplicate Detection using MD5 and Jaccard Coefficient in C#
Method
One of the simplest and efficient duplicate detection techniques uses n-grams. An n-gram (also referred to as shingle) is a sequence of consecutive words of size 'n'. For instance, the sentence 'Are endorsements keeping Slumdog kids away from school' can be represented in 3-gram phrases as follows:
- Are endorsements keeping
- endorsements keeping Slumdog
- keeping Slumdog kids
- Slumdog kids away
- kids away from
- away from school
These shingles can then be converted to a hash number using MD5 algorithm. Once we have encoded n-grams of the entire document we can then compare them with encoded n-grams of another documents using Jaccard Coefficient.
Code
Using Linq you can perform 'union' and 'intersection' functions in order to calculate Jaccard Coefficient. Here is the code for creating n-grams from 2 documents and calculating Jaccard Coefficient:
public double Detect(string page1, string page2)
{
List
List
var commonelements = s1.Intersect(s2);
var totalElements = s1.Union(s2);
double jaccardCoef = (Convert.ToDouble(commonelements.Count())) / (Convert.ToDouble(totalElements.Count()));
return jaccardCoef;
}
You can implement your own MD5 algorithm but many languages provide built-in functionality to use MD-5. In C# you can use MD5 function using System.Security.Cryptography. Here is the code:
public string CalculateMD5Hash(string input)
{
// step 1, calculate MD5 hash from input
MD5 md5 = System.Security.Cryptography.MD5.Create();
byte[] inputBytes = System.Text.Encoding.ASCII.GetBytes(input);
byte[] hash = md5.ComputeHash(inputBytes);
// step 2, convert byte array to hex string
StringBuilder sb = new StringBuilder();
for (int i = 0; i < hash.Length; i++) { sb.Append(hash[i].ToString("X2")); } return sb.ToString(); }
The 'input' here is the 3-gram phrase. Here is the complete Code.
In case if you just like to see how it works, here is the executable file.
The program just look for words in the web pages being compared. This may lead to slightly erratic results due to the text ads within the main content. Since some text ads change on every refresh the duplication percentage may differ by a few decimal points every time the two pages are compared.
A better way is to extract the main content from the web pages and then extract individual words from it.
Thursday, October 15, 2009
Text Summarization using Vector Space Model

The rapid growth of World Wide Web has resulted in information overload. One cannot possibly read every news article, document, paper, etc that is out there on the web. That is why text summarization has become essential in this information age. Even the search engines show the summary of each of the search results to help users select the most appropriate link.
However, summarizing any document automatically is not an easy task. It is one of the challenging Natural Language Processing (NLP) problems. In this post, I will show you how to create a summary from a web document using Vector Space Model.
We will first extract the sentences from the document. The summary generated for each document will actually consist of sentences that are extracted from it. We will try to pick the sentences are more representative of the document than others using the VSM.
In the previous examples, there were two main components: list of documents and the query. We then assumed that there exist a string array "docs" which contains the query's (at zero index) and the documents' text. We are going to use the same data structure ("docs") but this time instead of the query we will have the document's text (which we are trying to summarize) at the zero index and the various sentences that we extracted from the document in the remaining indices.
We will then rank the sentences using the VSM according to their similarity with the document. The similarity will decide whether we want to include the sentence in our summary or not.
The code from VSM example is almost the same, the only new function is the sentence extractor that I had to write for this example. The sentence extractor is created using the following heuristics:
1. If the character is "?" or "!" then mark this as the end of the sentence
2. If the character is "." the check that it is not preceded by "Mr", "Mrs", "Dr", etc
- if Yes, then don't mark this as the end of the sentence
- if No, then check that "." is not followed by a number
- if Yes, then don't mark this as the end of the sentence
- if No, then mark this as the end of the sentence
As can be seen from the image above, the summary produced by this simple method is not too bad :)
Here is the code.
Saturday, October 10, 2009
Extracting sentences from a document
char[] delimiter={'.','?','!'};
string[] sentences = document.Split(delimiter);
These two lines of code will work on most of the cases since '?','!' and '.' are usually the characters that represent the end of a sentence. However, using the period (.) can lead to erroneous results since dot '.' can be used to represent a decimal number ( 3.5 , 100.03, etc) or abbreviations (Mr., Mrs., Dr., etc).
Unfortunately, the corpora that I was using has a lot of instances of decimal numbers and abbreviations. So I have to come up with a solution. After unsuccessfully trying to Google (or Bing) for the some ideas, I came up with my own functions to solve this problem. The functions worked perfectly fine on my corpora but it may require some modifications to work for other types document.
The main function is 'parseSenteces()' that splits the document into sentences. It requires that the entire document text be sent as a string parameter.
The 'checkInitils()' function is called when the the program comes across a period. This function checks whether the character before the period represent any initial (like Dr., Miss., Mrs., Mr.) or not and returns a boolean value upon its finding.
The 'checkNum' function checks whether the character before the encountered period are numbers or not.
If the 'checkInitials()' and 'checkNum()' returns false then that represent end of sentence, and that sentence is added in a List. If any of them returns true, then the period is considered a part of sentence and the counter moves on to check the next character.
Here is the code:
private void parseSentences(string document)
{
string para = document;
string sentence = ""; //keeps track of current sentence
List words = new List(); //stores different sentences
int counter = 0; //keeps track of index of the character being read
CharEnumerator charEnum = para.GetEnumerator();//Enumerator to read each character
while (charEnum.MoveNext())
{
char currChar = para[counter]; //holds the current character being read
if (currChar == '.')
{
//check if the character before period are numbers
if (checkNum(para, currChar, counter))
{
//if not then period represents a number and not end of sentence
sentence += currChar.ToString();
}
//check if the character before period are initials of some sort
else if (checkInitial(para, currChar, counter))
{
sentence += currChar.ToString();
}
else
{
//add the sentence to the list
words.Add(sentence);
sentence = "";
}
}
else if (currChar == ' ')
{
sentence += currChar.ToString();
}
else if ((currChar == '!') || (currChar == '?'))
{
words.Add(sentence);
sentence = "";
}
else
{ sentence += currChar.ToString(); }
counter++;
}
if (sentence != "")
{ words.Add(sentence); }
string list = "";
foreach (string s in words)
{
Console.WriteLine(s);
}
}
private bool checkNum(string para, char currChar, int counter)
{ //space means not a decimal number
try
{
if (para[counter + 1] == ' ')
return false;
} catch(Exception)
{ return false; }
char currNext= para[counter+1];
if ((currNext >= '0' & currNext <= '9'))
{ char currPrev=para[counter-1];
if ((currPrev >= '0' & currPrev <= '9'))
return true;
}
return false;
}
private bool checkInitial(string para, char currChar, int counter)
{
string word1 = ""; string word2 = "";
for (int i = 3; i > 0; i--)
{
try
{
if (i < 3)
{
word1 += para[counter - i].ToString();
}
word2 += para[counter - i].ToString();
}
catch(Exception e)
{
Console.WriteLine(e.toString());
}
}
if ((word1.ToLower() == "mr") || (word1.ToLower() == "dr") || (word2.ToLower() == "miss") || (word2.ToLower() == "mrs"))
{
return true;
}
return false;
}
Monday, September 7, 2009
Singular Value Decomposition (SVD) and Latent Semantic Indexing (LSI) in C#
Tutorials for LSI and SVD can be found online. I will use Bluebit .Net Matrix Library to calculate SVD and other matrix and vector operations needed in LSI.
I implemented LSI in C# using the example described in Mi Ishlita. The program will display the name of each document and its similarity value. You can use any sorting algorithm to sort the results by the similarity values. Like in the previous example, the program assumes that there is a string array "docs" which contains the query's (at zero index) and the documents' text.
To use the code (below), you need to download and install .NET Matrix Library (NML™) 4.3 and then reference .NET Matrix Library from within your project using the following instructions:
- Open the Visual Studio a and open a new Visual Basic or C# project.
- Select Project | Add Reference...
- Select the .NET tab and find Bluebit .NET Matrix Library listed, double click on it and then press OK.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Bluebit.MatrixLibrary;
using System.Collections;
using System.Text.RegularExpressions;
namespace LSI
{
class Program
{
static double[,] A; //term-document Array
static double[] q; //term-query Array
static List
static SortedDictionary
static string[] docs ={"gold silver truck", //Query
"shipment of gold damaged in a fire",//Doc 1
"delivery of silver arrived in a silver truck", //Doc 2
"shipment of gold arrived in a truck"}; //Doc 3
static void Main(string[] args)
{
createWordList();
createVector();
LatentSemanticIndexing();
foreach (KeyValuePair
{
Console.WriteLine(kvp.Value + " " + kvp.Key);
}
Console.ReadLine();
}
public static void createWordList()
{
foreach (string doc in docs)
{
wordlist = getWordList(wordlist, doc);
}
wordlist.Sort(); //sort the wordlist alphabetically
}
public static List
{
Regex exp = new Regex("\\w+", RegexOptions.IgnoreCase);
MatchCollection MCollection = exp.Matches(query);
foreach (Match match in MCollection)
{
if (!wordlist.Contains(match.Value))
{
wordlist.Add(match.Value);
}
}
return wordlist;
}
public static void createVector()
{
double[] queryvector;
q = new double[wordlist.Count];
A = new double[wordlist.Count, docs.Length-1];
for (int j = 0; j < docs.Length; j++)
{
queryvector = new double[wordlist.Count];
for (int i = 0; i < wordlist.Count; i++)
{
//calculate Term Frequency
double tf = getTF(docs[j], wordlist[i]);
if (j == 0) //if Query term then add it to query array
{
q[i] = tf;
}
else //if document term then add it to document array
{
A[i, j-1] = tf;
}
}
}
}
private static void LatentSemanticIndexing()
{
//Singular Value Decomposition
Matrix docMatrix = new Matrix(A);
SVD svd = new SVD(docMatrix);
//A = U S VT
Matrix U = svd.U;
Matrix S = svd.S;
Matrix V = svd.V;
Matrix VT = svd.VH;
//Dimensionality Reduction: Computing Uk, Sk, Vk and VkT
Matrix Uk = new Matrix(U.ToArray(), U.Rows, U.Cols - 1);
Matrix Sk = new Matrix(S.ToArray(), S.Rows - 1, S.Cols - 1);
Matrix Vk = new Matrix(V.ToArray(), V.Rows, V.Cols - 1);
Matrix VkT = Vk.Transpose();
//q = qT Uk Sk-1
Matrix queryMatrix = new Matrix(q,q.Length,1);
queryMatrix = queryMatrix.Transpose() * Uk * Sk.Inverse();
//sim(q, d) = sim(qT Uk Sk-1, dT Uk Sk-1) using cosine similarities
for (int i = 0; i < V.Rows; i++)
{
Vector docVector = Vk.RowVector(i);
Vector queryVector = queryMatrix.RowVector(0);
double sim = Vector.DotProduct(docVector, queryVector) / (docVector.Length * queryVector.Length);
Console.WriteLine("Doc " + (i + 1).ToString() + " :" + sim);
}
}
private static double getTF(string document, string term)
{
string[] queryTerms = Regex.Split(document, "\\s");
double count = 0;
foreach (string t in queryTerms)
{
if (t == term)
{
count++;
}
}
return count;
}
}
}
After you have sorted the results in descending order using the similarity values, your results will be as follows:
d2>d3>d2