Mining the Web: 2009

Monday, November 23, 2009

Duplicate Detection using MD5 and Jaccard Coefficient in C#

Duplicate documents are significant issue in context of the Web. Detecting near duplicate documents in a large data set, like the web, is a long standing problem. All the major search engines strive to remove duplicate documents from their search results. In this post I will show you how to create a duplicate detection system using n-grams, MD5 algorithm and Jaccard Coefficient using C#.

Method
One of the simplest and efficient duplicate detection techniques uses n-grams. An n-gram (also referred to as shingle) is a sequence of consecutive words of size 'n'. For instance, the sentence 'Are endorsements keeping Slumdog kids away from school' can be represented in 3-gram phrases as follows:

- Are endorsements keeping
- endorsements keeping Slumdog
- keeping Slumdog kids
- Slumdog kids away
- kids away from
- away from school

These shingles can then be converted to a hash number using MD5 algorithm. Once we have encoded n-grams of the entire document we can then compare them with encoded n-grams of another documents using Jaccard Coefficient.

Code
Using Linq you can perform 'union' and 'intersection' functions in order to calculate Jaccard Coefficient. Here is the code for creating n-grams from 2 documents and calculating Jaccard Coefficient:

public double Detect(string page1, string page2)
{
List s1 = CreateShingles(page1);
List s2 = CreateShingles(page2);

var commonelements = s1.Intersect(s2);
var totalElements = s1.Union(s2);

double jaccardCoef = (Convert.ToDouble(commonelements.Count())) / (Convert.ToDouble(totalElements.Count()));

return jaccardCoef;
}

You can implement your own MD5 algorithm but many languages provide built-in functionality to use MD-5. In C# you can use MD5 function using System.Security.Cryptography. Here is the code:

public string CalculateMD5Hash(string input)
{
// step 1, calculate MD5 hash from input
MD5 md5 = System.Security.Cryptography.MD5.Create();
byte[] inputBytes = System.Text.Encoding.ASCII.GetBytes(input);
byte[] hash = md5.ComputeHash(inputBytes);

// step 2, convert byte array to hex string
StringBuilder sb = new StringBuilder();
for (int i = 0; i < hash.Length; i++) { sb.Append(hash[i].ToString("X2")); } return sb.ToString(); }

The 'input' here is the 3-gram phrase. Here is the complete Code.
In case if you just like to see how it works, here is the executable file.

The program just look for words in the web pages being compared. This may lead to slightly erratic results due to the text ads within the main content. Since some text ads change on every refresh the duplication percentage may differ by a few decimal points every time the two pages are compared.

A better way is to extract the main content from the web pages and then extract individual words from it.

Thursday, October 15, 2009

Text Summarization using Vector Space Model

The rapid growth of World Wide Web has resulted in information overload. One cannot possibly read every news article, document, paper, etc that is out there on the web. That is why text summarization has become essential in this information age. Even the search engines show the summary of each of the search results to help users select the most appropriate link.

However, summarizing any document automatically is not an easy task. It is one of the challenging Natural Language Processing (NLP) problems. In this post, I will show you how to create a summary from a web document using Vector Space Model.

We will first extract the sentences from the document. The summary generated for each document will actually consist of sentences that are extracted from it. We will try to pick the sentences are more representative of the document than others using the VSM.

In the previous examples, there were two main components: list of documents and the query. We then assumed that there exist a string array "docs" which contains the query's (at zero index) and the documents' text. We are going to use the same data structure ("docs") but this time instead of the query we will have the document's text (which we are trying to summarize) at the zero index and the various sentences that we extracted from the document in the remaining indices.

We will then rank the sentences using the VSM according to their similarity with the document. The similarity will decide whether we want to include the sentence in our summary or not.

The code from VSM example is almost the same, the only new function is the sentence extractor that I had to write for this example. The sentence extractor is created using the following heuristics:

1. If the character is "?" or "!" then mark this as the end of the sentence
2. If the character is "." the check that it is not preceded by "Mr", "Mrs", "Dr", etc

if Yes, then don't mark this as the end of the sentence
if No, then check that "." is not followed by a number
- if Yes, then don't mark this as the end of the sentence
- if No, then mark this as the end of the sentence

The program requires a valid url address from which it will read the item and display the Top 5 sentence that it thinks best summarizes the document. For this simple program, I assume that "p" tags are used to display paragraphs.

As can be seen from the image above, the summary produced by this simple method is not too bad :)

Here is the code.

Saturday, October 10, 2009

Extracting sentences from a document

Extracting words from a document is quite common and a fairly easy task. And using variations of Regex.Split(documentText, @"\W") one can easily do that. However, for a spam filtering project, I wanted to extract sentences from a document. I actually just wanted to extract the first and the last sentence but I thought it would be easier just to split the entire document into sentences at first using split function as follows:

char[] delimiter={'.','?','!'};
string[] sentences = document.Split(delimiter);

These two lines of code will work on most of the cases since '?','!' and '.' are usually the characters that represent the end of a sentence. However, using the period (.) can lead to erroneous results since dot '.' can be used to represent a decimal number ( 3.5 , 100.03, etc) or abbreviations (Mr., Mrs., Dr., etc).

Unfortunately, the corpora that I was using has a lot of instances of decimal numbers and abbreviations. So I have to come up with a solution. After unsuccessfully trying to Google (or Bing) for the some ideas, I came up with my own functions to solve this problem. The functions worked perfectly fine on my corpora but it may require some modifications to work for other types document.

The main function is 'parseSenteces()' that splits the document into sentences. It requires that the entire document text be sent as a string parameter.

The 'checkInitils()' function is called when the the program comes across a period. This function checks whether the character before the period represent any initial (like Dr., Miss., Mrs., Mr.) or not and returns a boolean value upon its finding.

The 'checkNum' function checks whether the character before the encountered period are numbers or not.

If the 'checkInitials()' and 'checkNum()' returns false then that represent end of sentence, and that sentence is added in a List. If any of them returns true, then the period is considered a part of sentence and the counter moves on to check the next character.

Here is the code:

private void parseSentences(string document)
{
string para = document;
string sentence = ""; //keeps track of current sentence
List words = new List(); //stores different sentences
int counter = 0; //keeps track of index of the character being read

CharEnumerator charEnum = para.GetEnumerator();//Enumerator to read each character
while (charEnum.MoveNext())
{
char currChar = para[counter]; //holds the current character being read

if (currChar == '.')
{
//check if the character before period are numbers
if (checkNum(para, currChar, counter))
{
//if not then period represents a number and not end of sentence
sentence += currChar.ToString();
}

//check if the character before period are initials of some sort
else if (checkInitial(para, currChar, counter))
{
sentence += currChar.ToString();
}
else
{
//add the sentence to the list
words.Add(sentence);
sentence = "";
}
}
else if (currChar == ' ')
{

sentence += currChar.ToString();

}

else if ((currChar == '!') || (currChar == '?'))
{
words.Add(sentence);
sentence = "";

}

else
{ sentence += currChar.ToString(); }
counter++;
}

if (sentence != "")
{ words.Add(sentence); }

string list = "";

foreach (string s in words)
{
Console.WriteLine(s);
}
}

private bool checkNum(string para, char currChar, int counter)
{ //space means not a decimal number

try
{
if (para[counter + 1] == ' ')
return false;
} catch(Exception)
{ return false; }

char currNext= para[counter+1];

if ((currNext >= '0' & currNext <= '9'))
{ char currPrev=para[counter-1];

if ((currPrev >= '0' & currPrev <= '9'))
return true;
}
return false;
}

private bool checkInitial(string para, char currChar, int counter)
{
string word1 = ""; string word2 = "";

for (int i = 3; i > 0; i--)
{
try
{
if (i < 3)
{
word1 += para[counter - i].ToString();
}
word2 += para[counter - i].ToString();
}
catch(Exception e)
{
Console.WriteLine(e.toString());
}

}

if ((word1.ToLower() == "mr") || (word1.ToLower() == "dr") || (word2.ToLower() == "miss") || (word2.ToLower() == "mrs"))
{
return true;
}
return false;
}

Monday, September 7, 2009

Singular Value Decomposition (SVD) and Latent Semantic Indexing (LSI) in C#

LSI is used in a variety of information retrieval and text processing applications. It uses uses a mathematical technique called Singular Value Decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text.

Tutorials for LSI and SVD can be found online. I will use Bluebit .Net Matrix Library to calculate SVD and other matrix and vector operations needed in LSI.

I implemented LSI in C# using the example described in Mi Ishlita. The program will display the name of each document and its similarity value. You can use any sorting algorithm to sort the results by the similarity values. Like in the previous example, the program assumes that there is a string array "docs" which contains the query's (at zero index) and the documents' text.

To use the code (below), you need to download and install .NET Matrix Library (NML™) 4.3 and then reference .NET Matrix Library from within your project using the following instructions:

Open the Visual Studio a and open a new Visual Basic or C# project.
Select Project | Add Reference...
Select the .NET tab and find Bluebit .NET Matrix Library listed, double click on it and then press OK.

Here is the code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Bluebit.MatrixLibrary;
using System.Collections;
using System.Text.RegularExpressions;

namespace LSI
{
class Program
{
static double[,] A; //term-document Array
static double[] q; //term-query Array
static List wordlist = new List(); //List of terms found in documents
static SortedDictionary sortedList = new SortedDictionary(); //Documents ranked by VSM with angle value
static string[] docs ={"gold silver truck", //Query
"shipment of gold damaged in a fire",//Doc 1
"delivery of silver arrived in a silver truck", //Doc 2
"shipment of gold arrived in a truck"}; //Doc 3

static void Main(string[] args)
{
createWordList();
createVector();
LatentSemanticIndexing();

foreach (KeyValuePair kvp in sortedList)
{

Console.WriteLine(kvp.Value + " " + kvp.Key);
}
Console.ReadLine();
}

public static void createWordList()
{
foreach (string doc in docs)
{
wordlist = getWordList(wordlist, doc);
}
wordlist.Sort(); //sort the wordlist alphabetically
}

public static List getWordList(List wordlist, string query)
{
Regex exp = new Regex("\\w+", RegexOptions.IgnoreCase);
MatchCollection MCollection = exp.Matches(query);

foreach (Match match in MCollection)
{
if (!wordlist.Contains(match.Value))
{
wordlist.Add(match.Value);
}
}

return wordlist;
}

public static void createVector()
{
double[] queryvector;
q = new double[wordlist.Count];
A = new double[wordlist.Count, docs.Length-1];
for (int j = 0; j < docs.Length; j++)
{
queryvector = new double[wordlist.Count];

for (int i = 0; i < wordlist.Count; i++)
{
//calculate Term Frequency
double tf = getTF(docs[j], wordlist[i]);

if (j == 0) //if Query term then add it to query array
{
q[i] = tf;
}
else //if document term then add it to document array
{
A[i, j-1] = tf;
}
}

}

}

private static void LatentSemanticIndexing()
{
//Singular Value Decomposition
Matrix docMatrix = new Matrix(A);
SVD svd = new SVD(docMatrix);

//A = U S VT
Matrix U = svd.U;
Matrix S = svd.S;
Matrix V = svd.V;
Matrix VT = svd.VH;

//Dimensionality Reduction: Computing Uk, Sk, Vk and VkT
Matrix Uk = new Matrix(U.ToArray(), U.Rows, U.Cols - 1);
Matrix Sk = new Matrix(S.ToArray(), S.Rows - 1, S.Cols - 1);
Matrix Vk = new Matrix(V.ToArray(), V.Rows, V.Cols - 1);
Matrix VkT = Vk.Transpose();

//q = qT Uk Sk-1
Matrix queryMatrix = new Matrix(q,q.Length,1);
queryMatrix = queryMatrix.Transpose() * Uk * Sk.Inverse();

//sim(q, d) = sim(qT Uk Sk-1, dT Uk Sk-1) using cosine similarities
for (int i = 0; i < V.Rows; i++)
{
Vector docVector = Vk.RowVector(i);
Vector queryVector = queryMatrix.RowVector(0);
double sim = Vector.DotProduct(docVector, queryVector) / (docVector.Length * queryVector.Length);

Console.WriteLine("Doc " + (i + 1).ToString() + " :" + sim);
}

}
private static double getTF(string document, string term)
{
string[] queryTerms = Regex.Split(document, "\\s");
double count = 0;

foreach (string t in queryTerms)
{
if (t == term)
{
count++;
}
}
return count;

}

}
}

After you have sorted the results in descending order using the similarity values, your results will be as follows:
d2>d3>d2

Monday, August 17, 2009

Classic Vector Space Model in C#

Vector Space Model (VSM) is widely used in Information Retrieval systems. The model creates a space in which both documents and queries are represented by vectors. VSM treats the query as a document as well and then it determines similarity between query and other documents by calculating the deviation of angles between each document vector and the original query vector. It then uses the deviation to rank documents.

I implemented the model in C# using the example described in Mi Islita. The program will display elements of a dictionary object that hold the name of each document and its similarity value.

You can use any sorting algorithm to sort the dictionary by the similarity values. The program also assumes that there is a string array "docs" which contains the query's (at zero index) and the documents' text.

Here is the code:

using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
using System.Collections;
using System.Linq;
namespace VectorSpaceModel
{
class Program
{
static Hashtable DTVector = new Hashtable(); //Hashtable to hold Document Term Vector
static List wordlist = new List(); //List of terms found in documents
static Dictionary sortedList = new Dictionary(); //Documents ranked by VSM with angle value
static string[] docs ={"gold silver truck", //Query
"shipment of gold damaged in a fire",//Doc 1
"delivery of silver arrived in a silver truck", //Doc 2
"shipment of gold arrived in a truck"}; //Doc 3

static void Main(string[] args)
{
createWordList();
createVector();
classify();

var dict = sortedList;
foreach (var x in dict.Reverse())
{
Console.WriteLine("{0} -> Doc{1}", x.Key, x.Value);
}

Console.ReadLine();
}

public static void createWordList()
{
foreach (string doc in docs)
{
wordlist = getWordList(wordlist, doc);
}
}

public static List getWordList(List wordlist, string query)
{
Regex exp = new Regex("\\w+", RegexOptions.IgnoreCase);
MatchCollection MCollection = exp.Matches(query);

foreach (Match match in MCollection)
{
if (!wordlist.Contains(match.Value))
{
wordlist.Add(match.Value);
}
}

return wordlist;
}

public static void createVector()
{
double[] queryvector;

for (int j = 0; j < docs.Length; j++)
{
queryvector = new double[wordlist.Count];

for (int i = 0; i < wordlist.Count; i++)
{

double tfIDF = getTF(docs[j], wordlist[i]) * getIDF(wordlist[i]);
queryvector[i] = tfIDF;
}

if (j == 0) //is it a query?
{
DTVector.Add("Query", queryvector);
}
else
{
DTVector.Add(j.ToString(), queryvector);
}
}
}

public static void classify()
{
double temp = 0.0;

IDictionaryEnumerator _enumerator = DTVector.GetEnumerator();

double[] queryvector = new double[wordlist.Count];

Array.Copy((double[])DTVector["Query"], queryvector, wordlist.Count);

while (_enumerator.MoveNext())
{
if (_enumerator.Key.ToString() != "Query")
{
temp = cosinetheta(queryvector, (double[])_enumerator.Value);

sortedList.Add(temp, _enumerator.Key.ToString());

}
}
}

public static double dotproduct(double[] v1, double[] v2)
{
double product = 0.0;
if (v1.Length == v2.Length)
{
for (int i = 0; i < v1.Length; i++)
{
product += v1[i] * v2[i];
}
}
return product;
}

public static double vectorlength(double[] vector)
{
double length = 0.0;
for (int i = 0; i < vector.Length; i++)
{
length += Math.Pow(vector[i], 2);
}

return Math.Sqrt(length);
}
private static double getTF(string document, string term)
{
string[] queryTerms = Regex.Split(document, "\\s");
double count = 0;

foreach (string t in queryTerms)
{
if (t == term)
{
count++;
}
}
return count;

}

private static double getIDF(string term)
{
double df = 0.0;
//get term frequency of all of the sentences except for the query
for (int i = 1; i < docs.Length; i++)
{
if (docs[i].Contains(term))
{
df++;
}
}

//Get sentence count
double D = docs.Length - 1; //excluding the query

double IDF = 0.0;

if (df > 0)
{
IDF = Math.Log(D / df);
}

return IDF;
}

public static double cosinetheta(double[] v1, double[] v2)
{
double lengthV1 = vectorlength(v1);
double lengthV2 = vectorlength(v2);

double dotprod = dotproduct(v1, v2);

return dotprod / (lengthV1 * lengthV2);

}
}
}

Friday, July 3, 2009

Extract Hyperlinks Using WebBrowser Object in C#

In my previous post, I mentioned using regular expressions to extract links from a web page. While I like using regular expressions, there is a much easier way of doing this in C#.net. All that you need is a 'WebBrowser' object to read a webpage and extracting all of the hyperlinks present in it. However, you need to create Windows Forms applications to use the WebBrowser object and then add the following code (on a button click event).

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;

namespace WebbrowserTest
{
public partial class WebBrowserSample : Form
{

public WebBrowserSample()
{
InitializeComponent();
}

private void button1_Click(object sender, EventArgs e)
{
WebBrowser web = new WebBrowser();
web.NewWindow += new CancelEventHandler(web_NewWindow);
web.Navigate("http://www.cnn.com");

//wait for the browser object to load the page
while (web.ReadyState != WebBrowserReadyState.Complete)
{
Application.DoEvents();
}

HtmlElementCollection ht = web.Document.Links;
List urls = new List();
foreach (HtmlElement h in ht)
{
try
{
if ((h.GetAttribute("href") != null) )
{
urls.Add(h.GetAttribute("href"));
}
}
catch
{ }
}

foreach (string url in urls)
{
Console.WriteLine(url);
}

}

void web_NewWindow(object sender, CancelEventArgs e)
{
e.Cancel = true;
}
}
}

The 'NewWindow' event is used to discard popups which may open while navigating a web page. The 'Document.Links' property of the WebBrowser is used to get the HtmlCollection of all of the links present in a webpage. Once you have the HtmlCollection of links, the links could be processes/viewed using 'href' attribute of each 'HtmlElement' in the 'HtmlCollection'.

Even though this is a pretty straight forward code, I found it a little slower. I think the reason is that you have to make sure that the webbrowser object has read the entire webpage before you can start using it (achieved by the while '(web.ReadyState != WebBrowserReadyState.Complete)' condition).

Also if you are thinking about link processing in a non-Window Forms application, then this program is not for you. For all the Window Forms application developers, enjoy the code :)

Saturday, April 18, 2009

A simple web crawler

it is becoming difficult to write more about what I have been reading/researching related to Information Retrieval but IR continues to be my area of interest. There is tons of literature out there and each research paper that I read brings with it more insight about IR. That made me want to test the various theories and algorithms mentioned in those papers. And to test them, first thing that I needed was data (web documents to be precise). Now, crawling is in itself a very important area of research. One could opt to write a very efficient crawler but I am a full-time student who works part-time to pay his bills and on top of that I have to start working on my dissertation as well. So there was no way I could allocate more time writing a crawler in order to test the various theories.

Instead I chose to write a very simple crawler. A simple crawler could just make use of the link structure of the web. And since I am not the only one who thinks that or does that, I thought it is a good step to start with. What I really needed was to extract < a > tags from a given web page. The following regular expression allowed me to do that:

Regex extractTags = new Regex(@"<" + tag + @"[^>]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?>", RegexOptions.IgnoreCase | RegexOptions.Compiled);

The main function of the crawler that I wrote loops through the list of predefined pages, calling addtoindex function on each one. The function can be used to store links and their text but for the sake of this post it just prints the URL. It then uses the regular expression that I mentioned above to get all the links on that page and adds their URLs to a set called newpages. At the end of the loop, newpages becomes pages, and the process repeats.

Here is the complete code in C#:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Collections;
using System.Net;
using System.IO;

namespace Crawler
{
class Program
{
private static String strText;
static MatchCollection tagCollection;
public static HttpWebRequest req;
public static HttpWebResponse res;
static Stream resStream;
public static string baseUrl;

static void Main(string[] args)
{
//add the specific site that you want to crawl
baseUrl = "http://www.techcrunch.com/";

ArrayList pages = new ArrayList();
pages.Add(baseUrl);

//start crawling
crawl(pages, 20);

Console.WriteLine("/nIndexing Complete!!");
Console.ReadLine();
}

public static void crawl(ArrayList pages, int depth)
{
MatchCollection mc;
ArrayList links = new ArrayList();

//Breadth-first search algorithm to crawl the pages and collect links
for (int i = 0; i < depth; i++)
{
ArrayList newpages = new ArrayList();

foreach (String page in pages)
{
try
{
if (isValidUrl(page))
{
urlOpen();
}
}
catch (Exception ex)
{
System.Console.WriteLine("Couldnot open {0} because {1}", page, ex.ToString());
continue;
}

string pagecontent = read();

//adding the page in the index
addtoindex(page, pagecontent);

mc = tagList(pagecontent, "a");

links = getAttributeValue(mc, "href", baseUrl);

foreach (string link in links)
{
String url, linktext;
url = linktext = null;

if (link.Contains("#"))
{
try
{
url = link;

}
catch (Exception ex)
{
Console.WriteLine("Error in Crawl " + ex.Message + " - " + url);
}
}
else
{
url = link;
}

try
{
if ((url.Substring(0, 4) == "http") && (isindexed(url) == false))
{

newpages.Add(url);
}

}
catch (Exception ex)
{
Console.WriteLine("Couldnot add new page " + url + " b/c {0}", ex.ToString());
}
linktext = gettextonly(pagecontent);
}

}
pages = newpages;

}
}

//Returns false for now, but can be modified to query a database to check whether a page has already been indexed
public static bool isindexed(string url)
{
return false;
}

//Add page to the index, this is where a database or file system can be used
public static void addtoindex(string url, string pagecontent)
{

Console.WriteLine("Indexing : " + url);

}

//Get the collection of < a > tags in a page
public static MatchCollection tagList(String HTMLcontent, String tag)
{

Regex extractTags = new Regex(@"<" + tag + @"[^>]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?>", RegexOptions.IgnoreCase | RegexOptions.Compiled);
try
{
tagCollection = extractTags.Matches(HTMLcontent);

return tagCollection;
}
catch (Exception ex)
{
Console.WriteLine(ex.ToString());
}
return null;
}

//Gets the HREF value from each <> tag
public static ArrayList getAttributeValue(MatchCollection mc, String Attr, string url)
{
ArrayList links = new ArrayList();//{ ""};

foreach (Match match in mc)
{

string temp = match.Value;

try
{
if (temp.Contains("http"))
{
links.Add(temp.Substring(temp.IndexOf("href") + 6, temp.LastIndexOf(">") - temp.IndexOf("http") - 1));

}

else if (temp.Contains("://"))
{
links.Add(temp.Substring(temp.IndexOf("href") + 6, temp.LastIndexOf(">") - (temp.IndexOf("href") + 7)));
}
else
{
string strTemp = temp.Substring(temp.IndexOf("href") + 6, temp.LastIndexOf(">") - (temp.IndexOf("href") + 7));
url.Replace("\n\r", "");
if (strTemp[0] != '/' && url[url.Length - 1] != '/')
{
strTemp = url + "/" + strTemp;
}
else
{
strTemp = url + strTemp;
}
links.Add(strTemp);
}

}
catch (Exception ex)
{
Console.WriteLine("Error in GetAttributes :" + ex.Message + " - " + url);
}

}
return links;

}

//reads that content of a web page
public static string gettextonly(string pagecontent)
{

string pattern = @"<(.|\n)*?>";
return Regex.Replace(pagecontent, pattern, String.Empty);

}

public static String read()
{
StreamReader sr = new StreamReader(resStream);
strText = sr.ReadToEnd();
return strText;
}

public static void urlOpen()
{
resStream = res.GetResponseStream();
}

public static bool isValidUrl(String url)
{
try
{
req = (HttpWebRequest)HttpWebRequest.Create(url);
res = (HttpWebResponse)req.GetResponse();
return (res.StatusCode == HttpStatusCode.OK);
}
catch (Exception ex)
{
Console.WriteLine("Error in ISValidURL " + ex.Message + " - " + url);
return false;
}

}
}
}

Any suggestions are welcomed.

Friday, March 6, 2009

Peter Norvig; "we don't think it's a big advance to be able to type something as a question as opposed to keywords."

Sometime ago I found myself in a healthy argument with a friend of mine about the future of search. Both of us are crazy about information retrieval and naturally our discussion more than normally diverts towards search. While usually we agree with each other's views, today I was shocked to find a major difference in our opinions.

I sincerely believe that any breakthrough in search will have something to do with Natural Language Processing. However my friend thinks that that will not be the case. This was not at all a big deal. But what shocked me was the evidence he used... Peter Norvig.

Now I am a big fan of Peter Norvig but while i am crazy about all things Peter Norvig, I was disappointed to read what he said in his interverview with Technology Review:

"TR: Companies such as Ask and Powerset are betting that the future is in natural-language search, which lets people use real, useful sentences instead of potentially ambiguous keywords. What is Google doing with natural language?

PN: We think what's important about natural language is the mapping of words onto the concepts that users are looking for. But we don't think it's a big advance to be able to type something as a question as opposed to keywords. Typing "What is the capital of France?" won't get you better results than typing "capital of France." But understanding how words go together is important. To give some examples, "New York" is different from "York," but "Vegas" is the same as "Las Vegas," and "Jersey" may or may not be the same as "New Jersey." That's a natural-language aspect that we're focusing on. Most of what we do is at the word and phrase level; we're not concentrating on the sentence. We think it's important to get the right results rather than change the interface."

Apparently, Peter Norvig thinks that typing complete questions will just be a change in interface. I seriously doubt it. These days, I hardly use single keyword searches. I like Powerset and really wish that I could google something much more intelligent than a few keywords. But if the director of research at Google doesn't think that this will bring any big advancement to search, then maybe it is just me. I am sure Peter has substantial data to back his comments but I for one would like to type in a question ( and not something as simple as "Who is the 44th president of USA") and expect a relevant answer.

But thats just me... The interview was conducted in 2008, here is hoping that during this time Peter has given a second thought to some serious NLP usage for search.

Monday, February 2, 2009

Information Retrieval - A word about Word Distribution

Word distributions
Words are not distributed evenly in documents. Same goes for letters of the alphabet, city sizes, wealth, etc. Usually, the 80/20 rule applies (80% of the wealth goes to 20% of the people or it takes 80% of the effort to build the easier 20% of the system). For example, the Shakespeare's famous play Romeo and Juliet we can find the following word frequencies:

Romeo and Juliet:
And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; ........ Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There, 64; Hath, 63;…..................
A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; ...

Stop words
There are 250-300 most common words in English that account for 50% or more of a given text.
Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%.

For example, Moby Dick Ch.1 has 859 unique words (types), 2256 word occurrences (tokens). The Top 65 types cover 1132 tokens (> 50%).
Therefore, the token/type ratio: 2256/859 = 2.63

Zipf’s law

Rank x Frequency = Constant

Zipf's law is fairly general. It states that, if t1 is the most common term in the collection, t2 the next most common etc, then the collection frequency (cf) of the ith most common term is proportional to 1/i:

cf(i) prop. 1/i

Heap’s law
A better way of getting a handle on vocabulary size, M is Heaps’ law, which estimates vocabulary size as a function of collection size:

M = k(T pow(b))

where, T is the number of tokens in the collection. Typical values for the parameters k and b are: 30 ≤ k ≤ 100 and b ≈ 0.5.
Regardless of the values of theparameters for a particular collection, Heaps’ law suggests that: (i) the dictionary size will continue to increase with more documents in the collection, rather than a maximum vocabulary size being reached, and (ii) the size of the dictionary will be quite large for large collections.

In English, k is between 10 and 100, b is between 0.4 and 0.6.

IDF: Inverse document frequency
TF * IDF is used for automated indexing and for topic discrimination:

idf(k) = log2(N/d(k)) + 1 = log2N - log2d(k) + 1

N: number of documents
dk: number of documents containing term k
fik: absolute frequency of term k in document i
wik: weight of term k in document i

Vector-based matching

A popular measure of similarity for text (which normalizes the features by the covariance matrix) clustering is the cosine of the angle between two vectors. The cosine measure is given by

$\displaystyle s^{(\mathrm{C})} (\mathbf{x}_a,\mathbf{x}_b) = \frac{\mathbf{x}_... ...ger} \mathbf{x}_b} {\Vert\mathbf{x}_a\Vert _2 \cdot \Vert\mathbf{x}_b\Vert _2}$

(4.2)

and captures a scale invariant understanding of similarity. An even stronger property is that the cosine similarity does not depend on the length:
$s^{(\mathrm{C})} (\alpha \mathbf{x}_a,\mathbf{x}_b) = s^{(\mathrm{C})} (\mathbf{x}_a,\mathbf{x}_b)$ for

0$" width="51" align="middle" border="0" height="33">. This allows documents with the same composition, but different totals to be treated identically which makes this the most popular measure for text documents. Also, due to this property, samples can be normalized to the unit sphere for more efficient processing.

References
Introduction to Inforamtion Retreival, Ch. 5 Index Compression

Friday, January 30, 2009

Information Retrieval - Document Preprocessing

In this post I will touch briefly on document preprocessing and indexing concepts related to IR.

Document Preprocessing
The content of a webpage read by the crawler has to be converted into tokens before an index can be created for the keywords. This is challenging because at this step we have to deal with various formatting and encoding issues. For example, what delimiter will we use to separate different words. Tokenizing using white space as a delimiter seems to work best for "Paul's", "555-1212" or "can't". But what about: "Willow Dr.", "Dr. Willow", "New York", "ad hoc". A white space delimiter will create separate tokens for each of them which is not desirable.

Tokenization becomes even more challenging when you consider tokenizing a string like “The New York-Los Angeles flight”. Other languages such as German or Korean have eve more complex tokenization issues.

Normalization
We need a systematic way of ensuring that the index structure is suitable for general-purpose querying. For instance, "new york" or "New York" or "NEW YORK" shouldn't be treated differently. There are various ways to achieve this normalcy:

Casing (cat vs. CAT)
Stemming (computer, computation)
Soundex
Spell Checker or synonyms (Labeled/labelled, extraterrestrial/extra-terrestrial/extra terrestrial, Qaddafi/Kadhafi/Ghadaffi)
Index reduction
Dropping stop words (“and”, “of”, “to”), it is problematic for “to be or not to be” kind of sentences

Porter’s algorithm
Porter's algorithm can be used for Stemming - the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form (Wikipedia).
The rules in the Porter algorithm are separated into five distinct phases numbered from 1 to 5. They are applied to the words in the text starting from phase 1 and moving on to phase 5. Further, they are applied sequentially one after the other as commands in a program.

The complete description of Porter's algorithm can be found here.

Example: the word “duplicatable” will be changed to "duplic" using the Proter's algorithm.

The Soundex algorithm (Odell and Russell)
This algorithm uses spelling correction and hash function and is non-recoverable.

The algorithm is as follows:

1. Retain the first letter of the name, and drop all occurrences of a,e,h,I,o,u,w,y in other positions
2. Assign the following numbers to the remaining letters after the first:
b,f,p,v : 1
c,g,j,k,q,s,x,z : 2
d,t : 3
l : 4
m n : 5
r : 6

3. if two or more letters with the same code were adjacent in the original name, omit all but the first
4. Convert to the form “LDDD” by adding terminal zeros or by dropping rightmost digits
Examples:
Euler: E460, Gauss: G200, H416: Hilbert, K530: Knuth, Lloyd: L300
same as Ellery, Ghosh, Heilbronn, Kant, and Ladd
Some problems: Rogers and Rodgers, Sinclair and StClair

Storage issues
As mentioned in the last post that a medium-sized collection with n=3,000,000 (terms) and m=50,000 (documents) will result in a large term-document matrix. We have to find a way to come up with a better data structure. Inverted Index is the basic structure that resolves the issue.

Inverted index
Instead of an incidence vector, use a posting table. We can use a Dictionary object and represent the term-documents as the term-value pairs in the dictionary. The term represents a unique ID and the document (where the term is present) IDs are the values in each term-value pair of the dictionary. For example,

CLEVELAND: D1, D2, D6
OHIO: D1, D5, D6, D7

We use linked lists to be able to insert new document postings in order and to remove existing postings. We need to keep everything sorted! This gives us a logarithmic improvement in access.

Basic operations on inverted indexes
Conjunction (AND) – iterative merge of the two postings: O(x+y)
Disjunction (OR) – very similar
Negation (NOT)

Example: MICHIGAN AND NOT OHIO
Example: MICHIGAN OR NOT OHIO

For recursive operations we should start with the smallest sets for better optimization.

That sums up the second post on Information Retrieval. In the next post, I will discuss word distribution issues in the IR.

Sunday, January 18, 2009

Information Retrieval - Introduction

This is the first in the series of blog posts related to Information Retrieval.

Information Retreival
So what is Information Retrieval (IR)? IR can be defined as the indexing and retrieval of textual documents. It can be more thoroughly defined as follows:

"Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)." [1]

What does it take to build a search engine?
The task of search engines looks trivial. All search engines (Google, Yahoo, Live.com) had to do is crawl the web, index the web pages, read the text and make the links to the web pages available to upon user's request.

But if you ponder over it the simple task converts into complex sub tasks, which are as follows:

Decide what to index
Collect it
Index it (efficiently)
Keep the index up to date
Provide user-friendly query facilities

A lot of research is done on each of these sub tasks. But that is not enough. There are other things which need to be dealt with to create an 'efficient' search engine. Some of them are:

Understand the structure of the web for efficient crawling
Understand user information needs
Preprocess text and other unstructured data
Cluster data
Classify data
Evaluate performance

Document representations
Search engines display relevant links to webpages (documents) based on the keywords typed by the user. The documents retrieved consist of those keywords (not necessarily). In order to carry out the task of finding keywords in the documents, there must be some sort of data structure that holds the word-document relationship. One way to achieve this is to have a Term-document matrix (m x n).

Another important datastructure that comes handy is a Document-document matrix (n x n). However, space limitations and time complexity have to be taken into consideration before implementing these matrices.

However, these matrices don't scale very well. Typical example in a medium-sized collection is something like 3,000,000 documents (n) with 50,000 terms (m). So you can imagine how big a (m x n) matrix can be. In case of web documents you will typically have: n=30,000,000,000, m=1,000,000. Creating a (m x n) matrix and then using it could shoot up the time complexity sky high.

Boolean vs. integer-valued matrices
There is also the question whether you should use boolean (True or False) or inter-valued (0s and 1s) to represent the document-word relationship in the matrix.

Major IR models
Assuming that term-document relationship has been efficiently implemented. The next task is to come up with retrieval models that will be used to query the matrices and retrieve the most relevant documents. These models are know as IR Models. Some of the most common models are as follows:

Boolean
Vector
Probabilistic
Language modeling
Fuzzy retrieval
Latent semantic indexing

Boolean queries
Boolean IR Model is perhaps one of the simplest to implement. It uses boolean operators: AND, OR, NOT, parentheses, etc. For example, typical search queries will be something like:
CLEVELAND AND NOT OHIO
(MICHIGAN AND INDIANA) OR (TEXAS AND OKLAHOMA)

Vector queries
Each document is represented as a vector. Non-binary weights provide consideration for partial matches. These term weights are used to compute a degree of similarity between a query and each document. Ranked set of documents provides for better matching.

Probabilistic Model
Captures the IR problem using a probabilistic framework. Improve by iteration

The matching process
Matching is done between a document and a query (or between two documents). Distance or similarity measures are used to determined which set of documents satisfied a given query. The common methods used are: Euclidean distance, Manhattan distance, Word overlap, Jaccard coefficient, etc.

Well this sums up the introductory post on IR. In the next post we will look into more details of some of the components of IR mentioned in this post, especially indexing. So stay tuned.

References
[1] http://nlp.stanford.edu/IR-book/html/htmledition/boolean-retrieval-1.html

Information Retrieval and Web Mining

Last semester I had to select a research topic for my Computer Seminar course. The topic I selected was "Personalized Search". I have always been interested in Information Retrieval and hence thought that it was a perfect opportunity to learn more about this field.

Even before I could starting doing research on personalization, I had to do a lot of reading and research related to information retrieval and web mining. In the process, I learned a lot about IR. However, I hardly get the chance to practice what I learned. Hence, I have decided to blog about what I learned related to IR every week....well, almost every week.

So why blog about this? Well, this will help me revise what I learned and maybe learn something new.

I hope that other people will find it helpful as well.

Sunday, January 4, 2009

Simple IsNum() function in C#

I disparately needed a C#'s equivalent for "IsNum". I was surprised that there isn't one, so I decided to write my own. There are many ways that can be used to write this functions. However, my task required that I should be able to verify any type of number (int, double, etc.) and get either True/False return values. Below is the code that I am using to solve this problem.

public static bool IsNumeric(object Expression)
{

// Variable to collect the Return value of the TryParse method.

bool isNum;

// Define variable to collect out parameter of the TryParse method. If the conversion fails, the out parameter is zero.

double retNum;

// The TryParse method converts a string in a specified style and culture-specific format to its double-precision floating point number equivalent.

// The TryParse method does not generate an exception if the conversion fails. If the conversion passes, True is returned. If it does not, False is returned. Hence no need for try..catch block

isNum = Double.TryParse(Convert.ToString(Expression), System.Globalization.NumberStyles.Any, System.Globalization.NumberFormatInfo.InvariantInfo, out retNum);

return isNum;

}