Thursday, October 15, 2009

Text Summarization using Vector Space Model


The rapid growth of World Wide Web has resulted in information overload. One cannot possibly read every news article, document, paper, etc that is out there on the web. That is why text summarization has become essential in this information age. Even the search engines show the summary of each of the search results to help users select the most appropriate link.

However, summarizing any document automatically is not an easy task. It is one of the challenging Natural Language Processing (NLP) problems. In this post, I will show you how to create a summary from a web document using Vector Space Model.

We will first extract the sentences from the document. The summary generated for each document will actually consist of sentences that are extracted from it. We will try to pick the sentences are more representative of the document than others using the VSM.

In the previous examples, there were two main components: list of documents and the query. We then assumed that there exist a string array "docs" which contains the query's (at zero index) and the documents' text. We are going to use the same data structure ("docs") but this time instead of the query we will have the document's text (which we are trying to summarize) at the zero index and the various sentences that we extracted from the document in the remaining indices.

We will then rank the sentences using the VSM according to their similarity with the document. The similarity will decide whether we want to include the sentence in our summary or not.

The code from VSM example is almost the same, the only new function is the sentence extractor that I had to write for this example. The sentence extractor is created using the following heuristics:

1. If the character is "?" or "!" then mark this as the end of the sentence
2. If the character is "." the check that it is not preceded by "Mr", "Mrs", "Dr", etc
  • if Yes, then don't mark this as the end of the sentence
  • if No, then check that "." is not followed by a number
    • if Yes, then don't mark this as the end of the sentence
    • if No, then mark this as the end of the sentence

The program requires a valid url address from which it will read the item and display the Top 5 sentence that it thinks best summarizes the document. For this simple program, I assume that "p" tags are used to display paragraphs.

As can be seen from the image above, the summary produced by this simple method is not too bad :)

Here is the code.

Saturday, October 10, 2009

Extracting sentences from a document

Extracting words from a document is quite common and a fairly easy task. And using variations of Regex.Split(documentText, @"\W") one can easily do that. However, for a spam filtering project, I wanted to extract sentences from a document. I actually just wanted to extract the first and the last sentence but I thought it would be easier just to split the entire document into sentences at first using split function as follows:

char[] delimiter={'.','?','!'};
string[] sentences = document.Split(delimiter);

These two lines of code will work on most of the cases since '?','!' and '.' are usually the characters that represent the end of a sentence. However, using the period (.) can lead to erroneous results since dot '.' can be used to represent a decimal number ( 3.5 , 100.03, etc) or abbreviations (Mr., Mrs., Dr., etc).

Unfortunately, the corpora that I was using has a lot of instances of decimal numbers and abbreviations. So I have to come up with a solution. After unsuccessfully trying to Google (or Bing) for the some ideas, I came up with my own functions to solve this problem. The functions worked perfectly fine on my corpora but it may require some modifications to work for other types document.

The main function is 'parseSenteces()' that splits the document into sentences. It requires that the entire document text be sent as a string parameter.

The 'checkInitils()' function is called when the the program comes across a period. This function checks whether the character before the period represent any initial (like Dr., Miss., Mrs., Mr.) or not and returns a boolean value upon its finding.

The 'checkNum' function checks whether the character before the encountered period are numbers or not.

If the 'checkInitials()' and 'checkNum()' returns false then that represent end of sentence, and that sentence is added in a List. If any of them returns true, then the period is considered a part of sentence and the counter moves on to check the next character.

Here is the code:

private void parseSentences(string document)
{
string para = document;
string sentence = ""; //keeps track of current sentence
List words = new List(); //stores different sentences
int counter = 0; //keeps track of index of the character being read

CharEnumerator charEnum = para.GetEnumerator();//Enumerator to read each character
while (charEnum.MoveNext())
{
char currChar = para[counter]; //holds the current character being read

if (currChar == '.')
{
//check if the character before period are numbers
if (checkNum(para, currChar, counter))
{
//if not then period represents a number and not end of sentence
sentence += currChar.ToString();
}

//check if the character before period are initials of some sort
else if (checkInitial(para, currChar, counter))
{
sentence += currChar.ToString();
}
else
{
//add the sentence to the list
words.Add(sentence);
sentence = "";
}
}
else if (currChar == ' ')
{

sentence += currChar.ToString();

}

else if ((currChar == '!') || (currChar == '?'))
{
words.Add(sentence);
sentence = "";

}

else
{ sentence += currChar.ToString(); }
counter++;
}

if (sentence != "")
{ words.Add(sentence); }

string list = "";

foreach (string s in words)
{
Console.WriteLine(s);
}
}

private bool checkNum(string para, char currChar, int counter)
{ //space means not a decimal number

try
{
if (para[counter + 1] == ' ')
return false;
} catch(Exception)
{ return false; }

char currNext= para[counter+1];

if ((currNext >= '0' & currNext <= '9'))
{ char currPrev=para[counter-1];

if ((currPrev >= '0' & currPrev <= '9'))
return true;
}
return false;
}

private bool checkInitial(string para, char currChar, int counter)
{
string word1 = ""; string word2 = "";

for (int i = 3; i > 0; i--)
{
try
{
if (i < 3)
{
word1 += para[counter - i].ToString();
}
word2 += para[counter - i].ToString();
}
catch(Exception e)
{
Console.WriteLine(e.toString());
}

}

if ((word1.ToLower() == "mr") || (word1.ToLower() == "dr") || (word2.ToLower() == "miss") || (word2.ToLower() == "mrs"))
{
return true;
}
return false;
}

Monday, September 7, 2009

Singular Value Decomposition (SVD) and Latent Semantic Indexing (LSI) in C#

LSI is used in a variety of information retrieval and text processing applications. It uses uses a mathematical technique called Singular Value Decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text.

Tutorials for LSI and SVD can be found online. I will use Bluebit .Net Matrix Library to calculate SVD and other matrix and vector operations needed in LSI.

I implemented LSI in C# using the example described in Mi Ishlita. The program will display the name of each document and its similarity value. You can use any sorting algorithm to sort the results by the similarity values. Like in the previous example, the program assumes that there is a string array "docs" which contains the query's (at zero index) and the documents' text.

To use the code (below), you need to download and install .NET Matrix Library (NML™) 4.3 and then reference .NET Matrix Library from within your project using the following instructions:
  1. Open the Visual Studio a and open a new Visual Basic or C# project.
  2. Select Project | Add Reference...
  3. Select the .NET tab and find Bluebit .NET Matrix Library listed, double click on it and then press OK.
Here is the code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Bluebit.MatrixLibrary;
using System.Collections;
using System.Text.RegularExpressions;

namespace LSI
{
class Program
{
static double[,] A; //term-document Array
static double[] q; //term-query Array
static List wordlist = new List(); //List of terms found in documents
static SortedDictionary sortedList = new SortedDictionary(); //Documents ranked by VSM with angle value
static string[] docs ={"gold silver truck", //Query
"shipment of gold damaged in a fire",//Doc 1
"delivery of silver arrived in a silver truck", //Doc 2
"shipment of gold arrived in a truck"}; //Doc 3

static void Main(string[] args)
{
createWordList();
createVector();
LatentSemanticIndexing();


foreach (KeyValuePair kvp in sortedList)
{

Console.WriteLine(kvp.Value + " " + kvp.Key);
}
Console.ReadLine();
}


public static void createWordList()
{
foreach (string doc in docs)
{
wordlist = getWordList(wordlist, doc);
}
wordlist.Sort(); //sort the wordlist alphabetically
}

public static List getWordList(List wordlist, string query)
{
Regex exp = new Regex("\\w+", RegexOptions.IgnoreCase);
MatchCollection MCollection = exp.Matches(query);

foreach (Match match in MCollection)
{
if (!wordlist.Contains(match.Value))
{
wordlist.Add(match.Value);
}
}


return wordlist;
}

public static void createVector()
{
double[] queryvector;
q = new double[wordlist.Count];
A = new double[wordlist.Count, docs.Length-1];
for (int j = 0; j < docs.Length; j++)
{
queryvector = new double[wordlist.Count];

for (int i = 0; i < wordlist.Count; i++)
{
//calculate Term Frequency
double tf = getTF(docs[j], wordlist[i]);

if (j == 0) //if Query term then add it to query array
{
q[i] = tf;
}
else //if document term then add it to document array
{
A[i, j-1] = tf;
}
}

}

}

private static void LatentSemanticIndexing()
{
//Singular Value Decomposition
Matrix docMatrix = new Matrix(A);
SVD svd = new SVD(docMatrix);

//A = U S VT
Matrix U = svd.U;
Matrix S = svd.S;
Matrix V = svd.V;
Matrix VT = svd.VH;

//Dimensionality Reduction: Computing Uk, Sk, Vk and VkT
Matrix Uk = new Matrix(U.ToArray(), U.Rows, U.Cols - 1);
Matrix Sk = new Matrix(S.ToArray(), S.Rows - 1, S.Cols - 1);
Matrix Vk = new Matrix(V.ToArray(), V.Rows, V.Cols - 1);
Matrix VkT = Vk.Transpose();


//q = qT Uk Sk-1
Matrix queryMatrix = new Matrix(q,q.Length,1);
queryMatrix = queryMatrix.Transpose() * Uk * Sk.Inverse();

//sim(q, d) = sim(qT Uk Sk-1, dT Uk Sk-1) using cosine similarities
for (int i = 0; i < V.Rows; i++)
{
Vector docVector = Vk.RowVector(i);
Vector queryVector = queryMatrix.RowVector(0);
double sim = Vector.DotProduct(docVector, queryVector) / (docVector.Length * queryVector.Length);

Console.WriteLine("Doc " + (i + 1).ToString() + " :" + sim);
}



}
private static double getTF(string document, string term)
{
string[] queryTerms = Regex.Split(document, "\\s");
double count = 0;


foreach (string t in queryTerms)
{
if (t == term)
{
count++;
}
}
return count;

}


}
}

After you have sorted the results in descending order using the similarity values, your results will be as follows:
d2>d3>d2

Monday, August 17, 2009

Classic Vector Space Model in C#

Vector Space Model (VSM) is widely used in Information Retrieval systems. The model creates a space in which both documents and queries are represented by vectors. VSM treats the query as a document as well and then it determines similarity between query and other documents by calculating the deviation of angles between each document vector and the original query vector. It then uses the deviation to rank documents.

I implemented the model in C# using the example described in Mi Islita. The program will display elements of a dictionary object that hold the name of each document and its similarity value.

You can use any sorting algorithm to sort the dictionary by the similarity values. The program also assumes that there is a string array "docs" which contains the query's (at zero index) and the documents' text.

Here is the code:

using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
using System.Collections;

namespace VectorSpaceModel
{
class Program
{
static Hashtable DTVector = new Hashtable(); //Hashtable to hold Document Term Vector
static List wordlist = new List(); //List of terms found in documents
static SortedDictionary sortedList= new SortedDictionary(); //Documents ranked by VSM with angle value
static string[] docs ={"gold silver truck", //Query
"shipment of gold damaged in a fire",//Doc 1
"delivery of silver arrived in a silver truck", //Doc 2
"shipment of gold arrived in a truck"}; //Doc 3

static void Main(string[] args)
{
createWordList();
createVector();
classify();

foreach (KeyValuePair kvp in sortedList)
{

Console.WriteLine(kvp.Value + " " + kvp.Key);
}
Console.ReadLine();
}


public static void createWordList()
{
foreach (string doc in docs)
{
wordlist = getWordList(wordlist,doc);
}
}

public static List getWordList(List wordlist,string query)
{
Regex exp = new Regex("\\w+", RegexOptions.IgnoreCase);
MatchCollection MCollection = exp.Matches(query);

foreach (Match match in MCollection)
{
if (!wordlist.Contains(match.Value))
{
wordlist.Add(match.Value);
}
}

return wordlist;
}

public static void createVector()
{
double[] queryvector;

for (int j = 0; j < queryvector =" new" i =" 0;" tfidf =" getTF(docs[j]," j ="="" queryterms =" Regex.Split(document," count =" 0;" t ="="" df="0.0;" i =" 1;" d =" docs.Length" idf="0.0;"> 0)
{
IDF = Math.Log(D / df);
}

return IDF;
}

public static void classify()
{
double temp = 0.0;

IDictionaryEnumerator _enumerator = DTVector.GetEnumerator();

double[] queryvector = new double[wordlist.Count];

Array.Copy((double[])DTVector["Query"], queryvector, wordlist.Count);

while (_enumerator.MoveNext())
{
if (_enumerator.Key.ToString() != "Query")
{
temp = cosinetheta(queryvector, (double[])_enumerator.Value);

sortedList.Add(temp,_enumerator.Key.ToString());

}
}
}

public static double dotproduct(double[] v1, double[] v2)
{
double product = 0.0;
if (v1.Length == v2.Length)
{
for (int i = 0; i < v1.Length; i++)
{
product += v1[i] * v2[i];
}
}
return product;
}

public static double vectorlength(double[] vector)
{
double length = 0.0;
for (int i = 0; i < vector.Length; i++)
{
length += Math.Pow(vector[i], 2);
}

return Math.Sqrt(length);
}

public static double cosinetheta(double[] v1, double[] v2)
{
double lengthV1 = vectorlength(v1);
double lengthV2 = vectorlength(v2);

double dotprod = dotproduct(v1, v2);

return dotprod / (lengthV1 * lengthV2);

}
}
}

Friday, July 3, 2009

Extract Hyperlinks Using WebBrowser Object in C#

In my previous post, I mentioned using regular expressions to extract links from a web page. While I like using regular expressions, there is a much easier way of doing this in C#.net. All that you need is a 'WebBrowser' object to read a webpage and extracting all of the hyperlinks present in it. However, you need to create Windows Forms applications to use the WebBrowser object and then add the following code (on a button click event).

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;

namespace WebbrowserTest
{
public partial class WebBrowserSample : Form
{

public WebBrowserSample()
{
InitializeComponent();
}

private void button1_Click(object sender, EventArgs e)
{
WebBrowser web = new WebBrowser();
web.NewWindow += new CancelEventHandler(web_NewWindow);
web.Navigate("http://www.cnn.com");

//wait for the browser object to load the page
while (web.ReadyState != WebBrowserReadyState.Complete)
{
Application.DoEvents();
}

HtmlElementCollection ht = web.Document.Links;
List urls = new List();
foreach (HtmlElement h in ht)
{
try
{
if ((h.GetAttribute("href") != null) )
{
urls.Add(h.GetAttribute("href"));
}
}
catch
{ }
}

foreach (string url in urls)
{
Console.WriteLine(url);
}



}

void web_NewWindow(object sender, CancelEventArgs e)
{
e.Cancel = true;
}
}
}



The 'NewWindow' event is used to discard popups which may open while navigating a web page. The 'Document.Links' property of the WebBrowser is used to get the HtmlCollection of all of the links present in a webpage. Once you have the HtmlCollection of links, the links could be processes/viewed using 'href' attribute of each 'HtmlElement' in the 'HtmlCollection'.

Even though this is a pretty straight forward code, I found it a little slower. I think the reason is that you have to make sure that the webbrowser object has read the entire webpage before you can start using it (achieved by the while '(web.ReadyState != WebBrowserReadyState.Complete)' condition).

Also if you are thinking about link processing in a non-Window Forms application, then this program is not for you. For all the Window Forms application developers, enjoy the code :)

Saturday, April 18, 2009

A simple web crawler

it is becoming difficult to write more about what I have been reading/researching related to Information Retrieval but IR continues to be my area of interest. There is tons of literature out there and each research paper that I read brings with it more insight about IR. That made me want to test the various theories and algorithms mentioned in those papers. And to test them, first thing that I needed was data (web documents to be precise). Now, crawling is in itself a very important area of research. One could opt to write a very efficient crawler but I am a full-time student who works part-time to pay his bills and on top of that I have to start working on my dissertation as well. So there was no way I could allocate more time writing a crawler in order to test the various theories.

Instead I chose to write a very simple crawler. A simple crawler could just make use of the link structure of the web. And since I am not the only one who thinks that or does that, I thought it is a good step to start with. What I really needed was to extract < a > tags from a given web page. The following regular expression allowed me to do that:

Regex extractTags = new Regex(@"<" + tag + @"[^>]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?>", RegexOptions.IgnoreCase | RegexOptions.Compiled);


The main function of the crawler that I wrote loops through the list of predefined pages, calling addtoindex function on each one. The function can be used to store links and their text but for the sake of this post it just prints the URL. It then uses the regular expression that I mentioned above to get all the links on that page and adds their URLs to a set called newpages. At the end of the loop, newpages becomes pages, and the process repeats.

Here is the complete code in C#:


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Collections;
using System.Net;
using System.IO;

namespace Crawler
{
class Program
{
private static String strText;
static MatchCollection tagCollection;
public static HttpWebRequest req;
public static HttpWebResponse res;
static Stream resStream;
public static string baseUrl;

static void Main(string[] args)
{
//add the specific site that you want to crawl
baseUrl = "http://www.techcrunch.com/";

ArrayList pages = new ArrayList();
pages.Add(baseUrl);

//start crawling
crawl(pages, 20);

Console.WriteLine("/nIndexing Complete!!");
Console.ReadLine();
}

public static void crawl(ArrayList pages, int depth)
{
MatchCollection mc;
ArrayList links = new ArrayList();

//Breadth-first search algorithm to crawl the pages and collect links
for (int i = 0; i < depth; i++)
{
ArrayList newpages = new ArrayList();

foreach (String page in pages)
{
try
{
if (isValidUrl(page))
{
urlOpen();
}
}
catch (Exception ex)
{
System.Console.WriteLine("Couldnot open {0} because {1}", page, ex.ToString());
continue;
}

string pagecontent = read();

//adding the page in the index
addtoindex(page, pagecontent);

mc = tagList(pagecontent, "a");

links = getAttributeValue(mc, "href", baseUrl);

foreach (string link in links)
{
String url, linktext;
url = linktext = null;


if (link.Contains("#"))
{
try
{
url = link;

}
catch (Exception ex)
{
Console.WriteLine("Error in Crawl " + ex.Message + " - " + url);
}
}
else
{
url = link;
}

try
{
if ((url.Substring(0, 4) == "http") && (isindexed(url) == false))
{

newpages.Add(url);
}

}
catch (Exception ex)
{
Console.WriteLine("Couldnot add new page " + url + " b/c {0}", ex.ToString());
}
linktext = gettextonly(pagecontent);
}

}
pages = newpages;

}
}

//Returns false for now, but can be modified to query a database to check whether a page has already been indexed
public static bool isindexed(string url)
{
return false;
}

//Add page to the index, this is where a database or file system can be used
public static void addtoindex(string url, string pagecontent)
{

Console.WriteLine("Indexing : " + url);

}

//Get the collection of < a > tags in a page
public static MatchCollection tagList(String HTMLcontent, String tag)
{

Regex extractTags = new Regex(@"<" + tag + @"[^>]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?>", RegexOptions.IgnoreCase | RegexOptions.Compiled);
try
{
tagCollection = extractTags.Matches(HTMLcontent);

return tagCollection;
}
catch (Exception ex)
{
Console.WriteLine(ex.ToString());
}
return null;
}

//Gets the HREF value from each <> tag
public static ArrayList getAttributeValue(MatchCollection mc, String Attr, string url)
{
ArrayList links = new ArrayList();//{ ""};

foreach (Match match in mc)
{

string temp = match.Value;

try
{
if (temp.Contains("http"))
{
links.Add(temp.Substring(temp.IndexOf("href") + 6, temp.LastIndexOf(">") - temp.IndexOf("http") - 1));

}

else if (temp.Contains("://"))
{
links.Add(temp.Substring(temp.IndexOf("href") + 6, temp.LastIndexOf(">") - (temp.IndexOf("href") + 7)));
}
else
{
string strTemp = temp.Substring(temp.IndexOf("href") + 6, temp.LastIndexOf(">") - (temp.IndexOf("href") + 7));
url.Replace("\n\r", "");
if (strTemp[0] != '/' && url[url.Length - 1] != '/')
{
strTemp = url + "/" + strTemp;
}
else
{
strTemp = url + strTemp;
}
links.Add(strTemp);
}

}
catch (Exception ex)
{
Console.WriteLine("Error in GetAttributes :" + ex.Message + " - " + url);
}

}
return links;

}

//reads that content of a web page
public static string gettextonly(string pagecontent)
{

string pattern = @"<(.|\n)*?>";
return Regex.Replace(pagecontent, pattern, String.Empty);

}


public static String read()
{
StreamReader sr = new StreamReader(resStream);
strText = sr.ReadToEnd();
return strText;
}

public static void urlOpen()
{
resStream = res.GetResponseStream();
}

public static bool isValidUrl(String url)
{
try
{
req = (HttpWebRequest)HttpWebRequest.Create(url);
res = (HttpWebResponse)req.GetResponse();
return (res.StatusCode == HttpStatusCode.OK);
}
catch (Exception ex)
{
Console.WriteLine("Error in ISValidURL " + ex.Message + " - " + url);
return false;
}

}
}
}


Any suggestions are welcomed.

Friday, March 6, 2009

Peter Norvig; "we don't think it's a big advance to be able to type something as a question as opposed to keywords."

Sometime ago I found myself in a healthy argument with a friend of mine about the future of search. Both of us are crazy about information retrieval and naturally our discussion more than normally diverts towards search. While usually we agree with each other's views, today I was shocked to find a major difference in our opinions.

I sincerely believe that any breakthrough in search will have something to do with Natural Language Processing. However my friend thinks that that will not be the case. This was not at all a big deal. But what shocked me was the evidence he used... Peter Norvig.

Now I am a big fan of Peter Norvig but while i am crazy about all things Peter Norvig, I was disappointed to read what he said in his interverview with Technology Review:

"TR: Companies such as Ask and Powerset are betting that the future is in natural-­language search, which lets people use real, useful sentences instead of potentially ambiguous keywords. What is Google doing with natural language?

PN: We think what's important about natural language is the mapping of words onto the concepts that users are looking for. But we don't think it's a big advance to be able to type something as a question as opposed to keywords. Typing "What is the capital of France?" won't get you better results than typing "capital of France." But understanding how words go together is important. To give some examples, "New York" is different from "York," but "Vegas" is the same as "Las Vegas," and "Jersey" may or may not be the same as "New Jersey." That's a natural-language aspect that we're focusing on. Most of what we do is at the word and phrase level; we're not concentrating on the sentence. We think it's important to get the right results rather than change the interface."

Apparently, Peter Norvig thinks that typing complete questions will just be a change in interface. I seriously doubt it. These days, I hardly use single keyword searches. I like Powerset and really wish that I could google something much more intelligent than a few keywords. But if the director of research at Google doesn't think that this will bring any big advancement to search, then maybe it is just me. I am sure Peter has substantial data to back his comments but I for one would like to type in a question ( and not something as simple as "Who is the 44th president of USA") and expect a relevant answer.

But thats just me... The interview was conducted in 2008, here is hoping that during this time Peter has given a second thought to some serious NLP usage for search.