Mining the Web: Information Retrieval

Showing posts with label Information Retrieval. Show all posts

Tuesday, November 16, 2010

Using Language Models for Information Retrieval (Python)

Continuing my regard for Bayes' Theorem, I decided to write a small python program that will crawl a list of urls and then will allow the user to search that list. I used the query likelihood Language Model for information retrieval.

A good explanation of query likelihood can be found in the Manning's Introduction to Information Retrieval book. I used the example mentioned in the "Language models for information retrieval" chapter. Here is the excerpt from that chapter:

In the query likelihood model, we construct from each document d in the collection a language model Md. We rank documents by P(d | q), where the probability of a document is interpreted as the likelihood that it is relevant to the query. Using Bayes rule , we have:

P(d | q) = P(q | d) P(d) / P(q)

P(q) is the same for all documents, and so can be ignored. The prior probability of a document P(d) is often treated as uniform across all d and so it can also be ignored.

In this model, Zero probabilities of words can be a problem. This can be solved by smoothing. Smoothing not only avoids zero probability but also it provides a term weighting component. More information about smoothing can be found in Manning's book.

I implemented the query likelihood model in python. I used BeautifulSoap to parse the crawled urls. You can download the python code from here. Once you have downloaded the code you can then use the code as follows:

1. import LangModel

2. Define a list of urls you want to search:
u=['http://cnn.com','http://techcrunch.com']

3. Call the crawl function to crawl the list:
LangModel.crawl(u)

4. Type the query you want to search and press the return key

5. The output will show you a ranked list of urls and the score of each according to query likelihood model.

6. You can search the list again by using the search function:
LangModel.search('your query')

That is it. Another application of Bayes' Theorem!

Stay tuned for more python code :)

Friday, January 29, 2010

Parsing Robots.txt File

Crawling is an essential part of search engine development. The more sites a search engine crawls, the bigger its index will be. However, a site can restrict web crawlers from crawling certain pages of the site by identifying those pages in a robots.txt file and making it available at the local URL /robots.txt. For example, you can see what the robots.txt file for CNN looks like here.

While crawling a site, it is a good idea to read the robots.txt file first so that you have a list of pages that your crawler shouldn't access. This way you are not only respecting the permissions provided by the site but also saving CPU cycles which can be utilized elsewhere.

You can find complete detail about the format and semantics of a robots.txt file, visit The Web Robots Pages. Here is a portion from this webpage:

The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognised headers are ignored.

User-agent

The value of this field is the name of the robot the record is describing access policy for.

If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.

The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.

If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.

Disallow

The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.

Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.

Using this description we can easily write the code to create a list of disallowed URLs. All we need to do is parse the robots.txt file and split the file first by using "User-agent" and then again by "Disallow". This will give us the list which we can then use in our crawler. We just need to check this list every time we are about to crawl a page.

Following is the code in C# to parse the robots.txt file and to create a list of disallowed URLs:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Net;
using System.Text.RegularExpressions;

namespace ParsingRobotTxt
{
class Program
{
public static HttpWebRequest req;
public static HttpWebResponse res;
static Stream resStream;

static void Main(string[] args)
{
String baseUrl = "http://www.cnn.com/";
baseUrl += "/robots.txt";

getDisallowedUrls(baseUrl);
}

static void getDisallowedUrls(string baseUrl)
{
if (isValidUrl(baseUrl))
{
urlOpen();
}

String RobotTxtContent = read();

List disallowed = new List(); // List that holds Urls which shouldn't be crawled
String[] user_agents = Regex.Split(RobotTxtContent, "User-agent:");
String userAgents = "";
foreach (String agent in user_agents)
{
if (agent.Trim().StartsWith("*"))
{
userAgents = agent.Trim().Substring(1);
}
}

String[] disallow = Regex.Split(userAgents, "Disallow:");

foreach (String item in disallow)
{
if (item != "\n")
{
disallowed.Add(item.Trim());
Console.WriteLine(baseUrl + item.Trim());
}
}

Console.ReadLine();

}

public static String read()
{
StreamReader sr = new StreamReader(resStream);
String strText = sr.ReadToEnd();
return strText;
}

public static void urlOpen()
{
resStream = res.GetResponseStream();
}

public static bool isValidUrl(String url)
{
try
{
req = (HttpWebRequest)HttpWebRequest.Create(url);
res = (HttpWebResponse)req.GetResponse();
return (res.StatusCode == HttpStatusCode.OK);
}
catch (Exception ex)
{
Console.WriteLine("Not a Valid URL:" + ex.Message + " - " + url);
return false;
}
}
}
}

Monday, September 7, 2009

Singular Value Decomposition (SVD) and Latent Semantic Indexing (LSI) in C#

LSI is used in a variety of information retrieval and text processing applications. It uses uses a mathematical technique called Singular Value Decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text.

Tutorials for LSI and SVD can be found online. I will use Bluebit .Net Matrix Library to calculate SVD and other matrix and vector operations needed in LSI.

I implemented LSI in C# using the example described in Mi Ishlita. The program will display the name of each document and its similarity value. You can use any sorting algorithm to sort the results by the similarity values. Like in the previous example, the program assumes that there is a string array "docs" which contains the query's (at zero index) and the documents' text.

To use the code (below), you need to download and install .NET Matrix Library (NML™) 4.3 and then reference .NET Matrix Library from within your project using the following instructions:

Open the Visual Studio a and open a new Visual Basic or C# project.
Select Project | Add Reference...
Select the .NET tab and find Bluebit .NET Matrix Library listed, double click on it and then press OK.

Here is the code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Bluebit.MatrixLibrary;
using System.Collections;
using System.Text.RegularExpressions;

namespace LSI
{
class Program
{
static double[,] A; //term-document Array
static double[] q; //term-query Array
static List wordlist = new List(); //List of terms found in documents
static SortedDictionary sortedList = new SortedDictionary(); //Documents ranked by VSM with angle value
static string[] docs ={"gold silver truck", //Query
"shipment of gold damaged in a fire",//Doc 1
"delivery of silver arrived in a silver truck", //Doc 2
"shipment of gold arrived in a truck"}; //Doc 3

static void Main(string[] args)
{
createWordList();
createVector();
LatentSemanticIndexing();

foreach (KeyValuePair kvp in sortedList)
{

Console.WriteLine(kvp.Value + " " + kvp.Key);
}
Console.ReadLine();
}

public static void createWordList()
{
foreach (string doc in docs)
{
wordlist = getWordList(wordlist, doc);
}
wordlist.Sort(); //sort the wordlist alphabetically
}

public static List getWordList(List wordlist, string query)
{
Regex exp = new Regex("\\w+", RegexOptions.IgnoreCase);
MatchCollection MCollection = exp.Matches(query);

foreach (Match match in MCollection)
{
if (!wordlist.Contains(match.Value))
{
wordlist.Add(match.Value);
}
}

return wordlist;
}

public static void createVector()
{
double[] queryvector;
q = new double[wordlist.Count];
A = new double[wordlist.Count, docs.Length-1];
for (int j = 0; j < docs.Length; j++)
{
queryvector = new double[wordlist.Count];

for (int i = 0; i < wordlist.Count; i++)
{
//calculate Term Frequency
double tf = getTF(docs[j], wordlist[i]);

if (j == 0) //if Query term then add it to query array
{
q[i] = tf;
}
else //if document term then add it to document array
{
A[i, j-1] = tf;
}
}

}

}

private static void LatentSemanticIndexing()
{
//Singular Value Decomposition
Matrix docMatrix = new Matrix(A);
SVD svd = new SVD(docMatrix);

//A = U S VT
Matrix U = svd.U;
Matrix S = svd.S;
Matrix V = svd.V;
Matrix VT = svd.VH;

//Dimensionality Reduction: Computing Uk, Sk, Vk and VkT
Matrix Uk = new Matrix(U.ToArray(), U.Rows, U.Cols - 1);
Matrix Sk = new Matrix(S.ToArray(), S.Rows - 1, S.Cols - 1);
Matrix Vk = new Matrix(V.ToArray(), V.Rows, V.Cols - 1);
Matrix VkT = Vk.Transpose();

//q = qT Uk Sk-1
Matrix queryMatrix = new Matrix(q,q.Length,1);
queryMatrix = queryMatrix.Transpose() * Uk * Sk.Inverse();

//sim(q, d) = sim(qT Uk Sk-1, dT Uk Sk-1) using cosine similarities
for (int i = 0; i < V.Rows; i++)
{
Vector docVector = Vk.RowVector(i);
Vector queryVector = queryMatrix.RowVector(0);
double sim = Vector.DotProduct(docVector, queryVector) / (docVector.Length * queryVector.Length);

Console.WriteLine("Doc " + (i + 1).ToString() + " :" + sim);
}

}
private static double getTF(string document, string term)
{
string[] queryTerms = Regex.Split(document, "\\s");
double count = 0;

foreach (string t in queryTerms)
{
if (t == term)
{
count++;
}
}
return count;

}

}
}

After you have sorted the results in descending order using the similarity values, your results will be as follows:
d2>d3>d2

Monday, August 17, 2009

Classic Vector Space Model in C#

Vector Space Model (VSM) is widely used in Information Retrieval systems. The model creates a space in which both documents and queries are represented by vectors. VSM treats the query as a document as well and then it determines similarity between query and other documents by calculating the deviation of angles between each document vector and the original query vector. It then uses the deviation to rank documents.

I implemented the model in C# using the example described in Mi Islita. The program will display elements of a dictionary object that hold the name of each document and its similarity value.

You can use any sorting algorithm to sort the dictionary by the similarity values. The program also assumes that there is a string array "docs" which contains the query's (at zero index) and the documents' text.

Here is the code:

using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
using System.Collections;
using System.Linq;
namespace VectorSpaceModel
{
class Program
{
static Hashtable DTVector = new Hashtable(); //Hashtable to hold Document Term Vector
static List wordlist = new List(); //List of terms found in documents
static Dictionary sortedList = new Dictionary(); //Documents ranked by VSM with angle value
static string[] docs ={"gold silver truck", //Query
"shipment of gold damaged in a fire",//Doc 1
"delivery of silver arrived in a silver truck", //Doc 2
"shipment of gold arrived in a truck"}; //Doc 3

static void Main(string[] args)
{
createWordList();
createVector();
classify();

var dict = sortedList;
foreach (var x in dict.Reverse())
{
Console.WriteLine("{0} -> Doc{1}", x.Key, x.Value);
}

Console.ReadLine();
}

public static void createWordList()
{
foreach (string doc in docs)
{
wordlist = getWordList(wordlist, doc);
}
}

public static List getWordList(List wordlist, string query)
{
Regex exp = new Regex("\\w+", RegexOptions.IgnoreCase);
MatchCollection MCollection = exp.Matches(query);

foreach (Match match in MCollection)
{
if (!wordlist.Contains(match.Value))
{
wordlist.Add(match.Value);
}
}

return wordlist;
}

public static void createVector()
{
double[] queryvector;

for (int j = 0; j < docs.Length; j++)
{
queryvector = new double[wordlist.Count];

for (int i = 0; i < wordlist.Count; i++)
{

double tfIDF = getTF(docs[j], wordlist[i]) * getIDF(wordlist[i]);
queryvector[i] = tfIDF;
}

if (j == 0) //is it a query?
{
DTVector.Add("Query", queryvector);
}
else
{
DTVector.Add(j.ToString(), queryvector);
}
}
}

public static void classify()
{
double temp = 0.0;

IDictionaryEnumerator _enumerator = DTVector.GetEnumerator();

double[] queryvector = new double[wordlist.Count];

Array.Copy((double[])DTVector["Query"], queryvector, wordlist.Count);

while (_enumerator.MoveNext())
{
if (_enumerator.Key.ToString() != "Query")
{
temp = cosinetheta(queryvector, (double[])_enumerator.Value);

sortedList.Add(temp, _enumerator.Key.ToString());

}
}
}

public static double dotproduct(double[] v1, double[] v2)
{
double product = 0.0;
if (v1.Length == v2.Length)
{
for (int i = 0; i < v1.Length; i++)
{
product += v1[i] * v2[i];
}
}
return product;
}

public static double vectorlength(double[] vector)
{
double length = 0.0;
for (int i = 0; i < vector.Length; i++)
{
length += Math.Pow(vector[i], 2);
}

return Math.Sqrt(length);
}
private static double getTF(string document, string term)
{
string[] queryTerms = Regex.Split(document, "\\s");
double count = 0;

foreach (string t in queryTerms)
{
if (t == term)
{
count++;
}
}
return count;

}

private static double getIDF(string term)
{
double df = 0.0;
//get term frequency of all of the sentences except for the query
for (int i = 1; i < docs.Length; i++)
{
if (docs[i].Contains(term))
{
df++;
}
}

//Get sentence count
double D = docs.Length - 1; //excluding the query

double IDF = 0.0;

if (df > 0)
{
IDF = Math.Log(D / df);
}

return IDF;
}

public static double cosinetheta(double[] v1, double[] v2)
{
double lengthV1 = vectorlength(v1);
double lengthV2 = vectorlength(v2);

double dotprod = dotproduct(v1, v2);

return dotprod / (lengthV1 * lengthV2);

}
}
}

Sunday, February 17, 2008

IR or Spam Filter

I haven't been updating my blog lately. The commencement of the spring 2008 semester has to do something with that but the real reason is that I have been busy researching some interesting topics for my AI project.

Last semester I wanted to create an Image Spam Filter for my "Data mining & Pattern Recognition" class. My theory was to apply OCR techniques to capture text from the image, after which it really becomes a text spam-filter problem. I was thinking of using Neural Networks for Image Text recognition and Bayes' Theorem for spam-filtering. But my idea was unanimously rejected by my group and instead we developed a Handwriting Recognition system (which was still a better project to work on than our previous plan to tell time by reading an image of analogue clock.)

So now I have a chance to have another go at my Image Spam Filter project. But I am still debating about it. The reason is that I am also fond of Information Retrieval problems. I am thinking of working on a Search engine and automatic indexing of a technical book/manual. Since this is an individual project I can do whatever I want (Of course my professor has to approve it.)

If I think about it, I find both, IR and Spam Filter, interesting. So it is not a clash of interest but more of what will I gain from them and what I want to do in future.

While I try to analyze this, feel free to give me your suggestions. Maybe you can give me an insight which may eventually help me reach a decision.