Mining the Web: January 2009

Friday, January 30, 2009

Information Retrieval - Document Preprocessing

In this post I will touch briefly on document preprocessing and indexing concepts related to IR.

Document Preprocessing
The content of a webpage read by the crawler has to be converted into tokens before an index can be created for the keywords. This is challenging because at this step we have to deal with various formatting and encoding issues. For example, what delimiter will we use to separate different words. Tokenizing using white space as a delimiter seems to work best for "Paul's", "555-1212" or "can't". But what about: "Willow Dr.", "Dr. Willow", "New York", "ad hoc". A white space delimiter will create separate tokens for each of them which is not desirable.

Tokenization becomes even more challenging when you consider tokenizing a string like “The New York-Los Angeles flight”. Other languages such as German or Korean have eve more complex tokenization issues.

Normalization
We need a systematic way of ensuring that the index structure is suitable for general-purpose querying. For instance, "new york" or "New York" or "NEW YORK" shouldn't be treated differently. There are various ways to achieve this normalcy:

Casing (cat vs. CAT)
Stemming (computer, computation)
Soundex
Spell Checker or synonyms (Labeled/labelled, extraterrestrial/extra-terrestrial/extra terrestrial, Qaddafi/Kadhafi/Ghadaffi)
Index reduction
Dropping stop words (“and”, “of”, “to”), it is problematic for “to be or not to be” kind of sentences

Porter’s algorithm
Porter's algorithm can be used for Stemming - the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form (Wikipedia).
The rules in the Porter algorithm are separated into five distinct phases numbered from 1 to 5. They are applied to the words in the text starting from phase 1 and moving on to phase 5. Further, they are applied sequentially one after the other as commands in a program.

The complete description of Porter's algorithm can be found here.

Example: the word “duplicatable” will be changed to "duplic" using the Proter's algorithm.

The Soundex algorithm (Odell and Russell)
This algorithm uses spelling correction and hash function and is non-recoverable.

The algorithm is as follows:

1. Retain the first letter of the name, and drop all occurrences of a,e,h,I,o,u,w,y in other positions
2. Assign the following numbers to the remaining letters after the first:
b,f,p,v : 1
c,g,j,k,q,s,x,z : 2
d,t : 3
l : 4
m n : 5
r : 6

3. if two or more letters with the same code were adjacent in the original name, omit all but the first
4. Convert to the form “LDDD” by adding terminal zeros or by dropping rightmost digits
Examples:
Euler: E460, Gauss: G200, H416: Hilbert, K530: Knuth, Lloyd: L300
same as Ellery, Ghosh, Heilbronn, Kant, and Ladd
Some problems: Rogers and Rodgers, Sinclair and StClair

Storage issues
As mentioned in the last post that a medium-sized collection with n=3,000,000 (terms) and m=50,000 (documents) will result in a large term-document matrix. We have to find a way to come up with a better data structure. Inverted Index is the basic structure that resolves the issue.

Inverted index
Instead of an incidence vector, use a posting table. We can use a Dictionary object and represent the term-documents as the term-value pairs in the dictionary. The term represents a unique ID and the document (where the term is present) IDs are the values in each term-value pair of the dictionary. For example,

CLEVELAND: D1, D2, D6
OHIO: D1, D5, D6, D7

We use linked lists to be able to insert new document postings in order and to remove existing postings. We need to keep everything sorted! This gives us a logarithmic improvement in access.

Basic operations on inverted indexes
Conjunction (AND) – iterative merge of the two postings: O(x+y)
Disjunction (OR) – very similar
Negation (NOT)

Example: MICHIGAN AND NOT OHIO
Example: MICHIGAN OR NOT OHIO

For recursive operations we should start with the smallest sets for better optimization.

That sums up the second post on Information Retrieval. In the next post, I will discuss word distribution issues in the IR.

Sunday, January 18, 2009

Information Retrieval - Introduction

This is the first in the series of blog posts related to Information Retrieval.

Information Retreival
So what is Information Retrieval (IR)? IR can be defined as the indexing and retrieval of textual documents. It can be more thoroughly defined as follows:

"Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)." [1]

What does it take to build a search engine?
The task of search engines looks trivial. All search engines (Google, Yahoo, Live.com) had to do is crawl the web, index the web pages, read the text and make the links to the web pages available to upon user's request.

But if you ponder over it the simple task converts into complex sub tasks, which are as follows:

Decide what to index
Collect it
Index it (efficiently)
Keep the index up to date
Provide user-friendly query facilities

A lot of research is done on each of these sub tasks. But that is not enough. There are other things which need to be dealt with to create an 'efficient' search engine. Some of them are:

Understand the structure of the web for efficient crawling
Understand user information needs
Preprocess text and other unstructured data
Cluster data
Classify data
Evaluate performance

Document representations
Search engines display relevant links to webpages (documents) based on the keywords typed by the user. The documents retrieved consist of those keywords (not necessarily). In order to carry out the task of finding keywords in the documents, there must be some sort of data structure that holds the word-document relationship. One way to achieve this is to have a Term-document matrix (m x n).

Another important datastructure that comes handy is a Document-document matrix (n x n). However, space limitations and time complexity have to be taken into consideration before implementing these matrices.

However, these matrices don't scale very well. Typical example in a medium-sized collection is something like 3,000,000 documents (n) with 50,000 terms (m). So you can imagine how big a (m x n) matrix can be. In case of web documents you will typically have: n=30,000,000,000, m=1,000,000. Creating a (m x n) matrix and then using it could shoot up the time complexity sky high.

Boolean vs. integer-valued matrices
There is also the question whether you should use boolean (True or False) or inter-valued (0s and 1s) to represent the document-word relationship in the matrix.

Major IR models
Assuming that term-document relationship has been efficiently implemented. The next task is to come up with retrieval models that will be used to query the matrices and retrieve the most relevant documents. These models are know as IR Models. Some of the most common models are as follows:

Boolean
Vector
Probabilistic
Language modeling
Fuzzy retrieval
Latent semantic indexing

Boolean queries
Boolean IR Model is perhaps one of the simplest to implement. It uses boolean operators: AND, OR, NOT, parentheses, etc. For example, typical search queries will be something like:
CLEVELAND AND NOT OHIO
(MICHIGAN AND INDIANA) OR (TEXAS AND OKLAHOMA)

Vector queries
Each document is represented as a vector. Non-binary weights provide consideration for partial matches. These term weights are used to compute a degree of similarity between a query and each document. Ranked set of documents provides for better matching.

Probabilistic Model
Captures the IR problem using a probabilistic framework. Improve by iteration

The matching process
Matching is done between a document and a query (or between two documents). Distance or similarity measures are used to determined which set of documents satisfied a given query. The common methods used are: Euclidean distance, Manhattan distance, Word overlap, Jaccard coefficient, etc.

Well this sums up the introductory post on IR. In the next post we will look into more details of some of the components of IR mentioned in this post, especially indexing. So stay tuned.

References
[1] http://nlp.stanford.edu/IR-book/html/htmledition/boolean-retrieval-1.html

Information Retrieval and Web Mining

Last semester I had to select a research topic for my Computer Seminar course. The topic I selected was "Personalized Search". I have always been interested in Information Retrieval and hence thought that it was a perfect opportunity to learn more about this field.

Even before I could starting doing research on personalization, I had to do a lot of reading and research related to information retrieval and web mining. In the process, I learned a lot about IR. However, I hardly get the chance to practice what I learned. Hence, I have decided to blog about what I learned related to IR every week....well, almost every week.

So why blog about this? Well, this will help me revise what I learned and maybe learn something new.

I hope that other people will find it helpful as well.

Sunday, January 4, 2009

Simple IsNum() function in C#

I disparately needed a C#'s equivalent for "IsNum". I was surprised that there isn't one, so I decided to write my own. There are many ways that can be used to write this functions. However, my task required that I should be able to verify any type of number (int, double, etc.) and get either True/False return values. Below is the code that I am using to solve this problem.

public static bool IsNumeric(object Expression)
{

// Variable to collect the Return value of the TryParse method.

bool isNum;

// Define variable to collect out parameter of the TryParse method. If the conversion fails, the out parameter is zero.

double retNum;

// The TryParse method converts a string in a specified style and culture-specific format to its double-precision floating point number equivalent.

// The TryParse method does not generate an exception if the conversion fails. If the conversion passes, True is returned. If it does not, False is returned. Hence no need for try..catch block

isNum = Double.TryParse(Convert.ToString(Expression), System.Globalization.NumberStyles.Any, System.Globalization.NumberFormatInfo.InvariantInfo, out retNum);

return isNum;

}