Sunday, January 18, 2009

Information Retrieval - Introduction

This is the first in the series of blog posts related to Information Retrieval.

Information Retreival
So what is Information Retrieval (IR)? IR can be defined as the indexing and retrieval of textual documents. It can be more thoroughly defined as follows:

"Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)." [1]

What does it take to build a search engine?
The task of search engines looks trivial. All search engines (Google, Yahoo, Live.com) had to do is crawl the web, index the web pages, read the text and make the links to the web pages available to upon user's request.

But if you ponder over it the simple task converts into complex sub tasks, which are as follows:
  • Decide what to index
  • Collect it
  • Index it (efficiently)
  • Keep the index up to date
  • Provide user-friendly query facilities
A lot of research is done on each of these sub tasks. But that is not enough. There are other things which need to be dealt with to create an 'efficient' search engine. Some of them are:
  • Understand the structure of the web for efficient crawling
  • Understand user information needs
  • Preprocess text and other unstructured data
  • Cluster data
  • Classify data
  • Evaluate performance
Document representations
Search engines display relevant links to webpages (documents) based on the keywords typed by the user. The documents retrieved consist of those keywords (not necessarily). In order to carry out the task of finding keywords in the documents, there must be some sort of data structure that holds the word-document relationship. One way to achieve this is to have a Term-document matrix (m x n).

Another important datastructure that comes handy is a Document-document matrix (n x n). However, space limitations and time complexity have to be taken into consideration before implementing these matrices.

However, these matrices don't scale very well. Typical example in a medium-sized collection is something like 3,000,000 documents (n) with 50,000 terms (m). So you can imagine how big a (m x n) matrix can be. In case of web documents you will typically have: n=30,000,000,000, m=1,000,000. Creating a (m x n) matrix and then using it could shoot up the time complexity sky high.

Boolean vs. integer-valued matrices
There is also the question whether you should use boolean (True or False) or inter-valued (0s and 1s) to represent the document-word relationship in the matrix.

Major IR models
Assuming that term-document relationship has been efficiently implemented. The next task is to come up with retrieval models that will be used to query the matrices and retrieve the most relevant documents. These models are know as IR Models. Some of the most common models are as follows:
  • Boolean
  • Vector
  • Probabilistic
  • Language modeling
  • Fuzzy retrieval
  • Latent semantic indexing
Boolean queries
Boolean IR Model is perhaps one of the simplest to implement. It uses boolean operators: AND, OR, NOT, parentheses, etc. For example, typical search queries will be something like:
CLEVELAND AND NOT OHIO
(MICHIGAN AND INDIANA) OR (TEXAS AND OKLAHOMA)

Vector queries
Each document is represented as a vector. Non-binary weights provide consideration for partial matches. These term weights are used to compute a degree of similarity between a query and each document. Ranked set of documents provides for better matching.

Probabilistic Model
Captures the IR problem using a probabilistic framework. Improve by iteration

The matching process
Matching is done between a document and a query (or between two documents). Distance or similarity measures are used to determined which set of documents satisfied a given query. The common methods used are: Euclidean distance, Manhattan distance, Word overlap, Jaccard coefficient, etc.

Well this sums up the introductory post on IR. In the next post we will look into more details of some of the components of IR mentioned in this post, especially indexing. So stay tuned.

References
[1] http://nlp.stanford.edu/IR-book/html/htmledition/boolean-retrieval-1.html

2 comments:

Anonymous said...

Awesome post. Do you mind if I ask what your source is for this information?

sensiblo chamaeleon said...

hello.
you can find further resources about IR in http://sensiblochamaeleon.blogspot.com/2010/05/echte-wissenschaft-und-objektive.html .
will unimportant content get high priorities in SERPs ?
what about the relation of SEO and ethics,politics,morality, volume and truth :
http://sensiblochamaeleon.wordpress.com/2009/09/08/seoethik-heiligt-der-zweck-die-mittel/
greetings.