Sunday, January 18, 2009

Information Retrieval - Introduction

This is the first in the series of blog posts related to Information Retrieval.

Information Retreival
So what is Information Retrieval (IR)? IR can be defined as the indexing and retrieval of textual documents. It can be more thoroughly defined as follows:

"Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)." [1]

What does it take to build a search engine?
The task of search engines looks trivial. All search engines (Google, Yahoo, Live.com) had to do is crawl the web, index the web pages, read the text and make the links to the web pages available to upon user's request.

But if you ponder over it the simple task converts into complex sub tasks, which are as follows:
  • Decide what to index
  • Collect it
  • Index it (efficiently)
  • Keep the index up to date
  • Provide user-friendly query facilities
A lot of research is done on each of these sub tasks. But that is not enough. There are other things which need to be dealt with to create an 'efficient' search engine. Some of them are:
  • Understand the structure of the web for efficient crawling
  • Understand user information needs
  • Preprocess text and other unstructured data
  • Cluster data
  • Classify data
  • Evaluate performance
Document representations
Search engines display relevant links to webpages (documents) based on the keywords typed by the user. The documents retrieved consist of those keywords (not necessarily). In order to carry out the task of finding keywords in the documents, there must be some sort of data structure that holds the word-document relationship. One way to achieve this is to have a Term-document matrix (m x n).

Another important datastructure that comes handy is a Document-document matrix (n x n). However, space limitations and time complexity have to be taken into consideration before implementing these matrices.

However, these matrices don't scale very well. Typical example in a medium-sized collection is something like 3,000,000 documents (n) with 50,000 terms (m). So you can imagine how big a (m x n) matrix can be. In case of web documents you will typically have: n=30,000,000,000, m=1,000,000. Creating a (m x n) matrix and then using it could shoot up the time complexity sky high.

Boolean vs. integer-valued matrices
There is also the question whether you should use boolean (True or False) or inter-valued (0s and 1s) to represent the document-word relationship in the matrix.

Major IR models
Assuming that term-document relationship has been efficiently implemented. The next task is to come up with retrieval models that will be used to query the matrices and retrieve the most relevant documents. These models are know as IR Models. Some of the most common models are as follows:
  • Boolean
  • Vector
  • Probabilistic
  • Language modeling
  • Fuzzy retrieval
  • Latent semantic indexing
Boolean queries
Boolean IR Model is perhaps one of the simplest to implement. It uses boolean operators: AND, OR, NOT, parentheses, etc. For example, typical search queries will be something like:
CLEVELAND AND NOT OHIO
(MICHIGAN AND INDIANA) OR (TEXAS AND OKLAHOMA)

Vector queries
Each document is represented as a vector. Non-binary weights provide consideration for partial matches. These term weights are used to compute a degree of similarity between a query and each document. Ranked set of documents provides for better matching.

Probabilistic Model
Captures the IR problem using a probabilistic framework. Improve by iteration

The matching process
Matching is done between a document and a query (or between two documents). Distance or similarity measures are used to determined which set of documents satisfied a given query. The common methods used are: Euclidean distance, Manhattan distance, Word overlap, Jaccard coefficient, etc.

Well this sums up the introductory post on IR. In the next post we will look into more details of some of the components of IR mentioned in this post, especially indexing. So stay tuned.

References
[1] http://nlp.stanford.edu/IR-book/html/htmledition/boolean-retrieval-1.html

6 comments:

Anonymous said...

Awesome post. Do you mind if I ask what your source is for this information?

sensiblo chamaeleon said...

hello.
you can find further resources about IR in http://sensiblochamaeleon.blogspot.com/2010/05/echte-wissenschaft-und-objektive.html .
will unimportant content get high priorities in SERPs ?
what about the relation of SEO and ethics,politics,morality, volume and truth :
http://sensiblochamaeleon.wordpress.com/2009/09/08/seoethik-heiligt-der-zweck-die-mittel/
greetings.

oakleyses said...

jordan shoes, ugg boots, longchamp outlet, uggs on sale, ray ban sunglasses, replica watches, ray ban sunglasses, ray ban sunglasses, christian louboutin shoes, nike free run, louboutin pas cher, michael kors pas cher, louis vuitton outlet, air max, prada outlet, tory burch outlet, christian louboutin, louis vuitton, nike air max, nike roshe, replica watches, tiffany and co, oakley sunglasses wholesale, nike free, longchamp pas cher, christian louboutin uk, louis vuitton, burberry pas cher, chanel handbags, nike outlet, nike air max, polo ralph lauren outlet online, gucci handbags, oakley sunglasses, jordan pas cher, prada handbags, oakley sunglasses, cheap oakley sunglasses, louis vuitton outlet, longchamp outlet, polo outlet, ugg boots, tiffany jewelry, longchamp outlet, christian louboutin outlet, louis vuitton outlet, sac longchamp pas cher, oakley sunglasses, polo ralph lauren

oakleyses said...

true religion jeans, ray ban pas cher, michael kors, oakley pas cher, polo lacoste, michael kors outlet online, ray ban uk, mulberry uk, michael kors outlet online, coach outlet, burberry handbags, nike air force, vans pas cher, true religion outlet, michael kors outlet, abercrombie and fitch uk, nike roshe run uk, replica handbags, burberry outlet, michael kors outlet online, kate spade, hollister pas cher, michael kors outlet online, sac hermes, nike air max uk, true religion outlet, nike air max uk, michael kors outlet, guess pas cher, north face, hollister uk, new balance, nike tn, uggs outlet, uggs outlet, ralph lauren uk, michael kors, true religion outlet, nike air max, timberland pas cher, nike blazer pas cher, sac vanessa bruno, converse pas cher, coach purses, coach outlet store online, north face uk, nike free uk, hogan outlet, michael kors outlet

oakleyses said...

wedding dresses, longchamp uk, new balance shoes, insanity workout, gucci, nike roshe run, nike huaraches, north face outlet, celine handbags, giuseppe zanotti outlet, ghd hair, ralph lauren, nike trainers uk, mac cosmetics, iphone cases, asics running shoes, nfl jerseys, mcm handbags, north face outlet, ray ban, reebok outlet, converse outlet, timberland boots, hermes belt, mont blanc pens, ferragamo shoes, hollister, hollister clothing, nike air max, abercrombie and fitch, vans outlet, vans, converse, instyler, louboutin, lululemon, p90x workout, herve leger, soccer shoes, soccer jerseys, nike air max, jimmy choo outlet, beats by dre, oakley, baseball bats, valentino shoes, babyliss, hollister, chi flat iron, bottega veneta

oakleyses said...

ugg, louis vuitton, hollister, ugg uk, links of london, canada goose outlet, toms shoes, thomas sabo, moncler outlet, doudoune moncler, coach outlet, canada goose, ugg,uggs,uggs canada, ugg,ugg australia,ugg italia, pandora charms, replica watches, louis vuitton, supra shoes, pandora jewelry, juicy couture outlet, canada goose outlet, barbour, swarovski crystal, pandora uk, montre pas cher, juicy couture outlet, lancel, canada goose, moncler, canada goose jackets, louis vuitton, canada goose uk, marc jacobs, moncler, pandora jewelry, canada goose, swarovski, canada goose outlet, moncler uk, moncler, louis vuitton, karen millen uk, ugg pas cher, louis vuitton, barbour uk, wedding dresses, moncler outlet, moncler