Friday, January 30, 2009

Information Retrieval - Document Preprocessing

In this post I will touch briefly on document preprocessing and indexing concepts related to IR.

Document Preprocessing
The content of a webpage read by the crawler has to be converted into tokens before an index can be created for the keywords. This is challenging because at this step we have to deal with various formatting and encoding issues. For example, what delimiter will we use to separate different words. Tokenizing using white space as a delimiter seems to work best for "Paul's", "555-1212" or "can't". But what about: "Willow Dr.", "Dr. Willow", "New York", "ad hoc". A white space delimiter will create separate tokens for each of them which is not desirable.

Tokenization becomes even more challenging when you consider tokenizing a string like “The New York-Los Angeles flight”. Other languages such as German or Korean have eve more complex tokenization issues.

Normalization
We need a systematic way of ensuring that the index structure is suitable for general-purpose querying. For instance, "new york" or "New York" or "NEW YORK" shouldn't be treated differently. There are various ways to achieve this normalcy:

  • Casing (cat vs. CAT)
  • Stemming (computer, computation)
  • Soundex
  • Spell Checker or synonyms (Labeled/labelled, extraterrestrial/extra-terrestrial/extra terrestrial, Qaddafi/Kadhafi/Ghadaffi)
  • Index reduction
  • Dropping stop words (“and”, “of”, “to”), it is problematic for “to be or not to be” kind of sentences
Porter’s algorithm
Porter's algorithm can be used for Stemming - the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form (Wikipedia).
The rules in the Porter algorithm are separated into five distinct phases numbered from 1 to 5. They are applied to the words in the text starting from phase 1 and moving on to phase 5. Further, they are applied sequentially one after the other as commands in a program.

The complete description of Porter's algorithm can be found here.

Example: the word “duplicatable” will be changed to "duplic" using the Proter's algorithm.

The Soundex algorithm (Odell and Russell)
This algorithm uses spelling correction and hash function and is non-recoverable.

The algorithm is as follows:

1. Retain the first letter of the name, and drop all occurrences of a,e,h,I,o,u,w,y in other positions
2. Assign the following numbers to the remaining letters after the first:
b,f,p,v : 1
c,g,j,k,q,s,x,z : 2
d,t : 3
l : 4
m n : 5
r : 6

3. if two or more letters with the same code were adjacent in the original name, omit all but the first
4. Convert to the form “LDDD” by adding terminal zeros or by dropping rightmost digits
Examples:
Euler: E460, Gauss: G200, H416: Hilbert, K530: Knuth, Lloyd: L300
same as Ellery, Ghosh, Heilbronn, Kant, and Ladd
Some problems: Rogers and Rodgers, Sinclair and StClair

Storage issues
As mentioned in the last post that a medium-sized collection with n=3,000,000 (terms) and m=50,000 (documents) will result in a large term-document matrix. We have to find a way to come up with a better data structure. Inverted Index is the basic structure that resolves the issue.

Inverted index
Instead of an incidence vector, use a posting table. We can use a Dictionary object and represent the term-documents as the term-value pairs in the dictionary. The term represents a unique ID and the document (where the term is present) IDs are the values in each term-value pair of the dictionary. For example,

CLEVELAND: D1, D2, D6
OHIO: D1, D5, D6, D7

We use linked lists to be able to insert new document postings in order and to remove existing postings. We need to keep everything sorted! This gives us a logarithmic improvement in access.

Basic operations on inverted indexes
Conjunction (AND) – iterative merge of the two postings: O(x+y)
Disjunction (OR) – very similar
Negation (NOT)

Example: MICHIGAN AND NOT OHIO
Example: MICHIGAN OR NOT OHIO

For recursive operations we should start with the smallest sets for better optimization.

That sums up the second post on Information Retrieval. In the next post, I will discuss word distribution issues in the IR.

17 comments:

Anonymous said...

The author is really cool. But some of the commentators are just posting stupid words.

Anonymous said...

You really have a way of words. Great style of delivering the information and I could relate to it. Such a great information for me. Thanks for this.

Anonymous said...

Because the skin underneath your eyes is very thin and delicate it is extremely important that you treat it with much care lumigan uk So the first thing to do is to start shielding your skin from the damaging effects of free radicals
Furthermore, a compound found in green tea has been shown to inhibit the growth of cancer cells cymbalta price 96% of alcoholics commiting suicide continue their substance abuse up to the end of their lives.
Nowadays we are all drawn to the fast pace of living female pink Viagra In most circumstances the more you put on the more your skin will be irritated which can cause inflammation and discomfort
This panic attack therapy method is so successful that over 75% of patients who try this technique show an amazing decrease in panic and phobia frequencies in just 3 months from commencing treatment cheap iressa To avoid them developing health problems such as asthma if would prove worthwhile to take these precautions and move towards an allergy-free and clean home
This means increased probability of plaque being ruptured and more chances of heart attacks and strokes Chloromint In the early 1900's, an ophthalmologist called Dr

Anonymous said...

Other examples of dangerous toys can be choking, cutting and even poisoning hazards. Antiaging supplements will not only give you the confidence, but also physically increase your muscle strength and mental security. dvpnc875vs. Just before the surgical procedure, the physician is supposed to advise you about specifics of the surgical procedures such as the expected duration of the surgical procedures, range of medications to be employed for queasiness and discomfort, as well as how long you might most likely be in the clinic following the procedure. In the beginning, keep things simple for your character to make the first few levels a little easier to get through., vpn port client. As we may recall, people with well aligned teeth tend to look very good. samsung unlock code

Anonymous said...

What do you think those "smudgy squares" mean? it network security policy

Anonymous said...

A lot of people say they have totally lost their ability to bring themselves to arousal after taking in pills and repeatedly pumping away their penis viagra By burning this many calories, the body will start to burn energy from your muscle reserves as well as your fat reserves, resulting in a body which isn't toned at all 5 generic cialis It is vitally important to lose fat, not muscle when selecting any type of dietary program

Anonymous said...

This component dictates how much information you can put in and where in your brochure should you essentially place them. It is great if you have gym equipment at home, which gives you the likely chance of using it any time., vpn series concentrator. Even though most of these substances are legally available it’s always recommended to ensure that they are bought from a renowned source and consumed in moderation. rdp mac

Anonymous said...

However, opening up stores for them would not be a plausible idea per say. In several instances it has been reported by the victims that the abusers do turn up at the property, and cause further domestic abuse, or attempt to enter the property forcibly. window 98 download. Whether your goal is to lose 10, even 90 pounds, these weight loss tips will help you get there. A lot of things need to be considered while a wedding is planned., free vpn hide. Search the web for the most informative and effective sites where you can focus all your time practicing. vpn concentrator mtu size

Anonymous said...

Other factors such as nutrition and rest also play a role, but height increasing exercises are especially important later in life if you're unhappy with being short., adsl be. A website contains many things or components that comprise in website like company logo, images, text, contents and many other things depending on the nature of the website. The website templates can be customized in terms of color, images and content and leave little to imagination as regards the end product. descarga estreno pelicula emule. Here you get services for ngo registration , pan registration, tm registration, company formation in india , online vat registration, ngo registration in delhi , ngo registration in india, patent registration and more. swiss golf network

Anonymous said...

Now there are many sources available on the net that offers online vat registration to domestic as well as international clients. We never know what is going to happen the next minute and this why insurance companies are here to cushion us from the losses we may incur in mishaps., sql server 2000 images. But again you should not expect too much from a business management software system if you have not paid too much money for it. good intranet

Anonymous said...

Potassium and magnesium help to maintain the electrolyte balance in the body and also aid in the controlling and building muscles. There is increased awareness about surrogacy spreading with media's active coverage about the hope that it brings to millions of people who want the own the most rewarding experience of their lifetime., der private schlussel den sie. These types are made specifically for highly nutritious bird feed called suet. . network scanner

Anonymous said...

All functions of the site similar to depositing and withdrawal are made extraordinarily consumer pleasant. Some procrastinators may delay seeking treatment because they do not want to examine the reasons why they procrastinate., 2005 unknown mysql server host http. It is good to write out your content first so that you can just insert it into the design alter. 500 errore interno server

Anonymous said...

However, plastic companies easily engage in doing business together with you if you have been referred by a good complementary plastic company. Custom web development provides easytouse options and interface, tailor made to satisfy every demands of business, delivers variety of highgrade features & functionality, speedy processing of data and great productivity, and offers long term costsaving advantages and so on. mod ssl sign sh. One of the finest silver wedding anniversary gifts may be silver anniversary keepsake boxes. Currently, human resource managers in many organisations are making efforts to improve their process in human resource management lifecycle., vpn u-tel. The first of the misconceptions that people may have about internet marketing software is that they think it may be too expensive thus scaring them away from it. intranet bayer

Anonymous said...

This is because researchers discovered that music with the tempo of healthy resting heartbeat will synchronize your own breathing and heart beat to it thereby slowing down your racing body rhythms., t online access finder. There may be some therapists that will ask you to disrobe completely for some of the sessions. Therefore, to book cheap flights, get in touch with a travel website. proxy port herausfinden. Usually the father bears the responsibility of the children in such divorce health insurance cases. 058 client

Anonymous said...

One of the things that you have to check before buying this type of item is: what is it made of? reato on line violazione privacy

Anonymous said...

Here's a slight Blasted with gorgeous gifts! https://twitter.com/Pixocool then that same entropy is matched against morning, be sure that you get your announcements cleaned and finished as early as possible. I erotic love the screech serial publication and eatery to discuss their requirements for a Label event on Feb 13th/14th,we aim to transform their new tented domain into a beautiful Label subject. The Duck's egg has garden pink beaks and toes and the combines a bit of authoritative Valentine's Day gifting with labels dash.
decals bikini Custom Labels removal does shows consideration and forethought is a individualised shote cant. This cheeky and fun talent is certain to spicery for decades exactly to uncover the slightest macrocosm of kids.

Anonymous said...

I've been exploring for a little bit for any high quality articles or blog posts in this kind of house . Exploring in Yahoo I finally stumbled upon this web site. Studying this information So i'm happy to exhibit
that I have an incredibly excellent uncanny feeling I came upon exactly what I needed.
I most undoubtedly will make certain to do not fail to remember
this site and give it a look on a constant basis.


Here is my web page; healthy diet plans for women
Also see my webpage - diet that works