Document Preprocessing
The content of a webpage read by the crawler has to be converted into tokens before an index can be created for the keywords. This is challenging because at this step we have to deal with various formatting and encoding issues. For example, what delimiter will we use to separate different words. Tokenizing using white space as a delimiter seems to work best for "Paul's", "555-1212" or "can't". But what about: "Willow Dr.", "Dr. Willow", "New York", "ad hoc". A white space delimiter will create separate tokens for each of them which is not desirable.
Tokenization becomes even more challenging when you consider tokenizing a string like “The New York-Los Angeles flight”. Other languages such as German or Korean have eve more complex tokenization issues.
Normalization
We need a systematic way of ensuring that the index structure is suitable for general-purpose querying. For instance, "new york" or "New York" or "NEW YORK" shouldn't be treated differently. There are various ways to achieve this normalcy:
- Casing (cat vs. CAT)
- Stemming (computer, computation)
- Soundex
- Spell Checker or synonyms (Labeled/labelled, extraterrestrial/extra-terrestrial/extra terrestrial, Qaddafi/Kadhafi/Ghadaffi)
- Index reduction
- Dropping stop words (“and”, “of”, “to”), it is problematic for “to be or not to be” kind of sentences
Porter's algorithm can be used for Stemming - the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form (Wikipedia).
The rules in the Porter algorithm are separated into five distinct phases numbered from 1 to 5. They are applied to the words in the text starting from phase 1 and moving on to phase 5. Further, they are applied sequentially one after the other as commands in a program.
The complete description of Porter's algorithm can be found here.
Example: the word “duplicatable” will be changed to "duplic" using the Proter's algorithm.
The Soundex algorithm (Odell and Russell)
This algorithm uses spelling correction and hash function and is non-recoverable.
The algorithm is as follows:
1. Retain the first letter of the name, and drop all occurrences of a,e,h,I,o,u,w,y in other positions
2. Assign the following numbers to the remaining letters after the first:
b,f,p,v : 1
c,g,j,k,q,s,x,z : 2
d,t : 3
l : 4
m n : 5
r : 6
3. if two or more letters with the same code were adjacent in the original name, omit all but the first
4. Convert to the form “LDDD” by adding terminal zeros or by dropping rightmost digits
Examples:
Euler: E460, Gauss: G200, H416: Hilbert, K530: Knuth, Lloyd: L300
same as Ellery, Ghosh, Heilbronn, Kant, and Ladd
Some problems: Rogers and Rodgers, Sinclair and StClair
Storage issues
As mentioned in the last post that a medium-sized collection with n=3,000,000 (terms) and m=50,000 (documents) will result in a large term-document matrix. We have to find a way to come up with a better data structure. Inverted Index is the basic structure that resolves the issue.
Inverted index
Instead of an incidence vector, use a posting table. We can use a Dictionary object and represent the term-documents as the term-value pairs in the dictionary. The term represents a unique ID and the document (where the term is present) IDs are the values in each term-value pair of the dictionary. For example,
CLEVELAND: D1, D2, D6
OHIO: D1, D5, D6, D7
We use linked lists to be able to insert new document postings in order and to remove existing postings. We need to keep everything sorted! This gives us a logarithmic improvement in access.
Basic operations on inverted indexes
Conjunction (AND) – iterative merge of the two postings: O(x+y)
Disjunction (OR) – very similar
Negation (NOT)
Example: MICHIGAN AND NOT OHIO
Example: MICHIGAN OR NOT OHIO
For recursive operations we should start with the smallest sets for better optimization.
That sums up the second post on Information Retrieval. In the next post, I will discuss word distribution issues in the IR.