Monday, February 2, 2009

Information Retrieval - A word about Word Distribution

Word distributions
Words are not distributed evenly in documents. Same goes for letters of the alphabet, city sizes, wealth, etc. Usually, the 80/20 rule applies (80% of the wealth goes to 20% of the people or it takes 80% of the effort to build the easier 20% of the system). For example, the Shakespeare's famous play Romeo and Juliet we can find the following word frequencies:

Romeo and Juliet:
And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; ........ Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There, 64; Hath, 63;…..................
A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; ...

Stop words
There are 250-300 most common words in English that account for 50% or more of a given text.
Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%.

For example, Moby Dick Ch.1 has 859 unique words (types), 2256 word occurrences (tokens). The Top 65 types cover 1132 tokens (> 50%).
Therefore, the token/type ratio: 2256/859 = 2.63

Zipf’s law

Rank x Frequency = Constant

Zipf's law is fairly general. It states that, if t1 is the most common term in the collection, t2 the next most common etc, then the collection frequency (cf) of the ith most common term is proportional to 1/i:

cf(i) prop. 1/i


Heap’s law
A better way of getting a handle on vocabulary size, M is Heaps’ law, which estimates vocabulary size as a function of collection size:

M = k(T pow(b))

where, T is the number of tokens in the collection. Typical values for the parameters k and b are: 30 ≤ k ≤ 100 and b ≈ 0.5.
Regardless of the values of theparameters for a particular collection, Heaps’ law suggests that: (i) the dictionary size will continue to increase with more documents in the collection, rather than a maximum vocabulary size being reached, and (ii) the size of the dictionary will be quite large for large collections.

In English, k is between 10 and 100, b is between 0.4 and 0.6.


IDF: Inverse document frequency
TF * IDF is used for automated indexing and for topic discrimination:

idf(k) = log2(N/d(k)) + 1 = log2N - log2d(k) + 1

N: number of documents
dk: number of documents containing term k
fik: absolute frequency of term k in document i
wik: weight of term k in document i


Vector-based matching

A popular measure of similarity for text (which normalizes the features by the covariance matrix) clustering is the cosine of the angle between two vectors. The cosine measure is given by

$\displaystyle s^{(\mathrm{C})} (\mathbf{x}_a,\mathbf{x}_b) =  \frac{\mathbf{x}_... ...ger} \mathbf{x}_b} {\Vert\mathbf{x}_a\Vert _2 \cdot  \Vert\mathbf{x}_b\Vert _2}$ (4.2)

and captures a scale invariant understanding of similarity. An even stronger property is that the cosine similarity does not depend on the length:
$ s^{(\mathrm{C})} (\alpha \mathbf{x}_a,\mathbf{x}_b) = s^{(\mathrm{C})} (\mathbf{x}_a,\mathbf{x}_b)$ for 0$" width="51" align="middle" border="0" height="33">. This allows documents with the same composition, but different totals to be treated identically which makes this the most popular measure for text documents. Also, due to this property, samples can be normalized to the unit sphere for more efficient processing.

References
Introduction to Inforamtion Retreival, Ch. 5 Index Compression

4 comments:

Anonymous said...

Consider flipping a property "as is also"? When turning or get-maintenance-market is your point, you may want to think about the advantages of turning a home as they are. Believe it or not, this technique is extremely loved by brokers who buy components once the market is favoring sellers. The benefit is that there are no from wallet maintenance fees, as well as the home could be sold considerably faster. Locations where this method is definitely the most suitable to brokers include neighborhoods that happen to be at present in cross over or where by redevelopment has developed into a concern. [url=http://www.x21w12w21.info]Eyua34222[/url]

oakleyses said...

jordan shoes, ugg boots, longchamp outlet, uggs on sale, ray ban sunglasses, replica watches, ray ban sunglasses, ray ban sunglasses, christian louboutin shoes, nike free run, louboutin pas cher, michael kors pas cher, louis vuitton outlet, air max, prada outlet, tory burch outlet, christian louboutin, louis vuitton, nike air max, nike roshe, replica watches, tiffany and co, oakley sunglasses wholesale, nike free, longchamp pas cher, christian louboutin uk, louis vuitton, burberry pas cher, chanel handbags, nike outlet, nike air max, polo ralph lauren outlet online, gucci handbags, oakley sunglasses, jordan pas cher, prada handbags, oakley sunglasses, cheap oakley sunglasses, louis vuitton outlet, longchamp outlet, polo outlet, ugg boots, tiffany jewelry, longchamp outlet, christian louboutin outlet, louis vuitton outlet, sac longchamp pas cher, oakley sunglasses, polo ralph lauren

oakleyses said...

true religion jeans, ray ban pas cher, michael kors, oakley pas cher, polo lacoste, michael kors outlet online, ray ban uk, mulberry uk, michael kors outlet online, coach outlet, burberry handbags, nike air force, vans pas cher, true religion outlet, michael kors outlet, abercrombie and fitch uk, nike roshe run uk, replica handbags, burberry outlet, michael kors outlet online, kate spade, hollister pas cher, michael kors outlet online, sac hermes, nike air max uk, true religion outlet, nike air max uk, michael kors outlet, guess pas cher, north face, hollister uk, new balance, nike tn, uggs outlet, uggs outlet, ralph lauren uk, michael kors, true religion outlet, nike air max, timberland pas cher, nike blazer pas cher, sac vanessa bruno, converse pas cher, coach purses, coach outlet store online, north face uk, nike free uk, hogan outlet, michael kors outlet

oakleyses said...

wedding dresses, longchamp uk, new balance shoes, insanity workout, gucci, nike roshe run, nike huaraches, north face outlet, celine handbags, giuseppe zanotti outlet, ghd hair, ralph lauren, nike trainers uk, mac cosmetics, iphone cases, asics running shoes, nfl jerseys, mcm handbags, north face outlet, ray ban, reebok outlet, converse outlet, timberland boots, hermes belt, mont blanc pens, ferragamo shoes, hollister, hollister clothing, nike air max, abercrombie and fitch, vans outlet, vans, converse, instyler, louboutin, lululemon, p90x workout, herve leger, soccer shoes, soccer jerseys, nike air max, jimmy choo outlet, beats by dre, oakley, baseball bats, valentino shoes, babyliss, hollister, chi flat iron, bottega veneta