Words are not distributed evenly in documents. Same goes for letters of the alphabet, city sizes, wealth, etc. Usually, the 80/20 rule applies (80% of the wealth goes to 20% of the people or it takes 80% of the effort to build the easier 20% of the system). For example, the Shakespeare's famous play Romeo and Juliet we can find the following word frequencies:
Romeo and Juliet:
And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; ........ Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There, 64; Hath, 63;…..................
A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; ...
Stop words
There are 250-300 most common words in English that account for 50% or more of a given text.
Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%.
For example, Moby Dick Ch.1 has 859 unique words (types), 2256 word occurrences (tokens). The Top 65 types cover 1132 tokens (> 50%).
Therefore, the token/type ratio: 2256/859 = 2.63
Zipf’s law
Rank x Frequency = Constant
Zipf's law is fairly general. It states that, if t1 is the most common term in the collection, t2 the next most common etc, then the collection frequency (cf) of the ith most common term is proportional to 1/i:
cf(i) prop. 1/i
Heap’s law
A better way of getting a handle on vocabulary size, M is Heaps’ law, which estimates vocabulary size as a function of collection size:
M = k(T pow(b))
where, T is the number of tokens in the collection. Typical values for the parameters k and b are: 30 ≤ k ≤ 100 and b ≈ 0.5.
Regardless of the values of theparameters for a particular collection, Heaps’ law suggests that: (i) the dictionary size will continue to increase with more documents in the collection, rather than a maximum vocabulary size being reached, and (ii) the size of the dictionary will be quite large for large collections.
In English, k is between 10 and 100, b is between 0.4 and 0.6.
IDF: Inverse document frequency
TF * IDF is used for automated indexing and for topic discrimination:
idf(k) = log2(N/d(k)) + 1 = log2N - log2d(k) + 1
N: number of documents
dk: number of documents containing term k
fik: absolute frequency of term k in document i
wik: weight of term k in document i
Vector-based matching
A popular measure of similarity for text (which normalizes the features by the covariance matrix) clustering is the cosine of the angle between two vectors. The cosine measure is given by
(4.2) |
and captures a scale invariant understanding of similarity. An even stronger property is that the cosine similarity does not depend on the length:
for 0$" width="51" align="middle" border="0" height="33">. This allows documents with the same composition, but different totals to be treated identically which makes this the most popular measure for text documents. Also, due to this property, samples can be normalized to the unit sphere for more efficient processing.
References
Introduction to Inforamtion Retreival, Ch. 5 Index Compression