A good explanation of query likelihood can be found in the Manning's Introduction to Information Retrieval book. I used the example mentioned in the "Language models for information retrieval" chapter. Here is the excerpt from that chapter:

In the query likelihood model, we construct from each document d in the collection a language model Md. We rank documents by P(d | q), where the probability of a document is interpreted as the likelihood that it is relevant to the query. Using Bayes rule , we have:

P(d | q) = P(q | d) P(d) / P(q)

P(q) is the same for all documents, and so can be ignored. The prior probability of a document P(d) is often treated as uniform across all d and so it can also be ignored.

In this model, Zero probabilities of words can be a problem. This can be solved by smoothing. Smoothing not only avoids zero probability but also it provides a term weighting component. More information about smoothing can be found in Manning's book.

I implemented the query likelihood model in python. I used BeautifulSoap to parse the crawled urls. You can download the python code from here. Once you have downloaded the code you can then use the code as follows:

1. import LangModel

2. Define a list of urls you want to search:

u=['http://cnn.com','http://techcrunch.com']

3. Call the crawl function to crawl the list:

LangModel.crawl(u)

4. Type the query you want to search and press the return key

5. The output will show you a ranked list of urls and the score of each according to query likelihood model.

6. You can search the list again by using the search function:

LangModel.search('your query')

That is it. Another application of Bayes' Theorem!

Stay tuned for more python code :)