Showing posts with label query likelihood. Show all posts
Showing posts with label query likelihood. Show all posts

Tuesday, November 16, 2010

Using Language Models for Information Retrieval (Python)

Continuing my regard for Bayes' Theorem, I decided to write a small python program that will crawl a list of urls and then will allow the user to search that list. I used the query likelihood Language Model for information retrieval.

A good explanation of query likelihood can be found in the Manning's Introduction to Information Retrieval book. I used the example mentioned in the "Language models for information retrieval" chapter. Here is the excerpt from that chapter:

In the query likelihood model, we construct from each document d in the collection a language model Md. We rank documents by P(d | q), where the probability of a document is interpreted as the likelihood that it is relevant to the query. Using Bayes rule , we have:

P(d | q) = P(q | d) P(d) / P(q)

P(q) is the same for all documents, and so can be ignored. The prior probability of a document P(d) is often treated as uniform across all d and so it can also be ignored.


In this model, Zero probabilities of words can be a problem. This can be solved by smoothing. Smoothing not only avoids zero probability but also it provides a term weighting component. More information about smoothing can be found in Manning's book.

I implemented the query likelihood model in python. I used BeautifulSoap to parse the crawled urls. You can download the python code from here. Once you have downloaded the code you can then use the code as follows:

1. import LangModel

2. Define a list of urls you want to search:
u=['http://cnn.com','http://techcrunch.com']

3. Call the crawl function to crawl the list:
LangModel.crawl(u)

4. Type the query you want to search and press the return key

5. The output will show you a ranked list of urls and the score of each according to query likelihood model.

6. You can search the list again by using the search function:
LangModel.search('your query')


That is it. Another application of Bayes' Theorem!

Stay tuned for more python code :)