Mining the Web: November 2010

Tuesday, November 16, 2010

Using Language Models for Information Retrieval (Python)

Continuing my regard for Bayes' Theorem, I decided to write a small python program that will crawl a list of urls and then will allow the user to search that list. I used the query likelihood Language Model for information retrieval.

A good explanation of query likelihood can be found in the Manning's Introduction to Information Retrieval book. I used the example mentioned in the "Language models for information retrieval" chapter. Here is the excerpt from that chapter:

In the query likelihood model, we construct from each document d in the collection a language model Md. We rank documents by P(d | q), where the probability of a document is interpreted as the likelihood that it is relevant to the query. Using Bayes rule , we have:

P(d | q) = P(q | d) P(d) / P(q)

P(q) is the same for all documents, and so can be ignored. The prior probability of a document P(d) is often treated as uniform across all d and so it can also be ignored.

In this model, Zero probabilities of words can be a problem. This can be solved by smoothing. Smoothing not only avoids zero probability but also it provides a term weighting component. More information about smoothing can be found in Manning's book.

I implemented the query likelihood model in python. I used BeautifulSoap to parse the crawled urls. You can download the python code from here. Once you have downloaded the code you can then use the code as follows:

1. import LangModel

2. Define a list of urls you want to search:
u=['http://cnn.com','http://techcrunch.com']

3. Call the crawl function to crawl the list:
LangModel.crawl(u)

4. Type the query you want to search and press the return key

5. The output will show you a ranked list of urls and the score of each according to query likelihood model.

6. You can search the list again by using the search function:
LangModel.search('your query')

That is it. Another application of Bayes' Theorem!

Stay tuned for more python code :)

Sunday, November 14, 2010

Bayes' theorem : A Love Story

A few days ago, I saw the video of Hilary Mason presenting the history of machine learning. As far as I know, she has covered the significant developments in the field in the most simplistic, understandable and casual way possible. She shed some light on the fundamental math and algorithmic tools employed in the field.

Hilary suggested, that after the AI winter storm, the best thing that happened in computer science was when computer scientists started invading the field of statistics. That is they starting using probability theories to built application and solve problems. She went on to give an example of how Bayes Theorem was used (and can be used) to classify text, detect spam, etc. I can very well identify with her point.

When I started my MS in computer science, I was supposed to take courses like Software Design, Advance DB, Advance Operating in my first semester. I couldn't get in the Advance Operating Sys class and so I enrolled in Data Mining class. The data mining techniques used, the algorithms that solve complex problems and simply finding interesting patterns in raw data just blew me away. Needless to say, taking DM was probably the best decision of my academic career. I changed my concentration from "Software Engineering" to "Machine Intelligence", worked on making application become smarter (rather than just building a smart application to pass a course) and read more about interesting concepts than that was covered in the class.

My first DM project was building a Spam Filter using Bayes' Theorem. Even though I built much complex DM applications after that, it remains one of my favorite things that I have done. I remember that project presentation was an option to get extra credit (project executable and report however accounted for 40% of the grade). I was doing pretty well in that class but I really wanted to present my Spam Filter. I have never felt so excited about implementing and presenting a project, so I knew that this is something special. And hence, even today when I take on projects that require use of more sophisticated algorithms like SVM, etc. I still perform a quick test using Bayes' Theorem just to get some insight. Maybe it is not a correct approach but I have felt that it has helped me fine tune the algorithms I have to use.

Being a big fan of all things Bayes, I wanted the undergrad and grad students in my college to be excited about using statistics in their projects as well. And hence I gave short tutorial about creating a spam filter using naive bayes theorem during one of the ACM open house meetings. I did that because, I realized that not every CS student enrolled in data mining. The meeting was pretty successful and I tend to believe that I at least made a handful of students excited about using Bayes' theorem.

So if you are an undergard (or even grad) and you are thinking about what project to work on, I'll suggest that you look up some machine learning techniques and see how you can use them to make applications related to your course. There is no better feeling than seeing your application run through mountains of raw(unstructured) data and then providing you with amazing insights and solution.