Mining the Web: natural language processing

Thursday, October 15, 2009

Text Summarization using Vector Space Model

The rapid growth of World Wide Web has resulted in information overload. One cannot possibly read every news article, document, paper, etc that is out there on the web. That is why text summarization has become essential in this information age. Even the search engines show the summary of each of the search results to help users select the most appropriate link.

However, summarizing any document automatically is not an easy task. It is one of the challenging Natural Language Processing (NLP) problems. In this post, I will show you how to create a summary from a web document using Vector Space Model.

We will first extract the sentences from the document. The summary generated for each document will actually consist of sentences that are extracted from it. We will try to pick the sentences are more representative of the document than others using the VSM.

In the previous examples, there were two main components: list of documents and the query. We then assumed that there exist a string array "docs" which contains the query's (at zero index) and the documents' text. We are going to use the same data structure ("docs") but this time instead of the query we will have the document's text (which we are trying to summarize) at the zero index and the various sentences that we extracted from the document in the remaining indices.

We will then rank the sentences using the VSM according to their similarity with the document. The similarity will decide whether we want to include the sentence in our summary or not.

The code from VSM example is almost the same, the only new function is the sentence extractor that I had to write for this example. The sentence extractor is created using the following heuristics:

1. If the character is "?" or "!" then mark this as the end of the sentence
2. If the character is "." the check that it is not preceded by "Mr", "Mrs", "Dr", etc

if Yes, then don't mark this as the end of the sentence
if No, then check that "." is not followed by a number
- if Yes, then don't mark this as the end of the sentence
- if No, then mark this as the end of the sentence

The program requires a valid url address from which it will read the item and display the Top 5 sentence that it thinks best summarizes the document. For this simple program, I assume that "p" tags are used to display paragraphs.

As can be seen from the image above, the summary produced by this simple method is not too bad :)

Here is the code.

Friday, March 6, 2009

Peter Norvig; "we don't think it's a big advance to be able to type something as a question as opposed to keywords."

Sometime ago I found myself in a healthy argument with a friend of mine about the future of search. Both of us are crazy about information retrieval and naturally our discussion more than normally diverts towards search. While usually we agree with each other's views, today I was shocked to find a major difference in our opinions.

I sincerely believe that any breakthrough in search will have something to do with Natural Language Processing. However my friend thinks that that will not be the case. This was not at all a big deal. But what shocked me was the evidence he used... Peter Norvig.

Now I am a big fan of Peter Norvig but while i am crazy about all things Peter Norvig, I was disappointed to read what he said in his interverview with Technology Review:

"TR: Companies such as Ask and Powerset are betting that the future is in natural-language search, which lets people use real, useful sentences instead of potentially ambiguous keywords. What is Google doing with natural language?

PN: We think what's important about natural language is the mapping of words onto the concepts that users are looking for. But we don't think it's a big advance to be able to type something as a question as opposed to keywords. Typing "What is the capital of France?" won't get you better results than typing "capital of France." But understanding how words go together is important. To give some examples, "New York" is different from "York," but "Vegas" is the same as "Las Vegas," and "Jersey" may or may not be the same as "New Jersey." That's a natural-language aspect that we're focusing on. Most of what we do is at the word and phrase level; we're not concentrating on the sentence. We think it's important to get the right results rather than change the interface."

Apparently, Peter Norvig thinks that typing complete questions will just be a change in interface. I seriously doubt it. These days, I hardly use single keyword searches. I like Powerset and really wish that I could google something much more intelligent than a few keywords. But if the director of research at Google doesn't think that this will bring any big advancement to search, then maybe it is just me. I am sure Peter has substantial data to back his comments but I for one would like to type in a question ( and not something as simple as "Who is the 44th president of USA") and expect a relevant answer.

But thats just me... The interview was conducted in 2008, here is hoping that during this time Peter has given a second thought to some serious NLP usage for search.