The rapid growth of World Wide Web has resulted in information overload. One cannot possibly read every news article, document, paper, etc that is out there on the web. That is why text summarization has become essential in this information age. Even the search engines show the summary of each of the search results to help users select the most appropriate link.
However, summarizing any document automatically is not an easy task. It is one of the challenging Natural Language Processing (NLP) problems. In this post, I will show you how to create a summary from a web document using Vector Space Model.
We will first extract the sentences from the document. The summary generated for each document will actually consist of sentences that are extracted from it. We will try to pick the sentences are more representative of the document than others using the VSM.
In the previous examples, there were two main components: list of documents and the query. We then assumed that there exist a string array "docs" which contains the query's (at zero index) and the documents' text. We are going to use the same data structure ("docs") but this time instead of the query we will have the document's text (which we are trying to summarize) at the zero index and the various sentences that we extracted from the document in the remaining indices.
We will then rank the sentences using the VSM according to their similarity with the document. The similarity will decide whether we want to include the sentence in our summary or not.
The code from VSM example is almost the same, the only new function is the sentence extractor that I had to write for this example. The sentence extractor is created using the following heuristics:
1. If the character is "?" or "!" then mark this as the end of the sentence
2. If the character is "." the check that it is not preceded by "Mr", "Mrs", "Dr", etc
- if Yes, then don't mark this as the end of the sentence
- if No, then check that "." is not followed by a number
- if Yes, then don't mark this as the end of the sentence
- if No, then mark this as the end of the sentence
As can be seen from the image above, the summary produced by this simple method is not too bad :)
Here is the code.