Saturday, October 10, 2009

Extracting sentences from a document

Extracting words from a document is quite common and a fairly easy task. And using variations of Regex.Split(documentText, @"\W") one can easily do that. However, for a spam filtering project, I wanted to extract sentences from a document. I actually just wanted to extract the first and the last sentence but I thought it would be easier just to split the entire document into sentences at first using split function as follows:

char[] delimiter={'.','?','!'};
string[] sentences = document.Split(delimiter);

These two lines of code will work on most of the cases since '?','!' and '.' are usually the characters that represent the end of a sentence. However, using the period (.) can lead to erroneous results since dot '.' can be used to represent a decimal number ( 3.5 , 100.03, etc) or abbreviations (Mr., Mrs., Dr., etc).

Unfortunately, the corpora that I was using has a lot of instances of decimal numbers and abbreviations. So I have to come up with a solution. After unsuccessfully trying to Google (or Bing) for the some ideas, I came up with my own functions to solve this problem. The functions worked perfectly fine on my corpora but it may require some modifications to work for other types document.

The main function is 'parseSenteces()' that splits the document into sentences. It requires that the entire document text be sent as a string parameter.

The 'checkInitils()' function is called when the the program comes across a period. This function checks whether the character before the period represent any initial (like Dr., Miss., Mrs., Mr.) or not and returns a boolean value upon its finding.

The 'checkNum' function checks whether the character before the encountered period are numbers or not.

If the 'checkInitials()' and 'checkNum()' returns false then that represent end of sentence, and that sentence is added in a List. If any of them returns true, then the period is considered a part of sentence and the counter moves on to check the next character.

Here is the code:

private void parseSentences(string document)
{
string para = document;
string sentence = ""; //keeps track of current sentence
List words = new List(); //stores different sentences
int counter = 0; //keeps track of index of the character being read

CharEnumerator charEnum = para.GetEnumerator();//Enumerator to read each character
while (charEnum.MoveNext())
{
char currChar = para[counter]; //holds the current character being read

if (currChar == '.')
{
//check if the character before period are numbers
if (checkNum(para, currChar, counter))
{
//if not then period represents a number and not end of sentence
sentence += currChar.ToString();
}

//check if the character before period are initials of some sort
else if (checkInitial(para, currChar, counter))
{
sentence += currChar.ToString();
}
else
{
//add the sentence to the list
words.Add(sentence);
sentence = "";
}
}
else if (currChar == ' ')
{

sentence += currChar.ToString();

}

else if ((currChar == '!') || (currChar == '?'))
{
words.Add(sentence);
sentence = "";

}

else
{ sentence += currChar.ToString(); }
counter++;
}

if (sentence != "")
{ words.Add(sentence); }

string list = "";

foreach (string s in words)
{
Console.WriteLine(s);
}
}

private bool checkNum(string para, char currChar, int counter)
{ //space means not a decimal number

try
{
if (para[counter + 1] == ' ')
return false;
} catch(Exception)
{ return false; }

char currNext= para[counter+1];

if ((currNext >= '0' & currNext <= '9'))
{ char currPrev=para[counter-1];

if ((currPrev >= '0' & currPrev <= '9'))
return true;
}
return false;
}

private bool checkInitial(string para, char currChar, int counter)
{
string word1 = ""; string word2 = "";

for (int i = 3; i > 0; i--)
{
try
{
if (i < 3)
{
word1 += para[counter - i].ToString();
}
word2 += para[counter - i].ToString();
}
catch(Exception e)
{
Console.WriteLine(e.toString());
}

}

if ((word1.ToLower() == "mr") || (word1.ToLower() == "dr") || (word2.ToLower() == "miss") || (word2.ToLower() == "mrs"))
{
return true;
}
return false;
}