Saturday, October 10, 2009

Extracting sentences from a document

Extracting words from a document is quite common and a fairly easy task. And using variations of Regex.Split(documentText, @"\W") one can easily do that. However, for a spam filtering project, I wanted to extract sentences from a document. I actually just wanted to extract the first and the last sentence but I thought it would be easier just to split the entire document into sentences at first using split function as follows:

char[] delimiter={'.','?','!'};
string[] sentences = document.Split(delimiter);

These two lines of code will work on most of the cases since '?','!' and '.' are usually the characters that represent the end of a sentence. However, using the period (.) can lead to erroneous results since dot '.' can be used to represent a decimal number ( 3.5 , 100.03, etc) or abbreviations (Mr., Mrs., Dr., etc).

Unfortunately, the corpora that I was using has a lot of instances of decimal numbers and abbreviations. So I have to come up with a solution. After unsuccessfully trying to Google (or Bing) for the some ideas, I came up with my own functions to solve this problem. The functions worked perfectly fine on my corpora but it may require some modifications to work for other types document.

The main function is 'parseSenteces()' that splits the document into sentences. It requires that the entire document text be sent as a string parameter.

The 'checkInitils()' function is called when the the program comes across a period. This function checks whether the character before the period represent any initial (like Dr., Miss., Mrs., Mr.) or not and returns a boolean value upon its finding.

The 'checkNum' function checks whether the character before the encountered period are numbers or not.

If the 'checkInitials()' and 'checkNum()' returns false then that represent end of sentence, and that sentence is added in a List. If any of them returns true, then the period is considered a part of sentence and the counter moves on to check the next character.

Here is the code:

private void parseSentences(string document)
{
string para = document;
string sentence = ""; //keeps track of current sentence
List words = new List(); //stores different sentences
int counter = 0; //keeps track of index of the character being read

CharEnumerator charEnum = para.GetEnumerator();//Enumerator to read each character
while (charEnum.MoveNext())
{
char currChar = para[counter]; //holds the current character being read

if (currChar == '.')
{
//check if the character before period are numbers
if (checkNum(para, currChar, counter))
{
//if not then period represents a number and not end of sentence
sentence += currChar.ToString();
}

//check if the character before period are initials of some sort
else if (checkInitial(para, currChar, counter))
{
sentence += currChar.ToString();
}
else
{
//add the sentence to the list
words.Add(sentence);
sentence = "";
}
}
else if (currChar == ' ')
{

sentence += currChar.ToString();

}

else if ((currChar == '!') || (currChar == '?'))
{
words.Add(sentence);
sentence = "";

}

else
{ sentence += currChar.ToString(); }
counter++;
}

if (sentence != "")
{ words.Add(sentence); }

string list = "";

foreach (string s in words)
{
Console.WriteLine(s);
}
}

private bool checkNum(string para, char currChar, int counter)
{ //space means not a decimal number

try
{
if (para[counter + 1] == ' ')
return false;
} catch(Exception)
{ return false; }

char currNext= para[counter+1];

if ((currNext >= '0' & currNext <= '9'))
{ char currPrev=para[counter-1];

if ((currPrev >= '0' & currPrev <= '9'))
return true;
}
return false;
}

private bool checkInitial(string para, char currChar, int counter)
{
string word1 = ""; string word2 = "";

for (int i = 3; i > 0; i--)
{
try
{
if (i < 3)
{
word1 += para[counter - i].ToString();
}
word2 += para[counter - i].ToString();
}
catch(Exception e)
{
Console.WriteLine(e.toString());
}

}

if ((word1.ToLower() == "mr") || (word1.ToLower() == "dr") || (word2.ToLower() == "miss") || (word2.ToLower() == "mrs"))
{
return true;
}
return false;
}

5 comments:

oakleyses said...

jordan shoes, ugg boots, longchamp outlet, uggs on sale, ray ban sunglasses, replica watches, ray ban sunglasses, ray ban sunglasses, christian louboutin shoes, nike free run, louboutin pas cher, michael kors pas cher, louis vuitton outlet, air max, prada outlet, tory burch outlet, christian louboutin, louis vuitton, nike air max, nike roshe, replica watches, tiffany and co, oakley sunglasses wholesale, nike free, longchamp pas cher, christian louboutin uk, louis vuitton, burberry pas cher, chanel handbags, nike outlet, nike air max, polo ralph lauren outlet online, gucci handbags, oakley sunglasses, jordan pas cher, prada handbags, oakley sunglasses, cheap oakley sunglasses, louis vuitton outlet, longchamp outlet, polo outlet, ugg boots, tiffany jewelry, longchamp outlet, christian louboutin outlet, louis vuitton outlet, sac longchamp pas cher, oakley sunglasses, polo ralph lauren

oakleyses said...

true religion jeans, ray ban pas cher, michael kors, oakley pas cher, polo lacoste, michael kors outlet online, ray ban uk, mulberry uk, michael kors outlet online, coach outlet, burberry handbags, nike air force, vans pas cher, true religion outlet, michael kors outlet, abercrombie and fitch uk, nike roshe run uk, replica handbags, burberry outlet, michael kors outlet online, kate spade, hollister pas cher, michael kors outlet online, sac hermes, nike air max uk, true religion outlet, nike air max uk, michael kors outlet, guess pas cher, north face, hollister uk, new balance, nike tn, uggs outlet, uggs outlet, ralph lauren uk, michael kors, true religion outlet, nike air max, timberland pas cher, nike blazer pas cher, sac vanessa bruno, converse pas cher, coach purses, coach outlet store online, north face uk, nike free uk, hogan outlet, michael kors outlet

oakleyses said...

wedding dresses, longchamp uk, new balance shoes, insanity workout, gucci, nike roshe run, nike huaraches, north face outlet, celine handbags, giuseppe zanotti outlet, ghd hair, ralph lauren, nike trainers uk, mac cosmetics, iphone cases, asics running shoes, nfl jerseys, mcm handbags, north face outlet, ray ban, reebok outlet, converse outlet, timberland boots, hermes belt, mont blanc pens, ferragamo shoes, hollister, hollister clothing, nike air max, abercrombie and fitch, vans outlet, vans, converse, instyler, louboutin, lululemon, p90x workout, herve leger, soccer shoes, soccer jerseys, nike air max, jimmy choo outlet, beats by dre, oakley, baseball bats, valentino shoes, babyliss, hollister, chi flat iron, bottega veneta

oakleyses said...

ugg, louis vuitton, hollister, ugg uk, links of london, canada goose outlet, toms shoes, thomas sabo, moncler outlet, doudoune moncler, coach outlet, canada goose, ugg,uggs,uggs canada, ugg,ugg australia,ugg italia, pandora charms, replica watches, louis vuitton, supra shoes, pandora jewelry, juicy couture outlet, canada goose outlet, barbour, swarovski crystal, pandora uk, montre pas cher, juicy couture outlet, lancel, canada goose, moncler, canada goose jackets, louis vuitton, canada goose uk, marc jacobs, moncler, pandora jewelry, canada goose, swarovski, canada goose outlet, moncler uk, moncler, louis vuitton, karen millen uk, ugg pas cher, louis vuitton, barbour uk, wedding dresses, moncler outlet, moncler

soma taha said...

Hi, I do believe this is a great site.
https://3kary.com/property-status/rest-houses-for-rent/
https://3kary.com/property-status/rest-houses-for-sale/
https://3kary.com/property-type/rest-houses/
https://decor-ksa.com/iron-fences/