tag:blogger.com,1999:blog-59409214181638276912024-03-04T23:39:56.345-08:00Mining the WebSyed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.comBlogger27125tag:blogger.com,1999:blog-5940921418163827691.post-36332161536334807782010-11-16T22:54:00.000-08:002010-11-29T12:44:06.099-08:00Using Language Models for Information Retrieval (Python)Continuing my regard for <a href="http://raza-rizvi.blogspot.com/2010/11/bayes-theorem-love-story.html">Bayes' Theorem</a>, I decided to write a small python program that will crawl a list of urls and then will allow the user to search that list. I used the <a href="http://nlp.stanford.edu/IR-book/html/htmledition/using-query-likelihood-language-models-in-ir-1.html"><span style="font-style: italic;">query likelihood</span></a> Language Model for information retrieval.<br /><br />A good explanation of query likelihood can be found in the Manning's Introduction to Information Retrieval book. I used the example mentioned in the "<a href="http://nlp.stanford.edu/IR-book/html/htmledition/language-models-1.html">Language models for information retrieval</a>" chapter. Here is the excerpt from that chapter:<br /><br /><div style="text-align: center; color: rgb(0, 0, 0); font-weight: bold;">In the <span style="font-style: italic;">query likelihood model</span>, we construct from each document <span style="font-style: italic;">d</span> in the collection a language model <span style="font-style: italic;">Md</span>. We rank documents by <span style="font-style: italic;">P(d | q)</span>, where the probability of a document is interpreted as the likelihood that it is relevant to the query. Using Bayes rule , we have:<br /><br /></div><div style="text-align: center;"><div style="text-align: center; color: rgb(0, 0, 0); font-weight: bold;"><span style="font-style: italic;">P(d | q) = P(q | d) P(d) / P(q)</span><br /><span style="font-style: italic;"></span></div><div style="text-align: left;"><div style="text-align: center; font-weight: bold;"><span style="font-style: italic; color: rgb(0, 0, 0);"></span><br /><span style="font-style: italic; color: rgb(0, 0, 0);">P(q) </span><span style="color: rgb(0, 0, 0);">is the same for all documents, and so can be ignored. The prior probability of a document </span><span style="font-style: italic; color: rgb(0, 0, 0);">P(d)</span><span style="color: rgb(0, 0, 0);"> is often treated as uniform across all </span><span style="font-style: italic; color: rgb(0, 0, 0);">d</span><span style="color: rgb(0, 0, 0);"> and so it can also be ignored.</span><br /></div><br /><div style="text-align: left;"><br />In this model, Zero probabilities of words can be a problem. This can be solved by <span style="font-style: italic;">smoothing</span>. Smoothing not only avoids zero probability but also it provides a term weighting component. More information about smoothing can be found in <a href="http://nlp.stanford.edu/IR-book/html/htmledition/estimating-the-query-generation-probability-1.html">Manning's book</a>.<br /><br />I implemented the query likelihood model in python. I used <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoap</a> to parse the crawled urls. You can download the python code from <a style="font-weight: bold;" href="http://ecs.fullerton.edu/%7Eraza09/software/LangModel.py">here</a>. Once you have downloaded the code you can then use the code as follows:<br /><br />1. <span style="font-style: italic;">import LangModel<br /><br /></span>2. Define a list of urls you want to search:<br /><span style="font-style: italic;">u=['http://cnn.com','http://techcrunch.com']</span><br /><br />3. Call the <span style="font-style: italic;">crawl</span> function to crawl the list:<br /><span style="font-style: italic;">LangModel.crawl(u)</span><br /><br />4. Type the query you want to search and press the return key<br /><br />5. The output will show you a ranked list of urls and the score of each according to query likelihood model.<br /><br />6. You can search the list again by using the <span style="font-style: italic;">search</span> function:<br /><span style="font-style: italic;">LangModel.search('your query')</span><br /><br /><img src="file:///C:/Users/raza/AppData/Local/Temp/moz-screenshot.png" alt="" /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgANNRQRVG4mOu5VJKZ7a_5LpBB98qOEotWCvQfT7F6BWqgWWI7pFrOrQcp4XCRi1yjkla7PJLpHHlDtU-n2IucoAWH8v1wnEBZR01fZK5WeSt0A6rN48jk1ZhMss4fk2jrBFuVbjabhrYz/s1600/LanguageModel.jpg"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 295px; height: 197px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgANNRQRVG4mOu5VJKZ7a_5LpBB98qOEotWCvQfT7F6BWqgWWI7pFrOrQcp4XCRi1yjkla7PJLpHHlDtU-n2IucoAWH8v1wnEBZR01fZK5WeSt0A6rN48jk1ZhMss4fk2jrBFuVbjabhrYz/s200/LanguageModel.jpg" alt="" id="BLOGGER_PHOTO_ID_5540804107787329266" border="0" /></a>That is it. Another application of Bayes' Theorem!<br /><br />Stay tuned for more python code :)<br /></div><span style="font-style: italic;"></span></div></div>Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com193tag:blogger.com,1999:blog-5940921418163827691.post-31630698433747192532010-11-14T12:11:00.000-08:002010-11-14T13:14:33.466-08:00Bayes' theorem : A Love StoryA few days ago, I saw the video of <a href="http://www.hilarymason.com/">Hilary Mason</a> presenting the<a href="http://www.infoq.com/presentations/Machine-Learning"> history of machine learning</a>. As far as I know, she has covered the significant developments in the field in the most simplistic, understandable and casual way possible. She shed some light on the fundamental math and algorithmic tools employed in the field.<br /><br />Hilary suggested, that after the <a href="http://en.wikipedia.org/wiki/AI_winter">AI winter</a> storm, the best thing that happened in computer science was when computer scientists started invading the field of statistics. That is they starting using probability theories to built application and solve problems. She went on to give an example of how <a href="http://en.wikipedia.org/wiki/Bayes%27_theorem">Bayes Theorem</a> was used (and can be used) to classify text, detect spam, etc. I can very well identify with her point.<br /><br />When I started my MS in computer science, I was supposed to take courses like Software Design, Advance DB, Advance Operating in my first semester. I couldn't get in the Advance Operating Sys class and so I enrolled in Data Mining class. The data mining techniques used, the algorithms that solve complex problems and simply finding interesting patterns in raw data just blew me away. Needless to say, taking DM was probably the best decision of my academic career. I changed my concentration from "Software Engineering" to "Machine Intelligence", worked on making application become smarter (rather than just building a smart application to pass a course) and read more about interesting concepts than that was covered in the class.<br /><br />My first DM project was building a <a href="http://raza-rizvi.blogspot.com/2010/06/text-classifier-using-naive-bayes.html">Spam Filter using Bayes' Theorem</a>. Even though I built much complex DM applications after that, it remains one of my favorite things that I have done. I remember that project presentation was an <a href="http://rizvipost.blogspot.com/2008/05/i-dont-need-extra-credit.html">option to get extra credit</a> (project executable and report however accounted for 40% of the grade). I was doing pretty well in that class but I really wanted to present my Spam Filter. I have never felt so excited about implementing and presenting a project, so I knew that this is something special. And hence, even today when I take on projects that require use of more sophisticated algorithms like SVM, etc. I still perform a quick test using Bayes' Theorem just to get some insight. Maybe it is not a correct approach but I have felt that it has helped me fine tune the algorithms I have to use.<br /><br />Being a big fan of all things Bayes, I wanted the undergrad and grad students in my college to be excited about using statistics in their projects as well. And hence I gave short tutorial about <a href="http://raza-rizvi.blogspot.com/2010/03/creating-spam-filter-using-naive-bayes.html">creating a spam filter using naive bayes theorem</a> during one of the ACM open house meetings. I did that because, I realized that not every CS student enrolled in data mining. The meeting was pretty successful and I tend to believe that I at least made a handful of students excited about using Bayes' theorem.<br /><br />So if you are an undergard (or even grad) and you are thinking about what project to work on, I'll suggest that you look up some machine learning techniques and see how you can use them to make applications related to your course. There is no better feeling than seeing your application run through mountains of raw(unstructured) data and then providing you with amazing insights and solution.Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com13tag:blogger.com,1999:blog-5940921418163827691.post-52867953217641365582010-10-17T19:24:00.000-07:002010-10-17T20:45:58.137-07:00Text Classification using Naive Bayes ClassifierI received some emails related to my <a href="http://raza-rizvi.blogspot.com/2010/03/creating-spam-filter-using-naive-bayes.html">spam filter post</a>. Some of them asked me to submit a code related to it. A very simple implementation of Spam Filter in Python can be found in Collective Intelligence (I highly recommend this book). However, I wrote a simple text classification application in C# which can be used to create a Spam Filter.<br /><br />I used <a href="http://spamassassin.apache.org/publiccorpus/">'spamassasin' datasets</a> for training and then tested the same datasets using the naive based classification. The results are quiet interesting. All of the 'ham' mails are classified as 'ham', however not all 'spam' emails are classified as 'spam'. The results for spam classification aren't that bad because as mentioned in the slides, it is ok to have misclassified a 'spam' email than a 'ham' email. Further, the classification model can be improved by including features such as letter case, html, terms co-occurrence, terms being considered for classification, etc.<br /><br />You have to first train the system before testing it. So click the 'Train' button before clicking on 'Test'. The application requires two datasets: 'easy_ham_2' and 'spam_2' both of which can be found in spamassasin website. The two datasets are also used for testing. You can make changes in the code to test a different dataset.<br /><br />I only used the top 15 words for classification. You can change that in the 'docprob' function that calculates the document 'spam' and 'ham' probabilities.<br /><br />The output of testing the datasets is printed on the console.<br /><br />This is just a proof-of-concept application and so it is not perfect. i didn't pay much attention on the design (UI) but UI doesn't have much to do with spam filtering functionality anyways. You can use other datasets from spamassasins website for testing the trained system.<br /><br />Also this code can be used for text classification in general using as many classes as one wants. Any suggestions are more than welcomed.<br /><br />Here is the <a href="http://ecs.fullerton.edu/%7Eraza09/software/SpamFilter.zip">complete project file.</a>Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com2tag:blogger.com,1999:blog-5940921418163827691.post-59247447196020787672010-09-26T16:06:00.000-07:002010-09-26T16:53:46.146-07:00DT-Tree: A Semantic Representation of Scientific PapersThis year I have been really busy working on some research projects, hence the delay in blog posts. Recently, one of my research works got accepted at the <span style="font-weight: bold;">2010 10th IEEE International Conference on Computer and Information Technology (CIT 2010)</span>. The conference proceedings were published by IEEE-CS and you can find the link to paper <a style="font-weight: bold;" href="http://www.computer.org/portal/web/csdl/doi/10.1109/CIT.2010.231">here</a>. I am really grateful to my advisor, <a style="font-weight: bold;" href="http://faculty.fullerton.edu/xwang/">Dr. Shawn X. Wang</a>, whose guidance and support made me achieve my goals. The paper was about a new indexing structure for scientific documents that we developed. Following is the abstract of the paper:<br /><br /><div style="text-align: center; font-weight: bold;"><span style="font-style: italic;">With the tremendous growth in electronic publication, locating the most relevant references is becoming a challenging task. Most effective document indexing structures represent a document as a vector of very high dimensionality. It is well known that such a representation suffers from the curse of dimensionality. In this paper, we introduce DT-Tree (DocumentTerm-Tree) - a new structure for the representation of scientific documents. DT-Tree represents a document using the 50 most frequent terms in that specific document. These terms are grouped into a tree structure according to where they appear in the document, such as title, abstract, or section title, etc. The distance between two documents is calculated based on their DT-Trees. Two DTTrees are compared using Dice coefficient between the corresponding nodes of the trees. To verify the effectiveness of our similarity measure, we conducted experiments to cluster 150 documents in three categories, namely biology [1], chemistry [2-3] and physics [3]. The experimental results indicated 100% accuracy.</span><br /><br /><div style="text-align: left;"><span style="font-weight: normal;">I am still working on some more research projects and will post more information about them here.</span><br /></div></div>Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com1tag:blogger.com,1999:blog-5940921418163827691.post-86195497788850698122010-03-20T11:22:00.000-07:002010-03-20T16:36:33.943-07:00Creating Spam Filter using Naive Bayes ClassifierFew months ago I gave a lecture to CS students about data mining. I decided to show how a spam filter can be built using simple data mining technique called <a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">naive bases classifier</a>. It was an interactive lecture and I was surprised by the students' interest in the field and the questions that they asked.<br /><br />Since the intention was to keep things simple, I used a simple example to walk them through the steps of creating a spam filter (see the slides embedded below). It was an exciting and rewarding experience. The most interesting part was when I asked for suggestion regarding the types of features which should be extracted for spam filtering. I was hoping to hear something like "individual words", "email addresses", etc and they indeed were among the responses that I got. But I wasn't expecting them to suggest taking into account "case sensitivity", "color scheme", "co-0ccurance" and "spelling" of words in email body. These responses were from undergrads who were mostly new to the idea of data mining. The curiosity and in-depth knowledge of these students made the lecture very enjoyable.<br /><br />Here are the slides:<br /><br /><iframe src="http://docs.google.com/present/embed?id=dmv7486_28cbpxtqxd" frameborder="0" width="410" height="342"></iframe>Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com80tag:blogger.com,1999:blog-5940921418163827691.post-29747629463828633132010-01-29T15:42:00.000-08:002010-01-30T18:31:12.396-08:00Parsing Robots.txt FileCrawling is an essential part of search engine development. The more sites a search engine crawls, the bigger its index will be. However, a site can restrict web crawlers from crawling certain pages of the site by identifying those pages in a robots.txt file and making it available at the local URL /robots.txt. For example, you can see what the robots.txt file for CNN looks like <a href="http://www.cnn.com/robots.txt">here</a>.<br /><br />While crawling a site, it is a good idea to read the robots.txt file first so that you have a list of pages that your crawler shouldn't access. This way you are not only respecting the permissions provided by the site but also saving CPU cycles which can be utilized elsewhere.<br /><br />You can find complete detail about the format and semantics of a robots.txt file, visit <a href="http://www.robotstxt.org/orig.html">The Web Robots Pages</a>. Here is a portion from this webpage:<br /><br /><p> <span style="font-size:85%;"><span style="font-style: italic;">The record starts with one or more </span><code style="font-style: italic;">User-agent</code><span style="font-style: italic;"> lines, followed by one or more </span><code style="font-style: italic;">Disallow</code><span style="font-style: italic;"> lines, as detailed below. Unrecognised headers are ignored.</span></span></p> <dl style="font-style: italic;"><dt><span style="font-size:85%;">User-agent</span></dt><dd><span style="font-size:85%;"> The value of this field is the name of the robot the record is describing access policy for. </span><p><span style="font-size:85%;"> If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.</span></p> <p><span style="font-size:85%;"> The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.</span></p> <p><span style="font-size:85%;"> If the value is '<code>*</code>', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "<code>/robots.txt</code>" file.</span></p></dd><dt><span style="font-size:85%;">Disallow</span></dt><dd><span style="font-size:85%;"> The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, <code>Disallow: /help</code> disallows both <code>/help.html</code> and <code>/help/index.html</code>, whereas <code>Disallow: /help/</code> would disallow <code>/help/index.html</code> but allow <code>/help.html</code>. </span><p><span style="font-size:85%;"> Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.</span></p></dd></dl>Using this description we can easily write the code to create a list of disallowed URLs. All we need to do is parse the robots.txt file and split the file first by using "User-agent" and then again by "Disallow". This will give us the list which we can then use in <a href="http://raza-rizvi.blogspot.com/2009/04/simple-web-crawler.html">our crawler</a>. We just need to check this list every time we are about to crawl a page.<br /><br />Following is the code in C# to parse the robots.txt file and to create a list of disallowed URLs<span style="font-style: italic;">:<span style="font-style: italic;"><br /><br />using System;<br />using System.Collections.Generic;<br />using System.Linq;<br />using System.Text;<br />using System.IO;<br />using System.Net;<br />using System.Text.RegularExpressions;<br /><br />namespace ParsingRobotTxt<br />{<br /> class Program<br /> {<br /> public static HttpWebRequest req;<br /> public static HttpWebResponse res;<br /> static Stream resStream;<br /><br /> static void Main(string[] args)<br /> {<br /> String baseUrl = "http://www.cnn.com/";<br /> baseUrl += "/robots.txt";<br /><br /> getDisallowedUrls(baseUrl);<br /> }<br /><br /> static void getDisallowedUrls(string baseUrl)<br /> {<br /> if (isValidUrl(baseUrl))<br /> {<br /> urlOpen();<br /> }<br /><br /> String RobotTxtContent = read();<br /><br /> List<string> disallowed = new List<string>(); // List that holds Urls which shouldn't be crawled<br /> String[] user_agents = Regex.Split(RobotTxtContent, "User-agent:");<br /> String userAgents = "";<br /> foreach (String agent in user_agents)<br /> {<br /> if (agent.Trim().StartsWith("*"))<br /> {<br /> userAgents = agent.Trim().Substring(1); <br /> }<br /> }<br /><br /> String[] disallow = Regex.Split(userAgents, "Disallow:");<br /> <br /> foreach (String item in disallow)<br /> {<br /> if (item != "\n")<br /> {<br /> disallowed.Add(item.Trim());<br /> Console.WriteLine(baseUrl + item.Trim());<br /> }<br /> }<br /> <br /> Console.ReadLine();<br /><br /> }<br /><br /> public static String read()<br /> {<br /> StreamReader sr = new StreamReader(resStream);<br /> String strText = sr.ReadToEnd();<br /> return strText;<br /> }<br /><br /> public static void urlOpen()<br /> {<br /> resStream = res.GetResponseStream();<br /> }<br /><br /> public static bool isValidUrl(String url)<br /> {<br /> try<br /> {<br /> req = (HttpWebRequest)HttpWebRequest.Create(url);<br /> res = (HttpWebResponse)req.GetResponse();<br /> return (res.StatusCode == HttpStatusCode.OK);<br /> }<br /> catch (Exception ex)<br /> {<br /> Console.WriteLine("Not a Valid URL:" + ex.Message + " - " + url);<br /> return false;<br /> }<br /> }<br /> }<br />}<br /><br /></string></string></span></span>Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com8tag:blogger.com,1999:blog-5940921418163827691.post-22901700921964448482009-11-23T00:15:00.000-08:002010-01-07T23:24:22.111-08:00Duplicate Detection using MD5 and Jaccard Coefficient in C#Duplicate documents are significant issue in context of the Web. Detecting near duplicate documents in a large data set, like the web, is a long standing problem. All the major search engines strive to remove duplicate documents from their search results. In this post I will show you how to create a duplicate detection system using n-grams, MD5 algorithm and Jaccard Coefficient using C#.<br /><br /><span style="font-weight: bold;">Method</span><br />One of the simplest and efficient duplicate detection techniques uses n-grams. An n-gram (also referred to as shingle) is a sequence of consecutive words of size 'n'. For instance, the sentence 'Are endorsements keeping Slumdog kids away from school' can be represented in 3-gram phrases as follows:<br /><br />- Are endorsements keeping<br />- endorsements keeping Slumdog<br />- keeping Slumdog kids<br />- Slumdog kids away<br />- kids away from<br />- away from school<br /><br />These shingles can then be converted to a hash number using <a href="http://en.wikipedia.org/wiki/MD5">MD5 </a>algorithm. Once we have encoded n-grams of the entire document we can then compare them with encoded n-grams of another documents using <a href="http://en.wikipedia.org/wiki/Jaccard_coefficient">Jaccard Coefficient</a>.<br /><br /><span style="font-weight: bold;">Code</span><br />Using Linq you can perform 'union' and 'intersection' functions in order to calculate Jaccard Coefficient. Here is the code for creating n-grams from 2 documents and calculating Jaccard Coefficient:<br /><br />public double Detect(string page1, string page2)<br /> {<br /> List<string> s1 = CreateShingles(page1);<br /> List<string> s2 = CreateShingles(page2);<br /><br /> var commonelements = s1.Intersect(s2);<br /> var totalElements = s1.Union(s2); <br /><br /> double jaccardCoef = (Convert.ToDouble(commonelements.Count())) / (Convert.ToDouble(totalElements.Count()));<br /><br /> return jaccardCoef;<br /> }<br /><br />You can implement your own MD5 algorithm but many languages provide built-in functionality to use MD-5. In C# you can use MD5 function using System.Security.Cryptography. Here is the code:<br /><br />public string CalculateMD5Hash(string input)<br /> {<br /> // step 1, calculate MD5 hash from input<br /> MD5 md5 = System.Security.Cryptography.MD5.Create();<br /> byte[] inputBytes = System.Text.Encoding.ASCII.GetBytes(input);<br /> byte[] hash = md5.ComputeHash(inputBytes);<br /><br /> // step 2, convert byte array to hex string<br /> StringBuilder sb = new StringBuilder();<br /> for (int i = 0; i < hash.Length; i++) { sb.Append(hash[i].ToString("X2")); } return sb.ToString(); } <br /><br />The 'input' here is the 3-gram phrase. Here is the <a href="http://ecs.fullerton.edu/%7Eraza09/software/DuplicateDetection.rar">complete Code</a>.<br />In case if you just like to see how it works, here is the <a href="http://ecs.fullerton.edu/%7Eraza09/software/DuplicateDetection.exe">executable file</a>.<br /><br />The program just look for words in the web pages being compared. This may lead to slightly erratic results due to the text ads within the main content. Since some text ads change on every refresh the duplication percentage may differ by a few decimal points every time the two pages are compared.<br /><br />A better way is to extract the main content from the web pages and then extract individual words from it.<br /></string></string>Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com6tag:blogger.com,1999:blog-5940921418163827691.post-8498069412437069932009-10-15T21:03:00.000-07:002009-10-19T00:38:02.159-07:00Text Summarization using Vector Space Model<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8D7kM6LGDTYTP6ZqlzZFTa5lr_9PEegaD3vvjVDOVEm_Kg_7n9byGEjLacb4D5BPU1Pyf9MSlMNop3aFzVfa_ZCw23y3DrrwhVRVgfcT0C5b0JAz5ngmdLX5PpUrAWoteTkFV5Fx_vFZE/s1600-h/textsummary.jpg"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 200px; height: 154px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8D7kM6LGDTYTP6ZqlzZFTa5lr_9PEegaD3vvjVDOVEm_Kg_7n9byGEjLacb4D5BPU1Pyf9MSlMNop3aFzVfa_ZCw23y3DrrwhVRVgfcT0C5b0JAz5ngmdLX5PpUrAWoteTkFV5Fx_vFZE/s200/textsummary.jpg" alt="" id="BLOGGER_PHOTO_ID_5394210538442975842" border="0" /></a><br />The rapid growth of World Wide Web has resulted in information overload. One cannot possibly read every news article, document, paper, etc that is out there on the web. That is why <span style="font-style: italic;">text summarization</span> has become essential in this information age. Even the search engines show the <span style="font-style: italic;">summary</span> <span style="font-style: italic;"><span style="font-style: italic;"><span style="font-style: italic;"></span></span></span> of each of the search results to help users select the most appropriate link.<br /><br />However, summarizing any document automatically is not an easy task. It is one of the challenging Natural Language Processing (NLP) problems. In this post, I will show you how to create a summary from a web document using <a href="http://raza-rizvi.blogspot.com/2009/08/classic-vector-space-model-in-c.html">Vector Space Model</a>.<br /><br />We will first extract the sentences from the document. The summary generated for each document will actually consist of sentences that are extracted from it. We will try to pick the sentences are more representative of the document than others using the VSM.<br /><br />In the <a href="http://raza-rizvi.blogspot.com/2009/08/classic-vector-space-model-in-c.html">previous</a> <a href="http://raza-rizvi.blogspot.com/2009/09/singular-value-decomposition-svd-and.html">examples</a>, there were two main components: list of documents and the query. We then assumed that there exist a string array "docs" which contains the query's (at zero index) and the documents' text. We are going to use the same data structure ("docs") but this time instead of the query we will have the document's text (which we are trying to summarize) at the zero index and the various sentences that we extracted from the document in the remaining indices.<br /><br />We will then rank the sentences using the VSM according to their similarity with the document. The similarity will decide whether we want to include the sentence in our summary or not.<br /><br />The code from VSM example is almost the same, the only new function is the sentence extractor that I had to write for this example. The sentence extractor is created using the following heuristics:<br /><br />1. If the character is "?" or "!" then mark this as the end of the sentence<br />2. If the character is "." the check that it is not preceded by "Mr", "Mrs", "Dr", etc<br /><ul><li>if Yes, then don't mark this as the end of the sentence</li><li>if No, then check that "." is not followed by a number</li><li><ul><li>if Yes, then don't mark this as the end of the sentence</li><li>if No, then mark this as the end of the sentence<br /></li></ul><br /></li></ul>The program requires a valid url address from which it will read the item and display the Top 5 sentence that it thinks best summarizes the document. For this simple program, I assume that "p" tags are used to display paragraphs. <p>As can be seen from the image above, the summary produced by this simple method is not too bad :)<br /><br /><a href="http://ecs.fullerton.edu/%7Eraza09/software/TextSummarization.rar">Here is the code</a>.<br /></p>Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com12tag:blogger.com,1999:blog-5940921418163827691.post-62533026153150119922009-10-10T23:51:00.000-07:002009-10-28T15:04:24.279-07:00Extracting sentences from a documentExtracting words from a document is quite common and a fairly easy task. And using variations of Regex.Split(documentText, @"\W") one can easily do that. However, for a spam filtering project, I wanted to extract sentences from a document. I actually just wanted to extract the first and the last sentence but I thought it would be easier just to split the entire document into sentences at first using split function as follows:<br /><br />char[] delimiter={'.','?','!'};<br />string[] sentences = document.Split(delimiter);<br /><br />These two lines of code will work on most of the cases since '?','!' and '.' are usually the characters that represent the end of a sentence. However, using the period (.) can lead to erroneous results since dot '.' can be used to represent a decimal number ( 3.5 , 100.03, etc) or abbreviations (Mr., Mrs., Dr., etc).<br /><br />Unfortunately, the corpora that I was using has a lot of instances of decimal numbers and abbreviations. So I have to come up with a solution. After unsuccessfully trying to Google (or Bing) for the some ideas, I came up with my own functions to solve this problem. The functions worked perfectly fine on my corpora but it may require some modifications to work for other types document.<br /><br />The main function is 'parseSenteces()' that splits the document into sentences. It requires that the entire document text be sent as a string parameter.<br /><br />The 'checkInitils()' function is called when the the program comes across a period. This function checks whether the character before the period represent any initial (like Dr., Miss., Mrs., Mr.) or not and returns a boolean value upon its finding.<br /><br />The 'checkNum' function checks whether the character before the encountered period are numbers or not.<br /><br />If the 'checkInitials()' and 'checkNum()' returns false then that represent end of sentence, and that sentence is added in a List. If any of them returns true, then the period is considered a part of sentence and the counter moves on to check the next character.<br /><br />Here is the code:<br /><br />private void parseSentences(string document)<br />{<br />string para = document;<br />string sentence = ""; //keeps track of current sentence<br />List words = new List(); //stores different sentences<br />int counter = 0; //keeps track of index of the character being read<br /><br />CharEnumerator charEnum = para.GetEnumerator();//Enumerator to read each character<br />while (charEnum.MoveNext())<br />{<br />char currChar = para[counter]; //holds the current character being read<br /><br />if (currChar == '.')<br />{<br />//check if the character before period are numbers<br />if (checkNum(para, currChar, counter))<br />{<br />//if not then period represents a number and not end of sentence<br />sentence += currChar.ToString();<br />}<br /><br />//check if the character before period are initials of some sort<br />else if (checkInitial(para, currChar, counter))<br />{<br />sentence += currChar.ToString();<br />}<br />else<br />{<br />//add the sentence to the list<br />words.Add(sentence);<br />sentence = "";<br />}<br />}<br />else if (currChar == ' ')<br />{<br /><br />sentence += currChar.ToString();<br /><br />}<br /><br />else if ((currChar == '!') || (currChar == '?'))<br />{<br />words.Add(sentence);<br />sentence = "";<br /><br />}<br /><br />else<br />{ sentence += currChar.ToString(); }<br />counter++;<br />}<br /><br />if (sentence != "")<br />{ words.Add(sentence); }<br /><br />string list = "";<br /><br />foreach (string s in words)<br />{<br />Console.WriteLine(s);<br />}<br />}<br /><br />private bool checkNum(string para, char currChar, int counter) <br />{ //space means not a decimal number <br /><br />try <br />{ <br />if (para[counter + 1] == ' ') <br />return false; <br />} catch(Exception)<br /> { return false; } <br /><br />char currNext= para[counter+1]; <br /><br />if ((currNext >= '0' & currNext <= '9')) <br />{ char currPrev=para[counter-1]; <br /><br />if ((currPrev >= '0' & currPrev <= '9')) <br />return true; <br />} <br />return false; <br />} <br /><br />private bool checkInitial(string para, char currChar, int counter) <br />{ <br />string word1 = ""; string word2 = ""; <br /><br />for (int i = 3; i > 0; i--)<br />{<br />try<br />{<br />if (i < 3)<br />{<br />word1 += para[counter - i].ToString();<br />}<br />word2 += para[counter - i].ToString();<br />}<br />catch(Exception e)<br />{ <br />Console.WriteLine(e.toString());<br />}<br /><br />}<br /><br />if ((word1.ToLower() == "mr") || (word1.ToLower() == "dr") || (word2.ToLower() == "miss") || (word2.ToLower() == "mrs"))<br />{<br />return true;<br />}<br />return false;<br />}Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com1tag:blogger.com,1999:blog-5940921418163827691.post-8637761524300606702009-09-07T15:27:00.000-07:002009-09-07T16:14:20.870-07:00Singular Value Decomposition (SVD) and Latent Semantic Indexing (LSI) in C#<a href="http://en.wikipedia.org/wiki/Latent_semantic_indexing">LSI</a> is used in a variety of information retrieval and text processing applications. It uses uses a mathematical technique called <a href="http://en.wikipedia.org/wiki/Singular_value_decomposition" title="Singular value decomposition">Singular Value Decomposition (SVD)</a> to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text.<br /><br />Tutorials for <a href="http://www.miislita.com/information-retrieval-tutorial/latent-semantic-indexing-fast-track-tutorial.pdf">LSI</a> and <a href="http://www.miislita.com/information-retrieval-tutorial/singular-value-decomposition-fast-track-tutorial.pdf">SVD</a> can be found online. I will use <a href="http://www.bluebit.gr/">Bluebit .Net Matrix Library</a> to calculate SVD and other matrix and vector operations needed in LSI.<br /><br />I implemented LSI in C# using the example described in <a href="http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html">Mi Ishlita</a>. The program will display the name of each document and its similarity value. You can use any sorting algorithm to sort the results by the similarity values. Like in the <a href="http://raza-rizvi.blogspot.com/2009/08/classic-vector-space-model-in-c.html">previous example</a>, the program assumes that there is a string array "docs" which contains the query's (at zero index) and the documents' text.<br /><br />To use the code (below), you need to download and install .NET Matrix Library (NML™) 4.3 and then reference .NET Matrix Library from within your project using the following instructions:<br /><ol><li>Open the Visual Studio a and open a new Visual Basic or C# project.</li><li> <div>Select <strong>Project</strong> | <strong>Add Reference...</strong></div> </li><li> <div>Select the <strong>.NET</strong> tab and find <strong>Bluebit .NET Matrix Library</strong> listed, double click on it and then press OK.</div></li></ol>Here is the code:<br />using System;<br />using System.Collections.Generic;<br />using System.Linq;<br />using System.Text;<br />using Bluebit.MatrixLibrary;<br />using System.Collections;<br />using System.Text.RegularExpressions;<br /><br />namespace LSI<br />{<br /> class Program<br /> { <br /> static double[,] A; //term-document Array<br /> static double[] q; //term-query Array<br /> static List<string> wordlist = new List<string>(); //List of terms found in documents<br /> static SortedDictionary<double,> sortedList = new SortedDictionary<double,>(); //Documents ranked by VSM with angle value<br /> static string[] docs ={"gold silver truck", //Query<br /> "shipment of gold damaged in a fire",//Doc 1<br /> "delivery of silver arrived in a silver truck", //Doc 2<br /> "shipment of gold arrived in a truck"}; //Doc 3<br /><br /> static void Main(string[] args)<br /> {<br /> createWordList();<br /> createVector();<br /> LatentSemanticIndexing();<br /> <br /><br /> foreach (KeyValuePair<double,> kvp in sortedList)<br /> {<br /><br /> Console.WriteLine(kvp.Value + " " + kvp.Key);<br /> }<br /> Console.ReadLine();<br /> }<br /><br /><br /> public static void createWordList()<br /> {<br /> foreach (string doc in docs)<br /> {<br /> wordlist = getWordList(wordlist, doc);<br /> }<br /> wordlist.Sort(); //sort the wordlist alphabetically<br /> }<br /><br /> public static List<string> getWordList(List<string> wordlist, string query)<br /> {<br /> Regex exp = new Regex("\\w+", RegexOptions.IgnoreCase);<br /> MatchCollection MCollection = exp.Matches(query);<br /><br /> foreach (Match match in MCollection)<br /> {<br /> if (!wordlist.Contains(match.Value))<br /> {<br /> wordlist.Add(match.Value);<br /> }<br /> }<br /><br /> <br /> return wordlist;<br /> }<br /><br /> public static void createVector()<br /> {<br /> double[] queryvector;<br /> q = new double[wordlist.Count];<br /> A = new double[wordlist.Count, docs.Length-1];<br /> for (int j = 0; j < docs.Length; j++)<br /> {<br /> queryvector = new double[wordlist.Count];<br /><br /> for (int i = 0; i < wordlist.Count; i++)<br /> {<br /> //calculate Term Frequency<br /> double tf = getTF(docs[j], wordlist[i]);<br /> <br /> if (j == 0) //if Query term then add it to query array<br /> {<br /> q[i] = tf;<br /> }<br /> else //if document term then add it to document array<br /> {<br /> A[i, j-1] = tf;<br /> }<br /> }<br /><br /> }<br /><br /> }<br /><br /> private static void LatentSemanticIndexing()<br /> {<br /> //Singular Value Decomposition<br /> Matrix docMatrix = new Matrix(A);<br /> SVD svd = new SVD(docMatrix);<br /><br /> //A = U S VT<br /> Matrix U = svd.U;<br /> Matrix S = svd.S;<br /> Matrix V = svd.V;<br /> Matrix VT = svd.VH;<br /><br /> //Dimensionality Reduction: Computing Uk, Sk, Vk and VkT<br /> Matrix Uk = new Matrix(U.ToArray(), U.Rows, U.Cols - 1);<br /> Matrix Sk = new Matrix(S.ToArray(), S.Rows - 1, S.Cols - 1);<br /> Matrix Vk = new Matrix(V.ToArray(), V.Rows, V.Cols - 1);<br /> Matrix VkT = Vk.Transpose();<br /><br /><br /> //q = qT Uk Sk-1<br /> Matrix queryMatrix = new Matrix(q,q.Length,1);<br /> queryMatrix = queryMatrix.Transpose() * Uk * Sk.Inverse();<br /><br /> //sim(q, d) = sim(qT Uk Sk-1, dT Uk Sk-1) using cosine similarities<br /> for (int i = 0; i < V.Rows; i++)<br /> {<br /> Vector docVector = Vk.RowVector(i);<br /> Vector queryVector = queryMatrix.RowVector(0);<br /> double sim = Vector.DotProduct(docVector, queryVector) / (docVector.Length * queryVector.Length);<br /> <br /> Console.WriteLine("Doc " + (i + 1).ToString() + " :" + sim);<br /> }<br /><br /><br /><br /> }<br /> private static double getTF(string document, string term)<br /> {<br /> string[] queryTerms = Regex.Split(document, "\\s");<br /> double count = 0;<br /><br /><br /> foreach (string t in queryTerms)<br /> {<br /> if (t == term)<br /> {<br /> count++;<br /> }<br /> }<br /> return count;<br /><br /> }<br /><br /> <br /> }<br />}<br /><br />After you have sorted the results in descending order using the similarity values, your results will be as follows:<br /><span style="font-weight: bold;">d2>d3>d2</span>Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com40tag:blogger.com,1999:blog-5940921418163827691.post-67824709979397274962009-08-17T20:40:00.000-07:002009-12-09T00:00:53.625-08:00Classic Vector Space Model in C#Vector Space Model (VSM) is widely used in Information Retrieval systems. The model creates a space in which both documents and queries are represented by vectors. VSM treats the query as a document as well and then it determines similarity between query and other documents by calculating the deviation of angles between each document vector and the original query vector. It then uses the deviation to rank documents.<br /><br />I implemented the model in C# using the example described in <a href="http://www.miislita.com/term-vector/term-vector-3.html">Mi Islita.</a> The program will display elements of a dictionary object that hold the name of each document and its similarity value.<br /><br />You can use any sorting algorithm to sort the dictionary by the similarity values. The program also assumes that there is a string array "docs" which contains the query's (at zero index) and the documents' text.<br /><br />Here is the code:<br /><br /><br />using System;<br />using System.Collections.Generic;<br />using System.Text;<br />using System.Text.RegularExpressions;<br />using System.Collections;<br />using System.Linq;<br />namespace VectorSpaceModel<br />{<br /> class Program<br /> {<br /> static Hashtable DTVector = new Hashtable(); //Hashtable to hold Document Term Vector<br /> static List<string> wordlist = new List<string>(); //List of terms found in documents<br /> static Dictionary<double,string> sortedList = new Dictionary<double,string>(); //Documents ranked by VSM with angle value<br /> static string[] docs ={"gold silver truck", //Query<br />"shipment of gold damaged in a fire",//Doc 1<br />"delivery of silver arrived in a silver truck", //Doc 2<br />"shipment of gold arrived in a truck"}; //Doc 3<br /><br /> static void Main(string[] args)<br /> {<br /> createWordList();<br /> createVector();<br /> classify();<br /><br /> var dict = sortedList;<br /> foreach (var x in dict.Reverse())<br /> {<br /> Console.WriteLine("{0} -> Doc{1}", x.Key, x.Value);<br /> }<br /><br /> <br /> Console.ReadLine();<br /> }<br /><br /><br /> public static void createWordList()<br /> {<br /> foreach (string doc in docs)<br /> {<br /> wordlist = getWordList(wordlist, doc);<br /> }<br /> }<br /><br /> public static List<string> getWordList(List<string> wordlist, string query)<br /> {<br /> Regex exp = new Regex("\\w+", RegexOptions.IgnoreCase);<br /> MatchCollection MCollection = exp.Matches(query);<br /><br /> foreach (Match match in MCollection)<br /> {<br /> if (!wordlist.Contains(match.Value))<br /> {<br /> wordlist.Add(match.Value);<br /> }<br /> }<br /><br /> return wordlist;<br /> }<br /><br /> public static void createVector()<br /> {<br /> double[] queryvector;<br /><br /> for (int j = 0; j < docs.Length; j++)<br /> {<br /> queryvector = new double[wordlist.Count];<br /><br /> for (int i = 0; i < wordlist.Count; i++)<br /> {<br /><br /> double tfIDF = getTF(docs[j], wordlist[i]) * getIDF(wordlist[i]);<br /> queryvector[i] = tfIDF;<br /> }<br /><br /> if (j == 0) //is it a query?<br /> {<br /> DTVector.Add("Query", queryvector);<br /> }<br /> else<br /> {<br /> DTVector.Add(j.ToString(), queryvector);<br /> }<br /> }<br /> }<br /><br /> public static void classify()<br /> {<br /> double temp = 0.0;<br /><br /> IDictionaryEnumerator _enumerator = DTVector.GetEnumerator();<br /><br /> double[] queryvector = new double[wordlist.Count];<br /><br /> Array.Copy((double[])DTVector["Query"], queryvector, wordlist.Count);<br /><br /> while (_enumerator.MoveNext())<br /> {<br /> if (_enumerator.Key.ToString() != "Query")<br /> {<br /> temp = cosinetheta(queryvector, (double[])_enumerator.Value);<br /><br /> sortedList.Add(temp, _enumerator.Key.ToString());<br /><br /> }<br /> }<br /> }<br /><br /> public static double dotproduct(double[] v1, double[] v2)<br /> {<br /> double product = 0.0;<br /> if (v1.Length == v2.Length)<br /> {<br /> for (int i = 0; i < v1.Length; i++)<br /> {<br /> product += v1[i] * v2[i];<br /> }<br /> }<br /> return product;<br /> }<br /><br /> public static double vectorlength(double[] vector)<br /> {<br /> double length = 0.0;<br /> for (int i = 0; i < vector.Length; i++)<br /> {<br /> length += Math.Pow(vector[i], 2);<br /> }<br /><br /> return Math.Sqrt(length);<br /> }<br /> private static double getTF(string document, string term)<br /> {<br /> string[] queryTerms = Regex.Split(document, "\\s");<br /> double count = 0;<br /><br /><br /> foreach (string t in queryTerms)<br /> {<br /> if (t == term)<br /> {<br /> count++;<br /> }<br /> }<br /> return count;<br /><br /> }<br /><br /> private static double getIDF(string term)<br /> {<br /> double df = 0.0;<br /> //get term frequency of all of the sentences except for the query<br /> for (int i = 1; i < docs.Length; i++)<br /> {<br /> if (docs[i].Contains(term))<br /> {<br /> df++;<br /> }<br /> }<br /><br /> //Get sentence count<br /> double D = docs.Length - 1; //excluding the query <br /><br /> double IDF = 0.0;<br /><br /> if (df > 0)<br /> {<br /> IDF = Math.Log(D / df);<br /> }<br /><br /> return IDF;<br /> }<br /><br /> public static double cosinetheta(double[] v1, double[] v2)<br /> {<br /> double lengthV1 = vectorlength(v1);<br /> double lengthV2 = vectorlength(v2);<br /><br /> double dotprod = dotproduct(v1, v2);<br /><br /> return dotprod / (lengthV1 * lengthV2);<br /><br /> }<br /> }<br />}Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com18tag:blogger.com,1999:blog-5940921418163827691.post-38271120564458262552009-07-03T00:45:00.001-07:002009-11-20T12:29:32.801-08:00Extract Hyperlinks Using WebBrowser Object in C#In my <a href="http://raza-rizvi.blogspot.com/2009/04/simple-web-crawler.html">previous post</a>, I mentioned using regular expressions to extract links from a web page. While I like using regular expressions, there is a much easier way of doing this in C#.net. All that you need is a 'WebBrowser' object to read a webpage and extracting all of the hyperlinks present in it. However, you need to create Windows Forms applications to use the WebBrowser object and then add the following code (on a button click event).<br /><br />using System;<br />using System.Collections.Generic;<br />using System.ComponentModel;<br />using System.Data;<br />using System.Drawing;<br />using System.Linq;<br />using System.Text;<br />using System.Windows.Forms;<br /><br />namespace WebbrowserTest<br />{<br /> public partial class WebBrowserSample : Form<br /> {<br /> <br /> public WebBrowserSample()<br /> {<br /> InitializeComponent();<br /> }<br /><br /> private void button1_Click(object sender, EventArgs e)<br /> {<br /> WebBrowser web = new WebBrowser();<br /> web.NewWindow += new CancelEventHandler(web_NewWindow);<br /> web.Navigate("http://www.cnn.com");<br /> <br /> //wait for the browser object to load the page<br /> while (web.ReadyState != WebBrowserReadyState.Complete)<br /> {<br /> Application.DoEvents();<br /> }<br /><br /> HtmlElementCollection ht = web.Document.Links;<br /> List<string> urls = new List<string>();<br /> foreach (HtmlElement h in ht)<br /> {<br /> try<br /> {<br /> if ((h.GetAttribute("href") != null) )<br /> {<br /> urls.Add(h.GetAttribute("href"));<br /> }<br /> }<br /> catch<br /> { }<br /> }<br /><br /> foreach (string url in urls)<br /> {<br /> Console.WriteLine(url);<br /> }<br /><br /> <br /><br /> }<br /><br /> void web_NewWindow(object sender, CancelEventArgs e)<br /> {<br /> e.Cancel = true;<br /> }<br /> }<br />}<br /><br /><br /><br />The 'NewWindow' event is used to discard popups which may open while navigating a web page. The 'Document.Links' property of the WebBrowser is used to get the HtmlCollection of all of the links present in a webpage. Once you have the HtmlCollection of links, the links could be processes/viewed using 'href' attribute of each 'HtmlElement' in the 'HtmlCollection'.<br /><br />Even though this is a pretty straight forward code, I found it a little slower. I think the reason is that you have to make sure that the webbrowser object has read the entire webpage before you can start using it (achieved by the while '(web.ReadyState != WebBrowserReadyState.Complete)' condition).<br /><br />Also if you are thinking about link processing in a non-Window Forms application, then this program is not for you. For all the Window Forms application developers, enjoy the code :)Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com14tag:blogger.com,1999:blog-5940921418163827691.post-18922831394521064122009-04-18T00:56:00.000-07:002009-04-20T01:32:58.982-07:00A simple web crawlerit is becoming difficult to write more about what I have been reading/researching related to Information Retrieval but IR continues to be my area of interest. There is tons of literature out there and each research paper that I read brings with it more insight about IR. That made me want to test the various theories and algorithms mentioned in those papers. And to test them, first thing that I needed was data (web documents to be precise). Now, crawling is in itself a very important area of research. One could opt to write a very efficient crawler but I am a full-time student who works part-time to pay his bills and on top of that I have to start working on my dissertation as well. So there was no way I could allocate more time writing a crawler in order to test the various theories.<br /><br />Instead I chose to write a very simple crawler. A simple crawler could just make use of the link structure of the web. And since I am not the only one who <a href="http://www.betaversion.org/%7Estefano/linotype/news/275/">thinks</a> that or does that, I thought it is a good step to start with. What I really needed was to extract < a > tags from a given web page. The following regular expression allowed me to do that:<br /><br /><span style="font-style: italic;">Regex extractTags = new Regex(@"<" + tag + @"[^>]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?>", RegexOptions.IgnoreCase | RegexOptions.Compiled);</span><br /><br /><br />The main function of the crawler that I wrote loops through the list of predefined pages, calling <span style="font-size:85%;"><span style="font-style: italic;font-family:arial;" >addtoindex</span><a style="font-style: italic;" name="does nothing"></a></span> function on each one. The function can be used to store links and their text but for the sake of this post it just prints the URL. It then uses the regular expression that I mentioned above to get all the links on that page and adds their URLs to a set called <span style="font-style: italic;font-size:85%;" ><span style="font-family:arial;">newpages</span></span>. At the end of the loop, <span style="font-style: italic;font-family:arial;font-size:85%;" >newpages</span> becomes <span style="font-style: italic;font-size:85%;" ><span style="font-family:arial;">pages</span></span>, and the process repeats.<br /><br />Here is the complete code in C#:<br /><br /><span style="font-style: italic;font-size:85%;" ><br />using System;<br />using System.Collections.Generic;<br />using System.Linq;<br />using System.Text;<br />using System.Text.RegularExpressions;<br />using System.Collections;<br />using System.Net;<br />using System.IO;<br /><br />namespace Crawler<br />{<br /> class Program<br /> {<br /> private static String strText;<br /> static MatchCollection tagCollection;<br /> public static HttpWebRequest req;<br /> public static HttpWebResponse res;<br /> static Stream resStream;<br /> public static string baseUrl;<br /><br /> static void Main(string[] args)<br /> {<br /> //add the specific site that you want to crawl<br /> baseUrl = "http://www.techcrunch.com/";<br /><br /> ArrayList pages = new ArrayList();<br /> pages.Add(baseUrl);<br /><br /> //start crawling<br /> crawl(pages, 20);<br /><br /> Console.WriteLine("/nIndexing Complete!!");<br /> Console.ReadLine();<br /> }<br /><br /> public static void crawl(ArrayList pages, int depth)<br /> {<br /> MatchCollection mc;<br /> ArrayList links = new ArrayList();<br /><br /> //Breadth-first search algorithm to crawl the pages and collect links<br /> for (int i = 0; i < depth; i++)<br /> {<br /> ArrayList newpages = new ArrayList();<br /><br /> foreach (String page in pages)<br /> {<br /> try<br /> {<br /> if (isValidUrl(page))<br /> {<br /> urlOpen();<br /> }<br /> }<br /> catch (Exception ex)<br /> {<br /> System.Console.WriteLine("Couldnot open {0} because {1}", page, ex.ToString());<br /> continue;<br /> }<br /><br /> string pagecontent = read();<br /><br /> //adding the page in the index<br /> addtoindex(page, pagecontent);<br /><br /> mc = tagList(pagecontent, "a");<br /><br /> links = getAttributeValue(mc, "href", baseUrl);<br /><br /> foreach (string link in links)<br /> {<br /> String url, linktext;<br /> url = linktext = null;<br /><br /> <br /> if (link.Contains("#"))<br /> {<br /> try<br /> {<br /> url = link;<br /> <br /> }<br /> catch (Exception ex)<br /> {<br /> Console.WriteLine("Error in Crawl " + ex.Message + " - " + url);<br /> }<br /> }<br /> else<br /> {<br /> url = link;<br /> }<br /><br /> try<br /> {<br /> if ((url.Substring(0, 4) == "http") && (isindexed(url) == false))<br /> {<br /> <br /> newpages.Add(url);<br /> }<br /><br /> }<br /> catch (Exception ex)<br /> {<br /> Console.WriteLine("Couldnot add new page " + url + " b/c {0}", ex.ToString());<br /> }<br /> linktext = gettextonly(pagecontent);<br /> }<br /> <br /> }<br /> pages = newpages; <br /><br /> }<br /> }<br /><br /> //Returns false for now, but can be modified to query a database to check whether a page has already been indexed<br /> public static bool isindexed(string url)<br /> {<br /> return false;<br /> }<br /><br /> //Add page to the index, this is where a database or file system can be used<br /> public static void addtoindex(string url, string pagecontent)<br /> {<br /><br /> Console.WriteLine("Indexing : " + url);<br /><br /> }<br /><br /> //Get the collection of < a > tags in a page<br /> public static MatchCollection tagList(String HTMLcontent, String tag)<br /> {<br /><br /> Regex extractTags = new Regex(@"<" + tag + @"[^>]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?>", RegexOptions.IgnoreCase | RegexOptions.Compiled);<br /> try<br /> {<br /> tagCollection = extractTags.Matches(HTMLcontent);<br /> <br /> return tagCollection;<br /> }<br /> catch (Exception ex)<br /> {<br /> Console.WriteLine(ex.ToString());<br /> }<br /> return null;<br /> }<br /><br /> //Gets the HREF value from each <> tag<br /> public static ArrayList getAttributeValue(MatchCollection mc, String Attr, string url)<br /> {<br /> ArrayList links = new ArrayList();//{ ""};<br /> <br /> foreach (Match match in mc)<br /> {<br /> <br /> string temp = match.Value;<br /><br /> try<br /> {<br /> if (temp.Contains("http"))<br /> {<br /> links.Add(temp.Substring(temp.IndexOf("href") + 6, temp.LastIndexOf(">") - temp.IndexOf("http") - 1));<br /> <br /> }<br /><br /> else if (temp.Contains("://"))<br /> {<br /> links.Add(temp.Substring(temp.IndexOf("href") + 6, temp.LastIndexOf(">") - (temp.IndexOf("href") + 7)));<br /> }<br /> else<br /> {<br /> string strTemp = temp.Substring(temp.IndexOf("href") + 6, temp.LastIndexOf(">") - (temp.IndexOf("href") + 7));<br /> url.Replace("\n\r", "");<br /> if (strTemp[0] != '/' && url[url.Length - 1] != '/')<br /> {<br /> strTemp = url + "/" + strTemp;<br /> }<br /> else<br /> {<br /> strTemp = url + strTemp;<br /> }<br /> links.Add(strTemp);<br /> }<br /><br /> }<br /> catch (Exception ex)<br /> {<br /> Console.WriteLine("Error in GetAttributes :" + ex.Message + " - " + url);<br /> }<br /> <br /> }<br /> return links;<br /><br /> }<br /><br /> //reads that content of a web page<br /> public static string gettextonly(string pagecontent)<br /> {<br /> <br /> string pattern = @"<(.|\n)*?>";<br /> return Regex.Replace(pagecontent, pattern, String.Empty);<br /><br /> }<br /><br /><br /> public static String read()<br /> {<br /> StreamReader sr = new StreamReader(resStream);<br /> strText = sr.ReadToEnd();<br /> return strText;<br /> }<br /><br /> public static void urlOpen()<br /> {<br /> resStream = res.GetResponseStream();<br /> }<br /><br /> public static bool isValidUrl(String url)<br /> {<br /> try<br /> {<br /> req = (HttpWebRequest)HttpWebRequest.Create(url);<br /> res = (HttpWebResponse)req.GetResponse();<br /> return (res.StatusCode == HttpStatusCode.OK);<br /> }<br /> catch (Exception ex)<br /> {<br /> Console.WriteLine("Error in ISValidURL " + ex.Message + " - " + url);<br /> return false;<br /> }<br /><br /> }<br /> }<br />}<br /></span><a><br /><br />Any suggestions are welcomed.<br /></a><p class="docText"><a name="link calls"></a></p>Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com65tag:blogger.com,1999:blog-5940921418163827691.post-74021719454377859972009-03-06T01:21:00.000-08:002009-05-31T00:27:03.980-07:00Peter Norvig; "we don't think it's a big advance to be able to type something as a question as opposed to keywords."Sometime ago I found myself in a healthy argument with a friend of mine about the future of search. Both of us are crazy about information retrieval and naturally our discussion more than normally diverts towards search. While usually we agree with each other's views, today I was shocked to find a major difference in our opinions.<br /><br />I sincerely believe that any breakthrough in search will have something to do with Natural Language Processing. However my friend thinks that that will not be the case. This was not at all a big deal. But what shocked me was the evidence he used... Peter Norvig.<br /><br />Now I am a big fan of Peter Norvig but while i am crazy about all things Peter Norvig, I was <span style="font-style: italic;">disappointed to read what he said in his interverview with </span><a style="font-style: italic;" href="http://www.technologyreview.com/web/19868/page2/">Technology Review</a><span style="font-style: italic;">:</span><br style="font-style: italic;"><p style="font-style: italic; color: rgb(102, 102, 102);"><strong><em>"TR</em>:</strong> Companies such as Ask and Powerset are betting that the future is in natural-language search, which lets people use real, useful sentences instead of potentially ambiguous keywords. What is Google doing with natural language? </p> <p style="font-style: italic; color: rgb(102, 102, 102);"><strong>PN:</strong> We think what's important about natural language is the mapping of words onto the concepts that users are looking for. But we don't think it's a big advance to be able to type something as a question as opposed to keywords. Typing "What is the capital of France?" won't get you better results than typing "capital of France." But<em> understanding</em> how words go together is important. To give some examples, "New York" is different from "York," but "Vegas" is the same as "Las Vegas," and "Jersey" may or may not be the same as "New Jersey." <em>That's</em> a natural-language aspect that we're focusing on. Most of what we do is at the word and phrase level; we're not concentrating on the sentence. We think it's important to get the right results rather than change the interface."</p><span style="color: rgb(0, 0, 0);">Apparently, Peter Norvig thinks that typing complete questions will just be a change in interface. I seriously doubt it. These days, I hardly use single keyword searches. I like Powerset and really wish that I could google something much more intelligent than a few keywords. But if the director of research at Google doesn't think that this will bring any big advancement</span> to search, then maybe it is just me. I am sure Peter has substantial data to back his comments but I for one would like to type in a question ( and not something as simple as "Who is the 44th president of USA") and expect a relevant answer.<br /><br />But thats just me... The interview was conducted in 2008, here is hoping that during this time Peter has given a second thought to some serious NLP usage for search.Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com1tag:blogger.com,1999:blog-5940921418163827691.post-14403324835879757362009-02-02T16:56:00.001-08:002009-03-04T00:44:53.097-08:00Information Retrieval - A word about Word Distribution<span style="font-weight: bold;font-size:130%;" >Word distributions</span><br />Words are not distributed evenly in documents. Same goes for letters of the alphabet, city sizes, wealth, etc. Usually, the 80/20 rule applies (80% of the wealth goes to 20% of the people or it takes 80% of the effort to build the easier 20% of the system). For example, the Shakespeare's famous play Romeo and Juliet we can find the following word frequencies:<br /><br />Romeo and Juliet:<br />And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; ........ Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There, 64; Hath, 63;…..................<br />A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; ...<br /><br /><span style="font-weight: bold;">Stop words</span><br />There are 250-300 most common words in English that account for 50% or more of a given text.<br />Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%.<br /><br />For example, Moby Dick Ch.1 has 859 unique words (types), 2256 word occurrences (tokens). The Top 65 types cover 1132 tokens (> 50%).<br />Therefore, the token/type ratio: 2256/859 = 2.63<br /><br /><span style="font-weight: bold;">Zipf’s law<br /><br /></span><span style="font-style: italic;">Rank x Frequency = Constant</span><br /><br />Zipf's law is fairly general. It states that, if t1 is the most common term in the collection, t2 the next most common etc, then the collection frequency (cf) of the ith most common term is proportional to 1/i:<br /><br /><div style="text-align: center;">cf(i) prop. 1/i<br /><br /></div><br /><span style="font-weight: bold;">Heap’s law<br /></span>A better way of getting a handle on vocabulary size, M is Heaps’ law, which estimates vocabulary size as a function of collection size:<br /><br /><div style="text-align: center;">M = k(T pow(b))<br /></div><br />where, T is the number of tokens in the collection. Typical values for the parameters k and b are: 30 ≤ k ≤ 100 and b ≈ 0.5.<br />Regardless of the values of theparameters for a particular collection, Heaps’ law suggests that: (i) the dictionary size will continue to increase with more documents in the collection, rather than a maximum vocabulary size being reached, and (ii) the size of the dictionary will be quite large for large collections.<br /><br />In English, k is between 10 and 100, b is between 0.4 and 0.6.<br /><br /><br /><span style="font-weight: bold;">IDF: Inverse document frequency</span><br />TF * IDF is used for automated indexing and for topic discrimination:<br /><br /><div style="text-align: center;">idf(k) = log2(N/d(k)) + 1 = log2N - log2d(k) + 1<br /></div><br />N: number of documents<br />dk: number of documents containing term k<br />fik: absolute frequency of term k in document i<br />wik: weight of term k in document i<br /><br /><br /><span style="font-weight: bold;">Vector-based matching</span><br /><p> A popular measure of similarity for text (which normalizes the features by the covariance matrix) clustering is the cosine of the angle between two vectors. The cosine measure is given by </p> <div align="center"><!-- MATH \begin{equation} s^{(\mathrm{C})} (\mathbf{x}_a,\mathbf{x}_b) = \frac{\mathbf{x}_a^{\dagger} \mathbf{x}_b} {\|\mathbf{x}_a\|_2 \cdot \|\mathbf{x}_b\|_2} \end{equation} --> <table width="100%" align="center" cellpadding="0"> <tbody><tr valign="middle"> <td align="center" nowrap="nowrap"><img src="http://www.lans.ece.utexas.edu/%7Estrehl/diss/img185.png" alt="$\displaystyle s^{(\mathrm{C})} (\mathbf{x}_a,\mathbf{x}_b) = \frac{\mathbf{x}_... ...ger} \mathbf{x}_b} {\Vert\mathbf{x}_a\Vert _2 \cdot \Vert\mathbf{x}_b\Vert _2}$" width="228" align="middle" border="0" height="65" /></td> <td width="10" align="right" nowrap="nowrap"> (4.2)</td></tr> </tbody></table></div><br />and captures a scale invariant understanding of similarity. An even stronger property is that the cosine similarity does not depend on the length:<br /><!-- MATH $s^{(\mathrm{C})} (\alpha \mathbf{x}_a,\mathbf{x}_b) = s^{(\mathrm{C})} (\mathbf{x}_a,\mathbf{x}_b)$ --> <img src="http://www.lans.ece.utexas.edu/%7Estrehl/diss/img302.png" alt="$ s^{(\mathrm{C})} (\alpha \mathbf{x}_a,\mathbf{x}_b) = s^{(\mathrm{C})} (\mathbf{x}_a,\mathbf{x}_b)$" width="222" align="middle" border="0" height="41" /> for <!-- MATH $\alpha > 0$ --> <img src="http://www.lans.ece.utexas.edu/%7Estrehl/diss/img303.png" alt="" /> 0$" width="51" align="middle" border="0" height="33">. This allows documents with the same composition, but different totals to be treated identically which makes this the most popular measure for text documents. Also, due to this property, samples can be normalized to the unit sphere for more efficient processing.<br /><br /><span style="font-weight: bold;">References<br /></span><a href="http://www-csli.stanford.edu/%7Ehinrich/information-retrieval-book.html">Introduction to Inforamtion Retreival</a><span style="font-weight: bold;">, </span>Ch. 5 Index Compression<span style="font-weight: bold;"><br /></span>Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com1tag:blogger.com,1999:blog-5940921418163827691.post-33943643206577285022009-01-30T23:20:00.000-08:002009-02-03T00:30:05.140-08:00Information Retrieval - Document PreprocessingIn this post I will touch briefly on document preprocessing and indexing concepts related to IR.<br /><br /><span style="font-size:130%;"><span style="font-weight: bold;">Document Preprocessing</span></span><br />The content of a webpage read by the crawler has to be converted into tokens before an index can be created for the keywords. This is challenging because at this step we have to deal with various formatting and encoding issues. For example, what delimiter will we use to separate different words. Tokenizing using white space as a delimiter seems to work best for "Paul's", "555-1212" or "can't". But what about: "Willow Dr.", "Dr. Willow", "New York", "ad hoc". A white space delimiter will create separate tokens for each of them which is not desirable.<br /><br /><span style="font-weight: bold;">Tokenization </span>becomes even more challenging when you consider tokenizing a string like “The New York-Los Angeles flight”. Other languages such as German or Korean have eve more complex tokenization issues.<br /><br /><span style="font-weight: bold;">Normalization</span><br />We need a systematic way of ensuring that the index structure is suitable for general-purpose querying. For instance, "new york" or "New York" or "NEW YORK" shouldn't be treated differently. There are various ways to achieve this normalcy:<br /><br /><ul><li>Casing (cat vs. CAT)</li><li>Stemming (computer, computation)</li><li>Soundex</li><li>Spell Checker or synonyms (Labeled/labelled, extraterrestrial/extra-terrestrial/extra terrestrial, Qaddafi/Kadhafi/Ghadaffi)</li><li>Index reduction</li><li>Dropping stop words (“and”, “of”, “to”), it is problematic for “to be or not to be” kind of sentences</li></ul><span style="font-weight: bold;font-size:100%;" >Porter’s algorithm</span><br />Porter's algorithm can be used for Stemming - the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form (Wikipedia).<br />The rules in the Porter algorithm are separated into five distinct phases numbered from 1 to 5. They are applied to the words in the text starting from phase 1 and moving on to phase 5. Further, they are applied sequentially one after the other as commands in a program.<br /><br />The complete description of Porter's algorithm can be found <a href="http://tartarus.org/%7Emartin/PorterStemmer/def.txt">here</a>.<br /><br />Example: the word “duplicatable” will be changed to "duplic" using the Proter's algorithm.<br /><br /><span style="font-weight: bold;">The Soundex algorithm (Odell and Russell)</span><br />This algorithm uses spelling correction and hash function and is non-recoverable.<br /><br />The algorithm is as follows:<br /><br />1. Retain the first letter of the name, and drop all occurrences of a,e,h,I,o,u,w,y in other positions<br />2. Assign the following numbers to the remaining letters after the first:<br />b,f,p,v : 1<br />c,g,j,k,q,s,x,z : 2<br />d,t : 3<br />l : 4<br />m n : 5<br />r : 6<br /><br />3. if two or more letters with the same code were adjacent in the original name, omit all but the first<br />4. Convert to the form “LDDD” by adding terminal zeros or by dropping rightmost digits<br />Examples:<br />Euler: E460, Gauss: G200, H416: Hilbert, K530: Knuth, Lloyd: L300<br />same as Ellery, Ghosh, Heilbronn, Kant, and Ladd<br />Some problems: Rogers and Rodgers, Sinclair and StClair<br /><br /><span style="font-weight: bold;">Storage issues</span><br />As mentioned in the last post that a medium-sized collection with n=3,000,000 (terms) and m=50,000 (documents) will result in a large term-document matrix. We have to find a way to come up with a better data structure. Inverted Index is the basic structure that resolves the issue.<br /><br /><span style="font-weight: bold;">Inverted index</span><br />Instead of an incidence vector, use a posting table. We can use a Dictionary object and represent the term-documents as the term-value pairs in the dictionary. The term represents a unique ID and the document (where the term is present) IDs are the values in each term-value pair of the dictionary. For example,<br /><br />CLEVELAND: D1, D2, D6<br />OHIO: D1, D5, D6, D7<br /><br />We use linked lists to be able to insert new document postings in order and to remove existing postings. We need to keep everything sorted! This gives us a logarithmic improvement in access.<br /><br /><span style="font-weight: bold;">Basic operations on inverted indexes</span><br />Conjunction (AND) – iterative merge of the two postings: O(x+y)<br />Disjunction (OR) – very similar<br />Negation (NOT)<br /><br />Example: MICHIGAN AND NOT OHIO<br />Example: MICHIGAN OR NOT OHIO<br /><br />For recursive operations we should start with the smallest sets for better optimization.<br /><br />That sums up the second post on Information Retrieval. In the next post, I will discuss word distribution issues in the IR.Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com17tag:blogger.com,1999:blog-5940921418163827691.post-88840443408186670482009-01-18T20:38:00.000-08:002009-02-03T00:28:08.410-08:00Information Retrieval - IntroductionThis is the first in the series of blog posts related to Information Retrieval.<br /><br /><span style="font-size:130%;">Information Retreival</span><br />So what is Information Retrieval (IR)? IR can be defined as the indexing and retrieval of textual documents. It can be more thoroughly defined as follows:<br /><br />"Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)." [1]<br /><br /><span style="font-size:130%;">What does it take to build a search engine?</span><br />The task of search engines looks trivial. All search engines (Google, Yahoo, Live.com) had to do is crawl the web, index the web pages, read the text and make the links to the web pages available to upon user's request.<br /><br />But if you ponder over it the simple task converts into complex sub tasks, which are as follows:<br /><ul><li>Decide what to index</li><li>Collect it</li><li>Index it (efficiently)</li><li>Keep the index up to date</li><li>Provide user-friendly query facilities</li></ul>A lot of research is done on each of these sub tasks. But that is not enough. There are other things which need to be dealt with to create an 'efficient' search engine. Some of them are:<br /><ul><li>Understand the structure of the web for efficient crawling</li><li>Understand user information needs</li><li>Preprocess text and other unstructured data</li><li>Cluster data</li><li>Classify data</li><li>Evaluate performance</li></ul><span style="font-size:130%;">Document representations</span><br />Search engines display relevant links to webpages (documents) based on the keywords typed by the user. The documents retrieved consist of those keywords (not necessarily). In order to carry out the task of finding keywords in the documents, there must be some sort of data structure that holds the word-document relationship. One way to achieve this is to have a Term-document matrix (m x n).<br /><br />Another important datastructure that comes handy is a Document-document matrix (n x n). However, space limitations and time complexity have to be taken into consideration before implementing these matrices.<br /><br />However, these matrices don't scale very well. Typical example in a medium-sized collection is something like 3,000,000 documents (n) with 50,000 terms (m). So you can imagine how big a (m x n) matrix can be. In case of web documents you will typically have: n=30,000,000,000, m=1,000,000. Creating a (m x n) matrix and then using it could shoot up the time complexity sky high.<br /><br /><span style="font-size:130%;">Boolean vs. integer-valued matrices</span><br />There is also the question whether you should use boolean (True or False) or inter-valued (0s and 1s) to represent the document-word relationship in the matrix.<br /><br /><span style="font-size:130%;">Major IR models</span><br />Assuming that term-document relationship has been efficiently implemented. The next task is to come up with retrieval models that will be used to query the matrices and retrieve the most relevant documents. These models are know as IR Models. Some of the most common models are as follows:<br /><ul><li>Boolean</li><li>Vector</li><li>Probabilistic</li><li>Language modeling</li><li>Fuzzy retrieval</li><li>Latent semantic indexing</li></ul><span style="font-size:130%;">Boolean queries</span><br />Boolean IR Model is perhaps one of the simplest to implement. It uses boolean operators: AND, OR, NOT, parentheses, etc. For example, typical search queries will be something like:<br />CLEVELAND AND NOT OHIO<br />(MICHIGAN AND INDIANA) OR (TEXAS AND OKLAHOMA)<br /><br /><span style="font-size:130%;">Vector queries</span><br />Each document is represented as a vector. Non-binary weights provide consideration for partial matches. These term weights are used to compute a degree of similarity between a query and each document. Ranked set of documents provides for better matching.<br /><br /><span style="font-size:130%;">Probabilistic Model</span><br />Captures the IR problem using a probabilistic framework. Improve by iteration<br /><br /><span style="font-size:130%;">The matching process</span><br />Matching is done between a document and a query (or between two documents). Distance or similarity measures are used to determined which set of documents satisfied a given query. The common methods used are: Euclidean distance, Manhattan distance, Word overlap, Jaccard coefficient, etc.<br /><br />Well this sums up the introductory post on IR. In the next post we will look into more details of some of the components of IR mentioned in this post, especially indexing. So stay tuned.<br /><br /><span style="font-size:130%;">References</span><br />[1] http://nlp.stanford.edu/IR-book/html/htmledition/boolean-retrieval-1.htmlSyed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com2tag:blogger.com,1999:blog-5940921418163827691.post-21703495566739538082009-01-18T19:20:00.000-08:002009-01-24T22:26:05.003-08:00Information Retrieval and Web MiningLast semester I had to select a research topic for my <a href="http://raza-rizvi.blogspot.com/2008/09/new-semester-new-challenges.html">Computer Seminar</a> course. The topic I selected was "Personalized Search". I have always been interested in Information Retrieval and hence thought that it was a perfect opportunity to learn more about this field.<br /><br />Even before I could starting doing research on personalization, I had to do a lot of reading and research related to information retrieval and web mining. In the process, I learned a lot about IR. However, I hardly get the chance to practice what I learned. Hence, I have decided to blog about what I learned related to IR every week....well, almost every week.<br /><br />So why blog about this? Well, this will help me revise what I learned and maybe learn something new.<br /><br />I hope that other people will find it helpful as well.Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com2tag:blogger.com,1999:blog-5940921418163827691.post-42718837966645629002009-01-04T14:19:00.000-08:002009-01-04T14:34:13.093-08:00Simple IsNum() function in C#I disparately needed a C#'s equivalent for "IsNum". I was surprised that there isn't one, so I decided to write my own. There are many ways that can be used to write this functions. However, my task required that I should be able to verify any type of number (int, double, etc.) and get either True/False return values. Below is the code that I am using to solve this problem.<br /><br />public static bool IsNumeric(object Expression)<br /> {<br /><br /> // Variable to collect the Return value of the TryParse method.<br /><br /> bool isNum;<br /><br /><br /> // Define variable to collect out parameter of the TryParse method. If the conversion fails, the out parameter is zero.<br /><br /> double retNum;<br /><br /><br /> // The TryParse method converts a string in a specified style and culture-specific format to its double-precision floating point number equivalent.<br /><br /> // The TryParse method does not generate an exception if the conversion fails. If the conversion passes, True is returned. If it does not, False is returned. Hence no need for try..catch block<br /><br /> isNum = Double.TryParse(Convert.ToString(Expression), System.Globalization.NumberStyles.Any, System.Globalization.NumberFormatInfo.InvariantInfo, out retNum);<br /><br /> return isNum;<br /><br /> }Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com0tag:blogger.com,1999:blog-5940921418163827691.post-75917332346960135432008-11-15T00:45:00.000-08:002008-11-27T00:21:46.047-08:00Sorting Dictionary object in .NetA few days ago I came across a sorting-dictionary problem at work. The easiest way to have a sorted dictionary in .Net is to use the SortedDictionary(TKey, TValue) item. Every time you enter a keyvaluepair in the SortedDictionary item, it will be sorted automatically.<br /><br />But I had to sort an existing Dictionary(TKey, TValue) item, that was provided by some other program. Initially, I was hoping that there is some Dictionary.Sort() function available that I could use but there isn't.<br /><br />After some research I found that you could use the List.Sort() function to sort the dictionary by keys. But that is not all. You also need to add a delegate in the List.Sort() function to carry out the sorting process.<br /><br />Here is the code.<span style="font-style:italic;"><br /><br />//Code takes a Dictionary<double,int> item, sorts it and return the new Dictionary<double,int> item<br /><br />private static Dictionary<double,int> Sort(Dictionary<double,int> Dict)<br />{<br />//Create a List object from the Dictionary item<br />List<keyvaluepair<double,int>> result = new List<keyvaluepair<double,int>>(Dict);<br /><br /><br />//Use the delegate in the List.Sort function to compare the keys for sorting<br />result.Sort(<br /> delegate(<br /> KeyValuePair<double, int> first,<br /> KeyValuePair<double, int> second)<br /> {<br /> return first.Key.CompareTo(second.Key);<br /> }<br /> );<br /><br />//Add the List items into a new dictionary item<br />Dictionary<double,int> sortedDict = new Dictionary<double,int>(); <br /><br />foreach (KeyValuePair<double, int> kvp in result)<br /> {<br /> <br /> double key = kvp.Key;<br /> int value = kvp.Value; <br /><br /> sortedDict.Add(key, value);<br /> }<br /><br />return sortedDict;<br />}</span><br /><br /><br />If there is a better way to achieve this, please let me know.Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com17tag:blogger.com,1999:blog-5940921418163827691.post-76144103956111767472008-09-09T16:06:00.000-07:002008-11-14T23:47:38.090-08:00New Semester, New ChallengesThe <a href="http://rizvipost.blogspot.com/">stint at ESRI</a> ended this august. The Fall'08 semester has started and I have already started to wish that humans didn't have to sleep. Distributed Computing and Information Retrieval will be my focus of study this fall. While I like the Distributed Computing course and the technology that we will use :<a href="http://msdn.microsoft.com/en-us/netframework/aa663324.aspx">WCF</a>, I am more excited about IR. Hope this will be an amazing learning experience.Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com4tag:blogger.com,1999:blog-5940921418163827691.post-64542059829874419002008-04-29T13:25:00.000-07:002008-04-29T13:49:07.202-07:00Internship @ ESRII have been hunting for internships for a while. Due to the course load and other mundane activities I started my hunt a little late and hence was always under the impression that probably its too late now. But some companies contacted me and finally I got a full-time summer internship offer from <a href="http://www.esri.com/">ESRI</a> . I am really excited about that and I hope that it will be one rewarding and learning experience.Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com1tag:blogger.com,1999:blog-5940921418163827691.post-57632110137447411782008-02-17T14:27:00.000-08:002008-05-01T12:09:15.221-07:00IR or Spam FilterI haven't been updating my blog lately. The commencement of the spring 2008 semester has to do something with that but the real reason is that I have been busy researching some interesting topics for my AI project.<br /><br />Last semester I wanted to create an Image Spam Filter for my "Data mining & Pattern Recognition" class. My theory was to apply OCR techniques to capture text from the image, after which it really becomes a text spam-filter problem. I was thinking of using Neural Networks for Image Text recognition and Bayes' Theorem for spam-filtering. But my idea was unanimously rejected by my group and instead we developed a Handwriting Recognition system (which was still a better project to work on than our previous plan to tell time by reading an image of analogue clock.)<br /><br />So now I have a chance to have another go at my Image Spam Filter project. But I am still debating about it. The reason is that I am also fond of Information Retrieval problems. I am thinking of working on a <a href="http://www.cs.chalmers.se/Cs/Grundutb/Kurser/ai/AIWww/project.html"><em>Search engine and automatic indexing of a technical book/manual</em>.</a> Since this is an individual project I can do whatever I want (Of course my professor has to approve it.)<br /><br />If I think about it, I find both, IR and Spam Filter, interesting. So it is not a clash of interest but more of what will I gain from them and what I want to do in future.<br /><br />While I try to analyze this, feel free to give me your suggestions. Maybe you can give me an insight which may eventually help me reach a decision.Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com1tag:blogger.com,1999:blog-5940921418163827691.post-22736090019287908162007-12-24T23:07:00.000-08:002007-12-24T23:28:51.620-08:00Single-row functions in ORACLEOracle 9i has many old and new single-row functions that are very useful. Some of them that I find very interesting are:<br /><br /><br /><br /><strong>NVL(x1, x2)</strong><br />NULLs are of great importance and I am pretty sure that I will get atleast one question on NULLs in the <a href="http://raza-rizvi.blogspot.com/2007/12/oracle-9i-certification.html">test</a>. One of the ways that NULL values are handled in ORACLE is by using the function NVL(x1,x2). This function takes 2 values x1 and x2 and return x2 if x1 is NULL else it returns x1. Eg., the following query will return the employee name and salary+bonus for each employee:<br /><br />SELECT ename, salary + NVL(bonus,0) FROM emp<br /><br /><strong>NVL2(x1,x2,x3)</strong><br />NVL2 is also used to handle NULL values. It takes 3 values (x1,x2,x3) and return x3 if x1 is NULL and x2 if x1 is not NULL. So we can rewrite the above query as follows:<br /><br />SELECT ename, NVL2(bonus, salary+bonus, salary) from EMP<br /><br /><strong>DECODE</strong><br />Decode is used to incorporate IF.. THEN.. ELSE functionality in Oracle. Eg., the following query will display the employee name and the department name:<br /><br />SELECT ename,<br />DECODE(deptno<br />, 10,'Accounting'<br />,20,'Administration'<br />,'Other')<br />FROM emp<br /><br />- RazaSyed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com1tag:blogger.com,1999:blog-5940921418163827691.post-60554411512590962972007-12-24T11:07:00.000-08:002007-12-24T11:44:15.500-08:00Datatype and OperatorsOracle has some interesting data types and operators. I don't use all of them while writing SQL queries but nevertheless they are helpful in database programming. Here are some of them:<br /><br /><strong>Number(p,s)</strong><br />'Number' is used for any numeric datatype. The <em>s</em> is for precision and <em>s</em> is for scale. <em>p</em> has a range from 1 to 38, while <em>s</em> can take values from -84 to 127. For example Number(4,2) can take values from -99.99 to 99.99. At first this datatype looks simple but it gets interesting when some different combinations of <em>p</em> and <em>s</em> are used. Here are some of them:<br /><br />If <em>s</em> is greater than <em>p</em>, then the precision defines the maximum number of digits to the right of the decimal point after (s-p) zeros. For examples, Number(3,5) can take values from -0.00999 to 0.00999. Any number out of this range will generate error, e.g. 0.999 or 0.01 will result in raising an exception.<br /><br />If <em>s </em>is negative then s zeros are added to the left of the decimal and the number is rounded. So if the datatype is Number(4,-2) and the value is 12355.5 then it will be changed to 12300. However 123555.5 will raise an error becuse the value should not have more than (s+p) digitgs.<br /><br /><strong>Operators<br /></strong>I have used many operators in SQL queries but some of them that I don't use often are as follows:<br /><br />ANY (or SOME) is used to compare a value to each value in a list or subquery. It must always be preceded by a comparision operator:<br /><br />SELECT empno, empname FROM emp<br />WHERE deptno <=ANY(10,20,25) or, SELECT empname, salary FROM emp WHERE salary >=ANY( SELECT salary FROM emp<br />WHERE deptn0=10);<br /><br /><br />ALL operator is used to compare a value to every value in a list or subquery. It must be preceded by a comparision operator.<br /><br />SELECT empno FROM emp<br />WHERE salary > ALL(SELECT salary FROM emp<br />WHERE deptno=10);<br /><br />NOTE: '=ANY' is equivalent to IN and '!=ALL' is equivalent to NOT IN<br /><br />EXISTS is also an interesting operator that I don't remember using in my academic or professional life but it can be very helpful under some circumstances. EXISTS is always followed by a subquery and is evaluated to TRUE if the subquery returns atleast one row.<br /><br />SELECT e.empno, e.ename FROM emp e<br />WHERE EXISTS ( SELECT 1 FROM dept d<br />WHERE d.deptno= e.deptno<br />AND d.dname='Administration');<br /><br />I hope you found this helpful. I'll keep adding useful things as I come across during my study.<br /><br />- Raza<s> <p></p></s></s><p><s></p></s><br /><s><s><s><s><s><s></s></s></s></s></s></s><br /><s></s><br /><s></s><br /><s><s><s><s><s><s></s></s></s></s></s></s>Syed Rizvihttp://www.blogger.com/profile/09451934384177669115noreply@blogger.com1