Saturday, April 18, 2009

A simple web crawler

it is becoming difficult to write more about what I have been reading/researching related to Information Retrieval but IR continues to be my area of interest. There is tons of literature out there and each research paper that I read brings with it more insight about IR. That made me want to test the various theories and algorithms mentioned in those papers. And to test them, first thing that I needed was data (web documents to be precise). Now, crawling is in itself a very important area of research. One could opt to write a very efficient crawler but I am a full-time student who works part-time to pay his bills and on top of that I have to start working on my dissertation as well. So there was no way I could allocate more time writing a crawler in order to test the various theories.

Instead I chose to write a very simple crawler. A simple crawler could just make use of the link structure of the web. And since I am not the only one who thinks that or does that, I thought it is a good step to start with. What I really needed was to extract < a > tags from a given web page. The following regular expression allowed me to do that:

Regex extractTags = new Regex(@"<" + tag + @"[^>]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?>", RegexOptions.IgnoreCase | RegexOptions.Compiled);


The main function of the crawler that I wrote loops through the list of predefined pages, calling addtoindex function on each one. The function can be used to store links and their text but for the sake of this post it just prints the URL. It then uses the regular expression that I mentioned above to get all the links on that page and adds their URLs to a set called newpages. At the end of the loop, newpages becomes pages, and the process repeats.

Here is the complete code in C#:


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Collections;
using System.Net;
using System.IO;

namespace Crawler
{
class Program
{
private static String strText;
static MatchCollection tagCollection;
public static HttpWebRequest req;
public static HttpWebResponse res;
static Stream resStream;
public static string baseUrl;

static void Main(string[] args)
{
//add the specific site that you want to crawl
baseUrl = "http://www.techcrunch.com/";

ArrayList pages = new ArrayList();
pages.Add(baseUrl);

//start crawling
crawl(pages, 20);

Console.WriteLine("/nIndexing Complete!!");
Console.ReadLine();
}

public static void crawl(ArrayList pages, int depth)
{
MatchCollection mc;
ArrayList links = new ArrayList();

//Breadth-first search algorithm to crawl the pages and collect links
for (int i = 0; i < depth; i++)
{
ArrayList newpages = new ArrayList();

foreach (String page in pages)
{
try
{
if (isValidUrl(page))
{
urlOpen();
}
}
catch (Exception ex)
{
System.Console.WriteLine("Couldnot open {0} because {1}", page, ex.ToString());
continue;
}

string pagecontent = read();

//adding the page in the index
addtoindex(page, pagecontent);

mc = tagList(pagecontent, "a");

links = getAttributeValue(mc, "href", baseUrl);

foreach (string link in links)
{
String url, linktext;
url = linktext = null;


if (link.Contains("#"))
{
try
{
url = link;

}
catch (Exception ex)
{
Console.WriteLine("Error in Crawl " + ex.Message + " - " + url);
}
}
else
{
url = link;
}

try
{
if ((url.Substring(0, 4) == "http") && (isindexed(url) == false))
{

newpages.Add(url);
}

}
catch (Exception ex)
{
Console.WriteLine("Couldnot add new page " + url + " b/c {0}", ex.ToString());
}
linktext = gettextonly(pagecontent);
}

}
pages = newpages;

}
}

//Returns false for now, but can be modified to query a database to check whether a page has already been indexed
public static bool isindexed(string url)
{
return false;
}

//Add page to the index, this is where a database or file system can be used
public static void addtoindex(string url, string pagecontent)
{

Console.WriteLine("Indexing : " + url);

}

//Get the collection of < a > tags in a page
public static MatchCollection tagList(String HTMLcontent, String tag)
{

Regex extractTags = new Regex(@"<" + tag + @"[^>]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?>", RegexOptions.IgnoreCase | RegexOptions.Compiled);
try
{
tagCollection = extractTags.Matches(HTMLcontent);

return tagCollection;
}
catch (Exception ex)
{
Console.WriteLine(ex.ToString());
}
return null;
}

//Gets the HREF value from each <> tag
public static ArrayList getAttributeValue(MatchCollection mc, String Attr, string url)
{
ArrayList links = new ArrayList();//{ ""};

foreach (Match match in mc)
{

string temp = match.Value;

try
{
if (temp.Contains("http"))
{
links.Add(temp.Substring(temp.IndexOf("href") + 6, temp.LastIndexOf(">") - temp.IndexOf("http") - 1));

}

else if (temp.Contains("://"))
{
links.Add(temp.Substring(temp.IndexOf("href") + 6, temp.LastIndexOf(">") - (temp.IndexOf("href") + 7)));
}
else
{
string strTemp = temp.Substring(temp.IndexOf("href") + 6, temp.LastIndexOf(">") - (temp.IndexOf("href") + 7));
url.Replace("\n\r", "");
if (strTemp[0] != '/' && url[url.Length - 1] != '/')
{
strTemp = url + "/" + strTemp;
}
else
{
strTemp = url + strTemp;
}
links.Add(strTemp);
}

}
catch (Exception ex)
{
Console.WriteLine("Error in GetAttributes :" + ex.Message + " - " + url);
}

}
return links;

}

//reads that content of a web page
public static string gettextonly(string pagecontent)
{

string pattern = @"<(.|\n)*?>";
return Regex.Replace(pagecontent, pattern, String.Empty);

}


public static String read()
{
StreamReader sr = new StreamReader(resStream);
strText = sr.ReadToEnd();
return strText;
}

public static void urlOpen()
{
resStream = res.GetResponseStream();
}

public static bool isValidUrl(String url)
{
try
{
req = (HttpWebRequest)HttpWebRequest.Create(url);
res = (HttpWebResponse)req.GetResponse();
return (res.StatusCode == HttpStatusCode.OK);
}
catch (Exception ex)
{
Console.WriteLine("Error in ISValidURL " + ex.Message + " - " + url);
return false;
}

}
}
}


Any suggestions are welcomed.

65 comments:

  1. How you find ideas for articles, I am always lack of new ideas for articles. Some tips would be great

    ReplyDelete
  2. This post have been very helpful to me. Thanks for posting it.

    ReplyDelete
  3. I like your ideas. They are interesting, and they should be definitely put into action.

    ReplyDelete
  4. Your ideas are amazing and helpful. Write more, and remember that "No person is your friend who demands your silence, or denies your right to grow."

    ReplyDelete
  5. I do totally share your opinion and I am glad that there are still people who think so. Let's fight for it,because "Opinions cannot survive if one has no chance to fight for them."

    ReplyDelete
  6. buy ambien ordering ambien online safely - ambien klonopin alcohol

    ReplyDelete
  7. ambien cr ambien 10mg street price - ambien no prescription overnight

    ReplyDelete
  8. diazepam 5mg diazepam online pharmacy - diazepam valium vs xanax

    ReplyDelete
  9. can i buy xanax online generic name of xanax - xanax versus generic xanax

    ReplyDelete
  10. diazepam online diazepam online no prescription usa - diazepam dosage to get high

    ReplyDelete
  11. buy zolpidem online zolpidem compared to ambien - zolpidem tartrate teva pharmaceuticals usa

    ReplyDelete
  12. generic xanax xanax withdrawal using klonopin - xanax generic orange

    ReplyDelete
  13. lorazepam drug high dosage of ativan - 1mg ativan vs xanax

    ReplyDelete
  14. lorazepam no prescription overdose on ativan mg - lorazepam 1mg 50 comprimidos

    ReplyDelete
  15. lethal dose of diazepam 600 mg diazepam - buy valium online from canada

    ReplyDelete
  16. discount ativan lorazepam 1mg photos - ativan vs xanax side effects

    ReplyDelete
  17. buy ambien online free online ambien noise generator - what do ambien pill look like

    ReplyDelete
  18. order ambien ambien cr bluelight - ambien side effects diarrhea

    ReplyDelete
  19. ambien sale lunesta vs ambien cr vs sonata - ambien side effects bruising

    ReplyDelete
  20. order carisoprodol buy generic soma online no prescription - where to buy soma online

    ReplyDelete
  21. buy valium online valium dosage hours - valium pill green

    ReplyDelete
  22. valium without prescriptions diazepam valium nursing responsibilities - generic valium doesn't work

    ReplyDelete
  23. Fantastic blog! Do you have any suggestions for aspiring writers?
    I'm hoping to start my own site soon but I'm a little lost on
    everything. Would you recommend starting with a free platform like Wordpress or go for a paid option?
    There are so many choices out there that I'm totally overwhelmed .. Any tips? Thanks!
    Also visit my page ; cremazioni roma

    ReplyDelete
  24. With havin so much content and articles do you ever run into
    any issues of plagorism or copyright infringement? My blog has a lot of unique content I've either created myself or outsourced but it appears a lot of it is popping it up all over the web without my agreement. Do you know any techniques to help protect against content from being stolen? I'd genuinely appreciate it.
    Review my blog : Geeks on Steroids Janeth

    ReplyDelete
  25. Quality content is the main to invite the visitors to go to see the website, that's what this site is providing.
    Check out my homepage :: Apple Valley

    ReplyDelete
  26. who is matthew underwood dating http://loveepicentre.com/articles.php medford speed dating

    ReplyDelete
  27. Thank you a lot for sharing this with all of us
    you really realize what you are speaking approximately! Bookmarked.
    Kindly additionally consult with my site =). We can have a link change agreement between us
    Feel free to visit my blog : mma fighters

    ReplyDelete
  28. Blogger: Mining the Web - Post a Comment soma price - soma cost http://www.emprendiendoideas.com/#soma-cost

    ReplyDelete
  29. Blogger: Mining the Web - Post a Comment meridia no prescription - buy sibutramine without prescription http://www.meridiaordernow.com/#buy-sibutramine-without-prescription

    ReplyDelete
  30. Blogger: Mining the Web - Post a Comment order sibutramine online - sibutramine without prescription http://www.meridiaordernow.com/#sibutramine-without-prescription

    ReplyDelete
  31. Blogger: Mining the Web - Post a Comment buy generic soma online no prescription - generic soma no prescription http://www.emprendiendoideas.com/#generic-soma-no-prescription

    ReplyDelete
  32. Blogger: Mining the Web - Post a Comment buy soma - buy soma online no prescription http://www.emprendiendoideas.com/#buy-soma-online-no-prescription

    ReplyDelete
  33. Blogger: Mining the Web - Post a Comment acomplia 20mg - buy acomplia online no prescription http://www.a6soft.com/#buy-acomplia-online-no-prescription

    ReplyDelete
  34. Blogger: Mining the Web - Post a Comment ventolin price - ventolin without prescription http://www.ventolinforsaleonline.com/#ventolin-without-prescription

    ReplyDelete
  35. Blogger: Mining the Web - Post a Comment buy acomplia online - buy acomplia http://www.a6soft.com/#buy-acomplia

    ReplyDelete
  36. Reowulcelloks lorazepam online pharmacy - lorazepam for sale http://www.lorazepamonlinesales.com/

    ReplyDelete
  37. http://legitpaydayloansonline2.com http://www.legitpaydayloansonline1.com/ Fundpopog http://www.legitpaydayloansonline3.com Stype [url=http://www.legitpaydayloansonline3.com/]online payday loan[/url] Legit Payday Loans Online These are online direct lenders who are absolutely since these everyone who is cash-strapped and whose paycheck has not yet arrived on time.With bad credit score, poor financial background and lack of but $100 and $1,500, depending on your income!

    ReplyDelete
  38. Hi, MoxoredapeTox furosemide no prescription - lasix price http://www.jacksbackyarddallas.com/#lasix-price

    ReplyDelete
  39. Hi, MoxoredapeTox propecia online - cheap finasteride http://www.bigdocpoker.com/#finasteride-cost

    ReplyDelete
  40. Hi, MoxoredapeTox cheap accutane - accutane no prescription http://www.josoclasolucio.com/#order-accutane

    ReplyDelete
  41. Kentucky Inpatient Drug Treatment prednisone price - prednisone sale http://www.prednisone4sale.com/#prednisone-sale

    ReplyDelete
  42. Medicine Lodge Ranch prednisone no prescription - prednisone online http://www.prednisone4sale.com/#prednisone-online

    ReplyDelete
  43. Hello, where to buy finasteride online - generic finasteride no prescription http://www.bigdocpoker.com/#buy-propecia-no-prescription

    ReplyDelete
  44. [url=http://loveepicentre.com/map.php][img]http://loveepicentre.com/uploades/photos/4.jpg[/img][/url]
    lesbian online dating rules [url=http://loveepicentre.com/articles.php]free play dating games[/url] dating sites on singapore
    dating woman gives you fake email [url=http://loveepicentre.com]adult singles dating yale south dakota[/url] local dating uk
    super speed dating [url=http://loveepicentre.com/faq.php]the bases of dating[/url] single christian dating sites

    ReplyDelete
  45. Hello, cytotec without prescription - cheap cytotec no prescription http://www.cytotecforlessonline.net/#cheap-cytotec-no-prescription

    ReplyDelete
  46. woodworking projects , http://woodworkingplans1.com/#Arorklalsam ted woodworking

    ReplyDelete
  47. tmd cytotec online no prescription - cytotec for sale online http://www.cytoteconlinecheap.net/#order-cytotec, [url=http://www.cytoteconlinecheap.net/#order-cytotec]cytotec online [/url]

    ReplyDelete
  48. This is very interesting, You are a very skilled blogger.
    I have joined your feed and look forward to seeking more
    of your excellent post. Also, I've shared your site in my social networks!

    My weblog :: Read www.blinky.co.nz - www.blinky.co.nz

    ReplyDelete
  49. Aw, this was a really good post. Taking the time and
    actual effort to generate a good article… but what can I say… I hesitate
    a lot and never seem to get anything done.

    Also visit my web-site HTTP://sisec2010.wiki.Irisa.fr/tiki-index.php?page=UserPagezanepoland

    ReplyDelete
  50. Hello, ambien for sale no prescription - cost of ambien http://www.monitorfolder.com/#cost-of-ambien , [url=http://www.monitorfolder.com/#ambien-cr ]ambien cr [/url]

    ReplyDelete

  51. 2 generic duloxetine - generic duloxetine http://www.cymbaltagenericonline.net/#generic-duloxetine, [url=http://www.cymbaltagenericonline.net/#generic-duloxetine]generic duloxetine[/url]

    ReplyDelete
  52. Hi! This post could not be written any better! Reading this post reminds me
    of my good old room mate! He always kept chatting about this.
    I will forward this page to him. Pretty sure he will have a good read.
    Thank you for sharing!

    My weblog ... jobs in uk

    ReplyDelete
  53. It's actually a nice and helpful piece of information. I am glad that you simply shared this helpful information with us. Please keep us informed like this. Thanks for sharing.

    My site: facebook quotes

    ReplyDelete
  54. Definitely believe that which you stated. Your favorite justification seemed to be on the internet the easiest thing
    to be aware of. I say to you, I certainly get annoyed while people consider worries that they plainly don't know about. You managed to hit the nail upon the top and also defined out the whole thing without having side-effects , people could take a signal. Will likely be back to get more. Thanks

    Review my webpage Fingerprint Building in Thailand

    ReplyDelete
  55. I don't even know how I ended up here, but I thought this post was great. I do not know who you are but definitely you're going to a famous blogger if you are not already ;) Cheers!


    Review my blog Credit ohne schufa

    ReplyDelete
  56. Ahaa, its good dialogue on the topic of this paragraph at this place at
    this website, I have read all that, so at this time me also commenting here.


    my web-site ... PBA Root Canal

    ReplyDelete
  57. Does your website have a contact page? I'm having problems locating it but, I'd like to send you an e-mail.

    I've got some recommendations for your blog you might be interested in hearing. Either way, great site and I look forward to seeing it improve over time.

    Also visit my web page :: More Bonuses

    ReplyDelete
  58. all the time i used to read smaller articles which as well clear their motive, and that is also happening
    with this piece of writing which I am reading now.

    Look into my webpage :: Http://race.Coursein.net/

    ReplyDelete
  59. May i get Firefox to always open media files in an
    external player?

    Here is my page; http://www.shahins2.com

    ReplyDelete
  60. First off I want to say excellent blog! I had a quick question which
    I'd like to ask if you don't mind. I was interested to find out
    how you center yourself and clear your thoughts prior to writing.
    I've had a difficult time clearing my thoughts in getting my thoughts out there. I do take pleasure in writing however it just seems like the first 10 to 15 minutes tend to be wasted simply just trying to figure out how to begin. Any suggestions or tips? Thank you!

    Here is my weblog Free Sex

    ReplyDelete
  61. Do you have a spam problem on this website; I also am a blogger, and I was
    wanting to know your situation; many of us have developed some
    nice procedures and we are looking to trade solutions
    with other folks, why not shoot me an e-mail if interested.


    Here is my web blog: Japanese Porn

    ReplyDelete
  62. Great web site you have here.. It's hard to find high-quality writing like yours these days. I really appreciate people like you! Take care!!

    my page http://www.xxxmoviegalls.com

    ReplyDelete
  63. Thanks for finally talking about > "A simple web crawler" < Loved it!

    My website; http:
    //www.cuteteenporn.net/video/16072/japanese-lesbian-orgy.

    html

    ReplyDelete
  64. Merely wanna comment that you have a very decent web site , I
    like the layout it really stands out.

    Feel free to visit my web blog; adfly auto clicker

    ReplyDelete
  65. love japan chinese dating http://loveepicentre.com/success_stories/ pot smoker dating
    disabled singles dating site [url=http://loveepicentre.com/testimonials/]speed dating london[/url] online dating calgary
    fifty over dating [url=http://loveepicentre.com]dating single[/url] mt pleasant dating mi [url=http://loveepicentre.com/user/fishboyy/]fishboyy[/url] 10 no stess dating ideas

    ReplyDelete