Saturday, April 18, 2009

A simple web crawler

it is becoming difficult to write more about what I have been reading/researching related to Information Retrieval but IR continues to be my area of interest. There is tons of literature out there and each research paper that I read brings with it more insight about IR. That made me want to test the various theories and algorithms mentioned in those papers. And to test them, first thing that I needed was data (web documents to be precise). Now, crawling is in itself a very important area of research. One could opt to write a very efficient crawler but I am a full-time student who works part-time to pay his bills and on top of that I have to start working on my dissertation as well. So there was no way I could allocate more time writing a crawler in order to test the various theories.

Instead I chose to write a very simple crawler. A simple crawler could just make use of the link structure of the web. And since I am not the only one who thinks that or does that, I thought it is a good step to start with. What I really needed was to extract < a > tags from a given web page. The following regular expression allowed me to do that:

Regex extractTags = new Regex(@"<" + tag + @"[^>]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?>", RegexOptions.IgnoreCase | RegexOptions.Compiled);


The main function of the crawler that I wrote loops through the list of predefined pages, calling addtoindex function on each one. The function can be used to store links and their text but for the sake of this post it just prints the URL. It then uses the regular expression that I mentioned above to get all the links on that page and adds their URLs to a set called newpages. At the end of the loop, newpages becomes pages, and the process repeats.

Here is the complete code in C#:


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Collections;
using System.Net;
using System.IO;

namespace Crawler
{
class Program
{
private static String strText;
static MatchCollection tagCollection;
public static HttpWebRequest req;
public static HttpWebResponse res;
static Stream resStream;
public static string baseUrl;

static void Main(string[] args)
{
//add the specific site that you want to crawl
baseUrl = "http://www.techcrunch.com/";

ArrayList pages = new ArrayList();
pages.Add(baseUrl);

//start crawling
crawl(pages, 20);

Console.WriteLine("/nIndexing Complete!!");
Console.ReadLine();
}

public static void crawl(ArrayList pages, int depth)
{
MatchCollection mc;
ArrayList links = new ArrayList();

//Breadth-first search algorithm to crawl the pages and collect links
for (int i = 0; i < depth; i++)
{
ArrayList newpages = new ArrayList();

foreach (String page in pages)
{
try
{
if (isValidUrl(page))
{
urlOpen();
}
}
catch (Exception ex)
{
System.Console.WriteLine("Couldnot open {0} because {1}", page, ex.ToString());
continue;
}

string pagecontent = read();

//adding the page in the index
addtoindex(page, pagecontent);

mc = tagList(pagecontent, "a");

links = getAttributeValue(mc, "href", baseUrl);

foreach (string link in links)
{
String url, linktext;
url = linktext = null;


if (link.Contains("#"))
{
try
{
url = link;

}
catch (Exception ex)
{
Console.WriteLine("Error in Crawl " + ex.Message + " - " + url);
}
}
else
{
url = link;
}

try
{
if ((url.Substring(0, 4) == "http") && (isindexed(url) == false))
{

newpages.Add(url);
}

}
catch (Exception ex)
{
Console.WriteLine("Couldnot add new page " + url + " b/c {0}", ex.ToString());
}
linktext = gettextonly(pagecontent);
}

}
pages = newpages;

}
}

//Returns false for now, but can be modified to query a database to check whether a page has already been indexed
public static bool isindexed(string url)
{
return false;
}

//Add page to the index, this is where a database or file system can be used
public static void addtoindex(string url, string pagecontent)
{

Console.WriteLine("Indexing : " + url);

}

//Get the collection of < a > tags in a page
public static MatchCollection tagList(String HTMLcontent, String tag)
{

Regex extractTags = new Regex(@"<" + tag + @"[^>]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?>", RegexOptions.IgnoreCase | RegexOptions.Compiled);
try
{
tagCollection = extractTags.Matches(HTMLcontent);

return tagCollection;
}
catch (Exception ex)
{
Console.WriteLine(ex.ToString());
}
return null;
}

//Gets the HREF value from each <> tag
public static ArrayList getAttributeValue(MatchCollection mc, String Attr, string url)
{
ArrayList links = new ArrayList();//{ ""};

foreach (Match match in mc)
{

string temp = match.Value;

try
{
if (temp.Contains("http"))
{
links.Add(temp.Substring(temp.IndexOf("href") + 6, temp.LastIndexOf(">") - temp.IndexOf("http") - 1));

}

else if (temp.Contains("://"))
{
links.Add(temp.Substring(temp.IndexOf("href") + 6, temp.LastIndexOf(">") - (temp.IndexOf("href") + 7)));
}
else
{
string strTemp = temp.Substring(temp.IndexOf("href") + 6, temp.LastIndexOf(">") - (temp.IndexOf("href") + 7));
url.Replace("\n\r", "");
if (strTemp[0] != '/' && url[url.Length - 1] != '/')
{
strTemp = url + "/" + strTemp;
}
else
{
strTemp = url + strTemp;
}
links.Add(strTemp);
}

}
catch (Exception ex)
{
Console.WriteLine("Error in GetAttributes :" + ex.Message + " - " + url);
}

}
return links;

}

//reads that content of a web page
public static string gettextonly(string pagecontent)
{

string pattern = @"<(.|\n)*?>";
return Regex.Replace(pagecontent, pattern, String.Empty);

}


public static String read()
{
StreamReader sr = new StreamReader(resStream);
strText = sr.ReadToEnd();
return strText;
}

public static void urlOpen()
{
resStream = res.GetResponseStream();
}

public static bool isValidUrl(String url)
{
try
{
req = (HttpWebRequest)HttpWebRequest.Create(url);
res = (HttpWebResponse)req.GetResponse();
return (res.StatusCode == HttpStatusCode.OK);
}
catch (Exception ex)
{
Console.WriteLine("Error in ISValidURL " + ex.Message + " - " + url);
return false;
}

}
}
}


Any suggestions are welcomed.

65 comments:

Anonymous said...

How you find ideas for articles, I am always lack of new ideas for articles. Some tips would be great

Anonymous said...

This post have been very helpful to me. Thanks for posting it.

Anonymous said...

I like your ideas. They are interesting, and they should be definitely put into action.

Anonymous said...

Your ideas are amazing and helpful. Write more, and remember that "No person is your friend who demands your silence, or denies your right to grow."

Anonymous said...

I do totally share your opinion and I am glad that there are still people who think so. Let's fight for it,because "Opinions cannot survive if one has no chance to fight for them."

Anonymous said...

buy ambien ordering ambien online safely - ambien klonopin alcohol

Anonymous said...

ambien cr ambien 10mg street price - ambien no prescription overnight

Anonymous said...

diazepam 5mg diazepam online pharmacy - diazepam valium vs xanax

Anonymous said...

can i buy xanax online generic name of xanax - xanax versus generic xanax

Anonymous said...

diazepam online diazepam online no prescription usa - diazepam dosage to get high

Anonymous said...

buy zolpidem online zolpidem compared to ambien - zolpidem tartrate teva pharmaceuticals usa

Anonymous said...

generic xanax xanax withdrawal using klonopin - xanax generic orange

Anonymous said...

lorazepam drug high dosage of ativan - 1mg ativan vs xanax

Anonymous said...

lorazepam no prescription overdose on ativan mg - lorazepam 1mg 50 comprimidos

Anonymous said...

lethal dose of diazepam 600 mg diazepam - buy valium online from canada

Anonymous said...

discount ativan lorazepam 1mg photos - ativan vs xanax side effects

Anonymous said...

buy ambien online free online ambien noise generator - what do ambien pill look like

Anonymous said...

order ambien ambien cr bluelight - ambien side effects diarrhea

Anonymous said...

ambien sale lunesta vs ambien cr vs sonata - ambien side effects bruising

Anonymous said...

order carisoprodol buy generic soma online no prescription - where to buy soma online

Anonymous said...

buy valium online valium dosage hours - valium pill green

Anonymous said...

valium without prescriptions diazepam valium nursing responsibilities - generic valium doesn't work

Anonymous said...

Fantastic blog! Do you have any suggestions for aspiring writers?
I'm hoping to start my own site soon but I'm a little lost on
everything. Would you recommend starting with a free platform like Wordpress or go for a paid option?
There are so many choices out there that I'm totally overwhelmed .. Any tips? Thanks!
Also visit my page ; cremazioni roma

Anonymous said...

With havin so much content and articles do you ever run into
any issues of plagorism or copyright infringement? My blog has a lot of unique content I've either created myself or outsourced but it appears a lot of it is popping it up all over the web without my agreement. Do you know any techniques to help protect against content from being stolen? I'd genuinely appreciate it.
Review my blog : Geeks on Steroids Janeth

Anonymous said...

Quality content is the main to invite the visitors to go to see the website, that's what this site is providing.
Check out my homepage :: Apple Valley

Anonymous said...

who is matthew underwood dating http://loveepicentre.com/articles.php medford speed dating

Anonymous said...

Thank you a lot for sharing this with all of us
you really realize what you are speaking approximately! Bookmarked.
Kindly additionally consult with my site =). We can have a link change agreement between us
Feel free to visit my blog : mma fighters

Anonymous said...

Blogger: Mining the Web - Post a Comment soma price - soma cost http://www.emprendiendoideas.com/#soma-cost

Anonymous said...

Blogger: Mining the Web - Post a Comment meridia no prescription - buy sibutramine without prescription http://www.meridiaordernow.com/#buy-sibutramine-without-prescription

Anonymous said...

Blogger: Mining the Web - Post a Comment order sibutramine online - sibutramine without prescription http://www.meridiaordernow.com/#sibutramine-without-prescription

Anonymous said...

Blogger: Mining the Web - Post a Comment buy generic soma online no prescription - generic soma no prescription http://www.emprendiendoideas.com/#generic-soma-no-prescription

Anonymous said...

Blogger: Mining the Web - Post a Comment buy soma - buy soma online no prescription http://www.emprendiendoideas.com/#buy-soma-online-no-prescription

Anonymous said...

Blogger: Mining the Web - Post a Comment acomplia 20mg - buy acomplia online no prescription http://www.a6soft.com/#buy-acomplia-online-no-prescription

Anonymous said...

Blogger: Mining the Web - Post a Comment ventolin price - ventolin without prescription http://www.ventolinforsaleonline.com/#ventolin-without-prescription

Anonymous said...

Blogger: Mining the Web - Post a Comment buy acomplia online - buy acomplia http://www.a6soft.com/#buy-acomplia

Anonymous said...

Reowulcelloks lorazepam online pharmacy - lorazepam for sale http://www.lorazepamonlinesales.com/

Anonymous said...

http://legitpaydayloansonline2.com http://www.legitpaydayloansonline1.com/ Fundpopog http://www.legitpaydayloansonline3.com Stype [url=http://www.legitpaydayloansonline3.com/]online payday loan[/url] Legit Payday Loans Online These are online direct lenders who are absolutely since these everyone who is cash-strapped and whose paycheck has not yet arrived on time.With bad credit score, poor financial background and lack of but $100 and $1,500, depending on your income!

Anonymous said...

Hi, MoxoredapeTox furosemide no prescription - lasix price http://www.jacksbackyarddallas.com/#lasix-price

Anonymous said...

Hi, MoxoredapeTox propecia online - cheap finasteride http://www.bigdocpoker.com/#finasteride-cost

Anonymous said...

Hi, MoxoredapeTox cheap accutane - accutane no prescription http://www.josoclasolucio.com/#order-accutane

Anonymous said...

Kentucky Inpatient Drug Treatment prednisone price - prednisone sale http://www.prednisone4sale.com/#prednisone-sale

Anonymous said...

Medicine Lodge Ranch prednisone no prescription - prednisone online http://www.prednisone4sale.com/#prednisone-online

Anonymous said...

Hello, where to buy finasteride online - generic finasteride no prescription http://www.bigdocpoker.com/#buy-propecia-no-prescription

Anonymous said...

[url=http://loveepicentre.com/map.php][img]http://loveepicentre.com/uploades/photos/4.jpg[/img][/url]
lesbian online dating rules [url=http://loveepicentre.com/articles.php]free play dating games[/url] dating sites on singapore
dating woman gives you fake email [url=http://loveepicentre.com]adult singles dating yale south dakota[/url] local dating uk
super speed dating [url=http://loveepicentre.com/faq.php]the bases of dating[/url] single christian dating sites

Anonymous said...

Hello, cytotec without prescription - cheap cytotec no prescription http://www.cytotecforlessonline.net/#cheap-cytotec-no-prescription

Anonymous said...

woodworking projects , http://woodworkingplans1.com/#Arorklalsam ted woodworking

Anonymous said...

tmd cytotec online no prescription - cytotec for sale online http://www.cytoteconlinecheap.net/#order-cytotec, [url=http://www.cytoteconlinecheap.net/#order-cytotec]cytotec online [/url]

Anonymous said...

This is very interesting, You are a very skilled blogger.
I have joined your feed and look forward to seeking more
of your excellent post. Also, I've shared your site in my social networks!

My weblog :: Read www.blinky.co.nz - www.blinky.co.nz

Anonymous said...

Aw, this was a really good post. Taking the time and
actual effort to generate a good article… but what can I say… I hesitate
a lot and never seem to get anything done.

Also visit my web-site HTTP://sisec2010.wiki.Irisa.fr/tiki-index.php?page=UserPagezanepoland

Anonymous said...

Hello, ambien for sale no prescription - cost of ambien http://www.monitorfolder.com/#cost-of-ambien , [url=http://www.monitorfolder.com/#ambien-cr ]ambien cr [/url]

Anonymous said...


2 generic duloxetine - generic duloxetine http://www.cymbaltagenericonline.net/#generic-duloxetine, [url=http://www.cymbaltagenericonline.net/#generic-duloxetine]generic duloxetine[/url]

Anonymous said...

Hi! This post could not be written any better! Reading this post reminds me
of my good old room mate! He always kept chatting about this.
I will forward this page to him. Pretty sure he will have a good read.
Thank you for sharing!

My weblog ... jobs in uk

Anonymous said...

It's actually a nice and helpful piece of information. I am glad that you simply shared this helpful information with us. Please keep us informed like this. Thanks for sharing.

My site: facebook quotes

Anonymous said...

Definitely believe that which you stated. Your favorite justification seemed to be on the internet the easiest thing
to be aware of. I say to you, I certainly get annoyed while people consider worries that they plainly don't know about. You managed to hit the nail upon the top and also defined out the whole thing without having side-effects , people could take a signal. Will likely be back to get more. Thanks

Review my webpage Fingerprint Building in Thailand

Anonymous said...

I don't even know how I ended up here, but I thought this post was great. I do not know who you are but definitely you're going to a famous blogger if you are not already ;) Cheers!


Review my blog Credit ohne schufa

Anonymous said...

Ahaa, its good dialogue on the topic of this paragraph at this place at
this website, I have read all that, so at this time me also commenting here.


my web-site ... PBA Root Canal

Anonymous said...

Does your website have a contact page? I'm having problems locating it but, I'd like to send you an e-mail.

I've got some recommendations for your blog you might be interested in hearing. Either way, great site and I look forward to seeing it improve over time.

Also visit my web page :: More Bonuses

Anonymous said...

all the time i used to read smaller articles which as well clear their motive, and that is also happening
with this piece of writing which I am reading now.

Look into my webpage :: Http://race.Coursein.net/

Anonymous said...

May i get Firefox to always open media files in an
external player?

Here is my page; http://www.shahins2.com

Anonymous said...

First off I want to say excellent blog! I had a quick question which
I'd like to ask if you don't mind. I was interested to find out
how you center yourself and clear your thoughts prior to writing.
I've had a difficult time clearing my thoughts in getting my thoughts out there. I do take pleasure in writing however it just seems like the first 10 to 15 minutes tend to be wasted simply just trying to figure out how to begin. Any suggestions or tips? Thank you!

Here is my weblog Free Sex

Anonymous said...

Do you have a spam problem on this website; I also am a blogger, and I was
wanting to know your situation; many of us have developed some
nice procedures and we are looking to trade solutions
with other folks, why not shoot me an e-mail if interested.


Here is my web blog: Japanese Porn

Anonymous said...

Great web site you have here.. It's hard to find high-quality writing like yours these days. I really appreciate people like you! Take care!!

my page http://www.xxxmoviegalls.com

Anonymous said...

Thanks for finally talking about > "A simple web crawler" < Loved it!

My website; http:
//www.cuteteenporn.net/video/16072/japanese-lesbian-orgy.

html

Anonymous said...

Merely wanna comment that you have a very decent web site , I
like the layout it really stands out.

Feel free to visit my web blog; adfly auto clicker

Anonymous said...

love japan chinese dating http://loveepicentre.com/success_stories/ pot smoker dating
disabled singles dating site [url=http://loveepicentre.com/testimonials/]speed dating london[/url] online dating calgary
fifty over dating [url=http://loveepicentre.com]dating single[/url] mt pleasant dating mi [url=http://loveepicentre.com/user/fishboyy/]fishboyy[/url] 10 no stess dating ideas