Friday, January 29, 2010

Parsing Robots.txt File

Crawling is an essential part of search engine development. The more sites a search engine crawls, the bigger its index will be. However, a site can restrict web crawlers from crawling certain pages of the site by identifying those pages in a robots.txt file and making it available at the local URL /robots.txt. For example, you can see what the robots.txt file for CNN looks like here.

While crawling a site, it is a good idea to read the robots.txt file first so that you have a list of pages that your crawler shouldn't access. This way you are not only respecting the permissions provided by the site but also saving CPU cycles which can be utilized elsewhere.

You can find complete detail about the format and semantics of a robots.txt file, visit The Web Robots Pages. Here is a portion from this webpage:

The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognised headers are ignored.

User-agent
The value of this field is the name of the robot the record is describing access policy for.

If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.

The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.

If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.

Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.

Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.

Using this description we can easily write the code to create a list of disallowed URLs. All we need to do is parse the robots.txt file and split the file first by using "User-agent" and then again by "Disallow". This will give us the list which we can then use in our crawler. We just need to check this list every time we are about to crawl a page.

Following is the code in C# to parse the robots.txt file and to create a list of disallowed URLs:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Net;
using System.Text.RegularExpressions;

namespace ParsingRobotTxt
{
class Program
{
public static HttpWebRequest req;
public static HttpWebResponse res;
static Stream resStream;

static void Main(string[] args)
{
String baseUrl = "http://www.cnn.com/";
baseUrl += "/robots.txt";

getDisallowedUrls(baseUrl);
}

static void getDisallowedUrls(string baseUrl)
{
if (isValidUrl(baseUrl))
{
urlOpen();
}

String RobotTxtContent = read();

List disallowed = new List(); // List that holds Urls which shouldn't be crawled
String[] user_agents = Regex.Split(RobotTxtContent, "User-agent:");
String userAgents = "";
foreach (String agent in user_agents)
{
if (agent.Trim().StartsWith("*"))
{
userAgents = agent.Trim().Substring(1);
}
}

String[] disallow = Regex.Split(userAgents, "Disallow:");

foreach (String item in disallow)
{
if (item != "\n")
{
disallowed.Add(item.Trim());
Console.WriteLine(baseUrl + item.Trim());
}
}

Console.ReadLine();

}

public static String read()
{
StreamReader sr = new StreamReader(resStream);
String strText = sr.ReadToEnd();
return strText;
}

public static void urlOpen()
{
resStream = res.GetResponseStream();
}

public static bool isValidUrl(String url)
{
try
{
req = (HttpWebRequest)HttpWebRequest.Create(url);
res = (HttpWebResponse)req.GetResponse();
return (res.StatusCode == HttpStatusCode.OK);
}
catch (Exception ex)
{
Console.WriteLine("Not a Valid URL:" + ex.Message + " - " + url);
return false;
}
}
}
}

5 comments:

Anonymous said...

interesting article. I would love to follow you on twitter.

oakleyses said...

jordan shoes, ugg boots, longchamp outlet, uggs on sale, ray ban sunglasses, replica watches, ray ban sunglasses, ray ban sunglasses, christian louboutin shoes, nike free run, louboutin pas cher, michael kors pas cher, louis vuitton outlet, air max, prada outlet, tory burch outlet, christian louboutin, louis vuitton, nike air max, nike roshe, replica watches, tiffany and co, oakley sunglasses wholesale, nike free, longchamp pas cher, christian louboutin uk, louis vuitton, burberry pas cher, chanel handbags, nike outlet, nike air max, polo ralph lauren outlet online, gucci handbags, oakley sunglasses, jordan pas cher, prada handbags, oakley sunglasses, cheap oakley sunglasses, louis vuitton outlet, longchamp outlet, polo outlet, ugg boots, tiffany jewelry, longchamp outlet, christian louboutin outlet, louis vuitton outlet, sac longchamp pas cher, oakley sunglasses, polo ralph lauren

oakleyses said...

true religion jeans, ray ban pas cher, michael kors, oakley pas cher, polo lacoste, michael kors outlet online, ray ban uk, mulberry uk, michael kors outlet online, coach outlet, burberry handbags, nike air force, vans pas cher, true religion outlet, michael kors outlet, abercrombie and fitch uk, nike roshe run uk, replica handbags, burberry outlet, michael kors outlet online, kate spade, hollister pas cher, michael kors outlet online, sac hermes, nike air max uk, true religion outlet, nike air max uk, michael kors outlet, guess pas cher, north face, hollister uk, new balance, nike tn, uggs outlet, uggs outlet, ralph lauren uk, michael kors, true religion outlet, nike air max, timberland pas cher, nike blazer pas cher, sac vanessa bruno, converse pas cher, coach purses, coach outlet store online, north face uk, nike free uk, hogan outlet, michael kors outlet

oakleyses said...

wedding dresses, longchamp uk, new balance shoes, insanity workout, gucci, nike roshe run, nike huaraches, north face outlet, celine handbags, giuseppe zanotti outlet, ghd hair, ralph lauren, nike trainers uk, mac cosmetics, iphone cases, asics running shoes, nfl jerseys, mcm handbags, north face outlet, ray ban, reebok outlet, converse outlet, timberland boots, hermes belt, mont blanc pens, ferragamo shoes, hollister, hollister clothing, nike air max, abercrombie and fitch, vans outlet, vans, converse, instyler, louboutin, lululemon, p90x workout, herve leger, soccer shoes, soccer jerseys, nike air max, jimmy choo outlet, beats by dre, oakley, baseball bats, valentino shoes, babyliss, hollister, chi flat iron, bottega veneta

oakleyses said...

ugg, louis vuitton, hollister, ugg uk, links of london, canada goose outlet, toms shoes, thomas sabo, moncler outlet, doudoune moncler, coach outlet, canada goose, ugg,uggs,uggs canada, ugg,ugg australia,ugg italia, pandora charms, replica watches, louis vuitton, supra shoes, pandora jewelry, juicy couture outlet, canada goose outlet, barbour, swarovski crystal, pandora uk, montre pas cher, juicy couture outlet, lancel, canada goose, moncler, canada goose jackets, louis vuitton, canada goose uk, marc jacobs, moncler, pandora jewelry, canada goose, swarovski, canada goose outlet, moncler uk, moncler, louis vuitton, karen millen uk, ugg pas cher, louis vuitton, barbour uk, wedding dresses, moncler outlet, moncler