While crawling a site, it is a good idea to read the robots.txt file first so that you have a list of pages that your crawler shouldn't access. This way you are not only respecting the permissions provided by the site but also saving CPU cycles which can be utilized elsewhere.
You can find complete detail about the format and semantics of a robots.txt file, visit The Web Robots Pages. Here is a portion from this webpage:
The record starts with one or more User-agent
lines, followed by one or more Disallow
lines, as detailed below. Unrecognised headers are ignored.
- User-agent
- The value of this field is the name of the robot the record is describing access policy for.
If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.
The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.
If the value is '
*
', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt
" file. - Disallow
- The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example,
Disallow: /help
disallows both/help.html
and/help/index.html
, whereasDisallow: /help/
would disallow/help/index.html
but allow/help.html
.Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.
Following is the code in C# to parse the robots.txt file and to create a list of disallowed URLs:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Net;
using System.Text.RegularExpressions;
namespace ParsingRobotTxt
{
class Program
{
public static HttpWebRequest req;
public static HttpWebResponse res;
static Stream resStream;
static void Main(string[] args)
{
String baseUrl = "http://www.cnn.com/";
baseUrl += "/robots.txt";
getDisallowedUrls(baseUrl);
}
static void getDisallowedUrls(string baseUrl)
{
if (isValidUrl(baseUrl))
{
urlOpen();
}
String RobotTxtContent = read();
List
String[] user_agents = Regex.Split(RobotTxtContent, "User-agent:");
String userAgents = "";
foreach (String agent in user_agents)
{
if (agent.Trim().StartsWith("*"))
{
userAgents = agent.Trim().Substring(1);
}
}
String[] disallow = Regex.Split(userAgents, "Disallow:");
foreach (String item in disallow)
{
if (item != "\n")
{
disallowed.Add(item.Trim());
Console.WriteLine(baseUrl + item.Trim());
}
}
Console.ReadLine();
}
public static String read()
{
StreamReader sr = new StreamReader(resStream);
String strText = sr.ReadToEnd();
return strText;
}
public static void urlOpen()
{
resStream = res.GetResponseStream();
}
public static bool isValidUrl(String url)
{
try
{
req = (HttpWebRequest)HttpWebRequest.Create(url);
res = (HttpWebResponse)req.GetResponse();
return (res.StatusCode == HttpStatusCode.OK);
}
catch (Exception ex)
{
Console.WriteLine("Not a Valid URL:" + ex.Message + " - " + url);
return false;
}
}
}
}
8 comments:
interesting article. I would love to follow you on twitter.
nice article
Layman Tips
nice one
Gadgetal
That what I was looking for.. thumbs up :)
oh~ nice article. thanks.robots.txt
curry 4
adidas nmd r1
mbt shoes
fitflops sale clearance
canada goose
air max 270
air force 1
nike air max 270
hermes
nike air max 270
our website dolabuy hermes find https://www.dolabuy.su like this bag replica high quality
replica bags in delhi fake hermes m5w91i5s82 replica bags online uae browse this site u4w71i8s86 Louis Vuitton replica Bags replica bags in delhi linked here h3v07v0r19 replica ysl replica bags in bangkok
Post a Comment