Mining the Web: search engine

Showing posts with label search engine. Show all posts

Friday, January 29, 2010

Parsing Robots.txt File

Crawling is an essential part of search engine development. The more sites a search engine crawls, the bigger its index will be. However, a site can restrict web crawlers from crawling certain pages of the site by identifying those pages in a robots.txt file and making it available at the local URL /robots.txt. For example, you can see what the robots.txt file for CNN looks like here.

While crawling a site, it is a good idea to read the robots.txt file first so that you have a list of pages that your crawler shouldn't access. This way you are not only respecting the permissions provided by the site but also saving CPU cycles which can be utilized elsewhere.

You can find complete detail about the format and semantics of a robots.txt file, visit The Web Robots Pages. Here is a portion from this webpage:

The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognised headers are ignored.

User-agent

The value of this field is the name of the robot the record is describing access policy for.

If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.

The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.

If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.

Disallow

The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.

Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.

Using this description we can easily write the code to create a list of disallowed URLs. All we need to do is parse the robots.txt file and split the file first by using "User-agent" and then again by "Disallow". This will give us the list which we can then use in our crawler. We just need to check this list every time we are about to crawl a page.

Following is the code in C# to parse the robots.txt file and to create a list of disallowed URLs:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Net;
using System.Text.RegularExpressions;

namespace ParsingRobotTxt
{
class Program
{
public static HttpWebRequest req;
public static HttpWebResponse res;
static Stream resStream;

static void Main(string[] args)
{
String baseUrl = "http://www.cnn.com/";
baseUrl += "/robots.txt";

getDisallowedUrls(baseUrl);
}

static void getDisallowedUrls(string baseUrl)
{
if (isValidUrl(baseUrl))
{
urlOpen();
}

String RobotTxtContent = read();

List disallowed = new List(); // List that holds Urls which shouldn't be crawled
String[] user_agents = Regex.Split(RobotTxtContent, "User-agent:");
String userAgents = "";
foreach (String agent in user_agents)
{
if (agent.Trim().StartsWith("*"))
{
userAgents = agent.Trim().Substring(1);
}
}

String[] disallow = Regex.Split(userAgents, "Disallow:");

foreach (String item in disallow)
{
if (item != "\n")
{
disallowed.Add(item.Trim());
Console.WriteLine(baseUrl + item.Trim());
}
}

Console.ReadLine();

}

public static String read()
{
StreamReader sr = new StreamReader(resStream);
String strText = sr.ReadToEnd();
return strText;
}

public static void urlOpen()
{
resStream = res.GetResponseStream();
}

public static bool isValidUrl(String url)
{
try
{
req = (HttpWebRequest)HttpWebRequest.Create(url);
res = (HttpWebResponse)req.GetResponse();
return (res.StatusCode == HttpStatusCode.OK);
}
catch (Exception ex)
{
Console.WriteLine("Not a Valid URL:" + ex.Message + " - " + url);
return false;
}
}
}
}

Friday, March 6, 2009

Peter Norvig; "we don't think it's a big advance to be able to type something as a question as opposed to keywords."

Sometime ago I found myself in a healthy argument with a friend of mine about the future of search. Both of us are crazy about information retrieval and naturally our discussion more than normally diverts towards search. While usually we agree with each other's views, today I was shocked to find a major difference in our opinions.

I sincerely believe that any breakthrough in search will have something to do with Natural Language Processing. However my friend thinks that that will not be the case. This was not at all a big deal. But what shocked me was the evidence he used... Peter Norvig.

Now I am a big fan of Peter Norvig but while i am crazy about all things Peter Norvig, I was disappointed to read what he said in his interverview with Technology Review:

"TR: Companies such as Ask and Powerset are betting that the future is in natural-language search, which lets people use real, useful sentences instead of potentially ambiguous keywords. What is Google doing with natural language?

PN: We think what's important about natural language is the mapping of words onto the concepts that users are looking for. But we don't think it's a big advance to be able to type something as a question as opposed to keywords. Typing "What is the capital of France?" won't get you better results than typing "capital of France." But understanding how words go together is important. To give some examples, "New York" is different from "York," but "Vegas" is the same as "Las Vegas," and "Jersey" may or may not be the same as "New Jersey." That's a natural-language aspect that we're focusing on. Most of what we do is at the word and phrase level; we're not concentrating on the sentence. We think it's important to get the right results rather than change the interface."

Apparently, Peter Norvig thinks that typing complete questions will just be a change in interface. I seriously doubt it. These days, I hardly use single keyword searches. I like Powerset and really wish that I could google something much more intelligent than a few keywords. But if the director of research at Google doesn't think that this will bring any big advancement to search, then maybe it is just me. I am sure Peter has substantial data to back his comments but I for one would like to type in a question ( and not something as simple as "Who is the 44th president of USA") and expect a relevant answer.

But thats just me... The interview was conducted in 2008, here is hoping that during this time Peter has given a second thought to some serious NLP usage for search.

Sunday, February 17, 2008

IR or Spam Filter

I haven't been updating my blog lately. The commencement of the spring 2008 semester has to do something with that but the real reason is that I have been busy researching some interesting topics for my AI project.

Last semester I wanted to create an Image Spam Filter for my "Data mining & Pattern Recognition" class. My theory was to apply OCR techniques to capture text from the image, after which it really becomes a text spam-filter problem. I was thinking of using Neural Networks for Image Text recognition and Bayes' Theorem for spam-filtering. But my idea was unanimously rejected by my group and instead we developed a Handwriting Recognition system (which was still a better project to work on than our previous plan to tell time by reading an image of analogue clock.)

So now I have a chance to have another go at my Image Spam Filter project. But I am still debating about it. The reason is that I am also fond of Information Retrieval problems. I am thinking of working on a Search engine and automatic indexing of a technical book/manual. Since this is an individual project I can do whatever I want (Of course my professor has to approve it.)

If I think about it, I find both, IR and Spam Filter, interesting. So it is not a clash of interest but more of what will I gain from them and what I want to do in future.

While I try to analyze this, feel free to give me your suggestions. Maybe you can give me an insight which may eventually help me reach a decision.