Friday, July 3, 2009

Extract Hyperlinks Using WebBrowser Object in C#

In my previous post, I mentioned using regular expressions to extract links from a web page. While I like using regular expressions, there is a much easier way of doing this in C#.net. All that you need is a 'WebBrowser' object to read a webpage and extracting all of the hyperlinks present in it. However, you need to create Windows Forms applications to use the WebBrowser object and then add the following code (on a button click event).

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;

namespace WebbrowserTest
{
public partial class WebBrowserSample : Form
{

public WebBrowserSample()
{
InitializeComponent();
}

private void button1_Click(object sender, EventArgs e)
{
WebBrowser web = new WebBrowser();
web.NewWindow += new CancelEventHandler(web_NewWindow);
web.Navigate("http://www.cnn.com");

//wait for the browser object to load the page
while (web.ReadyState != WebBrowserReadyState.Complete)
{
Application.DoEvents();
}

HtmlElementCollection ht = web.Document.Links;
List urls = new List();
foreach (HtmlElement h in ht)
{
try
{
if ((h.GetAttribute("href") != null) )
{
urls.Add(h.GetAttribute("href"));
}
}
catch
{ }
}

foreach (string url in urls)
{
Console.WriteLine(url);
}



}

void web_NewWindow(object sender, CancelEventArgs e)
{
e.Cancel = true;
}
}
}



The 'NewWindow' event is used to discard popups which may open while navigating a web page. The 'Document.Links' property of the WebBrowser is used to get the HtmlCollection of all of the links present in a webpage. Once you have the HtmlCollection of links, the links could be processes/viewed using 'href' attribute of each 'HtmlElement' in the 'HtmlCollection'.

Even though this is a pretty straight forward code, I found it a little slower. I think the reason is that you have to make sure that the webbrowser object has read the entire webpage before you can start using it (achieved by the while '(web.ReadyState != WebBrowserReadyState.Complete)' condition).

Also if you are thinking about link processing in a non-Window Forms application, then this program is not for you. For all the Window Forms application developers, enjoy the code :)

14 comments:

  1. Thanks, I hadn't come across either the links collection or the new window code. Could be useful.

    ReplyDelete
  2. @Jim, let me know if you find it of some use. It is always good to know how your code can be tweaked to achieve something more useful :)

    ReplyDelete
  3. Thanks for the idea. I just want to add that it doesn't take much effort to utilize the WebBrowser object for this purpose in a console (non winforms) application. Simply do the following 2 things:

    1) Add a project reference to System.Windows.Forms

    2) Mark your application entry point as single threaded apartment. (put [STAThread] above your Main function)

    For a quick and dirty app I really don't see a problem with doing it this way. I imagine it's much less error prone that using regex or other methods.

    Cheers.

    ReplyDelete
  4. @Nick, wow! thanks for the suggestion. Will definitely try it out. Best.

    ReplyDelete
  5. seattle hindu dating http://loveepicentre.com/ gardening gifts singles dating

    ReplyDelete
  6. bitter freedom ebook http://audiobookscollection.co.uk/it/Moab-Is-My-Washpot/p223922/ test ebook reader [url=http://audiobookscollection.co.uk/de/Radoslaw-Pytlak/m103893/]free english to spanish ebook[/url] sourav ganguly the fire within ebook

    ReplyDelete
  7. warez trout fishing tips ebook http://audiobooksworld.co.uk/es/authors/?letter=Ih oracle9i programming ebook [url=http://audiobooksworld.co.uk/Zoubin-Ghahramani/m135821/]ebook e-book pmp[/url] advanced masking dvd ebook free download

    ReplyDelete
  8. weight lifting personal software http://buysoftwareonline.co.uk/de/product-14632/AppleWorks-6-2-Mac beyond trust software [url=http://buysoftwareonline.co.uk/product-37099/Splash-Pro-Ex-1-1]talking atomic clock software[/url] origins home embroidery auto-digitizing software
    [url=http://buysoftwareonline.co.uk/product-14120/TextMate-1-5-Mac]TextMate 1.5 [Mac] - Cheap Legal OEM Software, Software Sale, Download OEM[/url] sign software
    [url=http://buysoftwareonline.co.uk/product-36150/Primasoft-Member-Organizer-Deluxe-3-7][img]http://buyoem.co.uk/image/5.gif[/img][/url]

    ReplyDelete
  9. super anti virus software free http://buysoftwareonline.co.uk/de/category-1/Betriebssysteme golf cart maintenance log software [url=http://buysoftwareonline.co.uk/de/product-37297/UltimateDefrag-1-7]ses provider software[/url] what is intelligence software
    [url=http://buysoftwareonline.co.uk/es/category-13/Programas-de-m-sica]Programas de musica - Software Store[/url] tanning bed management software
    [url=http://buysoftwareonline.co.uk/it/product-35464/NETGATE-Spy-Emergency-9-0][img]http://buyoem.co.uk/image/4.gif[/img][/url]

    ReplyDelete
  10. buen dato! no sabia que se podia sacar el listado de links con un metodo del objeto.

    OFF: ese anonimo culiao que se meta el antivirus por la raja!

    ReplyDelete