Friday, July 3, 2009

Extract Hyperlinks Using WebBrowser Object in C#

In my previous post, I mentioned using regular expressions to extract links from a web page. While I like using regular expressions, there is a much easier way of doing this in C#.net. All that you need is a 'WebBrowser' object to read a webpage and extracting all of the hyperlinks present in it. However, you need to create Windows Forms applications to use the WebBrowser object and then add the following code (on a button click event).

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;

namespace WebbrowserTest
{
public partial class WebBrowserSample : Form
{

public WebBrowserSample()
{
InitializeComponent();
}

private void button1_Click(object sender, EventArgs e)
{
WebBrowser web = new WebBrowser();
web.NewWindow += new CancelEventHandler(web_NewWindow);
web.Navigate("http://www.cnn.com");

//wait for the browser object to load the page
while (web.ReadyState != WebBrowserReadyState.Complete)
{
Application.DoEvents();
}

HtmlElementCollection ht = web.Document.Links;
List urls = new List();
foreach (HtmlElement h in ht)
{
try
{
if ((h.GetAttribute("href") != null) )
{
urls.Add(h.GetAttribute("href"));
}
}
catch
{ }
}

foreach (string url in urls)
{
Console.WriteLine(url);
}



}

void web_NewWindow(object sender, CancelEventArgs e)
{
e.Cancel = true;
}
}
}



The 'NewWindow' event is used to discard popups which may open while navigating a web page. The 'Document.Links' property of the WebBrowser is used to get the HtmlCollection of all of the links present in a webpage. Once you have the HtmlCollection of links, the links could be processes/viewed using 'href' attribute of each 'HtmlElement' in the 'HtmlCollection'.

Even though this is a pretty straight forward code, I found it a little slower. I think the reason is that you have to make sure that the webbrowser object has read the entire webpage before you can start using it (achieved by the while '(web.ReadyState != WebBrowserReadyState.Complete)' condition).

Also if you are thinking about link processing in a non-Window Forms application, then this program is not for you. For all the Window Forms application developers, enjoy the code :)

14 comments:

Unknown said...

Thanks, I hadn't come across either the links collection or the new window code. Could be useful.

Syed Rizvi said...

@Jim, let me know if you find it of some use. It is always good to know how your code can be tweaked to achieve something more useful :)

Nick said...

Thanks for the idea. I just want to add that it doesn't take much effort to utilize the WebBrowser object for this purpose in a console (non winforms) application. Simply do the following 2 things:

1) Add a project reference to System.Windows.Forms

2) Mark your application entry point as single threaded apartment. (put [STAThread] above your Main function)

For a quick and dirty app I really don't see a problem with doing it this way. I imagine it's much less error prone that using regex or other methods.

Cheers.

Syed Rizvi said...

@Nick, wow! thanks for the suggestion. Will definitely try it out. Best.

Anonymous said...

seattle hindu dating http://loveepicentre.com/ gardening gifts singles dating

Anonymous said...

bitter freedom ebook http://audiobookscollection.co.uk/it/Moab-Is-My-Washpot/p223922/ test ebook reader [url=http://audiobookscollection.co.uk/de/Radoslaw-Pytlak/m103893/]free english to spanish ebook[/url] sourav ganguly the fire within ebook

Anonymous said...

warez trout fishing tips ebook http://audiobooksworld.co.uk/es/authors/?letter=Ih oracle9i programming ebook [url=http://audiobooksworld.co.uk/Zoubin-Ghahramani/m135821/]ebook e-book pmp[/url] advanced masking dvd ebook free download

Anonymous said...

weight lifting personal software http://buysoftwareonline.co.uk/de/product-14632/AppleWorks-6-2-Mac beyond trust software [url=http://buysoftwareonline.co.uk/product-37099/Splash-Pro-Ex-1-1]talking atomic clock software[/url] origins home embroidery auto-digitizing software
[url=http://buysoftwareonline.co.uk/product-14120/TextMate-1-5-Mac]TextMate 1.5 [Mac] - Cheap Legal OEM Software, Software Sale, Download OEM[/url] sign software
[url=http://buysoftwareonline.co.uk/product-36150/Primasoft-Member-Organizer-Deluxe-3-7][img]http://buyoem.co.uk/image/5.gif[/img][/url]

Anonymous said...

super anti virus software free http://buysoftwareonline.co.uk/de/category-1/Betriebssysteme golf cart maintenance log software [url=http://buysoftwareonline.co.uk/de/product-37297/UltimateDefrag-1-7]ses provider software[/url] what is intelligence software
[url=http://buysoftwareonline.co.uk/es/category-13/Programas-de-m-sica]Programas de musica - Software Store[/url] tanning bed management software
[url=http://buysoftwareonline.co.uk/it/product-35464/NETGATE-Spy-Emergency-9-0][img]http://buyoem.co.uk/image/4.gif[/img][/url]

Anonymous said...

buen dato! no sabia que se podia sacar el listado de links con un metodo del objeto.

OFF: ese anonimo culiao que se meta el antivirus por la raja!

oakleyses said...

true religion jeans, ray ban pas cher, michael kors, oakley pas cher, polo lacoste, michael kors outlet online, ray ban uk, mulberry uk, michael kors outlet online, coach outlet, burberry handbags, nike air force, vans pas cher, true religion outlet, michael kors outlet, abercrombie and fitch uk, nike roshe run uk, replica handbags, burberry outlet, michael kors outlet online, kate spade, hollister pas cher, michael kors outlet online, sac hermes, nike air max uk, true religion outlet, nike air max uk, michael kors outlet, guess pas cher, north face, hollister uk, new balance, nike tn, uggs outlet, uggs outlet, ralph lauren uk, michael kors, true religion outlet, nike air max, timberland pas cher, nike blazer pas cher, sac vanessa bruno, converse pas cher, coach purses, coach outlet store online, north face uk, nike free uk, hogan outlet, michael kors outlet

yanmaneee said...

balenciaga
air max 97
goyard handbags
nfl store
nmd
michael kors outlet online
nike air max 270
ralph lauren uk
birkin bag
yeezy boost 350 v2

Unknown said...

check this replica bags online look these up gucci replica handbags his explanation luxury replica bags

dateceth said...

x1s96v5h49 a1f09f2s38 w2c32q3e62 z6u99u7w67 g2d75f3x76 d9o82d0r08