Extract or scrap all web links from a web page in VB.NET

The following code snippet explains how we can scrap or extract all web links from a web page. you can do it with the help of a simple procedure.

First we collect all html content from the given url, and then we use a regular expression for finding all links in the html content.
We are using here this regular expression

  1. <a\s+href\s*=\s*""?([^"" >]+)""?>(.+)</a>

the above regular expression is explained as follows:

  1. <a        Starting of the HTML anchor
  2. \s+       One or more white spaces
  3. href      Continuing with exact text in HTML anchor
  4. \s*       Zero or more white spaces
  5. =         Continuing with exact text in HTML anchor
  6. \s*       Zero or more white spaces
  7. ""?       Zero or none quotation mark (escaped)
  8. (         Start of group defining a substring: The anchor URL.
  9. [^"" >]+  One or more matches of any character except those in brackets.
  10. )         End of first group defining a substring
  11. ""?       Zero or none quotation mark (escaped)
  12. >         Continuing with exact text in HTML anchor
  13. (.+)      A group matching any character: The anchor text.
  14. </a>      Ending exact text of HTML anchor

The following example requires Listview control named lsvlinks, one Textbox control named txtURL and one Button control named btnFind with btnFind_Click() event:

  1.     Private requestweb As HttpWebRequest
  2.     Private responseWeb As HttpWebResponse
  3.     Private Sub btnFind_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles btnFind.Click
  4.         Dim WebSource As String
  5.         Dim objStreamReader As StreamReader = Nothing
  6.  
  7.         requestweb = CType(WebRequest.Create(txtURL.Text), HttpWebRequest)
  8.         With requestweb
  9.             .UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.1)"
  10.             .Method = "GET"
  11.             .Timeout = 10000
  12.         End With
  13.  
  14.         Try
  15.             responseWeb = CType(requestweb.GetResponse(), HttpWebResponse)
  16.         Catch ex As Exception
  17.             MessageBox.Show("Error retrieving the Web page " & _
  18.                 "you requested. Please check the entered Url and your internet connection")
  19.             Exit Sub
  20.         End Try
  21.  
  22.         If Not IsNothing(responseWeb.GetResponseStream()) Then
  23.             Try
  24.                 objStreamReader = New StreamReader(responseWeb.GetResponseStream())
  25.                 WebSource = objStreamReader.ReadToEnd
  26.             Catch ex As Exception
  27.                 MessageBox.Show(ex.Message)
  28.                 Exit Sub
  29.             Finally
  30.                 responseWeb.Close()
  31.                 objStreamReader.Close()
  32.             End Try
  33.  
  34.         End If
  35.         lsvlinks.Items.Clear()
  36.         Dim strReg As String
  37.         strReg = "<a\s+href\s*=\s*""?([^"" >]+)""?>(.+)</a>"
  38.         Dim reg As New Regex(strReg, RegexOptions.IgnoreCase)
  39.         Dim m As Match = reg.Match(WebSource)
  40.         While m.Success
  41.             Dim lvi As New ListViewItem()
  42.             lvi.Text = m.Groups(1).Value
  43.             lsvlinks.Items.Add(lvi)
  44.             m = m.NextMatch()
  45.         End While
  46.     End Sub

facebooktwittergoogle_plusredditpinterestlinkedinmail

4 thoughts on “Extract or scrap all web links from a web page in VB.NET”

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>