Extract or scrap all web links from a web page in VB.NET

The following code snippet explains how we can scrap or extract all web links from a web page. you can do it with the help of a simple procedure.

First we collect all html content from the given url, and then we use a regular expression for finding all links in the html content.
We are using here this regular expression

  1. <a\s+href\s*=\s*""?([^"" >]+)""?>(.+)</a>

the above regular expression is explained as follows:

  1. <a        Starting of the HTML anchor
  2. \s+       One or more white spaces
  3. href      Continuing with exact text in HTML anchor
  4. \s*       Zero or more white spaces
  5. =         Continuing with exact text in HTML anchor
  6. \s*       Zero or more white spaces
  7. ""?       Zero or none quotation mark (escaped)
  8. (         Start of group defining a substring: The anchor URL.
  9. [^"" >]+  One or more matches of any character except those in brackets.
  10. )         End of first group defining a substring
  11. ""?       Zero or none quotation mark (escaped)
  12. >         Continuing with exact text in HTML anchor
  13. (.+)      A group matching any character: The anchor text.
  14. </a>      Ending exact text of HTML anchor

The following example requires Listview control named lsvlinks, one Textbox control named txtURL and one Button control named btnFind with btnFind_Click() event:

  1.     Private requestweb As HttpWebRequest
  2.     Private responseWeb As HttpWebResponse
  3.     Private Sub btnFind_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles btnFind.Click
  4.         Dim WebSource As String
  5.         Dim objStreamReader As StreamReader = Nothing
  7.         requestweb = CType(WebRequest.Create(txtURL.Text), HttpWebRequest)
  8.         With requestweb
  9.             .UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.1)"
  10.             .Method = "GET"
  11.             .Timeout = 10000
  12.         End With
  14.         Try
  15.             responseWeb = CType(requestweb.GetResponse(), HttpWebResponse)
  16.         Catch ex As Exception
  17.             MessageBox.Show("Error retrieving the Web page " & _
  18.                 "you requested. Please check the entered Url and your internet connection")
  19.             Exit Sub
  20.         End Try
  22.         If Not IsNothing(responseWeb.GetResponseStream()) Then
  23.             Try
  24.                 objStreamReader = New StreamReader(responseWeb.GetResponseStream())
  25.                 WebSource = objStreamReader.ReadToEnd
  26.             Catch ex As Exception
  27.                 MessageBox.Show(ex.Message)
  28.                 Exit Sub
  29.             Finally
  30.                 responseWeb.Close()
  31.                 objStreamReader.Close()
  32.             End Try
  34.         End If
  35.         lsvlinks.Items.Clear()
  36.         Dim strReg As String
  37.         strReg = "<a\s+href\s*=\s*""?([^"" >]+)""?>(.+)</a>"
  38.         Dim reg As New Regex(strReg, RegexOptions.IgnoreCase)
  39.         Dim m As Match = reg.Match(WebSource)
  40.         While m.Success
  41.             Dim lvi As New ListViewItem()
  42.             lvi.Text = m.Groups(1).Value
  43.             lsvlinks.Items.Add(lvi)
  44.             m = m.NextMatch()
  45.         End While
  46.     End Sub

  • a BIG and a HUGE thanks 🙂

  • Kev

    You may want to tell people what lsvlinks is.

    • I have corrected the listbox control name lvi to lsvlinks. lsvlink is the listbox control.

  • morocco

    big thanks