Extract or scrap all web links from a web page in VB.NET

The following code snippet explains how we can scrap or extract all web links from a web page. you can do it with the help of a simple procedure.

First we collect all html content from the given url, and then we use a regular expression for finding all links in the html content.
We are using here this regular expression

<a\s+href\s*=\s*""?([^"" >]+)""?>(.+)</a>

the above regular expression is explained as follows:

<a        Starting of the HTML anchor
\s+       One or more white spaces
href      Continuing with exact text in HTML anchor
\s*       Zero or more white spaces 
=         Continuing with exact text in HTML anchor
\s*       Zero or more white spaces 
""?       Zero or none quotation mark (escaped)
(         Start of group defining a substring: The anchor URL.
[^"" >]+  One or more matches of any character except those in brackets.
)         End of first group defining a substring
""?       Zero or none quotation mark (escaped)
>         Continuing with exact text in HTML anchor
(.+)      A group matching any character: The anchor text.
</a>      Ending exact text of HTML anchor

The following example requires Listview control named lsvlinks, one Textbox control named txtURL and one Button control named btnFind with btnFind_Click() event:

    Private requestweb As HttpWebRequest
    Private responseWeb As HttpWebResponse
    Private Sub btnFind_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles btnFind.Click
        Dim WebSource As String
        Dim objStreamReader As StreamReader = Nothing
        requestweb = CType(WebRequest.Create(txtURL.Text), HttpWebRequest)
        With requestweb
            .UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.1)"
            .Method = "GET"
            .Timeout = 10000
        End With
            responseWeb = CType(requestweb.GetResponse(), HttpWebResponse)
        Catch ex As Exception
            MessageBox.Show("Error retrieving the Web page " & _
                "you requested. Please check the entered Url and your internet connection")
            Exit Sub
        End Try
        If Not IsNothing(responseWeb.GetResponseStream()) Then
                objStreamReader = New StreamReader(responseWeb.GetResponseStream())
                WebSource = objStreamReader.ReadToEnd
            Catch ex As Exception
                Exit Sub
            End Try
        End If
        Dim strReg As String
        strReg = "<a\s+href\s*=\s*""?([^"" >]+)""?>(.+)</a>"
        Dim reg As New Regex(strReg, RegexOptions.IgnoreCase)
        Dim m As Match = reg.Match(WebSource)
        While m.Success
            Dim lvi As New ListViewItem()
            lvi.Text = m.Groups(1).Value
            m = m.NextMatch()
        End While
    End Sub
  • a BIG and a HUGE thanks 🙂

  • Kev

    You may want to tell people what lsvlinks is.

    • I have corrected the listbox control name lvi to lsvlinks. lsvlink is the listbox control.

  • morocco

    big thanks