Extract or scrap all web links from a web page in VB.NET
The following code snippet explains how we can scrap or extract all web links from a web page. you can do it with the help of a simple procedure.

First we collect all html content from the given url, and then we use a regular expression for finding all links in the html content.
We are using here this regular expression
-
<a\s+href\s*=\s*""?([^"" >]+)""?>(.+)</a>
the above regular expression is explained as follows:
-
<a Starting of the HTML anchor
-
\s+ One or more white spaces
-
href Continuing with exact text in HTML anchor
-
\s* Zero or more white spaces
-
= Continuing with exact text in HTML anchor
-
\s* Zero or more white spaces
-
""? Zero or none quotation mark (escaped)
-
( Start of group defining a substring: The anchor URL.
-
[^"" >]+ One or more matches of any character except those in brackets.
-
) End of first group defining a substring
-
""? Zero or none quotation mark (escaped)
-
> Continuing with exact text in HTML anchor
-
(.+) A group matching any character: The anchor text.
-
</a> Ending exact text of HTML anchor
The following example requires Listview control named lvi, one Textbox control named txtURL and one Button control named btnFind with btnFind_Click() event:
-
Private requestweb As HttpWebRequest
-
Private responseWeb As HttpWebResponse
-
Private Sub btnFind_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles btnFind.Click
-
Dim WebSource As String
-
Dim objStreamReader As StreamReader = Nothing
-
-
requestweb = CType(WebRequest.Create(txtURL.Text), HttpWebRequest)
-
With requestweb
-
.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.1)"
-
.Method = "GET"
-
.Timeout = 10000
-
End With
-
-
Try
-
responseWeb = CType(requestweb.GetResponse(), HttpWebResponse)
-
Catch ex As Exception
-
MessageBox.Show("Error retrieving the Web page " & _
-
"you requested. Please check the entered Url and your internet connection")
-
Exit Sub
-
End Try
-
-
If Not IsNothing(responseWeb.GetResponseStream()) Then
-
Try
-
objStreamReader = New StreamReader(responseWeb.GetResponseStream())
-
WebSource = objStreamReader.ReadToEnd
-
Catch ex As Exception
-
MessageBox.Show(ex.Message)
-
Exit Sub
-
Finally
-
responseWeb.Close()
-
objStreamReader.Close()
-
End Try
-
-
End If
-
lsvlinks.Items.Clear()
-
Dim strReg As String
-
strReg = "<a\s+href\s*=\s*""?([^"" >]+)""?>(.+)</a>"
-
Dim reg As New Regex(strReg, RegexOptions.IgnoreCase)
-
Dim m As Match = reg.Match(WebSource)
-
While m.Success
-
Dim lvi As New ListViewItem()
-
lvi.Text = m.Groups(1).Value
-
lsvlinks.Items.Add(lvi)
-
m = m.NextMatch()
-
End While
-
End Sub