How to extract text from a website? VB2008

korae

Member
Joined
Jan 24, 2010
Messages
22
Programming Experience
Beginner
Hello everyone! Is there a way where you can extract a specific text from a website?

for example: Below is the website where I want to extract text.

Singapore Pools

I want get the numbers in the table and assign a label to display it on my application.

Below is the code I tried to get the source code, I know it displays all. Is there a way that I can add something so that I can get the specific text?

VB.NET:
Dim webclient As New WebClient
Dim eIP As String
eIP = System.Text.Encoding.ASCII.GetString((webclient.DownloadData("http://www.singaporepools.com.sg/en/lottery/4d_results.html")))
            TextBox1.Text = eIP
 
Easier to use the WebBrowser control, which parses the html codes into document object tree. Same as with plain text parsing you have to look for cues in the source that you can lock in on, but using the Document tree you can to a much greater extent rely on the code structures. Your sample data has multiple nested tables with seemingly no specific ids to query, all table data is in it's own cell (td). At first sight each cell has a 'class' attribute (used for page formatting) that all starts with "resultssection", and "resultssectionheader" for info texts and "resultssectiontext" for value texts. The draw number/date is in a cell with 'class' "normal10". Let's load the page and query the Document for these values, remember this is "weak" analysis that may or may not lead to match. For simplicity all values are added to a Listbox. Add the WebBrowser and Listbox controls to form. WebBrowser property Visible can be set to False, it is not necessary to display it. Navigate to the page:
VB.NET:
Me.WebBrowser1.Navigate("http://www.singaporepools.com.sg/Lottery?page=four_d")
Doubleclick the WebBrowser in Designer to get the DocumentCompleted event handler and add code to see that Document is ready and query for the mentioned cues:
VB.NET:
If Me.WebBrowser1.ReadyState = WebBrowserReadyState.Complete Then
    For Each cell As HtmlElement In Me.WebBrowser1.Document.GetElementsByTagName("td")
        Dim cls As String = cell.GetAttribute("className")                
        If cls.StartsWith("normal10") Then 'draw/date
            Me.ListBox1.Items.Add(cell.InnerText.Trim)
        ElseIf cls.StartsWith("resultssectionheader") Then 'category
            Me.ListBox1.Items.Add(cell.InnerText.Trim)
        ElseIf cls.StartsWith("resultssectiontext") Then 'result
            Me.ListBox1.Items.Add(cell.InnerText.Trim)
        End If
    Next
End If
You probably also noticed there was much space around the values in source code, this is (hopefully ;)) removed with the Trim function.

Run and after a few seconds the results appear in listbox for all three tables. No redundant data appears so the analysis above was sufficient, no further analysis or change of code is necessary. You may want to present the results differently, but the ground work is done.
 
Back
Top