scrape html table - question...

TeachMe

Member
Joined
Jun 27, 2012
Messages
16
Programming Experience
Beginner
Say I have a table that will always contain RANDOM DATA (various product titles, prices, & ratings in no particular order). I noticed that sometimes either the "Price:" column or "Rating" column won't always have a value. So when I'm scraping multiple items into an array & sending each column into a listview, the data won't sync up properly if a value is missing in say the "Price" column.

Here is an example of a html table that I'm trying to scrape data from, but notice how row "# 5" is missing the price. This is what's messing up the syncing of the data while it's being added to the listview in VB.NET:

HTML:
<html>
<head>
<style>

table {
margin:auto;
margin-top:50px;
font-family: arial, sans-serif;
border-collapse: collapse;
width: 40%;
}

td{
border: 3px solid #000;
text-align: left;
padding: 3px;
}

th {
border: 3px solid #000;
background-color:gold;
text-align: left;
padding: 3px;
}

tr:nth-child(even) {
background-color: #dddddd;
}
</style>
</head>
<body>
    <table>
        <tr><th>#</th><th>Product Title:</th><th width="60">Price:</th><th width="60">Rating:</th></tr>
        <tr><td width="20">1</td><td class="ProductTitle">Minera Natural Dead Sea Salt, 5lbs Bulk Bag - Fine Grain</td><td class="Price">$20.00</td><td class="Rating">9/10</td></tr>
        <tr><td width="20">2</td><td class="ProductTitle">Minera Dead Sea Salt 2lb Bag Fine Grain, 100% Pure Mineral Salt Treatment</td><td class="Price">$9.99</td><td class="Rating">6/10</td></tr>
        <tr><td width="20">3</td><td class="ProductTitle">Minera Pure Dead Sea Salt 10lbs Fine Grain</td><td class="Price">$15.95</td><td>8/10</td></tr>
        <tr><td width="20">4</td><td class="ProductTitle">Dead Sea Warehouse - Amazing Minerals Dead Sea Bath Salts, Temporary Relief from...</td><td class="Price">$16.00</td><td class="Rating">5/10</td></tr>
        <tr><td width="20">5</td><td class="ProductTitle">Natural Planet Dead Sea Salt, 5lbs Fine Grain - 100% Pure Bath Salt - For Psoriasis...</td><td></td><td class="Rating">5/10</td></tr>
        <tr><td width="20">6</td><td class="ProductTitle">Art Naturals Himalayan Salt Body Scrub 20oz -Deep Cleansing Exfoliator With Shea...</td><td class="Price">$13.95</td><td class="Rating">7/10</td></tr>
        <tr><td width="20">7</td><td class="ProductTitle">Dead Sea Salt 2.2lb try for Psoriasis, Eczema, and Dermatitis (1 x Resealable...</td><td class="Price">$9.99</td><td class="Rating">4/10</td></tr>
        <tr><td width="20">8</td><td class="ProductTitle">Premier Dead Sea Aromatherapy Mineral Body Treatment, Silver, Salt Scrub, 425...</td><td class="Price">$15.95</td><td class="Rating">8/10</td></tr>
        <tr><td width="20">9</td><td class="ProductTitle">Dead Sea Warehouse - Amazing Minerals Dead Sea Bath Salts, Temporary Relief from...</td><td class="Price">$16.00</td><td class="Rating">6/10</td></tr>
        <tr><td width="20">10</td><td>Natural Planet Dead Sea Salt, 50lbs Fine Grain - 100% Pure Bath Salt - For Psoriasis...</td><td class="Price">$90.25</td><td class="Rating">10/10</td></tr>

    </table>
</body>
</html>


Now here is an example of what I'm using in VB.NET to collect data from this table:


VB.NET:
Imports System.Text.RegularExpressions
Public Class Form1
    Dim ITEM As New ListViewItem
    Dim ProductTitle As String
    Dim ProductPrice As String
    Dim ProductRating As String

    Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
        ListView1.Items.Clear()
        ProductTitle = ""
        ProductPrice = ""
        ProductRating = ""

        Dim keyword As String = TextBox1.Text
        keyword = keyword.Replace(" ", "+")
        Try
    'This is the HTML Table That I'm talking about:
            Dim html As String = "THE HTML TABLE SPECIFIED"
  
            'Product Title:
            Dim regx1 As New Regex("td class=""ProductTitle"">.+?</td>", RegexOptions.IgnoreCase)
            Dim matches1 As MatchCollection = regx1.Matches(html)
            For Each match1 As Match In matches1
                ProductTitle += match1.Value & "^"
                ProductTitle = ProductTitle.Replace("td class=""ProductTitle"">", "").Replace("</td>", "")
            Next

            'Price:
            Dim regx As New Regex("td class=""ProductPrice"">.+?</td>", RegexOptions.IgnoreCase)
            Dim matches As MatchCollection = regx.Matches(html)
            For Each match As Match In matches
                ProductPrice += match.Value & "^"
                ProductPrice = ProductPrice.Replace("td class=""ProductPrice"">", "").Replace("</td>", "")
            Next

     'Rating:
            Dim regx As New Regex("td class=""ProductRating"">.+?</td>", RegexOptions.IgnoreCase)
            Dim matches As MatchCollection = regx.Matches(html)
            For Each match As Match In matches
                ProductRating += match.Value & "^"
                ProductRating = ProductRating.Replace("td class=""ProductRating"">", "").Replace("</td>", "")
            Next

            'Create the split & add all items to listview:
            Dim split1() As String = ProductTitle.Split("^")
            Dim split2() As String = ProductPrice.Split("^")
            Dim split3() As String = ProductRating.Split("^")


            For i = 0 To split1.Count - 2
                ITEM = ListView1.Items.Add(split1(i))
                ITEM.SubItems.Add(split2(i))
     ITEM.SubItems.Add(split3(i))
            Next

        Catch ex As Exception

        End Try
    End Sub
End Class



Again, the problem is that sometimes I won't know which table is going to have some elements missing (such as the "Price" column) which causes the data NOT to be synced up in the rows of the ListView. How could I fix this with the code that I've written above? Thanks.
 
Last edited:
Back
Top