Parsing HTML Extracting Values Between 2 Tags in WebBrowser

Pjgoodi

Member
Joined
Jan 31, 2013
Messages
6
Programming Experience
Beginner
Hi,

I am a relative newby to VB.net and trying to build an app using webBrowser within a windows form to automatically navigate a site. However once the app reaches its destination I need to extract certain values from the html with the intent of writing these to a csv file for comparison at a later time and this where I'm really struggling.

I've been Googling for hours an can't find an example that works for me. I know this is due to my lack of knowledge and I may be looking in completely the wrong places. Any help that the community can provide will be most gratefully received.

The information I would like to receive is contained within the following tags and these tags repeat several times within the page in <li> tags:
 

Attachments

  • source.txt
    13.7 KB · Views: 25
Last edited by a moderator:
Hi,

You can use the GetElementsByTagName method of the WebBrowser.Document class to parse the tags that you need to extract. Have a look at this example which shows you how to get all the data in an li tag or delve even deeper and also split out all the strong tags within the li tags:-

VB.NET:
Imports System.IO
 
Public Class Form1
  Dim WithEvents myWebBrowser As New WebBrowser
 
  Private Sub GetSource()
    'you can ignore this section. I am just getting your data from a file
    'you would populate your WebBrowser from your navigation statement
    Using myReader As New StreamReader(Application.StartupPath & "\HTMLPage1.txt")
      Me.Cursor = Cursors.WaitCursor
      myWebBrowser.DocumentText = myReader.ReadToEnd
    End Using
  End Sub
 
  Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    GetSource()
  End Sub
 
  Private Sub myWebBrowser_DocumentCompleted(sender As System.Object, e As System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles myWebBrowser.DocumentCompleted
    If myWebBrowser.ReadyState = WebBrowserReadyState.Complete Then
      Me.Cursor = Cursors.Default
      'here we can parse the tags of the HTML document using the GetElementsByTagName
      'method and then specifying the tag name to be got
 
      'this first sub just gets all the li tags and displays all the data in the tag
      GetLITags()
      'this second sub gets all the strong tags within li tags and displays all the data in the tag
      Get_LI_Plus_Strong_Tags()
    End If
  End Sub
 
  Private Sub GetLITags()
    For Each elem As HtmlElement In myWebBrowser.Document.GetElementsByTagName("li")
      If Not IsNothing(elem.InnerText) AndAlso Not elem.InnerText.Trim = String.Empty Then
        MsgBox(elem.InnerText)
      End If
    Next
  End Sub
 
  Private Sub Get_LI_Plus_Strong_Tags()
    For Each li_elem As HtmlElement In myWebBrowser.Document.GetElementsByTagName("li")
      For Each strongTag As HtmlElement In li_elem.GetElementsByTagName("strong")
        If Not IsNothing(strongTag.InnerText) AndAlso Not strongTag.InnerText.Trim = String.Empty Then
          MsgBox(strongTag.InnerText)
        End If
      Next
    Next
  End Sub
End Class

Hope that helps.

Cheers,

Ian
 
Hi Ian,

Thank you so much for your help. This sample looks almost perfect. Is it possible to extract values from specific strong tags based on their class while looping through?

<strong class="description model-name">NAME OF MODEL</strong>

Thanks again
 
Hi,

Yes, you can reference the attributes of each element to identify the one you want and then pull back the data you need using the GetAttribute method. i.e:-

VB.NET:
For Each li_elem As HtmlElement In myWebBrowser.Document.GetElementsByTagName("li")
  For Each strongTag As HtmlElement In li_elem.GetElementsByTagName("strong")
    If strongTag.GetAttribute("className") = "description model-name" Then
      If Not IsNothing(strongTag.InnerText) AndAlso Not strongTag.InnerText.Trim = String.Empty Then
        MsgBox(strongTag.InnerText)
      End If
    End If
  Next
Next

I expected this to take me about 2 minutes to write up and post but it actually took me about 10 minutes, since when I tried to access the "class" attribute, nothing was returned? After a quick search on the net, all indications are that you need to use "className" to get the class attribute within a tag and to be honest, at this point, I am not sure why?

Anyway, this does work and at some point I will figure out why you need to use "className" to get the class attribute of an element as apposed to the actual attribute name itself.

Hope that helps.

Cheers,

Ian
 
Hi Ian, You're a genious. If it took you 10 minutes it would have taken me 10 weeks that’s if I could have worked it out at all.

If you would be so kind as to offer further help along the same theme I would be most greatful. Having completed this part of the task I am now looking to target a specific DIV tag by class and loop through the elements within that tag to extract data. With the code example you so kindly provided I can isolate the div (which resides inside another div) by class but not loop through each of elements and subsequent divs with the same class. I feel ( but may be totally wrong) that I need the target div to be part of the first line. I have tried every method I can find referenced on google and nothing seems to work. I'm probably approaching it from completely the wrong angle. Are you able to help?


Sample html:
<div class="result">
<img style="float:left; margin: -5px 0px 0px 0px;" src="http://www.asdatyres.co.uk/brands/INFINITY.png">

<span class="newpostcodetext">Enter your Postcode:</span>

<input class="newpostcode" type="text" value="" name="fpostcode">


<p class="tyreoptions">


<span class="quantitydiv">


<div class="price" title="Our FULLY FITTED Price">

<div class="season_icons_wrapper"></div>

<input class="selectfittingcentre" type="image" onclick=" var isset = updatepostcodes(57.58,'1620555VINF04'); if(!isset){return false;}" src="images/asda_tyre_continue.png">


<div id="label_wrap">




<div class="result">
<img style="float:left; margin: -5px 0px 0px 0px;" src="http://www.asdatyres.co.uk/brands/LINGLONG.png">

<span class="newpostcodetext">Enter your Postcode:</span>

<input class="newpostcode" type="text" value="" name="fpostcode">


<p class="tyreoptions">


<span class="quantitydiv">


<div class="price" title="Our FULLY FITTED Price">

<div class="season_icons_wrapper"></div>

<input class="selectfittingcentre" type="image" onclick=" var isset = updatepostcodes(50.37,'1620555VLIGMHP010'); if(!isset){return false;}" src="images/asda_tyre_continue.png">


<div id="label_wrap">


</div>


<div class="result">
<img style="float:left; margin: -5px 0px 0px 0px;" src="http://www.asdatyres.co.uk/brands/JINYU.png">

<span class="newpostcodetext">Enter your Postcode:</span>

<input class="newpostcode" type="text" value="" name="fpostcode">


<p class="tyreoptions">


<span class="quantitydiv">


<div class="price" title="Our FULLY FITTED Price">

<div class="season_icons_wrapper"></div>

<input class="selectfittingcentre" type="image" onclick=" var isset = updatepostcodes(50.43,'1620555VJIYH12'); if(!isset){return false;}" src="images/asda_tyre_continue.png">


<div id="label_wrap">


</div>


<div class="result">
<img style="float:left; margin: -5px 0px 0px 0px;" src="http://www.asdatyres.co.uk/brands/INFINITY.png">

<span class="newpostcodetext">Enter your Postcode:</span>

<input class="newpostcode" type="text" value="" name="fpostcode">


<p class="tyreoptions">


<span class="quantitydiv">


<div class="price" title="Our FULLY FITTED Price">

<div class="season_icons_wrapper"></div>

<input class="selectfittingcentre" type="image" onclick=" var isset = updatepostcodes(57.58,'1620555VINF04'); if(!isset){return false;}" src="images/asda_tyre_continue.png">


<div id="label_wrap">


</div>
</div>


</div>


</form>
 
Hi,

Two issues here:-

1) You forgot to mention which tag you are trying to access within the HTML document. I know you said a "Div" tag but which one is unknown.

2) The sample HTML file that you have provided is an incorrectly formatted HTML document since it is missing a load of closing tags etc, etc...

That said, I have cribbed together a HTML file that I THINK matches your structure. See below:-

HTML:
<div class="result">
    <img style="float: left; margin: -5px 0px 0px 0px;" src="http://www.asdatyres.co.uk/brands/LINGLONG.png" />
    <span class="newpostcodetext">Enter your Postcode:</span>
    <input class="newpostcode" type="text" value="" name="fpostcode" />
    <p class="tyreoptions" />
    <span class="quantitydiv" />
    <div class="price" title="Our FULLY FITTED Price">
        <div class="season_icons_wrapper">
            Hello
        </div>
        <input class="selectfittingcentre" type="image" onclick=" var isset = updatepostcodes(50.37,'1620555VLIGMHP010'); if(!isset){return false;}"
            src="images/asda_tyre_continue.png" />
        <div id="label_wrap">
        </div>
    </div>
</div>
<div class="result">
    <img style="float: left; margin: -5px 0px 0px 0px;" src="http://www.asdatyres.co.uk/brands/LINGLONG.png" />
    <span class="newpostcodetext">Enter your Postcode:</span>
    <input class="newpostcode" type="text" value="" name="fpostcode" />
    <p class="tyreoptions" />
    <span class="quantitydiv" />
    <div class="price" title="Our FULLY FITTED Price">
        <div class="season_icons_wrapper">
            Hello2
        </div>
        <input class="selectfittingcentre" type="image" onclick=" var isset = updatepostcodes(50.37,'1620555VLIGMHP010'); if(!isset){return false;}"
            src="images/asda_tyre_continue.png" />
        <div id="label_wrap">
        </div>
    </div>
</div>
<div class="result">
    <img style="float: left; margin: -5px 0px 0px 0px;" src="http://www.asdatyres.co.uk/brands/LINGLONG.png" />
    <span class="newpostcodetext">Enter your Postcode:</span>
    <input class="newpostcode" type="text" value="" name="fpostcode" />
    <p class="tyreoptions" />
    <span class="quantitydiv" />
    <div class="price" title="Our FULLY FITTED Price">
        <div class="season_icons_wrapper">
            Hello3
        </div>
        <input class="selectfittingcentre" type="image" onclick=" var isset = updatepostcodes(50.37,'1620555VLIGMHP010'); if(!isset){return false;}"
            src="images/asda_tyre_continue.png" />
        <div id="label_wrap">
        </div>
    </div>
</div>

Using this, Here is how you can iterate through all the DIV tags and then display the InnerText of any specific class name that you specify. Have a look here:-

VB.NET:
Imports System.IO
 
Public Class Form1
  Dim WithEvents myWebBrowser As New WebBrowser
 
  Private Sub GetSource()
    'you can ignore this section. I am just getting your data from a file
    'you would populate your WebBrowser from your navigation statement
    Using myReader As New StreamReader(Application.StartupPath & "\HTMLPage2.html")
      Me.Cursor = Cursors.WaitCursor
      myWebBrowser.DocumentText = myReader.ReadToEnd
    End Using
  End Sub
 
  Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    GetSource()
  End Sub
 
  Private Sub myWebBrowser_DocumentCompleted(sender As System.Object, e As System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles myWebBrowser.DocumentCompleted
    If myWebBrowser.ReadyState = WebBrowserReadyState.Complete Then
      Me.Cursor = Cursors.Default
      'here we can parse the tags of the HTML document using the GetElementsByTagName
      'method and then specifying the tag to be got
 
      'this first sub just gets all the li tags and displays all the data in the tag
      'GetLITags()
      'this second sub just gets all the strong tags within li tags and displays all the data in the tag
      'Get_LI_Plus_Strong_Tags()
      IterateAllDivTags()
    End If
  End Sub
 
  Private Sub GetLITags()
    For Each elem As HtmlElement In myWebBrowser.Document.GetElementsByTagName("li")
      If Not IsNothing(elem.InnerText) AndAlso Not elem.InnerText.Trim = String.Empty Then
        MsgBox(elem.InnerText)
      End If
    Next
  End Sub
 
  Private Sub Get_LI_Plus_Strong_Tags()
    For Each li_elem As HtmlElement In myWebBrowser.Document.GetElementsByTagName("li")
      For Each strongTag As HtmlElement In li_elem.GetElementsByTagName("strong")
        If strongTag.GetAttribute("className") = "description model-name" Then
          If Not IsNothing(strongTag.InnerText) AndAlso Not strongTag.InnerText.Trim = String.Empty Then
            MsgBox(strongTag.InnerText)
          End If
        End If
      Next
    Next
  End Sub
 
  Private Sub IterateAllDivTags()
    For Each currentDiv_Element As HtmlElement In myWebBrowser.Document.GetElementsByTagName("Div")
      If currentDiv_Element.GetAttribute("className") = "season_icons_wrapper" Then
        MsgBox(currentDiv_Element.InnerText)
      End If
    Next
  End Sub
End Class

If this does not solve your issue then please post a fully qualified HTML file and highlight which DIV statement it is that you are trying to access.

Hope that helps.

Cheers,

Ian
 
Hi Ian,

Once again thank you for your help. I am so sorry for the poorly formatted html. I’m trying to access each <div class="result"> in turn and retrieve values/innertext for <span class="tyre_brand">, <div class="price" title="Our FULLY FITTED Price"> from each. I’ve had very limited success with methods similar to the ones you’ve kindly provided and have written lots of code chunks and subsequently deleted them. I never get beyond the first <div class="result"> in my returned results.

HTML:
<div id="formcontent" class="tyreshere"><div class="result"><img style="float:left; margin: -5px 0px 0px 0px;" src="http://www.asdatyres.co.uk/brands/JINYU.png"><span class="newpostcodetext">Enter your Postcode:</span><input class="newpostcode" type="text" value="" name="fpostcode"><p class="tyreoptions"><span class="tyre_brand">Jinyu</span><br> 205/55R16 91V YH12<br></p><span class="quantitydiv"><select class="forminput auto strong" name="number[1620555VJIYH12]"></span><div class="price" title="Our FULLY FITTED Price">£50.<span class="price_dec">43</span></div><div class="season_icons_wrapper"></div><input class="selectfittingcentre" type="image" onclick=" var isset = updatepostcodes(50.43,'1620555VJIYH12'); if(!isset){return false;}" src="images/asda_tyre_continue.png"><div id="label_wrap"><span class="fuel_r">E</span><span class="wet_r">B</span><span class="noise_r">70dB</span></div></div><div class="result"><img style="float:left; margin: -5px 0px 0px 0px;" src="http://www.asdatyres.co.uk/brands/INFINITY.png"><span class="newpostcodetext">Enter your Postcode:</span><input class="newpostcode" type="text" value="" name="fpostcode"><p class="tyreoptions"><span class="tyre_brand">Infinity</span><br> 205/55R16 91V INF-040<br></p><span class="quantitydiv"><select class="forminput auto strong" name="number[1620555VINF04]"><option value="1">1 Tyre</option><option value="2">2 Tyres</option><option value="3">3 Tyres</option><option value="4">4 Tyres</option></select></span><div class="price" title="Our FULLY FITTED Price">£57.<span class="price_dec">58</span></div><div class="season_icons_wrapper"></div><input class="selectfittingcentre" type="image" onclick=" var isset = updatepostcodes(57.58,'1620555VINF04'); if(!isset){return false;}" src="images/asda_tyre_continue.png"><div id="label_wrap"><span class="fuel_r">F</span><span class="wet_r">E</span><span class="noise_r">72dB</span></div></div><div class="result"><img style="float:left; margin: -5px 0px 0px 0px;" src="http://www.asdatyres.co.uk/brands/NANKANG.png"><span class="newpostcodetext">Enter your Postcode:</span><input class="newpostcode" type="text" value="" name="fpostcode"><p class="tyreoptions"><span class="tyre_brand">Nankang</span><br> 205/55R16 91V AS-1<br></p><span class="quantitydiv"><select class="forminput auto strong" name="number[1620555VNAAS1]"><option value="1">1 Tyre</option><option value="2">2 Tyres</option><option value="3">3 Tyres</option><option value="4">4 Tyres</option></select></span><div class="price" title="Our FULLY FITTED Price">£62.<span class="price_dec">92</span></div><div class="season_icons_wrapper"></div><input class="selectfittingcentre" type="image" onclick=" var isset = updatepostcodes(62.92,'1620555VNAAS1'); if(!isset){return false;}" src="images/asda_tyre_continue.png"><div id="label_wrap"><span class="fuel_r">E</span><span class="wet_r">C</span><span class="noise_r">71dB</span></div></div><div class="result"><img style="float:left; margin: -5px 0px 0px 0px;" src="http://www.asdatyres.co.uk/brands/MARSHAL.png"><span class="newpostcodetext">Enter your Postcode:</span><input class="newpostcode" type="text" value="" name="fpostcode"><p class="tyreoptions"><span class="tyre_brand">Marshal</span><br> 205/55R16 91V Matrac XM KH35<br></p><span class="quantitydiv"><select class="forminput auto strong" name="number[1620555VMAKH35]"><option value="1">1 Tyre</option><option value="2">2 Tyres</option><option value="3">3 Tyres</option><option value="4">4 Tyres</option></select></span><div class="price" title="Our FULLY FITTED Price">£64.<span class="price_dec">79</span></div><div class="season_icons_wrapper"></div><input class="selectfittingcentre" type="image" onclick=" var isset = updatepostcodes(64.79,'1620555VMAKH35'); if(!isset){return false;}" src="images/asda_tyre_continue.png"><div id="label_wrap"><span class="fuel_r">C</span><span class="wet_r">C</span><span class="noise_r">75dB</span></div></div>
</form>
 
Hi,

What a nightmare! Before I actually post an answer to this I need to ask you a question:-

Is that last HTML file a TRUE representation of what you need to deal with or is this something you have cribbed together?

The reason that I ask this is that it took me 2 minutes to get every product and price for every "result" Div element except for the first one! It took me over an hour to figure out why the first element would NOT return the price and this is again to do with an incorrectly formatted HTML file which is missing "tons, and I mean loads" of closing tags which throws everything into disarray.

Please do answer my above question and I will post a suggested solution based on what you are dealing with.

Cheers,

Ian
 
Hi Ian,

This is a section of a much larger page extracted from firebug. I was having a similar problem when reading it directly via the webbrowser component but I ended up with the first result only. I've only made the transition from VBA to VB around 6 weeks ago and it's a very steep learning curve. Below is the full "source" for the page directly extracted from ie. Again thanks for your help this has been driving me crazy.
 

Attachments

  • source.txt
    130 KB · Views: 25
Last edited by a moderator:
Hi,

What a headache this one has been. This was a fully qualified file this time, which helped, but when you apply this HTML code to a WebBrowser control to be able to iterate the tags the JScript in the document causes all sorts of errors and keeps throwing the JIT debugger. Not sure why, and I am that good with JScript.

So to get round this, I have used RegEx to split the document into 3 portions, thereby eliminating the JScript from the portion of the document that we need to work with. This "clean portion" of the HTML code is then passed to a WebBrowser object which can then be iterate successfully to get the Tyre information.

Have a look at the code below and read the comments to see what is going on:-

VB.NET:
Imports System.IO
Imports System.Text.RegularExpressions
 
Public Class Form1
  Dim WithEvents myWebBrowser As New WebBrowser
  Dim myTyreDetails As New List(Of TyreInfo)
 
  Private Sub GetSource()
    'you can ignore this section. I am just getting your data from a file
    'you would populate your WebBrowser from your navigation statement
 
    'due to the problems associated with the JScript on the page we have to strip
    'out the section that is causing the error so that we can work with the section
    'that we need. That being the body of the form
    'To do this I use RegEx to find the Form start and End point and then split those
    'I then do not want the first line since I have corrupter this with the first split
    'So I split again the portion I need, into lines this time, and then reconstitute 
    'ignoring the first line and adding the the WebBrowser object
 
    Dim myRegExSplitter As New Regex("<form method=|</form>")
    Using myReader As New StreamReader(Application.StartupPath & "\HTMLSource.html")
      Dim strWebPortions() = myRegExSplitter.Split(myReader.ReadToEnd)
      Dim strWebLines() As String = strWebPortions(1).Split(CChar(vbCrLf))
      'TextBox1.Text = String.Join(vbCrLf, strWebLines, 1, strWebLines.Count - 2)
      Me.Cursor = Cursors.WaitCursor
      myWebBrowser.DocumentText = String.Join(vbCrLf, strWebLines, 1, strWebLines.Count - 2)
    End Using
  End Sub
 
  Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    GetSource()
  End Sub
 
  Private Sub myWebBrowser_DocumentCompleted(sender As System.Object, e As System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles myWebBrowser.DocumentCompleted
    If myWebBrowser.ReadyState = WebBrowserReadyState.Complete Then
      Me.Cursor = Cursors.Default
      GetTyreInformation()
 
      'Once everything has been done you can iterate through your list of TyreInfo with:-
      For Each currentTyreInfo As TyreInfo In myTyreDetails
        MsgBox(String.Format("The tyre name is {0} with a price of {1}", currentTyreInfo.TyreName, currentTyreInfo.TyrePrice.ToString("c")))
      Next
    End If
  End Sub
 
  Private Sub GetTyreInformation()
    For Each currentDiv_Element As HtmlElement In myWebBrowser.Document.GetElementsByTagName("div")
      If currentDiv_Element.GetAttribute("className") = "result" Then
        Dim CurrentTyreInfo As New TyreInfo
        For Each currentSpan_Element As HtmlElement In currentDiv_Element.GetElementsByTagName("span")
          If currentSpan_Element.GetAttribute("className") = "tyre_brand" Then
            CurrentTyreInfo.TyreName = currentSpan_Element.InnerText
          End If
        Next
        For Each price_Element As HtmlElement In currentDiv_Element.GetElementsByTagName("div")
          If price_Element.GetAttribute("className") = "price" Then
            CurrentTyreInfo.TyrePrice = Decimal.Parse(price_Element.InnerText, Globalization.NumberStyles.Currency)
          End If
        Next
        myTyreDetails.Add(CurrentTyreInfo)
      End If
    Next
  End Sub
End Class
 
'Create a custom class to hold the tyre information
Public Class TyreInfo
  Public Property TyreName As String
  Public Property TyrePrice As Decimal
End Class

Hope that helps.

Cheers,

Ian
 
Back
Top