Extract data from html

moonlord

Member
Joined
May 7, 2008
Messages
15
Programming Experience
1-3
Hi all,
I wanted to ask for you opinion on something.
I have a webpage let's say serials.htm containing a description text at the top of the page, below this text I have 100 serials in this format a1c2-b3d4-e5f6-g7h8. The serials are one per line. Is there any way to extract only these serials from the html and put them in a listbox? I have searched all the resources I could find so far and didn't find anything similar.
I am trying to make a simple app which will get line by line, one by one the serials for my app for each user. Please get back to me on this.
Thanks in advance for your help
 
Load it into a WebBrowser and get yourself into the Document property. When you have read the Html source code you know its structure and perhaps there is a table where the serials are in the table cells. So you make two calls to GetElementsByTagName method and iterate the results.
 
Load it into a WebBrowser and get yourself into the Document property. When you have read the Html source code you know its structure and perhaps there is a table where the serials are in the table cells. So you make two calls to GetElementsByTagName method and iterate the results.

This is what i would recommend for a beginner.

But personally, I would down the HttpStream, or load the file and use regular expressions to extract the data you want.
 
No, the serials are not into a table JohnH, so it would be a little harder to get them. Cheetah, referring to regular expressions you mean regex?
 
Using the browser Document DOM is still the easiest way to get the nodes you want initially, even if you have to split the node text afterwards. There is no reason to reinvent the wheel of parsing out the DOM tree yourself with regex when it's so easily available, it usually saves a lot of unnecessary complex regex.
 
Using the browser Document DOM is still the easiest way to get the nodes you want initially, even if you have to split the node text afterwards. There is no reason to reinvent the wheel of parsing out the DOM tree yourself with regex when it's so easily available, it usually saves a lot of unnecessary complex regex.

Thats a fair shout, but if you do choose to do it JohnH's way, instead of using the webbrowser to navigate to the page, I would download the steam text and load it into the webbrowser, It will save you having to have events to check when the page had downloaded etc....

(That's IIRC from when I last used the Web Browser)
 
JohnH can you please tell me how can i use this code to extract the serials:

I tried using the <pre> and </pre> tags but it doesn't return the keys

webBrowser1.DocumentText = _
""

The html code where the serials are found is like this:

</center>
<br>
Validated Key List: <br><br><pre>
abcd-efgh-1234-5678
defg-hijk-3456-6789
...........................
</pre>
<br>100 Valid Keys Generated<br>

Thanks.
 
I get a "Object reference not set to an instance of an object." to this line "For Each anchor As HtmlElement In webbrowser.Document.GetElementsByTagName("a")". Also that guy has a href tag i do not have such thing
 
Ok. I have tested your code to see if it works


Dim occurences As New List(Of String)
For Each anchor As HtmlElement In webbrowser.Document.GetElementsByTagName("a")
Dim href As String = anchor.GetAttribute("href")
occurences.Add(href.Substring(href.IndexOf("?act=") + 5))
Next

doesn't do anything, it loads the entire page, it doesn't matter what content has on it. I have added some text before the a href tag and it gets that text also
 
Last edited:
Ok, maybe you can use Regex instead.
 
Back
Top