Extract data from html

moonlord · May 19, 2008

Hi all,
I wanted to ask for you opinion on something.
I have a webpage let's say serials.htm containing a description text at the top of the page, below this text I have 100 serials in this format a1c2-b3d4-e5f6-g7h8. The serials are one per line. Is there any way to extract only these serials from the html and put them in a listbox? I have searched all the resources I could find so far and didn't find anything similar.
I am trying to make a simple app which will get line by line, one by one the serials for my app for each user. Please get back to me on this.
Thanks in advance for your help

JohnH · May 19, 2008

Load it into a WebBrowser and get yourself into the Document property. When you have read the Html source code you know its structure and perhaps there is a table where the serials are in the table cells. So you make two calls to GetElementsByTagName method and iterate the results.

Cheetah · May 19, 2008

JohnH said:
Load it into a WebBrowser and get yourself into the Document property. When you have read the Html source code you know its structure and perhaps there is a table where the serials are in the table cells. So you make two calls to GetElementsByTagName method and iterate the results.

This is what i would recommend for a beginner.

But personally, I would down the HttpStream, or load the file and use regular expressions to extract the data you want.

moonlord · May 19, 2008

No, the serials are not into a table JohnH, so it would be a little harder to get them. Cheetah, referring to regular expressions you mean regex?

JohnH · May 19, 2008

Using the browser Document DOM is still the easiest way to get the nodes you want initially, even if you have to split the node text afterwards. There is no reason to reinvent the wheel of parsing out the DOM tree yourself with regex when it's so easily available, it usually saves a lot of unnecessary complex regex.

Cheetah · May 19, 2008

JohnH said:
Using the browser Document DOM is still the easiest way to get the nodes you want initially, even if you have to split the node text afterwards. There is no reason to reinvent the wheel of parsing out the DOM tree yourself with regex when it's so easily available, it usually saves a lot of unnecessary complex regex.

Thats a fair shout, but if you do choose to do it JohnH's way, instead of using the webbrowser to navigate to the page, I would download the steam text and load it into the webbrowser, It will save you having to have events to check when the page had downloaded etc....

(That's IIRC from when I last used the Web Browser)

moonlord · May 19, 2008

Thanks guys, I'll go to work and try your tips. Thanks a lot

moonlord · May 19, 2008

JohnH can you please tell me how can i use this code to extract the serials:

I tried using the <pre> and </pre> tags but it doesn't return the keys

webBrowser1.DocumentText = _
""

The html code where the serials are found is like this:

</center>
 
Validated Key List: <pre>
abcd-efgh-1234-5678
defg-hijk-3456-6789
...........................
</pre>
 100 Valid Keys Generated 

Thanks.

JohnH · May 19, 2008

Post 3 here is a sample of how you might start.

moonlord · May 19, 2008

I get a "Object reference not set to an instance of an object." to this line "For Each anchor As HtmlElement In webbrowser.Document.GetElementsByTagName("a")". Also that guy has a href tag i do not have such thing

moonlord · May 19, 2008

Ok solved the error. with a try and catch ex. but the web browser does not load what it should, it loads the entire page

moonlord · May 19, 2008

Ok. I have tested your code to see if it works

Dim occurences As New List(Of String)
For Each anchor As HtmlElement In webbrowser.Document.GetElementsByTagName("a")
Dim href As String = anchor.GetAttribute("href")
occurences.Add(href.Substring(href.IndexOf("?act=") + 5))
Next

doesn't do anything, it loads the entire page, it doesn't matter what content has on it. I have added some text before the a href tag and it gets that text also

JohnH · May 19, 2008

Ok, maybe you can use Regex instead.

Extract data from html

moonlord

Member

JohnH

VB.NET Forum Moderator

Cheetah

Well-known member

moonlord

Member

JohnH

VB.NET Forum Moderator

Cheetah

Well-known member

moonlord

Member

moonlord

Member

JohnH

VB.NET Forum Moderator

moonlord

Member

moonlord

Member

moonlord

Member

JohnH

VB.NET Forum Moderator

Similar threads

Share this page

Latest posts