mine data from HTML page

FunkiMunky

New member
Joined
Sep 1, 2008
Messages
3
Programming Experience
Beginner
I am trying to read data from an html page. The section that has data is

<!-- begin content --> <div class="box">
<h2 class="title">Search results - [ <i>2997 businesses found </i>]</h2>
<div class="content"><ul class="search-data">
<li><a href="?q=node/593">124 Facilities</a></li>
<li><a href="?q=node/597">2-0-2 Media</a></li>
<li><a href="?q=node/199">2.35 Research PLC</a></li>
<li><a href="?q=node/598">24-6 Cine & TV Services</a></li>
<li><a href="?q=node/599">27 Records</a></li>
<li><a href="?q=node/3029">2b Media Services</a></li>
<li><a href="?q=node/600">3 Bear Animations</a></li>
<li><a href="?q=node/6420">3-D Revolution Productions</a></li>
<li><a href="?q=node/580">3-D Revolution Productions</a></li>
<li><a href="?q=node/287">365Digital</a></li>
<li><a href="?q=node/601">3D Creations</a></li>
<li><a href="?q=node/603">3D Imaging</a></li>
<li><a href="?q=node/605">3D Jamie</a></li>
<li><a href="?q=node/7571">3D Orangepanda Digital Media</a></li>
<li><a href="?q=node/607">3DD Entertainment Ltd</a></li>
<li><a href="?q=node/289">3Dlabs</a></li>
<li><a href="?q=node/5846">3DRequest™</a></li>
<li><a href="?q=node/591">3p Underground Media UK Ltd</a></li>
<li><a href="?q=node/608">3rd Eye Broadcast Group</a></li>
<li><a href="?q=node/609">3rd Wave Graphics</a></li>
<li><a href="?q=node/610">3Sixty Media</a></li>
<li><a href="?q=node/613">422 South (Bristol)</a></li>
<li><a href="?q=node/612">422 South (Manchester)</a></li>
<li><a href="?q=node/310">7 Star Web Services</a></li>
<li><a href="?q=node/614">750mph</a></li>
<li><a href="?q=node/7197">A Bright Gem</a></li>
<li><a href="?q=node/582">A Double M Productions Ltd</a></li>
<li><a href="?q=node/615">A M Visualisation Ltd</a></li>
<li><a href="?q=node/616">A Productions</a></li>
<li><a href="?q=node/618">A Works TV Ltd</a></li>
<li><a href="?q=node/619">A. J. Murray</a></li>
<li><a href="?q=node/620">A.D. Modelmaking</a></li>
<li><a href="?q=node/621">A1 Vox Ltd</a></li>
<li><a href="?q=node/622">AAA 3D Imaging</a></li>
<li><a href="?q=node/65">Aardman Animations Ltd</a></li>
<li><a href="?q=node/625">Aardvark Swift Recruitment Ltd</a></li>
<li><a href="?q=node/626">AB Facility Vehicles</a></li>
<li><a href="?q=node/627">Abacus Film Productions Ltd</a></li>
<li><a href="?q=node/628">Abbey Home Media Group</a></li>
<li><a href="?q=node/629">About-Face Media Productions</a></li>
<li><a href="?q=node/630">Absolute Post</a></li>
<li><a href="?q=node/631">Absolute Studios</a></li>
<li><a href="?q=node/632">Absolutely Productions</a></li>
<li><a href="?q=node/633">Abstract Images</a></li>
<li><a href="?q=node/634">Acacia Productions Ltd</a></li>
<li><a href="?q=node/558">Academy</a></li>
<li><a href="?q=node/635">Academy Billiards</a></li>
<li><a href="?q=node/636">AccessMocap</a></li>
<li><a href="?q=node/637">Account - 4</a></li>
<li><a href="?q=node/638">ACE Accounting Ltd</a></li>
</ul>
</div>
</div>

<div id="pager" class="container-inline"><div class="pager-first"> </div><div class="pager-previous"><div class="pager-first"> </div></div><div class="pager-list"><strong>1</strong> <div class="pager-next"><a href="?q=business/search_data&from=50">2</a></div> <div class="pager-next"><a href="?q=business/search_data&from=100">3</a></div> <div class="pager-next"><a href="?q=business/search_data&from=150">4</a></div> <div class="pager-next"><a href="?q=business/search_data&from=200">5</a></div> <div class="pager-next"><a href="?q=business/search_data&from=250">6</a></div> <div class="pager-next"><a href="?q=business/search_data&from=300">7</a></div> <div class="pager-next"><a href="?q=business/search_data&from=350">8</a></div> <div class="pager-next"><a href="?q=business/search_data&from=400">9</a></div> <div class="pager-list-dots-right">...</div></div><div class="pager-next"><a href="?q=business/search_data&from=50">next page</a></div><div class="pager-last"><a href="?q=business/search_data&from=2950">last page</a></div></div><!-- end content -->

as you can see there are a number of div tags and in particular text that reads <!-- begin content --> and <!-- end content -->
I want the hrefs and the the href text. I have thought that maybe some straight string maniplation might do the job splitting the text into parts. I have also been thinking that their might be a way to just get the ul html control directly.

Any help in the right direction would be appreciated.
 
Did you post in ASP.Net Web Forms forum because you work in ASP.NET environment, or because you think you're mining one?
 
Should I post this question somewhere else? I just thought that this was the right place because I'm making a vb.net web application to get the information that I need. Just let me know where its supposed to go and I put a thread there.
 
hmm, unfortunate, you can't use the Windows.Forms.WebBrowser then, it is an excellent solution but you can't use it in ASP.Net web applications. If you can't find another DOM parser library you just have to use String manipulations and/or Regex. I will mention Html Agility Pack, which I know work, although it is not the best tip here at VB.Net forums because it requires you to compile this C# class library yourself, but if you're resourceful you can utilize this library easily. The old MSHTML library can be used too, similar to the WebBrowser control, only a bit messier, as described in this CodeProject article HTML Parsing using .NET Framework.
 
Thanks for the pointers, I didnt even think i could use a windows forms application, I guess that would be the easier of the 2 options. I dont have to worry about postbacks persistent data etc. Any thank you very much for your invaluable help.
 
Back
Top