Regular Expression Query Problem

hauptra

Well-known member
Joined
Feb 17, 2007
Messages
72
Location
Cary, NC
Programming Experience
3-5
So, I'm using a Regular Expression to parse the table header out of a html page. I am testing against the following data. (Excerpt)
VB.NET:
...
<th>Rank</th>
<th class="sortable">
<a href="/stats/categorystats;jsessionid=115CF01CEEC19AF19425FF0B60F00C7D?offensiveStatisticCategory=null&archive=false&seasonType=REG&defensiveStatisticCategory=GAME_STATS&d-447263-o=2&conference=null&d-447263-s=TEAMS.FULL_NAME&d-447263-n=1&season=2006&Submit=Find&tabSeq=2&role=OPP&d-447263-p=1">Team</a></th>
<th class="sortable">
<a href="/stats/categorystats;jsessionid=115CF01CEEC19AF19425FF0B60F00C7D?offensiveStatisticCategory=null&archive=false&seasonType=REG&defensiveStatisticCategory=GAME_STATS&d-447263-o=2&conference=null&d-447263-s=GAMES_PLAYED&d-447263-n=1&season=2006&Submit=Find&tabSeq=2&role=OPP&d-447263-p=1">G</a></th>
<th class="sortable">
<a href="/stats/categorystats;jsessionid=115CF01CEEC19AF19425FF0B60F00C7D?offensiveStatisticCategory=null&archive=false&seasonType=REG&defensiveStatisticCategory=GAME_STATS&d-447263-o=2&conference=null&d-447263-s=TOTAL_POINTS_GAME_AVG&d-447263-n=1&season=2006&Submit=Find&tabSeq=2&role=OPP&d-447263-p=1">Pts/G</a></th>
<th class="sortable">
<a href="/stats/categorystats;jsessionid=115CF01CEEC19AF19425FF0B60F00C7D?offensiveStatisticCategory=null&archive=false&seasonType=REG&defensiveStatisticCategory=GAME_STATS&d-447263-o=2&conference=null&d-447263-s=TOTAL_POINTS_SCORED&d-447263-n=1&season=2006&Submit=Find&tabSeq=2&role=OPP&d-447263-p=1">TotPts</a></th>
<th class="sortable">
<a href="/stats/categorystats;jsessionid=115CF01CEEC19AF19425FF0B60F00C7D?offensiveStatisticCategory=null&archive=false&seasonType=REG&defensiveStatisticCategory=GAME_STATS&d-447263-o=2&conference=null&d-447263-s=SCRIMMAGE_PLAYS&d-447263-n=1&season=2006&Submit=Find&tabSeq=2&role=OPP&d-447263-p=1">Scrm Plys</a></th>
<th class="order2 sortable">
<a href="/stats/categorystats;jsessionid=115CF01CEEC19AF19425FF0B60F00C7D?offensiveStatisticCategory=null&archive=false&seasonType=REG&defensiveStatisticCategory=GAME_STATS&d-447263-o=2&conference=null&d-447263-s=TOTAL_YARDS_GAME_AVG&d-447263-n=1&season=2006&Submit=Find&tabSeq=2&role=OPP&d-447263-p=1">Yds/G</a></th>
<th class="sortable">
<a href="/stats/categorystats;jsessionid=115CF01CEEC19AF19425FF0B60F00C7D?offensiveStatisticCategory=null&archive=false&seasonType=REG&defensiveStatisticCategory=GAME_STATS&d-447263-o=2&conference=null&d-447263-s=SCRIMMAGE_YDS_PLAY_AVG&d-447263-n=1&season=2006&Submit=Find&tabSeq=2&role=OPP&d-447263-p=1">Yds/P</a></th>
<th class="sortable">
<a href="/stats/categorystats;jsessionid=115CF01CEEC19AF19425FF0B60F00C7D?offensiveStatisticCategory=null&archive=false&seasonType=REG&defensiveStatisticCategory=GAME_STATS&d-447263-o=2&conference=null&d-447263-s=FIRST_DOWNS_GAME_AVG&d-447263-n=1&season=2006&Submit=Find&tabSeq=2&role=OPP&d-447263-p=1">1st/G</a></th>
<th class="sortable">
<a href="/stats/categorystats;jsessionid=115CF01CEEC19AF19425FF0B60F00C7D?offensiveStatisticCategory=null&archive=false&seasonType=REG&defensiveStatisticCategory=GAME_STATS&d-447263-o=2&conference=null&d-447263-s=DOWN_3RD_FD_MADE&d-447263-n=1&season=2006&Submit=Find&tabSeq=2&role=OPP&d-447263-p=1">3rd Md</a></th>
<th class="sortable">
<a href="/stats/categorystats;jsessionid=115CF01CEEC19AF19425FF0B60F00C7D?offensiveStatisticCategory=null&archive=false&seasonType=REG&defensiveStatisticCategory=GAME_STATS&d-447263-o=2&conference=null&d-447263-s=DOWN_3RD_ATTEMPTED&d-447263-n=1&season=2006&Submit=Find&tabSeq=2&role=OPP&d-447263-p=1">3rd Att</a></th>
<th class="sortable">
<a href="/stats/categorystats;jsessionid=115CF01CEEC19AF19425FF0B60F00C7D?offensiveStatisticCategory=null&archive=false&seasonType=REG&defensiveStatisticCategory=GAME_STATS&d-447263-o=2&conference=null&d-447263-s=DOWN_4TH_FD_MADE&d-447263-n=1&season=2006&Submit=Find&tabSeq=2&role=OPP&d-447263-p=1">4th Md</a></th>
<th class="sortable">
<a href="/stats/categorystats;jsessionid=115CF01CEEC19AF19425FF0B60F00C7D?offensiveStatisticCategory=null&archive=false&seasonType=REG&defensiveStatisticCategory=GAME_STATS&d-447263-o=2&conference=null&d-447263-s=DOWN_4TH_ATTEMPTED&d-447263-n=1&season=2006&Submit=Find&tabSeq=2&role=OPP&d-447263-p=1">4th Att</a></th>
<th class="sortable">
<a href="/stats/categorystats;jsessionid=115CF01CEEC19AF19425FF0B60F00C7D?offensiveStatisticCategory=null&archive=false&seasonType=REG&defensiveStatisticCategory=GAME_STATS&d-447263-o=2&conference=null&d-447263-s=DOWN_4TH_PERCENTAGE&d-447263-n=1&season=2006&Submit=Find&tabSeq=2&role=OPP&d-447263-p=1">4th Pct</a></th>
<th class="sortable">
<a href="/stats/categorystats;jsessionid=115CF01CEEC19AF19425FF0B60F00C7D?offensiveStatisticCategory=null&archive=false&seasonType=REG&defensiveStatisticCategory=GAME_STATS&d-447263-o=2&conference=null&d-447263-s=PENALTIES_TOTAL&d-447263-n=1&season=2006&Submit=Find&tabSeq=2&role=OPP&d-447263-p=1">Pen</a></th>
<th class="sortable">
<a href="/stats/categorystats;jsessionid=115CF01CEEC19AF19425FF0B60F00C7D?offensiveStatisticCategory=null&archive=false&seasonType=REG&defensiveStatisticCategory=GAME_STATS&d-447263-o=2&conference=null&d-447263-s=PENALTIES_YARDS_PENALIZED&d-447263-n=1&season=2006&Submit=Find&tabSeq=2&role=OPP&d-447263-p=1">Pen Yds</a></th>
<th class="sortable">
<a href="/stats/categorystats;jsessionid=115CF01CEEC19AF19425FF0B60F00C7D?offensiveStatisticCategory=null&archive=false&seasonType=REG&defensiveStatisticCategory=GAME_STATS&d-447263-o=2&conference=null&d-447263-s=TIME_OF_POSS_SECONDS&d-447263-n=1&season=2006&Submit=Find&tabSeq=2&role=OPP&d-447263-p=1">ToP</a></th>
<th class="sortable">
<a href="/stats/categorystats;jsessionid=115CF01CEEC19AF19425FF0B60F00C7D?offensiveStatisticCategory=null&archive=false&seasonType=REG&defensiveStatisticCategory=GAME_STATS&d-447263-o=2&conference=null&d-447263-s=FUMBLES_TOTAL&d-447263-n=1&season=2006&Submit=Find&tabSeq=2&role=OPP&d-447263-p=1">FUM</a></th>
<th class="sortable">
<a href="/stats/categorystats;jsessionid=115CF01CEEC19AF19425FF0B60F00C7D?offensiveStatisticCategory=null&archive=false&seasonType=REG&defensiveStatisticCategory=GAME_STATS&d-447263-o=2&conference=null&d-447263-s=FUMBLES_LOST&d-447263-n=1&season=2006&Submit=Find&tabSeq=2&role=OPP&d-447263-p=1">Lost</a></th>
...

I tried using this query

VB.NET:
<th.*>.*</th>

I called it with the following:

VB.NET:
Dim Matchobj As System.Text.RegularExpressions.Match = System.Text.RegularExpressions.Regex.Match(Match.Value, HEADER_PATTERN, System.Text.RegularExpressions.RegexOptions.Singleline)

I also tried calling the Matches Function.

What I want is to have each of the headers returned so I can parse the data further. What I instead get is the entire header returned. I can see how this happens, since it finds the </th> at the end and uses that. How can I have it return the first return of </th> and continue returning results so that I can get each table header item.

Thanks,
 
Those stars (*) are greedy and you have to make them lazy (*?). Regular expressions is often not the best tool to handle html because of the node tree hierarchy, using the WebBrowser is an option.
VB.NET:
Dim headerpattern As String = "<th.*?>.*?</th>"
Dim input As String = My.Computer.FileSystem.ReadAllText("th.htm")
For Each m As Match In Regex.Matches(input, headerpattern, RegexOptions.Singleline)
    Dim result As String = m.Value
Next
 
Back
Top