Question Removing invalid links from all files inside a folder

ZERO_COOL

Member
Joined
Aug 7, 2016
Messages
5
Programming Experience
Beginner
I want to create a tool which will remove invalid links from all say *.txt files inside a directory.
Sample file:
<sec id="sec1">
<p>"You fig. 23 did?" I <a href rid="sec12">section 12</a> asked, surprised.</p>
<p>"Cross sent it table 9 to me a few weeks ago." Stanton crossed over to my mother, taking her hand in his. "I <a href rid="sec2">section 2</a> couldn"t have argued for better terms."</p>
<p>"There are always better terms <a href rid="sec6">section 6</a>, Richard!" my mom said sharply.</p>
<p>"Of course, I <a href rid="sec2">section 2</a> didn"t know." He pulled her into his arms, crooning softly like he would wit table 9h a child. "I <a href rid="sec2">section 2</a> assumed he was looking ahead.</p>
<p>I <a href rid="sec2">section 2</a> stood. I <a href rid="sec2">section 2</a> had to hurry if I <a href rid="sec2">section 2</a> was going to get to work on time. Today of all days, I <a href rid="sec2">section 2</a> didn"t want to be late.
<fig id="fig4">
<caption><p>I'm confused</p></caption>
</fig>
</p>
<p>Turning to face her, I <a href rid="sec2">section 2</a> walked backward. "I"ve seriously got to get ready. Why don"t we get together for lunch and talk more then?"</p>
<sec id="sec2">
<p>"You fig. 23 can"t be""</p>
<p>I <a href rid="sec4">section 4</a> adored the Art Deco elegance of the Chrysler Building. I <a href rid="sec2">section 2</a> could pinpoint my place on the island in relation to the posit table 9ion of the Empire State Building. I <a href rid="sec2">section 2</a> was awed by the breathtaking height of the Freedom Tower that now dominated downtown. But the Crossfire Building was in a class by it table 9self.</p>
<p>I <a href rid="sec1">section 1</a> felt Gideon before I <a href rid="sec1">section 1</a> saw him, my entire body humming wit table 9h awareness as he stepped out of the Bentley, which had pulled up behind the Benz. The air around me charged wit table 9h electricit table 9y, the crackling energy that always heralded the approach of a storm.</p>
</sec>
</sec>

I want to remove all invalid "section" link tags from the files by checking for each ``rid="secDIGIT"`` in the files and seeing if there is a ``<sec id="secSAMEDIGIT">`` in the file, if found then move on to next link, if not found delete only the tags i.e. <a href rid="sec@"> and </a> and not what lies in between.
The coding I've done so far is incomplete and probably full of flaws, can anyone help?
code:
Imports System.IO
Imports System.Text.RegularExpressions
PublicClass Form1
PrivateSub Button1_Click(sender AsObject, e As EventArgs)Handles Button1.Click
If FolderBrowserDialog1.ShowDialog = DialogResult.OK Then
TextBox1
.Text = FolderBrowserDialog1.SelectedPath
EndIf
EndSub

PrivateSub Button2_Click(sender AsObject, e As EventArgs)Handles Button2.Click
Dim targetDirectory AsString
targetDirectory
= TextBox1.Text
Dim txtFilesArray AsString()= Directory.GetFiles(targetDirectory,"*.txt")
ForEach txtFile In txtFilesArray
Dim FileInfo AsNew FileInfo(txtFile)
Dim FileLocation AsString= FileInfo.FullName
Dim input()AsString= File.ReadAllLines(FileLocation)
Dim pattern AsString="(?<=rid="sec)(\d+)(?=">)"
Dim r As Regex =New Regex(pattern)
Dim m As Match = r.Match(input)
If(m.Success)Then
Dim x AsString=" id=""sec"+ pattern +""""
Dim r2 As Regex =New Regex(x)
Dim m2 As Match = r2.Match(input)
If(m2.Success)Then
Dim tgPat AsString="<a href rid=""sec + pattern +"">(\w+) (\d+)</a>"
Dim tgRep AsString="$1 $2"
Dim tgReg AsNew Regex(tgPat)
Dim result1 AsString= tgReg.Replace(input, tgRep)
Else
EndIf
EndIf
Next
EndSub
EndClass
 

Attachments

  • Untitled.png
    Untitled.png
    101.5 KB · Views: 26
It would be more efficient with a single pass to collect all valid IDs (Regex.Matches loop), then a single pass to process all links (Regex.Replace with a MatchEvaluator function to check IDs and return the conditional replacement).

You need to check whole file, so change File.ReadAllLines with File.ReadAllText to get the input string.
 
Could you please elaborate it with a bit of code? I'm new to coding, so I'm not familiar with how to use matchevaluator. :lookaroundb:
 
I still cannot figure out how to successfully apply "Regex.Replace Method (String, String, MatchEvaluator)" in my program:miserable:
 
In the MatchEvaluator function check if the link ID is among the valid ones, if it is return the link unchanged (match.Value) else return the text part (match.Group part).
 
The code that seems to work is (the tags are changed from "<a href" to "<xref ref-type="section" rid" and </a> to </xref>)

Dim targetDirectory As String = TextBox1.Text
Dim txtFilesArray As String() = Directory.GetFiles(targetDirectory, "*.xml")
For Each txtFile In txtFilesArray
Dim input As String = File.ReadAllText(txtFile)
Dim xref As New Regex("<xref[^>]+rid=""(?<id>sec\d+)""[^>]*>(?<content>[^<]+)</xref>", RegexOptions.IgnoreCase)
Dim result As String = xref.Replace(input, Function(xyz)
Dim sec As New Regex(" id=""" & xyz.Groups("id").Value & """")
Return If(sec.IsMatch(input), xyz.Value, xyz.Groups("content").Value)
End Function)
File.WriteAllText(txtFile, result)
Next

Does anyone have any other ways of doing it or maybe reduce or modify some code?
 
Back
Top