Resolved RegEx of .txt converted from PDF (to txt)

Innww

Member
Joined
Aug 16, 2020
Messages
6
Programming Experience
10+
I am in a situation where I have to convert a PDF to a format that can be set to a DataGridView.

The only Resolution I can come up with is using Itextsharp and converting the PDF to a textfile for the most part the format is kept.


here is the Code to parse the text.


VB.NET:
Dim mPDF As String = "C:\Users\Innovators World Wid\Documents\test.pdf"

    Dim mTXT As String = "C:\Users\Innovators World Wid\Documents\test.txt"

    Dim mPDFreader As New iTextSharp.text.pdf.PdfReader(mPDF)

    Dim mPageCount As Integer = mPDFreader.NumberOfPages()

    Dim parser As PdfReaderContentParser = New PdfReaderContentParser(mPDFreader)

    'Create the text file.

    Dim fs As FileStream = File.Create(mTXT)

Dim strategy As iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy

For i As Integer = 1 To mPageCount

strategy = parser.ProcessContent(i, New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy())

Dim info As Byte() = New UTF8Encoding(True).GetBytes(strategy.GetResultantText())

fs.Write(info, 0, info.Length)

Next

    fs.Close()


The text output ends up looking like this. (also see attached copy of file.txt)


63 FMPC0847535411 OD119523523152105000 Aug 28, 2020 02:18 PM EXPRESS
64 FMPP0532201112 OD119523544975573000 Aug 28, 2020 02:18 PM EXPRESS
65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS
67 FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS

Which is "Pretty close"

The issue is the lines where express then has another number next to it (look at line 65 where 66 starts on the line. It should look like this throughout (to make adding it to a DataGridView easier.

63 FMPC0847535411 OD119523523152105000 Aug 28, 2020 02:18 PM EXPRESS
64 FMPP0532201112 OD119523544975573000 Aug 28, 2020 02:18 PM EXPRESS
65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS
66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS
67 FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS

The attempt was to use RegEx to remove everything but this "Format"

"FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS"

Or in some cases it may end a bit differently (like)

FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS , Replacement Order

The RegEx is
(\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*)";

Question Does anyone have a better solution. Or a cleaner solution. What I need is

PDF Somehow Converted to a format that can can be inputted in to a Datgrid in the appropriate rows and columns

Any method to do what I like is appreciated

Edit:

I am using RegEx at the moment. This is the sub

VB.NET:
Private Sub Fixtext()



        Dim regex As Regex = New Regex("\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*")

        Using reader As StreamReader = New StreamReader("C:\Users\Innovators World Wid\Documents\test.txt")

            While (True)

                Dim line As String = reader.ReadLine()

                If line = Nothing Then

                    Return

                End If

                Dim match As Match = regex.Match(line)

                               If match.Success Then

                    Dim value As String = match.Groups(1).Value

                    Console.WriteLine(line)

                End If

            End While

        End Using



    End Sub


The issue is the output still contains a few issues.

490 FMPC0847531898 OD119522758218348000 Aug 28, 2020 03:20 PM EXPRESS 491 FMPP0532220915 OD119522825195489000 Aug 28, 2020 03:21 PM EXPRESS Tracking Id Forms Required Order Id RTS done on Notes492 FMPP0532194482 OD119522868525176000 Aug 28, 2020 03:21 PM EXPRESS 493 FMPP0532195684 OD119522871090000000 Aug 28, 2020 03:21 PM EXPRESS 494 FMPP0532224318 OD119522895172342000 Aug 28, 2020 03:21 PM EXPRESS 495 FMPC0847571813 OD119522919323643000 Aug 28, 2020 03:21 PM EXPRESS
That is one issue. It isn't removing the "Tracking ID Forms Required order ID RTS Done On Notes" Which should be removed

And a few lines are still crammed together.

65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS

The result should be

65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS
66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS


Any Help would be great! Thank you!

See Images below to see that output (those two lines should be separated and which one that happens to could be random)

1598817826414.png


1598817913340.png
 
Last edited:
Top Bottom