Question Splitting a CSV file line array using Regex

PeterM

Member
Joined
Jan 5, 2011
Messages
8
Location
Wiltshire, UK
Programming Experience
1-3
Hi all,

I am fairly new to VB.net, I used to do a bit of coding in early versions of VB years ago but moved on to things like RPG400, CL and SQL on an AS/400 box. As I am no longer in this career, I havent touched an AS/400 for a while either.

I am currently working on a visual basic console application in Visual Studio 2005 which takes various csv files from an osCommerce web shop, processes them and writes the data to TAS Books accounts software with the aid of the Infoplex COM module.

I have 2 functions, one reads each line of a csv file into an array, and the second function uses Regex to split the lines array into the various fields before they are added to a structure.

There are something like 5000 records (lines) in the csv file, one example is as follows;

"T-GPMM1070","699","Asst. Heat Shrink Tube 3"(76mm)Pk12","","3.28","17.5000","0.57","3.85","0","Kits & Spares","Chargers & Accessories","Charging Accessories","","","Ripmax","t-gpmm1070_1.jpg","0.00","","","0","1","","5","0","0","2010-10-20 11:41:59","","","","","","","","","","","","","","","","","","","","","","","EOREOR"

This originates from a PHP script which builds the csv file with fixed columns, but these columns are not fixed width. I also added the "EOREOR" on the end of the line to ensure that I know when the end of line has been reached.

The way I read the csv file into VB.net is by using 2 functions, the first, gets the csv file and reads the whole file line by line into an array, the funciton is as follows;

VB.NET:
' Read a csv file into an array, each array element contains a line from
    ' the .csv file
    Public Function FileToArray(ByVal filePath As String) As String()
        Dim sr As System.IO.StreamReader
        Try
            sr = New System.IO.StreamReader(filePath)
            Return System.Text.RegularExpressions.Regex.Split(sr.ReadToEnd, "\r\n")
        Finally
            If Not sr Is Nothing Then sr.Close()
        End Try
    End Function

The second function uses Regex to process the array lines one by one from within a for loop, splits them into the various fields, which are then individually written to elements of a structure for processing, the code is as follows;

VB.NET:
Public Function DecodeCSV(ByVal strLine As String) As String()

        Dim strPattern As String
        Dim objMatch As Match

        ' build a pattern
        strPattern = "^" ' anchor to start of the string
        strPattern += "(?:""(?<value>(?:""""|[^""\f\r])*)""|(?<value>[^,\f\r""]*))"
        strPattern += "(?:,(?:[ \t]*""(?<value>(?:""""|[^""\f\r])*)""|(?<value>[^,\f\r""]*)))*"
        strPattern += "$" ' anchor to the end of the string

        ' get the match
        objMatch = Regex.Match(strLine, strPattern)

        ' if RegEx match was ok
        If objMatch.Success Then
            Dim objGroup As Group = objMatch.Groups("value")
            Dim intCount As Integer = objGroup.Captures.Count
            Dim arrOutput(intCount - 1) As String

            ' transfer data to array
            For i As Integer = 0 To intCount - 1
                Dim objCapture As Capture = objGroup.Captures.Item(i)
                arrOutput(i) = objCapture.Value

                ' replace double-escaped quotes
                arrOutput(i) = arrOutput(i).Replace("""""", """")
            Next

            ' return the array
            Return arrOutput
        Else
            'Throw New ApplicationException("Bad CSV line: " & strLine)
            'Throw New ApplicationException("Bad CSV line: " & strLine)
            Console.WriteLine("Bad CSV line: " & strLine)
            objWriter.Write("Bad CSV line: " & strLine & vbCrLf)

        End If

    End Function

My problem is that due to the formatting of the product description in the example csv line above, it fails the Regex match and therefore fails to read.

I know this is because my pattern is wrong, but I'm not sure how to solve it, I think all I really need is for Regex to split the line by the "," between each element, this might be a little difficult because some of the descriptions contain both , and " but never one after the other.

Any help or advice greatly appreciated.
Peter.
 
I'd use the TextFieldParser class. See post 2 here Importing dat file about inserting the 'read delimited text file' code snippet.
 
Thanks for that John, I looked into textfieldparser and the fast csv reader module from the code project web site.

The problem I had with both of these was the same as the one I had before, in other words they could not cope with double quotes in the middle of a field as in the example I previously posted.

In the end I wrote my own piece of code which searches for not only the comma field separator, but also the double quotes either side of it in order to be sure that it was actually a field separator. The code is as follows;

VB.NET:
' Split CSV formatted string
    Public Function SplitCSVString(ByVal inString As String) As String()

        ' Declare the necessary variables
        Dim outString As String()
        Dim TempDelim As String = vbTab & "`"
        Dim p As Integer = 0
        Dim m As Integer = 0
        Dim a As Integer = 0
        Dim o As Integer = 0
        Dim EOR As String = "EOREOR"

        ' Check to see if the input string has any content
        If inString IsNot Nothing Then

            ' The input string must have some content so let's process it

            ' Set the error indicator to no error
            ErrorRow = False

            ' Set the loop to begin processing the array
            Do While p < inString.Length - 3

                ' Build the 3 positions ready to search for "," in the string
                ' p is already set as the first of the 3 chracters
                m = p + 1
                a = p + 2

                ' Check to see if we are at the end of the record
                If inString.Substring(p) = EOR Then

                    ' End of record found exit loop
                    Exit Do

                Else

                    ' Look for the "," string in the array
                    If inString.Substring(p, 1) = Chr(34) _
                    And inString.Substring(m, 1) = "," _
                    And inString.Substring(a, 1) = Chr(34) Then

                        ' "," found in the string, change the , to the temp delimiter
                        inString = inString.Substring(0, m) _
                        & TempDelim & inString.Substring(a)

                    End If

                    ' Advance the string position counter by 1 and loop round again
                    p = p + 1

                End If

            Loop

            ' Now split the string based on the new temp delimiter
            outString = Split(inString, TempDelim)

        Else

            ' The input string doesn't have any content
            outString = Nothing
            ErrorRow = True

        End If

        ' Return the processed output array and finish
        Return outString

    End Function

I also found that the function that loads the csv file into the array that this function processes line by line was actually loading the array with one more line than was actually in the file, so I replaced this function with a simple read text file one as follows;

VB.NET:
Public Function ReadDataFile(ByVal inFile) As String()

        ' Delcare the variables
        Dim outString As String()

        ' Read the data file into an array
        outString = File.ReadAllLines(inFile)

        Return outString

    End Function

These two new functions combined now work fine, so problem solved:cool:
 
I missed that part:
"Asst. Heat Shrink Tube 3"(76mm)Pk12"
It certainly complicates the pattern.

You could possibly also split by the "," string (including quotes) and trim the prepending/appending quote from first/last field. Like this:
VB.NET:
fields = line.Split(New String() {""","""}, StringSplitOptions.None)
'or
fields = Regex.Split(line, """,""")
 
Back
Top