Answered looping a collection into multiple new filenames, how to speed up?

metallo

Member
Joined
Apr 6, 2010
Messages
8
Programming Experience
Beginner
EDIT: the final solution has been reached and the time required to split files has been reduced by at least a factor of 10 by using a dictionary rather than direct IO. see later posts for the solution code.

greetings,

i have the following snippet of code:

VB.NET:
        For Each tempstring In SourceDataCollection
            LineToFieldsTemp = Split(tempstring, ",")
            LineStringWithoutFirstEntry = "" 'reset the linestring
            For i = 1 To UBound(LineToFieldsTemp)
                LineStringWithoutFirstEntry = LineStringWithoutFirstEntry & LineToFieldsTemp(i) & ","
            Next
            LineStringWithoutFirstEntry = Mid(LineStringWithoutFirstEntry, 1, Len(LineStringWithoutFirstEntry) - 1) 'rmv last ,
            CurrentFileName = ResultsDirectoryName & "\" & SourceFileName & "_" & LineToFieldsTemp(0) & ".csv"
            'MsgBox(CurrentFileName)
            Using FileWriterSW As New StreamWriter(CurrentFileName, True) 'append, if the file doesn't exist it will create it
                FileWriterSW.WriteLine(LineStringWithoutFirstEntry) 'writes the dataline to file
                'FileWriterSW.Close() 'close the opened file
            End Using
        Next

essentially what i'm doing is to read through each item in a collection of comma separated variable strings. the first variable in each string is appended to a master filename and becomes the NEW filename to which the entire string will be appended. this is a way to split a CSV file into multiple files using the first variable in that string.

the trouble is that this is doing a LOT of writes to potentially a LOT of files, the IO is taking forever. is there a way to buffer this data until the end, then write it to a disk? be aware, the filenames and total number of files to be created is UNKNOWN at the begining of the loop. it just creates a new file each time a unique first column ID is discovered.

thanks in advance
 
Last edited by a moderator:
If there's a chance that multiple lines are to be written to the same file then it would be best to wait until the end to do the writing. If each line is definitely being written to a different file then waiting to the end won't make a difference.
 
multiple lines will definitely go to a single file. for example, i might run this program 3-4 times on a starter file. it would go like this:

source file: 1 million lines, breaks into 5 sub files
one of the sub files: 200,000 lines breaks into 20-50 sub files
etc....

my problem, conceptually, is how to cache this data as i run through the loop. when i start to approach the algorithm, it makes my head hurt. if i could just create an array of filenames and their associated collections, i could probably do this. i'm pretty new to coding so i just need to dig-in.

i was hoping there would be a file buffering instruction that would buffer a file in RAM and then i could send that data to the filename at the end.



thanks for the help, and any further ideas appreciated!
 
In that case you should store the data in a Dictionary. The keys would be the file names and the values would be a List containing the lines to write to that file. Each time you process a line of input you first check the Dictionary to see if the key exists. If not then you add it along with a new empty List, otherwise you get the List for that key. You then add the new line to the List. At the end you simply loop through the keys and call File.WriteAllLines or .AppendAllLines for each key, passing the lines to write. That way, each output file gets opened once and once only, no matter how many lines you need to write to it.

If it's more appropriate, you might store StringBuilders in the Dictionary instead of Lists and append to them each time, then call File.WriteAllText at the end.
 
thanks so much, this sounds very promising....i'll pursue the solution and do my best to post it back here when complete!
 
i need pointed in a direction, please!

1) i can successfully read through the sourcefile, recognize and build a list of the keys in the dictionary and then print them to a debug window.

2) i also think i've successfully created a new collection value for each new key that is recognized, at least the syntax editor isn't complaining at this point.

3) my problem is at the ????????. i don't know how to add a string to the collection that would already exist for each key. i began to see references something called an ICollection instead of just Collection and wondered if that syntax is how i add in my strings.

my current code:
VB.NET:
        For Each tempstring In SourceDataCollection

            LineToFieldsTemp = Split(tempstring, ",") 'split the CSV data line
            LineStringWithoutFirstEntry = "" 'reset the linestring
            For i = 1 To UBound(LineToFieldsTemp)
                LineStringWithoutFirstEntry = LineStringWithoutFirstEntry & LineToFieldsTemp(i) & ","
            Next
            LineStringWithoutFirstEntry = Mid(LineStringWithoutFirstEntry, 1, Len(LineStringWithoutFirstEntry) - 1) 'rmv final ","
            CurrentFileName = ResultsDirectoryName & "\" & SourceFileName & "_" & LineToFieldsTemp(0) & ".csv"

            If DataDictionary.ContainsKey(CurrentFileName) Then
                'add this linestring to the collection for this key in the dictionary
                '???????????????????????????
            Else
                DataDictionary.Add(CurrentFileName, New Collection(Of String))
            End If

        Next

        'debug dictionary
        For Each keystring In DataDictionary.Keys
            textboxWorkLog.AppendText(keystring & vbCrLf)
        Next
 
You access the dictionaries TValue by specifying DictionaryName.Item(TKey).

VB.NET:
Public Class Form1

    Dim dict As New Dictionary(Of String, List(Of String))
    Dim fileName As String
    Dim lines As List(Of String)

    Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
        fileName = "fileA.txt"
        lines = New List(Of String)
        lines.AddRange({"a", "b", "c"})
        AddToDictionary(fileName, lines)

        fileName = "fileB.txt"
        lines = New List(Of String)
        lines.AddRange({"c", "d", "e"})
        AddToDictionary(fileName, lines)

        fileName = "fileA.txt"
        lines = New List(Of String)
        lines.AddRange({"f", "g", "h"})
        AddToDictionary(fileName, lines)
    End Sub

    Private Sub AddToDictionary(ByVal fileName As String, ByVal lines As List(Of String))
        If dict.ContainsKey(fileName) Then
            dict.Item(fileName).AddRange(lines)
        Else
            dict.Add(fileName, lines)
        End If
    End Sub
End Class
 
thanks so much to both of you, my code is now working and posted below. first a time study. it's almost fifteen times faster now.

original program:
split 300,000 line source file into two sub files: 2.0 minutes
split those two sub files further into 31 sub files: 1.6 minutes
total time: 3.6 minutes (216 seconds)

new program:
split 300,000 line source file into two sub files: 8 seconds
split those two sub files further into 31 sub files: 7 seconds
total time: 15 seconds

VB.NET:
'split sourcefiles into a dictionary containing each sub file
Dim DataDictionary As New Dictionary(Of String, Collection(Of String))
For Each tempstring In SourceDataCollection

    LineToFieldsTemp = Split(tempstring, ",") 'split the CSV data line
    LineStringWithoutFirstEntry = "" 'reset the linestring
    For i = 1 To UBound(LineToFieldsTemp)
        LineStringWithoutFirstEntry = LineStringWithoutFirstEntry & LineToFieldsTemp(i) & ","
    Next
    LineStringWithoutFirstEntry = Mid(LineStringWithoutFirstEntry, 1, Len(LineStringWithoutFirstEntry) - 1) 'rmv last ,
    CurrentFileName = ResultsDirectoryName & "\" & SourceFileName & "_" & LineToFieldsTemp(0) & ".csv"

    If DataDictionary.ContainsKey(CurrentFileName) Then
        'add this linestring to the collection for this key in the dictionary
        DataDictionary.Item(CurrentFileName).Add(LineStringWithoutFirstEntry) 'add the linestring to the collection
    Else
        DataDictionary.Add(CurrentFileName, New Collection(Of String))
    End If

Next

'write data from dictionary to new files
For Each keystring In DataDictionary.Keys
    textboxWorkLog.AppendText("Creating new sub-file: " & keystring & vbCrLf)
    textboxWorkLog.ScrollToCaret() 'scroll to end
    textboxWorkLog.Refresh() 'refresh the status while working
    Using FileWriterSW As New StreamWriter(keystring, True) 'creates new file and TRUE makes it append

        'prepend headers if preference is enabled
        If prefsKeepHeader.Checked Then 'prepend header to each file that was created
            For Each tempstring In SourceDataHeader
                FileWriterSW.WriteLine(tempstring)
            Next
        End If

        'write collection of data into the file
        For Each linestring In DataDictionary.Item(keystring) 'iterate through collection for each filename
            FileWriterSW.WriteLine(linestring) 'write the lines from this collection
        Next

        FileWriterSW.Close() 'close the opened file

    End Using
Next

srDataStreamReader.Close() 'close the opened source file file
 
Last edited:
I think your code is very 'wordy', which can be simplified for better readability IMO, I also have some other suggestions and corrections.
  • In many cases Hungarian notations where you add data types to names is considered bad practice.
  • The closing the writer in its Using block is redundant. When Using block exits it calls Dispose which is sufficient to close the StreamWriter.
  • You're using Collection(Of T) which is a base class mostly used for custom inheritance, if you're not changing it just use the default List(Of T) as suggested previously in thread.
  • The string operations can be reduced considerably from your splits&joins to simply splitting by the first comma.
  • Now you don't have many keys, but I would still just use the unique first field as key instead of adding up the whole filename, that you can do when opening the file.
  • You say you create a new file, still you specify the append Boolean, why? The append switch is used when you want to open and append content to an existing file.
  • Item is the default property of Dictionary(Tkey,Tvalue), which means it can be left out in the expression, this may be a personal preference, but to me data(key) is more readable than data.Item(key), and faster to type too.
  • Possible bug: when key don't exist, you add a new list, but do not add the line to list. Is this intentional, or just missed it?
  • There is a srDataStreamReader.Close call at the end of your code. This 'reader' is not used other places in that code, so why not close it earlier, immediately when you were done reading from that file?
Suggested reworked code:
VB.NET:
Dim data As New Dictionary(Of String, List(Of String))

For Each line As String In sourceData
    Dim kix As Integer = line.IndexOf(","c)
    Dim key As String = line.Remove(kix)
    Dim lineAfterKey As String = line.Substring(kix + 1)
    If Not data.ContainsKey(key) Then
        data(key) = New List(Of String)
    End If
    data(key).Add(lineAfterKey)
Next

For Each key As String In data.Keys
    Dim filename As String = ResultsDirectoryName & "\" & SourceFileName & "_" & key & ".csv"
    'textboxWorkLog
    Using writer As New IO.StreamWriter(filename)                
        'headers first possibly
        '...
        'lines           
        For Each line In data(key)
            writer.WriteLine(line)
        Next
    End Using
Next
You have some comments explaining the code, which is probably good, but also adds to the "mess". Personally I'm not a big fan of commenting code lines, sometimes I just add 'headers' or overall description, but try to make better comments if code is 'rocket-science' - which happens less frequently as things that once was new and difficult gets more familar.

Notice I declare three local variables in the "For Each line" loop that gets called '300.000 times'. This is 'closest scope' practice, and makes code generally more concise and more memory conservant. Doing that inside the loop is not a show-stopper, the compiler will arrange this code in order for best performance, so at runtime these allocations are only done once for the whole loop.

About UI interaction (I omitted that from sample code on purpose), you say your last code executed in 8 seconds, I wouldn't tie the UI thread up for so long. Have a look at BackgroundWorker component. It will allow you to run a worker thread to process files, and still easily report progress to UI. Apart from the benefit of not blocking UI, this will also improve performance of the worker loop.
 
i enjoy seeing the disection of my code, thank you for your efforts.

this program is my first complete self-taught effort at VB. i have to multi-task my talents between a lot of stuff such as 3D animation, graphic editing, data manipulation, video editing, etc....so without my comments (even on my own code), i'd be lost 6 months from now when i come to tweek it!

responses to querys from your reply:

1) naming convention, i can appreciate what your saying, using "DataDictionary" as a name of a Dictionary class (or is it object?) might be a bad idea. for myself, it is helping to get my mind around the code.

2) the append boolean is a leftover from the previous algorithm, it is no longer needed.

3) the possible bug: i suppose i just thought that i had to create the new list for each key that was newly discovered....so i created a blank list (well, in my case a blank collection). i did not necessarily know the dictionary would create the new list when/if it needed it.

4) yes, the srStreamReader could be closed much sooner, i completely missed that

again, i can't express how much i appreciate seeing your feedback and i know that this example is going to help a lot of people besides myself. this is exactly the reason that when i really need some indepth programming done, i'm going to contract it out.

as for your UI interaction comment about 8 seconds being "so long", i have to laugh. you're probably spot-on and i love that you are so enthusiastic about fixing my screwups. from my perspective though, this program itself, even if it took 15 minutes to execute, is doing something that would take me HOURS to do previously.

in the grand scope of my entire program, i'm interacting with it for a period of 5 to 10 minutes and doing a task that would otherwise have taken me several days. i'm at the point of diminishing return right now. the code we are discussing is only a portion of the entire program, and represents one of about six or eight tasks that i accomplish with this.

for all three of you who have helped out on this, i really admire your eagerness to help others and i hope you get your paybacks. in my own field of expertise, i try to do the same as much as possible.

cheers!
 
Last edited:
as for your UI interaction comment about 8 seconds being "so long", i have to laugh. you're probably spot-on and i love that you are so enthusiastic about fixing my screwups. from my perspective though, this program itself, even if it took 15 minutes to execute, is doing something that would take me HOURS to do previously.

in the grand scope of my entire program, i'm interacting with it for a period of 5 to 10 minutes and doing a task that would otherwise have taken me several days. i'm at the point of diminishing return right now. the code we are discussing is only a portion of the entire program, and represents one of about six or eight tasks that i accomplish with this.

Here's the article I originally used to learn about the BackgroundWorker: How To Update Controls Using BackgroundWorker in VB.NET.

I think you'll find that it's VERY easy to add this to your existing project and take the work off of your UI thread.
 
your potential bug report was indeed valid. the way i had it coded, it would discard the first line of data when it was creating a new key in the dictionary. you told me and it still took me a while to figure out it had to be fixed!

i know it isn't elegant, but i had to add a line into my code making it now look like this:

VB.NET:
            If DataDictionary.ContainsKey(CurrentFileName) Then
                'add this linestring to the collection for this key in the dictionary
                DataDictionary.Item(CurrentFileName).Add(LineStringWithoutFirstEntry) 'add the linestring to the collection
            Else
                DataDictionary.Add(CurrentFileName, New Collection(Of String))
                DataDictionary.Item(CurrentFileName).Add(LineStringWithoutFirstEntry)
            End If

i shudder to think how small (in lines and characters) my entire 1200 line program would be if an expert were to rewrite it!
 
a further question since we're talking about threads:

would it be relatively easy for me to spawn a new thread that would work on each file and do the split routine? i'm thinking along the lines of using all 8 cores of my CPU to each work on it's own file at the same time.

i would venture to guess that something like this would be a monumental task given the sloppy approach i've taken to this program in the first place.
 
It would still be recommended to use a List(Of String) rather than a Collection(Of String).

You can also initialize the List(Of String) rather than creating it on one line and then adding the string to it on the next. I've shown a couple of ways to do it here.

VB.NET:
    Private Sub AddLineToDictionary(ByVal fileName As String, ByVal line As String)
        If dict.ContainsKey(fileName) Then
            dict.Item(fileName).Add(line)
        Else
            dict.Add(fileName, {line}.ToList)
            'dict.Add(fileName, New List(Of String)(New String() {line}))
        End If
 
Back
Top