Fast method to read text files

dvj357

Member
Joined
Jan 14, 2011
Messages
7
Programming Experience
Beginner
Greetings,

I was wondering if anyone knew of a simple/easy (emphasis on easy, I am a major beginner) to read through text files quickly, search for strings and pull information. I am currently using the 'StreamReader' function and 'InStr' command, but the files I am working with are gigantic and this is taking from 1/2 an hour to a few hours to work its way through.

I again have exhausted my google skills and am unable to comprehend most of what I find.

Any links or tips on commands to learn about would be greatly appreciated!

here is a sample of the code I am currently using - poke fun at will, I am a major newbie.

strContentsTime1_jntfor=0
Do Until strContentsTime1_jntfor Is Nothing
strContentsTime1_jntfor = objFile1.ReadLine()
'look for string 'time'
strFindStringTime_jntfor = "time"
intPositionTime_jntfor = InStr(strContentsTime1_jntfor, strFindStringTime_jntfor)
'if string 'time' is found, InStr function will return a positive integer, if not found it returns 0
'if inStr returns positive interger;
If intPositionTime_jntfor > 0 Then
'then take the 11 most right sided characters
strTime1_jntfor(intCounter2) = Microsoft.VisualBasic.Right(strContentsTime1_jntfor, 12)
dblTime1_jntfor(intCounter2) = Val(strTime1_jntfor(intCounter2))
intCounter2 = intCounter2 + 1
Else : Continue Do
End If
Loop
objFile1.Close()
 
Last edited:
How big are the files? What are you looking for?

I have a couple different files I am working with (nodout and jntforc if you are familiar with LS-Dyna), and they range in size from 100MB to 6GB, from 100,000 lines to 20+million. The files are a repetitive series with the values changing.

The nodout looks like this for one time step:
VB.NET:
nodal point  x-disp     y-disp      z-disp      x-vel       y-vel       z-vel      x-accl      y-accl      [B]z-accl[/B]      x-coor      y-coor      z-coor
  2000001  2.4651E-005 0.0000E+000 8.3513E-005 1.0956E+002 0.0000E+000 3.7117E+002 0.0000E+000 0.0000E+000 0.0000E+000 4.3843E+002-5.4146E+001 1.9712E+002
  2001787  4.0887E-006 0.0000E+000 8.6979E-005 1.8172E+001 0.0000E+000 3.8657E+002 0.0000E+000 0.0000E+000 0.0000E+000 4.4756E+002-5.4146E+001 1.6289E+002
  [B]2003304[/B] -1.5120E-005 0.0000E+000 8.5752E-005-6.7202E+001 0.0000E+000 3.8112E+002[B][I]-0.0000E+000[/I][/B] 0.0000E+000 0.0000E+000 4.3934E+002-5.4146E+001 1.3224E+002
  2007501 -1.5120E-005 0.0000E+000-8.5752E-005-6.7202E+001 0.0000E+000-3.8112E+002-0.0000E+000 0.0000E+000-0.0000E+000 4.4182E+002-5.5346E+001 1.2846E+002
  2007502 -1.5120E-005 5.7457E-012-8.5752E-005-6.7202E+001 2.5537E-005-3.8112E+002-0.0000E+000 0.0000E+000-0.0000E+000 4.4182E+002-5.2945E+001 1.2846E+002
  2007503 -1.5120E-005 0.0000E+000-8.5752E-005-6.7202E+001 0.0000E+000-3.8112E+002-0.0000E+000 0.0000E+000-0.0000E+000 4.4142E+002-5.5346E+001 1.2619E+002
  2008001  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 4.4691E+002-5.4146E+001 1.3565E+002
  2008003  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 4.3843E+002-5.4146E+001 1.4095E+002
  2008005  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 4.4691E+002-4.4146E+001 1.3565E+002
  2008011  0.0000E+000 0.0000E+000 0.0000E+000-3.6481E-013 1.8336E-014-3.8700E+002-0.0000E+000 0.0000E+000-0.0000E+000 4.4071E+002-5.4146E+001 1.9435E+002
  2008013  0.0000E+000 0.0000E+000 0.0000E+000 6.4882E-013-4.0380E-015-3.8700E+002 0.0000E+000-0.0000E+000-0.0000E+000 4.3112E+002-5.4146E+001 1.9152E+002
  2008015  0.0000E+000 0.0000E+000 0.0000E+000-3.3253E-013 1.8336E-014-3.8700E+002-0.0000E+000 0.0000E+000-0.0000E+000 4.4071E+002-4.4145E+001 1.9435E+002
  2008021  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 4.5087E+002-5.4146E+001 1.7787E+002
  2008023  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 4.4128E+002-5.4146E+001 1.7504E+002
  2008025  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 4.5087E+002-4.4146E+001 1.7787E+002
  2008031  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 3.9337E+002-6.3785E+001 1.3271E+002
  2008033  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 3.8389E+002-6.3785E+001 1.3589E+002
  2008035  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 3.9337E+002-5.3785E+001 1.3271E+002
  2008041  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 3.8585E+002-6.3785E+001 1.1024E+002
  2008043  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 3.7637E+002-6.3785E+001 1.1341E+002
  2008045  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 3.8585E+002-5.3785E+001 1.1024E+002
  2008051  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 3.9339E+002-4.4588E+001 1.3295E+002
  2008053  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 3.8392E+002-4.4605E+001 1.3617E+002
  2008055  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 3.9337E+002-3.4588E+001 1.3295E+002
  2008061  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 3.8575E+002-4.4596E+001 1.1052E+002
  2008063  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 3.7628E+002-4.4614E+001 1.1374E+002
  2008065  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 3.8573E+002-3.4596E+001 1.1052E+002
  3100025  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 4.6795E+002-7.1029E+001 1.4437E+002
  3100602  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 4.6795E+002-7.5170E+001 1.1123E+002
  3100765  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 4.6795E+002-3.7197E+001 1.4437E+002
  3110873  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 4.6761E+002-7.0704E+001 1.5000E+002
  3111746  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000-3.8700E+002 0.0000E+000 0.0000E+000-0.0000E+000 4.6761E+002-3.7522E+001 1.5000E+002

VB.NET:
 n o d a l   p r i n t   o u t   f o r   t i m e  s t e p       1                              ( at time0.00000E+000 )

 nodal point  x-rot      y-rot       z-rot       x-rot vel   y-rot vel   z-rot vel   x-rot acc   y-rot acc   z-rot acc
  2000001  2.4651E-005 0.0000E+000 8.3513E-005 8.9449E-021-2.5329E-020 1.3689E-022 0.0000E+000-0.0000E+000 0.0000E+000
  2001787  4.0887E-006 0.0000E+000 8.6979E-005-5.0679E-020 4.1357E-019-3.6312E-020-0.0000E+000 0.0000E+000-0.0000E+000
  2003304 -1.5120E-005 0.0000E+000 8.5752E-005-2.5149E-007 4.8524E-004-9.6239E-007-0.0000E+000 0.0000E+000-0.0000E+000
  2007501 -1.5120E-005 0.0000E+000-8.5752E-005-2.5149E-007-4.8524E-004 9.6239E-007-0.0000E+000-0.0000E+000 0.0000E+000
  2007502 -1.5120E-005 5.7457E-012-8.5752E-005-2.5167E-007-4.8524E-004 9.6239E-007-0.0000E+000-0.0000E+000 0.0000E+000
  2007503 -1.5120E-005 0.0000E+000-8.5752E-005-2.5149E-007-4.8524E-004 9.6239E-007-0.0000E+000-0.0000E+000 0.0000E+000
  2008001  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  2008003  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  2008005  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  2008011  0.0000E+000 0.0000E+000 0.0000E+000-1.8846E-014-3.5817E-013-3.2281E-015-0.0000E+000-0.0000E+000-0.0000E+000
  2008013  0.0000E+000 0.0000E+000 0.0000E+000-1.8846E-014-3.5817E-013-3.2281E-015-0.0000E+000-0.0000E+000-0.0000E+000
  2008015  0.0000E+000 0.0000E+000 0.0000E+000-1.8846E-014-3.5817E-013-3.2281E-015-0.0000E+000-0.0000E+000-0.0000E+000
  2008021  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  2008023  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  2008025  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  2008031  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  2008033  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  2008035  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  2008041  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  2008043  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  2008045  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  2008051  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  2008053  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  2008055  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  2008061  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  2008063  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  2008065  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  3100025  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  3100602  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  3100765  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  3110873  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000
  3111746  0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000 0.0000E+000

In this nodout file I am trying to track the z-accl of node 2003304 (in bold italics) as it changes over a few thousand time steps. I am essentially trying to do this with a bunch of different nodes. Most of the lines are read and thrown away while I am searching for the lines/values I need, and the entire process takes a really really long time.

I apologize for the huge space consumption, or if that is way too much information.

Thanks for your time!
 
If i've understood it correctly, you are trying to find a value at the start of the line, and then if that value is found, you want a value from that line (z-accl)
I have no idea if this is any faster, but could you use the streamreader.read function, like..

VB.NET:
  Dim FileObject As System.IO.StreamReader = New System.IO.StreamReader("path")
        Dim Number(7) As Char

        FileObject.Read(Number, 0, 7)

' If Number="1234567" then read this line and split it into a seperate array and then grab the element which represents the z-accl?

        FileObject.Close()


I see each line has 152 chars, so could you increment the start index by 152 each time and then just keep checking those first 7 chracters for a match?


As I say, I really don't know if thats any more efficent.
 
Let us know if its any faster ;) - I would be interested to know, I don't have a 20,000,000 line text file handy and im not generating one either :p
 
they range in size from 100MB to 6GB, from 100,000 lines to 20+million.

I know you emphasize the 'easy' part here, but you can't always find an 'easy' way through.
Said that, if it was me, considering the size of this monster, I would be tempted to try to import this into a SQL server and try to work the searches you need from there.
 
How about using a Regex to do the searching?

I don't honestly know. Maybe using SQL could be faster. I don't have a file to test the speed on. If it is not classified or proprietary you might make available or email a zip file of one that is truncated somewhat from 100MB to say 5 or 10MB. I copied and used your example and changed it so there would be a second instance of nodal point 2003304 and it worked.

VB.NET:
Try
            Dim objReader As System.IO.StreamReader
            Dim lineNumber As Integer = 0
            objReader = File.OpenText("F:\Users\Don\Documents\Visual Studio 2010\Projects\FindingValue_x-accl\FindingValue_x-accl.txt") ' your file path here
            While objReader.Peek <> -1
                lineNumber = lineNumber + 1
                Dim strTempLineIn As String = objReader.ReadLine()
                Dim MatchObj As Match = Regex.Match(strTempLineIn, "^\s+(2003304)\s(.+?E[+-]\d{3})(.+?E[+-]\d{3})(.+?E[+-]\d{3})(.+?E[+-]\d{3})(.+?E[+-]\d{3})(.+?E[+-]\d{3})(.+?E[+-]\d{3})(.+?E[+-]\d{3})(.+?E[+-]\d{3})(.+?E[+-]\d{3})(.+?E[+-]\d{3})(.+?E[+-]\d{3})")
                ' This regex only test for one or more leading spaces
                If MatchObj.Success Then
                    Dim strnodalPoint As String = MatchObj.Groups(1).Value ' value of nodal point
                    Dim strx_disp As String = MatchObj.Groups(2).Value ' value of x-disp
                    Dim stry_disp As String = MatchObj.Groups(3).Value ' value of y_disp
                    Dim strz_disp As String = MatchObj.Groups(4).Value ' etc... etc...
                    Dim strx_vel As String = MatchObj.Groups(5).Value
                    Dim stry_vel As String = MatchObj.Groups(6).Value
                    Dim strz_vel As String = MatchObj.Groups(7).Value
                    Dim strx_accl As String = MatchObj.Groups(8).Value ' our target
                    Dim stry_accl As String = MatchObj.Groups(9).Value
                    Dim strz_accl As String = MatchObj.Groups(9).Value
                    Dim strx_coor As String = MatchObj.Groups(11).Value
                    Dim stry_coor As String = MatchObj.Groups(12).Value
                    Dim strz_coor As String = MatchObj.Groups(12).Value
                    strx_accl = strx_accl.Trim
                    Console.WriteLine("In line number " & lineNumber & " nodal point =>  " & strnodalPoint & " Value x-accl => " & strx_accl)
                End If
            End While
            objReader.Close()
        Catch ex As Exception
            MsgBox(ex.ToString)
        End Try
        Console.ReadLine()
 
I'm not sure on the process, but depending on what you are doing, a string compressor could help. Those files would probably shrink down tons because of how repetitive.
 
Thanks for the thoughts, everyone. I will do my best and give each one a try. '

My emphasis on easy is mostly because I am a complete novice...

I am currently trying a way to write out the files from the parent software (LS-Dyna) in a different manor to see if I can compress the size of these beasts.

thanks again for your thoughts and help!
 
Just out of curiosity, I copied and pasted and created a 100MB file using repetitive data with the target appearing about every 30 or lines. I ran the example that I submitted yesterday and it took something just over 3 and one half minutes to search that file. My computer is about 4 years old, Core 2 Duo.
 
Those files are really large, but using StreamReader and ReadLine is also really fast. I did a test reading through a 1GB file with about 9 million lines (copies of your data), just counting them, and this took 10 seconds on my machine (fairly good 3yrs old). So this part is not relevant to the time results you get, it's down to the processing of each line.

I'm not exactly sure about the structure of data from your post, but I guess you have the posted data and that structure repeats throughout the large file. So you would be looking for the 'time' string and if not found looking for node and a value. Interesting here is that adding the check for, retrieval and parse of the time value (as according to code in your first post) doesn't add any significant time to the total time, I end up with 14 seconds. Neither does a single check for a line starts with a value, for example " 2003304 " and retrieving one of the column values, which is a simple Substring multiple of 12 by fixed length, and converting that value to Single/Double. Since you said a 'bunch' I expanded and added lookup for 15 nodes to the test, in my test the equivalent of 50% of lines/nodes, and it finished now in around 50 seconds. So nothing you have decribed so far should take the time you say, and I don't know what might.

By the way, I don't think you'll get any benefit of Regex here, there is no complex varying string patterns, only fixed length substrings. The only string search is for the "time", which is as mentioned not an issue, and may as well be a fixed location substring.
 
I was in error in my last reply when I said that it took 3 and one half minutes. After seeing that I realize that something was terribly wrong there. I was just going by watching Windows explorer updating the properties of the file. I put a timer in the routine and it took a total of 2 seconds to run the regex on 1.2 million lines and output a text file with the results. In my created file of 1.2 million lines I put in 40,618 lines with the target 'nodal point'.
 
Back
Top