Question Reading Raw Text (?) as String to Manipulate...

ckoeber · Feb 29, 2012

Hello,

I am not sure if this makes sense or not, but I need to process raw text (without any encoding?) that I retrieved from a file recovery application "PhotoRec".

With an application like Notepad++ I can see all of the text I need to manipulate but with VB I seem to only get some of the text with other stuff stripped out.

Here is what I am using to process the file:

VB.NET:

Dim objFileReader As FileStream = System.IO.File.OpenRead(FilePath)
                Dim currentByte(2048) As Byte
                Dim tempUTFEncoding As System.Text.ASCIIEncoding = New System.Text.ASCIIEncoding
                If File Is Nothing Then
                    File = New Collection
                End If
                Do While objFileReader.Read(currentByte, 0, currentByte.Length) > 0
                    File.Add(tempUTFEncoding.GetString(currentByte))
                Loop

I essentially add each line of text to the collection so I can read each line at a time (other processing, etc.)

So, as mentioned, with the above code I get text but not all of it. With different encodings I get different versions of the text represented in "currentbyte".

So, how can my .NET application read text like Notepad++ or a similar "raw" text reader app?

Thanks.

Regards,

Chris K.

jmcilhinney · Feb 29, 2012

There's no such thing as raw text with no encoding. Encoding is a set of rules that defines how binary data is converted to and from text. Because all information in a computer is stored in binary form, any text you see must be converted from binary form, so there is always an encoding involved. Most likely the issue is that the data actually uses two bytes per character and, by using ASCII, you are interpreting every byte as a separate character. You should be using a different Encoding and you will have to experiment to determine which one. There are ways to read the encoding from a text file but there is no one way that will work for all files. Regardless, don't read Byte by Byte. Just call IO.File.ReadAllText and specify the appropriate Encoding as an argument if the default (UTF8 I believe) is not appropriate.

ckoeber · Feb 29, 2012

jmcilhinney said:
There's no such thing as raw text with no encoding. Encoding is a set of rules that defines how binary data is converted to and from text. Because all information in a computer is stored in binary form, any text you see must be converted from binary form, so there is always an encoding involved. Most likely the issue is that the data actually uses two bytes per character and, by using ASCII, you are interpreting every byte as a separate character. You should be using a different Encoding and you will have to experiment to determine which one. There are ways to read the encoding from a text file but there is no one way that will work for all files. Regardless, don't read Byte by Byte. Just call IO.File.ReadAllText and specify the appropriate Encoding as an argument if the default (UTF8 I believe) is not appropriate.

Thank you for the response. I know I didn't ask this in my previous example, but what happens with large files where I don't (or can't) read the whole file at once?

In those events should I specify a really large (but altogether managable) bytearray or is there a better way?

jmcilhinney · Feb 29, 2012

ckoeber said:
Thank you for the response. I know I didn't ask this in my previous example, but what happens with large files where I don't (or can't) read the whole file at once?

In those events should I specify a really large (but altogether managable) bytearray or is there a better way?

In such cases you would create a StreamReader and specify the Encoding as a constructor parameter. You can then call ReadToEnd to get all the remaining text as a String, ReadLine to get all text up to the next line break or Read or ReadBlock to read a specific number of characters into a Char array. That last option would let, for instance, read a large file in chunks of 1000 characters.

ckoeber · Mar 1, 2012

jmcilhinney said:
In such cases you would create a StreamReader and specify the Encoding as a constructor parameter. You can then call ReadToEnd to get all the remaining text as a String, ReadLine to get all text up to the next line break or Read or ReadBlock to read a specific number of characters into a Char array. That last option would let, for instance, read a large file in chunks of 1000 characters.

Thanks so much for your help. I tried multiple encodings (ASCII, UTF7,UTF8,UTF32, and Unicode) and loaded the entire file into an RichTextBox as string (looped it to get everything out of the array) via a IO.File.ReadAllText but still for some reason editors like NotePad ++ (even Notepad) load stuff that the RTF control doesn't see. I don't get it.

ckoeber · Mar 1, 2012

An example ...

Attached is an example file that I am trying to read.

Open it in Notepad and you can see that there is readable text but when I use .NET with all sorts of different encodings I just get the first character (an '@' sign).

Question Reading Raw Text (?) as String to Manipulate...

ckoeber

Member

jmcilhinney

VB.NET Forum Moderator

ckoeber

Member

jmcilhinney

VB.NET Forum Moderator

ckoeber

Member

ckoeber

Member

Attachments

Similar threads

Share this page

Latest posts