IO.StreamReader problems with accents

peterk

Member
Joined
Oct 24, 2006
Messages
9
Programming Experience
10+
My program is very simple. All it does is read a file that has file names and their corresponding paths. It then checks to see if the files exist. If not, it writes to another file.

A problem occurs when file names have accents (é) for example. StreamReader cannot read these characters and it returns it as a file not found.

Example:
IO.StreamReader will read the following line:

S:\projacad\PROJET_2005\05020\TQC\BINDÉS\05020-30200-C0-000-2.pdf

and return

S:\projacad\PROJET_2005\05020\TQC\BINDS\05020-30200-C0-000-2.pdf
resulting in an unknown file.

Other than reading the line byte by byte, is there anything quicker I can do?

here is my code


Thanks for taking the time to read and hopefully help.

VB.NET:
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
Dim TextString As String
Dim pathfile As String = "g:\xerox\general\path.txt"
Dim SR As IO.StreamReader = New IO.StreamReader(pathfile)
Dim sw As New System.IO.StreamWriter("g:\xerox\general\NotFound.txt")
'File.OpenRead(pathfile)
Do While SR.Peek() <> -1
TextString = SR.ReadLine()
If File.Exists(TextString) Then
Else
sw.WriteLine(TextString)
End If
Loop
sw.Close()
SR.Close()
MsgBox("done")
End Sub
 
Last edited:
Use the StreamReader constructor overload where you can specify encoding, for example use your systems default encoding or one that matches specific input.
VB.NET:
Dim SR As New IO.StreamReader(pathfile, System.Text.Encoding.Default)
 
John,

thanks for pointing me to this thread -- I had already found out about the System.Default encoding and corrected the problem.

The problem I have now is this --

- why can't I use the encoding that is detected? I had wanted to code the app so it would detect the encoding and then set it properly for StreamReader.

The files that have these diacritical marks in them are simple "plain text" files. "myReader.CurrentEncoding" detects that they are UTF-8 encoded but UTF-8 ignores characters with the high-bit set ("extended" ASCII or ANSI, whatever it is called) and they are lost.

Below is the test Console app I used to detect the encoding of the plain text files.

VB.NET:
' Option Strict On
Imports System
Imports System.IO
Imports System.Text

Module modMain
    Sub Main()
        Dim path As String = "textfile.txt"
        Try
            Dim myReader As StreamReader = New StreamReader(path, True)
            Do While myReader.Peek() >= 0
                Console.Write(Convert.ToChar(myReader.Read()))
            Loop
            Console.WriteLine("The encoding used was {0}.", myReader.CurrentEncoding)
            Console.WriteLine()
            myReader.Close()
        Catch e As Exception
            Console.WriteLine("The process failed: {0}", e.ToString())
        End Try
    End Sub
End Module
 
I perhaps haven't read this properly so apologies if I haven't as i', just off out to the bakers for lunch. But you can set the encoding for the streamreader in it's constructor as one if it's overloads.
 
This is the most significant info in StreamReader documentation about encoding detection:
The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks. Otherwise, the user-provided encoding is used. See the Encoding.GetPreamble method for more information.
The link to Encoding.GetPreamble page further explains different encoding byte marks, but it is a little cryptic in relation what common encodings and character sets are supported. (for this I think you have to research how Unicode maps different other character sets, perhaps here: http://www.unicode.org/) When I test with Norwegian characters in Notepad and save with different encodings all UTF/Unicode is detected and displays fine, ANSI is also detected as UTF-8 but does not display these chars (same as you asked).

Also console output encoding can be set with OutputEncoding property. When I did these tests I did not change this, the chars displayed with Western European (DOS, IBM850 codepage).
 
Back
Top