Question Down Convert UTF String into ASCII Using Stringbuilder

Zabalba

New member
Joined
Sep 25, 2013
Messages
2
Programming Experience
1-3
I'm having a hard time extracting the ASCII character conversion array from a Stringbuilder class. I can see that the Normalize worked, but when I try to set the variable to the contents of the Stringbuilder I'm returning the original input. I've tried to translate some C# solutions I found to VB.net but to no avail.
VB.NET:
[COLOR=#00008B]Public[/COLOR] [COLOR=#00008B]Sub[/COLOR] ASCIIConverter

    [COLOR=#00008B]Dim[/COLOR] TestString [COLOR=#00008B]As[/COLOR] [COLOR=#00008B]String[/COLOR] = [COLOR=#800000]"Caf?"[/COLOR]
    [COLOR=#00008B]Dim[/COLOR] ASCIIConverter [COLOR=#00008B]As[/COLOR] [COLOR=#00008B]New[/COLOR] StringBuilder

    ASCIIConverter.Clear()
    ASCIIConverter.Append(TestString.Normalize(NormalizationForm.FormKD)).ToString()
    TestString = ASCIIConverter.ToString

[COLOR=#00008B]End[/COLOR] [COLOR=#00008B]Sub[/COLOR]
 
What you're asking for doesn't really make sense. Strings are just Strings. They are not inherently UTF or ASCII. If a String contains Unicode characters then I guess that you could call it a Unicode String but if a String contains just ASCII characters then it's still a Unicode String because all ASCII characters are also Unicode characters. The "Caf?" example that you've provided contains nothing but ASCII characters so what exactly are you expecting to happen?

The Normalize method that you're using is about changing the binary representation of certain Unicode characters, NOT changing the characters themselves. Whenever you call Normalize on a String, the result will look exactly like the original, although the binary representation may be different.
 

That sounds a bit more likely but, even then, it doesn't seem to match the request.

@Zabalba, you seem to be expecting the characters of the String to change but that wouldn't make sense. No matter what encoding you use in the binary representation, it's still going to produce the same characters in the String, which is the whole point. An encoding is a way of mapping characters to bytes and vice versa. The same characters will map to a different set of bytes depending on what encoding you use but the characters are still the same.
 
What you're asking for doesn't really make sense. Strings are just Strings. They are not inherently UTF or ASCII. If a String contains Unicode characters then I guess that you could call it a Unicode String but if a String contains just ASCII characters then it's still a Unicode String because all ASCII characters are also Unicode characters. The "Caf?" example that you've provided contains nothing but ASCII characters so what exactly are you expecting to happen?

The Normalize method that you're using is about changing the binary representation of certain Unicode characters, NOT changing the characters themselves. Whenever you call Normalize on a String, the result will look exactly like the original, although the binary representation may be different.

I'm sorry if I was not clear enough in my original post. In fact the "Caf?" example I was trying to use seems was down converted to ASCII by this board. I was trying to use the example of the word Cafe with the accent on the top of the "e". I learned a lot about encoding while I was researching my issue. You are correct that Strings are just one kind of datatype. I would have to assume that the string datatype uses Unicode in order to read all possible valid characters in the string in the first place.


That sounds a bit more likely but, even then, it doesn't seem to match the request.

@Zabalba, you seem to be expecting the characters of the String to change but that wouldn't make sense. No matter what encoding you use in the binary representation, it's still going to produce the same characters in the String, which is the whole point. An encoding is a way of mapping characters to bytes and vice versa. The same characters will map to a different set of bytes depending on what encoding you use but the characters are still the same.

My original request was to strip the accent from characters to its base form. From my limited understanding on how these encodings were created in the first place they were built on top of each other. But from what I understand, Latin accented characters all have a base character. So from my failed attempt at a example, the accented e's base character is regular "e". I understand full well that this strips data from the original meaning of a word or phrase. I wanted to have a option in my text parser to strip it if need be.

After researching the topic some more I found my solution but I had to adapt it from C#. This will down-convert the character in the string to it's ASCII form.

VB.NET:
Public Function ASCIIConverter(ByVal strInput As String)
        Dim strOutput As String
        Dim sbASCIIConverter As New StringBuilder


        sbASCIIConverter.Clear()
        For Each c As Char In strInput.Normalize(NormalizationForm.FormKD)
            Dim unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c)
            If Not (unicodeCategory = unicodeCategory.NonSpacingMark) Then sbASCIIConverter.Append(c)
        Next
        strOutput = sbASCIIConverter.ToString().Normalize(NormalizationForm.FormKD)


        Return strOutput
End Function

I hope if someone needs it they could use it.
 
Back
Top