Case-Fixing text with Exceptions

speshulk926

Member
Joined
May 23, 2007
Messages
8
Programming Experience
1-3
First off, this is my first post! =) And now for my problem...


I have a Letter Data Import program that imports data from a text file into a SQL database. All of the text is in Uppercase and it needs to change to Upper and Lower case. That part works fine, but my problem is that there are "Exceptions" like O'Henry needs to be spelled like that and not O'henry. I have my exceptions table set up with a bunch of different names in it, but it is running sooo slow..... 134 records took nearly 5 minutes. I was wondering if someone could look over what I have and maybe make some improvements to it? Maybe there's a faster way to search through and I just don't know about it... Any suggestions or help would be great!

There are quite a few letters per letter file with about 18-40 columns per letter. This is a daily job, so speed is a must.

Here's my CaseFix Function which searches through an array list of about 850 exceptions, named Exceptions. Now without the exceptions part in there, it runs at lightning speed, with it, snails pace.

VB.NET:
    Public Function CaseFix(ByVal TextToFix As String, ByVal TextFieldName As String) As String
        'variable used to split text if there is more than 1 word
        Dim aText() As String
        'variable used to split the fields that have a special case and should ignore an exception
        Dim aFieldCase As String() = Nothing
        Dim NewText As String = ""
        Dim MultiText As String = ""
        Dim exception As Boolean = False

        'Loop through the Fields that have special field requirements like upper and lower case
        For i As Integer = 0 To FieldCase.Count - 1
            'split the fields leaving 0 = FieldName and 1 = "U" for Uppercase and "L" for lowercase
            aFieldCase = Split(FieldCase(i), "/")
            'if the Field Name matches what field name we are currently on, then Change to Upper or Lowercase
            If LCase(aFieldCase(0)) = LCase(TextFieldName) Then
                If aFieldCase(1) = "U" Then
                    Return UCase(TextToFix)
                ElseIf aFieldCase(1) = "L" Then
                    Return LCase(TextToFix)
                End If
            End If
        Next

        'split the text if there is more than 1 word
        aText = Split(TextToFix, " ")
        'if there is only 1 word then UBount(aText) will = 0, otherwise, we want to parse each word
        If UBound(aText) > 0 Then
            'itterate through each word
            For iSplit As Integer = 0 To UBound(aText)
                'if it's a number, then skip it... we never need to MixCase a number
                If IsNumeric(aText(iSplit)) = False Then
                    'itterate through the exceptions and see if 1 matches
                    For iEx As Integer = 0 To Exceptions.Count - 1
                        'if the exception matches, then add the exception to the NewText variable and exit the Loop
                        If LCase(aText(iSplit)) = LCase(Exceptions(iEx)) Then
                            NewText = Exceptions(iEx)
                            exception = True
                            Exit For
                        Else
                            exception = False
                        End If
                    Next
                    'if there were no exceptions then we can just Capitalize the first letter
                    If exception = False Then
                        For i As Integer = 1 To aText(iSplit).Length
                            If NewText = "" Then
                                NewText = NewText & UCase(Mid(aText(iSplit), i, 1))
                            Else
                                NewText = NewText & LCase(Mid(aText(iSplit), i, 1))
                            End If
                        Next
                        MultiText = MultiText & NewText & " "
                        NewText = ""
                    End If
                Else
                    MultiText = MultiText & NewText & " "
                    NewText = ""
                End If
            Next
            'after all words have been scanned, return all words
            Return MultiText
        Else
            'only 1 word, so do the same as above
            For iEx As Integer = 0 To Exceptions.Count - 1
                If LCase(TextToFix) = LCase(Exceptions(iEx)) Then
                    NewText = Exceptions(iEx)
                    exception = True
                    Exit For
                Else
                    exception = False
                End If
            Next
            If exception = False Then
                For i As Integer = 1 To TextToFix.Length
                    If NewText = "" Then
                        NewText = NewText & UCase(Mid(TextToFix, i, 1))
                    Else
                        NewText = NewText & LCase(Mid(TextToFix, i, 1))
                    End If
                Next
            End If
        End If
        'after the word has been scanned, it will be returned.
        Return NewText
 
There are a huge number of string ops in there that involve splitting, replacing, concatenating etc, you really should be using a stringbuilder, and a dictionary(of string, string) for your exceptions. You may not know that strings are immutable. Once created they cannot be changed.
If you had a string that started off as one character and then in a million-long loop you add a megabyte's worth of characters with str = str & newStr then that string would require copying a million times. That's on average, half a terabyte of memory in total, will be burned by that concat op. No wonder its slow! Even RAM will take some time to copy 0.5 terabyte!

Drop all the old VB6isms too:

Dim s as String = "HelO wORlD"
Dim lowered as String = s.ToLower()

more:

string.ToUpper() not UCase(string)
string.Split() not Split(string)
string.Length not UBound(s) or Len(s)

Another thing I noticed, is that the code is so horribly complex and unreadable, that I didnt really want to get too deep into it.. usually a sign that something is ripe for paring down and cleaning up, when a human gets confused by it..
 
Thank you for your honesty... I did come from a vb6 background and was thrown into .net. I did a LOT of reading on this subject matter and let me know if this looks any better. I am still using the Array List since all the code is already in place for that, BUT I am doing a search now instead of iterating through each entry... It is running much faster...

I added the stringbuilder functions in there and now I keep getting a Object Reference is not set to an instance of an object. I am getting it on "If NewText.ToString = "" Then" area. How can I tell if the stringbuilder is empty if nothing has been written to it. And on top of that, how do I empty it out when I am done?

VB.NET:
'only 1 word, so do the same as above
idx = Exceptions.BinarySearch(TextToFix, New CaseInsensitiveComparer())
If idx > 0 Then
  NewText = Exceptions(idx)
  exception = True
Else
  exception = False
End If
    If exception = False Then
      For i As Integer = 1 To TextToFix.Length
        If NewText.ToString = "" Then
          NewText.Append(Mid(TextToFix, i, 1).ToUpper)
        Else
          NewText.Append(Mid(TextToFix, i, 1).ToLower)
        End If
      Next
    End If
  End If
'after the word has been scanned, it will be returned.
Return NewText.ToString
 
Last edited:
I had in mind:

Get your block of text
Proper case it
Split it
for each word, look up the word in a dictionary
If found, add the dictionary word to the stringbuilder
Else, add the original word
 
Sorry I think I added this after you had started responding...

"I added the stringbuilder functions in there and now I keep getting a Object Reference is not set to an instance of an object. I am getting it on "If NewText.ToString = "" Then" area. How can I tell if the stringbuilder is empty if nothing has been written to it. And on top of that, how do I empty it out when I am done?"
 
Here is a RegularExpressions version:
VB.NET:
'Imports System.Text.RegularExpressions

Function CaseText(ByVal input As String) As String
    Dim newText As New System.Text.StringBuilder(input.ToLower)
    Dim specialCases As New List(Of String)
    specialCases.Add(".NET")
    specialCases.Add("Framework")
    For Each sc As String In specialCases
        newText = newText.Replace(sc.ToLower, sc)
    Next
    Dim regSentence As String = "\w.*?[\.\!\?]"
    For Each m As Match In Regex.Matches(input, regSentence, RegexOptions.Singleline)
       newText(m.Index) = CChar(newText(m.Index).ToString.ToUpper)
    Next
    Return newText.ToString
End Function
Here a sentence is defined as a string sequence starting with a 'word' character [a-Z0-9] followed by the shortest path of any characters ending with one of the punctuation characters [.?!].
 
Back
Top