Please help me optimize this string -> double parsing method I'm writing

lordofduct

Well-known member
Joined
Jun 24, 2010
Messages
71
Programming Experience
3-5
I'm writing my own parsing method to parse a String to a double. I have a lot of requirements of this parsing method, so none of the existing parsing methods actually work.


Here are the requirements:

Like TryParse, on most failures we should return 0. Things like stack overflow and the sort still bubble out.

Unlike TryParse, I don't want to pass the return value as a reference. Instead it just returns the value... like say Convert.ToDouble(...) does.

Accept optional NumberStyles in case the user wants to modify that.

must be able to support leading/trailing whitespace (the strings come from xml so there is no guarantee there isn't leading/trailing whitespace)

must be able to support hex values dynamically... meaning if the numeric value in the string starts with 0x, #, or &H... it automatically drops into hex mode.

Is numeric greedy, meaning that it will read through the string up until it hits the last numeric value and convert that... just cutting off any dangling none numeric values. For instance " 12zzz" return 12 because it just slices off the zzz


anyways, this works... it's about 20-25 times slower than regular old Double.TryParse... as is expected (I use Regex and the sort). Wonder if anyone has any suggestions to speeding it up. I don't need it to be on par with Double.TryParse or even near it... that'd be ridiculous as I'm expecting it to do a lot more than TryParse even does. But I'd like to try and speed it up just a little bit more.


VB.NET:
        Public Shared Function ToDouble(ByVal value As String, Optional ByVal style As System.Globalization.NumberStyles = 111) As Double

            If value Is Nothing Then Return 0

            Try
                Dim m As Match = Regex.Match(value, "^\s+(0x|#|&H)")

                If m.Value <> "" Then
                    value = value.Substring(m.Length)

                    m = Regex.Match(value, "^[-+]?[0-9a-fA-F]*")

                    If m.Value <> "" Then
                        Dim lng As Long

                        ''strip off any style bits that aren't compatible with HexNumber
                        style = style And System.Globalization.NumberStyles.HexNumber
                        ''make sure we allow hex values
                        style = style Or System.Globalization.NumberStyles.AllowHexSpecifier

                        Long.TryParse(m.Value, style, Nothing, lng)
                        Return CDbl(lng)
                    End If

                Else

                    m = Regex.Match(value, "^\s+[-+]?[0-9]*\.?[0-9]+")

                    If m.Value <> "" Then
                        Dim dbl As Double
                        Double.TryParse(m.Value, style, Nothing, dbl)
                        Return dbl
                    End If

                End If

            Catch ex As OverflowException
                ''If it was an OverFlowException, bubble it out
                Throw ex

            Catch ex As Exception
                ''if it was any other exception, just return 0
                Return 0

            End Try
        End Function
 
Last edited:
The reason this method slows is because all values goes through two regex searches then the numeric parse, though I can't think of an optimization of this approach right now. It's pretty much all about this as I see it.

Other comments:
must be able to support leading/trailing whitespace
NumberStyles.AllowLeadingWhite and .AllowTrailingWhite, both are included in both NumberStyles.Number and .HexNumber.
Optional ByVal style As System.Globalization.NumberStyles = 111
You should turn on Option Strict. NumberStyles enumeration type doesn't have a 111 member, that should be the value NumberStyles.Number.

I would change all If m.Value <> "" Then to If m.Success Then, I guess there would be a very small performance gain to this since Boolean comparisons are faster.
Like TryParse, on most failures we should return 0.
Actually these functions return a Boolean value, True if parsing is successful, and in this case the parsed value is return through the ByRef parameter, the value return is not relevant for failures. One alternative approach to this would be to return a Nullable value type, since this is the only way to distinguish a value from something that is not. 0 is not a failure, it is a valid value for the numeric data types. The regular Parse methods just throw an exception. Though you may have your reasons to convert invalid input to value 0.
[0-9a-fA-F]
I think ToUpper/ToLower the string first and then let regex search using the narrower scope of chars would improve performance just a bit.

You have everything in a Try-Catch block, this is something that makes code perform slower (slightly), but as I see it there are no calls in that code that would throw an exception, so it is not necessary. If it was possible CDbl(lng) could throw then I would only Try that one (for purpose of hanling), but it doesn't. Regex.Match(String, String) only throws an exception is input is a null reference, which you already validated. The Substring call will logically not throw.
If value Is Nothing
What if it is an empty string, or only contains whitespace - necessary to continue? There is String.IsNullOrEmpty and String.IsNullOrWhiteSpace methods that can be used, the latter is probably best since your input data could contain only whitespace.
 
Thanks, some of your suggestions were helpful.

The reason this method slows is because all values goes through two regex searches then the numeric parse, though I can't think of an optimization of this approach right now. It's pretty much all about this as I see it.

of course, that's why I don't expect it to get close to TryParse or anything.

Other comments:

NumberStyles.AllowLeadingWhite and .AllowTrailingWhite, both are included in both NumberStyles.Number and .HexNumber.

You should turn on Option Strict. NumberStyles enumeration type doesn't have a 111 member, that should be the value NumberStyles.Number.
Requiring it was only part of the requirements in designing it. The reason I pass 111 in is because it is NumberStyles.Number, that's what my compiler has it set to:

Globalization.NumberStyles.Number == 111

I would change all If m.Value <> "" Then to If m.Success Then, I guess there would be a very small performance gain to this since Boolean comparisons are faster.

doh...

Actually these functions return a Boolean value, True if parsing is successful, and in this case the parsed value is return through the ByRef parameter, the value return is not relevant for failures. One alternative approach to this would be to return a Nullable value type, since this is the only way to distinguish a value from something that is not. 0 is not a failure, it is a valid value for the numeric data types. The regular Parse methods just throw an exception. Though you may have your reasons to convert invalid input to value 0.

From the MSDN documentation for Integer.TryParse(...)
"When this method returns, contains the 32-bit signed integer value equivalent to the number contained in s, if the conversion succeeded, or zero if the conversion failed."

This is what I meant when I said returns 0 if failed.

I think ToUpper/ToLower the string first and then let regex search using the narrower scope of chars would improve performance just a bit.

You really think performing a tolower or toupper would be faster... I was going to just convert it to a case-insensitive regex search, just haven't put it in because well... I'm not all that great at writing regex statements (kind of been teaching myself slowly).

You have everything in a Try-Catch block, this is something that makes code perform slower (slightly), but as I see it there are no calls in that code that would throw an exception, so it is not necessary. If it was possible CDbl(lng) could throw then I would only Try that one (for purpose of hanling), but it doesn't. Regex.Match(String, String) only throws an exception is input is a null reference, which you already validated. The Substring call will logically not throw.

It wasn't in a Try catch block until just before I posted here. I thought maybe it would be slightly slower, but I had no definitive proof of it.

What if it is an empty string, or only contains whitespace - necessary to continue? There is String.IsNullOrEmpty and String.IsNullOrWhiteSpace methods that can be used, the latter is probably best since your input data could contain only whitespace.

Now that's useful! (though I didn't find a String.IsNullOrWhiteSpace method cause I'm using .Net 3.5 right now, and I can't move up as we aren't moving up at work).




Anyways, here's a slight update to it... if I dont' get it any faster, I'm not to worried about it. It's faster then I expected using all that regex in it, just would be nice.


VB.NET:
    Public Shared Function ToDouble(ByVal value As String, Optional ByVal style As System.Globalization.NumberStyles = Globalization.NumberStyles.Number, Optional ByVal provider As IFormatProvider = Nothing) As Double

        If String.IsNullOrEmpty(value) Then Return 0

        Dim m As Match = Regex.Match(value, "^\s+(0x|#|&H)")

        If m.Success Then
            value = value.Substring(m.Length)

            m = Regex.Match(value, "^[-+]?[0-9A-F]*", RegexOptions.IgnoreCase)

            If m.Success Then
                Dim lng As Long
                style = (style And System.Globalization.NumberStyles.HexNumber) Or System.Globalization.NumberStyles.AllowHexSpecifier
                Long.TryParse(m.Value, style, provider, lng)
                Return CDbl(lng)
            End If

        Else

            m = Regex.Match(value, "^\s+[-+]?[0-9]*\.?[0-9]+")

            If m.Success Then
                Dim dbl As Double
                Double.TryParse(m.Value, style, provider, dbl)
                Return dbl
            End If

        End If

        Return 0

    End Function

Oh I used "IgnoreCase" in the regex instead because it turned out faster then ToUpper or ToLower.


In any case though, like you said JohnH, I only saved minor amount of speed with these adjustments. It's about 18->21 times slower now... still better than 20->25.
 
Last edited:
I'm really noticing the biggest clincher is the fact I want the ability to slice off any trailing characters in the thing. If I removed that need I remove both the regex.Match methods looking for numeric values. And I become about 9 times slower than regular old TryParse.

but alas, that and the 0x/#/&H are the main driving force behind this.
 
though I didn't find a String.IsNullOrWhiteSpace method cause I'm using .Net 3.5 right now
True, didn't notice that was new.
The reason I pass 111 in is because it is NumberStyles.Number, that's what my compiler has it set to:

Globalization.NumberStyles.Number == 111
Option Strict force you to write type safe code, preventing accidental type conversion errors. You could have accidentally put 1111 there and wouldn't know until runtime when it crashed. The underlying primitive values for enumerations has no significant meaning when writing code, and also, intellisense provide the valid options to simply be selected from the drop down suggestion list - ie there is no excess writing to do when handling enumeration values correctly. If anyone is reading the code (even yourself) then Integer value 111 has no meaning to them, it would be a tedious job going through all valid NumerStyles enumeration members to find out which member or combination that could possibly refer to, with the NumberStyles value NumberStyles.Number you have it right there.

From the MSDN documentation for Integer.TryParse(...)
A return value indicates whether the operation succeeded.
Return Value
Type: System..::.Boolean

true if s was converted successfully; otherwise, false.
If the conversion fails (False) the ByRef value is simply reset to default value and does not reflect an actually converted value. This is the whole point of the TryParse methods, since they as opposed to Parse methods doesn't express failures by throwing exceptions.

About performance again, you can do a single regex that either capture hex or regular number (regex this | that), using named groups you can identify the hex qualifier and get the value capture for either, for example:
VB.NET:
Dim m = Regex.Match("  #af12.0abc", "^\s+((?<hex>0x|#|&H)(?<value>[0-9A-F]*))|(?<value>[-+]?[0-9]*\.?[0-9]+)", RegexOptions.IgnoreCase)
If Not m.Groups("value").Success Then Return
Dim value As String = m.Groups("value").Value
If m.Groups("hex").Success Then
    'handle value as hex
    '(captured "af12")
Else
    'handle value as numeric
    '(without # in input captured "12.0")
End If
I left out the sign for hex, parse as HexNumber will not allow it. If you need to include it as an optional named group and handle it afterwards.
 
You may get better performance if you do most of the checks yourself, i.e. turn your string into an array and start skipping over it..

Skip spaces, if the first chars are 0x or &H or # then set hex mode, keep skipping and find the start of the number, keep skipping to find the end of the number, then do your parse having set all the options..

As in:
VB.NET:
Dim s as Char() = myString.ToCharArray() 'i don't actually know ho to declare a char array in vb.. i'm a c# guy

For i = 0 to s.Length
 If s(i) == " "c Then Continue

..
Next i
 
It seems to me that you are trying to emulate the legacy Val() function.

Val() will return a type Double number using a string argument.
If the entire string is non-numeric or blank, it will return 0.
It will convert the left part of the string which is numeric and ignore any non-numeric characters after that.
It will ignore any leading, trailing, or inclusive spaces or tabs.
It will convert a Hexadecimal value which begins with &H.
It will also convert scientific notation that includes E or D.

Try entering a variety of inputs in this sample Console application:

VB.NET:
Module Module1

    Sub Main()
		Dim snum As String, mynum As Double
		Console.Write("Enter a number:  ")
		snum = Console.ReadLine()
		mynum = Val(snum)
		Console.WriteLine(mynum)
		Console.ReadLine()
    End Sub

End Module
 
Last edited:
@JohnH - turns out that one long regex ends up extremely slow at times. Especially if I pass in just a regular double with fractional values... e.g. " 2.001 ". This can take upward to 3 times slower than the previous version, which is 60 times slower than TryParse...

BUT, it's one damn elegant regex IMO. Great learning experience, I had no idea one could group results like that!



Oh but I did save just a few more cycles by using Trim(...) instead of \s in the regex.


VB.NET:
    Public Shared Function ToDouble_03(ByVal value As String, Optional ByVal style As System.Globalization.NumberStyles = Globalization.NumberStyles.Number, Optional ByVal provider As IFormatProvider = Nothing) As Double

        value = Trim(value)

        If String.IsNullOrEmpty(value) Then Return 0

        Dim m As Match = Regex.Match(value, "^(0x|#|&H)")

        If m.Success Then
            value = value.Substring(m.Length)

            m = Regex.Match(value, "^[-+]?[0-9A-F]*", RegexOptions.IgnoreCase)

            If m.Success Then
                Dim lng As Long
                style = (style And System.Globalization.NumberStyles.HexNumber) Or System.Globalization.NumberStyles.AllowHexSpecifier
                Long.TryParse(m.Value, style, provider, lng)
                Return CDbl(lng)
            End If

        Else

            m = Regex.Match(value, "^[-+]?[0-9]*\.?[0-9]+")

            If m.Success Then
                Dim dbl As Double
                Double.TryParse(m.Value, style, provider, dbl)
                Return dbl
            End If

        End If

        Return 0

    End Function


@other guys - I played around with a lot of different things... I kept coming back to regex though because it tended to be speedier than me manually doing what regex would have been doing.




I ran a loop test of this versus TryParse and on my machine (4 core i7 chip at 3 ghz with 12 gigs ram) it tooke approximately 2689 milliseconds to perform 1 million times. Versus 160 milliseconds to do it with TryParse. That's just shy of 17 times slower... I'm happy with that for now.
 
I got a 1:10 ratio with that comparison. If using this function in a such a high volume loop is relevant you can add RegexOptions.Compiled - this got me a 1:7.5 ratio. If this is the case you will also get better performance by using single regex instances declared outside the loop, got about 1:3 ratio when testing that.
 
oh it's very speedy if I'm passing in a hex, it ends up MUCH fater than my example. But when passing in a string double, it slows down poorly because the regex moves further through the command (attempting each hex prefix, failing all 3 times, than moving on to double).


Setting the RegexOptions.Compile sped it up slightly... not as much a gain as you saw...

your ratios... 1:10, what's the comparison? To Double.TryParse?

I'd like to know your setup because I certainly get no where near a 1:3.

Note I'm only attempting it in a long loop just for basic analysis. It's not the best real world analysis, but it's simple enough to give a rough idea of speed comparison. The use of the function in the end is any normal application when wanting to convert string data. Could be once in a blue moon randomly, or it could be a few times in a row when reading in some data from a server. Just general purpose, I'm not attempting to design for use in JUST loops.


Essentially I'm expecting it to act a lot like how string conversion acts in other places I have to write code as well. For instance in ECMAScript/javascript and AS3 this is exactly how it acts (just add on the # and &H availability). So does it work this way on the proprietary database server I use at work. I'm just trying to allow simple transfer of data to and from that acts and feels the same way in all places. .Net is the only odd man out, so I wrote a ConvertUtil class that compensates for that... it automatically converts all prim types the way I expect them to in our environment.
 
Last edited:
No, I was talking about your method, and only that, compared to plain TryParse.

My combined regex turned out comparable or a little slower, so it didn't have any noticable performance improvement when I tested.
 
Oh I get what you're saying now... you got 1:3 if you had basically static Regex's for the different match calls that persist between any call to the function. So basically I didn't construct a new one every time I called it... yeah that probably would speed it up some, I don't know if I'd want those hanging around in memory like that though as it won't be used in a loop like this in real world application.
 
The cost of first creation and use of a regex is significant, same goes for Shared calls that the regex engine caches. Using three regexes instead of one here could easily eat up 500-1000 calls during application run, more if the expression is pushed out of cache. At this rate measurement is also difficult, in total this is a 1-2ms operation and is very exposed for cpu doing something more important for a cycle or two. Comparisons must be done by ticks, but the results may vary a lot.

i7 cpus also auto-adjust clock speed according to work load, so I guess this will affect your results.
 
Back
Top