Good String Matching Algorithm

Cheetah

Well-known member
Joined
Oct 12, 2006
Messages
232
Programming Experience
Beginner
Hi there,

I need a good string matching algorithm, so it finds the closest match out of a list of strings.

Does anyone know of one?

Thanks.

EDIT: Basically, I am looking for a search algorithm/function
 
Last edited:
Sorry, actual characters/words

Eg.

So If i had a list like this:

myStrings[0] = "The Next Door"
myStrings[1] = "The Next Time"
myStrings[2] = "Next Friday / Friday After Next"
myStrings[3] = "Next (r)"
myStrings[4] = "Advance to the Next Level"

and I was searching for "Next"....it would return myStrings[3].
 
I'm sure there are people out there who know of better search algorithms then me. But, my approach would be to parse out each word that you are searching for and search through the strings. When a match is found then figure out the percentage of the found string that matches the word.

So for example searching for the Word 'search' against the following

myStrings[0] = "search"
myStrings[1] = "searching"

would have myStrings[0] = 100% and myStrings[1] = 66%

Then I would just return the value with the greatest percentage.



Again, I'm sure there is a better approach, I am just taking a stab at it.
 
Thanks for your reply again.

To be honest, I wouldn't really know where to begin and how to progress from there.

I am sort of asking for already existing algorithms where I could pass in the array and search term and it return an index - something that would be more advance than anything i could create.

Basically, I am looking for a search algorithm/function
 
ok. So here is what I could come up with.

VB.NET:
Public Class FuzzySearch
    'Searches with the options passed to it.  Returns the results
    Public Shared Function Search(ByVal Options As SearchOptions) As SearchResults
        'Creates an array to store the results as they come in.  As percentages
        Dim Percentages(Options.List.Count) As Double

        'Two loops.  First we loop through each word (Separated by a space).  Next we search each item with the
        'search term.  If we have a match then we add to the percentage for that item.
        For Each SearchTerm As String In Options.SearchString.Split(" "c)
            For i As Integer = 0 To Options.List.Count - 1
                Dim Item As String = CStr(Options.List(i))
                If Item.IndexOf(SearchTerm) <> -1 Then
                    Percentages(i) += SearchTerm.Length / Item.Length
                End If
            Next
        Next


        'All we are doing here is finding the index of the highest percentage.
        Dim HighestPercentageIndex As Integer, HighestPercentageValue As Double
        HighestPercentageIndex = -1
        HighestPercentageValue = -1
        For i As Integer = 0 To Percentages.GetUpperBound(0)
            If Percentages(i) > HighestPercentageValue Then
                HighestPercentageIndex = i
                HighestPercentageValue = Percentages(i)
            End If
        Next

        'Finally we determine whether the highest percentage meets the thresold.
        If HighestPercentageValue >= Options.SearchThresold Then
            Return New SearchResults(HighestPercentageIndex, CStr(Options.List(HighestPercentageIndex)), HighestPercentageValue, Options)
        Else
            Return New SearchResults(-1, "Not Found", -1, Options)
        End If
    End Function
End Class


'Search options structure.  Passed to search function
Public Structure SearchOptions
    Private _SearchString As String
    Private _List As IList
    Private _SearchThresold As Double

    'The SearchString must contain an actual search (No Null or empty values)
    'The List must be valid (Not Null or empty)
    'The SearchThresold must be between 0 and 1 (Percentage)
    Public Sub New(ByVal SearchString As String, ByVal List As IList, ByVal SearchThresold As Double)
        If SearchString Is Nothing OrElse SearchString.Trim.Length = 0 Then Throw New ArgumentException("Search String cannot be null or empty", "SearchString")
        If List Is Nothing OrElse List.Count = 0 Then Throw New ArgumentException("List cannot be null or empty", "List")
        If SearchThresold < 0 Or SearchThresold > 1 Then Throw New ArgumentException("Search Thresold must be between 0 and 1", "SearchThresold")

        _SearchString = SearchString
        _List = List
        _SearchThresold = SearchThresold
    End Sub

    'Readonly properties

    Public ReadOnly Property SearchString() As String
        Get
            Return _SearchString
        End Get
    End Property

    Public ReadOnly Property List() As IList
        Get
            Return _List
        End Get
    End Property

    Public ReadOnly Property SearchThresold() As Double
        Get
            Return _SearchThresold
        End Get
    End Property
End Structure

'Results Structure.  Returned by the Search function
Public Structure SearchResults
    Private _Index As Integer
    Private _Value As String
    Private _Relevance As Double
    Private _SearchOptions As SearchOptions

    Public Sub New(ByVal Index As Integer, ByVal Value As String, ByVal Relevance As Double, ByVal SearchOptions As SearchOptions)
        _Index = Index
        _Value = Value
        _Relevance = Relevance
        _SearchOptions = SearchOptions
    End Sub

    'Readonly properties


    Public ReadOnly Property Index() As Integer
        Get
            Return _Index
        End Get
    End Property

    Public ReadOnly Property Value() As String
        Get
            Return _Value
        End Get
    End Property

    Public ReadOnly Property Relevance() As Double
        Get
            Return _Relevance
        End Get
    End Property

    Public ReadOnly Property SearchOptions() As SearchOptions
        Get
            Return _SearchOptions
        End Get
    End Property

End Structure

To use this code you do the following

VB.NET:
        Dim myStrings As New ArrayList

        myStrings.Add("The Next Door")
        myStrings.Add("The Next Time")
        myStrings.Add("Next Friday / Friday After Next")
        myStrings.Add("Next (r)")
        myStrings.Add("Advance to the Next Level")

        Dim options As New SearchOptions("Next", myStrings, 0.5)
        Dim results As SearchResults

        results = FuzzySearch.Search(options)

        MsgBox("Index:     " & results.Index & vbCrLf & _
               "Value:     " & results.Value & vbCrLf & _
               "Relevance: " & results.Relevance)

This code produces the following message box.

Index: 3
Value: Next (r)
Relevance: 0.5


I know that this isn't perfect, but it was a quick approach to the problem.

One thing I don't like is the IList variable. It can be modified since it is an object being passed by reference. I chose to use this over the ICollection, so that I could use an index instead of a For Each loop. There are a few other things that I would probably change in hindsight as well.

Please give me some comments on improvements.
 
Thanks - wasn't expecting anyone to make one for me :)

Looks like it works ok to me - I haven't gone through it fully myself though.

Not quite sure how the SearchThreshold works and what its used for.

Also, how comes when I search for "American Idiot" in a list that contains "American Idiot" in exactly the same case it comes with a relevance of 0.9 - I am guessing it's because there are alot of other results it could have been.

At the moment (and its not a problem) i have to convert my list to lowercase and the search term to lower case and input it as if I search for "american idiot" in a list containing "American Idiot" it will come back with "Not Found". Fixed by changing these two lines:

VB.NET:
                Dim Item As String = LCase(CStr(Options.List(i)))
                If Item.IndexOf(LCase(SearchTerm)) <> -1 Then
 
Last edited:
All I was doing with the Search Threshold was giving a percentage that the first match must be in order to return a successful match. So if your best match was only 20% relevant and your Threshold is at 50% then it won't return that result.

You may want to check about leading or trailing spaces. For example "American Idiot " wouldn't return 100% because of the trailing spaces.

The case stuff can be modified. You can call a ToLower on the strings or we can change the compare method from Binary to Text. I think that will also fix that problem.
 
All I was doing with the Search Threshold was giving a percentage that the first match must be in order to return a successful match. So if your best match was only 20% relevant and your Threshold is at 50% then it won't return that result.

Ah I see - that's good functionality.

You may want to check about leading or trailing spaces. For example "American Idiot " wouldn't return 100% because of the trailing spaces.

Good point, thanks for pointing that out.

The case stuff can be modified. You can call a ToLower on the strings or we can change the compare method from Binary to Text. I think that will also fix that problem.

Done (see above) :)


==========

Thanks very much for this algorithm, at the moment I can't seem to fault it. Thankyou.
 
I have found a flaw (well I think its a flaw). Basically if you have this list

VB.NET:
The Next Day Out
The Next
Next

and you are searching for "The Next", it will return "Next" as the best match.

To get around it, I have changed a section of the code to as follows:

VB.NET:
        'Creates an array to store the results as they come in.  As percentages
        Dim Percentages(Options.List.Count) As Double
        Dim matchedWords(Options.List.Count) As Integer

        'Two loops.  First we loop through each word (Separated by a space).  Next we search each item with the
        'search term.  If we have a match then we add to the percentage for that item.
        For Each SearchTerm As String In Options.SearchString.Split(" "c)
            For i As Integer = 0 To Options.List.Count - 1
                Dim Item As String = LCase(CStr(Options.List(i)))
                If Item.IndexOf(LCase(SearchTerm)) <> -1 Then
                    Percentages(i) += SearchTerm.Length / Item.Length
                    matchedWords(i) += 1
                Else
                    matchedWords(i) -= 1
                End If
                For Each item2 As String In Item.Split(" "c)
                    If LCase(Options.SearchString).Contains(LCase(item2)) = False Then
                        matchedWords(i) -= 1
                    End If
                Next
            Next
        Next

        ''Check if there are any extra words on the results returned by the search and take away matched words.
        'For Each item As String In Options.List
        '    For Each item2 As String In item.Split(" "c)
        '        If LCase(Options.SearchString).Contains(LCase(item2)) = False Then
        '            matchedWords(i) -= 1
        '        End If
        '    Next
        'Next

        'All we are doing here is finding the index of the highest percentage.
        Dim HighestPercentageIndex, HighestMatchedWords As Integer, HighestPercentageValue As Double
        HighestPercentageIndex = -1
        HighestPercentageValue = -1
        HighestMatchedWords = -1
        For i As Integer = 0 To Percentages.GetUpperBound(0)
            If (Percentages(i) > HighestPercentageValue) And (matchedWords(i) > HighestMatchedWords) Then
                HighestPercentageIndex = i
                HighestMatchedWords = matchedWords(i)
                HighestPercentageValue = Percentages(i)
            End If
        Next

Thanks.
 
I can see how that could happen now. Since Next is the only thing for that results then searching on Next would give it 100%.

Adding the number of Matched words into it sounds like a good idea.

I don't have a better solution than the one that you have in front of you as of right now.

I'll post back if I think of something.
 
New problem has arisen with the new matched-words fix

If I have:

VB.NET:
Family Guy - Season II - Stewie Rules
Family Guy
Family Guy III

...and I am searching for "Family Guy II" in the attempt to find the top one but will return "Family Guy" instead due to the number of "matched-words" being higher.

I suppose I could take the bottom section of the matched words out and it would work... This section here:

VB.NET:
                For Each item2 As String In Item.Split(" "c)
                    If LCase(Options.SearchString).Contains(LCase(item2)) = False Then
                        matchedWords(i) -= 1
                    End If
                Next

Any ideas on how to fix this? - I'm totally stuck on this one.
 
Last edited:
Thanks.

Don't feel you have to on my account - its just I am a pretty novice coder and wondered what a more professional coder would do - so i can gain from the experience.

and take the number of characters into account rather than percentage.

Just to make sure, its full words that have to be matched for it to find a match, rather than the characters....or where you thinking of something else?
 
Ok. So I rewrote portions of it and added tests as well.

With the new code I also wrote some tests. These test the basic functionality and the three bugs that were reported earlier.

To summarize the changes. Instead of picking the best match by percentage, it is chosen by total character count with percentage as a tie-breaker. Also because of this I felt it unnecessary to keep the relevance and threshold variables.

Basically this should return the best item it has unless all the results have 0% matching.

Let me know of more bugs.
 

Attachments

  • Code.zip
    2.3 KB · Views: 46
Back
Top