String fuzzy matching in VBA and VB.Net

This describes how I speeded up a slow VBA string similarity function by first optimising the VBA and then migrating to VB.Net with Excel-Dna. If you’re a follower of Charles Williams’ FastExcel blog, none of this will be new to you.

To measure how alike two strings are, I used the Levenshtein Distance algorithm. I took the VBA versions from Wikibooks and StackOverflow. It returns the number of operations it would take to transform one string into the other, so 0 means a perfect case-sensitive match. My rule of thumb in the project I was using it for was that a mismatch of less than 5% of the length was close enough.

However, I found that it took 8300 milliseconds to do 100 comparisons of two strings of 100 characters each. This was not acceptable for a large sheet with 174 strings to compare with 174, ie 30,000 string comparisons.

The code uses WorksheetFunction.MIN() to get the minimum of three values.

   distance(i, j) = Application.WorksheetFunction.Min( _
     distance(i - 1, j) + 1, 
     distance(i, j - 1) + 1, 
     distance(i - 1, j - 1) + 1)

Let’s replace that by inline logic:

    min1 = distance(i - 1, j) + 1
    min2 = distance(i, j - 1) + 1
    min3 = distance(i - 1, j - 1) + 1
    If min1 <= min2 And min1 <= min3 Then
        distance(i, j) = min1
    ElseIf min2 <= min1 And min2 <= min3 Then
        distance(i, j) = min2
    Else
        distance(i, j) = min3
    End If

Now it only takes 1100 ms; almost eight times faster.

Another bottleneck in the code is a double loop that compares every character with every other character; so two 100 character strings cause 10,000 character comparisons. The function uses Mid$() like this:

If Mid$(string1, i, 1) = Mid$(string2, j, 1) Then

String manipulation is slow in VBA. So, let’s use a Byte array instead of Mid$() operations.  The code is then

Dim ByteArray1() As Byte, ByteArray2() As Byte
ByteArray1 = string1
ByteArray2 = string2

This creates two arrays of integers corresponding to the byte values. The wrinkle is that this is done in Unicode, so we get two bytes for each character, so the array length is twice the string length. Joel Spolsky has a lot to say about Unicode  and text encoding.

        If ByteArray1((i - 1) * 2) = ByteArray2((j - 1) * 2) And _
            ByteArray1((i - 1) * 2 + 1) = ByteArray2((j - 1) * 2 + 1) Then

This version now takes 620 ms – thirteen times faster than the original!

How about trying it in VB.Net using ExcelDna?

Using Worksheetfunction.MIN () is very slow because of the trip out from VB.Net to Excel and back – it takes 6209 ms to do just ONE comparison of two 100-character strings, 100 times longer.  Using the inline code it takes 266 ms for the MID$ version so string processing is a lot more efficient in VB.Net.

It takes 156 ms for the Byte array. We can improve it a bit more by using VB.Net’s native Math.Min() function. It only takes two arguments so we can use it twice:

    distance(i, j)=Math.Min( distance(i - 1, j) + 1, _
        Math.Min(distance(i, j - 1) + 1, distance(i - 1, j - 1) + 1 ) )

This now takes 140 ms to do 100 string comparisons, not much improvement but still sixty times faster than the Wikibook code in VBA.

So in summary, VB.Net should give good speed improvements for string manipulation. Sticking to VBA, inline code can be faster, but on other occasions it is faster to use WorksheetFunction for functions that process large ranges than to loop through the range in code. As always, test, as your mileage may vary.

VBA  Code: modLevenshtein.bas

ExcelDna VB.Net code: LevenshteinB.Dna

All I need now is for someone to point out another algorithm that is much quicker than the Levenshtein Distance 🙂

Advertisements

About Patrick O'Beirne, spreadsheet auditor

Patrick provides consultancy and training in spreadsheet development, auditing / testing and model review; and the Excel addin XLtest
This entry was posted in Excel/VBA, ExcelDna, Productivity and tagged , , , , , , , , , , . Bookmark the permalink.

5 Responses to String fuzzy matching in VBA and VB.Net

  1. dermotb says:

    You said “a mismatch of less than 5% of the length was close enough”
    Wikipedia says
    “If we are only interested in the distance if it is smaller than a threshold k, then it suffices to compute a diagonal stripe of width 2k+1 in the matrix. In this way, the algorithm can be run in O(kl) time, where l is the length of the shortest string”

  2. Azim says:

    Thanks for the superb article!

  3. kannan says:

    VBA code link is not working 😦

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s