This describes how I speeded up a slow VBA string similarity function by first optimising the VBA and then migrating to VB.Net with Excel-Dna. If you’re a follower of Charles Williams’ FastExcel blog, none of this will be new to you.
To measure how alike two strings are, I used the Levenshtein Distance algorithm. I took the VBA versions from Wikibooks and StackOverflow. It returns the number of operations it would take to transform one string into the other, so 0 means a perfect case-sensitive match. My rule of thumb in the project I was using it for was that a mismatch of less than 5% of the length was close enough.
However, I found that it took 8300 milliseconds to do 100 comparisons of two strings of 100 characters each. This was not acceptable for a large sheet with 174 strings to compare with 174, ie 30,000 string comparisons.
The code uses WorksheetFunction.MIN() to get the minimum of three values.
distance(i, j) = Application.WorksheetFunction.Min( _ distance(i - 1, j) + 1, distance(i, j - 1) + 1, distance(i - 1, j - 1) + 1)
Let’s replace that by inline logic:
min1 = distance(i - 1, j) + 1 min2 = distance(i, j - 1) + 1 min3 = distance(i - 1, j - 1) + 1 If min1 <= min2 And min1 <= min3 Then distance(i, j) = min1 ElseIf min2 <= min1 And min2 <= min3 Then distance(i, j) = min2 Else distance(i, j) = min3 End If
Now it only takes 1100 ms; almost eight times faster.
Another bottleneck in the code is a double loop that compares every character with every other character; so two 100 character strings cause 10,000 character comparisons. The function uses Mid$() like this:
If Mid$(string1, i, 1) = Mid$(string2, j, 1) Then
String manipulation is slow in VBA. So, let’s use a Byte array instead of Mid$() operations. The code is then
Dim ByteArray1() As Byte, ByteArray2() As Byte ByteArray1 = string1 ByteArray2 = string2
This creates two arrays of integers corresponding to the byte values. The wrinkle is that this is done in Unicode, so we get two bytes for each character, so the array length is twice the string length. Joel Spolsky has a lot to say about Unicode and text encoding.
If ByteArray1((i - 1) * 2) = ByteArray2((j - 1) * 2) And _ ByteArray1((i - 1) * 2 + 1) = ByteArray2((j - 1) * 2 + 1) Then
This version now takes 620 ms – thirteen times faster than the original!
How about trying it in VB.Net using ExcelDna?
Using Worksheetfunction.MIN () is very slow because of the trip out from VB.Net to Excel and back – it takes 6209 ms to do just ONE comparison of two 100-character strings, 100 times longer. Using the inline code it takes 266 ms for the MID$ version so string processing is a lot more efficient in VB.Net.
It takes 156 ms for the Byte array. We can improve it a bit more by using VB.Net’s native Math.Min() function. It only takes two arguments so we can use it twice:
distance(i, j)=Math.Min( distance(i - 1, j) + 1, _ Math.Min(distance(i, j - 1) + 1, distance(i - 1, j - 1) + 1 ) )
This now takes 140 ms to do 100 string comparisons, not much improvement but still sixty times faster than the Wikibook code in VBA.
So in summary, VB.Net should give good speed improvements for string manipulation. Sticking to VBA, inline code can be faster, but on other occasions it is faster to use WorksheetFunction for functions that process large ranges than to loop through the range in code. As always, test, as your mileage may vary.
VBA Code: modLevenshtein.bas
ExcelDna VB.Net code: LevenshteinB.Dna
All I need now is for someone to point out another algorithm that is much quicker than the Levenshtein Distance 🙂