Levenshtein algorithm is one of possible fuzzy strings matching algorithm. Levenshtein algorithm calculates Levenshtein distance which is a metric for measuring a difference between two strings. The Levenshtein distance is also called an edit distance and it defines minimum single character edits (insert/updates/deletes) needed to transform one string to another. Details on the algorithm itself can be found on Wikipedia.
When you need to use it in queries, functions or stored procedures you have two possibilities – T-SQL implementation and CLR implementation.
I will show both solutions here and also compare the speed of both solutions.
T-SQL implementation of Levenshtein algorithm
For T-SQL I will took the one I have found on the SqlTeam.com forums a which was originally developed by Joseph Gama as mentioned in the SqlTeam post.
CREATE FUNCTION [dbo].[edit_distance]( @s1 nvarchar(3999), @s2 nvarchar(3999)) RETURNS int AS BEGIN DECLARE @s1_len int, @s2_len int DECLARE @i int, @j int, @s1_char nchar, @c int, @c_temp int DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000) SELECT @s1_len = LEN(@s1), @s2_len = LEN(@s2), @cv1 = 0x0000, @j = 1, @i = 1, @c = 0 WHILE @j <= @s2_len SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1 WHILE @i <= @s1_len BEGIN SELECT @s1_char = SUBSTRING(@s1, @i, 1), @c = @i, @cv0 = CAST(@i AS binary(2)), @j = 1 WHILE @j <= @s2_len BEGIN SET @c = @c + 1 SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) + CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END IF @c > @c_temp SET @c = @c_temp SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1 IF @c > @c_temp SET @c = @c_temp SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1 END SELECT @cv1 = @cv0, @i = @i + 1 END RETURN @c END
CLR implementation of Levenshtein algorithm
public class FuzzyStrings{ /// <summary> /// Calculates the Levenshtein Distance between two strings. /// It is minimum of single character insert/delete/update operations needed to transfrom /// first string into the second string /// </summary> /// <param name="firstString">First string to calculate the distance</param> /// <param name="secondString">Second string to calculate the distance</param> /// <param name="ignoreCase">Specifies whether to ignore case in comparison</param> /// <returns>int represending the Levenshtein Distance</returns> public static int LevenshteinDistance(SqlString firstString, SqlString secondString, SqlBoolean ignoreCase) { string strF = ignoreCase ? firstString.Value.ToLower() : firstString.Value; string strS = ignoreCase ? secondString.Value.ToLower() : secondString.Value; int lenF = strF.Length; int lenS = strS.Length; int[,] d = new int[lenF + 1, lenS + 1]; for (int i = 0; i <= lenF; i++) d[i, 0] = i; for (int j = 0; j <= lenS; j++) d[0, j] = j; for (int j = 1; j <= lenS; j++) { for (int i = 1; i <= lenF; i++) { if (strF[i - 1] == strS[j - 1]) d[i, j] = d[i - 1, j - 1]; else d[i, j] = Math.Min(Math.Min( d[i - 1, j] + 1, // a deletion d[i, j - 1] + 1), //an Insertion d[i - 1, j - 1] + 1); // a substitution } } return d[lenF, lenS]; } }
You need to compile the code using e.g. C# Express into a an assembly and create the assembly in DB. In my Case the assembly is named [SQLCLR].
CREATE FUNCTION [dbo].[fn_LevenshteinDistance]( @firstString [nvarchar](4000), @secondString [nvarchar](4000), @ingoreCase [bit] = 1 ) RETURNS [int] WITH EXECUTE AS CALLER AS EXTERNAL NAME [CLRSQL].[FuzzyStrings].[LevenshteinDistance] GO
Testing of the functions
Once we have properly created both functions we can start testing it. Here is a script for test of both functions on several strings.
SELECT dbo.edit_distance('Sunday', 'Monday') AS TSQLDistance, ClrSafe.fn_LevenshteinDistance('Sunday', 'Monday', 1) AS CLRDistance UNION ALL SELECT dbo.edit_distance('Sunday', 'Sunday') AS TSQLDistance, ClrSafe.fn_LevenshteinDistance('Sunday', 'Sunday', 0) AS CLRDistance UNION ALL SELECT dbo.edit_distance('Sunday', 'sunday') AS TSQLDistance, ClrSafe.fn_LevenshteinDistance('Sunday', 'sunday', 0) AS CLRDistance UNION ALL SELECT dbo.edit_distance('Saturday', 'Monday') AS TSQLDistance, ClrSafe.fn_LevenshteinDistance('Saturday', 'Monday', 1) AS CLRDistance UNION ALL SELECT dbo.edit_distance('This is a first string to Compare', 'This is a second string to Compare') AS TSQLDistance, ClrSafe.fn_LevenshteinDistance('This is a first string to Compare', 'This is a second string to Compare', 1) AS CLRDistance
And here are the results of the tests:
TSQLDistance CLRDistance
------------ -----------
2 2
0 0
0 1
5 5
6 6
As we can see, the functions return the same results except the second select. As we have selected not to ignore the case in the CLR version. For T-SQL strings are identical as the function uses a default collation which I have Case Insensitive. If I had a default collation Case Sensitive, then the results will be the same. It could be also possible to modify the T-SQL function to accept parameter for Case Sensitive/Insensitive comparison and then use different collations for that, but it’s not what we want to do here.
Speed comparison
As we saw in previous paragraphs here, both T-SQL and CLR version of the algorithm woks correctly. Now take a look on the calculation speed of the Levenshtein distance by both version.
For the test we can use a simple script, which will calculate the Levenshtein distance in cycle 10 000 times
DECLARE @TSQLStartTime datetime, @TSQLEndTime datetime, @CLRStartTime datetime, @CLREndTime datetime, @distance int, @i int SELECT @i = 0, @TSQLStartTime = GETDATE(); WHILE (@i < 10000) BEGIN SELECT @distance = dbo.edit_distance('This is a first string to Compare', 'This is a second string to compare'), @i = @i + 1 END SELECT @TSQLEndTime = GETDATE(), @i = 0, @CLRStartTime = GETDATE() WHILE (@i < 10000) BEGIN SELECT @distance = [ClrSafe].fn_LevenshteinDistance('This is a first string to Compare', 'This is a second string to compare', 1), @i = @i + 1 END SELECT @CLREndTime = GETDATE() SELECT DATEDIFF(millisecond, @TSQLStartTime, @TSQLEndTime) AS TSQLDuration, DATEDIFF(millisecond, @CLRStartTime, @CLREndTime) AS CLRDuration
Here are the results
TSQLDuration CLRDuration
------------ -----------
22993 763
As we can see the 10 000 times calculation using the T-SQL version took 22993 milliseconds which is in average circa 2.3 millisecond for calculating the distance for the strings in our test query.
On the other side the 10 000 times calculation using CLR took only 763 milliseconds which is in average circa 0.08 milliseconds for calculating the distance for the same strings as in T-SQL version.
Conclusion
From the results we can see that the CLR is about 30 times faster on the same machine than the T-SQL version of the same algorithm. Even the T-SQL version took only 2.3 milliseconds per calculation of sample texts and it’s quite good to use in normal usage, the use or CLR can enormously decrease the processing time when processing higher volume of records.
Also it is another example where CLR beats the T-SQL solution and where it has sense to use the CLR instead of pure T-SQL.