Fuzzy strings matching using Levenshtein algorithm on SQL Server (T-SQL vs CLR)

Levenshtein algorithm is one of possible fuzzy strings matching algorithm. Levenshtein algorithm calculates Levenshtein distance which is a metric for measuring a difference between two strings. The Levenshtein distance is also called an edit distance and it defines minimum single character edits (insert/updates/deletes) needed to transform one string to another. Details on the algorithm itself can be found on Wikipedia.
When you need to use it in queries, functions or stored procedures you have two possibilities – T-SQL implementation and CLR implementation.
I will show both solutions here and also compare the speed of both solutions.

T-SQL implementation of Levenshtein algorithm

For T-SQL I will took the one I have found on the SqlTeam.com forums a which was originally developed by Joseph Gama as mentioned in the SqlTeam post.

CREATE FUNCTION [dbo].[edit_distance](
  @s1 nvarchar(3999),
  @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
  DECLARE @s1_len int, @s2_len int
  DECLARE @i int, @j int, @s1_char nchar, @c int, @c_temp int
  DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000)
  SELECT
    @s1_len = LEN(@s1),
    @s2_len = LEN(@s2),
    @cv1 = 0x0000,
    @j = 1, @i = 1, @c = 0

  WHILE @j <= @s2_len
    SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1
  WHILE @i <= @s1_len

  BEGIN
    SELECT
      @s1_char = SUBSTRING(@s1, @i, 1),
      @c = @i,
      @cv0 = CAST(@i AS binary(2)),
      @j = 1

    WHILE @j <= @s2_len
    BEGIN
      SET @c = @c + 1
      SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
        CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
      IF @c > @c_temp SET @c = @c_temp
      SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
      IF @c > @c_temp SET @c = @c_temp
      SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
    END
    SELECT
      @cv1 = @cv0,
      @i = @i + 1
  END

  RETURN @c
END

CLR implementation of Levenshtein algorithm

public class FuzzyStrings{
  /// <summary>
  /// Calculates the Levenshtein Distance between two strings.
  /// It is minimum of single character insert/delete/update operations needed to transfrom
  /// first string into the second string
  /// </summary>
  /// <param name="firstString">First string to calculate the distance</param>
  /// <param name="secondString">Second string to calculate the distance</param>
  /// <param name="ignoreCase">Specifies whether to ignore case in comparison</param>
  /// <returns>int represending the Levenshtein Distance</returns>
  public static int LevenshteinDistance(SqlString firstString, SqlString secondString, SqlBoolean ignoreCase)
  {
    string strF = ignoreCase ? firstString.Value.ToLower() : firstString.Value;
    string strS = ignoreCase ? secondString.Value.ToLower() : secondString.Value;
    int lenF = strF.Length;
    int lenS = strS.Length;
    int[,] d = new int[lenF + 1, lenS + 1];

    for (int i = 0; i <= lenF; i++)
      d[i, 0] = i;
    for (int j = 0; j <= lenS; j++)
      d[0, j] = j;

    for (int j = 1; j <= lenS; j++)
    {
      for (int i = 1; i <= lenF; i++)
      {
        if (strF[i - 1] == strS[j - 1])
          d[i, j] = d[i - 1, j - 1];
        else
          d[i, j] = Math.Min(Math.Min(
            d[i - 1, j] + 1,        // a deletion
            d[i, j - 1] + 1),       //an Insertion
            d[i - 1, j - 1] + 1);   // a substitution
      }
    }

    return d[lenF, lenS];
  }
}

You need to compile the code using e.g. C# Express into a an assembly and create the assembly in DB. In my Case the assembly is named [SQLCLR].

CREATE FUNCTION [dbo].[fn_LevenshteinDistance](
  @firstString [nvarchar](4000),
  @secondString [nvarchar](4000),
  @ingoreCase [bit] = 1
)
RETURNS [int]
WITH EXECUTE AS CALLER
AS EXTERNAL NAME [CLRSQL].[FuzzyStrings].[LevenshteinDistance]
GO

Testing of the functions

Once we have properly created both functions we can start testing it. Here is a script for test of both functions on several strings.

SELECT
  dbo.edit_distance('Sunday', 'Monday') AS TSQLDistance,
  ClrSafe.fn_LevenshteinDistance('Sunday', 'Monday', 1) AS CLRDistance

UNION ALL 

SELECT
  dbo.edit_distance('Sunday', 'Sunday') AS TSQLDistance,
  ClrSafe.fn_LevenshteinDistance('Sunday', 'Sunday', 0) AS CLRDistance

UNION ALL

SELECT
  dbo.edit_distance('Sunday', 'sunday') AS TSQLDistance,
  ClrSafe.fn_LevenshteinDistance('Sunday', 'sunday', 0) AS CLRDistance

UNION ALL

SELECT
  dbo.edit_distance('Saturday', 'Monday') AS TSQLDistance,
  ClrSafe.fn_LevenshteinDistance('Saturday', 'Monday', 1) AS CLRDistance

UNION ALL

SELECT
  dbo.edit_distance('This is a first string to Compare', 'This is a second string to Compare') AS TSQLDistance,
  ClrSafe.fn_LevenshteinDistance('This is a first string to Compare', 'This is a second string to Compare', 1) AS CLRDistance

And here are the results of the tests:

TSQLDistance CLRDistance
------------ -----------
2            2
0            0
0            1
5            5
6            6

As we can see, the functions return the same results except the second select. As we have selected not to ignore the case in the CLR version. For T-SQL strings are identical as the function uses a default collation which I have Case Insensitive. If I had a default collation Case Sensitive, then the results will be the same. It could be also possible to modify the T-SQL function to accept parameter for Case Sensitive/Insensitive comparison and then use different collations for that, but it’s not what we want to do here.

Speed comparison

As we saw in previous paragraphs here, both T-SQL and CLR version of the algorithm woks correctly. Now take a look on the calculation speed of the Levenshtein distance by both version.

For the test we can use a simple script, which will calculate the Levenshtein distance in cycle 10 000 times

DECLARE
   @TSQLStartTime datetime,
  @TSQLEndTime datetime,
  @CLRStartTime datetime,
  @CLREndTime datetime,
  @distance int,
  @i int

SELECT
  @i = 0,
  @TSQLStartTime = GETDATE();

WHILE (@i < 10000)
BEGIN
  SELECT
    @distance = dbo.edit_distance('This is a first string to Compare', 'This is a second string to compare'),
    @i = @i + 1
END

SELECT
  @TSQLEndTime = GETDATE(),
  @i = 0,
  @CLRStartTime = GETDATE()

WHILE (@i < 10000)
BEGIN
  SELECT
    @distance = [ClrSafe].fn_LevenshteinDistance('This is a first string to Compare', 'This is a second string to compare', 1),
    @i = @i + 1
END

SELECT @CLREndTime = GETDATE()

SELECT
  DATEDIFF(millisecond, @TSQLStartTime, @TSQLEndTime) AS TSQLDuration,
  DATEDIFF(millisecond, @CLRStartTime, @CLREndTime) AS CLRDuration

Here are the results

TSQLDuration CLRDuration
------------ -----------
22993        763

As we can see the 10 000 times calculation using the T-SQL version took 22993 milliseconds which is in average circa 2.3 millisecond for calculating the distance for the strings in our test query.

On the other side the 10 000 times calculation using CLR took only 763 milliseconds which is in average circa 0.08 milliseconds for calculating the distance for the same strings as in T-SQL version.

Conclusion

From the results we can see that the CLR is about 30 times faster on the same machine than the T-SQL version of the same algorithm. Even the T-SQL version took only 2.3 milliseconds per calculation of sample texts and it’s quite good to use in normal usage, the use or CLR can enormously decrease the processing time when processing higher volume of records.

Also it is another example where CLR beats the T-SQL solution and where it has sense to use the CLR instead of pure T-SQL.

Fuzzy strings matching using Levenshtein algorithm on SQL Server (T-SQL vs CLR)

T-SQL implementation of Levenshtein algorithm

CLR implementation of Levenshtein algorithm

Testing of the functions

Speed comparison

Conclusion

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List