Quantcast
Viewing all articles
Browse latest Browse all 18

Fastest CSV strings splitting using CLR (T-SQL vs. CLR revisited)

In one my previous blog post SQL Server – String splitting (T-SQL vs. CLR) I was comparing speed of T-SQL solution for string splitting vs. CLR RegEx solution. Although the CLR RegEx is fast enough, it isn’t the fastest solution for simple CSV string splitting. Also I will again compare it to the T-SQL solution.

In the mentioned post the T-SQL solution wasn’t usable for larger amount of CSV values, but after some investigations by Jeff Moden in post http://www.sqlservercentral.com/Forums/FindPost997236.aspx found a mistake we both made in the T-SQL Testing, and therefore I will post here also updated comparison to the T-SQL version

Fastest CLR version

Probably fastest CLR version for splitting sting is a CLR table-valued function which processes the whole string on character level and on a delimiter sends the results.

Here is one possible solution.

public class StringSplit
{
    private struct StrRow
    {
        public StrRow(int rowId, SqlChars value)
        {
            RowId = rowId;
            Value = value;
        }

        public int RowId;
        public SqlChars Value;

    }

    [SqlFunction(FillRowMethodName = "FillSplitString3")]
    public static IEnumerable SplitString3(SqlString sourceString, string delimiter, int maxLen)
    {
        char[] buffer = new char[maxLen];
        char delim = delimiter[0];
        int rowNumber = 0;
        int chars = 0;
        char[] finalString;

        foreach (char chr in sourceString.Value)
        {
            if (chr == delim)
            {
                finalString = new char[chars];
                Array.Copy(buffer, finalString, chars);
                yield return new StrRow(++rowNumber, new SqlChars(finalString));
                chars = 0;
            }
            else
            {
                buffer[chars++] = chr;
            }
        }
        if (chars > 0)
        {
            finalString = new char[chars];
            Array.Copy(buffer, finalString, chars);
            yield return new StrRow(++rowNumber, new SqlChars(finalString));
        }

    }

    [SqlFunction(FillRowMethodName = "FillSplitString3")]
    public static IEnumerable SplitString4(SqlString sourceString, string delimiter)
    {
        StringBuilder sb = new StringBuilder();
        char delim = delimiter[0];
        int rowNumber = 0;
        foreach (char chr in sourceString.Value)
        {
            if (chr == delim)
            {
                yield return new StrRow(++rowNumber, new SqlChars(sb.ToString()));
                sb = new StringBuilder(sb.Capacity);
            }
            else
            {
                sb.Append(chr);
            }
        }
        if (sb.Length > 0)
        {
            yield return new StrRow(++rowNumber, new SqlChars(sb.ToString()));
        }

    }

    public static void FillSplitString3(object obj, out int rowId, out SqlChars value)
    {
        StrRow r = (StrRow)obj;
        rowId = r.RowId;
        value = r.Value;
    }
}
CREATE FUNCTION dbo.fn_SplitString3(
  @sourceString nvarchar(max),
  @delimiter nchar(1),
  @maxLen int
)
RETURNS  TABLE (
    RowID int NULL,
    Value nvarchar(10) NULL
) WITH EXECUTE AS CALLER
AS
EXTERNAL NAME SQLRegEx.StringSplit.SplitString3
GO

This function takes three parameters. First the source string to be split, delimiter and maxLen, which is maximum length for an item in the CSV List. It is used to allocate buffer. And e.g.. for integer values it will be 10 as positive integer will have maximum of 10 digits. It is possible to write this function also without this parameter, but I’ve added it because of speed, as it doesn’t require buffer reallocations.

I will compare the speed also to the CLR RegEx version. I will use the function mentioned in my previous post.

For CLR RegEx we will use a simple Regular expression ”d+” as it is enough for the integer values delimited by commas.

As T-SQL candidate for speed comparison I will use the latest optimized version of Tally table splitting by Jeff Moden.

CREATE FUNCTION dbo.Split8KTallyM (
    @Parameter VARCHAR(8000),
    @Delimiter VARCHAR(1)
)
RETURNS @Result TABLE (ItemNumber INT, ItemValue INT) AS
  BEGIN
 INSERT INTO @Result
        (ItemNumber, ItemValue)
 SELECT CAST(ROW_NUMBER() OVER (ORDER BY N) AS INT) AS ItemNumber,
        SUBSTRING(@Parameter,N,CHARINDEX(@Delimiter,@Parameter+@Delimiter,N)-N) AS ItemValue
   FROM dbo.Tally
  WHERE N BETWEEN 1 AND LEN(@Parameter)+1
    AND SUBSTRING(@Delimiter+@Parameter,N,1) = @Delimiter; --Notice how we find the comma
 RETURN
    END;
GO

Test data preparation

I will use as test data the same tables as in previous tests. We will use table with 10 000 rows and each will be with different length of CSV string (16 items, 100 items and 1333 items). The table definition will be only modified and the string will not be stored as nvarchar(max) but as varchar(max). The nvarchar in previous test totally degraded the T-SQL solution so it was not usable for 1333 item in SCV string.

SELECT TOP 11000
    IDENTITY(INT, 1, 1) AS N
INTO dbo.Tally
FROM sys.all_objects o1, sys.all_objects
GO

--Add Clustered Index on Tally table
ALTER TABLE dbo.Tally
    ADD CONSTRAINT PK_Tally PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
GO

--Create and populate CsvTest table (doesn't matter whether the table has Clustered index or it is simply heap)
SELECT TOP (10000) --Controls the number of rows in the test table
    ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,
    (
        SELECT CAST(STUFF( --=== STUFF get's rid of the leading comma
                ( --=== This builds CSV row with a leading comma
                SELECT TOP (16) --Controls the number of CSV elements in each row
                    ','+CAST(ABS(CHECKSUM(NEWID()))%100000 AS VARCHAR(10))
                FROM dbo.Tally t3      --Classic cross join pseudo-cursor
                CROSS JOIN dbo.Tally t4 --can produce row sets up 121 million.
                WHERE t1.N <> t3.N --Without this line, all rows would be the same
                FOR XML PATH('')
                )
                ,1,1,'') AS VARCHAR(8000))
                ) AS CsvParameter
INTO CsvTest
FROM dbo.Tally t1        --Classic cross join pseudo-cursor
CROSS JOIN dbo.Tally t2;  --can produce row sets up 121 million.
GO

SELECT TOP (10000) --Controls the number of rows in the test table
    ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,
    (
        SELECT CAST(STUFF( --=== STUFF get's rid of the leading comma
                ( --=== This builds CSV row with a leading comma
                SELECT TOP (100) --Controls the number of CSV elements in each row
                    ','+CAST(ABS(CHECKSUM(NEWID()))%100000 AS VARCHAR(10))
                FROM dbo.Tally t3      --Classic cross join pseudo-cursor
                CROSS JOIN dbo.Tally t4 --can produce row sets up 121 million.
                WHERE t1.N <> t3.N --Without this line, all rows would be the same
                FOR XML PATH('')
                )
                ,1,1,'') AS VARCHAR(8000))
                ) AS CsvParameter
INTO CsvTest2
FROM dbo.Tally t1        --Classic cross join pseudo-cursor
CROSS JOIN dbo.Tally t2;  --can produce row sets up 121 million.
GO

SELECT TOP (10000) --Controls the number of rows in the test table
    ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,
    (
        SELECT CAST(STUFF( --=== STUFF get's rid of the leading comma
                ( --=== This builds CSV row with a leading comma
                SELECT TOP (1333) --Controls the number of CSV elements in each row
                    ','+CAST(ABS(CHECKSUM(NEWID()))%100000 AS VARCHAR(10))
                FROM dbo.Tally t3      --Classic cross join pseudo-cursor
                CROSS JOIN dbo.Tally t4 --can produce row sets up 121 million.
                WHERE t1.N <> t3.N --Without this line, all rows would be the same
                FOR XML PATH('')
                )
                ,1,1,'') AS VARCHAR(8000))
                ) AS CsvParameter
INTO CsvTest3
FROM dbo.Tally t1        --Classic cross join pseudo-cursor
CROSS JOIN dbo.Tally t2;  --can produce row sets up 121 million.
GO

Speed comparison

Here is a script I will use to compare the speed:

--================= 16 items ==========
GO
--CLR fn_SplitString3
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.RowID,
    @ItemValue = V.Value
FROM dbo.CsvTest D
CROSS APPLY dbo.fn_SplitString3(D.CsvParameter, ',', 10) V
GO
--CLR RegEx
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.RowID,
    @ItemValue = V.Value
FROM dbo.CsvTest D
CROSS APPLY dbo.fn_RegExMatches2(D.CsvParameter, 'd+') V
GO
--T-SQL Split8KTallyM
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.ItemNumber,
    @ItemValue = V.ItemValue
FROM dbo.CsvTest D
CROSS APPLY dbo.Split8KTallyM(D.CsvParameter, ',') V
GO
--================= 100 items ==========
GO
--CLR fn_SplitString3
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.RowID,
    @ItemValue = V.Value
FROM dbo.CsvTest2 D
CROSS APPLY dbo.fn_SplitString3(D.CsvParameter, ',', 10) V
GO
--CLR RegEx
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.RowID,
    @ItemValue = V.Value
FROM dbo.CsvTest2 D
CROSS APPLY dbo.fn_RegExMatches2(D.CsvParameter, 'd+') V
GO
--T-SQL Split8KTallyM
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.ItemNumber,
    @ItemValue = V.ItemValue
FROM dbo.CsvTest2 D
CROSS APPLY dbo.Split8KTallyM(D.CsvParameter, ',') V
GO
--================= 1333 items ==========
GO
--CLR fn_SplitString3
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.RowID,
    @ItemValue = V.Value
FROM dbo.CsvTest3 D
CROSS APPLY dbo.fn_SplitString3(D.CsvParameter, ',', 10) V
GO
--CLR RegEx
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.RowID,
    @ItemValue = V.Value
FROM dbo.CsvTest3 D
CROSS APPLY dbo.fn_RegExMatches2(D.CsvParameter, 'd+') V
GO
--T-SQL Split8KTallyM
DECLARE @RowNum INT, @ItemNumber INT, @ItemValue INT;
SELECT
    @RowNum = D.RowNum,
    @ItemNumber = V.ItemNumber,
    @ItemValue = V.ItemValue
FROM dbo.CsvTest3 D
CROSS APPLY dbo.Split8KTallyM(D.CsvParameter, ',') V
GO

And here are the results from profiler:

Image may be NSFW.
Clik here to view.
Profiler results

Results of comparison and conclusion

As we can see in the output from profiles, the new fn_SplitString3 function is unbeatable in all scenarios. While the T-SQL took 3.5 seconds for 16 items, the new CLR split function takes only 253 milliseconds. As mentioned in previous post, the CLR RegEx benefits at higher items count over 100. And in higher counts beats the T-SQL Solutions. The new fn_SplitString even on 1333 items count took only 8.2 sec.


Viewing all articles
Browse latest Browse all 18

Trending Articles