Quantcast
Viewing all articles
Browse latest Browse all 18

Selecting random records from a table

Some time you may come to a situation, when you have a table and need to get only a few random records from it. You can come with a question how to achieve this on SQL Server.

You may come with incorporating the RAND() function, but as I mentioned in my previous post Random numbers in query, the RAND() function is a constant function executed once per whole query and does not it’s work.

One of the solution could be using the TABLESAMPLE clause. All the samples here use the AdventureWorks2008R2 sample database.

SELECT
    *
FROM Sales.SalesOrderDetail
TABLESAMPLE (100 ROWS);

As this for some tables could work, probably it will not give you exact results you probably expect as the current implementation of the TABLESAMPLE clause build for purpose of statistics retrieving.

Selecting random rows using NEWID()

As working workaround for the purpose of random records selection we can use again the NEWID() function mentioned in my previous post. Each call to the function creates new unique identifier. There are several possibilities to return random rows using the function.

  1. Select an approximate percentage sample of random rows
  2. Select exact number of random rows
  3. Select an exact percentage sample of random rows
  4. Select random number of random rows

Selecting approximate percentage sample of random rows

To select an approximate percentage sample of random rows from a table you can you a below query. The query will select around 10 % of records from a table.

SELECT
    *
FROM Sales.SalesOrderDetail
WHERE 0.1 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
    / CAST (0x7fffffff AS int)

Each execution of the above query returns a different count of records, which will be close to the 10 percent of total records in the table.

To select a different percentage simply change the 0.1 to and appropriate percentage. In the condition there is added a SalesOrderID field in the calculation to force generation of the NEWID() for each row, otherwise the NEWID() would be generated only once as a constant in this query. The 0x7fffffff represents maximum positive integer value, so when using it in bit AND operation it limits the original number to that value.

The above query produces following plan:

Image may be NSFW.
Clik here to view.
Approximate percentage sample of random rows query plan

And trace from the profiler:

Image may be NSFW.
Clik here to view.
Approximate percentage sample of random rows profiler trace

The table Sales.SalesOrderDetail has 121317 rows and from the profiler results we can see, that the query returned exactly 10 % of records (without 1.7). With multiple runs the row count produced will be different, but very close to 10 %.

Selecting exact number of random rows

Selecting exact number of random rows is very easy, as it is only necessary to sort the query by the NEWID() result and limit the number of returned rows. For returning 100 random rows it’s possible to use below query:

SELECT TOP 100
    *
FROM Sales.SalesOrderDetail
ORDER BY NEWID()

In this example it is not necessary to include any random number generation and checksum as the NEWID() is in the ORDER BY clause and therefore it will be evaluated once for each row so the SQL Server can sort the record on it’s value.

Here is a plan for the query:

Image may be NSFW.
Clik here to view.
Exact number of random rows Query Plan

And trace from profiler:

Image may be NSFW.
Clik here to view.
Exact number of random rows Profiler Trace

As we can see, even the query looks much simpler than the previous for approximate percentage, this query produces much more complicated parallel plan and from the profiler we can see, that even returning les rows, it produces more reads and has much higher CPU cost. This is caused by the ORDER BY operator as the database engine has to sort all the records prior returning the TOP 100 rows. The duration of the first query is longer, but this is caused by returning 12 times more rows to client.

On small tables this doesn’t matter, but on large one it can has high impact. To solve the issue, we can use the approximate count query and limit the records returned by this query. Only we need to choose appropriate count records in the WHERE clause. If we know, that in the table is more than 120 000 records, we can first limit the results to about 0.15 % so we will be sure, the base query returns more than 100 rows and then simply limit them to 100.

SELECT TOP 100
    *
FROM Sales.SalesOrderDetail
WHERE 0.0015 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
    / CAST (0x7fffffff AS int)

This query produces below plan and trace from profiler.

Image may be NSFW.
Clik here to view.
Query Plan for Exact number of random rows with random generator in where condition

Image may be NSFW.
Clik here to view.
Profiler Trace for Exact number of random rows with random generator in where condition

As we can see, the query plan is again much simpler and the Reads count is more than half less then previous and the CPU cost is much lower and duration is a little bit lower, but generally equal because of returning rows to client.

Selecting exact percentage sample of random rows

To select an exact percentage sample of random rows it’s very simple and consist of simple modification of the TOP clause to a percentage value. So to select 10 percent of random records we can use below query.

SELECT TOP 10 PERCENT
    *
FROM Sales.SalesOrderDetail
ORDER BY NEWID()

Query produces below and again more complex plan then the solution with WHERE clause. It’s even more complicated than the one for the TOP without PERCENT as a table spool operator is added to the plan.

Image may be NSFW.
Clik here to view.
Query Plan for exact percentage sample of random rows

Image may be NSFW.
Clik here to view.
Profiler Trace for exact percentage sample of random rows

From the profiles we again see, tan it produces enormous amount of reads comparing the to the approximate percentage sample even the amount of returned records is similar. The duration and CPU cost is incomparable and even writes are produced because of the writes to Table Spool.

To optimize this, we cannot use the approximate percentage count query and limit it by percent, as the TOP 10 PERCENT clause will return only 10 percent of records from the one limited by the WHERE clause.

We can optimize the reads, execution time and resources cost by below query.

SELECT TOP ((SELECT COUNT(*) FROM Sales.SalesOrderDetail) / 10)
    *
FROM Sales.SalesOrderDetail
WHERE 0.11 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
    / CAST (0x7fffffff AS int)

The (SELECT COUNT(*) FROM Sales.SalesOrderDetail) / 10 query in TOP clause calculates 10 percent of records and whole query produces below plan and profiler trace.

Image may be NSFW.
Clik here to view.
Query Plan for exact percentage sample of random rows with random generator in WHERE clause

Image may be NSFW.
Clik here to view.
Profiler Trace for exact percentage sample of random rows with random generator in WHERE clause

Even the plan may look more complex than the previous one, from the profiler trace we can see, that for the same Row Count this plan produces 256 time less reads, no writes and 3 time lower duration and 6.5 time less CPU cost.

Selecting random number of random records

To select a random number of random records we can again use the above query and slightly modify the top clause. This modification consist of adding a random generator into the TOP clause. To select random number of records (or percent) in particular interval – in this example 10-20 records or 10 – 20 percent a below queries can be used:

--Select 10 - 20 random records
SELECT TOP (ABS(CHECKSUM(NEWID())) % 10 + 10)
    *
FROM Sales.SalesOrderDetail
ORDER BY NEWID()

--Select 10 - 20 random records
SELECT TOP (ABS(CHECKSUM(NEWID())) % 10 + 10)
    *
FROM Sales.SalesOrderDetail
WHERE 0.0022 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
    / CAST (0x7fffffff AS int)

Image may be NSFW.
Clik here to view.
Profiler Trace for Random number of random records

The plans for the queries are the same as the plan mentioned above and from the profiler trace we can again see, the incomparable lower costs of the second query compared to the first one.

--Select 10 - 20 % of random records
SELECT TOP (ABS(CHECKSUM(NEWID())) % 10 + 10) PERCENT
    *
FROM Sales.SalesOrderDetail
ORDER BY NEWID()

--Select 10 - 20 % of random records
SELECT TOP ((SELECT COUNT(*) FROM Sales.SalesOrderDetail) * (ABS(CHECKSUM(NEWID())) % 10 + 10) / 100)
    *
FROM Sales.SalesOrderDetail
WHERE 0.21 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
    / CAST (0x7fffffff AS int)

Image may be NSFW.
Clik here to view.
Profiler Trace for Random percentage of random records

Again the plan are the same as the plan for each version showed above and again from the profiler trace we can again see, that the performance of the second query is incomparable better, than the first one.

Complete comparison

To make a complete comparison of performance of all the queries mentioned in this article we can use below query batch. All the queries are written as sub queries of SELECT COUNT(*) to avoid wrong duration calculation which is affected by transferring the records from server to client.

--Approximate 10 percent =======================
SELECT COUNT(*) FROM (
    SELECT
        *
    FROM Sales.SalesOrderDetail
    WHERE 0.1 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
        / CAST (0x7fffffff AS int)
) q
GO
--TOP 100, ORDER BY =============================
SELECT COUNT(*) FROM (
    SELECT TOP 100
        *
    FROM Sales.SalesOrderDetail
    ORDER BY NEWID()
) q
GO
--TOP 100, Approximate =============================
SELECT COUNT(*) FROM (
    SELECT TOP 100
        *
    FROM Sales.SalesOrderDetail
    WHERE 0.0015 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
        / CAST (0x7fffffff AS int)
) q
GO
--TOP 10 PERCENT, ORDER BY =============================
SELECT COUNT(*) FROM (
    SELECT TOP 10 PERCENT
        *
    FROM Sales.SalesOrderDetail
    ORDER BY NEWID()
) q
GO
--TOP 10 PERCENT, Approximate =============================
SELECT COUNT(*) FROM (
    SELECT TOP ((SELECT COUNT(*) FROM Sales.SalesOrderDetail) / 10)
        *
    FROM Sales.SalesOrderDetail
    WHERE 0.11 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
        / CAST (0x7fffffff AS int)
) q
GO
--Random 10 - 20 Records, ORDER BY  =============================
SELECT COUNT(*) FROM (
    SELECT TOP (ABS(CHECKSUM(NEWID())) % 10 + 10)
        *
    FROM Sales.SalesOrderDetail
    ORDER BY NEWID()
) q
GO
--Random 10 - 20 Records, Approximate =============================
SELECT COUNT(*) FROM (
    SELECT TOP (ABS(CHECKSUM(NEWID())) % 10 + 10)
        *
    FROM Sales.SalesOrderDetail
    WHERE 0.0022 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
        / CAST (0x7fffffff AS int)
) q
GO
--Random 10 - 20 PERCENT, ORDER BY =============================
SELECT COUNT(*) FROM (
    SELECT TOP (ABS(CHECKSUM(NEWID())) % 10 + 10) PERCENT
        *
    FROM Sales.SalesOrderDetail
    ORDER BY NEWID()
) q
GO
--Random 10 - 20 PERCENT, Approximate =============================
SELECT COUNT(*) FROM (
    SELECT TOP ((SELECT COUNT(*) FROM Sales.SalesOrderDetail) * (ABS(CHECKSUM(NEWID())) % 10 + 10) / 100)
        *
    FROM Sales.SalesOrderDetail
    WHERE 0.21 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
        / CAST (0x7fffffff AS int)
) q
GO

And profiler output for above queries.

Image may be NSFW.
Clik here to view.
Queries comparison profiler trace output

Conclusion

As we can see from the queries, their query plans and profiler traces, we have several possibilities to select random records from table. Some possibilities have a very simpler query, but produces more complex plan and have a higher cost and can be suitable for relatively small tables.

Other versions producing the same results are more complex to write the query itself, but finally the query produces a much simpler plan and have much better performance comparing to the simple looking one. Such queries are then suitable even fro very large tables to produce results in acceptable time with low performance impact on the server.


Viewing all articles
Browse latest Browse all 18

Trending Articles