vote buttons
12
2
intermediate
0
beginner
0
advanced
12-Nov-2014 02:09 UTC
Super Human
1164

1 Answers

vote buttons
11

We can select random rows using NewID(), but the query can become expensive as the table becomes very large. Let's first see how to select random rows and then we'll improve the query for performance.

Selecting 10 Random Rows using NewID()

SELECT TOP 10 * FROM myTable ORDER BY NEWID()

In-Depth

The NewID() function is simple. It generates a pseudo-random uniqueidentifier every time it is run. If you execute the following statement, you will get a different result each time.

SELECT NEWID()

The uniqueidentifier's binary representation is 128 bits, which means that there can be in total 3.4 * 10^38 unique identifiers, which is a huge number and hence the probability of getting the same number twice is minuscule. So if we order the rows of a table by NewID() we would get random rows.

Performance Issues

If you look at the query carefully you'll soon realise that the query has to assign a NewID() for each row. Then it has to sort by the uniqueidentifier (Sql Server would optimize this by stopping when it has found the top 10 rows, as it doesn't need to sort the rest). For a table with billions of rows it can get very slow.

Optimizing Performance

Wouldn't it be nice if we could ask SQL Server for a random set of data rows, and then use NewID() method to sort the subset, which would be so much faster? Well we surely can ask Sql Server to do exactly that (SqlServer 2005 and above). Here's how we ask for a random page of 1000 rows.

SELECT * FROM myTable TABLESAMPLE (1000 ROWS)

Important: The above query wont exactly return 1000 Rows, but close to 1000.

The above query returns a random set from anywhere in the table, the rows inside the set still maintain the default order and are not random, so we will still have to use NewId() to randomize the above set. As NewId() will now work on a subset of table it would be much more efficient than the original query.

SELECT TOP 10 * FROM myTable TABLESAMPLE (1000 ROWS) ORDER BY NEWID()

One important point to keep in mind is that your results would be more localized in the second case as compared to the first non-optimized query as all rows though random would at max be 1000 rows away from each other. So select a sufficiently large value for TABLESAMPLE parameter to get more non-localized result. Also TABLESAMPLE can take as a parameter percentages. e.g. TABLESAMPLE(10 PERCENT)

12-Nov-2014 02:11 UTC
Super Human
1164