SAMPLE Clause in SQL


Oracle Database provides the SAMPLE clause that can be issued with a SELECT statement over a table. In the following query, we’re randomly selecting records from the CUSTOMERS table with a 20% probability.

SELECT *
FROM customers
SAMPLE (20);

This means that each record has a 20% chance of being selected. Perhaps contrary to expectations, this will normally not produce a result with 20% of the records from the CUSTOMERS table. This approximate sample size is quite adequate for most situations.

A variant of the SAMPLE clause is SAMPLE BLOCK, where each block of records has the same chance of being selected, 20% in our example. Since records are selected at the block level, this offers a performance improvement for large tables and should not adversely impact the randomness of the sample.

SAMPLE and SAMPLE BLOCK allow the sample percent to range from .000001 to, but not including, 100. An optional second parameter is the seed value, used to help ensure repeatability between executions. The seed value can range from 0 to 4294967295.

SELECT *
FROM customers
SAMPLE BLOCK (20, 8621);

Repeatability is important if, e.g., you want to compare models or test results across model builds, where the only variation is model settings.

In 10g the syntax for the sample clause is:
#1. sample [block] ( N [,M] ) [seed(S)]
You can sample rows or blocks (optional keyword [block])
You must specify the percentage sample size (N) – recognizing that the result will only approximate that size
You can specify that the sample should consist of groups of ([,M]) consecutive rows or blocks
You can supply a seed value ([seed (S)]) to the random number generator that drives the sample so that the sample is repeatable.
The ‘seed’ option appeared in 10g, and is documented there.
The ‘consecutive rows/blocks’ option doesn’t seem to be documented but appears in 9i.

While SAMPLE is built into the SQL syntax, it has limitations when applied to views or complex joins. SAMPLE relies on the existence of primary keys. If the underlying tables have the necessary primary keys, you may be able to use a view, but if the primary keys are absent, the SAMPLE clause may not work.

To address this limitation, we can use a technique that uses ORA_HASH.

ORA_HASH

The ORA_HASH technique can be used with both tables and views, whether or not primary keys are specified. The Oracle Data Miner user interface uses this technique to support sampling of both tables and views.

reference:   https://jonathanlewis.wordpress.com

https://blogs.oracle.com

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s