In this chapter
Calculating Running Statistics
Finding Sequences, Runs, and Regions
Limiting the Number of Rows Returned
This chapter describes how to solve common problems with SQL programs that
Contain subtle or clever combinations of standard SQL elements
or
A running statistic, or cumulative statistic, is a row-by-row calculation that uses progressively more data values, starting with a single value (the first value), continuing with more values in the order in which they’re supplied, and ending with all the values. The running sum (total) and running average (arithmetic mean) are the most common running statistics.
Listing 15.1 calculates the running sum and running average of book sales, along with a cumulative count of data items. The query cross-joins two instances of the table titles, grouping the result by the first-table (t1) title IDs and limiting the second-table (t2) rows to ID values smaller than or equal to the t1 row to which they’re joined. The intermediate cross-joined table, to which SUM(), AVG(), and COUNT() are applied, looks like this:
t1.id t1.sales t2.id t2.sales
----- -------- ----- --------
T01 566 T01 566
T02 9566 T01 566
T02 9566 T02 9566
T03 25667 T01 566
T03 25667 T02 9566
T03 25667 T03 25667
T04 13001 T01 566
T04 13001 T02 9566
T04 13001 T03 25667
T04 13001 T04 13001
T05 201440 T01 566
...
Note that the running statistics don’t change for title T10 because its sales value is null. The ORDER BY clause is necessary because GROUP BY doesn’t sort the result implicitly. See Figure 15.1 for the result.
Listing 15.1Calculate the running sum, average, and count of book sales. See Figure 15.1 for the result.
SELECT
t1.title_id,
SUM(t2.sales) AS RunSum,
AVG(t2.sales) AS RunAvg,
COUNT(t2.sales) AS RunCount
FROM titles t1, titles t2
WHERE t1.title_id >= t2.title_id
GROUP BY t1.title_id
ORDER BY t1.title_id;
Figure 15.1Result of Listing 15.1.
title_id RunSum RunAvg RunCount
-------- ------- ------ --------
T01 566 566 1
T02 10132 5066 2
T03 35799 11933 3
T04 48800 12200 4
T05 250240 50048 5
T06 261560 43593 6
T07 1761760 251680 7
T08 1765855 220731 8
T09 1770855 196761 9
T10 1770855 196761 9
T11 1864978 186497 10
T12 1964979 178634 11
T13 1975446 164620 12
A moving average is a way of smoothing a time series (such as a list of prices changing over time) by replacing each value by an average of that value and its nearest neighbors. Calculating a moving average is easy if you have a column that contains a sequence of integers or dates, such as in this table, named time_series:
seq price
--- -----
1 10.0
2 10.5
3 11.0
4 11.0
5 10.5
6 11.5
7 12.0
8 13.0
9 15.0
10 13.5
11 13.0
12 12.5
13 12.0
14 12.5
15 11.0
Listing 15.2 calculates the moving average of price. See Figure 15.2 for the result. Each value in the result’s moving-average column is the average of five values: the price in the current row and the prices in the four preceding rows (as ordered by seq). The first four rows are omitted because they don’t have the required number of preceding values. You can adjust the values in the WHERE clause to cover any size averaging window. To make Listing 15.2 calculate a five-point moving average that averages each price with the two prices before it and the two prices after it, for example, change the WHERE clause to:
WHERE t1.seq >= 3
AND t1.seq <= 13
AND t1.seq BETWEEN t2.seq - 2 AND t2.seq + 2
Listing 15.2Calculate a moving average with a five-point window. See Figure 15.2 for the result.
SELECT t1.seq, AVG(t2.price) AS MovingAvg
FROM time_series t1, time_series t2
WHERE t1.seq >= 5
AND t1.seq BETWEEN t2.seq AND t2.seq + 4
GROUP BY t1.seq
ORDER BY t1.seq;
Figure 15.2Result of Listing 15.2.
seq MovingAvg
--- ---------
5 10.6
6 10.9
7 11.2
8 11.6
9 12.4
10 13.0
11 13.3
12 13.4
13 13.2
14 12.7
15 12.2
If you have a table that already has running totals, then you can calculate the differences between pairs of successive rows. Listing 15.3 backs out the intercity distances from the following table, named roadtrip, which contains the cumulative distances for each leg of a trip from Seattle, Washington, to San Diego, California. See Figure 15.3 for the result.
seq city miles
--- ----------------- -----
1 Seattle, WA 0
2 Portland, OR 174
3 San Francisco, CA 808
4 Monterey, CA 926
5 Los Angeles, CA 1251
6 San Diego, CA 1372
Listing 15.3Calculate intercity distances from cumulative distances. See Figure 15.3 for the result.
SELECT
t1.seq AS seq1,
t2.seq AS seq2,
t1.city AS city1,
t2.city AS city2,
t1.miles AS miles1,
t2.miles AS miles2,
t2.miles - t1.miles AS dist
FROM roadtrip t1, roadtrip t2
WHERE t1.seq + 1 = t2.seq
ORDER BY t1.seq;
Figure 15.3Result of Listing 15.3.
seq1 seq2 city1 city2 miles1 miles2 dist
---- ---- ----------------- ----------------- ------ ------ ----
1 2 Seattle, WA Portland, OR 0 174 174
2 3 Portland, OR San Francisco, CA 174 808 634
3 4 San Francisco, CA Monterey, CA 808 926 118
4 5 Monterey, CA Los Angeles, CA 926 1251 325
5 6 Los Angeles, CA San Diego, CA 1251 1372 121
In Microsoft SQL Server, Oracle, Db2, MySQL, and PostgreSQL, you can use window functions to calculate running statistics; for example:
SELECT title_id, sales,
SUM(sales) OVER (ORDER BY title_id)
AS RunSum
FROM titles
ORDER BY title_id;
Recall from “Unique Identifiers” in Chapter 3 that you can use sequences of autogenerated integers to create identity columns (typically for primary keys). The SQL standard provides sequence generators to create them.
To define a sequence generator:
Type:
CREATE SEQUENCE seq_name
[INCREMENT [BY] increment]
[MINVALUE min | NO MINVALUE]
[MAXVALUE max | NO MAXVALUE]
[START [WITH] start]
[[NO] CYCLE];
seq_name is the name (a unique identifier) of the sequence to create.
increment specifies which value is added to the current sequence value to create a new value. A positive value will make an ascending sequence; a negative one, a descending sequence. The value of increment can’t be zero. If the clause INCREMENT BY is omitted, then the default increment is 1.
min specifies the minimum value that a sequence can generate. If the clause MINVALUE is omitted or NO MINVALUE is specified, then a default minimum is used. The defaults vary by DBMS, but they’re typically 1 for an ascending sequence or a very large number for a descending one.
max (> min) specifies the maximum value that a sequence can generate. If the clause MAXVALUE is omitted or NO MAXVALUE is specified, then a default maximum is used. The defaults vary by DBMS, but they’re typically a very large number for an ascending sequence or −1 for a descending one.
start specifies the first value of the sequence. If the clause START WITH is omitted, then the default starting value is min for an ascending sequence or max for a descending one.
CYCLE indicates that the sequence continues to generate values after reaching either its min or max. After an ascending sequence reaches its maximum value, it then generates its minimum value. After a descending sequence reaches its minimum value, it then generates its maximum value. NO CYCLE (the default) indicates that the sequence can’t generate more values after reaching its maximum or minimum value.
Listing 15.4 defines the sequence shown in Figure 15.4.
Listing 15.4Create a sequence generator for the consecutive integers 1 to 10000. See Figure 15.4 for the result.
CREATE SEQUENCE part_seq
INCREMENT BY 1
MINVALUE 1
MAXVALUE 10000
START WITH 1
NO CYCLE;
Figure 15.4The sequence that Listing 15.4 generates.
1
2
3
...
9998
9999
10000
You can use a sequence generator in a few ways. Standard SQL provides the built-in function NEXT VALUE FOR to increment a sequence value, as in:
INSERT INTO shipment(part_num, desc, quantity)
VALUES(NEXT VALUE FOR part_seq, 'motherboard', 5);
If you’re creating a column of unique values, then you can use the keyword IDENTITY to define a sequence right in the CREATE TABLE statement:
CREATE TABLE parts (
part_num INTEGER AS
IDENTITY(
INCREMENT BY 1
MINVALUE 1
MAXVALUE 10000
START WITH 1
NO CYCLE),
desc AS VARCHAR(100),
quantity INTEGER;
This table definition lets you omit NEXT VALUE FOR when you insert a row:
INSERT INTO shipment(
desc,
quantity)
VALUES(
'motherboard',
5);
SQL also provides ALTER SEQUENCE and DROP SEQUENCE to change and remove sequence generators.
Microsoft SQL Server, Oracle, Db2, and PostgreSQL support CREATE SEQUENCE, ALTER SEQUENCE, and DROP SEQUENCE. In Oracle, use NOCYCLE instead of NO CYCLE. See your DBMS documentation to see how sequences are used in your system.
Most DBMSs don’t support IDENTITY columns because they have other (pre-standard) ways that define columns with unique values; see “Unique Identifiers” in Chapter 3. PostgreSQL’s generate_series() function offers a quick way to generate numbered rows.
A one-column table containing a sequence of consecutive integers makes it easy to solve problems that would otherwise be difficult with SQL’s limited computational power. Sequence tables aren’t actually part of the data model—they’re auxiliary tables that are adjuncts to queries and other “real” tables.
You can create a sequence table by using one of the methods just described. Alternatively, you can create one by using Listing 15.5, which creates the sequence table seq by cross-joining the intermediate table temp09 with itself. The CAST expression concatenates digit characters into sequential numbers and then casts them as integers. You can drop the table temp09 after seq is created. Figure 15.5 shows the result. The table seq contains the integer sequence 0, 1, 2,..., 9999. You can shrink or grow this sequence by changing the SELECT and FROM expressions in the INSERT INTO seq statement.
Listing 15.5Create a one-column table that contains consecutive integers. See Figure 15.5 for the result.
CREATE TABLE temp09 (
i CHAR(1) NOT NULL PRIMARY KEY
);
INSERT INTO temp09 VALUES('0');
INSERT INTO temp09 VALUES('1');
INSERT INTO temp09 VALUES('2');
INSERT INTO temp09 VALUES('3');
INSERT INTO temp09 VALUES('4');
INSERT INTO temp09 VALUES('5');
INSERT INTO temp09 VALUES('6');
INSERT INTO temp09 VALUES('7');
INSERT INTO temp09 VALUES('8');
INSERT INTO temp09 VALUES('9');
CREATE TABLE seq (
i INTEGER NOT NULL PRIMARY KEY
);
INSERT INTO seq
SELECT CAST(t1.i || t2.i ||
t3.i || t4.i AS INTEGER)
FROM temp09 t1, temp09 t2,
temp09 t3, temp09 t4;
DROP TABLE temp09;
Figure 15.5Result of Listing 15.5.
i
-----
0
1
2
3
4
...
9996
9997
9998
9999
A sequence table is especially useful for enumerative and datetime functions. Listing 15.6 lists the 95 printable characters in the ASCII character set (if that’s the character set in use). See Figure 15.6 for the result.
Listing 15.6List the characters associated with a set of character codes. See Figure 15.6 for the result.
SELECT
i AS CharCode,
CHR(i) AS Ch
FROM seq
WHERE i BETWEEN 32 AND 126;
Figure 15.6Result of Listing 15.6.
CharCode Ch
-------- --
32
33 !
34 "
35 #
36 $
37 %
38 &
39 '
40 (
41 )
42 *
43 +
44 ,
45 -
46 .
47 /
48 0
49 1
50 2
51 3
52 4
...
Listing 15.7 adds monthly intervals to today’s date (7-March-2005) for the next six months. See Figure 15.7 for the result. This example works on Microsoft SQL Server; other DBMSs have similar functions that increment dates.
Listing 15.7Increment today’s date to six months hence, in one-month intervals. See Figure 15.7 for the result.
SELECT
i AS MonthsAhead,
DATEADD("m", i, CURRENT_TIMESTAMP)
AS FutureDate
FROM seq
WHERE i BETWEEN 1 AND 6;
Figure 15.7Result of Listing 15.7.
MonthsAhead FutureDate
----------- ----------
1 2005-04-07
2 2005-05-07
3 2005-06-07
4 2005-07-07
5 2005-08-07
6 2005-09-07
Sequence tables are handy for normalizing data that you’ve imported from a nonrelational source such as a spreadsheet or accounting package. Suppose that you have the following non-normalized table, named au_orders, showing the order of the authors’ names on each book’s cover:
title_id author1 author2 author3
-------- ------- ------- -------
T01 A01 NULL NULL
T02 A01 NULL NULL
T03 A05 NULL NULL
T04 A03 A04 NULL
T05 A04 NULL NULL
T06 A02 NULL NULL
T07 A02 A04 NULL
T08 A06 NULL NULL
T09 A06 NULL NULL
T10 A02 NULL NULL
T11 A06 A03 A04
T12 A02 NULL NULL
T13 A01 NULL NULL
Listing 15.8 cross-joins au_orders with seq to produce Figure 15.8. You can DELETE the result rows with nulls in the column au_id, leaving the result set looking like the table title_authors in the sample database. Note that Listing 15.8 does the reverse of Listing 8.18 in Chapter 8.
Listing 15.8Normalize the table au_orders. See Figure 15.8 for the result.
SELECT title_id,
(CASE WHEN i=1 THEN '1'
WHEN i=2 THEN '2'
WHEN i=3 THEN '3'
END) AS au_order,
(CASE WHEN i=1 THEN author1
WHEN i=2 THEN author2
WHEN i=3 THEN author3
END) AS au_id
FROM au_orders, seq
WHERE i BETWEEN 1 AND 3
ORDER BY title_id, i;
Figure 15.8Result of Listing 15.8.
title_id au_order au_id
-------- -------- -----
T01 1 A01
T01 2 NULL
T01 3 NULL
T02 1 A01
T02 2 NULL
T02 3 NULL
T03 1 A05
T03 2 NULL
T03 3 NULL
T04 1 A03
T04 2 A04
T04 3 NULL
T05 1 A04
T05 2 NULL
T05 3 NULL
T06 1 A02
T06 2 NULL
T06 3 NULL
T07 1 A02
T07 2 A04
T07 3 NULL
T08 1 A06
T08 2 NULL
T08 3 NULL
T09 1 A06
T09 2 NULL
T09 3 NULL
T10 1 A02
T10 2 NULL
T10 3 NULL
T11 1 A06
T11 2 A03
T11 3 A04
T12 1 A02
T12 2 NULL
T12 3 NULL
T13 1 A01
T13 2 NULL
T13 3 NULL
To run Listing 15.5 in Microsoft Access and Microsoft SQL Server, change the CAST expression to:
t1.i + t2.i + t3.i + t4.i
To run Listing 15.5 in MySQL, change the CAST expression to:
CONCAT(t1.i, t2.i, t3.i, t4.i)
To run Listing 15.6 in Microsoft SQL Server and MySQL, change CHR(i)
to CHAR(i)
.
To run Listing 15.8 in Microsoft Access, change the CASE expressions to Switch() function calls:
(Switch(i=1, '1', i=2, '2', i=3, '3')) AS au_order,
(Switch(i=1, author1, i=2, author2, i=3, author3)) AS au_id
Calendar Tables
Another useful auxiliary table is a calendar table. One type of calendar table has a primary-key column that contains a row for each calendar date (past and future) and other columns that indicate the date’s attributes: business day, holiday, international holiday, fiscal-month end, fiscal-year end, Julian date, business-day offsets, and so on. Another type of calendar table stores the starting and ending dates of events (in the columns event_id, start_date, and end_date, for example). Spreadsheets have more date-arithmetic functions than DBMSs, so it might be easier to build a calendar table in a spreadsheet and then import it as a database table.
Even if your DBMS has plenty of date-arithmetic functions, it might be faster to look up data in a calendar table than to call these functions in a query.
A sequence is a series of consecutive values without gaps. A run is like a sequence, but the values don’t have to be consecutive, just increasing (that is, gaps are allowed). A region is an unbroken series of values that all are equal.
Finding these series requires a table that has at least two columns: a primary-key column that holds a sequence of consecutive integers and a column that holds the values of interest. The table temps (Listing 15.9 and Figure 15.9) shows a series of high temperatures over 15 days.
As a set-oriented language, SQL isn’t a good choice for finding series of values. The following queries won’t run very fast, so if you have a lot of data to analyze, then consider exporting it to a statistical package or using a procedural host language.
Listing 15.9List all the columns in the table temps. See Figure 15.9 for the result.
SELECT *
FROM temps;
Figure 15.9Result of Listing 15.9.
id hi_temp
-- -------
1 49
2 46
3 48
4 50
5 50
6 50
7 51
8 52
9 53
10 50
11 50
12 47
13 50
14 51
15 52
Listing 15.10 finds all the sequences in temps and lists each sequence’s start position, end position, and length. See Figure 15.10 for the result.
Listing 15.10List the starting point, ending point, and length of each sequence in the table temps. See Figure 15.10 for the result.
SELECT
t1.id AS StartSeq,
t2.id AS EndSeq,
t2.id - t1.id + 1 AS SeqLen
FROM temps t1, temps t2
WHERE (t1.id < t2.id)
AND NOT EXISTS(
SELECT *
FROM temps t3
WHERE (t3.hi_temp - t3.id <>
t1.hi_temp - t1.id
AND t3.id BETWEEN
t1.id AND t2.id)
OR (t3.id = t1.id - 1
AND t3.hi_temp - t3.id =
t1.hi_temp - t1.id)
OR (t3.id = t2.id + 1
AND t3.hi_temp - t3.id =
t1.hi_temp - t1.id)
);
Figure 15.10Result of Listing 15.10.
StartSeq EndSeq SeqSize
-------- ------ -------
6 9 4
13 15 3
Listing 15.10 is a lot to take in at first glance, but it’s easier to understand it if you look at it piecemeal. Then you’ll be able to understand the rest of the queries in this section.
The subquery’s WHERE clause subtracts id from hi_temp, yielding (internally):
id hi_temp diff
-- ------- ----
1 49 48
2 46 44
3 48 45
4 50 46
5 50 45
6 50 44
7 51 44
8 52 44
9 53 44
10 50 40
11 50 39
12 47 35
13 50 37
14 51 37
15 52 37
In the column diff, note that successive differences are constant for sequences (50 − 6 = 44, 51 − 7 = 44, and so on). To find neighboring rows, the outer query cross-joins two instances of the same table (t1 and t2), as described in “Calculating Running Statistics” earlier in this chapter. The condition
WHERE (t1.id < t2.id)
guarantees that any t1 row represents an element with an index (id) lower than the corresponding t2 row.
The subquery detects sequence breaks with the condition
t3.hi_temp - t3.id <> t1.hi_temp - t1.id
The third instance of temps (t3) in the subquery is used to determine whether any row in a candidate sequence (t3) has the same difference as the sequence’s first row (t1). If so, then it’s a sequence member. If not, then the candidate pair (t1 and t2) is rejected.
The last two OR conditions determine whether the candidate sequence’s borders can expand. A row that satisfies these conditions means the current candidate sequence can be extended and is rejected in favor of a longer one.
To find only sequences larger than n rows, add the WHERE condition
AND (t2.id - t1.id) >= n - 1
To change Listing 15.10 to find all sequences of four or more rows, for example, replace
WHERE (t1.id < t2.id)
with
WHERE (t1.id < t2.id)
AND (t2.id - t1.id) >= 3
The result is:
StartSeq EndSeq SeqSize
-------- ------ -------
6 9 4
Listing 15.11 finds all the runs in temps and lists each run’s start position, end position, and length. See Figure 15.11 for the result.
The logic of this query is similar to that of the preceding one but accounts for run values needing only to increase, not (necessarily) be consecutive. The difference between id and hi_temp values doesn’t have to be constant, so a fourth instance of temps (t4) is needed. The subquery cross-joins t3 and t4 to check rows in the middle of a candidate run, whose borders are t1 and t2. For every element between t1 and t2 (limited by BETWEEN), t3 and its predecessor t4 are compared to see whether their values are increasing.
Listing 15.11List the starting point, ending point, and length of each run in the table temps. See Figure 15.11 for the result.
SELECT
t1.id AS StartRun,
t2.id AS EndRun,
t2.id - t1.id + 1 AS RunLen
FROM temps t1, temps t2
WHERE (t1.id < t2.id)
AND NOT EXISTS(
SELECT *
FROM temps t3, temps t4
WHERE (t3.hi_temp <= t4.hi_temp
AND t4.id = t3.id - 1
AND t3.id BETWEEN
t1.id + 1 AND t2.id)
OR (t3.id = t1.id - 1
AND t3.hi_temp < t1.hi_temp)
OR (t3.id = t2.id + 1
AND t3.hi_temp > t2.hi_temp)
);
Figure 15.11Result of Listing 15.11.
StartRun EndRun RunLen
-------- ------ ------
2 4 3
6 9 4
12 15 4
Listing 15.12 finds all regions in temps with a high temperature of 50 and lists each region’s start position, end position, and length. See Figure 15.12 for the result.
Listing 15.12List the starting point, ending point, and length of each region (with value 50) in the table temps. See Figure 15.12 for the result.
SELECT
t1.id AS StartReg,
t2.id AS EndReg,
t2.id - t1.id + 1 AS RegLen
FROM temps t1, temps t2
WHERE (t1.id < t2.id)
AND NOT EXISTS(
SELECT *
FROM temps t3
WHERE (t3.hi_temp <> 50
AND t3.id BETWEEN
t1.id AND t2.id)
OR (t3.id = t1.id - 1
AND t3.hi_temp = 50)
OR (t3.id = t2.id + 1
AND t3.hi_temp = 50)
);
Figure 15.12Result of Listing 15.12.
StartReg EndReg RegLen
-------- ------ ------
4 6 3
10 11 2
Listing 15.13 is a variation of Listing 15.12 that finds all regions of length 2. See Figure 15.13 for the result. Note that overlapping subregions are listed. To return regions of length n, change the WHERE clause’s second condition to:
AND t2.id - t1.id = n - 1
Listing 15.13List all regions of length 2. See Figure 15.13 for the result.
SELECT
t1.id AS StartReg,
t2.id AS EndReg,
t2.id - t1.id + 1 AS RegLen
FROM temps t1, temps t2
WHERE (t1.id < t2.id)
AND t2.id - t1.id = 1
AND NOT EXISTS(
SELECT *
FROM temps t3
WHERE (t3.hi_temp <> 50
AND t3.id BETWEEN t1.id AND t2.id)
);
Figure 15.13Result of Listing 15.13.
StartReg EndReg RegLen
-------- ------ ------
4 5 2
5 6 2
10 11 2
To rank regions by length, add an ORDER BY clause to the outer query:
ORDER BY t2.id - t1.id DESC
To list the individual ids that fall in a region (with value 50), type:
SELECT DISTINCT t1.id
FROM temps t1, temps t2
WHERE t1.hi_temp = 50
AND t2.hi_temp = 50
AND ABS(t1.id - t2.id) = 1;
The standard function ABS(), which all DBMSs support, returns the absolute value of its argument. The result is:
id
--
4
5
6
10
11
In practice it’s common to use queries that return a certain number (n) of rows that fall at the top or the bottom of a range specified by an ORDER BY clause. SQL doesn’t require an ORDER BY clause, but if you omit it, then the query will return an unsorted set of rows (because SQL doesn’t promise to deliver query results in any particular order without an ORDER BY clause).
The examples in this section use the table empsales (Listing 15.14 and Figure 15.14), which lists sales figures by employee. Note that some employees have the same sales amounts. A correct query for the top three salespeople in empsales actually will return four rows: employees E09, E02, E10, and E05. Ties shouldn’t force the query to choose arbitrarily between equal values (E10 and E05 in this case). No standard terminology exists, but queries that return at most n rows (regardless of ties) sometimes are called limit queries. Queries that include ties and return possibly more than n rows are top-n queries or quota queries.
You can also use limit and quota queries to limit the number of rows affected by an UPDATE or DELETE statement. Some limit and quota queries might be invalid in some contexts (such as in subqueries or views); see your DBMS documentation.
Listing 15.14List employees by descending sales. See Figure 15.14 for the result.
SELECT emp_id, sales
FROM empsales
ORDER BY sales DESC;
Figure 15.14Result of Listing 15.14.
emp_id sales
------ -----
E09 900
E02 800
E10 700
E05 700
E01 600
E04 500
E03 500
E06 500
E08 400
E07 300
The SQL:2003 standard introduced the functions ROW_NUMBER() and RANK() to use in limit and top-n queries. Microsoft SQL Server, Oracle, and Db2 support both functions. Queries that use pre-2003 SQL are complex, unintuitive, and run slowly (see the tips at the end of this section for an SQL-92 example). The SQL standard has lagged DBMSs, which for years have offered nonstandard extensions to create these types of queries. Some DBMSs also let you return a percentage of rows (rather than a fixed n) or return offsets by skipping a specified number of initial rows (returning rows 3–8 instead of 1–5, for example). This section covers the DBMS extensions individually.
Listing 15.15 lists the top three salespeople, including ties. See Figure 15.15 for the result. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.
The TOP clause always includes ties. Its syntax is:
TOP n [PERCENT]
Listing 15.15List the top three salespeople, with ties. See Figure 15.15 for the result.
SELECT TOP 3 emp_id, sales
FROM empsales
ORDER BY sales DESC;
Figure 15.15Result of Listing 15.15.
emp_id sales
------ -----
E09 900
E02 800
E10 700
E05 700
Listing 15.16 lists the bottom 40 percent of salespeople, including ties. See Figure 15.16 for the result. This query orders lowest to highest; to reverse the order, change ASC to DESC in the ORDER BY clause.
Listing 15.16List the bottom 40 percent of salespeople, with ties. See Figure 15.16 for the result.
SELECT TOP 40 PERCENT emp_id, sales
FROM empsales
ORDER BY sales ASC;
Figure 15.16Result of Listing 15.16.
emp_id sales
------ -----
E07 300
E08 400
E06 500
E04 500
E03 500
The following offset query returns n rows but excludes the topmost skip rows from the result. This query orders highest to lowest; to reverse the order, change ASC to DESC and DESC to ASC in each ORDER BY clause.
SELECT *
FROM (
SELECT TOP n *
FROM (
SELECT TOP n + skip *
FROM table
ORDER BY sort_col DESC)
ORDER BY sort_col ASC)
ORDER BY sort_col DESC;
Listing 15.17 lists the top three salespeople, not including ties. See Figure 15.17 for the result. Note that this query is inconsistent when ties exist; rerunning it can return either E10 or E05, depending on how ORDER BY sorts the table. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.
The TOP clause’s syntax is:
TOP n [PERCENT] [WITH TIES]
Listing 15.17List the top three salespeople, without ties. See Figure 15.17 for the result.
SELECT TOP 3 emp_id, sales
FROM empsales
ORDER BY sales DESC;
Figure 15.17Result of Listing 15.17.
emp_id sales
------ -----
E09 900
E02 800
E05 700
Listing 15.18 lists the top three salespeople, including ties. See Figure 15.18 for the result. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.
Listing 15.18List the top three salespeople, with ties. See Figure 15.18 for the result.
SELECT TOP 3 WITH TIES emp_id, sales
FROM empsales
ORDER BY sales DESC;
Figure 15.18Result of Listing 15.18.
emp_id sales
------ -----
E09 900
E02 800
E05 700
E10 700
Listing 15.19 lists the bottom 40 percent of salespeople, including ties. See Figure 15.19 for the result. This query orders lowest to highest; to reverse the order, change ASC to DESC in the ORDER BY clause.
Listing 15.19List the bottom 40 percent of salespeople, with ties. See Figure 15.19 for the result.
SELECT TOP 40 PERCENT WITH TIES
emp_id, sales
FROM empsales
ORDER BY sales ASC;
Figure 15.19Result of Listing 15.19.
emp_id sales
------ -----
E07 300
E08 400
E06 500
E03 500
E04 500
SET
ROWCOUNT
n
provides an alternative method returning n rows.The following offset query returns n rows but excludes the topmost skip rows from the result. This query orders highest to lowest; to reverse the order, change ASC to DESC and DESC to ASC in each ORDER BY clause.
SELECT *
FROM (
SELECT TOP n *
FROM (
SELECT TOP n + skip *
FROM table
ORDER BY sort_col DESC)
AS any_name1
ORDER BY sort_col ASC)
AS any_name2
ORDER BY sort_col DESC;
Use the built-in ROWNUM pseudocolumn to limit the number of rows returned. The first row selected has a ROWNUM of 1, the second has 2, and so on. Use the window function RANK() to include ties.
Listing 15.20 lists the top three salespeople, not including ties. See Figure 15.20 for the result. Note that this query is inconsistent when ties exist; rerunning it can return either E10 or E05, depending on how ORDER BY sorts the table. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.
Listing 15.20List the top three salespeople, without ties. See Figure 15.20 for the result.
SELECT emp_id, sales
FROM (
SELECT *
FROM empsales
ORDER BY sales DESC)
WHERE ROWNUM <= 3;
Figure 15.20Result of Listing 15.20.
emp_id sales
------ -----
E09 900
E02 800
E05 700
Listing 15.21 lists the top three salespeople, including ties. See Figure 15.21 for the result. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.
Listing 15.21List the top three salespeople, with ties. See Figure 15.21 for the result.
SELECT emp_id, sales
FROM (
SELECT
RANK() OVER
(ORDER BY sales DESC)
AS sales_rank,
emp_id,
sales
FROM empsales)
WHERE sales_rank <= 3;
Figure 15.21Result of Listing 15.21.
emp_id sales
------ -----
E09 900
E02 800
E05 700
E10 700
The following offset query returns n rows but excludes the topmost skip rows from the result. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.
SELECT *
FROM (
SELECT
ROW_NUMBER() OVER
(ORDER BY sort_col DESC)
AS rnum,
columns
FROM table)
WHERE rnum > skip
AND rnum <= (n + skip);
Listing 15.22 lists the top three salespeople, not including ties. See Figure 15.22 for the result. Note that this query is inconsistent when ties exist; rerunning it can return either E10 or E05, depending on how ORDER BY sorts the table. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.
The FETCH clause’s syntax is:
FETCH FIRST n ROW[S] ONLY
Listing 15.22List the top three salespeople, without ties. See Figure 15.22 for the result.
SELECT emp_id, sales
FROM empsales
ORDER BY sales DESC
FETCH FIRST 3 ROWS ONLY;
Figure 15.22Result of Listing 15.22.
emp_id sales
------ -----
E09 900
E02 800
E05 700
Listing 15.23 lists the top three salespeople, including ties. See Figure 15.23 for the result. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.
Listing 15.23List the top three salespeople, with ties. See Figure 15.23 for the result.
SELECT emp_id, sales
FROM (
SELECT
RANK() OVER
(ORDER BY sales DESC)
AS sales_rank,
emp_id,
sales
FROM empsales)
AS any_name
WHERE sales_rank <= 3;
Figure 15.23Result of Listing 15.23.
emp_id sales
------ -----
E09 900
E02 800
E05 700
E10 700
The following offset query returns n rows but excludes the topmost skip rows from the result. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.
SELECT *
FROM (
SELECT
ROW_NUMBER() OVER
(ORDER BY sort_col DESC)
AS rnum,
columns
FROM table)
AS any_name
WHERE rnum > skip
AND rnum <= n + skip;
Listing 15.24 lists the top three salespeople, not including ties. See Figure 15.24 for the result. Note that this query is inconsistent when ties exist; rerunning it can return either E10 or E05, depending on how ORDER BY sorts the table. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.
The LIMIT clause’s syntax is:
LIMIT n [OFFSET skip]
or
LIMIT [skip,] n
The offset of the initial row is 0 (not 1).
Listing 15.24List the top three salespeople, without ties. See Figure 15.24 for the result.
SELECT emp_id, sales
FROM empsales
ORDER BY sales DESC
LIMIT 3;
Figure 15.24Result of Listing 15.24.
emp_id sales
------ -----
E09 900
E02 800
E10 700
Listing 15.25 lists the top three salespeople, including ties. The OFFSET value is n − 1 = 2. COALESCE()’s second argument lets the query work in case the table has fewer than n rows; see “Checking for Nulls with COALESCE()” in Chapter 5. See Figure 15.25 for the result. This query orders highest to lowest; to reverse the order, change >= to <= in the comparison, change MIN() to MAX() in the second subquery, and change DESC to ASC in each ORDER BY clause.
Listing 15.25List the top three salespeople, with ties. See Figure 15.25 for the result.
SELECT emp_id, sales
FROM empsales
WHERE sales >= COALESCE(
(SELECT sales
FROM empsales
ORDER BY sales DESC
LIMIT 1 OFFSET 2),
(SELECT MIN(sales)
FROM empsales))
ORDER BY sales DESC;
Figure 15.25Result of Listing 15.25.
emp_id sales
------ -----
E09 900
E02 800
E05 700
E10 700
Listing 15.26 lists the top three salespeople, skipping the initial four rows. See Figure 15.26 for the result. Note that this query is inconsistent when ties exist. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.
Listing 15.26List the top three salespeople, skipping the initial four rows. See Figure 15.26 for the result.
SELECT emp_id, sales
FROM empsales
ORDER BY sales DESC
LIMIT 3 OFFSET 4;
Figure 15.26Result of Listing 15.26.
emp_id sales
------ -----
E01 600
E04 500
E03 500
Listing 15.27 lists the top three salespeople, not including ties. See Figure 15.27 for the result. Note that this query is inconsistent when ties exist; rerunning it can return either E10 or E05, depending on how ORDER BY sorts the table. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.
The LIMIT clause’s syntax is:
LIMIT n [OFFSET skip]
The offset of the initial row is 0 (not 1).
PostgreSQL 13 and later versions support the OFFSET and FETCH clauses:
OFFSET skip [ROW | ROWS]
FETCH [FIRST | NEXT] [n] [ROW | ROWS]
[ONLY | WITH TIES]
If n is omitted in a FETCH clause, then it defaults to 1. For example,
FETCH FIRST WITH TIES
returns any additional rows that match the last row.
The queries in this section use the LIMIT clause to ensure backward compatibility.
Listing 15.27List the top three salespeople, without ties. See Figure 15.27 for the result.
SELECT emp_id, sales
FROM empsales
ORDER BY sales DESC
LIMIT 3;
Figure 15.27Result of Listing 15.27.
emp_id sales
------ -----
E09 900
E02 800
E05 700
Listing 15.28 lists the top three salespeople, including ties. The OFFSET value is n − 1 = 2. See Figure 15.28 for the result. This query orders highest to lowest; to reverse the order, change >= to <= in the comparison and change DESC to ASC in each ORDER BY clause.
Listing 15.28List the top three salespeople, with ties. See Figure 15.28 for the result.
SELECT emp_id, sales
FROM empsales
WHERE (
sales >= (
SELECT sales
FROM empsales
ORDER BY sales DESC
LIMIT 1 OFFSET 2)
) IS NOT FALSE
ORDER BY sales DESC;
Figure 15.28Result of Listing 15.28.
emp_id sales
------ -----
E09 900
E02 800
E10 700
E05 700
Listing 15.29 lists the top three salespeople, skipping the initial four rows. See Figure 15.29 for the result. Note that this query is inconsistent when ties exist. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.
Listing 15.29List the top three salespeople, skipping the initial four rows. See Figure 15.29 for the result.
SELECT emp_id, sales
FROM empsales
ORDER BY sales DESC
LIMIT 3 OFFSET 4;
Figure 15.29Result of Listing 15.29.
emp_id sales
------ -----
E01 600
E06 500
E03 500
Fabian Pascal’s Practical Issues in Database Management discusses quota queries. His SQL-92 solution (which is too slow for practical use) to list the top three salespeople, including ties, is:
SELECT emp_id, sales
FROM empsales e1
WHERE (
SELECT COUNT(*)
FROM empsales e2
WHERE e2.sales > e1.sales
) < 3;
This query orders highest to lowest; to reverse the order, change > to < in the innermost WHERE clause.
Ranking, which allocates the numbers 1, 2, 3,... to sorted values, is related to top-n queries and shares the problem of interpreting ties. The following queries calculate ranks for sales values in the table empsales from “Limiting the Number of Rows Returned”.
Listings 15.30a to 15.30e rank employees by sales. The first two queries show the most commonly accepted ways to rank values. The other queries show variations on them. Figure 15.30 shows the result of each ranking method, a to e, combined for brevity and ease of comparison. These queries rank highest to lowest; to reverse the order, change > (or >=) to < (or <=) in the WHERE comparisons. You can add the clause ORDER BY ranking ASC
to a query’s outer SELECT to sort the results by rank.
Listing 15.30aRank employees by sales (method a). See Figure 15.30 for the result.
SELECT e1.emp_id, e1.sales,
(SELECT COUNT(sales)
FROM empsales e2
WHERE e2.sales >= e1.sales)
AS ranking
FROM empsales e1;
Listing 15.30bRank employees by sales (method b). See Figure 15.30 for the result.
SELECT e1.emp_id, e1.sales,
(SELECT COUNT(sales)
FROM empsales e2
WHERE e2.sales > e1.sales) + 1
AS ranking
FROM empsales e1;
Listing 15.30cRank employees by sales (method c). See Figure 15.30 for the result.
SELECT e1.emp_id, e1.sales,
(SELECT COUNT(sales)
FROM empsales e2
WHERE e2.sales > e1.sales)
AS ranking
FROM empsales e1;
Listing 15.30dRank employees by sales (method d). See Figure 15.30 for the result.
SELECT e1.emp_id, e1.sales,
(SELECT COUNT(DISTINCT sales)
FROM empsales e2
WHERE e2.sales >= e1.sales)
AS ranking
FROM empsales e1;
Listing 15.30eRank employees by sales (method e). See Figure 15.30 for the result.
SELECT e1.emp_id, e1.sales,
(SELECT COUNT(DISTINCT sales)
FROM empsales e2
WHERE e2.sales > e1.sales)
AS ranking
FROM empsales e1;
Figure 15.30Compilation of results of Listings 15.30a to 15.30e.
emp_id sales a b c d e
------ ----- -- -- -- -- --
E09 900 1 1 0 1 0
E02 800 2 2 1 2 1
E10 700 4 3 2 3 2
E05 700 4 3 2 3 2
E01 600 5 5 4 4 3
E04 500 8 6 5 5 4
E03 500 8 6 5 5 4
E06 500 8 6 5 5 4
E08 400 9 9 8 6 5
E07 300 10 10 9 7 6
These ranking queries use correlated subqueries and so run slowly. If you’re ranking a large number of items, then you should use a built-in rank function, if available. The SQL:2003 standard introduced the functions RANK() and DENSE_RANK(), which Microsoft SQL Server, Oracle, Db2, MySQL, and PostgreSQL support. Alternatively, you can use your DBMS’s SQL extensions to calculate ranks efficiently. The following MySQL script, for example, is equivalent to Listing 15.30b:
SET @rownum = 0;
SET @rank = 0;
SET @prev_val = NULL;
SELECT
@rownum := @rownum + 1 AS row,
@rank := IF(@prev_val <> sales,
@rownum, @rank) AS rank,
@prev_val := sales AS sales
FROM empsales
ORDER BY sales DESC;
Microsoft Access doesn’t support COUNT(DISTINCT) and won’t run Listing 15.30d and Listing 15.30e. For a workaround, see “Aggregating Distinct Values with DISTINCT” in Chapter 6.
The trimmed mean is a robust order statistic that is the arithmetic mean (average) of the data if the k smallest values and k largest values are discarded. The goal is to avoid the influence of extreme observations.
Listing 15.31 calculates the trimmed mean of book sales in the sample database by omitting the top three and bottom three sales figures. See Figure 15.31 for the result. For reference, the 12 sorted sales values are 566, 4095, 5000, 9566, 10467, 11320, 13001, 25667, 94123, 100001, 201440, and 1500200. This query discards 566, 4095, 5000, 100001, 201440, and 1500200 and then calculates the mean in the usual way by using the remaining six middle values. Nulls are ignored. Duplicate values are either all removed or all retained. (If all sales are the same, for example, then none of them will be trimmed no matter what k is.)
Listing 15.31Calculate the trimmed mean for k = 3. See Figure 15.31 for the result.
SELECT AVG(sales) AS TrimmedMean
FROM titles t1
WHERE
(SELECT COUNT(*)
FROM titles t2
WHERE t2.sales <= t1.sales) > 3
AND
(SELECT COUNT(*)
FROM titles t3
WHERE t3.sales >= t1.sales) > 3;
Figure 15.31Result of Listing 15.31.
TrimmedMean
-----------
27357.3333
Listing 15.32 is similar to Listing 15.31 but trims a fixed percentage of the extreme values rather than a fixed number. Trimming by 0.25 (25%), for example, discards the sales in the top and bottom quartiles and averages what’s left. See Figure 15.32 for the result.
Listing 15.32Calculate the trimmed mean by discarding the lower and upper 25% of values. See Figure 15.32 for the result.
SELECT AVG(sales) AS TrimmedMean
FROM titles t1
WHERE
(SELECT COUNT(*)
FROM titles t2
WHERE t2.sales <= t1.sales) >=
(SELECT 0.25 * COUNT(*)
FROM titles)
AND
(SELECT COUNT(*)
FROM titles t3
WHERE t3.sales >= t1.sales) >=
(SELECT 0.25 * COUNT(*)
FROM titles);
Figure 15.32Result of Listing 15.32.
TrimmedMean
-----------
27357.3333
Microsoft SQL Server and Db2 return an integer for the trimmed mean because the column sales is defined as an INTEGER. To get a floating-point value, change AVG(sales) to AVG(CAST(sales AS FLOAT)). For more information, see “Converting Data Types with CAST()” in Chapter 5.
Some databases are so large, and queries on them so complex, that often it’s impractical (and unnecessary) to retrieve all the data relevant to a query. If you’re interested in finding an overall trend or pattern, for example, then an approximate answer within some margin of error will usually suffice. One way to speed such queries is to select a random sample of rows. An efficient sample can improve performance by orders of magnitude yet still yield accurate results.
Standard SQL’s TABLESAMPLE clause returns a random subset of rows. Microsoft SQL Server, Db2, and PostgreSQL support TABLESAMPLE, and Oracle has a similar feature. For the other DBMSs, use a (nonstandard) function that returns a uniform random number between 0 and 1 (Table 15.1).
DBMS | Clause or Function |
---|---|
Access | RND() function |
SQL Server | TABLESAMPLE clause |
Oracle | SAMPLE clause or DBMS_RANDOM package |
Db2 | TABLESAMPLE clause |
MySQL | RAND() function |
PostgreSQL | TABLESAMPLE clause |
Listing 15.33a randomly picks about 25% (0.25) of the rows from the sample-database table titles. If necessary, change RAND() to the function that appears in Table 15.1 for your DBMS. For Oracle, use Listing 15.33b. For Microsoft SQL Server, Db2, and PostgreSQL, use Listing 15.33c.
Listing 15.33aSelect about 25% percent of the rows in the table titles at random. See Figure 15.33 for a possible result.
SELECT title_id, type, sales
FROM titles
WHERE RAND() < 0.25;
Listing 15.33bSelect about 25% percent of the rows in the table titles at random (Oracle only). See Figure 15.33 for a possible result.
SELECT title_id, type, sales
FROM titles
SAMPLE (25);
Listing 15.33cSelect about 25% percent of the rows in the table titles at random (Microsoft SQL Server, Db2, and PostgreSQL only). See Figure 15.33 for a possible result.
SELECT title_id, type, sales
FROM titles
TABLESAMPLE SYSTEM (25);
Figure 15.33 shows one possible result of a random selection. The rows and the number of rows returned will change every time that you run the query. If you need an exact number of random rows, then increase the sampling percentage and use one of the techniques described in “Limiting the Number of Rows Returned” earlier in this chapter.
Figure 15.33One possible result of Listing 15.33a/b/c.
title_id type sales
-------- ---------- -----
T03 computer 25667
T04 psychology 13001
T11 psychology 94123
Listing 15.33a won’t run correctly in Microsoft Access because the random-number function returns the same “random” number for every selected row. In Access, use Visual Basic or C# to pick random rows.
To use the NEWID() function to pick n random rows in Microsoft SQL Server, type:
SELECT TOP n title_id, type, sales
FROM titles
ORDER BY NEWID();
To use the VALUE() function in the DBMS_RANDOM package to pick n random rows in Oracle, type:
SELECT * FROM
(SELECT title_id, type, sales
FROM titles
ORDER BY DBMS_RANDOM.VALUE())
WHERE ROWNUM <= n;
Selecting Every nth Row
Instead of picking random rows, you can pick every nth row by using a modulo expression:
This expression returns the remainder of m divided by n. For example, MOD(20, 6) is 2 because 20 equals (3 × 6) + 2. MOD(a, 2) is 0 if a is an even number.
The condition MOD(rownumber, n) = 0 picks every nth row, where rownumber is a column of consecutive integers or row identifiers. This Oracle query, for example, picks every third row in a table:
SELECT *
FROM table
WHERE (ROWID,0) IN
(SELECT ROWID, MOD(ROWNUM,3)
FROM table);
Note that rownumber imposes a row order that doesn’t exist implicitly in a relational-database table.
Normally you use SQL’s PRIMARY KEY or UNIQUE constraints to prevent duplicate rows from appearing in production tables. But you need to know how to handle duplicates that appear when you accidentally import the same data twice or import data from a nonrelational source such as a spreadsheet or accounting package, where redundant information is rampant. This section describes how to detect, count, and remove duplicates.
Suppose that you import rows into a staging table to detect and eliminate any duplicates before inserting the data into a production table (Listing 15.34 and Figure 15.34). The column id is a unique row identifier that lets you identify and select rows that otherwise would be duplicates. If your imported rows don’t already have an identity column, then you can add one yourself; see “Unique Identifiers” and “Generating Sequences”. It’s a good practice to add an identity column to even short-lived working tables, but in this case it also makes deleting duplicates easy. The imported data might include other columns too, but you’ve decided that the combination of only book title, book type, and price determines whether a row is a duplicate, regardless of the values in any other columns. Before you identify or delete duplicates, you must define exactly what it means for two rows to be considered “duplicates” of each other.
Listing 15.34List the imported rows. See Figure 15.34 for the result.
SELECT id, title_name, type, price
FROM dups;
Figure 15.34Result of Listing 15.34.
id title_name type price
-- ------------ --------- -----
1 Book Title 5 children 15.00
2 Book Title 3 biography 7.00
3 Book Title 1 history 10.00
4 Book Title 2 children 20.00
5 Book Title 4 history 15.00
6 Book Title 1 history 10.00
7 Book Title 3 biography 7.00
8 Book Title 1 history 10.00
Listing 15.35 lists only the duplicates by counting the number of occurrences of each unique combination of title_name, type, and price. See Figure 15.35 for the result. If this query returns an empty result, then the table contains no duplicates. To list only the nonduplicates, change COUNT(*) > 1 to COUNT(*) = 1.
Listing 15.35List only duplicates. See Figure 15.35 for the result.
SELECT title_name, type, price
FROM dups
GROUP BY title_name, type, price
HAVING COUNT(*) > 1;
Figure 15.35Result of Listing 15.35.
title_name type price
------------ --------- -----
Book Title 1 history 10.00
Book Title 3 biography 7.00
Listing 15.36 uses a similar technique to list each row and its duplicate count. See Figure 15.36 for the result. To list only the duplicates, change COUNT(*) >= 1 to COUNT(*) > 1.
Listing 15.36List each row and its number of repetitions. See Figure 15.36 for the result.
SELECT title_name, type, price,
COUNT(*) AS NumDups
FROM dups
GROUP BY title_name, type, price
HAVING COUNT(*) >= 1
ORDER BY COUNT(*) DESC;
Figure 15.36Result of Listing 15.36.
title_name type price NumDups
------------ --------- ----- -------
Book Title 1 history 10.00 3
Book Title 3 biography 7.00 2
Book Title 4 history 15.00 1
Book Title 2 children 20.00 1
Book Title 5 children 15.00 1
Listing 15.37 deletes duplicate rows from dups in place. This statement uses the column id to leave exactly one occurrence (the one with the highest ID) of each duplicate. Figure 15.37 shows the table dups after running this statement. See also “Deleting Rows with DELETE” in Chapter 10.
Listing 15.37Remove the redundant duplicates in place. See Figure 15.37 for the result.
DELETE FROM dups
WHERE id < (
SELECT MAX(d.id)
FROM dups d
WHERE dups.title_name = d.title_name
AND dups.type = d.type
AND dups.price = d.price);
Figure 15.37Result of Listing 15.37.
id title_name type price
-- ------------ --------- -----
1 Book Title 5 children 15.00
4 Book Title 2 children 20.00
5 Book Title 4 history 15.00
7 Book Title 3 biography 7.00
8 Book Title 1 history 10.00
SELECT DISTINCT * FROM table
to delete duplicates. See “Eliminating Duplicate Rows with DISTINCT” in Chapter 4.If your DBMS offers a built-in unique row identifier, then you can drop the column id and still delete duplicates in place. In Oracle, for example, you can replace id with the ROWID pseudocolumn in Listing 15.37; change the outer WHERE clause to:
WHERE ROWID < (SELECT
MAX(d.ROWID)...
To run Listing 15.36 in MySQL, change ORDER BY COUNT(*) DESC
to ORDER BY NumDups DESC
. You can’t use Listing 15.37 to do an in-place deletion because MySQL won’t let you use same table for both the subquery’s FROM clause and the DELETE target.
Messy Data
Deleting duplicates gets harder as data get messier. It’s not unusual to buy a mailing list containing entries that look like this:
name address1
---------- ------------------
John Smith 123 Main St
John Smith 123 Main St, Apt 1
Jack Smiht 121 Main Rd
John Symthe 123 Main St.
Jon Smith 123 Mian Street
DBMSs offer nonstandard tools such as Soundex (phonetic) functions to suppress spelling variations, but creating an automated deletion program that works over thousands or millions of rows is a major project.
You can use the function COALESCE() with a left outer join to create a convenient telephone listing from a normalized table of telephone numbers. Suppose that the sample database has an extra table named telephones that stores the authors’ work and home telephone numbers:
au_id tel_type tel_no
----- -------- ------------
A01 H 111-111-1111
A01 W 222-222-2222
A02 W 333-333-3333
A04 H 444-444-4444
A04 W 555-555-5555
A05 H 666-666-6666
The table’s composite primary key is (au_id, tel_type), where tel_type indicates whether tel_no is a work (W) or home (H) number. Listing 15.38 lists the authors’ names and numbers. If an author has only one number, then that number is listed. If an author has both home and work numbers, then only the work number is listed. Authors with no numbers aren’t listed. See Figure 15.38 for the result.
The first left join picks out the work numbers, and the second picks out the home numbers. The WHERE clause filters out authors with no numbers. (You can extend this query to add mobile and other numbers.)
Listing 15.38Lists the authors’ names and telephone numbers, favoring work numbers over home numbers. See Figure 15.38 for the result.
SELECT
a.au_id AS "ID",
a.au_fname AS "FirstName",
a.au_lname AS "LastName",
COALESCE(twork.tel_no, thome.tel_no)
AS "TelNo",
COALESCE(twork.tel_type, thome.tel_type)
AS "TelType"
FROM authors a
LEFT OUTER JOIN telephones twork
ON a.au_id = twork.au_id
AND twork.tel_type = 'W'
LEFT OUTER JOIN telephones thome
ON a.au_id = thome.au_id
AND thome.tel_type = 'H'
WHERE COALESCE(twork.tel_no, thome.tel_no)
IS NOT NULL
ORDER BY a.au_fname ASC, a.au_lname ASC;
Figure 15.38Result of Listing 15.38.
ID FirstName LastName TelNo TelType
--- --------- --------- ------------ -------
A05 Christian Kells 666-666-6666 H
A04 Klee Hull 555-555-5555 W
A01 Sarah Buchman 222-222-2222 W
A02 Wendy Heydemark 333-333-3333 W
Microsoft Access won’t run Listing 15.38 because of the restrictions Access puts on join expressions.
Metadata are data about data. In DBMSs, metadata include information about schemas, databases, users, tables, columns, and so on. When meeting a new database, inspect its metadata: What’s in the database? How big is it? How are the tables organized?
Metadata, like other data, are stored in tables and so can be accessed via SELECT queries. Metadata also can be accessed, often more conveniently, by using command-line and graphical tools. The following listings show DBMS-specific examples for viewing metadata. The DBMS itself maintains metadata—look, but don’t touch.
The SQL standard calls a set of metadata a catalog and specifies that it be accessed through the schema INFORMATION_SCHEMA. Not all DBMSs implement this schema or use the same terms. In Microsoft SQL Server, for example, the equivalent term for a catalog is a database and for a schema, an owner. In Oracle, the repository of metadata is the data dictionary.
Microsoft Access metadata are available graphically through the Design View of each database object and programmatically through the Visual Basic for Applications (VBA) or C# language. Access also creates and maintains hidden system tables in each database.
To show system tables in Microsoft Access:
The system tables begin with MSys and are commingled with the database’s other tables. You can open and query them as you would ordinary tables. The most interesting system table is MSysObjects, which catalogs all the objects in the database. Listing 15.39 lists all the tables in the current database. Note that system tables don’t have to be visible to be used in queries.
Listing 15.39List the tables in the current Microsoft Access database. To list queries instead, change Type = 1 to Type = 5.
SELECT Name
FROM MSysObjects
WHERE Type = 1;
Microsoft SQL Server metadata are available through the schema INFORMATION_SCHEMA and via system stored procedures (Listing 15.40).
Listing 15.40Metadata statements and commands for Microsoft SQL Server.
-- List the databases.
exec sp_helpdb;
-- List the schemas.
SELECT schema_name
FROM information_schema.schemata;
-- List the tables (Method 1).
SELECT *
FROM information_schema.tables
WHERE table_type = 'BASE TABLE'
AND table_schema = 'schema_name';
-- List the tables (Method 2).
exec sp_tables;
-- Describe a table (Method 1).
SELECT *
FROM information_schema.columns
WHERE table_catalog = 'db_name'
AND table_schema = 'schema_name'
AND table_name = 'table_name';
-- Describe a table (Method 2).
exec sp_help table_name;
Oracle metadata are available through data dictionary views and via the sqlplus command-line tool (Listing 15.41). To list data dictionary views, run this query in sqlplus:
SELECT table_name, comments
FROM dictionary
ORDER BY table_name;
For a list of Oracle databases (instances) in Unix or Linux, look in the file oratab located in the directory /etc or /var/opt/oracle. In Windows, run this command at a command prompt:
net start | find /i "OracleService"
Or choose Start > Run (Windows logo key + R), type services.msc
, press Enter, and then inspect the Services list for entries that begin with OracleService.
Listing 15.41Metadata statements and commands for Oracle Database.
-- List the schemas (users).
SELECT *
FROM all_users;
-- List the tables.
SELECT table_name
FROM all_tables
WHERE owner = 'user_name';
-- Describe a table (Method 1).
SELECT *
FROM all_tab_columns
WHERE owner = 'user_name'
AND table_name = 'table_name';
-- Describe a table (Method 2, in sqlplus).
DESCRIBE table_name;
Db2 metadata are available through the system catalog SYSCAT and via the db2 command-line tool (Listing 15.42).
Listing 15.42Metadata statements and commands for IBM Db2 Database.
-- List the databases (in db2).
LIST DATABASE DIRECTORY;
-- List the schemas.
SELECT schemaname
FROM syscat.schemata;
-- List the tables (Method 1).
SELECT tabname
FROM syscat.tables
WHERE tabschema = 'schema_name';
-- List the tables (Method 2, in db2).
LIST TABLES;
-- List the tables (Method 3, in db2).
LIST TABLES FOR SCHEMA schema_name;
-- Describe a table (Method 1).
SELECT *
FROM syscat.columns
WHERE tabname = 'table_name'
AND tabschema = 'schema_name';
-- Describe a table (Method 2, in db2).
DESCRIBE TABLE table_name SHOW DETAIL;
MySQL metadata are available through the schema INFORMATION_SCHEMA and via the mysql command-line tool (Listing 15.43).
Listing 15.43Metadata statements and commands for MySQL.
-- List the databases (Method 1).
SELECT schema_name
FROM information_schema.schemata;
-- List the databases (Method 2, in mysql).
SHOW DATABASES;
-- List the tables (Method 1).
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'db_name';
-- List the tables (Method 2, in mysql).
SHOW TABLES;
-- Describe a table (Method 1).
SELECT *
FROM information_schema.columns
WHERE table_schema = 'db_name'
AND table_name = 'table_name';
-- Describe a table (Method 2, in mysql).
DESCRIBE table_name;
PostgreSQL metadata are available through the schema INFORMATION_SCHEMA and via the psql command-line tool (Listing 15.44).
Listing 15.44Metadata statements and commands for PostgreSQL.
-- List the databases (Method 1).
psql --list
-- List the databases (Method 2, in psql).
\l
-- List the schemas.
SELECT schema_name
FROM information_schema.schemata;
-- List the tables (Method 1).
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'schema_name';
-- List the tables (Method 2, in psql).
\dt
-- Describe a table (Method 1).
SELECT *
FROM information_schema.columns
WHERE table_schema = 'schema_name'
AND table_name = 'table_name';
-- Describe a table (Method 2, in psql).
\d table_name;
As pointed out in “Performing Datetime and Interval Arithmetic” and “Getting the Current Date and Time” in Chapter 5, DBMSs provide their own extended (nonstandard) functions for manipulating dates and times. This section explains how to use these built-in functions to do simple date arithmetic. The queries in each listing:
The function datepart() extracts the specified part of a datetime. now() returns the current (system) date and time. dateadd() adds a specified time interval to a date. datediff() returns the number of specified time intervals between two dates (Listing 15.45). Alternatives to datepart() are the extraction functions second(), day(), month(), and so on.
Listing 15.45Working with dates in Microsoft Access.
-- Extract parts of the current datetime.
SELECT
datepart("s", now()) AS sec_pt,
datepart("n", now()) AS min_pt,
datepart("h", now()) AS hr_pt,
datepart("d", now()) AS day_pt,
datepart("m", now()) AS mon_pt,
datepart("yyyy",now()) AS yr_pt;
-- Add or subtract days, months, and years.
SELECT
dateadd("d", 2,pubdate) AS p2d,
dateadd("d", -2,pubdate) AS m2d,
dateadd("m", 2,pubdate) AS p2m,
dateadd("m", -2,pubdate) AS m2m,
dateadd("yyyy", 2,pubdate) AS p2y,
dateadd("yyyy",-2,pubdate) AS m2y
FROM titles
WHERE title_id = 'T05';
-- Count the days between two dates.
SELECT datediff("d",date1,date2) AS days
FROM
(SELECT pubdate as date1
FROM titles
WHERE title_id = 'T05') t1,
(SELECT pubdate as date2
FROM titles
WHERE title_id = 'T06') t2;
-- Count the months between two dates.
SELECT datediff("m",date1,date2) AS months
FROM
(SELECT
MIN(pubdate) AS date1,
MAX(pubdate) AS date2
FROM titles) t1;
The function datepart() extracts the specified part of a datetime. getdate() returns the current (system) date and time. dateadd() adds a specified time interval to a date. datediff() returns the number of specified time intervals between two dates (Listing 15.46). Alternatives to datepart() are the extraction functions day(), month(), and year().
Listing 15.46Working with dates in Microsoft SQL Server.
-- Extract parts of the current datetime.
SELECT
datepart("s", getdate()) AS sec_pt,
datepart("n", getdate()) AS min_pt,
datepart("hh", getdate()) AS hr_pt,
datepart("d", getdate()) AS day_pt,
datepart("m", getdate()) AS mon_pt,
datepart("yyyy",getdate()) AS yr_pt;
-- Add or subtract days, months, and years.
SELECT
dateadd("d", 2,pubdate) AS p2d,
dateadd("d", -2,pubdate) AS m2d,
dateadd("m", 2,pubdate) AS p2m,
dateadd("m", -2,pubdate) AS m2m,
dateadd("yyyy", 2,pubdate) AS p2y,
dateadd("yyyy",-2,pubdate) AS m2y
FROM titles
WHERE title_id = 'T05';
-- Count the days between two dates.
SELECT datediff("d",date1,date2) AS days
FROM
(SELECT pubdate as date1
FROM titles
WHERE title_id = 'T05') t1,
(SELECT pubdate as date2
FROM titles
WHERE title_id = 'T06') t2;
-- Count the months between two dates.
SELECT datediff("m",date1,date2) AS months
FROM
(SELECT
MIN(pubdate) AS date1,
MAX(pubdate) AS date2
FROM titles) t1;
The function to_char() converts a datetime to a character value in the given format. to_number() converts its argument to a number. sysdate
returns the current (system) date and time. The standard addition and subtraction operators add and subtract days from a date. add_months() adds a specified number of months to a date. Subtracting one date from another yields the number of days between them. months_between() returns the number of months between two dates (Listing 15.47).
Listing 15.47Working with dates in Oracle Database.
-- Extract parts of the current datetime.
SELECT
to_number(to_char(sysdate,'ss')) AS sec_pt,
to_number(to_char(sysdate,'mi')) AS min_pt,
to_number(to_char(sysdate,'hh24')) AS hr_pt,
to_number(to_char(sysdate,'dd')) AS day_pt,
to_number(to_char(sysdate,'mm')) AS mon_pt,
to_number(to_char(sysdate,'yyyy')) AS yr_pt
FROM dual;
-- Add or subtract days, months, and years.
SELECT
pubdate+2 AS p2d,
pubdate-2 AS m2d,
add_months(pubdate, +2) AS p2m,
add_months(pubdate, -2) AS m2m,
add_months(pubdate,+24) AS p2y,
add_months(pubdate,-24) AS m2y
FROM titles
WHERE title_id = 'T05';
-- Count the days between two dates.
SELECT date2 - date1 AS days
FROM
(SELECT pubdate as date1
FROM titles
WHERE title_id = 'T05') t1,
(SELECT pubdate as date2
FROM titles
WHERE title_id = 'T06') t2;
-- Count the months between two dates.
SELECT months_between(date2,date1) AS months
FROM
(SELECT
MIN(pubdate) AS date1,
MAX(pubdate) AS date2
FROM titles) t1;
The functions second(), day(), month(), and so on, extract part of a datetime. current_timestamp
returns the current (system) date and time. The standard addition and subtraction operators add and subtract time intervals from a date. days() converts a date to an integer serial number (Listing 15.48).
Listing 15.48Working with dates in IBM Db2 Database.
-- Extract parts of the current datetime.
SELECT
second(current_timestamp) AS sec_pt,
minute(current_timestamp) AS min_pt,
hour(current_timestamp) AS hr_pt,
day(current_timestamp) AS day_pt,
month(current_timestamp) AS mon_pt,
year(current_timestamp) AS yr_pt
FROM SYSIBM.SYSDUMMY1;
-- Add or subtract days, months, and years.
SELECT
pubdate + 2 DAY AS p2d,
pubdate - 2 DAY AS m2d,
pubdate + 2 MONTH AS p2m,
pubdate - 2 MONTH AS m2m,
pubdate + 2 YEAR AS p2y,
pubdate - 2 YEAR AS m2y
FROM titles
WHERE title_id = 'T05';
-- Count the days between two dates.
SELECT days(date2) - days(date1) AS days
FROM
(SELECT pubdate as date1
FROM titles
WHERE title_id = 'T05') t1,
(SELECT pubdate as date2
FROM titles
WHERE title_id = 'T06') t2;
-- Count the months between two dates.
SELECT
(year(date2) * 12 + month(date2)) -
(year(date1) * 12 + month(date1))
AS months
FROM
(SELECT
MIN(pubdate) AS date1,
MAX(pubdate) AS date2
FROM titles) t1;
The function date_format() formats a datetime according to the specified format. current_timestamp
returns the current (system) date and time. The standard addition and subtraction operators add and subtract time intervals from a date. datediff() returns the number of days between two dates (Listing 15.49). Alternatives to date_format() are the extraction functions extract(), second(), day(), month(), and so on.
Listing 15.49Working with dates in MySQL.
-- Extract parts of the current datetime.
SELECT
date_format(current_timestamp,'%s') AS sec_pt,
date_format(current_timestamp,'%i') AS min_pt,
date_format(current_timestamp,'%k') AS hr_pt,
date_format(current_timestamp,'%d') AS day_pt,
date_format(current_timestamp,'%m') AS mon_pt,
date_format(current_timestamp,'%Y') AS yr_pt;
-- Add or subtract days, months, and years.
SELECT
pubdate + INTERVAL 2 DAY AS p2d,
pubdate - INTERVAL 2 DAY AS m2d,
pubdate + INTERVAL 2 MONTH AS p2m,
pubdate - INTERVAL 2 MONTH AS m2m,
pubdate + INTERVAL 2 YEAR AS p2y,
pubdate - INTERVAL 2 YEAR AS m2y
FROM titles
WHERE title_id = 'T05';
-- Count the days between two dates.
SELECT datediff(date2,date1) AS days
FROM
(SELECT pubdate as date1
FROM titles
WHERE title_id = 'T05') t1,
(SELECT pubdate as date2
FROM titles
WHERE title_id = 'T06') t2;
-- Count the months between two dates.
SELECT
(year(date2) * 12 + month(date2)) -
(year(date1) * 12 + month(date1))
AS months
FROM
(SELECT
MIN(pubdate) AS date1,
MAX(pubdate) AS date2
FROM titles) t1;
The function date_part() extracts the specified part of a datetime. current_timestamp
returns the current (system) date and time. The standard addition and subtraction operators add and subtract time intervals from a date. Subtracting one date from another yields the number of days between them (Listing 15.50). An alternative to date_part() is extract().
Listing 15.50Working with dates in PostgreSQL.
-- Extract parts of the current datetime.
SELECT
date_part('second',current_timestamp) AS sec_pt,
date_part('minute',current_timestamp) AS min_pt,
date_part('hour',current_timestamp) AS hr_pt,
date_part('day',current_timestamp) AS day_pt,
date_part('month',current_timestamp) AS mon_pt,
date_part('year',current_timestamp) AS yr_pt;
-- Add or subtract days, months, and years.
SELECT
pubdate + INTERVAL '2 DAY' AS p2d,
pubdate - INTERVAL '2 DAY' AS m2d,
pubdate + INTERVAL '2 MONTH' AS p2m,
pubdate - INTERVAL '2 MONTH' AS m2m,
pubdate + INTERVAL '2 YEAR' AS p2y,
pubdate - INTERVAL '2 YEAR' AS m2y
FROM titles
WHERE title_id = 'T05';
-- Count the days between two dates.
SELECT date2 - date1 AS days
FROM
(SELECT pubdate as date1
FROM titles
WHERE title_id = 'T05') t1,
(SELECT pubdate as date2
FROM titles
WHERE title_id = 'T06') t2;
-- Count the months between two dates.
SELECT
(date_part('year', date2) * 12 +
date_part('month',date2)) -
(date_part('year', date1) * 12 +
date_part('month',date1))
AS months
FROM
(SELECT
MIN(pubdate) AS date1,
MAX(pubdate) AS date2
FROM titles) t1;
The median describes the center of the data as the middle point of n (sorted) values. If n is odd, then the median is the observation number (n + 1)/2. If n is even, then the median is the midpoint (average) of observations n/2 and (n/2) + 1.
The examples in this section calculate the median of the column sales in the table empsales (Figure 15.39). The median is 550—the average of the middle two numbers, 500 and 600, in the sorted list.
Figure 15.39The table empsales, sorted by ascending sales.
emp_id sales
------ -----
E07 300
E08 400
E03 500
E04 500
E06 500
E01 600
E05 700
E10 700
E02 800
E09 900
Search the web and you’ll find many standard and DBMS-specific ways to calculate the median. Listing 15.51 shows one way—it uses a self-join and GROUP BY to create a Cartesian product (e1 and e2) without duplicates and then uses HAVING and SUM to find the row (containing the median) where the number of times e1.sales = e2.sales equals (or exceeds) the number of times e1.sales > e2.sales. Like all methods that use standard (or near-standard) SQL, it’s cumbersome, it’s hard to understand, and it runs slowly because it’s difficult to pick the middle value of an ordered set when SQL is about unordered sets.
To run Listing 15.51 in Microsoft Access, change the CASE expression to iif(e1.sales = e2.sales, 1, 0)
and change SIGN to SGN.
Listing 15.51Calculate the median of sales in standard SQL.
SELECT AVG(sales) AS median
FROM
(SELECT e1.sales
FROM empsales e1, empsales e2
GROUP BY e1.sales
HAVING
SUM(CASE WHEN e1.sales = e2.sales
THEN 1 ELSE 0 END) >=
ABS(SUM(SIGN(e1.sales -
e2.sales)))) t1;
It’s faster and more efficient to calculate the median by using DBMS-specific functions. In Microsoft Access, use Visual Basic or C# to call Excel’s MEDIAN() function from within Access. Listing 15.52 calculates the median in Microsoft SQL Server in two ways. Listing 15.53 calculates it in Oracle in two ways. The second query in Listing 15.52 and Listing 15.53 also works in Db2.
Listing 15.52Here are two ways to calculate the median in Microsoft SQL Server. The second way, which also works in Db2, is faster than the first.
-- Method 1.
-- Works in Microsoft SQL Server.
SELECT
(
(SELECT MAX(sales) FROM
(SELECT TOP 50 PERCENT sales
FROM empsales
ORDER BY sales ASC) AS t1)
+
(SELECT MIN(sales) FROM
(SELECT TOP 50 PERCENT sales
FROM empsales
ORDER BY sales DESC) AS t2)
)/2 AS median;
-- Method 2.
-- Works in Microsoft SQL Server and Db2.
SELECT AVG(sales) AS median
FROM
(SELECT
sales,
ROW_NUMBER() OVER (ORDER BY sales)
AS rownum,
COUNT(*) OVER () AS cnt
FROM empsales) t1
WHERE rownum IN ((cnt+1)/2, (cnt+2)/2);
Listing 15.53Here are two ways to calculate the median in Oracle. The second way, which also works in Db2, is faster than the first.
-- Method 1.
-- Works in Oracle.
SELECT
percentile_cont(0.5)
WITHIN GROUP (ORDER BY sales)
AS median
FROM empsales;
-- Method 2.
-- Works in Oracle and Db2.
SELECT median(sales) AS median
FROM empsales;
If you use an alternate method to compute the median, then make sure that it doesn’t eliminate duplicate values during calculations and averages the two middle observations for an even n (rather than just lazily choosing one of them as the median).
See also “Statistics in SQL” in Chapter 6.
Median vs. Mean
The median is a popular statistic because it’s robust, meaning it’s not affected seriously by extreme high or low values, either legitimate or caused by errors. The arithmetic mean (average), on the other hand, is so sensitive that it can swing wildly with the addition or removal of even a single extreme value. That’s why you see the median applied to skewed (lopsided) distributions such as wealth, house prices, military budgets, and gene expression. The median is also known as the 50th percentile or the second quartile. See also “Finding Extreme Values” later in this chapter.
You can use aggregate functions in a subquery to find the highest and lowest values in a column.
Listing 15.54 finds the rows with the highest and lowest values (ties included) of the column advance in the table royalties. Figure 15.40 shows the result.
You also can use the queries in “Limiting the Number of Rows Returned” to find extremes, although not both highs and lows in the same query.
In Microsoft SQL Server, Oracle, Db2, MySQL, and PostgreSQL you can replicate Listing 15.54 by using the window functions MIN OVER and MAX OVER (Listing 15.55).
Listing 15.54List the books with the highest and lowest advances. See Figure 15.40 for the result.
SELECT title_id, advance
FROM royalties
WHERE advance IN (
(SELECT MIN(advance) FROM royalties),
(SELECT MAX(advance) FROM royalties));
Figure 15.40Result of Listing 15.54.
title_id advance
-------- ----------
T07 1000000.00
T08 0.00
T09 0.00
Listing 15.55List the books with the highest and lowest advances, using window functions.
SELECT title_id, advance
FROM
(SELECT title_id, advance,
MIN(advance) OVER () min_adv,
MAX(advance) OVER () max_adv
FROM royalties) t1
WHERE advance IN (min_adv, max_adv);
You can modify values of an in-progress running statistic depending on values in another column. First, review Listing 15.1 in “Calculating Running Statistics” earlier in this chapter.
Listing 15.56 calculates the running sum of book sales, ignoring biographies. The scalar subquery computes the running sum, and the inner CASE expression identifies biographies and changes their sales value to NULL, which is ignored by the aggregate function SUM(). The outer CASE expression merely creates a label column in the result; it’s not part of the running-sum logic. Figure 15.41 shows the result.
In the inner CASE expression, you can set the value being summed to any number, not only NULL. If you were summing bank transactions, for example, then you could make the deposits positive and withdrawals negative.
Listing 15.56Calculate the running sum of book sales, ignoring biographies. See Figure 15.41 for the result.
SELECT
t1.title_id,
CASE WHEN t1.type = 'biography'
THEN '*IGNORED*'
ELSE t1.type END
AS title_type,
t1.sales,
(SELECT
SUM(CASE WHEN t2.type = 'biography'
THEN NULL
ELSE t2.sales END)
FROM titles t2
WHERE t1.title_id >= t2.title_id)
AS RunSum
FROM titles t1;
Figure 15.41Result of Listing 15.56.
title_id title_type sales RunSum
-------- ---------- ------- ------
T01 history 566 566
T02 history 9566 10132
T03 computer 25667 35799
T04 psychology 13001 48800
T05 psychology 201440 250240
T06 *IGNORED* 11320 250240
T07 *IGNORED* 1500200 250240
T08 children 4095 254335
T09 children 5000 259335
T10 *IGNORED* NULL 259335
T11 psychology 94123 353458
T12 *IGNORED* 100001 353458
T13 history 10467 363925
To run Listing 15.56 in Microsoft Access, change the two CASE expressions to
iif(t1.type = 'biography', '*IGNORED*', t1.type)
and
iif(t2.type = 'biography', NULL, t2.sales)
.
In Microsoft SQL Server, Oracle, Db2, MySQL, and PostgreSQL, you can replicate Listing 15.56 by using the window function SUM OVER (Listing 15.57).
Listing 15.57Calculate the running sum of book sales, ignoring biographies and using window functions.
SELECT
title_id,
CASE WHEN type = 'biography'
THEN '*IGNORED*'
ELSE type END
AS title_type,
sales,
SUM(CASE WHEN type = 'biography'
THEN NULL
ELSE sales END)
OVER (ORDER BY title_id, sales)
AS RunSum
FROM titles;
Pivoting a table swaps its columns and rows, typically to display data in a compact format on a report.
Listing 15.58 uses SUM functions and CASE expressions to list the number of books each author wrote (or cowrote). But instead of displaying the result in the usual way (see Listing 6.9, for example), like this:
au_id num_books
----- ---------
A01 3
A02 4
A03 2
A04 4
A05 1
A06 3
A07 0
Listing 15.58 produces a pivoted result:
A01 A02 A03 A04 A05 A06 A07
--- --- --- --- --- --- ---
3 4 2 4 1 3 0
Listing 15.58List the number of books each author wrote (or cowrote), pivoting the result.
SELECT
SUM(CASE WHEN au_id='A01'
THEN 1 ELSE 0 END) AS A01,
SUM(CASE WHEN au_id='A02'
THEN 1 ELSE 0 END) AS A02,
SUM(CASE WHEN au_id='A03'
THEN 1 ELSE 0 END) AS A03,
SUM(CASE WHEN au_id='A04'
THEN 1 ELSE 0 END) AS A04,
SUM(CASE WHEN au_id='A05'
THEN 1 ELSE 0 END) AS A05,
SUM(CASE WHEN au_id='A06'
THEN 1 ELSE 0 END) AS A06,
SUM(CASE WHEN au_id='A07'
THEN 1 ELSE 0 END) AS A07
FROM title_authors;
Listing 15.59 reverses the pivot. The first subquery in the FROM clause returns the unique authors’ IDs. The second subquery reproduces the result of Listing 15.58.
Listing 15.59List the number of books each author wrote (or cowrote), reverse-pivoting the result.
SELECT
au_ids.au_id,
CASE au_ids.au_id
WHEN 'A01' THEN num_books.A01
WHEN 'A02' THEN num_books.A02
WHEN 'A03' THEN num_books.A03
WHEN 'A04' THEN num_books.A04
WHEN 'A05' THEN num_books.A05
WHEN 'A06' THEN num_books.A06
WHEN 'A07' THEN num_books.A07
END
AS num_books
FROM
(SELECT au_id FROM authors) au_ids,
(SELECT
SUM(CASE WHEN au_id='A01'
THEN 1 ELSE 0 END) AS A01,
SUM(CASE WHEN au_id='A02'
THEN 1 ELSE 0 END) AS A02,
SUM(CASE WHEN au_id='A03'
THEN 1 ELSE 0 END) AS A03,
SUM(CASE WHEN au_id='A04'
THEN 1 ELSE 0 END) AS A04,
SUM(CASE WHEN au_id='A05'
THEN 1 ELSE 0 END) AS A05,
SUM(CASE WHEN au_id='A06'
THEN 1 ELSE 0 END) AS A06,
SUM(CASE WHEN au_id='A07'
THEN 1 ELSE 0 END) AS A07
FROM title_authors) num_books;
To run Listing 15.58 and Listing 15.59 in Microsoft Access, change the simple CASE expressions to iif functions. For example, change the first CASE expression in Listing 15.58 to
iif(au_id = 'A01', 1, 0)
Also, change the searched CASE expression to a switch() function (see the DBMS tip in “Evaluating Conditional Values with CASE” in Chapter 5).
A hierarchy ranks and organizes people or things within a system. Each element (except the top one) is a subordinate to a single other element. Figure 15.42 is a tree diagram of a corporate pecking order, with the chief executive officer (CEO) at top, above vice presidents (VP), directors (DIR), and wage slaves (WS).
Hierarchical trees come with their own vocabulary. Each element in the tree is a node. Nodes are connected by branches. Two connected nodes form a parent–child relationship (three connected nodes form a grandparent–parent–child relationship, and so on). At the top of the pyramid is the root node (CEO, in this example). Nodes without children are end nodes or leaf nodes (DIR2 and all the WSs). Branch nodes connect to leaf nodes or other branch nodes (VP1, VP2, DIR1, and DIR3—middle management).
The table hier (Figure 15.43) represents the tree in Figure 15.42. The table hier has the same structure as the table employees in “Creating a Self-Join” in Chapter 7. Review that section for the basics of using self-joins with hierarchies.
Figure 15.43The result of the query SELECT * FROM hier;
. The table hier represents the organization chart in Figure 15.42.
emp_id emp_title boss_id
------ --------- -------
E01 CEO NULL
E02 VP1 E01
E03 VP2 E01
E04 DIR1 E02
E05 DIR2 E02
E06 DIR3 E03
E07 WS1 E04
E08 WS2 E04
E09 WS3 E04
E10 WS4 E06
E11 WS5 E06
Listing 15.60 uses a self-join to list who works for whom. See Figure 15.44 for the result.
Listing 15.60List the parent–child relationships. See Figure 15.44 for the result.
SELECT
h1.emp_title ||
' obeys ' ||
h2.emp_title
AS power_structure
FROM hier h1, hier h2
WHERE h1.boss_id = h2.emp_id;
Figure 15.44Result of Listing 15.60.
power_structure
---------------
VP1 obeys CEO
VP2 obeys CEO
DIR1 obeys VP1
DIR2 obeys VP1
DIR3 obeys VP2
WS1 obeys DIR1
WS2 obeys DIR1
WS3 obeys DIR1
WS4 obeys DIR3
WS5 obeys DIR3
To run Listing 15.60 in Microsoft Access and Microsoft SQL Server, change each || to +. In MySQL, use CONCAT() to concatenate strings. See “Concatenating Strings with ||” in Chapter 5.
Listing 15.61 traverses the hierarchy by using multiple self-joins to trace the chain of command from employee WS3 to the top of the tree. See Figure 15.45 for the result. Unfortunately, you must know the depth of the hierarchy before you write this query; use one of the alternatives given next, if possible.
Listing 15.61Show the full hierarchical relationship of employee WS3. See Figure 15.45 for the result.
SELECT
h1.emp_title || ' < ' ||
h2.emp_title || ' < ' ||
h3.emp_title || ' < ' ||
h4.emp_title
AS chain_of_command
FROM hier h1, hier h2, hier h3, hier h4
WHERE h1.emp_title = 'WS3'
AND h1.boss_id = h2.emp_id
AND h2.boss_id = h3.emp_id
AND h3.boss_id = h4.emp_id;
Figure 15.45Result of Listing 15.61.
chain_of_command
----------------------
WS3 < DIR1 < VP1 < CEO
To run Listing 15.61 in Microsoft Access and Microsoft SQL Server, change each || to +. In MySQL, use CONCAT() to concatenate strings. See “Concatenating Strings with ||” in Chapter 5.
In Microsoft SQL Server and Db2, use the (standard) recursive WITH clause to traverse a hierarchy. The following query is equivalent to Listing 15.61 (in Microsoft SQL Server, change each || to +):
WITH recurse (chain, emp_level, boss_id) AS
(SELECT
CAST(emp_title AS VARCHAR(50)),
0,
boss_id
FROM hier
WHERE emp_title = 'WS3'
UNION ALL
SELECT
CAST(recurse.chain || ' < ' ||
hier.emp_title AS VARCHAR(50)),
recurse.emp_level + 1,
hier.boss_id
FROM hier, recurse
WHERE recurse.boss_id = hier.emp_id
)
SELECT chain AS chain_of_command
FROM recurse
WHERE emp_level = 3;
In Microsoft SQL Server and Db2, to list everyone who reports to a particular employee (VP1, in this example), either directly or indirectly (through a boss’s boss), use this query:
WITH recurse (emp_title, emp_id) AS
(SELECT emp_title,emp_id
FROM hier
WHERE emp_title = 'VP1'
UNION ALL
SELECT hier.emp_title, hier.emp_id
FROM hier, recurse
WHERE recurse.emp_id = hier.boss_id
)
SELECT emp_title AS "Works for VP1"
FROM recurse
WHERE emp_title <> 'VP1';
In Oracle, use the (nonstandard) CONNECT BY syntax to traverse a hierarchy. The following query is equivalent to Listing 15.61:
SELECT LTRIM(SYS_CONNECT_BY_PATH(
emp_title, ' < '), ' < ')
AS chain_of_command
FROM hier
WHERE LEVEL = 4
START WITH emp_title = 'WS3'
CONNECT BY PRIOR boss_id = emp_id;
In Oracle, to list everyone who reports to a particular employee (VP1, in this example), either directly or indirectly (through a boss’s boss), use this query:
SELECT emp_title AS "Works for VP1"
FROM hier
WHERE emp_title <> 'VP1'
START WITH emp_title = 'VP1'
CONNECT BY PRIOR emp_id = boss_id;
Listing 15.62 traverses the hierarchy by using multiple UNIONs and self-joins to trace the chain of command for every employee. See Figure 15.46 for the result. Unfortunately, you must know the maximum depth of the hierarchy before you write this query; use one of the alternatives given next, if possible.
Listing 15.62Show the full hierarchal relationship of every employee. See Figure 15.46 for the result.
SELECT chain AS chains_of_command
FROM
(SELECT emp_title as chain
FROM hier
WHERE boss_id IS NULL
UNION
SELECT
h1.emp_title || ' > ' ||
h2.emp_title
FROM hier h1
INNER JOIN hier h2
ON (h1.emp_id = h2.boss_id)
WHERE h1.boss_id IS NULL
UNION
SELECT
h1.emp_title || ' > ' ||
h2.emp_title || ' > ' ||
h3.emp_title
FROM hier h1
INNER JOIN hier h2
ON (h1.emp_id = h2.boss_id)
LEFT OUTER JOIN hier h3
ON (h2.emp_id = h3.boss_id)
WHERE h1.emp_title = 'CEO'
UNION
SELECT
h1.emp_title || ' > ' ||
h2.emp_title || ' > ' ||
h3.emp_title || ' > ' ||
h4.emp_title
FROM hier h1
INNER JOIN hier h2
ON (h1.emp_id = h2.boss_id)
INNER JOIN hier h3
ON (h2.emp_id = h3.boss_id)
LEFT OUTER JOIN hier h4
ON (h3.emp_id = h4.boss_id)
WHERE h1.emp_title = 'CEO'
) chains
WHERE chain IS NOT NULL
ORDER BY chain;
Figure 15.46Result of Listing 15.62.
chains_of_command
----------------------
CEO
CEO > VP1
CEO > VP1 > DIR1
CEO > VP1 > DIR1 > WS1
CEO > VP1 > DIR1 > WS2
CEO > VP1 > DIR1 > WS3
CEO > VP1 > DIR2
CEO > VP2
CEO > VP2 > DIR3
CEO > VP2 > DIR3 > WS4
CEO > VP2 > DIR3 > WS5
Microsoft Access won’t run Listing 15.62 because of the restrictions Access puts on join expressions.
To run Listing 15.62 in Microsoft SQL Server, change each || to +.
To run Listing 15.62 in MySQL, use CONCAT() instead of || to concatenate strings.
In Microsoft SQL Server and Db2, use the (standard) recursive WITH clause to traverse a hierarchy. The following query is equivalent to Listing 15.62 (in Microsoft SQL Server, change each || to +):
WITH recurse (emp_title, emp_id) AS
(SELECT
CAST(emp_title AS VARCHAR(50)),
emp_id
FROM hier
WHERE boss_id IS NULL
UNION ALL
SELECT
CAST(recurse.emp_title || ' > ' ||
h1.emp_title AS VARCHAR(50)),
h1.emp_id
FROM hier h1, recurse
WHERE h1.boss_id = recurse.emp_id
)
SELECT emp_title emp_tree
FROM recurse;
In Oracle, use the (nonstandard) CONNECT BY syntax to traverse a hierarchy. The following query is equivalent to Listing 15.62:
SELECT ltrim(SYS_CONNECT_BY_PATH(
emp_title, ' > '),' > ')
AS chains_of_command
FROM hier
START WITH boss_id IS NULL
CONNECT BY PRIOR emp_id = boss_id;
Listing 15.63 uses scalar subqueries to determine whether each node in the hierarchy is a root, branch, or leaf node. See Figure 15.47 for the result. A zero in the result denotes True; nonzero, False.
Listing 15.63Determine whether each node is a root, branch, or leaf node. See Figure 15.47 for the result.
SELECT h1.emp_title,
(SELECT SIGN(COUNT(*))
FROM hier h2
WHERE h1.emp_id = h2.emp_id
AND h2.boss_id IS NULL)
AS root_node,
(SELECT SIGN(COUNT(*))
FROM hier h2
WHERE h1.emp_id = h2.boss_id
AND h1.boss_id IS NOT NULL)
AS branch_node,
(SELECT SIGN(COUNT(*))
FROM hier h2
WHERE 0 =
(SELECT COUNT(*)
FROM hier h3
WHERE h1.emp_id = h3.boss_id))
AS leaf_node
FROM hier h1;
Figure 15.47Result of Listing 15.63.
emp_title root_node branch_node leaf_node
--------- --------- ----------- ---------
CEO 1 0 0
VP1 0 1 0
VP2 0 1 0
DIR1 0 1 0
DIR2 0 0 1
DIR3 0 1 0
WS1 0 0 1
WS2 0 0 1
WS3 0 0 1
WS4 0 0 1
WS5 0 0 1
To run Listing 15.63 in Microsoft Access, change each SIGN to SGN.
In Oracle, use the (nonstandard) CONNECT BY syntax to traverse a hierarchy. The following query is equivalent to Listing 15.63:
SELECT
emp_title,
(CASE CONNECT_BY_ROOT(emp_title)
WHEN emp_title THEN 1
ELSE 0 END)
AS root_node,
(SELECT COUNT(*)
FROM hier h1
WHERE h1.boss_id = hier.emp_id
AND hier.boss_id IS NOT NULL
AND rownum = 1)
AS branch_node,
CONNECT_BY_ISLEAF AS leaf_node
FROM hier
START WITH boss_id IS NULL
CONNECT BY PRIOR emp_id = boss_id
ORDER BY root_node DESC, branch_node DESC;