SQL Run

Database Programming

15. Advanced SQL

In this chapter

Calculating Running Statistics

Generating Sequences

Finding Sequences, Runs, and Regions

Limiting the Number of Rows Returned

Assigning Ranks

Calculating a Trimmed Mean

Picking Random Rows

Handling Duplicates

Creating a Telephone List

Retrieving Metadata

Working with Dates

Calculating a Median

Finding Extreme Values

Changing Running Statistics Midstream

Pivoting Results

Working with Hierarchies

This chapter describes how to solve common problems with SQL programs that

Calculating Running Statistics

A running statistic, or cumulative statistic, is a row-by-row calculation that uses progressively more data values, starting with a single value (the first value), continuing with more values in the order in which they’re supplied, and ending with all the values. The running sum (total) and running average (arithmetic mean) are the most common running statistics.

Listing 15.1 calculates the running sum and running average of book sales, along with a cumulative count of data items. The query cross-joins two instances of the table titles, grouping the result by the first-table (t1) title IDs and limiting the second-table (t2) rows to ID values smaller than or equal to the t1 row to which they’re joined. The intermediate cross-joined table, to which SUM(), AVG(), and COUNT() are applied, looks like this:

t1.id t1.sales t2.id t2.sales
----- -------- ----- --------
T01        566 T01        566
T02       9566 T01        566
T02       9566 T02       9566
T03      25667 T01        566
T03      25667 T02       9566
T03      25667 T03      25667
T04      13001 T01        566
T04      13001 T02       9566
T04      13001 T03      25667
T04      13001 T04      13001
T05     201440 T01        566
...

Note that the running statistics don’t change for title T10 because its sales value is null. The ORDER BY clause is necessary because GROUP BY doesn’t sort the result implicitly. See Figure 15.1 for the result.

Listing 15.1Calculate the running sum, average, and count of book sales. See Figure 15.1 for the result.

SELECT
    t1.title_id,
    SUM(t2.sales) AS RunSum,
    AVG(t2.sales) AS RunAvg,
    COUNT(t2.sales) AS RunCount
  FROM titles t1, titles t2
  WHERE t1.title_id >= t2.title_id
  GROUP BY t1.title_id
  ORDER BY t1.title_id;

Figure 15.1Result of Listing 15.1.

title_id RunSum  RunAvg RunCount
-------- ------- ------ --------
T01          566    566        1
T02        10132   5066        2
T03        35799  11933        3
T04        48800  12200        4
T05       250240  50048        5
T06       261560  43593        6
T07      1761760 251680        7
T08      1765855 220731        8
T09      1770855 196761        9
T10      1770855 196761        9
T11      1864978 186497       10
T12      1964979 178634       11
T13      1975446 164620       12

A moving average is a way of smoothing a time series (such as a list of prices changing over time) by replacing each value by an average of that value and its nearest neighbors. Calculating a moving average is easy if you have a column that contains a sequence of integers or dates, such as in this table, named time_series:

seq price
--- -----
  1  10.0
  2  10.5
  3  11.0
  4  11.0
  5  10.5
  6  11.5
  7  12.0
  8  13.0
  9  15.0
 10  13.5
 11  13.0
 12  12.5
 13  12.0
 14  12.5
 15  11.0

Listing 15.2 calculates the moving average of price. See Figure 15.2 for the result. Each value in the result’s moving-average column is the average of five values: the price in the current row and the prices in the four preceding rows (as ordered by seq). The first four rows are omitted because they don’t have the required number of preceding values. You can adjust the values in the WHERE clause to cover any size averaging window. To make Listing 15.2 calculate a five-point moving average that averages each price with the two prices before it and the two prices after it, for example, change the WHERE clause to:

WHERE t1.seq >= 3
  AND t1.seq <= 13
  AND t1.seq BETWEEN t2.seq - 2 AND t2.seq + 2

Listing 15.2Calculate a moving average with a five-point window. See Figure 15.2 for the result.

SELECT t1.seq, AVG(t2.price) AS MovingAvg
  FROM time_series t1, time_series t2
  WHERE t1.seq >= 5
    AND t1.seq BETWEEN t2.seq AND t2.seq + 4
  GROUP BY t1.seq
  ORDER BY t1.seq;

Figure 15.2Result of Listing 15.2.

seq MovingAvg
--- ---------
  5      10.6
  6      10.9
  7      11.2
  8      11.6
  9      12.4
 10      13.0
 11      13.3
 12      13.4
 13      13.2
 14      12.7
 15      12.2

If you have a table that already has running totals, then you can calculate the differences between pairs of successive rows. Listing 15.3 backs out the intercity distances from the following table, named roadtrip, which contains the cumulative distances for each leg of a trip from Seattle, Washington, to San Diego, California. See Figure 15.3 for the result.

seq city              miles
--- ----------------- -----
  1 Seattle, WA           0
  2 Portland, OR        174
  3 San Francisco, CA   808
  4 Monterey, CA        926
  5 Los Angeles, CA    1251
  6 San Diego, CA      1372

Listing 15.3Calculate intercity distances from cumulative distances. See Figure 15.3 for the result.

SELECT
    t1.seq AS seq1,
    t2.seq AS seq2,
    t1.city AS city1,
    t2.city AS city2,
    t1.miles AS miles1,
    t2.miles AS miles2,
    t2.miles - t1.miles AS dist
  FROM roadtrip t1, roadtrip t2
  WHERE t1.seq + 1 = t2.seq
  ORDER BY t1.seq;

Figure 15.3Result of Listing 15.3.

seq1 seq2 city1             city2             miles1 miles2 dist
---- ---- ----------------- ----------------- ------ ------ ----
   1    2 Seattle, WA       Portland, OR           0    174  174
   2    3 Portland, OR      San Francisco, CA    174    808  634
   3    4 San Francisco, CA Monterey, CA         808    926  118
   4    5 Monterey, CA      Los Angeles, CA      926   1251  325
   5    6 Los Angeles, CA   San Diego, CA       1251   1372  121

Tips for Running Statistics

Generating Sequences

Recall from “Unique Identifiers” in Chapter 3 that you can use sequences of autogenerated integers to create identity columns (typically for primary keys). The SQL standard provides sequence generators to create them.

To define a sequence generator:

Listing 15.4 defines the sequence shown in Figure 15.4.

Listing 15.4Create a sequence generator for the consecutive integers 1 to 10000. See Figure 15.4 for the result.

CREATE SEQUENCE part_seq
  INCREMENT BY 1
  MINVALUE 1
  MAXVALUE 10000
  START WITH 1
  NO CYCLE;

Figure 15.4The sequence that Listing 15.4 generates.

1
2
3
...
9998
9999
10000

You can use a sequence generator in a few ways. Standard SQL provides the built-in function NEXT VALUE FOR to increment a sequence value, as in:

INSERT INTO shipment(part_num, desc, quantity)
  VALUES(NEXT VALUE FOR part_seq, 'motherboard', 5);

If you’re creating a column of unique values, then you can use the keyword IDENTITY to define a sequence right in the CREATE TABLE statement:

CREATE TABLE parts (
  part_num INTEGER AS
    IDENTITY(
      INCREMENT BY 1
      MINVALUE 1
      MAXVALUE 10000
      START WITH 1
      NO CYCLE),
  desc AS VARCHAR(100),
  quantity INTEGER;

This table definition lets you omit NEXT VALUE FOR when you insert a row:

INSERT INTO shipment(
    desc,
    quantity)
  VALUES(
    'motherboard',
    5);

SQL also provides ALTER SEQUENCE and DROP SEQUENCE to change and remove sequence generators.

Microsoft SQL Server, Oracle, Db2, and PostgreSQL support CREATE SEQUENCE, ALTER SEQUENCE, and DROP SEQUENCE. In Oracle, use NOCYCLE instead of NO CYCLE. See your DBMS documentation to see how sequences are used in your system.

Most DBMSs don’t support IDENTITY columns because they have other (pre-standard) ways that define columns with unique values; see “Unique Identifiers” in Chapter 3. PostgreSQL’s generate_series() function offers a quick way to generate numbered rows.

A one-column table containing a sequence of consecutive integers makes it easy to solve problems that would otherwise be difficult with SQL’s limited computational power. Sequence tables aren’t actually part of the data model—they’re auxiliary tables that are adjuncts to queries and other “real” tables.

You can create a sequence table by using one of the methods just described. Alternatively, you can create one by using Listing 15.5, which creates the sequence table seq by cross-joining the intermediate table temp09 with itself. The CAST expression concatenates digit characters into sequential numbers and then casts them as integers. You can drop the table temp09 after seq is created. Figure 15.5 shows the result. The table seq contains the integer sequence 0, 1, 2,..., 9999. You can shrink or grow this sequence by changing the SELECT and FROM expressions in the INSERT INTO seq statement.

Listing 15.5Create a one-column table that contains consecutive integers. See Figure 15.5 for the result.

CREATE TABLE temp09 (
  i CHAR(1) NOT NULL PRIMARY KEY
  );

INSERT INTO temp09 VALUES('0');
INSERT INTO temp09 VALUES('1');
INSERT INTO temp09 VALUES('2');
INSERT INTO temp09 VALUES('3');
INSERT INTO temp09 VALUES('4');
INSERT INTO temp09 VALUES('5');
INSERT INTO temp09 VALUES('6');
INSERT INTO temp09 VALUES('7');
INSERT INTO temp09 VALUES('8');
INSERT INTO temp09 VALUES('9');

CREATE TABLE seq (
  i INTEGER NOT NULL PRIMARY KEY
  );

INSERT INTO seq
  SELECT CAST(t1.i || t2.i ||
      t3.i || t4.i AS INTEGER)
    FROM temp09 t1, temp09 t2,
      temp09 t3, temp09 t4;

DROP TABLE temp09;

Figure 15.5Result of Listing 15.5.

i
-----
0
1
2
3
4
...
9996
9997
9998
9999

A sequence table is especially useful for enumerative and datetime functions. Listing 15.6 lists the 95 printable characters in the ASCII character set (if that’s the character set in use). See Figure 15.6 for the result.

Listing 15.6List the characters associated with a set of character codes. See Figure 15.6 for the result.

SELECT
    i AS CharCode,
    CHR(i) AS Ch
  FROM seq
  WHERE i BETWEEN 32 AND 126;

Figure 15.6Result of Listing 15.6.

CharCode Ch
-------- --
      32
      33  !
      34  "
      35  #
      36  $
      37  %
      38  &
      39  '
      40  (
      41  )
      42  *
      43  +
      44  ,
      45  -
      46  .
      47  /
      48  0
      49  1
      50  2
      51  3
      52  4
...

Listing 15.7 adds monthly intervals to today’s date (7-March-2005) for the next six months. See Figure 15.7 for the result. This example works on Microsoft SQL Server; other DBMSs have similar functions that increment dates.

Listing 15.7Increment today’s date to six months hence, in one-month intervals. See Figure 15.7 for the result.

SELECT
    i AS MonthsAhead,
    DATEADD("m", i, CURRENT_TIMESTAMP)
      AS FutureDate
  FROM seq
  WHERE i BETWEEN 1 AND 6;

Figure 15.7Result of Listing 15.7.

MonthsAhead FutureDate
----------- ----------
          1 2005-04-07
          2 2005-05-07
          3 2005-06-07
          4 2005-07-07
          5 2005-08-07
          6 2005-09-07

Sequence tables are handy for normalizing data that you’ve imported from a nonrelational source such as a spreadsheet or accounting package. Suppose that you have the following non-normalized table, named au_orders, showing the order of the authors’ names on each book’s cover:

title_id author1 author2 author3
-------- ------- ------- -------
T01      A01     NULL    NULL
T02      A01     NULL    NULL
T03      A05     NULL    NULL
T04      A03     A04     NULL
T05      A04     NULL    NULL
T06      A02     NULL    NULL
T07      A02     A04     NULL
T08      A06     NULL    NULL
T09      A06     NULL    NULL
T10      A02     NULL    NULL
T11      A06     A03     A04
T12      A02     NULL    NULL
T13      A01     NULL    NULL

Listing 15.8 cross-joins au_orders with seq to produce Figure 15.8. You can DELETE the result rows with nulls in the column au_id, leaving the result set looking like the table title_authors in the sample database. Note that Listing 15.8 does the reverse of Listing 8.18 in Chapter 8.

Listing 15.8Normalize the table au_orders. See Figure 15.8 for the result.

SELECT title_id,
    (CASE WHEN i=1 THEN '1'
          WHEN i=2 THEN '2'
          WHEN i=3 THEN '3'
    END) AS au_order,
    (CASE WHEN i=1 THEN author1
          WHEN i=2 THEN author2
          WHEN i=3 THEN author3
    END) AS au_id
  FROM au_orders, seq
  WHERE i BETWEEN 1 AND 3
  ORDER BY title_id, i;

Figure 15.8Result of Listing 15.8.

title_id au_order au_id
-------- -------- -----
T01      1        A01
T01      2        NULL
T01      3        NULL
T02      1        A01
T02      2        NULL
T02      3        NULL
T03      1        A05
T03      2        NULL
T03      3        NULL
T04      1        A03
T04      2        A04
T04      3        NULL
T05      1        A04
T05      2        NULL
T05      3        NULL
T06      1        A02
T06      2        NULL
T06      3        NULL
T07      1        A02
T07      2        A04
T07      3        NULL
T08      1        A06
T08      2        NULL
T08      3        NULL
T09      1        A06
T09      2        NULL
T09      3        NULL
T10      1        A02
T10      2        NULL
T10      3        NULL
T11      1        A06
T11      2        A03
T11      3        A04
T12      1        A02
T12      2        NULL
T12      3        NULL
T13      1        A01
T13      2        NULL
T13      3        NULL

Tips for Generating Sequences

Calendar Tables

Another useful auxiliary table is a calendar table. One type of calendar table has a primary-key column that contains a row for each calendar date (past and future) and other columns that indicate the date’s attributes: business day, holiday, international holiday, fiscal-month end, fiscal-year end, Julian date, business-day offsets, and so on. Another type of calendar table stores the starting and ending dates of events (in the columns event_id, start_date, and end_date, for example). Spreadsheets have more date-arithmetic functions than DBMSs, so it might be easier to build a calendar table in a spreadsheet and then import it as a database table.

Even if your DBMS has plenty of date-arithmetic functions, it might be faster to look up data in a calendar table than to call these functions in a query.

Finding Sequences, Runs, and Regions

A sequence is a series of consecutive values without gaps. A run is like a sequence, but the values don’t have to be consecutive, just increasing (that is, gaps are allowed). A region is an unbroken series of values that all are equal.

Finding these series requires a table that has at least two columns: a primary-key column that holds a sequence of consecutive integers and a column that holds the values of interest. The table temps (Listing 15.9 and Figure 15.9) shows a series of high temperatures over 15 days.

As a set-oriented language, SQL isn’t a good choice for finding series of values. The following queries won’t run very fast, so if you have a lot of data to analyze, then consider exporting it to a statistical package or using a procedural host language.

Listing 15.9List all the columns in the table temps. See Figure 15.9 for the result.

SELECT *
  FROM temps;

Figure 15.9Result of Listing 15.9.

id hi_temp
-- -------
 1      49
 2      46
 3      48
 4      50
 5      50
 6      50
 7      51
 8      52
 9      53
10      50
11      50
12      47
13      50
14      51
15      52

Listing 15.10 finds all the sequences in temps and lists each sequence’s start position, end position, and length. See Figure 15.10 for the result.

Listing 15.10List the starting point, ending point, and length of each sequence in the table temps. See Figure 15.10 for the result.

SELECT
    t1.id AS StartSeq,
    t2.id AS EndSeq,
    t2.id - t1.id + 1 AS SeqLen
  FROM temps t1, temps t2
  WHERE (t1.id < t2.id)
    AND NOT EXISTS(
      SELECT *
        FROM temps t3
        WHERE (t3.hi_temp - t3.id <>
               t1.hi_temp - t1.id
               AND t3.id BETWEEN
                   t1.id AND t2.id)
           OR (t3.id = t1.id - 1
               AND t3.hi_temp - t3.id =
                   t1.hi_temp - t1.id)
           OR (t3.id = t2.id + 1
               AND t3.hi_temp - t3.id =
                   t1.hi_temp - t1.id)
    );

Figure 15.10Result of Listing 15.10.

StartSeq EndSeq SeqSize
-------- ------ -------
       6      9       4
      13     15       3

Listing 15.10 is a lot to take in at first glance, but it’s easier to understand it if you look at it piecemeal. Then you’ll be able to understand the rest of the queries in this section.

The subquery’s WHERE clause subtracts id from hi_temp, yielding (internally):

id hi_temp diff
-- ------- ----
 1      49   48
 2      46   44
 3      48   45
 4      50   46
 5      50   45
 6      50   44
 7      51   44
 8      52   44
 9      53   44
10      50   40
11      50   39
12      47   35
13      50   37
14      51   37
15      52   37

In the column diff, note that successive differences are constant for sequences (50 − 6 = 44, 51 − 7 = 44, and so on). To find neighboring rows, the outer query cross-joins two instances of the same table (t1 and t2), as described in “Calculating Running Statistics” earlier in this chapter. The condition

WHERE (t1.id < t2.id)

guarantees that any t1 row represents an element with an index (id) lower than the corresponding t2 row.

The subquery detects sequence breaks with the condition

t3.hi_temp - t3.id <> t1.hi_temp - t1.id

The third instance of temps (t3) in the subquery is used to determine whether any row in a candidate sequence (t3) has the same difference as the sequence’s first row (t1). If so, then it’s a sequence member. If not, then the candidate pair (t1 and t2) is rejected.

The last two OR conditions determine whether the candidate sequence’s borders can expand. A row that satisfies these conditions means the current candidate sequence can be extended and is rejected in favor of a longer one.

To find only sequences larger than n rows, add the WHERE condition

AND (t2.id - t1.id) >= n - 1

To change Listing 15.10 to find all sequences of four or more rows, for example, replace

WHERE (t1.id < t2.id)

with

WHERE (t1.id < t2.id)
  AND (t2.id - t1.id) >= 3

The result is:

StartSeq EndSeq SeqSize
-------- ------ -------
       6      9       4

Listing 15.11 finds all the runs in temps and lists each run’s start position, end position, and length. See Figure 15.11 for the result.

The logic of this query is similar to that of the preceding one but accounts for run values needing only to increase, not (necessarily) be consecutive. The difference between id and hi_temp values doesn’t have to be constant, so a fourth instance of temps (t4) is needed. The subquery cross-joins t3 and t4 to check rows in the middle of a candidate run, whose borders are t1 and t2. For every element between t1 and t2 (limited by BETWEEN), t3 and its predecessor t4 are compared to see whether their values are increasing.

Listing 15.11List the starting point, ending point, and length of each run in the table temps. See Figure 15.11 for the result.

SELECT
    t1.id AS StartRun,
    t2.id AS EndRun,
    t2.id - t1.id + 1 AS RunLen
  FROM temps t1, temps t2
  WHERE (t1.id < t2.id)
    AND NOT EXISTS(
      SELECT *
        FROM temps t3, temps t4
        WHERE (t3.hi_temp <= t4.hi_temp
               AND t4.id = t3.id - 1
               AND t3.id BETWEEN
                   t1.id + 1 AND t2.id)
           OR (t3.id = t1.id - 1
               AND t3.hi_temp < t1.hi_temp)
           OR (t3.id = t2.id + 1
               AND t3.hi_temp > t2.hi_temp)
    );

Figure 15.11Result of Listing 15.11.

StartRun EndRun RunLen
-------- ------ ------
       2      4      3
       6      9      4
      12     15      4

Listing 15.12 finds all regions in temps with a high temperature of 50 and lists each region’s start position, end position, and length. See Figure 15.12 for the result.

Listing 15.12List the starting point, ending point, and length of each region (with value 50) in the table temps. See Figure 15.12 for the result.

SELECT
    t1.id AS StartReg,
    t2.id AS EndReg,
    t2.id - t1.id + 1 AS RegLen
  FROM temps t1, temps t2
  WHERE (t1.id < t2.id)
    AND NOT EXISTS(
      SELECT *
        FROM temps t3
        WHERE (t3.hi_temp <> 50
               AND t3.id BETWEEN
                   t1.id AND t2.id)
           OR (t3.id = t1.id - 1
               AND t3.hi_temp = 50)
           OR (t3.id = t2.id + 1
               AND t3.hi_temp = 50)
    );

Figure 15.12Result of Listing 15.12.

StartReg EndReg RegLen
-------- ------ ------
       4      6      3
      10     11      2

Listing 15.13 is a variation of Listing 15.12 that finds all regions of length 2. See Figure 15.13 for the result. Note that overlapping subregions are listed. To return regions of length n, change the WHERE clause’s second condition to:

AND t2.id - t1.id = n - 1

Listing 15.13List all regions of length 2. See Figure 15.13 for the result.

SELECT
    t1.id AS StartReg,
    t2.id AS EndReg,
    t2.id - t1.id + 1 AS RegLen
  FROM temps t1, temps t2
  WHERE (t1.id < t2.id)
    AND t2.id - t1.id = 1
    AND NOT EXISTS(
      SELECT *
        FROM temps t3
        WHERE (t3.hi_temp <> 50
          AND t3.id BETWEEN t1.id AND t2.id)
    );

Figure 15.13Result of Listing 15.13.

StartReg EndReg RegLen
-------- ------ ------
       4      5      2
       5      6      2
      10     11      2

Tips for Sequences, Runs, and Regions

Limiting the Number of Rows Returned

In practice it’s common to use queries that return a certain number (n) of rows that fall at the top or the bottom of a range specified by an ORDER BY clause. SQL doesn’t require an ORDER BY clause, but if you omit it, then the query will return an unsorted set of rows (because SQL doesn’t promise to deliver query results in any particular order without an ORDER BY clause).

The examples in this section use the table empsales (Listing 15.14 and Figure 15.14), which lists sales figures by employee. Note that some employees have the same sales amounts. A correct query for the top three salespeople in empsales actually will return four rows: employees E09, E02, E10, and E05. Ties shouldn’t force the query to choose arbitrarily between equal values (E10 and E05 in this case). No standard terminology exists, but queries that return at most n rows (regardless of ties) sometimes are called limit queries. Queries that include ties and return possibly more than n rows are top-n queries or quota queries.

You can also use limit and quota queries to limit the number of rows affected by an UPDATE or DELETE statement. Some limit and quota queries might be invalid in some contexts (such as in subqueries or views); see your DBMS documentation.

Listing 15.14List employees by descending sales. See Figure 15.14 for the result.

SELECT emp_id, sales
  FROM empsales
  ORDER BY sales DESC;

Figure 15.14Result of Listing 15.14.

emp_id sales
------ -----
E09      900
E02      800
E10      700
E05      700
E01      600
E04      500
E03      500
E06      500
E08      400
E07      300

The SQL:2003 standard introduced the functions ROW_NUMBER() and RANK() to use in limit and top-n queries. Microsoft SQL Server, Oracle, and Db2 support both functions. Queries that use pre-2003 SQL are complex, unintuitive, and run slowly (see the tips at the end of this section for an SQL-92 example). The SQL standard has lagged DBMSs, which for years have offered nonstandard extensions to create these types of queries. Some DBMSs also let you return a percentage of rows (rather than a fixed n) or return offsets by skipping a specified number of initial rows (returning rows 3–8 instead of 1–5, for example). This section covers the DBMS extensions individually.

Microsoft Access

Listing 15.15 lists the top three salespeople, including ties. See Figure 15.15 for the result. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.

The TOP clause always includes ties. Its syntax is:

TOP n [PERCENT]

Listing 15.15List the top three salespeople, with ties. See Figure 15.15 for the result.

SELECT TOP 3 emp_id, sales
  FROM empsales
  ORDER BY sales DESC;

Figure 15.15Result of Listing 15.15.

emp_id sales
------ -----
E09      900
E02      800
E10      700
E05      700

Listing 15.16 lists the bottom 40 percent of salespeople, including ties. See Figure 15.16 for the result. This query orders lowest to highest; to reverse the order, change ASC to DESC in the ORDER BY clause.

Listing 15.16List the bottom 40 percent of salespeople, with ties. See Figure 15.16 for the result.

SELECT TOP 40 PERCENT emp_id, sales
  FROM empsales
  ORDER BY sales ASC;

Figure 15.16Result of Listing 15.16.

emp_id sales
------ -----
E07      300
E08      400
E06      500
E04      500
E03      500

Tips for Microsoft Access

Microsoft SQL Server

Listing 15.17 lists the top three salespeople, not including ties. See Figure 15.17 for the result. Note that this query is inconsistent when ties exist; rerunning it can return either E10 or E05, depending on how ORDER BY sorts the table. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.

The TOP clause’s syntax is:

TOP n [PERCENT] [WITH TIES]

Listing 15.17List the top three salespeople, without ties. See Figure 15.17 for the result.

SELECT TOP 3 emp_id, sales
  FROM empsales
  ORDER BY sales DESC;

Figure 15.17Result of Listing 15.17.

emp_id sales
------ -----
E09      900
E02      800
E05      700

Listing 15.18 lists the top three salespeople, including ties. See Figure 15.18 for the result. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.

Listing 15.18List the top three salespeople, with ties. See Figure 15.18 for the result.

SELECT TOP 3 WITH TIES emp_id, sales
  FROM empsales
  ORDER BY sales DESC;

Figure 15.18Result of Listing 15.18.

emp_id sales
------ -----
E09      900
E02      800
E05      700
E10      700

Listing 15.19 lists the bottom 40 percent of salespeople, including ties. See Figure 15.19 for the result. This query orders lowest to highest; to reverse the order, change ASC to DESC in the ORDER BY clause.

Listing 15.19List the bottom 40 percent of salespeople, with ties. See Figure 15.19 for the result.

SELECT TOP 40 PERCENT WITH TIES
    emp_id, sales
  FROM empsales
  ORDER BY sales ASC;

Figure 15.19Result of Listing 15.19.

emp_id sales
------ -----
E07      300
E08      400
E06      500
E03      500
E04      500

Tips for Microsoft SQL Server

Oracle Database

Use the built-in ROWNUM pseudocolumn to limit the number of rows returned. The first row selected has a ROWNUM of 1, the second has 2, and so on. Use the window function RANK() to include ties.

Listing 15.20 lists the top three salespeople, not including ties. See Figure 15.20 for the result. Note that this query is inconsistent when ties exist; rerunning it can return either E10 or E05, depending on how ORDER BY sorts the table. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.

Listing 15.20List the top three salespeople, without ties. See Figure 15.20 for the result.

SELECT emp_id, sales
  FROM (
    SELECT *
      FROM empsales
      ORDER BY sales DESC)
  WHERE ROWNUM <= 3;

Figure 15.20Result of Listing 15.20.

emp_id sales
------ -----
E09      900
E02      800
E05      700

Listing 15.21 lists the top three salespeople, including ties. See Figure 15.21 for the result. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.

Listing 15.21List the top three salespeople, with ties. See Figure 15.21 for the result.

SELECT emp_id, sales
  FROM (
    SELECT
      RANK() OVER
        (ORDER BY sales DESC)
          AS sales_rank,
      emp_id,
      sales
    FROM empsales)
  WHERE sales_rank <= 3;

Figure 15.21Result of Listing 15.21.

emp_id sales
------ -----
E09      900
E02      800
E05      700
E10      700

Tips for Oracle Database

IBM Db2 Database

Listing 15.22 lists the top three salespeople, not including ties. See Figure 15.22 for the result. Note that this query is inconsistent when ties exist; rerunning it can return either E10 or E05, depending on how ORDER BY sorts the table. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.

The FETCH clause’s syntax is:

FETCH FIRST n ROW[S] ONLY

Listing 15.22List the top three salespeople, without ties. See Figure 15.22 for the result.

SELECT emp_id, sales
  FROM empsales
  ORDER BY sales DESC
  FETCH FIRST 3 ROWS ONLY;

Figure 15.22Result of Listing 15.22.

emp_id sales
------ -----
E09      900
E02      800
E05      700

Listing 15.23 lists the top three salespeople, including ties. See Figure 15.23 for the result. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.

Listing 15.23List the top three salespeople, with ties. See Figure 15.23 for the result.

SELECT emp_id, sales
  FROM (
    SELECT
      RANK() OVER
        (ORDER BY sales DESC)
          AS sales_rank,
      emp_id,
      sales
    FROM empsales)
      AS any_name
  WHERE sales_rank <= 3;

Figure 15.23Result of Listing 15.23.

emp_id sales
------ -----
E09      900
E02      800
E05      700
E10      700

Tips for IBM Db2 Database

MySQL

Listing 15.24 lists the top three salespeople, not including ties. See Figure 15.24 for the result. Note that this query is inconsistent when ties exist; rerunning it can return either E10 or E05, depending on how ORDER BY sorts the table. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.

The LIMIT clause’s syntax is:

LIMIT n [OFFSET skip]

or

LIMIT [skip,] n

The offset of the initial row is 0 (not 1).

Listing 15.24List the top three salespeople, without ties. See Figure 15.24 for the result.

SELECT emp_id, sales
  FROM empsales
  ORDER BY sales DESC
  LIMIT 3;

Figure 15.24Result of Listing 15.24.

emp_id sales
------ -----
E09      900
E02      800
E10      700

Listing 15.25 lists the top three salespeople, including ties. The OFFSET value is n − 1 = 2. COALESCE()’s second argument lets the query work in case the table has fewer than n rows; see “Checking for Nulls with COALESCE()” in Chapter 5. See Figure 15.25 for the result. This query orders highest to lowest; to reverse the order, change >= to <= in the comparison, change MIN() to MAX() in the second subquery, and change DESC to ASC in each ORDER BY clause.

Listing 15.25List the top three salespeople, with ties. See Figure 15.25 for the result.

SELECT emp_id, sales
  FROM empsales
  WHERE sales >= COALESCE(
    (SELECT sales
       FROM empsales
       ORDER BY sales DESC
       LIMIT 1 OFFSET 2),
    (SELECT MIN(sales)
       FROM empsales))
  ORDER BY sales DESC;

Figure 15.25Result of Listing 15.25.

emp_id sales
------ -----
E09      900
E02      800
E05      700
E10      700

Listing 15.26 lists the top three salespeople, skipping the initial four rows. See Figure 15.26 for the result. Note that this query is inconsistent when ties exist. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.

Listing 15.26List the top three salespeople, skipping the initial four rows. See Figure 15.26 for the result.

SELECT emp_id, sales
  FROM empsales
  ORDER BY sales DESC
  LIMIT 3 OFFSET 4;

Figure 15.26Result of Listing 15.26.

emp_id sales
------ -----
E01      600
E04      500
E03      500

PostgreSQL

Listing 15.27 lists the top three salespeople, not including ties. See Figure 15.27 for the result. Note that this query is inconsistent when ties exist; rerunning it can return either E10 or E05, depending on how ORDER BY sorts the table. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.

The LIMIT clause’s syntax is:

LIMIT n [OFFSET skip]

The offset of the initial row is 0 (not 1).

PostgreSQL 13 and later versions support the OFFSET and FETCH clauses:

OFFSET skip [ROW | ROWS]

FETCH [FIRST | NEXT] [n] [ROW | ROWS]
  [ONLY | WITH TIES]

If n is omitted in a FETCH clause, then it defaults to 1. For example,

FETCH FIRST WITH TIES

returns any additional rows that match the last row.

The queries in this section use the LIMIT clause to ensure backward compatibility.

Listing 15.27List the top three salespeople, without ties. See Figure 15.27 for the result.

SELECT emp_id, sales
  FROM empsales
  ORDER BY sales DESC
  LIMIT 3;

Figure 15.27Result of Listing 15.27.

emp_id sales
------ -----
E09      900
E02      800
E05      700

Listing 15.28 lists the top three salespeople, including ties. The OFFSET value is n − 1 = 2. See Figure 15.28 for the result. This query orders highest to lowest; to reverse the order, change >= to <= in the comparison and change DESC to ASC in each ORDER BY clause.

Listing 15.28List the top three salespeople, with ties. See Figure 15.28 for the result.

SELECT emp_id, sales
  FROM empsales
  WHERE (
    sales >= (
      SELECT sales
        FROM empsales
        ORDER BY sales DESC
        LIMIT 1 OFFSET 2)
  ) IS NOT FALSE
  ORDER BY sales DESC;

Figure 15.28Result of Listing 15.28.

emp_id sales
------ -----
E09      900
E02      800
E10      700
E05      700

Listing 15.29 lists the top three salespeople, skipping the initial four rows. See Figure 15.29 for the result. Note that this query is inconsistent when ties exist. This query orders highest to lowest; to reverse the order, change DESC to ASC in the ORDER BY clause.

Listing 15.29List the top three salespeople, skipping the initial four rows. See Figure 15.29 for the result.

SELECT emp_id, sales
  FROM empsales
  ORDER BY sales DESC
  LIMIT 3 OFFSET 4;

Figure 15.29Result of Listing 15.29.

emp_id sales
------ -----
E01      600
E06      500
E03      500

Tips for Limiting the Number of Rows Returned

Assigning Ranks

Ranking, which allocates the numbers 1, 2, 3,... to sorted values, is related to top-n queries and shares the problem of interpreting ties. The following queries calculate ranks for sales values in the table empsales from “Limiting the Number of Rows Returned”.

Listings 15.30a to 15.30e rank employees by sales. The first two queries show the most commonly accepted ways to rank values. The other queries show variations on them. Figure 15.30 shows the result of each ranking method, a to e, combined for brevity and ease of comparison. These queries rank highest to lowest; to reverse the order, change > (or >=) to < (or <=) in the WHERE comparisons. You can add the clause ORDER BY ranking ASC to a query’s outer SELECT to sort the results by rank.

Listing 15.30aRank employees by sales (method a). See Figure 15.30 for the result.

SELECT e1.emp_id, e1.sales,
    (SELECT COUNT(sales)
       FROM empsales e2
       WHERE e2.sales >= e1.sales)
         AS ranking
  FROM empsales e1;

Listing 15.30bRank employees by sales (method b). See Figure 15.30 for the result.

SELECT e1.emp_id, e1.sales,
    (SELECT COUNT(sales)
       FROM empsales e2
       WHERE e2.sales > e1.sales) + 1
         AS ranking
  FROM empsales e1;

Listing 15.30cRank employees by sales (method c). See Figure 15.30 for the result.

SELECT e1.emp_id, e1.sales,
    (SELECT COUNT(sales)
       FROM empsales e2
       WHERE e2.sales > e1.sales)
         AS ranking
  FROM empsales e1;

Listing 15.30dRank employees by sales (method d). See Figure 15.30 for the result.

SELECT e1.emp_id, e1.sales,
    (SELECT COUNT(DISTINCT sales)
       FROM empsales e2
       WHERE e2.sales >= e1.sales)
         AS ranking
  FROM empsales e1;

Listing 15.30eRank employees by sales (method e). See Figure 15.30 for the result.

SELECT e1.emp_id, e1.sales,
    (SELECT COUNT(DISTINCT sales)
       FROM empsales e2
       WHERE e2.sales > e1.sales)
         AS ranking
  FROM empsales e1;

Figure 15.30Compilation of results of Listings 15.30a to 15.30e.

emp_id sales a  b  c  d  e
------ ----- -- -- -- -- --
E09      900  1  1  0  1  0
E02      800  2  2  1  2  1
E10      700  4  3  2  3  2
E05      700  4  3  2  3  2
E01      600  5  5  4  4  3
E04      500  8  6  5  5  4
E03      500  8  6  5  5  4
E06      500  8  6  5  5  4
E08      400  9  9  8  6  5
E07      300 10 10  9  7  6

These ranking queries use correlated subqueries and so run slowly. If you’re ranking a large number of items, then you should use a built-in rank function, if available. The SQL:2003 standard introduced the functions RANK() and DENSE_RANK(), which Microsoft SQL Server, Oracle, Db2, MySQL, and PostgreSQL support. Alternatively, you can use your DBMS’s SQL extensions to calculate ranks efficiently. The following MySQL script, for example, is equivalent to Listing 15.30b:

SET @rownum = 0;
SET @rank = 0;
SET @prev_val = NULL;
SELECT
    @rownum := @rownum + 1 AS row,
    @rank := IF(@prev_val <> sales,
      @rownum, @rank) AS rank,
    @prev_val := sales AS sales
  FROM empsales
  ORDER BY sales DESC;

Microsoft Access doesn’t support COUNT(DISTINCT) and won’t run Listing 15.30d and Listing 15.30e. For a workaround, see “Aggregating Distinct Values with DISTINCT” in Chapter 6.

Calculating a Trimmed Mean

The trimmed mean is a robust order statistic that is the arithmetic mean (average) of the data if the k smallest values and k largest values are discarded. The goal is to avoid the influence of extreme observations.

Listing 15.31 calculates the trimmed mean of book sales in the sample database by omitting the top three and bottom three sales figures. See Figure 15.31 for the result. For reference, the 12 sorted sales values are 566, 4095, 5000, 9566, 10467, 11320, 13001, 25667, 94123, 100001, 201440, and 1500200. This query discards 566, 4095, 5000, 100001, 201440, and 1500200 and then calculates the mean in the usual way by using the remaining six middle values. Nulls are ignored. Duplicate values are either all removed or all retained. (If all sales are the same, for example, then none of them will be trimmed no matter what k is.)

Listing 15.31Calculate the trimmed mean for k = 3. See Figure 15.31 for the result.

SELECT AVG(sales) AS TrimmedMean
  FROM titles t1
  WHERE
    (SELECT COUNT(*)
       FROM titles t2
       WHERE t2.sales <= t1.sales) > 3
    AND
    (SELECT COUNT(*)
       FROM titles t3
       WHERE t3.sales >= t1.sales) > 3;

Figure 15.31Result of Listing 15.31.

TrimmedMean
-----------
 27357.3333

Listing 15.32 is similar to Listing 15.31 but trims a fixed percentage of the extreme values rather than a fixed number. Trimming by 0.25 (25%), for example, discards the sales in the top and bottom quartiles and averages what’s left. See Figure 15.32 for the result.

Listing 15.32Calculate the trimmed mean by discarding the lower and upper 25% of values. See Figure 15.32 for the result.

SELECT AVG(sales) AS TrimmedMean
  FROM titles t1
  WHERE
    (SELECT COUNT(*)
       FROM titles t2
       WHERE t2.sales <= t1.sales) >=
         (SELECT 0.25 * COUNT(*)
           FROM titles)
    AND
    (SELECT COUNT(*)
       FROM titles t3
       WHERE t3.sales >= t1.sales) >=
         (SELECT 0.25 * COUNT(*)
           FROM titles);

Figure 15.32Result of Listing 15.32.

TrimmedMean
-----------
 27357.3333

Microsoft SQL Server and Db2 return an integer for the trimmed mean because the column sales is defined as an INTEGER. To get a floating-point value, change AVG(sales) to AVG(CAST(sales AS FLOAT)). For more information, see “Converting Data Types with CAST()” in Chapter 5.

Picking Random Rows

Some databases are so large, and queries on them so complex, that often it’s impractical (and unnecessary) to retrieve all the data relevant to a query. If you’re interested in finding an overall trend or pattern, for example, then an approximate answer within some margin of error will usually suffice. One way to speed such queries is to select a random sample of rows. An efficient sample can improve performance by orders of magnitude yet still yield accurate results.

Standard SQL’s TABLESAMPLE clause returns a random subset of rows. Microsoft SQL Server, Db2, and PostgreSQL support TABLESAMPLE, and Oracle has a similar feature. For the other DBMSs, use a (nonstandard) function that returns a uniform random number between 0 and 1 (Table 15.1).

Table 15.1Randomization Features
DBMS Clause or Function
Access RND() function
SQL Server TABLESAMPLE clause
Oracle SAMPLE clause or DBMS_RANDOM package
Db2 TABLESAMPLE clause
MySQL RAND() function
PostgreSQL TABLESAMPLE clause

Listing 15.33a randomly picks about 25% (0.25) of the rows from the sample-database table titles. If necessary, change RAND() to the function that appears in Table 15.1 for your DBMS. For Oracle, use Listing 15.33b. For Microsoft SQL Server, Db2, and PostgreSQL, use Listing 15.33c.

Listing 15.33aSelect about 25% percent of the rows in the table titles at random. See Figure 15.33 for a possible result.

SELECT title_id, type, sales
  FROM titles
  WHERE RAND() < 0.25;

Listing 15.33bSelect about 25% percent of the rows in the table titles at random (Oracle only). See Figure 15.33 for a possible result.

SELECT title_id, type, sales
  FROM titles
  SAMPLE (25);

Listing 15.33cSelect about 25% percent of the rows in the table titles at random (Microsoft SQL Server, Db2, and PostgreSQL only). See Figure 15.33 for a possible result.

SELECT title_id, type, sales
  FROM titles
  TABLESAMPLE SYSTEM (25);

Figure 15.33 shows one possible result of a random selection. The rows and the number of rows returned will change every time that you run the query. If you need an exact number of random rows, then increase the sampling percentage and use one of the techniques described in “Limiting the Number of Rows Returned” earlier in this chapter.

Figure 15.33One possible result of Listing 15.33a/b/c.

title_id type       sales
-------- ---------- -----
T03      computer   25667
T04      psychology 13001
T11      psychology 94123

Tips for Picking Random Rows

Selecting Every nth Row

Instead of picking random rows, you can pick every nth row by using a modulo expression:

This expression returns the remainder of m divided by n. For example, MOD(20, 6) is 2 because 20 equals (3 × 6) + 2. MOD(a, 2) is 0 if a is an even number.

The condition MOD(rownumber, n) = 0 picks every nth row, where rownumber is a column of consecutive integers or row identifiers. This Oracle query, for example, picks every third row in a table:

SELECT *
  FROM table
  WHERE (ROWID,0) IN
    (SELECT ROWID, MOD(ROWNUM,3)
       FROM table);

Note that rownumber imposes a row order that doesn’t exist implicitly in a relational-database table.

Handling Duplicates

Normally you use SQL’s PRIMARY KEY or UNIQUE constraints to prevent duplicate rows from appearing in production tables. But you need to know how to handle duplicates that appear when you accidentally import the same data twice or import data from a nonrelational source such as a spreadsheet or accounting package, where redundant information is rampant. This section describes how to detect, count, and remove duplicates.

Suppose that you import rows into a staging table to detect and eliminate any duplicates before inserting the data into a production table (Listing 15.34 and Figure 15.34). The column id is a unique row identifier that lets you identify and select rows that otherwise would be duplicates. If your imported rows don’t already have an identity column, then you can add one yourself; see “Unique Identifiers” and “Generating Sequences”. It’s a good practice to add an identity column to even short-lived working tables, but in this case it also makes deleting duplicates easy. The imported data might include other columns too, but you’ve decided that the combination of only book title, book type, and price determines whether a row is a duplicate, regardless of the values in any other columns. Before you identify or delete duplicates, you must define exactly what it means for two rows to be considered “duplicates” of each other.

Listing 15.34List the imported rows. See Figure 15.34 for the result.

SELECT id, title_name, type, price
  FROM dups;

Figure 15.34Result of Listing 15.34.

id title_name   type      price
-- ------------ --------- -----
 1 Book Title 5 children  15.00
 2 Book Title 3 biography  7.00
 3 Book Title 1 history   10.00
 4 Book Title 2 children  20.00
 5 Book Title 4 history   15.00
 6 Book Title 1 history   10.00
 7 Book Title 3 biography  7.00
 8 Book Title 1 history   10.00

Listing 15.35 lists only the duplicates by counting the number of occurrences of each unique combination of title_name, type, and price. See Figure 15.35 for the result. If this query returns an empty result, then the table contains no duplicates. To list only the nonduplicates, change COUNT(*) > 1 to COUNT(*) = 1.

Listing 15.35List only duplicates. See Figure 15.35 for the result.

SELECT title_name, type, price
  FROM dups
  GROUP BY title_name, type, price
  HAVING COUNT(*) > 1;

Figure 15.35Result of Listing 15.35.

title_name   type      price
------------ --------- -----
Book Title 1 history   10.00
Book Title 3 biography  7.00

Listing 15.36 uses a similar technique to list each row and its duplicate count. See Figure 15.36 for the result. To list only the duplicates, change COUNT(*) >= 1 to COUNT(*) > 1.

Listing 15.36List each row and its number of repetitions. See Figure 15.36 for the result.

SELECT title_name, type, price,
    COUNT(*) AS NumDups
  FROM dups
  GROUP BY title_name, type, price
  HAVING COUNT(*) >= 1
  ORDER BY COUNT(*) DESC;

Figure 15.36Result of Listing 15.36.

title_name   type      price NumDups
------------ --------- ----- -------
Book Title 1 history   10.00       3
Book Title 3 biography  7.00       2
Book Title 4 history   15.00       1
Book Title 2 children  20.00       1
Book Title 5 children  15.00       1

Listing 15.37 deletes duplicate rows from dups in place. This statement uses the column id to leave exactly one occurrence (the one with the highest ID) of each duplicate. Figure 15.37 shows the table dups after running this statement. See also “Deleting Rows with DELETE” in Chapter 10.

Listing 15.37Remove the redundant duplicates in place. See Figure 15.37 for the result.

DELETE FROM dups
  WHERE id < (
    SELECT MAX(d.id)
      FROM dups d
      WHERE dups.title_name = d.title_name
        AND dups.type = d.type
        AND dups.price = d.price);

Figure 15.37Result of Listing 15.37.

id  title_name  type      price
-- ------------ --------- -----
 1 Book Title 5 children  15.00
 4 Book Title 2 children  20.00
 5 Book Title 4 history   15.00
 7 Book Title 3 biography  7.00
 8 Book Title 1 history   10.00

Tips for Handling Duplicates

Messy Data

Deleting duplicates gets harder as data get messier. It’s not unusual to buy a mailing list containing entries that look like this:

name          address1
----------    ------------------
John Smith    123 Main St
John Smith    123 Main St, Apt 1
Jack Smiht    121 Main Rd
John Symthe   123 Main St.
Jon Smith     123 Mian Street

DBMSs offer nonstandard tools such as Soundex (phonetic) functions to suppress spelling variations, but creating an automated deletion program that works over thousands or millions of rows is a major project.

Creating a Telephone List

You can use the function COALESCE() with a left outer join to create a convenient telephone listing from a normalized table of telephone numbers. Suppose that the sample database has an extra table named telephones that stores the authors’ work and home telephone numbers:

au_id tel_type tel_no
----- -------- ------------
A01   H        111-111-1111
A01   W        222-222-2222
A02   W        333-333-3333
A04   H        444-444-4444
A04   W        555-555-5555
A05   H        666-666-6666

The table’s composite primary key is (au_id, tel_type), where tel_type indicates whether tel_no is a work (W) or home (H) number. Listing 15.38 lists the authors’ names and numbers. If an author has only one number, then that number is listed. If an author has both home and work numbers, then only the work number is listed. Authors with no numbers aren’t listed. See Figure 15.38 for the result.

The first left join picks out the work numbers, and the second picks out the home numbers. The WHERE clause filters out authors with no numbers. (You can extend this query to add mobile and other numbers.)

Listing 15.38Lists the authors’ names and telephone numbers, favoring work numbers over home numbers. See Figure 15.38 for the result.

SELECT
    a.au_id AS "ID",
    a.au_fname AS "FirstName",
    a.au_lname AS "LastName",
    COALESCE(twork.tel_no, thome.tel_no)
      AS "TelNo",
    COALESCE(twork.tel_type, thome.tel_type)
      AS "TelType"
  FROM authors a
  LEFT OUTER JOIN telephones twork
    ON a.au_id = twork.au_id
    AND twork.tel_type = 'W'
  LEFT OUTER JOIN telephones thome
    ON a.au_id = thome.au_id
    AND thome.tel_type = 'H'
  WHERE COALESCE(twork.tel_no, thome.tel_no)
    IS NOT NULL
  ORDER BY a.au_fname ASC, a.au_lname ASC;

Figure 15.38Result of Listing 15.38.

ID  FirstName LastName  TelNo        TelType
--- --------- --------- ------------ -------
A05 Christian Kells     666-666-6666 H
A04 Klee      Hull      555-555-5555 W
A01 Sarah     Buchman   222-222-2222 W
A02 Wendy     Heydemark 333-333-3333 W

Microsoft Access won’t run Listing 15.38 because of the restrictions Access puts on join expressions.

Retrieving Metadata

Metadata are data about data. In DBMSs, metadata include information about schemas, databases, users, tables, columns, and so on. When meeting a new database, inspect its metadata: What’s in the database? How big is it? How are the tables organized?

Metadata, like other data, are stored in tables and so can be accessed via SELECT queries. Metadata also can be accessed, often more conveniently, by using command-line and graphical tools. The following listings show DBMS-specific examples for viewing metadata. The DBMS itself maintains metadata—look, but don’t touch.

The SQL standard calls a set of metadata a catalog and specifies that it be accessed through the schema INFORMATION_SCHEMA. Not all DBMSs implement this schema or use the same terms. In Microsoft SQL Server, for example, the equivalent term for a catalog is a database and for a schema, an owner. In Oracle, the repository of metadata is the data dictionary.

Microsoft Access

Microsoft Access metadata are available graphically through the Design View of each database object and programmatically through the Visual Basic for Applications (VBA) or C# language. Access also creates and maintains hidden system tables in each database.

To show system tables in Microsoft Access:

The system tables begin with MSys and are commingled with the database’s other tables. You can open and query them as you would ordinary tables. The most interesting system table is MSysObjects, which catalogs all the objects in the database. Listing 15.39 lists all the tables in the current database. Note that system tables don’t have to be visible to be used in queries.

Listing 15.39List the tables in the current Microsoft Access database. To list queries instead, change Type = 1 to Type = 5.

SELECT Name
  FROM MSysObjects
  WHERE Type = 1;

Microsoft SQL Server

Microsoft SQL Server metadata are available through the schema INFORMATION_SCHEMA and via system stored procedures (Listing 15.40).

Listing 15.40Metadata statements and commands for Microsoft SQL Server.

-- List the databases.
exec sp_helpdb;

-- List the schemas.
SELECT schema_name
  FROM information_schema.schemata;

-- List the tables (Method 1).
SELECT *
  FROM information_schema.tables
  WHERE table_type = 'BASE TABLE'
    AND table_schema = 'schema_name';

-- List the tables (Method 2).
exec sp_tables;

-- Describe a table (Method 1).
SELECT *
  FROM information_schema.columns
  WHERE table_catalog = 'db_name'
    AND table_schema = 'schema_name'
    AND table_name = 'table_name';

-- Describe a table (Method 2).
exec sp_help table_name;

Oracle Database

Oracle metadata are available through data dictionary views and via the sqlplus command-line tool (Listing 15.41). To list data dictionary views, run this query in sqlplus:

SELECT table_name, comments
  FROM dictionary
  ORDER BY table_name;

For a list of Oracle databases (instances) in Unix or Linux, look in the file oratab located in the directory /etc or /var/opt/oracle. In Windows, run this command at a command prompt:

net start | find /i "OracleService"

Or choose Start > Run (Windows logo key + R), type services.msc, press Enter, and then inspect the Services list for entries that begin with OracleService.

Listing 15.41Metadata statements and commands for Oracle Database.

-- List the schemas (users).
SELECT *
  FROM all_users;

-- List the tables.
SELECT table_name
  FROM all_tables
  WHERE owner = 'user_name';

-- Describe a table (Method 1).
SELECT *
  FROM all_tab_columns
  WHERE owner = 'user_name'
    AND table_name = 'table_name';

-- Describe a table (Method 2, in sqlplus).
DESCRIBE table_name;

IBM Db2 Database

Db2 metadata are available through the system catalog SYSCAT and via the db2 command-line tool (Listing 15.42).

Listing 15.42Metadata statements and commands for IBM Db2 Database.

-- List the databases (in db2).
LIST DATABASE DIRECTORY;

-- List the schemas.
SELECT schemaname
  FROM syscat.schemata;

-- List the tables (Method 1).
SELECT tabname
  FROM syscat.tables
  WHERE tabschema = 'schema_name';

-- List the tables (Method 2, in db2).
LIST TABLES;

-- List the tables (Method 3, in db2).
LIST TABLES FOR SCHEMA schema_name;

-- Describe a table (Method 1).
SELECT *
  FROM syscat.columns
  WHERE tabname = 'table_name'
    AND tabschema = 'schema_name';

-- Describe a table (Method 2, in db2).
DESCRIBE TABLE table_name SHOW DETAIL;

MySQL

MySQL metadata are available through the schema INFORMATION_SCHEMA and via the mysql command-line tool (Listing 15.43).

Listing 15.43Metadata statements and commands for MySQL.

-- List the databases (Method 1).
SELECT schema_name
  FROM information_schema.schemata;

-- List the databases (Method 2, in mysql).
SHOW DATABASES;

-- List the tables (Method 1).
SELECT table_name
  FROM information_schema.tables
  WHERE table_schema = 'db_name';

-- List the tables (Method 2, in mysql).
SHOW TABLES;

-- Describe a table (Method 1).
SELECT *
  FROM information_schema.columns
  WHERE table_schema = 'db_name'
    AND table_name = 'table_name';

-- Describe a table (Method 2, in mysql).
DESCRIBE table_name;

PostgreSQL

PostgreSQL metadata are available through the schema INFORMATION_SCHEMA and via the psql command-line tool (Listing 15.44).

Listing 15.44Metadata statements and commands for PostgreSQL.

-- List the databases (Method 1).
psql --list

-- List the databases (Method 2, in psql).
\l

-- List the schemas.
SELECT schema_name
  FROM information_schema.schemata;

-- List the tables (Method 1).
SELECT table_name
  FROM information_schema.tables
  WHERE table_schema = 'schema_name';

-- List the tables (Method 2, in psql).
\dt

-- Describe a table (Method 1).
SELECT *
  FROM information_schema.columns
  WHERE table_schema = 'schema_name'
    AND table_name = 'table_name';

-- Describe a table (Method 2, in psql).
\d table_name;

Working with Dates

As pointed out in “Performing Datetime and Interval Arithmetic” and “Getting the Current Date and Time” in Chapter 5, DBMSs provide their own extended (nonstandard) functions for manipulating dates and times. This section explains how to use these built-in functions to do simple date arithmetic. The queries in each listing:

Microsoft Access

The function datepart() extracts the specified part of a datetime. now() returns the current (system) date and time. dateadd() adds a specified time interval to a date. datediff() returns the number of specified time intervals between two dates (Listing 15.45). Alternatives to datepart() are the extraction functions second(), day(), month(), and so on.

Listing 15.45Working with dates in Microsoft Access.

-- Extract parts of the current datetime.
SELECT
  datepart("s",   now()) AS sec_pt,
  datepart("n",   now()) AS min_pt,
  datepart("h",   now()) AS hr_pt,
  datepart("d",   now()) AS day_pt,
  datepart("m",   now()) AS mon_pt,
  datepart("yyyy",now()) AS yr_pt;

-- Add or subtract days, months, and years.
SELECT
    dateadd("d",    2,pubdate) AS p2d,
    dateadd("d",   -2,pubdate) AS m2d,
    dateadd("m",    2,pubdate) AS p2m,
    dateadd("m",   -2,pubdate) AS m2m,
    dateadd("yyyy", 2,pubdate) AS p2y,
    dateadd("yyyy",-2,pubdate) AS m2y
  FROM titles
  WHERE title_id = 'T05';

-- Count the days between two dates.
SELECT datediff("d",date1,date2) AS days
  FROM
    (SELECT pubdate as date1
       FROM titles
       WHERE title_id = 'T05') t1,
    (SELECT pubdate as date2
       FROM titles
       WHERE title_id = 'T06') t2;

-- Count the months between two dates.
SELECT datediff("m",date1,date2) AS months
  FROM
    (SELECT
         MIN(pubdate) AS date1,
         MAX(pubdate) AS date2
       FROM titles) t1;

Microsoft SQL Server

The function datepart() extracts the specified part of a datetime. getdate() returns the current (system) date and time. dateadd() adds a specified time interval to a date. datediff() returns the number of specified time intervals between two dates (Listing 15.46). Alternatives to datepart() are the extraction functions day(), month(), and year().

Listing 15.46Working with dates in Microsoft SQL Server.

-- Extract parts of the current datetime.
SELECT
  datepart("s",   getdate()) AS sec_pt,
  datepart("n",   getdate()) AS min_pt,
  datepart("hh",  getdate()) AS hr_pt,
  datepart("d",   getdate()) AS day_pt,
  datepart("m",   getdate()) AS mon_pt,
  datepart("yyyy",getdate()) AS yr_pt;

-- Add or subtract days, months, and years.
SELECT
    dateadd("d",    2,pubdate) AS p2d,
    dateadd("d",   -2,pubdate) AS m2d,
    dateadd("m",    2,pubdate) AS p2m,
    dateadd("m",   -2,pubdate) AS m2m,
    dateadd("yyyy", 2,pubdate) AS p2y,
    dateadd("yyyy",-2,pubdate) AS m2y
  FROM titles
  WHERE title_id = 'T05';

-- Count the days between two dates.
SELECT datediff("d",date1,date2) AS days
  FROM
    (SELECT pubdate as date1
       FROM titles
       WHERE title_id = 'T05') t1,
    (SELECT pubdate as date2
       FROM titles
       WHERE title_id = 'T06') t2;

-- Count the months between two dates.
SELECT datediff("m",date1,date2) AS months
  FROM
    (SELECT
         MIN(pubdate) AS date1,
         MAX(pubdate) AS date2
       FROM titles) t1;

Oracle Database

The function to_char() converts a datetime to a character value in the given format. to_number() converts its argument to a number. sysdate returns the current (system) date and time. The standard addition and subtraction operators add and subtract days from a date. add_months() adds a specified number of months to a date. Subtracting one date from another yields the number of days between them. months_between() returns the number of months between two dates (Listing 15.47).

Listing 15.47Working with dates in Oracle Database.

-- Extract parts of the current datetime.
SELECT
    to_number(to_char(sysdate,'ss'))   AS sec_pt,
    to_number(to_char(sysdate,'mi'))   AS min_pt,
    to_number(to_char(sysdate,'hh24')) AS hr_pt,
    to_number(to_char(sysdate,'dd'))   AS day_pt,
    to_number(to_char(sysdate,'mm'))   AS mon_pt,
    to_number(to_char(sysdate,'yyyy')) AS yr_pt
  FROM dual;

-- Add or subtract days, months, and years.
SELECT
    pubdate+2               AS p2d,
    pubdate-2               AS m2d,
    add_months(pubdate, +2) AS p2m,
    add_months(pubdate, -2) AS m2m,
    add_months(pubdate,+24) AS p2y,
    add_months(pubdate,-24) AS m2y
  FROM titles
  WHERE title_id = 'T05';

-- Count the days between two dates.
SELECT date2 - date1 AS days
  FROM
    (SELECT pubdate as date1
       FROM titles
       WHERE title_id = 'T05') t1,
    (SELECT pubdate as date2
       FROM titles
       WHERE title_id = 'T06') t2;

-- Count the months between two dates.
SELECT months_between(date2,date1) AS months
  FROM
    (SELECT
         MIN(pubdate) AS date1,
         MAX(pubdate) AS date2
       FROM titles) t1;

IBM Db2 Database

The functions second(), day(), month(), and so on, extract part of a datetime. current_timestamp returns the current (system) date and time. The standard addition and subtraction operators add and subtract time intervals from a date. days() converts a date to an integer serial number (Listing 15.48).

Listing 15.48Working with dates in IBM Db2 Database.

-- Extract parts of the current datetime.
SELECT
    second(current_timestamp) AS sec_pt,
    minute(current_timestamp) AS min_pt,
    hour(current_timestamp)   AS hr_pt,
    day(current_timestamp)    AS day_pt,
    month(current_timestamp)  AS mon_pt,
    year(current_timestamp)   AS yr_pt
  FROM SYSIBM.SYSDUMMY1;

-- Add or subtract days, months, and years.
SELECT
    pubdate + 2 DAY   AS p2d,
    pubdate - 2 DAY   AS m2d,
    pubdate + 2 MONTH AS p2m,
    pubdate - 2 MONTH AS m2m,
    pubdate + 2 YEAR  AS p2y,
    pubdate - 2 YEAR  AS m2y
  FROM titles
  WHERE title_id = 'T05';

-- Count the days between two dates.
SELECT days(date2) - days(date1) AS days
  FROM
    (SELECT pubdate as date1
       FROM titles
       WHERE title_id = 'T05') t1,
    (SELECT pubdate as date2
       FROM titles
       WHERE title_id = 'T06') t2;

-- Count the months between two dates.
SELECT
    (year(date2) * 12 + month(date2)) -
    (year(date1) * 12 + month(date1))
      AS months
  FROM
    (SELECT
         MIN(pubdate) AS date1,
         MAX(pubdate) AS date2
       FROM titles) t1;

MySQL

The function date_format() formats a datetime according to the specified format. current_timestamp returns the current (system) date and time. The standard addition and subtraction operators add and subtract time intervals from a date. datediff() returns the number of days between two dates (Listing 15.49). Alternatives to date_format() are the extraction functions extract(), second(), day(), month(), and so on.

Listing 15.49Working with dates in MySQL.

-- Extract parts of the current datetime.
SELECT
  date_format(current_timestamp,'%s') AS sec_pt,
  date_format(current_timestamp,'%i') AS min_pt,
  date_format(current_timestamp,'%k') AS hr_pt,
  date_format(current_timestamp,'%d') AS day_pt,
  date_format(current_timestamp,'%m') AS mon_pt,
  date_format(current_timestamp,'%Y') AS yr_pt;

-- Add or subtract days, months, and years.
SELECT
    pubdate + INTERVAL 2 DAY   AS p2d,
    pubdate - INTERVAL 2 DAY   AS m2d,
    pubdate + INTERVAL 2 MONTH AS p2m,
    pubdate - INTERVAL 2 MONTH AS m2m,
    pubdate + INTERVAL 2 YEAR  AS p2y,
    pubdate - INTERVAL 2 YEAR  AS m2y
  FROM titles
  WHERE title_id = 'T05';

-- Count the days between two dates.
SELECT datediff(date2,date1) AS days
  FROM
    (SELECT pubdate as date1
       FROM titles
       WHERE title_id = 'T05') t1,
    (SELECT pubdate as date2
       FROM titles
       WHERE title_id = 'T06') t2;

-- Count the months between two dates.
SELECT
    (year(date2) * 12 + month(date2)) -
    (year(date1) * 12 + month(date1))
      AS months
  FROM
    (SELECT
         MIN(pubdate) AS date1,
         MAX(pubdate) AS date2
       FROM titles) t1;

PostgreSQL

The function date_part() extracts the specified part of a datetime. current_timestamp returns the current (system) date and time. The standard addition and subtraction operators add and subtract time intervals from a date. Subtracting one date from another yields the number of days between them (Listing 15.50). An alternative to date_part() is extract().

Listing 15.50Working with dates in PostgreSQL.

-- Extract parts of the current datetime.
SELECT
  date_part('second',current_timestamp) AS sec_pt,
  date_part('minute',current_timestamp) AS min_pt,
  date_part('hour',current_timestamp)   AS hr_pt,
  date_part('day',current_timestamp)    AS day_pt,
  date_part('month',current_timestamp)  AS mon_pt,
  date_part('year',current_timestamp)   AS yr_pt;

-- Add or subtract days, months, and years.
SELECT
    pubdate + INTERVAL '2 DAY'   AS p2d,
    pubdate - INTERVAL '2 DAY'   AS m2d,
    pubdate + INTERVAL '2 MONTH' AS p2m,
    pubdate - INTERVAL '2 MONTH' AS m2m,
    pubdate + INTERVAL '2 YEAR'  AS p2y,
    pubdate - INTERVAL '2 YEAR'  AS m2y
  FROM titles
  WHERE title_id = 'T05';

-- Count the days between two dates.
SELECT date2 - date1 AS days
  FROM
    (SELECT pubdate as date1
       FROM titles
       WHERE title_id = 'T05') t1,
    (SELECT pubdate as date2
       FROM titles
       WHERE title_id = 'T06') t2;

-- Count the months between two dates.
SELECT
    (date_part('year', date2) * 12 +
     date_part('month',date2)) -
    (date_part('year', date1) * 12 +
     date_part('month',date1))
       AS months
  FROM
    (SELECT
         MIN(pubdate) AS date1,
         MAX(pubdate) AS date2
       FROM titles) t1;

Calculating a Median

The median describes the center of the data as the middle point of n (sorted) values. If n is odd, then the median is the observation number (n + 1)/2. If n is even, then the median is the midpoint (average) of observations n/2 and (n/2) + 1.

The examples in this section calculate the median of the column sales in the table empsales (Figure 15.39). The median is 550—the average of the middle two numbers, 500 and 600, in the sorted list.

Figure 15.39The table empsales, sorted by ascending sales.

emp_id sales
------ -----
E07      300
E08      400
E03      500
E04      500
E06      500
E01      600
E05      700
E10      700
E02      800
E09      900

Search the web and you’ll find many standard and DBMS-specific ways to calculate the median. Listing 15.51 shows one way—it uses a self-join and GROUP BY to create a Cartesian product (e1 and e2) without duplicates and then uses HAVING and SUM to find the row (containing the median) where the number of times e1.sales = e2.sales equals (or exceeds) the number of times e1.sales > e2.sales. Like all methods that use standard (or near-standard) SQL, it’s cumbersome, it’s hard to understand, and it runs slowly because it’s difficult to pick the middle value of an ordered set when SQL is about unordered sets.

To run Listing 15.51 in Microsoft Access, change the CASE expression to iif(e1.sales = e2.sales, 1, 0) and change SIGN to SGN.

Listing 15.51Calculate the median of sales in standard SQL.

SELECT AVG(sales) AS median
  FROM
    (SELECT e1.sales
       FROM empsales e1, empsales e2
       GROUP BY e1.sales
       HAVING
         SUM(CASE WHEN e1.sales = e2.sales
             THEN 1 ELSE 0 END) >=
             ABS(SUM(SIGN(e1.sales -
                          e2.sales)))) t1;

It’s faster and more efficient to calculate the median by using DBMS-specific functions. In Microsoft Access, use Visual Basic or C# to call Excel’s MEDIAN() function from within Access. Listing 15.52 calculates the median in Microsoft SQL Server in two ways. Listing 15.53 calculates it in Oracle in two ways. The second query in Listing 15.52 and Listing 15.53 also works in Db2.

Listing 15.52Here are two ways to calculate the median in Microsoft SQL Server. The second way, which also works in Db2, is faster than the first.

-- Method 1.
-- Works in Microsoft SQL Server.
SELECT
  (
    (SELECT MAX(sales) FROM
      (SELECT TOP 50 PERCENT sales
        FROM empsales
        ORDER BY sales ASC) AS t1)
    +
    (SELECT MIN(sales) FROM
      (SELECT TOP 50 PERCENT sales
        FROM empsales
        ORDER BY sales DESC) AS t2)
  )/2 AS median;

-- Method 2.
-- Works in Microsoft SQL Server and Db2.
SELECT AVG(sales) AS median
  FROM
    (SELECT
      sales,
      ROW_NUMBER() OVER (ORDER BY sales)
        AS rownum,
      COUNT(*) OVER () AS cnt
    FROM empsales) t1
  WHERE rownum IN ((cnt+1)/2, (cnt+2)/2);

Listing 15.53Here are two ways to calculate the median in Oracle. The second way, which also works in Db2, is faster than the first.

-- Method 1.
-- Works in Oracle.
SELECT
    percentile_cont(0.5)
      WITHIN GROUP (ORDER BY sales)
      AS median
  FROM empsales;

-- Method 2.
-- Works in Oracle and Db2.
SELECT median(sales) AS median
  FROM empsales;

If you use an alternate method to compute the median, then make sure that it doesn’t eliminate duplicate values during calculations and averages the two middle observations for an even n (rather than just lazily choosing one of them as the median).

See also “Statistics in SQL” in Chapter 6.

Median vs. Mean

The median is a popular statistic because it’s robust, meaning it’s not affected seriously by extreme high or low values, either legitimate or caused by errors. The arithmetic mean (average), on the other hand, is so sensitive that it can swing wildly with the addition or removal of even a single extreme value. That’s why you see the median applied to skewed (lopsided) distributions such as wealth, house prices, military budgets, and gene expression. The median is also known as the 50th percentile or the second quartile. See also “Finding Extreme Values” later in this chapter.

Finding Extreme Values

You can use aggregate functions in a subquery to find the highest and lowest values in a column.

Listing 15.54 finds the rows with the highest and lowest values (ties included) of the column advance in the table royalties. Figure 15.40 shows the result.

You also can use the queries in “Limiting the Number of Rows Returned” to find extremes, although not both highs and lows in the same query.

In Microsoft SQL Server, Oracle, Db2, MySQL, and PostgreSQL you can replicate Listing 15.54 by using the window functions MIN OVER and MAX OVER (Listing 15.55).

Listing 15.54List the books with the highest and lowest advances. See Figure 15.40 for the result.

SELECT title_id, advance
  FROM royalties
  WHERE advance IN (
    (SELECT MIN(advance) FROM royalties),
    (SELECT MAX(advance) FROM royalties));

Figure 15.40Result of Listing 15.54.

title_id advance
-------- ----------
T07      1000000.00
T08            0.00
T09            0.00

Listing 15.55List the books with the highest and lowest advances, using window functions.

SELECT title_id, advance
  FROM
    (SELECT title_id, advance,
         MIN(advance) OVER () min_adv,
         MAX(advance) OVER () max_adv
       FROM royalties) t1
  WHERE advance IN (min_adv, max_adv);

Changing Running Statistics Midstream

You can modify values of an in-progress running statistic depending on values in another column. First, review Listing 15.1 in “Calculating Running Statistics” earlier in this chapter.

Listing 15.56 calculates the running sum of book sales, ignoring biographies. The scalar subquery computes the running sum, and the inner CASE expression identifies biographies and changes their sales value to NULL, which is ignored by the aggregate function SUM(). The outer CASE expression merely creates a label column in the result; it’s not part of the running-sum logic. Figure 15.41 shows the result.

In the inner CASE expression, you can set the value being summed to any number, not only NULL. If you were summing bank transactions, for example, then you could make the deposits positive and withdrawals negative.

Listing 15.56Calculate the running sum of book sales, ignoring biographies. See Figure 15.41 for the result.

SELECT
    t1.title_id,
    CASE WHEN t1.type = 'biography'
      THEN '*IGNORED*'
      ELSE t1.type END
        AS title_type,
    t1.sales,
    (SELECT
         SUM(CASE WHEN t2.type = 'biography'
           THEN NULL
           ELSE t2.sales END)
       FROM titles t2
       WHERE t1.title_id >= t2.title_id)
         AS RunSum
  FROM titles t1;

Figure 15.41Result of Listing 15.56.

title_id title_type sales   RunSum
-------- ---------- ------- ------
T01      history        566    566
T02      history       9566  10132
T03      computer     25667  35799
T04      psychology   13001  48800
T05      psychology  201440 250240
T06      *IGNORED*    11320 250240
T07      *IGNORED*  1500200 250240
T08      children      4095 254335
T09      children      5000 259335
T10      *IGNORED*     NULL 259335
T11      psychology   94123 353458
T12      *IGNORED*   100001 353458
T13      history      10467 363925

To run Listing 15.56 in Microsoft Access, change the two CASE expressions to

iif(t1.type = 'biography', '*IGNORED*', t1.type)

and

iif(t2.type = 'biography', NULL, t2.sales).

In Microsoft SQL Server, Oracle, Db2, MySQL, and PostgreSQL, you can replicate Listing 15.56 by using the window function SUM OVER (Listing 15.57).

Listing 15.57Calculate the running sum of book sales, ignoring biographies and using window functions.

SELECT
    title_id,
    CASE WHEN type = 'biography'
      THEN '*IGNORED*'
      ELSE type END
        AS title_type,
    sales,
    SUM(CASE WHEN type = 'biography'
          THEN NULL
          ELSE sales END)
          OVER (ORDER BY title_id, sales)
        AS RunSum
  FROM titles;

Pivoting Results

Pivoting a table swaps its columns and rows, typically to display data in a compact format on a report.

Listing 15.58 uses SUM functions and CASE expressions to list the number of books each author wrote (or cowrote). But instead of displaying the result in the usual way (see Listing 6.9, for example), like this:

au_id num_books
----- ---------
A01   3
A02   4
A03   2
A04   4
A05   1
A06   3
A07   0

Listing 15.58 produces a pivoted result:

A01 A02 A03 A04 A05 A06 A07
--- --- --- --- --- --- ---
  3   4   2   4   1   3   0

Listing 15.58List the number of books each author wrote (or cowrote), pivoting the result.

SELECT
    SUM(CASE WHEN au_id='A01'
      THEN 1 ELSE 0 END) AS A01,
    SUM(CASE WHEN au_id='A02'
      THEN 1 ELSE 0 END) AS A02,
    SUM(CASE WHEN au_id='A03'
      THEN 1 ELSE 0 END) AS A03,
    SUM(CASE WHEN au_id='A04'
      THEN 1 ELSE 0 END) AS A04,
    SUM(CASE WHEN au_id='A05'
      THEN 1 ELSE 0 END) AS A05,
    SUM(CASE WHEN au_id='A06'
      THEN 1 ELSE 0 END) AS A06,
    SUM(CASE WHEN au_id='A07'
      THEN 1 ELSE 0 END) AS A07
  FROM title_authors;

Listing 15.59 reverses the pivot. The first subquery in the FROM clause returns the unique authors’ IDs. The second subquery reproduces the result of Listing 15.58.

Listing 15.59List the number of books each author wrote (or cowrote), reverse-pivoting the result.

SELECT
    au_ids.au_id,
    CASE au_ids.au_id
      WHEN 'A01' THEN num_books.A01
      WHEN 'A02' THEN num_books.A02
      WHEN 'A03' THEN num_books.A03
      WHEN 'A04' THEN num_books.A04
      WHEN 'A05' THEN num_books.A05
      WHEN 'A06' THEN num_books.A06
      WHEN 'A07' THEN num_books.A07
    END
      AS num_books
  FROM
    (SELECT au_id FROM authors) au_ids,
    (SELECT
      SUM(CASE WHEN au_id='A01'
        THEN 1 ELSE 0 END) AS A01,
      SUM(CASE WHEN au_id='A02'
        THEN 1 ELSE 0 END) AS A02,
      SUM(CASE WHEN au_id='A03'
        THEN 1 ELSE 0 END) AS A03,
      SUM(CASE WHEN au_id='A04'
        THEN 1 ELSE 0 END) AS A04,
      SUM(CASE WHEN au_id='A05'
        THEN 1 ELSE 0 END) AS A05,
      SUM(CASE WHEN au_id='A06'
        THEN 1 ELSE 0 END) AS A06,
      SUM(CASE WHEN au_id='A07'
        THEN 1 ELSE 0 END) AS A07
    FROM title_authors) num_books;

To run Listing 15.58 and Listing 15.59 in Microsoft Access, change the simple CASE expressions to iif functions. For example, change the first CASE expression in Listing 15.58 to

iif(au_id = 'A01', 1, 0)

Also, change the searched CASE expression to a switch() function (see the DBMS tip in “Evaluating Conditional Values with CASE” in Chapter 5).

Working with Hierarchies

A hierarchy ranks and organizes people or things within a system. Each element (except the top one) is a subordinate to a single other element. Figure 15.42 is a tree diagram of a corporate pecking order, with the chief executive officer (CEO) at top, above vice presidents (VP), directors (DIR), and wage slaves (WS).

Figure 15.42An organization chart showing a simple company hierarchy.

Diagram: An organization chart showing a company hierarchy with the CEO at the top and wage slaves at the bottom

Hierarchical trees come with their own vocabulary. Each element in the tree is a node. Nodes are connected by branches. Two connected nodes form a parent–child relationship (three connected nodes form a grandparent–parent–child relationship, and so on). At the top of the pyramid is the root node (CEO, in this example). Nodes without children are end nodes or leaf nodes (DIR2 and all the WSs). Branch nodes connect to leaf nodes or other branch nodes (VP1, VP2, DIR1, and DIR3—middle management).

The table hier (Figure 15.43) represents the tree in Figure 15.42. The table hier has the same structure as the table employees in “Creating a Self-Join” in Chapter 7. Review that section for the basics of using self-joins with hierarchies.

Figure 15.43The result of the query SELECT * FROM hier;. The table hier represents the organization chart in Figure 15.42.

emp_id emp_title boss_id
------ --------- -------
E01    CEO       NULL
E02    VP1       E01
E03    VP2       E01
E04    DIR1      E02
E05    DIR2      E02
E06    DIR3      E03
E07    WS1       E04
E08    WS2       E04
E09    WS3       E04
E10    WS4       E06
E11    WS5       E06

Listing 15.60 uses a self-join to list who works for whom. See Figure 15.44 for the result.

Listing 15.60List the parent–child relationships. See Figure 15.44 for the result.

SELECT
    h1.emp_title ||
    ' obeys '    ||
    h2.emp_title
      AS power_structure
  FROM hier h1, hier h2
  WHERE h1.boss_id = h2.emp_id;

Figure 15.44Result of Listing 15.60.

power_structure
---------------
VP1 obeys CEO
VP2 obeys CEO
DIR1 obeys VP1
DIR2 obeys VP1
DIR3 obeys VP2
WS1 obeys DIR1
WS2 obeys DIR1
WS3 obeys DIR1
WS4 obeys DIR3
WS5 obeys DIR3

To run Listing 15.60 in Microsoft Access and Microsoft SQL Server, change each || to +. In MySQL, use CONCAT() to concatenate strings. See “Concatenating Strings with ||” in Chapter 5.

Listing 15.61 traverses the hierarchy by using multiple self-joins to trace the chain of command from employee WS3 to the top of the tree. See Figure 15.45 for the result. Unfortunately, you must know the depth of the hierarchy before you write this query; use one of the alternatives given next, if possible.

Listing 15.61Show the full hierarchical relationship of employee WS3. See Figure 15.45 for the result.

SELECT
    h1.emp_title || ' < ' ||
    h2.emp_title || ' < ' ||
    h3.emp_title || ' < ' ||
    h4.emp_title
      AS chain_of_command
  FROM hier h1, hier h2, hier h3, hier h4
  WHERE h1.emp_title = 'WS3'
    AND h1.boss_id = h2.emp_id
    AND h2.boss_id = h3.emp_id
    AND h3.boss_id = h4.emp_id;

Figure 15.45Result of Listing 15.61.

chain_of_command
----------------------
WS3 < DIR1 < VP1 < CEO

To run Listing 15.61 in Microsoft Access and Microsoft SQL Server, change each || to +. In MySQL, use CONCAT() to concatenate strings. See “Concatenating Strings with ||” in Chapter 5.

In Microsoft SQL Server and Db2, use the (standard) recursive WITH clause to traverse a hierarchy. The following query is equivalent to Listing 15.61 (in Microsoft SQL Server, change each || to +):

WITH recurse (chain, emp_level, boss_id) AS
  (SELECT
      CAST(emp_title AS VARCHAR(50)),
      0,
      boss_id
    FROM hier
    WHERE emp_title = 'WS3'
  UNION ALL
  SELECT
      CAST(recurse.chain || ' < ' ||
        hier.emp_title AS VARCHAR(50)),
      recurse.emp_level + 1,
      hier.boss_id
    FROM hier, recurse
    WHERE recurse.boss_id = hier.emp_id
  )
SELECT chain AS chain_of_command
  FROM recurse
  WHERE emp_level = 3;

In Microsoft SQL Server and Db2, to list everyone who reports to a particular employee (VP1, in this example), either directly or indirectly (through a boss’s boss), use this query:

WITH recurse (emp_title, emp_id) AS
  (SELECT emp_title,emp_id
    FROM hier
    WHERE emp_title = 'VP1'
  UNION ALL
  SELECT hier.emp_title, hier.emp_id
    FROM hier, recurse
    WHERE recurse.emp_id = hier.boss_id
  )
SELECT emp_title AS "Works for VP1"
  FROM recurse
  WHERE emp_title <> 'VP1';

In Oracle, use the (nonstandard) CONNECT BY syntax to traverse a hierarchy. The following query is equivalent to Listing 15.61:

SELECT LTRIM(SYS_CONNECT_BY_PATH(
  emp_title, ' < '), ' < ')
    AS chain_of_command
  FROM hier
  WHERE LEVEL = 4
  START WITH emp_title = 'WS3'
  CONNECT BY PRIOR boss_id = emp_id;

In Oracle, to list everyone who reports to a particular employee (VP1, in this example), either directly or indirectly (through a boss’s boss), use this query:

SELECT emp_title AS "Works for VP1"
  FROM hier
  WHERE emp_title <> 'VP1'
  START WITH emp_title = 'VP1'
  CONNECT BY PRIOR emp_id = boss_id;

Listing 15.62 traverses the hierarchy by using multiple UNIONs and self-joins to trace the chain of command for every employee. See Figure 15.46 for the result. Unfortunately, you must know the maximum depth of the hierarchy before you write this query; use one of the alternatives given next, if possible.

Listing 15.62Show the full hierarchal relationship of every employee. See Figure 15.46 for the result.

SELECT chain AS chains_of_command
  FROM
    (SELECT emp_title as chain
      FROM hier
      WHERE boss_id IS NULL
    UNION
    SELECT
        h1.emp_title || ' > ' ||
        h2.emp_title
      FROM hier h1
      INNER JOIN hier h2
        ON (h1.emp_id = h2.boss_id)
      WHERE h1.boss_id IS NULL
    UNION
    SELECT
        h1.emp_title || ' > ' ||
        h2.emp_title || ' > ' ||
        h3.emp_title
      FROM hier h1
      INNER JOIN hier h2
        ON (h1.emp_id = h2.boss_id)
      LEFT OUTER JOIN hier h3
        ON (h2.emp_id = h3.boss_id)
      WHERE h1.emp_title = 'CEO'
    UNION
    SELECT
        h1.emp_title || ' > ' ||
        h2.emp_title || ' > ' ||
        h3.emp_title || ' > ' ||
        h4.emp_title
      FROM hier h1
      INNER JOIN hier h2
        ON (h1.emp_id = h2.boss_id)
      INNER JOIN hier h3
        ON (h2.emp_id = h3.boss_id)
      LEFT OUTER JOIN hier h4
        ON (h3.emp_id = h4.boss_id)
      WHERE h1.emp_title = 'CEO'
    ) chains
  WHERE chain IS NOT NULL
  ORDER BY chain;

Figure 15.46Result of Listing 15.62.

chains_of_command
----------------------
CEO
CEO > VP1
CEO > VP1 > DIR1
CEO > VP1 > DIR1 > WS1
CEO > VP1 > DIR1 > WS2
CEO > VP1 > DIR1 > WS3
CEO > VP1 > DIR2
CEO > VP2
CEO > VP2 > DIR3
CEO > VP2 > DIR3 > WS4
CEO > VP2 > DIR3 > WS5

Microsoft Access won’t run Listing 15.62 because of the restrictions Access puts on join expressions.

To run Listing 15.62 in Microsoft SQL Server, change each || to +.

To run Listing 15.62 in MySQL, use CONCAT() instead of || to concatenate strings.

In Microsoft SQL Server and Db2, use the (standard) recursive WITH clause to traverse a hierarchy. The following query is equivalent to Listing 15.62 (in Microsoft SQL Server, change each || to +):

WITH recurse (emp_title, emp_id) AS
  (SELECT
      CAST(emp_title AS VARCHAR(50)),
      emp_id
    FROM hier
    WHERE boss_id IS NULL
  UNION ALL
  SELECT
      CAST(recurse.emp_title || ' > ' ||
        h1.emp_title AS VARCHAR(50)),
      h1.emp_id
    FROM hier h1, recurse
    WHERE h1.boss_id = recurse.emp_id
  )
SELECT emp_title emp_tree
  FROM recurse;

In Oracle, use the (nonstandard) CONNECT BY syntax to traverse a hierarchy. The following query is equivalent to Listing 15.62:

SELECT ltrim(SYS_CONNECT_BY_PATH(
    emp_title, ' > '),' > ')
      AS chains_of_command
  FROM hier
  START WITH boss_id IS NULL
  CONNECT BY PRIOR emp_id = boss_id;

Listing 15.63 uses scalar subqueries to determine whether each node in the hierarchy is a root, branch, or leaf node. See Figure 15.47 for the result. A zero in the result denotes True; nonzero, False.

Listing 15.63Determine whether each node is a root, branch, or leaf node. See Figure 15.47 for the result.

SELECT h1.emp_title,
  (SELECT SIGN(COUNT(*))
    FROM hier h2
    WHERE h1.emp_id = h2.emp_id
      AND h2.boss_id IS NULL)
        AS root_node,
  (SELECT SIGN(COUNT(*))
    FROM hier h2
    WHERE h1.emp_id = h2.boss_id
      AND h1.boss_id IS NOT NULL)
        AS branch_node,
  (SELECT SIGN(COUNT(*))
    FROM hier h2
    WHERE 0 =
      (SELECT COUNT(*)
        FROM hier h3
        WHERE h1.emp_id = h3.boss_id))
          AS leaf_node
  FROM hier h1;

Figure 15.47Result of Listing 15.63.

emp_title root_node branch_node leaf_node
--------- --------- ----------- ---------
CEO       1         0           0
VP1       0         1           0
VP2       0         1           0
DIR1      0         1           0
DIR2      0         0           1
DIR3      0         1           0
WS1       0         0           1
WS2       0         0           1
WS3       0         0           1
WS4       0         0           1
WS5       0         0           1

To run Listing 15.63 in Microsoft Access, change each SIGN to SGN.

In Oracle, use the (nonstandard) CONNECT BY syntax to traverse a hierarchy. The following query is equivalent to Listing 15.63:

SELECT
    emp_title,
    (CASE CONNECT_BY_ROOT(emp_title)
    WHEN emp_title THEN 1
    ELSE 0 END)
      AS root_node,
    (SELECT COUNT(*)
      FROM hier h1
      WHERE h1.boss_id = hier.emp_id
        AND hier.boss_id IS NOT NULL
        AND rownum = 1)
          AS branch_node,
      CONNECT_BY_ISLEAF AS leaf_node
  FROM hier
  START WITH boss_id IS NULL
  CONNECT BY PRIOR emp_id = boss_id
  ORDER BY root_node DESC, branch_node DESC;