SQL Run

Database Programming

6. Summarizing and Grouping Data

In this chapter

Using Aggregate Functions

Creating Aggregate Expressions

Finding a Minimum with MIN()

Finding a Maximum with MAX()

Calculating a Sum with SUM()

Calculating an Average with AVG()

Counting Rows with COUNT()

Aggregating Distinct Values with DISTINCT

Grouping Rows with GROUP BY

Filtering Groups with HAVING

The preceding chapter described scalar functions, which operate on individual row values. This chapter introduces SQL’s aggregate functions, or set functions, which operate on a group of values to produce a single, summarizing value. You apply an aggregate to a set of rows, which can be:

A GROUP BY clause, which groups rows, often is used with a HAVING clause, which filters groups. No matter how many rows the input set contains, an aggregate function returns a single statistic: a sum, minimum, or average, for example.

The main difference between queries with and without aggregate functions is that nonaggregate queries process the rows one by one. Each row is processed independently and put into the result. (ORDER BY and DISTINCT make the DBMS look at all the rows, but they’re essentially postprocessing operations.) Aggregate queries do something completely different: they take a table as a whole and construct new rows from it.

Using Aggregate Functions

Table 6.1 lists SQL’s standard aggregate functions.

Table 6.1Aggregate Functions
Function Returns
MIN(expr) Minimum value in expr
MAX(expr) Maximum value in expr
SUM(expr) Sum of the values in expr
AVG(expr) Average (arithmetic mean) of the values in expr
COUNT(expr) The number of non-null values in expr
COUNT(*) The number of rows in a table or set

The important characteristics of the aggregate functions are:

Tips for Aggregate Functions

Creating Aggregate Expressions

Aggregate functions can be tricky to use. This section explains what’s legal and what’s not.

Tips for Aggregate Expressions

Finding a Minimum with MIN()

Use the aggregate function MIN() to find the minimum of a set of values.

To find the minimum of a set of values:

Listing 6.1 and Figure 6.1 show some queries that involve MIN(). The first query returns the price of the lowest-priced book. The second query returns the earliest publication date. The third query returns the number of pages in the shortest history book.

Listing 6.1Some MIN() queries. See Figure 6.1 for the result.

SELECT MIN(price) AS "Min price"
  FROM titles;

SELECT MIN(pubdate) AS "Earliest pubdate"
  FROM titles;

SELECT MIN(pages) AS "Min history pages"
  FROM titles
  WHERE type = 'history';

Figure 6.1Result of Listing 6.1.

Min price
---------
     6.95

Earliest pubdate
----------------
1998-04-01

Min history pages
-----------------
               14

Tips for MIN

Finding a Maximum with MAX()

Use the aggregate function MAX() to find the maximum of a set of values.

To find the maximum of a set of values:

Listing 6.2 and Figure 6.2 show some queries that involve MAX(). The first query returns the author’s last name that is last alphabetically. The second query returns the prices of the cheapest and most expensive books, as well as the price range. The third query returns the highest revenue (= price × sales) among the history books.

Listing 6.2Some MAX() queries. See Figure 6.2 for the result.

SELECT MAX(au_lname) AS "Max last name"
  FROM authors;

SELECT
    MIN(price) AS "Min price",
    MAX(price) AS "Max price",
    MAX(price) - MIN(price) AS "Range"
  FROM titles;

SELECT MAX(price * sales)
         AS "Max history revenue"
  FROM titles
  WHERE type = 'history';

Figure 6.2Result of Listing 6.2.

Max last name
-------------
O'Furniture

Min price Max price Range
--------- --------- -----
     6.95     39.95 33.00

Max history revenue
-------------------
          313905.33

Tips for MAX

Calculating a Sum with SUM()

Use the aggregate function SUM() to find the sum (total) of a set of values.

To calculate the sum of a set of values:

Listing 6.3 and Figure 6.3 show some queries that involve SUM(). The first query returns the total advances paid to all authors. The second query returns the total sales of books published in 2000. The third query returns the total price, sales, and revenue (= price × sales) of all books. Note a mathematical rule in action here: “The sum of the products doesn’t (necessarily) equal the product of the sums.”

Listing 6.3Some SUM() queries. See Figure 6.3 for the result.

SELECT
    SUM(advance) AS "Total advances"
  FROM royalties;

SELECT
    SUM(sales)
      AS "Total sales (2000 books)"
  FROM titles
  WHERE pubdate
    BETWEEN DATE '2000-01-01'
        AND DATE '2000-12-31';

SELECT
    SUM(price) AS "Total price",
    SUM(sales) AS "Total sales",
    SUM(price * sales) AS "Total revenue"
  FROM titles;

Figure 6.3Result of Listing 6.3.

Total advances
--------------
    1336000.00

Total sales (2000 books)
------------------------
                  231677

Total price Total sales Total revenue
----------- ----------- -------------
     220.65     1975446   41428860.77

Tips for SUM

Calculating an Average with AVG()

Use the aggregate function AVG() to find the average, or arithmetic mean, of a set of values. The arithmetic mean is the sum of a set of quantities divided by the number of quantities in the set.

To calculate the average of a set of values:

Listing 6.4 and Figure 6.4 shows some queries that involve AVG(). The first query returns the average price of all books if prices were doubled. The second query returns the average and total sales for business books; both calculations are null (not zero) because the table contains no business books. The third query uses a subquery (see Chapter 8) to list the books with above-average sales.

Listing 6.4Some AVG() queries. See Figure 6.4 for the result.

SELECT
    AVG(price * 2) AS "AVG(price * 2)"
  FROM titles;

SELECT
    AVG(sales) AS "AVG(sales)",
    SUM(sales) AS "SUM(sales)"
  FROM titles
  WHERE type = 'business';

SELECT title_id, sales
  FROM titles
  WHERE sales >
        (SELECT AVG(sales) FROM titles)
  ORDER BY sales DESC;

Figure 6.4Result of Listing 6.4.

AVG(price * 2)
--------------
     36.775000

AVG(sales) SUM(sales)
---------- ----------
NULL       NULL

title_id sales
-------- -------
T07      1500200
T05       201440

Tips for AVG

Aggregating and Nulls

Aggregate functions (except COUNT(*)) ignore nulls. If an aggregation requires that you account for nulls, then you can replace each null with a specified value by using COALESCE() (see “Checking for Nulls with COALESCE()” in Chapter 5). For example, the following query returns the average sales of biographies by including nulls (replaced by zeroes) in the calculation:

SELECT AVG(COALESCE(sales,0))
    AS AvgSales
  FROM titles
  WHERE type = 'biography';

Statistics in SQL

SQL isn’t a statistical programming language, but you can use built-in functions and a few tricks to calculate simple descriptive statistics such as the sum, mean, and standard deviation. For more-sophisticated analyses you should use your DBMS’s statistics, data science, or machine learning components or export your data to a dedicated statistical environment such as Excel, R, Python, SAS, or SPSS.

What you should not do is write statistical routines yourself in SQL or a host language. Implementing statistical algorithms correctly—even simple ones—means understanding trade-offs in efficiency (the space needed for arithmetic operations), stability (cancellation of significant digits), and accuracy (handling pathologic sets of values). See, for example, Ronald Thisted’s Elements of Statistical Computing or John Monahan’s Numerical Methods of Statistics.

You can get away with using small combinations of built-in SQL functions, such as STDEV()/(SQRT(COUNT()) for the standard error of the mean, but don’t use complex SQL expressions for correlations, regression, ANOVA (analysis of variance), or matrix arithmetic, for example. Check your DBMS’s documentation to see which functions and packages it offers. Built-in functions aren’t portable, but they run far faster and more accurately than equivalent query expressions.

The functions MIN() and MAX() calculate order statistics, which are values derived from a dataset that’s been sorted (ordered) by size. Well-known order statistics include the trimmed mean, rank, range, mode, and median. ChapterĀ 15 covers the trimmed mean, rank, and median. The range is the difference between the largest and smallest values: MAX(expr) – MIN(expr). The mode is the value that appears most frequently. A dataset can have more than one mode. The mode is a weak descriptive statistic because it’s not robust, meaning that it can be affected by adding a small number or unusual or incorrect values to the dataset. This query finds the mode of book prices in the sample database:

SELECT price, COUNT(*) AS frequency
  FROM titles
  GROUP BY price
  HAVING COUNT(*) >=
    ALL(SELECT COUNT(*)
          FROM titles
          GROUP BY price);

price has two modes:

price   frequency
-----   ---------
12.99   2
19.95   2

Counting Rows with COUNT()

Use the aggregate function COUNT() to count the number of rows in a set of values. COUNT() has two forms:

To count non-null rows:

To count all rows, including nulls:

Listing 6.5 and Figure 6.5 show some queries that involve COUNT(expr) and COUNT(*). The three queries count rows in the table titles and are identical except for the WHERE clause. The row counts in the first query differ because the column price contains a null. In the second query, the row counts are identical because the WHERE clause eliminates the row with the null price before the count. The third query shows the row-count differences between the results of the first two queries.

Listing 6.5Some COUNT() queries. See Figure 6.5 for the result.

SELECT
    COUNT(title_id) AS "COUNT(title_id)",
    COUNT(price) AS "COUNT(price)",
    COUNT(*) AS "COUNT(*)"
  FROM titles;

SELECT
    COUNT(title_id) AS "COUNT(title_id)",
    COUNT(price) AS "COUNT(price)",
    COUNT(*) AS "COUNT(*)"
  FROM titles
  WHERE price IS NOT NULL;

SELECT
    COUNT(title_id) AS "COUNT(title_id)",
    COUNT(price) AS "COUNT(price)",
    COUNT(*) AS "COUNT(*)"
  FROM titles
  WHERE price IS NULL;

Figure 6.5Result of Listing 6.5.

COUNT(title_id) COUNT(price) COUNT(*)
--------------- ------------ --------
             13           12       13

COUNT(title_id) COUNT(price) COUNT(*)
--------------- ------------ --------
             12           12       12

COUNT(title_id) COUNT(price) COUNT(*)
--------------- ------------ --------
              1            0        1

Tips for COUNT

Aggregating Distinct Values with DISTINCT

You can use DISTINCT to eliminate duplicate values in aggregate function calculations; see “Eliminating Duplicate Rows with DISTINCT” in Chapter 4. The general syntax of an aggregate function is:

agg_func([ALL | DISTINCT] expr)

agg_func is MIN, MAX, SUM, AVG, or COUNT. expr is a column name, literal, or expression. ALL applies the aggregate function to all values, and DISTINCT specifies that each unique value is considered. ALL is the default and is usually omitted in practice.

With SUM(), AVG(), and COUNT(expr), DISTINCT eliminates duplicate values before the sum, average, or count is calculated. DISTINCT isn’t meaningful with MIN() and MAX(); you can use it, but it won’t change the result. You can’t use DISTINCT with COUNT(*).

To calculate the sum of a set of distinct values:

To calculate the average of a set of distinct values:

To count distinct non-null rows:

The queries in Listing 6.6 return the count, sum, and average of book prices. The non-DISTINCT and DISTINCT results in Figure 6.6 differ because the DISTINCT results eliminate the duplicates of prices $12.99 and $19.95 from calculations.

Listing 6.6Some DISTINCT aggregate queries. See Figure 6.6 for the result.

SELECT
    COUNT(*) AS "COUNT(*)"
  FROM titles;

SELECT
    COUNT(price) AS "COUNT(price)",
    SUM(price)   AS "SUM(price)",
    AVG(price)   AS "AVG(price)"
  FROM titles;

SELECT
    COUNT(DISTINCT price)
      AS "COUNT(DISTINCT)",
    SUM(DISTINCT price)
      AS "SUM(DISTINCT)",
    AVG(DISTINCT price)
      AS "AVG(DISTINCT)"
  FROM titles;

Figure 6.6Result of Listing 6.6.

COUNT(*)
--------
      13

COUNT(price) SUM(price) AVG(price)
------------ ---------- ----------
          12     220.65    18.3875

COUNT(DISTINCT) SUM(DISTINCT) AVG(DISTINCT)
--------------- ------------- -------------
             10        187.71       18.7710

DISTINCT in a SELECT clause (Chapter 4) and DISTINCT in an aggregate function don’t return the same result.

The three queries in Listing 6.7 count the author IDs in the table title_authors. Figure 6.7 shows the results. The first query counts all the author IDs in the table. The second query returns the same result as the first query because COUNT() already has done its work and returned a value in a single row before DISTINCT is applied. In the third query, DISTINCT is applied to the author IDs before COUNT() starts counting.

Listing 6.7DISTINCT in a SELECT clause and DISTINCT in an aggregate function differ in meaning. See Figure 6.7 for the result.

SELECT
    COUNT(au_id)
      AS "COUNT(au_id)"
  FROM title_authors;

SELECT
    DISTINCT COUNT(au_id)
      AS "DISTINCT COUNT(au_id)"
  FROM title_authors;

SELECT
    COUNT(DISTINCT au_id)
      AS "COUNT(DISTINCT au_id)"
  FROM title_authors;

Figure 6.7Result of Listing 6.7.

COUNT(au_id)
------------
          17

DISTINCT COUNT(au_id)
---------------------
                   17

COUNT(DISTINCT au_id)
---------------------
                   6

Mixing non-DISTINCT and DISTINCT aggregates in the same SELECT clause can produce misleading results.

The four queries in Listing 6.8 show the four combinations of non-DISTINCT and DISTINCT sums and counts. Of the four results in Figure 6.8, only the first result (no DISTINCTs) and final result (all DISTINCTs) are consistent mathematically, which you can verify with AVG(price) and AVG(DISTINCT price). In the second and third queries (mixed non-DISTINCTs and DISTINCTs), you can’t calculate a valid average by dividing the sum by the count.

Listing 6.8Combining non-DISTINCT and DISTINCT aggregates gives inconsistent results. See Figure 6.8 for the result.

SELECT
    COUNT(price)
      AS "COUNT(price)",
    SUM(price)
      AS "SUM(price)"
  FROM titles;

SELECT
    COUNT(price)
      AS "COUNT(price)",
    SUM(DISTINCT price)
      AS "SUM(DISTINCT price)"
  FROM titles;

SELECT
    COUNT(DISTINCT price)
      AS "COUNT(DISTINCT price)",
    SUM(price)
      AS "SUM(price)"
  FROM titles;

SELECT
    COUNT(DISTINCT price)
      AS "COUNT(DISTINCT price)",
    SUM(DISTINCT price)
      AS "SUM(DISTINCT price)"
  FROM titles;

Figure 6.8Result of Listing 6.8. The differences in the counts and sums indicate duplicate prices. Averages (sum/count) obtained from the second (187.71/12) or third query (220.65/10) are incorrect. The first (220.65/12) and fourth (187.71/10) queries produce consistent averages.

COUNT(price) SUM(price)
------------ ----------
          12     220.65

COUNT(price) SUM(DISTINCT price)
------------ -------------------
          12              187.71

COUNT(DISTINCT price) SUM(price)
--------------------- ----------
                   10     220.65

COUNT(DISTINCT price) SUM(DISTINCT price)
--------------------- -------------------
                   10              187.71

Tips for DISTINCT

Grouping Rows with GROUP BY

To this point, I’ve used aggregate functions to summarize all the values in a column or just those values that matched a WHERE search condition. You can use the GROUP BY clause to divide a table into logical groups (categories) and calculate aggregate statistics for each group.

An example will clarify the concept. Listing 6.9 uses GROUP BY to count the number of books that each author wrote (or cowrote). In the SELECT clause, the column au_id identifies each author, and the derived column num_books counts each author’s books. The GROUP BY clause causes num_books to be calculated for every unique au_id instead of only once for the entire table. Figure 6.9 shows the result. In this example, au_id is called the grouping column.

Listing 6.9List the number of books each author wrote (or cowrote). See Figure 6.9 for the result.

SELECT
    au_id,
    COUNT(*) AS "num_books"
  FROM title_authors
  GROUP BY au_id;

Figure 6.9Result of Listing 6.9.

au_id num_books
----- ---------
A01           3
A02           4
A03           2
A04           4
A05           1
A06           3

The GROUP BY clause’s important characteristics are:

To group rows:

Listing 6.10 and Figure 6.10 show the difference between COUNT(expr) and COUNT(*) in a query that contains GROUP BY. The table publishers contains one null in the column state (for publisher P03 in Germany). Recall from “Counting Rows with COUNT()” earlier in this chapter that COUNT(expr) counts non-null values and COUNT(*) counts all values, including nulls. In the result, GROUP BY recognizes the null and creates a null group for it. COUNT(*) finds (and counts) the one null in the column state. But COUNT(state) contains a zero for the null group because COUNT(state) finds only a null in the null group, which it excludes from the count—that’s why you have the zero.

Listing 6.10This query illustrates the difference between COUNT(expr) and COUNT(*) in a GROUP BY query. See Figure 6.10 for the result.

SELECT
    state,
    COUNT(state) AS "COUNT(state)",
    COUNT(*)     AS "COUNT(*)"
  FROM publishers
  GROUP BY state;

Figure 6.10Result of Listing 6.10.

state COUNT(state) COUNT(*)
----- ------------ --------
NULL             0        1
CA               2        2
NY               1        1

If a nonaggregate column contains nulls, then using COUNT(*) rather than COUNT(expr) can produce misleading results. Listing 6.11 and Figure 6.11 show summary sales statistics for each type of book. The sales value for one of the biographies is null, so COUNT(sales) and COUNT(*) differ by 1. The average calculation in the fifth column, SUM/COUNT(sales), is consistent mathematically, whereas the sixth-column average, SUM/COUNT(*), is not. I’ve verified the inconsistency with AVG(sales) in the final column. (Recall a similar situation in Listing 6.8 in “Aggregating Distinct Values with DISTINCT” earlier in this chapter.)

Listing 6.11For mathematically consistent results, use COUNT(expr), rather than COUNT(*), if expr contains nulls. See Figure 6.11 for the result.

SELECT
    type,
    SUM(sales)   AS "SUM(sales)",
    COUNT(sales) AS "COUNT(sales)",
    COUNT(*)     AS "COUNT(*)",
    SUM(sales)/COUNT(sales)
      AS "SUM/COUNT(sales)",
    SUM(sales)/COUNT(*)
      AS "SUM/COUNT(*)",
    AVG(sales)   AS "AVG(sales)"
  FROM titles
  GROUP BY type;

Figure 6.11Result of Listing 6.11.

type       SUM(sales) COUNT(sales) COUNT(*) SUM/COUNT(sales) SUM/COUNT(*) AVG(sales)
---------- ---------- ------------ -------- ---------------- ------------ ----------
biography     1611521            3        4        537173.67    402880.25  537173.67
children         9095            2        2          4547.50      4547.50    4547.50
computer        25667            1        1         25667.00     25667.00   25667.00
history         20599            3        3          6866.33      6866.33    6866.33
psychology     308564            3        3        102854.67    102854.67  102854.67

Listing 6.12 and Figure 6.12 show a simple GROUP BY query that calculates the total sales, average sales, and number of titles for each type of book. In Listing 6.13 and Figure 6.13, I’ve added a WHERE clause to eliminate books priced less than $13 before grouping. I’ve also added an ORDER BY clause to sort the result by descending total sales of each book type.

Listing 6.12This simple GROUP BY query calculates a few summary statistics for each type of book. See Figure 6.12 for the result.

SELECT
    type,
    SUM(sales)   AS "SUM(sales)",
    AVG(sales)   AS "AVG(sales)",
    COUNT(sales) AS "COUNT(sales)"
  FROM titles
  GROUP BY type;

Figure 6.12Result of Listing 6.12.

TYPE       SUM(sales) AVG(sales) COUNT(sales)
---------- ---------- ---------- ------------
biography     1611521  537173.67            3
children         9095    4547.50            2
computer        25667   25667.00            1
history         20599    6866.33            3
psychology     308564  102854.67            3

Listing 6.13Here, I’ve added WHERE and ORDER BY clauses to Listing 6.12 to cull books priced less than $13 and sort the result by descending total sales. See Figure 6.13 for the result.

SELECT
    type,
    SUM(sales)   AS "SUM(sales)",
    AVG(sales)   AS "AVG(sales)",
    COUNT(sales) AS "COUNT(sales)"
  FROM titles
  WHERE price >= 13
  GROUP BY type
  ORDER BY "SUM(sales)" DESC;

Figure 6.13Result of Listing 6.13.

type       SUM(sales) AVG(sales) COUNT(sales)
---------- ---------- ---------- ------------
biography     1511520  755760.00            2
computer        25667   25667.00            1
history         20599    6866.33            3
children         5000    5000.00            1

Listing 6.14 and Figure 6.14 use multiple grouping columns to count the number of titles of each type that each publisher publishes.

Listing 6.14List the number of books of each type for each publisher, sorted by descending count within ascending publisher ID. See Figure 6.14 for the result.

SELECT
    pub_id,
    type,
    COUNT(*) AS "COUNT(*)"
  FROM titles
  GROUP BY pub_id, type
  ORDER BY pub_id ASC, "COUNT(*)" DESC;

Figure 6.14Result of Listing 6.14.

pub_id type       COUNT(*)
------ ---------- --------
P01    biography         3
P01    history           1
P02    computer          1
P03    history           2
P03    biography         1
P04    psychology        3
P04    children          2

In Listing 6.15 and Figure 6.15, I revisit Listing 5.31 in “Evaluating Conditional Values with CASE” in Chapter 5. But instead of listing each book categorized by its sales range, I use GROUP BY to list the number of books in each sales range.

Listing 6.15List the number of books in each calculated sales range, sorted by ascending sales. See Figure 6.15 for the result.

SELECT
    CASE
      WHEN sales IS NULL
        THEN 'Unknown'
      WHEN sales <= 1000
        THEN 'Not more than 1,000'
      WHEN sales <= 10000
        THEN 'Between 1,001 and 10,000'
      WHEN sales <= 100000
        THEN 'Between 10,001 and 100,000'
      WHEN sales <= 1000000
        THEN 'Between 100,001 and 1,000,000'
      ELSE 'Over 1,000,000'
    END
      AS "Sales category",
    COUNT(*) AS "Num titles"
  FROM titles
  GROUP BY
    CASE
      WHEN sales IS NULL
        THEN 'Unknown'
      WHEN sales <= 1000
        THEN 'Not more than 1,000'
      WHEN sales <= 10000
        THEN 'Between 1,001 and 10,000'
      WHEN sales <= 100000
        THEN 'Between 10,001 and 100,000'
      WHEN sales <= 1000000
        THEN 'Between 100,001 and 1,000,000'
      ELSE 'Over 1,000,000'
    END
  ORDER BY MIN(sales) ASC;

Figure 6.15Result of Listing 6.15.

Sales category                Num titles
----------------------------- ----------
Unknown                                1
Not more than 1,000                    1
Between 1,001 and 10,000               3
Between 10,001 and 100,000             5
Between 100,001 and 1,000,000          2
Over 1,000,000                         1

When used without an aggregate function, GROUP BY acts like DISTINCT (Listing 6.16 and Figure 6.16). For information about DISTINCT, see “Eliminating Duplicate Rows with DISTINCT” in Chapter 4.

Listing 6.16Both of these queries return the same result. The bottom form is preferred. See Figure 6.16 for the result.

SELECT type
  FROM titles
  GROUP BY type;

SELECT DISTINCT type
  FROM titles;

Figure 6.16Either statement in Listing 6.16 returns this result.

type
----------
biography
children
computer
history
psychology

You can use GROUP BY to look for patterns in your data. In Listing 6.17 and Figure 6.17, I’m looking for a relationship between price categories and average sales.

Listing 6.17List the average sales for each price, sorted by ascending price. See Figure 6.17 for the result.

SELECT price, AVG(sales) AS "AVG(sales)"
  FROM titles
  WHERE price IS NOT NULL
  GROUP BY price
  ORDER BY price ASC;

Figure 6.17Result of Listing 6.17. Ignoring the statistical outlier at $23.95, a weak inverse relationship between price and sales is apparent.

price   AVG(sales)
------- ----------
   6.95   201440.0
   7.99    94123.0
  10.00     4095.0
  12.99    56501.0
  13.95     5000.0
  19.95    10443.0
  21.99      566.0
  23.95  1500200.0
  29.99    10467.0
  39.95    25667.0

Tips for GROUP BY

Categorizing Numeric Values

You can use the function FLOOR(x) to categorize numeric values. FLOOR(x) returns the greatest integer that is lower than x. This query groups books in $10 price intervals:

SELECT
    FLOOR(price/10)*10 AS "Category",
    COUNT(*) AS "Count"
  FROM titles
  GROUP BY FLOOR(price/10)*10;

The result is:

Category Count
-------- -----
0        2
10       6
20       3
30       1
NULL     1

Category 0 counts prices between $0.00 and $9.99; category 10 counts prices between $10.00 and $19.99; and so on. (The analogous function CEILING(x) returns the smallest integer that is higher than x.)

Filtering Groups with HAVING

The HAVING clause sets conditions on the GROUP BY clause similar to the way that WHERE interacts with SELECT. The HAVING clause’s important characteristics are:

The sequence in which the WHERE, GROUP BY, and HAVING clauses are applied is:

  1. The WHERE clause filters the rows that result from the operations specified in the FROM and JOIN clauses.
  2. The GROUP BY clause groups the output of the WHERE clause.
  3. The HAVING clause filters rows from the grouped result.

To filter groups:

In Listing 6.18 and Figure 6.18, I revisit Listing 6.9 earlier in this chapter, but instead of listing the number of books that each author wrote (or cowrote), I use HAVING to list only the authors who have written three or more books.

Listing 6.18List the number of books written (or cowritten) by each author who has written three or more books. See Figure 6.18 for the result.

SELECT
    au_id,
    COUNT(*) AS "num_books"
  FROM title_authors
  GROUP BY au_id
  HAVING COUNT(*) >= 3;

Figure 6.18Result of Listing 6.18.

au_id num_books
----- ---------
A01           3
A02           4
A04           4
A06           3

In Listing 6.19 and Figure 6.19, the HAVING condition also is an aggregate expression in the SELECT clause. This query still works if you remove the AVG() expression from the SELECT list (Listing 6.20 and Figure 6.20).

Listing 6.19List the number of titles and average revenue for the types with average revenue more than $1 million. See Figure 6.19 for the result.

SELECT
    type,
    COUNT(price) AS "COUNT(price)",
    AVG(price * sales) AS "AVG revenue"
  FROM titles
  GROUP BY type
  HAVING AVG(price * sales) > 1000000;

Figure 6.19Result of Listing 6.19.

type       COUNT(price) AVG revenue
---------- ------------ -----------
biography             3 12484878.00
computer              1  1025396.65

Listing 6.20Listing 6.19 still works without AVG(price * sales) in the SELECT list. See Figure 6.20 for the result.

SELECT
    type,
    COUNT(price) AS "COUNT(price)"
  FROM titles
  GROUP BY type
  HAVING AVG(price * sales) > 1000000;

Figure 6.20Result of Listing 6.20.

type       COUNT(price)
---------- ------------
biography             3
computer              1

In Listing 6.21 and Figure 6.21, multiple grouping columns count the number of titles of each type that each publisher publishes. The HAVING condition removes groups in which the publisher has one or fewer titles of a particular type. This query retrieves a subset of the result of Listing 6.14 earlier in this chapter.

Listing 6.21List the number of books of each type for each publisher, for publishers with more than one title of a type. See Figure 6.21 for the result.

SELECT
    pub_id,
    type,
    COUNT(*) AS "COUNT(*)"
  FROM titles
  GROUP BY pub_id, type
  HAVING COUNT(*) > 1
  ORDER BY pub_id ASC, "COUNT(*)" DESC;

Figure 6.21Result of Listing 6.21.

pub_id type       COUNT(*)
------ ---------- --------
P01    biography         3
P03    history           2
P04    psychology        3
P04    children          2

In Listing 6.22 and Figure 6.22, the WHERE clause first removes all rows except for books from only publishers P03 and P04. Then the GROUP BY clause groups the output of the WHERE clause by type. Finally, the HAVING clause filters rows from the grouped result.

Listing 6.22For books from only publishers P03 and P04, list the total sales and average price by type, for types with more than $10000 total sales and less than $20 average price. See Figure 6.22 for the result.

SELECT
    type,
    SUM(sales) AS "SUM(sales)",
    AVG(price) AS "AVG(price)"
  FROM titles
  WHERE pub_id IN ('P03', 'P04')
  GROUP BY type
  HAVING SUM(sales) > 10000
     AND AVG(price) < 20;

Figure 6.22Result of Listing 6.22.

type       SUM(sales) AVG(price)
---------- ---------- ----------
psychology     308564       9.31

Tips for HAVING