In this chapter
Creating Aggregate Expressions
Calculating an Average with AVG()
The preceding chapter described scalar functions, which operate on individual row values. This chapter introduces SQL’s aggregate functions, or set functions, which operate on a group of values to produce a single, summarizing value. You apply an aggregate to a set of rows, which can be:
A GROUP BY clause, which groups rows, often is used with a HAVING clause, which filters groups. No matter how many rows the input set contains, an aggregate function returns a single statistic: a sum, minimum, or average, for example.
The main difference between queries with and without aggregate functions is that nonaggregate queries process the rows one by one. Each row is processed independently and put into the result. (ORDER BY and DISTINCT make the DBMS look at all the rows, but they’re essentially postprocessing operations.) Aggregate queries do something completely different: they take a table as a whole and construct new rows from it.
Table 6.1 lists SQL’s standard aggregate functions.
Function | Returns |
---|---|
MIN(expr) | Minimum value in expr |
MAX(expr) | Maximum value in expr |
SUM(expr) | Sum of the values in expr |
AVG(expr) | Average (arithmetic mean) of the values in expr |
COUNT(expr) | The number of non-null values in expr |
COUNT(*) | The number of rows in a table or set |
The important characteristics of the aggregate functions are:
Aggregate functions can be tricky to use. This section explains what’s legal and what’s not.
An aggregate expression can’t appear in a WHERE clause. If you want to find the title of the book with the highest sales, then you can’t use:
SELECT title_id
FROM titles
WHERE sales = MAX(sales); --Illegal
You can’t mix nonaggregate (row-by-row) and aggregate expressions in a SELECT clause. A SELECT clause must contain either all nonaggregate expressions or all aggregate expressions. If you want to find the title of the book with the highest sales, then you can’t use:
SELECT title_id, MAX(sales)
FROM titles; --Illegal
The one exception to this rule is that you can mix nonaggregate and aggregate expressions for grouping columns (see “Grouping Rows with GROUP BY” later in this chapter):
SELECT type, SUM(sales)
FROM titles
GROUP BY type; --Legal
You can use more than one aggregate expression in a SELECT clause:
SELECT MIN(sales), MAX(sales)
FROM titles; --Legal
You can’t nest aggregate functions:
SELECT SUM(AVG(sales))
FROM titles; --Illegal
You can use aggregate expressions in subqueries (see Chapter 8). This statement finds the title of the book with the highest sales:
SELECT title_id, price --Legal
FROM titles
WHERE sales = (SELECT MAX(sales) FROM titles);
You can’t use subqueries in aggregate expressions:
AVG(SELECT price FROM titles) --Illegal
Oracle lets you nest aggregate expressions in GROUP BY queries. The following example calculates the average of the maximum sales of all book types. Oracle evaluates the inner aggregate MAX(sales) for the grouping column type and then aggregates the results again:
SELECT AVG(MAX(sales))
FROM titles
GROUP BY type; --Legal in Oracle
To replicate this query in standard SQL, use a subquery in the FROM clause:
SELECT AVG(s.max_sales)
FROM (SELECT MAX(sales) AS max_sales
FROM titles
GROUP BY type) s;
Use the aggregate function MIN() to find the minimum of a set of values.
To find the minimum of a set of values:
Type:
MIN(expr)
expr is a column name, literal, or expression. The result has the same data type as expr.
Listing 6.1 and Figure 6.1 show some queries that involve MIN(). The first query returns the price of the lowest-priced book. The second query returns the earliest publication date. The third query returns the number of pages in the shortest history book.
SELECT MIN(price) AS "Min price"
FROM titles;
SELECT MIN(pubdate) AS "Earliest pubdate"
FROM titles;
SELECT MIN(pages) AS "Min history pages"
FROM titles
WHERE type = 'history';
Min price
---------
6.95
Earliest pubdate
----------------
1998-04-01
Min history pages
-----------------
14
String comparisons are case insensitive or case sensitive, depending on your DBMS; see the DBMS tip in “Tips for WHERE” in Chapter 4.
When comparing two VARCHAR strings for equality, your DBMS might right-pad the shorter string with spaces and compare the strings position by position. In this case, the strings 'John' and 'John ' are equal. Refer to your DBMS documentation (or experiment) to determine which string MIN() returns.
Use the aggregate function MAX() to find the maximum of a set of values.
To find the maximum of a set of values:
Type:
MAX(expr)
expr is a column name, literal, or expression. The result has the same data type as expr.
Listing 6.2 and Figure 6.2 show some queries that involve MAX(). The first query returns the author’s last name that is last alphabetically. The second query returns the prices of the cheapest and most expensive books, as well as the price range. The third query returns the highest revenue (= price × sales) among the history books.
SELECT MAX(au_lname) AS "Max last name"
FROM authors;
SELECT
MIN(price) AS "Min price",
MAX(price) AS "Max price",
MAX(price) - MIN(price) AS "Range"
FROM titles;
SELECT MAX(price * sales)
AS "Max history revenue"
FROM titles
WHERE type = 'history';
Max last name
-------------
O'Furniture
Min price Max price Range
--------- --------- -----
6.95 39.95 33.00
Max history revenue
-------------------
313905.33
String comparisons are case insensitive or case sensitive, depending on your DBMS; see the DBMS tip in “Tips for WHERE” in Chapter 4.
When comparing two VARCHAR strings for equality, your DBMS might right-pad the shorter string with spaces and compare the strings position by position. In this case, the strings 'John' and 'John ' are equal. Refer to your DBMS documentation (or experiment) to determine which string MAX() returns.
Use the aggregate function SUM() to find the sum (total) of a set of values.
To calculate the sum of a set of values:
Type:
SUM(expr)
expr is a column name, literal, or numeric expression. The result’s data type is at least as precise as the most precise data type used in expr.
Listing 6.3 and Figure 6.3 show some queries that involve SUM(). The first query returns the total advances paid to all authors. The second query returns the total sales of books published in 2000. The third query returns the total price, sales, and revenue (= price × sales) of all books. Note a mathematical rule in action here: “The sum of the products doesn’t (necessarily) equal the product of the sums.”
SELECT
SUM(advance) AS "Total advances"
FROM royalties;
SELECT
SUM(sales)
AS "Total sales (2000 books)"
FROM titles
WHERE pubdate
BETWEEN DATE '2000-01-01'
AND DATE '2000-12-31';
SELECT
SUM(price) AS "Total price",
SUM(sales) AS "Total sales",
SUM(price * sales) AS "Total revenue"
FROM titles;
Total advances
--------------
1336000.00
Total sales (2000 books)
------------------------
231677
Total price Total sales Total revenue
----------- ----------- -------------
220.65 1975446 41428860.77
In Microsoft Access date literals, omit the DATE keyword and surround the literal with # characters instead of quotes. To run Listing 6.3, change the date literals in the second query to #2000-01-01#
and #2000-12-31#
.
In Microsoft SQL Server and Db2 date literals, omit the DATE keyword. To run Listing 6.3, change the date literals to '2000-01-01'
and '2000-12-31'
.
Use the aggregate function AVG() to find the average, or arithmetic mean, of a set of values. The arithmetic mean is the sum of a set of quantities divided by the number of quantities in the set.
To calculate the average of a set of values:
Type:
AVG(expr)
expr is a column name, literal, or numeric expression. The result’s data type is at least as precise as the most precise data type used in expr.
Listing 6.4 and Figure 6.4 shows some queries that involve AVG(). The first query returns the average price of all books if prices were doubled. The second query returns the average and total sales for business books; both calculations are null (not zero) because the table contains no business books. The third query uses a subquery (see Chapter 8) to list the books with above-average sales.
SELECT
AVG(price * 2) AS "AVG(price * 2)"
FROM titles;
SELECT
AVG(sales) AS "AVG(sales)",
SUM(sales) AS "SUM(sales)"
FROM titles
WHERE type = 'business';
SELECT title_id, sales
FROM titles
WHERE sales >
(SELECT AVG(sales) FROM titles)
ORDER BY sales DESC;
AVG(price * 2)
--------------
36.775000
AVG(sales) SUM(sales)
---------- ----------
NULL NULL
title_id sales
-------- -------
T07 1500200
T05 201440
Aggregating and Nulls
Aggregate functions (except COUNT(*)) ignore nulls. If an aggregation requires that you account for nulls, then you can replace each null with a specified value by using COALESCE() (see “Checking for Nulls with COALESCE()” in Chapter 5). For example, the following query returns the average sales of biographies by including nulls (replaced by zeroes) in the calculation:
SELECT AVG(COALESCE(sales,0))
AS AvgSales
FROM titles
WHERE type = 'biography';
Statistics in SQL
SQL isn’t a statistical programming language, but you can use built-in functions and a few tricks to calculate simple descriptive statistics such as the sum, mean, and standard deviation. For more-sophisticated analyses you should use your DBMS’s statistics, data science, or machine learning components or export your data to a dedicated statistical environment such as Excel, R, Python, SAS, or SPSS.
What you should not do is write statistical routines yourself in SQL or a host language. Implementing statistical algorithms correctly—even simple ones—means understanding trade-offs in efficiency (the space needed for arithmetic operations), stability (cancellation of significant digits), and accuracy (handling pathologic sets of values). See, for example, Ronald Thisted’s Elements of Statistical Computing or John Monahan’s Numerical Methods of Statistics.
You can get away with using small combinations of built-in SQL functions, such as STDEV()/(SQRT(COUNT()) for the standard error of the mean, but don’t use complex SQL expressions for correlations, regression, ANOVA (analysis of variance), or matrix arithmetic, for example. Check your DBMS’s documentation to see which functions and packages it offers. Built-in functions aren’t portable, but they run far faster and more accurately than equivalent query expressions.
The functions MIN() and MAX() calculate order statistics, which are values derived from a dataset that’s been sorted (ordered) by size. Well-known order statistics include the trimmed mean, rank, range, mode, and median. ChapterĀ 15 covers the trimmed mean, rank, and median. The range is the difference between the largest and smallest values: MAX(expr) – MIN(expr). The mode is the value that appears most frequently. A dataset can have more than one mode. The mode is a weak descriptive statistic because it’s not robust, meaning that it can be affected by adding a small number or unusual or incorrect values to the dataset. This query finds the mode of book prices in the sample database:
SELECT price, COUNT(*) AS frequency
FROM titles
GROUP BY price
HAVING COUNT(*) >=
ALL(SELECT COUNT(*)
FROM titles
GROUP BY price);
price has two modes:
price frequency
----- ---------
12.99 2
19.95 2
Use the aggregate function COUNT() to count the number of rows in a set of values. COUNT() has two forms:
To count non-null rows:
Type:
COUNT(expr)
expr is a column name, literal, or expression. The result is an integer greater than or equal to zero.
To count all rows, including nulls:
Type:
COUNT(*)
The result is an integer greater than or equal to zero.
Listing 6.5 and Figure 6.5 show some queries that involve COUNT(expr) and COUNT(*). The three queries count rows in the table titles and are identical except for the WHERE clause. The row counts in the first query differ because the column price contains a null. In the second query, the row counts are identical because the WHERE clause eliminates the row with the null price before the count. The third query shows the row-count differences between the results of the first two queries.
SELECT
COUNT(title_id) AS "COUNT(title_id)",
COUNT(price) AS "COUNT(price)",
COUNT(*) AS "COUNT(*)"
FROM titles;
SELECT
COUNT(title_id) AS "COUNT(title_id)",
COUNT(price) AS "COUNT(price)",
COUNT(*) AS "COUNT(*)"
FROM titles
WHERE price IS NOT NULL;
SELECT
COUNT(title_id) AS "COUNT(title_id)",
COUNT(price) AS "COUNT(price)",
COUNT(*) AS "COUNT(*)"
FROM titles
WHERE price IS NULL;
COUNT(title_id) COUNT(price) COUNT(*)
--------------- ------------ --------
13 12 13
COUNT(title_id) COUNT(price) COUNT(*)
--------------- ------------ --------
12 12 12
COUNT(title_id) COUNT(price) COUNT(*)
--------------- ------------ --------
1 0 1
You can use DISTINCT to eliminate duplicate values in aggregate function calculations; see “Eliminating Duplicate Rows with DISTINCT” in Chapter 4. The general syntax of an aggregate function is:
agg_func([ALL | DISTINCT] expr)
agg_func is MIN, MAX, SUM, AVG, or COUNT. expr is a column name, literal, or expression. ALL applies the aggregate function to all values, and DISTINCT specifies that each unique value is considered. ALL is the default and is usually omitted in practice.
With SUM(), AVG(), and COUNT(expr), DISTINCT eliminates duplicate values before the sum, average, or count is calculated. DISTINCT isn’t meaningful with MIN() and MAX(); you can use it, but it won’t change the result. You can’t use DISTINCT with COUNT(*).
To calculate the sum of a set of distinct values:
Type:
SUM(DISTINCT expr)
expr is a column name, literal, or numeric expression. The result’s data type is at least as precise as the most precise data type used in expr.
To calculate the average of a set of distinct values:
Type:
AVG(DISTINCT expr)
expr is a column name, literal, or numeric expression. The result’s data type is at least as precise as the most precise data type used in expr.
To count distinct non-null rows:
Type:
COUNT(DISTINCT expr)
expr is a column name, literal, or expression. The result is an integer greater than or equal to zero.
The queries in Listing 6.6 return the count, sum, and average of book prices. The non-DISTINCT and DISTINCT results in Figure 6.6 differ because the DISTINCT results eliminate the duplicates of prices $12.99 and $19.95 from calculations.
SELECT
COUNT(*) AS "COUNT(*)"
FROM titles;
SELECT
COUNT(price) AS "COUNT(price)",
SUM(price) AS "SUM(price)",
AVG(price) AS "AVG(price)"
FROM titles;
SELECT
COUNT(DISTINCT price)
AS "COUNT(DISTINCT)",
SUM(DISTINCT price)
AS "SUM(DISTINCT)",
AVG(DISTINCT price)
AS "AVG(DISTINCT)"
FROM titles;
COUNT(*)
--------
13
COUNT(price) SUM(price) AVG(price)
------------ ---------- ----------
12 220.65 18.3875
COUNT(DISTINCT) SUM(DISTINCT) AVG(DISTINCT)
--------------- ------------- -------------
10 187.71 18.7710
DISTINCT in a SELECT clause (Chapter 4) and DISTINCT in an aggregate function don’t return the same result.
The three queries in Listing 6.7 count the author IDs in the table title_authors. Figure 6.7 shows the results. The first query counts all the author IDs in the table. The second query returns the same result as the first query because COUNT() already has done its work and returned a value in a single row before DISTINCT is applied. In the third query, DISTINCT is applied to the author IDs before COUNT() starts counting.
SELECT
COUNT(au_id)
AS "COUNT(au_id)"
FROM title_authors;
SELECT
DISTINCT COUNT(au_id)
AS "DISTINCT COUNT(au_id)"
FROM title_authors;
SELECT
COUNT(DISTINCT au_id)
AS "COUNT(DISTINCT au_id)"
FROM title_authors;
COUNT(au_id)
------------
17
DISTINCT COUNT(au_id)
---------------------
17
COUNT(DISTINCT au_id)
---------------------
6
Mixing non-DISTINCT and DISTINCT aggregates in the same SELECT clause can produce misleading results.
The four queries in Listing 6.8 show the four combinations of non-DISTINCT and DISTINCT sums and counts. Of the four results in Figure 6.8, only the first result (no DISTINCTs) and final result (all DISTINCTs) are consistent mathematically, which you can verify with AVG(price) and AVG(DISTINCT price). In the second and third queries (mixed non-DISTINCTs and DISTINCTs), you can’t calculate a valid average by dividing the sum by the count.
SELECT
COUNT(price)
AS "COUNT(price)",
SUM(price)
AS "SUM(price)"
FROM titles;
SELECT
COUNT(price)
AS "COUNT(price)",
SUM(DISTINCT price)
AS "SUM(DISTINCT price)"
FROM titles;
SELECT
COUNT(DISTINCT price)
AS "COUNT(DISTINCT price)",
SUM(price)
AS "SUM(price)"
FROM titles;
SELECT
COUNT(DISTINCT price)
AS "COUNT(DISTINCT price)",
SUM(DISTINCT price)
AS "SUM(DISTINCT price)"
FROM titles;
COUNT(price) SUM(price)
------------ ----------
12 220.65
COUNT(price) SUM(DISTINCT price)
------------ -------------------
12 187.71
COUNT(DISTINCT price) SUM(price)
--------------------- ----------
10 220.65
COUNT(DISTINCT price) SUM(DISTINCT price)
--------------------- -------------------
10 187.71
Microsoft Access doesn’t support DISTINCT aggregate functions. This statement, for example, is illegal in Access:
SELECT SUM(DISTINCT price)
FROM titles; --Illegal in Access
But you can replicate it with this subquery (see “Using Subqueries as Column Expressions” in Chapter 8):
SELECT SUM(price)
FROM (SELECT DISTINCT price
FROM titles);
This Access workaround won’t let you mix non-DISTINCT and DISTINCT aggregates, however, as in the second and third queries in Listing 6.8.
MySQL 4.1 and earlier support COUNT(DISTINCT expr) but not SUM(DISTINCT expr) and AVG(DISTINCT expr) and so won’t run Listings 6.6 and 6.8. MySQL 5.0 and later support all DISTINCT aggregates.
To this point, I’ve used aggregate functions to summarize all the values in a column or just those values that matched a WHERE search condition. You can use the GROUP BY clause to divide a table into logical groups (categories) and calculate aggregate statistics for each group.
An example will clarify the concept. Listing 6.9 uses GROUP BY to count the number of books that each author wrote (or cowrote). In the SELECT clause, the column au_id identifies each author, and the derived column num_books counts each author’s books. The GROUP BY clause causes num_books to be calculated for every unique au_id instead of only once for the entire table. Figure 6.9 shows the result. In this example, au_id is called the grouping column.
SELECT
au_id,
COUNT(*) AS "num_books"
FROM title_authors
GROUP BY au_id;
au_id num_books
----- ---------
A01 3
A02 4
A03 2
A04 4
A05 1
A06 3
The GROUP BY clause’s important characteristics are:
No columns from the input table can appear in an aggregate query’s SELECT clause unless they’re also included in the GROUP BY clause. A column has (or can have) different values in different rows, so there’s no way to decide which of these values to include in the result if you’re generating a single new row from the table as a whole. The following statement is illegal because GROUP BY returns only one row for each value of type; the query can’t return the multiple values of pub_id that are associated with each value of type:
SELECT type, pub_id, COUNT(*)
FROM titles
GROUP BY type; --Illegal
ORDER BY "num_books" DESC
.To group rows:
Type:
SELECT columns
FROM table
[WHERE search_condition]
GROUP BY grouping_columns
[HAVING search_condition]
[ORDER BY sort_columns];
columns and grouping_columns are one or more comma-separated column names, and table is the name of the table that contains columns and grouping_columns. The nonaggregate columns that appear in columns also must appear in grouping_columns. The order of the column names in grouping_columns determines the grouping levels, from the highest to the lowest level of grouping.
The GROUP BY clause restricts the rows of the result; only one row appears for each distinct value in the grouping column or columns. Each row in the result contains summary data related to the specific value in its grouping columns.
If the statement includes a WHERE clause, then the DBMS groups values after it applies search_condition to the rows in table. If the statement includes an ORDER BY clause, then the columns in sort_columns must be drawn from those in columns. The WHERE and ORDER BY clauses are covered in “Filtering Rows with WHERE” and “Sorting Rows with ORDER BY” in Chapter 4. HAVING, which filters grouped rows, is covered in the next section.
Listing 6.10 and Figure 6.10 show the difference between COUNT(expr) and COUNT(*) in a query that contains GROUP BY. The table publishers contains one null in the column state (for publisher P03 in Germany). Recall from “Counting Rows with COUNT()” earlier in this chapter that COUNT(expr) counts non-null values and COUNT(*) counts all values, including nulls. In the result, GROUP BY recognizes the null and creates a null group for it. COUNT(*) finds (and counts) the one null in the column state. But COUNT(state) contains a zero for the null group because COUNT(state) finds only a null in the null group, which it excludes from the count—that’s why you have the zero.
SELECT
state,
COUNT(state) AS "COUNT(state)",
COUNT(*) AS "COUNT(*)"
FROM publishers
GROUP BY state;
state COUNT(state) COUNT(*)
----- ------------ --------
NULL 0 1
CA 2 2
NY 1 1
If a nonaggregate column contains nulls, then using COUNT(*) rather than COUNT(expr) can produce misleading results. Listing 6.11 and Figure 6.11 show summary sales statistics for each type of book. The sales value for one of the biographies is null, so COUNT(sales) and COUNT(*) differ by 1. The average calculation in the fifth column, SUM/COUNT(sales), is consistent mathematically, whereas the sixth-column average, SUM/COUNT(*), is not. I’ve verified the inconsistency with AVG(sales) in the final column. (Recall a similar situation in Listing 6.8 in “Aggregating Distinct Values with DISTINCT” earlier in this chapter.)
SELECT
type,
SUM(sales) AS "SUM(sales)",
COUNT(sales) AS "COUNT(sales)",
COUNT(*) AS "COUNT(*)",
SUM(sales)/COUNT(sales)
AS "SUM/COUNT(sales)",
SUM(sales)/COUNT(*)
AS "SUM/COUNT(*)",
AVG(sales) AS "AVG(sales)"
FROM titles
GROUP BY type;
type SUM(sales) COUNT(sales) COUNT(*) SUM/COUNT(sales) SUM/COUNT(*) AVG(sales)
---------- ---------- ------------ -------- ---------------- ------------ ----------
biography 1611521 3 4 537173.67 402880.25 537173.67
children 9095 2 2 4547.50 4547.50 4547.50
computer 25667 1 1 25667.00 25667.00 25667.00
history 20599 3 3 6866.33 6866.33 6866.33
psychology 308564 3 3 102854.67 102854.67 102854.67
Listing 6.12 and Figure 6.12 show a simple GROUP BY query that calculates the total sales, average sales, and number of titles for each type of book. In Listing 6.13 and Figure 6.13, I’ve added a WHERE clause to eliminate books priced less than $13 before grouping. I’ve also added an ORDER BY clause to sort the result by descending total sales of each book type.
SELECT
type,
SUM(sales) AS "SUM(sales)",
AVG(sales) AS "AVG(sales)",
COUNT(sales) AS "COUNT(sales)"
FROM titles
GROUP BY type;
TYPE SUM(sales) AVG(sales) COUNT(sales)
---------- ---------- ---------- ------------
biography 1611521 537173.67 3
children 9095 4547.50 2
computer 25667 25667.00 1
history 20599 6866.33 3
psychology 308564 102854.67 3
SELECT
type,
SUM(sales) AS "SUM(sales)",
AVG(sales) AS "AVG(sales)",
COUNT(sales) AS "COUNT(sales)"
FROM titles
WHERE price >= 13
GROUP BY type
ORDER BY "SUM(sales)" DESC;
type SUM(sales) AVG(sales) COUNT(sales)
---------- ---------- ---------- ------------
biography 1511520 755760.00 2
computer 25667 25667.00 1
history 20599 6866.33 3
children 5000 5000.00 1
Listing 6.14 and Figure 6.14 use multiple grouping columns to count the number of titles of each type that each publisher publishes.
SELECT
pub_id,
type,
COUNT(*) AS "COUNT(*)"
FROM titles
GROUP BY pub_id, type
ORDER BY pub_id ASC, "COUNT(*)" DESC;
pub_id type COUNT(*)
------ ---------- --------
P01 biography 3
P01 history 1
P02 computer 1
P03 history 2
P03 biography 1
P04 psychology 3
P04 children 2
In Listing 6.15 and Figure 6.15, I revisit Listing 5.31 in “Evaluating Conditional Values with CASE” in Chapter 5. But instead of listing each book categorized by its sales range, I use GROUP BY to list the number of books in each sales range.
SELECT
CASE
WHEN sales IS NULL
THEN 'Unknown'
WHEN sales <= 1000
THEN 'Not more than 1,000'
WHEN sales <= 10000
THEN 'Between 1,001 and 10,000'
WHEN sales <= 100000
THEN 'Between 10,001 and 100,000'
WHEN sales <= 1000000
THEN 'Between 100,001 and 1,000,000'
ELSE 'Over 1,000,000'
END
AS "Sales category",
COUNT(*) AS "Num titles"
FROM titles
GROUP BY
CASE
WHEN sales IS NULL
THEN 'Unknown'
WHEN sales <= 1000
THEN 'Not more than 1,000'
WHEN sales <= 10000
THEN 'Between 1,001 and 10,000'
WHEN sales <= 100000
THEN 'Between 10,001 and 100,000'
WHEN sales <= 1000000
THEN 'Between 100,001 and 1,000,000'
ELSE 'Over 1,000,000'
END
ORDER BY MIN(sales) ASC;
Sales category Num titles
----------------------------- ----------
Unknown 1
Not more than 1,000 1
Between 1,001 and 10,000 3
Between 10,001 and 100,000 5
Between 100,001 and 1,000,000 2
Over 1,000,000 1
When used without an aggregate function, GROUP BY acts like DISTINCT (Listing 6.16 and Figure 6.16). For information about DISTINCT, see “Eliminating Duplicate Rows with DISTINCT” in Chapter 4.
SELECT type
FROM titles
GROUP BY type;
SELECT DISTINCT type
FROM titles;
type
----------
biography
children
computer
history
psychology
You can use GROUP BY to look for patterns in your data. In Listing 6.17 and Figure 6.17, I’m looking for a relationship between price categories and average sales.
SELECT price, AVG(sales) AS "AVG(sales)"
FROM titles
WHERE price IS NOT NULL
GROUP BY price
ORDER BY price ASC;
price AVG(sales)
------- ----------
6.95 201440.0
7.99 94123.0
10.00 4095.0
12.99 56501.0
13.95 5000.0
19.95 10443.0
21.99 566.0
23.95 1500200.0
29.99 10467.0
39.95 25667.0
In Microsoft Access, use the Switch() function instead of the CASE expression in Listing 6.15. See the DBMS tip in “Tips for CASE” in Chapter 5.
MySQL 4.1 and earlier don’t allow CASE in a GROUP BY clause and so won’t run Listing 6.15. MySQL 5.0 and later will run it.
Categorizing Numeric Values
You can use the function FLOOR(x) to categorize numeric values. FLOOR(x) returns the greatest integer that is lower than x. This query groups books in $10 price intervals:
SELECT
FLOOR(price/10)*10 AS "Category",
COUNT(*) AS "Count"
FROM titles
GROUP BY FLOOR(price/10)*10;
The result is:
Category Count
-------- -----
0 2
10 6
20 3
30 1
NULL 1
Category 0 counts prices between $0.00 and $9.99; category 10 counts prices between $10.00 and $19.99; and so on. (The analogous function CEILING(x) returns the smallest integer that is higher than x.)
The HAVING clause sets conditions on the GROUP BY clause similar to the way that WHERE interacts with SELECT. The HAVING clause’s important characteristics are:
The sequence in which the WHERE, GROUP BY, and HAVING clauses are applied is:
To filter groups:
Following the GROUP BY clause, type:
HAVING search_condition
search_condition is a search condition used to filter groups. search_condition can contain aggregate functions but otherwise is identical to the WHERE search condition, described in “Filtering Rows with WHERE” and subsequent sections in Chapter 4. You can combine and negate multiple HAVING conditions with the logical operators AND, OR, and NOT.
The HAVING search condition is applied to the rows in the output produced by grouping. Only the groups that meet the search condition appear in the result. You can apply a HAVING clause only to columns that appear in the GROUP BY clause or in an aggregate function.
In Listing 6.18 and Figure 6.18, I revisit Listing 6.9 earlier in this chapter, but instead of listing the number of books that each author wrote (or cowrote), I use HAVING to list only the authors who have written three or more books.
SELECT
au_id,
COUNT(*) AS "num_books"
FROM title_authors
GROUP BY au_id
HAVING COUNT(*) >= 3;
au_id num_books
----- ---------
A01 3
A02 4
A04 4
A06 3
In Listing 6.19 and Figure 6.19, the HAVING condition also is an aggregate expression in the SELECT clause. This query still works if you remove the AVG() expression from the SELECT list (Listing 6.20 and Figure 6.20).
SELECT
type,
COUNT(price) AS "COUNT(price)",
AVG(price * sales) AS "AVG revenue"
FROM titles
GROUP BY type
HAVING AVG(price * sales) > 1000000;
type COUNT(price) AVG revenue
---------- ------------ -----------
biography 3 12484878.00
computer 1 1025396.65
SELECT
type,
COUNT(price) AS "COUNT(price)"
FROM titles
GROUP BY type
HAVING AVG(price * sales) > 1000000;
type COUNT(price)
---------- ------------
biography 3
computer 1
In Listing 6.21 and Figure 6.21, multiple grouping columns count the number of titles of each type that each publisher publishes. The HAVING condition removes groups in which the publisher has one or fewer titles of a particular type. This query retrieves a subset of the result of Listing 6.14 earlier in this chapter.
SELECT
pub_id,
type,
COUNT(*) AS "COUNT(*)"
FROM titles
GROUP BY pub_id, type
HAVING COUNT(*) > 1
ORDER BY pub_id ASC, "COUNT(*)" DESC;
pub_id type COUNT(*)
------ ---------- --------
P01 biography 3
P03 history 2
P04 psychology 3
P04 children 2
In Listing 6.22 and Figure 6.22, the WHERE clause first removes all rows except for books from only publishers P03 and P04. Then the GROUP BY clause groups the output of the WHERE clause by type. Finally, the HAVING clause filters rows from the grouped result.
SELECT
type,
SUM(sales) AS "SUM(sales)",
AVG(price) AS "AVG(price)"
FROM titles
WHERE pub_id IN ('P03', 'P04')
GROUP BY type
HAVING SUM(sales) > 10000
AND AVG(price) < 20;
type SUM(sales) AVG(price)
---------- ---------- ----------
psychology 308564 9.31
Generally, a HAVING clause should involve only aggregates. The only conditions that you specify in the HAVING clause are those conditions that must be applied after the grouping operation has been performed. It’s more efficient to specify conditions that can be applied before the grouping operation in the WHERE clause. The following statements, for example, are equivalent, but the first statement is preferable because it reduces the number of rows that have to be grouped:
--Faster
SELECT pub_id, SUM(sales)
FROM titles
WHERE pub_id IN ('P03', 'P04')
GROUP BY pub_id
HAVING SUM(sales) > 10000;
--Slower
SELECT pub_id, SUM(sales)
FROM titles
GROUP BY pub_id
HAVING SUM(sales) > 10000
AND pub_id IN ('P03', 'P04');