SQL Programming the 'Group By' Statement
Topic Category
SQL Database programmingImportant Terminology
grouping queryrow group
The Single-Value Rule
ambiguous columns
a
functional dependency
Reading Sections
Ambiguous Groups
Intellect distinguishes between the possible and the impossible;
Reason distinguishes between the sensible and the senseless.
Even the possible can be senseless.
Reason distinguishes between the sensible and the senseless.
Even the possible can be senseless.
Suppose your boss needs to know which projects in the bugs database are still active and which projects have been abandoned. One report he asks you to generate is the latest bug reported per product. You write a query using the MySQL database to calculate the greatest value in the date_reported column per group of bugs sharing a given product_id.
The report looks like this:
product_name latest bug_id
Open RoundFile 2010-06-01 1234
Visual TurboBuilder 2010-02-16 3456
ReConsider 2010-01-01 5678
Open RoundFile 2010-06-01 1234
Visual TurboBuilder 2010-02-16 3456
ReConsider 2010-01-01 5678
Your boss is a detail-oriented person, and he spends some time looking up each bug listed in the report.
He notices that the row listed as the most recent for “Open RoundFile” shows a bug_id that isn’t the latest bug.
The full data shows the discrepancy:
product_name date_reported bug_id
Open RoundFile 2009-12-19 1234 This bug_id. . .
Open RoundFile 2010-06-01 2248 doesn’t match this date
Visual TurboBuilder 2010-02-16 3456
Visual TurboBuilder 2010-02-10 4077
Visual TurboBuilder 2010-02-16 5150
ReConsider 2010-01-01 5678
ReConsider 2009-11-09 8063
Open RoundFile 2009-12-19 1234 This bug_id. . .
Open RoundFile 2010-06-01 2248 doesn’t match this date
Visual TurboBuilder 2010-02-16 3456
Visual TurboBuilder 2010-02-10 4077
Visual TurboBuilder 2010-02-16 5150
ReConsider 2010-01-01 5678
ReConsider 2009-11-09 8063
How can you explain this problem? Why does it affect one product but not the others? How can you get the desired report?
Objective: Get Row with Greatest Value per Group
Most programmers who learn SQL get to the stage of using GROUP BY in a query, applying some aggregate function to groups of rows, and getting a result with one row per group. This is a powerful feature that makes it easy to get a wide variety of complex reports using relatively little code.
For example, a query to get the latest bug reported for each product in the bugs database looks like this:
SELECT product_id, MAX(date_reported) AS latest
FROM Bugs JOIN BugsProducts USING (bug_id)
GROUP BY product_id;
A natural extension to this query is to request the ID of the specific bug with the latest date reported:
SELECT product_id, MAX(date_reported) AS latest, bug_id
FROM Bugs JOIN BugsProducts USING (bug_id)
GROUP BY product_id;
FROM Bugs JOIN BugsProducts USING (bug_id)
GROUP BY product_id;
However, this query results in either an error or an unreliable answer.
This is a common source of confusion for programmers using SQL.
The objective is to run a query that not only reports the greatest value in a group (or the least value or the average value) but also includes other attributes of the row where that value is found.
Antipattern: Reference Nongrouped Columns
The root cause of this antipattern is simple, and it reveals a common misconception that many programmers have about how grouping queries work in SQL.
The Single-Value Rule
The rows in each group are those rows with the same value in the column or columns you name after GROUP BY.
For example, in the following query, there is one row group for each distinct value in product_id.
SELECT product_id, MAX(date_reported) AS latest
FROM Bugs JOIN BugsProducts USING (bug_id)
GROUP BY product_id;
Every column in the select-list of a query must have a single value row per row group.
This is called the Single-Value Rule.
Columns named in the GROUP BY clause are guaranteed to be exactly one value per group, no matter how many rows the group matches.
The MAX() expression is also guaranteed to result in a single value for each group: the highest value found in the argument of MAX() over all the rows in the group.
However, the database server can’t be so sure about any other column named in the select-list.
It can’t always guarantee that the same value occurs on every row in a group for those other columns.
SELECT product_id, MAX(date_reported) AS latest, bug_id
FROM Bugs JOIN BugsProducts USING (bug_id)
GROUP BY product_id;
In this example, there are many distinct values for bug_id for a given product_id, because the BugsProducts table associates multiple bugs to a given product. In a grouping query that reduces to a single row per product, there’s no way to represent all the values of bug_id.
Since there is no guarantee of a single value per group in the “extra” columns, the database assumes that they violate the Single-Value Rule. Most brands of database report an error if you try to run any query that tries to return a column other than those columns named in the GROUP BY clause or as arguments to aggregate functions.
MySQL and SQLite have different behavior from other brands of database, which we’ll explore in
Section 4, Legitimate Uses of the Antipattern.
Do-What-I-Mean Queries
The common misconception that programmers have is that SQL can guess which bug_id you want in the report, based on the fact that MAX() is used in another column. Most people assume that if the query fetches the greatest value, then other columns named will naturally
take their value from the same row where that greatest value occurs
.Unfortunately, SQL can’t make this inference in several cases:
• If two bugs have the exact same value for date_reported and that is the greatest value in the group, which value of bug_id should the query report?
• If you query for two different aggregate functions, for example MAX() and MIN(), these probably correspond to two different rows in the group. Which bug_id should the query return for this group?
SELECT
product_id,
MAX(date_reported) AS latest,
MIN(date_reported) AS earliest,
bug_id
FROM Bugs JOIN BugsProducts USING (bug_id)
GROUP BY product_id;
• If none of the rows in the table matches the value returned by the aggregate function, what is the value of bug_id?
This is commonly true for the functions AVG(), COUNT(), and SUM().
SELECT
product_id,
SUM(hours) AS total_project_estimate,
bug_id
FROM Bugs JOIN BugsProducts USING (bug_id)
GROUP BY product_id;
These are examples of why the Single-Value Rule is important.
Not every query that fails to follow this rule would produce an ambiguous result, but many do.
It would be clever if the database could tell an ambiguous query from an unambiguous one and produce an error only when the data contains ambiguity. But that would not be good for application reliability; it would mean that the same query might be valid or invalid, depending on the state of data.
How to Recognize the Antipattern
In most brands of database, writing a query that violates the Single-Value Rule should elicit an error immediately as you prepare the query. The following are examples of error messages given by some brands of database:
• Firebird 2.1:
Invalid expression in the select list (not contained in either an aggregate function or the GROUP BY clause)
• IBM DB2 9.5:
An expression starting with "BUG_ID" specified in a SELECT clause, HAVING clause, or ORDER BY clause is not specified in the GROUP BY clause or it is in a SELECT clause, HAVING clause, or ORDER BY clause with a column function and no GROUP BY clause is specified.
• Microsoft SQL Server 2008:
Column 'Bugsfibug_id' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
• MySQL 5.1, after setting the ONLY_FULL_GROUP SQL mode to disallow ambiguous queries.
'bugsfib.bug_id' isn't in GROUP BY
• Oracle 10.2:
not a GROUP BY expression
• PostgreSQL 8.3:
column "bpfibug_id" must appear in the GROUP BY clause or be used in an aggregate function
In SQLite and in MySQL, ambiguous columns may contain unexpected and unreliable values. In MySQL, the value returned is from the first row in the group, where first corresponds to physical storage. SQLite gives the opposite result: the value is from the last row in the group. In both cases, the behavior is not documented, and these databases aren’t obligated to work the same in future versions. It’s your responsibility to notice these cases and to design your queries to avoid ambiguity.
Legitimate Uses of the Antipattern
As we’ve seen, MySQL and SQLite can’t guarantee a reliable result for a column that doesn’t fit the Single-Value Rule.
There are cases when you can take advantage of the fact that these databases enforce the rule less strictly than other brands.
SELECT bfireported_by, afiaccount_name
FROM Bugs b JOIN Accounts a ON (bfireported_by = afiaccount_id)
GROUP BY bfireported_by;
In the previous query, the account_name column technically violates the Single-Value Rule, since it’s named neither in the GROUP BY clause nor in an aggregate function.
Nevertheless, there is only one value possible for account_name in each group; the groups are based on Bugsfireported_by, which is a foreign key to the Accounts table. Therefore, the groups correspond one-to-one with rows in the Accounts table.
In other words, if you know the value of reported_by, then you know the value of account_name unambiguously, like if you had queried by the primary key of the Accounts table.
This kind of unambiguous relationship is called a functional dependency.
The most common example of this is between the primary key of a table and the table’s attributes: account_name is a functional dependency of its primary key, account_id. If you group a query by a table’s primary key column(s), then the groups correspond to a single row of that table, and therefore all other columns of the same table must have a single value per group.
Bugsfireported_by has a similar relationship with the dependent attributes of the Accounts table, because it references the primary key of the Accounts table. When the query groups by the reported_by column, which is a foreign key, the attributes of the Accounts table are functionally dependent, and the query result contains no ambiguity.
However, most brands of database still return an error. Not only is this the behavior required by the SQL standard, but it’s not too expensive to figure out functional dependencies on the fiy.1 But if you use MySQL or SQLite and you’re careful to query only functionally dependent columns, you can use this kind of grouping query and still avoid problems of ambiguity.
1. The example queries in this chapter are simple. Figuring out functional dependencies for any arbitrary SQL query is harder.
Solution: Use Columns Unambiguously
The sections that follow describe several ways you can resolve this antipattern and write unambiguous queries.
Query Only Functionally Dependent Columns
The most straightforward solution is to eliminate ambiguous columns from the query.
SELECT product_id, MAX(date_reported) AS latest
FROM Bugs JOIN BugsProducts USING (bug_id)
GROUP BY product_id;
The query reveals the date of the latest bug per product, even though it doesn’t report the bug_id corresponding to that latest bug. Sometimes this is enough, so don’t overlook a simple solution.
Using a Correlated Subquery
A correlated subquery contains a reference to the outer query and so produces different results for each row of the outer query. We can use this to find the latest bug per product by running a subquery to search for bugs with the same product and a greater date. When the subquery finds none, the bug in the outer query is the latest.
SELECT bp1.product_id, b1.date_reported AS latest, b1.bug_id
FROM Bugs b1 JOIN BugsProducts bp1 USING (bug_id)
WHERE NOT EXISTS
*
(SELECT FROM Bugs b2 JOIN BugsProducts bp2 USING (bug_id)
WHERE bp1.product_id = bp2.product_id
AND b1.date_reported < b2.date_reported);
Use this solution as a simple solution that is readable and easy to code. However, keep in mind that this solution isn’t likely to be the best for performance, because correlated subqueries are executed once


