How SQL DISTINCT and ORDER BY are Related
SQL DISTINCT 和 ORDER BY 如何相关

One of the things that confuse SQL users all the time is how DISTINCT and ORDER BY are related in a SQL query.
在 SQL 查询中, DISTINCTORDER BY 的关系一直是困扰 SQL 用户的难题之一。

The Basics 基础知识

Running some queries against the Sakila database, most people quickly understand:
运行一些针对 Sakila 数据库的查询,大多数人很快就理解了:

SELECT DISTINCT length FROM film

This returns results in an arbitrary order, because the database can (and might apply hashing rather than ordering to remove duplicates):
该函数以任意顺序返回结果,因为数据库可以(并且可能会应用哈希而不是排序来删除重复项):

length |
-------|
129    |
106    |
120    |
171    |
138    |
80     |
...

Most people also understand:
大多数人也明白:

SELECT length FROM film ORDER BY length

This will give us duplicates, but in order:
这将给我们带来重复项,但按顺序排列:

length |
-------|
46     |
46     |
46     |
46     |
46     |
47     |
47     |
47     |
47     |
47     |
47     |
47     |
48     |
...

And, of course, we can combine the two:
当然,我们可以将两者结合起来:

SELECT DISTINCT length FROM film ORDER BY length

Resulting in… 导致…

length |
-------|
46     |
47     |
48     |
49     |
50     |
51     |
52     |
53     |
54     |
55     |
56     |
...

Then why doesn’t this work?
为什么这不起作用?

Maybe somewhat intuitively, we may want to order the lengths differently, e.g. by title:
也许直观地,我们可能希望以不同的方式排序长度,例如按标题:

SELECT DISTINCT length FROM film ORDER BY title

Most databases fail this query with an exception like Oracle’s:
大多数数据库无法执行此查询,并会引发类似 Oracle 这样的异常:

ORA-01791: not a SELECTed expression

At first sight, this seems funny, because this works after all:
乍一看这似乎很好笑,因为这终究是可行的:

SELECT length FROM film ORDER BY title

Yielding: 屈服:

length |
-------|
86     |
48     |
50     |
117    |
130    |
...

We could add the title to illustrate the ordering
我们可以添加标题来说明顺序

length |title                       |
-------|----------------------------|
86     |ACADEMY DINOSAUR            |
48     |ACE GOLDFINGER              |
50     |ADAPTATION HOLES            |
117    |AFFAIR PREJUDICE            |
130    |AFRICAN EGG                 |

So, how are these different?
那么,它们有什么区别呢?

We have to rewind and check out the logical order of SQL operations (as opposed to the syntactic order). And always remember, this is the logical order, not the actual order executed by the optimiser.
我们必须倒回到 SQL 运算的逻辑顺序(而不是语法顺序),并进行检查。并且始终记住,这是逻辑顺序,而不是优化器执行的实际顺序。

When we write something like this:
当我们像这样写文章时:

SELECT DISTINCT length FROM film ORDER BY length

The logical order of operations is:
逻辑运算顺序:

  • FROM clause, loading the FILM table
    FROM 子句,加载 FILM
  • SELECT clause, projecting the LENGTH column
    ## '将 SELECT 中的 LENGTH 列投影出来,'
  • DISTINCT clause, removing distinct tuples (with projected LENGTH columns)
    移除重复元组 (带有投影的 LENGTH 列) 的 DISTINCT 子句
  • ORDER BY clause, ordering by the LENGTH column
    ``` 根据 LENGTH 列排序的 ORDER BY 子句 ```

If we look at this step by step, we have:
如果我们一步一步地看,我们有:

Step 1: SELECT * FROM film
步骤 1:SELECT * FROM film

The intermediary data set is something like:
中介数据集类似于:

film_id |title                       |length | ...
--------|----------------------------|-------| ...
1       |ACADEMY DINOSAUR            |86     | ...
2       |ACE GOLDFINGER              |48     | ...
3       |ADAPTATION HOLES            |50     | ...
4       |AFFAIR PREJUDICE            |117    | ...
5       |AFRICAN EGG                 |130    | ...
...     |...                         |...    | ...

Step 2: SELECT length …
第二步:将 SELECT 的长度选择为 ..

The intermediary data set is something like:
中间数据集类似于:

length |
-------|
86     |
48     |
50     |
117    |
130    |
...
86     | <-- duplicate

Step 3: SELECT DISTINCT length …
步骤三:SELECT DISTINCT 长度 …

Now we’re getting a new random order (due to hashing) and no duplicates anymore:
现在得到一个新的随机顺序(由于哈希),并且不再有重复项:

length |
-------|
129    |
106    |
120    |
171    |
138    |
...

Step 4: … ORDER BY length
第四步:… 按长度排序

And we’re getting: 我们正在获得:

length |
-------|
46     |
47     |
48     |
49     |
50     |
...

It seems obvious. 这似乎很明显。

So why did this work?
那么为什么这行得通?

Remember, this query worked:
好的,以下是翻译结果: 记住,这个查询有效:

SELECT length FROM film ORDER BY title

Even if after projecting the LENGTH column, it seems as though it is no longer available for sorting, it really is, according to the SQL standard and to common sense. There is a concept called extended sort key columns in the SQL standard, which means the above query has a slightly different order of operations (apart from the fact that there is no DISTINCT operation):
即使在对 LENGTH 列进行投影后,看起来好像无法再对它进行排序,但实际上根据 SQL 标准和常识,这是可以的。在 SQL 标准中有一个概念叫做扩展排序关键字列,这意味着上述查询实际上有一种稍微不同的操作顺序(除了没有 DISTINCT 操作):

  • FROM clause, loading the FILM table
    FROM 子句,加载 FILM
  • SELECT clause, projecting the LENGTH column from the select list and the TITLE from the extended sort key columns
    ``` SELECT 子句,从选择列表中投影 LENGTH 列,从扩展排序键列中投影 TITLE 列 ```
  • ORDER BY clause, ordering by the TITLE column
    ``` 根据 TITLE 列排序的 ORDER BY 子句 ```
  • SELECT clause (implicit), projecting only the LENGTH column, discarding the TITLE column
    ```zh-CN 投射仅 LENGTH 列,丢弃 TITLE 列的 SELECT 子句(隐式) ```

Again, this is what happens logically. Database optimisers may choose other ways to implement this. By example:
同样的,这只是逻辑上的推导结果。数据库优化器可能会选择其他方式来实现此功能。通过示例说明:

Step 1: SELECT * FROM film
步骤 1:SELECT * FROM film

Same as before 一样

film_id |title                       |length | ...
--------|----------------------------|-------| ...
1       |ACADEMY DINOSAUR            |86     | ...
2       |ACE GOLDFINGER              |48     | ...
3       |ADAPTATION HOLES            |50     | ...
4       |AFFAIR PREJUDICE            |117    | ...
5       |AFRICAN EGG                 |130    | ...
...     |...                         |...    | ...

Step 2: SELECT length, title…
第二步:选择长度、标题…

We get that synthetic extended sort key column TITLE along with the LENGTH column that we requested
我们获取到合成的扩展排序键列 TITLE 以及我们请求的 LENGTH

length |title                       |
-------|----------------------------|
86     |ACADEMY DINOSAUR            |
114    |ALABAMA DEVIL               |
50     |ADAPTATION HOLES            |
117    |AFFAIR PREJUDICE            |
168    |ANTITRUST TOMATOES          |
...

Step 3: … ORDER BY title
## 第 3 步:… 按标题排序

… we can now order by that column
… 我们现在可以按该列排序

length |title                       |
-------|----------------------------|
86     |ACADEMY DINOSAUR            |
48     |ACE GOLDFINGER              |
50     |ADAPTATION HOLES            |
117    |AFFAIR PREJUDICE            |
130    |AFRICAN EGG                 |
...

Step 4: SELECT length 第四步: 选择长度

… and finally discard it, because we never wanted it
...最终抛弃它,因为我们从未真正想要它

length |
-------|
86     |
48     |
50     |
117    |
130    |

So why can’t we use DISTINCT?
为什么我们不能使用 DISTINCT?

If we try to run this:
如果我们尝试运行此内容:

SELECT DISTINCT length FROM film ORDER BY title

We would get an additional DISTINCT operation in our logical set of operations:
我们的运算逻辑集中将多出一个 " DISTINCT " 运算

  • FROM clause, loading the FILM table
    FROM 子句,加载 FILM
  • SELECT clause, projecting the LENGTH column from the select list and the TITLE from the extended sort key columns
    ``` SELECT 子句,从选择列表中投影 LENGTH 列,从扩展排序键列中投影 TITLE 列 ```
  • DISTINCT clause, removing duplicate (LENGTH, TITLE) values… Ooops
    DISTINCT 子句,删除重复的 (LENGTH, TITLE) 值… 哎呀
  • ORDER BY clause, ordering by the TITLE column
    ``` 根据 TITLE 列排序的 ORDER BY 子句 ```
  • SELECT clause (implicit), projecting only the LENGTH column, discarding the TITLE column
    ```zh-CN 投射仅 LENGTH 列,丢弃 TITLE 列的 SELECT 子句(隐式) ```

The problem is, since we have synthetically added the extended sort key column TITLE to the projection in order to be able to ORDER BY it, DISTINCT wouldn’t have the same semantics anymore as can be seen here:
问题是,由于我们已将扩展排序键列添加到投影中以能够 ORDER BY 它,因此, DISTINCT 将不再具有相同的语义,如这里所示:

SELECT count(*)
FROM (
  SELECT DISTINCT length FROM film
) t;
 
SELECT count(*)
FROM (
  SELECT DISTINCT length, title FROM film
) t;

Yielding 屈服

140
1000

All titles are distinct. There is no way this query can be executed reasonably. Either DISTINCT doesn’t work (because the added extended sort key column changes its semantics), or ORDER BY doesn’t work (because after DISTINCT we can no longer access the extended sort key column).
所有标题都是不同的。此查询无法合理执行。要么 DISTINCT 不起作用(因为添加的扩展排序键列更改了它的语义),要么 ORDER BY 不起作用(因为在 DISTINCT 之后,我们无法再访问扩展排序键列)。

A more constructed example. T contains this data:
更多结构化的例子。T 包含这些数据:

CREATE TABLE t (a INT, b INT);
INSERT INTO t VALUES (1, 1);
INSERT INTO t VALUES (1, 2);
INSERT INTO t VALUES (2, 3);
INSERT INTO t VALUES (1, 4);
INSERT INTO t VALUES (2, 5);
A   B
-----
1   1
1   2
2   3
1   4
2   5

What would this query produce?
此查询将产生什么结果?

SELECT DISTINCT a FROM t ORDER BY b;

Clearly, we should only get 2 rows with values 1, 2, because of DISTINCT a:
显然,由于 DISTINCT a 的原因,我们应该只有 2 行值为 1、2 的行:

A 
--
1
2

Now, how do we order these by B? There are 3 values of B associated A = 1 and 2 values of B associated with A = 2:
现在,我们如何根据 B

A   B
------------------
1   Any of 1, 2, 4
2   Any of 3, 5

Should we get 1, 2 or 2, 1 as a result? Impossible to tell.
结果应该是 1、2 或 2、1 吗?无法判断。

But there are some exceptions
但有一些例外

The way I read the SQL standard, the following exception should be possible. The SQL standard ISO/IEC 9075-2:2016(E), 7.17 <query expression>, Syntax Rules 28) d) i) 6) references the “Left normal form derivation”. But I may be reading this wrong, see also a discussion on the PostgreSQL mailing list:
根据我对 SQL 标准的理解,以下异常应该是可能的。SQL 标准 ISO/IEC 9075-2:2016(E) 中的 7.17 <查询表达式> 语法规则 28) d) i) 6) 引用了“左范式推导”。但我可能理解错误,请参阅 PostgreSQL 邮件列表上的一个讨论:

https://www.postgresql.org/message-id/20030819103859.L69440-100000%40megazone.bigpanda.com

In any case, it still makes sense to me. For instance, we can form expressions on the columns in the select list. This is totally fine in MySQL (strict mode) and Oracle:
在任何情况下,这对我来说仍然是有意义的。例如,我们可以在`select`列表中的列上形成表达式。这在 MySQL(严格模式)和 Oracle 中完全没问题:

SELECT DISTINCT length
FROM film
ORDER BY mod(length, 10), length;

It will produce 它将产生

length |
-------|
50     |
60     |
70     |
80     |
90     |
100    |
110    |
120    |
130    |
140    |
150    |
160    |
170    |
180    |
51     |
61     |
71     |

PostgreSQL doesn’t allow this because the expression MOD(LENGTH, 10) is not in the select list. How to interpret this? We’re looking again at the order of SQL operations:
PostgreSQL 不允许这样做,因为表达式 MOD(LENGTH, 10) 不在 select 列表中。如何理解这一点?我们再次回顾了 SQL 操作的顺序:

  • FROM clause, loading the FILM table
    FROM 子句,加载 FILM
  • SELECT clause, projecting the LENGTH column from the select list. MOD(LENGTH, 10) does not have to be put in the extended sort key columns, because it can be fully derived from the select list.
    `select` 语句中,投影 SELECT 列。 LENGTH 不需要放到扩展排序键列中,因为它可以完全从 `select` 列表中推导出来。
  • DISTINCT clause, removing duplicate LENGTH values … all fine, because we don’t have the verboten extended sort key columns
    DISTINCT 子句删除重复的 LENGTH 值 ... 都很好,因为我们没有禁用的扩展排序键列
  • ORDER BY clause, ordering by the mod(LENGTH, 10), LENGTH columns. Totally fine, because we can derive all of these order by expressions from expressions in the select list
    根据 mod(LENGTH, 10), LENGTH 列排序的 ORDER BY 子句。完全可以这样,因为我们可以基于 select 列表中的表达式推导出所有的排序表达式。

Makes sense, right? 说得通,对吧?

Back to our constructed table T:
回到我们构建的表 T:

A   B
-----
1   1
1   2
2   3
1   4
2   5

We are allowed to write:
我们被允许写作:

SELECT DISTINCT a, b FROM t ORDER BY a - b;

We would get: 我们会得到:

A   B
-----
1   4
2   5
2   3
1   2
1   1

Again, the order by expressions can be derived completely from the select list. This also works in Oracle:
同样,由表达式组成的 order by 可以完全从 select 列表中派生。这也适用于 Oracle:

SELECT DISTINCT a - b FROM t ORDER BY abs(a - b);

The select list contains a column A - B, so we can derive any ORDER BY expression from it. But these don’t work:
选择列表包含一个列 A - B ,因此我们可以从中派生任何 ORDER BY 表达式。但这不起作用:

SELECT DISTINCT a - b FROM t ORDER BY a;
SELECT DISTINCT a - b FROM t ORDER BY b;
SELECT DISTINCT a - b FROM t ORDER BY b - a;

It is easy to build an intuition for why these don’t work. Clearly, the data set we want is:
显然,我们想要的数据集是:

A - B  A             B             B - A
------------------------------------------
-3     Any of 1, 2   Any of 4, 5   3
-1     Any of 2, 1   Any of 3, 2   1
 0     Any of 1      Any of 1      0

Now, how are we supposed to order these by A, B or B - A? It looks as though we should be able to sort by B - A in this case. We could derive a complicated transformation of expressions that can be reasonably transformed into each other, such as A - B = -(B - A), but this simply isn’t practical. The expression in the projection is A - B, and that’s the only expression we can re-use in the ORDER BY. For example, we could even do this in Oracle:
现在,我们如何根据 A, BB - A 对这些进行排序呢? 在这种情况下,我们应该能够按 B - A 进行排序。我们可以推导出一个包含可以相互合理转换的表达式的复杂变换,例如 A - B = -(B - A) ,但这并不实用。投影中的表达式为 A - B ,这是我们可以在 ORDER BY 中重复使用的唯一表达式。例如,我们甚至可以在 Oracle 中执行此操作:

SELECT DISTINCT a - b FROM t ORDER BY abs((a - b) + (a - b));

Or start using aliases:
使用别名:

SELECT DISTINCT a - b AS x FROM t ORDER BY abs(x + x);

Conclusion ## 结论

The SQL language is quirky. This is mostly because the syntactical order of operations doesn’t match the logical order of operations. The syntax is meant to be human readable (remember Structured English Query Language?) but when reasoning about a SQL statement, we would often like to directly write down the logical order of operations.
SQL 语言有点古怪。这主要是因为语法上的操作顺序与逻辑上的操作顺序不匹配。语法是为人类可读而设计(还记得结构化英语查询语言吗?),但是当我们推理 SQL 语句时,通常希望直接写下逻辑上的操作顺序。

In this article, we haven’t even touched the implications of adding
在本文中,我们甚至还没有涉及添加

  • GROUP BY
  • TOP / LIMIT / FETCH
  • UNION

Which add more fun rules to what’s possible and what isn’t. Our previous article on the true logical order of SQL operations explains this completely.
这些规则增加了更多有趣的可能性和不可能性的内容。我们之前关于 SQL 运算真正逻辑顺序的文章对此进行了全面解释。

Need more explanation? Check this out.
还需要更多解释吗?请看这里。

7 thoughts on “How SQL DISTINCT and ORDER BY are Related
关于“SQL DISTINCT 和 ORDER BY 如何关联”的 7 个想法

  1. Actually, I would use the following explanation.
    实际上,我会用以下解释。

    SELECT DISTINCT length FROM film
    

    …is exactly the same as
    … 是一样的

    SELECT length FROM film GROUP BY length
    

    So it’s obvious that you can write
    所以很明显你可以写

    SELECT length FROM film GROUP BY length ORDER BY length
    

    …or just by referring field in SELECT clause by number:
    …或者直接按编号引用 SELECT 句中的字段:

    SELECT length FROM film GROUP BY length ORDER BY 1
    

    However, the below query is invalid because “title” is neither a part of grouping nor a result of aggregate function (like MAX or MIN) :
    但是,下面的查询是无效的,因为“title”既不是分组的一部分,也不是聚合函数(如 MAX 或 MIN)的结果:

    SELECT length FROM film GROUP BY length ORDER BY title
    

    …because you can’t write even such “unordred” query
    …因为你甚至不能写出如此“无序”的查询

    SELECT length, title FROM film GROUP BY length
     -- it's ambiguous what row will yields "title", it's not a part of grouping.
    

    However, if we apply aggregation to”title”(if it ever makes sense) we will get the following perfectly valid query:
    然而,如果我们将聚合应用于“标题”(如果它真的有意义),我们将得到以下完全有效的查询:

    SELECT length, MAX(title) FROM film GROUP BY length
    

    …so it’s valid to write
    …因此可以写为

    SELECT length, MAX(title) FROM film GROUP BY length ORDER BY MAX(title)
    

    …or referring by a column number
    或按列号引用

    SELECT length, MAX(title) FROM film GROUP BY length ORDER BY 2
    

    …or even exclude MAX(title) from the result set
    …或将 MAX(title) 从结果集中排除

    SELECT length FROM film GROUP BY length ORDER BY MAX(title)
    
    1. Actually, I would use the following explanation.
      实际上,我会使用以下解释。

      SELECT DISTINCT length FROM film
      …is exactly the same as
      …完全一样

      SELECT length FROM film GROUP BY length

      That’s like saying WHERE is the same thing as HAVING. While you can often achieve the same things, they’re not really exactly the same thing. For example, the GROUP BY clause is located before the WINDOW clause in the logical order of SQL operations, which means you cannot use any window functions in GROUP BY / HAVING. But you can use them in SELECT and thus filter on them using DISTINCT. See: https://blog.jooq.org/2016/12/09/a-beginners-guide-to-the-true-order-of-sql-operations
      说那是和说是一样的。虽然您通常可以实现相同的功能, 但它们实际上并不完全相同。例如,在 SQL 操作的逻辑顺序中, GROUP BY

      So, while it is useful to know that DISTINCT and GROUP BY work in a similar way, it is misleading (especially for beginners) to claim that they’re exactly the same.
      因此,虽然了解 ` DISTINCT

      1. That’s like saying WHERE is the same thing as HAVING
        这就像说 WHERE 和 HAVING 是一样的东西

        You are exaggerating. In the context of the article’s example filtering out non-unique rows (DISTINCT) is the same as “reducing” original rowset to unique rows via GROUP BY. Ok, I agree that “almost the same” here is a better statement than “exactly the same”, and, again, in the context given.
        你夸大了。根据文章示例,过滤掉非唯一行(DISTINCT)与通过 GROUP BY 将原始行集“减少”为唯一行相同。好的,我同意在这种情况下“几乎相同”比“完全相同”更好,并且再次在给定的上下文中。

  2. Thanks, this article is really helpful!
    谢谢,这篇文章真的很棒!

    Can you tell me (or point me to somewhere that I can read about) what the purpose of “t” is in this query?
    你能告诉我(或者给我指向我能阅读的地方)这个查询中“t”的目的是什么吗?

    SELECT length
    FROM (
      SELECT length, MIN(title) title
      FROM film
      GROUP BY length
    ) t
    ORDER BY title
    
    1. Thanks for your message. Some RDBMS (but not all, e.g. not Oracle) require that derived tables (subqueries in the FROM clause) have an alias. To play save, even if it is not really useful, I jsut always add it. This way, this query will also work in MySQL, PostgreSQL, SQL Server, and other RDBMS.
      感谢您的留言。一些 RDBMS(但并非所有,例如 Oracle)要求派生表(FROM 子句中的子查询)具有别名。为了安全起见,即使它不是很有用,我也只是始终添加它。这样,此查询也将在 MySQL、PostgreSQL、SQL Server 和其他 RDBMS 中工作。

      Hope this helps 希望这能帮上忙

  3. Thanks for such a detailed explanation @lukaseder especially the detailed step by step explanation. Helped a lot!
    感谢您如此详细的解释@lukaseder,尤其是详细的分步说明。帮了很多忙!

    Though for the postgreSQL “distinct” don’t agree with the explanation: the three sql statements are not equivalent;
    尽管对于 PostgreSQL 中的“distinct”不认同解释:这三个 SQL 语句并不等效;

    --SELECT DISTINCT length FROM film ORDER BY title;
    --SELECT DISTINCT ON (title) length FROM film ORDER BY title;
    --SELECT length FROM (   SELECT length, MIN(title) title   FROM film   GROUP BY length ) t ORDER BY title;
    

    Had doubt for the third query and did the following operations on a postgreSQL (EnterpriseDB) database:
    我对第三个查询有疑问,并对一个 PostgreSQL(EnterpriseDB)数据库执行了以下操作:

    -------------------------------------------------Output Text-Start---------------------------------------------------------------
    SQL> SELECT * FROM SOAPS;
    
    LENGTH               TITLE
    
    
    
    10                   TITLE A
    20                   TITLE B
    30                   TITLE C
    40                   TITLE D
    50                   TITLE E
    10                   TITLE F
    20                   TITLE G
    
    7 rows retrieved.
    
    SQL> SELECT DISTINCT LENGTH FROM SOAPS ORDER BY TITLE;
    
    LENGTH
    
    10
    20
    30
    40
    50
    10
    20
    
    7 rows retrieved.
    
    SQL> SELECT DISTINCT ON (TITLE) LENGTH FROM SOAPS ORDER BY TITLE;
    
    LENGTH
    
    10
    20
    30
    40
    50
    10
    20
    
    7 rows retrieved.
    
    SQL> SELECT LENGTH FROM (SELECT LENGTH, MIN(title) title FROM SOAPS GROUP BY LENGTH) t ORDER BY title;
    
    LENGTH
    
    10
    20
    30
    40
    50
    
    -------------------------------------------------Output Text- Ends---------------------------------------------------------------
    

    As you can notice, the third query is yielding a distinct set of result as opposed to the the first two. If i’m not missing any environment level properties, i suppose it is not wrong to draw the conclusion that the three queries are not equivalent.
    正如您注意到的,第三个查询产生的结果集与前两个截然不同。如果没有遗漏任何环境级属性,我可以得出结论,这三个查询是不等价的。

    Would love to hear from you on this.
    乐意听到您对此的意见。

    Thanks, again!Though 再次感谢!尽管

    1. Thanks for your comment. Indeed, the examples were wrong. I think I meant to inverse length and title somehow. Unfortunately, I don’t have time in the near future to fix this, so I’ll just remove that section of the article, which doesn’t really add too much value anyway.
      感谢您的评论。的确,例子是错误的。我想我大概是把长度和标题搞反了。不幸的是,我近期没有时间来修改,所以我将删除文章中的那部分内容,反正也没有什么实际价值。

      Thanks again for your thorough review!
      谢谢您再次进行全面的审查!

Leave a Reply 留下评论