这是用户在 2025-1-18 10:42 为 https://planetscale.com/blog/how-do-database-indexes-work?__readwiseLocation= 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Navigation

Blog|Engineering  博客|工程

How do Database Indexes Work?
数据库索引如何工作?

By Justin Gage |
作者:贾斯汀·盖奇 | 2022 年 7 月 14 日

If you’ve queried a database, you’ve probably used an index, even if you didn’t know it at the time. Database indexes help speed up read queries by creating ancillary data structures that make your scans faster. This post will walk through how database indexes work, with a particular focus on MySQL, everyone’s (well, many people’s) favorite homegrown organic database.
如果您查询过数据库,您可能使用过索引,即使您当时不知道。数据库索引通过创建使扫描速度更快的辅助数据结构来帮助加快读取查询速度。这篇文章将介绍数据库索引的工作原理,特别关注 MySQL,这是每个人(好吧,很多人)最喜欢的本土有机数据库。

After you read through how databases indexes work, make sure you check out our video on how to speed things up using composite indexes with hashed functions:
阅读完数据库索引的工作原理后,请务必观看我们的视频,了解如何使用带有散列函数的复合索引来加快速度:

Indexes speed up your read queries
索引可加快您的读取查询速度

Indexes are basically a way to speed up your read queries, particularly ones with filters (WHERE) on them. They’re data structures that exist in your database engine, outside of whatever table they work with, that point to data you’re trying to query.
索引基本上是一种加快读取查询速度的方法,特别是带有过滤器( WHERE )的查询。它们是存在于数据库引擎中的数据结构,位于它们使用的任何表之外,指向您尝试查询的数据。

To avoid the all-too-common librarian metaphor, imagine – perhaps a far out scenario – you’ve got all of your users in a table in your MySQL database (running on PlanetScale, of course). You’re building some functionality into your social app that allows users to search and filter for other users, which means there will be a query running in production1 that runs through your entire users table. Even worse, your app is quite successful, so there are hundreds of thousands of these “users.” Imagine that!
为了避免使用太常见的图书馆员比喻,想象一下(也许是一个遥远的场景)您的所有用户都位于 MySQL 数据库(当然在 PlanetScale 上运行)的一个表中。您正在社交应用程序中构建一些功能,允许用户搜索和过滤其他用户,这意味着生产 1 中将运行一个查询,该查询会遍历整个 users 表。更糟糕的是,您的应用程序非常成功,因此有数十万这样的“用户”。想象一下!

Unfortunately, SELECT-ing from that users table isn’t very performant right now. The filters you’re applying based on inputs in the UI – user location, account type, most recent activity, and other columns in your database – require scans of the entire users table, which is roughly O(N/2) on average. Your query is taking 5-6 seconds to run, which would be fine as a data analyst doing internal work, but isn’t up to snuff for a smooth user experience.
不幸的是,来自该用户表的 SELECT - 目前性能不是很好。您根据 UI 中的输入应用的过滤器(用户位置、帐户类型、最近的活动以及数据库中的其他列)需要扫描整个用户表,平均约为 O(N/2)。您的查询需要 5-6 秒才能运行,对于进行内部工作的数据分析师来说这没什么问题,但不足以提供流畅的用户体验。

To remedy this, you create a database index on the “most recent activity” column in your users table with CREATE INDEX. Behind the scenes, MySQL goes and creates a new pseudo-table in the database with two columns: a value for most recent activity, and a pointer to the record in the users table. The trick here, though, is that the table gets ordered and stored as a binary tree, with ordered values for that most recent activity column. As a result, your query is O Log(n) efficient, and only takes a second or less to run.
为了解决这个问题,您可以使用 CREATE INDEX 在用户表中的“最近活动”列上创建数据库索引。在幕后,MySQL 在数据库中创建一个新的伪表,其中包含两列:最近活动的值和指向 users 表中记录的指针。不过,这里的技巧是,表被排序并存储为二叉树,其中包含最近活动列的有序值。因此,您的查询效率为 O Log(n),并且只需要一秒或更短的时间即可运行。

This is the basic gist of indexes. If you know you’re going to run a specific query repeatedly and worry about read performance, creating an index (or several) can help speed that query up.
这就是索引的基本要点。如果您知道要重复运行特定查询并担心读取性能,那么创建一个(或多个)索引可以帮助加快该查询的速度。

That’s the simple version though; there’s a lot more going on under the hood, and too many indexes can even degrade that performance you sought out to improve.
但这是简单的版本;幕后还有更多的事情发生,太多的索引甚至会降低您想要提高的性能。

How database indexes work under the hood
数据库索引如何在幕后工作

An index is not magic – it’s a database structure that contains pointers to specific database records. Without an index, data in a database usually gets stored as a heap, basically a pile of unordered2, unsorted rows. In fact, this is actually a setting you can toggle in Microsoft SQL Server and Azure SQL Database.
索引并不神奇——它是一个包含指向特定数据库记录的指针的数据库结构。如果没有索引,数据库中的数据通常存储为堆,基本上是一堆无序的、未排序的行。事实上,这实际上是您可以在 Microsoft SQL Server 和 Azure SQL 数据库中切换的设置。

In practice, data is rarely stored completely unsorted. Instead, you’ll usually use some sort of primary key – which in MySQL can be identical to an index – that could be something like an auto-incrementing integer. But data can, of course, only be sorted by one column, which limits the “binary” efficiency of sorting (with unique values) to a query that filters on that one, ordered column. An index is basically a way of letting your table be sorted by multiple columns, which lets you get binary search efficiency on multiple filter columns.
在实践中,数据很少以完全未排序的方式存储。相反,您通常会使用某种主键(在 MySQL 中可以与索引相同),它可能类似于自动递增整数。但是,数据当然只能按一列进行排序,这将排序(具有唯一值)的“二进制”效率限制为针对该一有序列进行筛选的查询。索引基本上是一种让表按多列排序的方法,它可以让您在多个过滤列上获得二分搜索效率。

When you create an index on a column, you’re creating a new table with two columns: one for the column you indexed on, and one that contains a pointer to where the record in question is stored. Though the index will be of an identical length to the original table, the width will likely be a good degree shorter, and as such will take fewer disc blocks to store and traverse. Pointers in MySQL are pretty small, usually fewer than 5 bytes. The afore-linked now “legendary” Stack Overflow post runs through the math around the number of blocks required for storage, for those interested in a deeper look.
当您在列上创建索引时,您将创建一个包含两列的新表:一列用于您索引的列,另一列包含指向相关记录存储位置的指针。尽管索引的长度与原始表相同,但宽度可能会短很多,因此需要更少的磁盘块来存储和遍历。 MySQL 中的指针非常小,通常小于 5 个字节。前面链接的现在“传奇”的 Stack Overflow 帖子围绕存储所需的块数量进行了数学计算,对于那些有兴趣更深入了解的人来说。

Unless you set it up yourself, there are likely already several indexes created in whatever database you’re using right now. You can see any indexes that exist on a particular table with:
除非您自己设置,否则您现在使用的任何数据库中可能已经创建了多个索引。您可以使用以下命令查看特定表上存在的任何索引:

SHOW INDEX FROM table_name FROM db_name;

If you run an EXPLAIN statement on your query, you should also see information about which indexes the query plans to use. Here’s the table of possible EXPLAIN output values from the MySQL docs; note the possible_keys and key values, both relating to index selection.
如果您对查询运行 EXPLAIN 语句,您还应该看到有关查询计划使用哪些索引的信息。这是 MySQL 文档中可能的 EXPLAIN 输出值表;请注意 possible_keyskey 值,它们都与索引选择相关。

Column  柱子JSON name  JSON 名称Definition  定义
idselect_idThe SELECT identifier   SELECT 标识符
select_typeNone  没有任何The SELECT type   SELECT 类型
tabletable_nameThe table for the output row
输出行的表
partitionspartitionsThe matching partitions  匹配的分区
typeaccess_typeThe join type  连接类型
possible_keyspossible_keysThe possible indexes to choose
可以选择的索引
keykeyThe index actually chosen
实际选择的索引
key_lenkey_lengthThe length of the chosen key
所选密钥的长度
refrefThe columns compared to the index
列与索引的比较
rowsrowsEstimate of rows to be examined
估计要检查的行数
filteredfilteredPercentage of rows filtered by table condition
按表条件过滤的行的百分比
ExtraNone  没有任何Additional information  附加信息

This way of using EXPLAIN can also be useful for debugging when creating an index, i.e. verifying that a new one is working as intended.
这种使用 EXPLAIN 的方式对于创建索引时的调试也很有用,即验证新索引是否按预期工作。

While indexes always store pointers to the record any index points to, the database doesn’t always need to use those pointers. An index only scan (covering index in MySQL) refers to when the value your query is looking for is contained solely in the index, and thus doesn’t require a table lookup. In PostgreSQL, not all index types support index-only scans.
虽然索引始终存储指向任何索引所指向的记录的指针,但数据库并不总是需要使用这些指针。仅索引扫描(MySQL 中的覆盖索引)是指查询正在查找的值仅包含在索引中,因此不需要表查找。在 PostgreSQL 中,并非所有索引类型都支持仅索引扫描。

One thing to remember about indexes is that while they’re beneficial for read queries (including JOINs and aggregates), they can negatively impact your database performance as a whole if you’re not careful. Indexes take up a lot of space, even if that space is smaller than the initial table, and that space is precious; you’ll need to keep an eye on how close you’re inching towards your filesystem’s limit. They also make INSERT queries take longer and force the query engine to consider more options before choosing how to execute a query.
关于索引需要记住的一件事是,虽然它们有利于读取查询(包括 JOIN 和聚合),但如果您不小心,它们可能会对整个数据库性能产生负面影响。索引会占用大量空间,即使该空间比初始表小,而且该空间是宝贵的;您需要密切关注您距离文件系统限制的程度。它们还使 INSERT 查询花费更长的时间,并强制查询引擎在选择如何执行查询之前考虑更多选项。

Different types of database indexes
不同类型的数据库索引

We’ve been mostly referencing database indexes that are unique, but that turns out to be only one of several types. Keep in mind that any of these indexes can have more than one column (besides the pointer) in them. MySQL actually supports up to 16 columns in an index. But we digress.
我们主要引用唯一的数据库索引,但事实证明这只是几种类型中的一种。请记住,这些索引中的任何一个都可以包含多个列(除了指针之外)。 MySQL 实际上支持索引中最多 16 列。但我们离题了。

Keys   按键

You may see “key” used interchangeably with “index.” A KEY in MySQL is a type of non-unique index. Because the index isn’t unique, scanning through it won’t be as efficient as a binary tree, but it will likely be more efficient than linear search. Many columns in your tables will likely fit this bill, e.g. something like “first name” or “location.”
您可能会看到“key”与“index”互换使用。 MySQL中的 KEY 是一种非唯一索引。由于索引不是唯一的,因此扫描它不会像二叉树那样有效,但它可能比线性搜索更有效。表中的许多列可能符合此要求,例如例如“名字”或“位置”。

MySQL and Postgres have a pretty major difference between them in how they handle primary key storage. In MySQL, primary keys are stored with their data, which means they effectively take up no extra storage space. In Postgres, primary keys are treated like other indexes and stored in separate data structures.
MySQL 和 Postgres 在处理主键存储的方式上有很大的区别。在 MySQL 中,主键与其数据一起存储,这意味着它们实际上不会占用额外的存储空间。在 Postgres 中,主键像其他索引一样被处理并存储在单独的数据结构中。

Unique indexes  唯一索引

A unique index conforms to the traditional unique definition – no two identical non-NULL values – with the caveat that the values are sorted. When data fits those two definitions – unique and sorted – you can use binary search and get that sweet, sweet O log(N). Note that a primary key is a specific type of unique index, where there can only be one per table and the value cannot be NULL.
唯一索引符合传统的唯一定义——没有两个相同的非 NULL 值——但需要注意的是这些值是排序的。当数据符合这两个定义(唯一和排序)时,您可以使用二分搜索并获得非常好的 O log(N)。请注意,主键是一种特定类型的唯一索引,每个表只能有一个,并且值不能为 NULL

As an aside, UNIQUE indexes can be used as a way to enforce uniqueness on a particular table column. Since indexes are updated on every insert, trying to insert a non-unique value into a column with a unique index on it will throw an error. The Postgres docs call this out explicitly.
顺便说一句, UNIQUE 索引可以用来强制特定表列的唯一性。由于索引在每次插入时都会更新,因此尝试将非唯一值插入到具有唯一索引的列中将引发错误。 Postgres 文档明确指出了这一点。

Text indexes  文本索引

Using the FULLTEXT qualifier will create a text index in most popular relational databases. It can only be applied to text-type columns (CHAR, VARCHAR, or TEXT) in MySQL. Where things get interesting is using these indexes: MySQL provides a lot of functionality out of the box that starts to resemble what you’d expect from a modern text parsing / NLP library.
使用 FULLTEXT 限定符将在最流行的关系数据库中创建文本索引。它只能应用于 MySQL 中的文本类型列( CHARVARCHARTEXT )。事情变得有趣的是使用这些索引:MySQL 提供了许多开箱即用的功能,这些功能开始类似于您对现代文本解析/NLP 库的期望。

The main text searching keyword set in MySQL is MATCH()... AGAINST(). You can choose from a few different search methods, including:
MySQL中设置的主要文本搜索关键字是 MATCH()... AGAINST() 。您可以选择几种不同的搜索方法,包括:

  • Natural language search – no special operators, uses the built-in stopwords that you can customize. This is the default search type.
    自然语言搜索 - 没有特殊运算符,使用您可以自定义的内置停用词。这是默认的搜索类型。
  • Boolean search – uses a special query language that’s roughly analogous to RegEx (but very different).
    布尔搜索——使用一种特殊的查询语言,大致类似于正则表达式(但非常不同)。
  • Query expansion search – runs a regular natural language search, then another one using words from the returned rows, hence the “expansion.”
    查询扩展搜索 - 运行常规自然语言搜索,然后使用返回行中的单词进行另一次搜索,因此称为“扩展”。

Each of these methods has tradeoffs, none are one-size-fits-all.
这些方法中的每一种都需要权衡,没有一种方法是放之四海而皆准的。

There are other types of indexes in MySQL, like prefix indexes or descending indexes. For more complete info, check out the MySQL docs on indexes.
MySQL 中还有其他类型的索引,例如前缀索引或降序索引。有关更完整的信息,请查看有关索引的 MySQL 文档。

Creating indexes in MySQL
在 MySQL 中创建索引

MySQL uses pretty standard syntax for creating indexes:
MySQL 使用非常标准的语法来创建索引:

CREATE [type] INDEX index_name ON table_name (column_name)

Though the command can be simple, the amount of customization available is pretty impressive. Here’s the full roster of available options from the MySQL docs:
尽管命令可能很简单,但可用的定制量却相当令人印象深刻。以下是 MySQL 文档中可用选项的完整列表:

CREATE [UNIQUE | FULLTEXT | SPATIAL] INDEX index_name

    [index_type]

    ON tbl_name (key_part,...)

    [index_option]

    [algorithm_option | lock_option] ...

key_part: {col_name [(length)] | (expr)} [ASC | DESC]

index_option: {

    KEY_BLOCK_SIZE [=] value

  | index_type

  | WITH PARSER parser_name

  | COMMENT 'string'

  | {VISIBLE | INVISIBLE}

  | ENGINE_ATTRIBUTE [=] 'string'

  | SECONDARY_ENGINE_ATTRIBUTE [=] 'string'

}

index_type:

    USING {BTREE | HASH}

algorithm_option:

    ALGORITHM [=] {DEFAULT | INPLACE | COPY}

lock_option:

    LOCK [=] {DEFAULT | NONE | SHARED | EXCLUSIVE}

For example, if we wanted to create an index on the “most recent activity” column in our users table, we’d use the following options. Keep in mind that this column contains non-unique values, so we’ll skip the UNIQUE keyword when creating the index.
例如,如果我们想在用户表中的“最近活动”列上创建索引,我们将使用以下选项。请记住,该列包含非唯一值,因此我们在创建索引时将跳过 UNIQUE 关键字。

CREATE INDEX users_most_recent_activity

ON users ({most_recent_activity} DESC)

COMMENT ‘for querying by most recent activity’

This is on the simpler side. You can customize any of the following:
这是比较简单的一面。您可以自定义以下任意一项:

  • Type of index – as discussed above. You can select non-unique (no keyword), unique, fulltext, etc.
    索引类型——如上所述。您可以选择非唯一(无关键字)、唯一、全文等。
  • Index type – stored as a binary tree or hash.
    索引类型——存储为二叉树或散列。
  • Misc. options – key block size, special parsers, comments, etc.
    杂项。选项——关键块大小、特殊解析器、注释等。
  • Algorithms and locks used
    使用的算法和锁

To go deeper, check out our videos on indexes in MySQL:
要深入了解,请观看有关 MySQL 索引的视频:

Footnotes  脚注

  • 1 — Large read queries in production aren’t always common, but you can imagine that this comes up a lot in analytics.
    1 — 生产中的大型读取查询并不总是常见,但您可以想象这在分析中经常出现。

  • 2 — Well, technically ordered in the order in which you inserted them into the table.
    2 — 嗯,从技术上讲,按照您将它们插入表中的顺序进行排序。