Everything I know about good system design
我所知道的关于优秀系统设计的一切

I see a lot of bad system design advice. One classic is the LinkedIn-optimized “bet you never heard of queues” style of post, presumably aimed at people who are new to the industry. Another is the Twitter-optimized “you’re a terrible engineer if you ever store booleans in a database” clever trick¹. Even good system design advice can be kind of bad. I love Designing Data-Intensive Applications, but I don’t think it’s particularly useful for most system design problems engineers will run into.
我看到了很多糟糕的系统设计建议。一个经典的是 LinkedIn 优化的 “打赌你从未听说过队列 ”风格的帖子，大概是针对该行业的新手。另一个是 Twitter 优化的“如果你曾经在数据库中存储布尔值，你就是一个糟糕的工程师”的聪明技巧 ¹。即使是好的系统设计建议也可能有点糟糕。我喜欢 设计数据密集型应用程序 ，但我认为它对于工程师会遇到的大多数系统设计问题并不是特别有用。

What is system design? In my view, if software design is how you assemble lines of code, system design is how you assemble services. The primitives of software design are variables, functions, classes, and so on. The primitives of system design are app servers, databases, caches, queues, event buses, proxies, and so on.
什么是系统设计？在我看来，如果软件设计是你如何组装代码行，那么系统设计就是你如何组装服务。软件设计的基元是变量、函数、类等。系统设计的基元是应用程序服务器、数据库、缓存、队列、事件总线、代理等。

This post is my attempt to write down, in broad strokes, everything I know about good system design. A lot of the concrete judgment calls do come down to experience, which I can’t convey in this post. But I’m trying to write down what I can.
这篇文章是我试图粗略地写下我所知道的关于优秀系统设计的一切。很多具体的判断确实归结为经验，我无法在这篇文章中传达。但我正在努力写下我能写下的东西。

Recognizing good design 认可优秀设计

What does good system design look like? I’ve written before that it looks underwhelming. In practice, it looks like nothing going wrong for a long time. You can tell that you’re in the presence of good design if you have thoughts like “huh, this ended up being easier than I expected”, or “I never have to think about this part of the system, it’s fine”. Paradoxically, good design is self-effacing: bad design is often more impressive than good. I’m always suspicious of impressive-looking systems. If a system has distributed-consensus mechanisms, many different forms of event-driven communication, CQRS, and other clever tricks, I wonder if there’s some fundamental bad decision that’s being compensated for (or if the system is just straightforwardly over-designed).
好的系统设计是什么样的？我之前写过，它看起来平淡无奇。在实践中，很长一段时间内看起来都没有出错。如果你有这样的想法，比如 “嗯，这最终比我预期的要容易”，或者 “我从来不需要考虑系统的这一部分，这很好”，你就可以看出你有一个好的设计在场。矛盾的是，好的设计是自我贬低的：糟糕的设计往往比好的设计更令人印象深刻。我总是对看起来令人印象深刻的系统持怀疑态度。如果一个系统具有分布式共识机制、许多不同形式的事件驱动通信、CQRS 和其他聪明的技巧，我想知道是否有一些根本性的错误决策得到了补偿（或者系统是否只是直接过度设计）。

I’m often alone on this. Engineers look at complex systems with many interesting parts and think “wow, a lot of system design is happening here!” In fact, a complex system usually reflects an absence of good design. I say “usually” because sometimes you do need complex systems. I’ve worked on many systems that earned their complexity. However, a complex system that works always evolves from a simple system that works. Beginning from scratch with a complex system is a really bad idea.
我经常独自一人。工程师们看着包含许多有趣部分的复杂系统，会想“哇，这里发生了很多系统设计！事实上，一个复杂的系统通常反映了缺乏好的设计。我说“通常”是因为有时您确实需要复杂的系统。我参与过许多系统的工作，这些系统都获得了复杂性。然而，一个有效的复杂系统总是从一个有效的简单系统演变而来。从头开始使用复杂的系统是一个非常糟糕的主意。

State and statelessness 状态和无状态

The hard part about software design is state. If you’re storing any kind of information for any amount of time, you have a lot of tricky decisions to make about how you save, store and serve it. If you’re not storing information², your app is “stateless”. As a non-trivial example, GitHub has an internal API that takes a PDF file and returns a HTML rendering of it. That’s a real stateless service. Anything that writes to a database is stateful.
软件设计的难点是状态。如果您要将任何类型的信息存储任意时间，那么在如何保存、存储和提供这些信息方面，您需要做出许多棘手的决定。如果您未存储信息 ²，则您的应用程序是 “无状态” 的。举个重要的例子，GitHub 有一个内部 API，它接受一个 PDF 文件并返回它的 HTML 渲染。这是一个真正的无状态服务。写入数据库的任何内容都是有状态的。

You should try and minimize the amount of stateful components in any system. (In a sense this is trivially true, because you should try to minimize the amount of all components in a system, but stateful components are particularly dangerous.) The reason you should do this is that stateful components can get into a bad state. Our stateless PDF-rendering service will safely run forever, as long as you’re doing broadly sensible things: e.g. running it in a restartable container so that if anything goes wrong it can be automatically killed and restored to working order. A stateful service can’t be automatically repaired like this. If your database gets a bad entry in it (for instance, an entry with a format that triggers a crash in your application), you have to manually go in and fix it up. If your database runs out of room, you have to figure out some way to prune unneeded data or expand it.
您应该尝试尽量减少任何系统中有状态组件的数量。（从某种意义上说，这是微不足道的，因为您应该尝试最小化系统中所有组件的数量，但有状态组件特别危险。你应该这样做的原因是 有状态组件可能会进入错误状态 。我们的无状态 PDF 渲染服务将永远安全地运行，只要你正在做一些大致合理的事情：例如，在可重启的容器中运行它，这样如果出现任何问题，它就可以自动终止并恢复到工作状态。有状态服务不能像这样自动修复。如果您的数据库中出现了一个错误的条目（例如，一个格式在应用程序中触发崩溃的条目），您必须手动进入并修复它。如果您的数据库空间不足，您必须想办法删除不需要的数据或扩展它。

What this means in practice is having one service that knows about the state - i.e. it talks to a database - and other services that do stateless things. Avoid having five different services all write to the same table. Instead, have four of them send API requests (or emit events) to the first service, and keep the writing logic in that one service. If you can, it’s worth doing this for the read logic as well, although I’m less absolutist about this. It’s sometimes better for services to do a quick read of the user_sessions table than to make a 2x slower HTTP request to an internal sessions service.
这在实践中意味着拥有一个知道 state 的服务 - 即它与数据库通信 - 和其他执行无状态作的服务。避免让 5 个不同的服务都写入同一个表。相反，让其中四个服务向第一个服务发送 API 请求（或发出事件），并将编写逻辑保留在该服务中。如果可以的话，对于 read logic 也值得这样做，尽管我对此不那么绝对。有时，服务快速读取 user_sessions 表比向内部会话服务发出 2 倍慢的 HTTP 请求要好。

Databases 数据库

Since managing state is the most important part of system design, the most important component is usually where that state lives: the database. I’ve spent most of my time working with SQL databases (MySQL and PostgreSQL), so that’s what I’m going to talk about.
由于管理状态是系统设计中最重要的部分，因此最重要的组件通常是该状态所在的位置：数据库。我大部分时间都花在 SQL 数据库（MySQL 和 PostgreSQL）上，所以这就是我要讨论的内容。

Schemas and indexes 架构和索引

If you need to store something in a database, the first thing to do is define a table with the schema you need. Schema design should be flexible, because once you have thousands or millions of records, it can be an enormous pain to change the schema. However, if you make it too flexible (e.g. by sticking everything in a “value” JSON column, or using “keys” and “values” tables to track arbitrary data) you load a ton of complexity into the application code (and likely buy some very awkward performance constraints). Drawing the line here is a judgment call and depends on specifics, but in general I aim to have my tables be human-readable: you should be able to go through the database schema and get a rough idea of what the application is storing and why.
如果您需要在数据库中存储某些内容，首先要做的是使用您需要的架构定义一个表。架构设计应该是灵活的，因为一旦你有成千上万条记录，更改架构可能会非常痛苦。但是，如果你让它太灵活（例如，将所有内容都放在 “value” JSON 列中，或者使用 “keys” 和 “values” 表来跟踪任意数据），你会给应用程序代码带来大量的复杂性（并且可能会带来一些非常尴尬的性能约束）。在这里划清界限是一种判断，取决于具体情况，但总的来说，我的目标是让我的表可读：您应该能够浏览数据库架构并大致了解应用程序存储的内容和原因。

If you expect your table to ever be more than a few rows, you should put indexes on it. Try to make your indexes match the most common queries you’re sending (e.g. if you query by email and type, create an index with those two fields). Don’t index on every single thing you can think of, since each index adds write overhead.
如果预计表的行数不止几行，则应为其添加索引。尝试使索引与你发送的最常见查询匹配（例如， 如果你通过电子邮件和类型进行查询，请使用这两个字段创建索引）。不要对你能想到的每一件事都进行索引，因为每个索引都会增加写入开销。

Bottlenecks 瓶颈

Accessing the database is often the bottleneck in high-traffic applications. This is true even when the compute side of things is relatively inefficient (e.g. Ruby on Rails running on a preforking server like Unicorn). That’s because complex applications need to make a lot of database calls - hundreds and hundreds for every single request, often sequentially (because you don’t know if you need to check whether a user is part of an organization until after you’ve confirmed they’re not abusive, and so on). How can you avoid getting bottlenecked?
访问数据库通常是高流量应用程序中的瓶颈。即使计算方面的效率相对较低（例如，Ruby on Rails 运行在像 Unicorn 这样的 preforking 服务器上），也是如此。这是因为复杂的应用程序需要进行 大量的 数据库调用 - 每个请求需要数百次调用，通常是按顺序进行的（因为在确认用户没有滥用之前，您不知道是否需要检查用户是否属于组织，依此类推）。如何避免遇到瓶颈？

When querying the database, query the database. It’s almost always more efficient to get the database to do the work than to do it yourself. For instance, if you need data from multiple tables, JOIN them instead of making separate queries and stitching them together in-memory. Particularly if you’re using an ORM, beware accidentally making queries in an inner loop. That’s an easy way to turn a select id, name from table to a select id from table and a hundred select name from table where id = ?.
查询数据库时， 查询数据库 。让数据库完成工作几乎总是比自己做更有效。例如，如果您需要来自多个表的数据， 请 JOIN 它们，而不是进行单独的查询并在内存中将它们拼接在一起。特别是如果你使用的是 ORM，要小心不小心在内部循环中进行查询。这是一种将 选择 ID、表中的名称 转换为 表中的选择 ID 和一百个 select name from table where id = ? 的简单方法。

Every so often you do want to break queries apart. It doesn’t happen often, but I’ve run into queries that were ugly enough that it was easier on the database to split them up than to try to run them as a single query. I’m sure it’s always possible to construct indexes and hints such that the database can do it better, but the occasional tactical query-split is a tool worth having in your toolbox.
每隔一段时间，您确实希望将查询分开。这种情况并不经常发生，但我遇到过足够丑陋的查询，以至于在数据库上将它们拆分比尝试将它们作为单个查询运行要容易得多。我确信总是可以构造索引和提示，以便数据库可以做得更好，但偶尔的战术查询拆分是值得您工具箱中的工具。

Send as many read queries as you can to database replicas. A typical database setup will have one write node and a bunch of read-replicas. The more you can avoid reading from the write node, the better - that write node is already busy enough doing all the writes. The exception is when you really, really can’t tolerate any replication lag (since read-replicas are always running at least a handful of ms behind the write node). But in most cases replication lag can be worked around with simple tricks: for instance, when you update a record but need to use it right after, you can fill in the updated details in-memory instead of immediately re-reading after a write.
向数据库副本发送尽可能多的读取查询。典型的数据库设置将有一个写入节点和一组只读副本。越能避免从 write 节点读取越好 - 该 write 节点已经忙于执行所有写入作。例外情况是，您真的无法容忍任何复制滞后（因为只读副本总是在写入节点后面运行至少几毫秒）。但在大多数情况下，可以通过简单的技巧来解决复制滞后：例如，当您更新一条记录但需要在之后立即使用它时，您可以在内存中填写更新的详细信息，而不是在写入后立即重新读取。

Beware spikes of queries (particularly write queries, and particularly transactions). Once a database gets overloaded, it gets slow, which makes it more overloaded. Transactions and writes are good at overloading databases, because they require a lot of database work for each query. If you’re designing a service that might generate massive query spikes (e.g. some kind of bulk-import API), consider throttling your queries.
当心查询的峰值（尤其是写入查询， 尤其是 事务）。一旦数据库过载，它就会变慢，从而使其更加过载。事务和写入擅长使数据库过载，因为它们需要为每个查询执行大量数据库工作。如果您正在设计可能产生大量查询峰值的服务（例如某种批量导入 API），请考虑限制您的查询。

Slow operations, fast operations
慢作，快作

A service has to do some things fast. If a user is interacting with something (say, an API or a web page), they should see a response within a few hundred ms³. But a service has to do other things that are slow. Some operations just take a long time (converting a very large PDF to HTML, for instance). The general pattern for this is splitting out the minimum amount of work needed to do something useful for the user and doing the rest of the work in the background. In the PDF-to-HTML example, you might render the first page to HTML immediately and queue up the rest in a background job.
服务必须快速完成某些作。如果用户正在与某些内容（例如 API 或网页）交互，他们应该在几百毫秒 ³ 内看到响应。但是服务必须执行其他缓慢的作。有些作需要很长时间（例如，将非常大的 PDF 转换为 HTML）。这方面的一般模式是拆分出 对用户有用的事情所需的最少工作量 ，并在后台完成其余工作。在 PDF 到 HTML 示例中，您可以立即将第一页呈现为 HTML，并在后台作业中将其余页面排队。

What’s a background job? It’s worth answering this in detail, because “background jobs” are a core system design primitive. Every tech company will have some kind of system for running background jobs. There will be two main components: a collection of queues, e.g. in Redis, and a job runner service that will pick up items from the queues and execute them. You enqueue a background job by putting an item like {job_name, params} on the queue. It’s also possible to schedule background jobs to run at a set time (which is useful for periodic cleanups or summary rollups). Background jobs should be your first choice for slow operations, because they’re typically such a well-trodden path.
什么是后台作业？“ 详细回答这个问题是值得的，因为 ”后台作业“ 是一个核心的系统设计原语。每家科技公司都会有某种系统来运行后台作业。将有两个主要组件：队列集合（例如在 Redis 中）和作业运行程序服务，该服务将从队列中获取项目并执行它们。您可以通过在队列中放置 {job_name， params} 之类的项目来将后台作业排入队列。还可以安排后台作业在设定的时间运行（这对于定期清理或摘要汇总很有用）。后台作业应该是慢速作的首选，因为它们通常是一条经常走的路。

Sometimes you want to roll your own queue system. For instance, if you want to enqueue a job to run in a month, you probably shouldn’t put an item on the Redis queue. Redis persistence is typically not guaranteed over that period of time (and even if it is, you likely want to be able to query for those far-future enqueued jobs in a way that would be tricky with the Redis job queue). In this case, I typically create a database table for the pending operation with columns for each param plus a scheduled_at column. I then use a daily job to check for these items with scheduled_at <= today, and either delete them or mark them as complete once the job has finished.
有时，您想要推出自己的队列系统。例如，如果您想将作业排队以在一个月内运行，则可能不应在 Redis 队列中放置项目。通常不能保证 Redis 持久性在这段时间内（即使可以，您也可能希望能够以一种对 Redis 作业队列来说很棘手的方式查询那些遥远的将来排队的作业）。在这种情况下，我通常会为待处理作创建一个数据库表，其中包含每个参数的列以及一个 scheduled_at 列。然后，我今天使用每日作业来检查这些具有 <= scheduled_at 的项目，并在作业完成后删除它们或将它们标记为完成。

Caching 缓存

Sometimes an operation is slow because it needs to do an expensive (i.e. slow) task that’s the same between users. For instance, if you’re calculating how much to charge a user in a billing service, you might need to do an API call to look up the current prices. If you’re charging users per-use (like OpenAI does per-token), that could (a) be unacceptably slow and (b) cause a lot of traffic for whatever service is serving the prices. The classic solution here is caching: only looking up the prices every five minutes, and storing the value in the meantime. It’s easiest to cache in-memory, but using some fast external key-value store like Redis or Memcached is also popular (since it means you can share one cache across a bunch of app servers).
有时，作很慢，因为它需要执行一项昂贵的（即缓慢的）任务，而这些任务在用户之间是相同的。例如，如果您要计算在计费服务中向用户收取的费用，则可能需要执行 API 调用来查找当前价格。如果您按使用量向用户收费（就像 OpenAI 按令牌收费一样），这可能会（a）慢得令人无法接受，并且（b）为提供价格的任何服务都会导致大量流量。这里的经典解决方案是缓存：每 5 分钟只查找一次价格，同时存储值。在内存中缓存是最容易的，但使用一些快速的外部键值存储（如 Redis 或 Memcached）也很受欢迎（因为这意味着你可以在一组应用服务器之间共享一个缓存）。

The typical pattern is that junior engineers learn about caching and want to cache everything, while senior engineers want to cache as little as possible. Why is that? It comes down to the first point I made about the danger of statefulness. A cache is a source of state. It can get weird data in it, or get out-of-sync with the actual truth, or cause mysterious bugs by serving stale data, and so on. You should never cache something without first making a serious effort to speed it up. For instance, it’s silly to cache an expensive SQL query that isn’t covered by a database index. You should just add the database index!
典型的模式是初级工程师学习缓存并希望缓存 所有内容 ，而高级工程师希望尽可能少地缓存。为什么？这归结为我提出的关于状态危险的第一点。缓存是状态的来源。它可能会在其中获得奇怪的数据，或者与实际事实不同步，或者通过提供过时的数据来引起神秘的错误，等等。您永远不应该在没有首先认真努力加快速度的情况下缓存某些内容。例如，缓存数据库索引未涵盖的昂贵 SQL 查询是很愚蠢的。您应该只添加数据库索引！

I use caching a lot. One useful caching trick to have in the toolbox is using a scheduled job and a document storage like S3 or Azure Blob Storage as a large-scale persistent cache. If you need to cache the result of a really expensive operation (say, a weekly usage report for a large customer), you might not be able to fit the result in Redis or Memcached. Instead, stick a timestamped blob of the results in your document storage and serve the file directly from there. Like the database-backed long-term queue I mentioned above, this is an example of using the caching idea without using a specific cache technology.
我经常使用缓存。工具箱中的一个有用的缓存技巧是使用计划作业和文档存储（如 S3 或 Azure Blob 存储）作为大规模持久缓存。如果您需要缓存非常昂贵的作的结果（例如，大客户的每周使用情况报告），则可能无法在 Redis 或 Memcached 中容纳结果。相反，将结果的时间戳 blob 粘贴到您的文档存储中，并直接从那里提供文件。就像我上面提到的数据库支持的长期队列一样，这是一个使用缓存思想而不使用特定缓存技术的示例。

Events 事件

As well as some kind of caching infrastructure and background job system, tech companies will typically have an event hub. The most common implementation of this is Kafka. An event hub is just a queue - like the one for background jobs - but instead of putting “run this job with these params” on the queue, you put “this thing happened” on the queue. One classic example is firing off a “new account created” event for each new account, and then having multiple services consume that event and take some action: a “send a welcome email” service, a “scan for abuse” service, a “set up per-account infrastructure” service, and so on.
除了某种缓存基础设施和后台作业系统外，科技公司通常还会有一个 事件中心 。最常见的实现是 Kafka。事件中心只是一个队列（类似于后台作业的队列），但不是将“使用这些参数运行此作业”放在队列中，而是将“此事件发生”放在队列中。一个典型的例子是为每个新账户触发一个 “new account created” 事件，然后让多个服务使用该事件并采取一些作：“send a welcome email” 服务、“scan for abuse” 服务、“set up per-account infrastructure” 服务，等等。

You shouldn’t overuse events. Much of the time it’s better to just have one service make an API request to another service: all the logs are in the same place, it’s easier to reason about, and you can immediately see what the other service responded with. Events are good for when the code sending the event doesn’t necessarily care what the consumers do with the event, or when the events are high-volume and not particularly time-sensitive (e.g. abuse scanning on each new Twitter post).
您不应过度使用事件。大多数情况下，最好只让一个服务向另一个服务发出 API 请求：所有日志都位于同一位置，这样更容易推理，并且您可以立即看到其他服务的响应。当发送事件的代码不一定关心消费者如何处理事件时，或者当事件容量大且对时间不是特别敏感时（例如，每个新 Twitter 帖子上的滥用扫描）时，事件都是很好的。

Pushing and pulling 推拉

When you need data to flow from one place to a lot of other places, there are two options. The simplest is to pull. This is how most websites work: you have a server that owns some data, and when a user wants it they make a request (via their browser) to the server to pull that data down to them. The problem here is that users might do a lot of pulling down the same data - e.g. refreshing their email inbox to see if they have any new emails, which will pull down and reload the entire web application instead of just the data about the emails.
当您需要数据从一个地方流向许多其他地方时，有两种选择。最简单的是拉动。这就是大多数网站的工作方式：您有一个拥有一些数据的服务器，当用户需要这些数据时，他们（通过浏览器）向服务器发出请求，将这些数据拉取给他们。这里的问题是用户可能会做很多下拉相同的数据 - 例如，刷新他们的电子邮件收件箱以查看他们是否有任何新电子邮件，这将下拉并重新加载整个 Web 应用程序，而不仅仅是有关电子邮件的数据。

The alternative is to push. Instead of allowing users to ask for the data, you allow them to register as clients, and then when the data changes, the server pushes the data down to each client. This is how GMail works: you don’t have to refresh the page to get new emails, because they’ll just appear when they arrive.
另一种选择是 push。您不允许用户请求数据，而是允许他们注册为客户端，然后当数据更改时，服务器会将数据向下推送到每个客户端。这就是 GMail 的工作原理：您不必刷新页面即可获取新电子邮件，因为它们只会在到达时出现。

If we’re talking about background services instead of users with web browsers, it’s easy to see why pushing can be a good idea. Even in a very large system, you might only have a hundred or so services that need the same data. For data that doesn’t change much, it’s much easier to make a hundred HTTP requests (or RPC, or whatever) whenever the data changes than to serve up the same data a thousand times a second.
如果我们谈论的是后台服务而不是使用 Web 浏览器的用户，那么很容易理解为什么推送是个好主意。即使在非常大的系统中，也可能只有 100 个左右的服务需要相同的数据。对于变化不大的数据，每当数据变化时发出 100 个 HTTP 请求（或 RPC 或其他什么）比每秒提供相同的数据 1000 次要容易得多。

Suppose you did need to serve up-to-date data to a million clients (like GMail, does). Should those clients be pushing or pulling? It depends. Either way, you won’t be able to run it all from a single server, so you’ll need to farm it out to other components of the system. If you’re pushing, that will likely mean sticking each push on an event queue and having a horde of event processors each pulling from the queue and sending out your pushes. If you’re pulling, that will mean standing up a bunch (say, a hundred) of fast⁴ read-replica cache servers that will sit in front of your main application and handle all the read traffic⁵.
假设您确实需要向 100 万个客户端提供最新数据（就像 GMail 一样）。这些客户端应该推还是拉？这要看情况。无论哪种方式，您都无法从单个服务器运行它，因此您需要将其分配给系统的其他组件。如果您正在推送，这可能意味着将每个推送都放在事件队列上，并让一大群事件处理器每个事件处理器从队列中提取并发送您的推送。如果要拉取，则意味着要建立一堆（比如 100 个）快速的 ⁴ 只读副本缓存服务器，这些服务器将位于主应用程序前面并处理所有读取流量 ⁵。

Hot paths 热路径

When you’re designing a system, there are lots of different ways users can interact with it or data can flow through it. It can get a bit overwhelming. The trick is to mainly focus on the “hot paths”: the part of the system that is most critically important, and the part of the system that is going to handle the most data. For instance, in a metered billing system, those pieces might be the part that decides whether or not a customer gets charged, and the part that needs to hook into all user actions on the platform to identify how much to charge.
在设计系统时，用户可以通过多种不同的方式与系统交互，或者数据可以通过多种方式流经系统。这可能会有点让人不知所措。诀窍是主要关注 “热路径”：系统中最重要的部分，以及系统中将处理最多数据的部分。例如，在计量计费系统中，这些部分可能是决定是否向客户收费的部分，并且需要与平台上的所有用户作挂钩以确定要收费的部分。

Hot paths are important because they have fewer possible solutions than other design areas. There are a thousand ways you can build a billing settings page and they’ll all mainly work. But there might be only a handful of ways that you can sensibly consume the firehose of user actions. Hot paths also go wrong more spectacularly. You have to really screw up a settings page to take down the entire product, but any code you write that’s triggered on all user actions can easily cause huge problems.
热路径很重要，因为它们的可能解决方案比其他设计领域少。您可以通过一千种方式构建账单设置页面，而且它们都主要有效。但是，可能只有少数几种方法可以明智地使用用户作的火汤。热路径出错的情况也更为明显。你必须真的搞砸一个设置页面才能关闭整个产品，但你编写的任何在所有用户作上触发的代码都很容易导致巨大的问题。

Logging and metrics 日志记录和指标

How do you know if you’ve got problems? One thing I’ve learned from my most paranoid colleagues is to log aggressively during unhappy paths. If you’re writing a function that checks a bunch of conditions to see if a user-facing endpoint should respond 422, you should log out the condition that was hit. If you’re writing billing code, you should log every decision made (e.g. “we’re not billing for this event because of X”). Many engineers don’t do this because it adds a bunch of logging boilerplate and makes it hard to write beautifully elegant code, but you should do it anyway. You’ll be happy you did when an important customer is complaining that they’re getting a 422 - even if that customer did something wrong, you still need to figure out what they did wrong for them.
你怎么知道你是否有问题？我从最偏执的同事那里学到的一件事是，在不愉快的道路上积极地记录。如果您正在编写一个函数来检查一组条件以查看面向用户的终端节点是否应响应 422，则应注销命中的条件。如果您正在编写计费代码，则应记录所做的每个决定（例如，“由于 X，我们不为此事件计费”）。许多工程师不这样做，因为它添加了一堆日志记录样板，并且很难编写出精美优雅的代码，但无论如何您都应该这样做。当重要客户抱怨他们收到了 422 时，您会很高兴 - 即使该客户做错了什么，您仍然需要弄清楚 他们做错了什么 。

You should also have basic observability into the operational parts of the system. That means CPU/memory on the hosts or containers, queue sizes, average time per-request or per-job, and so on. For user-facing metrics like time per-request, you also need to watch the p95 and p99 (i.e. how slow your slowest requests are). Even one or two very slow requests are scary, because they’re disproportionately from your largest and most important users. If you’re just looking at averages, it’s easy to miss the fact that some users are finding your service unusable.
您还应该对系统的作部分具有基本的可观察性。这意味着主机或容器上的 CPU/内存、队列大小、每个请求或每个作业的平均时间等。对于面向用户的指标（如每次请求的时间），您还需要观察 p95 和 p99（即最慢的请求有多慢）。即使是一两个非常慢的请求也很可怕，因为它们不成比例地来自您最大和最重要的用户。如果您只查看平均值，则很容易忽略一些用户发现您的服务不可用的事实。

Killswitches, retries, and failing gracefully
终止开关、重试和正常失败

I wrote a whole post about killswitches that I won’t repeat here, but the gist is that you should think carefully about what happens when the system fails badly.
我写了一篇关于 killswitches 的整篇文章，我不会在这里重复，但要点是你应该仔细考虑当系统严重故障时会发生什么。

Retries are not a magic bullet. You need to make sure you’re not putting extra load on other services by blindly retrying failed requests. If you can, put high-volume API calls inside a “circuit breaker”: if you get too many 5xx responses in a row, stop sending requests for a while to let the service recover. You also need to make sure you’re not retrying write events that may or may not have succeeded (for instance, if you send a “bill this user” request and get back a 5xx, you don’t know if the user has been billed or not). The classic solution to this is to use an “idempotency key”, which is a special UUID in the request that the other service uses to avoid re-running old requests: every time they do something, they save the idempotency key, and if they get another request with the same key, they silently ignore it.
重试不是灵丹妙药。您需要确保不会盲目地重试失败的请求，从而给其他服务带来额外的负载。如果可以，请将大量 API 调用放入“熔断器”中：如果连续收到太多 5xx 响应，请暂时停止发送请求，以便服务恢复。您还需要确保不会重试写入可能成功或可能未成功的事件（例如，如果您发送“向此用户收费”请求并返回 5xx， 则您不知道 用户是否已被收费）。对此的经典解决方案是使用“幂等密钥”，这是请求中的一个特殊 UUID，其他服务使用它来避免重新运行旧请求：每次执行某项作时，它们都会保存幂等密钥，如果它们收到另一个具有相同密钥的请求，则会静默忽略它。

It’s also important to decide what happens when part of your system fails. For instance, say you have some rate limiting code that checks a Redis bucket to see if a user has made too many requests in the current window. What happens when that Redis bucket is unavailable? You have two options: fail open and let the request through, or fail closed and block the request with a 429.
确定当系统的一部分发生故障时会发生什么也很重要。例如，假设您有一些速率限制代码，用于检查 Redis 存储桶以查看用户在当前窗口中是否发出了太多请求。当该 Redis 存储桶不可用时会发生什么情况？您有两个选项：fail open 并允许请求通过，或者 fail closed 并使用 429 阻止请求。

Whether you should fail open or closed depends on the specific feature. In my view, a rate limiting system should almost always fail open. That means that a problem with the rate limiting code isn’t necessarily a big user-facing incident. However, auth should (obviously) always fail closed: it’s better to deny a user access to their own data than to give a user access to some other user’s data. There are a lot of cases where it’s not clear what the right behavior is. It’s often a difficult tradeoff.
您应该在故障中打开还是关闭取决于特定功能。在我看来，速率限制系统几乎总是应该失效开放。这意味着速率限制代码的问题不一定是面向用户的重大事件。但是，auth 应该（显然）始终失败关闭：拒绝用户访问自己的数据比授予用户访问其他用户的数据要好。在很多情况下，不清楚什么是正确的行为。这通常是一个艰难的权衡。

Final thoughts 最后

There are some topics I’m deliberately not covering here. For instance, whether or when to split your monolith out into different services, when to use containers or VMs, tracing, good API design. Partly this is because I don’t think it matters that much (in my experience, monoliths are fine), or because I think it’s too obvious to talk about (you should use tracing), or because I just don’t have the time (API design is complicated).
有些主题我故意不在这里讨论。例如，是否或何时将你的单体式架构拆分为不同的服务，何时使用容器或 VM，跟踪，良好的 API 设计。部分原因是我认为它没有那么重要（根据我的经验，单体式应用很好），或者因为我认为它太明显了，无法谈论（你应该使用跟踪），或者因为我没有时间（API 设计很复杂）。

The main point I’m trying to make is what I said at the start of this post: good system design is not about clever tricks, it’s about knowing how to use boring, well-tested components in the right place. I’m not a plumber, but I imagine good plumbing is similar: if you’re doing something too exciting, you’re probably going to end up with crap all over yourself.
我想表达的重点是我在这篇文章开头所说的：好的系统设计不是聪明的技巧，而是知道如何在正确的地方使用无聊的、经过充分测试的组件。我不是水管工，但我想好的管道是类似的：如果你做的事情太刺激了，你最终可能会满身都是垃圾。

Especially at large tech companies, where these components already exist off the shelf (i.e. your company already has some kind of event bus, caching service, etc), good system design is going to look like nothing. There are very, very few areas where you want to do the kind of system design you could talk about at a conference. They do exist! I have seen hand-rolled data structures make features possible that wouldn’t have been possible otherwise. But I’ve only seen that happen once or twice in ten years. I see boring system design every single day.
特别是在大型科技公司，这些组件已经存在现成的（例如，您的公司已经拥有某种事件总线、缓存服务等），好的系统设计将看起来毫无意义。很少有领域需要进行可以在会议上讨论的系统设计。他们确实存在！我见过手动滚动的数据结构使原本不可能实现的功能成为可能。但我十年只见过一两次。我每天都能看到无聊的系统设计。

You’re supposed to store timestamps instead, and treat the presence of a timestamp as true. I do this sometimes but not always - in my view there’s some value in keeping a database schema immediately-readable.
您应该改为存储时间戳，并将时间戳的存在视为 true。我有时会这样做，但并非总是这样做 - 在我看来，保持数据库架构的即时可读性是有一定价值的。
↩
Technically any service stores information of some kind for some duration, at least in-memory. Typically what’s meant here is storing information outside of the request-response lifecycle (e.g. persistently on-disk somewhere, such as in a database). If you can stand up a new version of the app by simply spinning up the application server, that’s a stateless app.
从技术上讲，任何服务都会在一段时间内存储某种信息，至少在内存中。通常，这里的意思是在请求-响应生命周期之外存储信息（例如，持久地存储在磁盘上的某个位置，例如在数据库中）。如果您可以通过简单地启动应用程序服务器来构建新版本的应用程序，那么这就是无状态应用程序。
↩
Gamedevs on Twitter will say that anything slower than 10ms is unacceptable. Whether that ought to be the case, it’s just factually not true about successful tech products - users will accept slower responses if the app is doing something that’s useful to them.
Twitter 上的游戏开发者会说，任何低于 10 毫秒的速度都是不可接受的。无论情况是否如此，对于成功的科技产品来说，事实并非如此——如果应用程序正在做对他们有用的事情，用户会接受较慢的响应。
↩
They’re fast because they don’t have to talk to a database in the way the main server does. In theory, this could just be a static file on-disk that they serve up when asked, or even data held in-memory.
它们速度很快，因为它们不必像主服务器那样与数据库通信。从理论上讲，这可能只是磁盘上的静态文件，当他们被询问时提供，甚至是保存在内存中的数据。
↩
Incidentally, those cache servers will either poll your main server (i.e. pulling) or your main server will send the new data to them (i.e. pushing). I don’t think it matters too much which you do. Pushing will give you more up-to-date data but pulling is simpler.
顺便说一句，这些缓存服务器将轮询你的主服务器（即拉取），或者你的主服务器将新数据发送给它们（即推送）。我认为你做什么并不重要。推送将为您提供更多最新数据，但拉取更简单。
↩

If you liked this post, consider subscribing to email updates about my new posts.
如果您喜欢这篇文章，请考虑订阅有关我的新帖子的电子邮件更新。

June 21, 2025 │ Tags: good engineers, software design
June 21， 2025 │ 标签：好工程师，软件设计