This article presents detailed few-principles reasoning about large language model inference performance, with no experiments or difficult math. The amount of understanding that can be acquired this way is really impressive and practical! A very simple model of latency for inference turns out to be a good fit for emprical results. It's helped me make better predictions and form better explanations about transformer inference.
本文详细介绍了关于大型语言模型推理性能的少数原则推理,无需实验或复杂的数学。通过这种方式能获得的理解量真的令人印象深刻且实用!对于推理延迟的一个非常简单的模型,事实证明与经验结果非常吻合。它帮助我做出了更好的预测,并形成了关于 Transformer 推理的更好解释。

This post assumes some prior knowledge about transformers, say at having understood most of The Illustrated Transformer but not having internalised all of it. Familiarity with this parameter counting post which I developed along with this one may also be useful.
本文假设读者对 Transformer 有一定的了解,比如说至少理解了《插图版 Transformer》的大部分内容,但并未完全内化。熟悉我与这篇文章一同开发的关于参数计数的文章也可能有所帮助。

Table of Contents 目录

  • kv cache explains the performance improvement of caching self-attention vectors as a part of inferencing, as well as the possible tradeoffs and capacity costs
    kv 缓存解释了作为推理一部分的缓存自注意力向量的性能提升,以及可能的权衡和容量成本
  • capacity takes the storage cost of kv cache and connects it to the storage cost of model weights and what capacity means for performance.
    容量将 kv 缓存的存储成本与模型权重的存储成本联系起来,并解释了容量对性能意味着什么。
  • model parallelism builds up an understanding specifically of tensor parallelism to clearly identify the cost of communication
    模型并行性特别构建了对张量并行性的理解,以清晰地识别通信成本
  • latency calculations pulls understanding from other concepts to create equations that serve as floorlines for inference speed.
    延迟计算从其他概念中提取理解,以创建作为推理速度基线的方程式。
  • batch sizes discusses what impact batch size has on performance and what sizes may be optimal.
    批量大小讨论了批量大小对性能的影响以及可能的最优大小。
  • flops counting steps through the transformer blocks and identifies which operations meaningfully contribute to flops speed.
    通过 Transformer 块的 flops 计数步骤,识别哪些操作对 flops 速度有实质性贡献。
  • intermediate memory costs covers how the activations take additional memory and what that memory bandwidth costs looks like from some real benchmarks.
    中间内存成本涵盖了激活如何占用额外内存以及一些真实基准测试中的内存带宽成本是什么样的。
  • comparing against real benchmarks compares what we can calculate to what Nvidia FasterTransformer benchmarks report and identifies the discrepancies.
    与真实基准测试对比,比较了我们能计算的内容与 Nvidia FasterTransformer 基准测试报告的内容,并识别了差异。

kv cache kv 缓存

For sampling, transformer inference consists of processing a provided prompt/context (which can happen in parallel), and then sampling additional tokens one by one (this is where the autoregressiveness surfaces). In the sampling, the transformer performs self-attention, which requires the kv values for each item currently in the sequence (whether it was prompt/context or a generated token). These vectors are provided a matrix known as the kv cache, aka past cache (the open source GPT-2 implementation called it past). The past cache would be shaped like [batch, 2, num_heads, seq_len, features].
对于采样,Transformer 推理包括处理提供的提示/上下文(这可以并行进行),然后逐个采样额外的令牌(这是自回归性显现的地方)。在采样中,Transformer 执行自注意力,这需要当前序列中每个项目的 kv 值(无论是提示/上下文还是生成的令牌)。这些向量提供了一个称为 kv 缓存,又名过去缓存(开源 GPT-2 实现称之为 past )。过去缓存的形状像 [batch, 2, num_heads, seq_len, features]

The purpose of this is to avoid recalculations of those vectors every time we sample a token. With the computed k,vk, v values, we can save quite a bit of computation at the cost of some storage. Per token, the number of bytes we store is
这样做的目的是为了避免每次采样令牌时都重新计算这些向量。通过计算的 k,vk, v 值,我们可以在一定的存储成本下节省相当多的计算量。每个令牌,我们存储的字节数是

22nlayersnheadsdhead2\cdot 2 \cdot n_\text{layers} \cdot n_\text{heads} \cdot d_\text{head}

The first factor of 2 is to account for the two vectors, kk and vv. We store that per each layer, and each of those values is a nheads×dhead n_\text{heads}\times d_\text{head} matrix. Then multiply by 2 again for the number of bytes (we'll assume 16-bit formats throughout the post).
2 的第一个因素是为了考虑两个向量 kkvv 。我们将这个因素存储在每一层中,每个值都是一个 nheads×dhead n_\text{heads}\times d_\text{head} 矩阵。然后再次乘以 2,用于字节数的数量(我们将在整篇文章中假设 16 位格式)。

The weights that we multiply by the token embeddings are Wk,WvRdmodel×dmodelW_\text{k}, W_\text{v} \in \mathbb{R}^{d_\text{model}\times d_\text{model}} and then each token embedding is teR1×dmodelt_\text{e}\in \mathbb{R}^{1\times d_\text{model}}. So then the flops to compute kk and vv for all our layers is
我们用来乘以令牌嵌入的权重是 Wk,WvRdmodel×dmodelW_\text{k}, W_\text{v} \in \mathbb{R}^{d_\text{model}\times d_\text{model}} ,然后每个令牌嵌入是 teR1×dmodelt_\text{e}\in \mathbb{R}^{1\times d_\text{model}} 。因此,计算 kkvv 对于我们所有层的 flops 是

22nlayersdmodel22 \cdot 2 \cdot n_\text{layers} \cdot {d_\text{model}}^2

We multiply tet_\text{e} by WkW_\text{k}, which takes 2dmodel22 \cdot {d_\text{model}}^2 flops. We have another factor of 2 as we do that twice, once each for kk and vv and then repeat for nlayersn_\text{layers}.
我们将 tet_\text{e} 乘以 WkW_\text{k} ,这需要 2dmodel22 \cdot {d_\text{model}}^2 flops。我们有另一个 2 的因子,因为我们做了两次,每次分别对 kkvv ,然后重复 nlayersn_\text{layers}

How many flops in a matmul?
一个 matmul 中有多少 flops?

The computation for a matrix-vector multiplication is 2mn2mn for ARm×n,bRnA \in \mathbb{R}^{m\times n}, b \in \mathbb{R}^{n}. A matrix-matrix is 2mnp2mnp for ARm×n,BRn×pA \in \mathbb{R}^{m\times n}, B \in \mathbb{R}^{n \times p}. The mnmn factor makes a lot of sense, and the two comes from the fact that a matmuls are composed of multiply(1)-add(2) operations. More in these lecture notes.
矩阵-向量乘法的计算是 2mn2mnARm×n,bRnA \in \mathbb{R}^{m\times n}, b \in \mathbb{R}^{n} 。矩阵-矩阵乘法是 2mnp2mnpARm×n,BRn×pA \in \mathbb{R}^{m\times n}, B \in \mathbb{R}^{n \times p}mnmn 因子非常有意义,而且两个来自于 matmuls 由乘法(1)-加法(2)操作组成的事实。更多内容在这些讲义中。

This means for a 52B parameter model (taking Anthropic's, where dmodel=8192d_\text{model} = 8192 and nlayers=64n_\text{layers} = 64). The flops are
这意味着对于一个 52B 参数模型(采用 Anthropic 的,其中 dmodel=8192d_\text{model} = 8192nlayers=64n_\text{layers} = 64 )。flops 是

226481922=17,179,869,1842 \cdot 2 \cdot 64 \cdot 8192^2 = 17,179,869,184

Say we have an A100 GPU, which does 312e12312\text{e}12 flops per second and 1.5e121.5\text{e}12 bytes per second of memory bandwidth. The following are numbers for just the kv weights and computations.
假设我们有一个 A100 GPU,它每秒可以进行 312e12312\text{e}12 flops 和 1.5e121.5\text{e}12 字节的内存带宽。以下是仅对 kv 权重和计算的数字。

memory=22nlayersdmodel2÷1.5e12compute=22nlayersdmodel2÷312e12\text{memory} = 2 \cdot 2 \cdot n_\text{layers} \cdot {d_\text{model}}^2 \div 1.5\text{e}12\\ \text{compute} = 2 \cdot 2 \cdot n_\text{layers} \cdot {d_\text{model}}^2 \div 312\text{e}12\\

Flops vs Memory Boundedness
Flops 与内存限制性

Flops vs memory boundedness is something we deal with a lot for transformer inference, but also in deep learning optimisation in general. To do the computations we do, we need to load weights which costs memory bandwidth. We assume (correctly, this has been very well optimised) that we can start the computations while we load the weights. Flop bound would then mean that there is time when nothing is being passed through memory, and memory bound would mean that no floperations are occuring. Nvidia uses the term math bandwidth which I find really cute. Technically, this delineation exist per kernel but can be abstracted to exist for groups of operations.
在 Transformer 推理中,我们经常需要处理 Flops 与内存限制的问题,这在深度学习优化中也是普遍存在的。为了进行计算,我们需要加载权重,这会消耗内存带宽。我们假设(正确地,这已经得到了非常好的优化)我们可以在加载权重的同时开始计算。Flop 限制意味着有时候没有数据通过内存传递,而内存限制则意味着没有 floperation 发生。Nvidia 使用了一个我觉得非常可爱的术语——数学带宽。技术上,这种划分是存在于每个内核中的,但可以抽象为存在于一组操作中。

None of the model architecture matters anymore — we get a distinct ratio here of 208 given this hardware specification. This means that if we're going to compute kv for one token, it'll take the same amount of time to compute for up to 208 tokens! Anything below, we're memory bandwidth bound. Above, flops bound. If we used the rest of our weights to do a full forwards pass (run the rest of the transformer) on our context, it's also 208 (both the numerator and denominator get a factor of 6 added). This will be reasoned thoroughly in future sections. The intersection of the below diagram is at 208, though in reality the memory line does have a slight slope due to memory cost of intermediate calculations (discussed in the last section).
模型架构已经不再重要了——在这个硬件规格下,我们得到了一个明确的比率,为 208。这意味着,如果我们要为一个 token 计算 kv,那么计算多达 208 个 token 所需的时间是一样的!低于此数值,我们受到内存带宽的限制。高于此数值,受到 Flops 的限制。如果我们使用剩余的权重来进行完整的前向传递(运行 Transformer 的其余部分),这个数字也是 208(分子和分母都增加了一个因子 6)。这将在未来的章节中进行彻底的论证。下图的交点是在 208,尽管实际上由于中间计算的内存成本,内存线确实有轻微的斜率(在最后一节中讨论)。

For a 52B model full forwards pass, that's 122nlayersdmodel2/1.5e126912\cdot 2 \cdot n_\text{layers} \cdot {d_\text{model}}^2 / 1.5\text{e}12 \approx 69 milliseconds for up to 208 tokens (in practice, we'd use four GPUs in parallel so it would actually be ~17 milliseconds, more in following sections). If we had 416 (double) tokens in the context, then it would take twice as long, and 312 tokens would take 1.5 times as long.
对于一个 52B 模型的完整前向传递,对于多达 208 个 token,这是 122nlayersdmodel2/1.5e126912\cdot 2 \cdot n_\text{layers} \cdot {d_\text{model}}^2 / 1.5\text{e}12 \approx 69 毫秒(实际上,我们会并行使用四个 GPU,所以实际上是约 17 毫秒,后续章节会有更多)。如果我们有 416(双倍)个 token 在上下文中,那么它会花费两倍的时间,312 个 token 会花费 1.5 倍的时间。

Calculating for a kv cache token is exactly 1/6th of the compute of passing the token through the model. In general, these forwards passes (what we experience in getting logits, embeddings and training) are very cheap because of the parallelism that is possible as opposed to sampling where we're forced to read through all the weights for each token and do the autoregression.
计算一个 kv 缓存 token 的时间正好是通过模型传递一个 token 的计算量的六分之一。一般来说,这些前向传递(我们在获取 logits、embeddings 和训练时的体验)由于可能的并行性而非常便宜,与我们被迫为每个 token 读取所有权重并进行自回归的采样相反。

This doesn't mean that 1/6th of the time is saved! Let's assume we are flops bound. Then at each sample step, we save 22ntokensnlayersdmodel2÷312e122 \cdot 2 \cdot n_\text{tokens} \cdot n_\text{layers} \cdot {d_\text{model}}^2 \div 312\text{e}12 flops while the decoding steps costs 212nlayersdmodel2÷312e122 \cdot 12 \cdot n_\text{layers} \cdot {d_\text{model}}^2 \div 312\text{e}12. Thus at each step we save 1/6 of the flops time multiplied by the number of tokens in our sequence (big!) — which increases as we sample tokens. It is the case that without a kv cache, sampling would be quadratic in time complexity as we increase the number of tokens.
这并不意味着节省了六分之一的时间!假设我们受到 flops 的限制。那么在每个样本步骤中,我们节省了 22ntokensnlayersdmodel2÷312e122 \cdot 2 \cdot n_\text{tokens} \cdot n_\text{layers} \cdot {d_\text{model}}^2 \div 312\text{e}12 flops,而解码步骤的成本是 212nlayersdmodel2÷312e122 \cdot 12 \cdot n_\text{layers} \cdot {d_\text{model}}^2 \div 312\text{e}12 。因此,在每一步中,我们节省的 flops 时间乘以我们序列中的 token 数量(很大!)——随着我们采样 token,这个数量会增加。如果没有 kv 缓存,随着 token 数量的增加,采样的时间复杂度将是二次方的。

This is not the whole story (given overheads and tradeoffs associated with storing this cache). If we're serving small batches we may be memory bandwidth bound rather than flops, in which case we won't even want to use the past cache and will instead happily do recomputations, spending the flops (we'll already be paying the memory cost to do our sampling).
这并不是全部的故事(考虑到与存储这个缓存相关的开销和权衡)。如果我们服务的是小批量,我们可能受到内存带宽的限制而不是 flops,在这种情况下,我们甚至不会想要使用过去的缓存,而是愿意重新计算,花费 flops(我们已经在为我们的采样支付内存成本了)。

capacity 容量

We have a solid idea of the two things we store in our GPUs — kv cache and weights. GPU capacity does come into play for transformer inferencing performance and we have all the understanding we need to evaluate that now!
我们对我们在 GPU 中存储的两样东西——kv 缓存和权重有了清晰的了解。GPU 容量确实对 Transformer 推理性能有影响,现在我们已经拥有了评估这一点所需的所有理解!

Nvidia A100 GPUs (which are generally speaking, the best GPUs we can get for inference) have a standard of 40GB of capacity. There are ones with 80GB and higher memory bandwidth (2e12 instead of 1.5e12) but they aren't available with any large cloud providers yet which means they aren't real to me!
Nvidia A100 GPU(通常来说,我们能获得的用于推理的最好的 GPU)标配有 40GB 的容量。也有配备 80GB 和更高内存带宽(2e12 而不是 1.5e12)的版本,但它们还没有在任何大型云服务提供商那里上市,这意味着对我来说它们还不是现实的!

Given the parameter count, we can multiply by two to get bytes. So to calculate the size of the weights for a 52B model.
考虑到参数数量,我们可以乘以二来转换为字节。因此,要计算一个 52B 模型的权重大小。

52e122=104e12bytes104GB52\text{e}12 \cdot 2 = 104\text{e}12 \text{bytes} \approx 104\text{GB}\\

Oh no! This doesn't fit in one GPU! We'd need at least three GPUs just to have all the weights loaded in (will discuss how to do that sharding later). That leaves us 120104=16GB120-104 = 16GB left for our kv cache. Is that enough? Back to our equation for kv cache memory per token, again with a 52B model;
哦不!这不适合单个 GPU!我们至少需要三个 GPU 才能加载所有权重(稍后将讨论如何进行这种分片)。这给我们的 kv 缓存留下了 120104=16GB120-104 = 16GB 。这足够吗?回到我们的 kv 缓存每个令牌的内存方程,再次以一个 52B 模型为例;

2nlayersnheadsdhead24nlayersnheadsdhead=4648192=2,097,152bytes0.002GB2\cdot n_\text{layers} \cdot n_\text{heads} \cdot d_\text{head} \cdot 2\\ 4\cdot n_\text{layers} \cdot n_\text{heads} \cdot d_\text{head} \\ = 4\cdot 64 \cdot 8192\\ = 2,097,152 \text{bytes} \approx 0.002 GB
And then we'd do 16/0.002800016/0.002 \approx 8000 tokens can fit into our kv cache with this GPU set up, or that we could do up to a batch size 4 where each request has up to 2048 tokens (and higher sizes for less tokens).
然后我们会做 16/0.002800016/0.002 \approx 8000 个令牌可以适配我们的 kv 缓存,有了这个 GPU 设置,或者我们可以做到批处理大小为 4,其中每个请求最多有 2048 个令牌(对于更少的令牌可以有更高的大小)。

This sucks because we would like to be able to do higher batch sizes, but are capacity limited! Higher batch sizes are more efficient in terms of how much GPU time it takes to process the same request. On the other hand, at batch sizes this low we're bound to be memory bound, and should forego the kv cache and just pay the flops cost instead.
这很糟糕,因为我们希望能够处理更高的批处理大小,但受到容量限制!更高的批处理大小在处理相同请求时,就 GPU 时间而言更高效。另一方面,在这么低的批处理大小下,我们肯定会受到内存的限制,应该放弃 kv 缓存,只支付浮点运算成本。

For four GPUs, we'd get 56/0.0022300056/0.002 \approx 23000. We definitely want to go for the four GPUs since we'll want to be able to do higher batch sizes, and it's silly to to divide powers of two over three GPUs. But it's not just batch size! If we have high volume, then we'd have multiple instances of our models. We approximately want each instance to be able to do as large as a batch size as possible, as we pay the cost of storing the weights anyway.
对于四个 GPU,我们会得到 56/0.0022300056/0.002 \approx 23000 。我们绝对希望选择四个 GPU,因为我们希望能够处理更高的批处理大小,而且在三个 GPU 上分配二的幂次方是愚蠢的。但这不仅仅是批处理大小的问题!如果我们有高流量,那么我们会有多个模型实例。我们大致希望每个实例都能尽可能地处理大批量大小,因为我们无论如何都要支付存储权重的成本。

There's some extra space used by intermediate calculation steps, but they're negligible.
有一些额外的空间被中间计算步骤使用,但它们可以忽略不计。

model parallelism 模型并行性

I'm not going to build up full understanding of model parallelism and all the implementation details, because many have done so. But we will build out the parts of the understanding that are useful to figure to make performance decisions and calculate communication costs!
我不打算全面构建对模型并行性及其所有实现细节的理解,因为已经有很多人做过了。但我们将构建出对于做出性能决策和计算通信成本有用的理解部分!

The outcome of model parallelism, is that the cost of passing all the weights through through memory and the flops are all divided over the degree (number of accelerators we use).
模型并行性的结果是,通过内存传递所有权重的成本以及浮点运算次数都被分摊到了我们使用的加速器数量(度)上。

We will assume tensor parallel (model parallel) where we will split down the middle of the model. Each accelerator will execute as much as it can with its shards of the weights and will communicate whenever synchronisation is required. A more naive way is pipeline parallel, where each GPU will hold onto a fraction of the layers. This does successfully even out the weight loading cost, but has the obvious silly that all but one GPU will be idling! In training you could pipeline through it (as the first batch moves onto the next GPU, start on a new batch on the first GPU) but it doesn't work out for a single sample request (though you could still do it for multiple requests). Pipeline also doesn't exhaust the memory bandwidth, which is actually ok if you're flops bound anyway. The only place where pipeline parallel does better is communications. A pipeline parallel model would communicate dmodeld_\text{model} per accelerator, while a model parallel does NdmodelN\cdot d_\text{model} per layer where NN is the number of accelerators.
我们将假设张量并行(模型并行),在模型的中间进行切分。每个加速器将尽可能多地执行其权重碎片的操作,并在需要同步时进行通信。一种更原始的方式是流水线并行,其中每个 GPU 将保留一部分层。这确实成功地平均了权重加载成本,但有一个明显的缺点,那就是除了一个 GPU 外,所有 GPU 都将处于空闲状态!在训练中,你可以通过它进行流水线处理(当第一批数据移动到下一个 GPU 时,在第一个 GPU 上开始处理新的批次),但这对于单个样本请求来说行不通(尽管你仍然可以对多个请求这样做)。流水线并行也没有耗尽内存带宽,如果你 anyway 受限于浮点运算次数,这实际上是可以接受的。流水线并行唯一表现更好的地方是通信。流水线并行模型会每个加速器通信 dmodeld_\text{model} ,而模型并行则是每层通信 NdmodelN\cdot d_\text{model} ,其中 NN 是加速器的数量。

Here we introduce the last constant for our A100 GPUs which is a communication bandwith of 300GB/s. The doc marks it as 600GB/s because Nvidia is adding up 300GB/s into each chip and 300GB/s out simultaneously rather than using a bidirectional number (which will be more intuitive for our calculations).
在这里,我们介绍了我们 A100 GPU 的最后一个常数,即通信带宽为 300GB/s。文档将其标记为 600GB/s,因为 Nvidia 是将每个芯片的 300GB/s 输入和 300GB/s 输出加在一起,而不是使用双向数字(这对我们的计算更直观)。

In this diagram, we start by following the yellow brick road where we insert our token embeddings into the bottom of the model. The purple boxes outline how our weights would be split across the accelerators, and we work with an extremely tiny model so we can draw everything to scale. A general idea is that if we have two matrices XX and YY we can shard both of them and multiply the shards. This doesn't actually complete the matmul of XYX\cdot Y, and an easy way to tell (other than our ability to multiply matrices) is that if we concatenated the result of multiplying the shards, we get too big of a matrix. Instead, we would want to communicate, compute a shard sum, communicate that sum back out and then concatenate for the output of XYX \cdot Y.
在这张图中,我们从沿着黄砖路开始,将我们的令牌嵌入插入到模型的底部。紫色框体展示了我们的权重如何在加速器之间分割,我们使用一个极小的模型,这样我们可以按比例绘制一切。一个总体思路是,如果我们有两个矩阵 XXYY ,我们可以将它们都分片然后乘以分片。这实际上并没有完成 XYX\cdot Y 的矩阵乘法,一个简单的判断方法(除了我们能够乘以矩阵的能力之外)是,如果我们连接乘以分片的结果,我们会得到一个过大的矩阵。相反,我们希望进行通信,计算一个分片和,将该和通信回去,然后连接以输出 XYX \cdot Y

For attention the parallelism is intuitive from the fact that we have multiple heads. We go through most of the attention layer without communication because our attention heads are concatenated to multiply by WoW_o. After we multiply by vv, we multiply the result by our shard of WoW_o to get a shard of osRdmodel×nheads/No_s \in \mathbb{R}^{d_\text{model}\times n_\text{heads}/N}. Then each accelerator will communicate its own shard to all the others, and all the others will communicate their shards back. This is (N1)dmodel/N(N-1)d_\text{model}/N of comms cost. Each accelerator will do an even share of the addition to get the output projection, then do the same communication they did last time and the individual hosts will do the concatenation (approximately instant).
对于注意力机制,由于我们有多个头部,其并行性从直觉上讲是显而易见的。我们在大部分注意力层中无需通信,因为我们的注意力头部被连接起来以乘以 WoW_o 。在我们乘以 vv 之后,我们将结果乘以我们的 WoW_o 分片,得到一个 osRdmodel×nheads/No_s \in \mathbb{R}^{d_\text{model}\times n_\text{heads}/N} 的分片。然后,每个加速器将其自己的分片与所有其他加速器通信,所有其他加速器将它们的分片反馈回来。这是 (N1)dmodel/N(N-1)d_\text{model}/N 的通信成本。每个加速器将平均分担加法操作以得到输出投影,然后进行与上次相同的通信,而各个主机将进行连接(几乎瞬间完成)。

The MLP layer is by nature very similar! Just like we have WoW_o to project our multi-headed attention results back down to a vector of length dmodeld_\text{model}, we have W1R4×dmodelW_1\in \mathbb{R}^{4\times d_\text{model}} and W2Rdmodel×4W_2\in \mathbb{R}^{d_\text{model}\times 4} to make a dimension 4 times larger and then project it back down. The same two communications are done at the end of the MLP.
MLP 层的本质非常相似!就像我们有 WoW_o 将我们的多头注意力结果投影回一个长度为 dmodeld_\text{model} 的向量,我们有 W1R4×dmodelW_1\in \mathbb{R}^{4\times d_\text{model}}W2Rdmodel×4W_2\in \mathbb{R}^{d_\text{model}\times 4} 使维度变大四倍,然后再投影回去。在 MLP 的末尾进行相同的两次通信。

Ultimately we do 4(N1)dmodel/N4 \cdot (N - 1)d_\text{model}/N bytes of communication. The kv cache is split across heads by GPU.
最终我们进行了 4(N1)dmodel/N4 \cdot (N - 1)d_\text{model}/N 字节的通信。kv 缓存按 GPU 跨头部分割。

latency calculations 延迟计算

We've discussed the capacity fairly thoroughly, mapped out comms in the model parallelism section and discussed general compute steps. Now we'll build it into equations that estimate latency!
我们已经相当彻底地讨论了容量问题,绘制了模型并行部分的通信图,并讨论了一般的计算步骤。现在,我们将把它构建成估算延迟的方程式!

Our latency calculations are mostly about the flops vs memory boundedness. If we have a small number of multiplies to do per parameter, then maybe we'll be throttled by memory bandwidth. Flops are increased by both batch size and number of parameters, while memory is only increased by number of parameters.
我们的延迟计算主要关于浮点运算次数与内存限制。如果我们每个参数需要做的乘法数量很少,那么可能会受到内存带宽的限制。浮点运算次数随着批量大小和参数数量的增加而增加,而内存只随参数数量增加。

For comms, it's not about boundedness, but rather about adding a latency term and a throughput term (the 300GB/s). Something tricky about the latency side of this figure is that it's not reported, so the best I can do is guess "approximately small", which is approximately 8 microseconds per message sent as found in this Citadel paper but it's for V100 NVLink.
对于通信来说,它不是关于限制性的,而是关于增加一个延迟项和一个吞吐量项(即 300GB/s)。这个图表中关于延迟的棘手之处在于它没有被报告,所以我能做的最好猜测是“大约很小”,这大约是每发送一条消息 8 微秒,如同在这篇 Citadel 论文中发现的,但它是针对 V100 NVLink 的。

Because of the compute factors, to calculate the latency of a single token decoding step we'd have two formulas - one for memory bandwidth bound (small batch) and another for flops bound (large batch). For large batch, we'll drop the latency factor for communications.
由于计算因素,为了计算单个令牌解码步骤的延迟,我们将有两个公式 - 一个是内存带宽限制(小批量)的,另一个是浮点运算限制(大批量)的。对于大批量,我们将忽略通信的延迟因素。

Equations for a small batch (say 1, so we can drop the batch factor) would be; (where NN is the number of accelerators and PP is the number of parameters and bb is "byte" as a unit)
小批量的方程式(比如说 1,所以我们可以忽略批量因素)将是;(其中 NN 是加速器的数量, PP 是参数的数量, bb 是单位“字节”)

compute=2PbNAbms1comms=4nlayers8μs\text{compute} = \frac{2 \cdot P\cdot\text{b}}{N \cdot A_{\text{bm}} \text{b}\space\text{s}^{-1}}\\ \text{comms} = 4 \cdot n_\text{layers} \cdot 8\mu\text{s}
There is 2P2 \cdot P because we need to pass all the parameters through the memory, and each parameter is two bytes. AbmA_\text{bm} is the accelerator memory bandwidth, and this cost is split across accelerators. For comms, we have 4nlayers 4 \cdot n_\text{layers} communications per layer, and the latency per each request. Comms will usually come out to be relatively small so for the compute bound case we won't need to pay attention to it anyway. There's also a throughput cost in comms which also rounds away.
2P2 \cdot P 是因为我们需要通过内存传递所有参数,每个参数是两个字节。 AbmA_\text{bm} 是加速器的内存带宽,这个成本在加速器之间分摊。对于通信,我们在每层有 4nlayers 4 \cdot n_\text{layers} 次通信,以及每个请求的延迟。通信通常会相对较小,所以对于计算限制的情况,我们无需过多关注。通信中还有一个吞吐量成本,也可以忽略不计。

There's another sometimes-significant factor here which is the read time for the kv cache, which I'll leave out of the equation now since it depends on number of context tokens, which can even vary within a batch and total number of tokens we want to sample. This would be calculated as memory bandwidth time. Another missing memory bandwidth time is the read of the unembeddings to calculate logits at each sampling step, which is Rdmodel×nvocab \in \mathbb{R}^{d_\text{model}\times n_\text{vocab}}.
这里还有另一个有时候很重要的因素,即 kv 缓存的读取时间,我现在将其从方程中省略,因为它依赖于上下文令牌的数量,这甚至可以在一个批量内变化,以及我们想要抽样的令牌总数。这将被计算为内存带宽时间。另一个缺失的内存带宽时间是在每个抽样步骤计算 logits 时读取 unembeddings 的时间,这是 Rdmodel×nvocab \in \mathbb{R}^{d_\text{model}\times n_\text{vocab}}

As previously mentioned, the memory does not actually stay constant, rather some additional memory is used per batch for intermediate activations. The reason we don't factor this in is simply because it's hard to count as it varies a lot by the software stack, compiler optimisations, etc.
正如之前提到的,内存实际上并不是保持恒定不变的,而是每个批次会因中间激活而使用一些额外的内存。我们之所以不将这一点纳入考虑,仅仅是因为这很难计算,因为它会因软件堆栈、编译器优化等因素而有很大的变化。

For large batches (say 512), where BB is the batch size;
对于大批量(比如 512),其中 BB 是批量大小;

compute=B2PFLOPNAfFLOP s1comms=B2nlayers4dmodelbAcs1\text{compute} = B \cdot \frac{2 \cdot P \cdot \text{FLOP}}{N \cdot A_f\text{FLOP}\space\text{s}^{-1}}\\ \text{comms} = B \cdot \frac{2\cdot n_\text{layers} \cdot 4 \cdot d_\text{model} \cdot \text{b}}{A_c\text{b}\space\text{s}^{-1}}

Where AfA_f is the flops of the accelerator and AcA_c is the comms bandwidth. We do 2P2\cdot P flops of operations, which can be intuited by the fact that we matmul through all the parameters, and as mentioned earlier, a matrix-vector multiplication is 2mn2mn given ARm×n,bRnA \in \mathbb{R}^{m\times n}, b \in \mathbb{R}^{n}.
其中 AfA_f 是加速器的浮点运算次数, AcA_c 是通信带宽。我们进行了 2P2\cdot P 次浮点运算操作,这可以通过我们通过所有参数进行矩阵乘法的事实来直观理解,正如之前提到的,给定 ARm×n,bRnA \in \mathbb{R}^{m\times n}, b \in \mathbb{R}^{n} 时,矩阵-向量乘法是 2mn2mn

For comms, we see the four (I'll round that N1N-1 factor to NN) communications each of a dmodeld_{model} size vector per layer as explained in the model parallelism section. We swapped out the latency calculation for a throughput one. Then it's all divided by the comms bandwidth.
对于通信,我们看到模型并行部分解释的每层四次(我会将那个 N1N-1 因子四舍五入到 NN )的 dmodeld_{model} 大小向量通信。我们将延迟计算换成了吞吐量计算。然后它全部被通信带宽除。

Let's play with a larger model, a Gopher sized 260B model on 16 GPUs. For a small batch, it's 22 ms per token generated. The throughput cost for the comms which we can calculate with the equation for large batch is approximately 35 microseconds, assuring us that it was safe to drop.
让我们试试更大的模型,一个在 16 个 GPU 上的 260B 大小的 Gopher 模型。对于一个小批量,每生成一个令牌需要 22 毫秒。我们可以用大批量的方程计算出的通信吞吐成本大约为 35 微秒,确保我们放弃它是安全的。

compute=2PNAbm=2260e9161.5e120.021722mscomms=24nlayers8μs=4808μs=2560μs3ms\text{compute} = \frac{2 \cdot P}{N \cdot A_\text{bm}} = \frac{2 \cdot 260\text{e}9}{16\cdot 1.5\text{e}12} \approx 0.0217 \approx 22\text{ms} \\ \text{comms} = 2 \cdot 4 \cdot n_\text{layers} \cdot 8\mu\text{s}= 4 \cdot 80\cdot 8\mu\text{s} = 2560\mu\text{s} \approx 3\text{ms}

For a large batch of 512, for a total of 53 ms per token generated (per batch, so in the 62ms 512 tokens are generated). The latency cost on comms here would've also been 3ms (latency is not multiplied by batch as the message can be prepared together) which is somewhat significant to drop but it's fine if we assuming parallel comms and compute.
对于一个大批量的 512,每生成一个令牌总共需要 53 毫秒(每批次,所以在 62 毫秒内生成 512 个令牌)。这里通信的延迟成本也将是 3 毫秒(延迟不会随批量增加,因为消息可以一起准备),这在放弃时有些显著,但如果我们假设通信和计算是并行的,那么这是可以接受的。

compute=B2PNAf=5122260e916312e120.05353mscomms=B24nlayersdmodelAc=51288016384300e918ms\text{compute} = B \cdot \frac{2 \cdot P}{N \cdot A_f} = 512 \cdot \frac{2 \cdot 260\text{e}9}{16 \cdot 312\text{e}12} \approx 0.053 \approx 53\text{ms}\\ \text{comms} = B \cdot \frac{2\cdot 4 \cdot n_\text{layers} \cdot d_\text{model}}{A_c} = 512 \cdot \frac{ 8 \cdot 80 \cdot 16384}{300\text{e}9} \approx 18\text{ms}

The higher value between the comms and compute is taken as we're assuming that it's parallel. Thus, we would want to avoid having comms being greater than compute (this is the mechanism that prevents us from approaching latency zero as we insert more chips, eventually the comms will start taking more and more time). It's not guaranteed that all systems will do this in parallel, and certainly not perfectly in parallel.
我们取通信和计算之间较高的值,因为我们假设它们是并行的。因此,我们希望避免通信时间大于计算时间(这是阻止我们通过增加更多芯片接近零延迟的机制,最终通信将开始占用越来越多的时间)。并不是所有系统都会并行执行这一操作,当然也不是完美的并行。

These numbers are definitely much lower than what we can get with real sand, as it assumes optimal hardware usage, doesn't factor in softmaxes, assumes zero comms latency and ignores many other smaller factors. Nonetheless, all the reasoning behind this math is useful for thinking about where to go optimise performance what deltas incoming optimisations will cause.
这些数字绝对比我们能用真正的沙子得到的要低得多,因为它假设了最佳硬件使用,没有考虑到 softmaxes,假设了零通信延迟,并忽略了许多其他较小的因素。尽管如此,这些数学背后的所有推理对于思考如何去优化性能以及即将到来的优化将会引起什么变化都是有用的。

batch sizes 批量大小

Batch size is an important factor of our performance, especially towards understanding performance for specific usages.
批量大小是我们性能的一个重要因素,特别是在理解特定用途的性能时。

In the previous section, we have two calculations for when something memory bandwidth bound versus flops bound. To figure out which is at play we can compare these numbers;
在前一节中,我们有两个计算,分别针对内存带宽受限与浮点运算受限的情况。要弄清楚哪种情况在起作用,我们可以比较这些数字;

mem bandwidth time=2PNAbmflops time=B2PNAf\text{mem bandwidth time} = \frac{2 \cdot P}{N \cdot A_\text{bm}}\\ \text{flops time} = B \cdot \frac{2 \cdot P}{N \cdot A_f}

We're dealing with the same ratio we found in the kv cache section. The min batch size for memory bandwidth bound is Abw/Af=208A_\text{bw}/A_f = 208. This is a handy ratio! If we have the load to do it, we prefer flops bound as it's more compute efficient. Though it's also the case that if we're flops bound, making the batch size larger doesn't mean anything is getting faster.
我们处理的比例与我们在 kv 缓存部分发现的比例相同。内存带宽受限的最小批量大小是 Abw/Af=208A_\text{bw}/A_f = 208 。这是一个方便的比例!如果我们有负载去做,我们更倾向于浮点运算受限,因为它的计算效率更高。尽管如果我们是浮点运算受限,增大批量大小并不意味着任何东西变得更快。

To calculate when the capacity goes from mostly kv cache to mostly weights is trivial, and also isn't a binary in the same way (nothing special happens when your kv cache starts taking up more memory than your weights). Nothing special really happens with comms either. At some point in increasing the batch size, the throughput starts dwarfing the latency so we dropped that factor. As observed previously, the latency becomes insignificant much later (our 512 batch on 52B communication cost was still 11% latency).
计算容量从主要是 kv 缓存到主要是权重的转变是微不足道的,也不是以同样的方式二元的(当你的 kv 缓存开始占用比你的权重更多的内存时,没有什么特别的事情发生)。与通信相关的也没有什么特别的事情发生。在增加批量大小的某个点,吞吐量开始使延迟显得微不足道,所以我们放弃了那个因素。正如之前观察到的,延迟在很晚之后变得不重要(我们的 512 批量在 52B 通信成本上仍然有 11% 的延迟)。

Something oversimplified about comms is that it happens at four different steps, which means we don't just want our compute time to be longer than our comms time, we want it to be the case at each step (if we can parallelise the compute and comms). For that, we have a weirder ratio: flops per byte of comms. Here's a nice chart of our computations, which will also be useful in the section below.
关于通信过于简化的一点是,它发生在四个不同的步骤,这意味着我们不仅希望我们的计算时间比通信时间长,我们希望在每个步骤都是这样(如果我们可以并行化计算和通信)。为此,我们有一个更奇怪的比例:每字节通信的浮点运算。这里有我们计算的一个漂亮图表,它在下面的部分也会很有用。

q,k,vq, k, voow1w_1w2w_2
flops 浮点运算3B(dmodel2)3B({d_\text{model}}^2)B(dmodel2)B({d_\text{model}}^2)4B(dmodel2)4B({d_\text{model}}^2)4B(dmodel2)4B({d_\text{model}}^2)
bytes of comms 通信字节B(dmodel)B(d_\text{model})B(dmodel)B(d_\text{model})B(dmodel)B(d_\text{model})B(dmodel)B(d_\text{model})
flops/byte 浮点运算次数/字节3(dmodel)3(d_\text{model})dmodeld_\text{model}4(dmodel)4(d_\text{model})4(dmodel)4(d_\text{model})

312e12÷300e9=1040312\text{e}12 \div 300\text{e}9 = 1040, which is our flops per byte of comms for our A100s. We want the values in the last row to be larger than our hardware flops per byte so that we stay flops bound (assuming we are not memory bound here). For any model with an embedding dimension over 1024 (per chip), we're safe! For 512, it's a little awkward.
312e12÷300e9=1040312\text{e}12 \div 300\text{e}9 = 1040 ,这是我们 A100 的每字节通信浮点运算次数。我们希望最后一行的值大于我们硬件的每字节浮点运算次数,这样我们就能保持在浮点运算受限(假设我们这里不受内存限制)。对于任何嵌入维度超过 1024(每芯片)的模型,我们都是安全的!对于 512,就有点尴尬了。

A low-load API may result in smaller batch sizes, leading to reasonable decisions like dropping the kv cache. If an API had the load for large batches it would probably want to serve the lowest batch size that gets flop bound even if there is capacity left so that it could optimise for per-request-latency. In say mass-inferencing jobs like AlphaCode we might want to insert as many chips as we can and then do the largest batch we can do with that capacity. I say "may" a lot here but I actually think these are absolute and all three kinds of cases.
低负载 API 可能导致较小的批量大小,从而做出合理的决策,如放弃 kv 缓存。如果一个 API 有处理大批量的负载,它可能会希望服务于能达到浮点运算受限的最小批量大小,即使还有剩余容量,这样它就可以针对每次请求的延迟进行优化。比如在 AlphaCode 这样的大规模推理任务中,我们可能想要插入尽可能多的芯片,然后用这些容量做我们能做的最大批量。我在这里多次使用“可能”一词,但我实际上认为这些都是绝对的,三种情况都是如此。

flops counting 浮点运算计数

Previously; 之前;

We do 2P2\cdot P flops of operations, which can be intuited by the fact that we matmul through all the parameters.
我们进行了 2P2\cdot P 次浮点运算操作,这可以通过我们通过所有参数进行矩阵乘法的事实来直观理解。

This is correct reasoning, but we can break it down by walking through all the transformer steps and check that we get 2P2P.
这是正确的推理,但我们可以通过逐步检查所有 Transformer 步骤来分解它,并检查我们是否得到 2P2P

The following calculations are per token, per layer. I describe Wq,Wk,WvRdmodel×dmodelW_q, W_k, W_v \in \mathbb{R}^{d_\text{model}\times d_\text{model}} where it's more accurate to say we have Wqi,Wki,WviRdmodel×dheadW_q^i, W_k^i, W_v^i \in \mathbb{R}^{d_\text{model}\times d_\text{head}}, where ii goes up to nheadsn_\text{heads}. But for the sake of calculating latency, I simplify Wq,Wk,WvW_q, W_k, W_v to include all the heads.
以下计算是每个令牌、每层的。我描述 Wq,Wk,WvRdmodel×dmodelW_q, W_k, W_v \in \mathbb{R}^{d_\text{model}\times d_\text{model}} 时,更准确地说我们有 Wqi,Wki,WviRdmodel×dheadW_q^i, W_k^i, W_v^i \in \mathbb{R}^{d_\text{model}\times d_\text{head}} ,其中 ii 最多到 nheadsn_\text{heads} 。但为了计算延迟,我简化 Wq,Wk,WvW_q, W_k, W_v 以包括所有的头。

  • Computing qkv  计算 qkv
    • Multiply teR1×dmodelt_e \in \mathbb{R}^{1\times d_\text{model}} by Wq,Wk,WvRdmodel×dmodelW_q, W_k, W_v \in \mathbb{R}^{d_\text{model}\times d_\text{model}}
      teR1×dmodelt_e \in \mathbb{R}^{1\times d_\text{model}} 乘以 Wq,Wk,WvRdmodel×dmodelW_q, W_k, W_v \in \mathbb{R}^{d_\text{model}\times d_\text{model}}
    • Flop count: 23dmodel2{2 \cdot 3 \cdot d_\text{model}}^2 Flop 计数: 23dmodel2{2 \cdot 3 \cdot d_\text{model}}^2
  • Calculate z  计算 z
    • This is softmax((qk)÷dhead)v=z\text{softmax}((q\cdot k)\div\sqrt{d_\text{head}}) \cdot v = z 这是 softmax((qk)÷dhead)v=z\text{softmax}((q\cdot k)\div\sqrt{d_\text{head}}) \cdot v = z
    • No matrices are multiplied, the number of flops is some factor of dmodeld_\text{model}.
      没有矩阵相乘,flop 的数量是 dmodeld_\text{model} 的某个因数。
  • Multiply by the output projection matrix
    乘以输出投影矩阵
    • Multiply WoRdmodel×dmodelW_o \in \mathbb{R}^{d_\text{model}\times d_\text{model}}, by zRdmodel×1z \in \mathbb{R}^{d_\text{model}\times1}
      WoRdmodel×dmodelW_o \in \mathbb{R}^{d_\text{model}\times d_\text{model}} 乘以 zRdmodel×1z \in \mathbb{R}^{d_\text{model}\times1}
    • Flop count: 2dmodel22 \cdot {d_\text{model}}^2 Flop 计数: 2dmodel22 \cdot {d_\text{model}}^2
  • Feed-forward  前馈
    • We have our MLP weights W1R4×dmodel,W2Rdmodel×4W_1 \in \mathbb{R}^{4\times d_\text{model}}, W_2 \in \mathbb{R}^{d_\text{model}\times 4} for two linear transformations (there's a ReLU in the middle, which small).
      我们有我们的 MLP 权重 W1R4×dmodel,W2Rdmodel×4W_1 \in \mathbb{R}^{4\times d_\text{model}}, W_2 \in \mathbb{R}^{d_\text{model}\times 4} 用于两个线性变换(中间有一个 ReLU,很小)。
    • Flop count: 28dmodel22\cdot 8 \cdot {d_\text{model}}^2  Flop 计数: 28dmodel22\cdot 8 \cdot {d_\text{model}}^2
  • Some other things  一些其他事情
    • There are typically layernorm that happen after each attention, where the weights there are a vector of length dmodeld_\text{model}.
      通常在每次注意力之后会发生 layernorm,那里的权重是长度为 dmodeld_\text{model} 的向量。
    • There's another linear layer and then a softmax that sits on top, which is our output (token) embedding or unembedding or de-embedding or embedding1^{-1}.
      还有另一个线性层,然后是一个位于顶部的 softmax,这是我们的输出(令牌)嵌入或解嵌入或去嵌入或嵌入 1^{-1}
    • The original transformer has a cosine absolute positional encoding scheme, which is an addition operation on the token embedding.
      原始的 Transformer 有一个余弦绝对位置编码方案,这是一个对令牌嵌入进行加法操作。

Adding up all the flops!
把所有的浮点运算加起来!

F=nlayers(23dmodel2+2dmodel2+16dmodel2)=nlayers24dmodel2F = n_\text{layers} \cdot (2 \cdot 3 \cdot {d_\text{model}}^2 + 2\cdot d_\text{model}^2 + 16\cdot d_\text{model}^2 )\\ = n_\text{layers} \cdot 24 \cdot {d_\text{model}}^2
Subbing in our 8192 model, we should get about 100B flops;
把我们的 8192 模型代入,我们应该能得到大约 100B 次浮点运算;

F=642481922=103079215104flopsF = 64\cdot 24\cdot 8192^2 = 103079215104 \text{flops}

103079215104 over two is about 51.5B. We're a lil under (we get 51.5B instead of 52B) but that's because token (un)embeddings are nearly a billion parameters. It would be reasonable to do the latency calculations with 212nlayersdmodel22\cdot 12\cdot n_\text{layers} \cdot {d_\text{model}}^2 instead of 2P2\cdot P, but it's less than a 2% difference.
103079215104 除以二大约是 51.5B。我们稍微少一点(我们得到 51.5B 而不是 52B),但这是因为令牌(非)嵌入几乎有十亿个参数。用 212nlayersdmodel22\cdot 12\cdot n_\text{layers} \cdot {d_\text{model}}^2 而不是 2P2\cdot P 进行延迟计算是合理的,但差距不到 2%。

What about the the calculation of zz and all the other steps I didn't count? Those are all vector-vector (or even vector-scalar) operations, so they are built around a factor of dmodeld_\text{model} rather than dmodel2{d_\text{model}}^2. Even if we had 100 of these operations per layer, it would come out to a hundred million flops, which is 0.1% of the number of flops we counted.
那么 zz 和我没计算的所有其他步骤的计算呢?这些都是向量-向量(甚至是向量-标量)操作,所以它们是围绕 dmodeld_\text{model} 而不是 dmodel2{d_\text{model}}^2 构建的。即使我们每层有 100 次这样的操作,也只会产生一亿次浮点运算,这占我们计算的浮点运算次数的 0.1%。

intermediate memory costs
中间内存成本

Data Movement Is All You Need (which is mostly about optimising low level data movement for transformers, and isn't a particularly relevant read) has a nice way of classifying operations. We have tensor contractions, which are the big matmuls we've mostly cared about (including the linear layers). Then there are statistical normalisations, the softmax and layernorm. Finally, which this post has completely ignored till now are element-wise operators, which are things like biases, dropouts and activations.
《数据移动就是你需要的一切》(主要是关于为 Transformer 优化低级数据移动,并不是特别相关的阅读)有一种很好的操作分类方法。我们有张量收缩,这是我们主要关心的大型矩阵乘法(包括线性层)。然后是统计规范化,softmax 和 layernorm。最后,这篇文章直到现在完全忽略的是逐元素操作,这些是像偏置、dropout 和激活函数这样的操作。

So how do we calculate the latency of those matmuls, layernorms, etc? The reported flops on our hardware is specificially for the multiply-add operations so it would not be right to count it in there even if we could count the flops. Surprise! It's only to cost memory to do the softmax read/writes as that's what the bandwidth to flops ratio favours. This is the latency factor that has been alluded to!
那么我们如何计算这些矩阵乘法、layernorm 等的延迟呢?我们硬件上报告的浮点运算次数是专门针对乘加操作的,所以即使我们能计算浮点运算次数,把它算在内也是不正确的。惊喜!只有进行 softmax 读/写操作才会消耗内存,因为这是带宽与浮点运算次数比率所偏好的。这就是一直被暗示的延迟因素!

I'm going to break character on the first-principles aspect of this and discuss Table A.1 from the Data Movement Is All You Need paper. Here we see that the latency for softmax is actually slightly higher than the calculations for qkv (which are a 1/3 of the time). This is a little concerning!
我将在这个关于基本原理的讨论上打破常规,讨论《数据移动就是你需要的一切》论文中的表 A.1。这里我们看到,softmax 的延迟实际上比 qkv 的计算稍高(后者是前者的 1/3 时间)。这有点令人担忧!

For the same reason the softmax is memory bound, so is the multiplication of qk, ReLU and dropout are also quite expensive.
出于同样的原因,softmax 是内存受限的,因此 qk 的乘法、ReLU 和 dropout 也相当耗费资源。

GPU Kernel Fusion GPU 核心融合

GPUs execute operations in units of "kernels". Kernel fusion means that something that was usually 2 kernels can become one, and the primary use is to reuse loads into memory and reduce redundant loads and stores. For example, a multiply-add is one kernel. If it were two, then one would load+add+store and the second would load+multiply+store. We could save a lot of trips by doing load+add+multiply+store.
GPU 以“核心”为单位执行操作。核心融合意味着通常需要 2 个核心的操作可以合并为一个,主要用途是重用加载到内存的数据,减少冗余的加载和存储。例如,乘加操作是一个核心。如果它是两个核心,那么一个将执行加载+加+存储,另一个将执行加载+乘+存储。通过执行加载+加+乘+存储,我们可以节省很多次访问。

We can also tell the softmax here is not perfectly fused by counting the number of read-writes we should need. In theory it can just be one read and one write (the standard is uh, four so I'm bullying a bit). For qk, it would be two reads and one write (the two reads can probably be saved). The three to one ratio then, indicates that the softmax is doing more memory passes than is optimal. I say this, because this expresses how much this counting is software dependents and needs experiments to estimate, since in theory the cost could be zero.
我们还可以通过计算理论上需要的读写次数来判断这里的 softmax 并没有完美融合。理论上它只需要一次读和一次写(标准是嗯,四次,所以我有点吹毛求疵)。对于 qk,它将需要两次读和一次写(两次读可能可以省略)。因此,三比一的比例表明,softmax 的内存传递次数多于最佳状态。我之所以这么说,是因为这表明了这种计数有多依赖软件,并且需要实验来估计,因为理论上成本可能为零。

It's also worth noting that the percentage of time these operations take gets smaller quickly as model size increases as the memory will increase on the order of dmodeld_\text{model} while the flops increase on the order of dmodel2{d_\text{model}}^2 — per layer. The paper is a 336M param model, dmodel=1024,nlayers=24d_\text{model} = 1024, n_\text{layers} = 24.
同样值得注意的是,随着模型大小的增加,这些操作所占用的时间百分比会迅速变小,因为内存的增加量是 dmodeld_\text{model} 的数量级,而 flops 的增加量是 dmodel2{d_\text{model}}^2 的数量级——每层而言。该论文是一个 336M 参数模型, dmodel=1024,nlayers=24d_\text{model} = 1024, n_\text{layers} = 24

I added up the latency of all the values in the "Ours" column that were memory bound, including the element-wise operations. The result is that these intermediate steps take 43% of the time. In a model of size 52B (where dmodeld_\text{model} is 8 times larger, we see these operations become less significant.
我将“我们的”列中所有内存受限的值的延迟加起来,包括逐元素操作。结果是,这些中间步骤占用了 43% 的时间。在一个 52B 大小的模型中(其中 dmodeld_\text{model} 是 8 倍更大,我们看到这些操作变得不那么重要。

The duration of these memory bound intermediate operations will take 8 times longer as the operations are vectors of length dmodeld_\text{model}. However, the number of flops will increase by 64 times, which means the flop time increases by 64 times.
这些内存受限的中间操作的持续时间将会是操作向量长度 dmodeld_\text{model} 的 8 倍。然而,flops 的数量将增加 64 倍,这意味着 flop 时间增加了 64 倍。

0.438÷(164)=0.053750.43\cdot 8 \div(1\cdot64) = 0.05375

So using the optimisations in that paper, a 52B model inference latency would be about 5% of these intermediate calculations we didn't factor into latency.
因此,使用该论文中的优化,一个 52B 模型的推理延迟将是我们没有考虑到的中间计算的大约 5%。

comparing against real benchmarks
与真实基准进行比较

I work at a language modelling company that has its own infrastructure and existing benchmarks but uh, IP is hard. There is a sadly small number of public benchmarks available for model parallel inferencing? It seems like the only public engines for this are Nvidia FasterTransformer and Microsoft Deepspeed with other benchmarks probably scattered in papers I don't know exist. Anywho, we can verify our calculations against some real benchmarks!
我在一家拥有自己的基础设施和现有基准的语言建模公司工作,但是,知识产权很难处理。对于模型并行推理,可用的公共基准数量少得可悲?看来唯一的公共引擎是 Nvidia FasterTransformer 和 Microsoft Deepspeed,其他基准可能散布在我不知道的论文中。无论如何,我们可以根据一些真实基准验证我们的计算!

Because I only want to use 2 GPUs, I've run a 13B parameter model with FasterTransformer, which does a bunch of good kernel fusing and provides us with tensor parallelism. 13B is 40 layers, 40 heads, each of dim 128 for a dim size of 5120. I have screenshots of the profiles in here and there are a bunch of interesting things in there that might make another post.
因为我只想使用 2 个 GPU,我已经用 FasterTransformer 运行了一个 13B 参数模型,它进行了大量良好的内核融合,并为我们提供了张量并行性。13B 是 40 层,每层 40 头,每个维度为 128,维度大小为 5120。我这里有性能分析的截图,里面有很多有趣的东西,可能会写成另一篇文章。

We'll start with a 512 context length, batch size 1 and 10 tokens outputted. For a small batch for one token on 2 GPUs we expect 8.4ms, and about 1ms of comms. For 1 GPU, that would be 16.8ms and 0 comms. (2x40x12x5120^2/1.5e12)
我们将从 512 的上下文长度,批量大小 1 和输出 10 个令牌开始。对于在 2 个 GPU 上的一个小批量的一个令牌,我们预计 8.4ms,通信大约 1ms。对于 1 个 GPU,那将是 16.8ms 和 0 通信。(2x40x12x5120^2/1.5e12)

Excuse my mangled significant figures, I probably should've kept the mem bandwidth to 1.555 instead of 1.5.
抱歉我的有效数字弄错了,我可能应该将内存带宽保持在 1.555 而不是 1.5。

Our empirical result for 1 GPU is 22.0ms, meaning our guess was 76% there. We can actually safely account for all of this, where we know some percentage will go to intermediate activations, and that we don't actually get 100% of our theoretical memory bandwidth. For these dimensions, a profile tells us we get up to about 90% of our full memory bandwidth (where I compare the expected cost of a matmul to the duration of a single matmul kernel and rounded up as the bandwidth usage varies quite a bit depending on the tensors being loaded). Counting that in, we expect to take 18.5ms. Adding up the cost of intermediate activations (which we can do from a profile) we get another 2.2ms, getting us to 20.7 ms! To account for the last 1.4 ms there are some other sub-millisecond operations like token embeddings, doing top-(k|p), less net bandwidth than 90% (I couldn't be bothered to actually average everything I took the highest bw usage I could find) or even kernal launch times.
我们对 1 个 GPU 的实验结果是 22.0ms,意味着我们的猜测达到了 76%。我们实际上可以安全地解释所有这些,我们知道一些百分比将用于中间激活,而我们实际上并没有得到 100%的理论内存带宽。对于这些维度,性能分析告诉我们我们获得了大约 90%的全部内存带宽(我比较了一个 matmul 的预期成本与单个 matmul 内核的持续时间,并四舍五入,因为带宽使用量根据被加载的张量而变化很大)。计算在内,我们预计将花费 18.5ms。加上中间激活的成本(我们可以从性能分析中做到这一点),我们又得到了 2.2ms,达到了 20.7ms!为了解释最后的 1.4ms,还有一些其他的亚毫秒操作,如令牌嵌入,进行 top-(k|p),净带宽少于 90%(我懒得实际平均我找到的最高带宽使用量)或甚至是内核启动时间。

Our emprical result for 2 GPUs is 13.5. We're farther off this time, for only 62% of the way there. We would check the profile again to see the memory bandwidth (which we expect to be slightly worse, as smaller tensors tend to be able to get less of the bandwidth). This time, it doesn't quite get to 90, more like 87, getting us to 9.5ms. The intermediate activations take a similar amount of time (2ms), getting us 11.7ms. With the remaining 1.5 ms then, we're searching for comms! This is easily covered by our calculated 1ms of comms not being parallelised. From the profile, our comms take 40-50microseconds per layer, for a total of 1.7ish ms of comms time, which accounts for everything pretty well!
我们对 2 个 GPU 的实验结果是 13.5。这次我们偏离得更远了,只有 62%的准确度。我们会再次检查性能分析,以查看内存带宽(我们预计会稍差,因为较小的张量往往能够获得较少的带宽)。这次,它并没有达到 90,更像是 87,让我们达到了 9.5ms。中间激活花费了类似的时间(2ms),让我们达到了 11.7ms。剩下的 1.5ms,我们在寻找通信!这很容易被我们计算的 1ms 通信时间没有并行化所覆盖。从性能分析来看,我们的通信每层需要 40-50 微秒,总共大约 1.7ms 的通信时间,这几乎完美地解释了一切!

I think for both of those operations, the counting of intermediate activations was a bit higher than it should be, because the profile gave consistently slightly-higher latencies than the raw benchmarking run. The output of the benchmark run was 180.86 ms (context time: 45.45 ms) and 283.60 ms (context time: 63.17 ms).
我认为对于这两种操作,中间激活的计数应该比实际情况要高一些,因为性能分析一致地给出了比原始基准运行略高的延迟。基准运行的输出是 180.86 ms (context time: 45.45 ms)283.60 ms (context time: 63.17 ms)

But what about the forwards pass? I expect the forwards pass to take num_tokens/flops_to_bw_ratio times as long as a decoding step. This is because we have to send all the tokens to all the GPUs, and each GPU will do their heads of attention on it and store kv. Let's use the updated memory bandwidth, 312e12/(1.5e12x0.9)=231. Looking at the 1 GPU setup, where 22 is our expected decoding step, we see the 22*(512/231) = 48 which is not quite the claimed 63. For 2 GPUs we get 13.5*(512/231) = 30ms, even worse!
但是前向传递呢?我预计前向传递的时间是解码步骤的 num_tokens/flops_to_bw_ratio 倍。这是因为我们必须将所有令牌发送到所有 GPU,并且每个 GPU 将对其进行注意力头的处理并存储 kv。让我们使用更新的内存带宽,312e12/(1.5e12x0.9)=231。看看 1 个 GPU 的设置,我们预期的解码步骤是 22,我们看到 22*(512/231)=48,这与声称的 63 不太一样。对于 2 个 GPU,我们得到 13.5*(512/231)=30ms,更糟糕!

For the one gpu, some of the missing time should just be kv storing. Looking at the profiles, this is 18 microseconds per layer, 0.7ms. There are some Memsets for 0.2ms. We expect the flop time (this is flops bound!) for one of our MLP multiplies to be 512x4x5120^2x2/312e12 = 344 microseconds. In practice, this is 476 at the lowest which means we get 72% of the flops we expect? For the projection in the attention we expect we 512x5120^2x2/312e12 = 86 microseconds. In profiles we find this to be 159 at the lowest, which is 54%. Yikes! I panicked for a bit, but uh this is apparently just the flops we expect? See Figure 14 in this paper where a 512x4000x4000 ends up getting less than 150TFLOPs/s.
对于一个 gpu,一些缺失的时间应该只是 kv 存储。查看性能分析,这是每层 18 微秒,0.7ms。有一些 Memsets 为 0.2ms。我们预计单个 MLP 乘法的 flop 时间(这是 flops 限制的!)为 512x4x5120^2x2/312e12=344 微秒。实际上,这是最低的 476,这意味着我们获得了我们预期的 72%的 flops 吗?对于注意力中的投影,我们预计 512x5120^2x2/312e12=86 微秒。在性能分析中,我们发现这在最低时为 159,即 54%。哎呀!我有点慌了,但呃,这显然只是我们所期望的 flops?参见本文的图 14,其中一个 512x4000x4000 的结果获得的 TFLOPs/s 不到 150。

exercises 练习

  1. Given batch size, context length and next_n, how can we calculate the savings of using kv cache?
    给定批量大小、上下文长度和 next_n,我们如何计算使用 kv 缓存的节省?

  2. What overheads does the kv cache add in memory time?
    kv 缓存在内存时间上增加了哪些开销?

  3. Can we be memory bound on our forwards pass but flops bound at each sampling step?
    我们的前向传递可以是内存限制的,但每个采样步骤是 flops 限制的吗?

  4. What tradeoffs and calculations should we consider for using more GPUs than is necessary for capacity? Say for example, a 52B model on 8 or 16 GPUs instead of 4.
    对于使用比容量需要更多的 GPU,我们应该考虑哪些权衡和计算?比如,一个 52B 模型使用 8 个或 16 个 GPU 而不是 4 个。

  5. We came up with formulas to calculate time to predict one token. How would we calculate the time to do an entire sample, from doing the forwards pass on the context to predicting all the tokens requested?
    我们提出了计算预测一个令牌时间的公式。我们如何计算做一个完整样本的时间,从对上下文进行前向传递到预测所有请求的令牌?

  6. In the capacity section, I say the memory of intermediate calculations are negligble. How small are they exactly?
    在容量部分,我说中间计算的内存是微不足道的。它们到底有多小?

  7. In the batch sizes section, we went a bit off topic and talked about the flops per byte of communication. What are the tradeoffs if we had an embedding dimension size of 512?
    在批量大小部分,我们有点偏题了,谈到了每字节通信的浮点运算次数。如果我们设置嵌入维度大小为 512,会有什么权衡?

  8. We assume GPUs attached to the same host here, but could communicate GPUs between hosts like we do in training. AWS has 400gb/s. What about it!
    我们这里假设的是连接到同一主机的 GPU,但像我们在训练中做的那样,可以实现主机间 GPU 的通信。AWS 的速度是 400gb/s。这意味着什么!

  9. In model parallelism, we could in practice communicate all the shards and then have each accelerator do all the addition, instead of just a share of their addition. What are the latency implications there?
    在模型并行性中,我们实际上可以通信所有的分片,然后让每个加速器完成所有的加法运算,而不仅仅是它们各自的一部分加法运算。这里的延迟影响是什么?

  10. Try calculating the large batch speed for a 52B on 4xGPUs at batch size 256. The compute should be about 21ms and comms should be about 4ms.
    尝试计算在 4xGPUs 上,批量大小为 256 时,52B 模型的大批量速度。计算应该约为 21ms,通信应该约为 4ms。

  11. Consider the operation of taking the vector out of the last layer and multiplying it by the unembedding matrix, storing the logits and then doing top-k or top-p sampling (which requires a sort). How long should this take for a 52B model, and what can we parallelise here?
    考虑从最后一层取出向量并将其乘以反嵌入矩阵,存储逻辑值然后进行 top-k 或 top-p 采样(这需要排序)的操作。对于一个 52B 模型,这应该需要多长时间,我们可以在这里并行化什么?

  12. How can we shard the token embeddings? Would shard the input token embeddings differently from the unembeddings? Layernorms? What extra communication does this incur?
    我们如何分片令牌嵌入?我们是否会以不同于反嵌入的方式分片输入令牌嵌入?层归一化呢?这会引起什么额外的通信?

acknowledgements 致谢

Would like to extend credit and thanks to people who make a positive impact on this post in varying capacities. James Bradbury, Eric Zhang, Taylor Rogalski, Horace He, Julian Schrittwieser, Reiner Pope, Jim Wu, Mohammad Bavarian, Tudor Brindus and Adrien Morisot with James leading by a long shot.
想要向对这篇帖子产生积极影响的人们表示感谢和致敬。James Bradbury, Eric Zhang, Taylor Rogalski, Horace He, Julian Schrittwieser, Reiner Pope, Jim Wu, Mohammad Bavarian, Tudor Brindus 以及 Adrien Morisot,其中 James 贡献尤为突出。

citation 引用

Please cite as: 请按以下格式引用:

Chen, Carol. "Transformer Inference Arithmetic", https://kipp.ly/blog/transformer-inference-arithmetic/, 2022.

hey kipply you should better understand our big model inferencing latency
嘿,kipply,你应该更好地理解我们的大型模型推理延迟

yes that's a great idea i'll look into it!
是的,这是个好主意,我会研究一下的!

cool i'd love to see the profile
酷,我很想看到性能分析报告

if i sit in a dark room by myself long enough i think i can explain all the milliseconds
如果我自己坐在一个黑暗的房间里足够长的时间,我想我可以解释所有的毫秒

😳


The architectures and latencies expressed in this post are those of publicly known or theoretical models and benchmarks and do not necessarily reflect the architectures or latencies of my employer's models.
本帖子中表达的架构和延迟是公开已知或理论模型和基准的,不一定反映我所在雇主的模型的架构或延迟。