Introduction to Transformers: an NLP Perspective
引言：从自然语言处理角度谈 Transformer

\nameTong Xiao \emailxiaotong@mail.neu.edu.cn
童晓 xiaotong@mail.neu.edu.cn
\addrNLP Lab., Northeastern University, Shenyang, China
东北大学地址自然语言处理实验室，沈阳，中国
NiuTrans Research, Shenyang, China \AND\nameJingbo Zhu \emailzhujingbo@mail.neu.edu.cn
牛译研究，沈阳，中国 \AND\name 朱景波 \emailzhujingbo@mail.neu.edu.cn
\addrNLP Lab., Northeastern University, Shenyang, China
东北大学地址自然语言处理实验室，沈阳，中国
NiuTrans Research, Shenyang, China
牛译研究，沈阳，中国

Abstract 摘要

Transformers have dominated empirical machine learning models of natural language processing. In this paper, we introduce basic concepts of Transformers and present key techniques that form the recent advances of these models. This includes a description of the standard Transformer architecture, a series of model refinements, and common applications. Given that Transformers and related deep learning techniques might be evolving in ways we have never seen, we cannot dive into all the model details or cover all the technical areas. Instead, we focus on just those concepts that are helpful for gaining a good understanding of Transformers and their variants. We also summarize the key ideas that impact this field, thereby yielding some insights into the strengths and limitations of these models.
Transformer 在自然语言处理的实证机器学习模型中占据主导地位。在本文中，我们介绍了 Transformer 的基本概念，并展示了构成这些模型近期进展的关键技术。这包括对标准 Transformer 架构的描述、一系列模型优化以及常见应用。鉴于 Transformer 和相关深度学习技术可能以我们从未见过的方向发展，我们无法深入所有模型细节或涵盖所有技术领域。相反，我们专注于那些有助于深入理解 Transformer 及其变体的概念。我们还总结了影响这一领域的关键思想，从而为这些模型的优势和局限性提供一些见解。

1 Background 1 背景

Transformers are a type of neural network (Vaswani et al., 2017). They were originally known for their strong performance in machine translation, and are now a de facto standard for building large-scale self-supervised learning systems (Devlin et al., 2019; Brown et al., 2020). The past few years have seen the rise of Transformers not only in natural language processing (NLP) but also in several other fields, such as computer vision and multi-modal processing. As Transformers continue to mature, these models are playing an increasingly important role in the research and application of artificial intelligence (AI).
Transformer 是一种神经网络（Vaswani 等人，2017 年）。它们最初因在机器翻译中的出色表现而闻名，现在已成为构建大规模自监督学习系统的既定标准（Devlin 等人，2019 年；Brown 等人，2020 年）。在过去的几年里，Transformer 不仅在自然语言处理（NLP）领域崛起，还在计算机视觉和多模态处理等多个领域得到了应用。随着 Transformer 的不断发展，这些模型在人工智能（AI）的研究和应用中扮演着越来越重要的角色。

Looking back at the history of neural networks, Transformers have not been around for a long time. While Transformers are “newcomers” in NLP, they were developed on top of several ideas, the origins of which can be traced back to earlier work, such as word embedding (Bengio et al., 2003; Mikolov et al., 2013) and attention mechanisms (Bahdanau et al., 2014; Luong et al., 2015). As a result, Transformers can benefit from the advancements of different sub-fields of deep learning, and provide an elegant way to combine these neural models. On the other hand, Transformers are unique, and differ from previous models in several ways. First, they do not depend on recurrent or convolutional neural networks for modeling sequences of words, but use only attention mechanisms and feed-forward neural networks. Second, the use of self-attention in Transformers makes it easier to deal with global contexts and dependencies among words. Third, Transformers are very flexible architectures and can be easily modified to accommodate different tasks.
回顾神经网络的历史，Transformer 并不算存在很长时间。虽然 Transformer 在自然语言处理领域是“新来者”，但它们是在几个想法的基础上开发的，这些想法的起源可以追溯到早期的工作，如词嵌入（Bengio 等人，2003；Mikolov 等人，2013）和注意力机制（Bahdanau 等人，2014；Luong 等人，2015）。因此，Transformer 可以受益于深度学习不同子领域的进步，并提供一种优雅的方式来结合这些神经网络模型。另一方面，Transformer 具有独特性，在几个方面与先前模型不同。首先，它们在建模词语序列时不依赖于循环或卷积神经网络，而是仅使用注意力机制和前馈神经网络。其次，Transformer 中使用自注意力使得处理全局上下文和词语之间的依赖关系变得更加容易。第三，Transformer 是非常灵活的架构，可以轻松修改以适应不同的任务。

The widespread use of Transformers motivates the development of cutting-edge techniques in deep learning. For example, there are significant refinements in self-attention mechanisms, which have been incorporated into many state-of-the-art NLP systems. The resulting techniques, together with the progress in self-supervised learning, have led us to a new era of AI: we are beginning to obtain models of universal language understanding, generation and reasoning. This has been evidenced by recent Transformer-based large language models (LLMs) which demonstrate amazing performance across a broad variety of tasks (Bubeck et al., 2023).
广泛使用 Transformer 推动了深度学习前沿技术的开发。例如，自我注意力机制得到了显著改进，并被纳入了许多最先进的自然语言处理系统中。这些技术成果与自监督学习的进展一起，引领我们进入了一个新的 AI 时代：我们开始获得通用语言理解、生成和推理的模型。这由最近的基于 Transformer 的大型语言模型（LLMs）所证明，这些模型在各种任务上展现了惊人的性能（Bubeck 等人，2023）。

This paper provides an introduction to Transformers while reflecting the recent developments in applying these models to different problems. However, Transformers are so successful that there have been numerous related studies and we cannot give a full description of them. Therefore, we focus this work on the core ideas of Transformers, and present a basic description of the common techniques. We also discuss some recent advances in Transformers, such as model improvements for efficiency and accuracy considerations. Because the field is very active and new techniques are coming out every day, it is impossible to survey all the latest literature and we are not attempting to do so. Instead, we focus on just those concepts and algorithms most relevant to Transformers, aimed at the people who wish to get a general understanding of these models.
本文介绍了 Transformer，同时反映了将这些模型应用于不同问题的最新进展。然而，Transformer 的成功导致相关研究众多，我们无法对其进行全面描述。因此，我们专注于 Transformer 的核心思想，并介绍了一些常见技术的简要描述。我们还讨论了 Transformer 的一些最新进展，例如为了效率和准确性考虑的模型改进。由于该领域非常活跃，新技术每天都在涌现，不可能对所有最新文献进行综述，我们也不试图这样做。相反，我们专注于与 Transformer 最相关的那些概念和算法，旨在帮助那些希望对这些模型有一个总体理解的人。

2 The Basic Model 第二章基本模型

Here we consider the model presented in Vaswani et al. (2017)’s work. We start by considering the Transformer architecture and discuss the details of the sub-models subsequently.
我们在此考虑 Vaswani 等人（2017）提出的方法。我们首先考虑 Transformer 架构，随后讨论子模型的细节。

2.1 The Transformer Architecture
2.1 变换器架构

Figure 1 shows the standard Transformer model which follows the general encoder-decoder framework. A Transformer encoder comprises a number of stacked encoding layers (or encoding blocks). Each encoding layer has two different sub-layers (or sub-blocks), called the self-attention sub-layer and the feed-forward neural network (FFN) sub-layer. Suppose we have a source-side sequence $\mathbf{x}=x_{1}...x_{m}$ and a target-side sequence $\mathbf{y}=y_{1}...y_{n}$ . The input of an encoding layer is a sequence of $m$ vectors $\mathbf{h}_{1}...\mathbf{h}_{m}$ , each having $d_{\mathrm{model}}$ dimensions (or $d$ dimensions for simplicity). We follow the notation adopted in the previous chapters, using $\mathbf{H}\in\mathbb{R}^{m\times d}$ to denote these input vectors¹¹1Provided $\mathbf{h}_{j}\in\mathbb{R}^{d}$ is a row vector, we have $\mathbf{H}=\begin{bmatrix}\mathbf{h}_{1}\\ \vdots\\ \mathbf{h}_{m}\end{bmatrix}$ .. The self-attention sub-layer first performs a self-attention operation $\mathrm{Att}_{\mathrm{self}}(\cdot)$ on $\mathbf{H}$ to generate an output $\mathbf{C}$ :
图 1 展示了遵循通用编码器-解码器框架的标准 Transformer 模型。一个 Transformer 编码器由多个堆叠的编码层（或编码块）组成。每个编码层有两个不同的子层（或子块），称为自注意力子层和前馈神经网络（FFN）子层。假设我们有一个源序列 $\mathbf{x}=x_{1}...x_{m}$ 和一个目标序列 $\mathbf{y}=y_{1}...y_{n}$ 。编码层的输入是一个序列的 $m$ 向量 $\mathbf{h}_{1}...\mathbf{h}_{m}$ ，每个向量具有 $d_{\mathrm{model}}$ 维度（或为了简化， $d$ 维度）。我们遵循前几章采用的符号，用 $\mathbf{H}\in\mathbb{R}^{m\times d}$ 表示这些输入向量 ¹ 。自注意力子层首先对 $\mathbf{H}$ 执行自注意力操作 $\mathrm{Att}_{\mathrm{self}}(\cdot)$ ，以生成输出 $\mathbf{C}$ ：

\displaystyle\mathbf{C}

\displaystyle=

\displaystyle\mathrm{Att}_{\mathrm{self}}(\mathbf{H})

(1)

Figure 1: The Transformer architecture (Vaswani et al., 2017). There are

L

stacked layers on each of the encoder and decoder sides. An encoding layer comprises a self-attention sub-layer and an FFN sub-layer. Both of these sub-layers share the same structure which involves a core function (either

\mathrm{Layer}_{\mathrm{self}}(\cdot)

\mathrm{Layer}_{\mathrm{ffn}}(\cdot)

), followed by a residual connection and a layer normalization unit. Each decoding layer has a similar architecture with the encoding layers, but with an additional encoder-decoder attention sub-layer sandwiched between the self-attention and FFN sub-layers. As with most sequence-to-sequence models, Transformer takes

x_{1}...x_{m}

and

y_{0}...y_{i-1}

for predicting

y_{i}

. The representation of an input word comprises a sum of a word embedding and a positional embedding. The distributions

\{\operatorname{Pr}(\cdot|y_{0}...y_{i-1},x_{1}...x_{m})\}

are generated in sequence by a Softmax layer, which operates on a linear transformation of the output from the last decoding layer.
图 1：Transformer 架构（Vaswani 等人，2017）。编码器和解码器每侧都有

L

堆叠层。一个编码层包含一个自注意力子层和一个 FFN 子层。这两个子层具有相同的结构，包括一个核心函数（要么是

\mathrm{Layer}_{\mathrm{self}}(\cdot)

要么是

\mathrm{Layer}_{\mathrm{ffn}}(\cdot)

），然后是残差连接和层归一化单元。每个解码层具有与编码层相似的架构，但在自注意力和 FFN 子层之间还有一个额外的编码器-解码器注意力子层。与大多数序列到序列模型一样，Transformer 使用

x_{1}...x_{m}

和

y_{0}...y_{i-1}

来预测

y_{i}

。输入词的表示是一个词嵌入和一个位置嵌入的和。Softmax 层按顺序生成

\{\operatorname{Pr}(\cdot|y_{0}...y_{i-1},x_{1}...x_{m})\}

分布，它对一个线性变换后的最后一个解码层的输出进行操作。

Here $\mathbf{C}$ is of the same size as $\mathbf{H}$ , and can thus be viewed as a new representation of the inputs. Then, a residual connection and a layer normalization unit are added to the output so that the resulting model is easier to optimize.
这里 $\mathbf{C}$ 与 $\mathbf{H}$ 大小相同，因此可以视为输入的新表示。然后，在输出中添加了残差连接和层归一化单元，使得得到的模型更容易优化。

The original Transformer model employs the post-norm structure where a residual connection is created before layer normalization is performed, like this
原始 Transformer 模型采用后归一化结构，在执行层归一化之前创建残差连接，如下所示

\displaystyle\mathbf{H}_{\mathrm{self}}

\displaystyle=

\displaystyle\mathrm{LNorm}(\mathbf{C}+\mathbf{H})

(2)

where the addition of $\mathbf{H}$ denotes the residual connection (He et al., 2016a), and $\mathrm{LNorm}(\cdot)$ denotes the layer normalization function (Ba et al., 2016). Substituting Eq. (1) into Eq. (2), we obtain the form of the self-attention sub-layer
在式中， $\mathbf{H}$ 表示残差连接（He 等，2016a）， $\mathrm{LNorm}(\cdot)$ 表示层归一化函数（Ba 等，2016）。将式（1）代入式（2），我们得到自注意力子层的表达式

	$\displaystyle\mathrm{Layer}_{\mathrm{self}}(\mathbf{H})$	$\displaystyle=$	$\displaystyle\mathbf{H}_{\mathrm{self}}$		(3)
		$\displaystyle=$	$\displaystyle\mathrm{LNorm}(\mathrm{Att}_{\mathrm{self}}(\mathbf{H})+\mathbf{H})$		(3)

The definitions of $\mathrm{LNorm}(\cdot)$ and $\mathrm{Att}_{\mathrm{self}}(\cdot)$ will be given later in this section.
该节中将给出 $\mathrm{LNorm}(\cdot)$ 和 $\mathrm{Att}_{\mathrm{self}}(\cdot)$ 的定义。

The FFN sub-layer takes $\mathbf{H}_{\mathrm{self}}$ and outputs a new representation $\mathbf{H}_{\mathrm{ffn}}\in\mathbb{R}^{m\times d}$ . It has the same form as the self-attention sub-layer, with the attention function replaced by the FFN function, given by
FFN 子层接收 $\mathbf{H}_{\mathrm{self}}$ 并输出新的表示 $\mathbf{H}_{\mathrm{ffn}}\in\mathbb{R}^{m\times d}$ 。其结构与自注意力子层相同，只是将注意力函数替换为 FFN 函数，具体为

	$\displaystyle\mathrm{Layer}_{\mathrm{ffn}}(\mathbf{H}_{\mathrm{self}})$	$\displaystyle=$	$\displaystyle\mathbf{H}_{\mathrm{ffn}}$		(4)
		$\displaystyle=$	$\displaystyle\mathrm{LNorm}(\mathrm{FFN}(\mathbf{H}_{\mathrm{self}})+\mathbf{H}_{\mathrm{self}})$		(4)

Here $\mathrm{FFN}(\cdot)$ could be any feed-forward neural networks with non-linear activation functions. The most common structure of $\mathrm{FFN}(\cdot)$ is a two-layer network involving two linear transformations and a ReLU activation function between them.
这里 $\mathrm{FFN}(\cdot)$ 可以是具有非线性激活函数的任何前馈神经网络。 $\mathrm{FFN}(\cdot)$ 最常见结构是一个包含两个线性变换和它们之间一个 ReLU 激活函数的两层网络。

For deep models, we can stack the above neural networks. Let $\mathbf{H}^{l}$ be the output of layer $l$ . Then, we can express $\mathbf{H}^{l}$ as a function of $\mathbf{H}^{l-1}$ . We write this as a composition of two sub-layers
对于深度模型，我们可以堆叠上述神经网络。令 $\mathbf{H}^{l}$ 为层 $l$ 的输出。然后，我们可以将 $\mathbf{H}^{l}$ 表示为 $\mathbf{H}^{l-1}$ 的函数。我们将其写作两个子层的组合

	$\displaystyle\mathbf{H}^{l}$	$\displaystyle=$	$\displaystyle\mathrm{Layer}_{\mathrm{ffn}}(\mathbf{H}_{\mathrm{self}}^{l})$		(5)
	$\displaystyle\mathbf{H}_{\mathrm{self}}^{l}$	$\displaystyle=$	$\displaystyle\mathrm{Layer}_{\mathrm{self}}(\mathbf{H}^{l-1})$		(6)

If there are $L$ encoding layers, then $\mathbf{H}^{L}$ will be the output of the encoder. In this case, $\mathbf{H}^{L}$ can be viewed as a representation of the input sequence that is learned by the Transformer encoder. $\mathbf{H}^{0}$ denotes the input of the encoder. In recurrent and convolutional models, $\mathbf{H}^{0}$ can simply be word embeddings of the input sequence. Transformer takes a different way of representing the input words , and encodes the positional information explicitly. In Section 2.2 we will discuss the embedding model used in Transformers.
如果有 $L$ 个编码层，那么 $\mathbf{H}^{L}$ 将是编码器的输出。在这种情况下， $\mathbf{H}^{L}$ 可以被视为 Transformer 编码器学习到的输入序列的表示。 $\mathbf{H}^{0}$ 表示编码器的输入。在循环和卷积模型中， $\mathbf{H}^{0}$ 可以简单地是输入序列的词嵌入。Transformer 采用不同的方式表示输入单词，并显式地编码位置信息。在第 2.2 节中，我们将讨论 Transformer 中使用的嵌入模型。

The Transformer decoder has a similar structure as the Transformer encoder. It comprises $L$ stacked decoding layers (or decoding blocks). Let $\mathbf{S}^{l}$ be the output of the $l$ -th decoding layer. We can formulate a decoding layer by using the following equations
Transformer 解码器与 Transformer 编码器具有相似的结构。它由 $L$ 堆叠的解码层（或解码块）组成。设 $\mathbf{S}^{l}$ 为第 $l$ 个解码层的输出。我们可以通过以下方程式来构建解码层

$\displaystyle\mathbf{S}^{l}$	$\displaystyle=$	$\displaystyle\mathrm{Layer}_{\mathrm{ffn}}(\mathbf{S}_{\mathrm{cross}}^{l})$	(7)
$\displaystyle\mathbf{S}_{\mathrm{cross}}^{l}$	$\displaystyle=$	$\displaystyle\mathrm{Layer}_{\mathrm{cross}}(\mathbf{H}^{L},\mathbf{S}_{\mathrm{self}}^{l-1})$	(8)
$\displaystyle\mathbf{S}_{\mathrm{self}}^{l}$	$\displaystyle=$	$\displaystyle\mathrm{Layer}_{\mathrm{self}}(\mathbf{S}^{l-1})$	(9)

Here there are three decoder sub-layers. The self-attention and FFN sub-layers are the same as those used in the encoder. $\mathrm{Layer}_{\mathrm{cross}}(\cdot)$ denotes a cross attention sub-layer (or encoder-decoder sub-layer) which models the transformation from the source-side to the target-side. In Section 2.6 we will see that $\mathrm{Layer}_{\mathrm{cross}}(\cdot)$ can be implemented using the same function as $\mathrm{Layer}_{\mathrm{self}}(\cdot)$ .
这里有三个解码子层。自注意力子层和 FFN 子层与编码器中使用的相同。 $\mathrm{Layer}_{\mathrm{cross}}(\cdot)$ 表示一个交叉注意力子层（或编码器-解码器子层），它模拟了从源端到目标端的转换。在第 2.6 节中，我们将看到 $\mathrm{Layer}_{\mathrm{cross}}(\cdot)$ 可以使用与 $\mathrm{Layer}_{\mathrm{self}}(\cdot)$ 相同的函数来实现。

The Transformer decoder outputs a distribution over a vocabulary $V_{\mathrm{y}}$ at each target-side position. This is achieved by using a softmax layer that normalizes a linear transformation of $\mathbf{S}^{L}$ to distributions of target-side words. To do this, we map $\mathbf{S}^{L}$ to an $n\times|V_{\mathrm{y}}|$ matrix $\mathbf{O}$ by
The Transformer decoder outputs a distribution over a vocabulary at each target-side position. This is achieved by using a softmax layer that normalizes a linear transformation to distributions of target-side words. To do this, we map to an matrix by

\displaystyle\mathbf{O}

\displaystyle=

\displaystyle\mathbf{S}^{L}\cdot\mathbf{W}_{\mathrm{o}}

(10)

where $\mathbf{W}_{\mathrm{o}}\in\mathbb{R}^{d\times|V_{\mathrm{y}}|}$ is the parameter matrix of the linear transformation.
$\mathbf{W}_{\mathrm{o}}\in\mathbb{R}^{d\times|V_{\mathrm{y}}|}$ 是线性变换的参数矩阵。

Then, the output of the Transformer decoder is given in the form
然后，Transformer 解码器的输出以以下形式给出

	$\displaystyle\begin{bmatrix}\operatorname{Pr}(\cdot\|y_{0},\mathbf{x})\\ \vdots\\ \operatorname{Pr}(\cdot\|y_{0}...y_{n-1},\mathbf{x})\end{bmatrix}$	$\displaystyle=$	$\displaystyle\mathrm{Softmax}(\mathbf{O})$		(11)
		$\displaystyle=$	$\displaystyle\begin{bmatrix}\mathrm{Softmax}(\mathbf{o}_{1})\\ \vdots\\ \mathrm{Softmax}(\mathbf{o}_{n})\end{bmatrix}$		(11)

where $\mathbf{o}_{i}$ denotes the $i$ -th row vector of $\mathbf{O}$ , and $y_{0}$ denotes the start symbol $\langle\mathrm{SOS}\rangle$ . Under this model, the probability of $\mathbf{x}$ given $\mathbf{y}$ can be defined as usual,
在本文中， $\mathbf{o}_{i}$ 表示 $\mathbf{O}$ 的第 $i$ 个行向量，而 $y_{0}$ 表示起始符号 $\langle\mathrm{SOS}\rangle$ 。在此模型下，给定 $\mathbf{y}$ 的 $\mathbf{x}$ 的概率可以按常规定义，

\displaystyle\log\operatorname{Pr}(\mathbf{y}|\mathbf{x})

\displaystyle=

\displaystyle\sum_{i=1}^{n}\log\operatorname{Pr}(y_{i}|y_{0}...y_{i-1},\mathbf{x})

(12)

This equation resembles the general form of language modeling: we predict the word at time $i$ given all of the words up to time $i-1$ . Therefore, the input of the Transformer decoder is shifted one word left, that is, the input is $y_{0}...y_{n-1}$ and the output is $y_{1}...y_{n}$ .
这个方程类似于语言模型的一般形式：我们预测在时间 $i$ 的单词，给定直到时间 $i-1$ 的所有单词。因此，Transformer 解码器的输入向左移动了一个单词，即输入是 $y_{0}...y_{n-1}$ ，输出是 $y_{1}...y_{n}$ 。

The Transformer architecture discussed above has several variants which have been successfully used in different fields of NLP. For example, we can use a Transformer encoder to represent texts (call it the encoder-only architecture), can use a Transformer decoder to generate texts (call it the decoder-only architecture), and can use a standard encoder-decoder Transformer model to transform an input sequence to an output sequence. In the rest of this chapter, most of the discussion is independent of the particular choice of application, and will be mostly focused on the encoder-decoder architecture. In Section 6, we will see applications of the encoder-only and decoder-only architectures.
上述讨论的 Transformer 架构有多个变体，这些变体在自然语言处理的各个领域都得到了成功应用。例如，我们可以使用 Transformer 编码器来表示文本（称为仅编码器架构），可以使用 Transformer 解码器来生成文本（称为仅解码器架构），还可以使用标准的编码器-解码器 Transformer 模型将输入序列转换为输出序列。在本章的其余部分，大部分讨论与特定应用的选择无关，并将主要关注编码器-解码器架构。在第 6 节中，我们将看到仅编码器和仅解码器架构的应用。

2.2 Positional Encoding 2.2 位置编码

In their original form, both FFNs and attention models used in Transformer ignore an important property of sequence modeling, which is that the order of the words plays a crucial role in expressing the meaning of a sequence. This means that the encoder and decoder are insensitive to the positional information of the input words. A simple approach to overcoming this problem is to add positional encoding to the representation of each word of the sequence. More formally, a word $x_{j}$ can be represented as a $d$ -dimensional vector
在它们原始形式中，Transformer 中使用的 FFN 和注意力模型都忽略了一个重要的序列建模属性，即单词的顺序在表达序列意义中起着至关重要的作用。这意味着编码器和解码器对输入单词的位置信息不敏感。克服这个问题的简单方法是为序列中每个单词的表示添加位置编码。更正式地说，单词 $x_{j}$ 可以表示为一个 $d$ 维向量

\displaystyle\mathbf{xp}_{j}

\displaystyle=

\displaystyle\mathbf{x}_{j}+\mathrm{PE}(j)

(13)

Here $\mathbf{x}_{j}\in\mathbb{R}^{d}$ is the embedding of the word which can be obtained by using the word embedding models. $\mathrm{PE}(j)\in\mathbb{R}^{d}$ is the representation of the position $j$ . Vanilla Transformer employs the sinusoidal positional encoding models which we write in the form
这里 $\mathbf{x}_{j}\in\mathbb{R}^{d}$ 是单词的嵌入，可以通过使用单词嵌入模型获得。 $\mathrm{PE}(j)\in\mathbb{R}^{d}$ 是位置 $j$ 的表示。Vanilla Transformer 采用正弦位置编码模型，我们将其写成以下形式

	$\displaystyle\mathrm{PE}(i,2k)$	$\displaystyle=$	$\displaystyle\mathrm{sin}(i\cdot\frac{1}{10000^{2k/d}})$		(14)
	$\displaystyle\mathrm{PE}(i,2k+1)$	$\displaystyle=$	$\displaystyle\mathrm{cos}(i\cdot\frac{1}{10000^{2k/d}})$		(15)

where $\mathrm{PE}(i,k)$ denotes the $k$ -th entry of $\mathrm{PE}(i)$ . The idea of positional encoding is to distinguish different positions using continuous systems. Here we use the sine and cosine functions with different frequencies. The interested reader can refer to Appendix A to see that such a method can be interpreted as a carrying system. Because the encoding is based on individual positions, it is also called absolute positional encoding. In Section 4.1 we will see an improvement to this method.
$\mathrm{PE}(i,k)$ 表示 $\mathrm{PE}(i)$ 中的第 $k$ 个条目。位置编码的思路是使用连续系统来区分不同的位置。在这里，我们使用不同频率的正弦和余弦函数。感兴趣的读者可以参考附录 A，以了解这种方法可以解释为一种携带系统。因为编码基于单个位置，所以它也被称为绝对位置编码。在第 4.1 节中，我们将看到对这种方法的一种改进。

Once we have the above embedding result, $\mathbf{xp}_{1}...\mathbf{xp}_{m}$ is taken as the input to the Transformer encoder, that is,
一旦我们得到上述嵌入结果， $\mathbf{xp}_{1}...\mathbf{xp}_{m}$ 被用作 Transformer 编码器的输入，即，

\displaystyle\mathbf{H}_{0}

\displaystyle=

\displaystyle\begin{bmatrix}\mathbf{xp}_{1}\\ \vdots\\ \mathbf{xp}_{m}\end{bmatrix}

(16)

Similarly, we can also define the input on the decoder side.
同样，我们也可以在解码器端定义输入。

2.3 Multi-head Self-attention
2.3 多头自注意力

The use of self-attention is perhaps one of the most significant advances in sequence-to-sequence models. It attempts to learn and make use of direct interactions between each pair of inputs. From a representation learning perspective, self-attention models assume that the learned representation at position $i$ (denoted by $\mathbf{c}_{i}$ ) is a weighted sum of the inputs over the sequence. The output $\mathbf{c}_{i}$ is thus given by
自我注意力的使用可能是序列到序列模型中最重大的进展之一。它试图学习和利用每对输入之间的直接交互。从表示学习角度来看，自我注意力模型假设在位置 $i$ （用 $\mathbf{c}_{i}$ 表示）学习到的表示是序列中输入的加权和。因此，输出 $\mathbf{c}_{i}$ 由以下公式给出

\displaystyle\mathbf{c}_{i}

\displaystyle=

\displaystyle\sum_{j=1}^{m}\alpha_{i,j}\mathbf{h}_{j}

(17)

where $\alpha_{i,j}$ indicates how strong the input $\mathbf{h}_{i}$ is correlated with the input $\mathbf{h}_{j}$ . We thus can view $\mathbf{c}_{i}$ as a representation of the global context at position $i$ . $\alpha_{i,j}$ can be defined in different ways if one considers different attention models. Here we use the scaled dot-product attention function to compute $\alpha_{i,j}$ , as follows
$\alpha_{i,j}$ 表示输入 $\mathbf{h}_{i}$ 与输入 $\mathbf{h}_{j}$ 之间的相关性强度。因此，我们可以将 $\mathbf{c}_{i}$ 视为位置 $i$ 的全局上下文表示。如果考虑不同的注意力模型， $\alpha_{i,j}$ 可以有不同的定义方式。在这里，我们使用缩放点积注意力函数来计算 $\alpha_{i,j}$ ，如下所示：

	$\displaystyle\alpha_{i,j}$	$\displaystyle=$	$\displaystyle\mathrm{Softmax}(\mathbf{h}_{i}\mathbf{h}_{j}^{\mathrm{T}}/\beta)$		(18)
		$\displaystyle=$	$\displaystyle\frac{\exp(\mathbf{h}_{i}\mathbf{h}_{j}^{\mathrm{T}}/\beta)}{\sum_{k=1}^{m}\exp(\mathbf{h}_{i}\mathbf{h}_{k}^{\mathrm{T}}/\beta)}$		(18)

where $\beta$ is a scaling factor and is set to $\sqrt{d}$ .
$\beta$ 是一个缩放因子，并设置为 $\sqrt{d}$ 。

Compared with conventional recurrent and convolutional models, an advantage of self-attention models is that they shorten the computational “distance” between two inputs. Figure 2 illustrates the information flow in these models. We see that, given the input at position $i$ , self-attention models can directly access any other input. By contrast, recurrent and convolutional models might need two or more jumps to see the whole sequence.
与传统的循环和卷积模型相比，自注意力模型的一个优点是它们缩短了两个输入之间的计算“距离”。图 2 展示了这些模型中的信息流。我们看到，给定位置 $i$ 的输入，自注意力模型可以直接访问任何其他输入。相比之下，循环和卷积模型可能需要两次或更多跳跃才能看到整个序列。

Figure 2: Information flows in recurrent, convolutional and self-attention models, shown as arrow lines between positions.
图 2：循环、卷积和自注意力模型中的信息流，以箭头线表示的位置间展示。

We can have a more general view of self-attention by using the QKV attention model. Suppose we have a sequence of $\kappa$ queries $\mathbf{Q}=\begin{bmatrix}\mathbf{q}_{1}\\ \vdots\\ \mathbf{q}_{\kappa}\end{bmatrix}$ , and a sequence of $\psi$ key-value pairs $(\mathbf{K}=\begin{bmatrix}\mathbf{k}_{1}\\ \vdots\\ \mathbf{k}_{\psi}\end{bmatrix},\mathbf{V}=\begin{bmatrix}\mathbf{v}_{1}\\ \vdots\\ \mathbf{v}_{\psi}\end{bmatrix})$ . The output of the model is a sequence of vectors, each corresponding to a query. The form of the QKV attention is given by
我们可以通过使用 QKV 注意力模型来获得对自注意力更一般的理解。假设我们有一个序列 $\kappa$ 查询 $\mathbf{Q}=\begin{bmatrix}\mathbf{q}_{1}\\ \vdots\\ \mathbf{q}_{\kappa}\end{bmatrix}$ ，以及一个序列 $\psi$ 键值对 $(\mathbf{K}=\begin{bmatrix}\mathbf{k}_{1}\\ \vdots\\ \mathbf{k}_{\psi}\end{bmatrix},\mathbf{V}=\begin{bmatrix}\mathbf{v}_{1}\\ \vdots\\ \mathbf{v}_{\psi}\end{bmatrix})$ 。模型的输出是一个向量序列，每个向量对应一个查询。QKV 注意力的形式如下所示

\displaystyle\mathrm{Att}_{\mathrm{qkv}}(\mathbf{Q},\mathbf{K},\mathbf{V})

\displaystyle=

\displaystyle\mathrm{Softmax}(\frac{\mathbf{Q}\mathbf{K}^{\mathrm{T}}}{\sqrt{d}})\mathbf{V}

(19)

We can write the output of the QKV attention model as a sequence of row vectors
我们可以将 QKV 注意力模型的输出表示为一系列行向量

	$\displaystyle\mathbf{C}$	$\displaystyle=$	$\displaystyle\begin{bmatrix}\mathbf{c}_{1}\\ \vdots\\ \mathbf{c}_{\kappa}\end{bmatrix}$		(20)
		$\displaystyle=$	$\displaystyle\mathrm{Att}_{\mathrm{qkv}}(\mathbf{Q},\mathbf{K},\mathbf{V})$		(20)

To apply this equation to self-attention, we simply have
将此方程应用于自注意力，我们只需

$\displaystyle\mathbf{H}^{q}$	$\displaystyle=$	$\displaystyle\mathbf{H}\mathbf{W}^{q}$	(21)
$\displaystyle\mathbf{H}^{k}$	$\displaystyle=$	$\displaystyle\mathbf{H}\mathbf{W}^{k}$	(22)
$\displaystyle\mathbf{H}^{v}$	$\displaystyle=$	$\displaystyle\mathbf{H}\mathbf{W}^{v}$	(23)

where $\mathbf{W}^{q},\mathbf{W}^{k},\mathbf{W}^{v}\in\mathbb{R}^{d\times d}$ represents linear transformations of $\mathbf{H}$ .
$\mathbf{W}^{q},\mathbf{W}^{k},\mathbf{W}^{v}\in\mathbb{R}^{d\times d}$ 表示 $\mathbf{H}$ 的线性变换。

By considering Eq. (1), we then obtain
通过考虑公式（1），我们随后得到

$\displaystyle\mathbf{C}$	$\displaystyle=$	$\displaystyle\mathrm{Att}_{\mathrm{self}}(\mathbf{H})$	(24)
	$\displaystyle=$	$\displaystyle\mathrm{Att}_{\mathrm{qkv}}(\mathbf{H}^{q},\mathbf{H}^{k},\mathbf{H}^{v})$
	$\displaystyle=$	$\displaystyle\mathrm{Softmax}(\frac{\mathbf{H}^{q}[\mathbf{H}^{k}]^{\mathrm{T}}}{\sqrt{d}})\mathbf{H}^{v}$

Here $\mathrm{Softmax}(\frac{\mathbf{H}^{q}[\mathbf{H}^{k}]^{\mathrm{T}}}{\sqrt{d}})$ is an $m\times m$ matrix in which each row represents a distribution over $\{\mathbf{h}_{1},...,\mathbf{h}_{m}\}$ , that is
这里 $\mathrm{Softmax}(\frac{\mathbf{H}^{q}[\mathbf{H}^{k}]^{\mathrm{T}}}{\sqrt{d}})$ 是一个 $m\times m$ 矩阵，其中每一行代表一个关于 $\{\mathbf{h}_{1},...,\mathbf{h}_{m}\}$ 的分布，即

row

i

行

i

\displaystyle=

\displaystyle\begin{bmatrix}\alpha_{i,1}&...&\alpha_{i,m}\end{bmatrix}

(25)

We can improve the above self-attention model by using a technique called multi-head attention. This method can be motivated from the perspective of learning from multiple lower-dimensional feature sub-spaces, which projects a feature vector onto multiple sub-spaces and learns feature mappings on individual sub-spaces. Specifically, we project the whole of the input space into $\tau$ sub-spaces (call them heads), for example, we transform $\mathbf{H}\in\mathbb{R}^{m\times d}$ into $\tau$ matrices of size $m\times\frac{d}{\tau}$ , denoted by $\{\mathbf{H}_{1}^{\mathrm{head}},...,\mathbf{H}_{\tau}^{\mathrm{head}}\}$ . The attention model is then run $\tau$ times, each time on a head. Finally, the outputs of these model runs are concatenated, and transformed by a linear projection. This procedure can be expressed by
我们可以通过使用一种称为多头注意力的技术来改进上述自注意力模型。这种方法可以从从多个低维特征子空间学习的角度进行启发，它将特征向量投影到多个子空间，并在单个子空间上学习特征映射。具体来说，我们将整个输入空间投影到 $\tau$ 个子空间（称为头），例如，我们将 $\mathbf{H}\in\mathbb{R}^{m\times d}$ 转换为 $\tau$ 个大小为 $m\times\frac{d}{\tau}$ 的矩阵，记作 $\{\mathbf{H}_{1}^{\mathrm{head}},...,\mathbf{H}_{\tau}^{\mathrm{head}}\}$ 。然后，注意力模型运行 $\tau$ 次，每次在一个头上运行。最后，将这些模型运行的输出连接起来，并通过线性投影进行转换。这个过程可以用以下公式表示

\displaystyle\mathbf{C}

\displaystyle=

\displaystyle\mathrm{Merge}(\mathbf{C}_{1}^{\mathrm{head}},...,\mathbf{C}_{\tau}^{\mathrm{head}})\mathbf{W}_{c}

(26)

For each head $h$ ,
对于每个头 $h$ ，

$\displaystyle\mathbf{C}_{h}^{\mathrm{head}}$	$\displaystyle=$	$\displaystyle\mathrm{Softmax}(\frac{\mathbf{H}_{h}^{q}[\mathbf{H}_{h}^{k}]^{\mathrm{T}}}{\sqrt{d}})\mathbf{H}_{h}^{v}$	(28)
$\displaystyle\mathbf{H}_{h}^{q}$	$\displaystyle=$	$\displaystyle\mathbf{H}\mathbf{W}_{h}^{q}$	(29)
$\displaystyle\mathbf{H}_{h}^{k}$	$\displaystyle=$	$\displaystyle\mathbf{H}\mathbf{W}_{h}^{k}$	(30)
$\displaystyle\mathbf{H}_{h}^{v}$	$\displaystyle=$	$\displaystyle\mathbf{H}\mathbf{W}_{h}^{v}$	(31)

Here $\mathrm{Merge}(\cdot)$ is the concatenation function, and $\mathrm{Att}_{\mathrm{QKV}}(\cdot)$ is the attention function described in Eq. (20). $\mathbf{W}_{h}^{q},\mathbf{W}_{h}^{k},\mathbf{W}_{h}^{v}\in\mathbb{R}^{d\times\frac{d}{\tau}}$ are the parameters of the projections from a $d$ -dimensional space to a $\frac{d}{\tau}$ -dimensional space for the queries, keys, and values. Thus, $\mathbf{H}_{h}^{q}$ , $\mathbf{H}_{h}^{k}$ , $\mathbf{H}_{h}^{v}$ , and $\mathbf{C}_{h}^{\mathrm{head}}$ are all $m\times\frac{d}{\tau}$ matrices. $\mathrm{Merge}(\mathbf{C}_{1}^{\mathrm{head}},...,\mathbf{C}_{\tau}^{\mathrm{head}})$ produces an $m\times d$ matrix. It is then transformed by a linear mapping $\mathbf{W}_{c}\in\mathbb{R}^{d\times d}$ , leading to the final result $\mathbf{C}\in\mathbb{R}^{d\times d}$ .
这里 $\mathrm{Merge}(\cdot)$ 是拼接函数， $\mathrm{Att}_{\mathrm{QKV}}(\cdot)$ 是方程（20）中描述的注意力函数。 $\mathbf{W}_{h}^{q},\mathbf{W}_{h}^{k},\mathbf{W}_{h}^{v}\in\mathbb{R}^{d\times\frac{d}{\tau}}$ 是从一个 $d$ 维空间到 $\frac{d}{\tau}$ 维空间的查询、键和值的投影参数。因此， $\mathbf{H}_{h}^{q}$ 、 $\mathbf{H}_{h}^{k}$ 、 $\mathbf{H}_{h}^{v}$ 和 $\mathbf{C}_{h}^{\mathrm{head}}$ 都是 $m\times\frac{d}{\tau}$ 矩阵。 $\mathrm{Merge}(\mathbf{C}_{1}^{\mathrm{head}},...,\mathbf{C}_{\tau}^{\mathrm{head}})$ 产生一个 $m\times d$ 矩阵。然后通过线性映射 $\mathbf{W}_{c}\in\mathbb{R}^{d\times d}$ 进行变换，得到最终结果 $\mathbf{C}\in\mathbb{R}^{d\times d}$ 。

While the notation here seems somewhat tedious, it is convenient to implement multi-head models using various deep learning toolkits. A common method in Transformer-based systems is to store inputs from all the heads in data structures called tensors, so that we can make use of parallel computing resources to have efficient systems.
尽管这里的符号看起来有些繁琐，但使用各种深度学习工具包实现多头模型却很方便。在基于 Transformer 的系统中的一个常见方法是，将所有头部的输入存储在称为张量的数据结构中，这样我们就可以利用并行计算资源来构建高效的系统。

2.4 Layer Normalization 2.4 层归一化

Layer normalization provides a simple and effective means to make the training of neural networks more stable by standardizing the activations of the hidden layers in a layer-wise manner. As introduced in Ba et al. (2016)’s work, given a layer’s output $\mathbf{h}\in\mathbb{R}^{d}$ , the layer normalization method computes a standardized output $\mathrm{LNorm}(\mathbf{h})\in\mathbb{R}^{d}$ by
层归一化提供了一种简单有效的方法，通过层状标准化隐藏层的激活，使神经网络的训练更加稳定。正如 Ba 等人（2016）的研究中所述，给定一个层的输出 $\mathbf{h}\in\mathbb{R}^{d}$ ，层归一化方法通过计算标准化输出 $\mathrm{LNorm}(\mathbf{h})\in\mathbb{R}^{d}$ 来实现。

\displaystyle\mathrm{LNorm}(\mathbf{h})

\displaystyle=

\displaystyle\mathbf{g}\odot\frac{\mathbf{h}-\mathbf{\mu}}{\sigma+\epsilon}+\mathbf{b}

(32)

Here $\mathbf{\mu}\in\mathbb{R}^{d}$ and $\sigma\in\mathbb{R}^{d}$ are the mean and standard derivation of the activations. Let $h_{k}$ be the $k$ -th dimension of $\mathbf{h}$ . $\mathbf{\mu}$ and $\sigma$ are given by
这里 $\mathbf{\mu}\in\mathbb{R}^{d}$ 和 $\sigma\in\mathbb{R}^{d}$ 分别是激活的平均值和标准差。令 $h_{k}$ 为 $\mathbf{h}$ 的第 $k$ 维。 $\mathbf{\mu}$ 和 $\sigma$ 由以下给出

	$\displaystyle\mu$	$\displaystyle=$	$\displaystyle\frac{1}{d}\cdot\sum_{k=1}^{d}h_{k}$		(33)
	$\displaystyle\sigma$	$\displaystyle=$	$\displaystyle\sqrt{\frac{1}{d}\cdot\sum_{k=1}^{d}(h_{k}-\mu)^{2}}$		(34)

Here $\mathbf{g}\in\mathbb{R}^{d}$ and $\mathbf{b}\in\mathbb{R}^{d}$ are the rescaling and bias terms. They can be treated as parameters of layer normalization, whose values are to be learned together with other parameters of the Transformer model. The addition of $\epsilon$ to $\sigma$ is used for the purpose of numerical stability. In general, $\epsilon$ is chosen to be a small number.
这里 $\mathbf{g}\in\mathbb{R}^{d}$ 和 $\mathbf{b}\in\mathbb{R}^{d}$ 是缩放和偏差项。它们可以被视为层归一化的参数，其值需要与其他 Transformer 模型的参数一起学习。将 $\epsilon$ 添加到 $\sigma$ 中是为了提高数值稳定性。一般来说， $\epsilon$ 被选为一个较小的数。

We illustrate the layer normalization method for the hidden states of an encoder in the following example (assume that $m=4$ , $d=3$ , $\mathbf{g}=\mathbf{1}$ , $\mathbf{b}=\mathbf{0}$ , and $\epsilon=0.1$ ).
# 在以下示例中，我们说明了编码器隐藏状态的层归一化方法（假设 $m=4$ 、 $d=3$ 、 $\mathbf{g}=\mathbf{1}$ 、 $\mathbf{b}=\mathbf{0}$ 和 $\epsilon=0.1$ ）。

\begin{matrix}\mathbf{h}_{1}\\ \mathbf{h}_{2}\\ \mathbf{h}_{3}\\ \mathbf{h}_{4}\end{matrix}\begin{bmatrix}1&1&2\\ 0.9&0.9&0\\ 0.7&0.8&0\\ 3&1&7\end{bmatrix}\ \ \ \begin{matrix}\mu=1.3,\ \sigma=0.5\\ \mu=0.6,\ \sigma=0.4\\ \mu=0.5,\ \sigma=0.4\\ \mu=3.7,\ \sigma=2.5\end{matrix}\ \ \ \implies\ \ \ \begin{bmatrix}\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}1}-1.3}{0.5+0.1}&\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}1}-1.3}{0.5+0.1}&\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}2}-1.3}{0.5+0.1}\\ \frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0.9}-0.6}{0.4+0.1}&\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0.9}-0.6}{0.4+0.1}&\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0}-0.6}{0.4+0.1}\\ \frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0.7}-0.5}{0.4+0.1}&\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0.8}-0.5}{0.4+0.1}&\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0}-0.5}{0.4+0.1}\\ \frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}3}-3.7}{2.5+0.1}&\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}1}-3.7}{2.5+0.1}&\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}7}-3.7}{2.5+0.1}\end{bmatrix}

As discussed in Section 2.1, the layer normalization unit in each sub-layer is used to standardize the output of a residual block. Here we describe a more general formulation for this structure. Suppose that $F(\cdot)$ is a neural network we want to run. Then, the post-norm structure of $F(\cdot)$ is given by
如第 2.1 节所述，每个子层的层归一化单元用于标准化残差块的输出。在这里，我们描述了这种结构的更一般化公式。假设 $F(\cdot)$ 是我们想要运行的神经网络。那么， $F(\cdot)$ 的后归一化结构由以下公式给出

\displaystyle\mathbf{H}_{\mathrm{out}}

\displaystyle=

\displaystyle\mathrm{LNorm}(F(\mathbf{H}_{\mathrm{in}})+\mathbf{H}_{\mathrm{in}})

(35)

where $\mathbf{H}_{\mathrm{in}}$ and $\mathbf{H}_{\mathrm{output}}$ are the input and output of this model. Clearly, Eq. (4) is an instance of this equation.
$\mathbf{H}_{\mathrm{in}}$ 和 $\mathbf{H}_{\mathrm{output}}$ 分别是该模型的输入和输出。显然，式(4)是该方程的一个实例。

An alternative approach to introducing layer normalization and residual connections into modeling is to execute the $\mathrm{LNorm}(\cdot)$ function right after the $F(\cdot)$ function, and to establish an identity mapping from the input to the output of the entire sub-layer. This structure, known as the pre-norm structure, can be expressed in the form
一种将层归一化和残差连接引入建模的替代方法是立即在 $\mathrm{LNorm}(\cdot)$ 函数之后执行 $F(\cdot)$ 函数，并从输入到整个子层的输出建立恒等映射。这种称为预归一化结构的结构可以表示为

\displaystyle\mathbf{H}_{\mathrm{out}}

\displaystyle=

\displaystyle\mathrm{LNorm}(F(\mathbf{H}_{\mathrm{in}}))+\mathbf{H}_{\mathrm{in}}

(36)

Both post-norm and pre-norm Transformer models are widely used in NLP systems. See Figure 3 for a comparison of these two structures. In general, residual connections are considered an effective means to make the training of multi-layer neural networks easier. In this sense, pre-norm Transformer seems promising because it follows the convention that a residual connection is created to bypass the whole network and that the identity mapping from the input to the output leads to easier optimization of deep models. However, by considering the expressive power of a model, there may be modeling advantages in using post-norm Transformer because it does not so much rely on residual connections and enforces more sophisticated modeling for representation learning. In Section 4.2, we will see a discussion on this issue.
后规范和前规范 Transformer 模型在 NLP 系统中被广泛使用。参见图 3 比较这两种结构。一般来说，残差连接被认为是一种使多层神经网络训练更容易的有效手段。从这个意义上说，前规范 Transformer 似乎很有前途，因为它遵循创建残差连接以绕过整个网络的传统，从输入到输出的恒等映射有助于深度模型的优化。然而，考虑到模型的表达能力，使用后规范 Transformer 可能存在建模优势，因为它不太依赖于残差连接，并强制进行更复杂的建模以进行表示学习。在第 4.2 节中，我们将看到对这个问题的讨论。

Figure 3: The post-norm and pre-norm structures.

F(\cdot)=

core function,

\mathrm{LNorm}(\cdot)=

layer normalization, and

\oplus=

residual connection.
图 3：后规范化和前规范化结构。

F(\cdot)=

核心功能，

\mathrm{LNorm}(\cdot)=

层归一化，

\oplus=

残差连接。

2.5 Feed-forward Neural Networks
2.5 前馈神经网络

The use of FFNs in Transformer is inspired in part by the fact that complex outputs can be formed by transforming the inputs through nonlinearities. While the self-attention model itself has some nonlinearity (in $\mathrm{Softmax}(\cdot)$ ), a more common way to do this is to consider additional layers with non-linear activation functions and linear transformations. Given an input $\mathbf{H}_{\mathrm{in}}\in\mathbb{R}^{m\times d}$ and an output $\mathbf{H}_{\mathrm{out}}\in\mathbb{R}^{m\times d}$ , the $\mathbf{H}_{\mathrm{out}}=\mathrm{FFN}(\mathbf{H}_{\mathrm{in}})$ function in Transformer has the following form
FFN 在 Transformer 中的应用部分受到以下事实的启发：通过非线性变换输入可以形成复杂的输出。虽然自注意力模型本身具有一定的非线性（在 $\mathrm{Softmax}(\cdot)$ 中），但更常见的方法是考虑具有非线性激活函数和线性变换的额外层。给定输入 $\mathbf{H}_{\mathrm{in}}\in\mathbb{R}^{m\times d}$ 和输出 $\mathbf{H}_{\mathrm{out}}\in\mathbb{R}^{m\times d}$ ，Transformer 中的 $\mathbf{H}_{\mathrm{out}}=\mathrm{FFN}(\mathbf{H}_{\mathrm{in}})$ 函数具有以下形式

	$\displaystyle\mathbf{H}_{\mathrm{out}}$	$\displaystyle=$	$\displaystyle\mathbf{H}_{\mathrm{hidden}}\mathbf{W}_{f}+\mathbf{b}_{f}$		(37)
	$\displaystyle\mathbf{H}_{\mathrm{hidden}}$	$\displaystyle=$	$\displaystyle\mathrm{ReLU}(\mathbf{H}_{\mathrm{in}}\mathbf{W}_{h}+\mathbf{b}_{h})$		(38)

where $\mathbf{H}_{\mathrm{hidden}}\in\mathbb{R}^{m\times d_{\mathrm{ffn}}}$ is the hidden states, and $\mathbf{W}_{h}\in\mathbb{R}^{d\times d_{\mathrm{ffn}}}$ , $\mathbf{b}_{h}\in\mathbb{R}^{d_{\mathrm{ffn}}}$ , $\mathbf{W}_{f}\in\mathbb{R}^{d_{\mathrm{ffn}}\times d}$ and $\mathbf{b}_{f}\in\mathbb{R}^{d}$ are the parameters. This is a two-layer FFN in which the first layer (or hidden layer) introduces a nonlinearity through $\mathrm{ReLU}(\cdot)$ ²²2 $\mathrm{ReLU}(x)=\max\{0,x\}$ . and the second layer involves only a linear transformation. It is common practice in Transformer to use a larger size of the hidden layer. For example, a common choice is $d_{\mathrm{ffn}}=4d$ , that is, the size of each hidden representation is 4 times as large as the input.
$\mathbf{H}_{\mathrm{hidden}}\in\mathbb{R}^{m\times d_{\mathrm{ffn}}}$ 是隐藏状态，而 $\mathbf{W}_{h}\in\mathbb{R}^{d\times d_{\mathrm{ffn}}}$ 、 $\mathbf{b}_{h}\in\mathbb{R}^{d_{\mathrm{ffn}}}$ 、 $\mathbf{W}_{f}\in\mathbb{R}^{d_{\mathrm{ffn}}\times d}$ 和 $\mathbf{b}_{f}\in\mathbb{R}^{d}$ 是参数。这是一个两层前馈神经网络（FFN），其中第一层（或隐藏层）通过 $\mathrm{ReLU}(\cdot)$ 、 ² 引入非线性，第二层只涉及线性变换。在 Transformer 中，使用更大的隐藏层大小是一种常见做法。例如，一个常见的选择是 $d_{\mathrm{ffn}}=4d$ ，即每个隐藏表示的大小是输入的 4 倍。

Note that using a wide FFN sub-layer has been proven to be of great practical value in many state-of-the-art systems. However, a consequence of this is that the model is occupied by the parameters of the FFN. Table 1 shows parameter numbers and time complexities for different modules of a standard Transformer system. We see that FFNs dominate the model size when $d_{\mathrm{ffn}}$ is large, though they are not the most time consuming components. In the case of very big Transform models, we therefore wish to address this problem for building efficient systems.
注意，使用宽的 FFN 子层已被证明在许多最先进的系统中具有很大的实际价值。然而，这导致模型被 FFN 的参数占用。表 1 显示了标准 Transformer 系统不同模块的参数数量和时间复杂度。我们看到，当 $d_{\mathrm{ffn}}$ 很大时，FFN 主导了模型大小，尽管它们不是最耗时的组件。因此，对于非常大的 Transform 模型，我们希望解决这个问题以构建高效的系统。

Sub-model 子模型		# of Parameters 参数数量	Time Complexity 时间复杂度	$\times$
Encoder 编码器	Multi-head Self-attention 多头自注意力	$4d^{2}$	$O(m^{2}\cdot d)$	$L$
	Feed-forward Network 前馈网络	$2d\cdot d_{\mathrm{ffn}}+d+d_{\mathrm{ffn}}$	$O(m\cdot d\cdot d_{\mathrm{ffn}})$	$L$
	Layer Normalization 层归一化	$2d$	$O(d)$	$2L$
Decoder 解码器	Multi-head Self-attention 多头自注意力	$4d^{2}$	$O(n^{2}\cdot d)$	$L$
	Multi-head Cross-attention 多头交叉注意力	$4d^{2}$	$O(m\cdot n\cdot d)$	$L$
	Feed-forward Network 前馈网络	$2d\cdot d_{\mathrm{ffn}}+d+d_{\mathrm{ffn}}$	$O(n\cdot d\cdot d_{\mathrm{ffn}})$	$L$
	Layer Normalization 层归一化	$2d$	$O(d)$	$3L$

Table 1: Numbers of parameters and time complexities of different Transformer modules under different setups.

m=

source-sequence length,

n=

target-sequence length,

d=

default number of dimensions of a hidden layer,

d_{\mathrm{ffn}}=

number of dimensions of the FFN hidden layer,

\tau=

number of heads in the attention models, and

L=

number of encoding or decoding layers. The column

\times

means the number of times a sub-model is applied on the encoder or decoder side. The time complexities are estimated by counting the number of multiplication of floating-point numbers.
表 1：不同设置下不同 Transformer 模块的参数数量和时间复杂度。

m=

源序列长度，

n=

目标序列长度，

d=

隐藏层默认维度数，

d_{\mathrm{ffn}}=

FFN 隐藏层维度数，

\tau=

注意力模型中的头数，

L=

编码或解码层数。列

\times

表示子模型在编码器或解码器侧应用的次数。时间复杂度通过计算浮点数乘法次数进行估算。

2.6 Attention Models on the Decoder Side
2.6 解码器侧的注意力模型

A decoder layer involves two attention sub-layers, the first of which is a self-attention sub-layer, and the second is a cross-attention sub-layer. These sub-layers are based on either the post-norm or the pre-norm structure, but differ by designs of the attention functions. Consider, for example, the post-norm structure, described in Eq. (35). We can define the cross-attention and self-attention sub-layers for a decoding layer to be
解码层涉及两个注意力子层，第一个是自注意力子层，第二个是交叉注意力子层。这些子层基于后归一化或前归一化结构，但注意力函数的设计不同。以式（35）描述的后归一化结构为例，我们可以定义解码层的交叉注意力和自注意力子层为：

$\displaystyle\mathbf{S}_{\mathrm{cross}}$	$\displaystyle=$	$\displaystyle\mathrm{Layer}_{\mathrm{cross}}(\mathbf{H}_{\mathrm{enc}},\mathbf{S}_{\mathrm{self}})$	(39)
	$\displaystyle=$	$\displaystyle\mathrm{LNorm}(\mathrm{Att}_{\mathrm{cross}}(\mathbf{H}_{\mathrm{enc}},\mathbf{S}_{\mathrm{self}})+\mathbf{S}_{\mathrm{self}})$	(39)
$\displaystyle\mathbf{S}_{\mathrm{self}}$	$\displaystyle=$	$\displaystyle\mathrm{Layer}_{\mathrm{self}}(\mathbf{S})$	(40)
	$\displaystyle=$	$\displaystyle\mathrm{LNorm}(\mathrm{Att}_{\mathrm{self}}(\mathbf{S})+\mathbf{S})$	(40)

where $\mathbf{S}\in\mathbb{R}^{n\times d}$ is the input of the self-attention sub-layer, $\mathbf{\mathbf{S}_{\mathrm{cross}}}\in\mathbb{R}^{n\times d}$ and $\mathbf{\mathbf{S}_{\mathrm{self}}}\in\mathbb{R}^{n\times d}$ are the outputs of the sub-layers, and $\mathbf{H}_{\mathrm{enc}}\in\mathbb{R}^{m\times d}$ is the output of the encoder ³³3For an encoder having $L$ encoder layers, $\mathbf{H}_{\mathrm{enc}}=\mathbf{H}^{L}$ ..
$\mathbf{S}\in\mathbb{R}^{n\times d}$ 是自注意力子层的输入， $\mathbf{\mathbf{S}_{\mathrm{cross}}}\in\mathbb{R}^{n\times d}$ 和 $\mathbf{\mathbf{S}_{\mathrm{self}}}\in\mathbb{R}^{n\times d}$ 是子层的输出， $\mathbf{H}_{\mathrm{enc}}\in\mathbb{R}^{m\times d}$ 是编码器 ³ 的输出。

As with conventional attention models, cross-attention is primarily used to model the correspondence between the source-side and target-side sequences. The $\mathrm{Att}_{\mathrm{cross}}(\cdot)$ function is based on the QKV attention model which generates the result of querying a collection of key-value pairs. More specifically, we define the queries, keys and values as linear mappings of $\mathbf{S}_{\mathrm{self}}$ and $\mathbf{H}_{\mathrm{enc}}$ , as follows
与传统的注意力模型一样，交叉注意力主要用于建模源序列和目标序列之间的对应关系。 $\mathrm{Att}_{\mathrm{cross}}(\cdot)$ 函数基于 QKV 注意力模型，该模型生成查询一组键值对的结果。更具体地说，我们将查询、键和值定义为 $\mathbf{S}_{\mathrm{self}}$ 和 $\mathbf{H}_{\mathrm{enc}}$ 的线性映射，如下所示

$\displaystyle\mathbf{S}_{\mathrm{self}}^{q}$	$\displaystyle=$	$\displaystyle\mathbf{S}_{\mathrm{self}}\mathbf{W}_{\mathrm{cross}}^{q}$	(41)
$\displaystyle\mathbf{H}_{\mathrm{enc}}^{k}$	$\displaystyle=$	$\displaystyle\mathbf{H}_{\mathrm{enc}}\mathbf{W}_{\mathrm{enc}}^{k}$	(42)
$\displaystyle\mathbf{H}_{\mathrm{enc}}^{v}$	$\displaystyle=$	$\displaystyle\mathbf{H}_{\mathrm{enc}}\mathbf{W}_{\mathrm{enc}}^{v}$	(43)

where $\mathbf{W}_{\mathrm{cross}}^{q},\mathbf{W}_{\mathrm{enc}}^{k},\mathbf{W}_{\mathrm{enc}}^{v}\in\mathbb{R}^{d\times d}$ are the parameters of the mappings. In other words, the queries are defined based on $\mathbf{\mathbf{S}_{\mathrm{self}}}$ , and the keys and values are defined based on $\mathbf{H}_{\mathrm{enc}}$ .
$\mathbf{W}_{\mathrm{cross}}^{q},\mathbf{W}_{\mathrm{enc}}^{k},\mathbf{W}_{\mathrm{enc}}^{v}\in\mathbb{R}^{d\times d}$ 是映射的参数。换句话说，查询是基于 $\mathbf{\mathbf{S}_{\mathrm{self}}}$ 定义的，而键和值是基于 $\mathbf{H}_{\mathrm{enc}}$ 定义的。

$\mathrm{Att}_{\mathrm{cross}}(\cdot)$ is then defined as
$\mathrm{Att}_{\mathrm{cross}}(\cdot)$ 定义为

	$\displaystyle\mathrm{Att}_{\mathrm{cross}}(\mathbf{H}_{\mathrm{enc}},\mathbf{\mathbf{S}_{\mathrm{self}}})$	$\displaystyle=$	$\displaystyle\mathrm{Att}_{\mathrm{qkv}}(\mathbf{S}_{\mathrm{self}}^{q},\mathbf{H}_{\mathrm{enc}}^{k},\mathbf{H}_{\mathrm{enc}}^{v})$		(44)
		$\displaystyle=$	$\displaystyle\mathrm{Softmax}(\frac{\mathbf{S}_{\mathrm{self}}^{q}[\mathbf{H}_{\mathrm{enc}}^{k}]^{\mathrm{T}}}{\sqrt{d}})\mathbf{H}_{\mathrm{enc}}^{v}$		(44)

The $\mathrm{Att}_{\mathrm{self}}(\cdot)$ function has a similar form as $\mathrm{Att}_{\mathrm{cross}}(\cdot)$ , with linear mappings of $\mathbf{S}$ taken as the queries, keys, and values, like this
$\mathrm{Att}_{\mathrm{self}}(\cdot)$ 函数的形式与 $\mathrm{Att}_{\mathrm{cross}}(\cdot)$ 相似，将 $\mathbf{S}$ 的线性映射作为查询、键和值，如下所示

	$\displaystyle\mathrm{Att}_{\mathrm{self}}(\mathbf{S})$	$\displaystyle=$	$\displaystyle\mathrm{Att}_{\mathrm{qkv}}(\mathbf{S}^{q},\mathbf{S}^{k},\mathbf{S}^{v})$		(45)
		$\displaystyle=$	$\displaystyle\mathrm{Softmax}(\frac{\mathbf{S}^{q}[\mathbf{S}^{k}]^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M})\mathbf{S}^{v}$		(45)

where $\mathbf{S}^{q}=\mathbf{S}\mathbf{W}_{\mathrm{dec}}^{q}$ , $\mathbf{S}^{k}=\mathbf{S}\mathbf{W}_{\mathrm{dec}}^{k}$ , and $\mathbf{S}^{v}=\mathbf{S}\mathbf{W}_{\mathrm{dec}}^{v}$ are linear mappings of $\mathbf{S}$ with parameters $\mathbf{W}_{\mathrm{dec}}^{q},\mathbf{W}_{\mathrm{dec}}^{k},\mathbf{W}_{\mathrm{dec}}^{v}\in\mathbb{R}^{d\times d}$ .
在 $\mathbf{S}^{q}=\mathbf{S}\mathbf{W}_{\mathrm{dec}}^{q}$ 、 $\mathbf{S}^{k}=\mathbf{S}\mathbf{W}_{\mathrm{dec}}^{k}$ 和 $\mathbf{S}^{v}=\mathbf{S}\mathbf{W}_{\mathrm{dec}}^{v}$ 是 $\mathbf{S}$ 的参数为 $\mathbf{W}_{\mathrm{dec}}^{q},\mathbf{W}_{\mathrm{dec}}^{k},\mathbf{W}_{\mathrm{dec}}^{v}\in\mathbb{R}^{d\times d}$ 的线性映射。

This form is similar to that of Eq. (20). A difference compared to self-attention on the encoder side, however, is that the model here needs to follow the rule of left-to-right generation (see Figure 2). That is, given a target-side word at the position $i$ , we can see only the target-side words in the left context $y_{1}...y_{i-1}$ . To do this, we add a masking variable $\mathbf{M}$ to the unnormalized weight matrix $\frac{\mathbf{S}^{q}[\mathbf{S}^{k}]^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M}$ . Both $\mathbf{M}$ and $\frac{\mathbf{S}^{q}[\mathbf{S}^{k}]^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M}$ are of size $n\times n$ , and so a lower value of an entry of $\mathbf{M}$ means a larger bias towards lower alignment scores for the corresponding entry of $\frac{\mathbf{S}^{q}[\mathbf{S}^{k}]^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M}$ . In order to avoid access to the right context given $i$ , $\mathbf{M}$ is defined to be
此形式与公式（20）的形式相似。然而，与编码器侧的自注意力相比，这里的模型需要遵循从左到右生成的规则（见图 2）。也就是说，给定位置 $i$ 的目标词，我们只能看到左侧上下文中的目标词 $y_{1}...y_{i-1}$ 。为了做到这一点，我们在未归一化的权重矩阵 $\frac{\mathbf{S}^{q}[\mathbf{S}^{k}]^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M}$ 中添加了一个掩码变量 $\mathbf{M}$ 。 $\mathbf{M}$ 和 $\frac{\mathbf{S}^{q}[\mathbf{S}^{k}]^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M}$ 的大小均为 $n\times n$ ，因此 $\mathbf{M}$ 中的一个条目的值越低，对应 $\frac{\mathbf{S}^{q}[\mathbf{S}^{k}]^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M}$ 中条目的对齐分数偏向越低。为了避免给定 $i$ 时访问右侧上下文，定义 $\mathbf{M}$ 为

\displaystyle M(i,k)

\displaystyle=

\displaystyle\begin{cases}0&i\leq k\\ -\infty&i>k\end{cases}

(46)

Table 2: Self-attention on the encoder and decoder sides. Each line connects an input and an output of the self-attention model, indicating a dependency of an output state on an input state. For encoder self-attention, the output at any position is computed by having access to the entire sequence. By contrast, for decoder self-attention, the output at position

i

is computed by seeing only inputs at positions up to

i

.
表 2：编码器和解码器侧的自注意力。每条线连接自注意力模型的一个输入和一个输出，表示输出状态对输入状态的依赖。对于编码器自注意力，任何位置的输出是通过访问整个序列来计算的。相比之下，对于解码器自注意力，位置

i

的输出是通过仅看到位置

i

及之前的输入来计算的。

where $M(i,k)$ indicates a bias term for the alignment score between positions $i$ and $k$ . Below we show an example of how the masking variable is applied (assume $n=4$ ).
$M(i,k)$ 表示位置 $i$ 和 $k$ 之间对齐得分的偏差项。以下我们展示如何应用掩码变量（假设 $n=4$ ）。

	$\displaystyle\mathrm{Softmax}(\frac{\mathbf{S}^{q}[\mathbf{S}^{k}]^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M})$	(47)
$\displaystyle=$	$\displaystyle\mathrm{Softmax}(\begin{bmatrix}2&0.1&1&1\\ 0&0.9&0.9&0.9\\ 0.2&0.8&0.7&2\\ 0.3&1&0.3&3\end{bmatrix}+\begin{bmatrix}0&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}\\ 0&0&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}\\ 0&0&0&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}\\ 0&0&0&0\\ \end{bmatrix})$
$\displaystyle=$	$\displaystyle\mathrm{Softmax}(\begin{bmatrix}2&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}\\ 0&0.9&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}\\ 0.2&0.8&0.7&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}\\ 0.3&1&0.3&3\\ \end{bmatrix})$
$\displaystyle=$	$\displaystyle\begin{bmatrix}1&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0}&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0}&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0}\\ 0.3&0.7&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0}&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0}\\ 0.2&0.4&0.4&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0}\\ 0.05&0.1&0.05&0.8\\ \end{bmatrix}$

As noted in Section 2.3, it is easy to improve these models by using the multi-head attention mechanism. Also, since decoders are typically the most time-consuming part of practical systems, the bulk of the computational effort in running these systems is very much concerned with the efficiency of the attention modules on the decoder side.
如第 2.3 节所述，通过使用多头注意力机制，这些模型很容易得到改进。此外，由于解码器通常是实际系统中耗时最多的部分，因此运行这些系统的大部分计算努力都与解码器侧注意力模块的效率密切相关。

2.7 Training and Inference
2.7 训练与推理

Transformers can be trained and used in a regular way. For example, we can train a Transformer model by performing gradient descent to minimize some loss function on the training data, and test the trained model by performing beam search on the unseen data. Below we present some of the techniques that are typically used in the training and inference of Transformer models.
Transformer 可以按照常规方式进行训练和使用。例如，我们可以通过在训练数据上执行梯度下降来最小化某些损失函数来训练 Transformer 模型，并通过在未见数据上执行束搜索来测试训练好的模型。以下我们介绍了一些在 Transformer 模型的训练和推理中通常使用的技巧。

•

Learning Rate Scheduling. As standard neural networks, Transformers can be directly trained using back-propagation. The training process is generally iterated many times to make the models fit the training data well. In each training step, we update the weights of the neural networks by moving them a small step in the direction of negative gradients of errors. There are many ways to design the update rule of training. A popular choice is to use the Adam optimization method (Kingma and Ba, 2014). To adjust the learning rate during training, Vaswani et al. (2017) present a learning rate scheduling strategy which increases the learning rate linearly for a number of steps and then decay it gradually. They design a learning rate of the form

\displaystyle lr

\displaystyle=

\displaystyle lr_{0}\cdot\min\left\{n_{\mathrm{step}}^{-0.5},\ n_{\mathrm{step}}\cdot(n_{\rm{warmup}})^{-1.5}\right\}

(48)

• 学习率调度。作为标准神经网络，Transformer 可以直接使用反向传播进行训练。训练过程通常迭代多次，以使模型更好地拟合训练数据。在每一步训练中，我们通过将权重向误差负梯度的方向移动一小步来更新神经网络的权重。有许多方法可以设计训练的更新规则。一种流行的选择是使用 Adam 优化方法（Kingma 和 Ba，2014）。为了在训练过程中调整学习率，Vaswani 等人（2017）提出了一种学习率调度策略，该策略在多个步骤中线性增加学习率，然后逐渐衰减。他们设计了一种形式为的学习率

where $lr_{0}$ denotes the initial learning rate, and $n_{\mathrm{step}}$ denotes the number of training steps we have executed, and $n_{\rm{warmup}}$ denotes the number of warmup steps. In the first $n_{\rm{warmup}}$ steps, the learning rate $lr$ grows larger as training proceeds. It reaches the highest value at the point of $n_{\mathrm{step}}=n_{\rm{warmup}}$ , and then decreases as an inverse square root function (i.e., $lr_{0}\cdot n_{\mathrm{step}}^{-0.5}$ ).
$lr_{0}$ 表示初始学习率， $n_{\mathrm{step}}$ 表示已执行的训练步数， $n_{\rm{warmup}}$ 表示预热步数。在前 $n_{\rm{warmup}}$ 步中，随着训练的进行，学习率 $lr$ 逐渐增大。在 $n_{\mathrm{step}}=n_{\rm{warmup}}$ 点达到最大值，然后以倒数平方根函数的形式（即 $lr_{0}\cdot n_{\mathrm{step}}^{-0.5}$ ）下降。

•

Batching and Padding. To make a trade-off between global optimization and training convergency, it is common to update the weights each time on a relatively small collection of samples, called a minibatch of samples. Therefore, we can consider a batch version of forward and backward computation processes in which the whole minibatch is used together to obtain the gradient information. One advantage of batching is that it allows the system to make use of efficient tensor operations to deal with multiple sequences in a single run. This requires that all the input sequences in a minibatch are stored in a single memory block, so that they can be read in and processed together. To illustrate this idea, consider a minimatch containing four samples whose source-sides are

批处理和填充。为了在全局优化和训练收敛之间进行权衡，通常每次在相对较小的样本集合上更新权重，称为样本的小批量。因此，我们可以考虑一种批处理版本的向前和向后计算过程，其中整个小批量一起使用以获得梯度信息。批处理的一个优点是它允许系统利用高效的张量运算在一次运行中处理多个序列。这要求所有的小批量输入序列都存储在单个内存块中，以便可以一起读取和处理。为了说明这个想法，考虑一个包含四个样本的最小匹配，其源端是

A B C D E F

M N

R S T

W X Y Z

We can store these sequences in a $4\times 6$ continuous block where each “row” represents a sequence, like this
我们可以将这些序列存储在一个 $4\times 6$ 连续块中，其中每一“行”代表一个序列，如下所示

A B C D E F

M N $\square$ $\square$ $\square$ $\square$

R S T $\square$ $\square$ $\square$

W X Y Z $\square$ $\square$

Here padding words $\square$ are inserted between sequences, so that these sequences are aligned in the memory. Typically, we do not want padding to affect the operation of the system, and so we can simply define $\square$ as a zero vector (call it zero padding). On the other hand, in some cases we are interested in using padding to describe something that is not covered by the input sequences. For example, we can replace padding words with the words in the left (or right) context of a sequence, though this may require modifications to the system to ensure that the newly added context words do not cause additional content to appear in the output.
在序列之间插入填充词 $\square$ ，以便在内存中对齐这些序列。通常，我们不希望填充影响系统的操作，因此我们可以简单地定义 $\square$ 为零向量（称为零填充）。另一方面，在某些情况下，我们感兴趣的是使用填充来描述输入序列未涵盖的内容。例如，我们可以用序列左侧（或右侧）的词语替换填充词，尽管这可能需要修改系统以确保新添加的上下文词不会导致输出中出现额外内容。
•

Search and Caching. At test time, we need to search the space of candidate hypotheses (or candidate target-side sequences) to identify the hypothesis (or target-side sequence) with the highest score.

$\displaystyle\hat{\mathbf{y}}$ $\displaystyle=$ $\displaystyle\underset{\mathbf{y}}{\arg\max}\operatorname{score}(\mathbf{x},\mathbf{y})$ (49)

• 搜索与缓存。在测试时，我们需要搜索候选假设（或候选目标序列）的空间，以识别得分最高的假设（或目标序列）。

where $\mathrm{score}(\mathbf{x},\mathbf{y})$ is the model score of the target-side sequence $\mathbf{y}$ given the source-side sequence $\mathbf{x}$ . While there are many search algorithms to achieve this, most of them share a similar structure: the search program operates by extending candidate target-side sequences in a pool at a time. In this way, the resulting algorithm can be viewed as a left-to-right generation procedure. Note that all of the designs of $\mathrm{score}(\mathbf{x},\mathbf{y})$ , no matter how complex, are based on computing $\operatorname{Pr}(\mathbf{y}|\mathbf{x})$ . Because the attention models used in Transformer require computing the dot-product of each pair of the input vectors of a layer, the time complexity of the search algorithm is a quadratic function of the length of $\mathbf{y}$ . It is therefore not efficient to repeatedly compute the outputs of the attention models for positions that have been dealt with. This problem can be addressed by caching the states of each layer for words we have seen. Figure 4 illustrates the use of the caching mechanism in a search step. All the states for positions $<i$ are maintained and easily accessed in a cache. At position $i$ , all we need is to compute the states for the newly added word, and then to update the cache.
$\mathrm{score}(\mathbf{x},\mathbf{y})$ 是在给定源序列 $\mathbf{x}$ 的情况下，针对目标序列 $\mathbf{y}$ 的模型得分。虽然有许多搜索算法可以实现这一点，但它们大多数具有相似的结构：搜索程序通过每次扩展一个池中的候选目标序列来运行。因此，这种算法可以被视为从左到右的生成过程。请注意，无论 $\mathrm{score}(\mathbf{x},\mathbf{y})$ 的设计多么复杂，都是基于计算 $\operatorname{Pr}(\mathbf{y}|\mathbf{x})$ 。由于在 Transformer 中使用的注意力模型需要计算每一层输入向量的每一对点积，因此搜索算法的时间复杂度是 $\mathbf{y}$ 长度的二次函数。因此，对于已经处理过的位置重复计算注意力模型的输出是不高效的。这个问题可以通过缓存我们已看到的单词的每一层的状态来解决。图 4 阐述了在搜索步骤中使用缓存机制。所有位置 $<i$ 的状态都保存在缓存中，并且可以轻松访问。在位置 $i$ ，我们只需要计算新添加单词的状态，然后更新缓存。

Figure 4: Illustration of the caching mechanism in Transformer decoders. Rectangles indicate the states of decoding layers or sub-layers. At step $i$ , all the states at previous steps are stored in a cache (see dotted boxes), and we only need to compute the states for this step (see blue rectangles and arrows). Then, we add the newly generated states to the cache, and move on to step $i+1$ .
图 4：Transformer 解码器中缓存机制的示意图。矩形表示解码层或子层的状态。在步骤 $i$ ，之前步骤的所有状态都存储在缓存中（见虚线框），我们只需要计算这一步的状态（见蓝色矩形和箭头）。然后，我们将新产生的状态添加到缓存中，继续进行步骤 $i+1$ 。

3 Syntax-aware Models 3 语法感知模型

Although Transformer is simply a deep learning model that does not make use of any linguistic structure or assumption, it may be necessary to incorporate our prior knowledge into such systems. This is in part because NLP researchers have long believed that a higher level of abstraction of data is needed to develop ideal NLP systems, and there have been many systems that use structure as priors. However, structure is a wide-ranging topic and there are several types of structure one may refer to See (2018)’s work. For example, the inductive biases used in our model design can be thought of as some structural prior, while NLP models can also learn the underlying structure of problems by themselves. In this sub-section we will discuss some of these issues. We will focus on the methods of introducing linguistic structure into Transformer models. As Transformer can be applied to many NLP tasks, which differ much in their input and output formats, we will primarily discuss modifications to Transformer encoders (call them syntax-aware Transformer encoders). Our discussion, however, is general, and the methods can be easily extended to Transformer decoders.
尽管 Transformer 只是一个不利用任何语言结构或假设的深度学习模型，但可能有必要将我们的先验知识融入此类系统。这在一定程度上是因为自然语言处理研究人员长期以来一直认为，为了开发理想的 NLP 系统，需要数据的高级抽象，并且已经有许多系统使用结构作为先验。然而，结构是一个广泛的话题，可能涉及多种结构，可参考（2018）的工作。例如，我们模型设计中使用的归纳偏差可以被视为某种结构先验，而 NLP 模型也可以自行学习问题的潜在结构。在本节中，我们将讨论这些问题。我们将重点关注将语言结构引入 Transformer 模型的方法。由于 Transformer 可以应用于许多 NLP 任务，这些任务的输入和输出格式差异很大，我们将主要讨论对 Transformer 编码器（称为语法感知 Transformer 编码器）的修改。然而，我们的讨论是通用的，这些方法可以很容易地扩展到 Transformer 解码器。

3.1 Syntax-aware Input and Output
3.1 语法感知输入和输出

One of the simplest methods of incorporating structure into NLP systems is to modify the input sequence, leaving the system unchanged. As a simple example, consider a sentence where each word $x_{j}$ is assigned a set of $\kappa$ syntactic labels $\{\mathrm{tag}_{j}^{1},...,\mathrm{tag}_{j}^{\kappa}\}$ (e.g., POS labels and dependency labels). We can write these symbols together to define a new “word”
将结构引入 NLP 系统的一种最简单的方法是修改输入序列，而系统本身保持不变。以一个简单的例子来说明，考虑一个句子，其中每个单词 $x_{j}$ 被分配一组 $\kappa$ 句法标签 $\{\mathrm{tag}_{j}^{1},...,\mathrm{tag}_{j}^{\kappa}\}$ （例如，词性标签和依存标签）。我们可以将这些符号组合起来定义一个新的“单词”

x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa}

Then, the embedding of this word is given by
然后，该词的嵌入表示为

\displaystyle\mathbf{xp}_{j}

\displaystyle=

\displaystyle e(x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa})+\mathrm{PE}(j)

(50)

where $e(x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa})\in\mathbb{R}^{d}$ is the embedding of $x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa}$ . Since $x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa}$ is a complex symbol, we decompose the learning problem of $e(x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa})$ into easier problems. For example, we can develop $\kappa$ embedding models, each producing an embedding given a tag. Then, we write $e(x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa})$ as a sum of the word embedding and tag embeddings
$e(x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa})\in\mathbb{R}^{d}$ 是 $x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa}$ 的嵌入。由于 $x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa}$ 是一个复杂符号，我们将 $e(x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa})$ 的学习问题分解为更简单的问题。例如，我们可以开发 $\kappa$ 嵌入模型，每个模型给定一个标签产生一个嵌入。然后，我们将 $e(x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa})$ 写成词嵌入和标签嵌入的和

\displaystyle e(x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa})

\displaystyle=

\displaystyle\mathbf{x}_{j}+e(\mathrm{tag}_{j}^{1})+...+e(\mathrm{tag}_{j}^{\kappa})

(51)

where $\{e(\mathrm{tag}_{j}^{1}),...,e(\mathrm{tag}_{j}^{\kappa})\}$ are the embeddings of the tags. Alternatively, we can combine these embeddings via a neural network in the form
$\{e(\mathrm{tag}_{j}^{1}),...,e(\mathrm{tag}_{j}^{\kappa})\}$ 是标签的嵌入。或者，我们可以通过神经网络将这些嵌入进行组合，形式为

\displaystyle e(x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa})

\displaystyle=

\displaystyle\mathrm{FFN}_{\mathrm{embed}}(\mathbf{x}_{j},e(\mathrm{tag}_{j}^{1}),...,e(\mathrm{tag}_{j}^{\kappa}))

(52)

where $\mathrm{FFN}_{\mathrm{embed}}(\cdot)$ is a feed-forward neural network that has one layer or two.
$\mathrm{FFN}_{\mathrm{embed}}(\cdot)$ 是一个具有一层或两层的前馈神经网络。

We can do the same thing for sentences on the decoder side as well, and treat $y_{i}/\mathrm{tag}_{i}^{1}/.../\mathrm{tag}_{i}^{\kappa}$ as a syntax-augmented word. However, this may lead to a much larger target-side vocabulary and poses a computational challenge for training and inference.
我们可以对解码器端的句子做同样的事情，将 $y_{i}/\mathrm{tag}_{i}^{1}/.../\mathrm{tag}_{i}^{\kappa}$ 视为一个语法增强词。然而，这可能会导致目标端词汇量大幅增加，给训练和推理带来计算挑战。

Another form that is commonly used to represent a sentence is syntax tree. In linguistics, the syntax of a sentence can be interpreted in many different ways, resulting in various grammars and the corresponding tree (or graph)-based representations. While these representations differ in their syntactic forms, a general approach to use them in sequence modeling is tree linearization. Consider the following sentence annotated with a constituency-based parse tree
另一种常用的表示句子的形式是句法树。在语言学中，句子的句法可以有多种不同的解释，从而产生各种语法和相应的基于树（或图）的表示。尽管这些表示在句法形式上有所不同，但在序列建模中使用它们的通用方法是树线性化。考虑以下带有基于成分分析的句法树的句子

We can write this tree structure as a sequence of words, syntactic labels and brackets via a tree traversal algorithm, as follows
我们可以通过树遍历算法将这种树结构写成一系列的词语、句法标签和括号，如下所示

(S	(NP	(PRP (个人关系 pronoun)	It 它	) ${}_{\textrm{PRP}}$	) ${}_{\textrm{NP}}$	(VP	(VBZ (VBZ 已翻译文本：	’s 无法翻译，因为提供的源文本为空。请提供有效的学术文本以便进行翻译	) ${}_{\textrm{VBZ}}$	(ADJP	(JJ
interesting 有趣		) ${}_{\textrm{JJ}}$	) ${}_{\textrm{ADJP}}$	) ${}_{\textrm{VP}}$	(.	!	) ${}_{\textrm{.}}$	) ${}_{\textrm{S}}$

This sequence of syntactic tokens can be used as an input to the system, that is, each token is represented by word and positional embeddings, and then the sum of these embeddings is treated as a regular input of the encoder. An example of the use of linearized trees is tree-to-string machine translation in which a syntax tree in one language is translated into a string in another language (Li et al., 2017; Currey and Heafield, 2018). Linearized trees can also be used for tree generation. For example, we can frame parsing tasks as sequence-to-sequence problems to map an input text to a sequential representation of its corresponding syntax tree (Vinyals et al., 2015; Choe and Charniak, 2016). See Figure 5 for illustrations of these models. It should be noted that the methods described here are not specific to Transformer but could be applied to many models, such as RNN-based models.
这个句法标记序列可以用作系统的输入，即每个标记由词嵌入和位置嵌入表示，然后这些嵌入的和被视为编码器的常规输入。线性化树的用法示例是树到字符串的机器翻译，其中一种语言的语法树被翻译成另一种语言的字符串（Li 等人，2017；Currey 和 Heafield，2018）。线性化树还可以用于树生成。例如，我们可以将解析任务构造成序列到序列问题，将输入文本映射到其对应语法树的顺序表示（Vinyals 等人，2015；Choe 和 Charniak，2016）。参见图 5 以了解这些模型的说明。需要注意的是，这里描述的方法并不仅限于 Transformer，也可以应用于许多其他模型，如基于 RNN 的模型。

Figure 5: Illustration of tree linearization on either the encoder or decoder side. For tree-to-string machine translation, the encoder takes sequential representation of an input parse tree, and the decoder outputs the corresponding translation. For parsing, the encoder takes a sentence, and the decoder outputs the corresponding syntax tree.
图 5：编码器或解码器侧的树线性化示意图。对于树到字符串的机器翻译，编码器接收输入解析树的顺序表示，解码器输出相应的翻译。对于解析，编码器接收一个句子，解码器输出相应的句法树。

3.2 Syntax-aware Attention Models
3.2 语法感知注意力模型

For Transformer models, it also makes sense to make use of syntax trees to guide the process of learning sequence representations. In the previous section we saw how representations of a sequence can be computed by relating different positions within that sequence. This allows us to impose some structure on these relations which are represented by distributions of attention weights over all the positions. To do this we use the encoder self-attention with an additive mask
对于 Transformer 模型，利用句法树来指导序列表示的学习过程也是合理的。在前一节中，我们看到了如何通过关联序列中的不同位置来计算序列的表示。这使我们能够对这些关系施加一些结构，这些关系由所有位置上的注意力权重分布来表示。为了做到这一点，我们使用带有加性掩码的编码器自注意力。

\displaystyle\mathrm{AttSyn}_{\mathrm{self}}(\mathbf{H})

\displaystyle=

\displaystyle\mathrm{Softmax}(\frac{\mathbf{H}^{q}[\mathbf{H}^{k}]^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M})\mathbf{H}^{v}

(53)

or alternatively with a multiplicative mask
或者使用乘性掩码

\displaystyle\mathrm{AttSyn}_{\mathrm{self}}(\mathbf{H})

\displaystyle=

\displaystyle\mathrm{Softmax}(\frac{\mathbf{H}^{q}[\mathbf{H}^{k}]^{\mathrm{T}}}{\sqrt{d}}\odot\mathbf{M})\mathbf{H}^{v}

(54)

where $\mathbf{M}\in\mathbb{R}^{m\times m}$ is a matrix of masking variables in which a larger value of $M(i,j)$ indicates a stronger syntactic correlation between positions $i$ and $j$ . In the following description we choose Eq. (54) as the basic form.
$\mathbf{M}\in\mathbb{R}^{m\times m}$ 是一个掩码变量矩阵，其中 $M(i,j)$ 的值越大，表示 $i$ 和 $j$ 位置之间的句法相关性越强。在以下描述中，我们选择公式（54）作为基本形式。

One common way to design $\mathbf{M}$ is to project syntactic relations of the input tree structure into constraints over the sequence. Here we consider constituency parse trees and dependency parse trees for illustration. Generally, two types of masking methods are employed.
一种常见的 $\mathbf{M}$ 设计方法是，将输入树结构的句法关系投射到序列上的约束中。在这里，我们考虑成分句法树和依存句法树进行说明。通常，采用两种类型的掩码方法。

•

0-1 Masking. This method assigns $M(i,j)$ a value of 1 if the words at positions $i$ and $j$ are considered syntactically correlated and a value of 0 otherwise (Zhang et al., 2020; Bai et al., 2021). To model the relation between two words in a syntax tree, we can consider the distance between their corresponding nodes. One of the simplest forms is given by

$\displaystyle M(i,j)$ $\displaystyle=$ $\displaystyle\begin{cases}1&\omega(i,j)\leq\omega_{\mathrm{max}}\\ 0&\textrm{otherwise}\end{cases}$ (55)

• 0-1 遮罩。此方法将 $M(i,j)$ 赋值为 1，如果位置 $i$ 和 $j$ 的词语被视为句法相关，否则赋值为 0（张等，2020；白等，2021）。为了在句法树中建模两个词语之间的关系，我们可以考虑它们对应节点之间的距离。其中一种最简单的形式如下：

where $\omega(i,j)$ is the length of the shortest path between the nodes of the words at positions $i$ and $j$ . For example, given a dependency parse tree, $\omega(i,j)$ is the number of dependency edges in the path between the two words. For a constituency parse tree, all the words are leaf nodes, and so $\omega(i,j)$ gives a tree distance between the two leaves in the same branch of the tree. $\omega_{\mathrm{max}}$ is a parameter used to control the maximum distance between two nodes that can be considered syntactically correlated. For example, assuming that there is a dependency parse tree and $\omega_{\mathrm{max}}=1$ , Eq. (55) enforces a constraint that the attention score between positions $i$ and $j$ is computed only if they have a parent-dependent relation⁴⁴4For multiplicative masks, $M(i,j)=0$ does not mean that the attention weight between $j$ and $i$ is zero because the Softmax function does not give a zero output for a dimension whose corresponding input is of a zero value. A method to “mask” an entry of $\mathrm{Softmax}(\frac{\mathbf{H}\mathbf{H}^{\mathrm{T}}}{\sqrt{d}})$ is to use an additive mask and set $M(i,j)=-\infty$ if $\omega(i,j)>\omega_{\mathrm{max}}$ ..
$\omega(i,j)$ 是位置 $i$ 和 $j$ 之间单词节点的最短路径长度。例如，给定一个依存句法分析树， $\omega(i,j)$ 是两个单词之间的依存边数。对于一个成分句法分析树，所有单词都是叶子节点，因此 $\omega(i,j)$ 给出了树中同一分支的两个叶子之间的树距离。 $\omega_{\mathrm{max}}$ 是一个用于控制两个节点之间可以被认为是句法相关的最大距离的参数。例如，假设存在一个依存句法分析树和 $\omega_{\mathrm{max}}=1$ ，等式（55）强制执行一个约束，即只有在它们具有父节点依赖关系 ⁴ 的情况下，才计算位置 $i$ 和 $j$ 之间的注意力分数。
•

Soft Masking. Instead of treating $\mathbf{M}$ as a hard constraint, we can use it as a soft constraint that scales the attention weight between positions $i$ and $j$ in terms of the degree to which the corresponding words are correlated. An idea is to reduce the attention weight as $\omega(i,j)$ becomes larger. A very simple method to do this is to transform $\omega(i,j)$ in some way that $M(i,j)$ holds a negative correlation relationship with $\omega(i,j)$ and its value falls into the interval $[0,1]$

$\displaystyle M(i,j)$ $\displaystyle=$ $\displaystyle\mathrm{DNorm}(\omega(i,j))$ (56)

• 软掩码。不是将 $\mathbf{M}$ 视为硬约束，我们可以将其用作软约束，根据对应词语的相关程度来缩放位置 $i$ 和 $j$ 之间的注意力权重。一个想法是随着 $\omega(i,j)$ 的增大而减少注意力权重。实现这一点的非常简单的方法是将 $\omega(i,j)$ 以某种方式转换，使得 $M(i,j)$ 与 $\omega(i,j)$ 保持负相关关系，并且其值落在 $[0,1]$ 区间内。

There are several alternative designs for $\mathrm{DNorm}(\cdot)$ . For example, one can compute a standardized score of $-\omega(i,j)$ by subtracting its mean and dividing by its standard deviation (Chen et al., 2018a), or can normalize $1/\omega(i,j)$ over all possible $j$ in the sequence (Xu et al., 2021b). In cases where parsers can output a score between positions $i$ and $j$ , it is also possible to use this score to compute $M(i,j)$ . For example, a dependency parser can produce the probability of the word at position $i$ being the parent of the word at position $j$ (Strubell et al., 2018). We can then write $M(i,j)$ as
存在几种针对 $\mathrm{DNorm}(\cdot)$ 的替代设计方案。例如，可以通过减去其均值并除以其标准差来计算 $-\omega(i,j)$ 的标准分数（Chen 等人，2018a），或者可以在序列中所有可能的 $j$ 上对 $1/\omega(i,j)$ 进行归一化（Xu 等人，2021b）。在解析器可以输出 $i$ 和 $j$ 之间的分数的情况下，也可以使用这个分数来计算 $M(i,j)$ 。例如，一个依存句法分析器可以产生位于 $i$ 的词是位于 $j$ 的词的父词的概率（Strubell 等人，2018）。然后我们可以将 $M(i,j)$ 写为

$\displaystyle M(i,j)$ $\displaystyle=$ $\displaystyle\mathrm{Pr}_{\mathrm{parent}}(i|j)$ (57)

or alternatively 或者，或者

$\displaystyle M(i,j)$ $\displaystyle=$ $\displaystyle\max\{\mathrm{Pr}_{\mathrm{parent}}(i|j),\mathrm{Pr}_{\mathrm{parent}}(j|i)\}$ (58)

where $\mathrm{Pr}_{\mathrm{parent}}(i|j)$ and $\mathrm{Pr}_{\mathrm{parent}}(j|i)$ are the probabilities given by the parser. See Figure 6 for an example of inducing a soft masking variable from a dependency parse tree.
$\mathrm{Pr}_{\mathrm{parent}}(i|j)$ 和 $\mathrm{Pr}_{\mathrm{parent}}(j|i)$ 是由解析器给出的概率。参见图 6，以了解从依存句法树中诱导软掩码变量的示例。

Figure 6: Priors induced from a dependency parse tree. The row

i

of the matrix

\mathbf{M}

represents a distribution that describes how much weight we can give to

M(i,j)

in terms of the syntactic distance between

i

and

j

.
图 6：从依存句法树诱导的先验。矩阵

\mathbf{M}

中的行

i

代表一个分布，描述了我们可以在句法距离方面给予

M(i,j)

多少权重，该句法距离由

i

和

j

之间的距离决定。

3.3 Multi-branch Models 3.3 多分支模型

Introducing syntax into NLP systems is not easy. This is partially because automatic parse trees may have errors, and partially because the use of syntax may lead to strong assumption of the underlying structure of a sentence. Rather than combining syntactic and word information into one “big” model, it may be more flexible and effective to build one model to encode syntax and a different one to encode word sequences. One way to achieve this is through the use of multiple neural networks (called branches or paths), each dealing with one type of input. The outputs of these branches are then combined to produce an output (Xie et al., 2017; Fan et al., 2020; Lin et al., 2022b). Various methods have therefore been used to combine different types of input for neural models like Transformer.
将句法引入 NLP 系统并不容易。这部分的理由是因为自动解析树可能存在错误，部分是因为使用句法可能导致对句子底层结构的强烈假设。与其将句法和词信息合并到一个“大”模型中，不如构建一个模型来编码句法，另一个模型来编码词序列，这可能更加灵活和有效。实现这一目标的一种方法是通过使用多个神经网络（称为分支或路径），每个处理一种类型的输入。然后，将这些分支的输出组合起来以产生一个输出（Xie 等，2017；Fan 等，2020；Lin 等，2022b）。因此，已经使用了各种方法来为像 Transformer 这样的神经网络组合不同类型的输入。

One commonly-used approach is to build two separate encoders, in which one model is trained to encode the syntactic input (denoted by $\mathbf{t}$ ), and the other is trained to encode the usual input (denoted by $\mathbf{x}$ ). Figure 7 (a) illustrates this multi-encoder architecture. The syntactic encoder $\mathrm{Encode}_{\mathrm{syn}}(\mathbf{t})$ is based on models presented in Sections 3.1 and 3.2, and the text encoder $\mathrm{Encode}_{\mathrm{text}}(\mathbf{x})$ is a standard Transformer encoder. The representations generated by these encoders are then fed into the combination model as input, and combined into a hybrid representation, given by
一种常用的方法是构建两个独立的编码器，其中一个模型被训练来编码句法输入（表示为 $\mathbf{t}$ ），另一个模型被训练来编码常规输入（表示为 $\mathbf{x}$ ）。图 7（a）展示了这种多编码器架构。句法编码器 $\mathrm{Encode}_{\mathrm{syn}}(\mathbf{t})$ 基于第 3.1 节和第 3.2 节中提出的模型，文本编码器 $\mathrm{Encode}_{\mathrm{text}}(\mathbf{x})$ 是一个标准的 Transformer 编码器。这些编码器生成的表示随后被输入到组合模型中，并组合成一个混合表示，表示为

	$\displaystyle\mathbf{H}_{\mathrm{hybrid}}$	$\displaystyle=$	$\displaystyle\mathrm{Combine}(\mathbf{H}_{\mathrm{syn}},\mathbf{H}_{\mathrm{text}})$		(59)
		$\displaystyle=$	$\displaystyle\mathrm{Combine}(\mathrm{Encode}_{\mathrm{syn}}(\mathbf{t}),\mathrm{Encode}_{\mathrm{text}}(\mathbf{x}))$		(59)

There are several designs for $\mathrm{Combine}(\cdot)$ , depending on what kind of problems we apply the encoders to. For example, if we want to develop a text classifier, $\mathrm{Combine}(\cdot)$ can be a simple pooling network. For more complicated tasks, such as machine translation, $\mathrm{Combine}(\cdot)$ can be a Transformer encoder as well, and we can fuse information from different sources by performing self-attention on $[\mathbf{H}_{\mathrm{syn}},\mathbf{H}_{\mathrm{text}}]$ .
有几种针对 $\mathrm{Combine}(\cdot)$ 的设计，具体取决于我们将编码器应用于何种问题。例如，如果我们想开发一个文本分类器， $\mathrm{Combine}(\cdot)$ 可以是一个简单的池化网络。对于更复杂的任务，如机器翻译， $\mathrm{Combine}(\cdot)$ 也可以是一个 Transformer 编码器，我们通过对 $[\mathbf{H}_{\mathrm{syn}},\mathbf{H}_{\mathrm{text}}]$ 执行自注意力操作来融合不同来源的信息。

While we restrict attention to syntactic models in this section, the general multi-encoder architecture can be used in many problems where inputs from additional sources are required. For example, one can use one encoder to represent a sentence, and use another encoder to represent the previous sentence in the same document. We thus have a context-aware model by combining the two encoders (Voita et al., 2018; Li et al., 2020a). Furthermore, the architectures of the encoders do not need to be restricted to Transformer, and we can choose different models for different branches. For example, as a widely-used 2-branch encoding architecture, we can use a CNN-based encoder to model local context, and a Transformer encoder to model global context (Wu et al., 2020).
虽然在本节中我们将注意力限制在句法模型上，但通用的多编码器架构可以用于许多需要来自额外来源的输入的问题。例如，可以使用一个编码器来表示一个句子，并使用另一个编码器来表示同一文档中的前一个句子。因此，通过结合两个编码器，我们得到了一个具有上下文感知能力的模型（Voita 等，2018；Li 等，2020a）。此外，编码器的架构不需要局限于 Transformer，我们可以为不同的分支选择不同的模型。例如，作为一个广泛使用的 2 分支编码架构，我们可以使用基于 CNN 的编码器来建模局部上下文，以及使用 Transformer 编码器来建模全局上下文（Wu 等，2020）。

Figure 7: Multi-branch architectures. There are two inputs: a sentence (denoted by

\mathbf{x}

) and the syntax tree of the sentence (denoted by

\mathbf{t}

). In the multi-encoder architecture (see sub-figure (a)), two encoders are constructed to encode

\mathbf{x}

and

\mathbf{t}

, respectively. A combination model then takes the outputs of the encoders and produces a combined representation of

\mathbf{x}

and

\mathbf{t}

. The idea of multi-branch networks can be used for designing sub-models of the encoder. A simple example is that we create multiple paths in parallel for some layers of the encoder (see sub-figure (b)). Another example is multi-head attention (see sub-figure (c)) where we use different heads to learn different representations.
图 7：多分支架构。有两个输入：一个句子（用

\mathbf{x}

表示）和句子的句法树（用

\mathbf{t}

表示）。在多编码器架构中（见图子图(a)），构建了两个编码器分别对

\mathbf{x}

和

\mathbf{t}

进行编码。随后，组合模型将编码器的输出结合起来，生成

\mathbf{x}

和

\mathbf{t}

的联合表示。多分支网络的想法可以用于设计编码器的子模型。一个简单的例子是我们为编码器的一些层创建多个并行路径（见图子图(b)）。另一个例子是多头注意力（见图子图(c)），其中我们使用不同的头学习不同的表示。

Sub-models of a Transformer model can also be multi-branch neural networks. See Figure 7 (b) for an example involving two self-attention branches. One is the standard self-attention network $\mathrm{Att}_{\mathrm{self}}(\mathbf{H})$ . The other is the syntax-aware self-attention network $\mathrm{AttSyn}_{\mathrm{self}}(\mathbf{H})$ . The output of the self-attention model is a linear combination of the outputs of these two branches (Xu et al., 2021b), given by
子模型可以是多分支神经网络。参见图 7（b）中的两个自注意力分支的示例。一个是标准自注意力网络 $\mathrm{Att}_{\mathrm{self}}(\mathbf{H})$ 。另一个是语法感知自注意力网络 $\mathrm{AttSyn}_{\mathrm{self}}(\mathbf{H})$ 。自注意力模型的输出是这两个分支输出的线性组合（Xu 等人，2021b），表示为

\displaystyle\mathbf{H}_{\mathrm{self}}

\displaystyle=

\displaystyle\alpha\cdot\mathrm{Att}_{\mathrm{self}}(\mathbf{H})+(1-\alpha)\cdot\mathrm{AttSyn}_{\mathrm{self}}(\mathbf{H})

(60)

where $\alpha$ is a coefficient of combination. $\mathbf{H}_{\mathrm{self}}$ can be used as usual by taking a layer normalization function and adding a residual connection, and so the overall architecture is the same as standard Transformer models.
$\alpha$ 是一个组合系数。 $\mathbf{H}_{\mathrm{self}}$ 可以像往常一样使用，通过取一个层归一化函数并添加残差连接，因此整体架构与标准 Transformer 模型相同。

Multi-head attention networks can also be viewed as forms of multi-branch models. Therefore, we can provide guidance from syntax to only some of the heads while keeping the rest unchanged (Strubell et al., 2018). This approach is illustrated in Figure 7 (c) where only one head of the self-attention sub-layer makes use of syntax trees for computing attention weights.
多头注意力网络也可以被视为多分支模型的形式。因此，我们可以在保持其余部分不变的情况下，仅从句法层面为部分头提供指导（Strubell 等人，2018）。这种方法在图 7（c）中得到了说明，其中自注意力子层的仅一个头使用了句法树来计算注意力权重。

3.4 Multi-scale Models 3.4 多尺度模型

In linguistics, syntax studies how sentences are built up by smaller constituents. Different levels of these constituents are in general organized in a hierarchical structure, called syntactic hierarchy. It is therefore possible to use multiple levels of syntactic constituents to explain the same sentence, for example, words explain how the sentence is constructed from small meaningful units, and phrases explain how the sentence is constructed from larger linguistic units.
在语言学中，句法学研究句子如何由更小的成分构成。这些成分的不同层次通常以层次结构组织，称为句法层次。因此，可以使用多个句法成分的层次来解释同一个句子，例如，单词解释句子如何由小的意义单位构成，而短语解释句子如何由较大的语言单位构成。

Multi-scale Transformers leverage varying abstraction levels of data to represent a sentence using diverse feature scales. A common approach is to write a sentence in multiple different forms and then to combine them using a multi-branch network (Hao et al., 2019). For example, consider a sentence
多尺度 Transformer 利用数据的不同抽象层次，以多种特征尺度来表示一个句子。一种常见的方法是将句子以多种不同的形式书写，然后通过多分支网络（Hao 等人，2019 年）将它们结合起来。例如，考虑一个句子

The oldest beer-making facility was discovered in China.
中国发现了最古老的啤酒酿造设施。

We can tokenize it into a sequence of words, denoted by
我们可以将其标记为一系列单词，表示为

$\mathbf{x}_{\mathrm{words}}=$ The oldest beer-making facility was discovered in China .
中国发现了最古老的啤酒酿造设施。

Alternatively, we can write it as a sequence of phrases by using a parser, denoted by
或者，我们可以通过使用解析器将其写成一系列短语，解析器用符号表示

$\mathbf{x}_{\mathrm{phrases}}=$ [The oldest beer-making facility] ${}_{\textrm{NP}}$ [was discovered in China] ${}_{\textrm{VP}}$ [.] ${}_{\textrm{.}}$
$\mathbf{x}_{\mathrm{phrases}}=$ [最古老的啤酒酿造设施] ${}_{\textrm{NP}}$ [在中国被发现] ${}_{\textrm{VP}}$ [。] ${}_{\textrm{.}}$

The simplest way to build a multi-scale model is to encode $\mathbf{x}_{\mathrm{words}}$ and $\mathbf{x}_{\mathrm{phrases}}$ using two separate Transformer encoders. Then, the outputs of these encoders are combined in some way. This leads to the same form as Eq. (59), and we can view this model as an instance of the general multi-encoder architecture.
最简单的构建多尺度模型的方法是使用两个独立的 Transformer 编码器对 $\mathbf{x}_{\mathrm{words}}$ 和 $\mathbf{x}_{\mathrm{phrases}}$ 进行编码。然后，以某种方式将这些编码器的输出进行组合。这导致与公式（59）相同的形式，我们可以将此模型视为通用多编码器架构的一个实例。

Both $\mathbf{x}_{\mathrm{words}}$ and $\mathbf{x}_{\mathrm{phrases}}$ can be viewed as sequences of tokens, for example, $\mathbf{x}_{\mathrm{words}}$ has nine word-based tokens, and $\mathbf{x}_{\mathrm{phrases}}$ has three phrase-based tokens⁵⁵5 $\mathbf{x}_{\mathrm{phrases}}$ comprises three tokens The oldest beer-making facility, was discovered in China, and ... However, involving all possible phrases will result in a huge vocabulary. We therefore need some method to represent each phrase as an embedding in a cheap way. By treating phrase embedding as a sequence modeling problem, it is straightforward to learn sub-sequence representations simply by considering the sequence models described in the previous chapters and this chapter. Now we have a two-stage learning process. In the first stage, we learn the embeddings of input units on different scales using separate models. In the second stage, we learn to encode sequences on different scales using a multi-branch model.
双方 $\mathbf{x}_{\mathrm{words}}$ 和 $\mathbf{x}_{\mathrm{phrases}}$ 都可以被视为标记序列，例如， $\mathbf{x}_{\mathrm{words}}$ 有九个基于单词的标记，而 $\mathbf{x}_{\mathrm{phrases}}$ 有三个基于短语的标记 ⁵ 。然而，包含所有可能的短语将导致词汇量巨大。因此，我们需要一种方法以低成本的方式将每个短语表示为嵌入。通过将短语嵌入视为序列建模问题，我们可以简单地通过考虑前几章和本章中描述的序列模型来学习子序列表示。现在我们有一个两阶段的学习过程。在第一阶段，我们使用不同的模型学习不同尺度的输入单元的嵌入。在第二阶段，我们学习使用多分支模型对不同尺度的序列进行编码。

More generally, we do not need to restrict ourselves to linguistically meaningful units in multi-scale representation learning. For example, we can learn sub-word segmentations from data and represent an input sentence as a sequence of sub-words. This results in a hierarchical representation of the sentence, for example, sub-words $\to$ words $\to$ phrases. While the learned sub-words may not have linguistic meanings, they provide a new insight into modeling words and phrases, as well as a new scale of features. Also, we do not need to develop multiple encoders for multi-scale modeling. An alternative approach is to take representations on different scales in the multi-head self-attention attention modules, which makes it easier to model the interactions among different scales (Guo et al., 2020; Li et al., 2022b).
更普遍地，在多尺度表示学习中，我们不需要将自己限制在语言上有意义的单元。例如，我们可以从数据中学习子词分割，并将输入句子表示为子词序列。这导致句子具有层次化的表示，例如，子词 $\to$ 单词 $\to$ 短语。虽然学习到的子词可能没有语言意义，但它们为建模单词和短语提供了新的见解，以及新的特征尺度。此外，我们不需要为多尺度建模开发多个编码器。一种替代方法是采用多尺度自注意力模块中的不同尺度的表示，这使得建模不同尺度之间的交互更容易（Guo 等，2020；Li 等，2022b）。

A problem with the approaches described above, however, is that the representations (or attention weight matrices) learned on different scales are of different sizes. For example, in the above examples, the representation learned from $\mathbf{x}_{\mathrm{words}}$ is a $9\times d$ matrix, and the representation learned from $\mathbf{x}_{\mathrm{phrases}}$ is a $3\times d$ matrix. A simple solution to this problem is to perform upsampling on the phrase-based representation to expand it to a $9\times d$ matrix. Likewise, we can perform downsampling on the word-based representation to shrink it to a $3\times d$ matrix. Then, the combination model $\mathrm{Combine}(\cdot)$ can be the same as those described in Section 3.3.
上述方法存在的问题是，在不同尺度上学习的表示（或注意力权重矩阵）大小不同。例如，在上面的例子中，从 $\mathbf{x}_{\mathrm{words}}$ 学习的表示是一个 $9\times d$ 矩阵，而从 $\mathbf{x}_{\mathrm{phrases}}$ 学习的表示是一个 $3\times d$ 矩阵。解决这个问题的一个简单方法是对基于短语的表示进行上采样，将其扩展到 $9\times d$ 矩阵。同样，我们可以对基于词的表示进行下采样，将其缩小到 $3\times d$ 矩阵。然后，组合模型 $\mathrm{Combine}(\cdot)$ 可以与第 3.3 节中描述的相同。

It is worth noting that multi-scale modeling is widely discussed in several fields. For example, in computer vision, multi-scale modeling is often referred to as a process of learning a series of feature maps on the input image (Fan et al., 2021; Li et al., 2022e). Unlike the multi-branch models presented here, the multi-scale vision Transformer models make use of the hierarchical nature of features in representing images. Systems of this kind are often based on a stack of layers in which each layer learns the features on a larger scale (e.g., a higher channel capacity) from the features produced by the previous layer.
值得指出的是，多尺度建模在多个领域中被广泛讨论。例如，在计算机视觉领域，多尺度建模通常被描述为学习输入图像上的一系列特征图的过程（Fan 等，2021；Li 等，2022e）。与这里提出的多分支模型不同，多尺度视觉 Transformer 模型利用了特征在表示图像时的层次性质。这类系统通常基于一层层的堆叠，其中每一层都从上一层的特征中学习更大尺度的特征（例如，更高的通道容量）。

3.5 Transformers as Syntax Learners
3.5 变压器作为句法学习者

So far we have discussed syntax trees as being constraints or priors on the encoding process so that we can make use of linguistic representations in learning neural networks. It is natural to wonder whether these neural models can learn some knowledge of linguistic structure from data without human design linguistic annotations. This reflects one of the goals of developing NLP systems: linguistic knowledge can be learned from data and encoded in models.
迄今为止，我们讨论了将句法树视为编码过程中的约束或先验，以便我们可以利用语言表示来学习神经网络。很自然地会想知道这些神经网络模型是否可以从数据中学习到一些语言结构的知识，而不需要人类设计的语言标注。这反映了开发 NLP 系统的一个目标：语言知识可以从数据中学习并在模型中编码。

In order to explore the linguistic properties learned by NLP systems, a simple method is to examine the syntactic behaviors of the outputs of the systems. For example, we can examine whether the outputs of language generation systems have grammatical errors. Another example is to ask these systems to accomplish tasks that make sense for linguistics, though they are not trained to do so (Brown et al., 2020). However, examining and explaining how model predictions exhibit syntactic abilities is not sufficient to answer the question. It is also the case that the neural networks have learned some knowledge about language, but it is not used in prediction (Clark et al., 2019). Therefore, we need to see what is modeled and learned inside these neural networks.
为了探索 NLP 系统学习到的语言属性，一种简单的方法是检查系统输出的句法行为。例如，我们可以检查语言生成系统的输出是否存在语法错误。另一个例子是要求这些系统完成对语言学有意义的任务，尽管它们并未接受过这方面的训练（Brown 等，2020）。然而，仅仅检查和解释模型预测如何展现句法能力还不足以回答这个问题。同样，神经网络已经学习了一些关于语言的知识，但这些知识并未用于预测（Clark 等，2019）。因此，我们需要了解这些神经网络内部所建模和学习的内容。

One approach to examining the latent linguistic structure in Transformer models is to develop probes to see whether and to what extent these models capture notions of linguistics, such as dependency relations and parts-of-speech. A general approach to probing is to extract the internal representations of the models and probe them for linguistic phenomena. For Transformer, it is usually achieved by examining the attention map and/or output of an attention layer. Then, we construct a probing predictor (or probing classifier) that takes these internal representations as input and produces linguistic notions as output (Belinkov, 2022). The probing predictor can be based on either simple heuristics or parameterized models optimized on the probing task. Recent work shows that large-scale Transformer-based language models exhibit good behaviors, called emergent abilities, in various probing tasks. However, we will not discuss details of these language modeling systems in this chapter, but leave them in the following chapters. Nevertheless, we assume here that we have a Transformer encoder that has been well trained on unlabeled data and can be used for probing. Figure 8 illustrates the process of probing.
一种检验 Transformer 模型中潜在语言结构的方法是开发探针以观察这些模型在多大程度上捕捉了语言学概念，如依存关系和词性。探查的一般方法是提取模型的内部表示并对其中的语言现象进行探查。对于 Transformer，这通常是通过检查注意力图和/或注意力层的输出实现的。然后，我们构建一个探查预测器（或探查分类器），它以这些内部表示作为输入并产生语言概念作为输出（Belinkov，2022）。探查预测器可以基于简单的启发式方法或针对探查任务优化的参数化模型。最近的研究表明，基于 Transformer 的大规模语言模型在各种探查任务中表现出良好的行为，称为涌现能力。然而，我们不会在本章中讨论这些语言建模系统的细节，而是将它们留到后续章节中。尽管如此，我们假设在这里我们有一个在未标记数据上经过良好训练的 Transformer 编码器，它可以用于探查。图 8 展示了探测过程。

Figure 8: An overview of probing for Transformer-based models. Given a Transformer model (e.g., a Transformer-based language model), we first optimize the model parameters on some unlabeled data. Then, we develop a predictor which takes the states of a hidden layer of the Transformer model and generates outputs for a probing task (see sub-figure (a)). The predictor can be trained as usual in which only the parameters of the predictor are optimized and the parameters of the Transformer model are fixed (see sub-figure (b)). The Transformer model and the predictor are used together to make predictions on new data for probing (see sub-figure (c)).
图 8：基于 Transformer 模型的探针概述。给定一个 Transformer 模型（例如，基于 Transformer 的语言模型），我们首先在一些未标记的数据上优化模型参数。然后，我们开发一个预测器，该预测器接收 Transformer 模型隐藏层的状态并生成探针任务的输出（见图子图(a)）。预测器可以像往常一样进行训练，其中只优化预测器的参数，而 Transformer 模型的参数保持不变（见图子图(b)）。Transformer 模型和预测器一起用于对新数据进行探针预测（见图子图(c))。

Many probing methods have been used in recent work on analyzing and understanding what is learned in neural encoders. Here we describe some of the popular ones.
许多探测方法已被用于分析理解神经网络编码器中学习到的内容的研究中。在此，我们描述了一些流行的方法。

•

Trees. Given a trained Transformer encoder, it is easy to know how “likely” two words of a sentence have some linguistic relationship by computing the attention weight between them. We can use this quantity to define a metric measuring the syntactic distance between the two words at positions $i$ and $j$

\displaystyle d_{s}(i,j)

\displaystyle=

\displaystyle 1-\alpha(i,j)

(61)

By using this metric it is straightforward to construct the minimum-spanning tree for the sentence, that is, we connect all the words to form a tree structure with the minimum total distance. The tree structure can be seen as a latent tree representation of the sentence that is induced from the neural network. While this dependency-tree-like structure can be used as a source of learned syntactic information in downstream tasks, it says nothing about our knowledge of syntax. An approach to aligning the representations in the encoder with linguistic structure is to learn to produce syntax trees that are consistent with human annotations. To do this, we need to develop a probing predictor that can be trained on tree-annotated data. Suppose that there is a human annotated dependency tree of a given sentence. For each pair of words, we can obtain a distance $\omega(i,j)$ by counting the number of edges between them. Then, we can learn a distance metric based on the internal representations of the encoder to approximate $\omega(i,j)$ . A simple form of such a metric is defined to be the Euclidean distance (Manning et al., 2020). Let $\mathbf{A}\in\mathbb{R}^{d\times k_{s}}$ be a parameter matrix. The form of the Euclidean distance is given by
通过使用此指标，构建句子的最小生成树变得简单，即我们将所有单词连接起来形成一个具有最小总距离的树结构。这种树结构可以看作是句子从神经网络中诱导出的潜在树表示。虽然这种类似于依存树的这种结构可以用作下游任务中学习到的句法信息的来源，但它并没有说明我们对句法的了解。将编码器中的表示与语言结构对齐的方法是学习产生与人类标注一致的句法树。为此，我们需要开发一个可以在树标注数据上训练的探测预测器。假设有一个给定句子的由人类标注的依存树。对于每一对单词，我们可以通过计算它们之间的边数来获得一个距离 $\omega(i,j)$ 。然后，我们可以基于编码器的内部表示学习一个距离度量来近似 $\omega(i,j)$ 。这种度量的一种简单形式被定义为欧几里得距离（Manning et al., 2020）。令 $\mathbf{A}\in\mathbb{R}^{d\times k_{s}}$ 为一个参数矩阵。欧几里得距离的形式如下

\displaystyle d_{s}(i,j)

\displaystyle=

\displaystyle\sqrt{||(\mathbf{h}_{i}-\mathbf{h}_{j})\mathbf{A}||_{2}^{2}}

(62)

where $\mathbf{h}_{i}$ and $\mathbf{h}_{j}$ are the representations produced by an encoding layer at positions $i$ and $j$ ⁶⁶6In general, $\mathbf{h}_{i}$ and $\mathbf{h}_{j}$ are the outputs of the last layer of the encoder. Alternatively, they can be weighted sums of the outputs of all the layers.. Given a set of tree-annotated sentences $S$ , we can optimize the model by
在位置 $i$ 和 $j$ ⁶ 由编码层产生的表示 $\mathbf{h}_{i}$ 和 $\mathbf{h}_{j}$ 。给定一组树标注句子 $S$ ，我们可以通过以下方式优化模型：

\displaystyle\hat{\mathbf{A}}

\displaystyle=

\displaystyle\underset{\mathbf{A}}{\arg\max}\sum_{s\in S}\frac{1}{|s|^{2}}\sum_{i\in s,j\in s}\left|\omega(i,j)-d_{s}^{2}(i,j)\right|

(63)

• 树。给定一个训练好的 Transformer 编码器，通过计算两个词语之间的注意力权重，可以轻松知道这两个词语在句子中“可能”存在某种语言关系。我们可以使用这个量来定义一个衡量位置

i

和

j

之间两个词语句法距离的指标。

where $|s|$ is length of the sentence $s$ , and $(i,j)$ indicates a pair of words in $s$ . The optimized model is then used to parse test sentences via the minimum-spanning tree algorithm, and we can compare the parse trees against the human-annotated trees. To obtain directed trees, which are standard forms of dependency syntax, one can update the above model by considering the relative distance of a word to the root. More details can be found in Manning et al. (2020)’s work. Here the probing predictor functions similarly to a neural parser, trained to predict a syntax tree based on a representation of the input sentence. This idea can be extended to other forms of syntactic structure, such as phrase structure trees (Shi et al., 2016).
$|s|$ 表示句子 $s$ 的长度， $(i,j)$ 表示 $s$ 中的一对词语。优化后的模型随后通过最小生成树算法解析测试句子，我们可以将解析树与人工标注的树进行比较。为了获得有向树，这是依存语法的标准形式，可以通过考虑一个词与根的相对距离来更新上述模型。更多细节可以在 Manning 等人（2020）的工作中找到。在这里，探测预测器与训练有根据输入句子的表示预测语法树的神经网络解析器类似。这个想法可以扩展到其他形式的句法结构，如短语结构树（Shi 等人，2016）。

•

Syntactic and Semantic Labels. Many syntactic and semantic parsing tasks can be framed as problems of predicting linguistic labels given a sentence or its segments. A simple example is part-of-speech tagging in which each word of a sentence is labeled with a word class. A probe for part-of-speech tagging can be a classifier that takes a representation $\mathbf{h}_{j}$ each time and outputs the corresponding word class. One general probing approach to these problems is edge probing (Tenney et al., 2019b, a). Given a sentence, a labeled edge is defined as a tuple

$\displaystyle(\mathrm{span}_{1},\mathrm{span}_{2},\mathrm{label})$

• 句法和语义标签。许多句法和语义解析任务可以被视为在给定句子或其片段的情况下预测语言标签的问题。一个简单的例子是词性标注，其中句子的每个词都被标注为词类。词性标注的探针可以是一个每次接收表示 $\mathbf{h}_{j}$ 并输出相应词类的分类器。对这些问题的一个通用探针方法是边缘探针（Tenney 等，2019b，a）。给定一个句子，一个标记的边缘被定义为元组

where $\mathrm{span}_{1}$ is a span $[i_{1},j_{1}]$ , and $\mathrm{span}_{2}$ is another span $[i_{2},j_{2}]$ (optionally), and $\mathrm{label}$ is the corresponding label. Our goal is to learn a probe to predict $\mathrm{label}$ given $\mathrm{span}_{1}$ and $\mathrm{span}_{2}$ . For example, for part-of-speech tagging, $\mathrm{span}_{1}$ is a unit span $[j,j]$ for each position $j$ , $\mathrm{span}_{2}$ is an empty span, and $\mathrm{label}$ is the part-of-speech tag corresponding to the $j$ -th word of the sentence; for dependency parsing and coreference resolution, $\mathrm{span}_{1}$ and $\mathrm{span}_{2}$ are two words or entities, and $\mathrm{label}$ is the relationship between them; for constituency parsing, $\mathrm{span}_{1}$ is a span of words, $\mathrm{span}_{2}$ is an empty span, and $\mathrm{label}$ is the syntactic category of the tree node yielding $\mathrm{span}_{1}$ . In simple cases, the probing model can be a multi-layer feed-forward neural network with a Softmax output layer. As usual, this model is trained on labeled data, and then tested on new data.
$\mathrm{span}_{1}$ 是一个跨度 $[i_{1},j_{1}]$ ， $\mathrm{span}_{2}$ 是另一个跨度 $[i_{2},j_{2}]$ （可选）， $\mathrm{label}$ 是相应的标签。我们的目标是学习一个探针来预测 $\mathrm{label}$ ，给定 $\mathrm{span}_{1}$ 和 $\mathrm{span}_{2}$ 。例如，对于词性标注， $\mathrm{span}_{1}$ 是每个位置 $[j,j]$ 的单元跨度， $j$ 是一个空跨度， $\mathrm{span}_{2}$ 是与句子中第 $\mathrm{label}$ 个词对应的词性标签；对于依存句法和指代消解， $j$ 和 $\mathrm{span}_{1}$ 是两个词或实体， $\mathrm{span}_{2}$ 是它们之间的关系；对于成分句法分析， $\mathrm{label}$ 是一个词跨度， $\mathrm{span}_{1}$ 是一个空跨度， $\mathrm{span}_{2}$ 是产生 $\mathrm{label}$ 的树节点的句法类别。在简单情况下，探针模型可以是一个多层前馈神经网络，具有 Softmax 输出层。像往常一样，这个模型在标记数据上训练，然后在新的数据上测试。
•

Surface Forms of Words and Sentences. Probing tasks can also be designed to examine whether the representations embed the surface information of sentences or words (Adi et al., 2016; Conneau et al., 2018). A simple sentence-level probing task is sentence length prediction. To do this, we first represent the sentence as a single vector $\mathbf{h}$ ⁷⁷7 $\mathbf{h}$ can be computed by performing a pooling operation on $\{\mathbf{h}_{1},...,\mathbf{h}_{m}\}$ , and then build a classifier to categorize $\mathbf{h}$ into the corresponding length bin. Similarly, probes can be built to predict whether two words at positions $i$ and $j$ are reordered in the sentence given $\mathbf{h}_{i}$ and $\mathbf{h}_{j}$ . Also, we can develop probes to address conventional problems in morphology. For example, we reconstruct the word at position $j$ or predict its sense with the representation $\mathbf{h}_{j}$ . In addition, probing tasks can be focused on particular linguistic problems, for example, numeracy (Wallace et al., 2019) and function words (Kim et al., 2019).

• 单词和句子的表面形式。可以设计探测任务来检验表示是否嵌入句子或单词的表面信息（Adi 等，2016；Conneau 等，2018）。一个简单的句子级探测任务是预测句子长度。为此，我们首先将句子表示为一个单一向量 $\mathbf{h}$ ⁷ ，然后构建一个分类器将 $\mathbf{h}$ 分类到相应的长度区间。同样，可以构建探测任务来预测位置 $i$ 和 $j$ 的两个单词在给定 $\mathbf{h}_{i}$ 和 $\mathbf{h}_{j}$ 的情况下是否在句子中重新排序。此外，我们还可以开发探测任务来解决形态学中的传统问题。例如，我们重建位置 $j$ 的单词或使用表示 $\mathbf{h}_{j}$ 预测其词义。此外，探测任务可以集中在特定的语言问题上，例如，数字能力（Wallace 等，2019）和功能词（Kim 等，2019）。
•

Cloze. Of course, we can probe neural models for problems beyond syntax and morphology. One perspective on large-scale pre-trained Transformer models is to view them as knowledge bases containing facts about the world. It is therefore tempting to see if we can apply them to test factual knowledge. A simple method is to ask a probe to recover the missing item of a sentence (Petroni et al., 2019). For example, if we have a cloze test

• 填空题。当然，我们可以探究神经网络模型在句法和形态学之外的问题。对于大规模预训练的 Transformer 模型的一个观点是，将它们视为包含关于世界事实的知识库。因此，尝试将它们应用于测试事实知识是有吸引力的。一种简单的方法是让探针恢复句子中缺失的项（Petroni 等人，2019 年）。例如，如果我们有一个填空测试

Shiji was written by .
《史记》是由...编写的。

we wish the probe to give an answer Sima Qian because there is a subject-object-relation fact (Shiji, Sima Qian, written-by). This probe can simply be a masked language model that is widely used in self-supervised learning of Transformer encoders.
我们希望这个探测器能给出一个司马迁的答案，因为存在一个主谓关系事实（史记，司马迁，作者）。这个探测器可以简单地是一个广泛用于 Transformer 编码器自监督学习的掩码语言模型。

In NLP, probing is closely related to pre-training of large language models. In general, we can see probing tasks as applications of these pre-trained language models, though probing is ordinarily used to give a quick test of the models. Ideally we would like to develop a probe that makes best use of the representations to deal with the problems. However, when a probe is complex and sufficiently well-trained, it might be difficult to say if the problem is solved by using the representations or the probe itself. A common way to emphasize the contribution of probes in problem-solving is to compare them with reasonable baselines or conduct the comparison on control tasks Hewitt and Liang (2019); Belinkov (2022).
在自然语言处理（NLP）中，探针与大型语言模型的预训练密切相关。一般来说，我们可以将探针任务视为这些预训练语言模型的应用，尽管探针通常用于对模型进行快速测试。理想情况下，我们希望开发出一种探针，它能最大限度地利用表示来处理问题。然而，当探针复杂且训练充分时，可能难以判断问题是通过使用表示还是探针本身解决的。强调探针在问题解决中的贡献的常见方法是将它们与合理的基线进行比较，或在控制任务上进行比较（Hewitt 和 Liang，2019；Belinkov，2022）。

4 Improved Architectures 4 改进的架构

In this section we present several improvements to the vanilla Transformer model. Unlike the previous section, most of the improvements are from the perspective of machine learning, rather than linguistics.
在这一节中，我们提出了对原始 Transformer 模型的几个改进。与上一节不同，这些改进大多是从机器学习的角度出发，而不是从语言学的角度。

4.1 Locally Attentive Models
4.1 本地注意力模型

Methods of self-attention, as discussed Section 2.3, can also be viewed as learning representations of the entire input sequence. The use of this global attention mechanism can lead to a better ability to deal with long-distance dependencies, but this model has a shortcoming: local information is not explicitly captured. Here we consider a few techniques that attempt to model the localness of representations.
自注意力方法，如第 2.3 节所述，也可以被视为学习整个输入序列的表示。使用这种全局注意力机制可以提高处理长距离依赖关系的能力，但该模型存在一个缺点：没有明确捕捉局部信息。在这里，我们考虑了几种尝试对表示的局部性进行建模的技术。

4.1.1 Priors of Local Modeling
4.1.1 本地建模的先验

One of the simplest ways of introducing local models into Transformers is to add a penalty term to the attention function in order to discourage large attention weights between distant positions. On the encoder-side, this leads to a form that we have already encountered several times in this chapter.
将局部模型引入 Transformer 的最简单方法之一是在注意力函数中添加一个惩罚项，以阻止远距离位置之间产生大的注意力权重。在编码器端，这导致了一种在本章中我们已经多次遇到的形式。

\displaystyle\mathrm{AttLocal}_{\mathrm{self}}(\mathbf{H})

\displaystyle=

\displaystyle\mathrm{Softmax}(\frac{\mathbf{H}^{q}[\mathbf{H}^{k}]^{\mathrm{T}}}{\sqrt{d}}-\gamma\cdot\mathbf{G})\mathbf{H}^{v}

(64)

where $\gamma$ is the weight (or temperature) of the penalty term, and $\mathbf{G}\in\mathbb{R}^{m\times m}$ is the matrix of penalties. Each entry $G(i,j)$ indicates how much we penalize the model given positions $i$ and $j$ . A simple form of $G(i,j)$ is a distance metric between $i$ and $j$ , for example
$\gamma$ 表示惩罚项的权重（或温度）， $\mathbf{G}\in\mathbb{R}^{m\times m}$ 表示惩罚矩阵。每个条目 $G(i,j)$ 表示针对位置 $i$ 和 $j$ 的模型惩罚程度。 $G(i,j)$ 的简单形式是 $i$ 和 $j$ 之间的距离度量，例如

\displaystyle G(i,j)

\displaystyle=

\displaystyle|i-j|

(65)

Or $G(i,j)$ can be defined as a Gaussian penalty function (Yang et al., 2018)
或可以将 $G(i,j)$ 定义为高斯惩罚函数（杨等，2018）

\displaystyle G(i,j)

\displaystyle=

\displaystyle\frac{(i-j)^{2}}{2\sigma_{i}^{2}}

(66)

where $\sigma_{i}$ is the standard deviation of the Gaussian distribution. For different $j$ , both of the above penalty terms increase, linearly or exponentially, away from the maximum at $i$ with distance $|i-j|$ .
$\sigma_{i}$ 是高斯分布的标准差。对于不同的 $j$ ，上述两个惩罚项都随着距离 $i$ 的最大值线性或指数地增加，距离为 $|i-j|$ 。

This method can be extended to the cross-attention model, like this
这种方法可以扩展到跨注意力模型，例如这样

\displaystyle\mathrm{AttLocal}_{\mathrm{cross}}(\mathbf{H},\mathbf{S})

\displaystyle=

\displaystyle\mathrm{Softmax}(\frac{\mathbf{S}^{q}[\mathbf{H}^{k}]^{\mathrm{T}}}{\sqrt{d}}-\gamma\cdot\mathbf{G})\mathbf{H}^{v}

(67)

where $\mathbf{G}$ is an $n\times m$ matrix. Each entry of $\mathbf{G}$ can be defined as
$\mathbf{G}$ 是一个 $n\times m$ 矩阵。 $\mathbf{G}$ 的每个元素可以定义为

\displaystyle G(i,j)

\displaystyle=

\displaystyle\frac{(\mu_{i}-j)^{2}}{2\sigma_{i}^{2}}

(68)

where $\mu_{i}$ is the mean of the Gaussian distribution over the source-side positions. Both $\mu_{i}$ and $\sigma_{i}$ can be determined using heuristics. Alternatively, we can develop additional neural networks to model them and learn corresponding parameters together with other parameters of the Transformer model. For example, we can use a feed-forward neural network to predict $\mu_{i}$ given $\mathbf{s}_{i}$ .
$\mu_{i}$ 是源端位置高斯分布的均值。 $\mu_{i}$ 和 $\sigma_{i}$ 可以通过启发式方法确定。或者，我们可以开发额外的神经网络来模拟它们，并与其他 Transformer 模型的参数一起学习相应的参数。例如，我们可以使用前馈神经网络根据 $\mathbf{s}_{i}$ 预测 $\mu_{i}$ 。

One alternative to Eq. (64) (or Eq. (67)) treats the penalty term as a separate model and combines it with the original attention model. For example, we can define the self-attention model as
一种与公式（64）（或公式（67））的替代方案是将惩罚项视为一个独立模型，并将其与原始注意力模型相结合。例如，我们可以将自注意力模型定义为

\displaystyle\mathrm{AttLocal}_{\mathrm{self}}(\mathbf{H})

\displaystyle=

\displaystyle\left((1-\beta)\cdot\mathrm{Softmax}(\frac{\mathbf{H}^{q}[\mathbf{H}^{k}]^{\mathrm{T}}}{\sqrt{d}})+\beta\cdot\mathrm{Softmax}(-\gamma\cdot\mathbf{G})\right)\mathbf{H}^{v}

(69)

where $\beta\in[0,1]$ is the coefficient of the linear combination. Note that, to avoid empirical choices of the values of $\alpha$ and $\beta$ , we can use gating functions to predict $\alpha$ and $\beta$ and train these functions as usual.
$\beta\in[0,1]$ 是线性组合的系数。注意，为了避免对 $\alpha$ 和 $\beta$ 的值进行经验选择，我们可以使用门控函数来预测 $\alpha$ 和 $\beta$ ，并像往常一样训练这些函数。

Another alternative is to use a multiplicative mask to incorporate the prior into modeling, as in Eq. (54). This is given by
另一种选择是使用乘性掩码将先验信息纳入建模，如式（54）所示。这由以下给出

\displaystyle\mathrm{AttLocal}_{\mathrm{self}}(\mathbf{H})

\displaystyle=

\displaystyle\mathrm{Softmax}(\frac{\mathbf{H}^{q}[\mathbf{H}^{k}]^{\mathrm{T}}}{\sqrt{d}}\odot\mathbf{G}^{\prime})\mathbf{H}^{v}

(70)

Here $\mathbf{G}^{\prime}\in[0,1]^{m\times m}$ is a matrix of scalars. The scalar $G^{\prime}(i,j)$ gives a value of 1 when $i=j$ , and a smaller value as $j$ moves away from $i$ . $G^{\prime}(i,j)$ can be obtained by normalizing $-G(i,j)$ over all $j$ or using alternative functions.
这里 $\mathbf{G}^{\prime}\in[0,1]^{m\times m}$ 是一个标量矩阵。标量 $G^{\prime}(i,j)$ 在 $i=j$ 时给出值为 1，而当 $j$ 远离 $i$ 时给出较小的值。 $G^{\prime}(i,j)$ 可以通过对所有 $j$ 上的 $-G(i,j)$ 进行归一化或使用其他函数来获得。

4.1.2 Local Attention 4.1.2 本地注意力

The term local attention has been used broadly to cover a wide range of problems and to refer to many different models in the NLP literature. The methods discussed above are those that impose soft constraints on attention models. In fact, local attention has its origins in attempts to restrict the scope of attention models for considerations of modeling and computational problems (Luong et al., 2015). Research in this area often looks into introducing hard constraints, so that the resulting models can focus on parts of the input and ignore the rest. For example, we can predict a span of source-side positions for performing the attention function given a target-side position (Sperber et al., 2018; Yang et al., 2018; Sukhbaatar et al., 2019). Also, attention spans can be induced from syntax trees, for example, knowing sub-tree structures of a sentence may help winnow the field that the model concentrates on in learning the representation. Thus, many of the syntax-constrained models are instances of local attention-based models (see Section 3.4) . In addition, the concept of local attention can be extended to develop a rich set of models, such as sparse attention models, although these models are often discussed in the context of efficient machine learning methods. We will see a few examples of them in Section 5.
该术语局部注意力被广泛用于涵盖一系列问题，并指代自然语言处理文献中许多不同的模型。上述讨论的方法是对注意力模型施加软约束的方法。实际上，局部注意力源于试图限制注意力模型的范围，以考虑建模和计算问题（Luong 等人，2015 年）。该领域的研究通常探讨引入硬约束，以便生成的模型可以专注于输入的部分并忽略其余部分。例如，我们可以根据目标侧的位置预测源侧的位置范围以执行注意力函数（Sperber 等人，2018 年；Yang 等人，2018 年；Sukhbaatar 等人，2019 年）。此外，注意力范围可以从句法树中诱导出来，例如，了解句子的子树结构可能有助于缩小模型在学习表示时关注的领域。因此，许多句法约束模型是局部注意力模型实例（参见第 3.4 节）。此外，局部注意力的概念可以扩展以开发一系列丰富的模型，例如稀疏注意力模型，尽管这些模型通常在高效机器学习方法背景下进行讨论。我们将在第 5 节中看到它们的几个例子。

In deep learning, one of the most widely used models for learning features from a restricted region of the input is CNNs. It is thus interesting to consider methods of combining CNNs and Transformer models to obtain the benefits of both approaches, for example, CNNs deal with short-term dependencies, and self-attention models deal with long-term dependencies. One approach is to build a two-branch sequence model where one branch is based on CNNs and the other is based on self-attention models (Wu et al., 2020). Another approach is to incorporate CNN layers into Transformer blocks in some way that we can learn both local and global representations through a deep model (Wu et al., 2019; Gulati et al., 2020).
在深度学习中，从输入的受限区域学习特征最广泛使用的模型之一是卷积神经网络（CNNs）。因此，考虑结合 CNNs 和 Transformer 模型的方法以获得两种方法的好处是很有趣的，例如，CNNs 处理短期依赖，而自注意力模型处理长期依赖。一种方法是在一个双分支序列模型中构建，其中一个分支基于 CNNs，另一个分支基于自注意力模型（Wu 等人，2020 年）。另一种方法是以某种方式将 CNN 层纳入 Transformer 块中，这样我们就可以通过深度模型学习到局部和全局表示（Wu 等人，2019 年；Gulati 等人，2020 年）。

4.1.3 Relative Positional Embedding
4.1.3 相对位置嵌入

Relative positional embedding, also known as relative positional representation (RPR), is an improvement to the absolute positional embedding method used in standard Transformer systems (Shaw et al., 2018; Huang et al., 2018). The idea of RPR is that we model the distance between two positions of a sequence rather than giving each position a fixed representation. As a result, we have a pair-wise representation $\mathrm{PE}(i,j)$ for any two positions $i$ and $j$ . One simple way to define $\mathrm{PE}(i,j)$ is to consider it as a lookup table for all pairs of $i$ and $j$ . More specifically, let $\mathbf{u}_{\pi}$ be a $d$ -dimensional representation for a given distance $\pi$ . The form of $\mathrm{PE}(i,j)$ in the vanilla RPR method is given by
相对位置嵌入，也称为相对位置表示（RPR），是对标准 Transformer 系统中使用的绝对位置嵌入方法的改进（Shaw 等，2018；Huang 等，2018）。RPR 的思路是，我们通过建模序列中两个位置之间的距离，而不是给每个位置一个固定的表示。因此，对于任何两个位置 $i$ 和 $j$ ，我们有一个成对表示 $\mathrm{PE}(i,j)$ 。定义 $\mathrm{PE}(i,j)$ 的一个简单方法是将它视为所有 $i$ 和 $j$ 对的查找表。更具体地说，令 $\mathbf{u}_{\pi}$ 为给定距离 $\pi$ 的 $d$ 维表示。在原始 RPR 方法中， $\mathrm{PE}(i,j)$ 的形式如下

\displaystyle\mathrm{PE}(i,j)

\displaystyle=

\displaystyle\mathbf{u}_{\mathrm{clip}(j-i,k_{\mathrm{rpr}})}

(71)

where $\mathrm{clip}(x,k_{\mathrm{rpr}})$ is a function that clips $x$ in the interval $[-k_{\mathrm{rpr}},k_{\mathrm{rpr}}]$
$\mathrm{clip}(x,k_{\mathrm{rpr}})$ 是一个在区间 $[-k_{\mathrm{rpr}},k_{\mathrm{rpr}}]$ 中对 $x$ 进行裁剪的函数

\displaystyle\mathrm{clip}(x,k_{\mathrm{rpr}})

\displaystyle=

\displaystyle\max\{-k_{\mathrm{rpr}},\min\{x,k_{\mathrm{rpr}}\}\}

(72)

Thus, we have a model with parameters
因此，我们有一个具有参数的模型

\displaystyle\mathbf{U}_{\mathrm{rpr}}

\displaystyle=

\displaystyle\begin{bmatrix}\mathbf{u}_{-k_{\mathrm{rpr}}}\\ \vdots\\ \mathbf{u}_{0}\\ \vdots\\ \mathbf{u}_{k_{\mathrm{rpr}}}\end{bmatrix}

(73)

While this matrix notation is used in a relatively informal way, we can view $\mathbf{U}_{\mathrm{rpr}}$ as a matrix $\in\mathbb{R}^{(2k_{\mathrm{rpr}}+1)\times d}$ , and select a row corresponding to $\mathrm{clip}(j-i,k_{\mathrm{rpr}})$ when RPR is required for given $i$ and $j$ .
尽管这种矩阵符号的使用相对非正式，我们仍可以将 $\mathbf{U}_{\mathrm{rpr}}$ 视为矩阵 $\in\mathbb{R}^{(2k_{\mathrm{rpr}}+1)\times d}$ ，并在需要针对给定的 $i$ 和 $j$ 进行 RPR 时选择对应的 $\mathrm{clip}(j-i,k_{\mathrm{rpr}})$ 行。

Using the above method, we can define three RPR models $\mathrm{PE}^{q}(i,j)$ , $\mathrm{PE}^{k}(i,j)$ and $\mathrm{PE}^{v}(i,j)$ for queries, keys, and values, respectively. Then, following the form of Eq. (17), the output of the self-attention model at position $i$ can be written as
使用上述方法，我们可以分别定义查询、键和值的三个 RPR 模型 $\mathrm{PE}^{q}(i,j)$ 、 $\mathrm{PE}^{k}(i,j)$ 和 $\mathrm{PE}^{v}(i,j)$ 。然后，按照公式(17)的形式，位置 $i$ 的自注意力模型的输出可以表示为

	$\displaystyle\mathbf{c}_{i}$	$\displaystyle=$	$\displaystyle\sum_{j=1}^{m}\alpha_{i,j}\left[\mathbf{h}_{j}^{v}+\mathrm{PE}^{v}(i,j)\right]$		(74)
		$\displaystyle=$	$\displaystyle\sum_{j=1}^{m}\alpha_{i,j}\mathbf{h}_{j}^{v}+\sum_{j=1}^{m}\alpha_{i,j}\mathrm{PE}^{v}(i,j)$		(74)

where $\mathbf{h}_{j}^{v}$ is the $j$ -th row vector of $\mathbf{H}^{v}$ . This representation comprises two components: $\sum_{j=1}^{m}\alpha_{i,j}\mathbf{h}_{j}^{v}$ is the basic representation, and $\sum_{j=1}^{m}\alpha_{i,j}\mathrm{PE}^{v}(i,j)$ is the positional representation.
$\mathbf{h}_{j}^{v}$ 是 $\mathbf{H}^{v}$ 的第 $j$ 个行向量。这种表示包含两个组成部分： $\sum_{j=1}^{m}\alpha_{i,j}\mathbf{h}_{j}^{v}$ 是基本表示， $\sum_{j=1}^{m}\alpha_{i,j}\mathrm{PE}^{v}(i,j)$ 是位置表示。

The attention weight $\alpha_{i,j}$ is computed in a regular way, but with additional terms $\mathrm{PE}^{q}(i,j)$ and $\mathrm{PE}^{k}(i,j)$ added to each query and key.
注意力权重 $\alpha_{i,j}$ 以常规方式计算，但每个查询和键中增加了额外的项 $\mathrm{PE}^{q}(i,j)$ 和 $\mathrm{PE}^{k}(i,j)$ 。

\displaystyle\alpha_{i,j}

\displaystyle=

\displaystyle\mathrm{Softmax}(\frac{[\mathbf{h}_{i}^{q}+\mathrm{PE}^{q}(i,j)][\mathbf{h}_{j}^{k}+\mathrm{PE}^{k}(i,j)]^{\mathrm{T}}}{\sqrt{d}})

(75)

Figure 9 shows the Transformer encoder architectures with and without RPR. When RPR is adopted, $\mathrm{PE}^{q}(i,j)$ , $\mathrm{PE}^{k}(i,j)$ , $\mathrm{PE}^{v}(i,j)$ are directly fed to each self-attention sub-layer, and so we can make better use of positional information for sequence modeling. Note that, the use of the clipping function (see Eq. (72)) makes the modeling simple because we do not need to distinguish the relative distances for the cases $|j-i|\geq k_{\mathrm{rpr}}$ . This clipped distance-based model can lead, in turn, to better modeling in local context windows.
图 9 展示了带有和没有 RPR 的 Transformer 编码器架构。当采用 RPR 时， $\mathrm{PE}^{q}(i,j)$ 、 $\mathrm{PE}^{k}(i,j)$ 、 $\mathrm{PE}^{v}(i,j)$ 直接输入到每个自注意力子层，因此我们可以更好地利用位置信息进行序列建模。注意，使用截断函数（见式(72)）使得建模变得简单，因为我们不需要区分 $|j-i|\geq k_{\mathrm{rpr}}$ 情况下的相对距离。这种基于截断距离的模型反过来可以导致在局部上下文窗口中更好的建模。

Figure 9: Transformer encoders without and with relative positional representation (RPR). In RPR, each pair of positions is represented as a vector

\mathrm{PE}(i,j)

using a model parameterized by

\mathbf{U}_{\mathrm{rpr}}

\mathrm{PE}(i,j)

is fed into each self-attention sub-layer so that we can make use of the positional information in intermediate steps of learning representations.
图 9：无相对位置表示（RPR）和有 RPR 的 Transformer 编码器。在 RPR 中，每个位置对被表示为一个向量

\mathrm{PE}(i,j)

，使用由

\mathbf{U}_{\mathrm{rpr}}

参数化的模型。

\mathrm{PE}(i,j)

被输入到每个自注意力子层，以便我们可以在学习表示的中间步骤中利用位置信息。

Eqs. (74) and (75) provide a general approach to position-sensitive sequence modeling. There are many variants of this model. In Shaw et al. (2018)’s early work on RPR, the positional representations for queries are removed, and the model works only with $\mathrm{PE}^{k}(i,j)$ and $\mathrm{PE}^{v}(i,j)$ , like this
方程（74）和（75）提供了一种针对位置敏感序列建模的通用方法。该模型有许多变体。在肖等人（2018）对 RPR 的早期工作中，去除了查询的位置表示，模型仅使用 $\mathrm{PE}^{k}(i,j)$ 和 $\mathrm{PE}^{v}(i,j)$ ，如下所示

\displaystyle\alpha_{i,j}

\displaystyle=

\displaystyle\mathrm{Softmax}(\frac{\mathbf{h}_{i}^{q}[\mathbf{h}_{j}^{k}+\mathrm{PE}^{k}(i,j)]^{\mathrm{T}}}{\sqrt{d}})

(76)

By contrast, there are examples that attempt to improve the RPR model in computing attention weights but ignore $\mathrm{PE}^{v}(i,j)$ in learning values (Dai et al., 2019; He et al., 2021). Instead of treating RPR as an additive term to each representation, researchers also explore other ways of introducing RPR into Transformer (Huang et al., 2020; Raffel et al., 2020). We refer the interested readers to these papers for more details.
与对比，有一些例子试图改进计算注意力权重的 RPR 模型，但忽略了 $\mathrm{PE}^{v}(i,j)$ 在学习值中的学习（Dai 等人，2019；He 等人，2021）。研究人员并没有将 RPR 视为每个表示的加性项，而是探索了将 RPR 引入 Transformer 的其他方法（Huang 等人，2020；Raffel 等人，2020）。我们建议感兴趣的读者查阅这些论文以获取更多详细信息。

4.2 Deep Models 4.2 深度模型

Many state-of-the-art NLP systems are based on deep Transformer models. For example, recent large language models generally comprise tens of Transformer layers (or more precisely, hundreds of layers of neurons), demonstrating strong performance on many tasks (Ouyang et al., 2022; Touvron et al., 2023a). By stacking Transformer layers, it is straightforward to obtain a deep model. However, as is often the case, training very deep neural networks is challenging. A difficulty arises from the fact that the error surfaces of deep neural networks are highly non-convex and have many local optima that make the training process likely to get stuck in them. While there are optimization algorithms that can help alleviate this problem, most of the practical efforts explore the use of gradient-based methods for optimizing deep neural networks. As a result, training a model with many Transformer layers becomes challenging due to vanishing and exploding gradients during back-propagation. Here we consider several techniques for training deep Transformer models.
许多最先进的自然语言处理系统基于深度 Transformer 模型。例如，最近的大型语言模型通常包含数十个 Transformer 层（或者更精确地说，数百层神经元），在许多任务上表现出强大的性能（Ouyang 等，2022；Touvron 等，2023a）。通过堆叠 Transformer 层，可以轻松获得深度模型。然而，正如通常情况那样，训练非常深的神经网络具有挑战性。困难在于深度神经网络的误差表面高度非凸，并且有许多局部最优解，使得训练过程可能陷入其中。尽管存在可以帮助缓解此问题的优化算法，但大多数实际努力都探索了使用基于梯度的方法来优化深度神经网络。因此，由于反向传播中的梯度消失和爆炸，训练具有许多 Transformer 层的模型变得具有挑战性。在此，我们考虑了几种训练深度 Transformer 模型的技术。

4.2.1 Re-thinking the Pre-Norm and Post-Norm Architectures
4.2.1 重新思考规范前和规范后架构

As introduced previously, a Transformer sub-layer is a residual network where a shortcut is created to add the input of the network directly to the output of this sub-layer. This allows gradients to flow more directly from the output back to the input, mitigating the vanishing gradient problem. In general, a residual connection in Transformer is used together with a layer normalization unit to form a sub-layer. This leads to two types of architecture, called post-norm and pre-norm. To be specific, recall from Section 2.4 that the post-norm architecture can be expressed as
如前所述，Transformer 子层是一种残差网络，其中创建了一个捷径以将网络的输入直接添加到该子层的输出。这允许梯度更直接地从输出流向输入，减轻了梯度消失问题。一般来说，Transformer 中的残差连接与层归一化单元一起使用，形成一个子层。这导致两种类型的架构，称为后归一化和前归一化。具体来说，回想第 2.4 节，后归一化架构可以表示为

\displaystyle\mathbf{z}^{l}

\displaystyle=

\displaystyle\mathrm{LNorm}(F^{l}(\mathbf{z}^{l-1})+\mathbf{z}^{l-1})

(77)

where $\mathbf{z}^{l}$ and $\mathbf{z}^{l-1}$ are the output and input of the sub-layer $l$ , and $F^{l}(\cdot)$ is the core function of this sub-layer. The pre-norm architecture takes the identity mapping $\mathbf{z}^{l}$ outside the layer normalization function, given in the form
$\mathbf{z}^{l}$ 和 $\mathbf{z}^{l-1}$ 分别是子层 $l$ 的输出和输入， $F^{l}(\cdot)$ 是该子层的核心函数。预规范架构将恒等映射 $\mathbf{z}^{l}$ 放在层归一化函数之外，其形式为

\displaystyle\mathbf{z}^{l}

\displaystyle=

\displaystyle\mathrm{LNorm}(F^{l}(\mathbf{z}^{l-1}))+\mathbf{z}^{l-1}

(78)

Consider the difference between the information flow in these two architectures:
考虑这两种架构中信息流的差异：

•

The post-norm architecture prevents the identity mapping of the input from adding to the output of the sub-layer. This is not a true residual network, because all the information is passed on through a non-linear function (i.e., the layer normalization unit). Thus, the post-norm architecture is not very “efficient” for back-propagation. Wang et al. (2019) show that the gradient of the loss of an $L$ sub-layer Transformer network with respect to $\mathbf{z}^{l}$ is given by

\displaystyle\frac{\partial E}{\partial\mathbf{z}^{l}}

\displaystyle=

\displaystyle\frac{\partial E}{\partial\mathbf{z}^{L}}\cdot\prod_{k=l}^{L-1}\frac{\partial\mathrm{LNorm}(\mathbf{v}^{k})}{\partial\mathbf{v}^{k}}\cdot\prod_{k=l}^{L-1}\left(1+\frac{\partial F^{k}(\mathbf{z}^{k})}{\partial\mathbf{z}^{k}}\right)

(79)

• 后规范架构阻止了输入的恒等映射添加到子层的输出。这并不是一个真正的残差网络，因为所有信息都是通过一个非线性函数（即层归一化单元）传递的。因此，后规范架构对于反向传播来说并不非常“高效”。Wang 等人（2019）表明，

L

子层 Transformer 网络相对于

\mathbf{z}^{l}

的损失梯度为

where $\mathbf{z}^{L}$ is the output of the last layer, $\mathbf{v}^{k}$ is a short for $F^{k}(\mathbf{z}^{k-1})$ , and $E$ is the error measured by some loss function. $\frac{\partial\mathrm{LNorm}(\mathbf{v}^{k})}{\partial\mathbf{v}^{k}}$ and $\frac{\partial F^{k}(\mathbf{z}^{k})}{\partial\mathbf{z}^{k}}$ are the gradients of the layer normalization function and the core function, respectively. Although the equation here appears a bit complex, we see that $\prod_{k=l}^{L-1}\frac{\partial\mathrm{LNorm}(\mathbf{v}^{k})}{\partial\mathbf{v}^{k}}$ is simply a product of $L-l$ factors. This means that the error gradient will be rescaled more times if $L$ becomes larger, and there is a higher risk of vanishing and exploding gradients for a deeper model.
$\mathbf{z}^{L}$ 是最后一层的输出， $\mathbf{v}^{k}$ 是 $F^{k}(\mathbf{z}^{k-1})$ 的简称， $E$ 是由某个损失函数测量的误差。 $\frac{\partial\mathrm{LNorm}(\mathbf{v}^{k})}{\partial\mathbf{v}^{k}}$ 和 $\frac{\partial F^{k}(\mathbf{z}^{k})}{\partial\mathbf{z}^{k}}$ 分别是层归一化函数和核心函数的梯度。尽管这里的方程看起来有点复杂，但我们看到 $\prod_{k=l}^{L-1}\frac{\partial\mathrm{LNorm}(\mathbf{v}^{k})}{\partial\mathbf{v}^{k}}$ 简单地是 $L-l$ 个因子的乘积。这意味着如果 $L$ 变得更大，误差梯度将被重新缩放更多次，对于更深层的模型，梯度消失和爆炸的风险更高。

•

The pre-norm architecture describes a standard residual neural network where the input of a whole network is added to its output. We can write the gradient of the error at $\mathbf{z}^{l}$ as

	$\displaystyle\frac{\partial E}{\partial\mathbf{z}^{l}}$	$\displaystyle=$	$\displaystyle\frac{\partial E}{\partial\mathbf{z}^{L}}\cdot\left(1+\prod_{k=l}^{L-1}\frac{\partial F^{k}(\mathrm{LNorm}(\mathbf{z}^{k}))}{\partial\mathbf{z}^{k}}\right)$		(80)
		$\displaystyle=$	$\displaystyle\frac{\partial E}{\partial\mathbf{z}^{L}}+\frac{\partial E}{\partial\mathbf{z}^{L}}\cdot\prod_{k=l}^{L-1}\frac{\partial F^{k}(\mathrm{LNorm}(\mathbf{z}^{k}))}{\partial\mathbf{z}^{k}}$		(80)

• 预规范架构描述了一种标准残差神经网络，其中整个网络的输入被添加到其输出中。我们可以将

\mathbf{z}^{l}

处的误差梯度写为

It is easy to see that $\frac{\partial E}{\partial\mathbf{z}^{l}}$ receives direct feedback regarding the errors made by the model, because the first term of the summation on the right-hand side (i.e., $\frac{\partial E}{\partial\mathbf{z}^{L}}$ ) is the gradient of the model output which is independent of the network depth.
可以看出， $\frac{\partial E}{\partial\mathbf{z}^{l}}$ 直接接收关于模型所犯错误的反馈，因为右侧求和的第一个项（即 $\frac{\partial E}{\partial\mathbf{z}^{L}}$ ）是模型输出的梯度，它与网络深度无关。

The use of the pre-norm architecture also helps optimization during early gradient descent steps. For example, it has been found that pre-norm Transformer models can be trained by using a larger learning rate in the early stage of training instead of gradually increasing the learning rate from a small value (Xiong et al., 2020).
使用预规范架构也有助于早期梯度下降步骤中的优化。例如，研究发现，预规范 Transformer 模型可以通过在训练早期使用较大的学习率来训练，而不是逐渐从小值增加学习率（Xiong 等人，2020 年）。

While the pre-norm architecture leads to easier optimization of deep Transformer models, we would not simply say that it is a better choice compared to the post-norm architecture. In fact, both post-norm and pre-norm Transformer models have been successfully used in many applications. For example, the post-norm architecture is widely used in BERT-like models, while the pre-norm architecture is a more popular choice in recent generative large language models. Broadly, these two architectures provide different ways to design a deep Transformer model, as well as different advantages and disadvantages in doing so. The post-norm architecture forces the representation to be learned through more non-linear functions, but in turn results in a complicated model that is relatively hard to train. By contrast, the pre-norm architecture can make the training of Transformer models easier, but would be less expressive than the post-norm counterpart if the learned models are overly dependent on the shortcut paths.
虽然预规范架构使得深度 Transformer 模型的优化更加容易，但我们不能简单地说它比后规范架构更好。事实上，后规范和预规范 Transformer 模型已经在许多应用中得到了成功应用。例如，后规范架构在 BERT 类模型中得到广泛应用，而预规范架构在最近的大规模生成语言模型中更受欢迎。总的来说，这两种架构提供了设计深度 Transformer 模型的不同方法，以及在此过程中不同的优缺点。后规范架构迫使表示通过更多非线性函数来学习，但反过来却导致了一个相对难以训练的复杂模型。相比之下，预规范架构可以使 Transformer 模型的训练更加容易，但如果学习到的模型过度依赖于捷径路径，那么它的表达能力将不如后规范架构。

An improvement to these architectures is to control the extent to which we want to “skip” a sub-layer. A simple way to do this is to weight different paths rather than treating them equally. For example, a scalar factor of a residual connection can be introduced to determine how heavily we weight this residual connection relative to the path of the core function (He et al., 2016b; Liu et al., 2020a, b). A more general form of this model is given by
对这些建构的改进是控制我们想要“跳过”子层的程度。实现这一点的简单方法是给不同的路径赋予不同的权重，而不是平等对待。例如，可以引入残差连接的标量因子，以确定相对于核心函数路径，我们如何重视这个残差连接（He 等，2016b；Liu 等，2020a，b）。这个模型的更一般形式如下：

\displaystyle\mathbf{z}^{l}

\displaystyle=

\displaystyle\mathrm{LNorm}(F^{l}(\mathbf{z}^{l-1})+\beta\cdot\mathbf{z}^{l-1})+\gamma\cdot\mathbf{z}^{l-1}

(81)

where $\beta$ is the weight of the identity mapping inside the layer normalization function, and $\gamma$ is the weight of the identity mapping outside the layer normalization function. Clearly, both the post-norm and pre-norm architectures can be seen as special cases of this equation. That is, if $\beta=1$ and $\gamma=0$ , then it will become Eq. (77); if $\beta=0$ and $\gamma=1$ , it will become Eq. (78). This model provides a multi-branch view of building residual blocks. The input to this block can be computed through multiple paths with different modeling complexities. When $\beta$ and $\gamma$ are small, the representation is forced to be learned through a “deep” model with multiple layers of cascaded non-linear units. In contrast, when $\beta$ and $\gamma$ are large, the representation is more likely to be learned using a “shallow” model with fewer layers. To determine the optimal choices of $\beta$ and $\gamma$ , one can give them fixed values by considering some theoretical properties or system performance on validation sets, or compute these values by using additional functions that can be trained to do so (Srivastava et al., 2015). It should be emphasized that many other types of architecture can be considered in the design of a Transformer sub-layer. It is possible, for instance, to introduce more layer normalization units into a sub-layer (Ding et al., 2021; Wang et al., 2022b), or, on the contrary, to simply remove them from a sub-layer (Bachlechner et al., 2021).
$\beta$ 是层归一化函数内部恒等映射的权重， $\gamma$ 是层归一化函数外部恒等映射的权重。显然，后归一化和前归一化架构都可以看作是这个方程的特殊情况。也就是说，如果 $\beta=1$ 和 $\gamma=0$ ，则变为公式（77）；如果 $\beta=0$ 和 $\gamma=1$ ，则变为公式（78）。该模型提供了构建残差块的多分支视图。该块的输入可以通过具有不同建模复杂性的多个路径来计算。当 $\beta$ 和 $\gamma$ 较小时，表示被迫通过具有多层级联非线性单元的“深度”模型来学习。相比之下，当 $\beta$ 和 $\gamma$ 较大时，表示更有可能通过具有较少层的“浅层”模型来学习。为了确定 $\beta$ 和 $\gamma$ 的最佳选择，可以通过考虑一些理论性质或验证集上的系统性能来给它们赋予固定值，或者通过使用可以训练来执行此操作的附加函数来计算这些值（Srivastava 等人，2015）。应强调，在 Transformer 子层的结构设计中可以考虑许多其他类型的架构。例如，可以在子层中引入更多的层归一化单元（Ding 等人，2021 年；Wang 等人，2022b），或者相反，简单地从子层中移除它们（Bachlechner 等人，2021 年）。

4.2.2 Parameter Initialization
4.2.2 参数初始化

As with other deep neural networks, there is interest in developing parameter initialization methods for deep Transformer models in order to perform optimization on some region around a better local optimum. However, initialization is a wide-ranging topic for optimization of machine learning models, and the discussion of this general topic lies beyond the scope of this section. Here we will discuss some of the parameter initialization methods used in Transformer-based systems rather than the general optimization problems.
与其他深度神经网络一样，人们对于开发用于深度 Transformer 模型的参数初始化方法感兴趣，以便在某个更好的局部最优附近进行优化。然而，初始化是机器学习模型优化中的一个广泛话题，关于这个一般性话题的讨论超出了本节的范围。在这里，我们将讨论在基于 Transformer 的系统中所使用的某些参数初始化方法，而不是一般性的优化问题。

While the parameters of a neural network can be set in various different ways, most practical systems adopt simple techniques to give appropriate initial values of model parameters. Consider, for example, the Xavier initialization for a parameter matrix $\mathbf{W}\in\mathbb{R}^{d_{\rm{in}}\times d_{\rm{out}}}$ (Glorot and Bengio, 2010). We define a variable $\eta$ by
尽管神经网络参数可以以多种不同的方式设置，但大多数实用系统采用简单技术来给出模型参数的适当初始值。例如，考虑 Xavier 初始化参数矩阵 $\mathbf{W}\in\mathbb{R}^{d_{\rm{in}}\times d_{\rm{out}}}$ （Glorot 和 Bengio，2010）。我们通过以下方式定义变量 $\eta$ ：

\displaystyle\eta

\displaystyle=

\displaystyle\mathrm{gain}\cdot\sqrt{\frac{6}{d_{\rm{in}}+d_{\rm{out}}}}

(82)

where $\mathrm{gain}$ is a hyper-parameter which equals 1 by default. Then, each entry of $\mathbf{W}$ can be initialized by using a uniform distribution
$\mathrm{gain}$ 是一个默认等于 1 的超参数。然后， $\mathbf{W}$ 的每个条目可以通过使用均匀分布进行初始化。

\displaystyle W

\displaystyle\sim

\displaystyle U\left(-\eta,\eta\right)

(83)

or, alternatively, using a Gaussian distribution
或者，使用高斯分布

\displaystyle W

\displaystyle\sim

\displaystyle\mathrm{Gaussian}\left(0,\eta^{2}\right)

(84)

This method can be easily adapted to initialize Transformer models having a large number of layers. One common way is to find a more suitable value of $\mathrm{gain}$ by taking into account the fact that the initial states of optimization might be different for neural networks of different depths. For example, one can increase the value of $\mathrm{gain}$ as the depth of the model grows. Then, $\mathrm{gain}$ can be defined as a function of the network depth in the form
这种方法可以轻松地适应初始化具有大量层的 Transformer 模型。一种常见的方法是通过考虑不同深度的神经网络优化初始状态可能不同这一事实，来找到一个更合适的 $\mathrm{gain}$ 值。例如，随着模型深度的增加，可以增加 $\mathrm{gain}$ 的值。然后，可以将 $\mathrm{gain}$ 定义为网络深度的函数，形式如下

\displaystyle\mathrm{gain}

\displaystyle=

\displaystyle a\cdot L^{b}

(85)

where $a$ is the scalar, and $L^{b}$ is the network depth raised to the power of $b$ . Typically, $a$ and $b$ can be positive numbers, which means that it is preferred to have larger initial values for the parameters for deeper models. For example, Wang et al. (2022a) show that, by choosing appropriate values for $a$ and $b$ , a very deep Transformer model can be successfully trained.
$a$ 是标量， $L^{b}$ 是网络深度 $b$ 的幂。通常， $a$ 和 $b$ 可以是正数，这意味着对于更深层的模型，参数的初始值越大越受青睐。例如，Wang 等人（2022a）表明，通过选择合适的 $a$ 和 $b$ 的值，可以成功训练一个非常深的 Transformer 模型。

Eq. (85) assigns $\mathrm{gain}$ the same value for all of the sub-layers. However, it is found that the norm of gradients becomes smaller when a sub-layer moves away from the output layer. This consistent application of $\mathrm{gain}$ across the entire model could result in under-training of the lower layers due to the gradient vanishing problem. For this reason, one can develop methods that are sensitive to the position of a sub-layer in the neural network. The general form of such methods is given by
等式（85）为所有子层分配相同的 $\mathrm{gain}$ 值。然而，发现当子层远离输出层时，梯度的范数会变小。在整个模型中一致应用 $\mathrm{gain}$ 可能导致由于梯度消失问题而导致的下层欠训练。因此，可以开发对神经网络中子层位置敏感的方法。这类方法的一般形式如下：

\displaystyle\mathrm{gain}

\displaystyle=

\displaystyle\frac{a}{l^{b}}

(86)

Here $l$ denotes the depth of a sub-layer. If $l$ is larger (i.e., the sub-layer is closer to the output), $\mathrm{gain}$ will be smaller and the corresponding parameters will be set to smaller values. An example of this method can be found in Zhang et al. (2019)’s work.
这里 $l$ 表示子层的深度。如果 $l$ 较大（即子层更接近输出），则 $\mathrm{gain}$ 将较小，相应的参数将被设置为较小的值。这种方法的一个例子可以在张等人（2019）的研究中找到。

It is also, of course, straightforward to apply general methods of initializing deep multi-layer neural networks to Transformer models. An example is to consider the Lipschitz constant in parameter initialization, which has been shown to help improve the stability of training deep models (Szegedy et al., 2014; Xu et al., 2020). Another approach is to use second-order methods to estimate the proper values of the parameters. For example, one can compute the Hessian of each parameter matrix to model its curvature (Skorski et al., 2021).
当然，将初始化深度多层神经网络的一般方法应用于 Transformer 模型也是直接简单的。例如，可以考虑在参数初始化中考虑 Lipschitz 常数，这已被证明有助于提高训练深度模型稳定性（Szegedy 等人，2014；Xu 等人，2020）。另一种方法是使用二阶方法来估计参数的正确值。例如，可以计算每个参数矩阵的 Hessian 来模拟其曲率（Skorski 等人，2021）。

For models with a large number of layers, it is also possible to pre-train some of the layers via smaller models and use their trained parameters to initialize bigger models (Chen et al., 2015). That is, we first obtain a rough estimation of the parameters in a cheap way, and then continue the training process on the whole model as usual. These methods fall into a class of training methods, called model growth or depth growth.
对于具有大量层的模型，也可以通过较小的模型预先训练其中的一些层，并使用它们的训练参数来初始化更大的模型（Chen 等人，2015）。也就是说，我们首先以低成本的方式获得参数的粗略估计，然后像往常一样继续整个模型的训练过程。这些方法属于一类称为模型增长或深度增长的训练方法。

As a simple example, consider a Transformer model (e.g., a Transformer encoder) of $2L$ sub-layers. We can train this model by using the shallow-to-deep training method (Li et al., 2020b). First, we train an $L$ -sub-layer model (call it the shallow model) in a regular way. Then, we create a $2L$ -sub-layer model (call it the deep model) by stacking the shallow model twice, and further train this deep model. To construct deeper models, this procedure can be repeated multiple times, say, we start with a model of $L$ sub-layers, and obtain a model of $L^{2^{I}}$ after $I$ iterations. Note that many of the pre-training models are used in the same manner. For example, for BERT-like methods, a transformer encoder is trained on large-scale data, and the optimized parameters are then used to initialize downstream systems.
作为一个简单的例子，考虑一个 Transformer 模型（例如，一个 Transformer 编码器）的 $2L$ 子层。我们可以通过浅层到深层训练方法（Li 等人，2020b）来训练这个模型。首先，我们以常规方式训练一个 $L$ -子层模型（称之为浅层模型）。然后，我们通过堆叠浅层模型两次来创建一个 $2L$ -子层模型（称之为深层模型），并进一步训练这个深层模型。为了构建更深的模型，这个程序可以重复多次，例如，我们从 $L$ 子层模型开始，经过 $I$ 次迭代后获得一个 $L^{2^{I}}$ 子层模型。请注意，许多预训练模型都是用同样的方式使用的。例如，对于 BERT 类方法，Transformer 编码器在大规模数据上训练，然后使用优化后的参数来初始化下游系统。

4.2.3 Layer Fusion 4.2.3 层融合

Another problem with training a deep Transformer model is that the prediction is only conditioned on the last layer of the neural network. While the use of residual connections enables the direct access to lower-level layers from a higher-level layer, there is still a “long” path of passing information from the bottom to the top. One simple way to address this is to create residual connections that skip more layers. For example, consider a group of $L$ Transformer sub-layers. For the sub-layer at depth $l$ , we can build $l-1$ residual connections, each connecting this sub-layer with a previous sub-layer. In this way, we develop a densely connected network where each sub-layer takes the outputs of all previous sub-layers (Huang et al., 2017). The output of the last sub-layer can be seen as some combination of the outputs at different levels of representation of the input.
另一个训练深度 Transformer 模型的难题是预测仅依赖于神经网络的最顶层。虽然残差连接的使用允许从高层直接访问低层，但仍然存在从底层到顶层的“长路径”传递信息。解决这一问题的简单方法之一是创建跳过更多层的残差连接。例如，考虑一组 $L$ Transformer 子层。对于深度 $l$ 的子层，我们可以构建 $l-1$ 残差连接，每个连接将此子层与一个前一个子层相连。这样，我们开发了一个密集连接的网络，其中每个子层都接受所有先前子层的输出（Huang 等人，2017）。最后一个子层的输出可以看作是输入不同层次表示的输出的某种组合。

Following the notation used in the previous sub-sections, we denote the output of the sub-layer at depth $l$ by $\mathbf{z}^{l}$ , and denote the function of the sub-layer by $\mathrm{Layer}^{l}(\cdot)$ . Then, $\mathbf{z}^{l}$ can be expressed as
根据前几小节中使用的符号，我们用 $\mathbf{z}^{l}$ 表示深度 $l$ 子层的输出，用 $\mathrm{Layer}^{l}(\cdot)$ 表示子层的函数。然后， $\mathbf{z}^{l}$ 可以表示为

\displaystyle\mathbf{z}^{l}

\displaystyle=

\displaystyle\mathrm{Layer}^{l}(\mathbf{z}^{1},...,\mathbf{z}^{l-1})

(87)

We can simply view $\mathrm{Layer}^{l}(\cdot)$ as a function that fuses the information from $\{\mathbf{z}^{1},...,\mathbf{z}^{l-1}\}$ . There are many possible choices for $\mathrm{Layer}^{l}(\cdot)$ . For example, a simple form of $\mathrm{Layer}^{l}(\cdot)$ is given by
我们可以简单地将 $\mathrm{Layer}^{l}(\cdot)$ 视为一个融合 $\{\mathbf{z}^{1},...,\mathbf{z}^{l-1}\}$ 信息的函数。对于 $\mathrm{Layer}^{l}(\cdot)$ 有许多可能的选择。例如， $\mathrm{Layer}^{l}(\cdot)$ 的简单形式如下所示

	$\displaystyle\mathrm{Layer}^{l}(\mathbf{z}^{1},...,\mathbf{z}^{l-1})$	$\displaystyle=$	$\displaystyle\mathrm{LNorm}(F^{l}(\mathbf{Z}^{l}))$		(88)
	$\displaystyle\mathbf{Z}^{l}$	$\displaystyle=$	$\displaystyle\phi(\mathbf{z}^{1},...,\mathbf{z}^{l-1})$		(89)

Here $\phi(\cdot)$ takes the layer outputs $\{\mathbf{z}^{1},...,\mathbf{z}^{l-1}\}$ and fuses them into a single representation $\mathbf{Z}^{l}$ . A simple instance of $\phi(\cdot)$ is average pooling which computes the sum of $\{\mathbf{z}^{1},...,\mathbf{z}^{l-1}\}$ divided by $l-1$ . See Table 3 for more examples of $\phi(\cdot)$ .
这里 $\phi(\cdot)$ 将层输出 $\{\mathbf{z}^{1},...,\mathbf{z}^{l-1}\}$ 融合成一个单一表示 $\mathbf{Z}^{l}$ 。 $\phi(\cdot)$ 的一个简单实例是平均池化，它计算 $\{\mathbf{z}^{1},...,\mathbf{z}^{l-1}\}$ 除以 $l-1$ 的和。参见表 3 以获取更多 $\phi(\cdot)$ 的示例。

Entry 条目	Function 函数
Average Pooling 平均池化	$\phi(\mathbf{z}^{1},...,\mathbf{z}^{l-1})=\frac{1}{l-1}\sum_{k=1}^{l-1}\mathbf{z}^{k}$
Weighted Sum 加权求和	$\phi(\mathbf{z}^{1},...,\mathbf{z}^{l-1})=\sum_{k=1}^{l-1}\mathrm{weight}_{k}\cdot\mathbf{z}^{k}$
Feedforward Network 前馈网络	$\phi(\mathbf{z}^{1},...,\mathbf{z}^{l-1})=\mathrm{FFN}([\mathbf{z}^{1},...,\mathbf{z}^{l-1}])$
Self Attention 自注意力	$\phi(\mathbf{z}^{1},...,\mathbf{z}^{l-1})=\mathrm{FFN}([\mathrm{Att}_{\mathrm{self}}(\mathbf{z}^{1},...,\mathbf{z}^{l-1})])$

Table 3: Fusion functions.

\mathrm{FFN}(\cdot)

= feedforward neural network,

[\cdot]

= concatenating the input vectors, and

\mathrm{Att}_{\mathrm{self}}(\cdot)

= self-attention function. All of the fusion functions can be followed by a layer normalization function, for example, we can write the weighted sum of

\{\mathbf{z}^{1},...,\mathbf{z}^{l-1}\}

\phi(\mathbf{z}^{1},...,\mathbf{z}^{l-1})=\mathrm{LNorm}(\sum_{k=1}^{l-1}\mathrm{weight}_{k}\cdot\mathbf{z}^{k})

.
表 3：融合函数。

\mathrm{FFN}(\cdot)

= 前馈神经网络，

[\cdot]

= 连接输入向量，

\mathrm{Att}_{\mathrm{self}}(\cdot)

= 自注意力函数。所有融合函数之后都可以跟上一层归一化函数，例如，我们可以将

\{\mathbf{z}^{1},...,\mathbf{z}^{l-1}\}

的加权求和写为

\phi(\mathbf{z}^{1},...,\mathbf{z}^{l-1})=\mathrm{LNorm}(\sum_{k=1}^{l-1}\mathrm{weight}_{k}\cdot\mathbf{z}^{k})

。

Taking a similar architecture of a Transformer sub-layer, we can also consider a post-norm form
借鉴 Transformer 子层的类似架构，我们也可以考虑后归一化形式

	$\displaystyle\mathrm{Layer}^{l}(\mathbf{z}^{1},...,\mathbf{z}^{l-1})$	$\displaystyle=$	$\displaystyle\mathrm{LNorm}(\mathbf{Z}^{l})$		(90)
	$\displaystyle\mathbf{Z}^{l}$	$\displaystyle=$	$\displaystyle\phi(F^{l}(\mathbf{z}^{l-1}),\mathbf{z}^{1},...,\mathbf{z}^{l-1})$		(91)

or a pre-norm form
或预规范形式

	$\displaystyle\mathrm{Layer}^{l}(\mathbf{z}^{1},...,\mathbf{z}^{l-1})$	$\displaystyle=$	$\displaystyle\mathbf{Z}^{l}$		(92)
	$\displaystyle\mathbf{Z}^{l}$	$\displaystyle=$	$\displaystyle\phi(\mathrm{LNorm}(F^{l}(\mathbf{z}^{l-1})),\mathbf{z}^{1},...,\mathbf{z}^{l-1})$		(93)

These models are very general. For example, a standard post-norm encoder sub-layer can be recovered as a special case of Eqs. (90-91), if we remove the dependencies of sub-layers from $1$ to $l-2$ , and define $\phi(\cdot)$ to be
这些模型非常通用。例如，如果我们移除从 $1$ 到 $l-2$ 子层的依赖关系，并将 $\phi(\cdot)$ 定义为，则标准后范数编码子层可以恢复为方程（90-91）的特殊情况。

\displaystyle\phi(F^{l}(\mathbf{z}^{l-1}),\mathbf{z}^{1},...,\mathbf{z}^{l-1})

\displaystyle=

\displaystyle F^{l}(\mathbf{z}^{l-1})+\mathbf{z}^{l-1}

(94)

Densely connected network makes the information easier to flow through direct connections between sub-layers, but the resulting models are a bit more complex, especially when we use parameterized fusion functions. In practice, we typically add dense connections only to some of the sub-layers, and so the overall networks are not very dense. For example, we only add connections from bottom sub-layers to the last few sub-layers. Thus, the prediction can be made by having direct access to different levels of representation (Wang et al., 2018a).
密集连接的网络使得信息通过子层之间的直接连接更容易流动，但由此产生的模型稍微复杂一些，尤其是在我们使用参数化融合函数时。在实践中，我们通常只将密集连接添加到一些子层，因此整体网络并不非常密集。例如，我们只从底层子层添加连接到最后几个子层。因此，可以通过直接访问不同级别的表示来进行预测（王等，2018a）。

4.2.4 Regularization 4.2.4 正则化

In machine learning, regularization is used to avoid overfitting in training deep neural networks. It is therefore straightforward to apply regularization techniques to Transformer models. Since the regularization issue has been extensively discussed in many papers and books on machine learning, here we consider specific methods applicable to training deep Transformer models.
在机器学习中，正则化用于避免在训练深度神经网络时出现过拟合。因此，将正则化技术应用于 Transformer 模型是直接的。由于正则化问题已在许多机器学习论文和书籍中广泛讨论，因此在此我们考虑适用于训练深度 Transformer 模型的具体方法。

One approach to regularizing a deep Transformer model is to randomly skip sub-layers or layers during training (Huang et al., 2016; Pham et al., 2019). In each run of the model, such as running the backpropgation algorithm on a batch of samples, we select each of the sub-layers with a probability $\rho$ , and stack the selected sub-layers to form a “new” model. Thus, we essentially train different neural networks with shared architectures and parameters on the same dataset. In this way, a sub-layer learns to operate somewhat independently, and so overfitting is reduced by preventing the co-adaption of sub-layers. In fact, dropping out sub-layers (or layers) and dropping out neurons are two different methods on a theme. Sometimes, the method described here is called sub-layer dropout or layer dropout.
一种正则化深度 Transformer 模型的方法是在训练过程中随机跳过子层或层（Huang 等人，2016；Pham 等人，2019）。在模型的每次运行中，例如在样本批次上运行反向传播算法，我们以概率 $\rho$ 选择每个子层，并将选定的子层堆叠形成一个“新”模型。因此，我们实际上是在同一数据集上训练具有共享架构和参数的不同神经网络。通过这种方式，子层学会在一定程度上独立操作，从而通过防止子层的协同适应来减少过拟合。实际上，丢弃子层（或层）和丢弃神经元是同一主题下的两种不同方法。有时，这里描述的方法被称为子层 dropout 或层 dropout。

At test time, we need to combine all the possible networks to make predictions of some output. A simple method to achieve this is to rescale the outputs of the stochastic components of the model (Li et al., 2021). As an example, suppose each sub-layer has a pre-norm architecture. Then, the output of the sub-layer at depth $l$ is given by
在测试时，我们需要结合所有可能的网络来预测某些输出。实现这一目标的一个简单方法是重新缩放模型随机组件的输出（Li 等，2021 年）。例如，假设每个子层具有预规范架构。那么，深度 $l$ 的子层输出由以下公式给出

\displaystyle\mathbf{z}^{l}

\displaystyle=

\displaystyle\rho\cdot\mathrm{LNorm}(F^{l}(\mathbf{z}^{l-1}))+\mathbf{z}^{l-1}

(95)

Another idea is to force the parameters to be shared across sub-layers. One of the simplest methods is to use the same parameters for all the corresponding sub-layers (Dehghani et al., 2018), for example, all the FFN sub-layers are based on the same feedforward network. This method has a similar effect as the methods that add norms of parameter matrices to the loss function for penalizing complex models. For practical systems, there can be significant benefit in adopting a shared architecture because we can reuse the same sub-model to build a multi-layer neural network and reduce the memory footprint. We will see more discussions on the efficiency issue in Section 5.4.
另一种想法是将参数强制共享到子层之间。其中一种最简单的方法是使用所有对应子层相同的参数（Dehghani 等人，2018 年），例如，所有 FFN 子层都基于相同的前馈网络。这种方法与向损失函数中添加参数矩阵范数以惩罚复杂模型的方法具有相似的效果。对于实际系统，采用共享架构可以带来显著的好处，因为我们可以使用相同的子模型构建多层神经网络并减少内存占用。我们将在第 5.4 节中看到更多关于效率问题的讨论。

4.3 Numerical Method-Inspired Models
4.3 基于数值方法的模型

A residual network computes its output through the sum of the identity mapping and some transformation of the input. Such a model can be interpreted as an Euler discretization of ordinary differential equations (ODEs) (Ee, 2017; Haber and Ruthotto, 2017). To illustrate this idea, we consider a general form of residual networks
一个残差网络通过输入的恒等映射和一些变换的总和来计算其输出。这样的模型可以解释为常微分方程（ODEs）的欧拉离散化（Ee，2017；Haber 和 Ruthotto，2017）。为了说明这个想法，我们考虑残差网络的一般形式

\displaystyle\mathbf{z}^{l}

\displaystyle=

\displaystyle f^{l}\big{(}\mathbf{z}^{l-1}\big{)}+\mathbf{z}^{l-1}

(96)

where $f^{l}(\mathbf{z}^{l-1})$ denotes a function takes an input variable $\mathbf{z}^{l-1}$ and produces an output variable in the same space. Clearly, a Transformer sub-layer is a special case of this equation. For example, for pre-norm Transformer, we have $f^{l}(\cdot)=\mathrm{LNorm}(F^{l}(\cdot))$ .
$f^{l}(\mathbf{z}^{l-1})$ 表示一个函数，它接受输入变量 $\mathbf{z}^{l-1}$ 并产生同一空间内的输出变量。显然，Transformer 子层是这个方程的特殊情况。例如，对于预归一化 Transformer，我们有 $f^{l}(\cdot)=\mathrm{LNorm}(F^{l}(\cdot))$ 。

For notational simplicity, we rewrite the above equation in an equivalent form
为了符号简便，我们将上述方程重新写成等价形式

\displaystyle\mathbf{z}(l)

\displaystyle=

\displaystyle f\big{(}\mathbf{z}(l-1),l\big{)}+\mathbf{z}(l-1)

(97)

We use the notations $\mathbf{z}(l)$ and $f(\mathbf{z}(\cdot,l))$ to emphasize that $\mathbf{z}(\cdot)$ and $f(\cdot)$ are functions of $l$ . Here we assume that $l$ is a discrete variable. If we relax $l$ to a continuous variable and $\mathbf{z}(l)$ to a continuous function of $l$ , then we can express Eq. (97) as
我们使用记号 $\mathbf{z}(l)$ 和 $f(\mathbf{z}(\cdot,l))$ 来强调 $\mathbf{z}(\cdot)$ 和 $f(\cdot)$ 是 $l$ 的函数。在此，我们假设 $l$ 是一个离散变量。如果我们把 $l$ 放宽为连续变量，把 $\mathbf{z}(l)$ 放宽为 $l$ 的连续函数，那么我们可以将式(97)表示为

\displaystyle\mathbf{z}(l)

\displaystyle=

\displaystyle\triangle l\cdot f\big{(}\mathbf{z}(l-\triangle l),l\big{)}+\mathbf{z}(l-\triangle l)

(98)

This can be further written as
这可以进一步写成

\displaystyle\frac{\mathbf{z}(l)-\mathbf{z}(l-\triangle l)}{\triangle l}

\displaystyle=

\displaystyle f\big{(}\mathbf{z}(l-\triangle l),l\big{)}

(99)

Taking the limit $\triangle l\rightarrow 0$ , we have an ODE
取极限 $\triangle l\rightarrow 0$ ，我们得到一个常微分方程

\displaystyle\frac{\mathrm{d}\mathbf{z}(l)}{\mathrm{d}l}

\displaystyle=

\displaystyle f\big{(}\mathbf{z}(l),l\big{)}

(100)

We say that a pre-norm Transformer sub-layer (i.e., Eqs. (97) and (96)) is an Euler discretization of solutions to the above ODE. This is an interesting result! A sub-layer is actually a solver of the ODE.
我们称一个预范 Transformer 子层（即公式（97）和（96））是上述常微分方程（ODE）解的欧拉离散化。这是一个有趣的结果！子层实际上是一个常微分方程的求解器。

Eqs. (97) and (96) are standard forms of the Euler method. It computes a new estimation of the solution by moving from an old estimation one step forward along $l$ . In general, two dimensions can be considered in design of numerical methods for ODEs.
方程(97)和(96)是欧拉方法的标准形式。它通过沿着 $l$ 向前移动一个旧估计值来计算解的新估计。一般来说，在设计常微分方程的数值方法时可以考虑两个维度。

•

Linear Multi-step Methods. A linear multi-step method computes the current estimation of the solutions by taking the estimations and derivative information from multiple previous steps. A general formulation of $p$ -step methods can be expressed as

\displaystyle\mathbf{z}(l)

\displaystyle=

\displaystyle\sum_{i=1}^{p}a_{i}\cdot\mathbf{z}(l-i)+h\sum_{i=1}^{p+1}b_{i}\cdot f\big{(}\mathbf{z}(l-i),\ l-i+1\big{)}

(101)

where $h$ is the size of the step we move each time⁸⁸8Let $\{t_{0},...,t_{i}\}$ denote the values of the variable $l$ at steps $\{0,...,i\}$ . In linear multi-step methods, it is assumed that $t_{i}=t_{0}+ih$ ., that is, $\triangle l$ in Eqs. (98) and (99). $\{a_{i}\}$ and $\{b_{i}\}$ are coefficients of the solution points and derivatives in the linear combination. Given this definition, we can think of the Euler method as a single-step, low-order method of solving ODEs⁹⁹9In numerical analysis, the local truncation error of a method of solving ODEs at a step is defined to be the difference between the approximated solution computed by the method and the true solution. The method is called order $p$ if it has a local truncation error $O(h^{p+1})$ ..
$h$ 表示每次移动的步长大小，即式(98)和(99)中的 $\triangle l$ 。 $\{a_{i}\}$ 和 $\{b_{i}\}$ 是线性组合中解点和导数的系数。根据这个定义，我们可以将欧拉法视为求解常微分方程 ⁹ 的单步、低阶方法。

线性多步法。线性多步法通过从多个前一步中获取估计值和导数信息来计算当前解的估计。

p

步法的一般公式可以表示为

•

(Higher-order) Runge-Kutta Methods. Runge-Kutta (RK) methods and their variants provide ways to compute the next step solution by taking intermediate results in solving an ODE. As a result, we obtain higher-order methods but still follow the form of single-step methods, that is, the estimated solution is dependent only on $\mathbf{z}(l-1)$ rather than on the outputs at multiple previous steps.

• (高阶)Runge-Kutta 方法。Runge-Kutta（RK）方法及其变体提供了一种通过在求解常微分方程（ODE）时取中间结果来计算下一步解的方法。因此，我们得到了高阶方法，但仍然遵循单步方法的格式，即估计的解仅依赖于 $\mathbf{z}(l-1)$ ，而不是依赖于多个先前步骤的输出。

In fact, linear multi-step methods, though not explicitly mentioned, have been used in layer fusion discussed in Section 4.2. For example, taking Eqs. (92) and (93) and a linear fusion function, a pre-norm sub-layer with dense connections to all previous sub-layers can be expressed as
事实上，尽管没有明确提及，线性多步方法已在第 4.2 节讨论的层融合中得到了应用。例如，取方程（92）和（93）以及一个线性融合函数，一个与所有先前子层密集连接的预范子层可以表示为

\displaystyle\mathrm{Layer}^{l}(\mathbf{z}^{1},...,\mathbf{z}^{l-1})

\displaystyle=

\displaystyle a_{1}\cdot\mathbf{z}^{l-1}+...+a_{l-1}\cdot\mathbf{z}^{1}+b_{1}\cdot\mathrm{LNorm}(F^{l}(\mathbf{z}^{l-1}))

(102)

This equation is an instance of Eq. (101) where we set $h=1$ and remove some of the terms on the right-hand side.
这个方程是方程（101）的一个实例，其中我们设定了 $h=1$ 并去除了右侧的一些项。

It is also straightforward to apply Runge-Kutta methods to Transformer (Li et al., 2022a). Given an ODE as described in Eq. (100), an explicit $p$ -order Runge-Kutta solution is given by
它也容易将 Runge-Kutta 方法应用于 Transformer（Li 等，2022a）。给定如式（100）所述的常微分方程，一个显式的 $p$ 阶 Runge-Kutta 解由以下给出

	$\displaystyle\mathbf{z}(l)$	$\displaystyle=$	$\displaystyle\mathbf{z}(l-1)+\sum_{i=1}^{p}\gamma_{i}\cdot\mathbf{g}_{i}$		(103)
	$\displaystyle\mathbf{g}_{i}$	$\displaystyle=$	$\displaystyle h\cdot f\big{(}\mathbf{z}(l-1)+\sum_{j=1}^{i-1}\beta_{i,j}\cdot\mathbf{g}_{j},\ l-1+\lambda_{i}\cdot h\big{)}$		(104)

Here $\mathbf{g}_{i}$ represents an intermediate step which is present only during the above process. $\{\gamma_{i}\}$ , $\{\beta_{i,j}\}$ and $\{\lambda_{i}\}$ are coefficients that are determined by using the Taylor series of $\mathbf{z}(l)$ . To simplify the model, we assume that the same function $f$ is used for all $\{\mathbf{g}_{i}\}$ . Then, we remove the dependency of the term $l-1+\lambda_{i}\cdot h$ in $f$ , and rewrite Eq. (104) as
$\mathbf{g}_{i}$ 代表上述过程中的一个中间步骤，仅在上述过程中存在。 $\{\gamma_{i}\}$ 、 $\{\beta_{i,j}\}$ 和 $\{\lambda_{i}\}$ 是通过使用 $\mathbf{z}(l)$ 的泰勒级数确定的系数。为了简化模型，我们假设对所有 $\{\mathbf{g}_{i}\}$ 使用相同的函数 $f$ 。然后，我们消除 $f$ 中项 $l-1+\lambda_{i}\cdot h$ 的依赖性，并将式（104）重写为

\displaystyle\mathbf{g}_{i}

\displaystyle=

\displaystyle h\cdot f\big{(}\mathbf{z}(l-1)+\sum_{j=1}^{i-1}\beta_{i,j}\cdot\mathbf{g}_{j}\big{)}

(105)

where $f(\cdot)$ is a function independent of $i$ .
$f(\cdot)$ 是一个与 $i$ 无关的函数。

As an example, consider the 4th-order Runge-Kutta (RK4) solution
例如，考虑四阶龙格-库塔（RK4）解

$\displaystyle\mathbf{z}(l)$	$\displaystyle=$	$\displaystyle\mathbf{z}(l-1)+\frac{1}{6}(\mathbf{g}_{1}+2\mathbf{g}_{2}+2\mathbf{g}_{3}+\mathbf{g}_{4})$	(106)
$\displaystyle\mathbf{g}_{1}$	$\displaystyle=$	$\displaystyle h\cdot f(\mathbf{z}(l-1))$	(107)
$\displaystyle\mathbf{g}_{2}$	$\displaystyle=$	$\displaystyle h\cdot f(\mathbf{z}(l-1)+\frac{1}{2}\mathbf{g}_{1})$	(108)
$\displaystyle\mathbf{g}_{3}$	$\displaystyle=$	$\displaystyle h\cdot f(\mathbf{z}(l-1)+\frac{1}{2}\mathbf{g}_{2})$	(109)
$\displaystyle\mathbf{g}_{4}$	$\displaystyle=$	$\displaystyle h\cdot f(\mathbf{z}(l-1)+\mathbf{g}_{3})$	(110)

These equations define a new architecture of sub-layer. For example, by setting $h=1$ and $f(\cdot)=\mathrm{LNorm}(F^{l}(\cdot))$ , we obtain an RK4 Transformer sub-layer, as shown in Figure 10. This method leads to a deep model because each sub-layer involves four runs of $f(\cdot)$ in sequence. On the other hand, the resulting model is parameter efficient because we reuse the same function $f(\cdot)$ within the sub-layer, without introducing new parameters.
这些方程定义了一个新的子层架构。例如，通过设置 $h=1$ 和 $f(\cdot)=\mathrm{LNorm}(F^{l}(\cdot))$ ，我们得到一个 RK4 Transformer 子层，如图 10 所示。这种方法导致了一个深度模型，因为每个子层涉及四个连续的 $f(\cdot)$ 运行。另一方面，得到的模型在参数效率上很高，因为我们重用了子层内的相同函数 $f(\cdot)$ ，而没有引入新的参数。

Figure 10: Pre-norm (a) and Runge-Kutta (b and c) sub-layer architectures.

\mathbf{z}(l-1)

denotes the input of a sub-layer at depth

l

\mathbf{z}(l)

denotes the output of the sub-layer, and

f

(in blue boxes) denotes the function

f(\cdot)=\mathrm{LNorm}(F^{l}(\cdot))

图 10：预规范层（a）和 Runge-Kutta 子层架构（b 和 c）。

\mathbf{z}(l-1)

表示深度

l

子层的输入，

\mathbf{z}(l)

表示子层的输出，

f

（蓝色框中）表示函数

f(\cdot)=\mathrm{LNorm}(F^{l}(\cdot))

So far in this sub-section our discussion has focused on applying dynamic systems to Transformer models by designing architectures of Transformer sub-layers. While the basic ODE model is continuous with respect to the depth $l$ , these methods still follow the general framework of neural networks in which $l$ is a discrete variable and the representational power of the models is largely determined by this hyper-parameter. An alternative approach is to use neural ODE models to relax the “depth” to a truly continuous variable. In this way, we can have a model with continuous depth for computing the solution of ODEs. However, as the discussion of neural ODE lies beyond the scope of this chapter, we refer the reader to related papers for more details (Chen et al., 2018c; Kidger, 2022).
截至目前，在本小节中，我们的讨论主要集中在通过设计 Transformer 子层架构将动态系统应用于 Transformer 模型。虽然基本 ODE 模型在深度 $l$ 上是连续的，但这些方法仍然遵循神经网络的一般框架，其中 $l$ 是一个离散变量，模型的表示能力在很大程度上由这个超参数决定。一种替代方法是使用神经 ODE 模型将“深度”放松为一个真正的连续变量。这样，我们可以拥有一个具有连续深度的模型来计算 ODE 的解。然而，由于神经 ODE 的讨论超出了本章的范围，我们建议读者参考相关论文以获取更多详细信息（Chen 等，2018c；Kidger，2022）。

4.4 Wide Models 4.4 广泛模型

Most of the methods that we have studied so far in this section are examples of learning and using deep models. Another design choice we generally face is to determine the width for a neural network. Typically, the width of a Transformer model can be defined as the number of dimensions of a representation at some position of the input sequence, that is, the parameter $d$ . Increasing this width is a common method to obtain a more complex and more powerful model. For example, in Vaswani et al. (2017)’s work, a wide model (called Transformer big) leads to significant improvements in translation quality for machine translation systems. More recently, wider models have been proposed to boost systems on large-scale tasks (Lepikhin et al., 2021; Fedus et al., 2022b).
大多数在本节中我们研究的方法都是学习和使用深度模型的例子。我们通常面临的设计选择之一是确定神经网络的宽度。通常，Transformer 模型的宽度可以定义为输入序列某个位置的表示维度，即参数 $d$ 。增加这个宽度是获得更复杂、更强大模型的一种常见方法。例如，在 Vaswani 等人（2017）的研究中，一个宽模型（称为 Transformer big）显著提高了机器翻译系统的翻译质量。最近，更宽的模型被提出用于提升大规模任务上的系统（Lepikhin 等人，2021；Fedus 等人，2022b）。

However, developing very wide Transformer models is difficult. One difficulty is that training such systems is computationally expensive. While the number of the model parameters (or model size) grows linearly with $d$ , the time complexity of the models grows quadratic with $d$ (see Table 1). In some NLP tasks, it is found empirically that the training effort that we need to obtain satisfactory performance is even an exponential function of the model size (Kaplan et al., 2020). These results suggest ways to improve the efficiency of training when we enlarge $d$ .
然而，开发非常宽泛的 Transformer 模型是困难的。一个困难是训练这样的系统计算成本高昂。虽然模型的参数数量（或模型大小）与 $d$ 线性增长，但模型的时间复杂度与 $d$ 的平方增长（见表 1）。在有些 NLP 任务中，经验上发现，为了获得令人满意的性能，所需的训练努力甚至与模型大小呈指数关系（Kaplan 等人，2020 年）。这些结果为我们提供了在扩大 $d$ 时提高训练效率的方法。

One simple method is to incrementally grow the model along the dimension of $d$ , rather than training the model from scratch. Suppose we have an initial model involving a $d_{1}\times d_{1}$ parameter matrix $\mathbf{W}_{1}$ , for example, the linear transformation of each query or key in some layer. We can train this model to obtain optimized $\mathbf{W}_{1}$ in a regular way. Then, we want to extend this model to a wider model where $\mathbf{W}_{1}$ is replaced by a $d_{2}\times d_{2}$ parameter matrix $\mathbf{W}_{2}$ . Let us assume for simplicity that $d_{2}=kd_{1}$ . There are several ways to expand a $d_{1}\times d_{1}$ matrix to a $kd_{1}\times kd_{1}$ matrix. The simplest of these may be to use $\mathbf{W}_{1}$ to fill $\mathbf{W}_{2}$ . We can write $\mathbf{W}_{2}$ in the form
一种简单的方法是在 $d$ 维度上逐步增长模型，而不是从头开始训练模型。假设我们有一个包含 $d_{1}\times d_{1}$ 参数矩阵 $\mathbf{W}_{1}$ 的初始模型，例如，某些层中每个查询或键的线性变换。我们可以以常规方式训练这个模型以获得优化的 $\mathbf{W}_{1}$ 。然后，我们希望将这个模型扩展到一个更宽的模型，其中 $\mathbf{W}_{1}$ 被替换为 $d_{2}\times d_{2}$ 参数矩阵 $\mathbf{W}_{2}$ 。为了简化，我们假设 $d_{2}=kd_{1}$ 。将 $d_{1}\times d_{1}$ 矩阵扩展到 $kd_{1}\times kd_{1}$ 矩阵的方法有几种。其中最简单的方法可能是使用 $\mathbf{W}_{1}$ 填充 $\mathbf{W}_{2}$ 。我们可以将 $\mathbf{W}_{2}$ 写成以下形式

\displaystyle\mathbf{W}_{2}

\displaystyle=

\displaystyle\begin{matrix}k\ \textrm{times}&\ \\ \begin{bmatrix}\frac{\mathbf{W}_{1}}{\rho}&\cdots&\frac{\mathbf{W}_{1}}{\rho}\\ \vdots&&\vdots\\ \frac{\mathbf{W}_{1}}{\rho}&\cdots&\frac{\mathbf{W}_{1}}{\rho}\end{bmatrix}&\begin{rotate}{90.0}\textrm{ $k$ times}\end{rotate}\\ \end{matrix}

(111)

where $\rho$ is a hyper-parameter that is used to control the norm of $\mathbf{W}_{2}$ . For example, if $\rho=k$ , $\mathbf{W}_{2}$ will have the same $l_{1}$ norm as $\mathbf{W}_{1}$ . The above equation provides a good starting point for training the wide model, and we can train $\mathbf{W}_{2}$ as usual after initialization. The procedure can be repeated a number of times for constructing a model with arbitrary width. Both this method and the depth growth method described in Section 4.2 are instances of the general method of model growth. In other words, we can obtain a larger model by extending a small model either vertically or horizontally, or both. Alternative methods for transforming $\mathbf{W}_{2}$ to $\mathbf{W}_{1}$ involve those considering other mathematical properties of the transformation (Chen et al., 2015). These models can fall under the reusable neural networks where we are concerned with models and algorithms for transferring parameters from small models to (significantly) larger models (Wang et al., 2023).
$\rho$ 是一个用于控制 $\mathbf{W}_{2}$ 范数的超参数。例如，如果 $\rho=k$ ， $\mathbf{W}_{2}$ 将具有与 $\mathbf{W}_{1}$ 相同的 $l_{1}$ 范数。上述方程为训练宽模型提供了一个良好的起点，初始化后我们可以像往常一样训练 $\mathbf{W}_{2}$ 。该过程可以重复多次以构建任意宽度的模型。这两种方法以及第 4.2 节中描述的深度增长方法都是模型增长的一般方法的实例。换句话说，我们可以通过垂直或水平扩展（或两者同时）一个小模型来获得一个更大的模型。将 $\mathbf{W}_{2}$ 转换为 $\mathbf{W}_{1}$ 的替代方法涉及考虑变换的其他数学性质（陈等，2015）。这些模型可以归入可重用神经网络，其中我们关注的是从小模型到（显著）大模型的参数转移的模型和算法（王等，2023）。

A second difficulty in building a wide Transformer model is the large memory requirement. Since the feedforward network generally has a larger hidden layer than other parts of the model, it demands relatively more memory as the model becomes wider. Consider the feedforward network described in Section 2.5
构建大型 Transformer 模型的第二个困难是内存需求量大。由于前馈网络通常比模型的其他部分具有更大的隐藏层，因此随着模型变宽，它需要相对更多的内存。考虑第 2.5 节中描述的前馈网络。

	$\displaystyle\mathbf{H}_{\mathrm{out}}$	$\displaystyle=$	$\displaystyle\mathrm{FFN}(\mathbf{H}_{\mathrm{in}})$		(112)
		$\displaystyle=$	$\displaystyle\mathrm{ReLU}(\mathbf{H}_{\mathrm{in}}\cdot\mathbf{W}_{h}+\mathbf{b}_{h})\cdot\mathbf{W}_{f}+\mathbf{b}_{f}$		(112)

where $\mathbf{W}_{h}\in\mathbb{R}^{d\times d_{\mathrm{ffn}}}$ and $\mathbf{W}_{f}\in\mathbb{R}^{d_{\mathrm{ffn}}\times d}$ are the parameters of the linear transformations. $d_{\mathrm{ffn}}$ is typically several times larger than $d$ . Therefore, $\mathbf{W}_{h}$ and $\mathbf{W}_{f}$ will occupy the model if $d$ and $d_{\mathrm{ffn}}$ have very large values.
$\mathbf{W}_{h}\in\mathbb{R}^{d\times d_{\mathrm{ffn}}}$ 和 $\mathbf{W}_{f}\in\mathbb{R}^{d_{\mathrm{ffn}}\times d}$ 是线性变换的参数。 $d_{\mathrm{ffn}}$ 通常比 $d$ 大几倍。因此，如果 $d$ 和 $d_{\mathrm{ffn}}$ 的值非常大， $\mathbf{W}_{h}$ 和 $\mathbf{W}_{f}$ 将占用模型。

In some cases, the size of the feedforward network may exceed the memory capacity of a single device. This problem can be addressed by using the mixture-of-experts (MoE) models (Shazeer et al., 2017). An MoE model consists of $M$ expert models $\{e_{1}(\cdot),...,e_{M}(\cdot)\}$ . Given an input $\mathbf{h}_{\mathrm{in}}\in\mathbb{R}^{d}$ , each expert model produces an output $e_{k}(\mathbf{h}_{\mathrm{in}})$ . The output of the MoE model is a linear combination of $\{e_{1}(\mathbf{h}_{\mathrm{in}}),...,e_{M}(\mathbf{h}_{\mathrm{in}})\}$ , given by
在某些情况下，前馈网络的大小可能会超过单个设备的内存容量。这个问题可以通过使用专家混合（MoE）模型（Shazeer 等人，2017 年）来解决。MoE 模型由 $M$ 个专家模型 $\{e_{1}(\cdot),...,e_{M}(\cdot)\}$ 组成。给定一个输入 $\mathbf{h}_{\mathrm{in}}\in\mathbb{R}^{d}$ ，每个专家模型产生一个输出 $e_{k}(\mathbf{h}_{\mathrm{in}})$ 。MoE 模型的输出是 $\{e_{1}(\mathbf{h}_{\mathrm{in}}),...,e_{M}(\mathbf{h}_{\mathrm{in}})\}$ 的线性组合，具体如下：

\displaystyle\mathbf{h}_{\mathrm{out}}

\displaystyle=

\displaystyle\sum_{i=1}^{M}g_{i}(\mathbf{h}_{\mathrm{in}})\cdot e_{i}(\mathbf{h}_{\mathrm{in}})

(113)

where $g(\cdot)$ is a gating model (also called routing model). Its output is a vector $g(\mathbf{h}_{\mathrm{in}})=\begin{bmatrix}g_{1}(\mathbf{h}_{\mathrm{in}})&...&g_{M}(\mathbf{h}_{\mathrm{in}})\end{bmatrix}$ in which each entry $g_{i}(\mathbf{h}_{\mathrm{in}})$ indicates the weight of the corresponding expert model. In many applications, it is assumed that $g(\mathbf{h}_{\mathrm{in}})$ is a sparse vector. This means that only a small number of expert models are involved in computing the output. A widely-used form of $g(\mathbf{h}_{\mathrm{in}})$ is given by using the Softmax layer
$g(\cdot)$ 是一个门控模型（也称为路由模型）。其输出是一个向量 $g(\mathbf{h}_{\mathrm{in}})=\begin{bmatrix}g_{1}(\mathbf{h}_{\mathrm{in}})&...&g_{M}(\mathbf{h}_{\mathrm{in}})\end{bmatrix}$ ，其中每个条目 $g_{i}(\mathbf{h}_{\mathrm{in}})$ 表示对应专家模型的权重。在许多应用中，假设 $g(\mathbf{h}_{\mathrm{in}})$ 是一个稀疏向量。这意味着只有少数专家模型参与计算输出。 $g(\mathbf{h}_{\mathrm{in}})$ 的一个常用形式是通过使用 Softmax 层给出。

\displaystyle g(\mathbf{h}_{\mathrm{in}})

\displaystyle=

\displaystyle\mathrm{Softmax}(\mathbf{h}_{\mathrm{in}}\cdot\mathbf{W}_{g})

(114)

where $\mathbf{W}_{g}\in\mathbb{R}^{d\times M}$ is the parameter matrix of the layer. To enforce sparsity on $g(\mathbf{h}_{\mathrm{in}})$ , we can simply select the top- $k$ entries of $g(\mathbf{h}_{\mathrm{in}})$ , that is, we set non-top- $k$ entries to 0. An alternative method is to first perform top- $k$ selection on $\mathbf{h}_{\mathrm{in}}\cdot\mathbf{W}_{g}$ and then normalize the top- $k$ entries using the Softmax function.
$\mathbf{W}_{g}\in\mathbb{R}^{d\times M}$ 是层的参数矩阵。为了在 $g(\mathbf{h}_{\mathrm{in}})$ 上强制稀疏性，我们可以简单地选择 $g(\mathbf{h}_{\mathrm{in}})$ 中的前 $k$ 个条目，即我们将非前 $k$ 个条目设置为 0。另一种方法是首先对 $\mathbf{h}_{\mathrm{in}}\cdot\mathbf{W}_{g}$ 执行前 $k$ 选择，然后使用 Softmax 函数对前 $k$ 个条目进行归一化。

Let $\pi$ be the set of the indices of the top- $k$ expert models. The MoE model with top- $k$ routing has the following form
设 $\pi$ 为前 $k$ 个专家模型的索引集合。具有前 $k$ 个路由的 MoE 模型具有以下形式

\displaystyle\mathbf{h}_{\mathrm{out}}

\displaystyle=

\displaystyle\sum_{i\in\pi}g_{i}(\mathbf{h}_{\mathrm{in}})\cdot e_{i}(\mathbf{h}_{\mathrm{in}})

(115)

An advantage of this approach is that we can distribute different expert models to different processors, making it possible to execute these models on parallel computing machines. In each run of the MoE model, either during training or inference, we only need to activate and use $k$ expert models rather than all of the expert models. In this way, the MoE approach is automatically learning a sparse model by limiting the number of active expert models each time in training and inference. The sparsity is determined by the hyperparameter $k$ , say, a small value of $k$ leads to a sparse model, and a large value of $k$ leads to a dense model.
这种方法的优势在于，我们可以将不同的专家模型分配到不同的处理器上，使得这些模型能够在并行计算机器上执行。在 MoE 模型的每次运行中，无论是训练还是推理阶段，我们只需要激活和使用 $k$ 个专家模型，而不是所有专家模型。通过这种方式，MoE 方法通过限制每次训练和推理中激活的专家模型数量，自动学习一个稀疏模型。稀疏度由超参数 $k$ 决定，例如， $k$ 的较小值会导致稀疏模型，而 $k$ 的较大值会导致密集模型。

Let us return to the discussion of Eq. (112). It is straightforward to apply the MoE approach to feedforward neural networks. To simplify the discussion, consider the linear transformation of the first layer as shown in Eq. (112), that is, $\mathbf{H}_{\mathrm{in}}\cdot\mathbf{W}_{h}$ . We can approximate $\mathbf{H}_{\mathrm{in}}\cdot\mathbf{W}_{h}$ in an MoE form
让我们回到对式（112）的讨论。将 MoE 方法应用于前馈神经网络是直接的。为了简化讨论，考虑式（112）所示的第一层的线性变换，即 $\mathbf{H}_{\mathrm{in}}\cdot\mathbf{W}_{h}$ 。我们可以在 MoE 形式中近似 $\mathbf{H}_{\mathrm{in}}\cdot\mathbf{W}_{h}$

	$\displaystyle\mathbf{H}_{\mathrm{in}}\cdot\mathbf{W}_{h}$	$\displaystyle\approx$	$\displaystyle\sum_{i\in\pi}g_{i}(\mathbf{H}_{\mathrm{in}})\cdot e_{i}(\mathbf{H}_{\mathrm{in}})$		(116)
		$\displaystyle=$	$\displaystyle\sum_{i\in\pi}g_{i}(\mathbf{H}_{\mathrm{in}})\cdot[\mathbf{H}_{\mathrm{in}}\cdot\mathbf{W}_{h}^{i}]$		(116)

Here $\mathbf{W}_{h}$ is divided into $M$ slides (or sub-matrices) $\{\mathbf{W}_{h}^{1},...,\mathbf{W}_{h}^{M}\}$ , written as
这里 $\mathbf{W}_{h}$ 被分为 $M$ 个幻灯片（或子矩阵） $\{\mathbf{W}_{h}^{1},...,\mathbf{W}_{h}^{M}\}$ ，表示为

\displaystyle\mathbf{W}_{h}

\displaystyle=

\displaystyle\begin{bmatrix}\mathbf{W}_{h}^{1}&...&\mathbf{W}_{h}^{M}\end{bmatrix}

(117)

Hence each expert model $e_{i}(\mathbf{H}_{\mathrm{in}})=\mathbf{H}_{\mathrm{in}}\cdot\mathbf{W}_{h}^{i}$ solves a sub-problem of the original linear mapping, and Eq. (116) can be thought of as a divide-and-conquer solution to the matrix multiplication problem.
因此，每个专家模型 $e_{i}(\mathbf{H}_{\mathrm{in}})=\mathbf{H}_{\mathrm{in}}\cdot\mathbf{W}_{h}^{i}$ 解决原始线性映射的子问题，而式（116）可以被视为矩阵乘法问题的分而治之解决方案。

We can, of course, treat any feedforward neural network as an expert model, resulting in the following model
当然，我们可以将任何前馈神经网络视为专家模型，从而得到以下模型

\displaystyle\mathbf{H}_{\mathrm{out}}

\displaystyle=

\displaystyle\sum_{i\in\pi}g_{i}(\mathbf{H}_{\mathrm{in}})\cdot\mathrm{FFN}_{i}(\mathbf{H}_{\mathrm{in}})

(118)

where $\mathrm{FFN}_{i}(\cdot)$ is a “small” feedforward neural network that has the same form as Eq. (112). This model is illustrated with an example in Figure 11. In practical implementations, all these expert models can be run in parallel on different devices, and so the resulting system is efficient.
$\mathrm{FFN}_{i}(\cdot)$ 是一个与式(112)形式相同的“小型”前馈神经网络。该模型在图 11 中通过一个示例进行了说明。在实际应用中，所有这些专家模型都可以在不同的设备上并行运行，因此所得到的系统是高效的。

Figure 11: An illustration of the MoE model applied to an FFN sub-layer. There are

M

FFNs (call them expert models) and a gating model. Each FFN is weighted by the gating model. The output of the model is the sum of the weighted outputs of the top-

k

FFNs (denoted by

\pi

). Because these FFNs work independently and can be placed on different computing devices, the model can be easily scaled up as

M

is larger.
图 11：MoE 模型应用于 FFN 子层的示意图。存在

M

个 FFN（称为专家模型）和一个门控模型。每个 FFN 都由门控模型加权。模型的输出是前

k

个 FFN（用

\pi

表示）加权输出的总和。因为这些 FFN 独立工作并且可以放置在不同的计算设备上，所以当

M

更大时，模型可以轻松扩展。

Note that, from a perspective of machine learning, MoE is a general approach to combining different neural networks, each of which is developed to address a different aspect of the problem (Yuksel et al., 2012; Masoudnia and Ebrahimpour, 2014). The application here is just a special instance of the general framework of MoE. The approach is also often used to improve the overall performance of predictors, which can be discussed in the field of ensemble learning (Zhou, 2012).
请注意，从机器学习的角度来看，MoE 是一种将不同神经网络结合的通用方法，每个神经网络都是针对问题的不同方面进行开发的（Yuksel 等，2012；Masoudnia 和 Ebrahimpour，2014）。这里的应用只是 MoE 通用框架的一个特殊实例。该方法也常用于提高预测器的整体性能，这可以在集成学习领域进行讨论（Zhou，2012）。

Another difficulty in developing large Transformer models is the training instability problem. As with many other large neural networks, straightforward optimization of a Transformer model with a large number of parameters may lead to getting trapped in local minimums, and, occasionally, large spikes in the loss during training (Lepikhin et al., 2021; Fedus et al., 2022b; Chowdhery et al., 2022). Even with careful choices about hyperparameters, training strategies, and initial model parameters, we still encounter the situation that we have to restart the training at some point in order to jump out of the tough regions in optimization. One of the reasons for this training difficulty is that the usual implementations of the linear algebra operations, such as matrix multiplication, will be numerically unstable if they operates on very large vectors and matrices. It is therefore possible to improve the training by considering numerically stable methods instead.
大型 Transformer 模型的开发中另一个困难是训练不稳定问题。与其他许多大型神经网络一样，直接优化具有大量参数的 Transformer 模型可能会导致陷入局部最小值，偶尔在训练过程中损失值会出现大幅波动（Lepikhin 等，2021；Fedus 等，2022b；Chowdhery 等，2022）。即使在超参数、训练策略和初始模型参数方面做了谨慎的选择，我们仍然会遇到不得不在某些时候重新启动训练以跳出优化困难区域的情况。这种训练困难的一个原因是，通常的线性代数运算实现，如矩阵乘法，如果作用于非常大的向量和矩阵，将会出现数值不稳定性。因此，可以考虑使用数值稳定的方法来提高训练效果。

5 Efficient Models 5 高效模型

Efficiency is an important consideration for many practical applications of Transformer models. For example, we may wish to run and/or train a Transformer model given memory and time constraints. Efficiency is not a single problem, but covers a wide range of problems. While these problems can be categorized in several different ways, there are two fundamental aspects one may consider in an efficiency problem.
效率是许多 Transformer 模型实际应用中的重要考虑因素。例如，我们可能希望在内存和时间限制下运行和/或训练一个 Transformer 模型。效率不是一个单一问题，而是涵盖了广泛的问题。虽然这些问题可以通过几种不同的方式分类，但在效率问题中，人们可能考虑的两个基本方面。

•

Time and Space Efficiencies. For a given problem, we wish the model to be small and fast, and meanwhile to be as accurate as possible in solving the problem. For example, in some machine translation applications, we may learn a model with a small number of parameters to fit the model to limited memory, and may develop a fast search algorithm to achieve low-latency translation. A practical difficulty here is that improving efficiency often leads to worse predictions. In many cases, we need to seek a trade-off between efficiency and accuracy.

时间与空间效率。对于给定的问题，我们希望模型既小又快，同时尽可能精确地解决问题。例如，在一些机器翻译应用中，我们可能学习一个参数数量较少的模型以适应有限的内存，并可能开发快速搜索算法以实现低延迟翻译。这里的一个实际困难是提高效率往往会导致预测效果变差。在许多情况下，我们需要在效率和精度之间寻求平衡。
•

Scalability. When the problem is scaled up, we wish that the additional effort we made for solving this problem is as small as possible. For example, the training of a neural network is called efficient if it takes a reasonably short time to optimize it as more training samples are involved. Another example of efficiency is that used to measure the amount of resources consumed in processing more inputs. For example, a machine translation system is inefficient in translating long sentences if the memory footprint and latency grow exponentially with the number of input words.

• 可扩展性。当问题规模扩大时，我们希望为解决这个问题所付出的额外努力尽可能小。例如，当涉及更多训练样本时，如果神经网络训练需要合理短的时间来优化，则称为高效。效率的另一个例子是用于衡量处理更多输入所消耗的资源量。例如，如果一个机器翻译系统在翻译长句时，内存占用和延迟随着输入单词数量的指数增长，则该系统在翻译长句时效率低下。

In this section, we will not discuss all the issues related to efficiency, which is a very broad topic. We instead consider the widely-used efficient approaches to Transformer-based sequence modeling and generation, some of which are refinements of model architectures, and some of which are model-free approaches and could be used in other systems as well. Most of the discussions here are focused on developing lightweight and fast Transformer models that are relatively robust to long input and output sequences.
在这一节中，我们将不讨论与效率相关的一切问题，这是一个非常广泛的话题。相反，我们考虑广泛使用的基于 Transformer 的序列建模和生成的高效方法，其中一些是对模型架构的改进，还有一些是无模型的方法，也可以用于其他系统。这里的讨论主要集中在开发轻量级且快速的 Transformer 模型上，这些模型对长输入和输出序列相对鲁棒。

In general, the same optimization method can be applied to different modules of a Transformer system. To simplify the discussion, we will mostly consider self-attention sub-layers and FFN sub-layers in this section. Our discussion, however, is general and the methods presented here can be applied to other parts of a Transformer system, for example, cross-attention sub-layers.
一般来说，相同的优化方法可以应用于 Transformer 系统的不同模块。为了简化讨论，本节我们将主要考虑自注意力子层和 FFN 子层。然而，我们的讨论是通用的，这里提出的方法也可以应用于 Transformer 系统的其他部分，例如，交叉注意力子层。

5.1 Sparse Attention 5.1 稀疏注意力

In practice, the attention approaches used in Transformer are time consuming, especially when the input sequences are long. To illustrate, consider a Transformer decoder that predicts a distribution of words at a time given the previous words. Suppose the sequence generated by the decoder is of size $n$ and the input of a self-attention sub-layer is an $n\times d$ matrix $\mathbf{S}$ . First, $\mathbf{S}$ is linearly transformed to obtain the queries $\mathbf{S}^{q}\in\mathbb{R}^{n\times d}$ , keys $\mathbf{S}^{k}\in\mathbb{R}^{n\times d}$ , and values $\mathbf{S}^{v}\in\mathbb{R}^{n\times d}$ . To simplify the notation in this sub-section, we use $\mathbf{Q}$ , $\mathbf{K}$ and $\mathbf{V}$ to represent $\mathbf{S}^{q}$ , $\mathbf{S}^{k}$ , and $\mathbf{S}^{v}$ , respectively.
在实践中，Transformer 中使用的注意力方法耗时较长，尤其是在输入序列较长时。为了说明，考虑一个 Transformer 解码器，它根据前面的单词预测一次单词的分布。假设解码器生成的序列大小为 $n$ ，自注意力子层的输入是一个 $n\times d$ 矩阵 $\mathbf{S}$ 。首先， $\mathbf{S}$ 通过线性变换得到查询 $\mathbf{S}^{q}\in\mathbb{R}^{n\times d}$ 、键 $\mathbf{S}^{k}\in\mathbb{R}^{n\times d}$ 和值 $\mathbf{S}^{v}\in\mathbb{R}^{n\times d}$ 。为了简化本小节中的符号，我们使用 $\mathbf{Q}$ 、 $\mathbf{K}$ 和 $\mathbf{V}$ 分别代表 $\mathbf{S}^{q}$ 、 $\mathbf{S}^{k}$ 和 $\mathbf{S}^{v}$ 。

The output of the self-attention sub-layer can then be computed using
自我注意力子层的输出可以通过以下方式计算得出

\displaystyle\mathrm{Att}_{\mathrm{self}}(\mathbf{S})

\displaystyle=

\displaystyle\mathbf{A}\mathbf{V}

(119)

where $\mathbf{A}$ is an $n\times n$ attention matrix or attention map
$\mathbf{A}$ 是一个 $n\times n$ 注意力矩阵或注意力图

\displaystyle\mathbf{A}

\displaystyle=

\displaystyle\mathrm{Softmax}(\frac{\mathbf{Q}\mathbf{K}^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M})

(120)

$\mathbf{M}$ is a masking matrix that is used to prevent the model from seeing the right context words at each position, that is, for a position $i$ , $M(i,j)=0$ for $j\leq i$ , and $M(i,j)=-\infty$ otherwise. Both the time and space complexities of the self-attention sub-layer are quadratic functions of $n$ ¹⁰¹⁰10More precisely, the amount of memory used by the self-attention function is $n^{2}+n\cdot d$ , and so it will be dominated by the quadratic term $n^{2}$ if $n>>d$ .. Therefore, if $n$ is large, the model would be computationally expensive.
$\mathbf{M}$ 是一个掩码矩阵，用于防止模型在每个位置看到正确的上下文单词，即对于位置 $i$ 、 $M(i,j)=0$ 、 $j\leq i$ 和 $M(i,j)=-\infty$ 的情况。自注意力子层的时空复杂度都是 $n$ ¹⁰ 的二次函数。因此，如果 $n$ 很大，模型将计算成本高昂。

The usual implementation of the above model depends on dense matrix computation, for example, the dense matrix multiplications in Eqs. (119-120). One approach to reducing the amount of memory and the number of floating-point calculations in a dense computation system is to sparsify the problem. To do this, we assume that $\mathbf{A}$ is a sparse matrix, for example, only $\varrho\cdot n^{2}$ entries of $\mathbf{M}$ have non-zero values, where $\varrho$ indicates how sparse the matrix is, also called sparsity ratio. Since we only need to store these non-zero entries, the memory requirement of $\mathbf{A}$ can be reduced by using sparse matrix representations. Another advantage of using a sparse attention matrix is that the models of $\frac{\mathbf{Q}\mathbf{K}^{\mathrm{T}}}{\sqrt{d}}$ and $\mathbf{A}\mathbf{V}$ can be simplified, as we consider only a “small” number of related positions when learning a representation.
通常上述模型的实现依赖于密集矩阵计算，例如，方程（119-120）中的密集矩阵乘法。减少密集计算系统中内存使用量和浮点运算数量的一个方法是对问题进行稀疏化。为此，我们假设 $\mathbf{A}$ 是一个稀疏矩阵，例如，只有 $\mathbf{M}$ 的 $\varrho\cdot n^{2}$ 项具有非零值，其中 $\varrho$ 表示矩阵的稀疏程度，也称为稀疏率。由于我们只需要存储这些非零项，因此可以通过使用稀疏矩阵表示来减少 $\mathbf{A}$ 的内存需求。使用稀疏注意力矩阵的另一个优点是， $\frac{\mathbf{Q}\mathbf{K}^{\mathrm{T}}}{\sqrt{d}}$ 和 $\mathbf{A}\mathbf{V}$ 的模型可以简化，因为我们只考虑学习表示时相关位置的一个“小”数量。

Given a position $i$ , we define the attention field $\pi_{i}$ to be the set of positions that are considered in computing the representation at this position. We therefore only need to compute the dot-product attention between the given position $i$ and each position $j\in\pi_{i}$ . This results in a sparse attention matrix $\mathbf{A}^{\prime}$ where
给定位置 $i$ ，我们定义注意力场 $\pi_{i}$ 为在计算此位置表示时考虑的位置集合。因此，我们只需要计算给定位置 $i$ 与每个位置 $j\in\pi_{i}$ 之间的点积注意力。这导致一个稀疏的注意力矩阵 $\mathbf{A}^{\prime}$ ，其中

\displaystyle A^{\prime}(i,j)

\displaystyle=

\displaystyle\begin{cases}\textrm{some weight}&j\in\pi_{i}\ \textrm{and}\ j\leq i\\ 0&\textrm{otherwise}\end{cases}

(121)

A simple implementation of this model involves a slight modification to $\mathbf{M}$ , leading to a new masking variable $\mathbf{M}^{\prime}$
一个简单实现此模型的方法涉及对 $\mathbf{M}$ 进行微小修改，从而得到一个新的掩码变量 $\mathbf{M}^{\prime}$

\displaystyle M^{\prime}(i,j)

\displaystyle=

\displaystyle\begin{cases}0&j\in\pi_{i}\ \textrm{and}\ j\leq i\\ -\infty&\textrm{otherwise}\end{cases}

(122)

In practical implementation, a more efficient approach is to employ sparse operations for $\mathbf{Q}\mathbf{K}^{\mathrm{T}}$ and $\mathbf{A}^{\prime}\mathbf{V}$ by considering $\mathbf{M}^{\prime}$ and $\mathbf{A}^{\prime}$ , respectively. That is, we save on computation for pairs of positions whose attention weights are non-zero, and skip the rest.
在实践实施中，采用更有效的方法是分别考虑 $\mathbf{M}^{\prime}$ 和 $\mathbf{A}^{\prime}$ ，对 $\mathbf{Q}\mathbf{K}^{\mathrm{T}}$ 和 $\mathbf{A}^{\prime}\mathbf{V}$ 执行稀疏操作。也就是说，我们节省了计算那些注意力权重非零的位置对的计算，并跳过其余部分。

There are several approaches that we can take to the sparse modeling of self-attention. We describe briefly some of them as follows
存在几种我们可以采取的自注意力稀疏建模方法。以下简要描述其中一些：

•

Span-based Attention/Local Attention. As discussed in Section 4.1, the use of context in sequence modeling is local in many cases. The basic idea of local attention is to span the attention weights to a restricted region of the input sequence. We can then write $\pi_{i}$ as

$\displaystyle\pi_{i}$ $\displaystyle=$ $\displaystyle[a_{i}^{l},a_{i}^{r}]$ (123)

• 基于跨度/局部注意力。如第 4.1 节所述，在序列建模中，上下文的使用在许多情况下是局部的。局部注意力的基本思想是将注意力权重跨度限制在输入序列的特定区域。然后我们可以将 $\pi_{i}$ 写为

where $a_{i}^{l}$ and $a_{i}^{r}$ and the left and right ends of $\pi_{i}$ . $a_{i}^{r}-a_{i}^{l}+1$ determines how small the region is, and so we can use it to control the sparsity of the attention model, for example, if $a_{i}^{r}-a_{i}^{l}+1<<n$ , the model would be very sparse. $a_{i}^{l}$ and $a_{i}^{r}$ can be obtained by using either heuristics or machine learning methods. The reader may refer to related papers for more details (Luong et al., 2015; Sperber et al., 2018; Yang et al., 2018; Sukhbaatar et al., 2019). See Figure 12 (b) for an illustration of local attention.
在 $a_{i}^{l}$ 和 $a_{i}^{r}$ 以及 $\pi_{i}$ 的左右端。 $a_{i}^{r}-a_{i}^{l}+1$ 决定了该区域的大小，因此我们可以用它来控制注意力模型的稀疏性，例如，如果 $a_{i}^{r}-a_{i}^{l}+1<<n$ ，则模型将非常稀疏。 $a_{i}^{l}$ 和 $a_{i}^{r}$ 可以通过使用启发式方法或机器学习方法获得。读者可参考相关论文以获取更多详细信息（Luong 等，2015；Sperber 等，2018；Yang 等，2018；Sukhbaatar 等，2019）。参见图 12（b）以了解局部注意力的示意图。
•

Chunked Attention. When a problem is too difficult to solve, one can transform it into easier problems and solve each of them separately, as is often the case in practice. This motivates the chunked attention approach in which we segment a sequence into chunks and run the attention model on each of them (Parmar et al., 2018; Qiu et al., 2020). Given a sequence $\{1,...,n\}$ , we define $\{\mathrm{chunk}_{1},...,\mathrm{chunk}_{q}\}$ to be a segmentation of the sequence. A chunk can be expressed as a span

$\displaystyle\mathrm{chunk}_{k}$ $\displaystyle=$ $\displaystyle[c_{k}^{l},c_{k}^{r}]$ (124)

• 分块注意力。当一个问题太难解决时，可以将它转化为更容易的问题，并分别解决它们，这在实践中很常见。这促使了分块注意力方法的出现，即我们将序列分割成块，并在每个块上运行注意力模型（Parmar 等，2018；Qiu 等，2020）。给定序列 $\{1,...,n\}$ ，我们定义 $\{\mathrm{chunk}_{1},...,\mathrm{chunk}_{q}\}$ 为序列的分割。一个块可以表示为一个跨度

In the attention step, we treat each chunk as a sequence and perform self-attention on it as usual. In other words, the representation at position $i$ is computed by using only the context in the chunk that $i$ belongs to. In this sense, this model can be thought of as some sort of local attention model. Figure 12 (c) shows an illustration of this model. There remains the issue of how to segment the sequence. There are several ways to do this. For example, as discussed in Section 3.4, we can do segmentation from a linguistic perspective, and segment the sequence into linguistically motivated units. In practical systems, it is sometimes more convenient to segment the sequence into chunks that are of equal length. Thus, the sparsity of the model is controlled by the size of these chunks, for example, the use of smaller chunks would lead to a more sparse attention model.
在注意力步骤中，我们将每个块视为一个序列，并像往常一样对其执行自注意力。换句话说，位置 $i$ 的表示仅使用 $i$ 所属块中的上下文来计算。从这个意义上讲，这个模型可以被视为一种局部注意力模型。图 12（c）展示了该模型的示意图。剩下的问题是序列如何进行分割。有几种方法可以做到这一点。例如，如第 3.4 节所述，我们可以从语言学的角度进行分割，将序列分割成由语言学动机驱动的单元。在实际系统中，有时将序列分割成等长的块更为方便。因此，模型的稀疏性由这些块的大小控制，例如，使用较小的块会导致更稀疏的注意力模型。
•

Strided Attention. Since the chunked attention approach enforces a hard segmentation on the input sequence, it may lose the ability to learn representations from inputs in different chunks. An alternative way to achieve chunk-wise attention is to allow overlap between chunks (Child et al., 2019; Beltagy et al., 2020; Ainslie et al., 2020). This approach is analogous to the family of approaches that are commonly used to apply a local model to 1D or 2D data to generate outputs of the same shape. Like CNNs, we use a context window to represent the field of input of the attention model. The context window slides along the sequence, each time moving forward a step of size $stride$ . As a special case, if $stride$ equals the size of the context window, this model is the same as the chunked attention model mentioned above. If $stride$ chooses a value smaller than the size of the context window, the attention model will become denser. Figure 12 (d) shows the case of $strdie=1$ where the chunk overlapping is maximized. A way to achieve relatively sparser attention is to use a dilated context window. Figure 12 (e) shows an example of the dilated strided attention model, where the context window is discontinuous, with gaps of size 1.

步长注意力。由于分块注意力方法对输入序列施加了硬分割，它可能会失去从不同分块中学习表示的能力。实现分块注意力的一种替代方法是允许分块之间有重叠（Child 等，2019；Beltagy 等，2020；Ainslie 等，2020）。这种方法类似于通常用于将局部模型应用于 1D 或 2D 数据以生成相同形状输出的方法族。像 CNN 一样，我们使用上下文窗口来表示注意力模型的输入域。上下文窗口沿着序列滑动，每次向前移动大小为 $stride$ 的一步。作为一个特殊情况，如果 $stride$ 等于上下文窗口的大小，那么这个模型与上面提到的分块注意力模型相同。如果 $stride$ 选择一个小于上下文窗口大小的值，注意力模型将变得更加密集。图 12（d）显示了 $strdie=1$ 的情况下，分块重叠最大化。实现相对稀疏注意力的方法之一是使用膨胀的上下文窗口。图 12（e）展示了扩张步长注意力模型的一个示例，其中上下文窗口是不连续的，存在大小为 1 的间隙。
•

Learning Attention Fields. Because the attention field $\pi_{i}$ can be any sub-set of $\{1,...,n\}$ , we can develop more general sparse attention models by considering attention maps beyond chunk-based patterns. The only question is how to determine which positions the model attends to for a given position. One simple approach is to use a computationally cheaper model to estimate the “importance” of each position. Then, attention weights are computed only for some of the positions which are thought to be most important (Zhou et al., 2021). A second approach is grouping: positions are grouped, and then the attention weights are computed only for positions in the same group. It is often relatively easy to achieve this by running clustering algorithms on keys and queries. For example, we can cluster keys and queries via $k$ -means clustering. The centroids of the clusters can be treated as additional parameters of the attention model, and so can be learned during optimization (Roy et al., 2021). One benefit of learning attention fields is that the model can spread its attention broader over the sequence. This is a useful property for many NLP problems because word dependencies are sometimes long-range, not restricted to a local context window. See Figure 12 (f) for an example of the attention map learned through this model. Alternative approaches to learning to attend are to use sorting or hashing functions to group similar key and query vectors (Kitaev et al., 2020; Tay et al., 2020a). These functions can be either heuristically designed functions or neural networks with learnable parameters. By using these functions, we can reorder the sequence so that the inputs in the same group are adjacent in the reordered sequence. In this way, the resulting attention map follows a chunk-wise pattern, and the model is computationally efficient through the use of the chunked attention approach.

• 学习注意力场。因为注意力场 $\pi_{i}$ 可以是 $\{1,...,n\}$ 的任意子集，我们可以通过考虑超越基于块的模式之外的注意力图来开发更通用的稀疏注意力模型。唯一的问题是如何确定模型对给定位置的关注位置。一种简单的方法是使用计算成本更低的模型来估计每个位置的重要性。然后，只对被认为最重要的某些位置计算注意力权重（Zhou 等人，2021 年）。第二种方法是分组：将位置分组，然后只对同一组中的位置计算注意力权重。通常，通过在键和查询上运行聚类算法来实现这一点相对容易。例如，我们可以通过 $k$ 均值聚类对键和查询进行聚类。聚类中心可以被视为注意力模型的额外参数，因此可以在优化过程中学习（Roy 等人，2021 年）。学习注意力场的一个好处是模型可以在序列上更广泛地分散其注意力。这是一个对许多 NLP 问题有用的属性，因为词依赖有时是长距离的，不仅限于局部上下文窗口。参见图 12（f）以了解通过此模型学习到的注意力图示例。学习注意力的替代方法包括使用排序或哈希函数来分组相似的关键词和查询向量（Kitaev 等人，2020；Tay 等人，2020a）。这些函数可以是启发式设计的函数或具有可学习参数的神经网络。通过使用这些函数，我们可以重新排列序列，使得同一组中的输入在重新排列的序列中相邻。这样，生成的注意力图遵循块状模式，并且通过使用块状注意力方法，模型在计算上效率更高。
•

Hybrid Methods. Above, we have discussed a range of different sparse attention models. It is natural to explore methods that combine multiple models together to make use of their benefits in some way. A simple way to do this is to combine the attention fields of different models. For example, in Zaheer et al. (2020)’s system, the attention map is generated by considering three different sparse models, including local attention (chunked attention), global attention, and random attention¹¹¹¹11Here the global attention model attends each word only to a special word which accounts for the entire sequence and is often placed at the beginning of the sequence. The random attention model attends each word to a random set of the words of the sequence.. The resulting model is still a sparse model, but is somewhat more robust as it involves multiple patterns from different perspectives of attention modeling. Another way of combining multiple attention models is to use different models for different heads in multi-head attention (Child et al., 2019; Beltagy et al., 2020). For example, one can use one head as a local attention model, and use another head as a global attention model (see Figure 12 (g-h)).

• 混合方法。如上所述，我们讨论了一系列不同的稀疏注意力模型。探索将多个模型结合起来以利用它们的一些优点是自然而然的。一种简单的方法是将不同模型的注意力字段结合起来。例如，在 Zaheer 等人（2020）的系统里，注意力图是通过考虑三个不同的稀疏模型生成的，包括局部注意力（块状注意力）、全局注意力和随机注意力 ¹¹ 。结果模型仍然是一个稀疏模型，但由于它涉及来自不同注意力建模角度的多个模式，因此具有一定的鲁棒性。另一种结合多个注意力模型的方法是在多头注意力中使用不同的模型（Child 等人，2019；Beltagy 等人，2020）。例如，可以将一个头用作局部注意力模型，而将另一个头用作全局注意力模型（见图 12（g-h））。

Figure 12: Illustration of the attention maps of different models (self-attention on the decoder side). Dark cells mean

A^{\prime}(i,j)\neq 0

(i.e.,

i

attends to

j

), and light cells mean

A^{\prime}(i,j)=0

(i.e.,

i

does not attend to

j

). In all these attention maps, we assume that every position attends to itself by default (see diagonals).
图 12：不同模型（解码器侧的自注意力）的注意力图示。深色单元格表示

A^{\prime}(i,j)\neq 0

（即

i

关注

j

），浅色单元格表示

A^{\prime}(i,j)=0

（即

i

不关注

j

）。在这些注意力图中，我们假设每个位置默认关注自身（见对角线）。

One disadvantage of sparse models compared to dense models is that they are not computationally efficient on GPUs or CPUs. While sparse models can ideally reduce both the memory and computation requirements, the actual rate at which work can be done by sparse models is much slower than by dense models. In practice, it is difficult for sparse models to approach the peak FLOPS of a GPU or CPU¹²¹²12FLOPS = floating point operations per second.. Therefore, they are often used for the purpose of high memory efficiency, not really for the purpose of efficient computation. On the other hand, sparse models are still of great use to NLP practitioners in the context of memory-efficient Transformer, especially when Transformer systems are used to deal with extremely long sequences.
稀疏模型与密集模型相比的一个缺点是它们在 GPU 或 CPU 上计算效率不高。虽然稀疏模型在理想情况下可以减少内存和计算需求，但稀疏模型实际的工作速度远慢于密集模型。在实践中，稀疏模型难以接近 GPU 或 CPU 的峰值 FLOPS。因此，它们通常用于提高内存效率，而不是真正为了高效计算。另一方面，在内存高效的 Transformer 的背景下，稀疏模型对于 NLP 从业者仍然非常有用，尤其是在使用 Transformer 系统处理极长序列时。

5.2 Recurrent and Memory Models
5.2 循环和记忆模型

For sequence generation problems, Transformer can also be thought of as a memory system. Consider again the general setting, in which we are given the states of previous $i-1$ positions, and we wish to predict the next state. In self-attention, this is done by using the query at position $i$ (i.e., $\mathbf{q}_{i}$ ) to access the key-value pairs of the previous positions (i.e., $\{(\mathbf{k}_{1},\mathbf{v}_{1}),...,(\mathbf{k}_{i-1},\mathbf{v}_{i-1})\}$ ). Then, we move to position $i+1$ , and add $(\mathbf{k}_{i},\mathbf{v}_{i})$ to the collection of key-value pairs. This procedure can be interpreted in terms of memory mechanisms. The Transformer model maintains a memory that retains the information of the past. When moving along the sequence, we repeat the same operation, each time generating some output by reading the memory, and then updating the memory so that new information could be stored in some way. This is illustrated in Figure 13.
对于序列生成问题，Transformer 也可以被视为一个记忆系统。再次考虑一般设置，其中我们被给出先前 $i-1$ 位置的状态，并希望预测下一个状态。在自注意力中，这是通过使用位置 $i$ （即 $\mathbf{q}_{i}$ ）来访问先前位置（即 $\{(\mathbf{k}_{1},\mathbf{v}_{1}),...,(\mathbf{k}_{i-1},\mathbf{v}_{i-1})\}$ ）的键值对来实现的。然后，我们移动到位置 $i+1$ ，并将 $(\mathbf{k}_{i},\mathbf{v}_{i})$ 添加到键值对集合中。这一过程可以用记忆机制来解释。Transformer 模型维护一个记忆，保留过去的信息。在沿着序列移动时，我们重复相同的操作，每次通过读取记忆来生成一些输出，然后更新记忆，以便以某种方式存储新信息。这如图 13 所示。

Figure 13: Transformer as a memory system. At position

i

, the collection of the key-value pairs of positions

\{1,...,i-1\}

is used as a memory of the past information. The Transformer model accesses this memory to generate some output, and then adds the key-value pair of position

i

to the memory. Moving to the next position, we repeat the same procedure of memory access and update.
图 13：Transformer 作为记忆系统。在位置

i

，使用位置

\{1,...,i-1\}

的键值对集合作为过去信息的记忆。Transformer 模型访问这个记忆以生成一些输出，然后将位置

i

的键值对添加到记忆中。移动到下一个位置，我们重复相同的记忆访问和更新过程。

5.2.1 Cache-based Memory 5.2.1 基于缓存的内存

The memory here can be viewed as a datastore of vectors. From a machine learning perspective, this is a non-parametric model, and the cost of accessing the model grows as a longer sub-sequence is observed. Clearly, such a variable-length memory will generally be infeasible if the model deals with a very, very long sequence. For the modeling problem of arbitrary length sequences, it is common to use a fix-length memory instead. As in many NLP problems, one of the simplest ways to do this is to consider a cache saving recent information, that is, we restrict the modeling to a context window. Let $n_{c}$ be the size of the context window. The model keeps track of the $n_{c}-1$ latest states to the current position, so that its closest successors can be considered at each step. This means that, for each position, a self-attention sub-layer attends to $n_{c}-1$ positions ahead, like this
这里的记忆可以看作是向量数据存储。从机器学习的角度来看，这是一个非参数模型，访问模型的成本随着观察到的子序列变长而增长。显然，如果模型处理非常非常长的序列，这样的可变长度记忆通常是不切实际的。对于任意长度序列的建模问题，通常使用固定长度的记忆。正如在许多自然语言处理问题中，这样做的一种最简单的方法是考虑一个缓存保存最近的信息，即我们将建模限制在一个上下文窗口内。令 $n_{c}$ 为上下文窗口的大小。模型跟踪从当前位置到最近的 $n_{c}-1$ 个状态，以便在每个步骤考虑其最近的后续状态。这意味着，对于每个位置，一个自注意力子层关注 $n_{c}-1$ 个位置，如下所示

If we stack multiple self-attention sub-layers, a larger context window would be considered. For example, a model involving two self-attention sub-layers has a context window of size $2n_{c}-1$ , as follows
如果我们堆叠多个自注意力子层，则会考虑更大的上下文窗口。例如，涉及两个自注意力子层的模型具有大小为 $2n_{c}-1$ 的上下文窗口，如下所示

Therefore, we can take a sufficiently large context by using a multi-layer Transformer model. Note that the context window model here is essentially the same as the strided attention model presented in the preceding section. Systems of this type are often easy to implement: we slide a window along the sequence, and, in each move, we make predictions at the last position of the window (for inference), or back-propagate errors (for training).
因此，我们可以通过使用多层 Transformer 模型来获取足够大的上下文。请注意，这里的上下文窗口模型本质上与上一节中提出的步进注意力模型相同。这类系统通常易于实现：我们在序列上滑动一个窗口，并在每次移动中，在窗口的最后一个位置进行预测（用于推理），或者反向传播错误（用于训练）。

An alternative way to train this context window model is by chunked attention. We divide the sequence into chunks (or sub-sequences) which are of the same length $n_{c}$ . Then, we treat these chunks as individual training samples, and run the training program on each of them as usual. This approach, however, completely ignores the relationship between inputs in different chunks. One way to address this issue is to introduce dependence between chunks. For example, the Transformer-XL model allows every chunk to access one or more preceding chunks (Dai et al., 2019). In the simplest case, consider an example in which $\mathrm{chunk}_{k}$ can see its successor $\mathrm{chunk}_{k-1}$ . Each position in $\mathrm{chunk}_{k}$ can attend to all its preceding positions in both $\mathrm{chunk}_{k}$ and $\mathrm{chunk}_{k-1}$ .
一种训练此上下文窗口模型的不同方法是使用分块注意力。我们将序列划分为相同长度的块（或子序列）。然后，我们将这些块作为单独的训练样本，并像往常一样对每个块运行训练程序。然而，这种方法完全忽略了不同块之间输入的关系。解决这一问题的方法之一是引入块之间的依赖关系。例如，Transformer-XL 模型允许每个块访问一个或多个前面的块（Dai 等人，2019 年）。在最简单的情况下，考虑一个例子，其中 $\mathrm{chunk}_{k}$ 可以看到其后续的 $\mathrm{chunk}_{k-1}$ 。 $\mathrm{chunk}_{k}$ 中的每个位置都可以关注 $\mathrm{chunk}_{k}$ 和 $\mathrm{chunk}_{k-1}$ 中所有其前面的位置。

In Transformer-XL, this approach is implemented in a simplified form. First, each position is constrained to attend to $n_{c}-1$ previous positions so that the size of the attention field of a position is the same in the training and inference stages. Such a method turns the problem back to strided attention, making the implementation of the attention model straightforward. On the other hand, the difference between the standard strided attention model and the Transformer-XL model is that in Transformer-XL, we perform training in a chunk-wise manner. Once we finish the training on a chunk, we directly move to the next chunk, rather than sliding the context window a small step forward. Second, while this approach allows for connections between chunks, the parameters of the sub-network on $\mathrm{chunk}_{k-1}$ are fixed, and we only update the parameters of the sub-network on $\mathrm{chunk}_{k}$ in the $k$ -th step. See Figure 14 for an illustration.
在 Transformer-XL 中，这种方法以简化的形式实现。首先，每个位置被限制只能关注 $n_{c}-1$ 之前的几个位置，从而使该位置的关注字段大小在训练和推理阶段保持一致。这种方法将问题转换回步长注意力，使得注意力模型的实现变得简单直接。另一方面，标准步长注意力模型与 Transformer-XL 模型之间的区别在于，在 Transformer-XL 中，我们以分块的方式执行训练。一旦完成一个分块的训练，我们就直接移动到下一个分块，而不是将上下文窗口向前滑动一小步。其次，虽然这种方法允许分块之间的连接，但在 $\mathrm{chunk}_{k-1}$ 上的子网络参数是固定的，我们只在 $k$ 步更新 $\mathrm{chunk}_{k}$ 上的子网络参数。见图 14 说明。

Figure 14: Illustration of chunk-wise training (Dai et al., 2019). The input sequence is divided into chunks of the same length

n_{c}

. Training is performed on these chunks, each time dealing with a chunk. In

\mathrm{chunk}_{k}

, the attention field for every position in this chunk is a left context window of size

n_{c}

. Hence this model allows for attention across chunks, for example, position

i-2

\mathrm{chunk}_{k}

can attend to positions

i-3

and

i-4

\mathrm{chunk}_{k-1}

(see sub-figure (a)). For training, errors are back-propagated only in the sub-network for

\mathrm{chunk}_{k}

, leaving other parts of the model unchanged. Here we use dashed lines to denote information flow that we consider in the forward pass but not in the backward pass. Once we finish the training on

\mathrm{chunk}_{k}

, we move to the next chunk, and repeat the same training procedure.
图 14：分块训练的示意图（Dai 等人，2019）。输入序列被分为相同长度的块

n_{c}

。在这些块上执行训练，每次处理一个块。在

\mathrm{chunk}_{k}

中，该块中每个位置的注意力字段是一个大小为

n_{c}

的左上下文窗口。因此，该模型允许跨块进行注意力，例如，

\mathrm{chunk}_{k}

中的位置

i-2

可以关注

\mathrm{chunk}_{k-1}

中的位置

i-3

和

i-4

（见子图(a)）。对于训练，只有

\mathrm{chunk}_{k}

的子网络中会进行误差的反向传播，其他模型部分保持不变。在这里，我们用虚线表示在正向传递中考虑但在反向传递中不考虑的信息流。一旦完成

\mathrm{chunk}_{k}

的训练，我们转向下一个块，并重复相同的训练程序。

The above model is similar in spirit to recurrent models because all of them require the computation in one step to depend on the states of the preceding steps. However, it is not in the standard form of a recurrent model, in which the output of a recurrent unit in one step is the input in the next step. Instead, the “recurrence” is expressed by involving connections between two different layers, that is, the output of one layer in $\mathrm{chunk}_{k-1}$ is used as the input of a higher-level layer in $\mathrm{chunk}_{k}$ .
上述模型在精神上与循环模型相似，因为它们都需要一步的计算依赖于前一步的状态。然而，它并非循环模型的标准形式，在循环模型中，一个循环单元在一步的输出是下一步的输入。相反，“循环”是通过涉及两个不同层之间的连接来表达的，即 $\mathrm{chunk}_{k-1}$ 中某一层的输出被用作 $\mathrm{chunk}_{k}$ 中更高层级的输入。

5.2.2 Encoding Long-term Memory
5.2.2 编码长期记忆

Another idea for representing the states of a sequence is to frame the task as an encoding problem. Instead of storing all the key-value vectors during left-to-right generation, we construct the memory of the entire “history” as a fixed number of encoded key-value vectors. These encoded key-value vectors can be either a small sub-set of $\{(\mathbf{k}_{1},\mathbf{v}_{1}),...,(\mathbf{k}_{i-1},\mathbf{v}_{i-1})\}$ or a small set of newly-generated vectors that encodes $\{(\mathbf{k}_{1},\mathbf{v}_{1}),...,(\mathbf{k}_{i-1},\mathbf{v}_{i-1})\}$ .
将表示序列状态的一种想法视为一个编码问题。在从左到右生成过程中，我们不是存储所有的键值向量，而是构建整个“历史”的记忆作为一个固定数量的编码键值向量。这些编码键值向量可以是 $\{(\mathbf{k}_{1},\mathbf{v}_{1}),...,(\mathbf{k}_{i-1},\mathbf{v}_{i-1})\}$ 的小子集，也可以是编码 $\{(\mathbf{k}_{1},\mathbf{v}_{1}),...,(\mathbf{k}_{i-1},\mathbf{v}_{i-1})\}$ 的新生成向量的小集合。

One way to do the encoding is to apply a pooling operation to $\{(\mathbf{k}_{1},\mathbf{v}_{1}),...,(\mathbf{k}_{i-1},\mathbf{v}_{i-1})\}$ (Rae et al., 2019). For example, by using average pooling, the memory contains only one key-value pair $(\bar{\mathbf{k}},\bar{\mathbf{v}})$
一种编码方法是应用池化操作到 $\{(\mathbf{k}_{1},\mathbf{v}_{1}),...,(\mathbf{k}_{i-1},\mathbf{v}_{i-1})\}$ （Rae 等人，2019 年）。例如，通过使用平均池化，内存中只包含一个键值对 $(\bar{\mathbf{k}},\bar{\mathbf{v}})$

	$\displaystyle\bar{\mathbf{k}}$	$\displaystyle=$	$\displaystyle\frac{1}{i-1}\sum_{j=1}^{i-1}\mathbf{k}_{j}$		(125)
	$\displaystyle\bar{\mathbf{v}}$	$\displaystyle=$	$\displaystyle\frac{1}{i-1}\sum_{j=1}^{i-1}\mathbf{v}_{j}$		(126)

This leads to a very efficient model, and we only need to update the vectors $(\bar{\mathbf{k}},\bar{\mathbf{v}})$ at a time (Zhang et al., 2018). Let $(\bar{\mathbf{k}}[i],\bar{\mathbf{v}}[i])$ be the state of the memory at position $i$ . A more general definition of $(\bar{\mathbf{k}}[i],\bar{\mathbf{v}}[i])$ is given in a recursive form
这导致了一个非常高效的模型，我们只需要一次更新向量 $(\bar{\mathbf{k}},\bar{\mathbf{v}})$ （张等人，2018）。令 $(\bar{\mathbf{k}}[i],\bar{\mathbf{v}}[i])$ 为位置 $i$ 处的内存状态。 $(\bar{\mathbf{k}}[i],\bar{\mathbf{v}}[i])$ 的更一般定义以递归形式给出

	$\displaystyle\bar{\mathbf{k}}[i]$	$\displaystyle=$	$\displaystyle\mathrm{KMem}(\bar{\mathbf{k}}[i-1],\mathbf{k}_{i-1})$		(127)
	$\displaystyle\bar{\mathbf{v}}[i]$	$\displaystyle=$	$\displaystyle\mathrm{VMem}(\bar{\mathbf{v}}[i-1],\mathbf{v}_{i-1})$		(128)

where $\mathrm{KMem}(\cdot)$ and $\mathrm{VMem}(\cdot)$ are functions that update the memory by taking both the states of the memory at the previous position (i.e., $\bar{\mathbf{k}}[i-1]$ and $\bar{\mathbf{v}}[i-1])$ and the new states (i.e., $\mathbf{k}_{i-1}$ and $\mathbf{v}_{i-1}$ ). There are many forms of the functions like $\mathrm{KMem}(\cdot)$ and $\mathrm{VMem}(\cdot)$ in common use. For example, if $\mathrm{KMem}(\cdot)$ and $\mathrm{VMem}(\cdot)$ are weighted sum functions, we can derive the same forms as Eqs. (125) and (126). If $\mathrm{KMem}(\cdot)$ and $\mathrm{VMem}(\cdot)$ are recurrent cells in RNNs or LSTM, we obtain a recurrent model of memory.
在 $\mathrm{KMem}(\cdot)$ 和 $\mathrm{VMem}(\cdot)$ 是更新内存的函数，它们同时考虑了内存前一个位置的两种状态（即 $\bar{\mathbf{k}}[i-1]$ 和 $\bar{\mathbf{v}}[i-1])$ ）以及新的状态（即 $\mathbf{k}_{i-1}$ 和 $\mathbf{v}_{i-1}$ ）。常见的函数形式有 $\mathrm{KMem}(\cdot)$ 和 $\mathrm{VMem}(\cdot)$ 等。例如，如果 $\mathrm{KMem}(\cdot)$ 和 $\mathrm{VMem}(\cdot)$ 是加权求和函数，我们可以推导出与公式（125）和（126）相同的形式。如果 $\mathrm{KMem}(\cdot)$ 和 $\mathrm{VMem}(\cdot)$ 是 RNN 或 LSTM 中的循环单元，我们得到一个记忆的循环模型。

Extension of the above model to memories having more than one key-value pair is straightforward. One approach is to use the memory to represent sub-sequences. Let $\{(\bar{\mathbf{k}}_{1},\bar{\mathbf{v}}_{1}),...,(\bar{\mathbf{k}}_{\kappa},\bar{\mathbf{v}}_{\kappa})\}$ be a memory of size $\kappa$ . Each $(\bar{\mathbf{k}}_{j},\bar{\mathbf{v}}_{j})$ is a snapshot of a chunk of length $n_{c}$ . Thus, this memory can encode a sequence with maximum length $\kappa\cdot n_{c}$ . Then, we can compute $(\bar{\mathbf{k}}_{j},\bar{\mathbf{v}}_{j})$ on the corresponding chunk using Eqs. (127) and (128). A second approach is to organize $\{(\bar{\mathbf{k}}_{1},\bar{\mathbf{v}}_{1}),...,(\bar{\mathbf{k}}_{\kappa},\bar{\mathbf{v}}_{\kappa})\}$ into a priority queue. We design some function to assign a score to any given key-value pair. The key-value pair can be inserted into the priority queue through the push operation. Ideally, we wish to develop a scoring function to estimate the value of a key-value pair, for example, we use another neural network to evaluate the key-value pair. In this way, the memory is a collection of the most valuable key-value pairs over the input sequence.
以上模型的扩展到具有多个键值对的内存是直接的。一种方法是将内存用于表示子序列。令 $\{(\bar{\mathbf{k}}_{1},\bar{\mathbf{v}}_{1}),...,(\bar{\mathbf{k}}_{\kappa},\bar{\mathbf{v}}_{\kappa})\}$ 为大小为 $\kappa$ 的内存。每个 $(\bar{\mathbf{k}}_{j},\bar{\mathbf{v}}_{j})$ 是长度为 $n_{c}$ 的块的一个快照。因此，这个内存可以编码最大长度为 $\kappa\cdot n_{c}$ 的序列。然后，我们可以使用公式（127）和（128）在相应的块上计算 $(\bar{\mathbf{k}}_{j},\bar{\mathbf{v}}_{j})$ 。第二种方法是将 $\{(\bar{\mathbf{k}}_{1},\bar{\mathbf{v}}_{1}),...,(\bar{\mathbf{k}}_{\kappa},\bar{\mathbf{v}}_{\kappa})\}$ 组织成一个优先队列。我们设计一些函数为任何给定的键值对分配一个分数。键值对可以通过推入操作插入到优先队列中。理想情况下，我们希望开发一个评分函数来估计键值对的值，例如，我们使用另一个神经网络来评估键值对。这样，内存是输入序列中最有价值的键值对集合。

Although representing the memory as a set of vectors is an obvious choice for the model design in Transformer, the memory is discrete and its capacity is determined by the number of the vectors. An alternative form of memory is continuous memory. This type of model typically builds on the idea of function approximation, in which $\{\mathbf{k}_{1},...,\mathbf{k}_{i-1}\}$ or $\{\mathbf{v}_{1},...,\mathbf{v}_{i-1}\}$ is viewed as a series of data points, and a continuous function is developed to fit these data points. Then, we no longer need to store $\{\mathbf{k}_{1},...,\mathbf{k}_{i-1}\}$ and $\{\mathbf{v}_{1},...,\mathbf{v}_{i-1}\}$ . Instead, the memory is represented by the functions fitting these vectors. A simple method is to combine simple functions to fit complex curves of data points. For example, we can develop a set of basis functions and use a linear combination of them to approximate the key or value vectors (Martins et al., 2022). The resulting model is parameterized by these basis functions and the corresponding weights in the combination.
尽管将记忆表示为向量集是 Transformer 模型设计中的一种明显选择，但记忆是离散的，其容量由向量的数量决定。另一种形式的记忆是连续记忆。这类模型通常基于函数逼近的理念，其中 $\{\mathbf{k}_{1},...,\mathbf{k}_{i-1}\}$ 或 $\{\mathbf{v}_{1},...,\mathbf{v}_{i-1}\}$ 被视为一系列数据点，并开发出连续函数来拟合这些数据点。然后，我们不再需要存储 $\{\mathbf{k}_{1},...,\mathbf{k}_{i-1}\}$ 和 $\{\mathbf{v}_{1},...,\mathbf{v}_{i-1}\}$ 。相反，记忆由拟合这些向量的函数来表示。一种简单的方法是将简单函数组合起来拟合数据点的复杂曲线。例如，我们可以开发一组基函数，并使用它们的线性组合来逼近键或值向量（Martins 等，2022）。得到的模型由这些基函数及其组合中的相应权重来参数化。

It is also straightforward to use a short-term memory and a long-term memory simultaneously so that we can combine the merits of both. For example, we use a cache-based memory to capture local context, and use an efficient long-term memory that encodes the entire history to model long-range dependency. This idea is also similar to that used in combining different sparse attention models as discussed in the previous sub-section.
同时使用短期记忆和长期记忆也很简单，这样我们可以结合两者的优点。例如，我们使用基于缓存的记忆来捕捉局部上下文，并使用一个高效的长期记忆来编码整个历史以模拟长距离依赖。这个想法也与之前小节中讨论的将不同的稀疏注意力模型相结合的方法类似。

5.2.3 Retrieval-based Methods
5.2.3 基于检索的方法

So far in this sub-section, we have discussed approaches based on fixed-length models. It is also possible to develop efficient memory models by improving the efficiency of accessing the memories, instead of just reducing the memory capacities. One way to achieve this is to store the past key-value pairs in a database (call it a vector database), and to find the most similar ones when querying the database. To be more precise, given a query $\mathbf{q}$ , we use the database to find a set of top- $p$ relevant key-value pairs (denoted by $\Omega_{p}$ ) by performing similarity search based on the dot-product similarity measure between query and key vectors. Then, we attend $\mathbf{q}$ to $\Omega_{p}$ as in standard self-attention models. The idea behind this method is to consider only a small number of elements that contribute most to the attention result. Therefore, the model is essentially a sparse attention model which is computationally efficient. Another advantage of this method is that it allows for fast similarity search over a very large set of vectors because of the highly optimized implementation of vector databases. Building a memory as a retrieval system can fall under the general framework called the retrieval-augmented approach. It provides a simple way to incorporate external memories into neural models like Transformer (Guu et al., 2020; Lewis et al., 2020; Wu et al., 2021).
截至目前，在本小节中，我们讨论了基于固定长度模型的方案。也可以通过提高访问内存的效率来开发高效的内存模型，而不仅仅是减少内存容量。实现这一目标的一种方法是将过去的关键值对存储在数据库中（称之为向量数据库），并在查询数据库时找到最相似的关键值对。更精确地说，给定查询 $\mathbf{q}$ ，我们使用数据库通过基于查询向量和关键向量之间点积相似度测量的相似性搜索来找到一组最相关的关键值对（表示为 $\Omega_{p}$ ）。然后，我们按照标准自注意力模型中的方式，将注意力从 $\mathbf{q}$ 分配到 $\Omega_{p}$ 。这种方法背后的思想是只考虑对注意力结果贡献最大的少数元素。因此，该模型本质上是一个计算高效的稀疏注意力模型。这种方法的优势还在于，由于向量数据库的高度优化实现，它允许在非常大的向量集上进行快速相似性搜索。构建一个作为检索系统的记忆可以归入被称为检索增强方法的通用框架。它提供了一种简单的方法将外部记忆融入如 Transformer（Guu 等人，2020 年；Lewis 等人，2020 年；Wu 等人，2021 年）等神经网络模型中。

5.3 Low-dimensional Models
5.3 低维模型

In many practical applications, Transformer models are “high-dimensional” models. This is not only because the input and/or output data is in high-dimensional spaces, but also because some of the intermediate representations of the data in the model are high-dimensional. As discussed in Section 5.1, this high dimensionality arises in part from the steps of computing the attention matrix as in Eq. (119) (for ease of presentation, we repeat the equation here)
在许多实际应用中，Transformer 模型是“高维”模型。这不仅是因为输入和/或输出数据位于高维空间中，还因为模型中数据的一些中间表示也是高维的。如第 5.1 节所述，这种高维性部分源于计算注意力矩阵的步骤，如公式（119）所示（为了便于展示，我们在此重复该公式）。

\displaystyle\mathrm{Att}_{\mathrm{self}}(\mathbf{S})

\displaystyle=

\displaystyle\mathbf{A}\mathbf{V}

(129)

and the weighted sum of value vectors as in Eq. (120)
并且如公式（120）所示的值向量的加权求和

\displaystyle\mathbf{A}

\displaystyle=

\displaystyle\mathrm{Softmax}(\frac{\mathbf{Q}\mathbf{K}^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M})

(130)

which involves large matrix multiplications $\mathbf{Q}\mathbf{K}^{\mathrm{T}}$ and $\mathbf{A}\mathbf{V}$ when the length $n$ and the hidden dimensionality $d$ have large values.
涉及当长度 $n$ 和隐藏维度 $d$ 具有大值时的大矩阵乘法 $\mathbf{Q}\mathbf{K}^{\mathrm{T}}$ 和 $\mathbf{A}\mathbf{V}$ 。

The $\mathbf{A}\mathbf{V}$ and $\mathbf{Q}\mathbf{K}^{\mathrm{T}}$ operations have a time complexity of $O(n^{2}\cdot d)$ and a space complexity of $O(n^{2}+n\cdot d)$ . Several previously described approaches have reduced this complexity by using sparse models. In this sub-section, we focus on methods that approximate these operations via dense computation. One simple idea is to transform $\mathbf{Q}$ , $\mathbf{K}$ , and $\mathbf{V}$ into smaller matrices, and thus to reduce the computational burden of matrix multiplication. Since $\mathbf{Q}$ , $\mathbf{K}$ , and $\mathbf{V}$ are all in $\mathbb{R}^{n\times d}$ , we can achieve this by reducing either the $n$ dimension or the $d$ dimension, or both.
$\mathbf{A}\mathbf{V}$ 和 $\mathbf{Q}\mathbf{K}^{\mathrm{T}}$ 操作的时间复杂度为 $O(n^{2}\cdot d)$ ，空间复杂度为 $O(n^{2}+n\cdot d)$ 。一些先前描述的方法通过使用稀疏模型来降低这种复杂度。在本节中，我们关注通过密集计算来近似这些操作的方法。一个简单的想法是将 $\mathbf{Q}$ 、 $\mathbf{K}$ 和 $\mathbf{V}$ 转换为较小的矩阵，从而减少矩阵乘法的计算负担。由于 $\mathbf{Q}$ 、 $\mathbf{K}$ 和 $\mathbf{V}$ 都位于 $\mathbb{R}^{n\times d}$ 中，我们可以通过减少 $n$ 维度或 $d$ 维度，或者两者同时减少来实现这一点。

5.3.1 Reducing $n$ 5.3.1 减少零号@

Note that the output $\mathrm{Att}_{\mathrm{self}}(\mathbf{S})$ is required to be an $n\times d$ matrix, and so we cannot reduce the number of queries. We instead consider reducing the number of keys and values. Suppose $n^{\prime}$ is a number less than $n$ , and $\mathbf{K}$ and $\mathbf{V}$ can be transformed into $n^{\prime}\times d$ matrices $\mathbf{K^{\prime}}$ and $\mathbf{V}^{{}^{\prime}}$ in some way. We can obtain a “smaller” model simply by replacing $\mathbf{K}$ and $\mathbf{V}$ with $\mathbf{K^{\prime}}$ and $\mathbf{V}^{{}^{\prime}}$ , giving
请注意，输出 $\mathrm{Att}_{\mathrm{self}}(\mathbf{S})$ 必须是一个 $n\times d$ 矩阵，因此我们无法减少查询数量。我们反而考虑减少键和值的数量。假设 $n^{\prime}$ 是一个小于 $n$ 的数，并且 $\mathbf{K}$ 和 $\mathbf{V}$ 可以通过某种方式变换成 $n^{\prime}\times d$ 矩阵 $\mathbf{K^{\prime}}$ 和 $\mathbf{V}^{{}^{\prime}}$ 。我们只需将 $\mathbf{K}$ 和 $\mathbf{V}$ 替换为 $\mathbf{K^{\prime}}$ 和 $\mathbf{V}^{{}^{\prime}}$ ，就可以得到一个“更小”的模型，即：

	$\displaystyle\mathrm{Att}_{\mathrm{self}}(\mathbf{S})$	$\displaystyle=$	$\displaystyle\mathbf{A}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathbf{V}^{{}^{\prime}}}$		(131)
	$\displaystyle\mathbf{A}$	$\displaystyle=$	$\displaystyle\mathrm{Softmax}(\frac{\mathbf{Q}[{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathbf{K^{\prime}}}]^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M})$		(132)

This model is in the standard form of self-attention, but has lower time and space complexities, that is, $O(n^{\prime}\cdot n\cdot d)<O(n^{2}\cdot d)$ and $O(n^{\prime}\cdot n+n^{\prime}\cdot d)<O(n^{2}+n\cdot d)$ . If $n^{\prime}<<n$ , the resulting model will be linear with $n$ .
此模型采用标准自注意力形式，但具有较低的时间和空间复杂度，即 $O(n^{\prime}\cdot n\cdot d)<O(n^{2}\cdot d)$ 和 $O(n^{\prime}\cdot n+n^{\prime}\cdot d)<O(n^{2}+n\cdot d)$ 。如果 $n^{\prime}<<n$ ，则得到的模型将是线性的，即 $n$ 。

The key problem here is how to obtain $\mathbf{K^{\prime}}$ and $\mathbf{V}^{{}^{\prime}}$ in a way that retains much of the information in $\mathbf{K}$ and $\mathbf{V}$ . There are several ways to do so. One simple method is to select the keys and values that are thought to be important. The importance of a key (or value) can be computed in terms of some computationally cheap measure. For example, we can sample a small number of query-key dot-products and estimate the importance of a key by collecting these dot-product results.
关键问题在于如何以保留 $\mathbf{K}$ 和 $\mathbf{V}$ 中大部分信息的方式获得 $\mathbf{K^{\prime}}$ 和 $\mathbf{V}^{{}^{\prime}}$ 。有几种方法可以实现这一点。一种简单的方法是选择被认为重要的键和值。一个键（或值）的重要性可以通过一些计算成本低的度量来计算。例如，我们可以采样少量查询键点积，并通过收集这些点积结果来估计键的重要性。

The above method is straightforward but still requires sparse operations, such as sampling and collection. As an alternative, we can use dense computation to transform $\mathbf{K}$ and $\mathbf{V}$ to $\mathbf{K^{\prime}}$ and $\mathbf{V}^{{}^{\prime}}$ . A typical choice is to use CNNs (Liu et al., 2018). Let $\mathrm{Conv}(\cdot)$ be a function describing a set of filters that slide along the $n$ dimension. $\mathbf{K^{\prime}}$ is then given by
上述方法简单直接，但仍需要稀疏操作，如采样和收集。作为替代，我们可以使用密集计算将 $\mathbf{K}$ 和 $\mathbf{V}$ 转换为 $\mathbf{K^{\prime}}$ 和 $\mathbf{V}^{{}^{\prime}}$ 。一种典型选择是使用 CNNs（刘等人，2018）。令 $\mathrm{Conv}(\cdot)$ 为一个描述一系列沿 $n$ 维度滑动的滤波器的函数。 $\mathbf{K^{\prime}}$ 由以下给出

\displaystyle\mathbf{K^{\prime}}

\displaystyle=

\displaystyle\mathrm{Conv}(\mathbf{K},\mathbf{W}_{c},size_{r},stride)

(133)

where $\mathbf{W}_{c}$ is the parameter matrix of the filters, $size_{r}$ is the size of the receptive field, and $stride$ is the number of units the filters are translated at a time. In general, we can achieve a high compression rate by choosing large values for $size_{r}$ and $stride$ . Likewise, we can compute $\mathbf{V}^{{}^{\prime}}$ using another convolutional function. It is worth noting that, if the parameter $n^{\prime}$ is fixed for all samples, compression of $\mathbf{K}$ and $\mathbf{V}$ along the length dimension is essentially the same as the fixed-length memory model as described in the preceding sub-section. The methods presented here are more general and could be applied to variable-length memories.
$\mathbf{W}_{c}$ 是滤波器的参数矩阵， $size_{r}$ 是感受野的大小， $stride$ 是滤波器每次转换的单位数。一般来说，通过选择较大的 $size_{r}$ 和 $stride$ 值，我们可以实现高压缩率。同样，我们可以使用另一个卷积函数来计算 $\mathbf{V}^{{}^{\prime}}$ 。值得注意的是，如果参数 $n^{\prime}$ 对所有样本都是固定的，那么沿着长度维度压缩 $\mathbf{K}$ 和 $\mathbf{V}$ 基本上与前面小节中描述的固定长度记忆模型相同。这里提出的方法更为通用，可以应用于变长记忆。

We might also be tempted to model the attention function by considering the attention matrix $\mathbf{A}$ as a high-dimensional representation of data and then applying conventional dimensionality reduction methods. For many problems, it is found that $\mathbf{A}$ (or more precisely $\mathbf{Q}\mathbf{K}^{\mathrm{T}}$ ) is a low-rank matrix. In this case, we can compress $\mathbf{A}$ while retaining as much information as possible. There are many ways to do so. For example, we might use a product of smaller matrices as an approximation to $\mathbf{A}$ via the SVD technique. However, this introduces computational overhead in using SVD compared with the standard attention model. A simpler idea to directly transform $\mathbf{K}$ and $\mathbf{V}$ into smaller-sized matrices via linear mappings, given by
我们可能也会倾向于通过将注意力矩阵 $\mathbf{A}$ 视为数据的高维表示，然后应用传统的降维方法来建模注意力函数。对于许多问题，发现 $\mathbf{A}$ （或者更精确地说， $\mathbf{Q}\mathbf{K}^{\mathrm{T}}$ ）是一个低秩矩阵。在这种情况下，我们可以在尽可能保留信息的同时压缩 $\mathbf{A}$ 。有许多方法可以实现这一点。例如，我们可能使用较小矩阵的乘积作为通过奇异值分解（SVD）技术对 $\mathbf{A}$ 的近似。然而，与标准注意力模型相比，这引入了使用 SVD 的计算开销。一个更简单的方法是直接通过线性映射将 $\mathbf{K}$ 和 $\mathbf{V}$ 转换为较小尺寸的矩阵，具体如下：

	$\displaystyle\mathbf{K^{\prime}}$	$\displaystyle=$	$\displaystyle\mathbf{U}^{k}\mathbf{K}$		(134)
	$\displaystyle\mathbf{V}^{{}^{\prime}}$	$\displaystyle=$	$\displaystyle\mathbf{U}^{v}\mathbf{V}$		(135)

where $\mathbf{U}^{k}\in\mathbb{R}^{n^{\prime}\times n}$ and $\mathbf{U}^{v}\in\mathbb{R}^{n^{\prime}\times n}$ are parameter matrices. Clearly, this leads to a model which is equivalent to that described in Eqs. (131) and (132). While such a method is intuitive and simple, it is proven to obtain a sufficiently small approximation error $\epsilon$ if $n^{\prime}$ is a linear function of $d/\epsilon^{2}$ (Wang et al., 2020b).
$\mathbf{U}^{k}\in\mathbb{R}^{n^{\prime}\times n}$ 和 $\mathbf{U}^{v}\in\mathbb{R}^{n^{\prime}\times n}$ 是参数矩阵。显然，这导致了一个与方程（131）和（132）中描述的模型等效的模型。虽然这种方法直观且简单，但如果 $n^{\prime}$ 是 $d/\epsilon^{2}$ 的线性函数（Wang 等人，2020b），则已被证明可以获得足够小的近似误差 $\epsilon$ 。

5.3.2 Reducing $d$ 5.3.2 减少零号@

Another approach to working in a low-dimensional space is to reduce the $d$ dimension. One of the simplest methods is to project all queries and keys onto a $d^{\prime}$ -dimensional space ( $d^{\prime}<d$ ), and to compute the dot-product of any key-value pair in the new space. For modeling, we only need to replace $\mathbf{Q}\in\mathbb{R}^{n\times d}$ and $\mathbf{K}\in\mathbb{R}^{n\times d}$ by new representations $\mathbf{Q^{\prime}}\in\mathbb{R}^{n\times d^{\prime}}$ and $\mathbf{K^{\prime}}\in\mathbb{R}^{n\times d^{\prime}}$ . We can easily modify Eq. (130) to use $\mathbf{Q^{\prime}}$ and $\mathbf{K^{\prime}}$ in computing the attention matrix
另一种在低维空间工作的方法是降低 $d$ 维度。其中一种最简单的方法是将所有查询和键投影到 $d^{\prime}$ -维空间（ $d^{\prime}<d$ ），并计算新空间中任意键值对的点积。对于建模，我们只需要用新的表示 $\mathbf{Q^{\prime}}\in\mathbb{R}^{n\times d^{\prime}}$ 和 $\mathbf{K^{\prime}}\in\mathbb{R}^{n\times d^{\prime}}$ 替换 $\mathbf{Q}\in\mathbb{R}^{n\times d}$ 和 $\mathbf{K}\in\mathbb{R}^{n\times d}$ 。我们可以轻松修改公式（130）以使用 $\mathbf{Q^{\prime}}$ 和 $\mathbf{K^{\prime}}$ 来计算注意力矩阵

\displaystyle\mathbf{A}

\displaystyle=

\displaystyle\mathrm{Softmax}(\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathbf{Q^{\prime}}}[{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathbf{K^{\prime}}}]^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M})

(136)

$\mathbf{Q^{\prime}}$ and $\mathbf{K^{\prime}}$ are given by
$\mathbf{Q^{\prime}}$ 和 $\mathbf{K^{\prime}}$ 由以下给出

	$\displaystyle\mathbf{Q^{\prime}}$	$\displaystyle=$	$\displaystyle\mathbf{Q}\mathbf{U}^{q}$		(137)
	$\displaystyle\mathbf{K^{\prime}}$	$\displaystyle=$	$\displaystyle\mathbf{K}\mathbf{U}^{k}$		(138)

where $\mathbf{U}^{q}\in\mathbb{R}^{d\times d^{\prime}}$ and $\mathbf{U}^{k}\in\mathbb{R}^{d\times d^{\prime}}$ are parameter matrices of linear transformations.
$\mathbf{U}^{q}\in\mathbb{R}^{d\times d^{\prime}}$ 和 $\mathbf{U}^{k}\in\mathbb{R}^{d\times d^{\prime}}$ 是线性变换的参数矩阵。

It is also possible to exploit kernel methods to obtain an efficient dot-product attention model. The basic idea is to map all data points (represented as vectors) from one space to another space, so that the problem, which might be difficult to solve in the original space, is easier to solve in the new space. The “trick” of kernel methods is that we actually do not need to know the mapping function, but only need to know how to compute the inner product of vectors in the new space in one operation¹³¹³13In mathematical analysis, the inner product is a generalized notion of the dot-product. It is typically denoted by $\langle\cdot,\cdot\rangle$ . A formal definition of the inner product requires that $\langle\cdot,\cdot\rangle$ satisfies several properties in a vector space. Although the inner product has different forms in different contexts, in the Euclidean space $\mathbb{R}^{d}$ , it is the same thing as the dot-product, that is, given two vectors $\mathbf{a}\in\mathbb{R}^{d}$ and $\mathbf{b}\in\mathbb{R}^{d}$ , we have $\displaystyle\langle\mathbf{a},\mathbf{b}\rangle$ $\displaystyle=$ $\displaystyle\mathbf{a}\cdot\mathbf{b}$ (139) $\displaystyle=$ $\displaystyle\sum_{i=1}^{d}a_{i}\cdot b_{i}$ . This operation of the inner product is usually called the kernel and denoted by $K(\cdot,\cdot)$ .
也可以利用核方法来获得高效的点积注意力模型。基本思想是将所有数据点（表示为向量）从一个空间映射到另一个空间，使得在原始空间中可能难以解决的问题在新空间中更容易解决。核方法的“技巧”在于我们实际上不需要知道映射函数，只需要知道如何在一次操作中计算新空间中向量的内积 ¹³ 。这种内积的操作通常被称为核，并用 $K(\cdot,\cdot)$ 表示。

It is interesting to approximate $\mathbf{A}$ in a fashion analogous to $K(\cdot,\cdot)$ in kernel methods. To illustrate, note in Eq. (130) $\mathbf{A}$ is a fraction denoting the normalized attention weights. The numerator can be written in the form
有趣的是，可以用类似于内核方法中 $K(\cdot,\cdot)$ 的方式近似 $\mathbf{A}$ 。为了说明，注意在公式（130）中， $\mathbf{A}$ 是一个表示归一化注意力权重的分数。分子可以写成以下形式

\displaystyle\widetilde{\mathbf{A}}

\displaystyle=

\displaystyle\mathrm{Mask}(\exp(\frac{\mathbf{Q}\mathbf{K}^{\mathrm{T}}}{\sqrt{d}}))

(140)

Here $\mathrm{Mask}(\cdot)$ is a function which has the same effect as using the additive masking variable $\mathbf{M}$ . Then, $\mathbf{A}$ can be expressed as
这里 $\mathrm{Mask}(\cdot)$ 是一个与使用加性掩码变量 $\mathbf{M}$ 具有相同效果的函数。然后， $\mathbf{A}$ 可以表示为

\displaystyle\mathbf{A}

\displaystyle=

\displaystyle\mathbf{D}^{-1}\widetilde{\mathbf{A}}

(142)

where $\mathbf{D}$ is an $n\times n$ diagonal matrix. Each entry of the main diagonal is the sum of the entries of the corresponding row in $\widetilde{\mathbf{A}}$ , denoting the normalization factor of Softmax. Substituting this equation into Eq. (130), we have
$\mathbf{D}$ 是一个 $n\times n$ 对角矩阵。主对角线上的每个元素是 $\widetilde{\mathbf{A}}$ 中对应行的元素之和，表示 Softmax 的归一化因子。将此方程代入式(130)，我们得到

\displaystyle\mathrm{Att}_{\mathrm{self}}(\mathbf{S})

\displaystyle=

\displaystyle\mathbf{D}^{-1}\widetilde{\mathbf{A}}\mathbf{V}

(143)

In this model, $\widetilde{A}(i,j)$ can be viewed as a similarity function over all query-key pairs in a $d$ -dimensional space. Here we assume that this function, which is in the form of the dot-product of vectors, can be approximated by a kernel function
在此模型中， $\widetilde{A}(i,j)$ 可以被视为在 $d$ 维度空间中所有查询-键对上的相似度函数。在此我们假设这个以向量点积形式存在的函数可以通过核函数进行近似。

	$\displaystyle\widetilde{A}(i,j)$	$\displaystyle=$	$\displaystyle K(\mathbf{q}_{i},\mathbf{k}_{j})$
		$\displaystyle=$	$\displaystyle\langle\phi(\mathbf{q}_{i}),\phi(\mathbf{k}_{j})\rangle$

$\phi(\cdot)$ is a mapping from $\mathbb{R}^{d}$ to $\mathbb{R}^{d^{\prime}}$ . We can represent the queries and keys in the following form
$\phi(\cdot)$ 是从 $\mathbb{R}^{d}$ 到 $\mathbb{R}^{d^{\prime}}$ 的映射。我们可以以下形式表示查询和键：

$\displaystyle\mathbf{Q^{\prime}}$	$\displaystyle=$	$\displaystyle\phi(\mathbf{Q})$	(144)
	$\displaystyle=$	$\displaystyle\begin{bmatrix}\phi(\mathbf{q}_{1})\\ \vdots\\ \phi(\mathbf{q}_{n})\end{bmatrix}$	(144)
$\displaystyle\mathbf{K^{\prime}}$	$\displaystyle=$	$\displaystyle\phi(\mathbf{K})$	(145)
	$\displaystyle=$	$\displaystyle\begin{bmatrix}\phi(\mathbf{k}_{1})\\ \vdots\\ \phi(\mathbf{k}_{n})\end{bmatrix}$	(145)

Then, we develop a kernelized attention model by approximating the attention weight $\alpha_{i,j}$ in the form
然后，我们通过近似注意力权重 $\alpha_{i,j}$ 的形式开发了一个核化注意力模型

\displaystyle\alpha_{i,j}

\displaystyle\approx

\displaystyle\frac{\phi(\mathbf{q}_{i})\phi(\mathbf{k}_{j})^{\mathrm{T}}}{\sum_{j^{\prime}=1}^{n}\phi(\mathbf{q}_{i})\phi(\mathbf{k}_{j^{\prime}})^{\mathrm{T}}}

(146)

The key idea behind this kernelized attention model is that we can remove the Softmax function if the queries and keys are mapped to a new space. Using this approximation, the $i$ -th output vector of the attention model (i.e., the $i$ -th row vector of $\mathrm{Att}_{\mathrm{self}}(\mathbf{S})$ ) is given by
该核化注意力模型背后的关键思想是，如果我们把查询和键映射到一个新的空间，就可以移除 Softmax 函数。使用这个近似，注意力模型的第 $i$ 个输出向量（即 $\mathrm{Att}_{\mathrm{self}}(\mathbf{S})$ 的第 $i$ 行向量）可以表示为

$\displaystyle\mathbf{c}_{i}$	$\displaystyle=$	$\displaystyle\sum_{j=1}^{n}\alpha_{i,j}\cdot\mathbf{v}_{j}$	(147)
	$\displaystyle\approx$	$\displaystyle\sum_{j=1}^{n}\Big{(}\frac{\phi(\mathbf{q}_{i})\phi(\mathbf{k}_{j})^{\mathrm{T}}}{\sum_{j^{\prime}=1}^{n}\phi(\mathbf{q}_{i})\phi(\mathbf{k}_{j^{\prime}})^{\mathrm{T}}}\cdot\mathbf{v}_{j}\Big{)}$
	$\displaystyle=$	$\displaystyle\frac{\sum_{j=1}^{n}\phi(\mathbf{q}_{i})\phi(\mathbf{k}_{j})^{\mathrm{T}}\mathbf{v}_{j}}{\sum_{j^{\prime}=1}^{n}\phi(\mathbf{q}_{i})\phi(\mathbf{k}_{j^{\prime}})^{\mathrm{T}}}$
	$\displaystyle=$	$\displaystyle\frac{\phi(\mathbf{q}_{i}){\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}(}\sum_{j=1}^{n}\phi(\mathbf{k}_{j})^{\mathrm{T}}\mathbf{v}_{j}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0})}}{\phi(\mathbf{q}_{i}){\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}(}\sum_{j^{\prime}=1}^{n}\phi(\mathbf{k}_{j^{\prime}})^{\mathrm{T}}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0})}}$

Although the equation appears a bit complicated, the idea is simple: instead of attending the query to all keys to obtain the attention weight $\alpha_{i,j}$ , we can compute the sum of the multiplications $\sum_{j=1}^{n}\phi(\mathbf{k}_{j})^{\mathrm{T}}\mathbf{v}_{j}\in\mathbb{R}^{d^{\prime}\times d}$ and then multiply it with the kernelized query $\phi(\mathbf{q}_{i})$ . Returning to the notation used in Eq. (143), we define the $i$ -th entry of $\mathbf{D}$ to be $\phi(\mathbf{q}_{i})\sum_{j^{\prime}=1}^{n}\phi(\mathbf{k}_{j^{\prime}})^{\mathrm{T}}$ . Then, the attention model can be re-expressed in the form
尽管该方程看起来有些复杂，但其思想很简单：我们不是将查询发送到所有键以获取注意力权重 $\alpha_{i,j}$ ，而是计算乘积之和 $\sum_{j=1}^{n}\phi(\mathbf{k}_{j})^{\mathrm{T}}\mathbf{v}_{j}\in\mathbb{R}^{d^{\prime}\times d}$ ，然后将它乘以核化查询 $\phi(\mathbf{q}_{i})$ 。回到式(143)中使用的符号，我们定义 $\mathbf{D}$ 的第 $i$ 个条目为 $\phi(\mathbf{q}_{i})\sum_{j^{\prime}=1}^{n}\phi(\mathbf{k}_{j^{\prime}})^{\mathrm{T}}$ 。然后，注意力模型可以重新表示为以下形式

$\displaystyle\mathrm{Att}_{\mathrm{self}}(\mathbf{S})$	$\displaystyle=$	$\displaystyle\mathbf{D}^{-1}\phi(\mathbf{Q})\phi(\mathbf{K})^{\mathrm{T}}\mathbf{V}$	(148)
	$\displaystyle=$	$\displaystyle\mathbf{D}^{-1}\mathbf{Q^{\prime}}\mathbf{K^{\prime}}^{\mathrm{T}}\mathbf{V}$
	$\displaystyle=$	$\displaystyle\mathbf{D}^{-1}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\big{(}}\mathbf{Q^{\prime}}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}(}\mathbf{K^{\prime}}^{\mathrm{T}}\mathbf{V}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0})}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\big{)}}$

Here we change the order of computation from left-to-right to right-to-left using parentheses. Given that $\mathbf{Q^{\prime}}\in\mathbb{R}^{n\times d^{\prime}}$ and $\mathbf{K^{\prime}}\in\mathbb{R}^{n\times d^{\prime}}$ , this model has time and space complexities of $O(n\cdot d\cdot d^{\prime})$ and $O(n\cdot d+n\cdot d^{\prime}+d\cdot d^{\prime})$ , respectively. Therefore, the model is linear with respect to the sequence length $n$ , and is sometimes called the linear attention model. One computational advantage of this model is that we need only compute the multiplication $\mathbf{K^{\prime}}^{\mathrm{T}}\mathbf{V}$ (i.e., $\sum_{j=1}^{n}\phi(\mathbf{k}_{j})^{\mathrm{T}}\mathbf{v}_{j}$ ) and the corresponding normalization factor (i.e., $\sum_{j^{\prime}=1}^{n}\phi(\mathbf{k}_{j^{\prime}})^{\mathrm{T}}$ ) once. The results can then be used for any query (Katharopoulos et al., 2020). The memory needs to maintain $\sum_{j=1}^{n}\phi(\mathbf{k}_{j})^{\mathrm{T}}\mathbf{v}_{j}$ and $\sum_{j^{\prime}=1}^{n}\phi(\mathbf{k}_{j^{\prime}})^{\mathrm{T}}$ and update them when new key and value vectors come.
这里我们将计算顺序从左到右改为右到左，使用括号。鉴于 $\mathbf{Q^{\prime}}\in\mathbb{R}^{n\times d^{\prime}}$ 和 $\mathbf{K^{\prime}}\in\mathbb{R}^{n\times d^{\prime}}$ ，该模型的时间复杂度和空间复杂度分别为 $O(n\cdot d\cdot d^{\prime})$ 和 $O(n\cdot d+n\cdot d^{\prime}+d\cdot d^{\prime})$ 。因此，该模型与序列长度 $n$ 线性相关，有时被称为线性注意力模型。该模型的一个计算优势是我们只需要计算一次乘法 $\mathbf{K^{\prime}}^{\mathrm{T}}\mathbf{V}$ （即 $\sum_{j=1}^{n}\phi(\mathbf{k}_{j})^{\mathrm{T}}\mathbf{v}_{j}$ ）和相应的归一化因子（即 $\sum_{j^{\prime}=1}^{n}\phi(\mathbf{k}_{j^{\prime}})^{\mathrm{T}}$ ）。然后，这些结果可以用于任何查询（Katharopoulos 等人，2020）。需要维护 $\sum_{j=1}^{n}\phi(\mathbf{k}_{j})^{\mathrm{T}}\mathbf{v}_{j}$ 和 $\sum_{j^{\prime}=1}^{n}\phi(\mathbf{k}_{j^{\prime}})^{\mathrm{T}}$ 的内存，并在新的键和值向量到来时更新它们。

Still, there are several problems regarding this kernelized model, for example, how to develop the feature map $\phi(\cdot)$ to obtain a good approximation to the standard attention model. Interested readers may refer to Choromanski et al. (2020)’s work for more details.
仍然，关于这个核化模型存在几个问题，例如，如何开发特征图 $\phi(\cdot)$ 以获得对标准注意力模型的良好近似。有兴趣的读者可以参考 Choromanski 等人（2020）的研究以获取更多细节。

A second idea for reducing $d$ is to take sub-space models, in which a problem in a $d$ -dimensional space is transformed into sub-problems in lower-dimensional spaces, and the solution to the original problem is approximated by some combination of the solutions to these sub-problems. In a general sub-space model, a $d$ -dimensional key vector $\mathbf{k}$ can be mapped into a set of $d^{\prime}$ -dimensional vectors $\{\mathbf{K^{\prime}}_{1},...,\mathbf{K^{\prime}}_{\eta}\}$ . To simplify modeling, we can do this by vector segmentation, that is, we segment $\mathbf{k}$ into $\eta$ sub-vectors, each having $d^{\prime}=\frac{d}{\eta}$ dimensions. We can transform all query and value vectors in the same way. Then, the attention model is applied in each of these sub-spaces.
一种减少 $d$ 的第二个想法是采用子空间模型，其中将 $d$ 维空间中的问题转化为低维空间中的子问题，并通过这些子问题的解的组合来近似原始问题的解。在一般的子空间模型中，一个 $d$ 维关键向量 $\mathbf{k}$ 可以被映射到一组 $d^{\prime}$ 维向量 $\{\mathbf{K^{\prime}}_{1},...,\mathbf{K^{\prime}}_{\eta}\}$ 。为了简化建模，我们可以通过向量分段来实现，即把 $\mathbf{k}$ 分割成 $\eta$ 个子向量，每个子向量具有 $d^{\prime}=\frac{d}{\eta}$ 维。我们可以以相同的方式转换所有查询和值向量。然后，在每个子空间中应用注意力模型。

This method, however, does not reduce the total amount of computation. As presented in Lample et al. (2019)’s work, we can instead approximate the dot-product attention over a set of key-value pairs by considering top- $p$ candidates in each sub-space. More precisely, we find $p$ -best key-value pairs in each sub-space, which is computationally cheaper. The Cartesian product of these $p$ -best key sets consists of $p^{\eta}$ product keys. Likewise, we obtain $p^{\eta}$ product values. The remaining work is simple: the $d$ -dimensional queries attend to these $d$ -dimensional product keys and values. An interesting difference between this sub-space model and the $d$ -dimensional space model is that the generated product keys and values may be different from any of the original key-values $\{(\mathbf{k}_{1},\mathbf{v}_{1}),...,(\mathbf{k}_{i-1},\mathbf{v}_{i-1})\}$ . This provides a way for learning new representations of the past information.
这种方法，然而，并不会减少总的计算量。正如 Lample 等人（2019）的研究所展示的，我们可以通过考虑每个子空间中的 top- $p$ 候选者来近似一组键值对的点积注意力。更精确地说，我们在每个子空间中找到 $p$ -最佳键值对，这计算上更便宜。这些 $p$ -最佳键集的笛卡尔积包含 $p^{\eta}$ 个产品键。同样，我们获得 $p^{\eta}$ 个产品值。剩余的工作很简单： $d$ 维度的查询关注这些 $d$ 维度的产品键和值。这个子空间模型与 $d$ 维度空间模型的一个有趣差异是，生成的产品键和值可能与原始的任何键值 $\{(\mathbf{k}_{1},\mathbf{v}_{1}),...,(\mathbf{k}_{i-1},\mathbf{v}_{i-1})\}$ 不同。这为学习过去信息的新的表示提供了一种方法。

So far we have discussed approaches to dimensionality reduction along either the $n$ or $d$ dimension. It is straightforward to combine them to develop a “lower-dimensional” model. As an example, suppose that we have the $n\to n^{\prime}$ reduction for keys and values, and the $d\to d^{\prime}$ reduction for queries and keys. The model takes the form
截至目前，我们已讨论了沿着 $n$ 或 $d$ 维度的降维方法。将它们结合起来以开发一个“低维”模型是直接的。例如，假设我们对于键和值有 $n\to n^{\prime}$ 降维，对于查询和键有 $d\to d^{\prime}$ 降维。该模型的形式为

	$\displaystyle\mathrm{Att}_{\mathrm{self}}(\mathbf{S})$	$\displaystyle=$	$\displaystyle\mathbf{A}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathbf{V}^{{}^{\prime}}}$
	$\displaystyle\mathbf{A}$	$\displaystyle=$	$\displaystyle\mathrm{Softmax}(\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathbf{Q^{\prime}}}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathbf{K^{\prime}}}^{\mathrm{T}}}{\sqrt{d^{\prime}}}+\mathbf{M})$		(149)

where $\mathbf{Q^{\prime}}\in\mathbb{R}^{n\times d^{\prime}}$ , $\mathbf{K^{\prime}}\in\mathbb{R}^{n^{\prime}\times d^{\prime}}$ , and $\mathbf{V}^{{}^{\prime}}\in\mathbb{R}^{n^{\prime}\times d^{\prime}}$ are low-dimensional representations for queries, keys and values. As usual, we can easily obtain these representations through the linear mappings of $\mathbf{Q}$ , $\mathbf{K}$ and $\mathbf{V}$ . The time and space complexities of this model are $O(n^{\prime}\cdot n\cdot d^{\prime})$ and $O(n^{\prime}\cdot n+n^{\prime}\cdot d^{\prime})$ .
$\mathbf{Q^{\prime}}\in\mathbb{R}^{n\times d^{\prime}}$ 、 $\mathbf{K^{\prime}}\in\mathbb{R}^{n^{\prime}\times d^{\prime}}$ 和 $\mathbf{V}^{{}^{\prime}}\in\mathbb{R}^{n^{\prime}\times d^{\prime}}$ 分别是查询、键和值的低维表示。与往常一样，我们可以通过 $\mathbf{Q}$ 、 $\mathbf{K}$ 和 $\mathbf{V}$ 的线性映射轻松获得这些表示。该模型的时间复杂度和空间复杂度分别为 $O(n^{\prime}\cdot n\cdot d^{\prime})$ 和 $O(n^{\prime}\cdot n+n^{\prime}\cdot d^{\prime})$ 。

5.4 Parameter and Activation Sharing
5.4 参数和激活共享

Redundancy is common to most large-scale neural networks. As a result, many of these models are over-parameterized, making the training and inference less efficient. One common approach to redundancy reduction is to simplify the modeling by removing useless components of the models, for example, we can either prune a complex model or share sub-models among different components of it to obtain a reasonably small model. In this sub-section, we discuss methods of parameter and intermediate state sharing in Transformer models. We leave the discussion of model transfer and pruning to Section 5.7.
冗余在大多数大规模神经网络中很常见。因此，许多这些模型都是过参数化的，使得训练和推理效率较低。减少冗余的一种常见方法是通过简化模型来去除无用的模型组件，例如，我们可以剪枝一个复杂的模型，或者在不同组件之间共享子模型，以获得一个相对较小的模型。在本节中，我们讨论了 Transformer 模型中参数和中间状态共享的方法。我们将模型迁移和剪枝的讨论留到第 5.7 节。

Shared-parameter architectures are widely used in neural network-based systems. Well-known examples include CNNs and RNNs, where the same set of parameters (or layers) is applied across different regions of the input. This produces a “big” neural network, parts of which have the same architecture and the same shared parameters. For Transformers as well as other sequence models, the sharing mechanism can be applied to different levels of modeling. A simple example, which might be not related to architecture design, is shared embedding. In machine translation, a typical strategy for dealing with words in two languages is to develop two separate embedding models. Alternatively, one can use a single embedding model for both languages. The parameters of the model are then learned during the training of both the source-side and target-side networks. Such a strategy is also often adopted in multi-lingual sequence models, such as language models that are able to deal with texts in many different languages.
共享参数架构在基于神经网络的系统中被广泛使用。众所周知的例子包括 CNN 和 RNN，其中相同的参数集（或层）被应用于输入的不同区域。这产生了一个“大”神经网络，其中部分具有相同的架构和共享参数。对于 Transformer 以及其他序列模型，共享机制可以应用于建模的不同层次。一个简单的例子，可能不涉及架构设计，是共享嵌入。在机器翻译中，处理两种语言中单词的典型策略是开发两个独立的嵌入模型。或者，可以使用单个嵌入模型处理两种语言。模型的参数随后在源端和目标端网络的训练过程中学习。这种策略也常被采用在多语言序列模型中，例如能够处理多种不同语言文本的语言模型。

For multi-layer neural networks, a popular method is layer-wise sharing. Suppose there is a stack of layers, all of which have the same form
对于多层神经网络，一种流行的方法是层间共享。假设有一堆层，它们都具有相同的结构

\displaystyle\mathbf{S}^{l}

\displaystyle=

\displaystyle\mathrm{Layer}(\mathbf{S}^{l-1};\theta^{l})

(150)

We can tie the parameters for some or all of these layers. For example, given a set of layers $\{l_{1},l_{2},...,l_{n}\}$ , we enforce the constraint $\theta^{l_{1}}=\theta^{l_{2}}=...=\theta^{l_{n}}$ , so that we can obtain a smaller model and the optimization of the model can be easier. In practice, this shared-layer model is highly advantageous if many layers are involved, because we can repeat the same process many times to construct a very deep neural network (Dehghani et al., 2018). For example, sharing a single FFN sub-layer across the Transformer encoder is found to be effective in reducing the redundancy in machine translation systems (Pires et al., 2023).
我们可以将这些层的参数进行绑定。例如，给定一组层 $\{l_{1},l_{2},...,l_{n}\}$ ，我们施加约束 $\theta^{l_{1}}=\theta^{l_{2}}=...=\theta^{l_{n}}$ ，这样我们可以获得一个更小的模型，并且模型的优化可以更容易。在实践中，如果涉及许多层，这种共享层模型非常有利，因为我们可以通过重复相同的步骤多次来构建一个非常深的神经网络（Dehghani 等，2018）。例如，在 Transformer 编码器中共享单个 FFN 子层被发现可以有效地减少机器翻译系统中的冗余（Pires 等，2023）。

For Transformers, sharing can also be performed in multi-head attention. An example of this is multi-query attention (Shazeer, 2019). Recall from Section 2.3 that the output of a head $h$ in standard multi-head self-attention can be written as
对于 Transformer，共享也可以在多头注意力中执行。一个例子是多查询注意力（Shazeer，2019）。回想第 2.3 节，标准多头自注意力中头 $h$ 的输出可以表示为

	$\displaystyle\mathbf{C}_{h}^{\mathrm{head}}$	$\displaystyle=$	$\displaystyle\mathrm{Att}_{\mathrm{qkv}}(\mathbf{S}_{h}^{q},\mathbf{S}_{h}^{k},\mathbf{S}_{h}^{v})$		(151)
		$\displaystyle=$	$\displaystyle\mathrm{Att}_{\mathrm{qkv}}(\mathbf{S}\mathbf{W}_{h}^{q},\mathbf{S}\mathbf{W}_{h}^{k},\mathbf{S}\mathbf{W}_{h}^{v})$		(151)

Here $\mathbf{S}_{h}^{q}=\mathbf{S}\mathbf{W}_{h}^{q}$ , $\mathbf{S}_{h}^{k}=\mathbf{S}\mathbf{W}_{h}^{k}$ , and $\mathbf{S}_{h}^{v}=\mathbf{S}\mathbf{W}_{v}^{v}$ are the query, key, and value, which are obtained by linearly transforming the input $\mathbf{S}$ with distinct parameter matrices $\mathbf{W}_{h}^{q}$ , $\mathbf{W}_{h}^{k}$ , and $\mathbf{W}_{h}^{v}$ . In multi-query attention, we share the same key and value across all the heads, but use different queries for different heads. The form of this model is given by
这里 $\mathbf{S}_{h}^{q}=\mathbf{S}\mathbf{W}_{h}^{q}$ 、 $\mathbf{S}_{h}^{k}=\mathbf{S}\mathbf{W}_{h}^{k}$ 和 $\mathbf{S}_{h}^{v}=\mathbf{S}\mathbf{W}_{v}^{v}$ 分别是查询、键和值，它们通过使用不同的参数矩阵 $\mathbf{W}_{h}^{q}$ 、 $\mathbf{W}_{h}^{k}$ 和 $\mathbf{W}_{h}^{v}$ 对输入 $\mathbf{S}$ 进行线性变换而获得。在多查询注意力中，我们共享所有头部的相同键和值，但为不同的头部使用不同的查询。该模型的形式如下：

\displaystyle\mathbf{C}_{h}^{\mathrm{head}}

\displaystyle=

\displaystyle\mathrm{Att}_{\mathrm{qkv}}(\mathbf{S}\mathbf{W}_{h}^{q},\mathbf{S}\mathbf{W}_{0}^{k},\mathbf{S}\mathbf{W}_{0}^{v})

(152)

Here the key $\mathbf{S}\mathbf{W}_{0}^{k}$ and value $\mathbf{S}\mathbf{W}_{0}^{v}$ are irrelevant to $h$ . Hence we need only compute them once rather than computing them several times. As a result, we can make a significant saving in computational cost, especially if the number of heads is large. Multi-query attention has been successfully incorporated into recent large language models, such as Llama 2 (Touvron et al., 2023b) and Falcon¹⁴¹⁴14https://falconllm.tii.ae/index.html.
这里的关键 $\mathbf{S}\mathbf{W}_{0}^{k}$ 和值 $\mathbf{S}\mathbf{W}_{0}^{v}$ 与 $h$ 无关。因此，我们只需要计算一次，而不是多次计算。结果，我们可以在计算成本上实现显著节省，尤其是在头数较多的情况下。多查询注意力已被成功集成到最近的大型语言模型中，例如 Llama 2（Touvron 等，2023b）和 Falcon ¹⁴ 。

By extending the idea of sharing to more general situations, any intermediate states can be shared across a neural network. For example, reusing neuron activations allows a sub-model to be applied multiple times. For Transformers, sharing can be considered inside the process of self-attention. It is found that the attention maps of different layers are similar in some NLP tasks (Xiao et al., 2019). Therefore, it is reasonable to compute the attention map only once and then use it in the following layers.
通过将共享的概念扩展到更一般的情况，任何中间状态都可以在整个神经网络中共享。例如，重用神经元激活可以使子模型被多次应用。对于 Transformer，共享可以被视为自注意力过程中的一个环节。研究发现，在某些 NLP 任务中（Xiao 等人，2019 年），不同层的注意力图是相似的。因此，合理的方法是只计算一次注意力图，然后在后续层中使用它。

If we make a further generalization of the sharing mechanism, we can view it as a process by which we use the result produced previously rather than computing it on the fly. It is thus possible to reuse the information across different runs of a neural network. A related example is reversible residual networks, in which activations of one layer can be recovered from the activations of the following layer (Gomez et al., 2017). Hence we only keep the output of the latest layer in the forward pass. Then, in the backward pass of training, we reconstruct the output of each layer from its successor. One advantage of this reversible treatment is that the information produced in the forward pass is shared implicitly, and the model is memory-efficient (Kitaev et al., 2020).
如果我们对共享机制进行进一步的泛化，我们可以将其视为一种过程，即我们使用先前产生的结果，而不是实时计算它。因此，可以在神经网络的多次运行中重用信息。一个相关的例子是可逆残差网络，其中一层激活可以从下一层的激活中恢复（Gomez 等人，2017）。因此，我们只保留正向传播中最新层的输出。然后，在训练的逆向传播中，我们从其后续层重建每一层的输出。这种可逆处理的一个优点是正向传播中产生的信息隐式共享，并且模型内存效率高（Kitaev 等人，2020）。

5.5 Alternatives to Self-Attention
5.5 自注意力机制的替代方案

We have seen that the use of self-attention is a primary source of the large computation and memory requirements for Transformer systems. It is natural to wonder if there are efficient alternatives to self-attention models. Here we present briefly some of the Transformer variants in which self-attention sub-layers are not required and we instead replace them with other types of neural networks.
我们已经看到，自注意力机制是 Transformer 系统对大量计算和内存需求的主要来源。自然地，人们会想知道是否存在高效的自注意力模型替代方案。在这里，我们简要介绍了一些 Transformer 变体，其中不需要自注意力子层，而是用其他类型的神经网络来替代。

5.5.1 CNN as A Replacement of Self-Attention
5.5.1 CNN 作为自注意力机制的替代

CNNs are simple and widely used neural networks, and are considered as potential alternatives to self-attention models. To apply CNNs to Transformers, all we need is to construct a convolutional sub-layer to replace the self-attention sub-layer in a Transformer block. While a filter of CNNs has a restricted receptive field and thus takes inputs from a “local” context window, large contexts can be easily modeled by stacking multiple convolutional sub-layers. One key advantage of CNNs is that the number of elementary operations required to run CNNs is a linear function of the sequence length $n$ , compared with the quadratic function for self-attention networks. In practical systems, there have been many highly-optimized implementations for CNNs, making it easier to apply them to sequence modeling. For further improvements to memory efficiency, we can use lightweight CNN variants, for example, depth-wise CNNs (Wu et al., 2018a). ¹⁵¹⁵15 In CNNs, a filter (or a set of filters) combines the input variables in the receptive field into an output variable (or a set of output variables) via linear mapping. Suppose that the input and output of a problem are represented as sequences of feature vectors. Given a filter having a $d\times k$ receptive field, we slide it along the sequence. At each step, the filter takes $d\times k$ input features and produces an output feature. This procedure is typically expressed by $\displaystyle y$ $\displaystyle=$ $\displaystyle\mathrm{ReduceSum}(\mathbf{x}\odot\mathbf{W})$ (153) where $\mathbf{x}\in\mathbb{R}^{k\times d}$ is the vector representation of the input, $y\in\mathbb{R}$ is the output feature, and $\mathbf{W}\in\mathbb{R}^{k\times d}$ is the weight matrix. The function $\mathrm{ReduceSum}(\cdot)$ computes the sum of all element-wise products between $\mathbf{x}$ and $\mathbf{W}$ . If we want the input and output to have the same number of features, we can design $d$ filters and the number of parameters will be $d^{2}\cdot k$ . In depth-wise CNNs, we tie the weights across different feature dimensions. More precisely, all the column vectors of $\mathbf{W}$ are the same. Thus, the number of the unique parameters of the model is reduced to $d\cdot k$ (each $\mathbf{W}$ corresponding to a filter having $k$ unique parameters).
卷积神经网络（CNNs）是一种简单且广泛使用的神经网络，被视为自注意力模型的潜在替代品。要将 CNN 应用于 Transformer，我们只需要构建一个卷积子层来替换 Transformer 块中的自注意力子层。虽然 CNN 的滤波器具有受限的感受野，因此只从“局部”上下文窗口获取输入，但通过堆叠多个卷积子层可以轻松地模拟大上下文。CNN 的一个关键优势是，运行 CNN 所需的基本操作数量是序列长度的线性函数，与自注意力网络的二次函数相比。在实际系统中，已经有许多针对 CNN 的高度优化实现，这使得将它们应用于序列建模变得更加容易。为了进一步提高内存效率，我们可以使用轻量级的 CNN 变体，例如深度可分离 CNN（Wu 等人，2018a）。

5.5.2 Linear Attention 5.5.2 线性注意力

As with many practical approaches to sequence modeling, there is also considerable interest in developing linear models in order to speed up the processing of long sequences. While there are many ways to define a linear model, one general form that is commonly used in sequence models is
与许多针对序列建模的实际方法一样，人们也对开发线性模型以加快长序列的处理速度表现出相当大的兴趣。虽然定义线性模型的方法有很多，但在序列模型中常用的一种通用形式是

\displaystyle\mathbf{z}_{i}

\displaystyle=

\displaystyle f(a\cdot\mathbf{z}_{i-1}+b\cdot\mathbf{s}_{i})

(154)

Here $\mathbf{s}_{i}$ represents some intermediate states of the model at step $i$ , and $\mathbf{z}_{i}$ represents the summary of the history states up to step $i$ . It is easy to see that this is a recurrent model: the output at step $i$ depends only on the input at the current step and the output at the previous step. As with the popular design choices in neural network-based systems, the linear part is followed by a transformation $f(\cdot)$ which can be either an activation function or a feedforward neural network. Note that, Eq. (154) defines a standard linear model only if $f(\cdot)$ is a linear function. The use of $f(\cdot)$ gives greater flexibility in modeling the problem, although the term linear model may not be applied if $f(\cdot)$ chooses a non-linear form.
$\mathbf{s}_{i}$ 代表模型在第①步的一些中间状态，而 $\mathbf{z}_{i}$ 代表直到第③步的历史状态的总结。很容易看出这是一个循环模型：第④步的输出只依赖于当前步的输入和前一步的输出。与基于神经网络的系统中的流行设计选择一样，线性部分后面跟着一个变换 $f(\cdot)$ ，这个变换可以是激活函数或前馈神经网络。请注意，当且仅当 $f(\cdot)$ 是一个线性函数时，式（154）定义了一个标准的线性模型。使用 $f(\cdot)$ 在建模问题方面提供了更大的灵活性，尽管如果 $f(\cdot)$ 选择非线性形式，则可能不适用线性模型这一术语。

The above formula describes a linearly structured model which can be seen as an instance of a general family of mathematical models. Typically, it can be represented as a chain structure, or an ordered set of nodes. The model repeats the same computation process from the first node to the last, each time taking the information from the current and previous steps and producing an output vector that is used in the following time steps. As a result, the space and time cost of the model scales linearly with the length of the chain.
上述公式描述了一个线性结构模型，这可以被视为一类通用数学模型的一个实例。通常，它可以表示为一个链结构，或一个有序节点集。该模型从第一个节点到最后一个节点重复相同的计算过程，每次从当前步骤和前一步骤获取信息，并产生一个用于后续时间步骤的输出向量。因此，该模型的空间和时间成本与链的长度成线性关系。

We can extend Eq. (154) to a standard RNN model by simply making a linear transformation of the current input and the previous state, that is, $\mathbf{z}_{i}=f(\mathbf{z}_{i-1}\cdot\mathbf{W}_{z}+\mathbf{s}_{i}\cdot\mathbf{W}_{s})$ . It is thus straightforward to apply RNN and its variants to Transformer to obtain a hybrid model. For example, we can use LSTM and GRUs in building some of the Transformer layers to combine the merits of both recurrent models and self-attentive models (Chen et al., 2018b).
我们可以通过简单地对方程（154）进行线性变换来将其扩展到标准的 RNN 模型，即 $\mathbf{z}_{i}=f(\mathbf{z}_{i-1}\cdot\mathbf{W}_{z}+\mathbf{s}_{i}\cdot\mathbf{W}_{s})$ 。因此，将 RNN 及其变体应用于 Transformer 以获得混合模型是直接的。例如，我们可以在构建 Transformer 的一些层时使用 LSTM 和 GRU，以结合两种循环模型和自注意力模型的优点（Chen 等，2018b）。

In fact, we may be more interested in developing linear attention models, so that we can obtain an efficient system, while still retaining the benefit of globally attentive sequence modeling. Part of the difficulty in doing this is that the form of self-attention is not linear. Let us take a moment to see how this difficulty arises. Recall that the result of self-attention can be written in the following form
事实上，我们可能更感兴趣于开发线性注意力模型，以便我们能够获得一个高效的系统，同时仍然保留全局注意力序列建模的好处。做这件事的部分困难在于自注意力的形式不是线性的。让我们花一点时间看看这种困难是如何产生的。回想一下，自注意力的结果可以写成以下形式

	$\displaystyle\mathrm{Att}_{\mathrm{self}}$	$\displaystyle=$	$\displaystyle\mathbf{A}\cdot\mathbf{V}$		(155)
		$\displaystyle=$	$\displaystyle\psi(\mathbf{Q}\cdot\mathbf{K}^{\mathrm{T}})\cdot\mathbf{V}$		(155)

Here $\psi(\cdot)$ is a function that is composed by taking the scaling, exponentiating, masking and normalization operations (i.e., $\psi(\mathbf{a})=\mathrm{Normalize}(\mathrm{Mask}(\exp(\frac{\mathbf{a}}{\sqrt{d}})))$ ). Because $\psi(\cdot)$ is a complex non-linear function, there is no obvious equivalent that simplifies the computation, and we have to calculate the two matrix multiplications separately (one inside $\psi(\cdot)$ and one outside $\psi(\cdot)$ ). As a consequence, we need to store all the key-value pairs explicitly, and visit each of them given a query. Not surprisingly, this leads to a model whose computational cost grows quadratically with the sequence length $n$ .
这里 $\psi(\cdot)$ 是一个由缩放、指数化、掩码和归一化操作（即 $\psi(\mathbf{a})=\mathrm{Normalize}(\mathrm{Mask}(\exp(\frac{\mathbf{a}}{\sqrt{d}})))$ ）组成的函数。因为 $\psi(\cdot)$ 是一个复杂的非线性函数，没有明显的等价函数可以简化计算，我们必须分别计算两个矩阵乘法（一个在 $\psi(\cdot)$ 内部，一个在 $\psi(\cdot)$ 外部）。因此，我们需要显式存储所有键值对，并在给定查询时访问每个键值对。不出所料，这导致了一个计算成本随着序列长度 $n$ 呈二次增长的模型。

Although in self-attention keys and values are coupled, they are used in separate steps. An elegant form of this model might be that allows for a direct interaction between the keys and queries, so that we can encode the context information in a way that is irrelevant to the queries. A trick here is that we can remove the non-linearity from $\psi(\cdot)$ by using a feature space mapping $\phi(\cdot)$ on the queries and keys, and reformulate $\psi(\mathbf{Q}\cdot\mathbf{K}^{\mathrm{T}})$ (i.e., $\mathbf{A}$ ) in a form of matrix products. For example, recall from Section 5.3 that we can transform $\mathbf{Q}$ and $\mathbf{K}$ to $\mathbf{Q}^{\prime}=\phi(\mathbf{Q})\in\mathbb{R}^{n\times d^{\prime}}$ and $\mathbf{K}^{\prime}=\phi(\mathbf{K})\in\mathbb{R}^{n\times d^{\prime}}$ through the mapping $\phi(\cdot)$ . Then, we define the form of the attention model to be
尽管在自注意力中键和值是耦合的，但它们在独立的步骤中使用。这种模型的一种优雅形式可能是允许键和查询之间直接交互，这样我们就可以以与查询无关的方式编码上下文信息。这里的技巧是，我们可以通过在查询和键上使用特征空间映射 $\phi(\cdot)$ 来从 $\psi(\cdot)$ 中移除非线性，并以矩阵乘积的形式重新表述 $\psi(\mathbf{Q}\cdot\mathbf{K}^{\mathrm{T}})$ （即 $\mathbf{A}$ ）。例如，回想第 5.3 节，我们可以通过映射 $\phi(\cdot)$ 将 $\mathbf{Q}$ 和 $\mathbf{K}$ 转换为 $\mathbf{Q}^{\prime}=\phi(\mathbf{Q})\in\mathbb{R}^{n\times d^{\prime}}$ 和 $\mathbf{K}^{\prime}=\phi(\mathbf{K})\in\mathbb{R}^{n\times d^{\prime}}$ 。然后，我们定义注意力模型的形式为

$\displaystyle\mathrm{Att}_{\mathrm{self}}$	$\displaystyle\equiv$	$\displaystyle\psi^{\prime}(\mathbf{Q}^{\prime}\cdot\mathbf{K^{\prime}}^{\mathrm{T}})\cdot\mathbf{V}$	(156)
	$\displaystyle=$	$\displaystyle\frac{\mathbf{Q}^{\prime}\cdot\mathbf{K^{\prime}}^{\mathrm{T}}}{\mathbf{D}}\cdot\mathbf{V}$
	$\displaystyle=$	$\displaystyle\frac{\mathbf{Q}^{\prime}\cdot{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\big{(}}\mathbf{K^{\prime}}^{\mathrm{T}}\cdot\mathbf{V}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\big{)}}}{\mathbf{D}}$

where $\psi^{\prime}(\mathbf{a})=\frac{\mathbf{a}}{\mathbf{D}}$ . From this definition, we see that, in the case of transformed queries and keys, the query-key product needs not be normalized via Softmax, but needs only be normalized via a simple factor $\mathbf{D}$ . Hence the model has a very simple form involving only matrix multiplication and division, allowing us to change the order of the operations using the associativity of matrix multiplication.
在 $\psi^{\prime}(\mathbf{a})=\frac{\mathbf{a}}{\mathbf{D}}$ 的情况下。从这个定义中，我们可以看出，在变换查询和键的情况下，查询-键乘积不需要通过 Softmax 进行归一化，而只需要通过一个简单的因子 $\mathbf{D}$ 进行归一化。因此，该模型具有非常简单的形式，仅涉及矩阵乘法和除法，使我们能够利用矩阵乘法的结合性改变操作顺序。

This leads to an interesting procedure: keys and values are first encoded via $\mathbf{K^{\prime}}^{\mathrm{T}}\cdot\mathbf{V}$ , and then each query attends to this encoding result. Given that $\mathbf{K^{\prime}}^{\mathrm{T}}\cdot\mathbf{V}=\sum_{j=1}^{n}\mathbf{k^{\prime}}_{j}^{\mathrm{T}}\cdot\mathbf{v}_{j}$ , we can write $\mathbf{K^{\prime}}^{\mathrm{T}}\cdot\mathbf{V}$ in the form of Eq. (154), as follows
这导致了一个有趣的程序：键和值首先通过 $\mathbf{K^{\prime}}^{\mathrm{T}}\cdot\mathbf{V}$ 进行编码，然后每个查询都关注这个编码结果。鉴于 $\mathbf{K^{\prime}}^{\mathrm{T}}\cdot\mathbf{V}=\sum_{j=1}^{n}\mathbf{k^{\prime}}_{j}^{\mathrm{T}}\cdot\mathbf{v}_{j}$ ，我们可以将 $\mathbf{K^{\prime}}^{\mathrm{T}}\cdot\mathbf{V}$ 写成方程（154）的形式，如下所示

\displaystyle\mu_{j}

\displaystyle=

\displaystyle\mu_{j-1}+\mathbf{k^{\prime}}_{j}^{\mathrm{T}}\cdot\mathbf{v}_{j}

(157)

Here $\mu_{j}\in\mathbb{R}^{d^{\prime}\times d}$ is a variable that adds $\mathbf{k^{\prime}}_{j}^{\mathrm{T}}\cdot\mathbf{v}_{j}$ at a time. Likewise, we can define another variable $\nu_{j}\in\mathbb{R}^{d^{\prime}}$
这里 $\mu_{j}\in\mathbb{R}^{d^{\prime}\times d}$ 是一个每次增加 $\mathbf{k^{\prime}}_{j}^{\mathrm{T}}\cdot\mathbf{v}_{j}$ 的变量。同样，我们可以定义另一个变量 $\nu_{j}\in\mathbb{R}^{d^{\prime}}$

\displaystyle\nu_{j}

\displaystyle=

\displaystyle\nu_{j-1}+\mathbf{k^{\prime}}_{j}^{\mathrm{T}}

(158)

Then, the output of self-attention for the $j$ -th query can be written as (see also Eq. (147))
然后，第 $j$ 个查询的自我注意力输出可以表示为（参见式（147））

\displaystyle\mathrm{Att}_{\mathrm{self},j}

\displaystyle=

\displaystyle\frac{\mathbf{q}^{\prime}_{j}\cdot\mu_{n}}{\mathbf{q}^{\prime}_{j}\cdot\nu_{n}}

(159)

Clearly, this is a linear model, because $\mu_{n}$ and $\nu_{n}$ are linear with respect to $n$ . In simple implementations of this model, only $\mu_{j}$ and $\nu_{j}$ are kept. Each time a new query is encountered, we update $\mu_{j}$ and $\nu_{j}$ using Eqs. (157) and (158), and then compute $\mathrm{Att}_{\mathrm{self},j}=\frac{\mathbf{q}^{\prime}_{j}\cdot\mu_{j}}{\mathbf{q}^{\prime}_{j}\cdot\nu_{j}}$ ¹⁶¹⁶16In autoregressive generation, we generate a sequence from left to right. In this case, we need not consider the keys and values for positions $>j$ ..
显然，这是一个线性模型，因为 $\mu_{n}$ 和 $\nu_{n}$ 相对于 $n$ 是线性的。在这个模型的简单实现中，只保留 $\mu_{j}$ 和 $\nu_{j}$ 。每次遇到新的查询时，我们使用公式（157）和（158）更新 $\mu_{j}$ 和 $\nu_{j}$ ，然后计算 $\mathrm{Att}_{\mathrm{self},j}=\frac{\mathbf{q}^{\prime}_{j}\cdot\mu_{j}}{\mathbf{q}^{\prime}_{j}\cdot\nu_{j}}$ 和 ¹⁶ 。

One straightforward extension to the linear attention model is to allow Eqs. (157) and (158) to combine different terms with different weights. For example, we can redefine $\mu_{j}$ and $\nu_{j}$ as
一种对线性注意力模型的简单扩展是允许方程（157）和（158）以不同的权重组合不同的项。例如，我们可以重新定义 $\mu_{j}$ 和 $\nu_{j}$ 为

	$\displaystyle\mu_{j}$	$\displaystyle=$	$\displaystyle a\cdot\mu_{j-1}+(1-a)\cdot\mathbf{k^{\prime}}_{j}^{\mathrm{T}}\cdot\mathbf{v}_{j}$		(160)
	$\displaystyle\nu_{j}$	$\displaystyle=$	$\displaystyle a\cdot\nu_{j-1}+(1-a)\cdot\mathbf{k^{\prime}}_{j}^{\mathrm{T}}$		(161)

and train the parameter $a$ as usual. Also, we can treat $a$ as a gate and use another neural network to compute $a$ (Peng et al., 2021). Another model design is to add more terms to Eqs. (157) and (158) in order to give a more powerful treatment of the linear attention approach (Bello, 2020; Schlag et al., 2021).
通常训练参数 $a$ 。同时，我们可以将 $a$ 视为一个门控，并使用另一个神经网络来计算 $a$ （Peng 等人，2021 年）。另一种模型设计是在方程（157）和（158）中添加更多项，以更有效地处理线性注意力方法（Bello，2020 年；Schlag 等人，2021 年）。

We have seen a general idea of designing linear models for the attention mechanism. The key design choice of such models is to remove the Softmax-based normalization, thereby taking linear forms of representations based on various intermediate states of the models. This motivates several recently developed alternatives to self-attention in which efficient inference systems are developed on the basis of recurrent models of sequence modeling (Peng et al., 2023; Sun et al., 2023). While these systems have different architectures, the underlying models have a similar form, as described in Eq. (154). Note that, by using the general formulation of recurrent models, we need not restrict the modeling to the standard QKV attention. Instead we may give new meanings and forms to the queries, keys, and values.
我们已了解到设计线性模型以实现注意力机制的一般思路。这类模型的关键设计选择是去除基于 Softmax 的归一化，从而采用基于模型各种中间状态的线性表示形式。这促使近年来开发出几种替代自注意力机制的方案，这些方案基于序列建模的循环模型开发出高效的推理系统（Peng 等，2023；Sun 等，2023）。尽管这些系统具有不同的架构，但底层模型的形式相似，如式（154）所述。请注意，通过使用循环模型的一般公式，我们不必将建模限制在标准的 QKV 注意力上。相反，我们可以赋予查询、键和值新的含义和形式。

The discussion here is also related to the memory models discussed in Section 5.2. From the memory viewpoint, the keys and values can be treated as encodings of the context. Therefore, in the linear attention model above we have a memory system in which two simple variables $\mu_{j}$ and $\nu_{j}$ are used to represent all the context information up to position $j$ . This results in a fixed-length memory which is very useful in practice. There are also other linear approaches to encoding long sequences. For example, we can view the moving average model as an instance of Eq. (154), and average a series of state vectors of a Transformer system, either weighted or unweighted.
此处讨论也与第 5.2 节中讨论的内存模型相关。从内存的角度来看，键和值可以被看作是上下文的编码。因此，在上面的线性注意力模型中，我们有一个内存系统，其中使用两个简单变量 $\mu_{j}$ 和 $\nu_{j}$ 来表示直到位置 $j$ 的所有上下文信息。这导致了一个固定长度的内存，这在实践中非常有用。还有其他线性方法来编码长序列。例如，我们可以将移动平均模型视为方程（154）的一个实例，并对 Transformer 系统的多个状态向量进行加权或无权平均。

5.5.3 State-Space Models 5.5.3 状态空间模型

In control systems, state-space models (SSMs) are representations of a system whose input and output are related by some state variables (or states for short), and whose dynamics is described by first-order differential equations of these states. As a simple example, we consider a continuous time-invariant linear system which is given in the form of the state-space representation
在控制系统领域，状态空间模型（SSM）是系统的一种表示，其输入和输出通过某些状态变量（简称状态）相关联，其动态特性由这些状态的一阶微分方程描述。作为一个简单的例子，我们考虑一个连续时间不变线性系统，该系统以状态空间表示的形式给出

	$\displaystyle\frac{d\mathbf{z}(t)}{dt}$	$\displaystyle=$	$\displaystyle\mathbf{z}(t)\cdot\mathbf{A}+\mathbf{s}(t)\cdot\mathbf{B}$		(162)
	$\displaystyle\mathbf{o}(t)$	$\displaystyle=$	$\displaystyle\mathbf{z}(t)\cdot\mathbf{C}+\mathbf{s}(t)\cdot\mathbf{D}$		(163)

Here $\mathbf{s}(t)$ , $\mathbf{o}(t)$ , and $\mathbf{z}(t)$ are the values of the input variable, output variable and state variable at time $t$ ¹⁷¹⁷17We use boldface letters to emphasize that the variables are vectors.. In a general setting, $\mathbf{s}(t)$ , $\mathbf{o}(t)$ , and $\mathbf{z}(t)$ may have different numbers of dimensions. To simplify the discussion here, we assume that $\mathbf{s}(t),\mathbf{o}(t)\in\mathbb{R}^{d}$ and $\mathbf{z}(t)\in\mathbb{R}^{d_{z}}$ ¹⁸¹⁸18In a general state-space model, all these variables are represented as vectors of complex numbers. Because the models defined on the field of complex numbers is applicable to case of real number-based state-spaces, we restrict our discussion to variables in the multi-dimensional real number field.. Eq. (162) is called the state equation, where $\mathbf{A}\in\mathbb{R}^{d_{z}\times d_{z}}$ is the state matrix and $\mathbf{B}\in\mathbb{R}^{d\times d_{z}}$ is the input matrix. Eq. (163) is called the output equation, where $\mathbf{C}\in\mathbb{R}^{d_{z}\times d}$ is the output matrix and $\mathbf{D}\in\mathbb{R}^{d\times d}$ is the feedforward matrix.
这里 $\mathbf{s}(t)$ 、 $\mathbf{o}(t)$ 和 $\mathbf{z}(t)$ 分别是时间 $t$ ¹⁷ 时刻的输入变量、输出变量和状态变量的值。在一般设置中， $\mathbf{s}(t)$ 、 $\mathbf{o}(t)$ 和 $\mathbf{z}(t)$ 可能具有不同的维度数。为了简化讨论，我们假设 $\mathbf{s}(t),\mathbf{o}(t)\in\mathbb{R}^{d}$ 和 $\mathbf{z}(t)\in\mathbb{R}^{d_{z}}$ ¹⁸ 。方程（162）被称为状态方程，其中 $\mathbf{A}\in\mathbb{R}^{d_{z}\times d_{z}}$ 是状态矩阵， $\mathbf{B}\in\mathbb{R}^{d\times d_{z}}$ 是输入矩阵。方程（163）被称为输出方程，其中 $\mathbf{C}\in\mathbb{R}^{d_{z}\times d}$ 是输出矩阵， $\mathbf{D}\in\mathbb{R}^{d\times d}$ 是前馈矩阵。

These equations describe a continuous mapping from the variable $\mathbf{s}(t)$ to the variable $\mathbf{o}(t)$ over time. They are, therefore, often used to deal with continuous time series data. To apply this model to the sequence modeling problem discussed in this chapter, we need to modify the above equations to give a discrete form of the state-space representation. Suppose that $\{\mathbf{s}_{0},\mathbf{s}_{1},...,\mathbf{s}_{n}\}$ is a sequence of input data points sampled from $s(t)$ with time step $\Delta t$ . Similarly, we define $\{\mathbf{z}_{0},\mathbf{z}_{1},...,\mathbf{z}_{n}\}$ and $\{\mathbf{o}_{0},\mathbf{o}_{1},...,\mathbf{o}_{n}\}$ as sequences of the state and output vectors. Given this notation, we now have a discretized version of the SSM, written as
这些方程描述了从变量 $\mathbf{s}(t)$ 到变量 $\mathbf{o}(t)$ 在时间上的连续映射。因此，它们通常用于处理连续时间序列数据。为了将此模型应用于本章讨论的序列建模问题，我们需要修改上述方程，以给出状态空间表示的离散形式。假设 $\{\mathbf{s}_{0},\mathbf{s}_{1},...,\mathbf{s}_{n}\}$ 是从 $s(t)$ 中以时间步长 $\Delta t$ 采样的输入数据点的序列。同样，我们将 $\{\mathbf{z}_{0},\mathbf{z}_{1},...,\mathbf{z}_{n}\}$ 和 $\{\mathbf{o}_{0},\mathbf{o}_{1},...,\mathbf{o}_{n}\}$ 定义为状态和输出向量的序列。根据这种表示法，我们现在有了 SSM 的离散版本，表示为

	$\displaystyle\mathbf{z}_{t}$	$\displaystyle=$	$\displaystyle\mathbf{z}_{t-1}\cdot\overline{\mathbf{A}}+\mathbf{s}_{t}\cdot\overline{\mathbf{B}}$		(164)
	$\displaystyle\mathbf{o}_{t}$	$\displaystyle=$	$\displaystyle\mathbf{z}_{t}\cdot\overline{\mathbf{C}}+\mathbf{s}_{t}\cdot\overline{\mathbf{D}}$		(165)

This formulation of the SSM defines an RNN with a residual connection. To be more precise, Eq. (164) describes a recurrent unit that reads the input at step $t$ and the state at step $t-1$ , without using any activation function. Eq. (165) describes an output layer that sums both the linear transformations of the state $\mathbf{z}_{t}$ and the identity mapping $\mathbf{s}_{t}$ .
这个 SSM 公式定义了一个具有残差连接的 RNN。更准确地说，式（164）描述了一个在步骤 $t$ 读取输入并在步骤 $t-1$ 读取状态的循环单元，而不使用任何激活函数。式（165）描述了一个输出层，它将状态 $\mathbf{z}_{t}$ 的线性变换和恒等映射 $\mathbf{s}_{t}$ 的和相加。

The parameters $\overline{\mathbf{A}}$ , $\overline{\mathbf{B}}$ , $\overline{\mathbf{C}}$ , and $\overline{\mathbf{D}}$ can be induced from $\mathbf{A}$ , $\mathbf{B}$ , $\mathbf{C}$ and $\mathbf{D}$ in several different ways, depending on how Eq. (162) is approximated by Eq. (164)¹⁹¹⁹19The discretization process can be interpreted as a numerical method of solving the differential equation. Note that Eq. (162) is an ODE $\displaystyle\frac{dz(t)}{dt}$ $\displaystyle=$ $\displaystyle g(z(t),t)$ (166) where $\displaystyle g(z(t),t)$ $\displaystyle=$ $\displaystyle z(t)\cdot\mathbf{A}+s(t)\cdot\mathbf{B}$ (167) There are many numerical approximations to the solutions to the ODE. For example, the Euler method of solving the ODE can be expressed in the form (see in Section 4.3) $\displaystyle\mathbf{z}_{t}$ $\displaystyle=$ $\displaystyle\mathbf{z}_{t-1}+\Delta t\cdot g(z_{t-1},t)$ (168) Substituting Eq. (167) into Eq. (168) yields $\displaystyle\mathbf{z}_{t}$ $\displaystyle=$ $\displaystyle\mathbf{z}_{t-1}+\Delta t(\mathbf{z}_{t-1}\cdot\mathbf{A}+\mathbf{s}_{t}\cdot\mathbf{B})$ (169) $\displaystyle=$ $\displaystyle\mathbf{z}_{t-1}\cdot(\mathbf{I}+\Delta t\cdot\mathbf{A})+\mathbf{s}_{t}\cdot(\Delta t\cdot\mathbf{B})$ This gives one of the simplest forms of the discretized state equations (Gu et al., 2022b), that is, $\displaystyle\overline{\mathbf{A}}$ $\displaystyle=$ $\displaystyle\mathbf{I}+\Delta t\cdot\mathbf{A}$ (170) $\displaystyle\overline{\mathbf{B}}$ $\displaystyle=$ $\displaystyle\Delta t\cdot\mathbf{B}$ (171) . One approach to time discretization, called bilinear transform or Tustin’s method, gives a model in which the parameters take the form
参数 $\overline{\mathbf{A}}$ 、 $\overline{\mathbf{B}}$ 、 $\overline{\mathbf{C}}$ 和 $\overline{\mathbf{D}}$ 可以通过 $\mathbf{A}$ 、 $\mathbf{B}$ 、 $\mathbf{C}$ 和 $\mathbf{D}$ 以多种方式诱导出来，具体取决于如何通过方程（164） ¹⁹ 近似方程（162）。一种称为双线性变换或 Tustin 方法的时离散化方法，给出一个模型，其中参数的形式为

$\displaystyle\overline{\mathbf{A}}$	$\displaystyle=$	$\displaystyle(\mathbf{I}-\frac{\Delta t}{2}\cdot\mathbf{A})\cdot(\mathbf{I}-\frac{\Delta t}{2}\cdot\mathbf{A})^{-1}$	(172)
$\displaystyle\overline{\mathbf{B}}$	$\displaystyle=$	$\displaystyle\Delta t\cdot\mathbf{B}\cdot(\mathbf{I}-\frac{\Delta t}{2}\cdot\mathbf{A})^{-1}$	(173)
$\displaystyle\overline{\mathbf{C}}$	$\displaystyle=$	$\displaystyle\mathbf{C}$	(174)
$\displaystyle\overline{\mathbf{D}}$	$\displaystyle=$	$\displaystyle\mathbf{D}$	(175)

An alternative approach is to use the Zero-Order-Hold (ZOH) discretization which has the form
一种替代方法是使用零阶保持（ZOH）离散化，其形式为

$\displaystyle\overline{\mathbf{A}}$	$\displaystyle=$	$\displaystyle\exp(\Delta t\cdot\mathbf{A})$	(176)
$\displaystyle\overline{\mathbf{B}}$	$\displaystyle=$	$\displaystyle\Delta t\cdot\mathbf{B}\cdot(\exp(\Delta t\cdot\mathbf{A})-\mathbf{I})\cdot(\Delta t\cdot\mathbf{A})^{-1}$	(177)
$\displaystyle\overline{\mathbf{C}}$	$\displaystyle=$	$\displaystyle\mathbf{C}$	(178)
$\displaystyle\overline{\mathbf{D}}$	$\displaystyle=$	$\displaystyle\mathbf{D}$	(179)

A detailed discussion of these approaches lies beyond the scope of this paper, and we refer the interested reader to standard textbooks on control theory for further details (Åström and Wittenmark, 2013).
这些方法的详细讨论超出了本文的范围，我们建议感兴趣的读者参考控制理论的标准教科书以获取更多细节（Åström 和 Wittenmark，2013）。

The recurrent form of Eq. (164) makes it easy to compute the states and outputs over a sequence of discrete time steps. We can unroll $\mathbf{z}_{t}$ and $\mathbf{o}_{t}$ in a feedforward fashion
方程（164）的递归形式使得在一系列离散的时间步长上计算状态和输出变得容易。我们可以以前馈方式展开 $\mathbf{z}_{t}$ 和 $\mathbf{o}_{t}$ 。

$\mathbf{z}_{0}=\mathbf{s}_{0}\cdot\overline{\mathbf{B}}$	$\mathbf{o}_{0}=\mathbf{s}_{0}\cdot\overline{\mathbf{B}}\cdot\overline{\mathbf{C}}+\mathbf{s}_{0}\cdot\overline{\mathbf{D}}$
$\mathbf{z}_{1}=\mathbf{s}_{0}\cdot\overline{\mathbf{B}}\cdot\overline{\mathbf{A}}+\mathbf{s}_{1}\cdot\overline{\mathbf{B}}$	$\mathbf{o}_{1}=\mathbf{s}_{0}\cdot\overline{\mathbf{B}}\cdot\overline{\mathbf{A}}\cdot\overline{\mathbf{C}}+\mathbf{s}_{1}\cdot\overline{\mathbf{B}}\cdot\overline{\mathbf{C}}+\mathbf{s}_{1}\cdot\overline{\mathbf{D}}$
$\mathbf{z}_{2}=\mathbf{s}_{0}\cdot\overline{\mathbf{B}}\cdot\overline{\mathbf{A}}^{2}+\mathbf{s}_{1}\cdot\overline{\mathbf{B}}\cdot\overline{\mathbf{A}}+\mathbf{s}_{2}\cdot\overline{\mathbf{B}}$	$\mathbf{o}_{2}=\mathbf{s}_{0}\cdot\overline{\mathbf{B}}\cdot\overline{\mathbf{A}}^{2}\cdot\overline{\mathbf{C}}+\mathbf{s}_{1}\cdot\overline{\mathbf{B}}\cdot\overline{\mathbf{A}}\cdot\overline{\mathbf{C}}+$
	$\mathbf{s}_{2}\cdot\overline{\mathbf{B}}\cdot\overline{\mathbf{C}}+\mathbf{s}_{2}\cdot\overline{\mathbf{D}}$
……	……

It is easy to write

	$\displaystyle\mathbf{z}_{t}$	$\displaystyle=$	$\displaystyle\sum_{i=0}^{t}\mathbf{s}_{i}\cdot\overline{\mathbf{B}}\cdot\overline{\mathbf{A}}^{t-i}$		(180)
	$\displaystyle\mathbf{o}_{t}$	$\displaystyle=$	$\displaystyle\sum_{i=0}^{t}\mathbf{s}_{i}\cdot\overline{\mathbf{B}}\cdot\overline{\mathbf{A}}^{t-i}\cdot\overline{\mathbf{C}}+\mathbf{s}_{t}\cdot\overline{\mathbf{D}}$		(181)

Clearly, the right-hand side of Eq. (181) can be interpreted as a merged output of a convolutional layer and a linear layer. Given that
显然，方程（181）的右侧可以解释为卷积层和线性层的合并输出。鉴于

	$\displaystyle\sum_{i=0}^{t}\mathbf{s}_{i}\cdot\overline{\mathbf{B}}\cdot\overline{\mathbf{A}}^{t-i}\cdot\overline{\mathbf{C}}$	$\displaystyle=$	$\displaystyle\begin{bmatrix}\mathbf{s}_{0}&\mathbf{s}_{1}&...&\mathbf{s}_{t}\end{bmatrix}\cdot$		(182)
			$\displaystyle\begin{bmatrix}\overline{\mathbf{B}}\cdot\overline{\mathbf{A}}^{t}\cdot\overline{\mathbf{C}}&\overline{\mathbf{B}}\cdot\overline{\mathbf{A}}^{t-1}\cdot\overline{\mathbf{C}}&...&\overline{\mathbf{B}}\cdot\overline{\mathbf{C}}\end{bmatrix}$		(182)

we define a filter having the parameters
我们定义了一个具有参数的滤波器

\displaystyle\mathbf{W}_{\mathrm{ssm}}

\displaystyle=

\displaystyle\begin{bmatrix}\overline{\mathbf{B}}\cdot\overline{\mathbf{A}}^{n_{\mathrm{max}}}\cdot\overline{\mathbf{C}}&\overline{\mathbf{B}}\cdot\overline{\mathbf{A}}^{n_{\mathrm{max}}-1}\cdot\overline{\mathbf{C}}&...&\overline{\mathbf{B}}\cdot\overline{\mathbf{C}}\end{bmatrix}

(183)

where $n_{\mathrm{max}}$ is the maximum length of the sequence²⁰²⁰20Here $\mathbf{W}_{\mathrm{ssm}}$ can be represented as an $n_{\mathrm{max}}\times d\times d$ tensor.. Then, the output of the state-space model for a sequence $\mathbf{S}=\begin{bmatrix}\mathbf{s}_{0}\\ \vdots\\ \mathbf{s}_{n}\end{bmatrix}$ can be expressed as
$n_{\mathrm{max}}$ 是序列 ²⁰ 的最大长度。然后，对于序列 $\mathbf{S}=\begin{bmatrix}\mathbf{s}_{0}\\ \vdots\\ \mathbf{s}_{n}\end{bmatrix}$ 的状态空间模型的输出可以表示为

\displaystyle\mathbf{O}

\displaystyle=

\displaystyle\mathrm{Conv}(\mathbf{S},\mathbf{W}_{\mathrm{ssm}})+\mathrm{Linear}(\mathbf{S},\overline{\mathbf{D}})

(184)

where $\mathrm{Conv}(\cdot)$ is the convolution operation, and $\mathrm{Linear}(\cdot)$ is the linear transformation operation. Such a treatment of the state-space model enables the system to be efficiently implemented using fast parallel convolution algorithms.
$\mathrm{Conv}(\cdot)$ 是卷积操作， $\mathrm{Linear}(\cdot)$ 是线性变换操作。这种处理状态空间模型的方法使得系统能够高效地使用快速并行卷积算法实现。

Unfortunately, the above model performs poorly in many cases. As with many deep neural networks, careful initialization of the model parameters plays an important role in such models. For example, restricting the state matrix to particular types of matrices is found to be useful for learning and generalizing on long sequences (Gu et al., 2022a).
很遗憾，上述模型在许多情况下表现不佳。与许多深度神经网络一样，仔细初始化模型参数在这些模型中起着重要作用。例如，将状态矩阵限制为特定类型的矩阵被发现对学习长序列和泛化很有用（Gu 等，2022a）。

Another problem with the basic state-space model is that it involves multiplication of multiple matrices. If the sequence is long (i.e., $n$ is a large number), computing $\overline{\mathbf{A}}^{n}$ will be computationally expensive and numerically unstable. One of the most popular approaches to developing practical state-space models for sequence modeling is diagonalization. The basic idea is that we can transform a state-space model into a new state-space where $\mathbf{A}$ (or $\overline{\mathbf{A}}$ ) is diagonalized. Given a state-space model parameterized by $(\mathbf{A},\mathbf{B},\mathbf{C},\mathbf{D})$ , we can define a new state-space model $(\mathbf{U}\mathbf{A}\mathbf{U}^{-1},\mathbf{B}\mathbf{U}^{-1},\mathbf{U}\mathbf{C},\mathbf{D})$ by introducing an invertible matrix $\mathbf{U}$ . It is easy to prove that the two models are equivalent under the state-space transformation $\mathbf{U}$ ²¹²¹21A state space transformation can be seen as a process of mapping all states from the old space to the new space, by $\displaystyle\mathbf{s}^{\prime}(t)$ $\displaystyle=$ $\displaystyle\mathbf{s}(t)\cdot\mathbf{U}$ (185) . By using this state-space transformation, and by noting that $\mathbf{A}$ (or $\overline{\mathbf{A}}$ ) can be written as a canonical form $\mathbf{P}^{-1}\Lambda\mathbf{P}$ ²²²²22 $\Lambda$ denotes a diagonal matrix., we can enforce the constraint that $\mathbf{A}$ (or $\overline{\mathbf{A}}$ ) is a diagonal matrix, giving rise to diagonal state-space models. To illustrate, consider the filter used in the convolutional representation of the state-space model (see Eq. (182)). Assuming that $\overline{\mathbf{A}}=\mathbf{P}^{-1}\Lambda\mathbf{P}$ , we can write $\overline{\mathbf{B}}\cdot\overline{\mathbf{A}}^{t}\cdot\overline{\mathbf{C}}$ as
另一个基本状态空间模型的问题是它涉及到多个矩阵的乘法。如果序列很长（即， $n$ 是一个很大的数），计算 $\overline{\mathbf{A}}^{n}$ 将会计算量大且数值不稳定。开发用于序列建模的实用状态空间模型的最流行方法之一是对角化。基本思想是我们可以将状态空间模型转换成一个新的状态空间，其中 $\mathbf{A}$ （或 $\overline{\mathbf{A}}$ ）是对角化的。给定一个由 $(\mathbf{A},\mathbf{B},\mathbf{C},\mathbf{D})$ 参数化的状态空间模型，我们可以通过引入可逆矩阵 $\mathbf{U}$ 定义一个新的状态空间模型 $(\mathbf{U}\mathbf{A}\mathbf{U}^{-1},\mathbf{B}\mathbf{U}^{-1},\mathbf{U}\mathbf{C},\mathbf{D})$ 。容易证明，在状态空间变换 $\mathbf{U}$ ²¹ 下，这两个模型是等价的。通过使用这种状态空间变换，并注意到 $\mathbf{A}$ （或 $\overline{\mathbf{A}}$ ）可以写成规范形式 $\mathbf{P}^{-1}\Lambda\mathbf{P}$ ²² ，我们可以强制约束 $\mathbf{A}$ （或 $\overline{\mathbf{A}}$ ）是对角矩阵，从而产生对角状态空间模型。为了说明，考虑在状态空间模型卷积表示中使用的滤波器（参见公式（182））。假设 $\overline{\mathbf{A}}=\mathbf{P}^{-1}\Lambda\mathbf{P}$ ，我们可以将 $\overline{\mathbf{B}}\cdot\overline{\mathbf{A}}^{t}\cdot\overline{\mathbf{C}}$ 写作

$\displaystyle\overline{\mathbf{B}}\cdot\overline{\mathbf{A}}^{t}\cdot\overline{\mathbf{C}}$	$\displaystyle=$	$\displaystyle\overline{\mathbf{B}}\cdot(\mathbf{P}^{-1}\Lambda\mathbf{P})^{t}\cdot\overline{\mathbf{C}}$	(186)
	$\displaystyle=$	$\displaystyle\overline{\mathbf{B}}\cdot(\mathbf{P}^{-1}\Lambda\mathbf{P})\cdot(\mathbf{P}^{-1}\Lambda\mathbf{P})\cdots(\mathbf{P}^{-1}\Lambda\mathbf{P})\cdot\overline{\mathbf{C}}$
	$\displaystyle=$	$\displaystyle(\overline{\mathbf{B}}\cdot\mathbf{P}^{-1})\cdot\Lambda^{t}\cdot(\mathbf{P}\cdot\overline{\mathbf{C}})$

Since $\Lambda$ is a diagonal matrix, we can efficiently compute $\Lambda^{t}$ by simply raising all the entries of $\Lambda$ to the $t$ -th power. We then have a computationally cheaper model, in which
由于 $\Lambda$ 是对角矩阵，我们可以通过简单地将对角矩阵 $\Lambda$ 的所有元素提高到 $t$ 次幂来有效地计算 $\Lambda^{t}$ 。这样我们就得到了一个计算成本更低的模型，其中

$\displaystyle\overline{\mathbf{A}}^{\prime}$	$\displaystyle=$	$\displaystyle\Lambda$	(187)
$\displaystyle\overline{\mathbf{B}}^{\prime}$	$\displaystyle=$	$\displaystyle\overline{\mathbf{B}}\cdot\mathbf{P}^{-1}$	(188)
$\displaystyle\overline{\mathbf{C}}^{\prime}$	$\displaystyle=$	$\displaystyle\mathbf{P}\cdot\overline{\mathbf{C}}$	(189)
$\displaystyle\overline{\mathbf{D}}^{\prime}$	$\displaystyle=$	$\displaystyle\overline{\mathbf{D}}$	(190)

More detailed discussions of diagonal state-space models in sequence modeling can be found in Gu et al. (2021)’s work.
更详细的关于序列建模中对角状态空间模型的讨论可以在 Gu 等人（2021）的研究中找到。

The application of state-space models to Transformer is simple. Each self-attention sub-layer is replaced in this case by an SSM sub-layer as described in Eqs. (164) and (165). As we have seen there is a close relationship between state-space models and both CNNs and RNNs. For sequence modeling, we can deal with a sequence of tokens either sequentially as in RNNs, or in parallel as in CNNs. This leads to a new paradigm that takes both the sequential view and the parallel view of the sequence modeling problem — for training, the system operates like CNNs to make use of fast parallel training algorithms; for prediction, the problem is re-cast as a sequential update problem which can be efficiently solved by using RNN-like models. It should be noted, however, that state-space models are found to underperform Transformer models for NLP problems, such as language modeling, although they have achieved promising results in several other fields. Further refinements are often needed to make them competitive with other widely used sequence models (Fu et al., 2022).
状态空间模型在 Transformer 中的应用简单。在这种情况下，每个自注意力子层被替换为如公式（164）和（165）所述的 SSM 子层。正如我们所见，状态空间模型与 CNN 和 RNN 都有密切的关系。对于序列建模，我们可以像 RNN 那样按顺序处理标记序列，或者像 CNN 那样并行处理。这导致了一种新的范式，它同时考虑了序列建模问题的顺序视图和并行视图——在训练时，系统像 CNN 一样运行，以利用快速并行训练算法；在预测时，问题被重新表述为一个可以由类似 RNN 的模型有效解决的顺序更新问题。然而，需要注意的是，尽管在多个其他领域取得了有希望的结果，但状态空间模型在 NLP 问题，如语言建模中，被发现表现不如 Transformer 模型。通常需要进一步改进才能使它们与其他广泛使用的序列模型具有竞争力（Fu 等人，2022 年）。

While the formalism of state-space models is different from those we discussed in this chapter, it provides a general framework of sequence modeling in which the problem can be viewed from either of two different perspectives and we choose different ones for different purposes. Several recent sequence models were motivated by this idea, leading to systems exhibiting properties of both parallel training and RNN-style inference (Orvieto et al., 2023; Sun et al., 2023).
尽管状态空间模型的形式与本章中我们讨论的不同，但它提供了一个序列建模的一般框架，其中问题可以从两个不同的视角之一来观察，我们根据不同的目的选择不同的视角。一些最近的序列模型受此启发，导致展现出并行训练和 RNN 风格推理特性的系统（Orvieto 等，2023；Sun 等，2023）。

5.6 Conditional Computation
5.6 条件计算

So far in our discussion of efficient Transformer models, we have assumed that the model architecture is given before beginning the training of a model and is then fixed throughout. We now turn to the case of learning efficient model architectures. Without loss of generality, we can write a model in the form
截至目前，在我们的关于高效 Transformer 模型的讨论中，我们假设在开始训练模型之前，模型架构是已知的，并且在训练过程中保持不变。现在，我们转向学习高效模型架构的情况。不失一般性，我们可以将模型写成以下形式

\displaystyle\mathbf{y}

\displaystyle=

\displaystyle\mathrm{Model}(\mathbf{x},g(\mathbf{x}))

(191)

where $\mathbf{x}$ and $\mathbf{y}$ are the input and output of the model. $g(\mathbf{x})$ is a model function that returns the model architecture and corresponding parameters for the given input $\mathbf{x}$ . In general, we adopt the convention prevalent in learning problems of using a fixed model architecture and learning only the parameters, say, $g(\mathbf{x})=\theta$ . In this case, the goal of learning is to find the optimal values of the parameters given the model architecture and training data. On test data, we make predictions using the same model architecture along with the optimized parameters.
$\mathbf{x}$ 和 $\mathbf{y}$ 分别是模型的输入和输出。 $g(\mathbf{x})$ 是一个模型函数，它返回给定输入 $\mathbf{x}$ 的模型架构和相应的参数。一般来说，我们采用在学习和问题中普遍使用的约定，即使用固定的模型架构，仅学习参数，例如， $g(\mathbf{x})=\theta$ 。在这种情况下，学习的目标是根据模型架构和训练数据找到参数的最优值。在测试数据上，我们使用相同的模型架构和优化后的参数进行预测。

A natural extension of this approach is to consider the learning of both the model architecture and parameters. In architecture learning, we would like to find a model function $\hat{g}(\mathbf{x})$ that produces the optimal model architecture and parameter values given the input $\mathbf{x}$ . However, searching a hypothesis space of all possible combinations of architectures and parameter choices is extremely difficult, and so we need practical methods to achieve the goal. Two classes of methods can be applied.
一种自然扩展的方法是考虑模型架构和参数的学习。在架构学习中，我们希望找到一个模型函数 $\hat{g}(\mathbf{x})$ ，它能在给定输入 $\mathbf{x}$ 的情况下产生最优的模型架构和参数值。然而，搜索所有可能的架构和参数选择组合的假设空间是非常困难的，因此我们需要实用的方法来实现这一目标。可以应用两类方法。

•

Neural Architecture Search (NAS). In automated machine learning (AutoML), neural architecture search is the process of exploring a space of neural networks to find one that best fits some criterion (Zoph and Le, 2016; Elsken et al., 2019). Once the optimal neural network is determined, its parameters will be trained as usual, and then be applied to new data. In order to make search tractable, several additional techniques, such as search space pruning and fast search algorithms, are typically used. Applying neural architecture search to the development of efficient neural networks is straightforward (Howard et al., 2019; Tan and Le, 2019). We need only incorporate efficiency measures into the performance estimation of neural networks, for example, the search can be guided by a criterion that penalizes neural networks with high latency or excessive memory requirements.

• 神经架构搜索（NAS）。在自动化机器学习（AutoML）中，神经架构搜索是探索神经网络空间以找到最适合某些标准的过程（Zoph 和 Le，2016；Elsken 等人，2019）。一旦确定了最优的神经网络，其参数将按常规进行训练，然后应用于新数据。为了使搜索可处理，通常使用一些附加技术，如搜索空间剪枝和快速搜索算法。将神经架构搜索应用于高效神经网络的开发是直接的（Howard 等人，2019；Tan 和 Le，2019）。我们只需将效率度量纳入神经网络的性能估计中，例如，搜索可以通过惩罚具有高延迟或过多内存需求的神经网络的准则来引导。
•

Dynamic Neural Networks. The key idea of dynamic neural networks is to adapt a neural network dynamically to various inputs (Gupta et al., 2004; Han et al., 2021). Ideally, we would like to learn $\hat{g}(\cdot)$ , and then, for any input $\mathbf{x}_{\mathrm{new}}$ , we apply the model $\mathrm{Model}(\mathbf{x}_{\mathrm{new}},\hat{g}(\mathbf{x}_{\mathrm{new}}))$ . As a result, at test time we may have different model structures and/or different parameters for different inputs. However, it is infeasible to develop a function $\hat{g}(\cdot)$ that can model arbitrary neural networks. In practice, $\hat{g}(\cdot)$ is often considered to represent a family of sub-networks of a super-network. The problem is therefore reframed as a simpler problem to learn to choose which sub-network is used for a given input.

• 动态神经网络。动态神经网络的关键思想是动态地适应各种输入（Gupta 等人，2004；Han 等人，2021）。理想情况下，我们希望学习 $\hat{g}(\cdot)$ ，然后，对于任何输入 $\mathbf{x}_{\mathrm{new}}$ ，我们应用模型 $\mathrm{Model}(\mathbf{x}_{\mathrm{new}},\hat{g}(\mathbf{x}_{\mathrm{new}}))$ 。因此，在测试时，我们可能对不同输入有不同的模型结构或参数。然而，开发一个可以模拟任意神经网络的函数 $\hat{g}(\cdot)$ 是不切实际的。在实践中， $\hat{g}(\cdot)$ 通常被认为代表超网络的一组子网络。因此，问题被重新定义为学习选择给定输入使用哪个子网络的一个更简单的问题。

From a machine learning perspective, the approaches to neural architecture search are general and can be applied to any neural network. On the other hand, from a practical perspective, it is still difficult to find an efficient neural network that is sufficiently powerful and generalizes well. While neural architecture search provides interesting ideas for developing efficient Transformer models, we make no attempt to discuss it here. Instead, the reader can refer to the above papers to have a general idea of it, and refer to So et al. (2019), Wang et al. (2020a), and Hu et al. (2021)’s work for its application to Transformers.
从机器学习角度来看，神经架构搜索的方法是通用的，可以应用于任何神经网络。另一方面，从实际角度来看，仍然难以找到一个既足够强大又泛化良好的高效神经网络。虽然神经架构搜索为开发高效的 Transformer 模型提供了有趣的想法，但我们在此不做任何讨论。相反，读者可以参考上述论文以获得对其的一般了解，并参考 So 等人（2019 年）、Wang 等人（2020a）和 Hu 等人（2021 年）的工作，了解其在 Transformer 中的应用。

In this sub-section we focus on a particular family of approaches to dynamic neural networks, called conditional computation. This concept was originally motivated by the dynamic selection of neurons of a neural network (Bengio et al., 2013, 2015). More recently, it has often been used to refer to as a process of dynamically selecting parts of a neural network. A narrow view of conditional computation is to see $g(\cdot)$ as an adaptive neural network which dynamically reduces or grows the number of computation units (such as neurons and layers). As a result, computation can adapt to changing conditions, and we can seek a good accuracy-latency trade-off by this adaptation mechanism.
在这个子节中，我们关注一种特定的动态神经网络方法族，称为条件计算。这个概念最初是由神经网络中神经元的动态选择所激发的（Bengio 等，2013，2015）。最近，它通常被用来指代动态选择神经网络部分的过程。对条件计算的一种狭隘看法是将 $g(\cdot)$ 视为一个自适应神经网络，它能够动态地减少或增加计算单元（如神经元和层）的数量。因此，计算可以适应变化条件，我们可以通过这种适应机制寻求良好的精度-延迟权衡。

A common way to achieve this is to learn how to skip some computation steps so that we can work with a necessary sub-set of the network (Xu and Mcauley, 2023). One of the simplest methods, sometimes called early stopping, is to stop the computation at some point during reading or generating a sequence. This technique is often used in practical sequence generation applications where a low latency is required. Suppose $y_{1}...y_{n_{\mathrm{max}}}$ is the longest sequence that the system can generate, and $\mathbf{s}_{1}...\mathbf{s}_{n_{\mathrm{max}}}$ is the corresponding sequence of the states of the top-most Transformer layer. Then we develop a model $f_{\mathrm{stop}}(\cdot)$ that takes one hidden state $\mathbf{s}_{i}$ at a time and produces a distribution of a binary variable $c\in\{\mathrm{stop},\mathrm{nonstop}\}$
一种常见的实现方法是学习如何跳过一些计算步骤，以便我们可以处理网络的必要子集（Xu 和 Mcauley，2023）。其中一种最简单的方法，有时被称为早期停止，是在阅读或生成序列的过程中在某个点上停止计算。这种技术在需要低延迟的实际序列生成应用中经常被使用。假设 $y_{1}...y_{n_{\mathrm{max}}}$ 是系统可以生成的最长序列， $\mathbf{s}_{1}...\mathbf{s}_{n_{\mathrm{max}}}$ 是对应的最顶层 Transformer 层的状态序列。然后我们开发了一个模型 $f_{\mathrm{stop}}(\cdot)$ ，该模型每次取一个隐藏状态 $\mathbf{s}_{i}$ ，并产生一个二元变量的分布 $c\in\{\mathrm{stop},\mathrm{nonstop}\}$

\displaystyle\operatorname{Pr}(c|\mathbf{s}_{i})

\displaystyle=

\displaystyle f_{\mathrm{stop}}(\mathbf{s}_{i})

(192)

The generation process terminates if $\operatorname{Pr}(\mathrm{stop}|\mathbf{s}_{i})$ is sufficiently large, for example
如果 $\operatorname{Pr}(\mathrm{stop}|\mathbf{s}_{i})$ 足够大，生成过程将终止，例如

\displaystyle\operatorname{Pr}(\mathrm{stop}|\mathbf{s}_{i})

\displaystyle\geq

\displaystyle\operatorname{Pr}(\mathrm{nonstop}|\mathbf{s}_{i})+\theta_{\mathrm{stop}}

(193)

where $\theta_{\mathrm{stop}}$ denotes the minimal margin for distinguishing the two actions²³²³23An equivalent form of Eq. (193) is $\operatorname{Pr}(\mathrm{stop}|\mathbf{s}_{i})\geq\frac{1+\theta_{\mathrm{stop}}}{2}$ .. This formulation is also related to the stopping criterion problem that is frequently discussed in search algorithms for sequence generation. $f_{\mathrm{stop}}(\cdot)$ can be designed in several different ways. For example, in many practical applications, the stopping criterion is based on simple heuristics. Alternatively, we can define the function $f_{\mathrm{stop}}(\cdot)$ as a neural network and train it using labeled data.
$\theta_{\mathrm{stop}}$ 表示区分两个动作 ²³ 的最小边界。这种表述也与在序列生成搜索算法中经常讨论的停止准则问题相关。 $f_{\mathrm{stop}}(\cdot)$ 可以设计成多种不同的方式。例如，在许多实际应用中，停止准则基于简单的启发式方法。或者，我们可以将函数 $f_{\mathrm{stop}}(\cdot)$ 定义为一个神经网络，并使用标记数据进行训练。

The above approach can be easily extended to handle situations in which some of the tokens are skipped. This learning-to-skip approach is typically used in the encoding stage in which all input tokens are given in advance. Let $\mathbf{h}_{1}...\mathbf{h}_{m}$ be low-level representations of a sequence $x_{1}...x_{m}$ . Like Eq. (192), we can develop a model $\operatorname{Pr}(c|\mathbf{s}_{i})$ ( $c\in\{\mathrm{skip},\mathrm{nonskip}\}$ ) to determine whether the token $x_{i}$ can be skipped. Figure 15 (a) and (b) show illustrations of early stopping and skipping. Note that the learning-to-skip method has overlap with other lines of research on training neural networks. For example, erasing some of the input tokens in training is found to be useful for achieving higher generalization of Transformer models (Shen et al., 2020; Kim and Cho, 2021). This method is also related to the downsampling methods which will be discussed in Section 5.8.
上述方法可以轻松扩展以处理某些标记被跳过的情况。这种学习跳过的方法通常用于编码阶段，其中所有输入标记都是预先给出的。令 $\mathbf{h}_{1}...\mathbf{h}_{m}$ 为序列 $x_{1}...x_{m}$ 的低级表示。类似于公式(192)，我们可以开发模型 $\operatorname{Pr}(c|\mathbf{s}_{i})$ （ $c\in\{\mathrm{skip},\mathrm{nonskip}\}$ ）来确定标记 $x_{i}$ 是否可以跳过。图 15(a)和(b)展示了早期停止和跳过的示例。请注意，学习跳过的方法与其他关于训练神经网络的科研领域有重叠。例如，在训练中删除一些输入标记被发现有助于实现 Transformer 模型更高的泛化能力（Shen 等人，2020；Kim 和 Cho，2021）。这种方法也与将在第 5.8 节讨论的下采样方法相关。

Figure 15: Methods of conditional computation, including early stopping, token skipping, MoE, sentence-level depth adaptation, token-level depth adaptation, and layer skipping. While these methods are illustrated using either the encoding or decoding process, most of them can be applied to both Transformer encoders and decoders.
图 15：条件计算方法，包括提前停止、标记跳过、MoE、句子级深度自适应、标记级深度自适应和层跳过。虽然这些方法是用编码或解码过程进行说明的，但其中大多数都可以应用于 Transformer 编码器和解码器。

A second approach to conditional computation is to resort to sparse expert models, or its popular instance — MoE (Yuksel et al., 2012). In deep learning, a model of this kind is typically built from a number of experts which are neural networks having the same structure but with different parameters. In this way, we can construct a big model by simply increasing the number of experts. When running this model, during either training or prediction, we activate only a small number of the experts by some routing algorithm (see Figure 15 (c)). An MoE model is an adaptive network since each time we have a new input, the model routes it to different experts. In Section 4.4, we presented the basic form of MoE, and showed how Transformer models can be scaled up by this sparse method. For a comprehensive review of the recent advances in MoE, we refer the interested reader to Fedus et al. (2022a)’s work.
第二种条件计算方法是求助于稀疏专家模型，或其流行的实例——MoE（Yuksel 等人，2012 年）。在深度学习中，这类模型通常由多个专家组成，这些专家是具有相同结构但参数不同的神经网络。通过这种方式，我们可以通过简单地增加专家的数量来构建一个大模型。在运行此模型时，无论是在训练还是预测过程中，我们通过某种路由算法仅激活少量专家（见图 15（c））。MoE 模型是一种自适应网络，因为每次我们有一个新的输入时，模型都会将其路由到不同的专家。在第 4.4 节中，我们介绍了 MoE 的基本形式，并展示了如何通过这种稀疏方法扩展 Transformer 模型。对于 MoE 最近进展的全面回顾，我们建议感兴趣的读者参考 Fedus 等人（2022a）的工作。

A third approach that can be used to adapt a Transformer model to changing input is to dynamically shrink the number of layers. Several methods have been proposed to do this in an attempt to improve inference efficiency. The simplest of these is to exit at some hidden layer by which we can still make accurate predictions for the sample (see Figure 15 (d) and (e)). To do this, we can either determine the appropriate depth for the entire sequence (call it a sentence-level depth-adaptive model), or use an adaptive depth for each token (call it a token-level depth-adaptive model). Here we consider token-level depth-adaptive models but the methods can be easily extended to sequence-level depth-adaptive models.
一种将 Transformer 模型适应变化输入的第三种方法是动态减少层数。为此已经提出了几种方法，试图提高推理效率。其中最简单的方法是在某个隐藏层退出，通过这个层我们仍然可以对样本做出准确的预测（见图 15（d）和（e））。为此，我们可以为整个序列确定适当的深度（称为句子级深度自适应模型），或者为每个标记使用自适应深度（称为标记级深度自适应模型）。在这里，我们考虑标记级深度自适应模型，但这些方法可以很容易地扩展到序列级深度自适应模型。

Suppose there are $L$ stacked layers at position $i$ ²⁴²⁴24A layer is a standard Transformer block consisting of a few sub-layers.. We would ideally like to find a layer in the stack, which can be used as the last hidden layer for making predictions, and whose depth is as low as possible. However, we cannot simply use the $L$ -th layer of the stack as the oracle for this problem, because we never know in advance what the last layer generates during inference. Instead, we need to determine whether the network should stop growing at depth $i$ , considering the layers generated so far.
假设在位置 $i$ ²⁴ 有 $L$ 层堆叠。我们理想情况下希望找到一个层作为预测的最后一层隐藏层，并且其深度尽可能低。然而，我们不能简单地使用堆叠中的第 $L$ 层作为这个问题的占卜者，因为我们事先不知道在推理过程中最后一层会生成什么。相反，我们需要确定网络是否应该在深度 $i$ 处停止增长，考虑到到目前为止生成的层。

Now suppose we have a Transformer decoder which produces a distribution over a vocabulary $V$ at each step. As usual, we denote the output of the $l$ -th layer at step $i$ by $\mathbf{s}_{i}^{l}$ . For each $\mathbf{s}_{i}^{l}$ , we create an output layer that produces a distribution $\mathbf{p}_{i}^{l}$ over the vocabulary (call it an early exit classifier), given by
现在假设我们有一个 Transformer 解码器，它在每一步生成一个关于词汇表 $V$ 的分布。像往常一样，我们用 $\mathbf{s}_{i}^{l}$ 表示第 $l$ 层在第 $i$ 步的输出。对于每个 $\mathbf{s}_{i}^{l}$ ，我们创建一个输出层，该层生成一个关于词汇表 $\mathbf{p}_{i}^{l}$ 的分布（称为早期退出分类器），表示为

\displaystyle\mathbf{p}_{i}^{l}

\displaystyle=

\displaystyle\mathrm{Softmax}(\mathbf{s}_{i}^{l}\cdot\mathbf{W}_{\mathrm{o}}^{l})

(194)

where $\mathbf{W}_{\mathrm{o}}^{l}\in\mathbb{R}^{d\times|V|}$ is the parameter matrix. Hence we have $L-1$ additional output layers, each corresponding to a hidden layer from depth 1 to $L-1$ . At training time, we consider the cross-entropy losses of $\{\mathbf{p}_{i}^{1},...,\mathbf{p}_{i}^{L-1}\}$ , and train these layers together with the Transformer model. At test time, the depth of the network grows as usual, and we use $\{\mathbf{p}_{i}^{1},...,\mathbf{p}_{i}^{l}\}$ and/or $\{\mathbf{s}_{i}^{1},...,\mathbf{s}_{i}^{l}\}$ to determine whether we should exit at the $l$ -th layer. There are several exit criteria, for example,
$\mathbf{W}_{\mathrm{o}}^{l}\in\mathbb{R}^{d\times|V|}$ 是参数矩阵。因此，我们有 $L-1$ 个额外的输出层，每个对应于从深度 1 到 $L-1$ 的隐藏层。在训练时间，我们考虑 $\{\mathbf{p}_{i}^{1},...,\mathbf{p}_{i}^{L-1}\}$ 的交叉熵损失，并将这些层与 Transformer 模型一起训练。在测试时间，网络的深度像往常一样增长，我们使用 $\{\mathbf{p}_{i}^{1},...,\mathbf{p}_{i}^{l}\}$ 和/或 $\{\mathbf{s}_{i}^{1},...,\mathbf{s}_{i}^{l}\}$ 来确定是否在 $l$ 层退出。存在几个退出标准，例如，

•

Common criteria are based on measures of the confidence of predictions. A simple method is to compute the entropy of $\mathbf{p}_{i}^{l}$ , and exit if this entropy is above a pre-defined value.

• 常见标准基于预测置信度的度量。一种简单的方法是计算 $\mathbf{p}_{i}^{l}$ 的熵，如果这个熵值超过预先定义的阈值，则退出。
•

Alternatively, one can view the maximum probability of the entries of $\mathbf{p}_{i}^{l}$ as the confidence of the prediction.

• 另一种观点是将 $\mathbf{p}_{i}^{l}$ 条目的最大概率视为预测的置信度。
•

Instead of considering the output of a single layer, we can also examine the change in the outputs or hidden states over a number of layers. For example, we can measure the similarity between $\mathbf{p}_{i}^{l-1}$ and $\mathbf{p}_{i}^{l}$ or between $\mathbf{s}_{i}^{l-1}$ and $\mathbf{s}_{i}^{l}$ . If the similarity is above a given threshold, then we say that the output of the neural tends to converge and the number of layers can stop growing.

• 而不是考虑单层的输出，我们还可以检查多个层中输出或隐藏状态的变化。例如，我们可以测量 $\mathbf{p}_{i}^{l-1}$ 与 $\mathbf{p}_{i}^{l}$ 或 $\mathbf{s}_{i}^{l-1}$ 与 $\mathbf{s}_{i}^{l}$ 之间的相似度。如果相似度超过某个阈值，那么我们可以说神经网络的输出趋于收敛，层�数可以停止增长。
•

The above method can be extended to examine the change in the predictions made by the classifiers associated with the layers. For example, the model can choose to exit if the predictions made by the classifiers remain unchanged for a number of layers.

• 上述方法可以扩展到检查与层相关的分类器做出的预测变化。例如，模型可以选择在分类器做出的预测在多个层中保持不变时退出。

Discussions of these criteria can be found in the related papers (Xin et al., 2020; Zhou et al., 2020; Schuster et al., 2022). There are a variety of ways to improve these early exit methods. One is to explore other forms of the prediction for each layer. For example, we can develop a model that directly predicts how many layers we need to model the input (Elbayad et al., 2020). Another line of research on early exit focuses on better training for these models, for example, we can consider various loss functions for training the classifiers (Schwartz et al., 2020; Schuster et al., 2022). In addition, there is also interest in learning the combination of the outputs of multiple layers so that we can make predictions by using multiple levels of representation (Zhou et al., 2020; Liao et al., 2021).
相关论文（Xin 等，2020；Zhou 等，2020；Schuster 等，2022）中可以找到对这些标准的讨论。提高这些早期退出方法有多种途径。一种方法是探索每一层的预测的其他形式。例如，我们可以开发一个直接预测我们需要多少层来建模输入的模型（Elbayad 等，2020）。早期退出研究的另一条路线集中在改进这些模型的训练上，例如，我们可以考虑各种损失函数来训练分类器（Schwartz 等，2020；Schuster 等，2022）。此外，还有兴趣学习多层输出的组合，以便我们可以通过使用多级表示来进行预测（Zhou 等，2020；Liao 等，2021）。

A problem with token-level adaptive-depth models is that the representations at certain depths may be absent in the previous steps. In this case, standard self-attention is not directly applicable, because we may not attend to the previous tokens in the same level of representation. For training, this can be addressed by using all the $L$ layers of the full model. For inference, we can either duplicate the layer from which we exit to fill up the layer stack, or modify the self-attention model to enable it to attend to the representations of the previous tokens at different depths.
问题在于，分词级自适应深度模型中，某些深度的表示可能在之前的步骤中缺失。在这种情况下，标准自注意力机制不能直接应用，因为我们可能无法关注同一表示级别的先前分词。对于训练，可以通过使用完整模型的全部 $L$ 层来解决。对于推理，我们可以复制我们退出的层以填充层堆栈，或者修改自注意力模型，使其能够关注不同深度的先前分词的表示。

It is also possible to select any sub-set of the layers for constructing a shallow network. The adaptive models therefore can be generalized to skipping models (see Figure 15 (f)). As with the early exit problem, the skipping problem can be framed as a learning task, in which a classifier is trained to decide whether a layer should be dropped. The learning-to-skip problem has been studied in the field of computer vision (Wang et al., 2018b; Wu et al., 2018b). However, learning a skipping model for large-scale, deep neural networks is difficult. For practical systems, it still seems reasonable to use heuristics or cheap models to obtain a neural network having skipped layers, which has been discussed in recent pre-trained NLP models (Wang et al., 2022c; Del Corro et al., 2023).
也可以选择任何子集的层来构建浅层网络。因此，自适应模型可以推广到跳过模型（见图 15（f））。与早期退出问题一样，跳过问题也可以被表述为一个学习任务，其中训练一个分类器来决定是否应该丢弃某一层。学习跳过问题在计算机视觉领域已被研究（王等，2018b；吴等，2018b）。然而，对于大规模、深度神经网络的学习跳过模型是困难的。对于实际系统，使用启发式方法或低成本模型来获得跳过层的神经网络仍然似乎是合理的，这在最近的预训练 NLP 模型中已有讨论（王等，2022c；德尔科罗等，2023）。

5.7 Model Transfer and Pruning
5.7 模型迁移与剪枝

Many large Transformer models have been successfully developed to address NLP problems. A common question is: can we transform a large, well-trained model into a smaller one that allows for more efficient inference? At a high level, this can be thought of as a transfer learning problem in which the knowledge is transferred from one model to another. But we will not discuss this general topic, which spans a broad range of issues and models, many outside the scope of this chapter. Instead, we narrow our discussion to two kinds of approaches that are widely used in learning small neural networks from large neural networks.
许多大型 Transformer 模型已成功开发用于解决 NLP 问题。一个常见的问题是：能否将一个大型、训练良好的模型转换为更小的模型，从而实现更高效的推理？从高层次来看，这可以被视为一个迁移学习问题，其中知识从一个模型转移到另一个模型。但我们将不讨论这个涵盖广泛问题和模型的一般主题，其中许多超出了本章的范围。相反，我们将讨论范围缩小到两种在从大型神经网络学习小型神经网络中广泛使用的方法。

5.7.1 Knowledge Distillation
5.7.1 知识蒸馏

Knowledge distillation is a process of compressing the knowledge in a large neural network (or an ensemble of neural networks) into a small neural network (Hinton et al., 2015). In supervised learning of neural networks, the objective functions are generally designed to represent some loss of replacing the true answer with the predicted answer. Hence we can minimize this loss so that the models are trained to output the true answer. While models are typically optimized on the training data in this manner, what we really want is to generalize them to new data. This is, however, difficult because we have no information about generalization in training with the ground-truth. In knowledge distillation, instead of forcing a model to stay close to the ground-truth output, we train this model to generalize. To do this, we directly transfer the knowledge (i.e., the generalization ability) of a pre-trained model to the model that we want to train.
知识蒸馏是将大型神经网络（或神经网络集成）中的知识压缩到小型神经网络（Hinton 等人，2015 年）的过程。在神经网络的监督学习中，目标函数通常被设计用来表示用预测答案替换真实答案的损失。因此，我们可以最小化这种损失，使模型训练输出真实答案。虽然模型通常以这种方式在训练数据上优化，但我们真正想要的是将它们推广到新数据。然而，这很困难，因为我们没有在训练中使用真实答案进行泛化的信息。在知识蒸馏中，我们不是强迫模型保持接近真实答案的输出，而是训练这个模型进行泛化。为此，我们直接将预训练模型的知识（即泛化能力）转移到我们想要训练的模型中。

A frequently used approach to knowledge distillation is teacher-student training. A teacher model is typically a relatively large neural network that has already been trained and can generalize well. A student model is a relatively small neural network, such as a neural network with fewer layers, to which we transfer the knowledge. A simple way to distill the knowledge from the teacher model into the student model is to use the output of the teacher model as the “correct” answer for training the student model. Suppose we have a teacher Transformer model that can generate a sequence of distributions $\{\operatorname{Pr}(\cdot|y_{0},\mathbf{x}),...,\operatorname{Pr}(\cdot|y_{0}...y_{n-1},\mathbf{x})\}$ for the input $\mathbf{x}$ . To keep the notation simple, we denote the distribution $\operatorname{Pr}(\cdot|y_{0}...y_{i-1},\mathbf{x})$ as $\widetilde{\mathbf{p}}_{i}$ . Similarly, we denote the output of the student Transformer model for the same input as $\mathbf{p}_{i}$ . As usual, we consider a loss function $Loss(\widetilde{\mathbf{p}}_{i},\mathbf{p}_{i})$ (such as the cross-entropy function) for computing some distance between $\widetilde{\mathbf{p}}_{i}$ and $\mathbf{p}_{i}$ . Then, we can define the loss over the entire sequence as
一种常用的知识蒸馏方法是师生训练。教师模型通常是一个相对较大的神经网络，已经过训练并且泛化能力良好。学生模型是一个相对较小的神经网络，例如层数较少的神经网络，我们将知识转移到这个模型上。将教师模型中的知识蒸馏到学生模型的一个简单方法是将教师模型的输出作为训练学生模型的“正确”答案。假设我们有一个教师 Transformer 模型，它可以对输入 $\mathbf{x}$ 生成一系列分布 $\{\operatorname{Pr}(\cdot|y_{0},\mathbf{x}),...,\operatorname{Pr}(\cdot|y_{0}...y_{n-1},\mathbf{x})\}$ 。为了简化符号，我们将分布 $\operatorname{Pr}(\cdot|y_{0}...y_{i-1},\mathbf{x})$ 表示为 $\widetilde{\mathbf{p}}_{i}$ 。类似地，我们将学生 Transformer 模型对相同输入的输出表示为 $\mathbf{p}_{i}$ 。通常，我们考虑一个损失函数 $Loss(\widetilde{\mathbf{p}}_{i},\mathbf{p}_{i})$ （例如交叉熵函数）来计算 $\widetilde{\mathbf{p}}_{i}$ 和 $\mathbf{p}_{i}$ 之间的某种距离。然后，我们可以定义整个序列的损失如下：

\displaystyle L(\mathbf{x},\theta)

\displaystyle=

\displaystyle\frac{1}{n}\sum_{i=1}^{n}Loss(\widetilde{\mathbf{p}}_{i},\mathbf{p}_{i})

(195)

where $\theta$ denotes the parameters of the student model²⁵²⁵25We omit the parameters of the teacher model because they are fixed throughout the training process.. Using this loss, we can optimize $\theta$ , for any given set of source sequences $\{\mathbf{x}_{1},...,\mathbf{x}_{K}\}$ , in such a way as to minimize the quality $\sum_{k=1}^{N}L(\mathbf{x}_{k},\theta)$ .
$\theta$ 表示学生模型的参数 ²⁵ 。利用这个损失函数，我们可以优化 $\theta$ ，对于任何给定的源序列 $\{\mathbf{x}_{1},...,\mathbf{x}_{K}\}$ ，以最小化其质量 $\sum_{k=1}^{N}L(\mathbf{x}_{k},\theta)$ 。

Several different extensions to this basic method have been developed to model the problem of knowledge transfer between two models. A simple way is to use the hidden states instead of the output probabilities as the training targets (Romero et al., 2014). In this case, the objective is to minimize the difference between some hidden states of the teacher model and the corresponding states of the student model. Rather than using the outputs of various layers as the targets for training the student model, another technique is to model the relations between samples and train the student model by minimizing some difference between the relation encodings of the teacher and student models (Park et al., 2019; Peng et al., 2019). For example, we can develop a relation encoding model based on the Transformer architecture. The goal is then to optimize the student model so that its corresponding relation encoding of a group of samples is as close as possible to that of the teacher model.
几种不同的扩展方法已被开发出来，以模拟两个模型之间知识迁移的问题。一种简单的方法是使用隐藏状态而不是输出概率作为训练目标（Romero 等人，2014）。在这种情况下，目标是使教师模型的某些隐藏状态与学生模型的对应状态之间的差异最小化。而不是将各层的输出作为训练学生模型的靶标，另一种技术是通过最小化教师模型和学生模型的关系编码之间的差异来建模样本之间的关系，并训练学生模型（Park 等人，2019；Peng 等人，2019）。例如，我们可以开发一个基于 Transformer 架构的关系编码模型。那么，目标就是优化学生模型，使其对应的一组样本的关系编码尽可能接近教师模型的关系编码。

For sequence generation problems, a special case of knowledge distillation, which can be viewed as a means of data augmentation, is often used for developing lightweight models (Kim and Rush, 2016). For example, consider the problem of transferring the translation ability of a well-developed machine translation model (i.e., the teacher model) to a new model (i.e., the student model). Given a set of source-side sentences $\{\mathbf{x}_{1},...,\mathbf{x}_{K}\}$ , we can use the teacher model to translate each $\mathbf{x}_{k}$ to a target-side sentence $\widetilde{\mathbf{y}}_{k}$ . Then, by treating $\mathbf{x}_{k}$ and $\widetilde{\mathbf{y}}_{k}$ as paired sentences, we obtain a bilingual dataset consisting of $\{(\mathbf{x}_{1},\widetilde{\mathbf{y}}_{1}),...,(\mathbf{x}_{K},\widetilde{\mathbf{y}}_{K})\}$ . We can use this bilingual dataset as the labeled dataset to train the student model as usual. One advantage of this data argumentation method is that it is architecture free, and we do not even need to understand the internal architectures of the teacher and student models. Hence we can apply this method if we have a black-box teacher model. More detailed discussions of knowledge distillation can be found in Gou et al. (2021) and Wang and Yoon (2021)’s surveys.
对于序列生成问题，这是一种知识蒸馏的特殊情况，可以视为一种数据增强手段，常用于开发轻量级模型（Kim 和 Rush，2016）。例如，考虑将一个成熟的机器翻译模型（即教师模型）的翻译能力转移到新模型（即学生模型）的问题。给定一组源句 $\{\mathbf{x}_{1},...,\mathbf{x}_{K}\}$ ，我们可以使用教师模型将每个 $\mathbf{x}_{k}$ 翻译成目标句 $\widetilde{\mathbf{y}}_{k}$ 。然后，将 $\mathbf{x}_{k}$ 和 $\widetilde{\mathbf{y}}_{k}$ 视为成对句子，我们获得包含 $\{(\mathbf{x}_{1},\widetilde{\mathbf{y}}_{1}),...,(\mathbf{x}_{K},\widetilde{\mathbf{y}}_{K})\}$ 的双语数据集。我们可以使用这个双语数据集作为标记数据集来训练学生模型。这种数据增强方法的一个优点是它不受架构限制，我们甚至不需要理解教师和学生模型的内部架构。因此，如果我们有一个黑盒教师模型，我们就可以应用这种方法。关于知识蒸馏的更详细讨论，可以参考 Gou 等人（2021）和 Wang 和 Yoon（2021）的调查。

5.7.2 Structured Pruning 5.7.2 结构化剪枝

Pruning is among the most popular of the model compression methods and has been applied to a broad range of systems. One common approach to pruning is unstructured pruning, by which we activate only some of the connections between neurons. However, as with most sparse models, models pruned in this way typically require special implementations and hardware support, which in turn reduces their efficiency in some applications. A simple but more aggressive way to do pruning is to use structured pruning. In deep learning, structured pruning is a technique that removes a group of neurons or connections together. For example, we can remove an entire layer of neuron from a neural network to obtain a shallower model. As multi-layer, multi-head neural networks, Transformers are naturally suited to structured pruning, and we can prune a Transformer network in several different ways. For example, we can prune some of the heads in multi-head attention (Voita et al., 2019; Michel et al., 2019), or some of the layers in the layer stack (Hou et al., 2020; Kim and Awadalla, 2020).
修剪是模型压缩方法中最受欢迎的一种，已被应用于广泛的系统中。修剪的一种常见方法是未结构化修剪，通过这种方法，我们只激活神经元之间的一些连接。然而，与大多数稀疏模型一样，以这种方式修剪的模型通常需要特殊的实现和硬件支持，这反过来又降低了它们在某些应用中的效率。一种简单但更激进的修剪方法是使用结构化修剪。在深度学习中，结构化修剪是一种删除一组神经元或连接的技术。例如，我们可以从神经网络中移除整个神经元层以获得一个更浅的模型。作为多层、多头神经网络，Transformer 自然适合结构化修剪，我们可以以几种不同的方式修剪 Transformer 网络。例如，我们可以修剪多头注意力中的某些头（Voita 等人，2019 年；Michel 等人，2019 年），或者层堆叠中的某些层（Hou 等人，2020 年；Kim 和 Awadalla，2020 年）。

Formally, we can represent a neural network as a set of parameter groups $\{\theta_{1},...,\theta_{R}\}$ , each corresponding to a component or sub-model of the model. Our goal is to find a sub-set of $\{\theta_{1},...,\theta_{R}\}$ by which we can build a model that yields good performance, while having a lower model complexity. However, a simple search of such a model is infeasible because there are a combinatorially large number of possible model candidates and evaluating all of these models is computationally expensive.
形式上，我们可以将神经网络表示为一组参数组 $\{\theta_{1},...,\theta_{R}\}$ ，每个参数组对应模型的一个组件或子模型。我们的目标是找到一个子集 $\{\theta_{1},...,\theta_{R}\}$ ，通过它我们可以构建一个性能良好且模型复杂度较低的新模型。然而，简单地搜索这样的模型是不可行的，因为可能存在大量可能的模型候选者，评估所有这些模型在计算上是昂贵的。

One approach to structured pruning is to randomly prune components of a model. One can run the random pruning process a number of times to generate a pool of model candidates and select the best one from the pool. Another approach is to use heuristics to decide which components are not important and can be removed. Common measures of the importance of a parameter group $\theta_{r}$ include various qualities based on norms of the weights or gradients of $\theta_{r}$ (Santacroce et al., 2023). We can prune $\theta_{r}$ if the values of these measures are below (or above) given thresholds. A third approach is to frame the pruning problem as an optimization task by introducing trainable gates indicating the presence of different components (McCarley et al., 2019; Wang et al., 2020c; Lagunas et al., 2021). The pruned model can be induced by using the trained gates. Note that, in many cases, pruning is not a post-processing step for a given trained model, but part of the training.
一种结构化剪枝的方法是对模型的部分组件进行随机剪枝。可以通过多次运行随机剪枝过程来生成一组模型候选者，并从中选择最佳的一个。另一种方法是使用启发式算法来决定哪些组件不重要且可以被移除。参数组 $\theta_{r}$ 的重要性度量通常包括基于权重或梯度范数的各种质量指标（Santacroce 等，2023）。如果这些度量值的数值低于（或高于）给定的阈值，则可以剪枝 $\theta_{r}$ 。第三种方法是通过引入可训练的门控来将剪枝问题表述为一个优化任务（McCarley 等，2019；Wang 等，2020c；Lagunas 等，2021）。可以通过训练好的门控来诱导剪枝模型。请注意，在许多情况下，剪枝不是给定训练模型的后期处理步骤，而是训练的一部分。

5.8 Sequence Compression 5.8 序列压缩

In sequence modeling and generation problems, the time and space complexities are strongly influenced by the length of the input or output sequence, and we prefer the sequence to be short. This is particularly important for Transformer models, as their time and space complexities are quadratic with the sequence length, and the memory footprint and latency can be heavy burdens if the sequence is very long. In the previous subsections, we have discussed modifications to the Transformer architecture for dealing with long sequences. Here we instead consider methods for compressing the sequences into ones with acceptable lengths.
在序列建模和生成问题中，时间和空间复杂度受到输入或输出序列长度的强烈影响，我们更倾向于序列较短。这对于 Transformer 模型尤为重要，因为它们的时空复杂度与序列长度呈二次关系，如果序列非常长，内存占用和延迟可能会成为沉重的负担。在前几节中，我们讨论了针对长序列对 Transformer 架构进行的修改。在这里，我们考虑将序列压缩到可接受长度的方法。

One simple approach is to map the input sequence to a fixed-size representation. For example, using the recurrent models discussed in Section 5.2, we can encode a sequence of vectors into a single vector. This method can be easily extended to generate a “larger” representation so that this representation can retain more information of the original input. For example, we can select a fixed number of the hidden states over the sequence to form a new sequence of fixed-length. Another way to represent a variable-length sequence as a fixed-length sequence is to attend the input vectors to some hidden states, usually a fixed number of learnable hidden representations. In Jaegle et al. (2021)’s work, this is done by introducing $r$ hidden representations $\{\mathbf{u}_{1},...,\mathbf{u}_{r}\}$ , and then attending the input vectors $\{\mathbf{x}_{1},...,\mathbf{x}_{m}\}$ to these hidden representations. The attention model can be a standard QKV attention model in which we view $\{\mathbf{u}_{1},...,\mathbf{u}_{r}\}$ as queries and $\{\mathbf{x}_{1},...,\mathbf{x}_{m}\}$ as keys and values. The output of this model is a sequence of $r$ vectors, which can be used as fixed-length input to downstream systems.
一种简单的方法是将输入序列映射到固定大小的表示。例如，使用第 5.2 节中讨论的循环模型，我们可以将向量序列编码成一个单一的向量。这种方法可以很容易地扩展以生成“更大”的表示，以便这个表示可以保留更多原始输入的信息。例如，我们可以选择序列中固定数量的隐藏状态来形成一个新的固定长度的序列。将可变长序列表示为固定长度序列的另一种方法是将输入向量关注到一些隐藏状态，通常是固定数量的可学习隐藏表示。在 Jaegle 等人（2021）的研究中，这是通过引入 $r$ 隐藏表示 $\{\mathbf{u}_{1},...,\mathbf{u}_{r}\}$ ，然后关注输入向量 $\{\mathbf{x}_{1},...,\mathbf{x}_{m}\}$ 到这些隐藏表示来实现的。注意力模型可以是标准的 QKV 注意力模型，其中我们将 $\{\mathbf{u}_{1},...,\mathbf{u}_{r}\}$ 视为查询， $\{\mathbf{x}_{1},...,\mathbf{x}_{m}\}$ 视为键和值。该模型的输出是一个 $r$ 向量的序列，可以用作固定长度输入到下游系统中。

A second approach is to use downsampling to compress the sequence into a shorter one. A typical method of downsampling is strided convolution, which has been widely used in computer vision and speech processing. For example, suppose there is a sequence of $m$ vectors $\in\mathbb{R}^{d}$ . We can develop a filter with a width of $2$ and a stride of $2$ . By taking the sequence as input, the filter produces a sequence of $\frac{m}{2}$ new vectors $\in\mathbb{R}^{d}$ , and so we have a reduction rate of $2$ . Also, we can stack multiple convolutional layers or pooling layers to achieve a desired level of length reduction, called progressive downsampling. However, it seems inevitable that downsampling will lead to information loss (Han et al., 2020; Burchi and Vielzeuf, 2021). We need to consider a trade-off between the compressed sequence length and the performance of downstream systems (Xu et al., 2023b).
第二种方法是使用下采样将序列压缩成更短的序列。下采样的典型方法是步长卷积，它在计算机视觉和语音处理中得到广泛应用。例如，假设有一个由 $m$ 向量组成的序列 $\in\mathbb{R}^{d}$ 。我们可以开发一个宽度为 $2$ 、步长为 $2$ 的滤波器。通过将序列作为输入，滤波器产生一个由 $\frac{m}{2}$ 新向量组成的序列 $\in\mathbb{R}^{d}$ ，因此我们有一个 $2$ 的缩减率。此外，我们可以堆叠多个卷积层或池化层，以达到所需的长度缩减水平，这被称为渐进式下采样。然而，下采样导致信息损失似乎是不可避免的（Han 等人，2020；Burchi 和 Vielzeuf，2021）。我们需要在压缩序列长度和下游系统的性能之间权衡（Xu 等人，2023b）。

In NLP, the problem of sequence compression is also closely related to the problem of tokenizing input strings. Therefore, tokenization is a practical approach that can be taken to address the length issue. Segmenting a string into small tokens (such as characters) generally reduces the sparsity of the data, which makes it easier to learn the embeddings of these tokens, but such approaches often lead to a long sequence. By contrast, we will have a shorter sequence if we segment the input string into larger units, but this will suffer from sparse data. In deterministic tokenization methods, which produce tokenization results using statistics collected from the entire dataset, the sequence length can be somehow controlled by adjusting some hyper-parameter, for example, in byte pair encoding (Sennrich et al., 2016), increasing the size of the vocabulary generally reduces the number of the resulting tokens. Another way to obtain an appropriate sequence of tokens is to use a model for choosing among tokenization candidates (Kudo, 2018; Provilkov et al., 2020). As with many probabilistic models for text generation, in this case, we can add priors to the criterion for tokenization selection so that we can express a preference for shorter sequences over longer sequences.
在自然语言处理（NLP）中，序列压缩问题也与输入字符串的标记化问题密切相关。因此，标记化是一种解决长度问题的实用方法。将字符串分割成小的标记（如字符）通常可以减少数据的稀疏性，这使得学习这些标记的嵌入更容易，但这种方法往往会导致序列变长。相比之下，如果我们把输入字符串分割成更大的单元，序列会更短，但这将导致数据稀疏。在确定性标记化方法中，这些方法通过从整个数据集收集的统计数据生成标记化结果，可以通过调整某些超参数来在一定程度上控制序列长度，例如，在字节对编码（Sennrich 等人，2016 年）中，增加词汇表的大小通常可以减少生成的标记数量。另一种获得适当标记序列的方法是使用模型来从标记化候选者中进行选择（Kudo，2018 年；Provilkov 等人，2020 年）。与许多用于文本生成的概率模型一样，在这种情况下，我们可以在标记化选择的标准中添加先验，以便我们能够表达对较短的序列相对于较长的序列的偏好。

A fourth approach to sequence compression is to drop some of the tokens in the sequence. For example, in many practical applications, we chop the sequence when its length exceeds a threshold. We can relate this to the early stopping and skipping approaches in conditional computation. Thus the methods discussed in Section 5.6 are directly applied. The token dropping methods can also be viewed as pruning methods, called token pruning. By discarding tokens that are less important for representing the entire sequence, token pruning can significantly reduce the sequence length while maintaining the performance of NLP systems on downstream tasks (Kim et al., 2023).
第四种序列压缩方法是删除序列中的某些标记。例如，在许多实际应用中，当序列长度超过阈值时，我们会截断序列。这可以与条件计算中的早期停止和跳过方法联系起来。因此，第 5.6 节中讨论的方法可以直接应用。标记删除方法也可以被视为剪枝方法，称为标记剪枝。通过丢弃对表示整个序列不那么重要的标记，标记剪枝可以显著缩短序列长度，同时保持 NLP 系统在下游任务上的性能（Kim 等，2023）。

5.9 High Performance Computing Methods
5.9 高性能计算方法

So far in this section, we have discussed efficient Transformer models from the perspectives of deep learning and NLP. However, we have not considered their efficiency on hardware. As modern hardware provides a variety of modes for running a program, the practical time and memory footprint savings generally depend on the specifications of hardware systems. One line of research on efficient use of computing resources explores methods of parallel computing. There have been many attempts to develop large-scale Transformer models by using a cluster of machines. Typically, scaling Transformers to models with billions or even tens of billions of parameters requires a careful design of parallelism strategies for sharding the big networks. More efficient implementations of such systems also need considerations of networking and communication in the cluster, as well as the utilization of sparse models that activate only a small sub-set of the parameters for each sample, enabling the use of very large models. Most of these methods have been studied in an extensive literature on how to scale up the training of deep neural networks like Transformers efficiently (Lepikhin et al., 2021; Barham et al., 2022; Fedus et al., 2022b). The results of these studies were foundational to many follow-on works on investigating the scaling laws for large language models (Brown et al., 2020; Chowdhery et al., 2022). Since large-scale distributed models are generic and not specialized to the case of Transformers, we skip the discussion of them here. The interested readers can refer to the above papers for more detailed discussions.
截至目前，在本节中，我们已从深度学习和 NLP 的角度讨论了高效的 Transformer 模型。然而，我们尚未考虑它们在硬件上的效率。由于现代硬件提供了多种运行程序的模式，实际的时间和内存占用节省通常取决于硬件系统的规格。关于高效利用计算资源的研究之一探索了并行计算的方法。已经有许多尝试通过使用机器集群来开发大规模的 Transformer 模型。通常，将 Transformer 扩展到具有数十亿甚至数百亿参数的模型需要精心设计并行策略，以对大型网络进行分片。此类系统的更高效实现还需要考虑集群中的网络和通信，以及利用仅激活每个样本的一小部分参数的稀疏模型，从而实现非常大型模型的使用。大多数这些方法已在关于如何高效扩展 Transformer 等深度神经网络训练的广泛文献中得到研究（Lepikhin 等人）。, 2021; 巴哈姆等，2022; 费杜斯等，2022b）。这些研究的结果为许多后续关于研究大型语言模型缩放定律的工作奠定了基础（布朗等，2020；乔德赫里等，2022）。由于大规模分布式模型是通用的，并未专门针对 Transformer 的情况，因此在此处省略对这些模型的讨论。感兴趣的读者可以参考上述论文以获取更详细的讨论。

In this sub-section we consider hardware-aware methods to seek greater computational efficiency for Transformer models. We first consider a simple but widely used method that aims to store and execute neural networks using lower or mixed-precision number representations (Gholami et al., 2022). Conventional neural networks are typically based on single-precision and/or double-precision floating-point representations of data. While single-precision floating-point data types provide a sufficiently precise way to represent parameters and intermediate states in most cases, in some applications, they are not essential. As an alternative, one can use half-precision (or even lower-precision) formats in storing floating-point numbers for neural networks. The size of the resulting model is thus half the size of the original model. One advantage of using half-precision floating-point representations is that, although processing such data types requires new APIs of linear algebra operations and hardware support, it does not change the model architecture, and so we need only a slight modification to the systems. For example, half-precision floating-point representations can be applied to either training or inference of Transformers, or both.
在这一小节中，我们考虑了针对 Transformer 模型寻求更高计算效率的硬件感知方法。我们首先考虑了一种简单但广泛使用的方法，该方法旨在使用低精度或混合精度数字表示来存储和执行神经网络（Gholami 等，2022 年）。传统的神经网络通常基于单精度和/或双精度浮点数的数据表示。虽然单精度浮点数类型在大多数情况下提供了一种足够精确的方式来表示参数和中间状态，但在某些应用中它们并非必需。作为一种替代方案，可以在存储神经网络中的浮点数时使用半精度（甚至更低精度）格式。因此，生成的模型大小是原始模型大小的一半。使用半精度浮点数表示的一个优点是，尽管处理此类数据类型需要新的线性代数操作 API 和硬件支持，但它不会改变模型架构，所以我们只需要对系统进行轻微修改。例如，半精度浮点表示可以应用于 Transformers 的训练或推理，或者两者兼而有之。

Recently, the deployment of large Transformer models has been further improved by quantizing these models. In signal processing, quantization is a process of mapping continuous values (i.e., floating-point representations) to a set of discrete values (i.e., fix-point representations) . This process is in general implemented using a system called quantizer. In the context of neural networks, a quantizer involves two functions — the quantization function and the de-quantization function. The quantization function maps a floating point number to a (lower-bit) integer. A simple quantization function is given by
最近，通过量化这些模型，大型 Transformer 模型的部署得到了进一步改进。在信号处理中，量化是将连续值（即浮点表示）映射到一组离散值（即定点表示）的过程。这个过程通常使用一个称为量化器的系统来实现。在神经网络的情况下，量化器涉及两个函数——量化函数和反量化函数。量化函数将一个浮点数映射到一个（位数更低的）整数。一个简单的量化函数如下所示

\displaystyle Q(x)

\displaystyle=

\displaystyle\lfloor\frac{x}{s}\rceil

(196)

where $\lfloor\cdot\rceil$ is a rounding function²⁶²⁶26 $\lfloor a\rceil$ returns the integer closest to $a$ ., $x$ is the real-valued input, and $s$ is the quantization step size that controls the level of quantization. The quantization function is coupled with a de-quantization function
$\lfloor\cdot\rceil$ 是一个舍入函数 ²⁶ ， $x$ 是实值输入， $s$ 是量化步长，它控制量化的级别。量化函数与去量化函数耦合。

\displaystyle D(r)

\displaystyle=

\displaystyle s\cdot r

(197)

With this notation, the quantizer can be expressed as $D(Q(x))=s\cdot\lfloor\frac{x}{s}\rceil$ . The difference between $D(Q(x))$ and $x$ is called quantization error. A smaller value of $s$ typically reduces the quantization error. In practice, however, we wish to choose an appropriate value of $s$ in order to spread possible values of $Q(r)$ evenly across values of an integer, for example, $s=\frac{\max\{D(r)\}}{2^{p}-1}$ where $p$ is the number of bits used to represent an integer and $\max\{D(r)\}$ is the maximum value for $D(r)$ . The above equations show one of the simplest cases of quantization. More general discussions of quantization can be found in books on digital signal processing and related surveys (Oppenheim and Schafer, 1975; Rabiner and Gold, 1975; Gray, 1998).
使用这种记号，量化器可以表示为 $D(Q(x))=s\cdot\lfloor\frac{x}{s}\rceil$ 。 $D(Q(x))$ 与 $x$ 之间的差异称为量化误差。通常，较小的 $s$ 值可以减少量化误差。然而，在实践中，我们希望选择一个适当的 $s$ 值，以便将 $Q(r)$ 的可能值均匀地分布在整数的值上，例如， $s=\frac{\max\{D(r)\}}{2^{p}-1}$ ，其中 $p$ 是表示整数的位数， $\max\{D(r)\}$ 是 $D(r)$ 的最大值。上述方程显示了量化最简单的情况之一。更一般的量化讨论可以在数字信号处理和相关调查的书籍中找到（Oppenheim 和 Schafer，1975；Rabiner 和 Gold，1975；Gray，1998）。

Applying quantization to Transformers is relatively straightforward. The idea is that we quantize the inputs and model parameters using $Q(x)$ , and feed them to a quantized Transformer model in which all the layers operate on integer-valued tensors. In other words, we implement the model using integer-only arithmetic. However, the price to be paid for this compressed model, as with many approximation approaches to deep learning, is that its prediction is not as accurate as that of the standard Transformer model. Using integer operations to approximate continuous-valued operations generally leads to approximation errors. These errors will be accumulated if the quantized neural network is deep. Furthermore, Transformer models involve components (such as self-attention sub-layers) that require relatively complex linear algebra operations. Simply applying quantization to these sub-models will lead to high accuracy loss. One solution is to simplify the model architecture and develop new sub-models that is more feasible for quantization. Alternatively, a more common paradigm in quantized neural networks is to add de-quantization functions to the neural networks so that the output of a layer is floating-point tensors and can be used as usual in the following steps. Consider a simple example where we multiply a real-valued input matrix $\mathbf{a}$ with a real-valued parameter matrix $\mathbf{A}$ . We first quantize $\mathbf{a}$ and $\mathbf{A}$ , and multiply them using integer-based matrix multiplication. The result is then de-quantized to a real-valued matrix. In this way, we obtain an approximation $D(Q(\mathbf{a})\cdot Q(\mathbf{A}))$ to $\mathbf{a}\cdot\mathbf{A}$ in a very cheap way.
将量化应用于 Transformer 相对简单。其思路是使用 $Q(x)$ 对输入和模型参数进行量化，并将它们输入到一个所有层都操作整数张量的量化 Transformer 模型中。换句话说，我们使用仅整数算术来实现模型。然而，与许多深度学习的近似方法一样，这种压缩模型的代价是其预测精度不如标准 Transformer 模型。使用整数运算来近似连续值运算通常会导致近似误差。如果量化神经网络较深，这些误差将会累积。此外，Transformer 模型涉及（如自注意力子层）需要相对复杂线性代数运算的组件。简单地对这些子模型应用量化会导致精度损失很高。一种解决方案是简化模型架构并开发更适合量化的新子模型。另一种在量化神经网络中更为常见的范式是在神经网络中添加去量化函数，使得层的输出是浮点张量，可以在后续步骤中像往常一样使用。考虑一个简单的例子，我们用一个实值输入矩阵 $\mathbf{a}$ 与一个实值参数矩阵 $\mathbf{A}$ 相乘。我们首先量化 $\mathbf{a}$ 和 $\mathbf{A}$ ，然后使用基于整数的矩阵乘法进行乘法运算。然后将结果去量化为一个实值矩阵。通过这种方式，我们以非常低廉的成本获得了对 $D(Q(\mathbf{a})\cdot Q(\mathbf{A}))$ 的近似 $\mathbf{a}\cdot\mathbf{A}$ 。

However, sandwiching each layer between $Q(\cdot)$ and $D(\cdot)$ will lead to additional cost of running $Q(\cdot)$ and $D(\cdot)$ . In some practical applications, the computational overhead introduced by $Q(\cdot)$ and $D(\cdot)$ is even bigger than the time saving of performing integer-based operations. In general, the benefit of quantizing neural networks would be larger than its cost if the neural networks are large. Therefore, in practice it is common to perform quantized computation only for operations whose computational costs are high. For example, in recent large language models, quantization is primarily applied to the multiplication of large matrices, yielding significant time and memory savings.
然而，在每个层之间插入 $Q(\cdot)$ 和 $D(\cdot)$ 将导致 $Q(\cdot)$ 和 $D(\cdot)$ 的运行成本增加。在一些实际应用中， $Q(\cdot)$ 和 $D(\cdot)$ 引入的计算开销甚至大于执行基于整数的操作节省的时间。一般来说，如果神经网络较大，量化神经网络的收益将大于其成本。因此，在实践中，通常只对计算成本高的操作进行量化计算。例如，在最近的大型语言模型中，量化主要应用于大型矩阵的乘法，从而显著节省时间和内存。

While the quantization approaches can be used in both training and inference, a widely-used approach is to get Transformer models quantized after training (call it post-training quantization). In this approach, quantization is performed on well-trained floating-point-based neural networks and there will be fewer quantization-related errors. However, these errors cannot be compensated for because they exist after training. A more promising idea is to involve quantization in training so that the model can learn to compensate for quantization-related errors (Jacob et al., 2018; Nagel et al., 2021). There have been several attempts to apply quantization-aware training to Transformers (Bondarenko et al., 2021; Stock et al., 2021; Yang et al., 2023b). In addition to computational efficiency, another important consideration for high-performance systems is the restrictions of the memory hierarchy. In general, better system design requires considering the speeds and sizes of different levels of memory. The problem is even more complicated when we train large Transformer models on modern hardware where both GPUs and CPUs are used. A general principle of system design is that memory transfer between different memory levels should be minimized. While we would ideally like to have a large high-level memory on which we can store all the data that we need to process, in many practical situations the size of the fast, on-chip memory is orders of magnitude smaller than the size of data. In this case, we can re-order the memory access in the algorithms so that the data used in nearby computation steps can be loaded into the high-speed memory at one time. This idea motivates the development of many fast linear algebra libraries. For example, there are matrix multiplication algorithms that are highly optimized for different shapes of input matrices.
虽然量化方法可以在训练和推理中使用，但一种广泛采用的方法是在训练后对 Transformer 模型进行量化（称为训练后量化）。在这种方法中，量化是在经过良好训练的基于浮点数的神经网络上进行的，量化相关的错误会更少。然而，这些错误无法补偿，因为它们存在于训练之后。一个更有前景的想法是在训练中引入量化，使模型能够学会补偿量化相关的错误（Jacob 等人，2018；Nagel 等人，2021）。已经有一些尝试将量化感知训练应用于 Transformer（Bondarenko 等人，2021；Stock 等人，2021；Yang 等人，2023b）。除了计算效率外，对于高性能系统来说，内存层次结构的限制也是一个重要考虑因素。一般来说，更好的系统设计需要考虑不同级别内存的速度和大小。当我们使用现代硬件（包括 GPU 和 CPU）训练大型 Transformer 模型时，问题变得更加复杂。系统设计的一般原则是尽量减少不同存储级别之间的内存传输。虽然我们理想情况下希望有一个大型的、高级的内存来存储我们处理所需的所有数据，但在许多实际情况下，快速、片上内存的大小比数据的大小小得多。在这种情况下，我们可以在算法中重新排序内存访问，以便在附近的计算步骤中使用的数据可以一次性加载到高速内存中。这个想法推动了众多快速线性代数库的发展。例如，存在针对不同形状输入矩阵高度优化的矩阵乘法算法。

It is relatively straightforward to use these optimized linear algebra algorithms to build a Transformer system. But the modules of this system are not optimized as a whole for efficiency improvement. For example, a self-attention sub-layer involves a series of operations of scaling, normalization, and matrix multiplication. Although each of these operations has been implemented in several supported and efficient libraries of linear algebra, successive calls to them still require multiple times of memory transfer when we switch from one operation to another. In practice, a better approach would be that we keep some of the intermediate states in the on-chip memory, and reuse them in the following computation steps instead of fetching them again from the slow memory. For example, on modern GPUs, a simple way to achieve this is to merge multiple operations into a single operation, known as kernel fusion. For Transformer models, a general idea is to design data partitioning and layout strategies by which we maximize the computation on each data block loaded into the high-performance memory, while at the same time minimizing the memory transfer. There have been several attempts to use these strategies to improve the attention models in Transformers (Ivanov et al., 2021; Pope et al., 2023). Some of these methods, such as flash attention and paged attention, have been successfully incorporated into recent large language models (Dao et al., 2022; Kwon et al., 2023).
相对简单，可以使用这些优化的线性代数算法构建 Transformer 系统。但该系统的模块并没有整体优化以提高效率。例如，自注意力子层涉及一系列缩放、归一化和矩阵乘法操作。尽管这些操作已经在多个支持的线性代数高效库中实现，但在从一种操作切换到另一种操作时，连续调用它们仍然需要多次内存传输。在实践中，更好的方法是我们将一些中间状态保留在片上内存中，并在后续的计算步骤中重复使用它们，而不是再次从慢速内存中获取。例如，在现代 GPU 上，实现这一点的简单方法是将多个操作合并为一个单一操作，称为内核融合。对于 Transformer 模型，一个一般的方法是通过设计数据分区和布局策略，最大化每个加载到高性能内存中的数据块的计算，同时最小化内存传输。已有几个尝试将这些策略用于改进 Transformer 中的注意力模型（Ivanov 等人，2021；Pope 等人，2023）。其中一些方法，如闪速注意力和分页注意力，已成功应用于最近的大型语言模型中（Dao 等人，2022；Kwon 等人，2023）。

6 Applications 6 应用程序

Transformers have a wide range of applications, covering numerous NLP problems. While the Transformer model introduced by Vaswani et al. (2017) is based on a standard encoder-decoder architecture, it is mainly used in three different ways.
Transformer 具有广泛的应用范围，涵盖了众多自然语言处理问题。虽然 Vaswani 等人（2017 年）提出的 Transformer 模型基于标准的编码器-解码器架构，但它主要以三种不同的方式被使用。

•

Decoder-only Models. By removing the cross-attention sub-layers from a Transformer decoder, the decoder becomes a standard language model. Hence this decoder-only model can be applied to text generation problems. For example, given a sequence of left-context tokens, we use the model to predict the next and following tokens.

• 仅解码器模型。通过从 Transformer 解码器中移除交叉注意力子层，解码器变成了一个标准语言模型。因此，这种仅解码器模型可以应用于文本生成问题。例如，给定一个左上下文标记序列，我们使用该模型来预测下一个和后续的标记。
•

Encoder-only Models. Transformer encoders can be treated as sequence models that take a sequence of tokens at once and produce a sequence of representations, each of which corresponds to an input token. These representations can be seen as some sort of encoding of the input sequence, and are often taken as input to a prediction model. This encoder+predictor architecture forms the basis of many NLP systems, for example, systems of sentence classification, sequence labeling, and so on. Pre-trained Transformer encoders can also be used to map texts into the same vector space so that we can compute the distance or similarity between any two texts.

• 仅编码器模型。Transformer 编码器可以被看作是接受一系列标记并产生一系列表示的序列模型，每个表示对应一个输入标记。这些表示可以看作是对输入序列的一种编码，通常被用作预测模型的输入。这种编码器+预测器架构是许多 NLP 系统的基础，例如句子分类、序列标注等系统。预训练的 Transformer 编码器还可以用于将文本映射到相同的向量空间，以便我们计算任意两个文本之间的距离或相似度。
•

Encoder-Decoder Models. Encoder-decoder models are typically used to model sequence-to-sequence problems. Applications include many tasks in NLP and related fields, such as machine translation and image captioning.

• 编码器-解码器模型。编码器-解码器模型通常用于建模序列到序列问题。应用包括自然语言处理和相关领域中的许多任务，如机器翻译和图像标题。

Note that while most Transformer-based systems can fall into the above three categories, the same NLP problem can generally be addressed using different types of models. For example, recent decoder-only models have demonstrated good performance on a broad range of problems by framing them as text generation tasks, though some of these problems were often addressed by using encoder-decoder or encoder-only models. To illustrate how the above models are applied, this section considers a few applications where Transformers as chosen as the backbone models.
请注意，尽管大多数基于 Transformer 的系统可以归入上述三类，但同一个 NLP 问题通常可以使用不同类型的模型来解决。例如，最近的仅解码器模型通过将问题框架为文本生成任务，在广泛的问题上展示了良好的性能，尽管其中一些问题通常是通过使用编码器-解码器或仅编码器模型来解决的。为了说明上述模型的应用，本节考虑了一些将 Transformer 选为骨干模型的几个应用。

6.1 Language Modeling 6.1 语言模型

Language modeling is an NLP task in which we predict the next token given its preceding tokens. This is generally formulated as a problem of estimating the distribution of tokens at position $i+1$ given tokens at positions $0\sim i$ (denoted by $\operatorname{Pr}(\cdot|x_{0},...,x_{i})$ where $\{x_{0},...,x_{i}\}$ denote the tokens up to position $i$ ). The best predicted token is the one which maximizes the probability, given by
语言建模是一种自然语言处理任务，其中我们根据前面的标记预测下一个标记。这通常被表述为估计位置 $i+1$ 的标记分布的问题，给定位置 $0\sim i$ 的标记（用 $\operatorname{Pr}(\cdot|x_{0},...,x_{i})$ 表示，其中 $\{x_{0},...,x_{i}\}$ 表示位置 $i$ 之前的标记）。最佳预测标记是最大化概率的标记，该概率由以下公式给出：

\displaystyle\hat{x}_{i+1}

\displaystyle=

\displaystyle\underset{x_{i+1}\in V}{\arg\max}\operatorname{Pr}\left(x_{i}\mid x_{0},\ldots,x_{i}\right)

(198)

where $V$ is the vocabulary. The prediction can be extended to the tokens following $\hat{x}_{i+1}$
$V$ 是词汇。预测可以扩展到 $\hat{x}_{i+1}$ 后的标记。

\displaystyle\hat{x}_{k+1}

\displaystyle=

\displaystyle\underset{x_{k+1}\in V}{\arg\max}\operatorname{Pr}\left(x_{k}\mid x_{0},\ldots,x_{i},\hat{x}_{i+1},\ldots,\hat{x}_{k}\right)

(199)

This model forms the basis of many systems for text generation: given the context tokens $x_{1}...x_{i}$ , we generate the remaining tokens $\hat{x}_{i+1}...\hat{x}_{k+1}$ to make the sequence complete and coherent.
此模型构成了许多文本生成系统的基石：给定上下文标记 $x_{1}...x_{i}$ ，我们生成剩余标记 $\hat{x}_{i+1}...\hat{x}_{k+1}$ 以使序列完整且连贯。

As discussed in Section 2.1, Transformer decoders are essentially language models. The only difference between the problem of decoding in an encoder-decoder Transformer and the problem of language modeling is that the Transformer decoder makes predictions conditioned on the “context” tokens on both the encoder and decoder sides, rather than being conditioned on preceding tokens solely on one side. To modify the Transformer decoder to implement a standard language model, the cross-attention sub-layers are simply removed and a Transformer decoding block can be expressed as
如第 2.1 节所述，Transformer 解码器本质上是一种语言模型。编码器-解码器 Transformer 中的解码问题与语言建模问题之间的唯一区别在于，Transformer 解码器在编码器和解码器两边的“上下文”标记上进行预测，而不是仅仅在一侧的先前标记上进行预测。要将 Transformer 解码器修改为实施标准语言模型，只需简单地移除交叉注意力子层，Transformer 解码块可以表示为

	$\displaystyle\mathbf{S}^{l}$	$\displaystyle=$	$\displaystyle\mathrm{Layer}_{\mathrm{ffn}}(\mathbf{S}_{\mathrm{self}}^{l})$		(200)
	$\displaystyle\mathbf{S}_{\mathrm{self}}^{l}$	$\displaystyle=$	$\displaystyle\mathrm{Layer}_{\mathrm{self}}(\mathbf{S}^{l-1})$		(201)

Here $\mathbf{S}^{l}$ denotes the output of the block at depth $l$ . $\mathrm{Layer}_{\mathrm{self}}(\cdot)$ denotes the self-attention sub-layer, and $\mathrm{Layer}_{\mathrm{ffn}}(\cdot)$ denotes the FFN sub-layer. We see that this decoding block has the same form as an encoding block. The difference between the decoding and encoding blocks arises from the masking strategies adopted in training, because the former masks the attention from a position $i$ to any right-context position $k>i$ whereas the latter has no such restriction. A Softmax layer is stacked on the top of the last block, and is used to produce the distribution over the vocabulary at each position. For inference, the Transformer decoder works in an auto-regressive manner, as described in Eq. (199).
这里 $\mathbf{S}^{l}$ 表示深度 $l$ 处的块输出。 $\mathrm{Layer}_{\mathrm{self}}(\cdot)$ 表示自注意力子层， $\mathrm{Layer}_{\mathrm{ffn}}(\cdot)$ 表示 FFN 子层。我们发现这个解码块与编码块具有相同的形式。解码块和编码块之间的区别来自于训练中采用的掩码策略，因为前者对位置 $i$ 到任何右上下文位置 $k>i$ 的注意力进行掩码，而后者没有这种限制。在最后一个块上方堆叠了一个 Softmax 层，用于在每个位置上生成词汇分布。对于推理，Transformer 解码器以自回归方式工作，如式(199)所述。

The training of this model is standard. We learn the model by repeatedly updating the parameters, based on the gradients of the loss on the training samples. This paradigm can be extended to the training of large Transformer-based language models, which have been widely applied in generative AI. However, training Transformer models at scale, including decoder-only, encoder-only, and encoder-decoder models, may lead to new difficulties, such as training instabilities. We will discuss these issues further in the following chapters, where large-scale pre-training is the primary focus.
该模型的训练是标准的。我们通过反复更新参数，基于训练样本上的损失梯度来学习模型。这种范式可以扩展到大型基于 Transformer 的语言模型的训练，这些模型已在生成式 AI 中得到广泛应用。然而，在规模上训练 Transformer 模型，包括仅解码器、仅编码器和编码器-解码器模型，可能会带来新的困难，例如训练不稳定。我们将在以下章节中进一步讨论这些问题，其中大规模预训练是主要焦点。

6.2 Text Encoding 6.2 文本编码

For many NLP problems, a widely used paradigm is to first represent an input sequence in some form, and then make predictions for downstream tasks based on this representation. As a result, we separate sequence modeling or sequence representation from NLP tasks. One of the advantages of this paradigm is that we can train a sequence model that is not specialized to particular tasks, thereby generalizing well.
对于许多自然语言处理问题，广泛采用的一种范式是首先以某种形式表示输入序列，然后基于这种表示对下游任务进行预测。因此，我们将序列建模或序列表示与 NLP 任务分开。这种范式的优点之一是我们可以训练一个不针对特定任务的序列模型，从而具有良好的泛化能力。

Clearly, Transformer encoders are a type of sequence model, and can be used as text encoders. Consider a Transformer encoder with $L$ encoding blocks. The output of the last encoding block can be seen as the encoding result. Here add a special token $x_{0}$ to any sequence, indicating the beginning of a sequence (written as $\langle\mathrm{SOS}\rangle$ or $[\mathrm{CLS}]$ ). If there is a sequence of $m+1$ input tokens $x_{0}x_{1}...x_{m}$ , the output of the encoder will be a sequence of $m+1$ vectors $\mathbf{h}_{0}^{L}\mathbf{h}_{1}^{L}...\mathbf{h}_{m}^{L}$ . Since $x_{0}$ is not a real token and has a fixed positional embedding, it serves as a tag for collecting information from other positions using the self-attention mechanism. Hence $\mathbf{h}_{0}^{L}$ is a representation of the entire sequence, with no biases for any specific tokens or positions. In many cases, we need a single representation of a sequence and take it as input to downstream components of the system, for example, we can construct a sentence classification system based on a single vector generated from $\{\mathbf{h}_{0}^{L},...,\mathbf{h}_{m}^{L}\}$ . In this case, we can simply use $\mathbf{h}_{0}^{L}$ as the representation of the sequence. A more general approach is to add a pooling layer to the encoder. This allows us to explore various pooling methods to generate the sequence embedding from $\{\mathbf{h}_{0}^{L},...,\mathbf{h}_{m}^{L}\}$ .
显然，Transformer 编码器是一种序列模型，可以用作文本编码器。考虑一个具有 $L$ 编码块的 Transformer 编码器。最后一个编码块的输出可以看作是编码结果。在此添加一个特殊标记 $x_{0}$ 到任何序列中，表示序列的开始（写作 $\langle\mathrm{SOS}\rangle$ 或 $[\mathrm{CLS}]$ ）。如果有一个 $m+1$ 输入标记的序列 $x_{0}x_{1}...x_{m}$ ，编码器的输出将是一个 $m+1$ 向量的序列 $\mathbf{h}_{0}^{L}\mathbf{h}_{1}^{L}...\mathbf{h}_{m}^{L}$ 。由于 $x_{0}$ 不是一个真实标记，并且具有固定的位置嵌入，它作为标签用于通过自注意力机制从其他位置收集信息。因此， $\mathbf{h}_{0}^{L}$ 是整个序列的表示，没有任何特定标记或位置的偏差。在许多情况下，我们需要一个序列的单个表示，并将其作为系统下游组件的输入，例如，我们可以基于从 $\{\mathbf{h}_{0}^{L},...,\mathbf{h}_{m}^{L}\}$ 生成的单个向量构建一个句子分类系统。在这种情况下，我们可以简单地使用 $\mathbf{h}_{0}^{L}$ 作为序列的表示。更通用的方法是在编码器中添加一个池化层。这允许我们探索各种池化方法，从 $\{\mathbf{h}_{0}^{L},...,\mathbf{h}_{m}^{L}\}$ 生成序列嵌入。

In text encoding, token sequences are represented by real-valued vectors, often referred to as sentence representations or sentence embeddings, which can be seen as points in a multi-dimensional space (Hill et al., 2016). Another way to make use of text encoding, therefore, is to obtain semantic or syntactic similarities of token sequences based on their relative positions or proximity in this space. A straightforward method for this is to compute the Euclidean distances between sequence embeddings. The shorter the distance between two sequences, the more similar they are considered to be. There are many distance metrics we can choose, and it is possible to combine them to obtain a better measure of sequence similarity. Such similarity computations are applied in areas such as text entailment, information retrieval, translation evaluation, among others (Cer et al., 2018; Reimers and Gurevych, 2019). Additionally, they are often used to assess the quality of text encoding models.
在文本编码中，标记序列由实值向量表示，通常被称为句子表示或句子嵌入，可以看作是多维空间中的点（Hill 等人，2016 年）。因此，利用文本编码的另一种方法是，根据它们在这个空间中的相对位置或邻近度来获取标记序列的语义或句法相似度。为此，一种直接的方法是计算序列嵌入之间的欧几里得距离。两个序列之间的距离越短，它们被认为越相似。我们可以选择许多距离度量，并且可以将它们结合起来以获得序列相似度的更好度量。此类相似度计算应用于文本蕴涵、信息检索、翻译评估等领域（Cer 等人，2018 年；Reimers 和 Gurevych，2019 年）。此外，它们通常用于评估文本编码模型的质量。

Text encoding is also a crucial component of sequence-to-sequence models. Given this, we can develop a separate Transformer encoder for source-side sequence modeling in an encoder-decoder system (see Figure 16). For example, we can pre-train a Transformer encoder on large-scale source-side texts, and use it as the encoder in a downstream encoder-decoder model. It is worth noting that while the encoder is designed based on the Transformer architecture, the decoder is not confined to just Transformers. Such flexibility enables us to incorporate pre-trained Transformer encoders into hybrid sequence-to-sequence architectures, such as systems that combine a Transformer encoder with an LSTM decoder.
文本编码也是序列到序列模型的一个关键组成部分。鉴于这一点，我们可以在编码器-解码器系统中开发一个独立的 Transformer 编码器用于源序列建模（见图 16）。例如，我们可以在大规模源序列文本上预训练一个 Transformer 编码器，并将其用作下游编码器-解码器模型中的编码器。值得注意的是，虽然编码器是基于 Transformer 架构设计的，但解码器并不局限于仅使用 Transformer。这种灵活性使我们能够将预训练的 Transformer 编码器纳入混合序列到序列架构中，例如结合 Transformer 编码器和 LSTM 解码器的系统。

Figure 16: Integrating Transformer encoders as components of different systems. A common approach is to feed the output of the encoder (with pooling) into a classifier to obtain a sequence classification system. Another way to utilize Transformer encoders is to compute the similarity between two sequences. We use the same encoder to represent the two sequences, and then construct a neural network on top of the two representations for producing a similarity score between them. As usual, Transformer encoders can also be used in encoder-decoder systems to model sequence-to-sequence problems.
图 16：将 Transformer 编码器作为不同系统的组件集成。一种常见的方法是将编码器的输出（带有池化）输入到分类器中，以获得序列分类系统。另一种利用 Transformer 编码器的方式是计算两个序列之间的相似度。我们使用相同的编码器来表示两个序列，然后在两个表示之上构建一个神经网络，以产生它们之间的相似度得分。与往常一样，Transformer 编码器也可以用于编码器-解码器系统中，以建模序列到序列问题。

In supervised learning scenarios, training a Transformer encoder is straightforward. We can treat it as a regular component of the target model and train this model on task-specific labeled data. However, such a method requires the encoder to be optimized on each task, and the resulting encoder might not always generalize well to other tasks, especially given that labeled data is scarce in most cases. A more prevalent approach is to frame the training of text encoders as an independent task in which supervision signals are derived solely from raw text. This led researchers to develop self-supervised Transformer encoders, such as BERT, which make use of large-scale unlabeled text, and these encoders were found to generalize well across many downstream tasks.
在监督学习场景中，训练 Transformer 编码器很简单。我们可以将其视为目标模型的一个常规组件，并在特定任务的标记数据上训练这个模型。然而，这种方法要求编码器在每个任务上都要进行优化，并且得到的编码器可能并不总是很好地泛化到其他任务，尤其是在标记数据稀缺的情况下。更常见的方法是将文本编码器的训练视为一个独立的任务，其中监督信号仅从原始文本中提取。这促使研究人员开发了自监督 Transformer 编码器，如 BERT，它们利用大规模未标记文本，并且这些编码器在许多下游任务中表现出良好的泛化能力。

6.3 Speech Translation 6.3 语音翻译

As illustrated in Section 2, the standard encoder-decoder Transformer model was proposed to model sequence-to-sequence problems. Here we consider the problem of translating speech in one language to text in another language — a problem that is conventionally addressed using both automatic speech recognition (ASR) and machine translation techniques. Instead of cascading an automatic speech recognition system and a machine translation system, we can use Transformer models to build an end-to-end speech-to-text (S2T) translation system to directly translate the input speech to the output text.
如第 2 节所示，标准的编码器-解码器 Transformer 模型被提出用于建模序列到序列问题。在这里，我们考虑将一种语言的语音翻译成另一种语言的文本的问题——这是一个传统上使用自动语音识别（ASR）和机器翻译技术来解决的问题。我们不是将自动语音识别系统和机器翻译系统级联起来，而是可以使用 Transformer 模型构建一个端到端的语音到文本（S2T）翻译系统，直接将输入语音翻译成输出文本。

To simplify the discussion, we assume that the input of an S2T translation system is a sequence of source-side acoustic feature vectors, denoted by $\mathbf{a}_{1}...\mathbf{a}_{m}$ , and the output of the system is a sequence of target-side tokens, denoted by $y_{1}...y_{n}$ .²⁷²⁷27In order to obtain the input sequence to the system, we need to discretize continuous speech into signals represented by feature vectors. This process is typically nontrivial, requiring either a feature extractor based on a variety of signal processing operations or a neural network that learns feature mappings in an end-to-end manner. But we will not dive into the details of these methods and simply treat the input feature extractor as an upstream system. Mapping $\mathbf{a}_{1}...\mathbf{a}_{m}$ to $y_{1}...y_{n}$ is a sequence-to-sequence problem. Thus it is straightforward to model the problem using an encoder-decoder Transformer model, and the training and inference of this model are standard, like in neural machine translation.
为了简化讨论，我们假设 S2T 翻译系统的输入是一个源端声学特征向量序列，记作 $\mathbf{a}_{1}...\mathbf{a}_{m}$ ，而系统的输出是一个目标端标记序列，记作 $y_{1}...y_{n}$ 。将 ²⁷ 映射到 $\mathbf{a}_{1}...\mathbf{a}_{m}$ 是一个序列到序列的问题。因此，使用编码器-解码器 Transformer 模型来建模这个问题是直接的，该模型的训练和推理与神经机器翻译中的标准类似。

In S2T translation, however, we have to deal with sequence mappings between modalities and between languages simultaneously. This poses new challenges compared with conventional machine translation problems and influences the design of S2T translation models. There have been several improvements to Transformer models for adapting them better to S2T translation tasks. Some of the improvements concern the design of Transformer blocks (Di Gangi et al., 2019). For example, in Gulati et al. (2020)’s system, a CNN sub-layer and relative positional embeddings are integrated into each Transformer block, enabling the model to efficiently capture both local and global features.
在 S2T 翻译中，然而，我们必须同时处理模态之间的序列映射和语言之间的序列映射。这相比于传统的机器翻译问题提出了新的挑战，并影响了 S2T 翻译模型的设计。为了更好地适应 S2T 翻译任务，对 Transformer 模型已经进行了几项改进。其中一些改进涉及 Transformer 块的设计（Di Gangi 等人，2019 年）。例如，在 Gulati 等人（2020 年）的系统中，CNN 子层和相对位置嵌入被整合到每个 Transformer 块中，使模型能够有效地捕获局部和全局特征。

Another line of research on S2T translation focuses on improving the encoder-decoder architecture. This involves modifications to either encoders or decoders, or both. To illustrate, Figure 17 shows the architectures of three S2T translation models. All of them are based on Transformers, but have different encoder architectures. As shown in the figure, the standard encoder-decoder architecture has one Transformer encoder for reading the source-side input $\mathbf{a}_{1}...\mathbf{a}_{m}$ and one Transformer decoder for producing the target-side output $y_{1}...y_{n}$ . By contrast, the decoupled encoder model separates the encoder into two stacked encoders — one for acoustic modeling (call it the speech encoder), and one for textual modeling (call it the text encoder) (Liu et al., 2020c; Xu et al., 2021a). This design reflects a modeling hierarchy in which representations in different levels of the network are concerned with different aspects of the problem, for example, the speech encoder models low-level features in mapping acoustic embeddings into larger language units, and the text encoder models the semantic or syntactic features in representing the entire input sequence. An advantage of separating out the text encoder is that the encoding process follows our prior knowledge that we need to first transcribe the speech input and then translate the transcript into the target language. Therefore, we can train the speech encoder in some way we train an ASR system. This enables us to pre-train the speech encoder and the text encoder on unlabeled data, and incorporate the pre-trained encoders into S2T translation systems.
另一条 S2T 翻译研究线关注于改进编码器-解码器架构。这涉及到对编码器或解码器，或两者都进行的修改。为了说明，图 17 展示了三个 S2T 翻译模型的架构。它们都基于 Transformer，但具有不同的编码器架构。如图所示，标准的编码器-解码器架构有一个用于读取源端输入 $\mathbf{a}_{1}...\mathbf{a}_{m}$ 的 Transformer 编码器和一个用于产生目标端输出 $y_{1}...y_{n}$ 的 Transformer 解码器。相比之下，解耦的编码器模型将编码器分为两个堆叠的编码器——一个用于声学建模（称为语音编码器），另一个用于文本建模（称为文本编码器）（刘等，2020c；徐等，2021a）。这种设计反映了一个建模层次，其中网络不同层级的表示关注问题的不同方面，例如，语音编码器在将声学嵌入映射到更大的语言单元时模型低级特征，而文本编码器在表示整个输入序列时模型语义或句法特征。文本编码器分离的优势在于编码过程遵循我们已有的知识，即我们首先需要转录语音输入，然后将转录文本翻译成目标语言。因此，我们可以以训练自动语音识别系统（ASR）的方式训练语音编码器。这使得我们能够在未标记数据上预先训练语音编码器和文本编码器，并将预训练的编码器纳入 S2T 翻译系统中。

Figure 17: Architectures of speech-to-text translation models based on Transformers. In addition to the standard encoder-decoder architecture, we can explicitly model the acoustic and textual (semantic) information using two separate encoders, called the speech encoder and the text encoder. In the decoupled encoder architecture, the two encoders are stacked, that is, text encoding is a subsequent process after speech encoding. In the two-stream encoder architecture, the two encoders work in parallel, and their outputs are merged using an additional encoder, called the shared encoder. The dotted line indicates the potential for interaction between the two encoders. For example, we could define a loss function to minimize the difference between their outputs, thereby guiding the model towards more aligned representations.
图 17：基于 Transformer 的语音到文本翻译模型架构。除了标准的编码器-解码器架构外，我们还可以使用两个独立的编码器来显式地建模声学信息和文本（语义）信息，这两个编码器分别称为语音编码器和文本编码器。在解耦编码器架构中，两个编码器是堆叠的，即文本编码是语音编码之后的后续过程。在双流编码器架构中，两个编码器并行工作，它们的输出通过一个额外的编码器合并，该编码器称为共享编码器。虚线表示两个编码器之间可能存在的交互。例如，我们可以定义一个损失函数来最小化它们输出之间的差异，从而引导模型向更对齐的表示发展。

An alternative encoding architecture is the two-stream architecture, as shown in Figure 17 (c). Like the decoupled encoder architecture, this architecture has a speech encoder and a text encoder, but the two encoders work in parallel rather than in sequence (Ye et al., 2021). The speech encoder takes acoustic features as input and the text encoder takes tokens (or their embeddings) as input. A third encoder, called shared encoder, integrates the outputs from both the speech and text encoders, merging the representations from the two modalities. This two-stream architecture is flexible because it provides multiple ways to train S2T translation models. A common approach is to train each branch individually. For example, if we mask the speech encoder, then the model will transform into a machine translation model which can be trained using bilingual texts. Conversely, if we mask the text encoder, then we can train the model as a standard S2T translation model. For inference, the text encoder can be dropped, and the speech input is modeled using the speech encoder and the shared encoder.
一种替代的编码架构是双流架构，如图 17（c）所示。与解耦编码器架构类似，该架构包含语音编码器和文本编码器，但这两个编码器是并行工作而不是按顺序（Ye 等人，2021）。语音编码器以声学特征为输入，文本编码器以标记（或它们的嵌入）为输入。第三个编码器，称为共享编码器，整合了语音和文本编码器的输出，合并了两种模态的表示。这种双流架构具有灵活性，因为它提供了多种训练 S2T 翻译模型的方法。一种常见的方法是单独训练每个分支。例如，如果我们屏蔽语音编码器，那么模型将转变为可以采用双语文本进行训练的机器翻译模型。相反，如果我们屏蔽文本编码器，那么我们可以将模型训练为标准的 S2T 翻译模型。对于推理，可以省略文本编码器，语音输入使用语音编码器和共享编码器进行建模。

In deep learning, training is often related to architecture design. Here, we have data in two modalities and two languages, and so we can develop multiple supervision signals for multi-task learning of S2T translation models. A widely used method is to introduce ASR-related loss into the training of speech encoders. For example, in the decoupled encoder model, a classifier can be constructed based on the output from the speech encoder. By minimizing the connectionist temporal classification (CTC) loss for this classifier, the speech encoder can be optimized in a manner similar to ASR. In general, training S2T translation models is challenging because speech-to-text aligned data is scarce. Among typical responses to this challenge are data augmentation, pre-training, knowledge distillation with machine translation, and so on. However, an in-depth discussion of these methods goes beyond the scope of this discussion on Transformers. The interested reader can refer to a recent survey on speech translation for more information (Xu et al., 2023a).
在深度学习中，训练通常与架构设计相关。在这里，我们有两种模态和两种语言的数据，因此我们可以为 S2T 翻译模型的多元任务学习开发多个监督信号。一种广泛使用的方法是将 ASR 相关的损失引入语音编码器的训练中。例如，在解耦编码器模型中，可以基于语音编码器的输出构建一个分类器。通过最小化该分类器的连接主义时序分类（CTC）损失，可以以类似于 ASR 的方式优化语音编码器。一般来说，训练 S2T 翻译模型具有挑战性，因为语音到文本对齐数据稀缺。对这一挑战的典型响应包括数据增强、预训练、与机器翻译结合的知识蒸馏等。然而，对这些方法的深入讨论超出了本文关于 Transformers 的讨论范围。感兴趣的读者可以参考最近关于语音翻译的综述以获取更多信息（Xu 等，2023a）。

6.4 Vision Models 6.4 视觉模型

While Transformers were first used in NLP, their application to other domains has been a prominent research topic. In computer vision, for instance, there is a notable trend of shifting from CNNs to Transformers as the backbone models. In this sub-section, we consider Vision Transformer (ViT) - an interesting application of Transformers to image classification (Dosovitskiy et al., 2021). Vision Transformer is a milestone model which opens the door to purely Transformer-based vision models. Here we consider the basic structure of Vision transformer to make this section concentrated and coherent, although there has been an extensive literature on Vision transformer and its variants. More detailed discussions of vision transformer can be found in recent surveys (Han et al., 2022; Liu et al., 2023b).
虽然 Transformer 最初应用于 NLP，但将其应用于其他领域已成为一个突出的研究课题。在计算机视觉领域，例如，有一个从 CNN 转向 Transformer 作为主干模型的重要趋势。在本节中，我们考虑了视觉 Transformer（ViT）——这是 Transformer 在图像分类（Dosovitskiy 等人，2021 年）中的一个有趣应用。视觉 Transformer 是一个里程碑模型，为纯 Transformer 视觉模型打开了大门。在这里，我们考虑视觉 Transformer 的基本结构，以使本节内容集中且连贯，尽管关于视觉 Transformer 及其变体的文献已经非常丰富。更详细的关于视觉 Transformer 的讨论可以在最近的综述中找到（Han 等人，2022 年；Liu 等人，2023b）。

The core idea behind Vision Transformer is to transform an image into a sequence of visual tokens, and input this sequence into a Transformer encoder to generate a representation of the image. The Transformer encoder is standard, and so we will not discuss it here, given the introduction to Transformers we have presented so far in this chapter. Mapping a 2D image into a sequence of tokens needs some additional work. Suppose we have an image represented as an $H\times W\times C$ feature map, where $H$ is the height of the image, $W$ is the width of the image, and $C$ is the number of channels. The first step is to segment this image into a number of patches. Suppose all patches are squares of side length $P$ . Then the resulting patches can be represented by feature maps of shape $P\times P\times C$ . By ordering these patches in some way, we obtain a sequence of $\frac{HW}{P^{2}}$ patches, with each patch being treated as a “token”.
核心思想是将图像转换为一系列视觉标记，并将该序列输入到 Transformer 编码器以生成图像的表示。Transformer 编码器是标准的，因此鉴于我们在本章中已介绍的 Transformers，我们在此不对其进行讨论。将二维图像映射为标记序列需要一些额外的工作。假设我们有一个表示为 $H\times W\times C$ 特征图的图像，其中 $H$ 是图像的高度， $W$ 是图像的宽度， $C$ 是通道数。第一步是将此图像分割成多个块。假设所有块都是边长为 $P$ 的正方形。然后，这些块可以通过形状为 $P\times P\times C$ 的特征图来表示。通过以某种方式对这些块进行排序，我们获得一个由 $\frac{HW}{P^{2}}$ 块组成的序列，每个块被视为一个“标记”。

Given this patch sequence, the subsequent steps are straightforward. For the patch at each position, we obtain a $d$ -dimensional embedding by a linear transformation of the input feature map. The input of the Transformer encoder is a sequence of $d$ -dimensional vectors, each of which is the sum of the corresponding patch and positional embeddings. Figure 18 illustrates the patching and embedding steps in Vision Transformer.
考虑到这个补丁序列，后续步骤很简单。对于每个位置的补丁，我们通过输入特征图的线性变换获得一个 $d$ 维度的嵌入。Transformer 编码器的输入是一个 $d$ 维度的向量序列，每个向量对应补丁和位置嵌入的和。图 18 说明了视觉 Transformer 中的补丁和嵌入步骤。

Figure 18: Illustration of Vision Transformer for image classification(Dosovitskiy et al., 2021). There are three steps. In the first step, the input image is segmented into patches, which are then flattened and mapped into embeddings. In the second step, a Transformer encoder is employed to process the sequence of embeddings, representing the image as a real-valued vector (e.g., the output of the encoder at the first position). In the last step, a classifier is built on top of this image representation.
图 18：图像分类中视觉 Transformer 的示意图（Dosovitskiy 等人，2021 年）。分为三个步骤。第一步，输入图像被分割成块，然后被展平并映射到嵌入中。第二步，使用 Transformer 编码器处理嵌入序列，将图像表示为实值向量（例如，编码器第一个位置的输出）。最后一步，在这个图像表示之上构建分类器。

Once we have a sequence of vectors for representing the image, we can employ the Transformer encoder to encode the sequence. The encoding process is exactly the same as that in text encoding as discussed in Section 6.2. For classification problems, we need only a single representation of the input. It is convenient to take the output of the encoder at position 0 (denoted by $\mathbf{h}_{0}^{L}$ ) and feed it into a classifier. Given that the first token $[\mathrm{CLS}]$ serves as a special token that would be attended to by all other tokens, $\mathbf{h}_{0}^{L}$ provides an unbiased representation of the entire sequence.
一旦我们有了表示图像的向量序列，我们就可以使用 Transformer 编码器来编码这个序列。编码过程与第 6.2 节中讨论的文本编码过程完全相同。对于分类问题，我们只需要输入的一个表示。取编码器在位置 0 的输出（表示为 $\mathbf{h}_{0}^{L}$ ）并将其输入到分类器中是很方便的。鉴于第一个标记 $[\mathrm{CLS}]$ 作为一个所有其他标记都会关注的特殊标记， $\mathbf{h}_{0}^{L}$ 提供了整个序列的无偏表示。

Typically, a standard way to train Vision Transformer is to minimize some loss on labeled data, such as ImageNet. More recently, inspired by self-supervised learning in BERT-like models, there have been successful attempts to train Transformer-based image encoders on large-scale unlabeled data (Caron et al., 2021; Bao et al., 2021; He et al., 2022). Note that one of the most significant contributions of Vision Transformer is that it unifies the representation models for different modalities. This suggests that if an object, whether an image or text, is represented as a sequence of embeddings, it can be easily modeled using the Transformer architecture.
通常，训练视觉 Transformer 的标准方法是在标记数据上最小化某些损失，例如 ImageNet。最近，受到 BERT 类模型中自监督学习的启发，人们已经成功尝试在大规模未标记数据上训练基于 Transformer 的图像编码器（Caron 等人，2021 年；Bao 等人，2021 年；He 等人，2022 年）。请注意，视觉 Transformer 最显著的贡献之一是它统一了不同模态的表示模型。这表明，如果一个对象，无论是图像还是文本，被表示为嵌入序列，它就可以很容易地使用 Transformer 架构进行建模。

6.5 Multimodal Models 6.5 多模态模型

The above discussion of Vision Transformer offers the possibility of unifying the representations from multiple modalities using the same Transformer architecture. In fact, many recent multimodal systems draw inspiration largely from Transformers (Xu et al., 2023c). Such systems convert objects from different modalities into vector sequences and feed these vectors into a single Transformer model. The output is a fused representation of all inputs, which can then be used in downstream systems.
上述关于视觉变换器的讨论提出了使用相同的变换器架构统一来自多个模态的表示的可能性。事实上，许多最近的多模态系统在很大程度上受到变换器的启发（Xu 等人，2023c）。这些系统将来自不同模态的对象转换为向量序列，并将这些向量输入到单个变换器模型中。输出是所有输入的融合表示，然后可以用于下游系统。

As a simple example, consider the task of encoding a pair consisting of text and its corresponding image. First, we represent both the text and the image as sequences of embeddings that have the same dimensionality. This is a common step in sequence modeling, which we have confronted many times so far. We can do this by using either a simple embedding model (e.g., a word or patch embedding model) or a well-trained sequence model (e.g., a vision model). Then, these two sequences are concatenated into a long sequence involving both textual and visual embeddings. The follow-on step is standard: a Transformer encoder takes the concatenated sequence of embeddings as input and produces representations of the text and image as output. Note that concatenating textual and visual sequences is one of the simplest methods for vision-text modeling. There are several alternative ways to merge information from different modalities, for example, we can feed visual representations into the attention layers of a text encoder or decoder (Li et al., 2022d; Alayrac et al., 2022).
作为简单示例，考虑将文本及其对应图像成对编码的任务。首先，我们将文本和图像都表示为具有相同维度的嵌入序列。这是序列建模中的常见步骤，我们至今已经多次遇到。我们可以通过使用简单的嵌入模型（例如，词嵌入或补丁嵌入模型）或经过良好训练的序列模型（例如，视觉模型）来实现这一点。然后，这两个序列被连接成一个包含文本和视觉嵌入的长序列。后续步骤是标准的：Transformer 编码器将连接的嵌入序列作为输入，并输出文本和图像的表示。请注意，将文本和视觉序列连接起来是视觉-文本建模中最简单的方法之一。有几种合并不同模态信息的方法，例如，我们可以将视觉表示输入到文本编码器或解码器的注意力层中（Li 等，2022d；Alayrac 等，2022）。

The above multimodal encoder can be used in both encoder-only and encoder-decoder systems. For encoder-only systems, consider an example where, given an image and a description of it, we predict the class of the image using a classifier built on top of the encoder (Kim et al., 2021). For encoder-decoder systems, we pair the encoder with a decoder, as in sequence-to-sequence modeling (Cho et al., 2021). For example, we might employ a Transformer decoder to generate text based on the output of the encoder. A common application of this architecture is visual question answering (VQA), where an image and a question about the image are provided, and the system is tasked with generating an answer (Antol et al., 2015). The architectures of these models are illustrated in Figure 19 (a-b).
上述多模态编码器可用于仅编码器和编码器-解码器系统。对于仅编码器系统，考虑一个例子，给定一个图像及其描述，我们使用在编码器之上构建的分类器来预测图像的类别（Kim 等，2021 年）。对于编码器-解码器系统，我们将编码器与解码器配对，如序列到序列建模（Cho 等，2021 年）。例如，我们可能使用 Transformer 解码器根据编码器的输出生成文本。这种架构的常见应用是视觉问答（VQA），其中提供图像及其相关问题，系统负责生成答案（Antol 等，2015 年）。这些模型的架构如图 19（a-b）所示。

Figure 19: Vision-text models. Blue boxes represent word+position embeddings, and red boxes represent image patch+position embeddings.
图 19：视觉-文本模型。蓝色方框表示词+位置嵌入，红色方框表示图像块+位置嵌入。

More recently, NLP has seen new advances by using large language models to deal with both textual and other forms of data, such as images, videos, and audio, leading to new breakthroughs in multimodal processing (Liu et al., 2023a; Yin et al., 2023). By representing all inputs as a sequence of token embeddings, the problem will be simple: we predict the next token given its context. This can be done by using decoder-only systems, as shown in Figure 19 (c).
最近，自然语言处理（NLP）通过使用大型语言模型来处理文本和其他形式的数据，如图像、视频和音频，取得了新的进展，导致多模态处理（刘等，2023a；尹等，2023）出现新的突破。通过将所有输入表示为标记嵌入序列，问题将变得简单：我们根据上下文预测下一个标记。这可以通过使用仅使用解码器的系统来实现，如图 19（c）所示。

7 Summary 7 总结

Transformer models have achieved widespread use over the past few years since the concept of Transformer was proposed by Vaswani et al. (2017). This has accelerated the development of these models, leading to a great variety of new algorithms, systems and concepts. A thorough discussion of Transformers requires a broad scope, and so it is impossible to cover every problem and to provide a complete list of the corresponding references. While this chapter has presented a detailed introduction to Transformers, there are still topics that we did not mention, such as the theoretical aspects of these models. Figure 20 shows an overview of Transformer models, where we attempt to give a big picture. Note that these models and related techniques can be classified in many different ways, and we just show one of them. To summarize, we would like to highlight the following points.
Transformer 模型在过去几年中得到了广泛应用，自从 Vaswani 等人（2017 年）提出了 Transformer 的概念。这加速了这些模型的发展，导致了大量新算法、系统和概念的涌现。对 Transformer 的深入讨论需要广泛的范围，因此不可能涵盖每一个问题并提供相应的完整参考文献列表。虽然本章已经对 Transformer 进行了详细的介绍，但仍有一些我们没有提及的话题，例如这些模型的理论方面。图 20 展示了 Transformer 模型的总览，我们试图提供一个全景。请注意，这些模型和相关技术可以以许多不同的方式进行分类，我们只展示其中一种。总结来说，我们想强调以下几点。

Refer to caption — Figure 20: An overview of Transformers.
图 20：Transformer 概述。

•

Foundations of Transformers. Although the impact of Transformers has been revolutionary, they are not completely ”new” models. From a deep learning perspective, Transformers are composed of common building blocks, including word and positional embeddings (Bengio et al., 2003; Mikolov et al., 2013; Gehring et al., 2017), attention mechanisms (Bahdanau et al., 2014; Luong et al., 2015), residual connections (He et al., 2016b), layer-normalization (Ba et al., 2016), and so on. Many of these components were presented in earlier systems, for example, similar ideas with QKV attention can be found in memory networks (Sukhbaatar et al., 2015) and hierarchical attention networks (Yang et al., 2016). Transformers offer a novel approach to integrating these components, resulting in a unique architecture. For example, in Transformers, the combination of multi-head attention and dot-product QKV attention, along with the incorporation of layer-normalization and residual connections, gives rise to a distinctive neural network block, specifically a self-attention sub-layer. This design has since become a de facto standard in many follow-on sequence modeling systems.

• 变换器基础。尽管变换器的影响是革命性的，但它们并非完全“新”模型。从深度学习角度来看，变换器由常见的构建块组成，包括词嵌入和位置嵌入（Bengio 等人，2003；Mikolov 等人，2013；Gehring 等人，2017）、注意力机制（Bahdanau 等人，2014；Luong 等人，2015）、残差连接（He 等人，2016b）、层归一化（Ba 等人，2016）等。其中许多组件在早期系统中已有介绍，例如，与 QKV 注意力类似的想法可以在记忆网络（Sukhbaatar 等人，2015）和分层注意力网络（Yang 等人，2016）中找到。变换器提供了一种将这些组件集成的创新方法，从而形成独特的架构。例如，在变换器中，多头注意力和点积 QKV 注意力的结合，以及层归一化和残差连接的引入，产生了一个独特的神经网络块，即自注意力子层。这种设计后来成为许多后续序列建模系统的实际标准。
•

Attention Models. The success of Transformers on NLP tasks has largely been attributed to the use of multi-head self-attention for sequence modeling. This has led to a surge of interest in enhancing the attention mechanisms within Transformers. While it is impossible to detail every attention model, there are several notable research directions. One prominent direction involves modifying the forms of QKV attention and multi-head attention for improved performance. The scope of this direction is vast, as there are numerous aspects to consider when enhancing Transformers (Lin et al., 2022a). For example, one may add new components to self-attention sub-layers to adapt them to specific tasks, resulting in various Transformer variants. A second direction is to incorporate prior knowledge into the design of attention models. This makes sense, because much of the emphasis in traditional NLP has been on using linguistic insights to guide system design, and we generally want NLP systems to be linguistically explainable. For example, many Transformer-based systems take syntactic parses as input in various forms and make use of syntax in sequence modeling. A third direction is to develop efficient attention models (Tay et al., 2020b). Self-attention has long been criticized for its quadratic time complexity and dependency on all previous tokens for each new token. In response, many researchers have focused on simplifying the structure of self-attention, or on approximating it using sparse or recurrent models. This concern for efficiency also motivates the development of alternatives to self-attention, such as attention models with linear time complexity. In addition to exploring stronger and more efficient attention models, it is natural to examine what knowledge is learned by such models. Interestingly, researchers have found that the underlying structure of languages can be learned by multi-head self-attention models, although these models are not trained to represent such knowledge (Manning et al., 2020).

• 注意力模型。Transformer 在 NLP 任务上的成功很大程度上归因于多头自注意力在序列建模中的应用。这导致了对增强 Transformer 中注意力机制的极大兴趣。虽然无法详细说明每个注意力模型，但有几个值得注意的研究方向。一个突出的方向是修改 QKV 注意力和多头注意力的形式以提高性能。这个方向的范围很广，因为增强 Transformer 时有许多方面需要考虑（Lin 等，2022a）。例如，可以在自注意力子层中添加新组件以适应特定任务，从而产生各种 Transformer 变体。第二个方向是将先验知识纳入注意力模型的设计中。这是有道理的，因为传统 NLP 中的许多重点在于使用语言洞察力来指导系统设计，而我们通常希望 NLP 系统在语言上具有可解释性。例如，许多基于 Transformer 的系统以各种形式接受句法分析作为输入，并在序列建模中使用句法。一个第三方向是开发高效的注意力模型（Tay 等，2020b）。自注意力长期以来因其二次时间复杂度和对每个新标记依赖于所有先前标记而受到批评。作为回应，许多研究人员专注于简化自注意力的结构，或者使用稀疏或循环模型来近似它。对效率的关注也推动了自注意力替代方案的发展，例如具有线性时间复杂度的注意力模型。除了探索更强和更高效的注意力模型外，自然要考察这些模型学到了哪些知识。有趣的是，研究人员发现，多头自注意力模型可以学习到语言的底层结构，尽管这些模型并未被训练来表示此类知识（Manning 等，2020）。
•

Word and Positional Embeddings. Transformers represent each input word as a word embedding, along with its positional embedding. Learning these word embeddings is not a specific problem for Transformers. We can either resort to well-trained word embeddings, such as the Word2Vec or GloVe embeddings, or treat them as learnable parameters of Transformers. A related issue is tokenization of the input sequences. In general, tokenization impacts the number of resulting tokens and the difficulty of learning the corresponding embeddings. In many applications, therefore, one needs to carefully choose a tokenization method. Furthermore, positional embedding plays an important role in Transformers, as the attention mechanisms are order-insensitive by design (Dufter et al., 2022). Although positional embedding is a general problem, much of the research is focused on improving Transformers, leading to modifications to Transformer models (Shaw et al., 2018; Huang et al., 2018). Additionally, studies show that, when we deal with sequences that are much longer than those in training data, extrapolation can be achieved by replacing sinusoidal positional embeddings with rotary positional embeddings or simply scaling attention weights with a positional scalar (Raffel et al., 2020; Su et al., 2021; Press et al., 2021).

• 单词和位置嵌入。Transformer 将每个输入词表示为一个词嵌入，以及其位置嵌入。学习这些词嵌入不是 Transformer 的特定问题。我们可以求助于经过良好训练的词嵌入，如 Word2Vec 或 GloVe 嵌入，或者将它们视为 Transformer 的可学习参数。一个相关问题是输入序列的标记化。一般来说，标记化会影响结果标记的数量和对应嵌入的学习难度。因此，在许多应用中，需要仔细选择标记化方法。此外，位置嵌入在 Transformer 中起着重要作用，因为注意力机制设计上是无序敏感的（Dufter 等人，2022 年）。尽管位置嵌入是一个一般性问题，但大部分研究都集中在改进 Transformer 上，导致 Transformer 模型的修改（Shaw 等人，2018 年；Huang 等人，2018 年）。此外，研究表明，当我们处理比训练数据中更长的序列时，可以通过用旋转位置嵌入替换正弦位置嵌入或简单地通过位置标量缩放注意力权重来实现外推（Raffel 等人，2020 年；Su 等人，2021 年；Press 等人，2021 年）。
•

Training and Model Scaling. In the era of deep learning, powerful systems are typically obtained by using large neural networks. A simple approach to increasing the model capacity of Transformers is to stack more layers and/or enlarge the size of each representation. We can see many cases where deep and wide Transformer models consistently outperform small models. However, challenges arise when we attempt to train extremely large Transformer models, especially when gradient descent is applied over vast amounts of data, demanding substantial computational resources. An engineering solution is to distribute the training across a cluster of computers (Lepikhin et al., 2021; Chowdhery et al., 2022). While distributed training is a very general method and is not restricted to Transformers, it indeed influences the design of model architectures, for example, sparse expert models can ease the training with distributed parameters, serving as the foundation for many expansive Transformer-based systems. Scaling up the training of Transformers allows us to study the scaling law of large neural networks: how model performance relates to model size, training data size, and training cost (Hestness et al., 2017; Kaplan et al., 2020). This is sometimes accompanied by an interesting behavior, known as emergence (Wei et al., 2022). In recent NLP research, the acquisition of emergent abilities has been considered one of the prerequisites for developing strong language models.

• 训练和模型扩展。在深度学习时代，通常通过使用大型神经网络来获得强大的系统。增加 Transformer 模型容量的简单方法是通过堆叠更多层或扩大每个表示的大小。我们可以看到许多案例，深度和宽度 Transformer 模型持续优于小型模型。然而，当我们尝试训练极其大的 Transformer 模型时，尤其是在对大量数据进行梯度下降时，会要求大量的计算资源，这就出现了挑战。一个工程解决方案是将训练分布在计算机集群上（Lepikhin 等人，2021 年；Chowdhery 等人，2022 年）。虽然分布式训练是一个非常通用的方法，并不局限于 Transformer，但它确实影响了模型架构的设计，例如，稀疏专家模型可以通过分布式参数简化训练，成为许多基于 Transformer 的庞大系统的基石。扩大 Transformer 的训练规模使我们能够研究大型神经网络的缩放定律：模型性能与模型大小、训练数据大小和训练成本之间的关系（Hestness 等人，2017；Kaplan 等人，2020）。这有时伴随着一种有趣的行为，称为涌现（Wei 等人，2022）。在最近的自然语言处理研究中，获得涌现能力被认为是开发强大语言模型的一个先决条件。
•

Efficient Models. There are different goals for efficiency. For example, one may wish a system to be memory efficient when the problem is memory bound, or one may wish it to be speed efficient when latency is an important consideration. In general, we need to seek a balance between these goals, resulting in different efficiency optimizations. In the context of Transformers, many of these optimizations are achieved by modifying the attention models, as mentioned above. For example, several variants of the self-attention models are proposed to reduce the memory footprint when processing long sequences (Tay et al., 2020b). Similarly, other variants aim to reduce computation and thus give lower latency. Furthermore, being a type of neural network, Transformers can be optimized in ways independent of model architectures. Typical methods include but are not limited to conditional computation, knowledge distillation, structured pruning, and sequence compression. Efficiency optimizations can also be considered from the perspective of computer architecture (Kim et al., 2023). For example, when applying Transformers to sequence-to-sequence problems, the encoding and decoding processes are generally compute-intensive and IO-intensive, respectively. Therefore, we can employ different optimization methods for different components of Transformers.

• 高效模型。效率有不同的目标。例如，当问题受内存限制时，可能希望系统具有内存效率；当延迟是一个重要考虑因素时，可能希望它具有速度效率。一般来说，我们需要在这些目标之间寻求平衡，从而产生不同的效率优化。在 Transformer 的背景下，许多这些优化是通过修改上述提到的注意力模型实现的。例如，提出了几种自注意力模型的变体，以减少处理长序列时的内存占用（Tay 等，2020b）。同样，其他变体旨在减少计算量，从而降低延迟。此外，作为神经网络的一种，Transformer 可以通过独立于模型架构的方式进行优化。典型的方法包括但不限于条件计算、知识蒸馏、结构化剪枝和序列压缩。效率优化也可以从计算机体系结构的角度考虑（Kim 等，2023）。例如，当将 Transformer 应用于序列到序列问题时，编码和解码过程通常分别具有计算密集型和 I/O 密集型。因此，我们可以针对 Transformer 的不同组件采用不同的优化方法。
•

Inference. The inference problem is commonly discussed in sequence generation. In NLP, we often need to find the “best” hypothesis in a space involving sequences of tens or even hundreds of tokens over a vocabulary. Considering this an instance of the search problem in artificial intelligence, many algorithms can be applied, such as breadth-first search, depth-first search and A* search. In many practical applications of NLP, the efficiency of the search systems is an important consideration. As a result, optimized search algorithms are required. Most of these algorithms have been explored in machine translation and ASR, and are directly applicable to neural text generation models like Transformer. There are also optimizations of conventional decoding methods tailored to Transformers (Leviathan et al., 2023). Moreover, the above-mentioned efficient approaches, such as the efficient attention models, are also in widespread use, with many successful examples in deploying neural machine translation systems and large language models (Heafield et al., 2021; Dao et al., 2023).

• 推理。推理问题在序列生成中通常被讨论。在自然语言处理（NLP）中，我们经常需要在涉及词汇表中数十甚至数百个标记的序列空间中找到“最佳”假设。考虑到这是一个人工智能中的搜索问题实例，可以应用许多算法，如广度优先搜索、深度优先搜索和 A*搜索。在 NLP 的许多实际应用中，搜索系统的效率是一个重要的考虑因素。因此，需要优化搜索算法。其中大部分算法已在机器翻译和语音识别（ASR）中得到了探索，并且可以直接应用于 Transformer 等神经文本生成模型。还有针对 Transformer 的常规解码方法的优化（Leviathan 等人，2023）。此外，上述提到的有效方法，如高效的注意力模型，也得到广泛应用，在部署神经机器翻译系统和大型语言模型方面有许多成功的案例（Heafield 等人，2021；Dao 等人，2023）。
•

Applications. Applications of Transformers cover a wide variety of NLP problems. During the development of Transformers, they were at first used to build supervised models that perform particular tasks. Later, a greater success was achieved by using them as backbone networks for large scale self-supervised learning of foundation models (Bommasani et al., 2021). This markedly changed the paradigm in NLP. We need only pre-train a model to obtain general knowledge of languages on huge amounts of text. Then, we adapt this model to downstream tasks using methods with little effort, such as fine-tuning or prompting. Over the past few years, we have also seen an explosion of applications for Transformers in fields other than NLP, such as computer vision, speech processing, and bioinformatics. The idea behind these applications is that we can represent any input data as a sequence of tokens and directly employ Transformers to model this sequence. This approach extends Transformers to general representation models across different modalities, making it easier to use Transformers for handling multi-modal data.

• 应用。Transformers 的应用涵盖了广泛的 NLP 问题。在 Transformers 的开发过程中，最初它们被用于构建执行特定任务的监督模型。后来，通过将它们用作大规模自监督学习基础模型的骨干网络，取得了更大的成功（Bommasani 等人，2021 年）。这显著改变了 NLP 的范式。我们只需要预先训练一个模型，就可以在大量文本上获得语言的一般知识。然后，我们使用如微调或提示等省力的方法将此模型适应下游任务。在过去的几年里，我们还看到了 Transformers 在 NLP 以外的领域的应用爆炸，如计算机视觉、语音处理和生物信息学。这些应用背后的想法是我们可以将任何输入数据表示为标记序列，并直接使用 Transformers 来建模这个序列。这种方法将 Transformers 扩展到不同模态的通用表示模型，使得使用 Transformers 处理多模态数据变得更加容易。
•

Large Language Models as Foundation Models. Transformers form the basis of recent large language models, such as the GPT series, which show surprising breakthroughs in NLP, and even in artificial general intelligence (AGI) (Bubeck et al., 2023; Yang et al., 2023a). Much of the research in large language models is more or less related to Transformers. For example, as discussed in Section 6.1, the problem of training these language models is the same as that of training Transformer decoders. And the modifications to Transformer decoders can be directly applied to large language models. On the other hand, the rapid development of large language models has also driven further improvements in various techniques for Transformers, such as efficient and low-cost adaptation of large Transformers to different tasks.

• 大型语言模型作为基础模型。Transformer 构成了最近大型语言模型（如 GPT 系列）的基础，这些模型在自然语言处理（NLP）甚至人工通用智能（AGI）领域取得了惊人的突破（Bubeck 等，2023；Yang 等，2023a）。大型语言模型的大部分研究或多或少与 Transformer 相关。例如，在第 6.1 节中讨论的，训练这些语言模型的问题与训练 Transformer 解码器的问题相同。对 Transformer 解码器的修改可以直接应用于大型语言模型。另一方面，大型语言模型的快速发展也推动了 Transformer 各种技术的进一步改进，例如高效且低成本地将大型 Transformer 适应于不同任务。
•

Theoretical Analysis. Although Transformers have shown strong empirical results in various fields, their theoretical aspects have received relatively less attention compared to the extensive research on model improvement and engineering. This is not a specific problem for Transformers, but a common problem for the NLP and machine learning communities. In response, researchers have made attempts to analyze Transformers more deeply. One way is to view Transformers as deep neural networks and interpret them via mathematical tools. For example, the residual networks in Transformers are mathematically equivalent to the Euler solvers for ODEs. This equivalence suggests that we can leverage insights from numerical ODE methods to inform model design. Another promising avenue of research aims to develop a theoretical understanding of the self-attention mechanism, which distinguishes Transformers from other deep learning models. For example, there have been studies on interpreting self-attention and Transformers from machine learning perspectives, such as data compression (Yu et al., 2023), optimization (Li et al., 2022c), and function approximation (Yun et al., 2019). Moreover, Transformers can also be related to formal systems, including Turing machines (Pérez et al., 2018), counter machines (Bhattamishra et al., 2020), regular and context-free languages (Hahn, 2020), Boolean circuits (Hao et al., 2022; Merrill et al., 2022), programming languages (Weiss et al., 2021), first-order logic (Chiang et al., 2023), and so on. These provide tools to study the expressivity of Transformers. It is, however, worth noting that, while we can understand Transformers in several different ways, there are no general theories to explain the nature of these models. Perhaps this is a challenge for the field of machine learning, and many researchers are working on this issue. But it is indeed an important issue, as the development of the theories behind complex neural networks like Transformers can help develop systems with explainable and predictable behaviors.

• 理论分析。尽管 Transformer 在各种领域显示出强大的实证结果，但与对模型改进和工程的大量研究相比，它们的理论方面相对较少受到关注。这并非 Transformer 特有的问题，而是 NLP 和机器学习社区普遍存在的问题。为此，研究人员尝试更深入地分析 Transformer。一种方法是将 Transformer 视为深度神经网络，并通过数学工具进行解释。例如，Transformer 中的残差网络在数学上等同于常微分方程的欧拉求解器。这种等价性表明，我们可以利用数值常微分方程方法的见解来指导模型设计。另一个有前景的研究方向是发展对自注意力机制的理论理解，这是 Transformer 区别于其他深度学习模型的特点。例如，已有研究从机器学习角度解释自注意力和 Transformer，如数据压缩（Yu 等，2023 年）、优化（Li 等，2022c）和函数逼近（Yun 等，2019 年）。此外，Transformer 还可以与形式系统相关联，包括图灵机（Pérez 等人，2018 年）、计数机（Bhattamishra 等人，2020 年）、正则语言和上下文无关语言（Hahn，2020 年）、布尔电路（Hao 等人，2022 年；Merrill 等人，2022 年）、编程语言（Weiss 等人，2021 年）、一阶逻辑（Chiang 等人，2023 年）等等。这些提供了研究 Transformer 表达能力的方法。然而，值得注意的是，尽管我们可以从多个不同的角度理解 Transformer，但还没有普遍的理论来解释这些模型的本性。或许这为机器学习领域提出了挑战，许多研究人员正在努力解决这个问题。但确实是一个重要的问题，因为复杂神经网络如 Transformer 背后的理论发展有助于开发具有可解释和可预测行为的系统。

Acknowledgements 致谢

We would like to thank Yongyu Mu, Chenglong Wang, Bei Li, Weiqiao Shan, Yuchun Fan, Kaiyan Chang, Tong Zheng, and Huiwen Bao for their suggestions on improving the early version of this work.
感谢穆永宇、王成龙、李贝、山伟桥、范宇春、常凯燕、郑彤、包慧文对本文早期版本提出的建议。

\addappheadtotoc \添加附录标题到目录

Appendix A Sinusoidal Positional Encoding
附录 A 正弦位置编码

In Transformers, positions are represented as vectors. Although vectorizing the representations of positions sounds complicated, a simple idea is to use a carrying system which describes how a natural number is expressed by a polynomial with respect to a base (Kernes, 2021). For example, $i$ can be written as
在 Transformer 中，位置被表示为向量。尽管将位置表示向量化听起来很复杂，但一个简单的想法是使用一个携带系统，该系统描述了自然数如何通过相对于基（Kernes，2021）的多项式来表示。例如， $i$ 可以写成

\displaystyle i

\displaystyle=

\displaystyle\sum_{k=0}^{k_{\mathrm{max}}}a(i,k)b^{k}

(202)

where $a(i,k)$ is the $k$ -th digit, $k_{\mathrm{max}}+1$ is the maximum number of digits, and $b$ is the base of the system. The carrying occurs when $a(i,k)$ reaches $b$ : we increase $a(i,k+1)$ by 1 and roll back $a(i,k)$ to 0. In this way we can change $a(i,k)$ with a period of $b^{k}$ , that is, $a(i,0)$ changes with a period of $b^{0}$ , $a(i,1)$ changes with a period of $b^{1}$ , $a(i,2)$ changes with a period of $b^{2}$ , and so on.
$a(i,k)$ 是第 $k$ 位数字， $k_{\mathrm{max}}+1$ 是数字的最大位数， $b$ 是系统的基数。当 $a(i,k)$ 达到 $b$ 时发生进位：我们将 $a(i,k+1)$ 增加 1 并将 $a(i,k)$ 回滚到 0。这样我们就可以以 $b^{k}$ 的周期改变 $a(i,k)$ ，即 $a(i,0)$ 以 $b^{0}$ 的周期改变， $a(i,1)$ 以 $b^{1}$ 的周期改变， $a(i,2)$ 以 $b^{2}$ 的周期改变，以此类推。

Using this system, $i$ can be represented as a vector
使用此系统， $i$ 可以表示为一个向量

\displaystyle\mathrm{PE}(i)

\displaystyle=

\displaystyle\begin{bmatrix}a(i,0)&a(i,1)&...&a(i,k_{\mathrm{max}})\end{bmatrix}

(203)

For example, when $b=2$ , $\mathrm{PE}(11)=\begin{bmatrix}1&1&0&1\end{bmatrix}$ . However, in Eq. (203), $\mathrm{PE}(i)$ is still a discrete function. We may want a continuous vector representation that can describe intermediate states between discrete events. Considering $a(i,k)$ as a periodic function, a common choice is the sine function. Thus $a(i,k)$ can be re-defined, as follows
例如，当 $b=2$ ， $\mathrm{PE}(11)=\begin{bmatrix}1&1&0&1\end{bmatrix}$ 。然而，在式（203）中， $\mathrm{PE}(i)$ 仍然是一个离散函数。我们可能需要一个连续的向量表示，可以描述离散事件之间的中间状态。将 $a(i,k)$ 视为周期函数时，一个常见的选择是正弦函数。因此， $a(i,k)$ 可以重新定义，如下所示

\displaystyle a(i,k)

\displaystyle=

\displaystyle\mathrm{sin}(i\cdot\omega_{k})

(204)

This function has an amplitude of 1 and a period of $\frac{2\pi}{\omega_{k}}$ . Using an analogous form of periods to that used in Eq. (202), we define $\omega_{k}$ as
此函数的振幅为 1，周期为 $\frac{2\pi}{\omega_{k}}$ 。使用与方程（202）中相同的周期形式，我们定义 $\omega_{k}$ 为

\displaystyle\omega_{k}

\displaystyle=

\displaystyle\frac{1}{(b_{\textrm{model}})^{k/d_{\textrm{model}}}}

(205)

where $b_{\textrm{model}}>0$ and $d_{\textrm{model}}>0$ are hyper-parameters of the model. Obviously, we have $\frac{2\pi}{\omega_{0}}<\frac{2\pi}{\omega_{1}}<...<\frac{2\pi}{\omega_{k_{\mathrm{max}}}}$ .
$b_{\textrm{model}}>0$ 和 $d_{\textrm{model}}>0$ 是模型的超参数。显然，我们有 $\frac{2\pi}{\omega_{0}}<\frac{2\pi}{\omega_{1}}<...<\frac{2\pi}{\omega_{k_{\mathrm{max}}}}$ 。

Similarly, we can define $a(i,k)$ via the cosine function
同样，我们可以通过余弦函数定义 $a(i,k)$

\displaystyle a(i,k)

\displaystyle=

\displaystyle\mathrm{cos}(i\cdot\omega_{k})

(206)

Taking both Eqs. (204) and (206), we create a new representation of $i$ , as follows
同时考虑公式（204）和（206），我们得到 $i$ 的新表示如下

\displaystyle\mathrm{PE}(i)

\displaystyle=

\displaystyle\begin{bmatrix}\mathrm{sin}(i\cdot\omega_{0})&\mathrm{cos}(i\cdot\omega_{0})&...&\mathrm{sin}(i\cdot\omega_{k_{\mathrm{max}}})&\mathrm{cos}(i\cdot\omega_{k_{\mathrm{max}}})\end{bmatrix}

(207)

Vaswani et al. (2017) instantiated the above form by setting $b_{\textrm{model}}=10,000$ . Let $\mathrm{PE}(i,k)$ be the $k$ -th dimension of $\mathrm{PE}(i)$ . Vaswani et al. (2017)’s version of positional encoding is written as
瓦斯瓦尼等人（2017）通过设置 $b_{\textrm{model}}=10,000$ 实例化了上述形式。令 $\mathrm{PE}(i,k)$ 为 $\mathrm{PE}(i)$ 的第 $k$ 维。瓦斯瓦尼等人（2017）的版本的位置编码表示为

	$\displaystyle\mathrm{PE}(i,2k)$	$\displaystyle=$	$\displaystyle\mathrm{sin}(i\cdot\frac{1}{10000^{2k/d_{\mathrm{model}}}})$		(208)
	$\displaystyle\mathrm{PE}(i,2k+1)$	$\displaystyle=$	$\displaystyle\mathrm{cos}(i\cdot\frac{1}{10000^{2k/d_{\mathrm{model}}}})$		(209)

Choosing $b_{\textrm{model}}=10,000$ is empirical. One can adjust it for specific tasks. Figure 21 plots the positional encoding for different positions. We see that, when $k$ becomes larger, the change of the color follows a larger period.
选择 $b_{\textrm{model}}=10,000$ 是经验性的。可以为特定任务进行调整。图 21 绘制了不同位置的位置编码。我们看到，当 $k$ 变大时，颜色的变化遵循更大的周期。

Note that Eqs. (208) and (209) have a useful property that $\mathrm{PE}(i+\mu)$ can be easily expressed by a linear function of $\mathrm{PE}(i)$ for a given offset $\mu$ ²⁸²⁸28One can derive these by taking $\displaystyle\mathrm{sin}(\alpha+\beta)$ $\displaystyle=$ $\displaystyle\mathrm{sin}(\alpha)\cdot\mathrm{cos}(\beta)+\mathrm{cos}(\alpha)\cdot\mathrm{sin}(\beta)$ (210) $\displaystyle\mathrm{cos}(\alpha+\beta)$ $\displaystyle=$ $\displaystyle\mathrm{cos}(\alpha)\cdot\mathrm{cos}(\beta)-\mathrm{sin}(\alpha)\cdot\mathrm{sin}(\beta)$ (211)
请注意，方程（208）和（209）具有一个有用的性质，即对于给定的偏移量 $\mu$ 和 ²⁸ ， $\mathrm{PE}(i+\mu)$ 可以很容易地用 $\mathrm{PE}(i)$ 的线性函数表示。

$\displaystyle\mathrm{PE}(i+\mu,2k)$	$\displaystyle=$	$\displaystyle\mathrm{PE}(i,2k)\cdot\mathrm{PE}(\mu,2k+1)+$	(212)
		$\displaystyle\mathrm{PE}(i,2k+1)\cdot\mathrm{PE}(\mu,2k)$	(212)
$\displaystyle\mathrm{PE}(i+\mu,2k+1)$	$\displaystyle=$	$\displaystyle\mathrm{PE}(i,2k+1)\cdot\mathrm{PE}(\mu,2k+1)+$	(213)
		$\displaystyle\mathrm{PE}(i,2k)\cdot\mathrm{PE}(\mu,2k)$	(213)

The resulting benefit is that the encoding can somewhat model relative positions. That is, the state at position $i+\mu$ can be described by starting with $i$ and then appending it with the offset $\mu$ .
结果优势在于编码可以某种程度上模拟相对位置。也就是说，位置 $i+\mu$ 的状态可以通过从 $i$ 开始，然后附加偏移量 $\mu$ 来描述。

When applying the sinusoidal positional encoding, one way is to concatenate $\mathbf{x}_{i}$ and $\mathrm{PE}(i)$ . In Vaswani et al. (2017)’s work, they instead assume $\mathrm{PE}(i)$ to be a vector of the same size as $\mathbf{x}_{i}$ (i.e., $|\mathrm{PE}(i)|=|\mathbf{x}_{i}|=d_{e}$ ), and add $\mathrm{PE}(i)$ to $\mathbf{x}_{i}$ , like this
当应用正弦位置编码时，一种方法是将 $\mathbf{x}_{i}$ 和 $\mathrm{PE}(i)$ 连接起来。在 Vaswani 等人（2017）的研究中，他们假设 $\mathrm{PE}(i)$ 是一个与 $\mathbf{x}_{i}$ （即 $|\mathrm{PE}(i)|=|\mathbf{x}_{i}|=d_{e}$ ）大小相同的向量，并将 $\mathrm{PE}(i)$ 添加到 $\mathbf{x}_{i}$ 中，如下所示

\displaystyle\mathbf{xp}_{i}

\displaystyle=

\displaystyle\mathbf{x}_{i}+\mathrm{PE}(i)

(214)

This sinusoidal addictive model has been the basis of many positional encoding approaches (Dehghani et al., 2018; Likhomanenko et al., 2021; Su et al., 2021).
此正弦函数成瘾模型已成为许多位置编码方法的基石（Dehghani 等，2018；Likhomanenko 等，2021；Su 等，2021）。

References 参考文献

Adi et al. (2016) Adi 等（2016） Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In Proceedings of International Conference on Learning Representations, 2016.
约西·阿迪，艾娜特·克尔曼尼，约纳坦·贝林科夫，奥弗·拉维，约阿夫·戈尔德堡。基于辅助预测任务的句子嵌入细粒度分析。载于《国际学习表示会议论文集》，2016 年。
Ainslie et al. (2020) Ainslie 等（2020） Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. Etc: Encoding long and structured inputs in transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 268–284, 2020.
Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang 等. 等等：在变压器中编码长结构和输入。在 2020 年实证自然语言处理会议（EMNLP）论文集，第 268-284 页，2020 年。
Alayrac et al. (2022) 阿莱拉克等（2022） Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
让-巴蒂斯特·阿拉亚克，杰夫·多纳休，保罗琳·卢克，安东尼·米歇，伊恩·巴，亚娜·哈森，卡雷尔·伦茨，阿图尔·门什，凯瑟琳·米利坎，马尔科姆·雷诺兹等。Flamingo：一种用于少样本学习的视觉语言模型。神经信息处理系统进展，第 35 卷：23716–23736，2022。
Antol et al. (2015) Antol 等（2015） Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
斯坦尼斯瓦夫·安托尔，艾希瓦拉·阿格拉瓦尔，贾森·卢，玛格丽特·米切尔，德鲁夫·巴特拉，C·劳伦斯·齐尼克，以及德维·帕里赫。VQA：视觉问答。在 2015 年 IEEE 国际计算机视觉会议论文集，第 2425-2433 页。
Åström and Wittenmark (2013)
Åström 和 Wittenmark (2013) Karl J Åström and Björn Wittenmark. Computer-controlled systems: theory and design. Courier Corporation, 2013.
Karl J Åström 和 Björn Wittenmark. 计算机控制系统：理论与设计。Courier Corporation，2013。
Ba et al. (2016) 巴等（2016） Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
李雷巴，瑞安基罗斯，和埃文·H·辛顿。层归一化。arXiv 预印本 arXiv:1607.06450，2016。
Bachlechner et al. (2021)
巴赫勒纳等（2021） Thomas Bachlechner, Bodhisattwa Prasad Majumder, Henry Mao, Gary Cottrell, and Julian McAuley. Rezero is all you need: Fast convergence at large depth. In Proceedings of Uncertainty in Artificial Intelligence, pages 1352–1361. PMLR, 2021.
托马斯·巴赫勒纳，博迪萨特瓦·普拉萨德·马朱姆达尔，亨利·毛，加里·科特雷尔，朱利安·麦凯利。重零即所需：在大深度下的快速收敛。在《人工智能中的不确定性》会议论文集，第 1352-1361 页。PMLR，2021 年。
Bahdanau et al. (2014) 巴哈努等（2014） Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
巴哈纳乌，Dzmitry；崔，金炫；本吉奥，约舒亚。通过联合学习对齐和翻译进行神经机器翻译。arXiv 预印本 arXiv:1409.0473，2014。
Bai et al. (2021) 白等（2021） Jiangang Bai, Yujing Wang, Yiren Chen, Yaming Yang, Jing Bai, Jing Yu, and Yunhai Tong. Syntax-bert: Improving pre-trained transformers with syntax trees. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3011–3020, 2021.
姜岗白，王宇静，陈一仁，杨亚明，白静，余静，童云海. Syntax-bert：通过句法树改进预训练的 Transformer。载于第 16 届欧洲计算语言学协会分会会议论文集：主卷，第 3011-3020 页，2021 年。
Bao et al. (2021) 包等（2021） Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. In Proceedings of International Conference on Learning Representations, 2021.
包航博，董丽，焦松豪，魏福如. 贝特：图像变换器的 Bert 预训练. 在《国际学习表示会议》论文集中，2021 年。
Barham et al. (2022) Barham 等（2022 年） Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Daniel Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, et al. Pathways: Asynchronous distributed dataflow for ml. In Proceedings of Machine Learning and Systems, volume 4, pages 430–449, 2022.
巴哈姆，查德赫里，迪安，杰姆瓦特，汉德，赫尔特，伊萨德，林海泰克，潘若明，罗伊，等. 路径：用于机器学习的异步分布式数据流。在机器学习与系统会议论文集，第 4 卷，第 430-449 页，2022 年。
Belinkov (2022) 贝林科夫（2022） Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022.
贝利科夫，约纳坦. 探测分类器：承诺、不足与进展. 计算语言学，48(1):207–219，2022.
Bello (2020) 贝尔洛（2020） Irwan Bello. Lambdanetworks: Modeling long-range interactions without attention. In Proceedings of International Conference on Learning Representations, 2020.
伊尔万·贝洛。Lambda 网络：无需注意力的长距离交互建模。载于《国际学习表示会议论文集》，2020 年。
Beltagy et al. (2020) 贝拉蒂等（2020） Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020.
伊兹·贝尔塔吉，马修·E·彼得斯，阿曼·科汉。长文档 Transformer。arXiv:2004.05150，2020。
Bengio et al. (2015) Bengio 等人（2015） Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015.
埃马纽埃尔·本吉奥，皮埃尔-吕克·贝孔，乔埃勒·皮内，和多伊娜·普雷库普。神经网络中的条件计算以实现更快的模型。arXiv 预印本 arXiv:1511.06297，2015。
Bengio et al. (2003) Bengio 等人（2003） Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137–1155, 2003.
杨立昆，雷杰安·杜夏梅，帕斯卡尔·文森特，克里斯蒂安·若阿温。神经概率语言模型。机器学习研究杂志，第 3 卷，第 1137-1155 页，2003 年。
Bengio et al. (2013) Bengio 等人（2013） Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
杨立昆，尼古拉斯·莱昂纳德，和艾伦·库维尔。通过随机神经元估计或传播梯度进行条件计算。arXiv 预印本 arXiv:1308.3432，2013。
Bhattamishra et al. (2020)
巴塔米什拉等人（2020 年） Satwik Bhattamishra, Kabir Ahuja, and Navin Goyal. On the ability and limitations of transformers to recognize formal languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7096–7116, 2020.
Satwik Bhattamishra，Kabir Ahuja，和 Navin Goyal. 关于 Transformer 识别形式语言的能力与局限性。载于 2020 年实证自然语言处理会议（EMNLP）论文集，第 7096-7116 页，2020 年。
Bommasani et al. (2021) Bommasani 等人（2021 年） Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren E. Gillespie, Karan Goel, Noah D. Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas F. Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, O. Khattab, Pang Wei Koh, Mark S. Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir P. Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Benjamin Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, J. F. Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Robert Reich, Hongyu Ren, Frieda Rong, Yusuf H. Roohani, Camilo Ruiz, Jack Ryan, Christopher R’e, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishna Parasuram Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei A. Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. On the opportunities and risks of foundation models. ArXiv, 2021.
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren E. Gillespie, Karan Goel, Noah D. Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas F. Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, O. Khattab, Pang Wei Koh, Mark S. Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir P. Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Benjamin Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, J. F. Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Robert Reich, Hongyu Ren, Frieda Rong, Yusuf H. Roohani, Camilo Ruiz, Jack Ryan, Christopher R'e, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishna Parasuram Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei A. Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. 论基础模型的机会与风险。ArXiv，2021。
Bondarenko et al. (2021)
邦达连科等人（2021 年） Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Understanding and overcoming the challenges of efficient transformer quantization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7947–7969, 2021.
叶利塞·邦达连科，马克斯·纳格尔，蒂姆·布兰克沃尔特。理解并克服高效 Transformer 量化的挑战。载于 2021 年实证自然语言处理会议论文集，第 7947-7969 页，2021 年。
Brown et al. (2020) 布朗等（2020） Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
汤姆·布朗，本杰明·曼，尼克·赖德，梅拉妮·苏比亚，贾里德·D·卡普兰，普拉夫拉·达里瓦尔，阿温德·尼尔卡坦，普拉纳夫·希亚姆，吉里什·萨斯特里，阿曼达·阿斯凯尔，等人。语言模型是零样本学习者。神经信息处理系统进展，33：1877–1901，2020。
Bubeck et al. (2023) 布贝克等人（2023） Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
塞巴斯蒂安·布贝克，瓦鲁恩·钱德拉塞卡拉，罗恩·埃尔丹，约翰内斯·格赫克，埃里克·霍维茨，埃斯·卡马尔，彼得·李，尹·达·李，李远志，斯科特·伦德伯格，等。人工智能通用智能的火花：与 gpt-4 的早期实验。arXiv 预印本 arXiv:2303.12712，2023。
Burchi and Vielzeuf (2021)
布奇和维尔祖夫（2021） Maxime Burchi and Valentin Vielzeuf. Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition. In Proceedings of 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 8–15. IEEE, 2021.
马克西姆·伯奇和瓦伦丁·维兹欧夫。高效构象：用于自动语音识别的渐进式下采样和分组注意力。在 2021 年 IEEE 自动语音识别与理解研讨会（ASRU）论文集中，第 8-15 页。IEEE，2021。
Caron et al. (2021) 卡隆等（2021） Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
卡罗尔·马蒂尔德，雨果·图尔弗龙，伊山·米斯拉，赫维·热戈，朱利安·马伊拉尔，皮奥特·博亚诺夫斯基，以及阿曼德·朱利安。自监督视觉 Transformer 中的新兴特性。在 IEEE/CVF 国际计算机视觉会议论文集中，第 9650-9660 页，2021 年。
Cer et al. (2018) 塞等（2018） Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. Universal sentence encoder. arXiv preprint arXiv:1803.11175, 2018.
丹尼尔·塞尔，杨音飞，孔圣一，华楠，林蒂雅克，罗姆尼·圣约翰，诺亚·康斯坦特，马里奥·瓜哈尔多-塞佩德斯，袁思，克里斯·塔，等。通用句子编码器。arXiv 预印本 arXiv:1803.11175，2018。
Chen et al. (2018a) 陈等（2018a） Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita, and Tiejun Zhao. Syntax-directed attention for neural machine translation. In Proceedings of the AAAI conference on artificial intelligence, 2018a.
陈凯海，王瑞，宇田山正男，住田英一郎，赵铁军。基于语法的注意力机制在神经机器翻译中的应用。在 2018 年 AAAI 人工智能会议论文集。
Chen et al. (2018b) 陈等（2018b） Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. The best of both worlds: Combining recent advances in neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 76–86, 2018b.
徐敏陈，奥兰·菲拉特，安克·巴帕纳，梅尔文·约翰逊，沃尔夫冈·马切雷，乔治·福斯特，利昂·琼斯，迈克·舒斯特，诺亚姆·沙泽尔，尼基·帕马尔，阿希什·瓦萨尼，雅各布·乌斯克雷伊特，卢卡斯·凯泽，陈之峰，吴勇辉，麦克杜夫·休斯。两全其美：结合神经机器翻译的最新进展。在计算语言学协会第 56 届年度会议论文集（第 1 卷：长篇论文），第 76-86 页，2018b。
Chen et al. (2018c) 陈等（2018c） Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018c.
陈 TQ Ricky，鲁博娜娃 Yulia，贝滕库尔特 Jesse，杜文纳德 David K. 神经常微分方程。神经信息处理系统进展，31，2018c。
Chen et al. (2015) 陈等（2015） Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641, 2015.
陈天琪，伊恩·古德费洛，乔纳森·舒伦斯。Net2net：通过知识迁移加速学习。arXiv 预印本 arXiv:1511.05641，2015。
Chiang et al. (2023) 江等（2023） David Chiang, Peter Cholak, and Anand Pillay. Tighter bounds on the expressivity of transformer encoders. arXiv preprint arXiv:2301.10743, 2023.
David Chiang, Peter Cholak, 和 Anand Pillay. Transformer 编码器表达能力更紧的上界。arXiv 预印本 arXiv:2301.10743，2023。
Child et al. (2019) Child 等（2019） Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
Rewon Child, Scott Gray, Alec Radford 和 Ilya Sutskever. 使用稀疏变换器生成长序列。arXiv 预印本 arXiv:1904.10509，2019。
Cho et al. (2021) 卓等（2021） Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, pages 1931–1942. PMLR, 2021.
Cho Jaemin，Lei Jie，Tan Hao，Bansal Mohit. 通过文本生成统一视觉与语言任务。国际机器学习会议论文集，第 1931-1942 页。PMLR，2021 年。
Choe and Charniak (2016)
崔和查尼雅克（2016） Do Kook Choe and Eugene Charniak. Parsing as language modeling. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2331–2336, 2016.
崔国库，查尼雅克尤金。作为语言模型的解析。在 2016 年实证自然语言处理会议论文集，第 2331-2336 页，2016。
Choromanski et al. (2020)
Choromanski 等（2020） Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. In Proceedings of International Conference on Learning Representations, 2020.
Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser 等人. 重新思考注意力机制。载于《国际学习表示会议论文集》，2020 年。
Chowdhery et al. (2022) 周德瑞等（2022） Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, 等人. Palm：通过路径扩展语言模型。arXiv 预印本 arXiv:2204.02311，2022。
Clark et al. (2019) 克拉克等人（2019 年） Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does bert look at? an analysis of bert’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286, 2019.
克拉克，凯文；卡德尔瓦尔，乌尔瓦希；利维，奥马尔；曼宁，克里斯托弗。BERT 看到了什么？BERT 注意力分析。在 2019 年 ACL 研讨会黑盒 NLP：分析并解释 NLP 中的神经网络论文集中，第 276-286 页，2019 年。
Conneau et al. (2018) 康奈等（2018） Alexis Conneau, Germán Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. What you can cram into a single vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, 2018.
Alexis Conneau, Germán Kruszewski, Guillaume Lample, Loïc Barrault, 和 Marco Baroni. 单个向量能容纳多少：探究句子嵌入的语用特性。在计算语言学协会第 56 届年度会议论文集（第 1 卷：长篇论文），第 2126-2136 页，2018 年。
Currey and Heafield (2018)
Currey 和 Heafield (2018) Anna Currey and Kenneth Heafield. Multi-source syntactic neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2961–2966, 2018.
安娜·库里和肯尼思·希菲尔德。多源句法神经机器翻译。载于 2018 年实证自然语言处理会议论文集，第 2961-2966 页，2018 年。
Dai et al. (2019) 戴等（2019） Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, 2019.
代志航，杨志林，杨一鸣，碳纳·G·卡彭特，吴国盛，拉苏兰·萨拉胡丁诺夫。Transformer-xl：超越固定长度上下文的注意力语言模型。在计算语言学协会第 57 届年度会议论文集，第 2978-2988 页，2019 年。
Dao et al. (2022) 道等（2022） Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
崔道，丹福，斯蒂法诺·埃尔蒙，阿特里·鲁德拉，克里斯托弗·雷。Flashattention：具有 I/O 感知的快速且内存高效的精确注意力机制。神经信息处理系统进展，第 35 卷：16344–16359，2022。
Dao et al. (2023) 道等（2023） Tri Dao, Daniel Haziza, Francisco Massa, and Grigory Sizov. Flash-decoding for long-context inference. https://pytorch.org/blog/flash-decoding/, 2023. Retrieved 2023-10-23.
Tri Dao，Daniel Haziza，Francisco Massa，和 Grigory Sizov。快速解码用于长上下文推理。https://pytorch.org/blog/flash-decoding/，2023。检索日期：2023-10-23。
Dehghani et al. (2018) 德赫甘尼等（2018） Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. 通用 Transformer。arXiv 预印本 arXiv:1807.03819，2018。
Del Corro et al. (2023) 德尔·科罗等人（2023） Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, and Subhabrata Mukherjee. Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference. arXiv preprint arXiv:2307.02628, 2023.
卢西亚诺·德尔·科罗，艾莉·德尔·乔尔诺，萨哈杰·阿加瓦尔，宾·余，艾哈迈德·阿瓦达拉，苏巴哈拉塔·穆克赫杰。跳解码：用于高效llm推理的批处理和缓存自回归跳解码。arXiv 预印本 arXiv:2307.02628，2023。
Devlin et al. (2019) Devlin 等人（2019 年） Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和 Kristina Toutanova. Bert：用于语言理解的深度双向变换器预训练。在 2019 年北美计算语言学协会分会人类语言技术会议论文集：长篇和短篇论文，第 1 卷，第 4171-4186 页，2019。
Di Gangi et al. (2019) 迪·甘吉等（2019） Mattia Antonino Di Gangi, Matteo Negri, Roldano Cattoni, Roberto Dessi, and Marco Turchi. Enhancing transformer for end-to-end speech-to-text translation. In Proceedings of Machine Translation Summit XVII: Research Track, pages 21–31, 2019.
马蒂亚·安东尼奥·迪·甘吉，马泰奥·内格里，罗尔丹诺·卡顿尼，罗伯托·德西，以及马可·图奇。增强端到端语音转文本翻译的 Transformer。在《第十七届机器翻译峰会：研究轨道》论文集中，第 21-31 页，2019 年。
Ding et al. (2021) 丁等（2021） Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021.
明丁，杨卓异，洪文怡，郑文迪，周长，尹大，林俊阳，邹旭，邵周，杨红霞，等。Cogview：通过 Transformer 掌握文本到图像生成。神经信息处理系统进展，34:19822–19835，2021。
Dosovitskiy et al. (2021)
多索维茨基等（2021） Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of ICLR 2021, 2021.
Alexey Dosovitskiy，Lucas Beyer，Alexander Kolesnikov，Dirk Weissenborn，Xiaohua Zhai，Thomas Unterthiner，Mostafa Dehghani，Matthias Minderer，Georg Heigold，Sylvain Gelly，等. 一图胜千言：大规模图像识别中的 Transformer。在 ICLR 2021 会议论文集，2021。
Dufter et al. (2022) 杜弗特等人（2022 年） Philipp Dufter, Martin Schmitt, and Hinrich Schütze. Position information in transformers: An overview. Computational Linguistics, 48(3):733–763, 2022.
杜特菲尔，施密特，舒特泽。Transformer 中的位置信息：概述。计算语言学，48(3)：733–763，2022。
Ee (2017) Weinan Ee. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5:1–11, 02 2017.
魏南·伊. 通过动力系统进行机器学习的提案。数学与统计学通讯，5:1–11，2017 年 2 月。
Elbayad et al. (2020) 埃尔拜德等（2020） Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer. In Proceedings of International Conference on Learning Representations, 2020.
马哈·埃尔拜德，贾涛·古，爱德华·格雷夫，迈克尔·奥利。深度自适应 Transformer。载于《国际学习表示会议论文集》，2020 年。
Elsken et al. (2019) Elsken 等人（2019） Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. The Journal of Machine Learning Research, 20(1):1997–2017, 2019.
托马斯·埃尔森，简·亨里克·梅岑，弗兰克·胡特。神经架构搜索：综述。机器学习研究杂志，20(1)：1997–2017，2019。
Fan et al. (2021) 范等（2021） Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6824–6835, 2021.
范豪奇，熊博，曼卡拉亚·卡尔蒂基，李杨豪，严志成，马尔克·吉滕德拉，费希滕霍弗·克里斯托夫。多尺度视觉 Transformer。IEEE/CVF 国际计算机视觉会议论文集，第 6824-6835 页，2021 年。
Fan et al. (2020) 范等（2020） Yang Fan, Shufang Xie, Yingce Xia, Lijun Wu, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Multi-branch attentive transformer. arXiv preprint arXiv:2006.10270, 2020.
杨帆，谢淑芳，夏英策，吴丽君，秦涛，李翔阳，刘铁岩. 多分支注意力 Transformer. arXiv 预印本 arXiv:2006.10270，2020.
Fedus et al. (2022a) 费杜斯等人（2022a） William Fedus, Jeff Dean, and Barret Zoph. A review of sparse expert models in deep learning. arXiv preprint arXiv:2209.01667, 2022a.
威廉·费杜斯，杰夫·迪恩，巴雷特·索普。深度学习中稀疏专家模型的综述。arXiv 预印本 arXiv:2209.01667，2022a。
Fedus et al. (2022b) 费杜斯等人（2022b） William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022b.
威廉·费杜斯，巴雷特·佐夫，和诺亚姆·沙泽尔。开关变压器：通过简单高效的稀疏性扩展到万亿参数模型。机器学习研究杂志，23(1):5232–5270，2022b。
Fu et al. (2022) 傅等（2022） Daniel Y Fu, Tri Dao, Khaled Kamal Saab, Armin W Thomas, Atri Rudra, and Christopher Re. Hungry hungry hippos: Towards language modeling with state space models. In Proceedings of The Eleventh International Conference on Learning Representations, 2022.
丹尼尔·Y·傅，Tri Dao，卡立德·卡马尔·萨布，阿明·W·托马斯，阿特里·鲁德拉，克里斯托弗·雷。饥饿的河马：迈向基于状态空间模型的语言建模。在《第十一届国际学习表示会议》论文集中，2022 年。
Gehring et al. (2017) Gehring 等人（2017 年） Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. In International conference on machine learning, pages 1243–1252. PMLR, 2017.
乔纳斯·盖尔林，迈克尔·奥利，大卫·格兰吉耶，丹尼斯·亚拉茨，以及扬·N·道宾。卷积序列到序列学习。在机器学习国际会议上，第 1243-1252 页。PMLR，2017。
Gholami et al. (2022) 戈拉米等（2022） Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC, 2022.
阿米尔·戈拉米，申勋·金，郑东，赵泽伟，迈克尔·W·马霍尼，库尔特·库泽特。关于高效神经网络推理的量化方法综述。载于《低功耗计算机视觉》，第 291-326 页。查普曼和霍尔/CRC 出版社，2022 年。
Glorot and Bengio (2010)
格罗特和本吉奥（2010） Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
谢尔维·格洛特和约舒亚·本吉奥。理解深度前馈神经网络的训练难度。在第十三届人工智能与统计国际会议论文集中，第 249-256 页。JMLR 研讨会与会议论文集，2010 年。
Gomez et al. (2017) 戈麦斯等人（2017 年） Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. Advances in neural information processing systems, 30, 2017.
Aidan N Gomez, Mengye Ren, Raquel Urtasun, 和 Roger B Grosse. 可逆残差网络：无需存储激活的反向传播。神经信息处理系统进展，30，2017。
Gou et al. (2021) 郭等（2021） Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819, 2021.
郭建平，余宝生，Stephen J Maybank，陶大程. 知识蒸馏：综述. 国际计算机视觉杂志，129：1789–1819，2021.
Gray (1998) 格雷（1998 年） Robert M. Gray. Quantization. IEEE transactions on information theory, 44(6):2325–2383, 1998.
格雷，R. M. 量化。IEEE 信息论汇刊，44（6）：2325–2383，1998。
Gu et al. (2021) 顾等（2021） Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. In Proceedings of International Conference on Learning Representations, 2021.
Albert Gu，Karan Goel，和 Christopher Ré. 高效建模具有结构化状态空间的长序列。载于《国际学习表示会议论文集》，2021 年。
Gu et al. (2022a) 顾等（2022a） Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971–35983, 2022a.
Albert Gu, Karan Goel, Ankit Gupta, 和 Christopher Ré. 关于对角状态空间模型的参数化和初始化。神经信息处理系统进展，35:35971–35983，2022a。
Gu et al. (2022b) 顾等（2022b） Albert Gu, Karan Goel, Khaled Saab, and Chris Ré. Structured state spaces: Combining continuous-time, recurrent, and convolutional models. https://hazyresearch.stanford.edu/blog/2022-01-14-s4-3, 2022b. Retrieved 2022-01-14.
Albert Gu，Karan Goel，Khaled Saab，和 Chris Ré. 结构化状态空间：结合连续时间、循环和卷积模型。https://hazyresearch.stanford.edu/blog/2022-01-14-s4-3，2022b。检索日期：2022-01-14。
Gulati et al. (2020) 古拉蒂等（2020） Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. Conformer: Convolution-augmented transformer for speech recognition. Proceedings of Interspeech 2020, pages 5036–5040, 2020.
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. Conformer：语音识别中的卷积增强 Transformer。Interspeech 2020 会议论文集，第 5036-5040 页，2020 年。
Guo et al. (2020) 郭等（2020） Qipeng Guo, Xipeng Qiu, Pengfei Liu, Xiangyang Xue, and Zheng Zhang. Multi-scale self-attention for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7847–7854, 2020.
郭起鹏，邱锡鹏，刘鹏飞，薛向阳，张正。多尺度自注意力文本分类。在《AAAI 人工智能会议论文集》，第 34 卷，第 7847-7854 页，2020 年。
Gupta et al. (2004) 古普塔等（2004） Madan Gupta, Liang Jin, and Noriyasu Homma. Static and dynamic neural networks: from fundamentals to advanced theory. John Wiley & Sons, 2004.
马达恩·古帕、梁金、户松浩。静态与动态神经网络：从基础到高级理论。约翰·威利与桑斯出版社，2004 年。
Guu et al. (2020) Guu 等（2020） Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In Proceedings of International conference on machine learning, pages 3929–3938. PMLR, 2020.
郭 Kelvin，李 Kenton，童 Zora，帕苏帕特 Panupong，张 Mingwei. 检索增强语言模型预训练。在《国际机器学习会议论文集》，第 3929-3938 页。PMLR，2020。
Haber and Ruthotto (2017)
Haber 和 Ruthotto（2017） Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse problems, 34(1):014004, 2017.
Haber E，Ruthotto L. 深度神经网络中的稳定架构。Inverse problems，34（1）：014004，2017。
Hahn (2020) 汉 (2020) Michael Hahn. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156–171, 2020.
迈克尔·汉。神经序列模型中自注意力理论局限性。计算语言学协会汇刊，8（1）：156-171，2020。
Han et al. (2022) 韩等（2022） Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022.
韩凯，王云鹤，陈汉廷，陈星豪，郭建源，刘振华，唐业辉，肖安，徐春静，徐一星，等. 视觉 Transformer 综述. IEEE 模式分析与机器智能汇刊，45(1):87–110，2022.
Han et al. (2020) 韩等（2020） Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, and Yonghui Wu. Contextnet: Improving convolutional neural networks for automatic speech recognition with global context. In Proceedings of Interspeech 2020, pages 3610–3614, 2020.
魏汉，张正东，张宇，余佳慧，邱中诚，秦杰，古拉蒂安蒙，潘若明，吴永辉。Contextnet：利用全局上下文改进卷积神经网络以实现自动语音识别。载于《Interspeech 2020 会议论文集》，第 3610-3614 页，2020 年。
Han et al. (2021) 韩等（2021） Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang. Dynamic neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7436–7456, 2021.
汉一增，黄高，宋世基，杨乐，王宏辉，王玉林。动态神经网络：综述。IEEE Transactions on Pattern Analysis and Machine Intelligence，44(11)：7436–7456，2021。
Hao et al. (2019) 郝等（2019） Jie Hao, Xing Wang, Shuming Shi, Jinfeng Zhang, and Zhaopeng Tu. Multi-granularity self-attention for neural machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 887–897, 2019.
贾昊，王星，石树民，张锦峰，涂朝鹏. 多粒度自注意力机制在神经机器翻译中的应用. 在 2019 年自然语言处理实证方法会议和第 9 届国际自然语言处理联合会议（EMNLP-IJCNLP）论文集，第 887-897 页，2019 年。
Hao et al. (2022) 郝等（2022） Yiding Hao, Dana Angluin, and Robert Frank. Formal language recognition by hard attention transformers: Perspectives from circuit complexity. Transactions of the Association for Computational Linguistics, 10:800–810, 2022.
郝一丁，安古林，罗伯特·弗兰克。基于硬注意力变换器的形式语言识别：来自电路复杂性的视角。计算语言学协会汇刊，第 10 卷，第 800-810 页，2022 年。
He et al. (2016a) 他等人（2016a） Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016a.
何凯明，张祥宇，任少卿，孙剑. 深度残差学习在图像识别中的应用. IEEE 计算机视觉与模式识别会议论文集，第 770-778 页，2016a.
He et al. (2016b) 他等人（2016b） Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In Proceedings of ECCV 2016, pages 630–645, 2016b.
何凯明，张祥宇，任少卿，孙剑. 深度残差网络中的身份映射。在 ECCV 2016 会议论文集，第 630-645 页，2016b。
He et al. (2022) 他等人（2022） Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
何凯明，陈新磊，谢赛宁，李阳豪，皮奥特·多拉尔，罗斯·吉什克。掩码自编码器是可扩展的视觉学习器。在 2022 年 IEEE/CVF 计算机视觉与模式识别会议论文集，第 16000-16009 页。
He et al. (2021) 他等人（2021） Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. In Proceedings of International Conference on Learning Representations, 2021.
何鹏程，刘晓东，高建峰，陈伟竹. Deberta：解码增强的 BERT 与解耦注意力机制。载于《国际学习表示会议论文集》，2021 年。
Heafield et al. (2021) Heafield 等（2021） Kenneth Heafield, Qianqian Zhu, and Roman Grundkiewicz. Findings of the WMT 2021 shared task on efficient translation. In Proceedings of the Sixth Conference on Machine Translation, pages 639–651, 2021.
Kenneth Heafield, Qianqian Zhu, 和 Roman Grundkiewicz. WMT 2021 高效翻译共享任务的研究发现。载于第六届机器翻译会议论文集，第 639-651 页，2021 年。
Hestness et al. (2017) 赫斯内斯等人（2017 年） Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, 和 Yanqi Zhou. 深度学习扩展可预测，经验上。arXiv 预印本 arXiv:1712.00409，2017。
Hewitt and Liang (2019) John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733–2743, 2019.
John Hewitt 和 Percy Liang. 设计和解释带有控制任务的探测。在 2019 年实证自然语言处理会议和第 9 届国际自然语言处理联合会议（EMNLP-IJCNLP）论文集中，第 2733-2743 页，2019 年。
Hill et al. (2016) 希尔等人（2016） Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences from unlabelled data. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1367–1377, 2016.
费利克斯·希尔，金亨勋，安娜·科罗嫩。从无标签数据中学习句子的分布式表示。载于 2016 年北美计算语言学协会分会人类语言技术会议论文集，第 1367-1377 页，2016 年。
Hinton et al. (2015) Hinton 等（2015） Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
霍金顿，维尼亚尔斯，迪恩。从神经网络中提炼知识。arXiv 预印本 arXiv:1503.02531，2015。
Hou et al. (2020) 侯等（2020） Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Dynabert: Dynamic bert with adaptive width and depth. Advances in Neural Information Processing Systems, 33:9782–9793, 2020.
鲁厚，黄志琪，商立锋，姜新，陈晓，刘群。Dynabert：自适应宽度和深度的动态 BERT。神经信息处理系统进展，33：9782–9793，2020。
Howard et al. (2019) 霍华德等人（2019 年） Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019.
Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan 等人. 寻找 mobilenetv3. 在 IEEE/CVF 国际计算机视觉会议论文集中，第 1314-1324 页，2019 年。
Hu et al. (2021) 胡等（2021） Chi Hu, Chenglong Wang, Xiangnan Ma, Xia Meng, Yinqiao Li, Tong Xiao, Jingbo Zhu, and Changliang Li. Ranknas: Efficient neural architecture search by pairwise ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2469–2480, 2021.
迟胡，王成龙，马向楠，孟晓，李银桥，肖桐，朱景波，李长亮. Ranknas：通过成对排名的高效神经架构搜索。载于 2021 年自然语言处理实证方法会议论文集，第 2469-2480 页，2021 年。
Huang et al. (2018) 黄等（2018） Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M Dai, Matthew D Hoffman, Monica Dinculescu, and Douglas Eck. Music transformer: Generating music with long-term structure. In Proceedings of International Conference on Learning Representations, 2018.
黄成志安娜，阿希什·瓦萨尼，雅各布·乌斯克雷伊特，伊恩·西蒙，库尔特斯·霍华德，诺亚姆·沙泽尔，安德鲁·戴，马修·霍夫曼，莫妮卡·迪库莱斯库，道格拉斯·艾克。音乐转换器：生成具有长期结构的音乐。在《国际学习表示会议》论文集中，2018 年。
Huang et al. (2016) 黄等（2016） Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In Proceedings of the 14th European Conference, pages 646–661. Springer, 2016.
高黄，孙宇，刘壮，丹尼尔·塞德拉，基利安·Q·魏因贝格尔. 具有随机深度的深度网络。在第 14 届欧洲会议论文集，第 646-661 页。Springer，2016。
Huang et al. (2017) 黄等（2017） Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
高黄，刘壮，劳伦斯·范德马滕，基利安·Q·魏因伯格。密集连接卷积网络。IEEE 计算机视觉与模式识别会议论文集，第 4700-4708 页，2017 年。
Huang et al. (2020) 黄等（2020） Zhiheng Huang, Davis Liang, Peng Xu, and Bing Xiang. Improve transformer models with better relative position embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3327–3335, 2020.
黄志恒，梁大维，徐鹏，向冰翔。改进 Transformer 模型：更好的相对位置嵌入。在计算语言学协会发现：EMNLP 2020，第 3327-3335 页，2020 年。
Ivanov et al. (2021) 伊万诺夫等（2021） Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Data movement is all you need: A case study on optimizing transformers. In Proceedings of Machine Learning and Systems, volume 3, pages 711–732, 2021.
安德烈·伊万诺夫，尼古拉·德雷登，塔尔·本-纳恩，李世刚，托尔斯坦·霍弗勒。数据移动即一切：关于优化 Transformer 的案例研究。在《机器学习与系统》会议论文集，第 3 卷，第 711-732 页，2021 年。
Jacob et al. (2018) 雅各等（2018） Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018.
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, 和 Dmitry Kalenichenko. 神经网络量化与训练以实现高效的整数算术推理。IEEE 计算机视觉与模式识别会议论文集，第 2704-2713 页，2018 年。
Jaegle et al. (2021) 杰格尔等（2021） Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver io: A general architecture for structured inputs & outputs. In Proceedings of International Conference on Learning Representations, 2021.
安德鲁·雅格勒，塞巴斯蒂安·博热奥，让-巴蒂斯特·阿拉亚拉克，卡尔·多尔斯，卡塔林·伊奥内斯库，大卫·丁，斯坎达·科普拉，丹尼尔·佐兰，安德鲁·布洛克，伊万·谢尔哈默，等。感知器 io：一种用于结构化输入和输出的通用架构。载于《国际学习表示会议论文集》，2021 年。
Kaplan et al. (2020) 卡普兰等（2020） Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
贾里德·卡普兰，山姆·麦克安德利斯，汤姆·亨利汉，汤姆·布朗，本杰明·切斯，雷温·查尔德，斯科特·格雷，亚历克·拉德福德，杰弗里·吴，达里奥·阿莫迪，等。神经语言模型的可扩展性定律。arXiv 预印本 arXiv:2001.08361，2020。
Katharopoulos et al. (2020)
卡塔拉波洛斯等人（2020 年） Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, 和 François Fleuret. 《Transformer 是 RNN：具有线性注意力的快速自回归 Transformer》。国际机器学习会议论文集，第 5156-5165 页。PMLR，2020。
Kernes (2021) 卡内斯（2021） Jonathan Kernes. Master positional encoding: Part i, 05 2021. URL https://towardsdatascience.com/master-positional-encoding-part-i-63c05d90a0c3.
乔纳森·凯伦斯. 精通位置编码：第一部分，2021 年 5 月。网址 https://towardsdatascience.com/master-positional-encoding-part-i-63c05d90a0c3。
Kidger (2022) 基德杰（2022） Patrick Kidger. On neural differential equations. arXiv preprint arXiv:2202.02435, 2022.
Patrick Kidger. 关于神经网络微分方程。arXiv 预印本 arXiv:2202.02435，2022。
Kim and Cho (2021)
金和崔（2021） Gyuwan Kim and Kyunghyun Cho. Length-adaptive transformer: Train once with length drop, use anytime with search. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6501–6511, 2021.
金奎万和崔庆勋。长度自适应 Transformer：一次训练带长度下降，随时搜索使用。在《第 59 届计算语言学年会和第 11 届自然语言处理国际联合会议》论文集（第 1 卷：长篇论文），第 6501-6511 页，2021 年。
Kim et al. (2019) 金等（2019） Najoung Kim, Roma Patel, Adam Poliak, Patrick Xia, Alex Wang, Tom McCoy, Ian Tenney, Alexis Ross, Tal Linzen, Benjamin Van Durme, et al. Probing what different nlp tasks teach machines about function word comprehension. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (* SEM 2019), pages 235–249, 2019.
金娜琼，罗玛·帕特尔，亚当·波利亚克，夏鹏，王泽，汤姆·麦科伊，伊恩·特尼，艾丽斯·罗斯，塔尔·林岑，本杰明·范·德姆，等。探究不同自然语言处理任务如何教会机器关于功能词理解的知识。载于第八届词汇与计算语义联合会议（*SEM 2019）论文集，第 235-249 页，2019 年。
Kim et al. (2023) 金等（2023） Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W Mahoney, et al. Full stack optimization of transformer inference: a survey. arXiv preprint arXiv:2302.14017, 2023.
金世勋，霍珀，塔纳库尔·瓦塔纳瓦翁，康民宇，严若涵，根灿，丁 Grace，黄启迪，库尔特·库泽特，迈克尔·W·马霍尼，等。Transformer 推理的全栈优化：综述。arXiv 预印本 arXiv:2302.14017，2023。
Kim et al. (2021) 金等（2021） Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of International Conference on Machine Learning, pages 5583–5594. PMLR, 2021.
金万杰，孙博庆，金一斗. Vilt：无需卷积或区域监督的视觉-语言转换器. 在国际机器学习会议论文集中，第 5583-5594 页. PMLR，2021 年。
Kim and Rush (2016)
金和拉什（2016） Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, 2016.
尹炯、亚历山大·M·拉什. 序列级知识蒸馏。载于 2016 年实证自然语言处理会议论文集，第 1317-1327 页，2016 年。
Kim and Awadalla (2020)
金和奥瓦达拉（2020） Young Jin Kim and Hany Hassan Awadalla. Fastformers: Highly efficient transformer models for natural language understanding. In Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 149–158, 2020.
金泳镇，阿瓦达拉，哈尼。Fastformers：用于自然语言理解的高效 Transformer 模型。载于 SustaiNLP：简单高效自然语言处理研讨会论文集，第 149-158 页，2020 年。
Kingma and Ba (2014)
金玛和巴（2014） Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
迪德里克·P·金玛和吉米·巴。Adam：一种随机优化方法。arXiv 预印本 arXiv:1412.6980，2014。
Kitaev et al. (2020) 基塔耶夫等人（2020） Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In Proceedings of International Conference on Learning Representations, 2020.
基塔耶夫，基辅，莱夫斯基。Reformer：高效的 Transformer。在《国际学习表示会议》论文集中，2020 年。
Kudo (2018) 库多（2018） Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, 2018.
古贺隆。子词正则化：使用多个子词候选者改进神经网络翻译模型。在《计算语言学协会第 56 届年度会议论文集》（第 1 卷：长篇论文），第 66-75 页，2018 年。
Kwon et al. (2023) 권 등（2023） Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. arXiv preprint arXiv:2309.06180, 2023.
Kwon Woosuk，Li Zhuohan，Zhuang Siyuan，Sheng Ying，Zheng Lianmin，Yu Cody Hao，Gonzalez Joseph E，Zhang Hao，Stoica Ion. 基于分页注意力的用于大型语言模型服务的有效内存管理。arXiv 预印本 arXiv:2309.06180，2023。
Lagunas et al. (2021) 拉古纳斯等（2021） François Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. Block pruning for faster transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10619–10629, 2021.
弗朗索瓦·拉古纳斯，艾拉·夏拉伊克斯，维克多·桑，亚历山大·M·拉什。用于更快 Transformer 的块剪枝。载于 2021 年实证自然语言处理会议论文集，第 10619-10629 页，2021 年。
Lample et al. (2019) Lample 等人（2019） Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. Large memory layers with product keys. Advances in Neural Information Processing Systems, 32, 2019.
Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, 和 Hervé Jégou. 基于产品键的大内存层。神经信息处理系统进展，32，2019。
Lepikhin et al. (2021) 李佩欣等（2021） Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. In Proceedings of International Conference on Learning Representations, 2021.
李彼得里·列皮欣，李虎钟，徐元忠，陈德豪，奥兰·菲拉特，黄燕平，马克西姆·克里昆，诺亚姆·沙泽尔，陈志峰。Gshard：通过条件计算和自动分片扩展巨型模型。载于《国际学习表示会议论文集》，2021 年。
Leviathan et al. (2023) 列维坦等人（2023） Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In Proceedings of International Conference on Machine Learning, pages 19274–19286. PMLR, 2023.
杨尼夫·利维坦、马坦·卡尔曼和约西·马蒂亚斯。通过投机解码从 Transformer 中进行快速推理。在《国际机器学习会议论文集》中，第 19274-19286 页。PMLR，2023。
Lewis et al. (2020) 刘易斯等人（2020 年） Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel 等人. 知识密集型自然语言处理任务的检索增强生成。神经信息处理系统进展，33:9459–9474，2020。
Li et al. (2020a) 李等（2020a） Bei Li, Hui Liu, Ziyang Wang, Yufan Jiang, Tong Xiao, Jingbo Zhu, Tongran Liu, and Changliang Li. Does multi-encoder help? a case study on context-aware neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3512–3518, 2020a.
李贝，刘辉，王紫阳，姜宇凡，肖桐，朱景波，刘同然，李长亮. 多编码器有帮助吗？基于上下文感知神经机器翻译的案例研究. 在第 58 届计算语言学协会年会论文集，第 3512-3518 页，2020a.
Li et al. (2020b) 李等（2020b） Bei Li, Ziyang Wang, Hui Liu, Yufan Jiang, Quan Du, Tong Xiao, Huizhen Wang, and Jingbo Zhu. Shallow-to-deep training for neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 995–1005, 2020b.
李贝，王紫洋，刘辉，姜宇凡，杜全，肖桐，王慧珍，朱静波. 深度神经网络机器翻译的浅层到深层训练。载于 2020 年自然语言处理实证方法会议（EMNLP）论文集，第 995-1005 页，2020b。
Li et al. (2021) 李等（2021） Bei Li, Ziyang Wang, Hui Liu, Quan Du, Tong Xiao, Chunliang Zhang, and Jingbo Zhu. Learning light-weight translation models from deep transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13217–13225, 2021.
李贝，王紫洋，刘辉，杜全，肖桐，张春亮，朱景波. 从深度变换器学习轻量级翻译模型。在《AAAI 人工智能会议论文集》第 35 卷，第 13217-13225 页，2021 年。
Li et al. (2022a) 李等（2022a） Bei Li, Quan Du, Tao Zhou, Yi Jing, Shuhan Zhou, Xin Zeng, Tong Xiao, Jingbo Zhu, Xuebo Liu, and Min Zhang. Ode transformer: An ordinary differential equation-inspired model for sequence generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8335–8351, 2022a.
李贝，杜全，周涛，景毅，周舒涵，曾欣，晓彤，朱静波，刘雪波，张敏。奥德变换器：一种受常微分方程启发的序列生成模型。在计算语言学协会第 60 届年度会议论文集（第 1 卷：长篇论文），第 8335-8351 页，2022a。
Li et al. (2022b) 李等（2022b） Bei Li, Tong Zheng, Yi Jing, Chengbo Jiao, Tong Xiao, and Jingbo Zhu. Learning multiscale transformer models for sequence generation. In International Conference on Machine Learning, pages 13225–13241. PMLR, 2022b.
李贝，郑桐，景毅，焦成波，肖彤，朱静波. 学习用于序列生成的多尺度 Transformer 模型。国际机器学习会议，第 13225-13241 页。PMLR，2022b。
Li et al. (2022c) 李等（2022c） Hongkang Li, Meng Wang, Sijia Liu, and Pin-Yu Chen. A theoretical understanding of shallow vision transformers: Learning, generalization, and sample complexity. In The Eleventh International Conference on Learning Representations, 2022c.
李红康，王萌，刘思佳，陈品宇。对浅层视觉 Transformer 的理论理解：学习、泛化与样本复杂度。载于《第 11 届学习表示国际会议》，2022c。
Li et al. (2017) 李等（2017） Junhui Li, Deyi Xiong, Zhaopeng Tu, Muhua Zhu, Min Zhang, and Guodong Zhou. Modeling source syntax for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 688–697, 2017.
李军辉，熊德义，涂昭鹏，朱穆华，张敏，周国栋. 源语法建模在神经机器翻译中的应用。在《计算语言学协会第 55 届年度会议论文集》（第 1 卷：长篇论文），第 688-697 页，2017 年。
Li et al. (2022d) 李等（2022d） Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022d.
李俊南，李东旭，熊才明，和霍思婷。Blip：统一视觉-语言理解和生成的自举语言-图像预训练。在《国际机器学习会议》，第 12888-12900 页。PMLR，2022d。
Li et al. (2022e) 李等（2022e） Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4804–4814, 2022e.
李杨豪，吴超远，范浩琪，曼卡拉亚·卡尔蒂基耶亚，熊博，马尔克·吉滕德拉，费希滕霍弗·克里斯托夫。Mvitv2：用于分类和检测的多尺度视觉 Transformer 的改进。IEEE/CVF 计算机视觉与模式识别会议论文集，第 4804-4814 页，2022e。
Liao et al. (2021) 廖等（2021） Kaiyuan Liao, Yi Zhang, Xuancheng Ren, Qi Su, Xu Sun, and Bin He. A global past-future early exit method for accelerating inference of pre-trained language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2013–2023, 2021.
廖凯元，张毅，任轩成，苏琪，孙旭，何斌。一种用于加速预训练语言模型推理的全局过去-未来早期退出方法。在 2021 年北美计算语言学协会分会人类语言技术会议论文集，第 2013-2023 页，2021 年。
Likhomanenko et al. (2021)
李霍莫年科等人（2021） Tatiana Likhomanenko, Qiantong Xu, Gabriel Synnaeve, Ronan Collobert, and Alex Rogozhnikov. Cape: Encoding relative positions with continuous augmented positional embeddings. Advances in Neural Information Processing Systems, 34:16079–16092, 2021.
塔蒂亚娜·利霍马诺娃，钱桐，加布里埃尔·辛纳夫，罗南·科洛贝，亚历克斯·罗戈日尼科夫。Cape：使用连续增强位置嵌入编码相对位置。神经信息处理系统进展，第 34 卷：16079–16092，2021 年。
Lin et al. (2022a) 林等（2022a） Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey of transformers. AI Open, 2022a.
林天阳，王宇新，刘向阳，邱锡鹏。关于 Transformer 的调查。AI 开放，2022a。
Lin et al. (2022b) 林等（2022b） Ye Lin, Shuhan Zhou, Yanyang Li, Anxiang Ma, Tong Xiao, and Jingbo Zhu. Multi-path transformer is better: A case study on neural machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5646–5656, 2022b.
叶林，周舒涵，李艳阳，马安祥，肖桐，朱景波。多路径 Transformer 更优：神经机器翻译案例研究。在计算语言学协会发现：EMNLP 2022，第 5646-5656 页，2022b。
Liu et al. (2020a) 刘等（2020a） Fenglin Liu, Xuancheng Ren, Zhiyuan Zhang, Xu Sun, and Yuexian Zou. Rethinking skip connection with layer normalization. In Proceedings of the 28th international conference on computational linguistics, pages 3586–3598, 2020a.
刘峰林，任轩成，张志远，孙旭，邹月仙。重新思考带有层归一化的跳跃连接。在 2020 年第 28 届国际计算语言学会议论文集，第 3586-3598 页。
Liu et al. (2023a) 刘等（2023a） Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
刘浩天，李春媛，吴庆阳，李庸宰. 视觉指令微调. arXiv 预印本 arXiv:2304.08485，2023a.
Liu et al. (2020b) 刘等（2020b） Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. Understanding the difficulty of training transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5747–5763, 2020b.
刘丽媛，刘晓东，高建峰，陈伟竹，韩家炜. 理解训练 Transformer 的难度. 在 2020 年自然语言处理实证方法会议（EMNLP）论文集，第 5747-5763 页，2020b。
Liu et al. (2018) 刘等（2018） Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. In Proceedings of International Conference on Learning Representations, 2018.
刘彼得·J、穆罕默德·萨利赫、埃蒂安·波、本·古德里奇、瑞安·塞帕西、卢卡什·凯泽和诺亚姆·沙泽尔。通过总结长序列生成维基百科。在《国际学习表示会议》论文集中，2018 年。
Liu et al. (2023b) 刘等（2023b） Yang Liu, Yao Zhang, Yixin Wang, Feng Hou, Jin Yuan, Jiang Tian, Yang Zhang, Zhongchao Shi, Jianping Fan, and Zhiqiang He. A survey of visual transformers. IEEE Transactions on Neural Networks and Learning Systems, 2023b.
杨柳，张瑶，王艺馨，侯峰，袁金，田江，张洋，石中朝，范建平，何志强。视觉 Transformer 调查。IEEE 神经网络与学习系统杂志，2023b。
Liu et al. (2020c) 刘等（2020c） Yuchen Liu, Junnan Zhu, Jiajun Zhang, and Chengqing Zong. Bridging the modality gap for speech-to-text translation. arXiv preprint arXiv:2010.14920, 2020c.
刘宇辰，朱俊南，张嘉俊，宗成庆。跨模态差距的语音转文本翻译。arXiv 预印本 arXiv:2010.14920，2020c。
Luong et al. (2015) 龙等（2015） Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, 2015.
龙明堂，潘辉，曼宁克里斯托弗。基于注意力的神经机器翻译的有效方法。载于 2015 年自然语言处理实证方法会议论文集，第 1412-1421 页，2015 年。
Manning et al. (2020) Manning 等（2020） Christopher D Manning, Kevin Clark, John Hewitt, Urvashi Khandelwal, and Omer Levy. Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences, 117(48):30046–30054, 2020.
克里斯托弗·D·曼宁，凯文·克拉克，约翰·休伊特，乌尔瓦希·坎达尔瓦尔，奥梅尔·利维。通过自监督训练的人工神经网络中的涌现语言结构。美国国家科学院院刊，117(48)：30046–30054，2020。
Martins et al. (2022) 马汀斯等人（2022 年） Pedro Henrique Martins, Zita Marinho, and André FT Martins. ∞-former: Infinite memory transformer-former: Infinite memory transformer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5468–5485, 2022.
Pedro Henrique Martins, Zita Marinho, and André FT Martins. ∞-former：无限记忆转换器。在《计算语言学协会第 60 届年度会议论文集》（第 1 卷：长篇论文），第 5468-5485 页，2022 年。
Masoudnia and Ebrahimpour (2014)
马苏德尼亚和艾布拉希姆波尔（2014） Saeed Masoudnia and Reza Ebrahimpour. Mixture of experts: a literature survey. The Artificial Intelligence Review, 42(2):275, 2014.
马苏德尼亚，赛义德；艾布拉希姆波尔，雷扎。专家混合：文献综述。《人工智能评论》，42(2)：275，2014。
McCarley et al. (2019) 麦卡利等（2019） JS McCarley, Rishav Chakravarti, and Avirup Sil. Structured pruning of a bert-based question answering model. arXiv preprint arXiv:1910.06360, 2019.
JS McCarley, Rishav Chakravarti, 和 Avirup Sil. 基于 BERT 的问答模型的层次剪枝。arXiv 预印本 arXiv:1910.06360，2019。
Merrill et al. (2022) Merrill 等（2022 年） William Merrill, Ashish Sabharwal, and Noah A Smith. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10:843–856, 2022.
威廉·梅里尔，阿希什·萨哈瓦尔，以及诺亚·A·史密斯。饱和变压器是恒深阈值电路。计算语言学协会汇刊，10：843–856，2022。
Michel et al. (2019) 米歇尔等（2019） Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019.
保罗·米歇尔，奥默·列维，格雷厄姆·纽比格。十六个头真的比一个头好吗？神经信息处理系统进展，32，2019。
Mikolov et al. (2013) 米科洛夫等人（2013） Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, page 3111–3119, 2013.
托马斯·米科尔洛夫，伊利亚·苏茨克维尔，凯·陈，格雷格·科拉多，杰弗里·戴恩。词语和短语的分布式表示及其组合性。在 2013 年第 26 届国际神经网络信息处理系统会议论文集第 2 卷，第 3111-3119 页。
Nagel et al. (2021) 纳格尔等（2021） Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295, 2021.
马库斯·纳格尔，马里奥斯·福拉拉基斯，拉娜·阿里·阿姆贾德，叶利塞伊·邦达连科，马丁·范·巴伦，蒂杰门·布兰克沃尔特。关于神经网络量化的白皮书。arXiv 预印本 arXiv:2106.08295，2021。
Oppenheim and Schafer (1975)
奥本海姆和沙弗（1975） Alan V Oppenheim and Ronald W Schafer. Digital signal processing(book). Prentice-Hall, 1975.
艾伦·V·奥本海姆和罗纳德·W·沙弗。《数字信号处理》（书）。普伦蒂斯-霍尔出版社，1975 年。
Orvieto et al. (2023) 奥维耶托等（2023） Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. arXiv preprint arXiv:2303.06349, 2023.
安东尼奥·奥维耶托，塞缪尔·L·史密斯，艾伯特·古，阿努山·费尔南多，卡格拉尔·古切雷，拉兹万·帕斯卡努，索汉·德。复活长序列的循环神经网络。arXiv 预印本 arXiv:2303.06349，2023。
Ouyang et al. (2022) 欧阳等（2022） Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, 等人. 基于人类反馈训练语言模型以遵循指令。神经信息处理系统进展，35:27730–27744，2022。
Park et al. (2019) 帕克等人（2019 年） Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3967–3976, 2019.
Park Wonpyo，Kim Dongju，Lu Yan，Cho Minsu. 关联知识蒸馏。IEEE/CVF 计算机视觉与模式识别会议论文集，第 3967-3976 页，2019 年。
Parmar et al. (2018) 帕尔马尔等人（2018） Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In International conference on machine learning, pages 4055–4064. PMLR, 2018.
尼基·帕马尔，阿希什·瓦桑尼，雅各布·乌斯克雷伊特，卢卡什·卡泽尔，诺亚姆·沙泽尔，亚历山大·库，和达斯汀·特兰。图像变换器。在国际机器学习会议上，第 4055-4064 页。PMLR，2018。
Peng et al. (2019) 彭等（2019） Baoyun Peng, Xiao Jin, Jiaheng Liu, Dongsheng Li, Yichao Wu, Yu Liu, Shunfeng Zhou, and Zhaoning Zhang. Correlation congruence for knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5007–5016, 2019.
包云鹏，金晓，刘佳恒，李东升，吴一超，刘宇，周顺峰，张兆宁。知识蒸馏的相关一致性。IEEE/CVF 国际计算机视觉会议论文集，第 5007-5016 页，2019 年。
Peng et al. (2023) 彭等（2023） Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
彭博，埃里克·阿尔卡伊德，昆廷·安东尼，阿隆·阿尔巴拉克，塞缪尔·阿卡迪尼奥，曹欢奇，程欣，钟伟，马泰奥·格雷拉，Kranthi Kiran GV，等。Rwkv：在 Transformer 时代重新发明 RNN。arXiv 预印本 arXiv:2305.13048，2023。
Peng et al. (2021) 彭等（2021） H Peng, N Pappas, D Yogatama, R Schwartz, N Smith, and L Kong. Random feature attention. In Proceedings of International Conference on Learning Representations (ICLR 2021), 2021.
彭 H，帕帕斯 N，约加塔玛 D，施瓦茨 R，史密斯 N，孔 L. 随机特征注意力。载于《国际学习表示会议（ICLR 2021）论文集》，2021 年。
Pérez et al. (2018) 佩雷斯等（2018） Jorge Pérez, Javier Marinković, and Pablo Barceló. On the turing completeness of modern neural network architectures. In Proceedings of International Conference on Learning Representations, 2018.
Jorge Pérez, Javier Marinković, 和 Pablo Barceló. 论现代神经网络架构的图灵完备性。载于《国际学习表示会议论文集》，2018 年。
Petroni et al. (2019) Petroni 等（2019） Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, 2019.
费博·佩特罗尼，蒂姆·罗克塔谢尔，塞巴斯蒂安·里德尔，帕特里克·刘易斯，安东·巴赫丁，吴宇翔，亚历山大·米勒。语言模型作为知识库？在 2019 年实证自然语言处理会议和第 9 届国际自然语言处理联合会议（EMNLP-IJCNLP）论文集中，第 2463-2473 页，2019 年。
Pham et al. (2019) Pham 等（2019） Ngoc-Quan Pham, Thai-Son Nguyen, Jan Niehues, Markus Müller, Sebastian Stüker, and Alexander Waibel. Very deep self-attention networks for end-to-end speech recognition. arXiv preprint arXiv:1904.13377, 2019.
Pham Ngoc-Quan, Nguyen Thai-Son, Niehues Jan, Müller Markus, Stüker Sebastian, Waibel Alexander. 极深自注意力网络在端到端语音识别中的应用。arXiv 预印本 arXiv:1904.13377，2019。
Pires et al. (2023) 皮雷斯等（2023） Telmo Pessoa Pires, António V Lopes, Yannick Assogba, and Hendra Setiawan. One wide feedforward is all you need. arXiv preprint arXiv:2309.01826, 2023.
Telmo Pessoa Pires, António V Lopes, Yannick Assogba, 和 Hendra Setiawan. 只需要一个宽前馈。arXiv 预印本 arXiv:2309.01826，2023。
Pope et al. (2023) 教皇等人（2023） Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. In Proceedings of Machine Learning and Systems, 2023.
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, 和 Jeff Dean. 高效扩展 Transformer 推理。在《机器学习与系统》会议论文集，2023。
Press et al. (2021) Press 等人（2021 年） Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In Proceedings of International Conference on Learning Representations, 2021.
Ofir Press, Noah Smith, 和 Mike Lewis. 训练短，测试长：线性偏差的注意力机制实现输入长度外推。在《国际学习表示会议》论文集中，2021 年。
Provilkov et al. (2020) 普罗维科夫等（2020） Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita. Bpe-dropout: Simple and effective subword regularization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1882–1892, 2020.
伊万·普罗维尔科夫，德米特里·埃梅利亚诺夫，和叶莲娜·沃伊塔。Bpe-dropout：简单有效的子词正则化。在 2020 年计算语言学协会第 58 届年度会议论文集，第 1882-1892 页。
Qiu et al. (2020) 邱等（2020） Jiezhong Qiu, Hao Ma, Omer Levy, Wen-tau Yih, Sinong Wang, and Jie Tang. Blockwise self-attention for long document understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2555–2565, 2020.
邱洁中，马浩，奥马尔·利维，叶文涛，王心农，唐杰. 块状自注意力机制在长文档理解中的应用. 在《计算语言学协会发现：EMNLP 2020》中，第 2555-2565 页，2020 年。
Rabiner and Gold (1975)
拉宾纳和戈德（1975 年） Lawrence R Rabiner and Bernard Gold. Theory and application of digital signal processing. Prentice-Hall, 1975.
拉法恩·R·拉宾纳和伯纳德·戈尔德。数字信号处理的理论与应用。普伦蒂斯-霍尔出版社，1975 年。
Rae et al. (2019) 雷等（2019） Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, Chloe Hillier, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling. In Proceedings of International Conference on Learning Representations, 2019.
杰克·W·雷，安娜·波塔彭科，西德哈特·M·贾亚库马尔，克洛伊·希利尔，以及蒂莫西·P·利利克拉普。压缩变换器用于长序列建模。载于《国际学习表示会议论文集》，2019 年。
Raffel et al. (2020) 拉费尔等（2020） Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, 等人. 探索统一文本到文本转换器在迁移学习中的极限。机器学习研究杂志，第 21 卷第 140 期：1–67，2020 年。
Reimers and Gurevych (2019)
雷默斯和古列维奇（2019） Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019.
Nils Reimers 和 Iryna Gurevych. 句子 BERT：使用暹罗 BERT 网络的句子嵌入。在 2019 年实证自然语言处理会议和第 9 届国际自然语言处理联合会议（EMNLP-IJCNLP）论文集中，第 3982-3992 页，2019。
Romero et al. (2014) 罗梅罗等（2014） Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, 和 Yoshua Bengio. Fitnets：瘦深网络的提示。arXiv 预印本 arXiv:1412.6550，2014。
Roy et al. (2021) 罗伊等（2021） Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9:53–68, 2021.
Aurko Roy, Mohammad Saffar, Ashish Vaswani, 和 David Grangier. 基于内容的稀疏注意力与路由变换器的高效方法。计算语言学协会学报，9(1):53–68，2021。
Santacroce et al. (2023)
桑特拉阔等（2023） Michael Santacroce, Zixin Wen, Yelong Shen, and Yuanzhi Li. What matters in the structured pruning of generative language models? arXiv preprint arXiv:2302.03773, 2023.
Michael Santacroce, Zixin Wen, Yelong Shen, 和 Yuanzhi Li. 生成语言模型结构剪枝中什么是重要的？arXiv 预印本 arXiv:2302.03773，2023。
Schlag et al. (2021) 施拉格等人（2021 年） Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. In Proceedings of International Conference on Machine Learning, pages 9355–9366. PMLR, 2021.
伊马诺尔·施拉格，加兹基·伊里，以及于尔根·施密德胡贝尔。线性变换器是秘密快速权重编程器。在《国际机器学习会议论文集》，第 9355-9366 页。PMLR，2021 年。
Schuster et al. (2022) 舒斯特等人（2022 年） Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Tran, Yi Tay, and Donald Metzler. Confident adaptive language modeling. Advances in Neural Information Processing Systems, 35:17456–17472, 2022.
Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Tran, Yi Tay, 和 Donald Metzler. 自信自适应语言建模。神经信息处理系统进展，35:17456–17472，2022。
Schwartz et al. (2020) 施瓦茨等人（2020 年） Roy Schwartz, Gabriel Stanovsky, Swabha Swayamdipta, Jesse Dodge, and Noah A Smith. The right tool for the job: Matching model and instance complexities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6640–6651, 2020.
罗伊·施瓦茨，加布里埃尔·斯坦福斯基，斯瓦巴·斯瓦亚姆迪帕，杰西·多德，以及诺亚·A·史密斯。适合工作的正确工具：匹配模型与实例复杂性。在 2020 年计算语言学协会第 58 届年度会议论文集，第 6640-6651 页。
See (2018) 参见（2018） Abigail See. Deep learning, structure and innate priors: A discussion between yann lecun and christopher manning, 02 2018. URL http://www.abigailsee.com/2018/02/21/deep-learning-structure-and-innate-priors.html.
-a 阿比盖尔·西。深度学习、结构和先验知识：杨立昆与克里斯托弗·曼宁的对话，2018 年 2 月。网址：http://www.abigailsee.com/2018/02/21/deep-learning-structure-and-innate-priors.html。
Sennrich et al. (2016) 森尼里希等（2016） Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, 2016.
Rico Sennrich，Barry Haddow，和 Alexandra Birch。基于子词单位的罕见词神经机器翻译。在计算语言学协会第 54 届年度会议论文集（第 1 卷：长篇论文），第 1715-1725 页，2016 年。
Shaw et al. (2018) 肖等人（2018） Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468, 2018.
Peter Shaw, Jakob Uszkoreit, 和 Ashish Vaswani. 基于相对位置表示的自注意力机制。载于 2018 年北美计算语言学协会分会人类语言技术会议论文集，第 2 卷（短论文），第 464-468 页，2018 年。
Shazeer (2019) 沙泽尔（2019） Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
Noam Shazeer. 快速 Transformer 解码：只需一个写头。arXiv 预印本 arXiv:1911.02150，2019。
Shazeer et al. (2017) 沙泽尔等人（2017） Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In Proceedings of International Conference on Learning Representations, 2017.
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, 和 Jeff Dean. 令人难以置信的大规模神经网络：稀疏门控专家混合层。在《国际学习表示会议》论文集中，2017 年。
Shen et al. (2020) 沈等（2020） Dinghan Shen, Mingzhi Zheng, Yelong Shen, Yanru Qu, and Weizhu Chen. A simple but tough-to-beat data augmentation approach for natural language understanding and generation. arXiv preprint arXiv:2009.13818, 2020.
沈丁汉，郑明志，沈冶龙，曲彦如，陈伟竹。一种简单但难以击败的自然语言理解和生成数据增强方法。arXiv 预印本 arXiv:2009.13818，2020。
Shi et al. (2016) 石等（2016） Xing Shi, Inkit Padhi, and Kevin Knight. Does string-based neural mt learn source syntax? In Proceedings of the 2016 conference on empirical methods in natural language processing, pages 1526–1534, 2016.
邢世，Inkit Padhi，和凯文·奈特。基于字符串的神经机器翻译是否学习源语法？在 2016 年自然语言处理实证方法会议论文集中，第 1526-1534 页，2016 年。
Skorski et al. (2021) 斯科普斯基等（2021） Maciej Skorski, Alessandro Temperoni, and Martin Theobald. Revisiting weight initialization of deep neural networks. In Asian Conference on Machine Learning, pages 1192–1207. PMLR, 2021.
斯柯尔基，马切伊；泰佩罗尼，亚历山德罗；西奥鲍尔德，马丁。重新审视深度神经网络的权重初始化。在亚洲机器学习会议，第 1192-1207 页。PMLR，2021 年。
So et al. (2019) 所等（2019） David So, Quoc Le, and Chen Liang. The evolved transformer. In Proceedings of International conference on machine learning, pages 5877–5886. PMLR, 2019.
David So，Quoc Le，陈亮。进化的 Transformer。在《国际机器学习会议论文集》，第 5877-5886 页。PMLR，2019。
Sperber et al. (2018) 斯佩伯等（2018） Matthias Sperber, Jan Niehues, Graham Neubig, Sebastian Stüker, and Alex Waibel. Self-attentional acoustic models. In Proceedings of Interspeech 2018, pages 3723–3727, 2018.
马蒂亚斯·斯佩伯，简·尼休斯，格雷厄姆·纽比格，塞巴斯蒂安·施图克尔，亚历克斯·瓦伊贝尔。自注意力声学模型。载于《2018 年国际语音会议论文集》，第 3723-3727 页，2018 年。
Srivastava et al. (2015)
斯里瓦斯塔瓦等人（2015 年） Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
鲁佩什·库马尔·斯里瓦斯塔瓦，克劳斯·格雷夫，及于尔根·施密德胡贝尔。高速公路网络。arXiv 预印本 arXiv:1505.00387，2015。
Stock et al. (2021) Stock 等（2021） Pierre Stock, Angela Fan, Benjamin Graham, Edouard Grave, Rémi Gribonval, Herve Jegou, and Armand Joulin. Training with quantization noise for extreme model compression. In Proceedings of International Conference on Learning Representations, 2021.
皮埃尔·斯托克，安吉拉·范，本杰明·格雷厄姆，爱德华·格雷夫，雷米·格里博瓦尔，赫维·热جو，阿曼德·朱卢。使用量化噪声进行极端模型压缩的训练。载于《国际学习表示会议论文集》，2021 年。
Strubell et al. (2018) Strubell 等（2018） Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. Linguistically-informed self-attention for semantic role labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5027–5038, 2018.
Emma Strubell，Patrick Verga，Daniel Andor，David Weiss，Andrew McCallum. 基于语言信息的自注意力机制在语义角色标注中的应用。载于 2018 年实证自然语言处理会议论文集，第 5027-5038 页，2018 年。
Su et al. (2021) 苏等（2021） Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
苏建林，卢宇，潘胜锋，文博，刘云峰. Roformer：带旋转位置嵌入的增强型 Transformer. arXiv 预印本 arXiv:2104.09864，2021.
Sukhbaatar et al. (2015)
苏赫巴特尔等（2015） Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. Advances in neural information processing systems, 28, 2015.
Sainbayar Sukhbaatar, Jason Weston, Rob Fergus 等人. 端到端记忆网络。神经信息处理系统进展，第 28 卷，2015 年。
Sukhbaatar et al. (2019)
苏赫巴特尔等（2019） Sainbayar Sukhbaatar, Édouard Grave, Piotr Bojanowski, and Armand Joulin. Adaptive attention span in transformers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 331–335, 2019.
赛因巴亚尔·苏赫巴塔尔，爱德华·格雷夫，皮奥特·博亚诺夫斯基，阿曼德·朱利安。在变换器中的自适应注意力跨度。在计算语言学协会第 57 届年度会议论文集，第 331-335 页，2019 年。
Sun et al. (2023) 孙等（2023） Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
孙宇涛，董丽，黄少涵，马树明，夏雨晴，薛继龙，王建勇，魏福如. 保留网络：大型语言模型的继任者。arXiv 预印本 arXiv:2307.08621，2023。
Szegedy et al. (2014) Szegedy 等人（2014） Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In Proceedings of 2nd International Conference on Learning Representations (ICLR 2014), 2014.
克里斯蒂安·塞格迪，沃伊切赫·扎伦巴，伊利亚·苏茨克维尔，若昂·布卢纳，杜米特鲁·埃尔汗，伊恩·古德费洛，以及罗布·费格斯。神经网络令人着迷的性质。载于第二届国际学习表示会议（ICLR 2014）论文集，2014 年。
Tan and Le (2019)
谭和雷（2019） Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
谭明星，李国。EfficientNet：重新思考卷积神经网络模型缩放。国际机器学习会议论文集，第 6105-6114 页。PMLR，2019。
Tay et al. (2020a) 泰等（2020a） Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn attention. In Proceedings of International Conference on Machine Learning, pages 9438–9447. PMLR, 2020a.
易泰，达拉·巴里，刘阳，唐纳德·梅茨勒，及 Juan Da-Cheng. 稀疏 Sinkhorn 注意力。载于《国际机器学习会议论文集》，第 9438-9447 页。PMLR，2020a。
Tay et al. (2020b) 泰等（2020b） Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. CoRR, abs/2009.06732, 2020b.
易泰，莫斯塔法·德赫甘尼，达拉·巴赫里，唐纳德·梅茨勒。高效变压器：综述。CoRR，abs/2009.06732，2020b。
Tenney et al. (2019a) Tenney 等（2019a） Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, 2019a.
Ian Tenney, Dipanjan Das 和 Ellie Pavlick. Bert 重新发现经典自然语言处理流程。在计算语言学协会第 57 届年度会议论文集，第 4593-4601 页，2019a。
Tenney et al. (2019b) Tenney 等（2019b） Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R Bowman, Dipanjan Das, et al. What do you learn from context? probing for sentence structure in contextualized word representations. In Proceedings of International Conference on Learning Representations, 2019b.
Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R Bowman, Dipanjan Das 等人. 从上下文中你学到了什么？探究语境化词表示中的句子结构。载于国际学习表示会议论文集，2019b。
Touvron et al. (2023a) 图尔旺等（2023a） Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar 等人. Llama：开放高效的基座语言模型。arXiv 预印本 arXiv:2302.13971，2023a。
Touvron et al. (2023b) 图尔旺等（2023b） Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale 等人. Llama 2：开放基础和微调聊天模型。arXiv 预印本 arXiv:2307.09288，2023b。
Vaswani et al. (2017) 瓦斯瓦尼等人（2017 年） Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of Advances in Neural Information Processing Systems, volume 30, 2017.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. 所需的一切都是注意力。在《神经信息处理系统进展》第 30 卷，2017 年。
Vinyals et al. (2015) Vinyals 等人（2015 年） Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. Grammar as a foreign language. Advances in neural information processing systems, 28, 2015.
Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, 和 Geoffrey Hinton. 语法作为外语。神经信息处理系统进展，28，2015。
Voita et al. (2018) 沃伊塔等（2018） Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan Titov. Context-aware neural machine translation learns anaphora resolution. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1264–1274, 2018.
Elena Voita, Pavel Serdyukov, Rico Sennrich, 和 Ivan Titov. 基于上下文的神经机器翻译学习代词消解。在计算语言学协会第 56 届年度会议论文集（第 1 卷：长篇论文），第 1264-1274 页，2018 年。
Voita et al. (2019) 沃伊塔等（2019） Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5797–5808, 2019.
Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, 和 Ivan Titov. 分析多头自注意力：专业头负责重活，其余可以剪枝。在计算语言学协会第 57 届年度会议论文集，第 5797-5808 页，2019 年。
Wallace et al. (2019) 华莱士等人（2019 年） Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner. Do nlp models know numbers? probing numeracy in embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5307–5315, 2019.
埃里克·华莱士，王毅中，李树坚，萨梅尔·辛格，马特·加德纳。NLP 模型知道数字吗？探究嵌入中的算术能力。在 2019 年实证自然语言处理会议和第 9 届国际自然语言处理联合会议（EMNLP-IJCNLP）论文集中，第 5307-5315 页，2019 年。
Wang et al. (2020a) 王等（2020a） Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. Hat: Hardware-aware transformers for efficient natural language processing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7675–7688, 2020a.
王汉瑞，吴张豪，刘志坚，蔡汉，朱立根，甘创，韩松。标题：面向硬件的 Transformer，用于高效的自然语言处理。在 2020 年计算语言学协会第 58 届年度会议论文集，第 7675-7688 页。
Wang et al. (2022a) 王等（2022a） Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet: Scaling transformers to 1,000 layers. arXiv preprint arXiv:2203.00555, 2022a.
王红宇，马树明，董东，黄少涵，张东东，魏福如. Deepnet：将 Transformer 扩展到 1000 层。arXiv 预印本 arXiv:2203.00555，2022a。
Wang et al. (2022b) 王等（2022b） Hongyu Wang, Shuming Ma, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang Peng, Yu Wu, Payal Bajaj, Saksham Singhal, Alon Benhaim, et al. Foundation transformers. arXiv preprint arXiv:2210.06423, 2022b.
王红宇，马树明，黄少涵，董丽，王文辉，彭志亮，吴宇，巴雅尔·巴贾，辛哈姆·辛加尔，阿隆·本海姆，等。基础变压器。arXiv 预印本 arXiv:2210.06423，2022b。
Wang et al. (2022c) 王等（2022c） Jue Wang, Ke Chen, Gang Chen, Lidan Shou, and Julian McAuley. Skipbert: Efficient inference with shallow layer skipping. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7287–7301, 2022c.
王决，陈科，陈刚，寿丽丹，朱利安·麦凯利. Skipbert：浅层跳层的高效推理。在《计算语言学协会第 60 届年度会议论文集》（第 1 卷：长篇论文），第 7287-7301 页，2022c。
Wang and Yoon (2021)
王和尹（2021） Lin Wang and Kuk-Jin Yoon. Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE transactions on pattern analysis and machine intelligence, 44(6):3048–3068, 2021.
林王和允金·允. 知识蒸馏和视觉智能中的学生-教师学习：综述与新的视角。IEEE 模式分析与机器智能杂志，44(6)：3048–3068，2021。
Wang et al. (2023) 王等（2023） Peihao Wang, Rameswar Panda, Lucas Torroba Hennigen, Philip Greengard, Leonid Karlinsky, Rogerio Feris, David Daniel Cox, Zhangyang Wang, and Yoon Kim. Learning to grow pretrained models for efficient transformer training. In Proceedings of The Eleventh International Conference on Learning Representations, 2023.
王培豪，潘拉姆，卢卡斯·托罗巴·亨尼根，菲利普·格林加德，列昂尼德·卡尔明斯基，罗杰里奥·费里斯，大卫·丹尼尔·科克斯，王张阳，金允。学习高效训练预训练模型。在《第十一届国际学习表示会议》论文集中，2023 年。
Wang et al. (2018a) 王等（2018a） Qiang Wang, Fuxue Li, Tong Xiao, Yanyang Li, Yinqiao Li, and Jingbo Zhu. Multi-layer representation fusion for neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3015–3026, 2018a.
王强，李福学，肖彤，李艳阳，李银桥，朱静波. 多层表示融合用于神经机器翻译。在《第 27 届国际计算语言学会议论文集》，第 3015-3026 页，2018a。
Wang et al. (2019) 王等（2019） Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F Wong, and Lidia S Chao. Learning deep transformer models for machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1810–1822, 2019.
王强，李北，肖桐，朱静波，李长亮，黄德富，赵丽达 S。学习深度变换器模型进行机器翻译。在计算语言学协会第 57 届年度会议论文集，第 1810-1822 页，2019 年。
Wang et al. (2020b) 王等（2020b） Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020b.
王新农，李贝琳·Z·李，哈比萨·马迪安，方汉，马浩。Linformer：具有线性复杂度的自注意力。arXiv 预印本 arXiv:2006.04768，2020b。
Wang et al. (2018b) 王等（2018b） Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 409–424, 2018b.
王新，鱼斐，窦子毅，达雷尔·特雷弗，约瑟夫·E·冈萨雷斯。Skipnet：在卷积网络中学习动态路由。载于欧洲计算机视觉会议（ECCV）论文集，第 409-424 页，2018b。
Wang et al. (2020c) 王等（2020c） Ziheng Wang, Jeremy Wohlwend, and Tao Lei. Structured pruning of large language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6151–6162, 2020c.
王志恒，杰里米·沃尔温德，雷涛。大型语言模型的层次化剪枝。载于 2020 年实证自然语言处理会议（EMNLP）论文集，第 6151-6162 页，2020c。
Wei et al. (2022) 魏等（2022） Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
贾森·魏，易泰，里希·博马萨尼，科林·拉费尔，巴雷特·佐普，塞巴斯蒂安·博尔热奥，丹尼·约加塔玛，马滕·博斯马，邓尼·周，唐纳德·梅茨勒，等。大型语言模型的涌现能力。arXiv 预印本 arXiv:2206.07682，2022。
Weiss et al. (2021) 魏斯等人（2021 年） Gail Weiss, Yoav Goldberg, and Eran Yahav. Thinking like transformers. In Proceedings of International Conference on Machine Learning, pages 11080–11090. PMLR, 2021.
盖尔·魏斯，约阿夫·戈尔德堡，埃兰·亚哈夫。像变压器一样思考。国际机器学习会议论文集，第 11080-11090 页。PMLR，2021 年。
Wu et al. (2018a) 吴等人（2018a） Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. Pay less attention with lightweight and dynamic convolutions. In Proceedings of International Conference on Learning Representations, 2018a.
吴斐，范安嘉，贝夫斯卡伊·阿列克谢，达芬·扬，奥利·迈克尔。轻量级和动态卷积的注意力降低。在《国际学习表示会议》论文集中，2018a。
Wu et al. (2019) 吴等人（2019） Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. Pay less attention with lightweight and dynamic convolutions. In Proceedings of International Conference on Learning Representations, 2019.
吴斐，范安嘉，贝夫斯卡伊·阿列克谢，达芬·扬，奥利·迈克尔。轻量级和动态卷积的注意力降低。在《国际学习表示会议》论文集中，2019 年。
Wu et al. (2021) 吴等人（2021 年） Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers. In Proceedings of International Conference on Learning Representations, 2021.
吴宇淮，马克斯·诺尔曼·拉贝，德莱塞利·哈钦斯，克里斯蒂安·塞格迪。记忆变换器。载于《国际学习表示会议论文集》，2021 年。
Wu et al. (2020) 吴等人（2020 年） Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han. Lite transformer with long-short range attention. In Proceedings of International Conference on Learning Representations (ICLR), 2020.
吴张豪，刘志坚，林吉，林宇军，韩松. 轻量级 Transformer 与长短距离注意力机制. 国际学习表示会议（ICLR）论文集，2020 年。
Wu et al. (2018b) 吴等人（2018b） Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8817–8826, 2018b.
吴祖轩，纳加兰·图沙尔，库马尔·阿比谢克，雷尼·史蒂文，戴维斯·拉里·S，格拉乌曼·克里斯滕，费里斯·罗杰里奥。Blockdrop：残差网络中的动态推理路径。IEEE 计算机视觉与模式识别会议论文集，第 8817-8826 页，2018b。
Xiao et al. (2019) 肖等（2019） Tong Xiao, Yinqiao Li, Jingbo Zhu, Zhengtao Yu, and Tongran Liu. Sharing attention weights for fast transformer. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), pages 5292–5298, 2019.
童晓，李银桥，朱静波，余正涛，刘同然. 快速 Transformer 的注意力权重共享。在第二十八届国际人工智能联合会议（IJCAI-19）论文集，第 5292-5298 页，2019 年。
Xie et al. (2017) 谢等（2017） Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
谢赛宁，罗斯·吉什克，皮奥特·多拉尔，屠卓文，何凯明. 深度神经网络中的聚合残差变换。IEEE 计算机视觉与模式识别会议论文集，第 1492-1500 页，2017 年。
Xin et al. (2020) 辛等（2020） Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. Deebert: Dynamic early exiting for accelerating bert inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2246–2251, 2020.
季新，唐瑞博，李在俊，余亚良，林吉明. Deebert：用于加速 BERT 推理的动态早期退出。在 2020 年计算语言学协会第 58 届年会论文集，第 2246-2251 页。
Xiong et al. (2020) 熊等（2020） Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pages 10524–10533, 2020.
熊瑞斌，杨云昌，何迪，郑凯，郑舒欣，邢晨，张慧爽，蓝燕燕，王立伟，刘铁岩. 关于 Transformer 架构中的层归一化. 在国际机器学习会议，第 10524–10533 页，2020 年。
Xu and Mcauley (2023)
徐和 Mcauley（2023） Canwen Xu and Julian Mcauley. A survey on dynamic neural networks for natural language processing. In Findings of the Association for Computational Linguistics: EACL 2023, pages 2325–2336, 2023.
徐灿文和朱利安·麦克奥利。关于自然语言处理中动态神经网络的综述。载于计算语言学协会发现：EACL 2023，第 2325-2336 页，2023 年。
Xu et al. (2021a) 徐等（2021a） Chen Xu, Bojie Hu, Yanyang Li, Yuhao Zhang, Shen Huang, Qi Ju, Tong Xiao, and Jingbo Zhu. Stacked acoustic-and-textual encoding: Integrating the pre-trained models into speech translation encoders. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2619–2630, 2021a.
陈旭，胡博杰，李艳阳，张宇豪，黄申，居琪，肖桐，朱静波。堆叠声学-文本编码：将预训练模型集成到语音翻译编码器中。在计算语言学协会第 59 届年会和第 11 届国际自然语言处理联合会议（第 1 卷：长篇论文）论文集中，第 2619-2630 页，2021a。
Xu et al. (2023a) 徐等（2023a） Chen Xu, Rong Ye, Qianqian Dong, Chengqi Zhao, Tom Ko, Mingxuan Wang, Tong Xiao, and Jingbo Zhu. Recent advances in direct speech-to-text translation. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23): Survey Track, pages 6796–6804, 2023a.
陈旭，荣叶，董倩倩，赵成琪，科汤姆，王明轩，肖彤，朱静波。直接语音到文本翻译的近期进展。在第三十二届国际人工智能联合会议（IJCAI-23）：调查轨迹论文集中，第 6796-6804 页，2023a。
Xu et al. (2023b) 徐等（2023b） Chen Xu, Yuhao Zhang, Chengbo Jiao, Xiaoqian Liu, Chi Hu, Xin Zeng, Tong Xiao, Anxiang Ma, Huizhen Wang, and Jingbo Zhu. Bridging the granularity gap for acoustic modeling. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10816–10833, 2023b.
陈旭，张宇豪，焦成波，刘晓倩，胡驰，曾欣，肖桐，马安祥，王辉珍，朱景波。弥合声学建模的粒度差距。在计算语言学协会发现：ACL 2023，第 10816-10833 页，2023b。
Xu et al. (2020) 徐等（2020） Hongfei Xu, Qiuhui Liu, Josef van Genabith, Deyi Xiong, and Jingyi Zhang. Lipschitz constrained parameter initialization for deep transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 397–402, July 2020.
徐宏飞，刘秋辉，范根比特，熊德毅，张静怡。深度变换器的 Lipschitz 约束参数初始化。在计算语言学协会第 58 届年度会议论文集，第 397-402 页，2020 年 7 月。
Xu et al. (2023c) 徐等（2023c） Peng Xu, Xiatian Zhu, and David A Clifton. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023c.
彭旭，朱夏天，和大卫·A·克利夫顿。基于 Transformer 的多模态学习：综述。IEEE 模式分析与机器智能汇刊，2023c。
Xu et al. (2021b) 徐等（2021b） Zenan Xu, Daya Guo, Duyu Tang, Qinliang Su, Linjun Shou, Ming Gong, Wanjun Zhong, Xiaojun Quan, Daxin Jiang, and Nan Duan. Syntax-enhanced pre-trained model. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5412–5422, 2021b.
徐振南，郭大亚，唐杜宇，苏钦亮，寿林君，龚明，钟万军，全晓军，姜大新，端楠。基于语法的预训练模型。在《第 59 届计算语言学年会和第 11 届国际自然语言处理联合会议》（第 1 卷：长篇论文）上的论文，第 5412-5422 页，2021b。
Yang et al. (2018) 楊等（2018） Baosong Yang, Zhaopeng Tu, Derek F Wong, Fandong Meng, Lidia S Chao, and Tong Zhang. Modeling localness for self-attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4449–4458, 2018.
杨宝松，涂兆鹏，黄德富，孟凡东，赵思丽，张彤。建模局部性以用于自注意力网络。载于 2018 年实证自然语言处理会议论文集，第 4449-4458 页，2018 年。
Yang et al. (2023a) 楊等（2023a） Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 2023a.
杨正远，李林杰，林凯文，王建峰，林钟清，刘子成，王丽娟。lmms 的黎明：使用 gpt-4v（vision）的初步探索。arXiv 预印本 arXiv:2309.17421，2023a。
Yang et al. (2023b) 楊等（2023b） Zi Yang, Samridhi Choudhary, Siegfried Kunzmann, and Zheng Zhang. Quantization-aware and tensor-compressed training of transformers for natural language understanding. arXiv preprint arXiv:2306.01076, 2023b.
子阳，萨姆里迪·乔达里，西格弗里德·昆茨曼，张正。针对自然语言理解的量化和张量压缩的 Transformer 训练。arXiv 预印本 arXiv:2306.01076，2023b。
Yang et al. (2016) 楊等（2016） Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 1480–1489, 2016.
杨子超，杨迪一，克里斯·戴尔，何晓东，亚历克斯·斯莫拉，爱德华·霍维。用于文档分类的层次注意力网络。在 2016 年北美计算语言学协会分会人类语言技术会议论文集，第 1480-1489 页，2016。
Ye et al. (2021) 叶等（2021） Rong Ye, Mingxuan Wang, and Lei Li. End-to-end speech translation via cross-modal progressive training. arXiv preprint arXiv:2104.10380, 2021.
戎叶，王明轩，李磊。基于跨模态渐进训练的端到端语音翻译。arXiv 预印本 arXiv:2104.10380，2021。
Yin et al. (2023) 0 尹等（2023） Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
Yin Shukang, Fu Chaoyou, Zhao Sirui, Li Ke, Sun Xing, Xu Tong, Chen Enhong. 多模态大型语言模型综述。arXiv 预印本 arXiv:2306.13549，2023。
Yu et al. (2023) 余等人（2023） Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Benjamin D Haeffele, and Yi Ma. White-box transformers via sparse rate reduction. arXiv preprint arXiv:2306.01129, 2023.
姚东余，山姆布坎南，德鲁夫帕伊，天哲褚，魏洋吴，盛邦童，本杰明 D 海费尔，以及马毅。通过稀疏率降低的白色盒式变压器。arXiv 预印本 arXiv:2306.01129，2023。
Yuksel et al. (2012) 余塞尔等人（2012） Seniha Esen Yuksel, Joseph N Wilson, and Paul D Gader. Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems, 23(8):1177–1193, 2012.
森尼哈·埃森·尤克斯埃尔，约瑟夫·N·威尔逊，保罗·D·加德。二十年的专家混合。IEEE 神经网络与学习系统杂志，23(8)：1177–1193，2012。
Yun et al. (2019) 云等（2019） Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? In Proceedings of International Conference on Learning Representations, 2019.
云初鹤，博嘉帕利·斯里纳达，拉瓦特·安基特，雷迪·萨尚克，库马尔·桑吉夫。变换器是否是序列到序列函数的通用逼近器？载于《国际学习表示会议论文集》，2019 年。
Zaheer et al. (2020) 扎希尔等（2020） Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
马兹尼尔·扎希尔，古鲁·古鲁甘内什，库马尔·阿维纳瓦·杜贝，约书亚·艾因斯利，克里斯·阿尔贝蒂，圣地亚哥·奥塔农，菲利普·范，阿尼鲁德·拉乌拉，奇凡·王，杨丽，等。大鸟：长序列的 Transformer。神经信息处理系统进展，33：17283–17297，2020。
Zhang et al. (2018) 张等（2018） Biao Zhang, Deyi Xiong, and Jinsong Su. Accelerating neural transformer via an average attention network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1789–1798, 2018.
张表，熊德义，苏金松. 通过平均注意力网络加速神经变压器。在《计算语言学协会第 56 届年度会议论文集》（第 1 卷：长篇论文），第 1789-1798 页，2018 年。
Zhang et al. (2019) 张等（2019） Biao Zhang, Ivan Titov, and Rico Sennrich. Improving deep transformer with depth-scaled initialization and merged attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 898–909, 2019.
张表，伊万·蒂托夫，里科·森尼希。通过深度缩放初始化和合并注意力改进深度变换器。载于 2019 年实证自然语言处理会议和第 9 届国际自然语言处理联合会议（EMNLP-IJCNLP）论文集，第 898-909 页，2019 年。
Zhang et al. (2020) 张等（2020） Zhuosheng Zhang, Yuwei Wu, Junru Zhou, Sufeng Duan, Hai Zhao, and Rui Wang. Sg-net: Syntax-guided machine reading comprehension. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 9636–9643, 2020.
张卓胜，吴宇威，周俊儒，段苏锋，赵海，王瑞. Sg-net：语法引导的机器阅读理解。在《AAAI 人工智能会议论文集》中，第 9636-9643 页，2020 年。
Zhou et al. (2021) 周等（2021） Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115, 2021.
周浩毅，张尚航，彭杰琪，张帅，李建新，熊辉，张旺才. Informer：超越高效 Transformer 的长期序列时间序列预测。在 2021 年 AAAI 人工智能会议论文集，第 35 卷，第 11106-11115 页。
Zhou et al. (2020) 周等（2020） Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, and Furu Wei. Bert loses patience: Fast and robust inference with early exit. Advances in Neural Information Processing Systems, 33:18330–18341, 2020.
周王春树，徐灿文，葛涛，朱利安·麦凯利，徐科，魏福如。BERT 失去耐心：快速且鲁棒的早期退出推理。神经信息处理系统进展，33：18330–18341，2020。
Zhou (2012) 周（2012） Zhi-Hua Zhou. Ensemble Methods: Foundations and Algorithms. Chapman and Hall/CRC, 2012.
周志华. 集成学习方法：基础与算法. 奇异出版社/CRC，2012。
Zoph and Le (2016)
索普和雷（2016） Barret Zoph and Quoc Le. Neural architecture search with reinforcement learning. In Proceedings of International Conference on Learning Representations, 2016.
Barret Zoph 和 Quoc Le. 基于强化学习的神经架构搜索。载于《国际学习表示会议论文集》，2016 年。

A	B	C	D	E	F
M	N	$\square$	$\square$	$\square$	$\square$
R	S	T	$\square$	$\square$	$\square$
W	X	Y	Z	$\square$	$\square$

Introduction to Transformers: an NLP Perspective 引言：从自然语言处理角度谈 Transformer

Abstract 摘要

1 Background 1 背景

2 The Basic Model 第二章 基本模型

2.1 The Transformer Architecture2.1 变换器架构

2.2 Positional Encoding 2.2 位置编码

2.3 Multi-head Self-attention2.3 多头自注意力

2.4 Layer Normalization 2.4 层归一化

2.5 Feed-forward Neural Networks2.5 前馈神经网络

2.6 Attention Models on the Decoder Side2.6 解码器侧的注意力模型

2.7 Training and Inference2.7 训练与推理