这是用户在 2024-12-4 19:29 为 https://ar5iv.org/html/2311.17633 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Introduction to Transformers: an NLP Perspective
引言:从自然语言处理角度谈 Transformer

 

\nameTong Xiao \emailxiaotong@mail.neu.edu.cn
童晓 xiaotong@mail.neu.edu.cn

\addrNLP Lab., Northeastern University, Shenyang, China
东北大学地址自然语言处理实验室,沈阳,中国

NiuTrans Research, Shenyang, China \AND\nameJingbo Zhu \emailzhujingbo@mail.neu.edu.cn
牛译研究,沈阳,中国 \AND\name 朱景波 \emailzhujingbo@mail.neu.edu.cn

\addrNLP Lab., Northeastern University, Shenyang, China
东北大学地址自然语言处理实验室,沈阳,中国

NiuTrans Research, Shenyang, China
牛译研究,沈阳,中国
Abstract 摘要

Transformers have dominated empirical machine learning models of natural language processing. In this paper, we introduce basic concepts of Transformers and present key techniques that form the recent advances of these models. This includes a description of the standard Transformer architecture, a series of model refinements, and common applications. Given that Transformers and related deep learning techniques might be evolving in ways we have never seen, we cannot dive into all the model details or cover all the technical areas. Instead, we focus on just those concepts that are helpful for gaining a good understanding of Transformers and their variants. We also summarize the key ideas that impact this field, thereby yielding some insights into the strengths and limitations of these models.
Transformer 在自然语言处理的实证机器学习模型中占据主导地位。在本文中,我们介绍了 Transformer 的基本概念,并展示了构成这些模型近期进展的关键技术。这包括对标准 Transformer 架构的描述、一系列模型优化以及常见应用。鉴于 Transformer 和相关深度学习技术可能以我们从未见过的方向发展,我们无法深入所有模型细节或涵盖所有技术领域。相反,我们专注于那些有助于深入理解 Transformer 及其变体的概念。我们还总结了影响这一领域的关键思想,从而为这些模型的优势和局限性提供一些见解。

1 Background 1 背景

Transformers are a type of neural network (Vaswani et al., 2017). They were originally known for their strong performance in machine translation, and are now a de facto standard for building large-scale self-supervised learning systems (Devlin et al., 2019; Brown et al., 2020). The past few years have seen the rise of Transformers not only in natural language processing (NLP) but also in several other fields, such as computer vision and multi-modal processing. As Transformers continue to mature, these models are playing an increasingly important role in the research and application of artificial intelligence (AI).
Transformer 是一种神经网络(Vaswani 等人,2017 年)。它们最初因在机器翻译中的出色表现而闻名,现在已成为构建大规模自监督学习系统的既定标准(Devlin 等人,2019 年;Brown 等人,2020 年)。在过去的几年里,Transformer 不仅在自然语言处理(NLP)领域崛起,还在计算机视觉和多模态处理等多个领域得到了应用。随着 Transformer 的不断发展,这些模型在人工智能(AI)的研究和应用中扮演着越来越重要的角色。

Looking back at the history of neural networks, Transformers have not been around for a long time. While Transformers are “newcomers” in NLP, they were developed on top of several ideas, the origins of which can be traced back to earlier work, such as word embedding (Bengio et al., 2003; Mikolov et al., 2013) and attention mechanisms (Bahdanau et al., 2014; Luong et al., 2015). As a result, Transformers can benefit from the advancements of different sub-fields of deep learning, and provide an elegant way to combine these neural models. On the other hand, Transformers are unique, and differ from previous models in several ways. First, they do not depend on recurrent or convolutional neural networks for modeling sequences of words, but use only attention mechanisms and feed-forward neural networks. Second, the use of self-attention in Transformers makes it easier to deal with global contexts and dependencies among words. Third, Transformers are very flexible architectures and can be easily modified to accommodate different tasks.
回顾神经网络的历史,Transformer 并不算存在很长时间。虽然 Transformer 在自然语言处理领域是“新来者”,但它们是在几个想法的基础上开发的,这些想法的起源可以追溯到早期的工作,如词嵌入(Bengio 等人,2003;Mikolov 等人,2013)和注意力机制(Bahdanau 等人,2014;Luong 等人,2015)。因此,Transformer 可以受益于深度学习不同子领域的进步,并提供一种优雅的方式来结合这些神经网络模型。另一方面,Transformer 具有独特性,在几个方面与先前模型不同。首先,它们在建模词语序列时不依赖于循环或卷积神经网络,而是仅使用注意力机制和前馈神经网络。其次,Transformer 中使用自注意力使得处理全局上下文和词语之间的依赖关系变得更加容易。第三,Transformer 是非常灵活的架构,可以轻松修改以适应不同的任务。

The widespread use of Transformers motivates the development of cutting-edge techniques in deep learning. For example, there are significant refinements in self-attention mechanisms, which have been incorporated into many state-of-the-art NLP systems. The resulting techniques, together with the progress in self-supervised learning, have led us to a new era of AI: we are beginning to obtain models of universal language understanding, generation and reasoning. This has been evidenced by recent Transformer-based large language models (LLMs) which demonstrate amazing performance across a broad variety of tasks (Bubeck et al., 2023).
广泛使用 Transformer 推动了深度学习前沿技术的开发。例如,自我注意力机制得到了显著改进,并被纳入了许多最先进的自然语言处理系统中。这些技术成果与自监督学习的进展一起,引领我们进入了一个新的 AI 时代:我们开始获得通用语言理解、生成和推理的模型。这由最近的基于 Transformer 的大型语言模型(LLMs)所证明,这些模型在各种任务上展现了惊人的性能(Bubeck 等人,2023)。

This paper provides an introduction to Transformers while reflecting the recent developments in applying these models to different problems. However, Transformers are so successful that there have been numerous related studies and we cannot give a full description of them. Therefore, we focus this work on the core ideas of Transformers, and present a basic description of the common techniques. We also discuss some recent advances in Transformers, such as model improvements for efficiency and accuracy considerations. Because the field is very active and new techniques are coming out every day, it is impossible to survey all the latest literature and we are not attempting to do so. Instead, we focus on just those concepts and algorithms most relevant to Transformers, aimed at the people who wish to get a general understanding of these models.
本文介绍了 Transformer,同时反映了将这些模型应用于不同问题的最新进展。然而,Transformer 的成功导致相关研究众多,我们无法对其进行全面描述。因此,我们专注于 Transformer 的核心思想,并介绍了一些常见技术的简要描述。我们还讨论了 Transformer 的一些最新进展,例如为了效率和准确性考虑的模型改进。由于该领域非常活跃,新技术每天都在涌现,不可能对所有最新文献进行综述,我们也不试图这样做。相反,我们专注于与 Transformer 最相关的那些概念和算法,旨在帮助那些希望对这些模型有一个总体理解的人。

2 The Basic Model 第二章 基本模型

Here we consider the model presented in Vaswani et al. (2017)’s work. We start by considering the Transformer architecture and discuss the details of the sub-models subsequently.
我们在此考虑 Vaswani 等人(2017)提出的方法。我们首先考虑 Transformer 架构,随后讨论子模型的细节。

2.1 The Transformer Architecture
2.1 变换器架构

Figure 1 shows the standard Transformer model which follows the general encoder-decoder framework. A Transformer encoder comprises a number of stacked encoding layers (or encoding blocks). Each encoding layer has two different sub-layers (or sub-blocks), called the self-attention sub-layer and the feed-forward neural network (FFN) sub-layer. Suppose we have a source-side sequence 𝐱=x1xm\mathbf{x}=x_{1}...x_{m} and a target-side sequence 𝐲=y1yn\mathbf{y}=y_{1}...y_{n}. The input of an encoding layer is a sequence of mm vectors 𝐡1𝐡m\mathbf{h}_{1}...\mathbf{h}_{m}, each having dmodeld_{\mathrm{model}} dimensions (or dd dimensions for simplicity). We follow the notation adopted in the previous chapters, using 𝐇m×d\mathbf{H}\in\mathbb{R}^{m\times d} to denote these input vectors111Provided 𝐡jd\mathbf{h}_{j}\in\mathbb{R}^{d} is a row vector, we have 𝐇=[𝐡1𝐡m]\mathbf{H}=\begin{bmatrix}\mathbf{h}_{1}\\ \vdots\\ \mathbf{h}_{m}\end{bmatrix}.. The self-attention sub-layer first performs a self-attention operation Attself()\mathrm{Att}_{\mathrm{self}}(\cdot) on 𝐇\mathbf{H} to generate an output 𝐂\mathbf{C}:
图 1 展示了遵循通用编码器-解码器框架的标准 Transformer 模型。一个 Transformer 编码器由多个堆叠的编码层(或编码块)组成。每个编码层有两个不同的子层(或子块),称为自注意力子层和前馈神经网络(FFN)子层。假设我们有一个源序列 𝐱=x1xmsubscript1subscript\mathbf{x}=x_{1}...x_{m} 和一个目标序列 𝐲=y1ynsubscript1subscript\mathbf{y}=y_{1}...y_{n} 。编码层的输入是一个序列的 mm 向量 𝐡1𝐡msubscript1subscript\mathbf{h}_{1}...\mathbf{h}_{m} ,每个向量具有 dmodelsubscriptd_{\mathrm{model}} 维度(或为了简化, dd 维度)。我们遵循前几章采用的符号,用 𝐇m×dsuperscript\mathbf{H}\in\mathbb{R}^{m\times d} 表示这些输入向量 1 。自注意力子层首先对 𝐇\mathbf{H} 执行自注意力操作 Attself()subscript\mathrm{Att}_{\mathrm{self}}(\cdot) ,以生成输出 𝐂\mathbf{C}

𝐂\displaystyle\mathbf{C} =\displaystyle= Attself(𝐇)\displaystyle\mathrm{Att}_{\mathrm{self}}(\mathbf{H}) (1)
Figure 1: The Transformer architecture (Vaswani et al., 2017). There are LL stacked layers on each of the encoder and decoder sides. An encoding layer comprises a self-attention sub-layer and an FFN sub-layer. Both of these sub-layers share the same structure which involves a core function (either Layerself()\mathrm{Layer}_{\mathrm{self}}(\cdot) or Layerffn()\mathrm{Layer}_{\mathrm{ffn}}(\cdot)), followed by a residual connection and a layer normalization unit. Each decoding layer has a similar architecture with the encoding layers, but with an additional encoder-decoder attention sub-layer sandwiched between the self-attention and FFN sub-layers. As with most sequence-to-sequence models, Transformer takes x1xmx_{1}...x_{m} and y0yi1y_{0}...y_{i-1} for predicting yiy_{i}. The representation of an input word comprises a sum of a word embedding and a positional embedding. The distributions {Pr(|y0yi1,x1xm)}\{\operatorname{Pr}(\cdot|y_{0}...y_{i-1},x_{1}...x_{m})\} are generated in sequence by a Softmax layer, which operates on a linear transformation of the output from the last decoding layer.
图 1:Transformer 架构(Vaswani 等人,2017)。编码器和解码器每侧都有 LL 堆叠层。一个编码层包含一个自注意力子层和一个 FFN 子层。这两个子层具有相同的结构,包括一个核心函数(要么是 Layerself()subscript\mathrm{Layer}_{\mathrm{self}}(\cdot) 要么是 Layerffn()subscript\mathrm{Layer}_{\mathrm{ffn}}(\cdot) ),然后是残差连接和层归一化单元。每个解码层具有与编码层相似的架构,但在自注意力和 FFN 子层之间还有一个额外的编码器-解码器注意力子层。与大多数序列到序列模型一样,Transformer 使用 x1xmsubscript1subscriptx_{1}...x_{m}y0yi1subscript0subscript1y_{0}...y_{i-1} 来预测 yisubscripty_{i} 。输入词的表示是一个词嵌入和一个位置嵌入的和。Softmax 层按顺序生成 {Pr(|y0yi1,x1xm)}\{\operatorname{Pr}(\cdot|y_{0}...y_{i-1},x_{1}...x_{m})\} 分布,它对一个线性变换后的最后一个解码层的输出进行操作。

Here 𝐂\mathbf{C} is of the same size as 𝐇\mathbf{H}, and can thus be viewed as a new representation of the inputs. Then, a residual connection and a layer normalization unit are added to the output so that the resulting model is easier to optimize.
这里 𝐂\mathbf{C}𝐇\mathbf{H} 大小相同,因此可以视为输入的新表示。然后,在输出中添加了残差连接和层归一化单元,使得得到的模型更容易优化。

The original Transformer model employs the post-norm structure where a residual connection is created before layer normalization is performed, like this
原始 Transformer 模型采用后归一化结构,在执行层归一化之前创建残差连接,如下所示

𝐇self\displaystyle\mathbf{H}_{\mathrm{self}} =\displaystyle= LNorm(𝐂+𝐇)\displaystyle\mathrm{LNorm}(\mathbf{C}+\mathbf{H}) (2)

where the addition of 𝐇\mathbf{H} denotes the residual connection (He et al., 2016a), and LNorm()\mathrm{LNorm}(\cdot) denotes the layer normalization function (Ba et al., 2016). Substituting Eq. (1) into Eq. (2), we obtain the form of the self-attention sub-layer
在式中, 𝐇\mathbf{H} 表示残差连接(He 等,2016a), LNorm()\mathrm{LNorm}(\cdot) 表示层归一化函数(Ba 等,2016)。将式(1)代入式(2),我们得到自注意力子层的表达式

Layerself(𝐇)\displaystyle\mathrm{Layer}_{\mathrm{self}}(\mathbf{H}) =\displaystyle= 𝐇self\displaystyle\mathbf{H}_{\mathrm{self}} (3)
=\displaystyle= LNorm(Attself(𝐇)+𝐇)\displaystyle\mathrm{LNorm}(\mathrm{Att}_{\mathrm{self}}(\mathbf{H})+\mathbf{H})

The definitions of LNorm()\mathrm{LNorm}(\cdot) and Attself()\mathrm{Att}_{\mathrm{self}}(\cdot) will be given later in this section.
该节中将给出 LNorm()\mathrm{LNorm}(\cdot)Attself()subscript\mathrm{Att}_{\mathrm{self}}(\cdot) 的定义。

The FFN sub-layer takes 𝐇self\mathbf{H}_{\mathrm{self}} and outputs a new representation 𝐇ffnm×d\mathbf{H}_{\mathrm{ffn}}\in\mathbb{R}^{m\times d}. It has the same form as the self-attention sub-layer, with the attention function replaced by the FFN function, given by
FFN 子层接收 𝐇selfsubscript\mathbf{H}_{\mathrm{self}} 并输出新的表示 𝐇ffnm×dsubscriptsuperscript\mathbf{H}_{\mathrm{ffn}}\in\mathbb{R}^{m\times d} 。其结构与自注意力子层相同,只是将注意力函数替换为 FFN 函数,具体为

Layerffn(𝐇self)\displaystyle\mathrm{Layer}_{\mathrm{ffn}}(\mathbf{H}_{\mathrm{self}}) =\displaystyle= 𝐇ffn\displaystyle\mathbf{H}_{\mathrm{ffn}} (4)
=\displaystyle= LNorm(FFN(𝐇self)+𝐇self)\displaystyle\mathrm{LNorm}(\mathrm{FFN}(\mathbf{H}_{\mathrm{self}})+\mathbf{H}_{\mathrm{self}})

Here FFN()\mathrm{FFN}(\cdot) could be any feed-forward neural networks with non-linear activation functions. The most common structure of FFN()\mathrm{FFN}(\cdot) is a two-layer network involving two linear transformations and a ReLU activation function between them.
这里 FFN()\mathrm{FFN}(\cdot) 可以是具有非线性激活函数的任何前馈神经网络。 FFN()\mathrm{FFN}(\cdot) 最常见结构是一个包含两个线性变换和它们之间一个 ReLU 激活函数的两层网络。

For deep models, we can stack the above neural networks. Let 𝐇l\mathbf{H}^{l} be the output of layer ll. Then, we can express 𝐇l\mathbf{H}^{l} as a function of 𝐇l1\mathbf{H}^{l-1}. We write this as a composition of two sub-layers
对于深度模型,我们可以堆叠上述神经网络。令 𝐇lsuperscript\mathbf{H}^{l} 为层 ll 的输出。然后,我们可以将 𝐇lsuperscript\mathbf{H}^{l} 表示为 𝐇l1superscript1\mathbf{H}^{l-1} 的函数。我们将其写作两个子层的组合

𝐇l\displaystyle\mathbf{H}^{l} =\displaystyle= Layerffn(𝐇selfl)\displaystyle\mathrm{Layer}_{\mathrm{ffn}}(\mathbf{H}_{\mathrm{self}}^{l}) (5)
𝐇selfl\displaystyle\mathbf{H}_{\mathrm{self}}^{l} =\displaystyle= Layerself(𝐇l1)\displaystyle\mathrm{Layer}_{\mathrm{self}}(\mathbf{H}^{l-1}) (6)

If there are LL encoding layers, then 𝐇L\mathbf{H}^{L} will be the output of the encoder. In this case, 𝐇L\mathbf{H}^{L} can be viewed as a representation of the input sequence that is learned by the Transformer encoder. 𝐇0\mathbf{H}^{0} denotes the input of the encoder. In recurrent and convolutional models, 𝐇0\mathbf{H}^{0} can simply be word embeddings of the input sequence. Transformer takes a different way of representing the input words , and encodes the positional information explicitly. In Section 2.2 we will discuss the embedding model used in Transformers.
如果有 LL 个编码层,那么 𝐇Lsuperscript\mathbf{H}^{L} 将是编码器的输出。在这种情况下, 𝐇Lsuperscript\mathbf{H}^{L} 可以被视为 Transformer 编码器学习到的输入序列的表示。 𝐇0superscript0\mathbf{H}^{0} 表示编码器的输入。在循环和卷积模型中, 𝐇0superscript0\mathbf{H}^{0} 可以简单地是输入序列的词嵌入。Transformer 采用不同的方式表示输入单词,并显式地编码位置信息。在第 2.2 节中,我们将讨论 Transformer 中使用的嵌入模型。

The Transformer decoder has a similar structure as the Transformer encoder. It comprises LL stacked decoding layers (or decoding blocks). Let 𝐒l\mathbf{S}^{l} be the output of the ll-th decoding layer. We can formulate a decoding layer by using the following equations
Transformer 解码器与 Transformer 编码器具有相似的结构。它由 LL 堆叠的解码层(或解码块)组成。设 𝐒lsuperscript\mathbf{S}^{l} 为第 ll 个解码层的输出。我们可以通过以下方程式来构建解码层

𝐒l\displaystyle\mathbf{S}^{l} =\displaystyle= Layerffn(𝐒crossl)\displaystyle\mathrm{Layer}_{\mathrm{ffn}}(\mathbf{S}_{\mathrm{cross}}^{l}) (7)
𝐒crossl\displaystyle\mathbf{S}_{\mathrm{cross}}^{l} =\displaystyle= Layercross(𝐇L,𝐒selfl1)\displaystyle\mathrm{Layer}_{\mathrm{cross}}(\mathbf{H}^{L},\mathbf{S}_{\mathrm{self}}^{l-1}) (8)
𝐒selfl\displaystyle\mathbf{S}_{\mathrm{self}}^{l} =\displaystyle= Layerself(𝐒l1)\displaystyle\mathrm{Layer}_{\mathrm{self}}(\mathbf{S}^{l-1}) (9)

Here there are three decoder sub-layers. The self-attention and FFN sub-layers are the same as those used in the encoder. Layercross()\mathrm{Layer}_{\mathrm{cross}}(\cdot) denotes a cross attention sub-layer (or encoder-decoder sub-layer) which models the transformation from the source-side to the target-side. In Section 2.6 we will see that Layercross()\mathrm{Layer}_{\mathrm{cross}}(\cdot) can be implemented using the same function as Layerself()\mathrm{Layer}_{\mathrm{self}}(\cdot).
这里有三个解码子层。自注意力子层和 FFN 子层与编码器中使用的相同。 Layercross()subscript\mathrm{Layer}_{\mathrm{cross}}(\cdot) 表示一个交叉注意力子层(或编码器-解码器子层),它模拟了从源端到目标端的转换。在第 2.6 节中,我们将看到 Layercross()subscript\mathrm{Layer}_{\mathrm{cross}}(\cdot) 可以使用与 Layerself()subscript\mathrm{Layer}_{\mathrm{self}}(\cdot) 相同的函数来实现。

The Transformer decoder outputs a distribution over a vocabulary VyV_{\mathrm{y}} at each target-side position. This is achieved by using a softmax layer that normalizes a linear transformation of 𝐒L\mathbf{S}^{L} to distributions of target-side words. To do this, we map 𝐒L\mathbf{S}^{L} to an n×|Vy|n\times|V_{\mathrm{y}}| matrix 𝐎\mathbf{O} by
The Transformer decoder outputs a distribution over a vocabulary at each target-side position. This is achieved by using a softmax layer that normalizes a linear transformation to distributions of target-side words. To do this, we map to an matrix by

𝐎\displaystyle\mathbf{O} =\displaystyle= 𝐒L𝐖o\displaystyle\mathbf{S}^{L}\cdot\mathbf{W}_{\mathrm{o}} (10)

where 𝐖od×|Vy|\mathbf{W}_{\mathrm{o}}\in\mathbb{R}^{d\times|V_{\mathrm{y}}|} is the parameter matrix of the linear transformation.
𝐖od×|Vy|subscriptsuperscriptsubscript\mathbf{W}_{\mathrm{o}}\in\mathbb{R}^{d\times|V_{\mathrm{y}}|} 是线性变换的参数矩阵。

Then, the output of the Transformer decoder is given in the form
然后,Transformer 解码器的输出以以下形式给出

[Pr(|y0,𝐱)Pr(|y0yn1,𝐱)]\displaystyle\begin{bmatrix}\operatorname{Pr}(\cdot|y_{0},\mathbf{x})\\ \vdots\\ \operatorname{Pr}(\cdot|y_{0}...y_{n-1},\mathbf{x})\end{bmatrix} =\displaystyle= Softmax(𝐎)\displaystyle\mathrm{Softmax}(\mathbf{O}) (11)
=\displaystyle= [Softmax(𝐨1)Softmax(𝐨n)]\displaystyle\begin{bmatrix}\mathrm{Softmax}(\mathbf{o}_{1})\\ \vdots\\ \mathrm{Softmax}(\mathbf{o}_{n})\end{bmatrix}

where 𝐨i\mathbf{o}_{i} denotes the ii-th row vector of 𝐎\mathbf{O}, and y0y_{0} denotes the start symbol SOS\langle\mathrm{SOS}\rangle. Under this model, the probability of 𝐱\mathbf{x} given 𝐲\mathbf{y} can be defined as usual,
在本文中, 𝐨isubscript\mathbf{o}_{i} 表示 𝐎\mathbf{O} 的第 ii 个行向量,而 y0subscript0y_{0} 表示起始符号 SOSdelimited-⟨⟩\langle\mathrm{SOS}\rangle 。在此模型下,给定 𝐲\mathbf{y}𝐱\mathbf{x} 的概率可以按常规定义,

logPr(𝐲|𝐱)\displaystyle\log\operatorname{Pr}(\mathbf{y}|\mathbf{x}) =\displaystyle= i=1nlogPr(yi|y0yi1,𝐱)\displaystyle\sum_{i=1}^{n}\log\operatorname{Pr}(y_{i}|y_{0}...y_{i-1},\mathbf{x}) (12)

This equation resembles the general form of language modeling: we predict the word at time ii given all of the words up to time i1i-1. Therefore, the input of the Transformer decoder is shifted one word left, that is, the input is y0yn1y_{0}...y_{n-1} and the output is y1yny_{1}...y_{n}.
这个方程类似于语言模型的一般形式:我们预测在时间 ii 的单词,给定直到时间 i11i-1 的所有单词。因此,Transformer 解码器的输入向左移动了一个单词,即输入是 y0yn1subscript0subscript1y_{0}...y_{n-1} ,输出是 y1ynsubscript1subscripty_{1}...y_{n}

The Transformer architecture discussed above has several variants which have been successfully used in different fields of NLP. For example, we can use a Transformer encoder to represent texts (call it the encoder-only architecture), can use a Transformer decoder to generate texts (call it the decoder-only architecture), and can use a standard encoder-decoder Transformer model to transform an input sequence to an output sequence. In the rest of this chapter, most of the discussion is independent of the particular choice of application, and will be mostly focused on the encoder-decoder architecture. In Section 6, we will see applications of the encoder-only and decoder-only architectures.
上述讨论的 Transformer 架构有多个变体,这些变体在自然语言处理的各个领域都得到了成功应用。例如,我们可以使用 Transformer 编码器来表示文本(称为仅编码器架构),可以使用 Transformer 解码器来生成文本(称为仅解码器架构),还可以使用标准的编码器-解码器 Transformer 模型将输入序列转换为输出序列。在本章的其余部分,大部分讨论与特定应用的选择无关,并将主要关注编码器-解码器架构。在第 6 节中,我们将看到仅编码器和仅解码器架构的应用。

2.2 Positional Encoding 2.2 位置编码

In their original form, both FFNs and attention models used in Transformer ignore an important property of sequence modeling, which is that the order of the words plays a crucial role in expressing the meaning of a sequence. This means that the encoder and decoder are insensitive to the positional information of the input words. A simple approach to overcoming this problem is to add positional encoding to the representation of each word of the sequence. More formally, a word xjx_{j} can be represented as a dd-dimensional vector
在它们原始形式中,Transformer 中使用的 FFN 和注意力模型都忽略了一个重要的序列建模属性,即单词的顺序在表达序列意义中起着至关重要的作用。这意味着编码器和解码器对输入单词的位置信息不敏感。克服这个问题的简单方法是为序列中每个单词的表示添加位置编码。更正式地说,单词 xjsubscriptx_{j} 可以表示为一个 dd 维向量

𝐱𝐩j\displaystyle\mathbf{xp}_{j} =\displaystyle= 𝐱j+PE(j)\displaystyle\mathbf{x}_{j}+\mathrm{PE}(j) (13)

Here 𝐱jd\mathbf{x}_{j}\in\mathbb{R}^{d} is the embedding of the word which can be obtained by using the word embedding models. PE(j)d\mathrm{PE}(j)\in\mathbb{R}^{d} is the representation of the position jj. Vanilla Transformer employs the sinusoidal positional encoding models which we write in the form
这里 𝐱jdsubscriptsuperscript\mathbf{x}_{j}\in\mathbb{R}^{d} 是单词的嵌入,可以通过使用单词嵌入模型获得。 PE(j)dsuperscript\mathrm{PE}(j)\in\mathbb{R}^{d} 是位置 jj 的表示。Vanilla Transformer 采用正弦位置编码模型,我们将其写成以下形式

PE(i,2k)\displaystyle\mathrm{PE}(i,2k) =\displaystyle= sin(i1100002k/d)\displaystyle\mathrm{sin}(i\cdot\frac{1}{10000^{2k/d}}) (14)
PE(i,2k+1)\displaystyle\mathrm{PE}(i,2k+1) =\displaystyle= cos(i1100002k/d)\displaystyle\mathrm{cos}(i\cdot\frac{1}{10000^{2k/d}}) (15)

where PE(i,k)\mathrm{PE}(i,k) denotes the kk-th entry of PE(i)\mathrm{PE}(i). The idea of positional encoding is to distinguish different positions using continuous systems. Here we use the sine and cosine functions with different frequencies. The interested reader can refer to Appendix A to see that such a method can be interpreted as a carrying system. Because the encoding is based on individual positions, it is also called absolute positional encoding. In Section 4.1 we will see an improvement to this method.
PE(i,k)\mathrm{PE}(i,k) 表示 PE(i)\mathrm{PE}(i) 中的第 kk 个条目。位置编码的思路是使用连续系统来区分不同的位置。在这里,我们使用不同频率的正弦和余弦函数。感兴趣的读者可以参考附录 A,以了解这种方法可以解释为一种携带系统。因为编码基于单个位置,所以它也被称为绝对位置编码。在第 4.1 节中,我们将看到对这种方法的一种改进。

Once we have the above embedding result, 𝐱𝐩1𝐱𝐩m\mathbf{xp}_{1}...\mathbf{xp}_{m} is taken as the input to the Transformer encoder, that is,
一旦我们得到上述嵌入结果, 𝐱𝐩1𝐱𝐩msubscript1subscript\mathbf{xp}_{1}...\mathbf{xp}_{m} 被用作 Transformer 编码器的输入,即,

𝐇0\displaystyle\mathbf{H}_{0} =\displaystyle= [𝐱𝐩1𝐱𝐩m]\displaystyle\begin{bmatrix}\mathbf{xp}_{1}\\ \vdots\\ \mathbf{xp}_{m}\end{bmatrix} (16)

Similarly, we can also define the input on the decoder side.
同样,我们也可以在解码器端定义输入。

2.3 Multi-head Self-attention
2.3 多头自注意力

The use of self-attention is perhaps one of the most significant advances in sequence-to-sequence models. It attempts to learn and make use of direct interactions between each pair of inputs. From a representation learning perspective, self-attention models assume that the learned representation at position ii (denoted by 𝐜i\mathbf{c}_{i}) is a weighted sum of the inputs over the sequence. The output 𝐜i\mathbf{c}_{i} is thus given by
自我注意力的使用可能是序列到序列模型中最重大的进展之一。它试图学习和利用每对输入之间的直接交互。从表示学习角度来看,自我注意力模型假设在位置 ii (用 𝐜isubscript\mathbf{c}_{i} 表示)学习到的表示是序列中输入的加权和。因此,输出 𝐜isubscript\mathbf{c}_{i} 由以下公式给出

𝐜i\displaystyle\mathbf{c}_{i} =\displaystyle= j=1mαi,j𝐡j\displaystyle\sum_{j=1}^{m}\alpha_{i,j}\mathbf{h}_{j} (17)

where αi,j\alpha_{i,j} indicates how strong the input 𝐡i\mathbf{h}_{i} is correlated with the input 𝐡j\mathbf{h}_{j}. We thus can view 𝐜i\mathbf{c}_{i} as a representation of the global context at position ii. αi,j\alpha_{i,j} can be defined in different ways if one considers different attention models. Here we use the scaled dot-product attention function to compute αi,j\alpha_{i,j}, as follows
αi,jsubscript\alpha_{i,j} 表示输入 𝐡isubscript\mathbf{h}_{i} 与输入 𝐡jsubscript\mathbf{h}_{j} 之间的相关性强度。因此,我们可以将 𝐜isubscript\mathbf{c}_{i} 视为位置 ii 的全局上下文表示。如果考虑不同的注意力模型, αi,jsubscript\alpha_{i,j} 可以有不同的定义方式。在这里,我们使用缩放点积注意力函数来计算 αi,jsubscript\alpha_{i,j} ,如下所示:

αi,j\displaystyle\alpha_{i,j} =\displaystyle= Softmax(𝐡i𝐡jT/β)\displaystyle\mathrm{Softmax}(\mathbf{h}_{i}\mathbf{h}_{j}^{\mathrm{T}}/\beta) (18)
=\displaystyle= exp(𝐡i𝐡jT/β)k=1mexp(𝐡i𝐡kT/β)\displaystyle\frac{\exp(\mathbf{h}_{i}\mathbf{h}_{j}^{\mathrm{T}}/\beta)}{\sum_{k=1}^{m}\exp(\mathbf{h}_{i}\mathbf{h}_{k}^{\mathrm{T}}/\beta)}

where β\beta is a scaling factor and is set to d\sqrt{d}.
β\beta 是一个缩放因子,并设置为 d\sqrt{d}

Compared with conventional recurrent and convolutional models, an advantage of self-attention models is that they shorten the computational “distance” between two inputs. Figure 2 illustrates the information flow in these models. We see that, given the input at position ii, self-attention models can directly access any other input. By contrast, recurrent and convolutional models might need two or more jumps to see the whole sequence.
与传统的循环和卷积模型相比,自注意力模型的一个优点是它们缩短了两个输入之间的计算“距离”。图 2 展示了这些模型中的信息流。我们看到,给定位置 ii 的输入,自注意力模型可以直接访问任何其他输入。相比之下,循环和卷积模型可能需要两次或更多跳跃才能看到整个序列。

Figure 2: Information flows in recurrent, convolutional and self-attention models, shown as arrow lines between positions.
图 2:循环、卷积和自注意力模型中的信息流,以箭头线表示的位置间展示。

We can have a more general view of self-attention by using the QKV attention model. Suppose we have a sequence of κ\kappa queries 𝐐=[𝐪1𝐪κ]\mathbf{Q}=\begin{bmatrix}\mathbf{q}_{1}\\ \vdots\\ \mathbf{q}_{\kappa}\end{bmatrix}, and a sequence of ψ\psi key-value pairs (𝐊=[𝐤1𝐤ψ],𝐕=[𝐯1𝐯ψ])(\mathbf{K}=\begin{bmatrix}\mathbf{k}_{1}\\ \vdots\\ \mathbf{k}_{\psi}\end{bmatrix},\mathbf{V}=\begin{bmatrix}\mathbf{v}_{1}\\ \vdots\\ \mathbf{v}_{\psi}\end{bmatrix}). The output of the model is a sequence of vectors, each corresponding to a query. The form of the QKV attention is given by
我们可以通过使用 QKV 注意力模型来获得对自注意力更一般的理解。假设我们有一个序列 κ\kappa 查询 𝐐=[𝐪1𝐪κ]matrixsubscript1subscript\mathbf{Q}=\begin{bmatrix}\mathbf{q}_{1}\\ \vdots\\ \mathbf{q}_{\kappa}\end{bmatrix} ,以及一个序列 ψ\psi 键值对 (𝐊=[𝐤1𝐤ψ],𝐕=[𝐯1𝐯ψ])formulae-sequencematrixsubscript1subscriptmatrixsubscript1subscript(\mathbf{K}=\begin{bmatrix}\mathbf{k}_{1}\\ \vdots\\ \mathbf{k}_{\psi}\end{bmatrix},\mathbf{V}=\begin{bmatrix}\mathbf{v}_{1}\\ \vdots\\ \mathbf{v}_{\psi}\end{bmatrix}) 。模型的输出是一个向量序列,每个向量对应一个查询。QKV 注意力的形式如下所示

Attqkv(𝐐,𝐊,𝐕)\displaystyle\mathrm{Att}_{\mathrm{qkv}}(\mathbf{Q},\mathbf{K},\mathbf{V}) =\displaystyle= Softmax(𝐐𝐊Td)𝐕\displaystyle\mathrm{Softmax}(\frac{\mathbf{Q}\mathbf{K}^{\mathrm{T}}}{\sqrt{d}})\mathbf{V} (19)

We can write the output of the QKV attention model as a sequence of row vectors
我们可以将 QKV 注意力模型的输出表示为一系列行向量

𝐂\displaystyle\mathbf{C} =\displaystyle= [𝐜1𝐜κ]\displaystyle\begin{bmatrix}\mathbf{c}_{1}\\ \vdots\\ \mathbf{c}_{\kappa}\end{bmatrix} (20)
=\displaystyle= Attqkv(𝐐,𝐊,𝐕)\displaystyle\mathrm{Att}_{\mathrm{qkv}}(\mathbf{Q},\mathbf{K},\mathbf{V})

To apply this equation to self-attention, we simply have
将此方程应用于自注意力,我们只需

𝐇q\displaystyle\mathbf{H}^{q} =\displaystyle= 𝐇𝐖q\displaystyle\mathbf{H}\mathbf{W}^{q} (21)
𝐇k\displaystyle\mathbf{H}^{k} =\displaystyle= 𝐇𝐖k\displaystyle\mathbf{H}\mathbf{W}^{k} (22)
𝐇v\displaystyle\mathbf{H}^{v} =\displaystyle= 𝐇𝐖v\displaystyle\mathbf{H}\mathbf{W}^{v} (23)

where 𝐖q,𝐖k,𝐖vd×d\mathbf{W}^{q},\mathbf{W}^{k},\mathbf{W}^{v}\in\mathbb{R}^{d\times d} represents linear transformations of 𝐇\mathbf{H}.
𝐖q,𝐖k,𝐖vd×dsuperscriptsuperscriptsuperscriptsuperscript\mathbf{W}^{q},\mathbf{W}^{k},\mathbf{W}^{v}\in\mathbb{R}^{d\times d} 表示 𝐇\mathbf{H} 的线性变换。

By considering Eq. (1), we then obtain
通过考虑公式(1),我们随后得到

𝐂\displaystyle\mathbf{C} =\displaystyle= Attself(𝐇)\displaystyle\mathrm{Att}_{\mathrm{self}}(\mathbf{H}) (24)
=\displaystyle= Attqkv(𝐇q,𝐇k,𝐇v)\displaystyle\mathrm{Att}_{\mathrm{qkv}}(\mathbf{H}^{q},\mathbf{H}^{k},\mathbf{H}^{v})
=\displaystyle= Softmax(𝐇q[𝐇k]Td)𝐇v\displaystyle\mathrm{Softmax}(\frac{\mathbf{H}^{q}[\mathbf{H}^{k}]^{\mathrm{T}}}{\sqrt{d}})\mathbf{H}^{v}

Here Softmax(𝐇q[𝐇k]Td)\mathrm{Softmax}(\frac{\mathbf{H}^{q}[\mathbf{H}^{k}]^{\mathrm{T}}}{\sqrt{d}}) is an m×mm\times m matrix in which each row represents a distribution over {𝐡1,,𝐡m}\{\mathbf{h}_{1},...,\mathbf{h}_{m}\}, that is
这里 Softmax(𝐇q[𝐇k]Td)superscriptsuperscriptdelimited-[]superscript\mathrm{Softmax}(\frac{\mathbf{H}^{q}[\mathbf{H}^{k}]^{\mathrm{T}}}{\sqrt{d}}) 是一个 m×mm\times m 矩阵,其中每一行代表一个关于 {𝐡1,,𝐡m}subscript1subscript\{\mathbf{h}_{1},...,\mathbf{h}_{m}\} 的分布,即

row ii ii =\displaystyle= [αi,1αi,m]\displaystyle\begin{bmatrix}\alpha_{i,1}&...&\alpha_{i,m}\end{bmatrix} (25)

We can improve the above self-attention model by using a technique called multi-head attention. This method can be motivated from the perspective of learning from multiple lower-dimensional feature sub-spaces, which projects a feature vector onto multiple sub-spaces and learns feature mappings on individual sub-spaces. Specifically, we project the whole of the input space into τ\tau sub-spaces (call them heads), for example, we transform 𝐇m×d\mathbf{H}\in\mathbb{R}^{m\times d} into τ\tau matrices of size m×dτm\times\frac{d}{\tau}, denoted by {𝐇1head,,𝐇τhead}\{\mathbf{H}_{1}^{\mathrm{head}},...,\mathbf{H}_{\tau}^{\mathrm{head}}\}. The attention model is then run τ\tau times, each time on a head. Finally, the outputs of these model runs are concatenated, and transformed by a linear projection. This procedure can be expressed by
我们可以通过使用一种称为多头注意力的技术来改进上述自注意力模型。这种方法可以从从多个低维特征子空间学习的角度进行启发,它将特征向量投影到多个子空间,并在单个子空间上学习特征映射。具体来说,我们将整个输入空间投影到 τ\tau 个子空间(称为头),例如,我们将 𝐇m×dsuperscript\mathbf{H}\in\mathbb{R}^{m\times d} 转换为 τ\tau 个大小为 m×dτm\times\frac{d}{\tau} 的矩阵,记作 {𝐇1head,,𝐇τhead}superscriptsubscript1superscriptsubscript\{\mathbf{H}_{1}^{\mathrm{head}},...,\mathbf{H}_{\tau}^{\mathrm{head}}\} 。然后,注意力模型运行 τ\tau 次,每次在一个头上运行。最后,将这些模型运行的输出连接起来,并通过线性投影进行转换。这个过程可以用以下公式表示

𝐂\displaystyle\mathbf{C} =\displaystyle= Merge(𝐂1head,,𝐂τhead)𝐖c\displaystyle\mathrm{Merge}(\mathbf{C}_{1}^{\mathrm{head}},...,\mathbf{C}_{\tau}^{\mathrm{head}})\mathbf{W}_{c} (26)

For each head hh,
对于每个头 hh

𝐂hhead\displaystyle\mathbf{C}_{h}^{\mathrm{head}} =\displaystyle= Softmax(𝐇hq[𝐇hk]Td)𝐇hv\displaystyle\mathrm{Softmax}(\frac{\mathbf{H}_{h}^{q}[\mathbf{H}_{h}^{k}]^{\mathrm{T}}}{\sqrt{d}})\mathbf{H}_{h}^{v} (28)
𝐇hq\displaystyle\mathbf{H}_{h}^{q} =\displaystyle= 𝐇𝐖hq\displaystyle\mathbf{H}\mathbf{W}_{h}^{q} (29)
𝐇hk\displaystyle\mathbf{H}_{h}^{k} =\displaystyle= 𝐇𝐖hk\displaystyle\mathbf{H}\mathbf{W}_{h}^{k} (30)
𝐇hv\displaystyle\mathbf{H}_{h}^{v} =\displaystyle= 𝐇𝐖hv\displaystyle\mathbf{H}\mathbf{W}_{h}^{v} (31)

Here Merge()\mathrm{Merge}(\cdot) is the concatenation function, and AttQKV()\mathrm{Att}_{\mathrm{QKV}}(\cdot) is the attention function described in Eq. (20). 𝐖hq,𝐖hk,𝐖hvd×dτ\mathbf{W}_{h}^{q},\mathbf{W}_{h}^{k},\mathbf{W}_{h}^{v}\in\mathbb{R}^{d\times\frac{d}{\tau}} are the parameters of the projections from a dd-dimensional space to a dτ\frac{d}{\tau}-dimensional space for the queries, keys, and values. Thus, 𝐇hq\mathbf{H}_{h}^{q}, 𝐇hk\mathbf{H}_{h}^{k}, 𝐇hv\mathbf{H}_{h}^{v}, and 𝐂hhead\mathbf{C}_{h}^{\mathrm{head}} are all m×dτm\times\frac{d}{\tau} matrices. Merge(𝐂1head,,𝐂τhead)\mathrm{Merge}(\mathbf{C}_{1}^{\mathrm{head}},...,\mathbf{C}_{\tau}^{\mathrm{head}}) produces an m×dm\times d matrix. It is then transformed by a linear mapping 𝐖cd×d\mathbf{W}_{c}\in\mathbb{R}^{d\times d}, leading to the final result 𝐂d×d\mathbf{C}\in\mathbb{R}^{d\times d}.
这里 Merge()\mathrm{Merge}(\cdot) 是拼接函数, AttQKV()subscript\mathrm{Att}_{\mathrm{QKV}}(\cdot) 是方程(20)中描述的注意力函数。 𝐖hq,𝐖hk,𝐖hvd×dτsuperscriptsubscriptsuperscriptsubscriptsuperscriptsubscriptsuperscript\mathbf{W}_{h}^{q},\mathbf{W}_{h}^{k},\mathbf{W}_{h}^{v}\in\mathbb{R}^{d\times\frac{d}{\tau}} 是从一个 dd 维空间到 dτ\frac{d}{\tau} 维空间的查询、键和值的投影参数。因此, 𝐇hqsuperscriptsubscript\mathbf{H}_{h}^{q}𝐇hksuperscriptsubscript\mathbf{H}_{h}^{k}𝐇hvsuperscriptsubscript\mathbf{H}_{h}^{v}𝐂hheadsuperscriptsubscript\mathbf{C}_{h}^{\mathrm{head}} 都是 m×dτm\times\frac{d}{\tau} 矩阵。 Merge(𝐂1head,,𝐂τhead)superscriptsubscript1superscriptsubscript\mathrm{Merge}(\mathbf{C}_{1}^{\mathrm{head}},...,\mathbf{C}_{\tau}^{\mathrm{head}}) 产生一个 m×dm\times d 矩阵。然后通过线性映射 𝐖cd×dsubscriptsuperscript\mathbf{W}_{c}\in\mathbb{R}^{d\times d} 进行变换,得到最终结果 𝐂d×dsuperscript\mathbf{C}\in\mathbb{R}^{d\times d}

While the notation here seems somewhat tedious, it is convenient to implement multi-head models using various deep learning toolkits. A common method in Transformer-based systems is to store inputs from all the heads in data structures called tensors, so that we can make use of parallel computing resources to have efficient systems.
尽管这里的符号看起来有些繁琐,但使用各种深度学习工具包实现多头模型却很方便。在基于 Transformer 的系统中的一个常见方法是,将所有头部的输入存储在称为张量的数据结构中,这样我们就可以利用并行计算资源来构建高效的系统。

2.4 Layer Normalization 2.4 层归一化

Layer normalization provides a simple and effective means to make the training of neural networks more stable by standardizing the activations of the hidden layers in a layer-wise manner. As introduced in Ba et al. (2016)’s work, given a layer’s output 𝐡d\mathbf{h}\in\mathbb{R}^{d}, the layer normalization method computes a standardized output LNorm(𝐡)d\mathrm{LNorm}(\mathbf{h})\in\mathbb{R}^{d} by
层归一化提供了一种简单有效的方法,通过层状标准化隐藏层的激活,使神经网络的训练更加稳定。正如 Ba 等人(2016)的研究中所述,给定一个层的输出 𝐡dsuperscript\mathbf{h}\in\mathbb{R}^{d} ,层归一化方法通过计算标准化输出 LNorm(𝐡)dsuperscript\mathrm{LNorm}(\mathbf{h})\in\mathbb{R}^{d} 来实现。

LNorm(𝐡)\displaystyle\mathrm{LNorm}(\mathbf{h}) =\displaystyle= 𝐠𝐡μσ+ϵ+𝐛\displaystyle\mathbf{g}\odot\frac{\mathbf{h}-\mathbf{\mu}}{\sigma+\epsilon}+\mathbf{b} (32)

Here μd\mathbf{\mu}\in\mathbb{R}^{d} and σd\sigma\in\mathbb{R}^{d} are the mean and standard derivation of the activations. Let hkh_{k} be the kk-th dimension of 𝐡\mathbf{h}. μ\mathbf{\mu} and σ\sigma are given by
这里 μdsuperscript\mathbf{\mu}\in\mathbb{R}^{d}σdsuperscript\sigma\in\mathbb{R}^{d} 分别是激活的平均值和标准差。令 hksubscripth_{k}𝐡\mathbf{h} 的第 kk 维。 μ\mathbf{\mu}σ\sigma 由以下给出

μ\displaystyle\mu =\displaystyle= 1dk=1dhk\displaystyle\frac{1}{d}\cdot\sum_{k=1}^{d}h_{k} (33)
σ\displaystyle\sigma =\displaystyle= 1dk=1d(hkμ)2\displaystyle\sqrt{\frac{1}{d}\cdot\sum_{k=1}^{d}(h_{k}-\mu)^{2}} (34)

Here 𝐠d\mathbf{g}\in\mathbb{R}^{d} and 𝐛d\mathbf{b}\in\mathbb{R}^{d} are the rescaling and bias terms. They can be treated as parameters of layer normalization, whose values are to be learned together with other parameters of the Transformer model. The addition of ϵ\epsilon to σ\sigma is used for the purpose of numerical stability. In general, ϵ\epsilon is chosen to be a small number.
这里 𝐠dsuperscript\mathbf{g}\in\mathbb{R}^{d}𝐛dsuperscript\mathbf{b}\in\mathbb{R}^{d} 是缩放和偏差项。它们可以被视为层归一化的参数,其值需要与其他 Transformer 模型的参数一起学习。将 ϵ\epsilon 添加到 σ\sigma 中是为了提高数值稳定性。一般来说, ϵ\epsilon 被选为一个较小的数。

We illustrate the layer normalization method for the hidden states of an encoder in the following example (assume that m=4m=4, d=3d=3, 𝐠=𝟏\mathbf{g}=\mathbf{1}, 𝐛=𝟎\mathbf{b}=\mathbf{0}, and ϵ=0.1\epsilon=0.1).
# 在以下示例中,我们说明了编码器隐藏状态的层归一化方法(假设 m=44m=4d=33d=3𝐠=𝟏1\mathbf{g}=\mathbf{1}𝐛=𝟎0\mathbf{b}=\mathbf{0}ϵ=0.10.1\epsilon=0.1 )。

𝐡1𝐡2𝐡3𝐡4[1120.90.900.70.80317]μ=1.3,σ=0.5μ=0.6,σ=0.4μ=0.5,σ=0.4μ=3.7,σ=2.5[11.30.5+0.111.30.5+0.121.30.5+0.10.90.60.4+0.10.90.60.4+0.100.60.4+0.10.70.50.4+0.10.80.50.4+0.100.50.4+0.133.72.5+0.113.72.5+0.173.72.5+0.1]\begin{matrix}\mathbf{h}_{1}\\ \mathbf{h}_{2}\\ \mathbf{h}_{3}\\ \mathbf{h}_{4}\end{matrix}\begin{bmatrix}1&1&2\\ 0.9&0.9&0\\ 0.7&0.8&0\\ 3&1&7\end{bmatrix}\ \ \ \begin{matrix}\mu=1.3,\ \sigma=0.5\\ \mu=0.6,\ \sigma=0.4\\ \mu=0.5,\ \sigma=0.4\\ \mu=3.7,\ \sigma=2.5\end{matrix}\ \ \ \implies\ \ \ \begin{bmatrix}\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}1}-1.3}{0.5+0.1}&\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}1}-1.3}{0.5+0.1}&\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}2}-1.3}{0.5+0.1}\\ \frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0.9}-0.6}{0.4+0.1}&\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0.9}-0.6}{0.4+0.1}&\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0}-0.6}{0.4+0.1}\\ \frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0.7}-0.5}{0.4+0.1}&\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0.8}-0.5}{0.4+0.1}&\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0}-0.5}{0.4+0.1}\\ \frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}3}-3.7}{2.5+0.1}&\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}1}-3.7}{2.5+0.1}&\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}7}-3.7}{2.5+0.1}\end{bmatrix}

As discussed in Section 2.1, the layer normalization unit in each sub-layer is used to standardize the output of a residual block. Here we describe a more general formulation for this structure. Suppose that F()F(\cdot) is a neural network we want to run. Then, the post-norm structure of F()F(\cdot) is given by
如第 2.1 节所述,每个子层的层归一化单元用于标准化残差块的输出。在这里,我们描述了这种结构的更一般化公式。假设 F()F(\cdot) 是我们想要运行的神经网络。那么, F()F(\cdot) 的后归一化结构由以下公式给出

𝐇out\displaystyle\mathbf{H}_{\mathrm{out}} =\displaystyle= LNorm(F(𝐇in)+𝐇in)\displaystyle\mathrm{LNorm}(F(\mathbf{H}_{\mathrm{in}})+\mathbf{H}_{\mathrm{in}}) (35)

where 𝐇in\mathbf{H}_{\mathrm{in}} and 𝐇output\mathbf{H}_{\mathrm{output}} are the input and output of this model. Clearly, Eq. (4) is an instance of this equation.
𝐇insubscript\mathbf{H}_{\mathrm{in}}𝐇outputsubscript\mathbf{H}_{\mathrm{output}} 分别是该模型的输入和输出。显然,式(4)是该方程的一个实例。

An alternative approach to introducing layer normalization and residual connections into modeling is to execute the LNorm()\mathrm{LNorm}(\cdot) function right after the F()F(\cdot) function, and to establish an identity mapping from the input to the output of the entire sub-layer. This structure, known as the pre-norm structure, can be expressed in the form
一种将层归一化和残差连接引入建模的替代方法是立即在 LNorm()\mathrm{LNorm}(\cdot) 函数之后执行 F()F(\cdot) 函数,并从输入到整个子层的输出建立恒等映射。这种称为预归一化结构的结构可以表示为

𝐇out\displaystyle\mathbf{H}_{\mathrm{out}} =\displaystyle= LNorm(F(𝐇in))+𝐇in\displaystyle\mathrm{LNorm}(F(\mathbf{H}_{\mathrm{in}}))+\mathbf{H}_{\mathrm{in}} (36)

Both post-norm and pre-norm Transformer models are widely used in NLP systems. See Figure 3 for a comparison of these two structures. In general, residual connections are considered an effective means to make the training of multi-layer neural networks easier. In this sense, pre-norm Transformer seems promising because it follows the convention that a residual connection is created to bypass the whole network and that the identity mapping from the input to the output leads to easier optimization of deep models. However, by considering the expressive power of a model, there may be modeling advantages in using post-norm Transformer because it does not so much rely on residual connections and enforces more sophisticated modeling for representation learning. In Section 4.2, we will see a discussion on this issue.
后规范和前规范 Transformer 模型在 NLP 系统中被广泛使用。参见图 3 比较这两种结构。一般来说,残差连接被认为是一种使多层神经网络训练更容易的有效手段。从这个意义上说,前规范 Transformer 似乎很有前途,因为它遵循创建残差连接以绕过整个网络的传统,从输入到输出的恒等映射有助于深度模型的优化。然而,考虑到模型的表达能力,使用后规范 Transformer 可能存在建模优势,因为它不太依赖于残差连接,并强制进行更复杂的建模以进行表示学习。在第 4.2 节中,我们将看到对这个问题的讨论。

Figure 3: The post-norm and pre-norm structures. F()=F(\cdot)= core function, LNorm()=\mathrm{LNorm}(\cdot)= layer normalization, and =\oplus= residual connection.
图 3:后规范化和前规范化结构。 F()=absentF(\cdot)= 核心功能, LNorm()=absent\mathrm{LNorm}(\cdot)= 层归一化, =direct-sum\oplus= 残差连接。

2.5 Feed-forward Neural Networks
2.5 前馈神经网络

The use of FFNs in Transformer is inspired in part by the fact that complex outputs can be formed by transforming the inputs through nonlinearities. While the self-attention model itself has some nonlinearity (in Softmax()\mathrm{Softmax}(\cdot)), a more common way to do this is to consider additional layers with non-linear activation functions and linear transformations. Given an input 𝐇inm×d\mathbf{H}_{\mathrm{in}}\in\mathbb{R}^{m\times d} and an output 𝐇outm×d\mathbf{H}_{\mathrm{out}}\in\mathbb{R}^{m\times d}, the 𝐇out=FFN(𝐇in)\mathbf{H}_{\mathrm{out}}=\mathrm{FFN}(\mathbf{H}_{\mathrm{in}}) function in Transformer has the following form
FFN 在 Transformer 中的应用部分受到以下事实的启发:通过非线性变换输入可以形成复杂的输出。虽然自注意力模型本身具有一定的非线性(在 Softmax()\mathrm{Softmax}(\cdot) 中),但更常见的方法是考虑具有非线性激活函数和线性变换的额外层。给定输入 𝐇inm×dsubscriptsuperscript\mathbf{H}_{\mathrm{in}}\in\mathbb{R}^{m\times d} 和输出 𝐇outm×dsubscriptsuperscript\mathbf{H}_{\mathrm{out}}\in\mathbb{R}^{m\times d} ,Transformer 中的 𝐇out=FFN(𝐇in)subscriptsubscript\mathbf{H}_{\mathrm{out}}=\mathrm{FFN}(\mathbf{H}_{\mathrm{in}}) 函数具有以下形式

𝐇out\displaystyle\mathbf{H}_{\mathrm{out}} =\displaystyle= 𝐇hidden𝐖f+𝐛f\displaystyle\mathbf{H}_{\mathrm{hidden}}\mathbf{W}_{f}+\mathbf{b}_{f} (37)
𝐇hidden\displaystyle\mathbf{H}_{\mathrm{hidden}} =\displaystyle= ReLU(𝐇in𝐖h+𝐛h)\displaystyle\mathrm{ReLU}(\mathbf{H}_{\mathrm{in}}\mathbf{W}_{h}+\mathbf{b}_{h}) (38)

where 𝐇hiddenm×dffn\mathbf{H}_{\mathrm{hidden}}\in\mathbb{R}^{m\times d_{\mathrm{ffn}}} is the hidden states, and 𝐖hd×dffn\mathbf{W}_{h}\in\mathbb{R}^{d\times d_{\mathrm{ffn}}}, 𝐛hdffn\mathbf{b}_{h}\in\mathbb{R}^{d_{\mathrm{ffn}}}, 𝐖fdffn×d\mathbf{W}_{f}\in\mathbb{R}^{d_{\mathrm{ffn}}\times d} and 𝐛fd\mathbf{b}_{f}\in\mathbb{R}^{d} are the parameters. This is a two-layer FFN in which the first layer (or hidden layer) introduces a nonlinearity through ReLU()\mathrm{ReLU}(\cdot)222ReLU(x)=max{0,x}\mathrm{ReLU}(x)=\max\{0,x\}. and the second layer involves only a linear transformation. It is common practice in Transformer to use a larger size of the hidden layer. For example, a common choice is dffn=4dd_{\mathrm{ffn}}=4d, that is, the size of each hidden representation is 4 times as large as the input.
𝐇hiddenm×dffnsubscriptsuperscriptsubscript\mathbf{H}_{\mathrm{hidden}}\in\mathbb{R}^{m\times d_{\mathrm{ffn}}} 是隐藏状态,而 𝐖hd×dffnsubscriptsuperscriptsubscript\mathbf{W}_{h}\in\mathbb{R}^{d\times d_{\mathrm{ffn}}}𝐛hdffnsubscriptsuperscriptsubscript\mathbf{b}_{h}\in\mathbb{R}^{d_{\mathrm{ffn}}}𝐖fdffn×dsubscriptsuperscriptsubscript\mathbf{W}_{f}\in\mathbb{R}^{d_{\mathrm{ffn}}\times d}𝐛fdsubscriptsuperscript\mathbf{b}_{f}\in\mathbb{R}^{d} 是参数。这是一个两层前馈神经网络(FFN),其中第一层(或隐藏层)通过 ReLU()\mathrm{ReLU}(\cdot)2 引入非线性,第二层只涉及线性变换。在 Transformer 中,使用更大的隐藏层大小是一种常见做法。例如,一个常见的选择是 dffn=4dsubscript4d_{\mathrm{ffn}}=4d ,即每个隐藏表示的大小是输入的 4 倍。

Note that using a wide FFN sub-layer has been proven to be of great practical value in many state-of-the-art systems. However, a consequence of this is that the model is occupied by the parameters of the FFN. Table 1 shows parameter numbers and time complexities for different modules of a standard Transformer system. We see that FFNs dominate the model size when dffnd_{\mathrm{ffn}} is large, though they are not the most time consuming components. In the case of very big Transform models, we therefore wish to address this problem for building efficient systems.
注意,使用宽的 FFN 子层已被证明在许多最先进的系统中具有很大的实际价值。然而,这导致模型被 FFN 的参数占用。表 1 显示了标准 Transformer 系统不同模块的参数数量和时间复杂度。我们看到,当 dffnsubscriptd_{\mathrm{ffn}} 很大时,FFN 主导了模型大小,尽管它们不是最耗时的组件。因此,对于非常大的 Transform 模型,我们希望解决这个问题以构建高效的系统。

Sub-model 子模型 # of Parameters 参数数量 Time Complexity 时间复杂度 ×\times
Encoder 编码器 Multi-head Self-attention
多头自注意力
4d24d^{2} O(m2d)O(m^{2}\cdot d) LL
Feed-forward Network 前馈网络 2ddffn+d+dffn2d\cdot d_{\mathrm{ffn}}+d+d_{\mathrm{ffn}} O(mddffn)O(m\cdot d\cdot d_{\mathrm{ffn}}) LL
Layer Normalization 层归一化 2d2d O(d)O(d) 2L2L
Decoder 解码器 Multi-head Self-attention
多头自注意力
4d24d^{2} O(n2d)O(n^{2}\cdot d) LL
Multi-head Cross-attention
多头交叉注意力
4d24d^{2} O(mnd)O(m\cdot n\cdot d) LL
Feed-forward Network 前馈网络 2ddffn+d+dffn2d\cdot d_{\mathrm{ffn}}+d+d_{\mathrm{ffn}} O(nddffn)O(n\cdot d\cdot d_{\mathrm{ffn}}) LL
Layer Normalization 层归一化 2d2d O(d)O(d) 3L3L
Table 1: Numbers of parameters and time complexities of different Transformer modules under different setups. m=m= source-sequence length, n=n= target-sequence length, d=d= default number of dimensions of a hidden layer, dffn=d_{\mathrm{ffn}}= number of dimensions of the FFN hidden layer, τ=\tau= number of heads in the attention models, and L=L= number of encoding or decoding layers. The column ×\times means the number of times a sub-model is applied on the encoder or decoder side. The time complexities are estimated by counting the number of multiplication of floating-point numbers.
表 1:不同设置下不同 Transformer 模块的参数数量和时间复杂度。 m=absentm= 源序列长度, n=absentn= 目标序列长度, d=absentd= 隐藏层默认维度数, dffn=subscriptabsentd_{\mathrm{ffn}}= FFN 隐藏层维度数, τ=absent\tau= 注意力模型中的头数, L=absentL= 编码或解码层数。列 ×\times 表示子模型在编码器或解码器侧应用的次数。时间复杂度通过计算浮点数乘法次数进行估算。

2.6 Attention Models on the Decoder Side
2.6 解码器侧的注意力模型

A decoder layer involves two attention sub-layers, the first of which is a self-attention sub-layer, and the second is a cross-attention sub-layer. These sub-layers are based on either the post-norm or the pre-norm structure, but differ by designs of the attention functions. Consider, for example, the post-norm structure, described in Eq. (35). We can define the cross-attention and self-attention sub-layers for a decoding layer to be
解码层涉及两个注意力子层,第一个是自注意力子层,第二个是交叉注意力子层。这些子层基于后归一化或前归一化结构,但注意力函数的设计不同。以式(35)描述的后归一化结构为例,我们可以定义解码层的交叉注意力和自注意力子层为:

𝐒cross\displaystyle\mathbf{S}_{\mathrm{cross}} =\displaystyle= Layercross(𝐇enc,𝐒self)\displaystyle\mathrm{Layer}_{\mathrm{cross}}(\mathbf{H}_{\mathrm{enc}},\mathbf{S}_{\mathrm{self}}) (39)
=\displaystyle= LNorm(Attcross(𝐇enc,𝐒self)+𝐒self)\displaystyle\mathrm{LNorm}(\mathrm{Att}_{\mathrm{cross}}(\mathbf{H}_{\mathrm{enc}},\mathbf{S}_{\mathrm{self}})+\mathbf{S}_{\mathrm{self}})
𝐒self\displaystyle\mathbf{S}_{\mathrm{self}} =\displaystyle= Layerself(𝐒)\displaystyle\mathrm{Layer}_{\mathrm{self}}(\mathbf{S}) (40)
=\displaystyle= LNorm(Attself(𝐒)+𝐒)\displaystyle\mathrm{LNorm}(\mathrm{Att}_{\mathrm{self}}(\mathbf{S})+\mathbf{S})

where 𝐒n×d\mathbf{S}\in\mathbb{R}^{n\times d} is the input of the self-attention sub-layer, 𝐒crossn×d\mathbf{\mathbf{S}_{\mathrm{cross}}}\in\mathbb{R}^{n\times d} and 𝐒selfn×d\mathbf{\mathbf{S}_{\mathrm{self}}}\in\mathbb{R}^{n\times d} are the outputs of the sub-layers, and 𝐇encm×d\mathbf{H}_{\mathrm{enc}}\in\mathbb{R}^{m\times d} is the output of the encoder 333For an encoder having LL encoder layers, 𝐇enc=𝐇L\mathbf{H}_{\mathrm{enc}}=\mathbf{H}^{L}..
𝐒n×dsuperscript\mathbf{S}\in\mathbb{R}^{n\times d} 是自注意力子层的输入, 𝐒crossn×dsubscriptsuperscript\mathbf{\mathbf{S}_{\mathrm{cross}}}\in\mathbb{R}^{n\times d}𝐒selfn×dsubscriptsuperscript\mathbf{\mathbf{S}_{\mathrm{self}}}\in\mathbb{R}^{n\times d} 是子层的输出, 𝐇encm×dsubscriptsuperscript\mathbf{H}_{\mathrm{enc}}\in\mathbb{R}^{m\times d} 是编码器 3 的输出。

As with conventional attention models, cross-attention is primarily used to model the correspondence between the source-side and target-side sequences. The Attcross()\mathrm{Att}_{\mathrm{cross}}(\cdot) function is based on the QKV attention model which generates the result of querying a collection of key-value pairs. More specifically, we define the queries, keys and values as linear mappings of 𝐒self\mathbf{S}_{\mathrm{self}} and 𝐇enc\mathbf{H}_{\mathrm{enc}}, as follows
与传统的注意力模型一样,交叉注意力主要用于建模源序列和目标序列之间的对应关系。 Attcross()subscript\mathrm{Att}_{\mathrm{cross}}(\cdot) 函数基于 QKV 注意力模型,该模型生成查询一组键值对的结果。更具体地说,我们将查询、键和值定义为 𝐒selfsubscript\mathbf{S}_{\mathrm{self}}𝐇encsubscript\mathbf{H}_{\mathrm{enc}} 的线性映射,如下所示

𝐒selfq\displaystyle\mathbf{S}_{\mathrm{self}}^{q} =\displaystyle= 𝐒self𝐖crossq\displaystyle\mathbf{S}_{\mathrm{self}}\mathbf{W}_{\mathrm{cross}}^{q} (41)
𝐇enck\displaystyle\mathbf{H}_{\mathrm{enc}}^{k} =\displaystyle= 𝐇enc𝐖enck\displaystyle\mathbf{H}_{\mathrm{enc}}\mathbf{W}_{\mathrm{enc}}^{k} (42)
𝐇encv\displaystyle\mathbf{H}_{\mathrm{enc}}^{v} =\displaystyle= 𝐇enc𝐖encv\displaystyle\mathbf{H}_{\mathrm{enc}}\mathbf{W}_{\mathrm{enc}}^{v} (43)

where 𝐖crossq,𝐖enck,𝐖encvd×d\mathbf{W}_{\mathrm{cross}}^{q},\mathbf{W}_{\mathrm{enc}}^{k},\mathbf{W}_{\mathrm{enc}}^{v}\in\mathbb{R}^{d\times d} are the parameters of the mappings. In other words, the queries are defined based on 𝐒self\mathbf{\mathbf{S}_{\mathrm{self}}}, and the keys and values are defined based on 𝐇enc\mathbf{H}_{\mathrm{enc}}.
𝐖crossq,𝐖enck,𝐖encvd×dsuperscriptsubscriptsuperscriptsubscriptsuperscriptsubscriptsuperscript\mathbf{W}_{\mathrm{cross}}^{q},\mathbf{W}_{\mathrm{enc}}^{k},\mathbf{W}_{\mathrm{enc}}^{v}\in\mathbb{R}^{d\times d} 是映射的参数。换句话说,查询是基于 𝐒selfsubscript\mathbf{\mathbf{S}_{\mathrm{self}}} 定义的,而键和值是基于 𝐇encsubscript\mathbf{H}_{\mathrm{enc}} 定义的。

Attcross()\mathrm{Att}_{\mathrm{cross}}(\cdot) is then defined as
Attcross()subscript\mathrm{Att}_{\mathrm{cross}}(\cdot) 定义为

Attcross(𝐇enc,𝐒self)\displaystyle\mathrm{Att}_{\mathrm{cross}}(\mathbf{H}_{\mathrm{enc}},\mathbf{\mathbf{S}_{\mathrm{self}}}) =\displaystyle= Attqkv(𝐒selfq,𝐇enck,𝐇encv)\displaystyle\mathrm{Att}_{\mathrm{qkv}}(\mathbf{S}_{\mathrm{self}}^{q},\mathbf{H}_{\mathrm{enc}}^{k},\mathbf{H}_{\mathrm{enc}}^{v}) (44)
=\displaystyle= Softmax(𝐒selfq[𝐇enck]Td)𝐇encv\displaystyle\mathrm{Softmax}(\frac{\mathbf{S}_{\mathrm{self}}^{q}[\mathbf{H}_{\mathrm{enc}}^{k}]^{\mathrm{T}}}{\sqrt{d}})\mathbf{H}_{\mathrm{enc}}^{v}

The Attself()\mathrm{Att}_{\mathrm{self}}(\cdot) function has a similar form as Attcross()\mathrm{Att}_{\mathrm{cross}}(\cdot), with linear mappings of 𝐒\mathbf{S} taken as the queries, keys, and values, like this
Attself()subscript\mathrm{Att}_{\mathrm{self}}(\cdot) 函数的形式与 Attcross()subscript\mathrm{Att}_{\mathrm{cross}}(\cdot) 相似,将 𝐒\mathbf{S} 的线性映射作为查询、键和值,如下所示

Attself(𝐒)\displaystyle\mathrm{Att}_{\mathrm{self}}(\mathbf{S}) =\displaystyle= Attqkv(𝐒q,𝐒k,𝐒v)\displaystyle\mathrm{Att}_{\mathrm{qkv}}(\mathbf{S}^{q},\mathbf{S}^{k},\mathbf{S}^{v}) (45)
=\displaystyle= Softmax(𝐒q[𝐒k]Td+𝐌)𝐒v\displaystyle\mathrm{Softmax}(\frac{\mathbf{S}^{q}[\mathbf{S}^{k}]^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M})\mathbf{S}^{v}

where 𝐒q=𝐒𝐖decq\mathbf{S}^{q}=\mathbf{S}\mathbf{W}_{\mathrm{dec}}^{q}, 𝐒k=𝐒𝐖deck\mathbf{S}^{k}=\mathbf{S}\mathbf{W}_{\mathrm{dec}}^{k}, and 𝐒v=𝐒𝐖decv\mathbf{S}^{v}=\mathbf{S}\mathbf{W}_{\mathrm{dec}}^{v} are linear mappings of 𝐒\mathbf{S} with parameters 𝐖decq,𝐖deck,𝐖decvd×d\mathbf{W}_{\mathrm{dec}}^{q},\mathbf{W}_{\mathrm{dec}}^{k},\mathbf{W}_{\mathrm{dec}}^{v}\in\mathbb{R}^{d\times d}.
𝐒q=𝐒𝐖decqsuperscriptsuperscriptsubscript\mathbf{S}^{q}=\mathbf{S}\mathbf{W}_{\mathrm{dec}}^{q}𝐒k=𝐒𝐖decksuperscriptsuperscriptsubscript\mathbf{S}^{k}=\mathbf{S}\mathbf{W}_{\mathrm{dec}}^{k}𝐒v=𝐒𝐖decvsuperscriptsuperscriptsubscript\mathbf{S}^{v}=\mathbf{S}\mathbf{W}_{\mathrm{dec}}^{v}𝐒\mathbf{S} 的参数为 𝐖decq,𝐖deck,𝐖decvd×dsuperscriptsubscriptsuperscriptsubscriptsuperscriptsubscriptsuperscript\mathbf{W}_{\mathrm{dec}}^{q},\mathbf{W}_{\mathrm{dec}}^{k},\mathbf{W}_{\mathrm{dec}}^{v}\in\mathbb{R}^{d\times d} 的线性映射。

This form is similar to that of Eq. (20). A difference compared to self-attention on the encoder side, however, is that the model here needs to follow the rule of left-to-right generation (see Figure 2). That is, given a target-side word at the position ii, we can see only the target-side words in the left context y1yi1y_{1}...y_{i-1}. To do this, we add a masking variable 𝐌\mathbf{M} to the unnormalized weight matrix 𝐒q[𝐒k]Td+𝐌\frac{\mathbf{S}^{q}[\mathbf{S}^{k}]^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M}. Both 𝐌\mathbf{M} and 𝐒q[𝐒k]Td+𝐌\frac{\mathbf{S}^{q}[\mathbf{S}^{k}]^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M} are of size n×nn\times n, and so a lower value of an entry of 𝐌\mathbf{M} means a larger bias towards lower alignment scores for the corresponding entry of 𝐒q[𝐒k]Td+𝐌\frac{\mathbf{S}^{q}[\mathbf{S}^{k}]^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M}. In order to avoid access to the right context given ii, 𝐌\mathbf{M} is defined to be
此形式与公式(20)的形式相似。然而,与编码器侧的自注意力相比,这里的模型需要遵循从左到右生成的规则(见图 2)。也就是说,给定位置 ii 的目标词,我们只能看到左侧上下文中的目标词 y1yi1subscript1subscript1y_{1}...y_{i-1} 。为了做到这一点,我们在未归一化的权重矩阵 𝐒q[𝐒k]Td+𝐌superscriptsuperscriptdelimited-[]superscript\frac{\mathbf{S}^{q}[\mathbf{S}^{k}]^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M} 中添加了一个掩码变量 𝐌\mathbf{M}𝐌\mathbf{M}𝐒q[𝐒k]Td+𝐌superscriptsuperscriptdelimited-[]superscript\frac{\mathbf{S}^{q}[\mathbf{S}^{k}]^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M} 的大小均为 n×nn\times n ,因此 𝐌\mathbf{M} 中的一个条目的值越低,对应 𝐒q[𝐒k]Td+𝐌superscriptsuperscriptdelimited-[]superscript\frac{\mathbf{S}^{q}[\mathbf{S}^{k}]^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M} 中条目的对齐分数偏向越低。为了避免给定 ii 时访问右侧上下文,定义 𝐌\mathbf{M}

M(i,k)\displaystyle M(i,k) =\displaystyle= {0iki>k\displaystyle\begin{cases}0&i\leq k\\ -\infty&i>k\end{cases} (46)
Table 2: Self-attention on the encoder and decoder sides. Each line connects an input and an output of the self-attention model, indicating a dependency of an output state on an input state. For encoder self-attention, the output at any position is computed by having access to the entire sequence. By contrast, for decoder self-attention, the output at position ii is computed by seeing only inputs at positions up to ii.
表 2:编码器和解码器侧的自注意力。每条线连接自注意力模型的一个输入和一个输出,表示输出状态对输入状态的依赖。对于编码器自注意力,任何位置的输出是通过访问整个序列来计算的。相比之下,对于解码器自注意力,位置 ii 的输出是通过仅看到位置 ii 及之前的输入来计算的。

where M(i,k)M(i,k) indicates a bias term for the alignment score between positions ii and kk. Below we show an example of how the masking variable is applied (assume n=4n=4).
M(i,k)M(i,k) 表示位置 iikk 之间对齐得分的偏差项。以下我们展示如何应用掩码变量(假设 n=44n=4 )。

Softmax(𝐒q[𝐒k]Td+𝐌)\displaystyle\mathrm{Softmax}(\frac{\mathbf{S}^{q}[\mathbf{S}^{k}]^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M}) (47)
=\displaystyle= Softmax([20.11100.90.90.90.20.80.720.310.33]+[0000000000])\displaystyle\mathrm{Softmax}(\begin{bmatrix}2&0.1&1&1\\ 0&0.9&0.9&0.9\\ 0.2&0.8&0.7&2\\ 0.3&1&0.3&3\end{bmatrix}+\begin{bmatrix}0&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}\\ 0&0&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}\\ 0&0&0&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}\\ 0&0&0&0\\ \end{bmatrix})
=\displaystyle= Softmax([200.90.20.80.70.310.33])\displaystyle\mathrm{Softmax}(\begin{bmatrix}2&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}\\ 0&0.9&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}\\ 0.2&0.8&0.7&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\infty}\\ 0.3&1&0.3&3\\ \end{bmatrix})
=\displaystyle= [10000.30.7000.20.40.400.050.10.050.8]\displaystyle\begin{bmatrix}1&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0}&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0}&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0}\\ 0.3&0.7&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0}&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0}\\ 0.2&0.4&0.4&{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0}\\ 0.05&0.1&0.05&0.8\\ \end{bmatrix}

As noted in Section 2.3, it is easy to improve these models by using the multi-head attention mechanism. Also, since decoders are typically the most time-consuming part of practical systems, the bulk of the computational effort in running these systems is very much concerned with the efficiency of the attention modules on the decoder side.
如第 2.3 节所述,通过使用多头注意力机制,这些模型很容易得到改进。此外,由于解码器通常是实际系统中耗时最多的部分,因此运行这些系统的大部分计算努力都与解码器侧注意力模块的效率密切相关。

2.7 Training and Inference
2.7 训练与推理

Transformers can be trained and used in a regular way. For example, we can train a Transformer model by performing gradient descent to minimize some loss function on the training data, and test the trained model by performing beam search on the unseen data. Below we present some of the techniques that are typically used in the training and inference of Transformer models.
Transformer 可以按照常规方式进行训练和使用。例如,我们可以通过在训练数据上执行梯度下降来最小化某些损失函数来训练 Transformer 模型,并通过在未见数据上执行束搜索来测试训练好的模型。以下我们介绍了一些在 Transformer 模型的训练和推理中通常使用的技巧。

  • Learning Rate Scheduling. As standard neural networks, Transformers can be directly trained using back-propagation. The training process is generally iterated many times to make the models fit the training data well. In each training step, we update the weights of the neural networks by moving them a small step in the direction of negative gradients of errors. There are many ways to design the update rule of training. A popular choice is to use the Adam optimization method (Kingma and Ba, 2014). To adjust the learning rate during training, Vaswani et al. (2017) present a learning rate scheduling strategy which increases the learning rate linearly for a number of steps and then decay it gradually. They design a learning rate of the form

    lr\displaystyle lr =\displaystyle= lr0min{nstep0.5,nstep(nwarmup)1.5}\displaystyle lr_{0}\cdot\min\left\{n_{\mathrm{step}}^{-0.5},\ n_{\mathrm{step}}\cdot(n_{\rm{warmup}})^{-1.5}\right\} (48)

    • 学习率调度。作为标准神经网络,Transformer 可以直接使用反向传播进行训练。训练过程通常迭代多次,以使模型更好地拟合训练数据。在每一步训练中,我们通过将权重向误差负梯度的方向移动一小步来更新神经网络的权重。有许多方法可以设计训练的更新规则。一种流行的选择是使用 Adam 优化方法(Kingma 和 Ba,2014)。为了在训练过程中调整学习率,Vaswani 等人(2017)提出了一种学习率调度策略,该策略在多个步骤中线性增加学习率,然后逐渐衰减。他们设计了一种形式为的学习率

    where lr0lr_{0} denotes the initial learning rate, and nstepn_{\mathrm{step}} denotes the number of training steps we have executed, and nwarmupn_{\rm{warmup}} denotes the number of warmup steps. In the first nwarmupn_{\rm{warmup}} steps, the learning rate lrlr grows larger as training proceeds. It reaches the highest value at the point of nstep=nwarmupn_{\mathrm{step}}=n_{\rm{warmup}}, and then decreases as an inverse square root function (i.e., lr0nstep0.5lr_{0}\cdot n_{\mathrm{step}}^{-0.5}).
    lr0subscript0lr_{0} 表示初始学习率, nstepsubscriptn_{\mathrm{step}} 表示已执行的训练步数, nwarmupsubscriptn_{\rm{warmup}} 表示预热步数。在前 nwarmupsubscriptn_{\rm{warmup}} 步中,随着训练的进行,学习率 lrlr 逐渐增大。在 nstep=nwarmupsubscriptsubscriptn_{\mathrm{step}}=n_{\rm{warmup}} 点达到最大值,然后以倒数平方根函数的形式(即 lr0nstep0.5subscript0superscriptsubscript0.5lr_{0}\cdot n_{\mathrm{step}}^{-0.5} )下降。

  • Batching and Padding. To make a trade-off between global optimization and training convergency, it is common to update the weights each time on a relatively small collection of samples, called a minibatch of samples. Therefore, we can consider a batch version of forward and backward computation processes in which the whole minibatch is used together to obtain the gradient information. One advantage of batching is that it allows the system to make use of efficient tensor operations to deal with multiple sequences in a single run. This requires that all the input sequences in a minibatch are stored in a single memory block, so that they can be read in and processed together. To illustrate this idea, consider a minimatch containing four samples whose source-sides are


    批处理和填充。为了在全局优化和训练收敛之间进行权衡,通常每次在相对较小的样本集合上更新权重,称为样本的小批量。因此,我们可以考虑一种批处理版本的向前和向后计算过程,其中整个小批量一起使用以获得梯度信息。批处理的一个优点是它允许系统利用高效的张量运算在一次运行中处理多个序列。这要求所有的小批量输入序列都存储在单个内存块中,以便可以一起读取和处理。为了说明这个想法,考虑一个包含四个样本的最小匹配,其源端是
    A B C D E F
    M N
    R S T
    W X Y Z

    We can store these sequences in a 4×64\times 6 continuous block where each “row” represents a sequence, like this
    我们可以将这些序列存储在一个 4×6464\times 6 连续块中,其中每一“行”代表一个序列,如下所示

    A B C D E F
    M N \square \square \square \square
    R S T \square \square \square
    W X Y Z \square \square

    Here padding words \square are inserted between sequences, so that these sequences are aligned in the memory. Typically, we do not want padding to affect the operation of the system, and so we can simply define \square as a zero vector (call it zero padding). On the other hand, in some cases we are interested in using padding to describe something that is not covered by the input sequences. For example, we can replace padding words with the words in the left (or right) context of a sequence, though this may require modifications to the system to ensure that the newly added context words do not cause additional content to appear in the output.
    在序列之间插入填充词 \square ,以便在内存中对齐这些序列。通常,我们不希望填充影响系统的操作,因此我们可以简单地定义 \square 为零向量(称为零填充)。另一方面,在某些情况下,我们感兴趣的是使用填充来描述输入序列未涵盖的内容。例如,我们可以用序列左侧(或右侧)的词语替换填充词,尽管这可能需要修改系统以确保新添加的上下文词不会导致输出中出现额外内容。

  • Search and Caching. At test time, we need to search the space of candidate hypotheses (or candidate target-side sequences) to identify the hypothesis (or target-side sequence) with the highest score.

    𝐲^\displaystyle\hat{\mathbf{y}} =\displaystyle= argmax𝐲score(𝐱,𝐲)\displaystyle\underset{\mathbf{y}}{\arg\max}\operatorname{score}(\mathbf{x},\mathbf{y}) (49)

    • 搜索与缓存。在测试时,我们需要搜索候选假设(或候选目标序列)的空间,以识别得分最高的假设(或目标序列)。

    where score(𝐱,𝐲)\mathrm{score}(\mathbf{x},\mathbf{y}) is the model score of the target-side sequence 𝐲\mathbf{y} given the source-side sequence 𝐱\mathbf{x}. While there are many search algorithms to achieve this, most of them share a similar structure: the search program operates by extending candidate target-side sequences in a pool at a time. In this way, the resulting algorithm can be viewed as a left-to-right generation procedure. Note that all of the designs of score(𝐱,𝐲)\mathrm{score}(\mathbf{x},\mathbf{y}), no matter how complex, are based on computing Pr(𝐲|𝐱)\operatorname{Pr}(\mathbf{y}|\mathbf{x}). Because the attention models used in Transformer require computing the dot-product of each pair of the input vectors of a layer, the time complexity of the search algorithm is a quadratic function of the length of 𝐲\mathbf{y}. It is therefore not efficient to repeatedly compute the outputs of the attention models for positions that have been dealt with. This problem can be addressed by caching the states of each layer for words we have seen. Figure 4 illustrates the use of the caching mechanism in a search step. All the states for positions <i<i are maintained and easily accessed in a cache. At position ii, all we need is to compute the states for the newly added word, and then to update the cache.
    score(𝐱,𝐲)\mathrm{score}(\mathbf{x},\mathbf{y}) 是在给定源序列 𝐱\mathbf{x} 的情况下,针对目标序列 𝐲\mathbf{y} 的模型得分。虽然有许多搜索算法可以实现这一点,但它们大多数具有相似的结构:搜索程序通过每次扩展一个池中的候选目标序列来运行。因此,这种算法可以被视为从左到右的生成过程。请注意,无论 score(𝐱,𝐲)\mathrm{score}(\mathbf{x},\mathbf{y}) 的设计多么复杂,都是基于计算 Pr(𝐲|𝐱)conditional\operatorname{Pr}(\mathbf{y}|\mathbf{x}) 。由于在 Transformer 中使用的注意力模型需要计算每一层输入向量的每一对点积,因此搜索算法的时间复杂度是 𝐲\mathbf{y} 长度的二次函数。因此,对于已经处理过的位置重复计算注意力模型的输出是不高效的。这个问题可以通过缓存我们已看到的单词的每一层的状态来解决。图 4 阐述了在搜索步骤中使用缓存机制。所有位置 <iabsent<i 的状态都保存在缓存中,并且可以轻松访问。在位置 ii ,我们只需要计算新添加单词的状态,然后更新缓存。

    Figure 4: Illustration of the caching mechanism in Transformer decoders. Rectangles indicate the states of decoding layers or sub-layers. At step ii, all the states at previous steps are stored in a cache (see dotted boxes), and we only need to compute the states for this step (see blue rectangles and arrows). Then, we add the newly generated states to the cache, and move on to step i+1i+1.
    图 4:Transformer 解码器中缓存机制的示意图。矩形表示解码层或子层的状态。在步骤 ii ,之前步骤的所有状态都存储在缓存中(见虚线框),我们只需要计算这一步的状态(见蓝色矩形和箭头)。然后,我们将新产生的状态添加到缓存中,继续进行步骤 i+11i+1

3 Syntax-aware Models 3 语法感知模型

Although Transformer is simply a deep learning model that does not make use of any linguistic structure or assumption, it may be necessary to incorporate our prior knowledge into such systems. This is in part because NLP researchers have long believed that a higher level of abstraction of data is needed to develop ideal NLP systems, and there have been many systems that use structure as priors. However, structure is a wide-ranging topic and there are several types of structure one may refer to See (2018)’s work. For example, the inductive biases used in our model design can be thought of as some structural prior, while NLP models can also learn the underlying structure of problems by themselves. In this sub-section we will discuss some of these issues. We will focus on the methods of introducing linguistic structure into Transformer models. As Transformer can be applied to many NLP tasks, which differ much in their input and output formats, we will primarily discuss modifications to Transformer encoders (call them syntax-aware Transformer encoders). Our discussion, however, is general, and the methods can be easily extended to Transformer decoders.
尽管 Transformer 只是一个不利用任何语言结构或假设的深度学习模型,但可能有必要将我们的先验知识融入此类系统。这在一定程度上是因为自然语言处理研究人员长期以来一直认为,为了开发理想的 NLP 系统,需要数据的高级抽象,并且已经有许多系统使用结构作为先验。然而,结构是一个广泛的话题,可能涉及多种结构,可参考(2018)的工作。例如,我们模型设计中使用的归纳偏差可以被视为某种结构先验,而 NLP 模型也可以自行学习问题的潜在结构。在本节中,我们将讨论这些问题。我们将重点关注将语言结构引入 Transformer 模型的方法。由于 Transformer 可以应用于许多 NLP 任务,这些任务的输入和输出格式差异很大,我们将主要讨论对 Transformer 编码器(称为语法感知 Transformer 编码器)的修改。然而,我们的讨论是通用的,这些方法可以很容易地扩展到 Transformer 解码器。

3.1 Syntax-aware Input and Output
3.1 语法感知输入和输出

One of the simplest methods of incorporating structure into NLP systems is to modify the input sequence, leaving the system unchanged. As a simple example, consider a sentence where each word xjx_{j} is assigned a set of κ\kappa syntactic labels {tagj1,,tagjκ}\{\mathrm{tag}_{j}^{1},...,\mathrm{tag}_{j}^{\kappa}\} (e.g., POS labels and dependency labels). We can write these symbols together to define a new “word”
将结构引入 NLP 系统的一种最简单的方法是修改输入序列,而系统本身保持不变。以一个简单的例子来说明,考虑一个句子,其中每个单词 xjsubscriptx_{j} 被分配一组 κ\kappa 句法标签 {tagj1,,tagjκ}superscriptsubscript1superscriptsubscript\{\mathrm{tag}_{j}^{1},...,\mathrm{tag}_{j}^{\kappa}\} (例如,词性标签和依存标签)。我们可以将这些符号组合起来定义一个新的“单词”

xj/tagj1//tagjκx_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa}

Then, the embedding of this word is given by
然后,该词的嵌入表示为

𝐱𝐩j\displaystyle\mathbf{xp}_{j} =\displaystyle= e(xj/tagj1//tagjκ)+PE(j)\displaystyle e(x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa})+\mathrm{PE}(j) (50)

where e(xj/tagj1//tagjκ)de(x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa})\in\mathbb{R}^{d} is the embedding of xj/tagj1//tagjκx_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa}. Since xj/tagj1//tagjκx_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa} is a complex symbol, we decompose the learning problem of e(xj/tagj1//tagjκ)e(x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa}) into easier problems. For example, we can develop κ\kappa embedding models, each producing an embedding given a tag. Then, we write e(xj/tagj1//tagjκ)e(x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa}) as a sum of the word embedding and tag embeddings
e(xj/tagj1//tagjκ)dsubscriptsuperscriptsubscript1superscriptsubscriptsuperscripte(x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa})\in\mathbb{R}^{d}xj/tagj1//tagjκsubscriptsuperscriptsubscript1superscriptsubscriptx_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa} 的嵌入。由于 xj/tagj1//tagjκsubscriptsuperscriptsubscript1superscriptsubscriptx_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa} 是一个复杂符号,我们将 e(xj/tagj1//tagjκ)subscriptsuperscriptsubscript1superscriptsubscripte(x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa}) 的学习问题分解为更简单的问题。例如,我们可以开发 κ\kappa 嵌入模型,每个模型给定一个标签产生一个嵌入。然后,我们将 e(xj/tagj1//tagjκ)subscriptsuperscriptsubscript1superscriptsubscripte(x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa}) 写成词嵌入和标签嵌入的和

e(xj/tagj1//tagjκ)\displaystyle e(x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa}) =\displaystyle= 𝐱j+e(tagj1)++e(tagjκ)\displaystyle\mathbf{x}_{j}+e(\mathrm{tag}_{j}^{1})+...+e(\mathrm{tag}_{j}^{\kappa}) (51)

where {e(tagj1),,e(tagjκ)}\{e(\mathrm{tag}_{j}^{1}),...,e(\mathrm{tag}_{j}^{\kappa})\} are the embeddings of the tags. Alternatively, we can combine these embeddings via a neural network in the form
{e(tagj1),,e(tagjκ)}superscriptsubscript1superscriptsubscript\{e(\mathrm{tag}_{j}^{1}),...,e(\mathrm{tag}_{j}^{\kappa})\} 是标签的嵌入。或者,我们可以通过神经网络将这些嵌入进行组合,形式为

e(xj/tagj1//tagjκ)\displaystyle e(x_{j}/\mathrm{tag}_{j}^{1}/.../\mathrm{tag}_{j}^{\kappa}) =\displaystyle= FFNembed(𝐱j,e(tagj1),,e(tagjκ))\displaystyle\mathrm{FFN}_{\mathrm{embed}}(\mathbf{x}_{j},e(\mathrm{tag}_{j}^{1}),...,e(\mathrm{tag}_{j}^{\kappa})) (52)

where FFNembed()\mathrm{FFN}_{\mathrm{embed}}(\cdot) is a feed-forward neural network that has one layer or two.
FFNembed()subscript\mathrm{FFN}_{\mathrm{embed}}(\cdot) 是一个具有一层或两层的前馈神经网络。

We can do the same thing for sentences on the decoder side as well, and treat yi/tagi1//tagiκy_{i}/\mathrm{tag}_{i}^{1}/.../\mathrm{tag}_{i}^{\kappa} as a syntax-augmented word. However, this may lead to a much larger target-side vocabulary and poses a computational challenge for training and inference.
我们可以对解码器端的句子做同样的事情,将 yi/tagi1//tagiκsubscriptsuperscriptsubscript1superscriptsubscripty_{i}/\mathrm{tag}_{i}^{1}/.../\mathrm{tag}_{i}^{\kappa} 视为一个语法增强词。然而,这可能会导致目标端词汇量大幅增加,给训练和推理带来计算挑战。

Another form that is commonly used to represent a sentence is syntax tree. In linguistics, the syntax of a sentence can be interpreted in many different ways, resulting in various grammars and the corresponding tree (or graph)-based representations. While these representations differ in their syntactic forms, a general approach to use them in sequence modeling is tree linearization. Consider the following sentence annotated with a constituency-based parse tree
另一种常用的表示句子的形式是句法树。在语言学中,句子的句法可以有多种不同的解释,从而产生各种语法和相应的基于树(或图)的表示。尽管这些表示在句法形式上有所不同,但在序列建模中使用它们的通用方法是树线性化。考虑以下带有基于成分分析的句法树的句子

We can write this tree structure as a sequence of words, syntactic labels and brackets via a tree traversal algorithm, as follows
我们可以通过树遍历算法将这种树结构写成一系列的词语、句法标签和括号,如下所示

(S (NP (PRP (个人关系 pronoun) It  )PRP{}_{\textrm{PRP}} )NP{}_{\textrm{NP}} (VP (VBZ (VBZ 已翻译文本: ’s 无法翻译,因为提供的源文本为空。请提供有效的学术文本以便进行翻译 )VBZ{}_{\textrm{VBZ}} (ADJP (JJ
interesting 有趣 )JJ{}_{\textrm{JJ}} )ADJP{}_{\textrm{ADJP}} )VP{}_{\textrm{VP}} (. ! ).{}_{\textrm{.}} )S{}_{\textrm{S}}

This sequence of syntactic tokens can be used as an input to the system, that is, each token is represented by word and positional embeddings, and then the sum of these embeddings is treated as a regular input of the encoder. An example of the use of linearized trees is tree-to-string machine translation in which a syntax tree in one language is translated into a string in another language (Li et al., 2017; Currey and Heafield, 2018). Linearized trees can also be used for tree generation. For example, we can frame parsing tasks as sequence-to-sequence problems to map an input text to a sequential representation of its corresponding syntax tree (Vinyals et al., 2015; Choe and Charniak, 2016). See Figure 5 for illustrations of these models. It should be noted that the methods described here are not specific to Transformer but could be applied to many models, such as RNN-based models.
这个句法标记序列可以用作系统的输入,即每个标记由词嵌入和位置嵌入表示,然后这些嵌入的和被视为编码器的常规输入。线性化树的用法示例是树到字符串的机器翻译,其中一种语言的语法树被翻译成另一种语言的字符串(Li 等人,2017;Currey 和 Heafield,2018)。线性化树还可以用于树生成。例如,我们可以将解析任务构造成序列到序列问题,将输入文本映射到其对应语法树的顺序表示(Vinyals 等人,2015;Choe 和 Charniak,2016)。参见图 5 以了解这些模型的说明。需要注意的是,这里描述的方法并不仅限于 Transformer,也可以应用于许多其他模型,如基于 RNN 的模型。

Figure 5: Illustration of tree linearization on either the encoder or decoder side. For tree-to-string machine translation, the encoder takes sequential representation of an input parse tree, and the decoder outputs the corresponding translation. For parsing, the encoder takes a sentence, and the decoder outputs the corresponding syntax tree.
图 5:编码器或解码器侧的树线性化示意图。对于树到字符串的机器翻译,编码器接收输入解析树的顺序表示,解码器输出相应的翻译。对于解析,编码器接收一个句子,解码器输出相应的句法树。

3.2 Syntax-aware Attention Models
3.2 语法感知注意力模型

For Transformer models, it also makes sense to make use of syntax trees to guide the process of learning sequence representations. In the previous section we saw how representations of a sequence can be computed by relating different positions within that sequence. This allows us to impose some structure on these relations which are represented by distributions of attention weights over all the positions. To do this we use the encoder self-attention with an additive mask
对于 Transformer 模型,利用句法树来指导序列表示的学习过程也是合理的。在前一节中,我们看到了如何通过关联序列中的不同位置来计算序列的表示。这使我们能够对这些关系施加一些结构,这些关系由所有位置上的注意力权重分布来表示。为了做到这一点,我们使用带有加性掩码的编码器自注意力。

AttSynself(𝐇)\displaystyle\mathrm{AttSyn}_{\mathrm{self}}(\mathbf{H}) =\displaystyle= Softmax(𝐇q[𝐇k]Td+𝐌)𝐇v\displaystyle\mathrm{Softmax}(\frac{\mathbf{H}^{q}[\mathbf{H}^{k}]^{\mathrm{T}}}{\sqrt{d}}+\mathbf{M})\mathbf{H}^{v} (53)

or alternatively with a multiplicative mask
或者使用乘性掩码

AttSynself(𝐇)\displaystyle\mathrm{AttSyn}_{\mathrm{self}}(\mathbf{H}) =\displaystyle= Softmax(𝐇q[𝐇k]Td𝐌)𝐇v\displaystyle\mathrm{Softmax}(\frac{\mathbf{H}^{q}[\mathbf{H}^{k}]^{\mathrm{T}}}{\sqrt{d}}\odot\mathbf{M})\mathbf{H}^{v} (54)

where 𝐌m×m\mathbf{M}\in\mathbb{R}^{m\times m} is a matrix of masking variables in which a larger value of M(i,j)M(i,j) indicates a stronger syntactic correlation between positions ii and jj. In the following description we choose Eq. (54) as the basic form.
𝐌m×msuperscript\mathbf{M}\in\mathbb{R}^{m\times m} 是一个掩码变量矩阵,其中 M(i,j)M(i,j) 的值越大,表示 iijj 位置之间的句法相关性越强。在以下描述中,我们选择公式(54)作为基本形式。

One common way to design 𝐌\mathbf{M} is to project syntactic relations of the input tree structure into constraints over the sequence. Here we consider constituency parse trees and dependency parse trees for illustration. Generally, two types of masking methods are employed.
一种常见的 𝐌\mathbf{M} 设计方法是,将输入树结构的句法关系投射到序列上的约束中。在这里,我们考虑成分句法树和依存句法树进行说明。通常,采用两种类型的掩码方法。

  • 0-1 Masking. This method assigns M(i,j)M(i,j) a value of 1 if the words at positions ii and jj are considered syntactically correlated and a value of 0 otherwise (Zhang et al., 2020; Bai et al., 2021). To model the relation between two words in a syntax tree, we can consider the distance between their corresponding nodes. One of the simplest forms is given by

    M(i,j)\displaystyle M(i,j) =\displaystyle= {1ω(i,j)ωmax0otherwise\displaystyle\begin{cases}1&\omega(i,j)\leq\omega_{\mathrm{max}}\\ 0&\textrm{otherwise}\end{cases} (55)

    • 0-1 遮罩。此方法将 M(i,j)M(i,j) 赋值为 1,如果位置 iijj 的词语被视为句法相关,否则赋值为 0(张等,2020;白等,2021)。为了在句法树中建模两个词语之间的关系,我们可以考虑它们对应节点之间的距离。其中一种最简单的形式如下:

    where ω(i,j)\omega(i,j) is the length of the shortest path between the nodes of the words at positions ii and jj. For example, given a dependency parse tree, ω(i,j)\omega(i,j) is the number of dependency edges in the path between the two words. For a constituency parse tree, all the words are leaf nodes, and so ω(i,j)\omega(i,j) gives a tree distance between the two leaves in the same branch of the tree. ωmax\omega_{\mathrm{max}} is a parameter used to control the maximum distance between two nodes that can be considered syntactically correlated. For example, assuming that there is a dependency parse tree and ωmax=1\omega_{\mathrm{max}}=1, Eq. (55) enforces a constraint that the attention score between positions ii and jj is computed only if they have a parent-dependent relation444For multiplicative masks, M(i,j)=0M(i,j)=0 does not mean that the attention weight between jj and ii is zero because the Softmax function does not give a zero output for a dimension whose corresponding input is of a zero value. A method to “mask” an entry of Softmax(𝐇𝐇Td)\mathrm{Softmax}(\frac{\mathbf{H}\mathbf{H}^{\mathrm{T}}}{\sqrt{d}}) is to use an additive mask and set M(i,j)=M(i,j)=-\infty if ω(i,j)>ωmax\omega(i,j)>\omega_{\mathrm{max}}..
    ω(i,j)\omega(i,j) 是位置 iijj 之间单词节点的最短路径长度。例如,给定一个依存句法分析树, ω(i,j)\omega(i,j) 是两个单词之间的依存边数。对于一个成分句法分析树,所有单词都是叶子节点,因此 ω(i,j)\omega(i,j) 给出了树中同一分支的两个叶子之间的树距离。 ωmaxsubscript\omega_{\mathrm{max}} 是一个用于控制两个节点之间可以被认为是句法相关的最大距离的参数。例如,假设存在一个依存句法分析树和 ωmax=1subscript1\omega_{\mathrm{max}}=1 ,等式(55)强制执行一个约束,即只有在它们具有父节点依赖关系 4 的情况下,才计算位置 iijj 之间的注意力分数。

  • Soft Masking. Instead of treating 𝐌\mathbf{M} as a hard constraint, we can use it as a soft constraint that scales the attention weight between positions ii and jj in terms of the degree to which the corresponding words are correlated. An idea is to reduce the attention weight as ω(i,j)\omega(i,j) becomes larger. A very simple method to do this is to transform ω(i,j)\omega(i,j) in some way that M(i,j)M(i,j) holds a negative correlation relationship with ω(i,j)\omega(i,j) and its value falls into the interval [0,1][0,1]

    M(i,j)\displaystyle M(i,j) =\displaystyle= DNorm(ω(i,j))\displaystyle\mathrm{DNorm}(\omega(i,j)) (56)

    • 软掩码。不是将 𝐌\mathbf{M} 视为硬约束,我们可以将其用作软约束,根据对应词语的相关程度来缩放位置 iijj 之间的注意力权重。一个想法是随着 ω(i,j)\omega(i,j) 的增大而减少注意力权重。实现这一点的非常简单的方法是将 ω(i,j)\omega(i,j) 以某种方式转换,使得 M(i,j)M(i,j)ω(i,j)\omega(i,j) 保持负相关关系,并且其值落在 [0,1]01[0,1] 区间内。

    There are several alternative designs for DNorm()\mathrm{DNorm}(\cdot). For example, one can compute a standardized score of ω(i,j)-\omega(i,j) by subtracting its mean and dividing by its standard deviation (Chen et al., 2018a), or can normalize 1/ω(i,j)1/\omega(i,j) over all possible jj in the sequence (Xu et al., 2021b). In cases where parsers can output a score between positions ii and jj, it is also possible to use this score to compute M(i,j)M(i,j). For example, a dependency parser can produce the probability of the word at position ii being the parent of the word at position jj (Strubell et al., 2018). We can then write M(i,j)M(i,j) as
    存在几种针对 DNorm()\mathrm{DNorm}(\cdot) 的替代设计方案。例如,可以通过减去其均值并除以其标准差来计算 ω(i,j)-\omega(i,j) 的标准分数(Chen 等人,2018a),或者可以在序列中所有可能的 jj 上对 1/ω(i,j)1