Introduction to Transformers: an NLP Perspective
引言:从自然语言处理角度谈 Transformer
Abstract 摘要
Transformers have dominated empirical machine learning models of natural language processing. In this paper, we introduce basic concepts of Transformers and present key techniques that form the recent advances of these models. This includes a description of the standard Transformer architecture, a series of model refinements, and common applications. Given that Transformers and related deep learning techniques might be evolving in ways we have never seen, we cannot dive into all the model details or cover all the technical areas. Instead, we focus on just those concepts that are helpful for gaining a good understanding of Transformers and their variants. We also summarize the key ideas that impact this field, thereby yielding some insights into the strengths and limitations of these models.
Transformer 在自然语言处理的实证机器学习模型中占据主导地位。在本文中,我们介绍了 Transformer 的基本概念,并展示了构成这些模型近期进展的关键技术。这包括对标准 Transformer 架构的描述、一系列模型优化以及常见应用。鉴于 Transformer 和相关深度学习技术可能以我们从未见过的方向发展,我们无法深入所有模型细节或涵盖所有技术领域。相反,我们专注于那些有助于深入理解 Transformer 及其变体的概念。我们还总结了影响这一领域的关键思想,从而为这些模型的优势和局限性提供一些见解。
1 Background 1 背景
Transformers are a type of neural network (Vaswani et al., 2017). They were originally known for their strong performance in machine translation, and are now a de facto standard for building large-scale self-supervised learning systems (Devlin et al., 2019; Brown et al., 2020). The past few years have seen the rise of Transformers not only in natural language processing (NLP) but also in several other fields, such as computer vision and multi-modal processing. As Transformers continue to mature, these models are playing an increasingly important role in the research and application of artificial intelligence (AI).
Transformer 是一种神经网络(Vaswani 等人,2017 年)。它们最初因在机器翻译中的出色表现而闻名,现在已成为构建大规模自监督学习系统的既定标准(Devlin 等人,2019 年;Brown 等人,2020 年)。在过去的几年里,Transformer 不仅在自然语言处理(NLP)领域崛起,还在计算机视觉和多模态处理等多个领域得到了应用。随着 Transformer 的不断发展,这些模型在人工智能(AI)的研究和应用中扮演着越来越重要的角色。
Looking back at the history of neural networks, Transformers have not been around for a long time. While Transformers are “newcomers” in NLP, they were developed on top of several ideas, the origins of which can be traced back to earlier work, such as word embedding (Bengio et al., 2003; Mikolov et al., 2013) and attention mechanisms (Bahdanau et al., 2014; Luong et al., 2015). As a result, Transformers can benefit from the advancements of different sub-fields of deep learning, and provide an elegant way to combine these neural models. On the other hand, Transformers are unique, and differ from previous models in several ways. First, they do not depend on recurrent or convolutional neural networks for modeling sequences of words, but use only attention mechanisms and feed-forward neural networks. Second, the use of self-attention in Transformers makes it easier to deal with global contexts and dependencies among words. Third, Transformers are very flexible architectures and can be easily modified to accommodate different tasks.
回顾神经网络的历史,Transformer 并不算存在很长时间。虽然 Transformer 在自然语言处理领域是“新来者”,但它们是在几个想法的基础上开发的,这些想法的起源可以追溯到早期的工作,如词嵌入(Bengio 等人,2003;Mikolov 等人,2013)和注意力机制(Bahdanau 等人,2014;Luong 等人,2015)。因此,Transformer 可以受益于深度学习不同子领域的进步,并提供一种优雅的方式来结合这些神经网络模型。另一方面,Transformer 具有独特性,在几个方面与先前模型不同。首先,它们在建模词语序列时不依赖于循环或卷积神经网络,而是仅使用注意力机制和前馈神经网络。其次,Transformer 中使用自注意力使得处理全局上下文和词语之间的依赖关系变得更加容易。第三,Transformer 是非常灵活的架构,可以轻松修改以适应不同的任务。
The widespread use of Transformers motivates the development of cutting-edge techniques in deep learning. For example, there are significant refinements in self-attention mechanisms, which have been incorporated into many state-of-the-art NLP systems. The resulting techniques, together with the progress in self-supervised learning, have led us to a new era of AI: we are beginning to obtain models of universal language understanding, generation and reasoning. This has been evidenced by recent Transformer-based large language models (LLMs) which demonstrate amazing performance across a broad variety of tasks (Bubeck et al., 2023).
广泛使用 Transformer 推动了深度学习前沿技术的开发。例如,自我注意力机制得到了显著改进,并被纳入了许多最先进的自然语言处理系统中。这些技术成果与自监督学习的进展一起,引领我们进入了一个新的 AI 时代:我们开始获得通用语言理解、生成和推理的模型。这由最近的基于 Transformer 的大型语言模型(LLMs)所证明,这些模型在各种任务上展现了惊人的性能(Bubeck 等人,2023)。
This paper provides an introduction to Transformers while reflecting the recent developments in applying these models to different problems. However, Transformers are so successful that there have been numerous related studies and we cannot give a full description of them. Therefore, we focus this work on the core ideas of Transformers, and present a basic description of the common techniques. We also discuss some recent advances in Transformers, such as model improvements for efficiency and accuracy considerations. Because the field is very active and new techniques are coming out every day, it is impossible to survey all the latest literature and we are not attempting to do so. Instead, we focus on just those concepts and algorithms most relevant to Transformers, aimed at the people who wish to get a general understanding of these models.
本文介绍了 Transformer,同时反映了将这些模型应用于不同问题的最新进展。然而,Transformer 的成功导致相关研究众多,我们无法对其进行全面描述。因此,我们专注于 Transformer 的核心思想,并介绍了一些常见技术的简要描述。我们还讨论了 Transformer 的一些最新进展,例如为了效率和准确性考虑的模型改进。由于该领域非常活跃,新技术每天都在涌现,不可能对所有最新文献进行综述,我们也不试图这样做。相反,我们专注于与 Transformer 最相关的那些概念和算法,旨在帮助那些希望对这些模型有一个总体理解的人。
2 The Basic Model 第二章 基本模型
Here we consider the model presented in Vaswani et al. (2017)’s work. We start by considering the Transformer architecture and discuss the details of the sub-models subsequently.
我们在此考虑 Vaswani 等人(2017)提出的方法。我们首先考虑 Transformer 架构,随后讨论子模型的细节。
2.1 The Transformer Architecture
2.1 变换器架构
Figure 1 shows the standard Transformer model which follows the general encoder-decoder framework. A Transformer encoder comprises a number of stacked encoding layers (or encoding blocks). Each encoding layer has two different sub-layers (or sub-blocks), called the self-attention sub-layer and the feed-forward neural network (FFN) sub-layer. Suppose we have a source-side sequence and a target-side sequence . The input of an encoding layer is a sequence of vectors , each having dimensions (or dimensions for simplicity). We follow the notation adopted in the previous chapters, using to denote these input vectors111Provided is a row vector, we have .. The self-attention sub-layer first performs a self-attention operation on to generate an output :
图 1 展示了遵循通用编码器-解码器框架的标准 Transformer 模型。一个 Transformer 编码器由多个堆叠的编码层(或编码块)组成。每个编码层有两个不同的子层(或子块),称为自注意力子层和前馈神经网络(FFN)子层。假设我们有一个源序列 和一个目标序列 。编码层的输入是一个序列的 向量 ,每个向量具有 维度(或为了简化, 维度)。我们遵循前几章采用的符号,用 表示这些输入向量 1 。自注意力子层首先对 执行自注意力操作 ,以生成输出 :
(1) |
Here is of the same size as , and can thus be viewed as a new representation of the inputs. Then, a residual connection and a layer normalization unit are added to the output so that the resulting model is easier to optimize.
这里 与 大小相同,因此可以视为输入的新表示。然后,在输出中添加了残差连接和层归一化单元,使得得到的模型更容易优化。
The original Transformer model employs the post-norm structure where a residual connection is created before layer normalization is performed, like this
原始 Transformer 模型采用后归一化结构,在执行层归一化之前创建残差连接,如下所示
(2) |
where the addition of denotes the residual connection (He et al., 2016a), and denotes the layer normalization function (Ba et al., 2016). Substituting Eq. (1) into Eq. (2), we obtain the form of the self-attention sub-layer
在式中, 表示残差连接(He 等,2016a), 表示层归一化函数(Ba 等,2016)。将式(1)代入式(2),我们得到自注意力子层的表达式
(3) | |||||
The definitions of and will be given later in this section.
该节中将给出 和 的定义。
The FFN sub-layer takes and outputs a new representation . It has the same form as the self-attention sub-layer, with the attention function replaced by the FFN function, given by
FFN 子层接收 并输出新的表示 。其结构与自注意力子层相同,只是将注意力函数替换为 FFN 函数,具体为
(4) | |||||
Here could be any feed-forward neural networks with non-linear activation functions. The most common structure of is a two-layer network involving two linear transformations and a ReLU activation function between them.
这里 可以是具有非线性激活函数的任何前馈神经网络。 最常见结构是一个包含两个线性变换和它们之间一个 ReLU 激活函数的两层网络。
For deep models, we can stack the above neural networks. Let be the output of layer . Then, we can express as a function of . We write this as a composition of two sub-layers
对于深度模型,我们可以堆叠上述神经网络。令 为层 的输出。然后,我们可以将 表示为 的函数。我们将其写作两个子层的组合
(5) | |||||
(6) |
If there are encoding layers, then will be the output of the encoder. In this case, can be viewed as a representation of the input sequence that is learned by the Transformer encoder. denotes the input of the encoder. In recurrent and convolutional models, can simply be word embeddings of the input sequence. Transformer takes a different way of representing the input words , and encodes the positional information explicitly. In Section 2.2 we will discuss the embedding model used in Transformers.
如果有 个编码层,那么 将是编码器的输出。在这种情况下, 可以被视为 Transformer 编码器学习到的输入序列的表示。 表示编码器的输入。在循环和卷积模型中, 可以简单地是输入序列的词嵌入。Transformer 采用不同的方式表示输入单词,并显式地编码位置信息。在第 2.2 节中,我们将讨论 Transformer 中使用的嵌入模型。
The Transformer decoder has a similar structure as the Transformer encoder. It comprises stacked decoding layers (or decoding blocks). Let be the output of the -th decoding layer. We can formulate a decoding layer by using the following equations
Transformer 解码器与 Transformer 编码器具有相似的结构。它由 堆叠的解码层(或解码块)组成。设 为第 个解码层的输出。我们可以通过以下方程式来构建解码层
(7) | |||||
(8) | |||||
(9) |
Here there are three decoder sub-layers. The self-attention and FFN sub-layers are the same as those used in the encoder. denotes a cross attention sub-layer (or encoder-decoder sub-layer) which models the transformation from the source-side to the target-side. In Section 2.6 we will see that can be implemented using the same function as .
这里有三个解码子层。自注意力子层和 FFN 子层与编码器中使用的相同。 表示一个交叉注意力子层(或编码器-解码器子层),它模拟了从源端到目标端的转换。在第 2.6 节中,我们将看到 可以使用与 相同的函数来实现。
The Transformer decoder outputs a distribution over a vocabulary at each target-side position. This is achieved by using a softmax layer that normalizes a linear transformation of to distributions of target-side words. To do this, we map to an matrix by
The Transformer decoder outputs a distribution over a vocabulary at each target-side position. This is achieved by using a softmax layer that normalizes a linear transformation to distributions of target-side words. To do this, we map to an matrix by
(10) |
where is the parameter matrix of the linear transformation.
是线性变换的参数矩阵。
Then, the output of the Transformer decoder is given in the form
然后,Transformer 解码器的输出以以下形式给出
(11) | |||||
where denotes the -th row vector of , and denotes the start symbol . Under this model, the probability of given can be defined as usual,
在本文中, 表示 的第 个行向量,而 表示起始符号 。在此模型下,给定 的 的概率可以按常规定义,
(12) |
This equation resembles the general form of language modeling: we predict the word at time given all of the words up to time . Therefore, the input of the Transformer decoder is shifted one word left, that is, the input is and the output is .
这个方程类似于语言模型的一般形式:我们预测在时间 的单词,给定直到时间 的所有单词。因此,Transformer 解码器的输入向左移动了一个单词,即输入是 ,输出是 。
The Transformer architecture discussed above has several variants which have been successfully used in different fields of NLP. For example, we can use a Transformer encoder to represent texts (call it the encoder-only architecture), can use a Transformer decoder to generate texts (call it the decoder-only architecture), and can use a standard encoder-decoder Transformer model to transform an input sequence to an output sequence. In the rest of this chapter, most of the discussion is independent of the particular choice of application, and will be mostly focused on the encoder-decoder architecture. In Section 6, we will see applications of the encoder-only and decoder-only architectures.
上述讨论的 Transformer 架构有多个变体,这些变体在自然语言处理的各个领域都得到了成功应用。例如,我们可以使用 Transformer 编码器来表示文本(称为仅编码器架构),可以使用 Transformer 解码器来生成文本(称为仅解码器架构),还可以使用标准的编码器-解码器 Transformer 模型将输入序列转换为输出序列。在本章的其余部分,大部分讨论与特定应用的选择无关,并将主要关注编码器-解码器架构。在第 6 节中,我们将看到仅编码器和仅解码器架构的应用。
2.2 Positional Encoding 2.2 位置编码
In their original form, both FFNs and attention models used in Transformer ignore an important property of sequence modeling, which is that the order of the words plays a crucial role in expressing the meaning of a sequence. This means that the encoder and decoder are insensitive to the positional information of the input words. A simple approach to overcoming this problem is to add positional encoding to the representation of each word of the sequence. More formally, a word can be represented as a -dimensional vector
在它们原始形式中,Transformer 中使用的 FFN 和注意力模型都忽略了一个重要的序列建模属性,即单词的顺序在表达序列意义中起着至关重要的作用。这意味着编码器和解码器对输入单词的位置信息不敏感。克服这个问题的简单方法是为序列中每个单词的表示添加位置编码。更正式地说,单词 可以表示为一个 维向量
(13) |
Here is the embedding of the word which can be obtained by using the word embedding models. is the representation of the position . Vanilla Transformer employs the sinusoidal positional encoding models which we write in the form
这里 是单词的嵌入,可以通过使用单词嵌入模型获得。 是位置 的表示。Vanilla Transformer 采用正弦位置编码模型,我们将其写成以下形式
(14) | |||||
(15) |
where denotes the -th entry of . The idea of positional encoding is to distinguish different positions using continuous systems. Here we use the sine and cosine functions with different frequencies. The interested reader can refer to Appendix A to see that such a method can be interpreted as a carrying system. Because the encoding is based on individual positions, it is also called absolute positional encoding. In Section 4.1 we will see an improvement to this method.
表示 中的第 个条目。位置编码的思路是使用连续系统来区分不同的位置。在这里,我们使用不同频率的正弦和余弦函数。感兴趣的读者可以参考附录 A,以了解这种方法可以解释为一种携带系统。因为编码基于单个位置,所以它也被称为绝对位置编码。在第 4.1 节中,我们将看到对这种方法的一种改进。
Once we have the above embedding result, is taken as the input to the Transformer encoder, that is,
一旦我们得到上述嵌入结果, 被用作 Transformer 编码器的输入,即,
(16) |
Similarly, we can also define the input on the decoder side.
同样,我们也可以在解码器端定义输入。
2.3 Multi-head Self-attention
2.3 多头自注意力
The use of self-attention is perhaps one of the most significant advances in sequence-to-sequence models. It attempts to learn and make use of direct interactions between each pair of inputs. From a representation learning perspective, self-attention models assume that the learned representation at position (denoted by ) is a weighted sum of the inputs over the sequence. The output is thus given by
自我注意力的使用可能是序列到序列模型中最重大的进展之一。它试图学习和利用每对输入之间的直接交互。从表示学习角度来看,自我注意力模型假设在位置 (用 表示)学习到的表示是序列中输入的加权和。因此,输出 由以下公式给出
(17) |
where indicates how strong the input is correlated with the input . We thus can view as a representation of the global context at position . can be defined in different ways if one considers different attention models. Here we use the scaled dot-product attention function to compute , as follows
表示输入 与输入 之间的相关性强度。因此,我们可以将 视为位置 的全局上下文表示。如果考虑不同的注意力模型, 可以有不同的定义方式。在这里,我们使用缩放点积注意力函数来计算 ,如下所示:
(18) | |||||
where is a scaling factor and is set to .
是一个缩放因子,并设置为 。
Compared with conventional recurrent and convolutional models, an advantage of self-attention models is that they shorten the computational “distance” between two inputs. Figure 2 illustrates the information flow in these models. We see that, given the input at position , self-attention models can directly access any other input. By contrast, recurrent and convolutional models might need two or more jumps to see the whole sequence.
与传统的循环和卷积模型相比,自注意力模型的一个优点是它们缩短了两个输入之间的计算“距离”。图 2 展示了这些模型中的信息流。我们看到,给定位置 的输入,自注意力模型可以直接访问任何其他输入。相比之下,循环和卷积模型可能需要两次或更多跳跃才能看到整个序列。
We can have a more general view of self-attention by using the QKV attention model. Suppose we have a sequence of queries , and a sequence of key-value pairs . The output of the model is a sequence of vectors, each corresponding to a query. The form of the QKV attention is given by
我们可以通过使用 QKV 注意力模型来获得对自注意力更一般的理解。假设我们有一个序列 查询 ,以及一个序列 键值对 。模型的输出是一个向量序列,每个向量对应一个查询。QKV 注意力的形式如下所示
(19) |
We can write the output of the QKV attention model as a sequence of row vectors
我们可以将 QKV 注意力模型的输出表示为一系列行向量
(20) | |||||
To apply this equation to self-attention, we simply have
将此方程应用于自注意力,我们只需
(21) | |||||
(22) | |||||
(23) |
where represents linear transformations of .
表示 的线性变换。
By considering Eq. (1), we then obtain
通过考虑公式(1),我们随后得到
(24) | |||||
Here is an matrix in which each row represents a distribution over , that is
这里 是一个 矩阵,其中每一行代表一个关于 的分布,即
row 行 | (25) |
We can improve the above self-attention model by using a technique called multi-head attention. This method can be motivated from the perspective of learning from multiple lower-dimensional feature sub-spaces, which projects a feature vector onto multiple sub-spaces and learns feature mappings on individual sub-spaces. Specifically, we project the whole of the input space into sub-spaces (call them heads), for example, we transform into matrices of size , denoted by . The attention model is then run times, each time on a head. Finally, the outputs of these model runs are concatenated, and transformed by a linear projection. This procedure can be expressed by
我们可以通过使用一种称为多头注意力的技术来改进上述自注意力模型。这种方法可以从从多个低维特征子空间学习的角度进行启发,它将特征向量投影到多个子空间,并在单个子空间上学习特征映射。具体来说,我们将整个输入空间投影到 个子空间(称为头),例如,我们将 转换为 个大小为 的矩阵,记作 。然后,注意力模型运行 次,每次在一个头上运行。最后,将这些模型运行的输出连接起来,并通过线性投影进行转换。这个过程可以用以下公式表示
(26) |
For each head ,
对于每个头 ,
(28) | |||||
(29) | |||||
(30) | |||||
(31) |
Here is the concatenation function, and is the attention function described in Eq. (20). are the parameters of the projections from a -dimensional space to a -dimensional space for the queries, keys, and values. Thus, , , , and are all matrices. produces an matrix. It is then transformed by a linear mapping , leading to the final result .
这里 是拼接函数, 是方程(20)中描述的注意力函数。 是从一个 维空间到 维空间的查询、键和值的投影参数。因此, 、 、 和 都是 矩阵。 产生一个 矩阵。然后通过线性映射 进行变换,得到最终结果 。
While the notation here seems somewhat tedious, it is convenient to implement multi-head models using various deep learning toolkits. A common method in Transformer-based systems is to store inputs from all the heads in data structures called tensors, so that we can make use of parallel computing resources to have efficient systems.
尽管这里的符号看起来有些繁琐,但使用各种深度学习工具包实现多头模型却很方便。在基于 Transformer 的系统中的一个常见方法是,将所有头部的输入存储在称为张量的数据结构中,这样我们就可以利用并行计算资源来构建高效的系统。
2.4 Layer Normalization 2.4 层归一化
Layer normalization provides a simple and effective means to make the training of neural networks more stable by standardizing the activations of the hidden layers in a layer-wise manner. As introduced in Ba et al. (2016)’s work, given a layer’s output , the layer normalization method computes a standardized output by
层归一化提供了一种简单有效的方法,通过层状标准化隐藏层的激活,使神经网络的训练更加稳定。正如 Ba 等人(2016)的研究中所述,给定一个层的输出 ,层归一化方法通过计算标准化输出 来实现。
(32) |
Here and are the mean and standard derivation of the activations. Let be the -th dimension of . and are given by
这里 和 分别是激活的平均值和标准差。令 为 的第 维。 和 由以下给出
(33) | |||||
(34) |
Here and are the rescaling and bias terms. They can be treated as parameters of layer normalization, whose values are to be learned together with other parameters of the Transformer model. The addition of to is used for the purpose of numerical stability. In general, is chosen to be a small number.
这里 和 是缩放和偏差项。它们可以被视为层归一化的参数,其值需要与其他 Transformer 模型的参数一起学习。将 添加到 中是为了提高数值稳定性。一般来说, 被选为一个较小的数。
We illustrate the layer normalization method for the hidden states of an encoder in the following example (assume that , , , , and ).
# 在以下示例中,我们说明了编码器隐藏状态的层归一化方法(假设 、 、 、 和 )。
As discussed in Section 2.1, the layer normalization unit in each sub-layer is used to standardize the output of a residual block. Here we describe a more general formulation for this structure. Suppose that is a neural network we want to run. Then, the post-norm structure of is given by
如第 2.1 节所述,每个子层的层归一化单元用于标准化残差块的输出。在这里,我们描述了这种结构的更一般化公式。假设 是我们想要运行的神经网络。那么, 的后归一化结构由以下公式给出
(35) |
where and are the input and output of this model. Clearly, Eq. (4) is an instance of this equation.
和 分别是该模型的输入和输出。显然,式(4)是该方程的一个实例。
An alternative approach to introducing layer normalization and residual connections into modeling is to execute the function right after the function, and to establish an identity mapping from the input to the output of the entire sub-layer. This structure, known as the pre-norm structure, can be expressed in the form
一种将层归一化和残差连接引入建模的替代方法是立即在 函数之后执行 函数,并从输入到整个子层的输出建立恒等映射。这种称为预归一化结构的结构可以表示为
(36) |
Both post-norm and pre-norm Transformer models are widely used in NLP systems. See Figure 3 for a comparison of these two structures. In general, residual connections are considered an effective means to make the training of multi-layer neural networks easier. In this sense, pre-norm Transformer seems promising because it follows the convention that a residual connection is created to bypass the whole network and that the identity mapping from the input to the output leads to easier optimization of deep models. However, by considering the expressive power of a model, there may be modeling advantages in using post-norm Transformer because it does not so much rely on residual connections and enforces more sophisticated modeling for representation learning. In Section 4.2, we will see a discussion on this issue.
后规范和前规范 Transformer 模型在 NLP 系统中被广泛使用。参见图 3 比较这两种结构。一般来说,残差连接被认为是一种使多层神经网络训练更容易的有效手段。从这个意义上说,前规范 Transformer 似乎很有前途,因为它遵循创建残差连接以绕过整个网络的传统,从输入到输出的恒等映射有助于深度模型的优化。然而,考虑到模型的表达能力,使用后规范 Transformer 可能存在建模优势,因为它不太依赖于残差连接,并强制进行更复杂的建模以进行表示学习。在第 4.2 节中,我们将看到对这个问题的讨论。
2.5 Feed-forward Neural Networks
2.5 前馈神经网络
The use of FFNs in Transformer is inspired in part by the fact that complex outputs can be formed by transforming the inputs through nonlinearities. While the self-attention model itself has some nonlinearity (in ), a more common way to do this is to consider additional layers with non-linear activation functions and linear transformations. Given an input and an output , the function in Transformer has the following form
FFN 在 Transformer 中的应用部分受到以下事实的启发:通过非线性变换输入可以形成复杂的输出。虽然自注意力模型本身具有一定的非线性(在 中),但更常见的方法是考虑具有非线性激活函数和线性变换的额外层。给定输入 和输出 ,Transformer 中的 函数具有以下形式
(37) | |||||
(38) |
where is the hidden states, and , , and are the parameters. This is a two-layer FFN in which the first layer (or hidden layer) introduces a nonlinearity through 222. and the second layer involves only a linear transformation. It is common practice in Transformer to use a larger size of the hidden layer. For example, a common choice is , that is, the size of each hidden representation is 4 times as large as the input.
是隐藏状态,而 、 、 和 是参数。这是一个两层前馈神经网络(FFN),其中第一层(或隐藏层)通过 、 2 引入非线性,第二层只涉及线性变换。在 Transformer 中,使用更大的隐藏层大小是一种常见做法。例如,一个常见的选择是 ,即每个隐藏表示的大小是输入的 4 倍。
Note that using a wide FFN sub-layer has been proven to be of great practical value in many state-of-the-art systems. However, a consequence of this is that the model is occupied by the parameters of the FFN. Table 1 shows parameter numbers and time complexities for different modules of a standard Transformer system. We see that FFNs dominate the model size when is large, though they are not the most time consuming components. In the case of very big Transform models, we therefore wish to address this problem for building efficient systems.
注意,使用宽的 FFN 子层已被证明在许多最先进的系统中具有很大的实际价值。然而,这导致模型被 FFN 的参数占用。表 1 显示了标准 Transformer 系统不同模块的参数数量和时间复杂度。我们看到,当 很大时,FFN 主导了模型大小,尽管它们不是最耗时的组件。因此,对于非常大的 Transform 模型,我们希望解决这个问题以构建高效的系统。
Sub-model 子模型 | # of Parameters 参数数量 | Time Complexity 时间复杂度 | ||
Encoder 编码器 | Multi-head Self-attention 多头自注意力 |
|||
Feed-forward Network 前馈网络 | ||||
Layer Normalization 层归一化 | ||||
Decoder 解码器 | Multi-head Self-attention 多头自注意力 |
|||
Multi-head Cross-attention 多头交叉注意力 |
||||
Feed-forward Network 前馈网络 | ||||
Layer Normalization 层归一化 |
表 1:不同设置下不同 Transformer 模块的参数数量和时间复杂度。 源序列长度, 目标序列长度, 隐藏层默认维度数, FFN 隐藏层维度数, 注意力模型中的头数, 编码或解码层数。列 表示子模型在编码器或解码器侧应用的次数。时间复杂度通过计算浮点数乘法次数进行估算。
2.6 Attention Models on the Decoder Side
2.6 解码器侧的注意力模型
A decoder layer involves two attention sub-layers, the first of which is a self-attention sub-layer, and the second is a cross-attention sub-layer. These sub-layers are based on either the post-norm or the pre-norm structure, but differ by designs of the attention functions. Consider, for example, the post-norm structure, described in Eq. (35). We can define the cross-attention and self-attention sub-layers for a decoding layer to be
解码层涉及两个注意力子层,第一个是自注意力子层,第二个是交叉注意力子层。这些子层基于后归一化或前归一化结构,但注意力函数的设计不同。以式(35)描述的后归一化结构为例,我们可以定义解码层的交叉注意力和自注意力子层为:
(39) | |||||
(40) | |||||
where is the input of the self-attention sub-layer, and are the outputs of the sub-layers, and is the output of the encoder 333For an encoder having encoder layers, ..
是自注意力子层的输入, 和 是子层的输出, 是编码器 3 的输出。
As with conventional attention models, cross-attention is primarily used to model the correspondence between the source-side and target-side sequences. The function is based on the QKV attention model which generates the result of querying a collection of key-value pairs. More specifically, we define the queries, keys and values as linear mappings of and , as follows
与传统的注意力模型一样,交叉注意力主要用于建模源序列和目标序列之间的对应关系。 函数基于 QKV 注意力模型,该模型生成查询一组键值对的结果。更具体地说,我们将查询、键和值定义为 和 的线性映射,如下所示
(41) | |||||
(42) | |||||
(43) |
where are the parameters of the mappings. In other words, the queries are defined based on , and the keys and values are defined based on .
是映射的参数。换句话说,查询是基于 定义的,而键和值是基于 定义的。
is then defined as
定义为
(44) | |||||
The function has a similar form as , with linear mappings of taken as the queries, keys, and values, like this
函数的形式与 相似,将 的线性映射作为查询、键和值,如下所示
(45) | |||||
where , , and are linear mappings of with parameters .
在 、 和 是 的参数为 的线性映射。
This form is similar to that of Eq. (20). A difference compared to self-attention on the encoder side, however, is that the model here needs to follow the rule of left-to-right generation (see Figure 2). That is, given a target-side word at the position , we can see only the target-side words in the left context . To do this, we add a masking variable to the unnormalized weight matrix . Both and are of size , and so a lower value of an entry of means a larger bias towards lower alignment scores for the corresponding entry of . In order to avoid access to the right context given , is defined to be
此形式与公式(20)的形式相似。然而,与编码器侧的自注意力相比,这里的模型需要遵循从左到右生成的规则(见图 2)。也就是说,给定位置 的目标词,我们只能看到左侧上下文中的目标词 。为了做到这一点,我们在未归一化的权重矩阵 中添加了一个掩码变量 。 和 的大小均为 ,因此 中的一个条目的值越低,对应 中条目的对齐分数偏向越低。为了避免给定 时访问右侧上下文,定义 为
(46) |
where indicates a bias term for the alignment score between positions and . Below we show an example of how the masking variable is applied (assume ).
表示位置 和 之间对齐得分的偏差项。以下我们展示如何应用掩码变量(假设 )。
(47) | |||||
As noted in Section 2.3, it is easy to improve these models by using the multi-head attention mechanism. Also, since decoders are typically the most time-consuming part of practical systems, the bulk of the computational effort in running these systems is very much concerned with the efficiency of the attention modules on the decoder side.
如第 2.3 节所述,通过使用多头注意力机制,这些模型很容易得到改进。此外,由于解码器通常是实际系统中耗时最多的部分,因此运行这些系统的大部分计算努力都与解码器侧注意力模块的效率密切相关。
2.7 Training and Inference
2.7 训练与推理
Transformers can be trained and used in a regular way. For example, we can train a Transformer model by performing gradient descent to minimize some loss function on the training data, and test the trained model by performing beam search on the unseen data. Below we present some of the techniques that are typically used in the training and inference of Transformer models.
Transformer 可以按照常规方式进行训练和使用。例如,我们可以通过在训练数据上执行梯度下降来最小化某些损失函数来训练 Transformer 模型,并通过在未见数据上执行束搜索来测试训练好的模型。以下我们介绍了一些在 Transformer 模型的训练和推理中通常使用的技巧。
-
•
Learning Rate Scheduling. As standard neural networks, Transformers can be directly trained using back-propagation. The training process is generally iterated many times to make the models fit the training data well. In each training step, we update the weights of the neural networks by moving them a small step in the direction of negative gradients of errors. There are many ways to design the update rule of training. A popular choice is to use the Adam optimization method (Kingma and Ba, 2014). To adjust the learning rate during training, Vaswani et al. (2017) present a learning rate scheduling strategy which increases the learning rate linearly for a number of steps and then decay it gradually. They design a learning rate of the form
(48)
• 学习率调度。作为标准神经网络,Transformer 可以直接使用反向传播进行训练。训练过程通常迭代多次,以使模型更好地拟合训练数据。在每一步训练中,我们通过将权重向误差负梯度的方向移动一小步来更新神经网络的权重。有许多方法可以设计训练的更新规则。一种流行的选择是使用 Adam 优化方法(Kingma 和 Ba,2014)。为了在训练过程中调整学习率,Vaswani 等人(2017)提出了一种学习率调度策略,该策略在多个步骤中线性增加学习率,然后逐渐衰减。他们设计了一种形式为的学习率where denotes the initial learning rate, and denotes the number of training steps we have executed, and denotes the number of warmup steps. In the first steps, the learning rate grows larger as training proceeds. It reaches the highest value at the point of , and then decreases as an inverse square root function (i.e., ).
表示初始学习率, 表示已执行的训练步数, 表示预热步数。在前 步中,随着训练的进行,学习率 逐渐增大。在 点达到最大值,然后以倒数平方根函数的形式(即 )下降。 -
•
Batching and Padding. To make a trade-off between global optimization and training convergency, it is common to update the weights each time on a relatively small collection of samples, called a minibatch of samples. Therefore, we can consider a batch version of forward and backward computation processes in which the whole minibatch is used together to obtain the gradient information. One advantage of batching is that it allows the system to make use of efficient tensor operations to deal with multiple sequences in a single run. This requires that all the input sequences in a minibatch are stored in a single memory block, so that they can be read in and processed together. To illustrate this idea, consider a minimatch containing four samples whose source-sides are
批处理和填充。为了在全局优化和训练收敛之间进行权衡,通常每次在相对较小的样本集合上更新权重,称为样本的小批量。因此,我们可以考虑一种批处理版本的向前和向后计算过程,其中整个小批量一起使用以获得梯度信息。批处理的一个优点是它允许系统利用高效的张量运算在一次运行中处理多个序列。这要求所有的小批量输入序列都存储在单个内存块中,以便可以一起读取和处理。为了说明这个想法,考虑一个包含四个样本的最小匹配,其源端是A B C D E F M N R S T W X Y Z We can store these sequences in a continuous block where each “row” represents a sequence, like this
我们可以将这些序列存储在一个 连续块中,其中每一“行”代表一个序列,如下所示A B C D E F M N R S T W X Y Z Here padding words are inserted between sequences, so that these sequences are aligned in the memory. Typically, we do not want padding to affect the operation of the system, and so we can simply define as a zero vector (call it zero padding). On the other hand, in some cases we are interested in using padding to describe something that is not covered by the input sequences. For example, we can replace padding words with the words in the left (or right) context of a sequence, though this may require modifications to the system to ensure that the newly added context words do not cause additional content to appear in the output.
在序列之间插入填充词 ,以便在内存中对齐这些序列。通常,我们不希望填充影响系统的操作,因此我们可以简单地定义 为零向量(称为零填充)。另一方面,在某些情况下,我们感兴趣的是使用填充来描述输入序列未涵盖的内容。例如,我们可以用序列左侧(或右侧)的词语替换填充词,尽管这可能需要修改系统以确保新添加的上下文词不会导致输出中出现额外内容。 -
•
Search and Caching. At test time, we need to search the space of candidate hypotheses (or candidate target-side sequences) to identify the hypothesis (or target-side sequence) with the highest score.
(49)
• 搜索与缓存。在测试时,我们需要搜索候选假设(或候选目标序列)的空间,以识别得分最高的假设(或目标序列)。where is the model score of the target-side sequence given the source-side sequence . While there are many search algorithms to achieve this, most of them share a similar structure: the search program operates by extending candidate target-side sequences in a pool at a time. In this way, the resulting algorithm can be viewed as a left-to-right generation procedure. Note that all of the designs of , no matter how complex, are based on computing . Because the attention models used in Transformer require computing the dot-product of each pair of the input vectors of a layer, the time complexity of the search algorithm is a quadratic function of the length of . It is therefore not efficient to repeatedly compute the outputs of the attention models for positions that have been dealt with. This problem can be addressed by caching the states of each layer for words we have seen. Figure 4 illustrates the use of the caching mechanism in a search step. All the states for positions are maintained and easily accessed in a cache. At position , all we need is to compute the states for the newly added word, and then to update the cache.
是在给定源序列 的情况下,针对目标序列 的模型得分。虽然有许多搜索算法可以实现这一点,但它们大多数具有相似的结构:搜索程序通过每次扩展一个池中的候选目标序列来运行。因此,这种算法可以被视为从左到右的生成过程。请注意,无论 的设计多么复杂,都是基于计算 。由于在 Transformer 中使用的注意力模型需要计算每一层输入向量的每一对点积,因此搜索算法的时间复杂度是 长度的二次函数。因此,对于已经处理过的位置重复计算注意力模型的输出是不高效的。这个问题可以通过缓存我们已看到的单词的每一层的状态来解决。图 4 阐述了在搜索步骤中使用缓存机制。所有位置 的状态都保存在缓存中,并且可以轻松访问。在位置 ,我们只需要计算新添加单词的状态,然后更新缓存。
3 Syntax-aware Models 3 语法感知模型
Although Transformer is simply a deep learning model that does not make use of any linguistic structure or assumption, it may be necessary to incorporate our prior knowledge into such systems. This is in part because NLP researchers have long believed that a higher level of abstraction of data is needed to develop ideal NLP systems, and there have been many systems that use structure as priors. However, structure is a wide-ranging topic and there are several types of structure one may refer to See (2018)’s work. For example, the inductive biases used in our model design can be thought of as some structural prior, while NLP models can also learn the underlying structure of problems by themselves. In this sub-section we will discuss some of these issues. We will focus on the methods of introducing linguistic structure into Transformer models. As Transformer can be applied to many NLP tasks, which differ much in their input and output formats, we will primarily discuss modifications to Transformer encoders (call them syntax-aware Transformer encoders). Our discussion, however, is general, and the methods can be easily extended to Transformer decoders.
尽管 Transformer 只是一个不利用任何语言结构或假设的深度学习模型,但可能有必要将我们的先验知识融入此类系统。这在一定程度上是因为自然语言处理研究人员长期以来一直认为,为了开发理想的 NLP 系统,需要数据的高级抽象,并且已经有许多系统使用结构作为先验。然而,结构是一个广泛的话题,可能涉及多种结构,可参考(2018)的工作。例如,我们模型设计中使用的归纳偏差可以被视为某种结构先验,而 NLP 模型也可以自行学习问题的潜在结构。在本节中,我们将讨论这些问题。我们将重点关注将语言结构引入 Transformer 模型的方法。由于 Transformer 可以应用于许多 NLP 任务,这些任务的输入和输出格式差异很大,我们将主要讨论对 Transformer 编码器(称为语法感知 Transformer 编码器)的修改。然而,我们的讨论是通用的,这些方法可以很容易地扩展到 Transformer 解码器。
3.1 Syntax-aware Input and Output
3.1 语法感知输入和输出
One of the simplest methods of incorporating structure into NLP systems is to modify the input sequence, leaving the system unchanged. As a simple example, consider a sentence where each word is assigned a set of syntactic labels (e.g., POS labels and dependency labels). We can write these symbols together to define a new “word”
将结构引入 NLP 系统的一种最简单的方法是修改输入序列,而系统本身保持不变。以一个简单的例子来说明,考虑一个句子,其中每个单词 被分配一组 句法标签 (例如,词性标签和依存标签)。我们可以将这些符号组合起来定义一个新的“单词”
Then, the embedding of this word is given by
然后,该词的嵌入表示为
(50) |
where is the embedding of . Since is a complex symbol, we decompose the learning problem of into easier problems. For example, we can develop embedding models, each producing an embedding given a tag. Then, we write as a sum of the word embedding and tag embeddings
是 的嵌入。由于 是一个复杂符号,我们将 的学习问题分解为更简单的问题。例如,我们可以开发 嵌入模型,每个模型给定一个标签产生一个嵌入。然后,我们将 写成词嵌入和标签嵌入的和
(51) |
where are the embeddings of the tags. Alternatively, we can combine these embeddings via a neural network in the form
是标签的嵌入。或者,我们可以通过神经网络将这些嵌入进行组合,形式为
(52) |
where is a feed-forward neural network that has one layer or two.
是一个具有一层或两层的前馈神经网络。
We can do the same thing for sentences on the decoder side as well, and treat as a syntax-augmented word. However, this may lead to a much larger target-side vocabulary and poses a computational challenge for training and inference.
我们可以对解码器端的句子做同样的事情,将 视为一个语法增强词。然而,这可能会导致目标端词汇量大幅增加,给训练和推理带来计算挑战。
Another form that is commonly used to represent a sentence is syntax tree. In linguistics, the syntax of a sentence can be interpreted in many different ways, resulting in various grammars and the corresponding tree (or graph)-based representations. While these representations differ in their syntactic forms, a general approach to use them in sequence modeling is tree linearization. Consider the following sentence annotated with a constituency-based parse tree
另一种常用的表示句子的形式是句法树。在语言学中,句子的句法可以有多种不同的解释,从而产生各种语法和相应的基于树(或图)的表示。尽管这些表示在句法形式上有所不同,但在序列建模中使用它们的通用方法是树线性化。考虑以下带有基于成分分析的句法树的句子
We can write this tree structure as a sequence of words, syntactic labels and brackets via a tree traversal algorithm, as follows
我们可以通过树遍历算法将这种树结构写成一系列的词语、句法标签和括号,如下所示
(S | (NP | (PRP (个人关系 pronoun) | It 它 | ) | ) | (VP | (VBZ (VBZ 已翻译文本: | ’s 无法翻译,因为提供的源文本为空。请提供有效的学术文本以便进行翻译 | ) | (ADJP | (JJ |
interesting 有趣 | ) | ) | ) | (. | ! | ) | ) |
This sequence of syntactic tokens can be used as an input to the system, that is, each token is represented by word and positional embeddings, and then the sum of these embeddings is treated as a regular input of the encoder. An example of the use of linearized trees is tree-to-string machine translation in which a syntax tree in one language is translated into a string in another language (Li et al., 2017; Currey and Heafield, 2018). Linearized trees can also be used for tree generation. For example, we can frame parsing tasks as sequence-to-sequence problems to map an input text to a sequential representation of its corresponding syntax tree (Vinyals et al., 2015; Choe and Charniak, 2016). See Figure 5 for illustrations of these models. It should be noted that the methods described here are not specific to Transformer but could be applied to many models, such as RNN-based models.
这个句法标记序列可以用作系统的输入,即每个标记由词嵌入和位置嵌入表示,然后这些嵌入的和被视为编码器的常规输入。线性化树的用法示例是树到字符串的机器翻译,其中一种语言的语法树被翻译成另一种语言的字符串(Li 等人,2017;Currey 和 Heafield,2018)。线性化树还可以用于树生成。例如,我们可以将解析任务构造成序列到序列问题,将输入文本映射到其对应语法树的顺序表示(Vinyals 等人,2015;Choe 和 Charniak,2016)。参见图 5 以了解这些模型的说明。需要注意的是,这里描述的方法并不仅限于 Transformer,也可以应用于许多其他模型,如基于 RNN 的模型。
3.2 Syntax-aware Attention Models
3.2 语法感知注意力模型
For Transformer models, it also makes sense to make use of syntax trees to guide the process of learning sequence representations. In the previous section we saw how representations of a sequence can be computed by relating different positions within that sequence. This allows us to impose some structure on these relations which are represented by distributions of attention weights over all the positions. To do this we use the encoder self-attention with an additive mask
对于 Transformer 模型,利用句法树来指导序列表示的学习过程也是合理的。在前一节中,我们看到了如何通过关联序列中的不同位置来计算序列的表示。这使我们能够对这些关系施加一些结构,这些关系由所有位置上的注意力权重分布来表示。为了做到这一点,我们使用带有加性掩码的编码器自注意力。
(53) |
or alternatively with a multiplicative mask
或者使用乘性掩码
(54) |
where is a matrix of masking variables in which a larger value of indicates a stronger syntactic correlation between positions and . In the following description we choose Eq. (54) as the basic form.
是一个掩码变量矩阵,其中 的值越大,表示 和 位置之间的句法相关性越强。在以下描述中,我们选择公式(54)作为基本形式。
One common way to design is to project syntactic relations of the input tree structure into constraints over the sequence. Here we consider constituency parse trees and dependency parse trees for illustration. Generally, two types of masking methods are employed.
一种常见的 设计方法是,将输入树结构的句法关系投射到序列上的约束中。在这里,我们考虑成分句法树和依存句法树进行说明。通常,采用两种类型的掩码方法。
-
•
0-1 Masking. This method assigns a value of 1 if the words at positions and are considered syntactically correlated and a value of 0 otherwise (Zhang et al., 2020; Bai et al., 2021). To model the relation between two words in a syntax tree, we can consider the distance between their corresponding nodes. One of the simplest forms is given by
(55)
• 0-1 遮罩。此方法将 赋值为 1,如果位置 和 的词语被视为句法相关,否则赋值为 0(张等,2020;白等,2021)。为了在句法树中建模两个词语之间的关系,我们可以考虑它们对应节点之间的距离。其中一种最简单的形式如下:where is the length of the shortest path between the nodes of the words at positions and . For example, given a dependency parse tree, is the number of dependency edges in the path between the two words. For a constituency parse tree, all the words are leaf nodes, and so gives a tree distance between the two leaves in the same branch of the tree. is a parameter used to control the maximum distance between two nodes that can be considered syntactically correlated. For example, assuming that there is a dependency parse tree and , Eq. (55) enforces a constraint that the attention score between positions and is computed only if they have a parent-dependent relation444For multiplicative masks, does not mean that the attention weight between and is zero because the Softmax function does not give a zero output for a dimension whose corresponding input is of a zero value. A method to “mask” an entry of is to use an additive mask and set if ..
是位置 和 之间单词节点的最短路径长度。例如,给定一个依存句法分析树, 是两个单词之间的依存边数。对于一个成分句法分析树,所有单词都是叶子节点,因此 给出了树中同一分支的两个叶子之间的树距离。 是一个用于控制两个节点之间可以被认为是句法相关的最大距离的参数。例如,假设存在一个依存句法分析树和 ,等式(55)强制执行一个约束,即只有在它们具有父节点依赖关系 4 的情况下,才计算位置 和 之间的注意力分数。 -
•
Soft Masking. Instead of treating as a hard constraint, we can use it as a soft constraint that scales the attention weight between positions and in terms of the degree to which the corresponding words are correlated. An idea is to reduce the attention weight as becomes larger. A very simple method to do this is to transform in some way that holds a negative correlation relationship with and its value falls into the interval
(56)
• 软掩码。不是将 视为硬约束,我们可以将其用作软约束,根据对应词语的相关程度来缩放位置 和 之间的注意力权重。一个想法是随着 的增大而减少注意力权重。实现这一点的非常简单的方法是将 以某种方式转换,使得 与 保持负相关关系,并且其值落在 区间内。There are several alternative designs for . For example, one can compute a standardized score of by subtracting its mean and dividing by its standard deviation (Chen et al., 2018a), or can normalize over all possible in the sequence (Xu et al., 2021b). In cases where parsers can output a score between positions and , it is also possible to use this score to compute . For example, a dependency parser can produce the probability of the word at position being the parent of the word at position (Strubell et al., 2018). We can then write as
存在几种针对 的替代设计方案。例如,可以通过减去其均值并除以其标准差来计算 的标准分数(Chen 等人,2018a),或者可以在序列中所有可能的 上对