All you need to know about ‘Attention’ and ‘Transformers’ — In-depth Understanding — Part 1
关于 "注意力 "和 "变形金刚 "的所有信息 - 深入了解 - 第 1 部分
Attention, Self-Attention, Multi-head Attention, and Transformers
注意力、自我注意力、多头注意力和变压器
This is a long article that talks about almost everything one needs to know about the Attention mechanism including Self-Attention, Query, Keys, Values, Multi-Head Attention, Masked-Multi Head Attention, and Transformers including some details on BERT and GPT. I have hence divided the article into two parts. In this article, I cover all the Attention blocks, and in the next story, I will dive into the Transformer Network architecture.
这是一篇很长的文章,讲述了关于注意机制的几乎所有知识,包括自注意、查询、键、值、多头注意、屏蔽多头注意和变压器,其中包括 BERT 和 GPT 的一些细节。因此,我将文章分为两部分。在本文中,我将介绍所有注意模块,而在下一篇文章中,我将深入探讨变压器网络架构。
Contents: 内容
- Challenges with RNNs and how Transformer models can help overcome those challenges
RNN 的挑战以及 Transformer 模型如何帮助克服这些挑战 - The Attention Mechanism 注意机制
2.2 Query, Key, and Values
2.2 查询、键和值
2.3 Neural network representation of Attention
2.3 注意力的神经网络表征
2.4 Multi-Head Attention 2.4 多头关注
3. Transformers (Continued in next story)
3.变形金刚(下回继续)
Introduction 导言
The attention mechanism was first used in 2014 in computer vision, to try and understand what a neural network is looking at while making a prediction. This was one of the first steps to try and understand the outputs of Convolutional Neural Networks (CNNs). In 2015, attention was used first in Natural Language Processing (NLP) in Aligned Machine Translation. Finally, in 2017, the attention mechanism was used in Transformer networks for language modeling. Transformers have since surpassed the prediction accuracies of Recurrent Neural Networks (RNNs), to become state-of-the-art for NLP tasks.
注意力机制于 2014 年首次应用于计算机视觉领域,用于理解神经网络在进行预测时所关注的内容。这是尝试理解卷积神经网络(CNN)输出的第一步。2015 年,注意力首次用于自然语言处理(NLP)中的对齐式机器翻译。最后,在 2017 年,注意力机制被用于语言建模的 Transformer 网络中。自此,Transformer 的预测准确率超过了递归神经网络(RNN),成为 NLP 任务的最先进技术。
1. Challenges with RNNs and how Transformer models can help overcome those challenges
1.RNN 的挑战以及 Transformer 模型如何帮助克服这些挑战
1.1 RNN problem 1 — Suffers issues with long-range dependencies. RNNs do not work well with long text documents.
1.1 RNN 问题 1 - 长距离依赖性问题。RNN 不能很好地处理长文本文档。
Transformer Solution —Transformer networks almost exclusively use attention blocks. Attention helps to draw connections between any parts of the sequence, so long-range dependencies are not a problem anymore. With transformers, long-range dependencies have the same likelihood of being taken into account as any other short-range dependencies.
变压器解决方案-变压器网络几乎完全使用注意块。注意力有助于在序列的任何部分之间建立联系,因此远距离依赖关系不再是问题。在变压器中,长程相关性与其他短程相关性一样,都有可能被考虑在内。
1.2. RNN problem 2 — Suffers from gradient vanishing and gradient explosion.
1.2.RNN 问题 2 - 存在梯度消失和梯度爆炸 问题 。
Transformer Solution — There is little to no gradient vanishing or explosion problem. In Transformer networks, the entire sequence is trained simultaneously, and to build on that only a few more layers are added. So gradient vanishing or explosion is rarely an issue.
Transformer 解决方案 - 几乎不存在梯度消失或爆炸问题。在 Transformer 网络中,整个序列是同时训练的,只需在此基础上增加几层即可。因此,梯度消失或爆炸很少成为问题。
1.3. RNN problem 3 — RNNs need larger training steps to reach a local/global minima. RNNs can be visualized as an unrolled network that is very deep. The size of the network depends on the length of the sequence. This gives rise to many parameters, and most of these parameters are interlinked with one another. As a result, the optimization requires a longer time to train and a lot of steps.
1.3. RNN 问题 3- RNN 需要更大的训练步长才能达到局部/全局最小值。RNN 可以形象地理解为一个非常深的未展开网络。网络的大小取决于序列的长度。这就产生了许多参数,而且这些参数大多相互关联。因此,优化需要较长的训练时间和大量的步骤。
Transformer Solution — Requires fewer steps to train than an RNN.
变压器解决方案 - 训练步骤比 RNN 少。
1.4. RNN problem 4 — RNNs do not allow parallel computation. GPUs help to achieve parallel computation. But RNNs work as sequence models, that is, all the computation in the network occurs sequentially and can not be parallelized.
1.4.RNN 问题 4- RNN 无法实现并行计算。GPU 有助于实现并行计算。但 RNN 作为序列模型工作,即网络中的所有计算都是按顺序进行的,无法并行化。
Transformer Solution — No recurrence in the transformer networks allows parallel computation. So computation can be done in parallel for every step.
变压器解决方案 - 变压器网络中没有递归,允许并行计算。因此,每一步都可以并行计算。
2. The Attention Mechanism
2.注意机制
2.1 Self-Attention 2.1 自我关注
Consider the sentence — ” Bark is very cute and he is a dog”. This sentence has 9 words or tokens. If we just consider the word ‘he’ in the sentence, we see that ‘and’ and ‘is’ are the two words in close proximity to it. But these words do not give the word ‘he’ any context. Rather the words ‘Bark’ and ‘dog’ are much more related to ‘he’ in the sentence. From this, we understand that proximity is not always relevant but context is more relevant in a sentence.
请看这样 一个句子--" Bark 非常可爱,他是一只狗"。这个句子有 9 个词或标记。如果我们只考虑句子中的 "他",我们会发现 "和 "和 "是 "这两个词与 "他 "非常接近。但是这两个词并没有给 "他 "这个词提供任何语境。相反,句中的 "Bark "和 "dog "与 "he "的关系更为密切。由此我们可以理解,在一个句子中,上下文并不总是相关的。
When this sentence is fed to a computer, it considers each word as a token t, and each token has a word embedding V. But these word embeddings have no context. So the idea is to apply some kind of weighing or similarity to obtain final word embedding Y, which has more context than the initial embedding V.
当把这个句子输入计算机时,计算机会将每个单词视为一个标记t, 而每个标记都有一个词嵌入 V。因此,我们的想法是应用某种权衡或相似性来获得最终的词嵌入Y,它比初始嵌入V 有更多的上下文。
In an embedding space, similar words appear closer together or have similar embeddings. Such as the word ‘king’ will be more related to the word ‘queen’ and ‘royalty’, than with the word ‘zebra’. Similarly, ‘zebra’ will be more related to ‘horse’ and ‘stripes’, than with the word ‘emotion’. To know more about embedding space, please visit this video by Andrew Ng (NLP and Word Embeddings).
在嵌入空间中,相似的单词会出现在更近的地方,或具有相似的嵌入。例如,"king"(国王)一词与 "queen"(王后)和 "royal"(皇室)一词的关联度要高于 "zebra"(斑马)一词。同样,"zebra "与 "horse "和 "strips "的关联度要高于 "emotion"。要了解有关嵌入空间的更多信息,请观看 Andrew Ng(NLP 与单词嵌入)的视频。
So, intuitively, if the word ‘king’ appears at the beginning of the sentence, and the word ‘queen’ appears at the end of the sentence, they should provide each other with better context. We use this idea to find the weight vectors W, by multiplying (dot product) the word embeddings together to gain more context. So, in the sentence Bark is very cute and he is a dog, instead of using the word embeddings as it is, we multiply the embedding of each word with one another. Figure 3 should illustrate this better.
因此,直观地说,如果 "king"(国王)出现在句子开头,而 "queen"(王后)出现在句子结尾,那么它们应该能为彼此提供更好的语境。我们利用这一想法,通过将单词嵌入相乘(点积)的方法来找到权重向量W, 从而获得更多的语境。因此,在句子 "Bark is very cute and he is a dog " 中,我们不再使用单词嵌入,而是将每个单词的嵌入相乘。图 3 可以更好地说明这一点。
As we see in figure 3, we first find the weights by multiplying (dot product) the initial embedding of the first word with the embedding of all other words in the sentence. These weights (W11 to W19) are also normalized to have a sum of 1. Next, these weights are multiplied with the initial embeddings of all the words in the sentence.
如图 3 所示,我们首先将第一个单词的初始嵌入值与句子中所有其他单词的嵌入值相乘(点积),得出权重。这些权重(W11 至 W19)也被归一化为总和为 1。接下来,这些权重与句子中所有单词的初始嵌入相乘。
W11 V1 + W12 V2 + …. W19 V9 = Y1
W11 v1 + W12 v2 + .... W19 v9 = y1
W11 to W19 are all weights that have the context of the first word V1. So when we are multiplying these weights to each word, we are essentially reweighing all the other words towards the first word. So in a sense, the word ‘Bark’ is now tending more towards the words ‘dog’ and ‘cute’, rather than the word that comes right after it. And this, in a way gives some context.
W11 到 W19 都是与第一个单词 V1 有关的权重。因此,当我们将这些权重乘以每个单词时,我们基本上是在重新权衡所有其他单词,使其更倾向于第一个单词。因此,从某种意义上说,"Bark"这个词现在更倾向于 "dog"和 "cute",而不是它后面的那个词。这在某种程度上提供了一些语境。
This is repeated for all words so that each word gets some context from every other word in the sentence.
所有单词都要重复这一过程,以便每个单词都能从句子中的其他单词中获得一些语境。
Figure 4 gives a better understanding of the above steps to obtain Y1, using a pictorial diagram.
图 4 通过图示让我们更好地理解上述获取 Y1 的步骤。
What is interesting here is that no weights are trained, the order or proximity of the words have no influence on each other. Also, the process has no dependency on the length of the sentence, that is, more or fewer words in a sentence do not matter. This approach of adding some context to the words in a sentence is known as Self-Attention.
有趣的是,这里没有训练任何权重,单词的顺序或远近对彼此没有影响。此外,这一过程也与句子的长度无关,也就是说,句子中单词的多与少并不重要。这种为句子中的单词添加一些上下文的方法被称为 "自我关注"(Self-Attention)。
2.2 Query, Key, and Values
2.2 查询、键和值
The issue with Self-Attention is that nothing is being trained. But maybe if we add some trainable parameters, the network can then learn some patterns which give much better context. This trainable parameter can be a matrix whose values are trained. So the idea of Query, Key, and Values was introduced.
自关注 "的问题在于没有对任何东西进行训练。但是,如果我们添加一些可训练参数,也许网络就能学习到一些模式,从而提供更好的语境。这种可训练参数可以是一个矩阵,其值经过训练。因此,我们引入了 "查询"、"关键 "和 "值 "的概念。
Let's again consider the previous sentence — ” Bark is very cute and he is a dog”. In Figure 4 in self-attention, we see that the initial word embeddings (V) are used 3 times. 1st as a dot product between the first word embedding and all other words (including itself, 2nd) in the sentence to obtain the weights, and then multiplying them again (3rd time) to the weights, to obtain the final embedding with context. These 3 occurrences of the V’s can be replaced by the three terms Query, Keys, and Values.
让我们再来看看前面的句子--" Bark 非常可爱,他是一只狗"。在图 4 的自我关注中,我们可以看到初始词嵌入(V)被使用了 3 次。第一次是将第一个词嵌入与句子中的所有其他词(包括其本身,第二次)进行点乘,得到权重,然后再将其与权重相乘(第三次),得到带有上下文的最终嵌入。这 3 次出现的 V 可以用 Query、 Keys 和 Values 这三个词代替 。
Let's say we want to make all the words similar with respect to the first word V1. We then send V1 as the Query word. This query word will then do a dot product with all the words in the sentence (V1 to V9) — and these are the Keys. So the combination of the Query and the Keys give us the weights. These weights are then multiplied with all the words again (V1 to V9) which act as Values. There we have it, the Query, Keys, and the Values. If you still have some doubts, figure 5 should be able to clear them.
比方说,我们想让所有单词都与第一个单词 V1 相似。我们就将 V1 作为查询词发送。然后,这个查询词将与句子中的所有词(V1 至 V9)做点积,这些就是关键字。因此,查询词和关键字的组合就是权重。这些权重再与所有单词(V1 至 V9)相乘,这些单词就是值。这就是查询、键和值。如果您还有疑问,图 5 应该能帮您解答。
But wait, we have not added any matrix that can be trained yet. That's pretty simple. We know, if a 1 x k shaped vector is multiplied with a k x k shaped matrix, we get a 1 x k shaped vector as output. Keeping this in mind let's just multiply each key from V1 to V10 (each of shape 1 x k), with a matrix Mk (Key matrix) of shape k x k. Similarly, the query vector is multiplied with a matrix Mq (Query matrix)., and the Values vectors are multiplied with Values matrix Mv. All the values in these matrices Mk, Mq, and Mv can now be trained by the neural network, and give much better context than just using self-attention. Again for a better understanding, Figure 6 shows a pictorial representation of what I just explained.
但等等,我们还没有添加任何可以训练的矩阵。这很简单。我们知道,如果一个 1 x k 形的向量与一个 k x k 形的矩阵相乘,就会得到一个 1 x k 形的向量作为输出。同样,查询向量与矩阵 Mq(查询矩阵)相乘,值向量与值矩阵 Mv 相乘。现在,这些矩阵 Mk、Mq 和 Mv 中的所有值都可以由神经网络进行训练,并提供比单纯使用自我关注更好的上下文。为了让大家更好地理解,图 6 展示了我刚才解释的图示。
Now that we know the intuition of Keys, Query, and Values, let's look at a database analysis and the official steps and formulas behind Attention.
既然我们已经知道了 "键"、"查询 "和 "值 "的直观含义,那么让我们来看看数据库分析以及 "注意 "背后的官方步骤和公式。
Let’s try and understand the Attention mechanism, by looking into an example of a database. So, in a database, if we want to retrieve some value vi based on a query q and key ki, some operations can be done where we can use a query to identify a key that corresponds to a certain value. Attention can be thought to be a similar process to this database technique but in a more probabilistic manner. This is demonstrated in the figure below.
让我们以数据库为例,尝试理解 Attention 机制。因此,在数据库中,如果我们想根据查询q 和关键字ki 来检索某个值vi ,我们可以进行一些操作,通过查询来确定与某个值相对应的关键字。注意力可以被认为是与这种数据库技术类似的过程,但采用的是一种更具概率性的方式。下图对此进行了演示。
Figure 7, shows the steps of data retrieval in a database. Suppose we send a query into the database, some operations will find out which key in the database is the most similar to the query. Once the key is located, it will send out the value corresponding to that key as an output. In the figure, the operation finds that the Query is most similar to Key 5, and hence gives us the value 5 as output.
图 7 显示了在数据库中检索数据的步骤。假设我们向数据库发送一个查询,一些操作会找出数据库中与查询最相似的键。一旦键被找到,它就会将该键对应的值作为输出发送出去。在图中,操作发现查询与键 5 最相似,因此输出值为 5。
The Attention mechanism is a neural architecture that mimics this process of retrieval.
注意机制是一种模拟这种检索过程的神经结构。
- The attention mechanism measures the similarity between the query q and each key-value ki.
关注机制测量查询 q 和每个键值 ki 之间的相似性。 - This similarity returns a weight for each key value.
这种相似性会返回每个键值的权重。 - Finally, it produces an output that is the weighted combination of all the values in our database.
最后,它产生的输出是数据库中所有数值的加权组合。
The only difference between database retrieval and attention in a sense is that in database retrieval we only get one value as input, but here we get a weighted combination of values. In the attention mechanism, if a query is most similar to say, key 1 and key 4, then both these keys will get the most weights, and the output will be a combination of value 1 and value 4.
从某种意义上说,数据库检索与注意力之间的唯一区别在于,在数据库检索中,我们只得到一个值作为输入,而在这里,我们得到的是值的加权组合。在注意力机制中,如果一个查询与关键字 1 和关键字 4 最为相似,那么这两个关键字将获得最大权重,输出将是值 1 和值 4 的组合。
Figure 8 shows the steps required to get to the final attention value from the query, keys, and values. Each step is explained in detail below.
图8 显示了从查询、键和值得出最终关注值所需的步骤。下文将详细解释每个步骤。
(The key values k are vectors, the Similarity values S are scalars, the weight values (softmax) values a are scalars, and the Values V are vectors)
(密钥值k是向量,相似度值 S是标量,权重值(软最大值)a是标量,值V是向量)。
Step 1. 步骤 1.
Step 1 contains the keys and the query and the respective similarity measures. The query q influences the similarity. What we have are the query and the keys, and we calculate the similarity. The similarity is some function of the query q and the keys k. Both the query and the keys are some embedding vectors. Similarity S can be calculated using various methods as shown in figure 9.
步骤 1 包含键、查询和各自的相似性度量。查询q会影响相似度。我们有查询和密钥,然后计算相似度。查询和密钥都是一些嵌入向量。如图 9 所示,相似度S可以用多种方法计算。
Similarity can be a simple dot product of the query and the key. It can be scaled dot product, where the dot product of q and k, is divided by the square root of the dimensionality of each key, d. These are the most commonly used two techniques to find the similarity.
相似度可以是查询和关键字的简单点积。也可以是比例点积,即q和k 的点积除以每个密钥的维度 d 的平方根 。
Often a query is projected into a new space by using a weight matrix W, and then a dot product is made with the key k. Kernel methods can also be used as a similarity.
通常情况下, 使用权重矩阵W 将查询投射到一个新的空间,然后与关键字k 进行点积。
Step 2. 步骤 2.
Step 2 is finding the weights a. This is done using ‘SoftMax’. The formula is shown below. (exp is exponential)
第二步是计算权重a。计算公式如下。(exp 表示指数)
The similarities are connected to the weights like a fully connected layer.
相似性与权重相连,就像全连接层一样。
Step 3. 步骤 3.
Step 3 is a weighted combination of the results of the softmax (a) with the corresponding values (V). The 1st value of a is multiplied with the first value of V and is then summed with the product of the 2nd value of a with the 2nd value of Values V, and so on. The final output that we obtain is the resulting attention value that is desired.
第 3 步是将软最大值(a)的结果与相应的值(V)进行加权组合。a的第一个值与V的第一个值相乘,然后与a的第二个值与V 的第二个值的乘积相加,以此类推。最后得到的输出结果就是所需的注意力值。
Summary of the three steps:
三个步骤的总结:
With the help of the query q and the keys k, we obtain the attention value, which is a weighted sum/linear combination of the Values V, and the weights come from some sort of similarity between the query and the keys.
借助查询q和关键字k,我们可以得到关注值,它是值V 的加权和/线性组合,权重来自查询和关键字之间的某种相似性。
2.3 Neural network representation of Attention
2.3 注意力的神经网络表征
Figure 10 shows the neural network representation of an attention block. The word embeddings are first passed into some linear layers. These linear layers do not have a ‘bias’ term, and hence are nothing but matrix multiplications. One of these layers is denoted as ‘keys’, the other as ‘queries’, and the last one as ‘values’. If a matrix multiplication is performed between the keys and the queries and are then normalized, we get the weights. These weights are then multiplied by the values, and summed up, to get the final attention vector. This block can now be used in a neural network and is known as the Attention block. Multiple such attention blocks can be added to provide more context. And the best part is, we can get a gradient backpropagating to update the attention block (weights of keys, queries, values).
图 10 显示了注意力区块的神经网络表示。单词嵌入首先进入一些线性层。这些线性层没有 "偏置 "项,因此只是矩阵乘法。其中一层称为 "键",另一层称为 "查询",最后一层称为 "值"。如果在键值和查询值之间进行矩阵乘法,然后进行归一化处理,我们就能得到权重。然后将这些权重与值相乘并相加,就得到了最终的注意力向量。这个区块现在可以用于神经网络,被称为注意力区块。可以添加多个这样的注意力区块,以提供更多的背景信息。最棒的是,我们可以通过梯度反向传播来更新注意力区块(键、查询、值的权重)。
2.4 Multi-Head Attention 2.4 多头关注
To overcome some of the pitfalls of using single attention, multi-head attention is used. Let's go back to the sentence — ” Bark is very cute and he is a dog”. Here, if we take the word ‘dog’, grammatically we understand that the words ‘Bark’, ‘cute’, and ‘he’ should have some significance or relevance with the word ‘dog’. These words say that the dog’s name is Bark, it is a male dog, and that he is a cute dog. Just one attention mechanism may not be able to correctly identify these three words as relevant to ‘dog’, and we can say that three attentions are better here to signify the three words with the word ‘dog’. This reduces the load on one attention to find all significant words and also increases the chances of finding more relevant words easily.
为了克服使用单一注意力的一些缺陷,我们使用了多头注意力。让我们回到这个句子--" Bark 非常可爱,他是一只狗"。在这里,如果我们使用 "狗 "这个词,从语法上讲,我们可以理解为 "Bark"、"cute "和 "he "这三个词应该与 "狗 "这个词有一定的意义或相关性。这些词表示狗的名字叫 "汪汪",它是一只公狗,它是一只可爱的狗。仅仅一个注意机制可能无法正确识别这三个词与 "狗 "的相关性,因此我们可以说,这里最好有三个注意机制将这三个词与 "狗 "联系起来。这样既减轻了一个注意力寻找所有重要词语的负担,也增加了轻松找到更多相关词语的机会。
So let's add more linear layers as the keys, queries, and values. These linear layers are training in parallel, and have independent weights to one another. So now, each of the values, keys, and queries gives us three outputs instead of one. These 3 keys and queries now give three different weights. These three weights then with matrix multiplication with the three values, to give three multiple outputs. These three attention blocks are finally concatenated to give one final attention output. This representation is shown in figure 11.
因此,让我们添加更多线性层作为键、查询和值。这些线性层是并行训练的,彼此具有独立的权重。因此,现在每个值、键和查询都有三个输出,而不是一个。现在,这三个键和查询给出了三个不同的权重。然后,这三个权重与三个值进行矩阵乘法,得到三个多重输出。最后,将这三个注意力模块连接起来,得到一个最终的注意力输出。如图 11 所示。
But 3 is just a random number we chose. In the actual scenario, these can be any number of linear layers, and these are called heads (h). That is there can be h number of linear layers giving h attention outputs which are then concatenated together. And this is exactly why it is called multi-head attention (multiple heads). The simpler version of figure 11, but with h number of heads is shown in figure 12.
但 3 只是我们随机选择的一个数字。在实际场景中,这些线性层的数量可以是任意的,这些线性层被称为 "头(h)"。也就是说,可以有h 个线性层提供h 个注意力输出,然后将这些输出串联起来。这也正是它被称为多头注意力(多头)的原因。图 12 是图 11 的简化版,但有h 个头。
Now that we understand the mechanism and idea behind Attention, Query, Keys, Values, and Multi-Head attention, we have covered all the important building blocks of a Transformer network. In the next story, I will talk about how all these blocks are stacked together to form the Transformer Network, and also talk about some networks based on Transformers such as BERT and GPT.
现在,我们已经了解了注意、查询、键、值和多头注意背后的机制和理念,我们已经涵盖了变换器网络的所有重要构件。在接下来的故事中,我将讲述如何将所有这些模块堆叠在一起形成变形网络,并介绍一些基于变形的网络,如 BERT 和 GPT。
References: 参考资料
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin.2017.注意力就是你所需要的一切。第 31 届神经信息处理系统国际会议(NIPS'17)论文集》。Curran Associates Inc., Red Hook, NY, USA, 6000-6010.