xxxxxxxxxx
# Assignment 1: Neural Machine Translation
Welcome to the first assignment of Course 4. Here, you will build an English-to-Portuguese neural machine translation (NMT) model using Long Short-Term Memory (LSTM) networks with attention. Machine translation is an important task in natural language processing and could be useful not only for translating one language to another but also for word sense disambiguation (e.g. determining whether the word "bank" refers to the financial bank, or the land alongside a river). Implementing this using just a Recurrent Neural Network (RNN) with LSTMs can work for short to medium length sentences but can result in vanishing gradients for very long sequences. To help with this, you will be adding an attention mechanism to allow the decoder to access all relevant parts of the input sentence regardless of its length. By completing this assignment, you will:
- Implement an encoder-decoder system with attention
- Build the NMT model from scratch using Tensorflow
- Generate translations using greedy and Minimum Bayes Risk (MBR) decoding
## Table of Contents
- [1 - Data Preparation](#1)
- [2 - NMT model with attention](#2)
- [Exercise 1 - Encoder](#ex1)
- [Exercise 2 - CrossAttention](#ex2)
- [Exercise 3 - Decoder](#ex3)
- [Exercise 4 - Translator](#ex4)
- [3 - Training](#3)
- [4 - Using the model for inference ](#4)
- [Exercise 5 - translate](#ex5)
- [5 - Minimum Bayes-Risk Decoding](#5)
- [Exercise 6 - rouge1_similarity](#ex6)
- [Exercise 7 - average_overlap](#ex7)
欢迎来到课程 4 的第一个作业。在这里,您将使用长短期记忆 (LSTM) 网络并注意构建英语到葡萄牙语的神经机器翻译 (NMT) 模型。机器翻译是自然语言处理中的一项重要任务,不仅可用于将一种语言翻译成另一种语言,还可用于词义消歧(例如确定“银行”一词是指金融银行还是指河边的土地) 。仅使用带有 LSTM 的循环神经网络 (RNN) 来实现这一点可以适用于短到中等长度的句子,但可能会导致非常长的序列梯度消失。为了解决这个问题,您将添加一个注意机制,以允许解码器访问输入句子的所有相关部分,无论其长度如何。通过完成此作业,您将:
xxxxxxxxxx
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # Setting this env variable prevents TF warnings from showing up
import numpy as np
import tensorflow as tf
from collections import Counter
from utils import (sentences, train_data, val_data, english_vectorizer, portuguese_vectorizer,
masked_loss, masked_acc, tokens_to_text)
xxxxxxxxxx
import w1_unittest
xxxxxxxxxx
<a name="1"></a>
## 1. Data Preparation
The text pre-processing bits have already been taken care of (if you are interested in this be sure to check the `utils.py` file). The steps performed can be summarized as:
- Reading the raw data from the text files
- Cleaning the data (using lowercase, adding space around punctuation, trimming whitespaces, etc)
- Splitting it into training and validation sets
- Adding the start-of-sentence and end-of-sentence tokens to every sentence
- Tokenizing the sentences
- Creating a Tensorflow dataset out of the tokenized sentences
Take a moment to inspect the raw sentences:
文本预处理位已经处理完毕(如果您对此感兴趣,请务必检查 utils.py
文件)。执行的步骤可以总结为:
花点时间检查一下原始句子:
xxxxxxxxxx
portuguese_sentences, english_sentences = sentences
print(f"English (to translate) sentence:\n\n{english_sentences[-5]}\n")
print(f"Portuguese (translation) sentence:\n\n{portuguese_sentences[-5]}")
xxxxxxxxxx
# sentences
# english_sentences[-2]
xxxxxxxxxx
You don't have much use for the raw sentences so delete them to save memory:
原始句子没有多大用处,因此删除它们以节省内存:
xxxxxxxxxx
del portuguese_sentences
del english_sentences
del sentences
xxxxxxxxxx
Notice that you imported an `english_vectorizer` and a `portuguese_vectorizer` from `utils.py`. These were created using [tf.keras.layers.TextVectorization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) and they provide interesting features such as ways to visualize the vocabulary and convert text into tokenized ids and vice versa. In fact, you can inspect the first ten words of the vocabularies for both languages:
请注意,您从 utils.py
导入了 english_vectorizer
和 portuguese_vectorizer
。这些是使用 tf.keras.layers.TextVectorization 创建的,它们提供了有趣的功能,例如可视化词汇表以及将文本转换为标记化 id 的方法,反之亦然。事实上,您可以检查两种语言词汇表的前十个单词:
xxxxxxxxxx
print(f"First 10 words of the english vocabulary:\n\n{english_vectorizer.get_vocabulary()[:10]}\n")
print(f"First 10 words of the portuguese vocabulary:\n\n{portuguese_vectorizer.get_vocabulary()[:10]}")
xxxxxxxxxx
Notice that the first 4 words are reserved for special words. In order, these are:
- the empty string
- a special token to represent an unknown word
- a special token to represent the start of a sentence
- a special token to represent the end of a sentence
You can see how many words are in a vocabulary by using the `vocabulary_size` method:
请注意,前 4 个字是为特殊字保留的。按顺序,这些是:
您可以使用 vocabulary_size
方法查看词汇表中有多少个单词:
xxxxxxxxxx
# Size of the vocabulary
vocab_size_por = portuguese_vectorizer.vocabulary_size()
vocab_size_eng = english_vectorizer.vocabulary_size()
print(f"Portuguese vocabulary is made up of {vocab_size_por} words")
print(f"English vocabulary is made up of {vocab_size_eng} words")
xxxxxxxxxx
You can define [tf.keras.layers.StringLookup](https://www.tensorflow.org/api_docs/python/tf/keras/layers/StringLookup) objects that will help you map from words to ids and vice versa. Do this for the portuguese vocabulary since this will be useful later on when you decode the predictions from your model:
您可以定义 tf.keras.layers.StringLookup 对象来帮助您从单词映射到 id,反之亦然。对葡萄牙语词汇执行此操作,因为稍后当您从模型中解码预测时这将很有用:
xxxxxxxxxx
# This helps you convert from words to ids
word_to_id = tf.keras.layers.StringLookup(
vocabulary=portuguese_vectorizer.get_vocabulary(),
mask_token="",
oov_token="[UNK]"
)
# This helps you convert from ids to words
id_to_word = tf.keras.layers.StringLookup(
vocabulary=portuguese_vectorizer.get_vocabulary(),
mask_token="",
oov_token="[UNK]",
invert=True,
)
xxxxxxxxxx
Try it out for the special tokens and a random word:
尝试使用特殊标记和随机单词:
xxxxxxxxxx
unk_id = word_to_id("[UNK]")
sos_id = word_to_id("[SOS]")
eos_id = word_to_id("[EOS]")
baunilha_id = word_to_id("baunilha")
print(f"The id for the [UNK] token is {unk_id}")
print(f"The id for the [SOS] token is {sos_id}")
print(f"The id for the [EOS] token is {eos_id}")
print(f"The id for baunilha (vanilla) is {baunilha_id}")
xxxxxxxxxx
Finally take a look at how the data that is going to be fed to the neural network looks like. Both `train_data` and `val_data` are of type `tf.data.Dataset` and are already arranged in batches of 64 examples. To get the first batch out of a tf dataset you can use the `take` method. To get the first example out of the batch you can slice the tensor and use the `numpy` method for nicer printing:
最后看一下将输入神经网络的数据是什么样子的。 train_data
和 val_data
都是 tf.data.Dataset
类型,并且已按 64 个示例的批次进行排列。要从 tf 数据集中获取第一批,您可以使用 take
方法。要从批次中获取第一个示例,您可以对张量进行切片并使用 numpy
方法进行更好的打印:
xxxxxxxxxx
for (to_translate, sr_translation), translation in train_data.take(1):
print(f"Tokenized english sentence:\n{to_translate[0, :].numpy()}\n\n")
print(f"Tokenized portuguese sentence (shifted to the right):\n{sr_translation[0, :].numpy()}\n\n")
print(f"Tokenized portuguese sentence:\n{translation[0, :].numpy()}\n\n")
xxxxxxxxxx
# train_data.take(1)
xxxxxxxxxx
There are a couple of important details to notice.
- Padding has already been applied to the tensors and the value used for this is 0
- Each example consists of 3 different tensors:
- The sentence to translate
- The shifted-to-the-right translation
- The translation
The first two can be considered as the features, while the third one as the target. By doing this your model can perform Teacher Forcing as you saw in the lectures.
Now it is time to begin coding!
有几个重要的细节需要注意。
前两个可以视为特征,第三个可以视为目标。通过这样做,您的模型可以执行教师强制,正如您在讲座中看到的那样。
现在是时候开始编码了!
xxxxxxxxxx
<a name="2"></a>
## 2. NMT model with attention
The model you will build uses an encoder-decoder architecture. This Recurrent Neural Network (RNN) takes in a tokenized version of a sentence in its encoder, then passes it on to the decoder for translation. As mentioned in the lectures, just using a a regular sequence-to-sequence model with LSTMs will work effectively for short to medium sentences but will start to degrade for longer ones. You can picture it like the figure below where all of the context of the input sentence is compressed into one vector that is passed into the decoder block. You can see how this will be an issue for very long sentences (e.g. 100 tokens or more) because the context of the first parts of the input will have very little effect on the final vector passed to the decoder.
<img src='images/plain_rnn.png'>
Adding an attention layer to this model avoids this problem by giving the decoder access to all parts of the input sentence. To illustrate, let's just use a 4-word input sentence as shown below. Remember that a hidden state is produced at each timestep of the encoder (represented by the orange rectangles). These are all passed to the attention layer and each are given a score given the current activation (i.e. hidden state) of the decoder. For instance, let's consider the figure below where the first prediction "como" is already made. To produce the next prediction, the attention layer will first receive all the encoder hidden states (i.e. orange rectangles) as well as the decoder hidden state when producing the word "como" (i.e. first green rectangle). Given this information, it will score each of the encoder hidden states to know which one the decoder should focus on to produce the next word. As a result of training, the model might have learned that it should align to the second encoder hidden state and subsequently assigns a high probability to the word "você". If we are using greedy decoding, we will output the said word as the next symbol, then restart the process to produce the next word until we reach an end-of-sentence prediction.
<img src='images/attention_overview.png'>
There are different ways to implement attention and the one we'll use for this assignment is the Scaled Dot Product Attention which has the form:
$$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$
You will dive deeper into this equation in the next week but for now, you can think of it as computing scores using queries (Q) and keys (K), followed by a multiplication of values (V) to get a context vector at a particular timestep of the decoder. This context vector is fed to the decoder RNN to get a set of probabilities for the next predicted word. The division by square root of the keys dimensionality ($\sqrt{d_k}$) is for improving model performance and you'll also learn more about it next week. For our machine translation application, the encoder activations (i.e. encoder hidden states) will be the keys and values, while the decoder activations (i.e. decoder hidden states) will be the queries.
You will see in the upcoming sections that this complex architecture and mechanism can be implemented with just a few lines of code.
First you will define two important global variables:
- The size of the vocabulary
- The number of units in the LSTM layers (the same number will be used for all LSTM layers)
In this assignment, the vocabulary sizes for English and Portuguese are the same. Therefore, we use a single constant VOCAB_SIZE throughout the notebook. While in other settings, vocabulary sizes could differ, that is not the case in our assignment.
您将构建的模型使用编码器-解码器架构。该循环神经网络 (RNN) 在其编码器中接收句子的标记化版本,然后将其传递到解码器进行翻译。正如讲座中提到的,仅使用 LSTM 的常规序列到序列模型对于中短句子会有效,但对于较长的句子就会开始退化。您可以将其想象如下图所示,其中输入句子的所有上下文都被压缩为传递到解码器块的一个向量。您可以看到这对于很长的句子(例如 100 个标记或更多)来说是一个问题,因为输入的第一部分的上下文对传递给解码器的最终向量影响很小。
向该模型添加注意力层可以让解码器访问输入句子的所有部分,从而避免此问题。为了说明这一点,我们只使用 4 个单词的输入句子,如下所示。请记住,编码器的每个时间步都会产生隐藏状态(由橙色矩形表示)。这些都被传递到注意力层,并且根据解码器的当前激活(即隐藏状态),每个层都会获得一个分数。例如,让我们考虑下图,其中已经做出了第一个预测“como”。为了产生下一个预测,注意力层将首先接收所有编码器隐藏状态(即橙色矩形)以及生成单词“como”时的解码器隐藏状态(即第一个绿色矩形)。有了这些信息,它将对每个编码器隐藏状态进行评分,以了解解码器应该关注哪个隐藏状态来生成下一个单词。作为训练的结果,模型可能已经了解到它应该与第二个编码器隐藏状态对齐,并随后为单词“você”分配高概率。如果我们使用贪婪解码,我们将输出所述单词作为下一个符号,然后重新启动过程以产生下一个单词,直到我们达到句尾预测。
实现注意力的方法有多种,我们将用于此任务的方法是缩放点积注意力,其形式如下:
您将在下周更深入地研究这个方程,但现在,您可以将其视为使用查询 (Q) 和键 (K) 计算分数,然后乘以值 (V) 以获取上下文向量解码器的特定时间步长。该上下文向量被馈送到解码器 RNN,以获得下一个预测单词的一组概率。除以键维度的平方根 ( ) 是为了提高模型性能,下周您还将了解更多相关信息。对于我们的机器翻译应用程序,编码器激活(即编码器隐藏状态)将是键和值,而解码器激活(即解码器隐藏状态)将是查询。
您将在接下来的部分中看到,只需几行代码即可实现这种复杂的架构和机制。
首先,您将定义两个重要的全局变量:
在此作业中,英语和葡萄牙语的词汇量相同。因此,我们在整个笔记本中使用单个常量 VOCAB_SIZE。虽然在其他设置中,词汇量可能会有所不同,但我们的作业中并非如此。
xxxxxxxxxx
VOCAB_SIZE = 12000
UNITS = 256
xxxxxxxxxx
<a name="ex1"></a>
## Exercise 1 - Encoder
Your first exercise is to code the encoder part of the neural network. For this, complete the `Encoder` class below. Notice that in the constructor (the `__init__` method) you need to define all of the sublayers of the encoder and then use these sublayers during the forward pass (the `call` method).
The encoder consists of the following layers:
- [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding). For this layer you need to define the appropriate `input_dim` and `output_dim` and let it know that you are using '0' as padding, which can be done by using the appropriate value for the `mask_zero` parameter.
+ [Bidirectional](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional) [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM). In TF you can implement bidirectional behaviour for RNN-like layers. This part is already taken care of but you will need to specify the appropriate type of layer as well as its parameters. In particular you need to set the appropriate number of units and make sure that the LSTM returns the full sequence and not only the last output, which can be done by using the appropriate value for the `return_sequences` parameter.
You need to define the forward pass using the syntax of TF's [functional API](https://www.tensorflow.org/guide/keras/functional_api). What this means is that you chain function calls together to define your network like this:
```python
encoder_input = keras.Input(shape=(28, 28, 1), name="original_img")
x = layers.Conv2D(16, 3, activation="relu")(encoder_input)
x = layers.MaxPooling2D(3)(x)
x = layers.Conv2D(16, 3, activation="relu")(x)
encoder_output = layers.GlobalMaxPooling2D()(x)
```
您的第一个练习是对神经网络的编码器部分进行编码。为此,请完成下面的 Encoder
类。请注意,在构造函数( __init__
方法)中,您需要定义编码器的所有子层,然后在前向传递期间使用这些子层( call
方法)。
编码器由以下层组成:
input_dim
和 output_dim
并让它知道您正在使用“0”作为填充,这可以通过使用 < 的适当值来完成b2> 参数。return_sequences
参数的适当值来完成。
您需要使用 TF 函数式 API 的语法来定义前向传播。这意味着您可以将函数调用链接在一起来定义您的网络,如下所示:
encoder_input = keras.Input(shape=(28, 28, 1), name="original_img")
x = layers.Conv2D(16, 3, activation="relu")(encoder_input)
x = layers.MaxPooling2D(3)(x)
x = layers.Conv2D(16, 3, activation="relu")(x)
encoder_output = layers.GlobalMaxPooling2D()(x)
xxxxxxxxxx
# GRADED CLASS: Encoder
class Encoder(tf.keras.layers.Layer):
def __init__(self, vocab_size, units):
"""Initializes an instance of this class
Args:
vocab_size (int): Size of the vocabulary
units (int): Number of units in the LSTM layer
"""
super(Encoder, self).__init__()
### START CODE HERE ###
self.embedding = tf.keras.layers.Embedding(
input_dim=vocab_size,
output_dim=units,
mask_zero=True
)
self.rnn = tf.keras.layers.Bidirectional(
layer=tf.keras.layers.LSTM(
units=units,
return_sequences=True
),
merge_mode="sum"
)
### END CODE HERE ###
def call(self, context):
"""Forward pass of this layer
Args:
context (tf.Tensor): The sentence to translate
Returns:
tf.Tensor: Encoded sentence to translate
"""
### START CODE HERE ###
# Pass the context through the embedding layer
x = self.embedding(context)
# Pass the output of the embedding through the RNN
x = self.rnn(x)
### END CODE HERE ###
return x
xxxxxxxxxx
# Do a quick check of your implementation
# Create an instance of your class
encoder = Encoder(VOCAB_SIZE, UNITS)
# Pass a batch of sentences to translate from english to portuguese
encoder_output = encoder(to_translate)
print(f'Tensor of sentences in english has shape: {to_translate.shape}\n')
print(f'Encoder output has shape: {encoder_output.shape}')
xxxxxxxxxx
##### __Expected Output__
```
Tensor of sentences in english has shape: (64, 14)
Encoder output has shape: (64, 14, 256)
```
Tensor of sentences in english has shape: (64, 14)
Encoder output has shape: (64, 14, 256)
xxxxxxxxxx
# Test your code!
w1_unittest.test_encoder(Encoder)
xxxxxxxxxx
<a name="ex2"></a>
## Exercise 2 - CrossAttention
Your next exercise is to code the layer that will perform cross attention between the original sentences and the translations. For this, complete the `CrossAttention` class below. Notice that in the constructor (the `__init__` method) you need to define all of the sublayers and then use these sublayers during the forward pass (the `call` method). For this particular case some of these bits are already taken care of.
The cross attention consists of the following layers:
- [MultiHeadAttention](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention). For this layer you need to define the appropriate `key_dim`, which is the size of the key and query tensors. You will also need to set the number of heads to 1 since you aren't implementing multi head attention but attention between two tensors. The reason why this layer is preferred over [Attention](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention) is that it allows simpler code during the forward pass.
A couple of things to notice:
- You need a way to pass both the output of the attention alongside the shifted-to-the-right translation (since this cross attention happens in the decoder side). For this you will use an [Add](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Add) layer so that the original dimension is preserved, which would not happen if you use something like a [Concatenate](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Concatenate) layer.
+ Layer normalization is also performed for better stability of the network by using a [LayerNormalization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LayerNormalization) layer.
- You don't need to worry about these last steps as these are already solved.
您的下一个练习是编写将在原始句子和翻译之间执行交叉关注的层。为此,请完成下面的 CrossAttention
类。请注意,在构造函数( __init__
方法)中,您需要定义所有子层,然后在前向传递期间使用这些子层( call
方法)。对于这种特殊情况,其中一些位已经得到处理。
交叉注意力由以下几层组成:
key_dim
,它是键和查询张量的大小。您还需要将头数设置为 1,因为您没有实现多头注意力,而是实现两个张量之间的注意力。该层优于 Attention 的原因是它在前向传递过程中允许更简单的代码。
有几点需要注意:
xxxxxxxxxx
# GRADED CLASS: CrossAttention
class CrossAttention(tf.keras.layers.Layer):
def __init__(self, units):
"""Initializes an instance of this class
Args:
units (int): Number of units in the LSTM layer
"""
super().__init__()
### START CODE HERE ###
self.mha = tf.keras.layers.MultiHeadAttention(
key_dim=units,
num_heads=1
)
### END CODE HERE ###
self.layernorm = tf.keras.layers.LayerNormalization()
self.add = tf.keras.layers.Add()
def call(self, context, target):
"""Forward pass of this layer
Args:
context (tf.Tensor): Encoded sentence to translate
target (tf.Tensor): The embedded shifted-to-the-right translation
Returns:
tf.Tensor: Cross attention between context and target
"""
### START CODE HERE ###
# Call the MH attention by passing in the query and value
# For this case the query should be the translation and the value the encoded sentence to translate
# Hint: Check the call arguments of MultiHeadAttention in the docs
attn_output = self.mha(
query=target,
value=context
)
### END CODE HERE ###
x = self.add([target, attn_output])
x = self.layernorm(x)
return x
xxxxxxxxxx
# Do a quick check of your implementation
# Create an instance of your class
attention_layer = CrossAttention(UNITS)
# The attention layer expects the embedded sr-translation and the context
# The context (encoder_output) is already embedded so you need to do this for sr_translation:
sr_translation_embed = tf.keras.layers.Embedding(VOCAB_SIZE, output_dim=UNITS, mask_zero=True)(sr_translation)
# Compute the cross attention
attention_result = attention_layer(encoder_output, sr_translation_embed)
print(f'Tensor of contexts has shape: {encoder_output.shape}')
print(f'Tensor of translations has shape: {sr_translation_embed.shape}')
print(f'Tensor of attention scores has shape: {attention_result.shape}')
xxxxxxxxxx
##### __Expected Output__
```
Tensor of contexts has shape: (64, 14, 256)
Tensor of translations has shape: (64, 15, 256)
Tensor of attention scores has shape: (64, 15, 256)
```
Tensor of contexts has shape: (64, 14, 256)
Tensor of translations has shape: (64, 15, 256)
Tensor of attention scores has shape: (64, 15, 256)
xxxxxxxxxx
# Test your code!
w1_unittest.test_cross_attention(CrossAttention)
xxxxxxxxxx
<a name="ex3"></a>
## Exercise 3 - Decoder
Now you will implement the decoder part of the neural network by completing the `Decoder` class below. Notice that in the constructor (the `__init__` method) you need to define all of the sublayers of the decoder and then use these sublayers during the forward pass (the `call` method).
The decoder consists of the following layers:
- [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding). For this layer you need to define the appropriate `input_dim` and `output_dim` and let it know that you are using '0' as padding, which can be done by using the appropriate value for the `mask_zero` parameter.
+ Pre-attention [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM). Unlike in the encoder in which you used a Bidirectional LSTM, here you will use a vanilla LSTM. Don't forget to set the appropriate number of units and make sure that the LSTM returns the full sequence and not only the last output, which can be done by using the appropriate value for the `return_sequences` parameter. It is very important that this layer returns the state since this will be needed for inference so make sure to set the `return_state` parameter accordingly. Notice that LSTM layers return state as a tuple of two tensors called `memory_state` and `carry_state`, **however these names have been changed to better reflect what you have seen in the lectures to `hidden_state` and `cell_state` respectively**.
- The attention layer that performs cross attention between the sentence to translate and the right-shifted translation. Here you need to use the `CrossAttention` layer you defined in the previous exercise.
+ Post-attention [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM). Another LSTM layer. For this one you don't need it to return the state.
- Finally a [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) layer. This one should have the same number of units as the size of the vocabulary since you expect it to compute the logits for every possible word in the vocabulary. Make sure to use a `logsoftmax` activation function for this one, which you can get as [tf.nn.log_softmax](https://www.tensorflow.org/api_docs/python/tf/nn/log_softmax).
现在,您将通过完成下面的 Decoder
类来实现神经网络的解码器部分。请注意,在构造函数( __init__
方法)中,您需要定义解码器的所有子层,然后在前向传递期间使用这些子层( call
方法)。
解码器由以下层组成:
input_dim
和 output_dim
并让它知道您正在使用“0”作为填充,这可以通过使用 < 的适当值来完成b2> 参数。return_sequences
参数的适当值来完成。该层返回状态非常重要,因为推理需要这,因此请确保相应地设置 return_state
参数。请注意,LSTM 层以名为 memory_state
和 carry_state
的两个张量的元组形式返回状态,但是这些名称已更改,以更好地反映您在 hidden_state
。CrossAttention
图层。logsoftmax
激活函数,您可以通过 tf.nn.log_softmax 获得该函数。xxxxxxxxxx
# GRADED CLASS: Decoder
class Decoder(tf.keras.layers.Layer):
def __init__(self, vocab_size, units):
"""Initializes an instance of this class
Args:
vocab_size (int): Size of the vocabulary
units (int): Number of units in the LSTM layer
"""
super(Decoder, self).__init__()
### START CODE HERE ###
# The embedding layer
self.embedding = tf.keras.layers.Embedding(
input_dim=vocab_size,
output_dim=units,
mask_zero=True
)
# The RNN before attention
self.pre_attention_rnn = tf.keras.layers.LSTM(
units=units,
return_sequences=True,
return_state=True
)
# The attention layer
self.attention = CrossAttention(units)
# The RNN after attention
self.post_attention_rnn = tf.keras.layers.LSTM(
units=units,
return_sequences=True
)
# The dense layer with logsoftmax activation
self.output_layer = tf.keras.layers.Dense(
units=vocab_size,
activation=tf.nn.log_softmax
)
### END CODE HERE ###
def call(self, context, target, state=None, return_state=False):
"""Forward pass of this layer
Args:
context (tf.Tensor): Encoded sentence to translate
target (tf.Tensor): The shifted-to-the-right translation
state (list[tf.Tensor, tf.Tensor], optional): Hidden state of the pre-attention LSTM. Defaults to None.
return_state (bool, optional): If set to true return the hidden states of the LSTM. Defaults to False.
Returns:
tf.Tensor: The log_softmax probabilities of predicting a particular token
"""
### START CODE HERE ###
# Get the embedding of the input
x = self.embedding(target)
# Pass the embedded input into the pre attention LSTM
# Hints:
# - The LSTM you defined earlier should return the output alongside the state (made up of two tensors)
# - Pass in the state to the LSTM (needed for inference)
x, hidden_state, cell_state = self.pre_attention_rnn(x, initial_state=state)
# Perform cross attention between the context and the output of the LSTM (in that order)
x = self.attention(context, x)
# Do a pass through the post attention LSTM
x = self.post_attention_rnn(x)
# Compute the logits
logits = self.output_layer(x)
### END CODE HERE ###
if return_state:
return logits, [hidden_state, cell_state]
return logits
xxxxxxxxxx
# Do a quick check of your implementation
# Create an instance of your class
decoder = Decoder(VOCAB_SIZE, UNITS)
# Notice that you don't need the embedded version of sr_translation since this is done inside the class
logits = decoder(encoder_output, sr_translation)
print(f'Tensor of contexts has shape: {encoder_output.shape}')
print(f'Tensor of right-shifted translations has shape: {sr_translation.shape}')
print(f'Tensor of logits has shape: {logits.shape}')
xxxxxxxxxx
##### __Expected Output__
```
Tensor of contexts has shape: (64, 14, 256)
Tensor of right-shifted translations has shape: (64, 15)
Tensor of logits has shape: (64, 15, 12000)
```
Tensor of contexts has shape: (64, 14, 256)
Tensor of right-shifted translations has shape: (64, 15)
Tensor of logits has shape: (64, 15, 12000)
xxxxxxxxxx
# Test your code!
w1_unittest.test_decoder(Decoder, CrossAttention)
xxxxxxxxxx
<a name="ex4"></a>
## Exercise 4 - Translator
Now you have to put together all of the layers you previously coded into an actual model. For this, complete the `Translator` class below. Notice how unlike the Encoder and Decoder classes inherited from `tf.keras.layers.Layer`, the Translator class inherits from `tf.keras.Model`.
Remember that `train_data` will yield a tuple with the sentence to translate and the shifted-to-the-right translation, which are the "features" of the model. This means that the inputs of your network will be tuples containing context and targets.
现在,您必须将之前编码的所有层组合到一个实际模型中。为此,请完成下面的 Translator
类。请注意,与从 tf.keras.layers.Layer
继承的 Encoder 和 Decoder 类不同,Translator 类从 tf.keras.Model
继承。
请记住, train_data
将生成一个元组,其中包含要翻译的句子和右移的翻译,这是模型的“特征”。这意味着网络的输入将是包含上下文和目标的元组。
xxxxxxxxxx
# GRADED CLASS: Translator
class Translator(tf.keras.Model):
def __init__(self, vocab_size, units):
"""Initializes an instance of this class
Args:
vocab_size (int): Size of the vocabulary
units (int): Number of units in the LSTM layer
"""
super().__init__()
### START CODE HERE ###
# Define the encoder with the appropriate vocab_size and number of units
self.encoder = Encoder(vocab_size, units)
# Define the decoder with the appropriate vocab_size and number of units
self.decoder = Decoder(vocab_size, units)
### END CODE HERE ###
def call(self, inputs):
"""Forward pass of this layer
Args:
inputs (tuple(tf.Tensor, tf.Tensor)): Tuple containing the context (sentence to translate) and the target (shifted-to-the-right translation)
Returns:
tf.Tensor: The log_softmax probabilities of predicting a particular token
"""
### START CODE HERE ###
# In this case inputs is a tuple consisting of the context and the target, unpack it into single variables
context, target = inputs
# Pass the context through the encoder
encoded_context = self.encoder(context)
# Compute the logits by passing the encoded context and the target to the decoder
logits = self.decoder(encoded_context, target)
### END CODE HERE ###
return logits
xxxxxxxxxx
# Do a quick check of your implementation
# Create an instance of your class
translator = Translator(VOCAB_SIZE, UNITS)
# Compute the logits for every word in the vocabulary
logits = translator((to_translate, sr_translation))
print(f'Tensor of sentences to translate has shape: {to_translate.shape}')
print(f'Tensor of right-shifted translations has shape: {sr_translation.shape}')
print(f'Tensor of logits has shape: {logits.shape}')
xxxxxxxxxx
##### __Expected Output__
```
Tensor of sentences to translate has shape: (64, 14)
Tensor of right-shifted translations has shape: (64, 15)
Tensor of logits has shape: (64, 15, 12000)
```
Tensor of sentences to translate has shape: (64, 14)
Tensor of right-shifted translations has shape: (64, 15)
Tensor of logits has shape: (64, 15, 12000)
xxxxxxxxxx
w1_unittest.test_translator(Translator, Encoder, Decoder)
xxxxxxxxxx
<a name="3"></a>
## 3. Training
Now that you have an untrained instance of the NMT model, it is time to train it. You can use the `compile_and_train` function below to achieve this:
xxxxxxxxxx
def compile_and_train(model, epochs=20, steps_per_epoch=500):
model.compile(optimizer="adam", loss=masked_loss, metrics=[masked_acc, masked_loss])
history = model.fit(
train_data.repeat(),
epochs=epochs,
steps_per_epoch=steps_per_epoch,
validation_data=val_data,
validation_steps=50,
callbacks=[tf.keras.callbacks.EarlyStopping(patience=3)],
)
return model, history
xxxxxxxxxx
# Train the translator (this takes some minutes so feel free to take a break)
trained_translator, history = compile_and_train(translator)
xxxxxxxxxx
<a name="4"></a>
## 4. Using the model for inference
Now that your model is trained you can use it for inference. To help you with this the `generate_next_token` function is provided. Notice that this function is meant to be used inside a for-loop, so you feed to it the information of the previous step to generate the information of the next step. In particular you need to keep track of the state of the pre-attention LSTM in the decoder and if you are done with the translation. Also notice that a `temperature` variable is introduced which determines how to select the next token given the predicted logits:
xxxxxxxxxx
def generate_next_token(decoder, context, next_token, done, state, temperature=0.0):
"""Generates the next token in the sequence
Args:
decoder (Decoder): The decoder
context (tf.Tensor): Encoded sentence to translate
next_token (tf.Tensor): The predicted next token
done (bool): True if the translation is complete
state (list[tf.Tensor, tf.Tensor]): Hidden states of the pre-attention LSTM layer
temperature (float, optional): The temperature that controls the randomness of the predicted tokens. Defaults to 0.0.
Returns:
tuple(tf.Tensor, np.float, list[tf.Tensor, tf.Tensor], bool): The next token, log prob of said token, hidden state of LSTM and if translation is done
"""
# Get the logits and state from the decoder
logits, state = decoder(context, next_token, state=state, return_state=True)
# Trim the intermediate dimension
logits = logits[:, -1, :]
# If temp is 0 then next_token is the argmax of logits
if temperature == 0.0:
next_token = tf.argmax(logits, axis=-1)
# If temp is not 0 then next_token is sampled out of logits
else:
logits = logits / temperature
next_token = tf.random.categorical(logits, num_samples=1)
# Trim dimensions of size 1
logits = tf.squeeze(logits)
next_token = tf.squeeze(next_token)
# Get the logit of the selected next_token
logit = logits[next_token].numpy()
# Reshape to (1,1) since this is the expected shape for text encoded as TF tensors
next_token = tf.reshape(next_token, shape=(1,1))
# If next_token is End-of-Sentence token you are done
if next_token == eos_id:
done = True
return next_token, logit, state, done
xxxxxxxxxx
See how it works by running the following cell:
通过运行以下单元格查看其工作原理:
xxxxxxxxxx
# PROCESS SENTENCE TO TRANSLATE AND ENCODE
# A sentence you wish to translate
eng_sentence = "I love languages"
# Convert it to a tensor
texts = tf.convert_to_tensor(eng_sentence)[tf.newaxis]
# Vectorize it and pass it through the encoder
context = english_vectorizer(texts).to_tensor()
context = encoder(context)
# SET STATE OF THE DECODER
# Next token is Start-of-Sentence since you are starting fresh
next_token = tf.fill((1,1), sos_id)
# Hidden and Cell states of the LSTM can be mocked using uniform samples
state = [tf.random.uniform((1, UNITS)), tf.random.uniform((1, UNITS))]
# You are not done until next token is EOS token
done = False
# Generate next token
next_token, logit, state, done = generate_next_token(decoder, context, next_token, done, state, temperature=0.5)
print(f"Next token: {next_token}\nLogit: {logit:.4f}\nDone? {done}")
xxxxxxxxxx
<a name="ex5"></a>
## Exercise 5 - translate
Now you can put everything together to translate a given sentence. For this, complete the `translate` function below. This function will take care of the following steps:
- Process the sentence to translate and encode it
+ Set the initial state of the decoder
- Get predictions of the next token (starting with the \<SOS> token) for a maximum of iterations (in case the \<EOS> token is never returned)
+ Return the translated text (as a string), the logit of the last iteration (this helps measure how certain was that the sequence was translated in its totality) and the translation in token format.
Hints:
- The previous cell provides a lot of insights on how this function should work, so if you get stuck refer to it.
+ Some useful docs:
+ [tf.newaxis](https://www.tensorflow.org/api_docs/python/tf#newaxis)
- [tf.fill](https://www.tensorflow.org/api_docs/python/tf/fill)
+ [tf.zeros](https://www.tensorflow.org/api_docs/python/tf/zeros)
现在您可以将所有内容放在一起来翻译给定的句子。为此,请完成下面的 translate
函数。该函数将执行以下步骤:
Hints:
Some useful docs:
xxxxxxxxxx
# GRADED FUNCTION: translate
def translate(model, text, max_length=50, temperature=0.0):
"""Translate a given sentence from English to Portuguese
Args:
model (tf.keras.Model): The trained translator
text (string): The sentence to translate
max_length (int, optional): The maximum length of the translation. Defaults to 50.
temperature (float, optional): The temperature that controls the randomness of the predicted tokens. Defaults to 0.0.
Returns:
tuple(str, np.float, tf.Tensor): The translation, logit that predicted <EOS> token and the tokenized translation
"""
# Lists to save tokens and logits
tokens, logits = [], []
### START CODE HERE ###
# PROCESS THE SENTENCE TO TRANSLATE
# Convert the original string into a tensor
texts = tf.convert_to_tensor(text)[tf.newaxis]
# Vectorize the text using the correct vectorizer
context = english_vectorizer(texts).to_tensor()
# Get the encoded context (pass the context through the encoder)
# Hint: Remember you can get the encoder by using model.encoder
context = model.encoder(context)
# INITIAL STATE OF THE DECODER
# First token should be SOS token with shape (1,1)
next_token = tf.fill((1,1), sos_id)
# Initial hidden and cell states should be tensors of zeros with shape (1, UNITS)
state = [tf.zeros((1, UNITS)), tf.zeros((1, UNITS))]
# You are done when you draw a EOS token as next token (initial state is False)
done = False
# Iterate for max_length iterations
for _ in range(max_length):
# Generate the next token
try:
next_token, logit, state, done = generate_next_token(
model.decoder, context, next_token, done, state, temperature)
except:
raise Exception("Problem generating the next token")
# If done then break out of the loop
if done:
break
# Add next_token to the list of tokens
tokens.append(next_token)
# Add logit to the list of logits
logits.append(logit)
### END CODE HERE ###
# Concatenate all tokens into a tensor
tokens = tf.concat(tokens, axis=-1)
# Convert the translated tokens into text
translation = tf.squeeze(tokens_to_text(tokens, id_to_word))
translation = translation.numpy().decode()
return translation, logits[-1], tokens
xxxxxxxxxx
Try your function with temperature of 0, which will yield a deterministic output and is equivalent to a greedy decoding:
在温度为 0 的情况下尝试您的函数,这将产生确定性输出,相当于贪婪解码:
xxxxxxxxxx
# Running this cell multiple times should return the same output since temp is 0
temp = 0.0
original_sentence = "I love languages"
translation, logit, tokens = translate(trained_translator, original_sentence, temperature=temp)
print(f"Temperature: {temp}\n\nOriginal sentence: {original_sentence}\nTranslation: {translation}\nTranslation tokens:{tokens}\nLogit: {logit:.3f}")
xxxxxxxxxx
Try your function with temperature of 0.7 (stochastic output):
尝试使用温度为 0.7 的函数(随机输出):
xxxxxxxxxx
# Running this cell multiple times should return different outputs since temp is not 0
# You can try different temperatures
temp = 0.7
original_sentence = "I love languages"
translation, logit, tokens = translate(trained_translator, original_sentence, temperature=temp)
print(f"Temperature: {temp}\n\nOriginal sentence: {original_sentence}\nTranslation: {translation}\nTranslation tokens:{tokens}\nLogit: {logit:.3f}")
xxxxxxxxxx
w1_unittest.test_translate(translate, trained_translator)
xxxxxxxxxx
<a name="5"></a>
## 5. Minimum Bayes-Risk Decoding
As mentioned in the lectures, getting the most probable token at each step may not necessarily produce the best results. Another approach is to do Minimum Bayes Risk Decoding or MBR. The general steps to implement this are:
- Take several random samples
+ Score each sample against all other samples
- Select the one with the highest score
You will be building helper functions for these steps in the following sections.
With the ability to generate different translations by setting different temperature values you can do what you saw in the lectures and generate a bunch of translations and then determine which one is the best candidate. You will now do this by using the provided `generate_samples` function. This function will return any desired number of candidate translations alongside the log-probability for each one:
正如讲座中提到的,在每一步中获得最可能的代币不一定会产生最好的结果。另一种方法是进行最小贝叶斯风险解码或 MBR。实现这一点的一般步骤是:
您将在以下部分中为这些步骤构建辅助函数。
通过设置不同的温度值来生成不同的翻译,您可以按照您在讲座中看到的方式生成一堆翻译,然后确定哪个是最佳候选。您现在将使用提供的 generate_samples
函数来执行此操作。此函数将返回任意所需数量的候选翻译以及每个翻译的对数概率:
xxxxxxxxxx
def generate_samples(model, text, n_samples=4, temperature=0.6):
samples, log_probs = [], []
# Iterate for n_samples iterations
for _ in range(n_samples):
# Save the logit and the translated tensor
_, logp, sample = translate(model, text, temperature=temperature)
# Save the translated tensors
samples.append(np.squeeze(sample.numpy()).tolist())
# Save the logits
log_probs.append(logp)
return samples, log_probs
xxxxxxxxxx
samples, log_probs = generate_samples(trained_translator, 'I love languages')
for s, l in zip(samples, log_probs):
print(f"Translated tensor: {s} has logit: {l:.3f}")
xxxxxxxxxx
## Comparing overlaps
Now that you can generate multiple translations it is time to come up with a method to measure the goodness of each one. As you saw in the lectures, one way to achieve this is by comparing each sample against the others.
There are several metrics you can use for this purpose, as shown in the lectures and you can try experimenting with any one of these. For this assignment, you will be calculating scores for **unigram overlaps**.
One of these metrics is the widely used yet simple [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index) which gets the intersection over union of two sets. The `jaccard_similarity` function returns this metric for any pair of candidate and reference translations:
现在您可以生成多个翻译,是时候想出一种方法来衡量每个翻译的优劣了。正如您在讲座中看到的,实现这一目标的一种方法是将每个样本与其他样本进行比较。
您可以使用多种指标来实现此目的,如讲座中所示,您可以尝试使用其中任何一种指标进行试验。对于本作业,您将计算一元语法重叠的分数。
这些指标之一是广泛使用但简单的 Jaccard 相似度,它获取两个集合的并集的交集。 jaccard_similarity
函数返回任意一对候选翻译和参考翻译的指标:
xxxxxxxxxx
def jaccard_similarity(candidate, reference):
# Convert the lists to sets to get the unique tokens
candidate_set = set(candidate)
reference_set = set(reference)
# Get the set of tokens common to both candidate and reference
common_tokens = candidate_set.intersection(reference_set)
# Get the set of all tokens found in either candidate or reference
all_tokens = candidate_set.union(reference_set)
# Compute the percentage of overlap (divide the number of common tokens by the number of all tokens)
overlap = len(common_tokens) / len(all_tokens)
return overlap
xxxxxxxxxx
l1 = [1, 2, 3]
l2 = [1, 2, 3, 4]
js = jaccard_similarity(l1, l2)
print(f"jaccard similarity between lists: {l1} and {l2} is {js:.3f}")
xxxxxxxxxx
##### __Expected Output__
```
jaccard similarity between tensors: [1, 2, 3] and [1, 2, 3, 4] is 0.750
```
jaccard similarity between tensors: [1, 2, 3] and [1, 2, 3, 4] is 0.750
xxxxxxxxxx
<a name="ex6"></a>
## Exercise 6 - rouge1_similarity
Jaccard similarity is good but a more commonly used metric in machine translation is the ROUGE score. For unigrams, this is called ROUGE-1 and as shown in the lectures, you can output the scores for both precision and recall when comparing two samples. To get the final score, you will want to compute the F1-score as given by:
$$score = 2* \frac{(precision * recall)}{(precision + recall)}$$
For the implementation of the `rouge1_similarity` function you want to use the [Counter](https://docs.python.org/3/library/collections.html#collections.Counter) class from the Python standard library:
Jaccard 相似度很好,但机器翻译中更常用的指标是 ROUGE 分数。对于一元语法,这称为 ROUGE-1,如讲座中所示,您可以在比较两个样本时输出精确度和召回率的分数。要获得最终分数,您需要计算 F1 分数,如下所示:
对于 rouge1_similarity
函数的实现,您需要使用 Python 标准库中的 Counter 类:
xxxxxxxxxx
# GRADED FUNCTION: rouge1_similarity
def rouge1_similarity(candidate, reference):
"""Computes the ROUGE 1 score between two token lists
Args:
candidate (list[int]): Tokenized candidate translation
reference (list[int]): Tokenized reference translation
Returns:
float: Overlap between the two token lists
"""
### START CODE HERE ###
# Make a frequency table of the candidate and reference tokens
# Hint: use the Counter class (already imported)
candidate_word_counts = Counter(candidate)
reference_word_counts = Counter(reference)
# Initialize overlap at 0
overlap = 0
# Iterate over the tokens in the candidate frequency table
# Hint: Counter is a subclass of dict and you can get the keys
# out of a dict using the keys method like this: dict.keys()
for token in candidate_word_counts.keys():
# Get the count of the current token in the candidate frequency table
# Hint: You can access the counts of a token as you would access values of a dictionary
token_count_candidate = candidate_word_counts[token]
# Get the count of the current token in the reference frequency table
# Hint: You can access the counts of a token as you would access values of a dictionary
token_count_reference = reference_word_counts[token]
# Update the overlap by getting the minimum between the two token counts above
overlap += min(token_count_candidate, token_count_reference)
# Compute the precision
# Hint: precision = overlap / (number of tokens in candidate list)
precision = overlap / len(candidate)
# Compute the recall
# Hint: recall = overlap / (number of tokens in reference list)
recall = overlap / len(reference)
if precision + recall != 0:
# Compute the Rouge1 Score
# Hint: This is equivalent to the F1 score
f1_score = 2 * (precision * recall) / (precision + recall)
return f1_score
### END CODE HERE ###
return 0 # If precision + recall = 0 then return 0
xxxxxxxxxx
l1 = [1, 2, 3]
l2 = [1, 2, 3, 4]
r1s = rouge1_similarity(l1, l2)
print(f"rouge 1 similarity between lists: {l1} and {l2} is {r1s:.3f}")
xxxxxxxxxx
##### __Expected Output__
```
rouge 1 similarity between lists: [1, 2, 3] and [1, 2, 3, 4] is 0.857
```
rouge 1 similarity between lists: [1, 2, 3] and [1, 2, 3, 4] is 0.857
xxxxxxxxxx
w1_unittest.test_rouge1_similarity(rouge1_similarity)
xxxxxxxxxx
## Computing the Overall Score
You will now build a function to generate the overall score for a particular sample. As mentioned in the lectures, you need to compare each sample with all other samples. For instance, if we generated 30 sentences, we will need to compare sentence 1 to sentences 2 through 30. Then, we compare sentence 2 to sentences 1 and 3 through 30, and so forth. At each step, we get the average score of all comparisons to get the overall score for a particular sample. To illustrate, these will be the steps to generate the scores of a 4-sample list.
- Get similarity score between sample 1 and sample 2
+ Get similarity score between sample 1 and sample 3
- Get similarity score between sample 1 and sample 4
+ Get average score of the first 3 steps. This will be the overall score of sample 1
- Iterate and repeat until samples 1 to 4 have overall scores.
The results will be stored in a dictionary for easy lookups.
<a name="ex7"></a>
## Exercise 7 - average_overlap
Complete the `average_overlap` function below which should implement the process described above:
您现在将构建一个函数来生成特定样本的总体得分。正如讲座中提到的,您需要将每个样本与所有其他样本进行比较。例如,如果我们生成 30 个句子,我们需要将句子 1 与句子 2 到 30 进行比较。然后,我们将句子 2 与句子 1 和 3 到 30 进行比较,依此类推。在每一步中,我们都会获得所有比较的平均分数,以获得特定样本的总体分数。为了说明这一点,这些将是生成 4 个样本列表的分数的步骤。
结果将存储在字典中以便于查找。
完成下面的 average_overlap
函数,该函数应实现上述过程:
xxxxxxxxxx
# GRADED FUNCTION: average_overlap
def average_overlap(samples, similarity_fn):
"""Computes the arithmetic mean of each candidate sentence in the samples
Args:
samples (list[list[int]]): Tokenized version of translated sentences
similarity_fn (Function): Similarity function used to compute the overlap
Returns:
dict[int, float]: A dictionary mapping the index of each translation to its score
"""
# Initialize dictionary
scores = {}
# Iterate through all samples (enumerate helps keep track of indexes)
for index_candidate, candidate in enumerate(samples):
### START CODE HERE ###
# Initially overlap is zero
overlap = 0
# Iterate through all samples (enumerate helps keep track of indexes)
for index_sample, sample in enumerate(samples):
# Skip if the candidate index is the same as the sample index
if index_candidate == index_sample:
continue
# Get the overlap between candidate and sample using the similarity function
sample_overlap = similarity_fn(candidate, sample)
# Add the sample overlap to the total overlap
overlap += sample_overlap
### END CODE HERE ###
# Get the score for the candidate by computing the average
score = overlap / (len(samples) - 1)
# Only use 3 decimal points
score = round(score, 3)
# Save the score in the dictionary. use index as the key.
scores[index_candidate] = score
return scores
xxxxxxxxxx
# Test with Jaccard similarity
l1 = [1, 2, 3]
l2 = [1, 2, 4]
l3 = [1, 2, 4, 5]
avg_ovlp = average_overlap([l1, l2, l3], jaccard_similarity)
print(f"average overlap between lists: {l1}, {l2} and {l3} using Jaccard similarity is:\n\n{avg_ovlp}")
xxxxxxxxxx
##### __Expected Output__
```
average overlap between lists: [1, 2, 3], [1, 2, 4] and [1, 2, 4, 5] using Jaccard similarity is:
{0: 0.45, 1: 0.625, 2: 0.575}
```
average overlap between lists: [1, 2, 3], [1, 2, 4] and [1, 2, 4, 5] using Jaccard similarity is:
{0: 0.45, 1: 0.625, 2: 0.575}
xxxxxxxxxx
# Test with Rouge1 similarity
l1 = [1, 2, 3]
l2 = [1, 4]
l3 = [1, 2, 4, 5]
l4 = [5,6]
avg_ovlp = average_overlap([l1, l2, l3, l4], rouge1_similarity)
print(f"average overlap between lists: {l1}, {l2}, {l3} and {l4} using Rouge1 similarity is:\n\n{avg_ovlp}")
xxxxxxxxxx
##### __Expected Output__
```
average overlap between lists: [1, 2, 3], [1, 4], [1, 2, 4, 5] and [5, 6] using Rouge1 similarity is:
{0: 0.324, 1: 0.356, 2: 0.524, 3: 0.111}
```
average overlap between lists: [1, 2, 3], [1, 4], [1, 2, 4, 5] and [5, 6] using Rouge1 similarity is:
{0: 0.324, 1: 0.356, 2: 0.524, 3: 0.111}
xxxxxxxxxx
w1_unittest.test_average_overlap(average_overlap)
xxxxxxxxxx
In practice, it is also common to see the weighted mean being used to calculate the overall score instead of just the arithmetic mean. This is implemented in the `weighted_avg_overlap` function below and you can use it in your experiments to see which one will give better results:
在实践中,通常会看到使用加权平均值而不是算术平均值来计算总分。这是在下面的 weighted_avg_overlap
函数中实现的,您可以在实验中使用它来查看哪个会给出更好的结果:
xxxxxxxxxx
def weighted_avg_overlap(samples, log_probs, similarity_fn):
# Scores dictionary
scores = {}
# Iterate over the samples
for index_candidate, candidate in enumerate(samples):
# Initialize overlap and weighted sum
overlap, weight_sum = 0.0, 0.0
# Iterate over all samples and log probabilities
for index_sample, (sample, logp) in enumerate(zip(samples, log_probs)):
# Skip if the candidate index is the same as the sample index
if index_candidate == index_sample:
continue
# Convert log probability to linear scale
sample_p = float(np.exp(logp))
# Update the weighted sum
weight_sum += sample_p
# Get the unigram overlap between candidate and sample
sample_overlap = similarity_fn(candidate, sample)
# Update the overlap
overlap += sample_p * sample_overlap
# Compute the score for the candidate
score = overlap / weight_sum
# Only use 3 decimal points
score = round(score, 3)
# Save the score in the dictionary. use index as the key.
scores[index_candidate] = score
return scores
xxxxxxxxxx
l1 = [1, 2, 3]
l2 = [1, 2, 4]
l3 = [1, 2, 4, 5]
log_probs = [0.4, 0.2, 0.5]
w_avg_ovlp = weighted_avg_overlap([l1, l2, l3], log_probs, jaccard_similarity)
print(f"weighted average overlap using Jaccard similarity is:\n\n{w_avg_ovlp}")
xxxxxxxxxx
## mbr_decode
You will now put everything together in the the `mbr_decode` function below. This final step is not graded as this function is just a wrapper around all the cool stuff you have coded so far!
You can use it to play around, trying different numbers of samples, temperatures and similarity functions!
现在,您将把所有内容放在下面的 mbr_decode
函数中。最后一步没有评分,因为这个函数只是您迄今为止编码的所有很酷的东西的包装!
您可以使用它来尝试不同数量的样本、温度和相似性函数!
xxxxxxxxxx
def mbr_decode(model, text, n_samples=5, temperature=0.6, similarity_fn=jaccard_similarity):
# Generate samples
samples, log_probs = generate_samples(model, text, n_samples=n_samples, temperature=temperature)
# Compute the overlap scores
scores = weighted_avg_overlap(samples, log_probs, similarity_fn)
# Decode samples
decoded_translations = [tokens_to_text(s, id_to_word).numpy().decode('utf-8') for s in samples]
# Find the key with the highest score
max_score_key = max(scores, key=lambda k: scores[k])
# Get the translation
translation = decoded_translations[max_score_key]
return translation, decoded_translations
xxxxxxxxxx
english_sentence = "I love languages"
translation, candidates = mbr_decode(trained_translator, english_sentence, n_samples=10, temperature=0.6)
print("Translation candidates:")
for c in candidates:
print(c)
print(f"\nSelected translation: {translation}")
xxxxxxxxxx
**Congratulations!** Next week, you'll dive deeper into attention models and study the Transformer architecture. You will build another network but without the recurrent part. It will show that attention is all you need! It should be fun!
**Keep up the good work!**
恭喜!下周,您将更深入地研究注意力模型并研究 Transformer 架构。您将构建另一个网络,但没有重复部分。这将表明您所需要的就是关注!应该很有趣!
保持良好的工作!
xxxxxxxxxx