CS224N Assignment 1: Exploring Word Vectors (25 Points)
CS224N 作业 1:探索词向量(25 分)¶
Due 4:30pm, Tue April 9th 2024
截止日期:2024 年 4 月 9 日星期二下午 4:30¶
Welcome to CS224N! 欢迎来到CS224N!
Before you start, make sure you read the README.md in the same directory as this notebook for important setup information. You need to install some Python libraries before you can successfully do this assignment. A lot of code is provided in this notebook, and we highly encourage you to read and understand it as part of the learning :)
在开始之前,请确保阅读与此笔记本位于同一目录中的 README.md,以获取重要的设置信息。您需要先安装一些 Python 库,然后才能成功完成此分配。此笔记本中提供了大量代码,我们强烈建议您在学习:)阅读和理解它
If you aren't super familiar with Python, Numpy, or Matplotlib, we recommend you check out the review session on Friday. The session will be recorded and the material will be made available on our website. The CS231N Python/Numpy tutorial is also a great resource.
如果您对 Python、Nampy 或 Matplotlib 不是很熟悉,我们建议您查看周五的复习会议。会议将被记录下来,材料将在我们的网站上提供。CS231N Python/Numpy 教程也是一个很好的资源。
Assignment Notes: Please make sure to save the notebook as you go along. Submission Instructions are located at the bottom of the notebook.
作业说明:请确保在进行时保存笔记本。提交说明位于笔记本的底部。
# All Import Statements Defined Here
# Note: Do not add to this list.
# ----------------
import sys
assert sys.version_info[0] == 3
assert sys.version_info[1] >= 8
from platform import python_version
assert int(python_version().split(".")[1]) >= 5, "Please upgrade your Python version following the instructions in \
the README.md file found in the same directory as this notebook. Your Python version is " + python_version()
from gensim.models import KeyedVectors
from gensim.test.utils import datapath
import pprint
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 5]
from datasets import load_dataset
imdb_dataset = load_dataset("stanfordnlp/imdb")
import re
import numpy as np
import random
import scipy as sp
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA
START_TOKEN = '<START>'
END_TOKEN = '<END>'
NUM_SAMPLES = 150
np.random.seed(0)
random.seed(0)
# ----------------
Word Vectors 词向量¶
Word Vectors are often used as a fundamental component for downstream NLP tasks, e.g. question answering, text generation, translation, etc., so it is important to build some intuitions as to their strengths and weaknesses. Here, you will explore two types of word vectors: those derived from co-occurrence matrices, and those derived via GloVe.
Word Vectors 通常用作下游 NLP 任务的基本组件,例如问答、文本生成、翻译等,因此建立一些关于其优势和劣势的直觉非常重要。在这里,您将探索两种类型的词向量:从共现矩阵派生的词向量和通过 GloVe 派生的词向量。
Note on Terminology: The terms "word vectors" and "word embeddings" are often used interchangeably. The term "embedding" refers to the fact that we are encoding aspects of a word's meaning in a lower dimensional space. As Wikipedia states, "conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension".
术语说明:“词向量”和“词嵌入”这两个术语经常互换使用。术语“嵌入”是指我们在较低维空间中编码单词含义的各个方面。正如维基百科所说,“从概念上讲,它涉及从每个单词一个维度的空间到具有低得多维度的连续向量空间的数学嵌入”。
Part 1: Count-Based Word Vectors (10 points)
第 1 部分:基于计数的词向量(10 分)¶
Most word vector models start from the following idea:
大多数词向量模型都从以下想法开始:
You shall know a word by the company it keeps (Firth, J. R. 1957:11)
你应该知道它所保留的公司的一个词(Firth, J. R. 1957:11)
Many word vector implementations are driven by the idea that similar words, i.e., (near) synonyms, will be used in similar contexts. As a result, similar words will often be spoken or written along with a shared subset of words, i.e., contexts. By examining these contexts, we can try to develop embeddings for our words. With this intuition in mind, many "old school" approaches to constructing word vectors relied on word counts. Here we elaborate upon one of those strategies, co-occurrence matrices (for more information, see here or here).
许多词向量实现都是由这样一种想法驱动的,即相似的单词,即(近)同义词,将在相似的上下文中使用。因此,相似的单词通常会与共享的单词子集(即上下文)一起被说出来或写出来。通过检查这些上下文,我们可以尝试为我们的单词开发嵌入。考虑到这种直觉,许多构建词向量的“老派”方法都依赖于字数。在这里,我们详细阐述了其中一种策略,即共现矩阵(有关更多信息,请参阅此处或此处)。
Co-Occurrence 共现¶
A co-occurrence matrix counts how often things co-occur in some environment. Given some word occurring in the document, we consider the context window surrounding . Supposing our fixed window size is , then this is the preceding and subsequent words in that document, i.e. words and . We build a co-occurrence matrix , which is a symmetric word-by-word matrix in which is the number of times appears inside 's window among all documents.
共现矩阵计算事物在特定环境中共现的频率。给定文档中出现的一些单词 ,我们考虑围绕 的上下文窗口。假设我们的固定窗口大小是 ,那么这是 该文档 中的前后 词,即词 和 。我们构建一个共现矩阵 ,这是一个对称的逐字矩阵,其中 是所有文档中在窗口内 出现的次数 。
Example: Co-Occurrence with Fixed Window of n=1:
示例:与 n=1 的固定窗口共现:
Document 1: "all that glitters is not gold"
文件1:“闪闪发光的不是金子”
Document 2: "all is well that ends well"
文件2:“一切都很好,结局很好”
* | <START> |
all | that 那 | glitters 闪光 | is | not | gold 金 | well 井 | ends 结束 | <END> |
---|---|---|---|---|---|---|---|---|---|---|
<START> |
0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
all | 2 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
that 那 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
glitters 闪光 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
is | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
not | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
gold 金 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
well 井 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
ends 结束 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
<END> |
0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
In NLP, we commonly use <START>
and <END>
tokens to mark the beginning and end of sentences, paragraphs, or documents. These tokens are included in co-occurrence counts, encapsulating each document, for example: "<START>
All that glitters is not gold <END>
".
在 NLP 中,我们通常使用 <START>
和 <END>
标记来标记句子、段落或文档的开头和结尾。这些标记包含在共现计数中,封装了每个文档,例如:“ <START>
所有闪闪发光的都不是金子 <END>
”。
The matrix rows (or columns) provide word vectors based on word-word co-occurrence, but they can be large. To reduce dimensionality, we employ Singular Value Decomposition (SVD), akin to PCA, selecting the top principal components. The SVD process decomposes the co-occurrence matrix into singular values in the diagonal matrix and new, shorter word vectors in .
矩阵行(或列)提供基于词-词共现的词向量,但它们可能很大。为了降低维数,我们采用类似于 PCA 的奇异值分解 (SVD),选择顶部 主成分。SVD 过程将共现矩阵 分解 为对角 矩 阵中的奇异值和 .
This dimensionality reduction maintains semantic relationships; for instance, doctor and hospital will be closer than doctor and dog.
这种降维保持了语义关系;例如,医生和医院将比医生和狗更亲近。
For those unfamiliar with eigenvalues and SVD, a beginner-friendly introduction to SVD is available here. Additional resources for in-depth understanding include lectures 7, 8, and 9 of CS168, providing high-level treatment of these algorithms. For practical implementation, utilizing pre-programmed functions from Python packages like numpy, scipy, or sklearn is recommended. While applying full SVD to large corpora can be memory-intensive, scalable techniques such as Truncated SVD exist for extracting the top vector components efficiently.
对于那些不熟悉特征值和 SVD 的人,可以在此处获得适合初学者的 SVD 介绍。用于深入了解的其他资源包括 CS168 的第 7、8 和 9 讲,提供这些算法的高级处理。对于实际实现,建议使用 Python 包(如 numpy、scipy 或 sklearn)中的预编程函数。虽然将完整 SVD 应用于大型语料库可能会占用大量内存,但存在可扩展的技术(例如截断 SVD)来有效地提取顶部 向量分量。
Plotting Co-Occurrence Word Embeddings
绘制共现词嵌入¶
Here, we will be using the Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. We provide a read_corpus
function below that pulls out the text of a movie review from the dataset. The function also adds <START>
and <END>
tokens to each of the documents, and lowercases words. You do not have to perform any other kind of pre-processing.
在这里,我们将使用大型电影评论数据集。这是一个用于二元情感分类的数据集,包含的数据比以前的基准数据集多得多。我们提供一套 25,000 条极性电影评论用于培训,25,000 条用于测试。还有其他未标记的数据可供使用。我们在下面提供了一个 read_corpus
函数,可以从数据集中提取电影评论的文本。该函数还会向每个文档添加 <START>
和 <END>
标记,以及小写单词。您不必执行任何其他类型的预处理。
def read_corpus():
""" Read files from the Large Movie Review Dataset.
Params:
category (string): category name
Return:
list of lists, with words from each of the processed files
"""
files = imdb_dataset["train"]["text"][:NUM_SAMPLES]
return [[START_TOKEN] + [re.sub(r'[^\w]', '', w.lower()) for w in f.split(" ")] + [END_TOKEN] for f in files]
Let's have a look what these documents are like….
让我们来看看这些文件是什么样的......
imdb_corpus = read_corpus()
pprint.pprint(imdb_corpus[:3], compact=True, width=100)
print("corpus size: ", len(imdb_corpus[0]))
Question 1.1: Implement distinct_words
[code] (2 points)
问题 1.1:实现 distinct_words
[代码] (2 分)¶
Write a method to work out the distinct words (word types) that occur in the corpus.
编写一种方法来计算语料库中出现的不同单词(单词类型)。
You can use for
loops to process the input corpus
(a list of list of strings), but try using Python list comprehensions (which are generally faster). In particular, this may be useful to flatten a list of lists. If you're not familiar with Python list comprehensions in general, here's more information.
您可以使用 for
循环来处理输入 corpus
(字符串列表列表),但请尝试使用 Python 列表推导式(通常更快)。特别是,这对于展平列表列表可能很有用。如果您不熟悉 Python 列表推导式,请参阅以下详细信息。
Your returned corpus_words
should be sorted. You can use python's sorted
function for this.
您退回 corpus_words
的应该被分类。为此,您可以使用 python sorted
的函数。
You may find it useful to use Python sets to remove duplicate words.
您可能会发现使用 Python 集删除重复单词很有用。
def distinct_words(corpus):
""" Determine a list of distinct words for the corpus.
Params:
corpus (list of list of strings): corpus of documents
Return:
corpus_words (list of strings): sorted list of distinct words across the corpus
n_corpus_words (integer): number of distinct words across the corpus
"""
corpus_words = []
n_corpus_words = -1
# ------------------
# Write your implementation here.
# ------------------
return corpus_words, n_corpus_words
# ---------------------
# Run this sanity check
# Note that this not an exhaustive check for correctness.
# ---------------------
# Define toy corpus
test_corpus = ["{} All that glitters isn't gold {}".format(START_TOKEN, END_TOKEN).split(" "), "{} All's well that ends well {}".format(START_TOKEN, END_TOKEN).split(" ")]
test_corpus_words, num_corpus_words = distinct_words(test_corpus)
# Correct answers
ans_test_corpus_words = sorted([START_TOKEN, "All", "ends", "that", "gold", "All's", "glitters", "isn't", "well", END_TOKEN])
ans_num_corpus_words = len(ans_test_corpus_words)
# Test correct number of words
assert(num_corpus_words == ans_num_corpus_words), "Incorrect number of distinct words. Correct: {}. Yours: {}".format(ans_num_corpus_words, num_corpus_words)
# Test correct words
assert (test_corpus_words == ans_test_corpus_words), "Incorrect corpus_words.\nCorrect: {}\nYours: {}".format(str(ans_test_corpus_words), str(test_corpus_words))
# Print Success
print ("-" * 80)
print("Passed All Tests!")
print ("-" * 80)
Question 1.2: Implement compute_co_occurrence_matrix
[code] (3 points)
问题 1.2:实现 compute_co_occurrence_matrix
[代码] (3 分)¶
Write a method that constructs a co-occurrence matrix for a certain window-size (with a default of 4), considering words before and after the word in the center of the window. Here, we start to use numpy (np)
to represent vectors, matrices, and tensors. If you're not familiar with NumPy, there's a NumPy tutorial in the second half of this cs231n Python NumPy tutorial.
编写一个方法,该方法为特定窗口大小 (默认值为 4)构造共现矩阵,同时考虑窗口中心单词前后的 单词 。在这里,我们开始用于 numpy (np)
表示向量、矩阵和张量。如果您不熟悉 NumPy,则在此 cs231n Python NumPy 教程的后半部分有一个 NumPy 教程。
def compute_co_occurrence_matrix(corpus, window_size=4):
""" Compute co-occurrence matrix for the given corpus and window_size (default of 4).
Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
number of co-occurring words.
For example, if we take the document "<START> All that glitters is not gold <END>" with window size of 4,
"All" will co-occur with "<START>", "that", "glitters", "is", and "not".
Params:
corpus (list of list of strings): corpus of documents
window_size (int): size of context window
Return:
M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)):
Co-occurence matrix of word counts.
The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
word2ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
"""
words, n_words = distinct_words(corpus)
M = None
word2ind = {}
# ------------------
# Write your implementation here.
# ------------------
return M, word2ind
# ---------------------
# Run this sanity check
# Note that this is not an exhaustive check for correctness.
# ---------------------
# Define toy corpus and get student's co-occurrence matrix
test_corpus = ["{} All that glitters isn't gold {}".format(START_TOKEN, END_TOKEN).split(" "), "{} All's well that ends well {}".format(START_TOKEN, END_TOKEN).split(" ")]
M_test, word2ind_test = compute_co_occurrence_matrix(test_corpus, window_size=1)
# Correct M and word2ind
M_test_ans = np.array(
[[0., 0., 0., 0., 0., 0., 1., 0., 0., 1.,],
[0., 0., 1., 1., 0., 0., 0., 0., 0., 0.,],
[0., 1., 0., 0., 0., 0., 0., 0., 1., 0.,],
[0., 1., 0., 0., 0., 0., 0., 0., 0., 1.,],
[0., 0., 0., 0., 0., 0., 0., 0., 1., 1.,],
[0., 0., 0., 0., 0., 0., 0., 1., 1., 0.,],
[1., 0., 0., 0., 0., 0., 0., 1., 0., 0.,],
[0., 0., 0., 0., 0., 1., 1., 0., 0., 0.,],
[0., 0., 1., 0., 1., 1., 0., 0., 0., 1.,],
[1., 0., 0., 1., 1., 0., 0., 0., 1., 0.,]]
)
ans_test_corpus_words = sorted([START_TOKEN, "All", "ends", "that", "gold", "All's", "glitters", "isn't", "well", END_TOKEN])
word2ind_ans = dict(zip(ans_test_corpus_words, range(len(ans_test_corpus_words))))
# Test correct word2ind
assert (word2ind_ans == word2ind_test), "Your word2ind is incorrect:\nCorrect: {}\nYours: {}".format(word2ind_ans, word2ind_test)
# Test correct M shape
assert (M_test.shape == M_test_ans.shape), "M matrix has incorrect shape.\nCorrect: {}\nYours: {}".format(M_test.shape, M_test_ans.shape)
# Test correct M values
for w1 in word2ind_ans.keys():
idx1 = word2ind_ans[w1]
for w2 in word2ind_ans.keys():
idx2 = word2ind_ans[w2]
student = M_test[idx1, idx2]
correct = M_test_ans[idx1, idx2]
if student != correct:
print("Correct M:")
print(M_test_ans)
print("Your M: ")
print(M_test)
raise AssertionError("Incorrect count at index ({}, {})=({}, {}) in matrix M. Yours has {} but should have {}.".format(idx1, idx2, w1, w2, student, correct))
# Print Success
print ("-" * 80)
print("Passed All Tests!")
print ("-" * 80)
Question 1.3: Implement reduce_to_k_dim
[code] (1 point)
问题 1.3:实现 reduce_to_k_dim
[代码] (1 分)¶
Construct a method that performs dimensionality reduction on the matrix to produce k-dimensional embeddings. Use SVD to take the top k components and produce a new matrix of k-dimensional embeddings.
构建一种在矩阵上执行降维以生成 k 维嵌入的方法。使用 SVD 获取前 k 个分量并生成新的 k 维嵌入矩阵。
Note: All of numpy, scipy, and scikit-learn (sklearn
) provide some implementation of SVD, but only scipy and sklearn provide an implementation of Truncated SVD, and only sklearn provides an efficient randomized algorithm for calculating large-scale Truncated SVD. So please use sklearn.decomposition.TruncatedSVD.
注意:numpy、scipy 和 scikit-learn ( sklearn
) 都提供了一些 SVD 的实现,但只有 scipy 和 sklearn 提供了截断 SVD 的实现,并且只有 sklearn 提供了计算大规模截断 SVD 的高效随机算法。因此,请使用 sklearn.decomposition.TruncatedSVD。
def reduce_to_k_dim(M, k=2):
""" Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
- http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
Params:
M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurence matrix of word counts
k (int): embedding size of each word after dimension reduction
Return:
M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
In terms of the SVD from math class, this actually returns U * S
"""
n_iters = 10 # Use this parameter in your call to `TruncatedSVD`
M_reduced = None
print("Running Truncated SVD over %i words..." % (M.shape[0]))
# ------------------
# Write your implementation here.
# ------------------
print("Done.")
return M_reduced
# ---------------------
# Run this sanity check
# Note that this is not an exhaustive check for correctness
# In fact we only check that your M_reduced has the right dimensions.
# ---------------------
# Define toy corpus and run student code
test_corpus = ["{} All that glitters isn't gold {}".format(START_TOKEN, END_TOKEN).split(" "), "{} All's well that ends well {}".format(START_TOKEN, END_TOKEN).split(" ")]
M_test, word2ind_test = compute_co_occurrence_matrix(test_corpus, window_size=1)
M_test_reduced = reduce_to_k_dim(M_test, k=2)
# Test proper dimensions
assert (M_test_reduced.shape[0] == 10), "M_reduced has {} rows; should have {}".format(M_test_reduced.shape[0], 10)
assert (M_test_reduced.shape[1] == 2), "M_reduced has {} columns; should have {}".format(M_test_reduced.shape[1], 2)
# Print Success
print ("-" * 80)
print("Passed All Tests!")
print ("-" * 80)
Question 1.4: Implement plot_embeddings
[code] (1 point)
问题 1.4:实现 plot_embeddings
[代码] (1 分)¶
Here you will write a function to plot a set of 2D vectors in 2D space. For graphs, we will use Matplotlib (plt
).
在这里,您将编写一个函数来在 2D 空间中绘制一组 2D 向量。对于图形,我们将使用 Matplotlib ( plt
)。
For this example, you may find it useful to adapt this code. In the future, a good way to make a plot is to look at the Matplotlib gallery, find a plot that looks somewhat like what you want, and adapt the code they give.
对于此示例,您可能会发现调整此代码很有用。将来,制作绘图的一个好方法是查看 Matplotlib 图库,找到一个看起来有点像您想要的绘图,然后调整它们提供的代码。
def plot_embeddings(M_reduced, word2ind, words):
""" Plot in a scatterplot the embeddings of the words specified in the list "words".
NOTE: do not plot all the words listed in M_reduced / word2ind.
Include a label next to each point.
Params:
M_reduced (numpy matrix of shape (number of unique words in the corpus , 2)): matrix of 2-dimensioal word embeddings
word2ind (dict): dictionary that maps word to indices for matrix M
words (list of strings): words whose embeddings we want to visualize
"""
# ------------------
# Write your implementation here.
# ------------------
# ---------------------
# Run this sanity check
# Note that this is not an exhaustive check for correctness.
# The plot produced should look like the included file question_1.4_test.png
# ---------------------
print ("-" * 80)
print ("Outputted Plot:")
M_reduced_plot_test = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1], [0, 0]])
word2ind_plot_test = {'test1': 0, 'test2': 1, 'test3': 2, 'test4': 3, 'test5': 4}
words = ['test1', 'test2', 'test3', 'test4', 'test5']
plot_embeddings(M_reduced_plot_test, word2ind_plot_test, words)
print ("-" * 80)
Question 1.5: Co-Occurrence Plot Analysis [written] (3 points)
问题 1.5:共现图分析 [书面] (3 分)¶
Now we will put together all the parts you have written! We will compute the co-occurrence matrix with fixed window of 4 (the default window size), over the Large Movie Review corpus. Then we will use TruncatedSVD to compute 2-dimensional embeddings of each word. TruncatedSVD returns U*S, so we need to normalize the returned vectors, so that all the vectors will appear around the unit circle (therefore closeness is directional closeness). Note: The line of code below that does the normalizing uses the NumPy concept of broadcasting. If you don't know about broadcasting, check out
Computation on Arrays: Broadcasting by Jake VanderPlas.
现在我们将把你写的所有部分放在一起!我们将在 Large Movie Review 语料库上计算固定窗口为 4(默认窗口大小)的共现矩阵。然后我们将使用 TruncatedSVD 来计算每个单词的二维嵌入。截断SVD返回U*S,因此我们需要对返回的向量进行归一化,以便所有向量都出现在单位圆周围(因此接近度是方向接近度)。注意:下面执行规范化的代码行使用 NumPy 广播概念。如果您不了解广播,请查看 Jake VanderPlas 的 Computation on Arrays: Broadcasting。
Run the below cell to produce the plot. It can take up to a few minutes to run.
运行下面的单元格以生成绘图。最多可能需要几分钟才能运行。
# -----------------------------
# Run This Cell to Produce Your Plot
# ------------------------------
imdb_corpus = read_corpus()
M_co_occurrence, word2ind_co_occurrence = compute_co_occurrence_matrix(imdb_corpus)
M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)
# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)
M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcasting
words = ['movie', 'book', 'mysterious', 'story', 'fascinating', 'good', 'interesting', 'large', 'massive', 'huge']
plot_embeddings(M_normalized, word2ind_co_occurrence, words)
Verify that your figure matches "question_1.5.png" in the assignment zip. If not, use the figure in "question_1.5.png" to answer the next two questions.
验证您的图窗是否与分配 zip 中的“question_1.5.png”匹配。如果没有,请使用“question_1.5.png”中的数字来回答接下来的两个问题。
a. Find at least two groups of words that cluster together in 2-dimensional embedding space. Give an explanation for each cluster you observe.
一个。找到至少两组在二维嵌入空间中聚集在一起的单词。为您观察到的每个聚类提供解释。
Write your answer here. 在这里写下你的答案。¶
b. What doesn't cluster together that you might think should have? Describe at least two examples.
b.您可能认为应该有什么没有聚集在一起的东西?至少描述两个例子。
Write your answer here. 在这里写下你的答案。¶
Part 2: Prediction-Based Word Vectors (15 points)
第 2 部分:基于预测的词向量(15 分)¶
As discussed in class, more recently prediction-based word vectors have demonstrated better performance, such as word2vec and GloVe (which also utilizes the benefit of counts). Here, we shall explore the embeddings produced by GloVe. Please revisit the class notes and lecture slides for more details on the word2vec and GloVe algorithms. If you're feeling adventurous, challenge yourself and try reading GloVe's original paper.
正如在课堂上讨论的那样,最近基于预测的词向量已经显示出更好的性能,例如 word2vec 和 GloVe(它们也利用了计数的好处)。在这里,我们将探讨 GloVe 生成的嵌入。请重新访问课堂笔记和讲座幻灯片,了解有关 word2vec 和 GloVe 算法的更多详细信息。如果您喜欢冒险,请挑战自己并尝试阅读 GloVe 的原始论文。
Then run the following cells to load the GloVe vectors into memory. Note: If this is your first time to run these cells, i.e. download the embedding model, it will take a couple minutes to run. If you've run these cells before, rerunning them will load the model without redownloading it, which will take about 1 to 2 minutes.
然后运行以下单元格以将 GloVe 向量加载到内存中。注意:如果这是您第一次运行这些单元,即下载嵌入模型,则需要几分钟才能运行。如果您之前运行过这些单元,则重新运行它们将加载模型而无需重新下载,这大约需要 1 到 2 分钟。
def load_embedding_model():
""" Load GloVe Vectors
Return:
wv_from_bin: All 400000 embeddings, each length 200
"""
import gensim.downloader as api
wv_from_bin = api.load("glove-wiki-gigaword-200")
print("Loaded vocab size %i" % len(list(wv_from_bin.index_to_key)))
return wv_from_bin
wv_from_bin = load_embedding_model()
Note: If you are receiving a "reset by peer" error, rerun the cell to restart the download.
注意:如果收到“由对等方重置”错误,请重新运行单元以重新开始下载。¶
Reducing dimensionality of Word Embeddings
降低单词嵌入的维度¶
Let's directly compare the GloVe embeddings to those of the co-occurrence matrix. In order to avoid running out of memory, we will work with a sample of 40000 GloVe vectors instead.
Run the following cells to:
让我们直接将 GloVe 嵌入与共现矩阵的嵌入进行比较。为了避免内存不足,我们将改用 40000 个 GloVe 向量的样本。运行以下单元格以:
- Put 40000 Glove vectors into a matrix M
将 40000 个手套向量放入矩阵 M 中 - Run
reduce_to_k_dim
(your Truncated SVD function) to reduce the vectors from 200-dimensional to 2-dimensional.
运行reduce_to_k_dim
(截断的 SVD 函数)将向量从 200 维减少到 2 维。
def get_matrix_of_vectors(wv_from_bin, required_words):
""" Put the GloVe vectors into a matrix M.
Param:
wv_from_bin: KeyedVectors object; the 400000 GloVe vectors loaded from file
Return:
M: numpy matrix shape (num words, 200) containing the vectors
word2ind: dictionary mapping each word to its row number in M
"""
import random
words = list(wv_from_bin.index_to_key)
print("Shuffling words ...")
random.seed(225)
random.shuffle(words)
print("Putting %i words into word2ind and matrix M..." % len(words))
word2ind = {}
M = []
curInd = 0
for w in words:
try:
M.append(wv_from_bin.get_vector(w))
word2ind[w] = curInd
curInd += 1
except KeyError:
continue
for w in required_words:
if w in words:
continue
try:
M.append(wv_from_bin.get_vector(w))
word2ind[w] = curInd
curInd += 1
except KeyError:
continue
M = np.stack(M)
print("Done.")
return M, word2ind
# -----------------------------------------------------------------
# Run Cell to Reduce 200-Dimensional Word Embeddings to k Dimensions
# Note: This should be quick to run
# -----------------------------------------------------------------
M, word2ind = get_matrix_of_vectors(wv_from_bin, words)
M_reduced = reduce_to_k_dim(M, k=2)
# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced, axis=1)
M_reduced_normalized = M_reduced / M_lengths[:, np.newaxis] # broadcasting
Note: If you are receiving out of memory issues on your local machine, try closing other applications to free more memory on your device. You may want to try restarting your machine so that you can free up extra memory. Then immediately run the jupyter notebook and see if you can load the word vectors properly. If you still have problems with loading the embeddings onto your local machine after this, please go to office hours or contact course staff.
注意:如果您在本地计算机上收到内存不足问题,请尝试关闭其他应用程序以释放设备上的更多内存。您可能需要尝试重新启动计算机,以便释放额外的内存。然后立即运行 jupyter 笔记本,看看是否可以正确加载单词向量。如果在此之后仍然无法将嵌入加载到本地计算机上,请前往办公时间或联系课程工作人员。
Question 2.1: GloVe Plot Analysis [written] (3 points)
问题 2.1:GloVe 图分析 [书面] (3 分)¶
Run the cell below to plot the 2D GloVe embeddings for ['movie', 'book', 'mysterious', 'story', 'fascinating', 'good', 'interesting', 'large', 'massive', 'huge']
.
运行下面的单元格以绘制 ['movie', 'book', 'mysterious', 'story', 'fascinating', 'good', 'interesting', 'large', 'massive', 'huge']
的 2D GloVe 嵌入。
words = ['movie', 'book', 'mysterious', 'story', 'fascinating', 'good', 'interesting', 'large', 'massive', 'huge']
plot_embeddings(M_reduced_normalized, word2ind, words)
Verify that your figure matches "question_2.1.png" in the assignment zip. If not, use the figure in "question_2.1.png" (and the figure in "question_1.5.png", if applicable) to answer the next two questions.
验证您的图窗是否与分配 zip 中的“question_2.1.png”匹配。如果没有,请使用“question_2.1.png”中的数字(以及“question_1.5.png”中的数字,如果适用)来回答接下来的两个问题。
a. What is one way the plot is different from the one generated earlier from the co-occurrence matrix? What is one way it's similar?
一个。该图与之前从共现矩阵生成的图有何不同?它有什么相似之处?
Write your answer here. 在这里写下你的答案。¶
b. Why might the GloVe plot (question_2.1.png) differ from the plot generated earlier from the co-occurrence matrix (question_1.5.png)?
b.为什么 GloVe 图 (question_2.1.png) 可能与之前从共现矩阵 (question_1.5.png) 生成的图不同?
Write your answer here. 在这里写下你的答案。¶
Cosine Similarity 余弦相似度¶
Now that we have word vectors, we need a way to quantify the similarity between individual words, according to these vectors. One such metric is cosine-similarity. We will be using this to find words that are "close" and "far" from one another.
现在我们有了词向量,我们需要一种方法来根据这些向量来量化单个词之间的相似性。其中一个指标是余弦相似度。我们将使用它来查找彼此相距“接近”和“远离”的单词。
We can think of n-dimensional vectors as points in n-dimensional space. If we take this perspective L1 and L2 Distances help quantify the amount of space "we must travel" to get between these two points. Another approach is to examine the angle between two vectors. From trigonometry we know that:
我们可以将 n 维向量视为 n 维空间中的点。如果我们从这个角度来看,L1 和 L2 距离有助于量化“我们必须旅行”才能到达这两点之间的空间量。另一种方法是检查两个向量之间的角度。从三角学中我们知道:
Instead of computing the actual angle, we can leave the similarity in terms of . Formally the Cosine Similarity between two vectors and is defined as:
我们可以不计算实际角度,而是将相似性保留为 。形式上,两个向量 之间的余弦相似度 定义为:
Question 2.2: Words with Multiple Meanings (1.5 points) [code + written]
第2.2题:多重含义的词语(1.5分)[代码+书面]¶
Polysemes and homonyms are words that have more than one meaning (see this wiki page to learn more about the difference between polysemes and homonyms ). Find a word with at least two different meanings such that the top-10 most similar words (according to cosine similarity) contain related words from both meanings. For example, "leaves" has both "go_away" and "a_structure_of_a_plant" meaning in the top 10, and "scoop" has both "handed_waffle_cone" and "lowdown". You will probably need to try several polysemous or homonymic words before you find one.
多义词和同音异义词是具有多个含义的词(请参阅此 wiki 页面以了解有关多义词和同音异义词之间区别的更多信息)。找到一个至少具有两种不同含义的单词,以便前 10 个最相似的单词(根据余弦相似度)包含两种含义的相关单词。例如,“叶子”在前 10 名中既有“go_away”又有“a_structure_of_a_plant”的意思,而“勺子”既有“handed_waffle_cone”又有“内幕”的意思。在找到一个之前,您可能需要尝试几个多义词或同音词。
Please state the word you discover and the multiple meanings that occur in the top 10. Why do you think many of the polysemous or homonymic words you tried didn't work (i.e. the top-10 most similar words only contain one of the meanings of the words)?
请说明您发现的单词以及前 10 名中出现的多种含义。为什么你认为你尝试的许多多义词或同音词都不起作用(即前 10 个最相似的词只包含单词的一种含义)?
Note: You should use the wv_from_bin.most_similar(word)
function to get the top 10 most similar words. This function ranks all other words in the vocabulary with respect to their cosine similarity to the given word. For further assistance, please check the GenSim documentation.
注意:您应该使用该 wv_from_bin.most_similar(word)
函数来获取前 10 个最相似的单词。此函数根据词汇表中所有其他单词与给定单词的余弦相似度对它们进行排名。如需进一步帮助,请查看 GenSim 文档。
# ------------------
# Write your implementation here.
# ------------------
Write your answer here. 在这里写下你的答案。¶
Question 2.3: Synonyms & Antonyms (2 points) [code + written]
问题 2.3: 同义词和反义词 (2 分) [代码 + 书面]¶
When considering Cosine Similarity, it's often more convenient to think of Cosine Distance, which is simply 1 - Cosine Similarity.
在考虑余弦相似度时,通常更方便考虑余弦距离,它只是 1 - 余弦相似度。
Find three words where and are synonyms and and are antonyms, but Cosine Distance Cosine Distance .
找到三个单词 ,其中 和 是同义词, 并且是 反义词,但余弦距离 余弦距离 。
As an example, ="happy" is closer to ="sad" than to ="cheerful". Please find a different example that satisfies the above. Once you have found your example, please give a possible explanation for why this counter-intuitive result may have happened.
例如, =“happy” 更接近 =“sad” 而不是 =“cheerful”。请找到满足上述条件的其他示例。一旦你找到了你的例子,请给出一个可能的解释,解释为什么会出现这种违反直觉的结果。
You should use the the wv_from_bin.distance(w1, w2)
function here in order to compute the cosine distance between two words. Please see the GenSim documentation for further assistance.
您应该使用此处的 wv_from_bin.distance(w1, w2)
函数来计算两个单词之间的余弦距离。请参阅 GenSim 文档以获得进一步的帮助。
# ------------------
# Write your implementation here.
# ------------------
Write your answer here. 在这里写下你的答案。¶
Question 2.4: Analogies with Word Vectors [written] (1.5 points)
问题 2.4:与词向量的类比 [书面] (1.5 分)¶
Word vectors have been shown to sometimes exhibit the ability to solve analogies.
词向量有时被证明具有解决类比的能力。
As an example, for the analogy "man : grandfather :: woman : x" (read: man is to grandfather as woman is to x), what is x?
例如,对于类比“男人:祖父:女人:x”(读作:男人之于祖父,就像女人之于x),x是什么?
In the cell below, we show you how to use word vectors to find x using the most_similar
function from the GenSim documentation. The function finds words that are most similar to the words in the positive
list and most dissimilar from the words in the negative
list (while omitting the input words, which are often the most similar; see this paper). The answer to the analogy will have the highest cosine similarity (largest returned numerical value).
在下面的单元格中,我们向您展示了如何使用 GenSim 文档中的 most_similar
函数使用词向量来查找 x。该函数查找与 positive
列表中的单词最相似且与 negative
列表中的单词最不相似的单词(同时省略输入单词,这些单词通常最相似;请参阅本文)。类比的答案将具有最高的余弦相似度(返回的最大数值)。
# Run this cell to answer the analogy -- man : grandfather :: woman : x
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'grandfather'], negative=['man']))
Let , , , and denote the word vectors for man
, grandfather
, woman
, and the answer, respectively. Using only vectors , , , and the vector arithmetic operators and in your answer, what is the expression in which we are maximizing cosine similarity with ?
设 、 、 和 分别表示 、 grandfather
、 woman
和 答案的 man
词向量。仅使用向量 、 、 和向量算术运算符 , 在您的答案中,我们最大化余弦相似度 的表达式是什么 ?
Hint: Recall that word vectors are simply multi-dimensional vectors that represent a word. It might help to draw out a 2D example using arbitrary locations of each vector. Where would man
and woman
lie in the coordinate plane relative to grandfather
and the answer?
提示:回想一下,词向量只是表示单词的多维向量。使用每个向量的任意位置绘制 2D 示例可能会有所帮助。 man
和 woman
在坐标平面中相对于 grandfather
和 的什么位置?
Write your answer here. 在这里写下你的答案。¶
Question 2.5: Finding Analogies [code + written] (1.5 points)
问题 2.5:寻找类比 [代码 + 书面] (1.5 分)¶
a. For the previous example, it's clear that "grandmother" completes the analogy. But give an intuitive explanation as to why the most_similar
function gives us words like "granddaughter", "daughter", or "mother?
一个。对于前面的例子,很明显,“祖母”完成了类比。但是给出一个直观的解释,为什么这个 most_similar
函数会给我们“孙女”、“女儿”或“母亲”这样的词?
Write your answer here. 在这里写下你的答案。¶
b. Find an example of analogy that holds according to these vectors (i.e. the intended word is ranked top). In your solution please state the full analogy in the form x:y :: a:b. If you believe the analogy is complicated, explain why the analogy holds in one or two sentences.
b.找到一个根据这些向量成立的类比示例(即预期的单词排名靠前)。在您的解决方案中,请以 x:y :: a:b 的形式陈述完整的类比。如果你认为这个类比很复杂,请解释为什么这个类比在一两句话中成立。
Note: You may have to try many analogies to find one that works!
注意:您可能需要尝试许多类比才能找到一个有效的类比!
# For example: x, y, a, b = ("", "", "", "")
# ------------------
# Write your implementation here.
# ------------------
# Test the solution
assert wv_from_bin.most_similar(positive=[a, y], negative=[x])[0][0] == b
Write your answer here. 在这里写下你的答案。¶
Question 2.6: Incorrect Analogy [code + written] (1.5 points)
问题 2.6:不正确的类比 [代码 + 书面] (1.5 分)¶
a. Below, we expect to see the intended analogy "hand : glove :: foot : sock", but we see an unexpected result instead. Give a potential reason as to why this particular analogy turned out the way it did?
一个。下面,我们期望看到预期的类比“手:手套:脚:袜子”,但我们看到的是一个意想不到的结果。给出一个潜在的原因,说明为什么这个特定的类比会变成这样?
pprint.pprint(wv_from_bin.most_similar(positive=['foot', 'glove'], negative=['hand']))
Write your answer here. 在这里写下你的答案。¶
b. Find another example of analogy that does not hold according to these vectors. In your solution, state the intended analogy in the form x:y :: a:b, and state the incorrect value of b according to the word vectors (in the previous example, this would be '45,000-square').
b.找到另一个根据这些向量不成立的类比示例。在解决方案中,以 x:y :: a:b 的形式陈述预期的类比,并根据单词向量陈述不正确的 b 值(在前面的示例中,这将是“45,000 平方”)。
# For example: x, y, a, b = ("", "", "", "")
# ------------------
# Write your implementation here.
# ------------------
pprint.pprint(wv_from_bin.most_similar(positive=[a, y], negative=[x]))
assert wv_from_bin.most_similar(positive=[a, y], negative=[x])[0][0] != b
Write your answer here. 在这里写下你的答案。¶
Question 2.7: Guided Analysis of Bias in Word Vectors [written] (1 point)
问题 2.7:词向量偏差的引导分析 [书面] (1 分)¶
It's important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit in our word embeddings. Bias can be dangerous because it can reinforce stereotypes through applications that employ these models.
重要的是要认识到我们的单词嵌入中隐含的偏见(性别、种族、性取向等)。偏见可能是危险的,因为它可以通过采用这些模型的应用程序来强化刻板印象。
Run the cell below, to examine (a) which terms are most similar to "man" and "profession" and most dissimilar to "woman" and (b) which terms are most similar to "woman" and "profession" and most dissimilar to "man". Point out the difference between the list of female-associated words and the list of male-associated words, and explain how it is reflecting gender bias.
运行下面的单元格,检查 (a) 哪些术语与“男人”和“职业”最相似,与“女人”最不同,以及 (b) 哪些术语与“女人”和“职业”最相似,与“男人”最不同。指出与女性相关的单词列表和与男性相关的单词列表之间的区别,并解释它如何反映性别偏见。
# Run this cell
# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be
# most dissimilar from.
pprint.pprint(wv_from_bin.most_similar(positive=['man', 'profession'], negative=['woman']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'profession'], negative=['man']))
Write your answer here. 在这里写下你的答案。¶
Question 2.8: Independent Analysis of Bias in Word Vectors [code + written] (1 point)
问题 2.8:词向量偏差的独立分析 [代码 + 书面] (1 分)¶
Use the most_similar
function to find another pair of analogies that demonstrates some bias is exhibited by the vectors. Please briefly explain the example of bias that you discover.
使用该 most_similar
函数查找另一对类比,以证明向量表现出一些偏差。请简要解释一下您发现的偏见示例。
# ------------------
# Write your implementation here.
# ------------------
Write your answer here. 在这里写下你的答案。¶
Question 2.9: Thinking About Bias [written] (2 points)
问题 2.9:思考偏见 [书面] (2 分)¶
a. Give one explanation of how bias gets into the word vectors. Briefly describe a real-world example that demonstrates this source of bias. Your real-world example should be focused on word vectors, as opposed to bias in other AI systems (e.g., ChatGPT).
一个。给出一个关于偏见如何进入单词向量的解释。简要描述一个真实世界的例子,证明这种偏见的来源。您的真实示例应侧重于词向量,而不是其他 AI 系统(例如 ChatGPT)中的偏见。
Write your answer here. 在这里写下你的答案。¶
b. What is one method you can use to mitigate bias exhibited by word vectors? Briefly describe a real-world example that demonstrates this method.
b.您可以使用什么方法来减轻词向量表现出的偏见?简要描述演示此方法的真实示例。
Write your answer here. 在这里写下你的答案。¶
Submission Instructions 投稿须知¶
- Click the Save button at the top of the Jupyter Notebook.
单击 Jupyter Notebook 顶部的“保存”按钮。 - Select Cell -> All Output -> Clear. This will clear all the outputs from all cells (but will keep the content of all cells).
选择单元格 -> 所有输出 ->清除。这将清除所有单元格的所有输出(但将保留所有单元格的内容)。 - Select Cell -> Run All. This will run all the cells in order, and will take several minutes.
选择单元格 ->全部运行。这将按顺序运行所有单元格,并且需要几分钟时间。 - Once you've rerun everything, select File -> Download as -> PDF via LaTeX (If you have trouble using "PDF via LaTex", you can also save the webpage as pdf. Make sure all your solutions especially the coding parts are displayed in the pdf, it's okay if the provided codes get cut off because lines are not wrapped in code cells).
重新运行所有内容后,选择“文件”->“通过 LaTeX 下载为 -> PDF”(如果您在使用“通过 LaTex 的 PDF”时遇到问题,您也可以将网页另存为 pdf。确保您的所有解决方案,尤其是编码部分都显示在 pdf 中,如果提供的代码被切断,因为行没有包装在代码单元格中,那也没关系)。 - Look at the PDF file and make sure all your solutions are there, displayed correctly. The PDF is the only thing your graders will see!
查看 PDF 文件,确保所有解决方案都在那里,并正确显示。PDF 是评分员唯一能看到的东西! - Submit your PDF on Gradescope.
在 Gradescope 上提交您的 PDF。