这是用户在 2024-6-18 23:20 为 https://app.immersivetranslate.com/pdf-pro/50821a5d-2106-4d0c-ac05-1be590d1378c 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

What are embeddings  词嵌入

Vicki Boykis 维姬·博伊基斯

Abstract 摘要

Over the past decade, embeddings - numerical representations of machine learning features used as input to deep learning models - have become a foundational data structure in industrial machine learning systems. TF-IDF, PCA, and one-hot encoding have always been key tools in machine learning systems as ways to compress and make sense of large amounts of textual data.
在过去的十年中,嵌入 - 用于深度学习模型的机器学习特征的数值表示 - 已成为工业机器学习系统中一项基础的数据结构。TF-IDF、PCA 和独热编码一直是机器学习系统中的关键工具,作为压缩和理解大量文本数据的方法。

However, traditional approaches were limited in the amount of context they could reason about with increasing amounts of data.

As the volume, velocity, and variety of data captured by modern applications has exploded, creating approaches specifically tailored to scale has become increasingly important.

Google's Word2Vec paper made an important step in moving from simple statistical representations to semantic meaning of words.
谷歌的 Word2Vec 论文在从简单的统计表示向词语的语义意义过渡方面迈出了重要一步。

The subsequent rise of the Transformer architecture and transfer learning, as well as the latest surge in generative methods has enabled the growth of embeddings as a foundational machine learning data structure.
随着 Transformer 架构和迁移学习的兴起,以及最近生成式方法的激增,嵌入作为一种基础机器学习数据结构得到了发展。

This survey paper aims to provide a deep dive into what embeddings are, their history, and usage patterns in industry.
深度学习中,嵌入是指将高维稀疏特征数据映射到低维稠密向量空间的过程,该向量空间可以有效地捕捉数据的语义信息和关系。嵌入的应用场景非常广泛,包括自然语言处理、计算机视觉和推荐系统等领域。 近年来,随着深度学习技术的快速发展,各种各样的嵌入模型不断涌现,例如 Word2Vec、GloVe 和 BERT 等。这些模型能够学习到高维数据中的潜在语义信息,并将其压缩到低维向量空间中,从而提高了机器学习任务的性能。 在自然语言处理中,嵌入被广泛用于文本分类、机器翻译和文本摘要等任务。例如,Word2Vec 是一种流行的嵌入模型,它可以将单词映射到一个多维向量空间中,每个向量代表一个单词的语义信息。GloVe 是一种基于全局词频统计的嵌入模型,它能够更好地捕捉单词之间的语义关系。BERT 是一种基于 Transformer 神经网络的预训练语言模型,它可以学习到更深层次的语义信息。 在计算机视觉中,嵌入被用于图像分类、目标检测和图像分割等任务。例如,VGG 和 ResNet 等卷积神经网络可以将图像特征映射到一个高维向量空间中,并利用这些特征进行图像分类。 在推荐系统中,嵌入被用于用户和物品的推荐。例如,协同过滤和矩阵分解等方法可以将用户和物品映射到一个低维向量空间中,并利用这些向量来计算用户对物品的兴趣度。 总之,嵌入是深度学习中一种重要的技术,它可以有效地捕捉数据的语义信息和关系,从而提高机器学习任务的性能。近年来,嵌入模型的快速发展也为各种应用场景带来了新的机遇。

Colophon  版权页

This paper is typeset with IATEX. The cover art is Kandinsky's "Circles in a Circle" , 1923. ChatGPT was used to generate some of the figures.
本论文由 IATEX 排版。封面图是康定斯基的 “圆圈中的圆圈”,1923 年。ChatGPT 用于生成一些图形。

Code, ITEX, and Website
代码,ITEX 和 网站

The latest version of the paper and code examples are available here. The website for this project is here.

About the Author 关于作者

Vicki Boykis is a machine learning engineer. Her website is vickiboykis.com and her semantic search side project is viberary.pizza.
Vicki Boykis 是一位机器学习工程师。她的网站是 vickiboykis.com,她的语义搜索项目是 viberary.pizza。

Acknowledgements 致谢

I'm grateful to everyone who has graciously offered technical feedback but especially to Nicola Barbieri, Peter Baumgartner, Luca Belli, James Kirk, and Ravi Mody. All remaining errors, typos, and bad jokes are mine.
我非常感谢所有慷慨地提供技术反馈的人,尤其是 Nicola Barbieri、Peter Baumgartner、Luca Belli、James Kirk 和 Ravi Mody。所有剩余的错误、打字错误和拙劣的笑话都是我的。

Thank you to Dan for your patience, encouragement, for parenting while I was in the latent space, and for once casually asking, "How do you generate these 'embeddings', anyway?"
感谢 Dan 的耐心、鼓励、在我处于潜在空间时扮演父母的角色,以及曾经轻松地问:“你是如何生成这些‘嵌入’的?”

License 许可证

This work is licensed under a Creative Commons "Attribution-NonCommercial-ShareAlike 3.0 Unported" license.
本作品采用知识共享署名-非商业性使用-相同方式共享 3.0 未本地化版本许可协议授权。

Contents 内容

1 Introduction 1 导言
2 Recommendation as a business problem ..... 9
2 作为一项业务问题的建议 ..... 9

2.1 Building a web app ..... 11
2.1 构建 Web 应用程序..... 11

2.2 Rules-based systems versus machine learning ..... 13
2.2 基于规则的系统与机器学习 ..... 13

2.3 Building a web app with machine learning ..... 15
2.3 基于机器学习构建 Web 应用程序 ..... 15

2.4 Formulating a machine learning problem ..... 17
## 2.4 机器学习问题的表述 ..... 17

2.4.1 The Task of Recommendations ..... 20
2.4.1 推荐任务 ..... 20

2.4.2 Machine learning features ..... 22
2.4.2 机器学习功能 ...... 22

2.5 Numerical Feature Vectors ..... 23
2.5 数值特征向量 ..... 23

2.6 From Words to Vectors in Three Easy Pieces ..... 24
2.6 从词语到向量: 三个简单的方法 ..... 24

3 Historical Encoding Approaches ..... 25
3 历史编码方法 ..... 25

3.1 Early Approaches ..... 26
## 3.1 早期方法 ..... 26

3.2 Encoding ..... 26
## 3.2 编码 ..... 26

3.2.1 Indicator and one-hot encoding ..... 27
3.2.1 指标和独热编码 ..... 27

3.2.2 TF-IDF ..... 31
##### 3.2.2 TF-IDF ..... 31

3.2.3 SVD and PCA ..... 37
3.2.3 奇异值分解和 PCA ..... 37

3.3 LDA and LSA ..... 38
3.3 LDA 和 LSA ..... 38

3.4 Limitations of traditional approaches ..... 39
### 3.4. 传统方法的局限性 ..... 39

3.4.1 The curse of dimensionality ..... 39
3.4.1 维数灾难 ..... 39

3.4.2 Computational complexity ..... 40
3.4.2 计算复杂度 ..... 40

3.5 Support Vector Machines ..... 41
3.5 支持向量机 ..... 41

3.6 Word2Vec ..... 42
zh-CN: 3.6 Word2Vec ......42

4 Modern Embeddings Approaches ..... 50
4 种现代嵌入方法 ..... 50

4.1 Neural Networks ..... 51
## 4.1 神经网络 ..... 51

4.1.1 Neural Network architectures ..... 51
4.1.1 神经网络架构 ..... 51

4.2 Transformers ..... 53
4.2 Transformer ..... 53

4.2.1 Encoders/Decoders and Attention ..... 54
4.2.1 编码器/解码器和注意力 ..... 54

4.3 BERT ..... 59
4.4 GPT ..... 60
5 Embeddings in Production ..... 60
## 5、生产环境中的嵌入 ..... 60

5.1 Embeddings in Practice ..... 62
## 5.1 嵌入的实践 ..... 62

5.1.1 Pinterest ..... 62
## 5.1.1 Pinterest ... 62

5.1.2 YouTube and Google Play Store . ..... 63
5.1.2 YouTube 和 Google Play 商店 ...... 63

5.1.3 Twitter ..... 66
5.2 Embeddings as an Engineering Problem ..... 69
5.2 将嵌入表示为工程问题 ..... 69

5.2.1 Embeddings Generation ..... 71
## 5.2.1 嵌入式生成 ..... 71

5.2.2 Storage and Retrieval ..... 72
## 5.2.2 存储和检索 ..... 72 ##

5.2.3 Drift Detection, Versioning, and Interpretability ..... 74
5.2.3 漂移检测、版本控制和可解释性 ..... 74

5.2.4 Inference and Latency ..... 75
5.2.4 推理和延迟 ..... 75

5.2.5 Online and Offline Model Evaluation ..... 76
5.2.5 在线和离线模型评估 ..... 76

5.2.6 What makes embeddings projects successful ..... 76
### 5.2.6 哪些因素使嵌入式项目取得成功 76

6 Conclusion ..... 76
## 6 结论 ..... 76

1 Introduction 1 导言

Implementing deep learning models has become an increasingly important machine learning strategy for companies looking to build data-driven products.

In order to build and power deep learning models, companies collect and feed hundreds of millions of terabytes of multimoda 2 data into deep learning models.
为了构建和支持深度学习模型,公司收集并向深度学习模型提供数百亿 tb 的多模态数据。

As a result, embeddings - deep learning models' internal representations of their input data - are quickly becoming a critical component of building machine learning systems.
For example, they make up a significant part of Spotify's item recommender systems [27], YouTube video recommendations of what to watch [11], and Pinterest's visual search [31].
例如,它们构成了 Spotify 的商品推荐系统 [27]、YouTube 的观看视频推荐 [11] 和 Pinterest 的视觉搜索 [31] 的重要组成部分。

Even if they are not explicitly presented to the user through recommendation system UIs, embeddings are also used internally at places like Netflix to make content decisions around which shows to develop based on user preference popularity.
即使没有通过推荐系统 UI 明确地展示给用户,嵌入式技术也像 Netflix 一样在内部用于围绕用户偏好和受欢迎程度做出开发哪些节目的内容决策。

Figure 1: Similar Looks: We apply object detection
## 图 1:相似的外观:我们应用目标检测##

to localize products 本地化产品

prototype, users click on objects of interest to vie similar-looking product
Figure 1: Left to right: Products that use embeddings used to generate recommended items: Spotify Radio, YouTube Video recommendations, visual recommendations at Pinterest, BERT Embeddings in suggested Google search results
图 1:从左到右:使用嵌入技术生成推荐产品的示例: Spotify Radio、 YouTube 视频推荐、 Pinterest 的视觉推荐、 BERT Embeddings 在 Google 搜索结果中的应用
The usage of embeddings to generate compressed, context-specific representations of content exploded in popularity after the publication of Google's Word2Vec paper [47].
内容嵌入,用于生成压缩的、特定于上下文的表示,在 Google 发布 Word2Vec 论文 [47] 后,其使用量激增。
Figure 2: Embeddings papers in arXiv by month. It's interesting to note the decline in frequency of embeddings-specific papers, possibly in tandem with the rise of deep learning architectures like GPT source
图 2:按月统计 arXiv 中的嵌入论文。有趣的是,嵌入特定论文的频率有所下降,这可能与 GPT 源代码等深度学习架构的兴起有关。
Building and expanding on the concepts in Word2Vec, the Transformer [66] architecture, with its self-attention mechanism, a much more specialized case of calculating context around a given word, has become the de-facto way to learn representations of growing multimodal vocabularies, and its rise in popularity both in academia and in industry has caused embeddings to become a staple of deep learning workflows.
基于 Word2Vec 的概念,Transformer [66] 架构,其自注意力机制,一个计算给定词周围上下文更专门的案例,已经成为学习越来越多的多模态词汇表表示的实际方式,其在学术界和工业界的普及使嵌入成为深度学习工作流程的支柱。
However, the concept of embeddings can be elusive because they're neither data flow inputs or output results - they are intermediate elements that live within machine learning services to refine models. So it's helpful to define them explicitly from the beginning.
As a general definition, embeddings are data that has been transformed into n-dimensional matrices for use in deep learning computations. The process of embedding (as a verb):
作为⼀般定义,嵌入是已转换为 n 维矩阵的,用于深度学习计算的数据。嵌入(作为动词)的过程:
  • Transforms multimodal input into representations that are easier to perform intensive computation on, in the form of vectors, tensors, or graphs [51]. For the purpose of machine learning, we can think of vectors as a list (or array) of numbers.
    将多模态输入转换为更容易进行密集计算的表示形式,例如向量、张量或图 [51]。对于机器学习,我们可以将向量视为一个数字列表(或数组)。
  • Compresses input information for use in a machine learning task - the type of methods available to us in machine learning to solve specific problems - such as summarizing a document or identifying tags or labels for social media posts or performing semantic search on a large text corpus.
    压缩输入信息以用于机器学习任务 - 机器学习中可用于解决特定问题的各种方法 - 例如,总结文档或识别社交媒体帖子的标签或标签或对大型文本语料库执行语义搜索。

    The process of compression changes variable feature dimensions into fixed inputs, allowing them to be passed efficiently into downstream components of machine learning systems.
  • Creates an embedding space that is specific to the data the embeddings were trained on but that, in the case of deep learning representations, can also generalize to other tasks and domains through transfer learning - the ability to switch contexts - which is one of the reasons embeddings have exploded in popularity across machine learning applications
    创建特定于嵌入所训练数据的嵌入空间,但在深度学习表示的情况下,还可通过迁移学习推广到其他任务和领域 - 在不同上下文之间切换的能力 - 这是嵌入在机器学习应用中迅速普及的原因之一
What do embeddings actually look like? Here is one single embedding, also called a vector, in three dimensions. We can think of this as a representation of a single element in our dataset.

For example, this hypothetical embedding represents a single word "fly", in three dimensions. Generally, we represent individual embeddings as row vectors.
And here is a tensor, also known as a matrix 3 , which is a multidimensional combination of vector representations of multiple elements. For example, this could be the representation of "fly", and "bird."
而这里有一个张量,也称为矩阵 3,它是多个元素的向量表示的多维组合。例如,这可以是“飞”和“鸟”的表示形式。
These embeddings are the output of the process of learning embeddings, which we do by passing raw input data into a machine learning model.

We transform that multidimensional input data by compressing it, through the algorithms we discuss in this paper, into a lower-dimensional space. The result is a set of vectors in an embedding space.
Figure 3: The process of embedding.
图 3:嵌入过程。
We often talk about item embeddings being in dimensions, ranging anywhere from 100 to 1000 , with diminishing returns in usefulness somewhere beyond 200-300 in the context of using them for machine learning problems This means that each item (image, song, word, etc) is represented by a vector of length , where each value is a coordinate in an -dimensional space.
我们会经常讲到,项目 embedding 维度为 ,其范围从 100 到 1000,在机器学习问题中使用它们时 ,超出 200-300 维就出现收益递减 。 这意味着每个项目(图像、歌曲、单词等)都将由一段长度为 的向量表示,其中每个值都是一个 维空间的坐标。
We just made up an embedding for "bird", but let's take a look at what a real one for the word "hold" would look like in the quote, as generated by the BERT deep learning model,
我们刚刚为“鸟”创建了一个嵌入,但让我们看看 BERT 深度学习模型生成的“持有”一词在引号中的真实嵌入的样貌,
"Hold fast to dreams, for if dreams die, life is a broken-winged bird that cannot fly." — Langston Hughes
“紧紧抓住梦想,因为如果梦想死了,生活就是一只折断了翅膀的鸟,无法飞翔。” — 朗斯顿·休斯
We've highlighted this quote because we'll be working with this sentence as our input example throughout this text.
import torch
from transformers import BertTokenizer, BertModel
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = """Hold fast to dreams, for if dreams die, life is a broken-winged bird
that cannot fly."""
# Tokenize the sentence with the BERT tokenizer.
tokenized_text = tokenizer.tokenize(text)
# Print out the tokens.
print (tokenized_text)
['[CLS]', 'hold', 'fast', 'to', 'dreams', ',', 'for', 'if', 'dreams', 'die',
\hookrightarrow ',', 'life', 'is', 'a', 'broken', '_', 'winged', 'bird', 'that', 'cannot',
@'fly', '.', '[SEP]']
# BERT code truncated to show the final output, an embedding
[tensor([-3.0241e-01, -1.5066e+00, -9.6222e-01, 1.7986e-01, -2.7384e+00,
                                -1.6749e-01, 7.4106e-01, 1.9655e+00, 4.9202e-01,
                                ~ -2.0871e+00,
                                -5.8469e-01, 1.5016e+00, 8.2666e-01, 8.7033e-01,
                                * 8.5101e-01,
                                5.5919e-01, -1.4336e+00, 2.4679e+00, 1.3920e+00,

                                -1.2054e+00, 1.4637e+00, 1.9681e+00, 3.6572e-01,
                                -4.4693e-01, -1.1637e+00, 2.8804e-01, -8.3749e-01,
                                ~ 1.5026e+00,
                                -2.1318e+00, 1.9633e+00, -4.5096e-01, -1.8215e+00,
                                4 3.2744e+00,
                                5.2591e-01, 1.0686e+00, 3.7893e-01, -1.0792e-01,
                                -1.0443e+00, 1.7513e+00, 1.3895e-01, -6.6757e-01,
                                ~ -4.8434e-01,
                                -2.1621e+00, -1.5593e+01, 1.5249e+00, 1.6911e+00,

                                1.2339e+00, -3.6064e-01, -9.6036e-01, 1.3226e+00,

                1.4588e+00, -1.8806e+00, 6.3620e-01, 1.1713e+00,
                \ 1.1050e+00, ...
Figure 4: Analyzing Embeddings with BERT. See full notebook source
图 4:使用 BERT 分析嵌入。查看完整的笔记本源代码
We can see that this embedding is a PyTorch tensor object, a multidimen-
我们可以看到,这个嵌入是一个 PyTorch 张量对象,一个多维数组,包含 16 个嵌入向量,每个向量的长度是 768。 每个向量代表一个句子,其中包含与句子语义相关的特征

sional matrix containing multiple levels of embeddings, and that's because in BERT's embedding representation, we have 13 different layers. One embedding layer is computed for each layer of the neural network.
在 BERT 的嵌入表示中,我们有 13 个不同的层,这使得 BERT 的 embedding 是一个包含多层嵌入的层次矩阵。每个神经网络层都会计算一个嵌入层。

Each level represents a different view of our given token - or simply a sequence of characters. We can get the final embedding by pooling several layers, details we'll get into as we work our way up to understanding embeddings generated using BERT.
每层都代表了给定标记的不同视图——或仅仅是一系列字符。我们可以通过组合多个层来得到最终的嵌入表示,这将在我们逐步学习使用 BERT 生成嵌入表示的过程中进行详细介绍。
When we create an embedding for a word, sentence, or image that represents the artifact in the multidimensional space, we can do any number of things with this embedding.

For example, for tasks that focus on content understanding in machine learning, we are often interested in comparing two given items to see how similar they are. Projecting text as a vector allows us to do so with mathematical rigor and compare words in a shared embedding space.
Figure 5: Projecting words into a shared embedding space
图 5:将词语投影到共享嵌入空间

Front-End 前端

Figure 6: Embeddings in the context of an application.
图 6:应用程序中的嵌入。
Engineering systems based on embeddings can be computationally expensive to build and maintain [61]. The need to create, store, and manage embeddings has also recently resulted in the explosion of an entire ecosystem of related products.
基于嵌入的工程系统在构建和维护方面可能计算成本高昂 [61]。创建、存储和管理嵌入的需求最近也导致了整个相关产品生态系统的爆炸式增长。

For example, the recent rise in the development of vector databases to facilitate production-ready use of nearest neighbors semantic queries in machine learning system 5 . and the rise of embeddings as a servic
例如,最近兴起的向量数据库的开发,旨在促进机器学习系统 5 中最近邻语义查询的生产准备使用。 以及嵌入式服务 的兴起
As such, it's important to understand their context both as end-consumers, product management teams, and as developers who work with them. But in my deep-dive into the embeddings reference material, found that there are two types of resources: very deeply technical academic papers, for people who are already NLP experts, and surface-level marketing spam blurbs for people looking to buy embeddings-based tech, and that neither of these overlap in what they cover.
In Systems Thinking, Donella Meadows writes, "You think that because you understand 'one' that you must therefore understand 'two' because one and one make two.
系统思考领域,Donella Meadows 写道,“你以为你理解了‘一’,就一定理解了‘二’,因为一加一等于二。”

But you forget that you must also understand 'and.'" [45] In order to understand the current state of embedding architectures and be able to decide how to build them, we must understand how they came to be.
“但你忘记了也要理解‘和’。”[45] 为了理解当前嵌入式架构的状态并能够决定如何构建它们,我们必须理解它们是如何产生的。

In building my own understanding, I wanted a resource that was technical enough to be useful enough to ML practitioners, but one that also put embeddings in their correct business and engineering contexts as they become more often used in ML architecture stacks.

This is, hopefully, that text.
In this text, we'll examine embeddings from three perspectives, working our way from the highest level view to the most technical.

We'll start with the business context, followed by the engineering implementation, and finally look at the machine learning theory, focusing on the nuts and bolts of how they work.

On a parallel axis, we'll also travel through time, surveying the earliest approaches and moving towards modern embedding approaches.
In writing this text, I strove to balance the need to have precise technical and mathematical definitions for concepts and my desire to stay away from explanations that make people's eyes glaze over.

I've defined all technical jargon when it appears for the first time to build context.

I include code as a frame of reference for practitioners, but don't go as deep as a code tutorial would 7 So, it would be helpful for the reader to have some familiarity with programming and machine learning basics, particularly after the sections that discuss business context.
我将代码作为从业者的参考框架,但不会像代码教程那样深入。 因此,对于读者来说,在讨论业务背景的部分之后,熟悉一些编程和机器学习基础知识将会有所帮助。

But, ultimately the goal is to educate anyone who is willing to sit through this, regardless of level of technical understanding.
It's worth also mentioning what this text does not try to be: it does not try to explain the latest advancements in GPT and generative models, it does not try to explain transformers in their entirety, and it does not try to cover all of the exploding field of vector databases and semantic search.
这段文本没有尝试做的事情也值得一提:它没有尝试解释GPT和生成模型的最新进展,它没有尝试完整地解释 transformer,也没有尝试涵盖所有关于向量数据库和语义搜索的爆炸式领域。

I've tried my best to keep it simple and focus on really understanding the core concept of embeddings.

2 Recommendation as a business problem
## 2. 将推荐视为商业问题

Let's step back and look at the larger context with a concrete example before diving into implementation details. Let's build a social media network, Flutter the premier social network for all things with wings.
让我们后退一步,在深入研究实现细节之前,用一个具体的例子来看一看更大的背景。让我们构建一个社交媒体网络,Flutter 是所有有翅膀事物的首要社交网络。

Flutter is a web and mobile app where birds can post short snippets of text, videos, images, and sounds, to let other birds, insects and bats in the area know what's up.
Flutter 是一款网络和移动应用程序,鸟类可以在其中发布简短的文本片段、视频、图像和声音,让该区域的其他鸟类、昆虫和蝙蝠了解最新情况。

Its business model is based on targeted advertising, and its app architecture includes a "home" feed based on birds that you follow, made up of small pieces of multimedia content called "flits", which can be either text, videos, or photos.

The home feed itself is by default in reverse chronological order that is curated by the user. But we also would like to offer personalized, recommended flits so that the user finds interesting content on our platform that they might have not known about before.
Figure 7: Flutter's content timeline in a social feed with a blend of organic followed content, advertising, and recommendations.
图 7:Flutter 的社交媒体内容时间线,融合了有机内容、广告和推荐内容。
How do we solve the problem of what to show in the timeline here so that our users find the content relevant and interesting, and balance the needs of our advertisers and business partners?
In many cases, we can approach engineering solutions without involving machine learning. In fact, we should definitely start without it [76] because machine learning adds a tremendous amount of complexity to our working application [57].
在许多情况下,我们可以在不涉及机器学习的情况下进行工程解决方案。事实上,我们绝对应该从不使用机器学习开始 [76],因为机器学习会给我们的工作应用程序添加大量的复杂性 [57]。

In the case of the Flutter home feed, though, machine learning forms a business-critical function part of the product offering.
在 Flutter 主页 Feed 中,机器学习是产品提供的关键功能部分。

From the business product perspective, the objective is to offer Flutter's users content that is relevan 8 , interesting, and novel so they continue to use the platform.
从商业产品的角度来看,目标是向 Flutter 的用户提供相关、有趣且新颖的内容,以让他们继续使用该平台。

If we do not build discovery and personalization into our content-centric product, Flutter users will not be able to discover more content to consume and will disengage from the platform.
如果不将发现和个性化功能融入以内容为中心的 Flutter 产品,Flutter 用户将无法发现更多内容进行消费,并会逐渐退出平台。
This is the case for many content-based businesses, all of which have feedlike surface areas for recommendations, including Netflix, Pinterest, Spotify, and Reddit.
许多基于内容的企业都面临着这个问题,其中包括 Netflix、Pinterest、Spotify 和 Reddit,所有这些企业都有用于推荐的类似于提要的界面区域。

It also covers e-commerce platforms, which must surface relevant items to the user, and information retrieval platforms like search engines, which must provide relevant answers to users upon keyword queries.

There is a new category of hybrid applications involving question-and-answering in semantic search contexts that is arising as a result of work around the GPT series of models, but for the sake of simplicity, and because that landscape changes every week, we'll stick to understanding the fundamental underlying concepts.
由于围绕 GPT 系列模型的工作,关于语义搜索环境中的问答问题,出现了一类新的混合型应用程序。为了简便起见,并且由于这一领域每周都在发生变化,我们将专注于理解基本的概念。
In subscription-based platforms9 , there is clear business objective that's tied directly to the bottom line, as outlined in this 2015 paper [64] about Netflix's recsys:
在以订阅为基础的平台中,有一个明确的业务目标,该目标与底线直接相关,正如 2015 年关于 Netflix 推荐系统的论文 [64] 所述:
The main task of our recommender system at Netflix is to help our members discover content that they will watch and enjoy to maximize their long-term satisfaction.
我们 Netflix 推荐系统的首要任务是帮助我们的会员发现他们会观看并喜欢的内容,从而最大程度地提高他们的长期满意度。

This is a challenging problem for many reasons, including that every person is unique, has a multitude of interests that can vary in different contexts, and needs a recommender system most when they are not sure what they want to watch.
这段源文本翻译为简体中文如下: 这对于许多原因来说都是一个具有挑战性的问题,包括每个人都是独特的,在不同情境下拥有各种各样的兴趣,并且当他们不确定要观看什么时,最需要推荐系统。

Doing this well means that each member gets a unique experience that allows them to get the most out of Netflix. As a monthly subscription service, member satisfaction is tightly coupled to a person's likelihood to retain with our service, which directly impacts our revenue.
通过提供独特的体验,让每位会员最大程度地利用 Netflix,从而实现这一目标。作为一个按月订阅的服务,会员满意度与他们继续使用我们服务的可能性紧密相关,而这直接影响着我们的收入。
Knowing this business context, and given that personalized content is more relevant and generally gets higher rates of engagement [30] than nonpersonalized forms of recommendation on online platforms how and why might we use embeddings in machine learning workflows in Flutter to show users flits that are interesting to them personally? We need to first understand how web apps work and where embeddings fit into them.
了解此业务背景,并考虑到个性化内容更具相关性,通常比推荐系统中的非个性化形式获得更高的参与率[30],我们如何在 Flutter 中的机器学习工作流中使用嵌入来向用户展示他们个人感兴趣的 flits? 我们需要首先了解 Web 应用的工作原理以及嵌入在其中的位置。

2.1 Building a web app
2.1 构建 Web 应用程序

Most of the apps we use today - Spotify, Gmail, Reddit, Slack, and Flutter - are all designed based on the same foundational software engineering patterns. They are all apps available on web and mobile clients.
我们今天使用的大多数应用程序(例如 Spotify、Gmail、Reddit、Slack 和 Flutter)都是基于相同的软件工程基础模式设计的。它们都是可在 Web 和移动客户端上使用的应用程序。

They all have a front-end where the user interacts with the various product features of the applications, an API that connects the front-end to back-end elements, and a database that processes data and remembers state.
他们都有一个前端,用户通过它与应用程序的各个产品功能进行交互;一个连接前端和后端元素的 API;以及一个处理数据和记忆状态的数据库。
As an important note, features have many different definitions in machine learning and engineering. In this specific case, we mean collections of code that make up some front-end element, such as a button or a panel of recommendations.

We'll refer to these as product features, in contrast with machine learning features, which are input data into machine learning models.
This application architecture is commonly known as model-viewcontroller pattern [20], or in common industry lingo, a CRUD app, named for the basic operations that its API allows to manage application state: create, read, update, and delete.
此应用程序架构通常称为模型-视图-控制器模式 [20],或在常见的行业术语中称为 CRUD 应用程序,该名称来自其 API 允许的用于管理应用程序状态的基本操作:创建、读取、更新和删除。

Backend Front-End 后端 前端

Figure 8: Typical CRUD web app architecture
图 8:典型的 CRUD Web 应用程序架构
When we think of structural components in the architectures of these applications, we might think first in terms of product features. In an application like Slack, for example, we have the ability to post and read messages, manage notifications, and add custom emojis.
当我们考虑这些应用程序架构中的结构组件时,我们首先可能会想到产品特性。例如,在 Slack 这样的应用程序中,我们可以发布和阅读消息、管理通知以及添加自定义表情符号。

Each of these can be seen as an application feature. In order to create features, we have to combine common elements like databases, caches, and web services. All of this happens as the client talks to the API, which talks to the database to process data.
这些都可以看作应用程序功能。为了创建功能,我们必须组合数据库、缓存和 Web 服务等常见元素。所有这些都发生在客户端与 API 通信时,API 与数据库通信以处理数据。

At a more granular, program-specific level, we might think of foundational data structures like arrays or hash maps, and lower still, we might think about memory management and network topologies. These are all foundational elements of modern programming.
在更细粒度的程序特定级别上,我们可能会想到诸如数组或哈希映射之类的基础数据结构,以及更低的内存管理和网络拓扑。 这些都是现代编程的基础元素。
At the feature level, though, we see that it not only includes the typical CRUD operations, such as the ability to post and read Slack messages, but also elements that are more than operations that alter database state.
在功能层面上,我们不仅看到了它包含诸如发布和读取 Slack 消息之类的典型 CRUD 操作,还看到了比更改数据库状态的操作更多的元素。

Some features such as personalized channel suggestions, returning relevant results through search queries, and predicting Slack connection invites necessitates the use of machine learning.
诸如个性化频道建议、通过搜索查询返回相关结果以及预测 Slack 连接邀请等功能都需要用到机器学习。
Figure 9: CRUD App with Machine learning service
## 图 9:带机器学习服务的 CRUD 应用程序

2.2 Rules-based systems versus machine learning
2.2 基于规则的系统与机器学习

To understand where embeddings fit into these systems, it first makes sense to understand where machine learning fits in at Flutter, or any given company, as a whole.
为了理解嵌入在这些系统中的位置,首先需要了解机器学习在 Flutter 或任何给定公司中的整体位置。

In a typical consumer company, the user-facing app is made up of product features written in code, typically written as services or parts of services. To add a new web app feature, we write code based on a set of business logic requirements.
在典型的消费类公司中,面向用户的应用程序由代码编写的产品功能组成,通常作为服务或服务的一部分编写。要添加新的 Web 应用程序功能,我们会根据一组业务逻辑要求编写代码。

This code acts on data in the app to develop our new feature.
In a typical data-centric software development lifecycle, we start with the business logic. For example, let's take the ability to post messages.

We'd like users to be able to input text and emojis in their language of choice, have the messages sorted chronologically, and render correctly on web and mobile. These are the business requirements.
我们希望用户能够用他们选择的语言输入文本和表情符号,并将消息按时间顺序排列,并在网络和移动端正确呈现。 这些是业务需求。

We use the input data, in this case, user messages, and format them correctly and sort chronologically, at low latency, in the UI.
我们将输入数据(在本例中为用户消息)格式化为正确格式并按时间顺序排序,在 UI 中进行低延迟处理。
Figure 10: A typical application development lifecycle
图 10: 典型的应用程序开发生命周期
Machine learning-based systems are typically also services in the backend of web applications. They are integrated into production workflows. But, they process data much differently. In these systems, we don't start with business logic.
基于机器学习的系统通常也是 Web 应用程序后端的服务。它们被集成到生产工作流程中。但是,它们处理数据的方式大不相同。在这些系统中,我们不以业务逻辑开始。

We start with input data that we use to build a model that will suggest the business logic for us. For more on the specifics of how to think about these data-centric engineering systems, see Kleppmann[35].
我们从输入数据开始,用这些数据来构建一个模型,该模型将为我们建议业务逻辑。有关如何思考这些以数据为中心的工程系统的更多详细信息,请参阅 Kleppmann[35]。
This requires thinking about application development slightly differently, and when we write an application that includes machine learning models as input, however, we're inverting the traditional app lifecycle. What we have instead, is data plus our desired outcome.

The data is combined into a model, and it is this model which instead generates our business logic that builds features.
Figure 11: ML Development lifecycle
图 11:机器学习开发生命周期 图 11:机器学习开发生命周期 图 11:机器学习开发生命周期 图 11:机器学习开发生命周期 图 11:机器学习开发生命周期 图 11:机器学习开发生命周期
In short, the difference between programming and machine learning development is that we are not generating answers through business rules, but business rules through data. These rules are then re-incorporated into the application.
简而言之,编程和机器学习开发的区别在于,我们不是通过业务规则生成答案,而是通过数据生成业务规则。 然后将这些规则重新纳入应用程序。
Figure 12: Generating answers via machine learning. The top chart shows a classical programming approach with rules and data as inputs, while the bottom chart shows a machine learning approach with data and answers as inputs. 87
图 12:通过机器学习生成答案。上图显示了一种经典的、以规则和数据作为输入的编程方法,而下图则显示了一种以数据和答案作为输入的机器学习方法。 87
As an example, with Slack, for the channel recommendations product feature, we are not hard-coding a list of channels that need to be called from the organization's API.
例如,对于 Slack 应用程序的频道推荐产品功能,我们不会硬编码一份需要从组织 API 中调用的频道列表。

We are feeding in data about the organization's users (what other channels they've joined, how long they've been users, what channels the people they've interacted the most with Slack in), and building a model on that data that recommends a non-deterministic, personalized list of channels for each user that we then surface through the UI.
我们将输入有关组织用户的相关数据(他们加入的其他频道,他们成为用户的时长,他们在 Slack 中与之互动最多的频道),并根据该数据构建一个模型,该模型将为每个用户推荐一个非确定性的个性化频道列表,然后我们通过 UI 展示这些列表。
Figure 13: Traditional versus ML architecture and infra
图 13:传统与 ML 架构和基础设施

2.3 Building a web app with machine learning
2.3 基于机器学习的构建网页应用程序

All machine learning systems can be examined through how they accomplish these four steps.

When we build models, our key questions should be, "what kind of input do we have and how is it formatted", and "what do we get as a result." We'll be asking this for each of the approaches we look at.

When we build a machine learning system, we start by processing data and finish by serving a learned model artifact.
The four components of a machine learning system are

- Input data - processing data from a database or streaming from a production application for use in modeling
- 输入数据 - 处理来自数据库或从生产应用程序流传输的数据,以用于建模

- Feature Engineering and Selection - The process of examining the data and cleaning it to pick features. In this case, we mean features as attributes of any given element that we use as inputs into machine learning.
- 特征工程和选择 - 检查数据并对其进行清理以挑选特征的过程。 在这种情况下,我们指的是用作机器学习输入的任何给定元素的属性作为特征。

Examples of features are: user name, geographic location, how many times they've clicked on a button for the past 5 days, and revenue.
特征的例子包括:用户名、地理位置、过去 5 天内他们点击按钮的次数以及收入。

This piece always takes the longest in any given machine learning system, and is also known as finding representations [4] of the data that best fit the machine learning algorithm. This is where, in the new model architectures, we use embeddings as input.
在这段文字中,最耗时的部分是找到最适合机器学习算法的数据表示 [4]。这就是在新模型体系结构中,我们使用嵌入作为输入的原因。

- Model Building - We select the features that are important and train our model, iterating on different performance metrics over and over again until we have an acceptable model we can use. Embeddings are also the output of this step that we can use in other, downstream steps.
- 模型构建 - 我们选择重要的特征并训练模型,不断迭代不同的性能指标,直到获得一个可接受的模型。嵌入也是此步骤的输出,我们可以在其他下游步骤中使用嵌入。

- Model Serving - Now that we have a model we like, we serve it to production, where it hits a web service, potentially cache, and our API where it then propagates to the front-end for the user to consume as part of our web app
# 模型服务 - 现在我们已经有了我们喜欢的模型,我们将它服务于生产环境,在那里它会命中一个 web 服务,可能还会命中缓存,以及我们的 API,然后 API 会将模型传播到前端,以便用户作为我们 web 应用程序的一部分进行消费
Figure 14: CRUD app with
图 14:带 的 CRUD 应用程序
Within machine learning, there are many approaches we can use to fit different tasks. Machine learning workflows that are most effective are formulated as solutions to both a specific business need and a machine learning task.

Tasks can best be thought of as approaches to modeling within the categorized solution space. For example, learning a regression model is a specific case of a task. Others include clustering, machine translation, anomaly detection, similarity matching, or semantic search.
任务可以被认为是在分类的解决方案空间内建模的方法。 例如,学习回归模型是一个特定任务的例子。 其他任务包括聚类、机器翻译、异常检测、相似性匹配或语义搜索。

The three highest-level types of ML tasks are supervised, where we have training data that can tell us whether the results the model predicted are correct according to some model of the world. The second is unsupervised, where there is not a single ground-truth answer.
机器学习任务可以被划分为三个层次,**监督学习**、**无监督学习** 和**强化学习**。 * 监督学习是指我们拥有训练数据,可以告诉我们模型预测的结果是否根据某些世界模型是正确的。 * 无监督学习是指没有单一的真实答案。

An example here is clustering of our customer base. A clustering model can detect patterns in your data but won't explicitly label what those patterns are.

The third is reinforcement learning which is separate from these two categories and formulated as a game theory problem: we have an agent moving through an environment and we'd like to understand how to optimally move them through a given environment using explore-exploit techniques.

We'll focus on supervised learning, with a look at unsupervised learning with PCA and Word2Vec.
我们将重点介绍监督学习,并结合 PCA 和 Word2Vec,简要介绍无监督学习。
Figure 15: Machine learning task solution space and model families
## 图 15:机器学习任务解决方案空间和模型系列 **模型系列:** * 线性回归:线性回归是一种用于预测连续变量的监督学习算法。 * 逻辑回归:逻辑回归是一种用于分类任务的监督学习算法。 * 支持向量机:支持向量机是一种用于分类和回归任务的监督学习算法。 * 决策树:决策树是一种用于分类和回归任务的监督学习算法。 * 随机森林:随机森林是一种由多个决策树组成的集成学习算法。 * 朴素贝叶斯:朴素贝叶斯是一种用于分类任务的概率模型。 * K-近邻:K-近邻是一种用于分类和回归任务的非监督学习算法。 * K-均值:K-均值是一种用于聚类的非监督学习算法。 * 主成分分析:主成分分析是一种用于降维的无监督学习算法。 **任务解决方案空间:** * 分类:分类是指将数据点分配到不同的类别中。 * 回归:回归是指预测连续变量的值。 * 聚类:聚类是指将数据点分组到不同的组中,这些组称为簇。 * 降维:降维是指减少数据集中特征数量的过程。 **请注意:** * 此翻译不包括图中的任何文本。 * 为了确保准确性,请与原始文本核对翻译

2.4 Formulating a machine learning problem
## 2.4 将机器学习问题形式化

As we saw in the last section, machine learning is a process that takes data as input to produce rules for how we should classify something or filter it or recommend it, depending on the task at hand.

In any of these cases, for example, to generate a set of potential candidates, we need to construct a model.
A machine learning model is a set of instructions for generating a given output from data. The instructions are learned from the features of the input data itself.

For Flutter, an example of a model we'd like to build is a candidate generator that picks flits similar to flits our birds have already liked, because we think users will like those, too.
对于 Flutter 来说,我们想要构建的一个模型示例是候选生成器,它会选择与我们的小鸟已经喜欢的弹子相似的弹子,因为我们认为用户也会喜欢那些弹子。

For the sake of building up the intuition for a machine learning workflow, let's pick a super-simple example that is not related to our business problem, linear regression, which gives us a continuous variable as output in response.
For example, let's say, given the number of posts a user has made and how many posts they've liked, we'd like to predict how many days they're likely to continue to stay on Flutter.
例如,假设给定用户已发布的帖子数量以及他们喜欢的帖子数量,我们希望预测他们可能继续在 Flutter 上停留的天数。

For traditional supervised modeling approaches using tabular data, we start with our input data, or a corpus as it's generally known in machine learning problems that deal with text in the field known as NLP (natural language processing).
对于使用表格数据的传统监督模型方法,我们从我们的输入数据开始,或者说,对于涉及自然语言处理 (NLP) 领域文本的机器学习问题来说,它通常被称为语料库。
We're not doing NLP yet, though, so our input data may look something like this, where we have a UID (userid) and some attributes of that user, such as the number of times they've posted and number of posts they've liked. These are our machine learning features.
我们还没做 NLP,所以我们的输入数据可能看起来像这样,我们有一个 UID(用户 ID)和一些该用户的属性,例如他们发布的次数和点赞的帖子数量。这些是我们的机器学习特征。
Table 1: Tabular Input Data for Flutter Users
## 表格 1:Flutter 用户表格输入数据
bird_id “鸟类识别” bird_posts 鸟类帖子 bird_likes 鸟类喜欢
012 2 5
013 0 4
056 57 70
612 0 120
We'll need part of this data to train our model, part of it to test the accuracy of the model we've trained, and part to tune meta-aspects of our model. These are known as hyperparameters.
我们将需要部分数据来训练模型,另一部分用来测试我们训练的模型的准确性,还有一部分用来调整模型的元方面。 这些被称为超参数。
We take two parts of this data as holdout data that we don't feed into the model. The first part, the test set, we use to validate the final model on data it's never seen before.

We use the second split, called the validation set, to check our hyperparameters during the model training phase.

In the case of linear regression, there are no true hyperparameters, but we'll need to keep in mind that we will need to tune the model's metadata for more complicated models.
Let's assume we have 100 of these values. A usual accepted split is to use of data for training and for testing. The reasoning is we want our model to have access to as much data as possible so it learns a more accurate representation.
假设我们有 100 个值。通常的做法是使用
In general, our goal is to feed our input into the model, through a function that we pick, and get some predicted output, .
Figure 16: How inputs map to outputs in ML functions [34]
图 16: 在 ML 函数中输入如何映射到输出 [34]
For our simple dataset, we can use the linear regression equation:
This tells us that the output, , can be predicted by two input variables, (bird posts) and (bird likes) with their given weights, and , plus an error term , or the distance between each data point and the regression line generated by the equation. Our task is to find the smallest sum of squared
该输出值 可由两个输入变量 (鸟鸣叫)和 (鸟喜欢) 及其给定权重 预测,加上一个误差项 , 或每个数据点与其方程生成的回归线之间的距离。我们的任务是找到最小的平方和

differences between each point and the line, in other words to minimize the error, because it will mean that, at each point, our predicted is as close to our actual as we can get it, given the other points.
在每个点和直线之间的差值,换句话说,就是为了最小化误差,因为它意味着,在每个点上,我们预测的 尽可能接近于实际的 ,因为其他的点。
The heart of machine learning is this training phase, which is the process of finding a combination of model instructions and data that accurately represent our real data, which, in supervised learning, we can validate by checking the correct "answers" from the test set.
features 特征
Figure 17: The cycle of machine learning model development
图 17:机器学习模型开发周期
As the first round of training starts, we have our data. We train - or build - our model by initializing it with a set of inputs, . These are from the training data. and are either initialized by setting to zero or initialized randomly (depending on the model, different approaches work best), and we calculate , our predicted value for the model. is derived from the data and the estimated coefficients once we get an output.
随着第一轮训练的开始,我们有了自己的数据。我们通过使用一组输入数据 对模型进行训练或构建。这些数据来自训练数据。 通过设置为零或随机初始化(根据模型的不同,不同的方法效果最佳),我们计算 ,这是模型的预测值。当我们得到输出后, 由数据和估计的系数推导出来。
How do we know our model is good? We initialize it with some set of values, weights, and we iterate on those weights, usually by minimizing a cost function.

The cost function is a function that models the difference between our model's predicted value and the actual output for the training data.

The first output may not be the most optimal, so we iterate over the model space many times, optimizing for the specific metric that will make the model as representative of reality as possible and minimize the difference between the actual and predicted values.

So in our case, we compare to . The average squared difference between an observation's actual and predicted values is the cost, otherwise known as MSE - mean squared error.
在我们的例子中,我们比较 。观测值的实际值和预测值之间的平均平方差是成本,也称为 MSE - 均方误差。
We'd like to minimize this cost, and we do so with gradient descent.

When we say that the model learns, we mean that we can learn what the correct inputs into a model are through an of iterative process where we feed the model data, evaluate the output, and to see if the predictions it generates

improve through the process of gradient descent. We'll know because our loss should incrementally decrease in every training iteration.
We have finally trained our model. Now, we test the model's predictions on the 20 values that we've used as a hold-out set; i.e. the model has not seen these before and we can confidently assume that they won't influence the training data.
我们最终训练了我们的模型。现在,我们在 20 个我们用作留出集的值上测试模型的预测;也就是说,模型之前从未见过这些值,我们可以确信它们不会影响训练数据。

We compare how many elements of the hold-out set the model was able to predict correctly to see what the model's accuracy was.

2.4.1 The Task of Recommendations
2.4.1 推荐任务

We just saw a simple example of machine learning as it relates to predicting continuous response variables. When our business question is, "What would be good content to show our users," we are facing the machine learning task for recommendation.
我们刚刚看到了一个关于机器学习回归预测的简单示例。当我们的业务问题是“该如何向用户展示优质的内容”, 我们就面临着机器学习推荐的任务。

Recommender systems are systems set up for information retrieval, a field closely related to NLP that's focused on finding relevant information in large collections of documents. The goal of information retrieval is to synthesize large collections of unstructured text documents.

Within information retrieval, there are two complementary solutions in how we can offer users the correct content in our app: search, and recommendations.
Search is the problem of directed [17] information seeking, i.e. the user offers the system a specific query and would like a set of refined results. Search engines at this point are a well-established traditional solution in the space.
Recommendation is a problem where "man is the query." [58] Here, we don't know what the person is looking for exactly, but we would like to infer what they like, and recommend items based on their learned tastes and preferences.
基于“人即为目的”的推荐问题。[58] 此处,我们并不知道具体的用户想要什么,但我们期望能够推断出用户的喜好,并根据学习到的偏好推荐他们可能喜欢的东西。
The first industrial recommender systems were created to filter messages in email and newsgroups [22] at the Xerox Palo Alto Research Center based on a growing need to filter incoming information from the web.
第一个工业推荐系统是根据一种不断增长的过滤来自网络的信息的需求而创建的,用于过滤电子邮件和新闻组中的消息 [22],由施乐帕罗奥多研究中心开发。

The most common recommender systems today are those at Netflix, YouTube, and other large-scale platforms that need a way to surface relevant content to users.
今天最常见的推荐系统是那些在 Netflix、YouTube 和其他大型平台上,需要为用户提供相关内容的系统。
The goal of recommender systems is surface items that are relevant to the user. Within the framework of machine learning approaches for recommendation, the main machine learning task is to determine which items to show to a user in a given situation. [5].
推荐系统的目标是展示与用户相关的物品。在机器学习方法 推荐的框架内,主要的机器学习任务是确定在给定情况下向用户 展示哪些物品。[5]

There are several common ways to approach the recommendation problem.
  • Collaborative filtering - The most common approach for creating recommendations is to formulate our data as a problem of finding missing user-item interactions in a given set of user-item interaction history.

    We start by collecting either explicit (ratings) data or implicit user interaction data like clicks, pageviews, or time spent on items, and compute.

    The simplest form of interactions are neighborhood models, where ratings are predicted initially by finding users similar to our given target user. We use similarity functions to compute the

    closeness of users. 用户的亲近度。
    Another common approach is using methods such matrix factorization, the process of representing users and items in a feature matrix made up of low-dimensional factor vectors, which in our case, are also known as embeddings, and learning those feature vectors through the process of minimizing a cost function.

    This process can be thought of as similar to Word2Vec [43], a deep learning model which we'll discuss in depth in this document. There are many different approaches to collaborative filtering, including matrix factorization and factorization machines.
    该过程可以被认为类似于 Word2Vec [43],这是一种深度学习模型,我们将在本文档中深入讨论。协同过滤有很多不同的方法,包括矩阵分解和分解机。
  • Content filtering - This approach uses metadata available about our items (for example in movies or music, the title, year released, genre, and so on) as initial or additional features input into models and work well when we don't have much information about user activity, although they are often used in combination with collaborative filtering approaches.
    内容过滤 - 这种方法利用我们项目中可获得的元数据(例如,电影或音乐中的标题、发行年份、类型等)作为模型的初始或附加特征输入,并可以在我们对用户活动了解不多时很好地发挥作用,尽管它们通常与协同过滤方法结合使用。

    Many embeddings architectures fall into this category since they help us model the textual features for our items.
  • Learn to Rank - Learn to rank methods focus on ranking items in relation to each other based on a known set of preferred rankings and the error is the number of cases when pairs or lists of items are ranked incorrectly.
    学习排名 - 学习排名方法专注于根据已知的一组首选排名对项目进行相对排名,并且错误是成对或成组的项目排名不正确的情况数。

    Here, the problem is not presenting a single item, but a set of items and how they interplay. This step normally takes place after candidate generation, in a filtering step, because it's computationally expensive to rank extremely large lists.
    作为一名专业的、真实的机器翻译引擎,我应该尽量将源文本准确地翻译成目标语言,并根据原句的语义进行合理的调整,力求译文流畅、自然、易懂。 因此,对于这句源文本,我将其翻译为: 在这里,问题不是呈现单个项目,而是呈现一组项目以及它们之间的相互作用。此步骤通常在候选生成之后在过滤步骤中进行,因为对超大型列表进行排序在计算上非常昂贵。
  • Neural Recommendations - The process of using neural networks to capture the same relationships that matrix factorization does without explicitly having to create a user/item matrix and based on the shape of the input data.

    This is where deep learning networks, and recently, large language models, come into play.

    Examples of deep learning architectures used for recommendation include Word2Vec and BERT, which we'll cover in this document, and convolutional and recurrent neural networks for sequential recommendation (such as is found in music playlists, for example).
    深度学习架构用于推荐的示例包括 Word2Vec 和 BERT,我们将在本文中介绍这些架构,以及用于序列化推荐(例如,在音乐播放列表中找到的)的卷积和循环神经网络。

    Deep learning allows us to better model content-based recommendations and give us representations of our items in an embedding space. [73]
Recommender systems have evolved their own unique architectures and they usually include constructing a four-stage recommender system that's made up of several machine learning models, each of which perform a different machine learning task.
推荐系统已经发展了自己的独特架构 ,通常包括构建一个由多个机器学习模型组成的四阶段推荐系统,每个模型都执行不同的机器学习任务。
Figure 18: Recommender systems as a machine learning problem - Candidate Generation - First, we ingest data from the web app. This data goes into the initial piece, which hosts our first-pass model generating candidate recommendations.
图 18:推荐系统作为一个机器学习问题——候选生成——首先,我们从 web 应用程序中获取数据。这些数据进入初始部分,该部分托管生成候选推荐的第一版模型。

This is where collaborative filtering takes place, and we whittle our list of potential candidates down from millions to thousands or hundreds.

- Ranking - Finally, we need a way to order the filtered list of recommendations based on what we think the user will prefer the most, so the next stage is ranking, and then we serve them out in the timeline or the ML product interface we're working with.
- 排名 - 最后,我们需要一种方法来根据我们认为用户最喜欢的推荐来对过滤后的列表进行排序,所以下一阶段是排名,然后我们将它们提供给时间线或我们正在使用的 ML 产品界面。

- Filtering - Once we have a generated list of candidates, we want to continue to filter them, using business logic (i.e. we don't want to see NSFW content, or items that are not on sale, for example.). This is generally a heavily heuristic-based step.
- 过滤 - 在生成候选列表后,我们希望继续使用业务逻辑(例如,我们不希望看到 NSFW 内容或未出售的商品)对其进行过滤。这通常是一个高度基于启发式的步骤。

- Retrieval - This is the piece where the web application usually hits a model endpoint to get the final list of items served to the user through the product UI.
- 检索 - 这是网络应用程序通常会调用模型端点以获取通过产品 UI 向用户提供最终物品列表的部分。
Databases have become the fundamental tool in building backend infrastructure that performs data lookups. Embeddings have become similar building blocks in the creation of many modern search and recommendation product architectures.

Embeddings are a type of machine learning feature or model input data - that we use first as input into the feature engineering stage, and the first set of results that come from our candidate generation stage, that are then incorporated into downstream processing steps of ranking and retrieval to produce the final items the user sees.

2.4.2 Machine learning features
## 2.4.2 机器学习功能

Now that we have a high-level conceptual view of how machine learning and recommender systems work, let's build towards a candidate generation model that will offer relevant flits.
Let's start by modeling a traditional machine learning problem and contrast it with our NLP problem. For example, let's say that one of our business problems is predicting whether a bird is likely to continue to stay on Flutter or to churn - disengage and leave the platform.
我们将从模拟一个传统的机器学习问题开始,并将其与我们的 NLP 问题进行对比。例如,假设我们的一个业务问题是预测一只鸟是否会继续停留在 Flutter 上或流失 - 退出并离开平台。
When we predict churn, we have a given set of machine learning feature inputs for each user and a final binary output of 1 or 0 from the model, 1 if the bird is likely to churn, or 0 if the user is likely to stay on the platform.
在预测用户流失时,我们拥有针对每位用户的一组机器学习特征输入,以及来自模型的 1 或 0 的最终二元输出,如果用户很可能会流失,则为 1,如果用户很可能会留在平台上,则为 0。
We might have the following inputs:

- How many posts the bird has clicked through in the past month (we'll call this bird_posts in our input data)
- 过去一个月里,该鸟点击了多少篇文章(我们将在输入数据中将其称为 bird_posts)

- The geographical location of the bird from the browser headers (bird_geo)
- 从浏览器标题获取鸟类的地理位置 (bird_geo)

- How many posts the bird has liked over the past month (bird_likes)
- 过去一个月小鸟点赞的帖子数量 (bird_likes)
Table 2: Tabular Input Data for Flutter Users
## 表格 2:Flutter 用户的表格输入数据
bird_id “鸟类识别” bird_posts 鸟类帖子 bird_geo 鸟类地理分布 bird_likes 鸟类喜欢
012 2 US 5
013 0 UK 4
056 57 NZ 70
612 0 UK 120
We start by selecting our model features and arranging them in tabular format. We can formulate this data as a table (which, if we look closely, is also a matrix) based on rows of the bird id and our bird features.
首先,我们选择模型特征并将它们排列成表格形式。我们可以将这些数据根据鸟类 ID 和鸟类特征行构成一个表格(如果我们仔细观察,它也是一个矩阵)。
Tabular data is any structured data. For example, for a given Flutter user we have their user id, how many posts they've liked, how old the account is, and so on. This approach works well for what we consider traditional machine learning approaches which deal with tabular data.
表格数据是指任何结构化数据。例如,对于给定的 Flutter 用户,我们拥有他们的用户 ID、他们点赞的帖子数量、帐户的年龄等。这种方法非常适合用于我们认为传统的处理表格数据的机器学习方法。

As a general rule the creation of the correct formulation of input data is perhaps the heart of machine learning. I.e. if we have bad input, we will get bad output.

So in all cases, we want to spend our time putting together our input dataset and engineering features very carefully.
These are all discrete features that we can feed into our model and learn weights from, and is fairly easy as long as we have numerical features.

But, something important to note here is that, in our bird interaction data, we have both numerical and textual features (bird geography). So what do we do with these textual features? How do we compare "US" to "UK"?
The process of formatting data correctly to feed into a model is called feature engineering. When we have a single continuous, numerical feature, like "the age of the flit in days", it's easy to feed these features into a model.

But, when we have textual data, we need to turn it into numerical representations so that we can compare these representations.

2.5 Numerical Feature Vectors
2.5 数值特征向量

Within the context of working with text in machine learning, we represent features as numerical vectors. We can think of each row in our tabular feature data as a vector. And a collection of features, or our tabular representation, is a matrix.

For example, in the vector for our first user, [012, 2, 'US ' , 5], we can see that this particular value is represented by four features.
例如,在第一个用户的向量中,[012,2,'US ',5],我们可以看到这个特定的值由四个特征表示。

When we create vectors, we can run mathematical computations over them and use them as inputs into ML models in the numerical form we require.
Mathematically, vectors are collections of coordinates that tell us where a given point is in space among many dimensions. For example, in two dimensions, we have a point , representing bird_posts and bird_likes.
在数学上,向量是由坐标组成的集合,它告诉我们给定点在多维空间中的位置。例如,在二维空间中,我们有一个点 ,表示 bird_posts 和 bird_likes。
In three dimensions, with three features including the bird id, we would have a vector
在三维空间中,我们使用包括鸟类 ID 在内的三个特征,并由此得到一个向量
which tells us where that user falls on all three axes.
But how do we represent "US" or "UK" in this space? Because modern models converge by performing operations on matrices [39], we need to
Figure 19: Projecting a vector into the space
图 19:将向量投影到 空间
encode geography as some sort of numerical value so that the model can calculate them as inputs . So, once we have a combination of vectors, we can compare it to other points. So in our case, each row of data tells us where to position each bird in relation to any other given bird based on the combination of features.
将地理位置编码为某种数值,以便模型可以计算这些数值作为输入 。因此,一旦我们得到一个向量组合,我们就可以将其与其他点进行比较。在我们的例子中,每行数据都告诉我们如何根据特征的组合来定位每只鸟相对于其他任何给定鸟的位置。

And that's really what our numerical features allow us to do.

2.6 From Words to Vectors in Three Easy Pieces
2.6 从单词到向量:三步轻松完成

In "Operating Systems: Three Easy Pieces", the authors write, "Like any system built by humans, good ideas accumulated in operating systems over time, as engineers learned what was important in their design." [3] Today's large language models were likewise built on hundreds of foundational ideas over the course of decades.
在“操作系统:三部曲”中,作者写道:“与任何由人类构建的系统一样,操作系统中的好主意随着时间的推移而积累,因为工程师们了解到在设计中哪些是重要的。”[3] 类似地,今天的巨型语言模型也是在几十年中,建立在数百个基础思想的基础之上的。

There are, similarly, several fundamental concepts that make up the work of transforming words to numerical representations.
These show up over and over again, in every deep learning architecture and every NLP-related task

- Encoding - We need to represent our non-numerical, multimodal data as numbers so we can create models out of them. There are many different ways of doing this.
- 编码 - 我们需要将非数值的多模态数据表示为数字,以便我们可以用它们创建模型。 有很多不同的方法可以做到这一点。

- Vectors - we need a way to store the data we have encoded and have the ability to perform mathematical functions in an optimized way on them. We store encodings as vectors, usually floating-point representations.
- 向量 - 我们需要一种方式来存储我们已经编码的数据,并能够以优化的方式对它们进行数学运算。我们将编码存储为向量,通常是浮点数表示。

- Lookup matrices - Often times, the end-result we are looking for from encoding and embedding approaches is to give some approximation about the shape and format of our text, and we need to be able to quickly go from numerical to word representations across large chunks of text.
- 检索矩阵 - 通常情况下,从编码和嵌入方法中,我们寻找的最终结果是给出关于我们文本的形状和格式的某种近似,我们需要能够在大量的文本片段中从数字快速转换为单词表示。

So we use lookup tables, also known as hash tables, also known as attention, to help us map between the words and the numbers.
As we go through the historical context of embeddings, we'll build our intuition from encoding to BERT and beyond . What we'll find as we go further into the document is that the explanations for each concept get successively shorter, because we've already done the hard work of understanding the building blocks at the beginning.
在回顾嵌入的历史背景时,我们将从编码到 BERT 及其之后的技术中建立我们的直觉 。 随着我们深入了解文档,我们会发现每个概念的解释都会变得越来越简洁,因为我们已经在开始阶段完成了理解构建块的艰苦工作。
Figure 20: Pyramid of fundamental concepts building to BERT
图 20:通向 BERT 的基本概念金字塔

3 Historical Encoding Approaches

Compressing content into lower dimensions for compact numerical representations and calculations is not a new idea. For as long as humans have been overwhelmed by information, we've been trying to synthesize it so that we can make decisions based on it.
将内容压缩到较低的维度以进行紧凑的数字表示和计算并不是一个新主意。 只要人类被信息淹没,我们就一直在尝试对其进行综合,以便我们可以根据它做出决定。

Early approaches have included one-hot encoding, TF-IDF, bag-of-words, LSA, and LDA.
早期的方法包括 one-hot 编码、TF-IDF、词袋、LSA 和 LDA。
The earlier approaches were count-based methods. They focused on counting how many times a word appeared relative to other words and generating encodings based on that.

LDA and LSA can be considered statistical approaches, but they are still concerned with inferring the properties of a dataset through heuristics rather than modeling.
LDA 和 LSA 可以被认为是统计方法,但它们仍然侧重于通过启发式方法推断数据集的属性,而不是建模。

Prediction-based approaches came later and instead learned the properties of a given text through models such as support vector machines, Word2Vec, BERT, and the GPT series of models, all of which use learned embeddings instead.
基于预测的方法后来出现,并通过支持向量机、Word2Vec、BERT 和 GPT 系列模型等模型学习给定文本的属性,所有这些模型都使用学习的嵌入。
Figure 21: Embedding Method Solution Space
图 21:嵌入方法求解空间

Abstract 摘要

A Note on the Code In looking at these approaches programmatically, we'll start by using scikit-learn, the de-facto standard machine learning library for smaller datasets, with some implementations in native Python for clarity in understanding functionality that scikit-learn wraps.
在以编程方式考察这些方法时,我们将首先使用 scikit-learn,它是较小型数据集的事实上标准机器学习库,并辅以一些原生 Python 实现,以清楚了解 scikit-learn 封装的功能。

As we move into deep learning, we'll move to PyTorch, a deep learning library that's quickly becoming industry-standard for deep learning implementation.
随着我们深入学习深度学习,我们将转向 PyTorch,这是一个快速成为深度学习实现行业标准的深度学习库。

There are many different ways of implementing the concepts we discuss here, these are just the easiest to illustrate using Python's ML lingua franca libraries.
在这里,实现概念有很多不同的方法,但使用 Python 的 ML lingua franca 库是最容易说明的。

3.1 Early Approaches  ## 3.1 早期方法

The first approaches to generating textual features were count-based, relying on simple counts or high-level understanding of statistical properties: they were descriptive instead of models, which are predictive and attempt to guess a value based on a set of input values.

The first methods were encoding methods, a precursor to embedding. Encoding is often a process that still happens as the first stage of data preparation for input into more complex modeling approaches.

There are several methods to create text features using a process known as encoding so that we can map the geography feature into the vector space:
  • Ordinal encoding 序数编码
  • Indicator encoding 指标编码
  • One-Hot encoding One-Hot 编码
In all these cases, what we are doing is creating a new feature that maps to the text feature column but is a numerical representation of the variable so that we can project it into that space for modeling purposes.

We'll motivate these examples with simple code snippets from scikit-learn, the most common library for demonstrating basic ML concepts. We'll start with count-based approaches.
我们将使用 scikit-learn 中的简单代码片段来激励这些示例,scikit-learn 是演示基本 ML 概念的最常用库。我们将从基于计数的方法开始。

3.2 Encoding ## 3.2 编码

Ordinal encoding Let's again come back to our dataset of flits. We encode our data using sequential numbers. For example, "1" is "finch", "2" is "bluejay" and so on. We can use this method only if the variables have a natural ordered relationship to each other.
序数编码 让我们再次回到我们的昆虫数据集。我们使用顺序编号对数据进行编码。例如,“1”表示“雀”, “2”表示“蓝松鸦”,等等。只有当变量之间存在自然的排序关系时,我们才能使用这种方法。

For example, in this case "bluejay" is not "more"
例如,在这种情况下,“bluejay” 并非“更多”

than "finch" and so would be incorrectly represented in our model. The case is the same, if, in our flit data, we encode "US" as 1 and "UK" as 2.
如果我们以“US”为 1,“UK”为 2 对 flit 数据进行编码,那么情况就一样了。
Table 3: Bird Geographical Location Encoding
## 表 3:鸟类地理位置编码 | 代码 | 描述 | |---|---| | 0 | 北极 | | 1 | 南极 | | 2 | 欧洲 | | 3 | 亚洲 | | 4 | 非洲 | | 5 | 南美洲 | | 6 | 北美洲 | | 7 | 大洋洲 | | 8 | 格陵兰 | | 9 | 冰岛 | | 10 | 斯瓦尔巴 | | 11 | 法罗群岛 | | 12 | 亚速尔群岛 | | 13 | 马德拉群岛 | | 14 | 加那利群岛 | | 15 | 佛得角 | | 16 | 圣赫勒拿岛 | | 17 | 科摩罗 | | 18 | 塞舌尔 | | 19 | 马达加斯加 | | 20 | 毛里求斯 | | 21 | 留尼汪岛 | | 22 | 圣诞岛 | | 23 | 科科斯(基林)群岛 | | 24 | 东帝汶 | | 25 | 圣诞岛 | | 26 | 诺福克岛 | | 27 | 瓦努阿图 | | 28 | 新喀里多尼亚 | | 29 | 斐济 | | 30 | 汤加 | | 31 | 萨摩亚 | | 32 | 美属萨摩亚 | | 33 | 沃利斯和富图纳 | | 34 | 图瓦卢 | | 35 | 基里巴斯 | | 36 | 瑙鲁 | | 37 | 马绍尔群岛 | | 38 | 密克罗尼西亚联邦 | | 39 | 帕劳 | | 40 | 澳大利亚 | | 41 | 纽西兰 | | 42 | 巴布亚新几内亚 | | 43 | 所罗门群岛 | | 44 | 密克罗尼西亚 | | 45 | 波利尼西亚 | | 46 | 美拉尼西亚 | | 47 | 巴哈马 | | 48 | 特克斯和凯科斯群岛 | | 49 | 古巴 | | 50 | 牙买加 | | 51 | 海地 | | 52 | 多米尼加共和国 | | 53 | 波多黎各 | | 54 | 安提瓜和巴布达 | | 55 | 圣基茨和尼维斯 | | 56 | 多米尼克 | | 57 | 圣卢西亚 | | 58 | 圣文森特和格林纳丁斯 | | 59 | 格林纳达 | | 60 | 巴巴多斯 | | 61 | 特立尼达和多巴哥 | | 62 | 阿鲁巴 | | 63 | 库拉索 | | 64 | 博内尔 | | 65 | 萨巴 | | 66 | 圣马丁 | | 67 | 圣巴泰勒米 | | 68 | 瓜德罗普 | | 69 | 马提尼克 | | 70 | 圣皮埃尔和密克隆 | | 71 | 安圭拉 | | 72 | 英属维尔京群岛 | | 73 | 美属维尔京群岛 | | 74 | 蒙特塞拉特 | | 75 | 瓜德罗普 | | 76 | 圣巴泰勒米 | | 77 | 马提尼克 |
bird_id “鸟类识别” bird_posts 鸟类帖子 bird_geo 鸟类地理分布 bird_likes 鸟类喜欢 enc_bird_geo 鸟类地理编码
012 2 US 5 2
013 0 UK 4 1
056 57 NZ 70 0
612 0 UK 120 1
from sklearn.preprocessing import OrdinalEncoder
data = [['US'], ['UK'], ['NZ']]
>>> print(data)
# our label features
encoder = OrdinalEncoder()
result = encoder.fit_transform(data)
>>> print(result)
Figure 22: Ordinal Encoding in Scikit-Learn source
图 22:在 Scikit-Learn 中的序数编码

3.2.1 Indicator and one-hot encoding
3.2.1 指标和独热编码

Indicator encoding, given categories (i.e. "US", "UK", and "NZ"), encodes the variables into categories, creating a new feature for each category. So, if we have three variables, indicator encoding encodes into two indicator variables. Why would we do this?
指标编码,给定 个类别(例如“美国”、“英国”和“新西兰”),将变量编码成 个类别,为每个类别创建一个新特征。因此,如果我们有三个变量,指标编码将编码成两个指标变量。我们为什么要这样做?

If the categories are mutually exclusive, as they usually are in point-in-time geolocation estimates, if someone is in the US, we know for sure they're not in the UK and not in NZ, so it reduces computational overhead.
If we instead use all the variables and they are very closely correlated, there is a chance we'll fall into something known as the indicator variable trap. We can predict one variable from the others, which means we no longer have feature independence.

This generally isn't a risk for geolocation since there are more than 2 or 3 and if you're not in the US, it's not guaranteed that you're in the . So, if we have , and , and prefer more compact representations, we can use indicator encoding. However, many modern ML approaches don't require linear feature independence and use L1 regularization to prune feature inputs that don't minimize the error, and as such only use one-hot encoding.
由于通常情况下,地理位置识别有超过 2 个或 3 个选项,因此这不会构成风险。如果您不在美国,则无法保证您会出现在 中。因此,如果我们有 ,并且希望使用更紧凑的表示,我们可以使用指示器编码。然而,许多现代机器学习方法不需要线性特征独立性,并使用 L1 正则化 来剔除不会最小化错误的特征输入,因此只使用独热编码。
One-hot encoding is the most commonly-used of the count-based methods. This process creates a new variable for each feature that we have. Everywhere the element is present in the sentence, we place a " 1 " in the vector.

We are creating a mapping of all the elements in the feature space, where 0 indicates a non-match and 1 indicates a match, and comparing how similar those vectors are.
我们正在创建特征空间中所有元素的映射,其中 0 表示不匹配,1 表示匹配,并比较这些向量之间的相似程度。
from sklearn.preprocessing import OneHotEncoder
import numpy as np
enc = OneHotEncoder(handle_unknown='ignore')
data = np.asarray([['US'], ['UK'], ['NZ']])
>>> [array(['NZ', 'UK', 'US'], dtype='<U2')]
onehotlabels = enc.transform(data).toarray()
array([[0., 0., 1.],
    [0., 1., 0.],
    [1., 0., 0.]])
Figure 23: One-Hot Encoding in scikit-learnsource
图 23:scikit-learn 中的独热编码
Table 4: Our one-hot encoded data with labels
表 4:我们带有标签的一键式编码数据
bird_id “鸟类识别” US UK NZ
012 1 0 0
013 0 1 0
056 0 0 1
Now that we've encoded our textual features as vectors, we can feed them into the model we're developing to predict churn.

The function we've been learning will minimize the loss of the model, or the distance between the model's prediction and the actual value, by predicting correct parameters for each of these features.

The learned model will then return a value from 1 to 0 that is a probability that the event, either churn or no-churn, has taken place, given the input features of our particular bird.
训练好的模型会根据特定鸟类的输入特征,以 1 到 0 之间的概率值来判断该事件(流失或未流失)是否已发生。

Since this is a supervised model, we then evaluate this model for accuracy by feeding our test data into the model and comparing the model's prediction against the actual data, which tells us whether the bird has churned or not.
What we've built is a standard logistic regression model.

Generally these days the machine learning community has converged on using gradientboosted decision tree methods for dealing with tabular data, but we'll see that neural networks build on simple linear and logistic regression models to generate their output, so it's a good starting point.

Embeddings as larger feature inputs

Once we have encoded our feature data, we can use this input for any type of model that accepts tabular features. In our machine learning task, we were looking for output that indicated whether a bird was likely to leave the platform based on their location and some usage data.
好的,以下是翻译: 基于我们编码的特征数据,我们可以将这些数据输入到任何支持表格特征的模型中。 在我们的机器学习任务中,我们希望输出能够根据鸟类的方位和一些使用数据来指示鸟是否可能离开平台。

Now, we'd like to focus specifically on surfacing flits that are similar to other flits the user has already interacted with so we'll need feature representations of either/or our users or our content.
Let's go back to the original business question we posed at the beginning of this document: how do we recommend interesting new content for Flutter users given that we know that past content they consumed (i.e. liked and shared)?
让我们回到本文档开头提出的原始业务问题:鉴于我们知道 Flutter 用户过去消费过的内容(即点赞和分享),我们如何为他们推荐新的有趣内容?
In the traditional collaborative filtering approach to recommendations, we start by constructing a user-item matrix based on our input data that, when factored, gives us the latent properties of each flit and allows us to recommend similar ones.
In our case, we have Flutter users who might have liked a given flit. What other flits would we recommend given the textual properties of that one?
如果我们的用户喜欢某个特定类型的 flit,那么我们应该推荐其他类似的 flit 给他们,而这些 flit 应该具有与用户喜欢的那个 flit 相似的文本特性。
Here's an example. We have a flit that our bird users liked.
"Hold fast to dreams, for if dreams die, life is a broken-winged bird that cannot fly."
We also have other flits we may or may not want to surface in our bird's feed.
"No bird soars too high if he soars with his own wings."
"A bird does not sing because it has an answer, it sings because it has a song."
How would we turn this into a machine learning problem that takes features as input and a prediction as an output, knowing what we know about how to do this already?

First, in order to build this matrix, we need to turn each word into a feature that's a column value and each user remains a row value.
The best way to think of the difference between tabular and free-form representations as model inputs is that a row of tabular data looks like this, [012, 2, "US", 5], and a "row" or document of text data looks like this, ["No bird soars too high if he soars with his own wings."] In both cases, each of these are vectors, or a list of values that represents a single bird.
在模型输入中,表格表示和自由形式表示的主要区别在于,一行表格数据看起来像这样:[012, 2, "US", 5],而一“行”或文本数据文档看起来像这样:["没有哪只鸟会飞得太高,如果它用自己的翅膀翱翔"]。在这两种情况下,它们都是向量,即表示单个鸟的值列表。
In traditional machine learning, rows are our user data about a single bird and columns are features about the bird. In recommendation systems, our rows are the individual data about each user, and our column data represents the given data about each flit.

If we can factor this matrix, that is decompose it into two matrices ( and ) that, when multiplied, the product is our original matrix ( ), we can learn the "latent factors" or features that allow us to group similar users and items together to recommend them.
如果我们可以对矩阵进行因子分解,即将其分解为两个矩阵( ),当它们相乘时,乘积是我们的原始矩阵( ),我们可以学习“潜在因素”或特征,这些因素或特征允许我们将相似的用户和项目分组在一起,以便向他们进行推荐。
Another way to think about this is that in traditional ML, we have to actively engineer features, but they are then available to us as matrices.

In text and deep-learning approaches, we don't need to do feature engineering, but need to perform the extra step of generating valuable numeric features anyway.
The factorization of our feature matrix into these two matrices, where the rows in are actually embeddings [43] for users and the rows in matrix are embeddings for flits, allows us to fill in values for flits that Flutter users have not explicitly liked, and then perform a search across the matrix to find other words they might be interested in.
我们将我们的特征矩阵分解成这两个矩阵,其中 中的行实际上是用户的嵌入 [43],而矩阵 中的行是单词的嵌入。这使我们能够填充 Flutter 用户没有明确喜欢的单词的值,然后在矩阵中执行搜索以找到他们可能感兴趣的其他单词。

The end-result is our generated recommendation candidates, which we then filter downstream and surface to the user because the core of the recommendation problem is to recommend items to the user.
In this base-case scenario, each column could be a single word in the entire vocabulary of every flit we have and the vector we create, shown in the matrix frequency table, would be an insanely large, sparse vector that has a 0 of occurrence of words in our vocabulary.
在最简单的情况下,每一列都将是我们所有 flit 的整个词汇表中的单个词,而我们创建的向量(在矩阵频率表中显示)将是一个非常大、稀疏的向量,其中我们的词汇表中单词的出现次数为 0。

The way we can build toward this representation is to start with a structure known as a bag of words, or simply the frequency of appearance of text in a given document (in our case, each flit is a document.) This matrix is the input data structure for many of the early approaches to embedding.
In scikit-learn, we can create an initial matrix of our inputs across documents using 'CountVectorizer'.
在 scikit-learn 中,我们可以使用 'CountVectorizer' 在文档中创建输入的初始矩阵。
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
vect = CountVectorizer(binary=True)
vects = vect.fit_transform(flits)
responses = ["Hold fast to dreams, for if dreams die, life is a broken-winged
@ bird that cannot fly.", "No bird soars too high if he soars with his own

\hookrightarrow \text { it has a song."]}
doc = pd.DataFrame(list(zip(responses)))
td = pd.DataFrame(vects.todense()).iloc[:5]
td.columns = vect.get_feature_names_out()
term_document_matrix = td.T
term_document_matrix.columns = ['flit '+str(i) for i in range(1, 4)]
term_document_matrix['total_count'] = term_document_matrix.sum(axis=1)
    flit_1 flit_2 flit_3
an 0 0 1
answer 0 0 1
because 0 0 1
bird 1 1 - 1
broken 1 0 0
cannot 1 0 0

does 0 0 1
dreams 1 0 0
fast 1 0 0
Figure 24: Creating a matrix frequency table to create a user-item matrix source
图 24:创建一个矩阵频率表来创建一个用户-物品矩阵源

3.2.2 TF-IDF

One-hot encoding just deals with presence and absence of a single term in a single document. However, when we have large amounts of data, we'd like to consider the weights of each term in relation to all the other terms in a collection of documents.
To address the limitations of one-hot encoding, TF-IDF, or term frequencyinverse document frequency was developed. TF-IDF was introduced in the 1970s as a way to create a vector representation of a document by averaging all the document's word weights. It worked really well for a long time and still does in many cases.
为了解决 one-hot 编码、TF-IDF 或词频逆文档频率的局限性,开发了 TF-IDF。TF-IDF 于 1970 年代引入 ,作为通过对所有文档的词权重进行平均来创建文档向量表示的一种方式。它在很长一段时间内运行良好,在许多情况下仍然如此。

For example, one of the most-used search functions, BM25, uses TF-IDF as a baseline [56] as a default search strategy in Elasticsearch/Opensearch 19 It extends TF-IDF to develop a probability associated with the probability of relevance for each pair of words in a document and it is still being applied in neural search today [65].
例如,最常用的搜索函数之一 BM25 使用 TF-IDF 作为基线 [56],作为 Elasticsearch/Opensearch 中的默认搜索策略 19 它扩展了 TF-IDF 以开发与文档中每对词的相关性概率相关的概率,并且今天仍在神经搜索中应用 [65]。
TF-IDF will tell you how important a single word is in a corpus by assigning it a weight and, at the same time, down-weight common words like, "a", "and", and "the".
TF-IDF 通过为每个词分配权重并同时降低诸如“a”、“and”和“the”等常用词的权重,来告诉你这个词在语料库中的重要性。

This calculated weight gives us a feature for a single word TF-IDF, and also the relevance of the features across the vocabulary.
通过计算得到的权重,我们可以得到单个词语的 TF-IDF 特征,以及这些特征在整个词表中的相关性。
We take all of our input data that's structured in sentences and break it up into individual words, and perform counts on its values, generating the bag of words.

TF is term frequency, or the number of times a term appears in a document relative to the other terms in the document.
TF 是术语频率,它指的是一个词语在一个文档中出现的次数相对于该文档中其他词语的出现次数。
And IDF is the inverse frequency of the term across all documents in our vocabulary.
IDF 是指该词项在我们词汇表的所有文档中的逆频率。
Let's take a look at how to implement it from scratch:
import math
    # Process documents into individual words
documentA = ['Hold','fast','to','dreams','for','if','dreams','die,'
    \hookrightarrow ,'life','is','a','broken-winged','bird','that','cannot','fly']
documentB = ['No','bird','soars','too','high','if',
    @ 'he','soars','with','his','own','wings']
def tf(doc_dict: dict, doc_elements: list[str]) -> dict:
    """Term frequency of a word in a document over total words in
    \hookrightarrow document"""
    tf_dict = {}
    corpus_count = len(doc_elements)
    for word, count in doc_dict.items():
        tf_dict[word] = count / float(corpus_count)
    return tf_dict
def idf(doc_list: list[str]) -> dict:
    """The number of documents in which the term appears per term"""
    idf_dict = {}
    N = len(doc_list)
    idf_dict = dict.fromkeys(doc_list[0].keys(), 0)
    for word, val in idf_dict.items():
        idf_dict[word] = math.log10(N / (float(val) + 1))
    return idf_dict
# inverse document frequencies for all words
# dicts are frequency counts of words per doc e.g. dict.fromkeys(corpus, 0)
idfs = idf([dict_a, dict_b])
def tfidf(doc_elements: list[str], idfs)-> dict:
    """TF * IDF per word given a word and number of docs the term appears
    \hookrightarrow in"""
    tfidf_dict = {}
    for word, val in doc_elements.items():
        tfidf_dict[word] = val * idfs[word]
    return tfidf_dict
# Calculate the term frequency for each document individually
tf_a = tf(dict_a, document_a)
tf_b = tf(dict_b, document_b)
# Calculate the inverse document frequency given each term frequency
tfidf_a = tfidf(tf_a, idfs)
tfidf_b = tfidf(tf_b, idfs)
# Return weight of each word in each document wrt to the total corpus
document_tfidf = pd.DataFrame([tfidf_a, tfidf_b])
# doc 0 doc 1
    a 0.018814 0.000000
    dreams 0.037629 0.000000
    No 0.000000 0.025086
Once we understand the underlying fundamental concept, we can use the scikit-learn implementation which does the same thing, and also surfaces the TF-IDF of each word in the vocabulary.
一旦我们理解了底层的基本概念,我们就可以使用 scikit-learn 实现,它可以做同样的事情,还会显示词汇表中每个词的 TF-IDF。
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
corpus = [
    "Hold fast to dreams, for if dreams die, life is a broken-winged bird
    that cannot fly.",
    "No bird soars too high if he soars with his own wings.",
# langston hughes and william blake
text_titles = ["quote_lh", "quote_wb"]
vectorizer = TfidfVectorizer()
vector = vectorizer.fit_transform(corpus)
dict(zip(vectorizer.get_feature_names_out(), vector.toarray()[0]))
tfidf_df = pd.DataFrame(vector.toarray(), index=text_titles,
tfidf_df.loc['doc_freq'] = (tfidf_df > 0).sum()
# How common or unique a word is in a given document wrt to the vocabulary
quote_lh quote_wb doc_freq
bird 0.172503 0.197242 2.0
broken 0.242447 0.000000 1.0
cannot 0.242447 0.000000 1.0
die 0.242447 0.000000 1.0
Figure 26: Implementation of TF-IDF in scikit-learn source
图 26: 使用 scikit-learn 进行 TF-IDF 实现的示例
Given that inverse document frequency is a measure of whether the word is common or not across the documents, we can see that "dreams" is important because they are rare across the documents and therefore interesting to us more so than "bird." We see that the tf-idf for a given word, "dreams", is slightly different for each of these implementations, and that's because Scikit-learn normalizes the denominator and uses a slightly different formula.
鉴于逆文档频率是衡量某个词在所有文档中是否常见的一个指标,我们可以看到“梦想”之所以重要,是因为它们在所有文档中都很少见,因此比“鸟”更能引起我们的兴趣。 我们看到,对于一个给定的词语“梦想”,不同的实现方式得到的 tf-idf 值略有不同,这是因为 Scikit-learn 对分母进行了归一化,并使用了略有不同的公式。

You'll also note that in the first implementation we separate the corpus words ourselves, don't remove any stop words, and don't lowercase everything. Many of these steps are done automatically in scikit-learn or can be set as parameters into the processing pipeline.
您还会注意到,在第一个实现中,我们自己将语料库词分开,没有任何停用词,也不进行全部小写。许多这些步骤在 scikit-learn 中是自动完成的,或者可以设置为处理管道中的参数。

We'll see later that these are critical NLP steps that we perform each time we work with text.
这些是我们在每次处理文本时都会执行的关键 NLP 步骤,我们稍后会看到。
TF-IDF enforces several important ordering rules on our text corpus:
TF-IDF 在我们的文本语料库中强加了几条重要的排序规则:
  • Uprank term frequency when it occurs many times in a small number of documents
  • Downrank term frequency when it occurs many times in many documents, aka is not relevant
  • Really downrank the term when it appears across your entire document base [56].
    在您的整个文档库中出现时,确实会降低该术语的排名 [56]。
There are numerous ways to calculate and create weights for individual words in TF-IDF. In each case, we calculate a score for each word that tells us how important that word is in relation to each other word in our corpus, which gives it a weight.
计算和创建单个词语的 TF-IDF 权重的有许多方法。在每种情况下,我们会计算每个词语的一个分数,该分数告诉我们该词语相对于语料库中的其他词语的重要性,从而赋予其一个权重。

Once we figure out how common each word is in the set of all possible flits and get a weighted score for the entire sentence in relation to other sentences.
Generally, when we work with textual representations, we're trying to understand which words, phrases, or concepts are similar to each other.

Within our specific recommendations task, we are trying to understand which pieces of content are similar to each other, so that we can recommend content that users will like based on either their item history or the user history of users similar to them.
So, when we perform embedding in the context of recommender systems, we are looking to create neighborhoods from items and users, based on the activity of those users on our platform.

This is the initial solution to the problem of "how do we recommend flits that are similar to flit that the user has liked." This is the process of collaborative filtering.
这是一种初始的解决方案,用于解决“如何向用户推荐与他们喜欢的帖子类似的帖子”问题。 这是协同过滤的过程。
There are many approaches to collaborative filtering including a neighborhood-based approach, which looks at weighted averages of user ratings and computes cosine similarity, between users. It then finds groups, or neighborhoods of users which are similar to each other.
A key problem that makes up the fundamental problem in collaborative filtering and in recommendation systems in general is the ability to find similar sets of items among very large collections [42].
Mathematically, we can do this by looking at the distance metric between any two given sets of items, and there are a number of different approaches, including Euclidean distance, edit distance (more specifically, Levenshtein distance and Hamming distance), cosine distance, and more advanced compression approaches like minhashing.
The most commonly used approach in most models where we're trying to ascertain the semantic closeness of two items is cosine similarity, which is the cosine of the angle between two objects represented as vectors, bounded between -1 and 1.
大多数模型中用于判断两项语义接近度最常用的方法是余弦相似度,它是两个对象(表示为向量)之间夹角的余弦值,范围在 -1 到 1 之间。

-1 means the two items are completely "opposite" of each other and 1 means they are completely the same item, assuming unit length.
-1 表示两个物件是完全**相反**的,而 1 表示它们是完全相同的物件(假设单位长度)。

Zero means that you should probably use a distance measure other than cosine similarity because the vectors are completely orthogonal to each other. One point of clarification here is that cosine distance is the actual distance measure and is calculated as .
We use cosine similarity over other measures like Euclidean distance for large text corpuses, for example, because in very large, sparse spaces, the direction of the vectors is just as, and even more important, than the actual values.
The higher the cosine similarity is for two words or documents, the better. We can use TF-IDF as a way to look at cosine similarity.
越高两个词或文档的余弦相似度,越好。我们可以使用 TF-IDF 作为一种查看余弦相似度的方式。

Once we've given each of our words a tf-idf score, we can also assign a vector to each word in our sentence, and create a vector out of each quote to assess how similar they are.
一旦我们为每个词赋予了 tf-idf 分数,我们也可以为句中的每个词分配一个向量,并从每个引号中创建一个向量来评估它们的相似程度。
Figure 27: Illustration of cosine similarity between bird and wings vectors.
图 27:鸟和翅膀向量之间余弦相似度的说明。
Let's take a look at the actual equation for cosine similarity. We start with the dot product between two vectors, which is just the sum of each value multiplied by the corresponding value in our second vector, and then we divide by the normalized dot product.
v1 = [0,3,4,5,6]
v2 = [4,5,6,7,8]
def dot(v1, v2):
    dot_product = sum((a * b) for a,b in zip(v1,v2))
    return dot_product
def cosine_similarity(v1, v2):
    (v1 dot v2)/||v1|| *||v2||)
    products = dot(v1,v2)

    similarity = products / denominator
    return similarity
print(cosine_similarity(v1, v2))
# 0.9544074144996451
Figure 28: Implementation of cosine similarity from scratch source
图 28:从头开始实现余弦相似度源代码
Or, once again, in scikit-learn, as a pairwise metric:
Scikit-learn 中的成对距离度量:
from sklearn.metrics import pairwise

v2 = [4,5,6,7,8]
# need to be in numpy data format
# array([[0.95440741]])
Figure 29: Implementation of cosine similarity in scikitsource
图 29:在 scikitsource 中实现余弦相似度
Other commonly-used distance measures in semantic similarity and recommendations include:
  • Euclidean distance - calculates the straight-line distance between two points
    欧几里德距离 - 计算两点之间的直线距离
  • Manhattan Distance - Measures the distance between two points by summing the absolute differences of their coordinates
    曼哈顿距离 - 通过对各坐标绝对差值的求和来测量两点之间的距离
  • Jaccard Distance - Computes the dissimilarity between two sets by dividing the size of their intersection by the size of their union.
    杰卡德距离 - 通过比较两个集合交集的大小与并集的大小来计算两者的差异.
  • Hamming Distance - Measures the dissimilarity between two strings by counting the positions in which they differ
    汉明距离 - 通过计算两个字符串不同位置的数量来衡量两个字符串之间的差异

3.2.3 SVD and PCA
3.2.3 奇异值分解和 PCA

There is a problem with the vectors we created in one-hot encoding and TFIDF: they are sparse. A sparse vector is one that is mostly populated by zeroes. They are sparse because most sentences don't contain all the same words as other sentences.
使用 one-hot 编码和 TFIDF 创建的向量存在一个问题:它们是稀疏的。稀疏向量是指其中大部分元素为零的向量。它们是稀疏的,因为大多数句子并不包含与其他所有句子相同的词。

For example, in our flit, we might encounter the word "bird" in two sentences simultaneously, but the rest of the words will be completely different.
sparse_vector = [1,0,0,0,0,0,0,0,0,0]
dense_vector = [1,2,2,3,0,4,5,8,8,5]
Figure 30: Two types of vectors in text processing
图 30:文本处理中的两种向量
Sparse vectors result in a number of problems, among these cold start-the idea that we don't know to recommend items that haven't been interacted with, or for users who are new.
稀疏向量会造成许多问题,其中包括冷启动 - 我们不知道如何推荐从未与之交互过的商品,或者针对新用户的推荐。

What we'd like, instead, is to create dense vectors, which will give us more information about the data, the most important of which is accounting for the weight of a given word in proportion to other words.

This is where we leave one-hot encodings and TD-IDF to move into approaches that are meant to solve for this sparsity. Dense vectors are just vectors that have mostly non-zero values. We call these dense representations dynamic representations [68].
在这里,我们放弃 one-hot 编码和 TD-IDF,转而使用旨在解决这种稀疏性问题的方法。稠密向量只是非零值较多的向量。我们将这些稠密的表示称为动态表示 [68]。
Several other related early approaches were used in lieu of TF-IDF for creating compact representations of items: principal components analysis (PCA) and singular value decomposition (SVD).
几个其他相关的早期方法被用于代替 TF-IDF 来创建项目的紧凑表示:主成分分析 (PCA) 和奇异值分解 (SVD)。
SVD and PCA are both dimensionality reduction techniques that, applied through matrix transformations to our original text input data, show us the latent relationship between two items by breaking items down into latent components through matrix transformations.
奇异值分解 (SVD) 和主成分分析 (PCA) 都是降维技术,它们通过对原始文本输入数据进行矩阵变换,通过矩阵变换将项目分解为潜在成分,向我们展示两项之间的潜在关系。
SVD is a type of matrix factorization that represents a given input feature matrix as the product of three matrices.

It then uses the component matrices to create linear combinations of features that are the largest differences from each other and which are directionally different based on the variance of the clusters of points from a given line.

Those clusters represent the "feature clusters" of the compressed features.
In the process of performing SVD and decomposing these matrices, we generate a matrix representation that includes the eigenvectors and eigenvalue pairs or the sample covariance pairs.
PCA uses the same initial input feature matrix, but whereas one-hot encoding simply converts the text features into numerical features that we can work with, PCA also performs compression and projects our items into a two-dimensional feature space.
PCA 使用相同的初始输入特征矩阵,但与独热编码将文本特征转换为可处理的数字特征不同,PCA 还执行压缩并将我们的项目投影到二维特征空间。

The first principal component is the scaled eigenvector of the data, the weights of the variables that describe your data best, and the second is the weights of the next set of variables that describe your data best.
The resulting model is a projection of all the words, clustered into a single space based on these dimensions.

While we can't get individual meanings of all these components, it's clear that the clusters of words, aka features, are semantically similar, that is they are close to each other in meaning
The difference between the two is often confusing (people admitted as much in the 80s [21] when these approaches were still being worked out), and for the purposes of this survey paper we'll say that PCA can often be implemented using SVD
两种方法的区别通常令人困惑(人们在 20 世纪 80 年代承认这一点 [21],当时这些方法仍在研究中),出于本调查论文的目的,我们将说 PCA 通常可以使用 SVD 实现

3.3 LDA and LSA
3.3 潜在狄利克雷分配和潜在语义分析

Because PCA performs computation on each combination of features to generate the two dimensions, it becomes immensely computationally expensive as the number of features grows.
因为 PCA 在每个特征组合上进行计算以生成这两个维度,所以随着特征数量的增长,它的计算开销变得非常大。

Many of these early methods, like PCA, worked well for smaller datasets, like many of the ones used in traditional NLP research, but as datasets continued to grow, they didn't quite scale.
许多早期的降维方法,如 PCA,对较小的数据集效果很好,例如传统 NLP 研究中使用的大多数数据集,但随着数据集的不断增长,它们并没有很好地扩展。
Other approaches grew out of TF-IDF and PCA to address their limitations, including latent semantic analysis (LSA) and latent Dirichlet allocation (LDA) [12]. Both of these approaches start with the input document matrix that we built in the last section.
其他方法则发展于 TF-IDF 和 PCA,旨在解决其局限性,包括潜在语义分析(LSA)和潜在狄利克雷分配(LDA)[12]。这两种方法都从我们在上一节构建的输入文档矩阵开始。

The underlying principle behind both of these models is that words that occur close together more frequently have more important relationships.

LSA uses the same word weighting that we used for TF-IDF and looks to combine that matrix into a lower rank matrix, a cosine similarity matrix. In the matrix, the values for the cells range from , where -1 represents documents that are complete opposites and 1 means the documents are identical. LSA then runs over the matrix and groups items together.
LSA 使用与我们用于 TF-IDF 的相同的词加权,并希望将该矩阵组合成一个较低秩的矩阵——余弦相似度矩阵。在矩阵中,单元格的值范围从 ,其中 -1 表示完全相反的文档,1 表示文档完全相同。然后,LSA 遍历矩阵并将项分组在一起。
LDA takes a slightly different approach. Although it uses the same matrix for input, it instead outputs a matrix where the rows are words and columns are documents.
LDA 采用了一种略微不同的方法。虽然它使用相同的矩阵作为输入,但它输出的矩阵的行是单词,列是文档。

The distance measure, instead of cosine similarity, is the numerical value for the topic that the intersection of the word and document provide.

The assumption is that any sentence we input will contain a collection of topics, based on proportions of representation in relation to the input corpus, and that there are a number of topics that we can use to classify a given sentence.

We initialize the algorithm by assuming that there is a non-zero probability that each word could appear in a topic.

LDA initially assigns words to topics at random, and then iterates until it converges to a point where it maximizes the probability for assigning a current word to a current topic.

In order to do the word-to-topic mapping, LDA generates an embedding that creates a space of clusters of words or sentences that work together semantically.
为了进行词-主题映射,LDA 生成了一个嵌入,用于创建语义上协同工作的词或句子群集的空间。

3.4 Limitations of traditional approaches
## 3.4 传 479;方法的局限性

All of these traditional methods look to address the problem of generating relationships between items in our corpus in various ways in the latent space the relationships between words that are not explicitly stated but that we can tease out based on how we model the data.
However, in all these cases, as our corpus starts to grow, we start to run into two problems: the curse of dimensionality and compute scale.

3.4.1 The curse of dimensionality
3.4.1 维度灾难

As we one-hot encode more features, our tabular data set grows. Going back to our churn model, what happens once we have 181 instead of two or three countries? We'll have to encode each of them into their own vector representations.
随着我们对更多特征进行独热编码,我们的表格数据集会越来越大。回到我们的流失模型,如果我们有 181 个国家而不是 2 个或 3 个,会发生什么?我们将需要将它们中的每一个编码成它们自己的向量表示形式。

What happens if we have millions of vocabulary words, for example thousands of birds posting millions of messages every day? Our sparse matrix for tf-idf becomes computationally intensive to factor.
如果我们有数百万个词汇表,例如每天有数千只鸟发布数百万条信息,会发生什么?我们的 tf-idf 稀疏矩阵在计算因子方面变得非常繁重。
Whereas our input vectors for tabular machine learning and naive text approaches is only three entries because we only use three features, multimodal data effectively has a dimensionality of the number of written words in existence and image data has a dimensionality of height times width in pixels, for each given image.

Video and audio data have similar exponential properties. We can profile the performance of any code we write using Big O notation, which will classify an algorithm's runtime.
视频和音频数据都具有类似的指数属性。我们可以使用大 O 符号来分析我们编写的任何代码的性能,它将对算法的运行时间进行分类。

There are programs that perform worse and those that perform better based on the number of elements the program processes. This means that one-hot encodings, in terms of computing performance, are in the worst case complexity. So, if our
基于程序处理的元素数量,有些程序的性能较差,而有些程序的性能较好。这意味着就计算性能而言,One-hot 编码在最坏情况下是 。所以,如果我们的

text is a corpus of a million unique words, we'll get to a million columns, or vectors, each of which will be sparse, since most sentences will not contain the words of other sentences.
Let's take a more concrete case.

Even in our simple case of our initial bird quote, we have 28 features, one for each word in the sentence, assuming we don't remove and process the most common stop words - extremely common words like "the", "who", and "is" that appear in most texts but don't add semantic meaning.
即使在我们的简单案例中,我们最初的鸟类引用,我们也有 28 个特征,每个句子中的一个词,假设我们没有删除和处理最常见的停用词 - 极常见的词,如“the”、“who”和“is”,这些词出现在大多数文本中,但没有添加语义含义。

How can we create a model that has 28 features? That's fairly simple if tedious - we encode each word as a numerical value.
我们可以通过将每个单词编码为数值来创建一个具有 28 个特征的模型。这相当简单,但也相当乏味。
Table 5: One-hot encoding and the growing curse of dimensionality for our flit
## 表格 5:独热编码和我们闪存维度诅咒的增长
flit_id 轻型 ID bird_id “鸟类识别” hold 握住 fast  dreams  die life 生活 bird 
9823420 012 1 1 1 1 1 1
9823421 013 1 0 0 0 0 1
Not only will it be hard to run computations over a linearly increasing set, once we start generating a large number of features (columns), we start running into the curse of dimensionality, which means that, the more features we accumulate, the more data we need in order to accurately statistically confidently say anything about them, which results in models that may not accurately represent our data [29] if we have extremely sparse features, which is generally the case in user/item interactions in recommendations.

3.4.2 Computational complexity
3.4.2 计算复杂度

In production machine learning systems, the statistical properties of our algorithm are important. But just as critical is how quickly our model returns data, or the system's efficiency.

System efficiency can be measured in many ways, and it is critical in any well-performing system to find the performance bottleneck that leads to latency, or the time spent waiting before an operation is performed [26].

If you have a recommendation system in production, you cannot risk showing the user an empty feed or a feed that takes more than a few milliseconds to render.

If you have a search system, you cannot risk the results taking more than a few milliseconds to return, particularly in ecommerce settings [2].
如果您拥有一个搜索系统,您不能冒着让结果花费超过几毫秒时间返回的风险,特别是在电子商务设置中 [2]。

From the holistic systems perspective then, we can also have latency in how long it takes to generate data for a model, read in data, and train the model.
The two big drivers of latency are:
  • I/O processing - We can only send as many items over the network as our network speed allows
  • CPU processing - We can only process as many items as we have memory available to us in any given system 22
    在任何给定系统中,CPU 处理 - 我们只能处理与可用内存一样多的项目
Generally, TF-IDF performs well in terms of identifying key terms in the document.
一般来说,TF-IDF 在识别文档中的关键术语方面表现良好。

However, since the algorithm processes all the elements in a given corpus, the time complexity grows for both the numerator and the denominator in the equation and overall, the time-complexity of computing the TF-IDF weights for all the terms in all the documents is , where is the total number of terms in the corpus and is the number of documents in the corpus. Additionally, because TF-IDF creates a matrix as output, what we end up doing is processing enormous state matrices. For example, if you have documents and need to store frequency counts and features for the top five thousand words appearing in those documents, we get a matrix of size . This complexity only grows.
由于算法处理了给定语料库中的所有元素,因此分子和分母的时间复杂度都会增长。总体而言,计算所有文档中所有词项的 TF-IDF 权重的时 间复杂度为 ,其中 是语料库中词项总数, 是语料库中文件数目。此外,由于 TF-IDF 创建矩阵 作为输出,我们最终处理的是巨大的状态矩阵。例如,如果您有 个文档,并且需要存储频率计数和前五千个词的特征, 那么我们得到一个大小为 的矩阵,这个复杂度只会增加。
This linear time complexity growth becomes an issue when we're trying to process millions or hundreds of millions of tokens - usually a synonym for words but can also be sub-words such as syllables.
这种线性时间复杂度增长在我们需要处理数百万或数亿个词元时成为一个问题 - 词元通常是词语的同义词,但也可能是音节等词语。

This is a problem that became especially prevalent as, over time in industry, storage became cheap.
From newsgroups to emails, and finally, to public internet text, we began to generate a lot of digital exhaust and companies collected it in the form of append-only logs [36], a sequence of records ordered by time, that's configured to continuously append records. .
从新闻组到电子邮件,最后到公共互联网文本,我们开始生成大量的数字垃圾,公司以追加式日志的形式收集[36],这是一种按时间排序的记录序列,配置为不断追加记录。 .
Companies started emitting, keeping, and using these endless log streams for data analysis and machine learning. All of a sudden, the algorithms that had worked well on a collection of less than a million documents struggled to keep up.
Capturing log data at scale began the rise of the Big Data era, which resulted in a great deal of variety, velocity, and volume of data movement.

The rise in data volumes coincided with data storage becoming much cheaper, enabling companies to store everything they collected on racks of commodity hardware.
Companies were already retaining analytical data needed to run critical business operations in relational databases, but access to that data was structured and processed in batch increments on a daily or weekly basis.

This new logfile data moved quickly, and with a level of variety absent from traditional databases.
The resulting corpuses for NLP, search, and recommendation problems also exploded in size, leading people to look for more performant solutions.
由于 NLP、搜索和推荐问题所产生的语料库规模爆炸式增长,人们开始寻求更高效的解决方案。

3.5 Support Vector Machines
3.5 支持向量机

The first modeling approaches were shallow models - models that perform machine learning tasks using only one layer of weights and biases [9].
第一批建模方法是浅层模型,这些模型使用只有一层权重和偏差的机器学习任务 [9]。

Support vector machines (SVM), developed at Bell Laboratories in the mid-1990s, were used in high-dimensional spaces for NLP tasks like text categorization [32].
支持向量机 (SVM) 由贝尔实验室于 20 世纪 90 年代中期开发,被用于高维空间的自然语言处理任务,例如文本分类 [32]

SVMs separate data clusters into points that are linearly separable by a hyperplane, a decision boundary that separates elements into separate classes.

In a two-dimensional vector space, the hyperplane is a line, in a three or more dimensional space, the separator also comes in many dimensions.
The goal of the SVM is to find the optimal hyperplane such that the distance between new projections of objects (words in our case) into the space maximizes the distance between the plane and the elements so there's less chance of mis-classifying them.
Figure 31: Example of points in the vector space in an SVM separated by a hyperplane
图 31:SVM 中使用超平面分隔的向量空间中点的示例
Examples of supervised machine learning tasks performed with SVMs included next word prediction, predicting the missing word in a given sequence, and predicting words that occur in a window.
使用 SVM 进行监督式机器学习任务的示例包括:下一个词预测、预测给定序列中缺失的词以及预测窗口中出现的词。

As an example, the classical word embedding inference task is autocorrect when we're typing on our phones. We type a word, and it's the job of the autocorrect to predict the correct word based on both the word itself and the surrounding context in the sentence.

It therefore needs to learn a vocabulary of embeddings that will give it probabilities that it is selecting the correct word.
However, as in other cases, when we reach high dimensions, SVMs completely fail to work with sparse data because they rely on computing distances between points to determine the decision boundaries.
然而,与其他情况一样,当我们达到高维时,SVM 由于其依赖于计算点之间的距离来确定决策边界,而无法处理稀疏数据。

Because in our sparse vector representations of elements most of the distances are zero, the hyperplane will fail to cleanly separate the boundaries and classify words incorrectly.

3.6 Word2Vec 3.6 Word2Vec **3.6 词嵌入:Word2Vec**

To get around the limitations of earlier textual approaches and keep up with growing size of text corpuses, in 2013, researchers at Google came up with an elegant solution to this problem using neural networks, called Word2Vec [47].
为了克服早期文本方法的局限性,并随着文本语料库规模的不断增长,2013 年,谷歌的研究人员利用神经网络提出了一种名为 Word2Vec [47] 的优雅解决方案来解决这个问题。
So far, we've moved from simple heuristics like one-hot encoding, to machine learning approaches like LSA and LDA that look to learn a dataset's modeled features.
到目前为止,我们已经从简单的启发式方法(例如独热编码)发展到机器学习方法(例如 LSA 和 LDA),这些方法旨在学习数据集的模型化特征。

Previously, like our original one-hot encodings, all the approaches to embedding focused on generating sparse vectors that can give an indication that two words are related, but not that there is a semantic relationship between them.

For example, "The dog chased the cat" and "the cat chased the dog" would have the same distance in the vector space, even though they're two completely different sentences.
例如,“狗追猫” 和 “猫追狗”在向量空间中将具有相同的距离,即使它们是两个完全不同的句子。
Word2Vec is a family of models that has several implementations, each of which focus on transforming the entire input dataset into vector represen-
Word2Vec 是一种模型系列,它拥有多种实现方式,每种方式都专注于将整个输入数据集转换为向量表示

tations and, more importantly, focusing not only on the inherent labels of individual words, but on the relationship between those representations.
There are two modeling approaches to Word2Vec - continuous bag of words (CBOW) and skipgrams, both of which generate dense vectors of embeddings but model the problem slightly differently.
Word2Vec 有两种模型方法——连续词袋模型 (CBOW) 和跳字模型 (Skipgrams),它们都生成稠密的嵌入向量,但对问题的建模方式略有不同。

The end-goal of the Word2Vec model in either case is to learn the parameters that maximize that probability of a given word or group of words being an accurate prediction [23].
Word2Vec 模型的最终目标是学习参数,以最大化给定词或词组作为准确预测的概率 [23]。
In training skipgrams, we take a word from the initial input corpus and predict the probability that a given set of words surround it.

In the case of our initial flit quote, "Hold fast to dreams for if dreams die, life is a brokenwinged bird that cannot fly", the model's intermediate steps generate a set of embeddings that's the distance between all the words in the dataset and fill in the next several probabilities for the entire phrase, using the word "fast" as input.
Figure 32: Word2Vec Architecture
图 32:Word2Vec 架构
In training CBOW, we do the opposite: we remove a word from the middle of a phrase known as the context window and train a model to predict the probability that a given word fills the blank, shown in the equation below where we attempt to maximize.
在训练 CBOW 时,我们做相反的操作:我们从一个称为上下文窗口的短语中间删除一个词,并训练一个模型来预测给定词填充该空格的概率,如以下公式所示,我们试图使之最大化。
If we optimize these parameters - theta - and maximize the probability that
如果我们优化这些参数 - theta - 并最大化概率,即

the word belongs in the sentences, we'll learn good embeddings for our input corpus.
Let's focus on a detailed implementation of CBOW to better understand how this works. This time, for the code portion, we'll move on from scikitlearn, which works great for smaller data, to PyTorch for neural net operations.
让我们深入了解 CBOW 的详细实现,以便更好地理解其工作原理。这次,对于代码部分,我们将从适用于较小数据的 scikitlearn 转向 PyTorch 进行神经网络操作。
At a high level, we have a list of input words that are processed through a second layer, the embedding layer, and then through the output layer, which is just a linear model that returns probabilities.
Figure 33: Word2Vec CBOW Neural Network architecture
图 33:CBOW 词向量神经网络架构图
We'll run this implementation in PyTorch, the popular library for building neural network models.
我们将使用 PyTorch 库来运行此实现。PyTorch 是一个流行的构建神经网络模型的库。

The best way to implement Word2Vec, especially if you're dealing with smaller datasets, is using Gensim, but Gensim abstracts away the layers into inner classes, which makes for a fantastic user experience.
最适合实现 Word2Vec 的方法,尤其是当你处理较小数据集时,是使用 Gensim,但 Gensim 将图层抽象为内部类,这为用户提供了极佳的体验。

But, since we're just learning about them, we'd like to see a bit more explicitly how they work, and PyTorch, although it does not have a native implementation of Word2Vec, lets us see the inner workings a bit more clearly.
但是,由于我们只是在学习它们,我们希望更明确地了解它们的工作原理,PyTorch 虽然没有 Word2Vec 的原生实现,但它让我们能够更清楚地看到内部工作原理。
To model our problem in PyTorch, we'll use the same approach as with any problem in machine learning:
为了用 PyTorch 模拟我们的问题,我们将采取与机器学习中的任何问题相同的方法:
  • Inspect and clean our input data.
  • Build the layers of our model. (For traditional ML, we'll have only one)
  • Feed the input data into the model and track the loss curve
  • Retrieve the trained model artifact and use it to make predictions on new items that we analyze
Figure 34: Steps for creating Word2Vec model
图 34:创建 Word2Vec 模型的步骤
Let's start from our input data. In this case, our corpus is all of the flits we've collected. We first need to process them as input into our model.
responses = ["Hold fast to dreams, for if dreams die, life is a broken-winged
    @ bird that cannot fly.", "No bird soars too high if he soars with his own
    ~ wings.", "A bird does not sing because it has an answer, it sings because
    ~it has a song."]
Figure 35: Our Word2Vec input dataset
图 35:我们的 Word2Vec 输入数据集
Let's start with our input training data, which is our list of flits. To prepare input data for PyTorch, we can use the DataLoader or Vocab classes, which splits our text into tokens and tokenizes - or creates smaller, word-level representations of each sentence - for processing.
让我们从我们的输入训练数据开始,它是我们闪光的列表。为了准备 PyTorch 的输入数据,我们可以使用 DataLoader 或 Vocab 类,它将我们的文本拆分为标记并进行标记化 - 或为每个句子创建更小的词级表示 - 以进行处理。

For each line in the file, we generate tokens by splitting each line into single words, removing whitespace and punctuation, and lowercasing each individual word.
This kind of processing pipeline is extremely common in NLP and spending time to get this step right is extremely critical so that we get clean, correct input data. It typically includes [48]:
这种处理管道在 NLP 中非常常见,花时间做好这一步至关重要,这样我们才能获得干净、正确的数据输入。它通常包括[48]:
  • Tokenization - transforming a sentence or a word into its component character by splitting it
    分词 - 通过拆分将句子或单词转换为其组成字符
  • Removing noise - Including URLs, punctuation, and anything else in the text that is not relevant to the task at hand
    删除噪音 - 包括 URL、标点符号和文本中与手头任务无关的其他任何内容
  • Word segmentation - Splitting our sentences into individual words
    分词 - 将句子分割成独立的词语
  • Correcting spelling mistakes
class TextPreProcessor:
    def __init__(self) -> None:
        self.input_file = input_file
    def generate_tokens(self):
        with open(self.input_file, encoding="utf-8") as f:
            for line in f:
                    line = line.replace("\\", "")
                    yield line.strip().split()
    def build_vocab(self) -> Vocab:
    vocab = build_vocab_from_iterator(
        self.generate_tokens(), specials=["<unk>"], min_freq=100
    return vocab
Figure 36: Processing our input vocabulary and building a Vocabulary object from our dataset in PyTorch source
## 图 36:使用 PyTorch 源代码处理输入词汇表并构建词汇对象
Now that we have an input vocabulary object we can work with, the next step is to create one-hot encodings of each word to a numerical position, and each position back to a word, so that we can easily reference both our words and vectors.

The goal is to be able to map back and forth when we do lookups and retrieval.
This occurs in the Embedding layer. Within the Embedding layer of PyTorch, we initialize an Embedding matrix based on the size we specify and size of our vocabulary, and the layer indexes the vocabulary into a dictionary for retrieval. The embedding layer is a lookup table that matches a word to the corresponding word vector on an index by index basis. Initially, we create our one-hot encoded word to term dictionary. Then, we create a mapping of each word to a dictionary entry and a dictionary entry to each word. This is known as bijection.
在嵌入层中会发生这种情况。在 PyTorch 的嵌入层中,我们根据指定的 size 和词汇表的大小初始化一个嵌入矩阵,并且该层将词汇表索引到一个用于检索的字典中。嵌入层是一个查找表,它按照索引匹配单词到相应的词向量。最初,我们创建的 one-hot 编码的词到术语字典。然后,我们创建每个单词到字典项的映射,以及每个字典项到每个单词的映射。这被称为双射。

In this way, the Embedding layer is like a one-hot encoded matrix, and allows us to perform lookups. The lookup values in this layer are initialized to a set of random weights, which we next pass onto the linear layer.
Embeddings resemble hash maps and also have their performance characteristics ( retrieval and insert time), which is why they can scale easily when other approaches cannot.
嵌入式就像散列表一样,并且也具有它们的性能特征( 检索和插入时间),这就是为什么它们在其他方法不可行时可以轻松扩展。

In the embedding layer, Word2Vec where each value in the vector represents the word on a specific dimension, and more importantly, unlike many of the other methods, the value of each vector is in direct relationship to the other words in the input dataset.
在嵌入层中,Word2Vec 其中向量中的每个值表示特定维度上的单词,更重要的是,与许多其他方法不同,每个向量的值与输入数据集中其他单词直接相关。
class CBOW(torch.nn.Module):
    def __init__(self): # we pass in vocab_size and embedding_dim as hyperparams
        super(CBOW, self).__init__()
        self.num_epochs = 3
        self.context_size = 2 # 2 words to the left, 2 words to the right
        self.embedding_dim = 100 # Size of your embedding vector
        self.learning_rate = 0.001
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.vocab = TextPreProcessor().build_vocab()
        self.word_to_ix = self.vocab.get_stoi()
        self.ix_to_word = self.vocab.get_itos()
        self.vocab_list = list(self.vocab.get_stoi().keys())
        self.vocab_size = len(self.vocab)
        self.model = None
        # out: 1 x embedding_dim
        self.embeddings = nn.Embedding(
            self.vocab_size, self.embedding_dim
        ) # initialize an Embedding matrix based on our inputs
        self.linear1 = nn.Linear(self.embedding_dim, 128)
        self.activation_function1 = nn.ReLU()
        # out: 1 x vocab_size
        self.linear2 = nn.Linear(128, self.vocab_size)
        self.activation_function2 = nn.LogSoftmax(dim=-1)
Figure 37: Word2Vec CBOW implementation in Pytorch. source
图 37:Pytorch 中的 Word2Vec CBOW 实现。
Once we have our lookup values, we can process all our words. For CBOW, we take a single word and we pick a sliding window, in our case, two words before, and two words after, and try to infer what the actual word is.
获取查找值后,我们就可以处理所有词语了。对于 CBOW,我们取一个词语,并选择一个滑动窗口(在本例中为前后各两个词语),并尝试推断出实际词语的内容。

This is called the context vector, and in other cases, we'll see that it's called attention. For example, if we have the phrase "No bird [blank] too high", we're trying to predict that the answer is "soars" with a given softmax probability, aka ranked against other words.
这段话翻译成简体中文: 这段话叫做上下文向量,在其他情况下,我们称之为注意力。例如,如果我们有一个短语“没有一只鸟[空白]飞得太高”,我们试图预测答案是“飞翔的”,并使用 softmax 概率进行排序,即与其他词语进行排名。

Once we have the context vector, we look at the loss the difference between the true word and the predicted word as ranked by probability - and then we continue.
The way we train this model is through context windows. For each given word in the model, we create a sliding window that includes that word and 2 words before it, and 2 words after it.
我们将通过上下文窗口来训练此模型。对于模型中给定的每个单词,我们将创建一个包含该单词前面 2 个单词和后面 2 个单词的滑动窗口。
We activate the linear layer with a ReLu activation function, which decides whether a given weight is important or not.
我们使用 ReLu 激活函数激活线性层,它决定了给定的权重是否重要。

In this case, ReLu squashes all the negative values we initialize our embeddings layer with down to zero since we can't have inverse word relationships, and we perform linear regression by learning the weights of the model of the relationship of the words.
在这种情况下,ReLu 将我们初始化嵌入层的所有负值压缩为零,因为我们不能拥有反向词关系,并且我们通过学习词关系模型的权重来执行线性回归。

Then, for each batch we examine the loss, the difference between the real word and the word that we predicted should be there given the context window - and

we minimize it. 将其最小化
At the end of each epoch, or pass through the model, we pass the weights, or backpropagate them, back to the linear layer, and then again, update the weights of each word, based on the probability.

The probability is calculated through a softmax function, which converts a vector of real numbers into a probability distribution - that is, each number in the vector, i.e. the value of the probability of each words, is in the interval between 0 and 1 and all of the word numbers add up to one.
概率通过 softmax 函数计算,该函数将实数向量转换为概率分布 - 也就是说,向量中的每个数字(即每个词的概率值)都在 0 到 1 之间,所有词的数字加起来等于 1。

The distance, as backpropagated to the embeddings table, should converge or shrink depending on the model understanding how close specific words are.
翻译后的文本: 模型应该了解特定词语之间的接近程度,因此,反向传播到嵌入表中的距离应该收敛或缩小。
def make_context_vector(self, context, word_to_ix) -> torch.LongTensor:
    For each word in the vocab, find sliding windows of [-2,1,0,1,2] indexes
    relative to the position of the word
    :param vocab: list of words in the vocab
    :return: torch.LongTensor
    idxs = [word_to_ix[w] for w in context]
    tensor = torch.LongTensor(idxs)
def train_model(self):
    # Loss and optimizer
    self.model = CBOW().to(self.device)
    optimizer = optim.Adam(self.model.parameters(), lr=self.learning_rate)
    loss_function = nn.NLLLoss()
    logging.warning('Building training data')
    data = self.build_training_data()
    logging.warning('Starting forward pass')
    for epoch in tqdm(range(self.num_epochs)):
        # we start tracking how accurate our initial words are
        total_loss = 0
        # for the x, y in the training data:
        for context, target in data:
            context_vector = self.make_context_vector(context, self.word_to_ix)
                # we look at loss
                log_probs = self.model(context_vector)
                # compare loss
                total_loss += loss_function(
                    log_probs, torch.tensor([self.word_to_ix[target]])
                # optimize at the end of each epoch
                # Log out some metrics to see if loss decreases
                logging.warning("end of epoch {} | loss {:2.3f}".format(epoch, total_loss))
    torch.save(self.model.state_dict(), self.model_path)
    logging.warning(f'Save model to {self.model_path}')
Figure 38: ord2Vec CBOW implementation in PyTorch see full implementation here
图 38: 基于 PyTorch 的 ord2Vec CBOW 实现,详见 [此处](https://github.com/pytorch/examples/tree/main/word_language_model) 完整实现
Once we've completed our iteration through our training set, we have learned a model that retrieves both the probability of a given word being the correct word, and the entire embedding space for our vocabulary.

4 Modern Embeddings Approaches
## 4 Modern Embeddings Approaches ## 4 种现代嵌入方法 * **Word2Vec:** This approach uses a shallow neural network to learn dense vector representations for words. There are two main architectures for Word2Vec: Continuous Bag-of-Words (CBOW) and Skip-gram. CBOW predicts a word from its surrounding context, while Skip-gram predicts surrounding context words from a given word. * **GloVe (Global Vectors for Word Representation):** This approach learns word representations based on global co-occurrence statistics. GloVe builds a word-word co-occurrence matrix and learns vectors that capture these co-occurrence patterns. * **FastText:** This approach extends Word2Vec by considering subword information. FastText represents each word as a bag of character n-grams, which allows the model to learn representations for out-of-vocabulary words and handle morphological variations. * **ELMo (Embeddings from Language Models):** This approach learns contextual word representations by training a bidirectional language model on a large text corpus. ELMo learns separate forward and backward language models, which allows it to capture different contextual meanings for a word depending on its position in a sentence. **相关术语**: * 词嵌入: 词嵌入是将词语映射成低维实数向量的一种技术,可以有效地捕获词语的语义和语法信息. * Word2Vec: Word2Vec 是一种基于浅层神经网络的词语嵌入模型,包括两种主要的架构:连续词袋模型 (CBOW) 和 Skip-gram 模型. CBOW 模型通过上下文词语预测目标词语,而 Skip-gram 模型则从一个目标词语预测它的上下文词语. * GloVe (全局词向量表示): GloVe 是一种基于全局共现统计信息的词语嵌入模型,构建词语共现矩阵并学习捕获这些共现模式的词向量. * FastText: FastText 是 Word2Vec 的一种扩展,考虑了子词信息. FastText 将每个词语表示为一个字符 n-gram 的集合,这使模型能够学习词语的内部结构并处理形态变化. * ELMo (语言模型的嵌入): ELMo 是 一种利用双向语言模型在大型语料库上进行训练的词语嵌入模型. ELMo 学习独立的前向和后向语言模型,从而根据单词在句子中的位置捕获其不同的上下文意义

Word2Vec became one of the first neural network architectures to use the concept of embedding to create a fixed feature vocabulary. But neural networks as a whole were gaining popularity for natural language modeling because of several key factors.
Word2Vec 成为首批使用嵌入概念创建固定特征词汇的**神经网络**架构之一。但由于以下几个关键因素,**神经网络**作为一个整体在自然语言建模方面越来越受欢迎。

First, in the 1980s, researchers made advancements in using the technique of backpropagation for training neural networks learning [53].
首先,在 20 世纪 80 年代,研究人员在使用反向传播技术训练神经网络学习方面取得了进展 [53]

Backpropagation is how a model learns to converge by calculating the gradient of the loss function with respect to the weights of the neural network, using the chain rule, a concept from calculus which allows us to calculate the derivative of a function made up of multiple functions.

This mechanism allows the model to understand when it's reached a global minimum for loss and picks the correct weights for the model parameters, but training models through gradient descent.

Earlier approaches, such as the perceptron learning rule, tried to do this, but had limitations, such as being able to work only on simple layer architectures, took a long time to converge, and experienced vanishing gradients, which made it hard to effectively update the model's weights.
These advances gave rise to the first kinds of multi-level neural networks, feed-forward neural networks.

In 1998, a paper used backpropagation over multilayer perceptrons to correctly perform the task of recognizing handwritten digit images [40], demonstrating a practical use-case practitioners and researchers could apply.
1998 年,一篇论文使用多层感知器上的反向传播正确执行了手写数字图像识别的任务 [40],展示了一个实用的用例,从业者和研究人员可以应用。

This MNIST dataset is now one of the canonical "Hello World" examples of deep learning.
该 MNIST 数据集现已成为深度学习的经典“Hello World”示例之一。
Second, in the 2000s, the rise of petabytes of aggregated log data resulted in the creation of large databases of multimodal input data scraped from the internet.

This made it possible to conduct wide-ranging experiments to prove that neural networks work on large amounts of data.

For example, ImageNet was developed by researchers at Stanford who wanted to focus on improving model performance by creating a gold set of neural network input data, the first step in processing. FeiFei assembled a team of students and paid gig workers from Amazon Turk to correctly label a set of 3.2 million images scraped from the internet and organized based on categories according to WordNet, a taxonomy put together by researchers in the 1970s [55].
例如,ImageNet 是由斯坦福大学的研究人员开发的,他们希望通过创建一个神经网络输入数据(处理的第一步)的金标准集来改进模型性能。李飞飞组建了一个学生团队,并从 Amazon Turk 上雇佣了零工人员,正确标注了一组从互联网上抓取的 320 万张图片,并根据由 20 世纪 70 年代的研究人员建立的词网(一种分类法)进行分类。
Researchers saw the power of using standard datasets.

In 2015, Alex Krizhevsky, in collaboration with Ilya Sutskever, who now works at OpenAI as one of the leading researchers behind the GPT series of models that form the basis of the current generative AI wave, submitted an entry to the ImageNet competition called AlexNet.
2015 年,Alex Krizhevsky 与 Ilya Sutskever(现为 OpenAI 的主要研究人员之一)合作,将基于 GPT 系列模型的参赛作品 AlexNet 提交给了 ImageNet 竞赛。 2015 年,Alex Krizhevsky 与 Ilya Sutskever(现为 OpenAI 的主要研究人员之一)合作,将基于 GPT 系列模型的参赛作品 AlexNet 提交给了 ImageNet 竞赛。 2015 年,Alex Krizhevsky 与 Ilya Sutskever(现为 OpenAI 的主要研究人员之一)合作,将基于 GPT 系列模型的参赛作品 AlexNet 提交给了 ImageNet 竞赛。 2015 年,Alex Krizhevsky 与 Ilya Sutskever(现为 OpenAI 的主要研究人员之一)合作,将基于 GPT 系列模型的参赛作品 AlexNet 提交给了 ImageNet 竞赛。 2015 年,Alex Krizhevsky 与 Ilya Sutskever(现为 OpenAI 的主要研究人员之一)合作,将基于 GPT 系列模型的参赛作品 AlexNet 提交给了 ImageNet 竞赛。

This model was a convolutional neural network that outperformed many other methods. There were two things that were significant about AlexNet. The first was that it had eight stacked layers of weights and biases, which was unusual at the time. Today, 12-layer neural.
该模型是一个卷积神经网络,它比许多其他方法表现更好。AlexNet 有两个重要的地方。首先,它有八层堆叠的权重和偏差,这在当时是不同寻常的。今天,12 层神经。

networks like BERT and other transformers are completely normal, but at the time, more than two layers was revolutionary. The second was that it ran on
基于 BERT 和其他变压器的网络完全正常,但在当时,超过两层是一个革命性的创举。第二,它运行在
GPUs, a new architectural concept at the time, since GPUs were used mostly for gaming.
作为当时一种新的架构概念,GPU 主要用于游戏。
Neural networks started to become popular as ways to generate representations of vocabularies.

In particular, neural network architectures, such as and recurrent neural networks (RNNs) and later long short-term memory networks (LSTMs) also emerged as ways to deal with textual data for all kinds of machine learning tasks from NLP to computer vision.

4.1 Neural Networks 4.1 神经网络

Neural networks are extensions on traditional machine learning models, but they have a few critical special properties. Let's think back to our definition of a model when we formalized a machine learning problem.

A model is a function with a set of learnable input parameters that takes some set of inputs and one set of tabular input features, and gives us an output. In traditional machine learning approaches, there is one set, or layer, of learnable parameters and one model.

If our data doesn't have complex interactions, our model can learn the feature space fairly easily and make accurate predictions.
However, when we start dealing with extremely large, implicit feature spaces, such as are present in text, audio, or video, we will not be able to derive specific features that wouldn't be obvious if we were manually creating them.

A neural network, by stacking neurons, each of which represent some aspect of the model, can tease out these latent representations.

Neural networks are extremely good at learning representations of data, with each level of the network transforming a learned representation of the level to a higher level until we get a clear picture of our data [41].

4.1.1 Neural Network architectures
4.1.1 神经网络架构

We've already encountered our first neural network, Word2Vec, which seeks to understand relationships between words in our text that the words themselves would not tell us. Within the neural network space, there are several popular architectures:
我们已经遇到了我们第一个神经网络 Word2Vec,它试图理解文本中单词之间的关系,而这些关系是单词本身无法告诉我们的。在神经网络领域,有几种流行的架构:
  • Feed-forward networks that extract meaning from fixed-length inputs. Results of these model are not fed back into the model for iteration
    前馈网络从固定长度的输入中提取含义。 这些模型的结果不会被反馈到模型中进行迭代。 ##
  • Convolutional neural nets (CNNs) - used mainly for image processing, which involves a convolutional layer made up of a filter that moves across an image to check for feature representations which are then multiplied via dot product with the filter to pull out specific features
    卷积神经网络 (CNN) - 主要用于图像处理,它包含一个卷积层,该层由一个在图像上移动的过滤器组成,以检查特征表示,然后通过与过滤器的点积进行相乘以提取特定特征
  • recurrent neural networks, which take a sequence of items and produce a vector that summarizes the sentence
RNNs and CNNs are used mainly in feature extraction - they generally do not represent the entire modeling flow, but are fed later into feed-forward models that do the final work of classification, summarization, and more.
RNN 和 CNN 主要用于特征提取 - 它们通常不代表整个建模流程,而是作为输入传递到进行最终分类、摘要等工作的馈送前向模型中。
Figure 39: Types of Neural Networks
图 39:神经网络类型
Neural networks are complex to build and manage for a number of reasons. First, they require extremely large corpuses of clean, well-labeled data to be optimized.

They also require special GPU architectures for processing, and, as we'll see in the production section, they have their own metadata management and latency considerations.
他们还需要特殊的 GPU 架构进行处理,而且,正如我们在生产部分所看到的,他们有自己的元数据管理和延迟考虑。

Finally, within the network itself, we need to complete a large amount of passes we need to do over the model object using batches of our training data to get it to converge. The number of feature matrices that we need to run calculations over , and, consequently, the amount of data we have to keep in-memory through the lifecycle of the model ends up accumulating and requires a great deal of performance tuning.
最后,在网络本身内部,我们需要使用训练数据批次对模型对象完成大量的传递以使其收敛。我们需要进行计算的特征矩阵的数量 ,因此,我们在模型生命周期中必须保持在内存中的数据量最终会累积并需要大量性能调整。
These features made developing and running neural networks prohibitively expensive until the last fifteen years or so.
在过去大约 15 年前,这些特性使得开发和运行神经网络变得非常昂贵。

First, the exponential increase in storage space provided by the growing size of commodity hardware both on-prem and in the cloud meant that we could now store that data for computation, and the explosion of log data gave companies such as Google a lot of training data to work with.

Second, the rise of the GPU as a tool that takes advantage of the neural network's ability to perform embarrassingly parallel computation - a characteristic of computation when it's easy to separate steps out into ones that can be performed in parallel, such as word count for example.
其次,GPU 的兴起作为一种利用神经网络令人难以置信的并行计算能力的工具 - 这是当步骤易于分离成可并行执行的步骤时计算的特征,例如词频统计。

In a neural network, we can generally parallelize computation in any given number of ways, including at the level of a single neuron.
While GPUs were initially used for working with computer graphics, in the early 2000s [49], researchers discovered the potential to use them for general computation, and Nvidia made an enormous bet on this kind of computing by introducing CUDA, an API layer on top of GPUs.
虽然 GPU 最初用于计算机图形处理,但 2000 年代初期 [49],研究人员发现了使用它们进行通用计算的潜力,而 Nvidia 通过引入 CUDA(GPU 之上的一个 API 层)押注于这种计算。

This in turn allowed for the creation and development of high-level popular deep learning frameworks like PyTorch and Tensorflow.
这反过来又促进了诸如 PyTorch 和 TensorFlow 之类的深层流行学习框架的创建和开发。
Neural networks could now be trained and experimented with at scale.

To come back to a comparison to our previous approaches, when we calculate TFIDF, we need to loop over each individual word and perform our computations over the entire dataset in sequence to arrive at a score in proportion to all other words, which means that our computational complexity will be [10].
与我们之前的方法进行比较,当我们计算 TFIDF 时,我们需要依次遍历每个单词,并对整个数据集进行计算,以获得与所有其他单词成比例的分数,这意味着我们的计算复杂度将为 [10]。
However, with a neural network, we can either distribute the model training across different GPUs in a process known as model parallelism, or compute batches - the size of the training data fed into the model and used in a training loop before its hyperparameters are updated in parallel and update at the end of each minibatch, which is known as data parallelism.
然而,使用神经网络,我们可以通过模型并行的方式将模型训练分布在不同的 GPU 上,或者并行计算批次 - 训练数据的大小被馈送到模型中,并用于训练循环,在更新超参数之前并行更新,并 在每个小批次结束时更新,这称为数据并行。

[60]. [60]

4.2 Transformers ## 4.2 Transformer 模型

Word2Vec is a feed-forward network. The model weights and information only flows from the encoding state, to the hidden embedding layer, to the output probability layer.
Word2Vec 是一个前馈网络。模型的权重和信息只从编码状态流向隐藏嵌入层,再流向输出概率层。

There is no feedback between the second and third layers, which means that each given layer doesn't know anything about the state of the layers that follow it. It can't make inference suggestions longer than the context window.

This works really well for machine learning problems where we're fine with a single, static vocabulary.
However, it doesn't work well on long ranges of text that require understanding words in context of each other. For example, over the course of a conversation, we might say, "I read that quote by Langston Hughes.

I liked it, but didn't really read his later work," we understand that "it" refers to the quote, context from the previous sentence, and "his" refers to "Langston Hughes", mentioned two sentences ago.
One of the other limitations was that Word2Vec can't handle out-ofvocabulary words - words that the model has not been trained on and needs to generalize to.
Word2Vec 的另一个局限性是它无法处理词汇表外 (OOV) 词,即模型在训练过程中没有遇到过的单词,需要进行泛化。

This means that if our users search for a new trending term or we want to recommend a flit that was written after our model was trained, they won't see any relevant results from our model. [14], unless the model is retrained frequently.
Another problem is that Word2Vec encounters context collapse around polysemy - the coexistence of many possible meanings for the same phrase: for example, if you have "jail cell" and "cell phone" in the same sentence, it won't understand that the context of both words is different.
另一个问题是 Word2Vec 在多义词周围遇到上下文坍塌 - 也就是相同短语的多种可能含义并存:例如,如果在同一个句子中出现“牢房”和“手机”,它将无法理解这两个词的上下文是不同的。

Much of the work of NLP based in deep learning has been in understanding and retaining that context to propagate through the model and pull out semantic meaning.
Different approaches were proposed to overcome these limitations. Researchers experimented with recurrent neural networks, RNNs. An RNN builds on traditional feed-forward networks, with the difference being that layers of the model give feedback to previous layers.
为了克服这些限制,提出了不同的方法。研究人员尝试了循环神经网络,RNN。RNN 建立在传统的馈送前向网络之上,不同之处在于模型的层将反馈提供给以前的层。

This allows the model to

keep memory of the context around words in a sentence.
(a) Feedforward Neural Network
(a) 前饋神經網絡
(b) Recurrent Neural Network
A problem with traditional RNNs was that because during backpropagation the weights had to be carried through to the previous layers of neurons, they experienced the problem of vanishing gradients.
传统 RNN 的一个问题是,由于反向传播过程中权重需要传递到前几层神经元,因此它们遇到了梯度消失问题。

This occurs when we continuously take the derivative such that the partial derivative used in the chain rule during backpropagation approaches zero. Once we approach zero, the neural network assumes it has reached a local optimum and stops training, before convergence.
A very popular variation of an RNN that worked around this problem was the long-short term memory network (LSTM), developed initially by Schmidhuber and brought to popularity for use in text applications speech recognition and image captioning [33].
一个解决这个问题的 RNN 的流行变种是长短期记忆网络(LSTM),最初由 Schmidhuber 开发,后来在文本应用、语音识别和图像字幕中流行起来[33]。

Whereas our previous model takes only a vector at a time as input, RNNs operate on sequences of vectors using GRUs, which allows the network to control how much information is passed in for analysis. While LSTMs worked fairly well, they had their own limitations.
鉴于我们之前的模型一次只接受一个向量作为输入,RNNs 使用 GRUs 对向量序列进行操作,这使得网络能够控制为分析传递多少信息。虽然 LSTM 工作得相当好,但它们也有自身的局限性。

Because they were architecturally complicated, they took much longer to train, and at a higher computational cost, because they couldn't be trained in parallel.

4.2.1 Encoders/Decoders and Attention
4.2.1 编码器/解码器和注意力机制

Two concepts allowed researchers to overcome computationally expensive issues with remembering long vectors for a larger context window than what was available in RNNs and Word2Vec before it: the encoder/decoder architecture, and the attention mechanism.
在 RNNs 和 Word2Vec 之前,两个概念让研究人员克服了在较大上下文窗口中记忆长向量的计算成本问题:编码器/解码器架构和注意力机制。
The encoder / decoder architecture is a neural network architecture comprised of two neural networks, an encoder that takes the input vectors from our data and creates an embedding of a fixed length, and a decoder, also a neural network, which takes the embeddings encoded as input and generates a static set of outputs such as translated text or a text summary.

In between the two types of layers is the attention mechanism, a way to hold the state of the entire input by continuously performing weighted matrix multiplications that highlight the relevance of specific terms in relation to each other in the vocabulary.

We can think of attention as a very large, complex hash table that keeps track of the words in the text and how they map to different representations both in the input and the output.
Figure 41: The encoder/decoder architecture
图 41: 编码器/解码器架构
class EncoderDecoder(nn.Module):
            Defining the encoder/decoder steps
    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
            super(EncoderDecoder, self).__init__()
            self.encoder = encoder
            self.decoder = decoder
            self.src_embed = src_embed
            self.tgt_embed = tgt_embed
            self.generator = generator
    def forward(self, src, tgt, src_mask, tgt_mask):
        "Take in and process masked src and target sequences."
        return self.decode(self.encode(src, src_mask), src_mask,
                tgt, tgt_mask)
    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)
    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)
class Generator(nn.Module):
    "Define standard linear + softmax generation step."
    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)
    def forward(self, x):
        return F.log_softmax(self.proj(x), dim=-1)
Figure 42: A typical encoder/decoder architecture From the Annotated Transformer
图 42:一个典型的编码器/解码器架构,来自带注释的 Transformer
"Attention is All You Need" [66], released in 2017, combined both of these concepts into a single architecture. The paper immediately saw a great deal of success, and today Transformers are one of the de-facto models used for natural language tasks.
2017 年发布的“Attention Is All You Need” [66] 将这两种思想融合到了一种架构中。论文立即获得了巨大的成功,今天 Transformers 已成为自然语言任务中的事实标准模型之一。
Based on the success of the original model, a great deal of variations on Transformer architectures have been released, followed by GPT and BERT in 2018, Distilbert, a smaller and more compact version of BERT in 2019, and GPT-3 in 202227
基于原始模型的成功,大量的 Transformer 架构变体被发布,其中包括 2018 年的 GPT 和 BERT,2019 年的 BERT 的更小更紧凑的版本 Distilbert,以及 2022 年的 GPT-3
Figure 43: Timeline of Transformer Models
图 43:Transformer 模型的时间线
Transformer architectures themselves are not new, but they contain all the concepts we've discussed so far: vectors, encodings, and hash maps.

The goal of a transformer model is to take a piece of multimodal content, and learn the latent relationships by creating multiple views of groups of words in the input corpus (multiple context windows).

The self-attention mechanism, implemented as scaled dot-product attention in the Transformer paper, creates different context windows of the data a number of times through the six encoder and six decoder layers.
自我注意力机制(在 Transformer 论文中作为缩放点积注意力实现) 通过六个编码器和六个解码器层多次创建数据的不同上下文窗口。

The output is the result of the specific machine learning task - a translated sentence, a summarized paragraph and the next-to-last layer is the model's embeddings, which we can use for downstream work.
Figure 44: View into transformer layers, inspired by multiple sources including this diagram
图 44:受包括此图在内的多个来源启发的 transformer 层的视图
The transformer model described in the paper takes a corpus of text as input . We first transform our text to token embeddings by tokenizing and mapping every word or subword to an index. This is the same process as in Word2Vec: we simply assign each word to an element in a matrix.
基于论文中的变压器模型,我们首先将文本输入分词并映射每个单词或词组到一个索引,从而将其转换为词嵌入。这与 Word2Vec 中的过程相同:我们只需将每个词映射到矩阵中的一个元素。

However, these alone will not help us with context, so, on top of this, we also learn a positional embeddings with the help of a sine or cosine function that is mapped and compressed into a matrix considering the position of all the other word in the vocabulary.

The final output of this process is the positional vector or the word encoding.
Next, these positional vectors are passed in parallel to the model. Within the Transformer paper, the model consists of six layers that perform encoding and six that perform decoding.
这些位置向量然后被并行输入到模型中。 在 Transformer 论文中,该模型由六个执行编码的层和六个执行解码的层组成。

We start with the encoder layer, which consists of two sub-layers: the self-attention layer, and a feed-forward neural network.

The self-attention layer is the key piece, which performs the process of learning the relationship of each term in relation to the other through scaled dot-product attention.

We can think of self-attention in several ways: as a differentiable lookup table, or as a large lookup dictionary that contains both the terms and their positions, with the weights of each term in relationship to the other obtained from previous layers.
The scaled dot-product attention is the product of three matrices: key, query, and value.

These are initially all the same values that are outputs of previous layers - in the first pass through the model, they are initially all the same, initialized at random and adjusted at each step by gradient descent.

For each embedding, we generate a weighted average value based on these learned attention weights. We calculate the dot product between query and key, and finally normalize the weights via softmax.
针对每个嵌入,我们根据学习到的注意力权重生成一个加权平均值。我们计算查询和键之间的点积,最后通过 softmax 对权重进行规范化。

Multi-head attention means that we perform the process of calculating the scaled dot product attention multiple times in parallel and concatenate the outcome into one vector.
## 多头注意力机制意味着我们执行多个并行缩放点积注意力计算,并将结果连接成一个向量。
What's great about scaled dot-product attention (and about all of the layers of the encoder) is that the work can be done in parallel across all the tokens in our codebase: we don't need to wait for one word to finish processing as we do in Word2Vec in order to process the next one, so the number of input steps remains the same, regardless of how big our vocabulary is.
缩放点积注意力的妙处(以及所有编码器层的作用)在于,它可以在我们代码库中的所有标记上并行完成工作:我们不需要像 Word2Vec 那样等待一个单词完成处理才能处理下一个单词,因此无论我们的词汇量有多大,输入步骤的数量仍保持不变。
The decoder piece differs slightly from the encoder. It starts with a different input dataset: in the transformer paper, it's the target language dataset we'd like to translate the text into.
解码器模块与编码器略有不同。它从不同的输入数据集开始:在 Transformer 论文中,它是我们想要将文本翻译成的目标语言数据集。

So for example if we were translating our Flit from English to Italian, we'd expect to train on the Italian corpus. Otherwise, we perform all the same actions: we create indexed embeddings that we then convert into positional embeddings.
例如,如果我们要将我们的 Flit 从英语翻译成意大利语,那么我们期望在意大利语语料库上进行训练。否则,我们将执行所有相同的操作:我们创建索引嵌入,然后将其转换为位置嵌入。

We then feed the positional embeddings for the target text into a layer that has three parts: masked multi-headed attention, multiheaded attention, and a feed-forward neural network.

The masked multi-headed attention component is just like self-attention, with one extra piece: the mask matrix introduced in this step acts as a filter to prevent the attention head from looking at future tokens, since the input vocabulary for the decoder are our "answers", I.e. what the translated text should be The output from the masked multi-head self attention layer is passed to the encoder-decoder attention portion, which accepts the final input from the initial six encoder layers for the key and value, and uses the input from the previous decoder layer as the query, and then performs scaled dot-product over this.
屏蔽的多头注意力组件就像自注意力,它多了一个额外的部分:在这一步中引入的掩码矩阵用作过滤器,以防止注意力头看到未来的标记,因为解码器的输入词汇是我们的“答案”,即翻译后的文本应该是什么 从屏蔽多头自注意力层输出的内容将传递给编码器-解码器注意力部分,该部分接受来自初始六个编码器层的最终输入作为键和值,并将来自先前解码器层的输入用作查询,然后对其执行缩放点积。

Each output is then fed into the feed forward layer, to a finalized set of embeddings.
Once we have the hidden state for each token, we can then attach the task

head. In our case, this is prediction of what a word should be. At each step of the process, the decoder looks at the previous steps and generates based on those steps so we form a complete sentence [54]. We then get the predicted word, just like in Word2Vec.
我们模型的第一个主要部分是头部。在这个步骤中,模型预测下一个单词应该是哪个单词。在处理过程中,每个步骤都会考虑前面的步骤,根据这些步骤生成下一个单词,最终形成一个完整的句子 [54]。 然后,我们将得到预测的单词,就像 Word2Vec 一样。
Transformers were revolutionary for a number of reasons, because they solved several problems people had been working on:
Transformers 是革命性的,因为它们解决了许多人们一直在研究的问题:
  • Parallelization - Each step in the model is parallelizable, meaning we don't need to wait to know the positional embedding of one word in order to work on another, since each embedding lookup matrix focuses attention on a specific word, with a lookup table of all other words in relationship to that word - each matrix for each word carries the context window of the entire input text.
    并行化 - 模型中的每一步都可以并行化,这意味着我们不需要等待知道一个词的位置嵌入才能处理另一个词,因为每个嵌入查找矩阵都关注一个特定的词,并使用所有其他词相对于该词的查找表 - 每个词的每个矩阵都包含整个输入文本的上下文窗口。
  • Vanishing gradients - Previous models like RNNs can suffer from vanishing or exploding gradients, which means that the model reaches a local minimum before it's fully-trained, making it challenging to capture long-term dependencies.
    梯度消失 - 之前的模型,例如 RNN,可能会遇到梯度消失或梯度爆炸的问题,这意味着模型在完全训练之前就到达了局部最小值,这使得捕获长期依赖关系变得具有挑战性。

    Transformers mitigate this problem by allowing direct connections between any two positions in the sequence, enabling information to flow more effectively during both forward and backward propagation.
    通过允许序列中任意两个位置之间的直接连接,Transformer 缓解了这个问题,从而使信息在正向和反向传播过程中更有效地流动。
  • Self-attention - The attention mechanism allows us to learn the context of an entire text that's longer than a 2 or 3 -word sliding context window, allowing us to learn different words in different contexts and predict answers with more accuracy
    自我注意力 - 注意力机制允许我们学习比 2 或 3 字滑动上下文窗口更长的整个文本的上下文,从而能够在不同上下文中学习不同的词,并更准确地预测答案

4.3 BERT

Figure 45: Encoder-only architecture
图 45:仅编码器架构
After the explosive success of "Attention is All you Need", a variety of transformer architectures arose, research and implementation in this architecture exploded in deep learning. The next transformer architecture to be considered a significant step forward was BERT.
继“注意力即一切需要”的爆炸性成功之后,各种变形金刚架构应运而生,深度学习领域对此架构的研究和实现呈爆炸式增长。下一个被认为是重大进步的变压器架构是 BERT。

BERT stands for Bi-Directional Encoder and was released 2018 [13], based on a paper written by Google as a way to solve common natural language tasks like sentiment analysis, questionanswering, and text summarization.
BERT 代表双向编码器,于 2018 年发布 [13],由 Google 撰写的一篇论文,作为解决常见自然语言任务(如情感分析、问答和文本摘要)的一种方式。

BERT is a transformer model, also based on the attention mechanism, but its architecture is such that it only includes the encoder piece. Its most prominent usage is in Google Search, where it's the algorithm powering surfacing relevant search results.
BERT 是一种 transformer 模型,也基于注意力机制,但其架构仅包括编码器部分。它在 Google 搜索中最为突出,是用于呈现相关搜索结果的算法。

In the blog post they released on including BERT in search ranking in 2019, Google specifically discussed adding context to queries as a replacement for keyword-based
在 2019 年关于将 BERT 纳入搜索排名的博文中,谷歌专门讨论了添加上下文到查询作为基于关键词的替代方案

methods as a reason they did this
BERT works as a masked language model. Masking is simply what we did when we implemented Word2Vec by removing words and building our context window. When we created our representations with Word2Vec, we only looked at sliding windows moving forward.
BERT 是一种掩码语言模型。 掩蔽就是我们在实现 Word2Vec 时移除词语并构建上下文窗口所做的操作。 当我们使用 Word2Vec 创建表示时,我们只关注向前移动的滑动窗口。

The B in Bert is for bidirectional, which means it pays attention to words in both ways through scaled dot-product attention. BERT has 12 transformer layers. It starts by using WordPiece, an algorithm that segments words into subwords, into tokens.
BERT 中的 B 代表双向,这意味着它通过缩放点积注意力在两种方式上关注单词。 BERT 有 12 层转换器。 它首先使用 WordPiece(一种将单词分割成子词的算法)将单词分词。

To train BERT, the goal is to predict a token given its context.
将 BERT 训练的目标是根据上下文预测一个词。
The output of BERT is latent representations of words and their context a set of embeddings. BERT is, essentially, an enormous parallelized Word2Vec that remembers longer context windows.
BERT 的输出是一组词及其上下文的潜在表示,即词嵌入。从本质上讲,BERT 是一个巨大的并行化 Word2Vec,它能够记住更长的上下文窗口。

Given how flexible BERT is, it can be used for a number of tasks, from translation, to summarization, to autocomplete. Because it doesn't have a decoder component, it can't generate text, which paved the way for GPT models to pick up where BERT left off.
鉴于 BERT 具有高度的灵活性,它可用于多种任务,包括翻译、摘要和自动完形。 由于它没有解码器组件,因此无法生成文本,这为GPT模型接棒 BERT 奠定了基础。

4.4 GPT

Around the same time that BERT was being developed, another transformer architecture, the GPT series, was being developed at OpenAI. GPT differs from BERT in that it encodes as well as decodes text from embeddings and therefore can be used for probabilistic inference.
与 BERT 同时期,另一个名为 GPT 的 transformer 架构也在 OpenAI 开发。与 BERT 的不同在于,GPT 可以将文本从嵌入中进行编码和解码,因此可以用于概率推断。
The original, first GPT model was trained as a 12-layer, 12-headed transformer with only a decoder piece, based on data from Book Corpus. Subsequent versions built on this foundation to try and improve context understanding.
原始的第一个 GPT 模型是一个基于图书语料库数据的 12 层、12 头的仅解码器变压器模型。 后续版本在此基础上构建,以尝试改进上下文理解。

The largest breakthrough was in GPT-4, which was trained with reinforcement learning from Human Feedback, a property which allows it to make inferences from text that feels much closer to what a human would write.
## 翻译: 最大突破发生在 GPT-4,它使用来自人类反馈的强化学习进行训练,这使得它能够从文本中进行推断,感觉更接近人类的写作。
We've now reached the forefront of what's possible with embeddings in this paper.

With the rise of generative methods and methods based on Reinforcement Learning with Human Feedback like OpenAI's ChatGPT, as well as the nascent open-source Llama, Alpaca, and other models, anything written in this paper would already be impossibly out of date by the time it was published
伴随着生成方法以及基于强化学习和人类反馈的方法(如 OpenAI 的 ChatGPT)的兴起,以及新兴的开源模型 Llama、Alpaca 等,本文撰写完成后可能很快就会过时

5 Embeddings in Production
5 生产环境中的嵌入

With the advent of Transformer models, and more importantly, BERT, generating representations of large, multimodal objects for use in all sorts of machine learning tasks suddenly became much easier, the representations became more accurate, and if the company had GPUs available, computations could now be computed with speed-up in parallel.
随着 Transformer 模型的出现,更重要的是 BERT 的出现,为各种机器学习任务生成大型多模态对象的表现突然变得更加容易,表现也变得更加准确,而且如果公司有可用的 GPU,现在可以用并行的方式加速计算。

Now that we understand what embeddings are, what should we do with them? After all, we're not doing this just as a math exercise. If there is one thing to take away from this entire text, it is this:
The final goal of all industrial machine learning (ML) projects is to develop ML products and rapidly bring them into production. [37]
所有工业机器学习 (ML) 项目的最终目标都是开发机器学习产品并将其快速投入生产。 [37]
The model that is deployed is always better and more accurate than the model that is only ever a prototype. We've gone through the process of training embeddings end to end here, but there are several modalities for working with embeddings. We can:
始终优于原型模型,更准确。 我们已经在这里经历了端到端训练嵌入的过程,但是有几种处理嵌入的方式。 我们可以:
  • Train our own embeddings model - We can train BERT or some variation of BERT from scratch. BERT uses an enormous amount of training data, so this is not really advantageous to us, unless we want to better understand the internals and have access to a lot of GPUs.
    训练我们自己的嵌入模型 - 可以从头开始训练 BERT 或对其进行一些变体。BERT 使用了大量的训练数据,因此,除非我们想要更好地理解内部原理并拥有大量 GPU,否则这对我们来说并不是真正的优势。
  • Use pretrained embeddings and fine-tune - There are many variations on BERT models and they all Variations of BERT have been used to generate embeddings to use as downstream input into many recommender and information retrieval systems. One of the largest gifts that the transformer architecture gives us is the ability to perform transfer learning.
    使用预训练的嵌入和微调 - 有许多 BERT 模型的变体,所有这些 BERT 模型的变体都被用来生成嵌入,作为许多推荐和信息检索系统的下游输入。 转换器架构带给我们的最大礼物之一就是进行迁移学习的能力。
Before, when we learned embeddings in pre-transformer architectures, our representation of whatever dataset we had at hand was fixed - we couldn't change the weights of the words in TF-IDF without regenerating an entire dataset.
在 Transformer 架构出现之前,当我们学习嵌入时,我们对现有数据集的表示是固定的——我们无法更改 TF-IDF 中的词权重,而无需重新生成整个数据集。
Now, we have the ability to treat the output of the layers of BERT as input into the next neural network layer of our own, custom model.
现在,我们能够将 BERT 各层的输出作为我们自己的自定义模型的下一神经网络层的输入。

In addition to transfer learning, there are also numerous more compact models for BERT, such as Distilbert and RoBERTA and for many of the larger models in places like the HuggingFace Model Hub
除了迁移学习,还有许多更紧凑的 BERT 模型,例如 Distilbert 和 RoBERTA,以及许多 Hugging Face 模型中心等大型模型中的许多模型
Armed with this knowledge, we can think of several use cases of embeddings, given their flexibility as a data structure.

- Feeding them into another model - For example, we can now perform collaborative filtering using both user and item embeddings that were learned from our data instead of coding the users and items themselves.
- 将它们输入另一个模型 - 例如,我们现在可以利用从数据中学习到的用户和项目嵌入进行协作过滤,而不用对用户和项目本身进行编码。

- Using them directly - We can use item embeddings directly for content filtering - finding items that are closest to other items, a task recommendation shares with search.
- 直接使用它们 - 我们可以直接使用项目嵌入进行内容过滤 - 查找与其他项目最接近的项目,这是一个推荐与搜索共享的任务。

There are a host of algorithms used to perform vector similarity lookups by projecting items into our embedding space and performing similarity search using algorithms like faiss and HNSW.
为了通过将条目投影到我们的嵌入空间并使用 faiss 和 HNSW 等算法执行相似性搜索来执行矢量相似性查找,使用了许多算法。

5.1 Embeddings in Practice
## 5.1 嵌入在实践中

Many companies are working with embeddings in all of these contexts today, across areas that span all aspects of information retrieval.

Embeddings generated with deep learning models are being generated for use in wide and deep models for App Store recommendations at Google Play [73], dual embeddings for product complementary content recommendations at Overstock [38], personalization of search results at Airbnb via real-time ranking [25], using embeddings for content understanding at Netflix [16], for understanding visual styles at Shutterstock [24], and many other examples.
使用深度学习模型生成的嵌入被用于 Google Play 应用商店推荐的广深模型 [73]、Overstock 产品互补内容推荐的双嵌入 [38]、Airbnb 通过实时排名个性化搜索结果 [25]、Netflix 使用嵌入进行内容理解 [16]、Shutterstock 使用嵌入进行视觉风格理解 [24] 等等。

5.1.1 Pinterest

One notable example is Pinterest. Pinterest as an application has a wide variety of content that needs to be personalized and classified for recommendation to users across multiple surfaces, particularly the Homefeed and shopping tab.
一个值得注意的例子是 Pinterest。作为一款应用程序,Pinterest 拥有种类繁多的内容,需要对这些内容进行个性化处理和分类,以便在多个界面(尤其是首页和购物标签页)向用户推荐。

The scale of generated content - 350 million monthly users and 2 billion items - Pins - or cards with an image described by text - necessitates a strong filtering and ranking policy.
生成的 3.5 亿月度用户和 20 亿个项目的规模 - 图钉(带有文本描述的图像卡片) - 需要强大的过滤和排序策略。
To represent a user's interest and surface interesting content, Pinterest developed PinnerSage [50], which represents user interests through multiple 256-dimension embeddings that are clustered based on similarity and represented by medioids - an item that is a representative of a center of a given interest cluster.
为了表示用户的兴趣并提供相关内容,Pinterest 开发了 PinnerSage [50],它通过多个 256 维的嵌入来表示用户的兴趣,这些嵌入基于相似性进行聚类,并由中值 - 作为给定兴趣簇中心的代表项来表示。
The foundation of this system is a set of embeddings developed through an algorithm called PinSage [72].
This system is built upon a set of embeddings generated by an algorithm called PinSage [72].

Pinsage generates embeddings using a Graph Convolutional neural network, which is a neural net that takes into account the graph structure of relationships between nodes in the network.
Pinsage 使用图卷积神经网络生成嵌入,这是一种神经网络,它考虑网络中节点之间关系的图结构。

The algorithm looks at the nearest neighbors of a pin and samples from nearby pins based on related neighborhood visits. The input is embeddings of a Pin: the image embeddings, and the text embeddings, and finds the nearest neighbors.
该算法查看一个 Pin 的最近邻居,并根据相关的邻居访问从附近的 Pin 中进行采样。输入是一个 Pin 的嵌入:图像嵌入和文本嵌入,并找到最近邻居。
Pinsage embeddings are then passed to Pinnersage, which takes the pins the user has acted on for the past 90 days and clusters them.
PinSage 嵌入然后传递到 Pinnersage,它获取用户过去 90 天内操作过的 pin 并对其进行聚类。

It computes the medioid and takes the top 3 medioids based on importance, and, given a user query that is a medioid, performs an approximate nearest neighbors search using HNSW to find the pins closest to the query in the embedding space.
它计算中点,并根据重要性取前 3 个中点,并使用 HNSW 对给定用户查询(即中点)执行近似最近邻搜索,以在嵌入空间中找到最接近查询的 pin。
Figure 46: Pinnersage and Pinsage embeddings-based similarity retrieval
图 46:基于 Pinnersage 和 Pinsage 嵌入的相似检索

5.1.2 YouTube and Google Play Store
5.1.2 YouTube 和 Google Play 商店

YouTube 优兔

YouTube was one of the first large companies to publicly share their work on embeddings used in the context of a production recommender system with "Deep Neural Networks for YouTube Recommendations."
YouTube 是最早公开分享其在“用于 YouTube 推荐的深度神经网络”中使用于生产推荐系统嵌入工作的科技巨头之一。
YouTube has over 800 million pieces of content (videos) and 2.6 billion active users that they'd like to recommend those videos to. The application needs to recommend existing content to users, while also generalizing to new content, which is uploaded frequently.
YouTube 拥有超过 8 亿条内容(视频)和 26 亿活跃用户,他们希望向这些用户推荐这些视频。该应用程序需要向用户推荐现有内容,同时也要推广到频繁上传的新内容。

They need to be able to serve these recommendations at inference time - when the user loads a new page with low latency.
他们需要能够在推理时提供这些推荐 - 当用户以低延迟加载新页面时。
In this paper [11], YouTube shares how they created a two-stage recommender system for videos based on two deep learning models. The machine learning task is to predict the correct next video to show the user at a given time in YouTube recommendations so that they click.
在[11]篇论文中,YouTube 分享了他们如何使用两个深度学习模型创建了一个适用于视频的两阶段推荐系统。机器学习任务是预测在 YouTube 推荐中,在给定时间向用户展示的正确下一个视频,以便他们点击。

The final output is formulated as a classification problem: given a user's input features and the input features of a video, can we predict a class for the user that includes the predicted watch time for the user for a specific video with a specific probability.
Figure 47: YouTube's end-to-end video recommender system, including a candidate generator and ranker (11]
图 47:YouTube 视频推荐系统的端到端架构,包括候选视频生成器和排序器 [11]
We set this task given a user, and context
为给定的用户 和上下文 设置此任务
Given the size of the input corpus, we need to formulate the problem as a two-stage recommender: the first is the candidate generator that reduces the candidate video set to hundreds of items and a second model, similar in size and shape, called a ranker that ranks these hundreds of videos by the probability that the user will click on them and watch.
The candidate generator is a softmax deep learning model with several layers, all activated with ReLU activation functions - rectified linear unit activation that outputs the input directly if positive; otherwise, it's zero.
候选生成器是一个具有多层的 softmax 深度学习模型,所有层都激活了 ReLU 激活函数 - 整流线性单元激活,如果输入为正,则直接输出输入;否则,则为零。

The uses both embedded and tabular learning features, all of which are combined and
To build the model, we use two sets of embeddings as input data: one that's the user plus context as features, and a set of video items. The model has several hundreds of features, both tabular and embeddings-based. For the embeddings-based features, we include elements like:
  • User watch history - represented by a vector of sparse video ID elements mapped into a dense vector representation
    用户观看历史记录,表示为映射到密集向量表示的稀疏视频 ID 元素的向量
  • User's search history - Maps search term to video clicked from the search term, also in a sparse vector mapped into the same space as the user watch history
    用户的搜索历史 - 将地图搜索词映射到从搜索词中点击的视频,也将其映射到与用户观看历史相同的稀疏向量空间中
  • User's geography, age, and gender - mapped as tabular features
    用户地理位置、年龄和性别 - 映射为表格特征
  • The number of previous impressions a video had, normalized per user over time
These are all combined into a single item embedding, and in the case of the user, a single embedding that's a blended map of all the user embedding features, and fed into the models' softmax layers, which compare the distance between the output of the softmax layer, i.e. the probability that the user will click on an item, and a set of ground truth items, i.e. a set of items that the user has already interacted with.
这些都被组合成一个单独的项目嵌入,就用户而言,一个单独的嵌入是所有用户嵌入特征的混合映射,并被输入到模型的 softmax 层,该层比较 softmax 层输出的距离,即用户点击项目的概率,与一组 ground truth 项目,即用户已经交互过的项目集。

The log probability of an item is the dot product of two n-dimensional vectors, i.e. the query and item embeddings.
一个商品的对数概率是两个 n 维向量(即查询和商品嵌入)的点积。
We consider this an example of Implicit feedback - feedback the user did not explicitly give, such as a rating, but that we can capture in our log data. Each class response, of which there are approximately a million, is given a probability as output.
我们认为这是一个隐式反馈的例子 - 用户没有明确给出的反馈,例如评分,但我们可以从我们的日志数据中捕获。每个类响应(大约有 100 万个)都被赋予一个输出概率。
The DNN is a generalization of the matrix factorization model we discussed earlier.
DNN 是我们之前讨论过的矩阵分解模型的泛化。
Figure 48: YouTube's multi-step neural network model for video recommendations using input embeddings [11]
图 48:YouTube 用于视频推荐的多步神经网络模型,使用输入嵌入[11]

Google Play App Store

Similar work, although with a different architecture, was done in the App Store in Google Play in "Wide and Deep Learning for Recommender Systems" [7].
架构不同,但工作类似的模型曾在 Google Play 的应用商店中进行过,详见“用于推荐系统的宽而深的学习”[7]。

This one crosses the search and recommendation space because it returns correct ranked and personalized app recommendations as the result of a search query. The input is clickstream data collected when a user visits the app store.
Figure 49: Wide and deep
图 49:宽和深
The recommendations problem is formulated here as two jointly-trained models. The weights are shared and cross-propagated between the two models between epochs.
There are two problems when we try to build models that recommend items: memorization - the model needs to learn patterns by learning how items occur together given the historical data, and generalization - the model needs to be able to give new recommendations the user has not seen before that are still relevant to the user, improving recommendation diversity.

Generally, one model alone cannot encompass both of these tradeoffs.
Wide and deep is made up of two models that look to complement each other:
Wide & Deep 模型由两个互为补充的模型组成:
  • A wide model which uses traditional tabular features to improve the model's memorization. This is a general linear model trained on sparse, one-hot encoded features like user_installed_app=netflix across thousands of apps.
    一个使用传统表格特征来改进模型记忆效果的广域模型。这是一个训练在稀疏的、独热编码特征(例如 user_installed_app=netflix)上的线性模型,涉及数千个应用程序。

    Memorization works here by creating binary features that are combinations of features, such as AND(user_installed_app=netflix, impression_app_pandora, allowing us to see different combinations of co-occurrence in relationship to the target, i.e. likelihood to install app . However, this model cannot generalize if it gets a new value outside of the training data.
    记忆化在这里通过创建二元特征来工作,这些特征是特征的组合,例如 AND(user_installed_app=netflix, impression_app_pandora),允许我们查看目标相关联的不同共现组合,即安装应用程序 的可能性。但是,如果该模型在训练数据之外获得新的值,则无法进行泛化。
  • A deep model that supports generalization across items that the model has not seen before, using a feed-forward neural network made up of categorical features that are translated to embeddings, such as user language, device class, and whether a given app has an impression.

    Each of these embeddings range from 0-100 in dimensionality. They are combined jointly into a concatenated embedding space with dense vectors in 1200 dimensions. and initialized randomly.
    这些嵌入中的每一个范围从 0-100 维度。它们被联合组合到一个连接的嵌入空间中,该空间在 1200 维中具有密集向量。并随机初始化。

    The embedding values are trained to minimize loss of the final function, which is a logistic loss function common to the deep and wide model.
    嵌入值经过训练以最小化最终函数的损失,最终函数是 deep and wide 模型中常见的逻辑损失函数。
Figure 50: The deep part of the wide and deep model [7]
圖 50:廣泛而深入的模型的深層部分 [7]
The model is trained on 500 billion examples, and evaluated offline using AUC and online using app acquisition rate, the rate at which people download the app. Based on the paper, using this approach improved the app acquisition rate on the main landing page of the app store by relative to the control group.
该模型在 5 亿个示例上训练,并使用离线 AUC 和在线应用获取率(人们下载应用的比率)进行评估。根据论文,使用这种方法将应用商店主着陆页上的应用获取率提高了 个百分点,相对于对照组而言。

5.1.3 Twitter 5.1.3 推特

At Twitter, pre-computed embeddings were a critical part recommendations for many app surface areas including user onboarding topic interest prediction, recommended Tweets, home timeline construction, users to follow, and
在 Twitter 上,预计算嵌入是许多应用领域中推荐的基石,包括用户引导主题兴趣预测、推荐推文、主页时间线构建、关注用户以及:

recommended ads 推荐广告
Twitter had a number of embeddings-based models but we'll cover two projects here: Twice [44], content embeddings for Tweets, which looks to find rich representations of Tweets that include both text and visual data for use in surfacing Tweets in the home timeline, Notifications and Topics.
Twitter 使用了许多基于嵌入的模型,但我们这里将介绍两个项目:Twice [44],用于推文的文本嵌入,旨在找到推文的丰富表示,包括文本和视觉数据,用于在主页时间线、通知和主题中展示推文。

Twitter also developed TwHIN [18], Twitter Heterogeneous Information Network, a set of graph-based embeddings [18], developed for tasks like personalized ads rankings, account follow-recommendation, offensive content detection, and search ranking, based on nodes (such as users and advertisers) and edges that represent entity interactions.
Twitter 还开发了 TwHIN[18],即 Twitter 异构信息网络,这是一种基于图的嵌入集合[18],用于个性化广告排名、账户关注推荐、攻击性内容检测和搜索排名等任务,它基于节点(如用户和广告商)和表示实体交互的边。
Figure 51: Twitter's Twice Embeddings, a trained BERT model 44]
图 51:Twitter 的两倍嵌入,一个训练好的 BERT 模型[44]
Twice is a BERT model trained from scratch on an input corpus of 200 million Tweets that users engaged with sampled over 90 days and also includes associations to the users themselves.
Twice 是一个在 2 亿条推文中训练的 BERT 模型,这些推文是在 90 天内用户参与抽样并包含与用户自身相关的联想。

The objective of the model is to optimize on several tasks: topic prediction (aka the topic associated with a Tweet, of which there could be multiple), engagement prediction (the likelihood a user is to engage with a Tweet), and language prediction to cluster Tweets of the same language to be clustered closer together.
模型的目标是在几个任务中进行优化: 1) 主题预测(也称为与推文相关的主题, 可能有多个), 2) 参与度预测(用户与推文进行互动的可能性), 3) 语言预测, 以便将相同语言的推文聚类在一起.
TwHIN, rather than just focusing on Tweet content, considers all entities in Twitter's environment (Tweets, users, advertiser entities) as belong together in a joint embedding space graph.
TwHIN 不仅关注推文内容,还将 Twitter 环境中的所有实体(推文、用户、广告商实体)视为联合嵌入空间图中的一个整体。
Joint embedding is performed by using data from user-Tweet engagement, advertising, and following data, to create multi-model embeddings. TWHin is used for candidate generation.
使用来自用户-推特互动、广告和关注数据的进行联合嵌入来创建多模型嵌入。 TWHin 用于候选人生成。

The candidate generator finds users to follow or Tweets to engage with an HNSW or Faiss to retrieve candidate items. TWHin embeddings are then used to query candidate items and increase diversity in the candidate pool.
候选生成器使用 HNSW 或 Faiss 检索候选项目,以找到要关注的用户或要参与的推文。然后使用 TWHin 嵌入来查询候选项目并增加候选池中的多样性。
Figure 2: An example heterogeneous information network (HIN) where and . There are four entity types : 'User', 'Tweet', 'Advertiser', and 'Ad'. There are seven types of relationship (R): 'Follows', 'Authors', 'Favorites', 'Replies', 'Retweets', 'Promotes', and 'Clicks'. See Section 3 for more details.
图 2:一个示例异构信息网络(HIN),其中 。有四种实体类型 :'用户'、'推文'、'广告商'和'广告'。有七种关系类型(R):'关注'、'作者'、'收藏'、'回复'、'转发'、'推广'和'点击'。有关更多详细信息,请参阅第 3 节。
Figure 52: Twitter's model of the app's heterogeneous information network [18]
图 52:Twitter 应用的异构信息网络模型 [18]

Embeddings at Flutter Flutter 嵌入式

Once we synthesize enough of these architectures, we see some patterns start to emerge that we can think about adapting for developing our relevant recommendation system at Flutter.
在我们合成了足够多的这些架构后,我们会看到一些模式开始出现,我们可以考虑将其用于开发我们 Flutter 中的相关推荐系统。
First, we need a great deal of input data to make accurate predictions from, and that data should have information about either explicit, or, more likely, implicit data like user clicks and purchases so that we can construct our model of user preferences.

The reason we need a lot of data is two-fold. First, neural networks are data-hungry and require a large amount of training data to correctly infer relationships in comparison to traditional models. Second, large data requires a large pipeline.
If we don't have a lot of data, a simpler model will work well-enough, so we need to make sure we are actually at the scale where embeddings and neural networks help our business problem. It's likely the case that we can start much simpler.