这是用户在 2024-12-19 17:11 为 https://www.dailydoseofds.com/a-beginner-friendly-and-comprehensive-deep-dive-on-vector-databases/#h... 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

A Beginner-friendly and Comprehensive Deep Dive on Vector Databases
向量数据库入门级全面深入探讨

Understanding every little detail on vector databases and their utility in LLMs, along with a hands-on demo.

A Beginner-friendly and Comprehensive Deep Dive on Vector Databases
👉
Hey! Enjoy the free article. It looks like you are from Japan 🇯🇵. So whenever you are ready, join by visiting this membership page for relief pricing of 20% off on your subscription, FOREVER.
嗨!免费文章请随意阅读。看起来您来自日本🇯🇵。随时访问此会员页面,享受永久 20%的订阅折扣。

Introduction 引言

It’s pretty likely that in the generative AI era (since the release of ChatGPT, to be more precise), you must have at least heard of the term “vector databases.”
生成式 AI 时代(更准确地说,自 ChatGPT 发布以来),你很可能至少听说过“向量数据库”这个术语。

It’s okay if you don’t know what they are, as this article is primarily intended to explain everything about vector databases in detail.
如果你不知道它们是什么也没关系,因为本文主要目的是详细解释向量数据库的一切。

But given how popular they have become lately, I think it is crucial to be aware of what makes them so powerful that they gained so much popularity, and their practical utility not just in LLMs but in other applications as well.
但是鉴于它们最近变得如此流行,我认为了解是什么让它们如此强大,以至于如此受欢迎,以及它们不仅在LLMs中,还在其他应用中的实际效用至关重要。

Let’s dive in! 让我们深入探讨!


What are vector databases?
向量数据库是什么?

Objective 目标

To begin, we must note that vector databases are NOT new.
首先,我们必须指出,向量数据库并非新事物。

In fact, they have existed for a pretty long time now. You have been indirectly interacting with them daily, even before they became widely popular lately. These include applications like recommendation systems, and search engines, for instance.
事实上,它们已经存在了相当长的时间。即使在它们最近变得流行之前,你每天都在间接地与它们互动。例如,推荐系统和搜索引擎就是这类应用。

Simply put, a vector database stores unstructured data (text, images, audio, video, etc.) in the form of vector embeddings.
简单来说,向量数据库以向量嵌入的形式存储非结构化数据(文本、图像、音频、视频等)。

Each data point, whether a word, a document, an image, or any other entity, is transformed into a numerical vector using ML techniques (which we shall see ahead).
每个数据点,无论是单词、文档、图像还是任何其他实体,都使用机器学习技术将其转换为数值向量(我们稍后将看到)。

This numerical vector is called an embedding, and the model is trained in such a way that these vectors capture the essential features and characteristics of the underlying data.
这个数值向量被称为嵌入,模型的训练方式使得这些向量捕捉了底层数据的关键特征和特性。

Considering word embeddings, for instance, we may discover that in the embedding space, the embeddings of fruits are found close to each other, which cities form another cluster, and so on.
例如,考虑词嵌入,我们可能会发现,在嵌入空间中,水果的嵌入彼此靠近,城市形成另一个簇,等等。

This shows that embeddings can learn the semantic characteristics of entities they represent (provided they are trained appropriately).
这表明嵌入向量可以学习它们所代表的实体的语义特征(前提是它们经过适当的训练)。

Once stored in a vector database, we can retrieve original objects that are similar to the query we wish to run on our unstructured data.
将数据存储在向量数据库中后,我们可以检索与我们希望在非结构化数据上运行的查询相似的原始对象。

In other words, encoding unstructured data allows us to run many sophisticated operations like similarity search, clustering, and classification over it, which otherwise is difficult with traditional databases.
换句话说,对非结构化数据进行编码,使我们能够执行许多复杂的运算,例如相似性搜索、聚类和分类,而使用传统的数据库则难以实现这些运算。

To exemplify, when an e-commerce website provides recommendations for similar items or searches for a product based on the input query, we’re (in most cases) interacting with vector databases behind the scenes.
例如,当电子商务网站根据输入查询提供类似商品的推荐或搜索产品时,我们(在大多数情况下)实际上是在与幕后的向量数据库进行交互。


Before we get into the technical details, let me give you a couple of intuitive examples to understand vector databases and their immense utility.
在深入技术细节之前,让我先给您举几个直观的例子,以便理解向量数据库及其巨大的实用价值。

Example #1 示例 #1

Let's imagine we have a collection of photographs from various vacations we’ve taken over the years. Each photo captures different scenes, such as beaches, mountains, cities, and forests.
假设我们有一系列过去几年旅行拍摄的照片。每张照片都捕捉不同的场景,例如海滩、山脉、城市和森林。

Now, we want to organize these photos in a way that makes it easier to find similar ones quickly.
现在,我们要以一种更容易快速找到相似照片的方式组织这些照片。

Traditionally, we might organize them by the date they were taken or the location where they were shot.
传统上,我们可以按拍摄日期或拍摄地点组织它们。

However, we can take a more sophisticated approach by encoding them as vectors.

More specifically, instead of relying solely on dates or locations, we could represent each photo as a set of numerical vectors that capture the essence of the image.

💡
While Google Photos doesn't explicitly disclose the exact technical details of its backend systems, I speculate that it uses a vector database to facilitate its image search and organization features, which you may have already used many times.

Let’s say we use an algorithm that converts each photo into a vector based on its color composition, prominent shapes, textures, people, etc.

Each photo is now represented as a point in a multi-dimensional space, where the dimensions correspond to different visual features and elements in the image.

Now, when we want to find similar photos, say, based on our input text query, we encode the text query into a vector and compare it with image vectors.

Photos that match the query are expected to have vectors that are close together in this multi-dimensional space.

Suppose we wish to find images of mountains.

In that case, we can quickly find such photos by querying the vector database for images close to the vector representing the input query.

A point to note here is that a vector database is NOT just a database to keep track of embeddings.

Instead, it maintains both the embeddings and the raw data which generated those embeddings.

Why is that necessary, you may wonder?

Considering the above image retrieval task again, if our vector database is only composed of vectors, we would also need a way to reconstruct the image because that is what the end-user needs.

When a user queries for images of mountains, they would receive a list of vectors representing similar images, but without the actual images.

By storing both the embeddings (the vectors representing the images) and the raw image data, the vector database ensures that when a user queries for similar images, it not only returns the closest matching vectors but also provides access to the original images.

Example #2

In this example, consider an all-text unstructured data, say thousands of news articles, and we wish to search for an answer from that data.

Traditional search methods rely on exact keyword search, which is entirely a brute-force approach and does not consider the inherent complexity of text data.

In other words, languages are incredibly nuanced, and each language provides various ways to express the same idea or ask the same question.

For instance, a simple inquiry like "What's the weather like today?" can be phrased in numerous ways, such as "How's the weather today?", "Is it sunny outside?", or "What are the current weather conditions?".

This linguistic diversity makes traditional keyword-based search methods inadequate.

As you may have already guessed, representing this data as vectors can be pretty helpful in this situation too.

Instead of relying solely on keywords and following a brute-force search, we can first represent text data in a high-dimensional vector space and store them in a vector database.

When users pose queries, the vector database can compare the vector representation of the query with that of the text data, even if they don't share the exact same wording.


How to generate embeddings?

At this point, if you are wondering how do we even transform words (strings) into vectors (a list of numbers), let me explain.

We also covered this in a recent newsletter issue here but not in much detail, so let’s discuss those details here.

A Pivotal Moment in NLP Research Which Made Static Embeddings (Almost) Obsolete
Looking back to the pre-Transformer times.
If you already know what embedding models are, feel free to skip this part.

To build models for language-oriented tasks, it is crucial to generate numerical representations (or vectors) for words.

This allows words to be processed and manipulated mathematically and perform various computational operations on words.

The objective of embeddings is to capture semantic and syntactic relationships between words. This helps machines understand and reason about language more effectively.

In the pre-Transformers era, this was primarily done using pre-trained static embeddings.

Essentially, someone would train embeddings on, say, 100k, or 200k common words using deep learning techniques and open-source them.

Consequently, other researchers would utilize those embeddings in their projects.

The most popular models at that time (around 2013-2017) were:

  • Glove
  • Word2Vec
  • FastText, etc.

These embeddings genuinely showed some promising results in learning the relationships between words.

For instance, at that time, an experiment showed that the vector operation (King - Man) + Woman returned a vector near the word Queen.

That’s pretty interesting, isn’t it?

In fact, the following relationships were also found to be true:

  • Paris - France + Italy ≈ Rome
  • Summer - Hot + Cold ≈ Winter
  • Actor - Man + Woman ≈ Actress
  • and more.

So, while these embeddings captured relative word representations, there was a major limitation.

Consider the following two sentences:

  • Convert this data into a table in Excel.
  • Put this bottle on the table.

Here, the word “table” conveys two entirely different meanings:

  • The first sentence refers to a “data” specific sense of the word “table.”
  • The second sentence refers to a “furniture” specific sense of the word “table.”

Yet, static embedding models assigned them the same representation.

Thus, these embeddings didn’t consider that a word may have different usages in different contexts.

But this was addressed in the Transformer era, which resulted in contextualized embedding models powered by Transformers, such as:

  • BERT: A language model trained using two techniques:
    • Masked Language Modeling (MLM): Predict a missing word in the sentence, given the surrounding words.
    • Next Sentence Prediction (NSP).
    • We shall discuss it in a bit more detail shortly.
  • DistilBERT: A simple, effective, and lighter version of BERT, which is around 40% smaller:
    • Utilizes a common machine learning strategy called student-teacher theory.
    • Here, the student is the distilled version of BERT, and the teacher is the original BERT model.
    • The student model is supposed to replicate the teacher model’s behavior.
    • If you want to learn how this is implemented practically, we discussed it here:
Model Compression: A Critical Step Towards Efficient Machine Learning
Four critical ways to reduce model footprint and inference time.
    • This differs from the BERT and DistilBERT models, which produce an embedding for all words in the sentence.
      这与 BERT 和 DistilBERT 模型不同,后者会为句子中的所有单词生成嵌入。

There are more models, but we won't go into more detail here, and I hope you get the point.
还有更多模型,但我们这里不会深入探讨,希望你理解我的意思。

The idea is that these models are quite capable of generating context-aware representations, thanks to their self-attention mechanism and appropriate training mechanism.
这些模型能够生成上下文感知的表示,这得益于它们的自注意力机制和合适的训练机制。

BERT

For instance, if we consider BERT again, we discussed above that it uses the masked language modeling (MLM) technique and next sentence prediction (NSP).
例如,如果我们再次考虑 BERT,我们上面讨论过它使用了掩蔽语言模型 (MLM) 技术和下一句预测 (NSP)。

These steps are also called the pre-training step of BERT because they involve training the model on a large corpus of text data before fine-tuning it on specific downstream tasks.
这些步骤也称为 BERT 的预训练步骤,因为它们涉及在特定下游任务上微调模型之前,先使用大量的文本数据训练模型。

💡
Pre-training, in the context of machine learning model training, refers to the initial phase of training where the model learns general language representations from a large corpus of text data. The goal of pre-training is to enable the model to capture the syntactic and semantic properties of language, such as grammar, context, and relationships between words. While the text itself is unlabeled, MLM and NSP are two tasks that help us train the model in a supervised fashion. Once the model is trained, we can use the language understanding capabilities that the model acquired from the pre-training phase, and fine-tune the model on task-specific data. The following animation depicts fine-tuning:
预训练,在机器学习模型训练的语境中,指的是模型训练的初始阶段,在这个阶段,模型从大量的文本数据语料库中学习通用的语言表示。预训练的目标是使模型能够捕捉语言的句法和语义属性,例如语法、上下文以及词语之间的关系。虽然文本本身未标记,但 MLM 和 NSP 是两种帮助我们在监督方式下训练模型的任务。一旦模型经过训练,我们就可以利用模型在预训练阶段获得的语言理解能力,并在特定任务的数据上微调模型。下面的动画描述了微调过程:

Moving on, let’s see how the pre-training objectives of masked language modeling (MLM) and next sentence prediction (NSP) help BERT generate embeddings.
接下来,让我们看看掩蔽语言模型(MLM)和下一句预测(NSP)的预训练目标如何帮助 BERT 生成嵌入。

#1) Masked Language Modeling (MLM)
掩蔽语言模型 (MLM)

  • In MLM, BERT is trained to predict missing words in a sentence. To do this, a certain percentage of words in most (not all) sentences are randomly replaced with a special token, [MASK].
    在 MLM 中,BERT 被训练来预测句子中缺失的词。为此,大多数(并非所有)句子中一定比例的词会被随机替换为一个特殊标记, [MASK]
  • BERT then processes the masked sentence bidirectionally, meaning it considers both the left and right context of each masked word, that is why the name “Bidirectional Encoder Representation from Transformers (BERT).”
    BERT 随后以双向方式处理掩蔽句子,这意味着它考虑了每个掩蔽词的左右上下文,这就是为什么它被称为“来自变换器的双向编码器表示(BERT)”。
  • For each masked word, BERT predicts what the original word is supposed to be from its context. It does this by assigning a probability distribution over the entire vocabulary and selecting the word with the highest probability as the predicted word.
    对于每个被遮蔽的词,BERT 会根据上下文预测其原始词应该是什么。它通过为整个词汇表分配概率分布,并选择概率最高的词作为预测词来实现这一点。
  • During training, BERT is optimized to minimize the difference between the predicted words and the actual masked words, using techniques like cross-entropy loss.
    训练过程中,BERT 使用交叉熵损失等技术,优化预测词与实际掩码词之间的差异。

#2) Next Sentence Prediction (NSP)
#2) 下一句预测 (NSP)

  • In NSP, BERT is trained to determine whether two input sentences appear consecutively in a document or whether they are randomly paired sentences from different documents.
    在 NSP 任务中,BERT 被训练来判断两个输入句子是否在一个文档中连续出现,或者它们是否是从不同文档中随机配对的句子。
  • During training, BERT receives pairs of sentences as input. Half of these pairs are consecutive sentences from the same document (positive examples), and the other half are randomly paired sentences from different documents (negative examples).
    训练期间,BERT 接收到句子对作为输入。这些句子对中的一半来自同一文档的连续句子(正例),另一半来自不同文档的随机配对句子(负例)。
  • BERT then learns to predict whether the second sentence follows the first sentence in the original document (label 1) or whether it is a randomly paired sentence (label 0).
    BERT 然后学习预测第二个句子在原始文档中是否跟随第一个句子( label 1 )或是否是一个随机配对的句子( label 0 )。
  • Similar to MLM, BERT is optimized to minimize the difference between the predicted labels and the actual labels, using techniques like binary cross-entropy loss.
    与 MLM 类似,BERT 通过使用诸如二元交叉熵损失之类的技术,优化以最小化预测标签与实际标签之间的差异。
💡
If we look back to MLM and NSP, in both cases, we did not need a labeled dataset to begin with. Instead, we used the structure of the text itself to create the training examples. This allows us to leverage large amounts of unlabeled text data, which is often more readily available than labeled data.
如果我们回顾 MLM 和 NSP,在这两种情况下,我们一开始都不需要标记数据集。相反,我们利用文本本身的结构来创建训练示例。这使我们能够利用大量的未标记文本数据,而未标记文本数据通常比标记数据更容易获得。

Now, let's see how these pre-training objectives help BERT generate embeddings:
现在,让我们看看这些预训练目标是如何帮助 BERT 生成嵌入的:

  • MLM: By predicting masked words based on their context, BERT learns to capture the meaning and context of each word in a sentence. The embeddings generated by BERT reflect not just the individual meanings of words but also their relationships with surrounding words in the sentence.
    MLM:通过预测掩蔽词的上下文,BERT 学习捕捉句子中每个词的含义和上下文。BERT 生成的嵌入不仅反映了词的单独含义,还反映了它们在句子中与周围词的关系。
  • NSP: By determining whether sentences are consecutive or not, BERT learns to understand the relationship between different sentences in a document. This helps BERT generate embeddings that capture not just the meaning of individual sentences but also the broader context of a document or text passage.
    NSP:通过判断句子是否连续,BERT 学习理解文档中不同句子之间的关系。这有助于 BERT 生成嵌入,不仅捕捉单个句子的含义,还捕捉文档或文本段落的更广泛语境。

With consistent training, the model learns how different words relate to each other in sentences. It learns which words often come together and how they fit into the overall meaning of a sentence.
通过持续训练,模型学习句子中不同单词之间的关系。它学习哪些词经常一起出现以及它们如何融入句子的整体含义。

This learning process helps BERT create embeddings for words and sentences, which are contextualized, unlike earlier embeddings like Glove and Word2Vec:
这个学习过程帮助 BERT 为单词和句子创建嵌入,这些嵌入是上下文相关的,不同于之前的嵌入,例如 Glove 和 Word2Vec

Contextualized means that the embedding model can dynamically generate embeddings for a word based on the context they were used in.
上下文化意味着嵌入模型可以根据单词在上下文中使用的语境动态地生成嵌入。

As a result, if a word would appear in a different context, the model would return a different representation.
因此,如果一个词出现在不同的语境中,模型会返回不同的表示。

This is precisely depicted in the image below for different uses of the word Bank.
这幅图形象地展示了不同情况下单词 Bank 的用法。

For visualization purposes, the embeddings have been projected into 2d space using t-SNE.
为了可视化,嵌入向量已使用 t-SNE 投影到 2 维空间。

As depicted above, the static embedding models — Glove and Word2Vec produce the same embedding for different usages of a word.
如上图所示,静态嵌入模型——Glove 和 Word2Vec 会为一个词的不同用法生成相同的嵌入。

However, contextualized embedding models don’t.
然而,上下文嵌入模型却不行。

In fact, contextualized embeddings understand the different meanings/senses of the word “Bank”:
事实上,上下文嵌入理解了“银行”这个词的不同含义/意义

  • A financial institution 金融机构
  • Sloping land 坡地
  • A Long Ridge, and more.
    龙脊山,以及更多。

As a result, these contextualized embedding models address the major limitations of static embedding models.
因此,这些上下文化嵌入模型解决了静态嵌入模型的主要局限性。


The point of the above discussion is that modern embedding models are quite proficient at the encoding task.
以上讨论的重点在于,现代嵌入式模型在编码任务方面非常擅长。

As a result, they can easily transform documents, paragraphs, or sentences into a numerical vector that captures its semantic meaning and context.
因此,它们可以轻松地将文档、段落或句子转换为捕捉其语义含义和上下文的数值向量。


Querying a vector database
查询向量数据库

In the last to last sub-section, we provided an input query, which was encoded, and then we searched the vector database for vectors that were similar to the input vector.
在倒数第二个小节中,我们提供了一个输入查询,该查询被编码,然后我们在向量数据库中搜索与输入向量相似的向量。

In other words, the objective was to return the nearest neighbors as measured by a similarity metric, which could be:
换句话说,目标是根据相似性度量返回最近的邻居,该度量可以是:

  • Euclidean distance (the lower the metric, the more the similarity).
    欧几里得距离(距离越小,相似度越高)。
  • Manhattan distance (the lower the metric, the more the similarity).
    曼哈顿距离(距离值越低,相似度越高)。
  • Cosine similarity (the more the metric, the more the similarity).
    余弦相似度(指标值越大,相似度越高)。

The idea resonates with what we do in a typical k-nearest neighbors (kNN) setting.
这个想法与我们在典型的 k 近邻 (kNN) 设置中所做的事情相呼应。

We can match the query vector across the already encoded vectors and return the model similar ones.
我们可以将查询向量与已编码的向量进行匹配,并返回相似的模型。

The problem with this approach is that to find, say, only the first nearest neighbor, the input query must be matched across all vectors stored in the vector database.
这种方法的问题在于,要找到,比如说,最近邻,输入查询必须与向量数据库中存储的所有向量进行匹配。

This is computationally expensive, especially when dealing with large datasets, which may have millions of data points. As the size of the vector database grows, the time required to perform a nearest neighbor search increases proportionally.
这在计算上代价很高,尤其是在处理大型数据集时,这些数据集可能包含数百万个数据点。随着向量数据库的规模增大,执行最近邻搜索所需的时间也会成比例地增加。

But in scenarios where real-time or near-real-time responses are necessary, this brute-force approach becomes impractical.
但在需要实时或近乎实时响应的情况下,这种蛮力方法变得不切实际。

In fact, this problem is also observed in typical relational databases. If we were to fetch rows that match a particular criteria, the whole table must be scanned.
事实上,这个问题在典型的关系数据库中也存在。如果我们要获取符合特定条件的行,则必须扫描整个表。

Indexing the database provides a quick lookup mechanism, especially in cases where near real-time latency is paramount.
数据库索引提供快速查找机制,尤其是在实时延迟至关重要的场合。

More specifically, when columns used in WHERE clauses or JOIN conditions are indexed, it can significantly speed up query performance.
更具体地说,当 WHERE 子句或 JOIN 条件中使用的列被索引时,它可以显著提高查询性能。

A similar idea of indexing is also utilized in vector databases, which results in something we call an approximate nearest neighbor (ANN), which is quite self-explanatory, isn't it?
向量数据库也使用了类似的索引思想,其结果是我们称之为近似最近邻 (ANN),这非常容易理解,不是吗?

Well, the core idea is to have a tradeoff between accuracy and run time. Thus, approximate nearest neighbor algorithms are used to find the closest neighbors to a data point, though these neighbors may not always be the closest neighbors.
嗯,核心思想在于在准确性和运行时间之间取得平衡。因此,使用近似最近邻算法来寻找数据点最近的邻居,尽管这些邻居可能并非总是最近的邻居。

That is why they are also called non-exhaustive search algorithms.
因此,它们也被称为非穷举搜索算法。

The motivation is that when we use vector search, exact matches aren't absolutely necessary in most cases.
动机在于,在大多数情况下,使用向量搜索时,精确匹配并非绝对必要。

Approximate nearest neighbor (ANN) algorithms utilize this observation and compromise a bit of accuracy for run-time efficiency.
近似最近邻 (ANN) 算法利用这一观察,并为运行效率牺牲了一点精度。

Thus, instead of exhaustively searching through all vectors in the database to find the closest matches, ANN algorithms provide fast, sub-linear time complexity solutions that yield approximate nearest neighbors.
因此,与其穷尽搜索数据库中的所有向量以找到最接近的匹配项,ANN 算法提供了快速、亚线性时间复杂度的解决方案,从而获得近似最近邻。

Let’s discuss these techniques in the next section!
让我们在下个部分讨论这些技术!


Approximate nearest neighbors (ANN)
近似最近邻

While approximate nearest neighbor algorithms may sacrifice a certain degree of precision compared to exact nearest neighbor methods, they offer significant performance gains, particularly in scenarios where real-time or near-real-time responses are required.
虽然近似最近邻算法可能与精确最近邻方法相比牺牲了一定的精度,但在需要实时或近实时响应的情况下,它们提供了显著的性能提升。

The core idea is to narrow down the search space for the query vector, thereby improving the run-time performance.
核心思想是缩小查询向量的搜索空间,从而提高运行时间性能。

The search space is reduced with the help of indexing. There are five popular indexing strategies here.
通过索引减少了搜索空间。这里有五种流行的索引策略。

Let’s go through each of them.
让我们一一来看。

#1) Flat Index 扁平索引

Flat index is another name for the brute-force search we saw earlier, which is also done by KNN. Thus, all vectors are stored in a single index structure without any hierarchical organization.
扁平索引是前面提到的暴力搜索的另一种名称,它也由 KNN 执行。因此,所有向量都存储在一个单一的索引结构中,没有任何层次结构。

That is why this indexing technique is called “flat” – it involves no indexing strategy, and stores the data vectors as they are, i.e., in a ‘flat’ data structure.
这就是为什么这种索引技术被称为“扁平”——它不涉及任何索引策略,而是将数据向量原封不动地存储,即以“扁平”的数据结构存储。

As it searches the entire vector database, it delivers the best accuracy of all indexing methods we shall see ahead. However, this approach is incredibly slow and impractical.
它搜索整个向量数据库,因此提供所有索引方法中最高的准确性。然而,这种方法非常慢且不切实际。

Nonetheless, I wouldn’t recommend adopting any other sophisticated approach over a flat index when the data conditions are favorable, such as having only a few data points to search over and a low-dimensional dataset.
不过,如果数据条件有利,例如只有少量数据点需要搜索,并且数据集维度较低,我并不建议采用任何比平面索引更复杂的方案。

But, of course, not all datasets are small, and using a flat index is impractical in most real-life situations.
当然,并非所有数据集都小,在大多数实际情况下,使用平面索引是不可行的。

Thus, we need more sophisticated approaches to index our vectors in the vector database.
因此,我们需要更复杂的策略来索引向量数据库中的向量。

#2) Inverted File Index 倒排索引

IVF is probably one of the most simple and intuitive indexing techniques. While it is commonly used in text retrieval systems, it can be adapted to vector databases for approximate nearest neighbor searches.
IVF 可能是在索引技术中最简单直观的技术之一。虽然它通常用于文本检索系统,但它可以被调整用于向量数据库的近似最近邻搜索。

Here’s how! 就是这样!

Given a set of vectors in a high-dimensional space, the idea is to organize them into different partitions, typically using clustering algorithms like k-means.
给定高维空间中的一组向量,其思想是将它们组织到不同的分区中,通常使用诸如 k-means 的聚类算法。

As a result, each partition has a corresponding centroid, and every vector gets associated with only one partition corresponding to its nearest centroid.
因此,每个分区都有一个对应的质心,每个向量都只与与其最近的质心对应的分区相关联。

Thus, every centroid maintains information about all the vectors that belong to its partition.
因此,每个质心都维护与其分区中所有向量的信息。

When searching for the nearest vector to the query vector, instead of searching across all vectors, we first find the closest centroid to the query vector.
当搜索与查询向量最接近的向量时,我们首先找到与查询向量最接近的质心,而不是搜索所有向量。

The nearest neighbor is then searched in only those vectors that belong to the partition of the closest centroid found above.
然后,在属于上面找到的最近质心分区中的那些向量中搜索最近邻。

Let’s estimate the search run-time difference it provides over using a flat index.
让我们估算一下,它相对于使用平面索引提供的搜索运行时间差异。

To reiterate, in flat-index, we compute the similarity of the query vector with all the vectors in the vector database.
重申,在扁平索引中,我们将查询向量与向量数据库中所有向量进行相似度计算。

If we have N vectors and each vector is D dimensional, the run-time complexity is O(ND) to find the nearest vector.
如果我们有 N 个向量,每个向量是 D 维的,那么找到最近的向量的时间复杂度是 O(ND)

Compare it to the Inverted File Index, wherein, we first compute the similarity of the query vector with all the centroids obtained using the clustering algorithm.
将其与倒排索引进行比较,其中,我们首先计算查询向量与使用聚类算法获得的所有质心的相似度。

Let’s say there are k centroids, a total of N vectors, and each vector is D dimensional.
假设有 k 个质心,总共有 N 个向量,每个向量是 D 维的。

Also, for simplicity, let’s say that the vectors are equally distributed across all partitions. Thus, each partition will have Nk data points.
此外,为简单起见,假设向量在所有分区中均匀分布。因此,每个分区将包含 Nk 个数据点。

First, we compute the similarity of the query vector with all the centroids, whose run-time complexity is O(kD).
首先,我们计算查询向量与所有质心的相似度,其运行时间复杂度为 O(kD)

Next, we compute the similarity of the query vector to the data points that belong to the centroid’s partition, with a run-time complexity of O(NDk).
接下来,我们计算查询向量与属于质心分区的数据点的相似度,运行时间复杂度为 O(NDk)

Thus, the overall run-time complexity comes out to be O(kD+NDk).
因此,整体运行时间复杂度为 O(kD+NDk)

To get some perspective, let’s assume we have 10M vectors in the vector databases and divide that into k=100 centroids. Thus, each partition is expected to have roughly 1 lakh data points.
为了更好地理解,假设向量数据库中有 1000 万个向量,并将其划分为 k=100 个质心。因此,每个分区预计包含大约 10 万个数据点。

In the flat index, we shall compare the input query across all data points – 10M.
在平面索引中,我们将比较输入查询与所有数据点(1000 万个)

In IVF, first, we shall compare the input query across all centroids (100), and then compare it to the vectors in the obtained partition (100k). Thus, the total comparisons, in this case, will be 100,050, nearly 100 times faster.
在 IVF 中,首先,我们将输入查询与所有质心( 100 )进行比较,然后将其与获得的分区中的向量进行比较( 100k )。因此,在这种情况下,总比较次数将为 100,050 ,几乎快 100 倍。

Of course, it is essential to note that if some vectors are actually close to the input vector but still happen to be in the neighboring partition, we will miss them.
当然,必须注意,如果某些向量实际上接近输入向量,但碰巧位于相邻分区中,我们将错过它们。

But recalling our objective, we were never aiming for the best solution but an approximate best (that's why we call it “approximate nearest neighbors”), so this accuracy tradeoff is something we willingly accept for better run-time performance.
但回想起我们的目标,我们从未追求最佳解决方案,而是一个近似的最佳方案(这就是我们称之为“近似最近邻”的原因),因此这种精度权衡是我们为了获得更好的运行时间性能而乐于接受的。

In fact, if you remember the model compression deep dive, we followed the same idea there as well:
事实上,如果你记得模型压缩深度探讨,我们也遵循了同样的思路

Model Compression: A Critical Step Towards Efficient Machine Learning
模型压缩:迈向高效机器学习的关键一步
Four critical ways to reduce model footprint and inference time.
减少模型占用空间和推理时间四种关键方法

#3) Product Quantization 产品量化

The idea of quantization in general refers to compressing the data while preserving the original information.
一般来说,量化的概念是指在保留原始信息的同时压缩数据。

Thus, Product Quantization (PQ) is a technique used for vector compression for memory-efficient nearest neighbor search.
因此,产品量化 (PQ) 是一种用于向量压缩的技巧,以实现内存高效的最近邻搜索。

Let’s understand how it works in detail.
让我们详细了解它是如何工作的。

Step 1) Create data segments
步骤 1) 创建数据段

Say we have some vectors, and each vector is 256-dimensional. Assuming each dimension is represented by a number that takes 32 bits, the memory consumed by each vector would be 256 x 32 bits = 8192 bits.
假设我们有一些向量,每个向量是 256 维的。假设每个维度由一个占用 32 位的数字表示,那么每个向量的内存消耗将是 256 x 32 位 = 8192 位。

In Product Quantization (PQ), we first split all vectors into sub-vectors. A demonstration is shown below, where we split the vector into M (a parameter) segments, say 8:
在产品量化 (PQ) 中,我们首先将所有向量分成子向量。以下演示了将向量分成 M (一个参数)段,例如 8

As a result, each segment will be 32-dimensional.
因此,每个片段将是 32 维的。

Step 2) Run KMeans 步骤 2) 运行 KMeans

Next, we run KMeans on each segment separately, which will generate k centroids for each segment.
接下来,我们在每个片段上分别运行 KMeans,这将为每个片段生成 k 个质心。

Do note that each centroid will represent the centroid for the subspace (32 dimensions) but not the entire vector space (256 dimensions in this demo).
请注意,每个质心将代表子空间( 32 维度)的质心,而不是整个向量空间(本演示中的 256 维度)。

For instance, if k=100, this will generate 100*8 centroids in total.
例如,如果 k=100 ,这将总共生成 100*8 个质心。

Training complete. 训练完成。

Step 3) Encode vectors 步骤 3) 编码向量

Next, we move to the encoding step.
接下来,我们进入编码步骤。

The idea is that for each segment of a vector in the entire database, we find the nearest centroid from the respective segment, which has k centroids that were obtained in the training step above.
想法是,对于数据库中每个向量的每个片段,我们找到该片段最近的质心,该质心来自上述训练步骤中获得的 k 个质心。

For instance, consider the first segment of the 256-dimensional data we started with:
例如,考虑我们最初使用的第一个 256 维数据段:

n denotes the number of vectors in the database
数据库中向量的数量

We compare these segment vectors to the corresponding k centroids and find the nearest centroid for all segment vectors:
我们将这些片段向量与相应的 k 质心进行比较,并找到所有片段向量最近的质心

After obtaining the nearest centroid for each vector segment, we replace the entire segment with a unique centroid ID, which can be thought of as indices (a number from 0 to k-1) of the centroids in that sub-space.
获取每个向量片段最近的质心后,我们将整个片段替换为一个唯一的 centroid ID ,可以将其视为该子空间中质心的索引(从 0k-1 的数字)。

We do see for all individual segments of the vectors:
我们确实看到所有向量段的个体情况:

This provides us with a quantized (or compressed) representation of all vectors in the vector database, which is composed of centroid IDs, and they are also known as PQ codes.
这为我们提供了向量数据库中所有向量的量化(或压缩)表示,它由 centroid IDs 组成,它们也称为 PQ 码。

To recall, what we did here is that we’ve encoded all the vectors in the vector database with a vector of centroid IDs, which is a number from 0 to k-1, and every dimension now only consumes 8 bits of memory.
回想一下,我们在这里所做的是,用一个向量 centroid IDs 对向量数据库中的所有向量进行了编码,该向量是一个从 0k-1 的数字,现在每个维度只占用 8 位内存。

As there are 8 segments, total memory consumed is 8*8=64 bits, which is 128 times lower memory usage than what we had earlier – 8192 bits.
由于有 8 个片段,总内存消耗为 8*8=64 位,这比我们之前使用的内存使用量低 128 倍 – 8192 位。

The memory saving scales immensely well when we are dealing with millions of vectors.
当我们处理数百万个向量时,内存节省会极大地扩展。

Of course, the encoded representation isn’t entirely accurate, but don’t worry about that as it is not that important for us to be perfectly precise on all fronts.
当然,编码后的表示并非完全准确,但不用担心,因为我们不必在所有方面都完全精确。

Step 4) Find an approximate nearest neighbor
步骤 4) 寻找近似最近邻

Now, you might be wondering, how exactly do we search for the nearest neighbor based on the encoded representations?
现在,您可能想知道,我们如何根据编码表示来搜索最近的邻居?

More specifically, given a new query vector Q, we have to find a vector that is most similar (or closest) to Q in our database.
更具体地说,给定一个新的查询向量 Q ,我们必须找到数据库中与 Q 最相似(或最接近)的向量。

We begin by splitting the query vector Q into M segments, as we did earlier.
我们首先将查询向量 Q 分割成 M 个片段,就像我们之前做的那样。

Next, we calculate the distance between all segments of the vector Q to all the respective centroids of that segment obtained from the KMeans step above.
接下来,我们计算向量 Q 所有片段与上述 KMeans 步骤中获得的该片段各自质心的距离。

This gives us a distance matrix:
这给我们一个距离矩阵:

The final step is to estimate the distance of the query vector Q from the vectors in the vector database.
最后一步是估计查询向量 Q 与向量数据库中向量的距离。

To do this, we go back to the PQ matrix we generated earlier:
为此,我们回到之前生成的 PQ 矩阵

Next, we look up the corresponding entries in the distance matrix generated above.
接下来,我们查找上面生成的距离矩阵中的对应项。

For instance, the first vector in the above PQ matrix is this:
例如,上述 PQ 矩阵中的第一个向量是:

To get the distance of our query vector to this vector, we check the corresponding entries in the distance entries.
要获得查询向量到该向量的距离,我们检查距离条目中的对应项。

We sum all the segment-wise distances to get a rough estimate of the distance of the query vector Q from all the vectors in the vector database.
我们将所有分段距离相加,以粗略估计查询向量 Q 与向量数据库中所有向量的距离。

We repeat this for all vectors in the database, find the lowest distance, and return the corresponding vectors from the database.
我们对数据库中的所有向量重复此操作,找到最短距离,并返回数据库中对应的向量。

Of course, it is important to note that the above PQ matrix lookup is still a brute-force search. This is because we look up all the distances in the distance matrix for all entries of the PQ matrix.
当然,重要的是要注意,上述 PQ 矩阵查找仍然是蛮力搜索。这是因为我们查找距离矩阵中所有 PQ 矩阵条目的所有距离。

Moreover, since we are not estimating the vector-to-vector distances but rather vector-to-centroid distances, the obtained values are just approximated distances but not true distances.
此外,由于我们不是估计向量到向量的距离,而是向量到质心的距离,因此获得的值只是近似距离,而不是真实距离。

Increasing the number of centroids and segments will increase the precision of the approximate nearest neighbor search, but it will also increase the run-time of the search algorithm.
增加质心和分段的数量会提高近似最近邻搜索的精度,但也将增加搜索算法的运行时间。

Here’s a summary of the product quantization approach:
以下是产品量化方法的总结:

  • Divide the vectors in the vector database into M segments.
    将向量数据库中的向量分成 M 段。
  • Run KMeans on each segment. This will give k centroids per segment.
    对每个片段运行 KMeans。这将为每个片段提供 k 个质心。
  • Encode the vectors in the vector database by replacing each segment of the vector with the centroid ID of the cluster it belongs to. This generates a PQ matrix, which is immensely memory-efficient.
    将向量数据库中的向量编码,方法是将每个向量片段替换为它所属的簇的 centroid ID 。这生成一个 PQ 矩阵,其内存效率极高。
  • Next, to determine the approximate nearest neighbor for a query vector Q, generate a distance matrix, whose each entry denotes the distance of a segment of the vector Q to all the centroids.
    接下来,为了确定查询向量 Q 的大致最近邻,生成一个距离矩阵,其中每个条目表示向量 Q 的一个片段到所有质心的距离。
  • Go back to the PQ codes now, and look up the distances in the distance matrix above to get an estimate of the distance between all vectors in the database and the query vector Q. Select the vector with the minimum distance to get the approximate nearest neighbor.
    现在回到 PQ 代码,查看上表中的距离矩阵,估算数据库中所有向量与查询向量 Q 之间的距离。选择距离最小的向量,得到近似最近邻。

Done! 完成!

Approximate nearest neighbor search with product quantization is suitable for medium-sized systems, and it’s pretty clear that there is a tradeoff between precision and memory utilization.
基于产品量化的近似最近邻搜索适用于中等规模系统,其精度和内存利用率之间存在权衡。

Let’s understand some more effective ways to search for the nearest neighbor.
让我们了解一些更有效地搜索最近邻的方法。


#4-5) Hierarchical Navigable Small World (HNSW)
层次可导航的小世界网络 (HNSW)

HNSW is possibly one of the most effective and efficient indexing methods designed specifically for nearest neighbor searches in high-dimensional spaces.
HNSW 可能是为高维空间最近邻搜索而设计的,最有效和最有效率的索引方法之一。

The core idea is to construct a graph structure, where each node represents a data vector, and edges connect nodes based on their similarity.
核心思想是构建一个图结构,其中每个节点代表一个数据向量,边连接基于它们相似度的节点。

HNSW organizes the graph in such a way that facilitates fast search operations by efficiently navigating through the graph to find approximate nearest neighbors.
HNSW 通过有效地遍历图来查找近似最近邻,从而组织图,以促进快速搜索操作。

But before we understand HNSW, it is crucial to understand NSW (Navigable Small World), which is foundational to the HNSW algorithm.
但在理解 HNSW 之前,至关重要的是理解 NSW(可导航的小世界),它是 HNSW 算法的基础。


The upcoming discussion is based on the assumption that you have some idea about graphs.
接下来的讨论基于你对图的概念有所了解的假设。

While we cannot cover them in whole detail, here are some details that will be suffice to understand the upcoming concepts.
虽然我们无法详尽地涵盖所有细节,但以下一些细节足以理解即将出现的概念。

A graph is composed of vertices and edges, where edges connect vertices together. In this context, connected vertices are often referred to as neighbors.
图由顶点和边组成,其中边连接顶点。在此上下文中,连接的顶点通常被称为邻居。

Recalling what we discussed earlier about vectors, we know that similar vectors are usually located close to each other in the vector space.
回想一下我们之前讨论过的向量,我们知道在向量空间中,相似的向量通常位于彼此附近。

Thus, if we represent these vectors as vertices of a graph, vertices that are close together (i.e., vectors with high similarity) should be connected as neighbors.
因此,如果我们将这些向量表示为图的顶点,则彼此靠近的顶点(即具有高相似度的向量)应连接为邻居。

That said, even if two nodes are not directly connected, they should be reachable by traversing other vertices.
即便两个节点没有直接连接,也应该可以通过遍历其他顶点到达。

This means that we must create a navigable graph.
这意味着我们必须创建一个可导航的图。

More formally, for a graph to be navigable, every vertex must have neighbors; otherwise, there will be no way to reach some vertices.
更正式地说,对于一个可导航的图,每个顶点都必须有邻居;否则,将无法到达某些顶点。

Also, while having neighbors is beneficial for traversal, at the same time, we want to avoid such situations where every node has too many neighbors.
此外,虽然拥有邻居对于遍历是有利的,但同时我们也要避免每个节点拥有太多邻居的情况。

This can be costly in terms of memory, storage, and computational complexity during search time.
这在搜索时间方面会造成内存、存储和计算复杂度的成本。

Ideally, we want a navigable graph that resembles a small-world network, where each vertex has only a limited number of connections, and the average number of edge traversals between two randomly chosen vertices is low.
理想情况下,我们希望得到一个类似小世界网络的可导航图,其中每个顶点只有有限的连接,并且随机选择的两个顶点之间平均边遍历次数较低。

This type of graph is efficient for similarity search in large datasets.
这种类型的图对于大型数据集的相似性搜索效率很高。

If this is clear, we can understand how the Navigable Small World (NSW) algorithm works.
如果这清楚了,我们就能理解可导航的小世界 (NSW) 算法是如何工作的。


→ Graph construction in NSW
新南威尔士州的图构建

The first step in NSW is graph construction, which we call G.
新南威尔士州的第一步是图构建,我们称之为 G

This is done by shuffling the vectors randomly and constructing the graph by sequentially inserting vertices in a random order.
通过随机打乱向量并按随机顺序依次插入顶点来构建图。

When adding a new vertex (V) to the graph (G), it shares an edge with K existing vertices in the graph that are closest to it.
将新的顶点 ( V ) 添加到图 ( G ) 中时,它与图中距离最近的 K 个现有顶点共享边。

This demo will make it more clear.
这个演示会更清晰。

Say we set K=3.
假设我们设置 K=3

Initially, we insert the first vertex A. As there are no other vertices in the graph at this point, A remains unconnected.
最初,我们插入第一个顶点 A 。由于此时图中没有其他顶点, A 仍然保持未连接。

Next, we add vertex B, connecting it to A since A is the only existing vertex, and it will anyway be among the top K closest vertices. Now the graph has two vertices {A, B}.
接下来,我们添加顶点 B ,将其连接到 A ,因为 A 是唯一存在的顶点,并且它无论如何都将位于前 K 个最近的顶点中。现在图中有两个顶点 {A, B}

Next, when vertex C is inserted, it is connected to both A and B. The exact process takes place for the vertex D as well.
接下来,当顶点 C 插入时,它连接到 AB 。顶点 D 也经历了同样的过程。

Now, when vertex E is inserted into the graph, it connects only to the K=3 closest vertices, which, in this case, are A, B, and D.
现在,当顶点 E 插入到图中时,它只连接到 K=3 个最近的顶点,在本例中,它们是 ABD

This sequential insertion process continues, gradually building the NSW graph.
这个顺序插入过程持续进行,逐步构建 NSW 图。

The good thing is that as more and more vertices are added, connections formed in the early stages of the graph constructions may become longer-range links, which makes it easier to navigate long distances in small hops.
好处在于,随着顶点数量的增加,图构建早期形成的连接可能会变成更长距离的链接,这使得在少量跳跃中就能更容易地导航较远距离。

This is evident from the following graph, where the connections A — C and B — D span greater distances.
从下图可以看出,连接 A — CB — D 的距离更远。

By constructing the graph in this manner, we get an NSW graph, which, most importantly, is navigable.
通过这种方式构建图,我们得到一个 NSW 图,最重要的是,它是可导航的。

In other words, any node can be reached from any other node in the graph in some hops.
换句话说,图中任何一个节点都可以在一定跳数内到达其他任何节点。

→ Search in NSW 在 NSW 搜索

In the NSW graph (G) constructed above, the search process is conducted using a simple greedy search method that relies on local information at each step.
在上述 NSW 图( G )中,搜索过程采用简单的贪婪搜索方法,在每一步都依赖于局部信息。

Say we want to find the nearest neighbor to the yellow node in the graph below:
假设我们想找到下图中黄色节点的最近邻:

To start the search, an entry point is randomly selected, which is also the beauty of this algorithm. In other words, a key advantage of NSW is that a search can be initiated from any vertex in the graph G.
为了启动搜索,随机选择一个入口点,这也是该算法的优点所在。换句话说,NSW 算法的一个关键优势在于,搜索可以从图中的任何顶点开始。

Let’s choose the node A as the entry point:
让我们选择节点 A 作为入口点:

After selecting the initial point, the algorithm iteratively finds a neighbor (i.e., a connected vertex) that is nearest to the query vector Q.
选择初始点后,算法迭代地寻找一个最近邻的邻居(即连接的顶点),该邻居最接近查询向量 Q

For instance, in this case, the vertex A has neighbors (D, B, C, and E). Thus, we shall compute the distance (or similarity, whatever you chose as a metric) of these 4 neighbors to the query vector Q.
例如,在本例中,顶点 A 拥有邻居 ( D , B , C , 和 E )。因此,我们将计算这些 4 个邻居与查询向量 Q 之间的距离(或相似度,无论您选择何种度量)。

In this case, node C is the closest, so we move to that node C from node A.
在这种情况下,节点 C 最近,所以我们从节点 A 移动到节点 C

Next, the search moves toward the vertex with the least distance to the query vector.
接下来,搜索移动到距离查询向量最近的顶点。

The unevaluated neighbor of node C is only H, which turns out to be closer to the query vector, so we mode to node H now.
节点 C 未评估的邻居只有 H ,结果发现它更靠近查询向量,因此我们现在转向节点 H

This process is repeated until no neighbor closer to the query vector can be found, which gives us the nearest neighbor in the graph for a query vector.
重复此过程,直到找不到比查询向量更接近的邻居,这为我们提供了查询向量在图中的最近邻。

One thing I like about this search algorithm is how intuitive and easy to implement it is.
我喜欢这个搜索算法的一个地方是它直观易于实现。

That said, the search is still approximate, and it is not guaranteed that we will always find the closest neighbor, and it may return highly suboptimal results.
也就是说,搜索仍然是近似的,我们不能保证总是找到最近的邻居,并且可能会返回非常次优的结果。

For instance, consider this graph below, where node A is the entry point, and the yellow node is the vector we need the nearest neighbor for:
例如,考虑下图,其中节点 A 是入口点,黄色节点是我们需要找到最近邻的向量:

Following the above procedure of nearest neighbor search, we shall evaluate the neighbors of the node A, which are C and B.
按照上述最近邻搜索过程,我们将评估节点 A 的邻居,它们是 CB

It is clear that both nodes are further distant from the query vector than node A. Thus, the algorithm returns the node A as the final nearest neighbor.
很明显,这两个节点都比节点 A 更远离查询向量。因此,该算法返回节点 A 作为最终最近邻。

To avoid such situations, it is recommended to repeat the search process with multiple entry points, which, of course, consumes more time.
为了避免这种情况,建议重复使用多个入口进行搜索,当然,这会花费更多时间。


→ Skip list data structure
跳表数据结构

While NSW is quite a promising and intuitive approach, another major issue is that we end up traversing the graph many times (or repeating the search multiple times) to arrive at an optimal approximate nearest neighbor node.
虽然 NSW 方法相当有前景且直观,但另一个主要问题是,我们最终需要多次遍历图(或重复搜索)才能找到最佳的近似最近邻节点。

HNSW speeds up the search process by indexing the vector database into a more optimal graph structure, which is based on the idea of a skip list.
HNSW 通过将向量数据库索引到更优化的图结构中来加速搜索过程,该结构基于跳表的概念。

First, let me give you some details about the skip list data structure, as that is important here.
首先,让我先介绍一下跳表数据结构,因为它在这里很重要。

And to do this, let’s consider a pretty intuitive example that will make it pretty clear.
为此,让我们考虑一个非常直观的例子,这将使它非常清晰。

Say you wish to travel from New York to California.
假设你想从纽约前往加州。

If we were following an NSW approach, this journey would be like traveling from one city to another, say, via an intercity taxi, which takes many hops, but gradually moves us closer to our destination, as shown below:
如果我们遵循 NSW 方法,这段旅程就像乘坐城际出租车从一个城市到另一个城市,需要多次换乘,但逐渐将我们带到目的地,如下图所示:

Is that optimal? 那最优吗?

No right? 对吗?

Now, think about it for a second.
现在想一想。

How can you more optimally cover this route, or how would you more optimally cover this in real life?
如何更有效地覆盖这条路线,或者在现实生活中如何更有效地覆盖它?

If you are thinking of flights, you are thinking in the right direction.
如果你在考虑航班,你的想法方向正确。

Think of skip lists as a way to plan your trip using different modes of transportation, of which, some modes can travel larger distances in small hops.
将跳表想象成使用不同交通方式规划旅行的方式,其中一些交通方式可以在小跳跃中行驶更长的距离。

So essentially, instead of hopping from one city to another, we could start by taking a flight from New York to a major city closer to California, say Denver.
所以,与其在城市间来回奔波,不如先从纽约飞往加利福尼亚州附近的大城市,比如丹佛。

This flight covers a longer distance in a single hop, analogous to skipping several vertices in the graph that we would have covered otherwise going from one city to another.
这次航班单程航程更长,类似于在从一个城市飞往另一个城市的途中,跳过了图中几个顶点。

👉
Of course, I know there is a direct flight between New York and California. This is just for demonstration purposes so assume that there is no such flight between New York and California.
当然,我知道纽约和加州之间有直飞航班。但这只是为了演示,所以假设纽约和加州之间没有直飞航班。

From Denver, we can take another faster mode of transport, which will involve fewer hops, like a train to reach California:
从丹佛出发,我们可以选择另一种更快的交通方式,例如火车,减少中转次数,到达加州

To add even more granularity, say, once we reach the train station in California, we wish to travel to some place within Los Angeles, California.
为了更精细地描述,例如,当我们到达加利福尼亚州的火车站后,我们希望前往加利福尼亚州洛杉矶的某个地方。

Now we need something that can take smaller hops, so a taxi is perfect here.
现在我们需要一些可以进行小跳跃的东西,所以出租车在这里很合适。

So what did we do here?
我们在这里做了什么?

This combination of longer flights, a relatively shorter train trip, and a taxi to travel within the city allowed us to reach our destination in relatively very few stops.
更长的航班、相对较短的火车旅程以及市内出租车出行,让我们能够在相对较少的停站次数到达目的地。

This is precisely what skip lists do.
这正是跳表所做的。

Skip lists are a data structure that allows for efficient searching of elements in a sorted list. They are similar to linked lists but with an added layer of "skip pointers" that allow faster traversal.
跳表是一种数据结构,允许在已排序的列表中高效地搜索元素。它们类似于链表,但增加了“跳跃指针”层,允许更快地遍历。

This is what linked lists look like:
链表长这样:

In a skip list, each element (or node) contains a value and a set of forward pointers that can "skip" over several elements in the list.
在跳跃表中,每个元素(或节点)包含一个值和一组向前指针,这些指针可以“跳过”列表中的几个元素。

These forward pointers create multiple layers within the list (layer 0, layer 1, layer 2, in the above visual), with each level representing a different "skip distance."
这些向前指针在列表中创建了多层( layer 0layer 1layer 2 ,如上图所示),每一层都代表不同的“跳跃距离”。

  • Top layer (layer 2) can be thought of as a flight that can travel longer distances in one hop.
    顶层( layer 2 )可以被视为一次跳跃就能飞行更远距离的航班。
  • Middle layer (layer 1) can be thought of as a train that can travel relatively shorter distances than a flight in one hop.
    中间层( layer 1 )可以被视为一列火车,它一次跳跃可以行驶的距离比飞机短。
  • Bottom layer (layer 0) can be thought of as a taxi that can travel short distances in one hop.
    底层( layer 0 )可以被视为一辆出租车,可以在一次跳跃中行驶短距离。

The nodes that must be kept in each layer are decided using a probabilistic approach.
每一层必须保留的节点是使用概率方法决定的。

The basic idea is that nodes are included in higher layers with decreasing probability, resulting in fewer nodes at higher levels, while the bottom layer ALWAYS contains all nodes.
基本思想是,节点以递减的概率包含在更高层,导致较高层节点数量减少,而底层始终包含所有节点。

More specifically, before skip list construction, each node is randomly assigned an integer L, which indicates the maximum layer at which it can be present in the skip list data structure. This is done as follows:
更具体地说,在跳表构建之前,每个节点都会随机分配一个整数 L ,该整数表示它在跳表数据结构中可能出现的最大层数。这是通过以下方式完成的:

  • uniform(0,1) generates a random number between 0 and 1.
    uniform(0,1) 生成 0 到 1 之间的随机数。
  • floor() rounds the result down to the nearest integer.
    向下取整到最接近的整数。
  • CLM​ is a layer multiplier constant that adjusts the overlap between layers. Increasing this parameter leads to more overlap.
    CLM 是一个层乘数常量,用于调整层之间的重叠。增加该参数会导致更大的重叠。

For instance, if a node has L=2, which means it must exist on layer 2, layer 1 and layer 0.
例如,如果一个节点有 L=2 ,则意味着它必须存在于 layer 2layer 1layer 0

Also, say the layer multiplier (CLM) was set to 0. This would mean that L=0 for all nodes:
此外,假设层乘数( CLM )设置为 0 。这意味着 L=0 对于所有节点:

As discussed above, L indicates the maximum layer at which a node can be present in the skip list data structure. If L=0 for all nodes, this means that the skip list will only have one layer.
如上所述, L 指的是跳跃表数据结构中节点可以存在的最大层数。如果 L=0 对所有节点都相同,则意味着跳跃表将只有一层。

Increasing this parameter leads to more overlap between layers and more layers, as shown in the plots below:
增加该参数会导致层与层之间重叠更多,以及层数更多,如下图所示:

As depicted above: 如上图所示:

  • With CLM=0, the skip list can only have layer 0, which is similar to the NSW search.
    使用 CLM=0 ,跳跃表只能包含 layer 0 ,这类似于 NSW 搜索。
  • With CLM=0.25, we get one more layer, which has around 6-7 nodes.
    通过 CLM=0.25 ,我们得到了一层,大约有 6-7 个节点。
  • With CLM=1, we get four layers.
    通过 CLM=1 ,我们得到四层。
  • In all cases, layer 0 always has all nodes.
    在所有情况下, layer 0 始终包含所有节点。

The objective is to decide an optimal value for CLM because we do not want to have too many layers and so much overlap, while, at the same time, also not having only one layer (when $C_{LM}=0), which would result in no speedup improvement.
目标是确定 CLM 的最佳值,因为我们既不想拥有太多层和太多重叠,也不想只有一层(当 $C_{LM}=0$ 时),这将导致没有速度提升。

Now, let me explain how skip lists speedup the search process.
现在,让我解释跳表如何加速搜索过程。

Let’s say we want to find the element 50 in this list.
假设我们要在这个列表中找到元素 50

If we were using the typical linked list, we would have started from the first element (HEAD), and scanned each node one by one to see and check if it matches the query or not (50).
如果我们使用典型的链表,我们会从第一个元素( HEAD )开始,逐个扫描每个节点,看看它是否与查询匹配( 50 )。

See how a skip list helps us optimize this search process.
看看跳表如何帮助我们优化这个搜索过程。

We begin from the top layer (layer 2) and check the value corresponding to the next node in the same layer, which is 65.
我们从顶层( layer 2 )开始,检查同一层中下一个节点对应的值,即 65

As 65>50 and it is a unidirectional linked list, we must go down one level.
由于 65>50 ,它是一个单向链表,我们必须向下移动一级。

In layer 1, we check the value corresponding to the next node in the same layer, which is 36.
layer 1 ,我们检查同一层中下一个节点对应的值,即 36

As 50>36 and it is wise to move to the node corresponding to the value 36.
50>36 ,明智的做法是移动到对应值 36 的节点。

Now again in layer 1, we check the value corresponding to the next node in the same layer, which is 65.
现在,在 layer 1 中,我们检查同一层中下一个节点对应的值,即 65

Again, as 65>50 and it is a unidirectional linked list, we must go down one level.
再次,如同 65>50 以及它是一个单向链表,我们必须下降一级。

We reach layer 0, which can be traversed the way we usually would.
我们到达 layer 0 ,可以像往常一样遍历。

Had we traversed the linked list without building a skip list, we would have taken 5 hops:
如果我们没有构建跳表而遍历链表,我们将需要 5 次跳跃

But with a skip list, we completed the same search in 3 hops instead:
但是使用跳表,我们只需 3 次跳转即可完成相同的搜索:

That was pretty simple and elegant, wasn't it?
这很简单也很优雅,不是吗?

While reducing the number of hops from 5 to 3 might not sound like a big improvement, but it is important to note that typical vector databases have millions of nodes.
虽然将跳数从 5 减少到 3 可能听起来不是很大的改进,但需要注意的是,典型的向量数据库包含数百万个节点。

Thus, such improvements scale pretty quickly to provide run-time benefits.
因此,此类改进可以相当快速地扩展以提供运行时优势。

→ Graph construction in HNSW
HNSW 图构建

Now that we understand how skip lists work, understanding the graph construction process of Hierarchical Navigable Small World is also pretty straightforward.
现在我们理解了跳表的工作原理,理解分层可导航小世界图的构建过程也相当简单。

Consider that this is our current graph structure:
考虑我们的当前图结构是:

I understand we are starting in the middle of the construction process, but just bear with me for a while, as it will clear everything up.
我知道我们现在是在建设过程的中间,但请稍安勿躁,这很快就会明白的。

Essentially, the above graph has three layers in total, and it is in the middle of construction. Also, as we go up, the number of nodes decreases, which is what happens ideally in skip lists.
本质上,上图总共有三层,并且正在建设中。此外,随着我们向上移动,节点数量减少,这在跳表中是理想的情况。

Now, let’s say we wish to insert a new node (blue node in the image below), and its max level (determined by the probability distribution) is L=1. This means that this node will be present on layer 1 and layer 0.
现在,假设我们要插入一个新节点(下图中的蓝色节点),其最大层级(由概率分布决定)为 L=1 。这意味着该节点将存在于 layer 1layer 0

Now, our objective is to connect this new node to other nodes in the graph on layer 0 and layer 1.
现在,我们的目标是将此新节点连接到图中其他节点,在 layer 0layer 1 上。

This is how we do it:
我们就是这样做的:

  • We start from the topmost layer (layer 2) and select an entry point for this new node randomly:
    我们从最顶层( layer 2 )开始,随机选择一个新的节点的入口点:
  • We explore the neighbors of this entry point and select the one that is nearest to the new node to be inserted.
    我们探索该入口点的邻居,并选择距离新节点最近的邻居。
  • The nearest neighbor found for the blue node becomes the entry point for the next layer. Thus, we move to the corresponding node of the nearest neighbor in the next layer (layer 1):
    找到的蓝色节点最近邻成为下一层入口。因此,我们移动到下一层中最近邻对应的节点( layer 1 )。
  • With that, we have arrived at the layer where this blue node must be inserted.
    至此,我们到达了必须插入这个蓝色节点的层。

Here, note that had there still been more layers, we would have repeated the above process of finding the nearest neighbor of the entry point and moving one layer down until we had arrived at the layer of interest.
这里需要注意,如果还有更多层,我们将重复上述过程,找到入口点的最近邻,然后向下移动一层,直到到达感兴趣的层。

For instance, imagine that the max level value for this node was L=0. Thus, the blue node would have only existed on the bottom-most layer.
例如,假设此节点的最大级别值为 L=0 。因此,蓝色节点仅存在于最底层。

After moving the layer 1 (as shown in the figure above), we haven't arrived at the layer of interest yet.
移动 layer 1 (如上图所示) 后,我们还没有到达感兴趣的层。

So we explore the neighborhood of the entry point in layer 1 and find the nearest neighbor of the blue point again.
因此,我们探索入口点附近区域( layer 1 ),再次找到蓝色点的最近邻。

Now, the nearest neighbors found on layer 1 becomes our entry point for layer 0.
现在,在 layer 1 上找到的最近邻成为我们进入 layer 0 的切入点。


Coming back to the situation where the max level value was L=1. We are currently at layer 1, where the entry point is marked in the image below, and we must insert the blue node on this layer:
回到最大层级值为 L=1 的情况。我们目前位于 layer 1 层,入口点在下面的图片中标出,我们必须在这个层级插入蓝色节点:

To insert a node, here’s what we do:
要插入一个节点,我们需要这样做:

  • Explore the neighbors of the entry point in the current layer and connect the new node to the top K nearest neighbors. To determine the top K neighbors, a total of efConstruction (a hyperparameter) neighbors are greedily explored in this step. For instance, if K=2, then in the above diagram, we connect the blue node to the following nodes:
    探索当前层入口点的邻居,并将新节点连接到前 K 个最近的邻居。为了确定前 K 个邻居,本步骤贪婪地探索了总共 efConstruction (一个超参数)个邻居。例如,如果 K=2 ,那么在上图中,我们将蓝色节点连接到以下节点:

However, to determine these top K=2 nodes, we may have explored, say efConstruction=3 neighbors instead (the purpose of doing this will become clear shortly), as shown below:
然而,为了确定这 K=2 个顶点,我们可能已经探索了,比如说 efConstruction=3 个邻居(这样做的原因稍后会清楚),如下所示:

Now, we must insert the blue node at a layer that is below the max layer value L of the blue node.
现在,我们必须将蓝色节点插入到其最大层值 L 之下的层。

In such a case, we don’t keep just one entry point to the next layer, like we did earlier, as shown below:
在这种情况下,我们不像之前那样只保留一个通往下一层的入口,如下所示:

However, all efConstruction nodes explored in the above layer are considered as an entry point for the next layer.
但是,以上层中探索的所有 efConstruction 节点都被视为下一层的一个入口。

Once we enter the next layer, The process is repeated, wherein, we connect the blue node to the top K neighbors by exploring all efConstruction entry nodes.
进入下一层后,重复该过程,其中,我们通过探索所有 efConstruction 入口节点将蓝色节点连接到前 K 个邻居。

The layer-by-layer insertion process ends when the new node gets connected to nodes at level 0.
分层插入过程在新的节点连接到 level 0 处的节点时结束。

💡
I have intentionally cut out a few minor details here because it will get too complicated to understand. Also, in any real-life situation, we would hardly have to implement this algorithm as it is already implemented in popular vector databases that use HNSW for indexing. The only thing that we must know is how HNSW works on a high level.
这里我故意省略了一些细节,因为这样会使理解过于复杂。此外,在实际应用中,我们几乎不必自己实现这个算法,因为常用的向量数据库已经实现了它,并使用了 HNSW 进行索引。我们只需要了解 HNSW 的高层工作原理即可。
→ Search in HNSW 在 HNSW 中搜索

Consider that after all the nodes have been inserted, we get the following graph:
考虑所有节点插入完成后,我们得到以下图表:

Let’s understand how the approximate nearest neighbor search would work.
让我们了解一下近似最近邻搜索是如何工作的。

Say we want to find the nearest neighbor of the yellow vector in the image below:
假设我们想找到下图中黄色向量最近的邻居:

We begin our search with an entry point in the top layer (layer 2):
我们从顶层( layer 2 )的入口开始搜索

We explore the connected neighbors of A and see which is closest to the yellow node. In this layer, it is C.
我们探索 A 的相邻节点,并查看哪个最靠近黄色节点。在这一层,它是 C

The algorithm greedily explores the neighborhood of vertices in a layer. We consistently move towards the query vector during this process.
算法贪婪地探索图层中顶点的邻域。在此过程中,我们持续地向查询向量移动。

When no closer node can be found in a layer that is closer to the query vector, we move to the next layer while considering the nearest neighbor (C, in this case) as an entry point to the next layer:
当在更靠近查询向量的层中找不到更近的节点时,我们会移动到下一层,同时将最近邻( C ,在本例中)作为进入下一层的入口

The process of neighborhood exploration is repeated again.
邻域探索过程再次重复。

We explore the neighbors of C and greedily move to that specific neighbor which is closest to the query vector:
我们探索 C 的邻居,并贪婪地移动到最接近查询向量的那个特定邻居

Again, as no node exists in layer 1 that is closer to the query vector, we move to the next layer while considering the nearest neighbor (F, in this case) as an entry point to the next layer.
再次,由于 layer 1 中没有节点更接近查询向量,我们移至下一层,同时将最近邻 ( F ,在本例中) 作为进入下一层的入口。

But this time, we have reached layer 0. Thus, an approximate nearest neighbor will be returned in this case.
但这一次,我们已经达到了 layer 0 。因此,在这种情况下,将返回一个近似最近邻。

When we move to layer 0, and start exploring its neighborhood, we notice that it has no neighbors that are closer to the query vector:
当我们移动到 layer 0 ,并开始探索其邻域时,我们注意到它没有比查询向量更接近的邻居:

Thus, the vector corresponding to node F is returned as an approximate nearest neighbor, which, coincidentally, also happens to be the true nearest neighbor.
因此,与节点 F 对应的向量被返回为近似最近邻,这恰好也是真正的最近邻。

→ HNSW vs NSW HNSW 与 NSW

In the above search process, it only took 2 hops (descending is not a hop) to return the nearest neighbor to the query vector.
在上述搜索过程中,仅需 2 跳(降序不是跳)即可返回查询向量最近的邻居。

Let’s see how many hops it would have taken us to find the nearest neighbor with NSW. For simplicity, let’s consider that the graph constructed by NSW is the one represented by layer 0 of the HNSW graph:
让我们看看使用 NSW 找到最近邻需要多少跳跃。为简单起见,假设由 NSW 构造的图是 HNSW 图中由 layer 0 表示的图:

We started with the node A as the entry point earlier, so let’s consider the same here as well.
我们之前以节点 A 作为入口点开始,所以这里也同样考虑。

We begin with the node A, explore its neighbors, and move to the node E as that is the closest to the query vector:
我们从节点 A 开始,探索其邻居,然后移动到节点 E ,因为它是距离查询向量最近的节点:

From node E, we move to node B, which is closer to the query vector than the node E.
从节点 E ,我们移动到节点 B ,它比节点 E 更接近查询向量。

Next, we explore the neighbors of node B, and notice that node I is the closest to the query vector, so we hop onto that node now:
接下来,我们探索节点 B 的邻居,并注意到节点 I 最接近查询向量,所以我们现在跳到该节点:

As node I cannot find any other node that it is connected to that is closer than itself, the algorithm returns node I as the nearest neighbor.
由于节点 I 未找到任何比自身更近的连接节点,算法返回节点 I 作为最近邻。

What happened there? 那里发生了什么?

Not only did the algorithm take more hops (3) to return a nearest neighbor, but it also returned a less optimal nearest neighbor.
算法不仅需要更多跳跃( 3 )才能返回最近邻,而且返回的最近邻也更不理想。

HNSW, on the other hand, took fewer hops and returned a more accurate and optimal nearest neighbor.
另一方面,HNSW 只需要更少的跳跃次数,并返回更准确和最佳的最近邻。


Perfect! 完美!

With that, you have learned five pretty common indexing strategies to index vectors in a vector database for efficient search.
通过以上内容,您已经学习了五种在向量数据库中为高效搜索索引向量的常见索引策略。


Using Vector Databases in LLMs
使用向量数据库

At this point, one interesting thing to learn is how exactly do Large Language Models (LLMs) take advantage of Vector Databases.
此时,一个有趣的问题是大型语言模型(LLMs)是如何利用向量数据库的。

In my experience, the biggest conundrum many people face is the following:
在我看来,许多人面临的最大难题是:

Once we have trained our LLM, it will have some model weights for text generation. Where do vector databases fit in here?
训练完我们的LLM后,它将拥有用于文本生成的模型权重。那么向量数据库在这里扮演什么角色呢?

And it is a pretty genuine query, in my opinion.
依我看,这是一个相当真实的查询。

Let me explain how vector databases help LLMs be more accurate and reliable in what they produce.
让我解释一下向量数据库如何帮助LLMs使其产出更准确可靠。


To begin, we must understand that an LLM is deployed after learning from a static version of the corpus it was fed during training.
首先,我们必须理解,一个LLM是在从其训练期间所提供的语料库的静态版本中学习之后部署的。

For instance, if the model was deployed after considering the data until 31st Jan 2024, and we use it, say, a week after training, it will have no clue about what happened in those days.
例如,如果模型是在考虑数据直到 31st Jan 2024 之后部署的,并且我们在一周后使用它,它将对那几天发生的事情一无所知。

Repeatedly training a new model (or adapting the latest version) every single day on new data is impractical and cost-ineffective. In fact, LLMs can take weeks to train.
每天都用新数据重新训练一个新模型(或微调最新版本)是不可行的,而且成本高。事实上,LLMs的训练可能需要几周时间。

Also, what if we open-sourced the LLM and someone else wants to use it on their privately held dataset, which, of course, was not shown during training?
如果我们开源了LLM,并且其他人想将其用于他们私有的数据集,而这个数据集当然在训练过程中没有显示出来,会发生什么?

As expected, the LLM will have no clue about it.
不出所料,LLM对此一无所知。

But if you think about it, is it really our objective to train an LLM to know every single thing in the world?
但仔细想想,我们的目标真的是要训练一个LLM去了解世界上所有的事情吗?

A BIGGG NOOOO! 不!

That’s not our objective.
那不是我们的目标。

Instead, it is more about helping the LLM learn the overall structure of the language, and how to understand and generate it.
相反,它更在于帮助LLM学习语言的整体结构,以及如何理解和生成它。

So, once we have trained this model on a ridiculously large enough training corpus, it can be expected that the model will have a decent level of language understanding and generation capabilities.
因此,一旦我们用足够大的训练语料库训练了这个模型,就可以预期该模型将具备相当不错的语言理解和生成能力。

Thus, if we could figure out a way for LLMs to look up new information they were not trained on and use it in text generation (without training the model again), that would be great!
因此,如果我们能找到一种方法让LLMs查找他们未经训练的新信息并将其用于文本生成(无需再次训练模型),那就太好了!


One way could be to provide that information in the prompt itself.
一种方法是直接在提示中提供该信息。

In other words, if training or fine-tuning the model isn’t desired, we can provide all the necessary details in the prompt given to the LLM.
换句话说,如果不需要训练或微调模型,我们可以将所有必要细节包含在给 LLM 的提示中。

Unfortunately, this will only work for a small amount of information.
很遗憾,这只能用于少量信息。

This is because LLMs are auto-regressive models.
这是因为LLMs是自回归模型。

💡
Auto-regressive models are those models that generate outputs one step at a time, where each step depends on the previous steps. In the case of LLMs, this means that the model generates text one word at a time, based on the words it has already generated.
自回归模型是指那些一次生成一个输出的模型,其中每个步骤都依赖于之前的步骤。在 LLMs 的情况下,这意味着模型一次生成一个单词,基于它已经生成的单词。

Thus, as the LLM considers previous words, they have a token limit that they practically can not exceed in their prompts.
因此,由于 LLM 会考虑之前的词语,它们在提示中实际上存在一个令牌限制。

Overall, this approach of providing everything in the prompt is not that promising because it limits the utility to a few thousand tokens, whereas in real life, additional information can have millions of tokens.
总的来说,这种在提示中提供所有信息的策略并不那么有前景,因为它将实用性限制在几千个标记内,而在现实生活中,额外信息可能包含数百万个标记。

This is where vector databases help.
向量数据库在这里就派上用场了。

Instead of retraining the LLM every time new data emerges or changes, we can leverage vector databases to update the model's understanding of the world dynamically.
与其每次出现新的数据或变化就重新训练LLM,我们可以利用向量数据库动态地更新模型对世界的理解。

How? 如何?

It’s simple. 很简单。


As discussed earlier in the article, vector databases help us store information in the form of vectors, where each vector captures semantic information about the piece of text being encoded.
正如本文前面所讨论的,向量数据库帮助我们以向量的形式存储信息,其中每个向量都捕捉了被编码的文本片段的语义信息。

Thus, we can maintain our available information in a vector database by encoding it into vectors using an embedding model.
因此,我们可以通过使用嵌入模型将其编码为向量,从而将我们可用的信息保存在向量数据库中。

When the LLM needs to access this information, it can query the vector database using a similarity search with the prompt vector.
当 LLM 需要访问此信息时,它可以使用提示向量进行相似性搜索来查询向量数据库。

More specifically, the similarity search will try to find contents in the vector database that are similar to the input query vector.
更具体地说,相似性搜索将尝试在向量数据库中查找与输入查询向量相似的内容。

This is where indexing becomes important because our vector database can possibly have millions of vectors.
索引在这里变得很重要,因为我们的向量数据库可能包含数百万个向量。

Theoretically, we can compare the input vector to every vector in the vector database.
理论上,我们可以将输入向量与向量数据库中的每个向量进行比较。

But for practical utility, we must find the nearest neighbor as quickly as possible.
但为了实际应用,我们必须尽快找到最近的邻居。

That is why indexing techniques, which we discussed earlier, become so important. They help us find the approximate nearest neighbor in almost real-time.
因此,我们之前讨论过的索引技术变得如此重要。它们帮助我们几乎实时地找到近似最近邻。

Moving on, once the approximate nearest neighbor gets retrieved, we gather the context from which those specific vectors were generated. This is possible because a vector database not only stores vectors but also the raw data that generated those vectors.
接下来,一旦检索到近似最近邻,我们就收集生成这些特定向量的上下文。这是可能的,因为向量数据库不仅存储向量,还存储生成这些向量的原始数据。

This search process retrieves context that is similar to the query vector, which represents the context or topic the LLM is interested in.
这个搜索过程检索与查询向量相似的上下文,该查询向量代表LLM感兴趣的上下文或主题。

We can augment this retrieved content along with the actual prompt provided by the user and give it as input to the LLM.
我们可以将检索到的内容与用户提供的实际提示一起增强,并将其作为输入提供给 LLM。

Consequently, the LLM can easily incorporate this info while generating text because it now has the relevant details available in the prompt.
因此,LLM 可以轻松地将这些信息整合到文本生成过程中,因为它现在可以在提示中获得相关细节。

And...Congratulations! 并且……恭喜!

You just learned Retrieval-Augmented Generation (RAG). I am sure you must have heard this term many times now and what we discussed above is the entire idea behind RAG.
你刚刚学习了检索增强生成 (RAG)。我相信你肯定已经多次听到过这个术语,我们上面讨论的就是 RAG 背后的全部思想。

I intentionally did not mention RAG anywhere earlier to build the desired flow and avoid intimidating you with this term first.
我故意之前没有提到 RAG,是为了构建预期的流程,避免一开始就用这个术语吓到你。

In fact, even its name entirely justifies what we do with this technique:
事实上,甚至它的名字完全证明了我们用这种技术所做的事情:

  • Retrieval: Accessing and retrieving information from a knowledge source, such as a database or memory.
    检索:访问和检索知识源中的信息,例如数据库或内存。
  • Augmented: Enhancing or enriching something, in this case, the text generation process, with additional information or context.
    增强:在本文档中,指用额外信息或上下文来增强或丰富文本生成过程。
  • Generation: The process of creating or producing something, in this context, generating text or language.
    生成:在此上下文中,生成文本或语言的过程,即创建或产生某种事物。

Another critical advantage of RAG is that it drastically helps the LLM reduce hallucinations in its responses. I am sure you must have heard of this term too somewhere.
RAG 的另一个关键优势在于,它极大地帮助 LLM 减少了其回复中的幻觉。我相信你肯定也听说过这个词。

Hallucinations happen when a language model generates information that is not grounded in reality or when it makes up things.
当语言模型生成与现实不符的信息或编造信息时,就会出现幻觉。

This can lead to the model generating incorrect or misleading information, which can be problematic in many applications.
这可能会导致模型生成不准确或误导信息,这在许多应用中都可能存在问题。

With RAG, the language model can use the retrieved information (which is expected to be reliable) from the vector database to ensure that its responses are grounded in real-world knowledge and context, reducing the likelihood of hallucinations.
通过 RAG,语言模型可以使用从向量数据库中检索到的信息(预期可靠)来确保其响应基于现实世界的知识和上下文,从而降低产生幻觉的可能性。

This makes the model's responses more accurate, reliable, and contextually relevant, improving its overall performance and utility.
这使得模型的回复更准确、可靠且具有上下文相关性,从而提高其整体性能和实用性。

This idea makes intuitive sense as well.
这个想法也很直观。


Vector database providers
向量数据库提供商

These days, there are tons of vector database providers, which can help us store and retrieve vector representations of our data efficiently.
如今,有很多向量数据库提供商,可以帮助我们高效地存储和检索数据的向量表示。

  • Pinecone: Pinecone is a managed vector database service that provides fast, scalable, and efficient storage and retrieval of vector data. It offers a range of features for building AI applications, such as similarity search and real-time analytics.
    Pinecone:Pinecone 是一款托管向量数据库服务,提供快速、可扩展且高效的向量数据存储和检索。它提供一系列用于构建 AI 应用的功能,例如相似性搜索和实时分析。
  • Weaviate: Weaviate is an open-source vector database that is robust, scalable, cloud-native, and fast. With Weaviate, one can turn your text, images, and more into a searchable vector database using state-of-the-art ML models.
    Weaviate:Weaviate 是一款开源向量数据库,具有强大的鲁棒性、可扩展性、云原生特性和快速性能。借助 Weaviate,您可以使用最先进的机器学习模型,将您的文本、图像等转换为可搜索的向量数据库。
  • Milvus: Milvus is an open-source vector database built to power embedding similarity search and AI applications. Milvus makes unstructured data search more accessible and provides a consistent user experience regardless of the deployment environment.
    Milvus:Milvus 是一个开源向量数据库,旨在为嵌入式相似性搜索和 AI 应用提供支持。Milvus 让非结构化数据搜索更加便捷,并无论部署环境如何,都提供一致的用户体验。
  • Qdrant: Qdrant is a vector similarity search engine and vector database. It provides a production-ready service with a convenient API to store, search, and manage points—vectors with an additional payload. Qdrant is tailored to extended filtering support. It makes it useful for all sorts of neural network or semantic-based matching, faceted search, and other applications.
    Qdrant:Qdrant 是一个向量相似度搜索引擎和向量数据库。它提供了一个生产就绪的服务,并拥有方便的 API 来存储、搜索和管理点——带有附加有效载荷的向量。Qdrant 针对扩展的过滤支持进行了定制。这使其适用于各种基于神经网络或语义匹配、面搜索以及其他应用。

Pinecone demo Pinecone 演示

Going ahead, let’s get into some practical details on how we can index vectors in a vector database and perform search operations over them.
接下来,让我们深入了解如何在向量数据库中索引向量并对其执行搜索操作的实际细节。

For this demonstration, I will be using Pinecone as it’s possibly one of the easiest to begin with and understand. But many other provides are linked above, which you can explore if you wish to.
本演示将使用 Pinecone,因为它可能是最容易上手和理解的数据库之一。但许多其他提供商链接在上面,如果您想探索,可以查看。

To begin, we first install a few dependencies, like Pinecone and Sentence transformers:
首先,我们先安装一些依赖项,例如 Pinecone 和 Sentence transformers:

It is important to encode this text dataset into vector embeddings to store them in the vector database. For this purpose, we will leverage the SentenceTransformers library.
将此文本数据集编码为向量嵌入以将其存储在向量数据库中非常重要。为此,我们将利用 SentenceTransformers 库。

It provides pre-trained transformer-based architectures that can efficiently encode text into dense vector representations, often called embeddings.
它提供预训练的基于变换器的架构,可以有效地将文本编码为密集的向量表示,通常称为嵌入。

The SentenceTransformers model offers various pre-trained architectures, such as BERT, RoBERTa, and DistilBERT, fine-tuned specifically for sentence embeddings.
SentenceTransformers 模型提供各种预训练的架构,例如 BERTRoBERTaDistilBERT ,这些架构经过微调,专门用于句子嵌入。

These embeddings capture semantic similarities and relationships between textual inputs, making them suitable for downstream tasks like classification and clustering.
这些嵌入表示文本输入之间的语义相似性和关系,使其适合下游任务,例如分类和聚类。

DistilBERT is a relatively smaller model, so we shall be using it in this demonstration.
DistilBERT 是一个相对较小的模型,因此我们将在这演示中使用它。

Next, open a Jupyter Notebook and import the above libraries:
接下来,打开一个 Jupyter Notebook 并导入上述库:

Moving on, we download and instantiate the DistilBERT sentence transformer model as follows:
接下来,我们下载并实例化 DistilBERT 句子变换模型,如下:

To get started with Pinecone and create a vector database, we need a Pinecone API key.
要开始使用 Pinecone 并创建向量数据库,我们需要一个 Pinecone API 密钥。

To get this, head over to the Pinecone website and create an account here: https://app.pinecone.io/?sessionType=signup. We get to the following page after signing up:
为此,请访问 Pinecone 网站并在此处创建帐户:https://app.pinecone.io/?sessionType=signup。注册后,我们将进入以下页面:

Grab your API key from the left panel in the dashboard below:
从左侧面板中的仪表盘获取您的 API 密钥:

Click on API Keys -> Create API Key -> Enter API Key Name -> Create.
点击 API Keys -> Create API Key -> Enter API Key Name -> Create

Done! 完成!

Grab this API key (like the copy to clipboard button), go back to the Jupyter Notebook, and establish a connection to Pinecone using this API Key, as follows:
获取此 API 密钥(例如复制到剪贴板按钮),返回 Jupyter Notebook,并按照以下步骤使用此 API 密钥连接 Pinecone:

In Pinecone, we store vector embeddings in indexes. The vectors in any index we create must share the same dimensionality and distance metric for measuring similarity.
在 Pinecone 中,我们存储向量嵌入在索引中。我们创建的任何索引中的向量都必须共享相同的维度和距离度量来衡量相似度。

We create an index using the create_index() method of the Pinecone class object created above.
我们使用上面创建的 Pinecone 类对象的 create_index() 方法创建索引。

On a side note, currently, as we have no indexes, running the list_indexes() method returns an empty list in the indexes key of the dictionary:
顺便说一句,目前,由于我们没有索引,运行 list_indexes() 方法会在字典的 indexes 键中返回一个空列表:

Coming back to creating an index, we use the using the create_index() method of the Pinecone class object as follows:
回到创建索引,我们使用 Pinecone 类对象的 create_index() 方法如下:

Here’s a breakdown of this function call:
以下是此函数调用的分解:

  • name: The name of the index. This is a user-defined name that can be used to refer to the index later when performing operations on it.
    索引名称。这是一个用户定义的名称,可以在以后执行索引操作时用来引用该索引。
  • dimension: The dimensionality of the vectors that will be stored in the index. This should match the dimensionality of the vectors that will be inserted into the index. We have specified 768 here because that is the embedding dimension returned by the SentenceTransformer model.
    索引中将存储的向量的维度。这应该与将插入索引的向量的维度相匹配。这里我们指定了 768 ,因为这是 SentenceTransformer 模型返回的嵌入维度。
  • metric: The distance metric used to calculate the similarity between vectors. In this case, euclidean is used, which means that the Euclidean distance will be used as the similarity metric.
    用于计算向量之间相似度的距离度量。在本例中,使用 euclidean ,这意味着欧几里得距离将用作相似性度量。
  • spec: A PodSpec object that specifies the environment in which the index will be created. In this example, the index is created in a GCP (Google Cloud Platform) environment named gcp-starter.
    一个指定索引创建环境的 PodSpec 对象。在本例中,索引是在名为 gcp-starter 的 GCP(Google Cloud Platform)环境中创建的。

Executing this method creates an index, which we can also see in the dashboard:
执行此方法会创建索引,我们也可以在仪表盘中看到它

Now that we have created an index, we can push vector embeddings.
现在我们已经创建了索引,可以推送向量嵌入。

To do this, let’s create some text data and encode it using the SentenceTransformer model.
为此,让我们创建一些文本数据,并使用 SentenceTransformer 模型对其进行编码。

I have created a dummy data below:
我创建了以下示例数据:

We create embeddings for these sentences as follows:
我们对这些句子创建嵌入如下:

This code snippet iterates over each sentence in the data list we defined earlier and encodes the text of each sentence into a vector using the downloaded sentence transformer model (model).
这段代码片段遍历我们之前定义的 data 列表中的每个句子,并使用下载的句子转换器模型 ( model ) 将每个句子的文本编码为向量。

It then creates a dictionary vector_info containing the sentence ID (id) and the corresponding vector (values), and appends this dictionary to the vector_data list.
然后它创建一个字典 vector_info ,包含句子 ID( id )和对应的向量( values ),并将该字典添加到 vector_data 列表中。

In practical instances, there can be multiple indexes under the same account, we must create an index object that specifies the index we wish to add these embeddings to. This is done as follows:
在实际案例中,同一个账户下可能有多个索引,我们必须创建一个 index 对象来指定要将这些嵌入添加到哪个索引。具体操作如下:

Now that we have the embeddings and the index, we upsert these vectors.
现在我们有了嵌入和索引,就可以将这些向量进行更新或插入。

Upsert is a database operation that combines the actions of update and insert. It inserts a new document into a collection if the document does not already exist, or updates an existing document if it does exist. Upsert is a common operation in databases, especially in NoSQL databases, where it is used to ensure that a document is either inserted or updated based on its existence in the collection.
Upsert 是数据库操作,它结合了更新和插入的动作。如果文档不存在,则插入新文档到集合中;如果文档存在,则更新现有文档。Upsert 在数据库中,尤其是在 NoSQL 数据库中,是一种常见操作,用于确保文档根据其在集合中的存在而插入或更新。

Done! 完成!

We have added these vectors to the index. While the output does highlight this, we can double verify this by using the describe_index_stats operation to check if the current vector count matches the number of vectors we upserted:
我们已将这些向量添加到索引中。虽然输出突出显示了这一点,但我们可以通过使用 describe_index_stats 操作来双重验证这一点,以检查当前向量数量是否与我们上载的向量数量相符:

Here's what each key in the returned dictionary represents:
返回字典中每个键的含义如下:

  • dimension: The dimensionality of the vectors stored in the index (768, in this case).
    索引中存储的向量的维度( 768 ,在本例中)。
  • index_fullness: A measure of how full the index is, typically indicating the percentage of slots in the index that are occupied.
    索引的填充程度,通常表示索引中已占用槽位的百分比。
  • namespaces: A dictionary containing statistics for each namespace in the index. In this case, there is only one namespace ('') with a vector_count of 10, indicating that there are 10 vectors in the index.
    索引中每个命名空间的统计信息字典。在本例中,只有一个命名空间(''),其 vector_count10 ,表示索引中有 10 个向量。
  • total_vector_count: The total number of vectors in the index across all namespaces (10, in this case).
    所有命名空间( 10 ,在本例中)的索引中向量总数。

Now that we have stored the vectors in the above index, let’s run a similarity search to see the obtained results.
现在我们已经将向量存储在上面的索引中,让我们运行相似性搜索,看看获得的结果。

We can do this using the query() method of the index object we created earlier.
我们可以使用我们之前创建的 index 对象的 query() 方法来做到这一点。

First, we define a search text and generate its embedding:
首先,我们定义一个搜索文本并生成其嵌入

Next, we query this as follows:
接下来,我们进行如下查询:

This code snippet calls the query method on an index object, which performs a nearest neighbor search for a given query vector (search_embedding) and returns the top 3 matches.
这段代码片段调用了索引对象上的 query 方法,该方法执行给定查询向量( search_embedding )的最近邻搜索,并返回前 3 个匹配项。

Here's what each key in the returned dictionary represents:
返回字典中每个键的含义如下:

  • matches: A list of dictionaries, where each dictionary contains information about a matching vector. Each dictionary includes the id of the matching vector, the score indicating the similarity between the query vector and the matching vector. As we specified euclidean as our metric while creating this index, a higher score indicates more distance or similarity.
    一个字典列表,其中每个字典包含关于匹配向量的信息。每个字典包含匹配向量的 idscore ,表示查询向量与匹配向量之间的相似度。由于我们在创建此索引时指定了 euclidean 作为我们的度量标准,因此分数越高表示距离或相似度越大。
  • namespace: The namespace of the index where the query was performed. In this case, the namespace is an empty string (''), indicating the default namespace.
    查询执行的索引的命名空间。在本例中,命名空间为空字符串(""),表示默认命名空间。
  • usage: A dictionary containing information about the usage of resources during the query operation. In this case, read_units indicates the number of read units consumed by the query operation, which is 5. However, we originally appended 10 vectors to this index, which shows that it did look through all the vectors to find the nearest neighbors.
    一个包含查询操作期间资源使用情况信息的字典。在本例中, read_units 指示查询操作消耗的读取单元数,为 5 。然而,我们最初向该索引追加了 10 个向量,这表明它确实查看了所有向量以找到最近的邻居。

From the above results, we notice that the top 3 neighbors of the search_text ("Vector database are really helpful") are:
从以上结果来看, search_text"Vector database are really helpful" )的 3 个最近邻是:

Awesome! 太棒了!


Conclusion 结论

With this, we come to an end to this deep dive into vector databases.
至此,我们对向量数据库的深入探讨告一段落。

To recap, we learned that vector databases are specialized databases designed to efficiently store and retrieve vector representations of data.
总结一下,我们了解到向量数据库是专门设计用于高效存储和检索数据向量表示的数据库。

By organizing vectors into indexes, vector databases enable fast and accurate similarity searches, making them invaluable for tasks like recommendation systems and information retrieval.
通过将向量组织到索引中,向量数据库能够实现快速准确的相似性搜索,使其在推荐系统和信息检索等任务中变得非常宝贵。

Moreover, the Pinecone demo showcased how easy it is to create and query a vector index using Pinecone's service.
此外,Pinecone 的演示展示了使用 Pinecone 的服务创建和查询向量索引是多么容易。

Before I end this article, there’s one important point I want to mention.
本文结束之前,我想提一点重要的事情。

Just because vector databases sound cool, it does not mean that you have to adopt them in each and every place where you wish to find vector similarities.
向量数据库听起来很酷,但这并不意味着你必须在所有需要查找向量相似性的地方都采用它们。

It's so important to assess whether using a vector database is necessary for your specific use case.
评估使用向量数据库是否适合您的特定用例非常重要。

For small-scale applications with a limited number of vectors, simpler solutions like NumPy arrays and doing an exhaustive search will suffice.
对于小型应用和数量有限的向量,NumPy 数组和穷举搜索等简单解决方案就足够了。

There’s no need to move to vector databases unless you see any benefits, such as latency improvement in your application, cost reduction, and more.
除非您在应用程序中看到任何好处,例如延迟改进、成本降低等,否则无需迁移到向量数据库。

I hope you learned something new today!
希望你今天学到了一些新东西!

I know we discussed many details in this deep dive, so if there’s any confusion, feel free to post them in the comments.
我们在这篇深入探讨中讨论了很多细节,所以如有任何疑问,请随时在评论区留言。

Or, if you wish to connect privately, feel free to initiate a chat here:
或者,如果您希望私下联系,请在此处发起聊天:

Connect via chat 通过聊天连接

Thanks for reading! 谢谢阅读!


If you loved reading this deep dive, I am sure you will learn a lot of practical skills from other rich deep dives too, like these:
如果你喜欢这篇深入探讨,我相信你也能从其他丰富的深入探讨中学习到很多实用技巧,例如这些:

Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning
联邦学习:迈向隐私保护机器学习的关键一步
Learn real-world ML model development with a primary focus on data privacy – A practical guide.
学习实际的机器学习模型开发,重点关注数据隐私——实用指南。
You Cannot Build Large Data Projects Until You Learn Data Version Control!
你无法构建大型数据项目,除非你学会数据版本控制!
The underappreciated, yet critical, skill that most data scientists overlook.
数据科学家们常常忽略的,却至关重要的,被低估的技能。
Why Bagging is So Ridiculously Effective At Variance Reduction?
为什么装袋法在方差降低方面如此有效?
Diving into the mathematical motivation for using bagging.
深入了解使用装袋法的数学动机。

Become a full member (if you aren't already) so that you never miss an article:
成为正式会员(如果您还不是),以便不错过任何文章:


Join the Daily Dose of Data Science Today!

A daily column with insights, observations, tutorials, and best practices on data science.

Get Started!
Join the Daily Dose of Data Science Today!

Great! You’ve successfully signed up. Please check your email.

Welcome back! You've successfully signed in.

You've successfully subscribed to Daily Dose of Data Science.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.