这是用户在 2024-3-7 16:17 为 https://scriv.ai/guides/retrieval-augmented-generation-overview/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

How do domain-specific chatbots work? An Overview of Retrieval Augmented Generation (RAG)
特定领域聊天机器人如何工作?检索增强生成(RAG)概述

Demystifying and deconstructing Q&A bots that work over your data.
解密和解构通过您的数据工作的问答机器人。

Aug. 25, 2023 2023 年 8 月 25 日

There’s a popular open-source library called LangChain that can create chatbots that—among other things—do Q&A over any website/document in 3 lines of code. Here’s an example of that from the langchain docs.
有一个名为 LangChain 的开源库很受欢迎,它可以创建聊天机器人,只需 3 行代码就能在任何网站/文档上进行问答。下面是 Langchain 文档中的一个示例。

from langchain.document_loaders import WebBaseLoader
from langchain.indexes import VectorstoreIndexCreator
loader = WebBaseLoader("http://www.paulgraham.com/greatwork.html")
index = VectorstoreIndexCreator().from_loaders([loader])
index.query("What should I work on?")

Which outputs an answer specific to Paul Graham’s essay:
这就为保罗-格雷厄姆的文章提供了一个专门的答案:

The work you choose should ideally have three qualities: it should be something you have a natural aptitude for, something you have a deep interest in, and something that offers scope to do great work.
你所选择的工作最好应具备三个方面的特质:它应该是你有天赋的、你有浓厚兴趣的,以及它能为你提供施展才华的空间。

If you're unsure, you could start by working on your own projects that seem excitingly ambitious to you. It's also beneficial to be curious, try lots of things, meet lots of people, read lots of books, and ask lots of questions. When in doubt, optimize for interestingness.
如果你不确定,可以从你自己的项目开始,这些项目对你来说是雄心勃勃的。此外,保持好奇心、多做尝试、多与人接触、多读书、多问问题也是有益的。如果有疑问,就优化有趣性。

It's okay to guess and be wrong sometimes, as this can lead to discovering what you're truly good at or interested in.
有时猜测和出错是没有关系的,因为这可以让你发现自己真正擅长什么或对什么感兴趣。

Note: if you’re interested you can try the chatbot built from Paul Graham’s essay here.
注:如果您感兴趣,可以在这里试用保罗-格雷厄姆文章中的聊天机器人。

The first time you run this it feels like pure magic. How the heck does this work?
第一次运行它时,感觉就像纯粹的魔法。这到底是怎么做到的?

The answer is a process called retrieval augmented generation, or RAG for short. It is a remarkably simple concept, though also has incredible depth in the details of its implementation.
答案就是一种名为 "检索增强生成 "的过程,简称 RAG。这是一个非常简单的概念,但在实施细节上却有着惊人的深度。

This post will provide a high-level overview of RAG. We’ll start from the big picture workflow of what’s happening, and then zoom in on all the individual pieces.
这篇文章将提供 RAG 的高级概览。我们将从正在发生的事情的工作流程大图开始,然后放大到所有单独的部分。

By the end of it, you should have a solid understanding of how those three magic lines of code work, and all the principles involved in creating these Q&A bots.
在学习结束时,您应该对这三行神奇代码的工作原理以及创建这些问答机器人所涉及的所有原则有了扎实的了解。

If you’re a developer trying to build bots like this, you’ll learn which knobs you can tweak and how to tweak them. If you’re a non-developer hoping to use AI tools on your dataset, you'll gain knowledge that will help you get the most out of them.
如果你是一名开发人员,正试图构建这样的机器人,你将了解到哪些旋钮可以调整以及如何调整。如果你是一名非开发人员,希望在你的数据集上使用人工智能工具,你将获得帮助你充分利用这些工具的知识。

And, if you’re just a curious mind, you’ll hopefully learn a thing or two about some of the technology that's upending our lives.
如果你只是一个好奇心强的人,希望你能了解一两件颠覆我们生活的技术。

Let’s dive in. 让我们深入了解一下。

What is Retrieval Augmented Generation?
什么是检索增强一代?

Retrieval augmented generation is the process of supplementing a user’s input to a large language model (LLM) like ChatGPT with additional information that you have retrieved from somewhere else. The LLM can then use that information to augment the response that it generates.
检索增强生成是将用户输入的信息补充到大型语言模型(LLM)(如 ChatGPT)中的过程。然后,LLM 可以使用这些信息来增强其生成的响应。

The following diagram shows how it works in practice:
下图显示了实际操作过程:

Retrieval Augmented Generation: An Overview

It starts with a user’s question. For example “How do I do <something>?”
它以用户的问题开始。例如,"我该如何做<某件事>?

The first thing that happens is the retrieval step. This is the process that takes the user’s question and searches for the most relevant content from a knowledge base that might answer it. The retrieval step is by far the most important, and most complex part of the RAG chain.
首先是检索步骤。这个过程是根据用户的问题,从知识库中搜索可能回答问题的最相关内容。到目前为止,检索步骤是 RAG 链中最重要、最复杂的部分。

But for now, just imagine a black box that knows how to pull out the best chunks of relevant information related to the user’s query.
但现在,只需想象一个黑盒子,它知道如何提取与用户查询相关的最佳信息块。

Can't we just give the LLM the whole knowledge base?
我们就不能把整个知识库交给法律硕士吗?

You might be wondering why we bother with retrieval instead of just sending the whole knowledge base to the LLM. One reason is that models have built-in limits on how much text they can consume at a time (though these are quickly increasing).
你可能会问,为什么我们要费心检索,而不是直接将整个知识库发送给 LLM。其中一个原因是,模型在一次消耗多少文本方面有内置限制(尽管这些限制正在迅速增加)。

A second reason is cost—sending huge amounts of text gets quite expensive. Finally, there is evidence suggesting that sending small amounts of relevant information results in better answers.
第二个原因是成本--发送大量文本的成本相当高。最后,有证据表明,发送少量相关信息会得到更好的答案。

Once we’ve gotten the relevant information out of our knowledge base, we send it, along with the user’s question, to the large language model (LLM). The LLM—most commonly ChatGPT—then “reads” the provided information and answers the question. This is the augmented generation step.
一旦我们从知识库中获取了相关信息,我们就会将这些信息连同用户的问题一起发送给大语言模型(LLM)。LLM 通常是 ChatGPT,然后 "读取 "所提供的信息并回答问题。这就是增强生成步骤。

Pretty simple, right? 很简单吧?

Working backwards: Giving an LLM extra knowledge to answer a question
逆向思维:为法学硕士提供额外的知识来回答问题

We’ll start at the last step: answer generation. That is, let’s assume we already have the relevant information pulled from our knowledge base that we think answers the question. How do we use that to generate an answer?
我们将从最后一步开始:生成答案。也就是说,假设我们已经从知识库中获取了我们认为可以回答问题的相关信息。我们如何利用这些信息生成答案呢?

Augmented Answer Generation

This process may feel like black magic, but behind the scenes it is just a language model. So in broad strokes, the answer is “just ask the LLM”. How do we get an LLM to do something like this?
这个过程可能感觉像黑魔法,但其背后只是一个语言模型。因此,概括地说,答案就是 "问问 LLM 就知道了"。我们怎样才能让 LLM 做这样的事情呢?

We'll use ChatGPT as an example. And just like regular ChatGPT, it all comes down to prompts and messages.
我们将以 ChatGPT 为例。就像普通的 ChatGPT 一样,一切都取决于提示和信息。

Giving the LLM custom instructions with the system prompt
通过系统提示为 LLM 提供自定义指令

The first component is the system prompt. The system prompt gives the language model its overall guidance. For ChatGPT, the system prompt is something like “You are a helpful assistant.”
第一个组成部分是系统提示。系统提示为语言模型提供整体指导。对于 ChatGPT 来说,系统提示类似于 "你是一个乐于助人的助手"。

In this case we want it to do something more specific. And, since it’s a language model, we can just tell it what we want it to do. Here’s an example short system prompt that gives the LLM more detailed instructions:
在这种情况下,我们希望它做一些更具体的事情。既然它是一个语言模型,我们就可以直接告诉它我们想要它做什么。下面是一个简短的系统提示示例,为 LLM 提供了更详细的指令:

You are a Knowledge Bot. You will be given the extracted parts of a knowledge base (labeled with DOCUMENT) and a question. Answer the question using information from the knowledge base.
你是一个知识机器人。你将得到一个知识库的提取部分(标有 DOCUMENT)和一个问题。请使用知识库中的信息回答问题。

We’re basically saying, “Hey AI, we’re gonna give you some stuff to read. Read it and then answer our question, k? Thx.” And, because AIs are great at following our instructions, it kind of just... works.
我们基本上是在说:"嘿,人工智能,我们会给你一些东西让你读。读完后回答我们的问题,好吗?谢谢"。因为人工智能很擅长听从我们的指令,所以......很有效。

Giving the LLM our specific knowledge sources
赋予LLM我们特定的知识来源

Next we need to give the AI its reading material. And again—the latest AIs are really good at just figuring stuff out. But, we can help it a bit with a bit of structure and formatting.
接下来,我们需要给人工智能提供阅读材料。再说一遍,最新的人工智能真的很擅长自己找出东西。但是,我们可以通过一些结构和格式来帮助它。

Here’s an example format you can use to pass documents to the LLM:
以下是向 LLM 传递文档的格式示例:

------------ DOCUMENT 1 -------------

This document describes the blah blah blah...

------------ DOCUMENT 2 -------------

This document is another example of using x, y and z...

------------ DOCUMENT 3 -------------

[more documents here...]

Do you need all this formatting? Probably not, but it’s nice to make things as explicit as possible. You can also use a machine-readable format like JSON or YAML. Or, if you're feeling frisky, you can just dump everything in a giant blob of text.
你需要这些格式吗?也许不需要,但尽可能让事情清晰明了是件好事。您也可以使用 JSON 或 YAML 等机器可读格式。或者,如果你觉得自己很狂躁,也可以把所有东西都写成一个巨大的文本块。

But, having some consistent formatting becomes important in more advanced use-cases, for example, if you want the LLM to cite its sources.
但是,在更高级的使用情况下,一些一致的格式变得非常重要,例如,如果您希望 LLM 引用其来源。

Once we’ve formatted the documents we just send it as a normal chat message to the LLM. Remember, in the system prompt we told it we were gonna give it some documents, and that’s all we’re doing here.
格式化文件后,我们只需将其作为普通聊天信息发送到 LLM。记住,在系统提示中,我们告诉它我们要给它一些文件,这就是我们要做的。

Putting everything together and asking the question
把所有东西放在一起并提出问题

Once we’ve got our system prompt and our “documents” message, we just send the user’s question to the LLM alongside them. Here’s how that looks in Python code, using the OpenAI ChatCompletion API:
在获得系统提示和 "文档 "信息后,我们只需将用户的问题发送至旁边的 LLM。以下是使用 OpenAI ChatCompletion API 编写的 Python 代码:

openai_response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {
            "role": "system",
            "content": get_system_prompt(),  # the system prompt as per above
        },
        {
            "role": "system",
            "content": get_sources_prompt(),  # the formatted documents as per above
        },
        {
            "role": "user",
            "content": user_question,  # the question we want to answer
        },
    ],
)
Python code for doing a retrieval-augmented answer with OpenAI's ChatGPT
使用 OpenAI 的 ChatGPT 进行检索增强回答的 Python 代码

That’s it! A custom system prompt, two messages, and you have context-specific answers!
就是这样!一个自定义系统提示,两条信息,你就能得到针对具体情况的答案!

This is a simple use case, and it can be expanded and improved on. One thing we haven't done is told the AI what to do if it can't find an answer in the sources.
这只是一个简单的用例,还可以进一步扩展和改进。我们还没有做的一件事就是告诉人工智能,如果它在信息源中找不到答案该怎么办。

We can add these instructions to the system prompt—typically either telling it to refuse to answer, or to use its general knowledge, depending on your bot's desired behavior. You can also get the LLM to cite the specific sources it used to answer the question.
我们可以将这些指令添加到系统提示中--根据机器人的预期行为,通常可以告诉它拒绝回答或使用其常识。您还可以让LLM引用回答问题时使用的具体资料来源。

We’ll talk about those tactics in future posts but for now, that’s the basics of answer generation.
我们将在以后的文章中讨论这些策略,但现在,这就是生成答案的基本要素。

With the easy part out of the way, it’s time to come back to that black box we skipped over...
简单的部分说完了,现在该回到我们跳过的那个黑盒子了...

The retrieval step: getting the right information out of your knowledge base
检索步骤:从知识库中获取正确信息

Above we assumed we had the right knowledge snippets to send to the LLM. But how do we actually get these from the user’s question? This is the retrieval step, and it is the core piece of infrastructure in any “chat with your data” system.
在上文中,我们假定已经有了正确的知识片段,可以发送给 LLM。但我们如何从用户的问题中获取这些知识片段呢?这就是检索步骤,也是任何 "与数据聊天 "系统的核心基础结构。

Retrieval

At its core, retrieval is a search operation—we want to look up the most relevant information based on a user’s input. And just like search, there are two main pieces:
检索的核心是搜索操作--我们希望根据用户的输入查找最相关的信息。与搜索一样,检索也有两个主要部分:

  1. Indexing: Turning your knowledge base into something that can be searched/queried.
    索引:将知识库转化为可搜索/查询的内容。
  2. Querying: Pulling out the most relevant bits of knowledge from a search term.
    查询:从搜索词中提取最相关的知识点。

It’s worth noting that any search process could be used for retrieval. Anything that takes a user input and returns some results would work.
值得注意的是,任何搜索过程都可以用于检索。任何接受用户输入并返回结果的程序都可以使用。

So, for example, you could just try to find text that matches the user's question and send that to the LLM, or you could Google the question and send the top results across—which, incidentally, is approximately how Bing's chatbot works.
例如,您可以尝试找到与用户问题相匹配的文本,然后将其发送到LLM,或者您也可以在谷歌上搜索问题,然后将最重要的结果发送过去--顺便说一下,这与必应聊天机器人的工作方式大致相同。

That said, most RAG systems today rely on something called semantic search, which uses another core piece of AI technology: embeddings. Here we’ll focus on that use case.
尽管如此,如今大多数 RAG 系统都依赖于一种叫做语义搜索的东西,它使用了人工智能的另一项核心技术:嵌入。在这里,我们将重点讨论这一用例。

So...what are embeddings?
那么......什么是嵌入?

What are embeddings? And what do they have to do with knowledge retrieval?
什么是嵌入?它们与知识检索有什么关系?

LLMs are weird. One of the weirdest things about them is that nobody really knows how they understand language. Embeddings are a big piece of that story.
LLMs 很奇怪。其中最奇怪的是,没人知道它们是如何理解语言的。嵌入式是其中重要的一环。

If you ask a person how they turn words into meaning, they will likely fumble around and say something vague and self-referential like "because I know what they mean".
如果你问一个人他们是如何将词语转化为意义的,他们很可能会支支吾吾地说出 "因为我知道它们是什么意思 "之类含糊不清、自说自话的话。

Somewhere deep in our brains there is a complex structure that knows "child" and "kid" are basically the same, "red" and "green" are both colors, and "pleased," "happy," and "elated" represent the same emotion with varying magnitudes. We can't explain how this works, we just know it.
在我们大脑深处有一个复杂的结构,它知道 "孩子 "和 "小孩 "基本上是一样的,"红色 "和 "绿色 "都是颜色,而 "高兴"、"快乐 "和 "欣喜 "代表的是同一种情绪,只是程度不同而已。我们无法解释这是如何运作的,但我们就是知道。

Language models have a similarly complex understanding of language, except, since they are computers it's not in their brains, but made up of numbers. In an LLM's world, any piece of human language can be represented as a vector (list) of numbers. This vector of numbers is an embedding.
语言模型对语言的理解也同样复杂,只不过由于它们是计算机,所以语言并不在它们的大脑中,而是由数字组成的。在LLM的世界里,任何一段人类语言都可以用数字向量(列表)来表示。这个数字向量就是嵌入。

A critical piece of LLM technology is a translator that goes from human word-language to AI number-language. We'll call this translator an "embedding machine", though under the hood it's just an API call. Human language goes in, AI numbers come out.
LLM技术的一个关键部分是将人类文字语言转换为人工智能数字语言的翻译器。我们称这种翻译器为 "嵌入式机器",尽管它的底层只是一个应用程序接口调用。人类语言输入,人工智能数字输出。

Embeddings Multiple

What do these numbers mean? No human knows! They are only “meaningful” to the AI. But, what we do know is that similar words end up with similar sets of numbers. Because behind the scenes, the AI uses these numbers to “read” and “speak”. So the numbers have some kind of magic comprehension baked into them in AI-language—even if we don't understand it. The embedding machine is our translator.
这些数字意味着什么?没有人知道!它们只对人工智能 "有意义"。但是,我们知道的是,相似的单词最终会有相似的一组数字。因为在幕后,人工智能使用这些数字来 "阅读 "和 "说话"。因此,这些数字在人工智能语言中蕴含着某种神奇的理解力--即使我们无法理解。嵌入式机器就是我们的翻译机。

Now, since we have these magic AI numbers, we can plot them. A simplified plot of the above examples might look something like this—where the axes are just some abstract representation of human/AI language:
现在,既然我们有了这些神奇的人工智能数字,我们就可以绘制它们了。上述例子的简化图可能是这样的--其中的坐标轴只是人类/人工智能语言的某种抽象表示:

Embedding Plot

Once we’ve plotted them, we can see that the closer two points are to each other in this hypothetical language space, the more similar they are. “Hello, how are you?” and “Hey, how’s it going?” are practically on top of each other.
绘制出这些点之后,我们可以发现,在这个假设的语言空间中,两个点越接近,它们就越相似。"喂,你好吗?"和 "嘿,你好吗?"几乎是彼此相邻的。

“Good morning,” another greeting, is not too far from those. And “I like cupcakes” is on a totally separate island from the rest.
另一种问候语 "早上好 "与之相差无几。而 "我喜欢纸杯蛋糕 "与其他问候语完全是两个独立的岛屿。

Naturally, you can’t represent the entirety of human language on a two-dimensional plot, but the theory is the same. In practice, embeddings have many more coordinates (1,536 for the current model used by OpenAI). But you can still do basic math to determine how close two embeddings—and therefore two pieces of text—are to each other.
当然,你不可能在一幅二维图上表示人类语言的全部,但理论是一样的。实际上,嵌入式有更多的坐标(OpenAI 目前使用的模型有 1536 个坐标)。但是你仍然可以通过基本的数学计算来确定两个嵌入式之间的距离,从而确定两个文本之间的距离。

These embeddings, and determining “closeness” are the core principle behind semantic search, which powers the retrieval step.
这些嵌入和确定 "接近度 "是语义搜索背后的核心原则,为检索步骤提供了动力。

Like what you're reading?
喜欢您正在阅读的内容吗?
Sign up to get notified when there are new posts about building applications with LLMs. No spam, unsubscribe anytime.
注册后,当有关于使用 LLMs 构建应用程序的新文章时,您将收到通知。无垃圾邮件,随时退订。

Powered by EmailOctopus 技术支持:EmailOctopus

Finding the best pieces of knowledge using embeddings
利用嵌入式查找最佳知识片段

Once we understand how search with embeddings works, we can construct a high-level picture of the retrieval step.
了解了嵌入式搜索的工作原理后,我们就能构建出检索步骤的高级图景。

On the indexing side, first we have to break up our knowledge base into chunks of text. This process is an entire optimization problem in and of itself, and we’ll cover it next, but for now just assume we know how to do it.
在索引方面,我们首先要将知识库分解成文本块。这个过程本身就是一个完整的优化问题,我们将在下一步介绍,但现在只需假设我们知道如何去做。

Knowledge Splitting

Once we’ve done that, we pass each knowledge snippet through the embedding machine (which is actually an OpenAI API or similar) and get back our embedding representation of that text. Then we save the snippet, along with the embedding in a vector database—a database that is optimized for working with vectors of numbers.
完成上述操作后,我们将每个知识片段传递给嵌入机器(实际上是 OpenAI API 或类似程序接口),并返回该文本的嵌入表示。然后,我们将片段和嵌入一起保存到矢量数据库中--该数据库经过优化,可用于处理数字矢量。

Embedding Knowledge Snippets

Now we have a database with the embeddings of all our content in it. Conceptually, you can think of it as a plot of our entire knowledge base on our “language” graph:
现在,我们有了一个包含所有内容嵌入的数据库。从概念上讲,你可以把它看作是我们整个知识库在 "语言 "图谱上的绘制:

Knowledge Snippet Plot

Once we have this graph, on the query side, we do a similar process. First we get the embedding for the user’s input:
有了这个图之后,在查询方面,我们也要做类似的处理。首先,我们获取用户输入的嵌入:

Embedding the user query

Then we plot it in the same vector-space and find the closest snippets (in this case 1 and 2):
然后,我们将其绘制在同一向量空间中,并找出最接近的片段(本例中为 1 和 2):

Knowledge Snippet Plot with Query

The magic embedding machine thinks these are the most related answers to the question that was asked, so these are the snippets that we pull out to send to the LLM!
神奇的嵌入机器认为这些是与所提问题最相关的答案,因此我们将这些片段提取出来发送给 LLM!

In practice, this “what are the closest points” question is done via a query into our vector database. So the actual process looks more like this:
实际上,这个 "最近的点是什么 "的问题是通过向我们的矢量数据库查询来完成的。因此,实际过程看起来更像是这样:

Retrieval with Embeddings Final

The query itself involves some semi-complicated math—usually using something called a cosine distance, though there are other ways of computing it.
查询本身涉及一些半复杂的数学运算,通常使用余弦距离,当然也有其他计算方法。

The math is a whole space you can get into, but is out of scope for the purposes of this post, and from a practical perspective can largely be offloaded to a library or database.
数学是一个可以深入研究的领域,但不在本篇文章的讨论范围之内,而且从实用的角度来看,它在很大程度上可以被卸载到图书馆或数据库中。

Back to LangChain 返回 LangChain

In our LangChain example from the beginning, we have now covered everything done by this single line of code. That little function call is hiding a whole lot of complexity!
在一开始的 LangChain 示例中,我们现在已经涵盖了这一行代码所完成的所有工作。这个小小的函数调用隐藏了大量的复杂性!

index.query("What should I work on?")

Indexing your knowledge base
为知识库编制索引

Alright, we’re almost there. We now understand how we can use embeddings to find the most relevant bits of our knowledge base, pass everything to the LLM, and get our augmented answer back. The final step we’ll cover is creating that initial index from your knowledge base.
好了,我们就快成功了。我们现在明白了如何使用嵌入式技术找到知识库中最相关的部分,将所有内容传递给LLM,然后得到增强答案。我们要介绍的最后一步是从知识库中创建初始索引。

In other words, the “knowledge splitting machine” from this picture:
换句话说,就是这幅图中的 "知识分裂机":

Knowledge Splitting High-Level

Perhaps surprisingly, indexing your knowledge base is usually the hardest and most important part of the whole thing. And unfortunately, it’s more art than science and involves lots of trial and error.
也许令人惊讶的是,为知识库编制索引通常是整个工作中最难也是最重要的部分。遗憾的是,它更像是一门艺术,而不是科学,需要反复试验。

Big picture, the indexing process comes down to two high-level steps.
总的来说,索引编制过程可归结为两个高级步骤。

  1. Loading: Getting the contents of your knowledge base out of wherever it is normally stored.
    加载:将知识库的内容从通常的存储位置取出。
  2. Splitting: Splitting up the knowledge into snippet-sized chunks that work well with embedding searches.
    分割:将知识分割成片段大小的小块,以便嵌入搜索。
Technical clarification 技术澄清

Technically, the distinction between "loaders" and "splitters" is somewhat arbitrary. You could imagine a single component that does all the work at the same time, or break the loading stage into multiple sub-components.
从技术上讲,"加载器 "和 "分割器 "之间的区别有些武断。你可以想象一个组件同时完成所有工作,也可以把加载阶段分成多个子组件。

That said, "loaders" and "splitters" are how it is done in LangChain, and they provide a useful abstraction on top of the underlying concepts.
尽管如此,"加载器 "和 "分割器 "在 LangChain 中是这样实现的,它们在底层概念之上提供了一个有用的抽象。

Let’s use my own use-case as an example. I wanted to build a chatbot to answer questions about my saas boilerplate product, SaaS Pegasus. The first thing I wanted to add to my knowledge base was the documentation site. The loader is the piece of infrastructure that goes to my docs, figures out what pages are available, and then pulls down each page. When the loader is finished it will output individual documents—one for each page on the site.
让我们以我自己的使用案例为例。我想建立一个聊天机器人来回答关于我的 saas 模板产品 SaaS Pegasus 的问题。我想添加到知识库的第一件事就是文档网站。加载器是一个基础设施,它能进入我的文档,找出有哪些页面可用,然后拉下每个页面。加载器完成后,会输出单个文档--网站上的每个页面都有一个文档。

Loader

Inside the loader a lot is happening! We need to crawl all the pages, scrape each one’s content, and then format the HTML into usable text. And loaders for other things—e.g. PDFs or Google Drive—have different pieces.
在加载器内部发生了很多事情!我们需要抓取所有页面,抓取每个页面的内容,然后将 HTML 格式化为可用文本。而用于其他内容(如 PDF 或 Google Drive)的加载器则有不同的部分。

There’s also parallelization, error handling, and so on to figure out. Again—this is a topic of nearly infinite complexity, but one that we’ll mostly offload to a library for the purposes of this write up.
还有并行化、错误处理等问题需要解决。同样,这也是一个几乎无限复杂的话题,但在本文中,我们主要将其卸载到一个库中。

So for now, once more, we’ll just assume we have this magic box where a “knowledge base” goes in, and individual “documents” come out.
因此,现在我们再一次假定,我们有一个神奇的盒子,"知识库 "从这里进去,各个 "文档 "从这里出来。

LangChain Loaders 朗链装载机

Built in loaders are one of the most useful pieces of LangChain. They provide a long list of built-in loaders that can be used to extract content from anything from a Microsoft Word doc to an entire Notion site.
内置加载器是 LangChain 最有用的部分之一。它提供了一长串内置加载器,可用于从 Microsoft Word 文档到整个 Notion 网站的任何内容中提取内容。

The interface to LangChain loaders is exactly the same as depicted above. A "knowledge base" goes in, and a list of "documents" comes out.
与 LangChain 加载器的接口完全相同。输入一个 "知识库",然后输出一个 "文档 "列表。

Coming out of the loader, we’ll have a collection of documents corresponding to each page in the documentation site. Also, ideally at this point the extra markup has been removed and just the underlying structure and text remains.
从加载器中出来后,我们将得到一个与文档网站中每个页面相对应的文档集合。此外,理想情况下,此时额外的标记已被删除,只剩下底层结构和文本。

Now, we could just pass these whole webpages to our embedding machine and use those as our knowledge snippets. But, each page might cover a lot of ground! And, the more content in the page, the more “unspecific” the embedding of that page becomes.
现在,我们可以将这些网页完整地传递给我们的嵌入式机器,并将其用作我们的知识片段。但是,每个页面都可能涵盖很多内容!而且,页面内容越多,该页面的嵌入就越 "不具体"。

This means that our “closeness” search algorithm may not work so well.
这意味着,我们的 "接近度 "搜索算法可能并不那么有效。

What’s more likely is that the topic of a user’s question matches some piece of text inside the page. This is where splitting enters the picture. With splitting, we take any single document, and split it up into bite-size, embeddable chunks, better-suited for searches.
更有可能的情况是,用户问题的主题与页面中的某些文本相吻合。这就是拆分的作用所在。通过拆分,我们可以将任何单个文档拆分成一小块一小块可嵌入的内容,以便更好地进行搜索。

Splitter

Once more, there’s an entire art to splitting up your documents, including how big to make the snippets on average (too big and they don’t match queries well, too small and they don’t have enough useful context to generate answers), how to split things up (usually by headings, if you have them), and so on.
此外,分割文档也是一门艺术,包括平均分割多大面积的片段(太大片段无法很好地匹配查询,太小片段没有足够的有用上下文来生成答案)、如何分割(如果有标题,通常按标题分割)等等。

But—a few sensible defaults are good enough to start playing with and refining your data.
但是,几个合理的默认值就足以让您开始使用和完善您的数据。

Splitters in LangChain LangChain 中的分割器

In LangChain, splitters fall under a larger category of things called "document transformers". In addition to providing various strategies for splitting up documents, they also have tools for removing redundant content, translation, adding metadata, and so on. We only focus on splitters here as they represent the overwhelming majority of document transformations.
在 LangChain 中,拆分器属于 "文档转换器 "这一更大的类别。除了提供拆分文档的各种策略外,它们还拥有删除冗余内容、翻译、添加元数据等工具。我们在这里只关注分割器,因为它们代表了绝大多数的文档转换。

Once we have the document snippets, we save them into our vector database, as described above, and we’re finally done!
获得文档片段后,如上所述,我们将其保存到矢量数据库中,最后就大功告成了!

Here’s the complete picture of indexing a knowledge base.
以下是编制知识库索引的全貌。

Knowledge Indexing Complete
Back to LangChain 返回 LangChain

In LangChain, the entire indexing process is encapsulated in these two lines of code. First we initialize our website loader and tell it what content we want to use:
在 LangChain 中,整个索引过程都封装在这两行代码中。首先,我们初始化网站加载器,并告诉它我们要使用哪些内容:

loader = WebBaseLoader("http://www.paulgraham.com/greatwork.html")

Then we build the entire index from the loader and save it to our vector database:
然后,我们从加载器中建立整个索引,并将其保存到我们的矢量数据库中:

index = VectorstoreIndexCreator().from_loaders([loader])

The loading, splitting, embedding, and saving is all happening behind the scenes.
加载、分割、嵌入和保存都是在幕后进行的。

Recapping the whole process
回顾整个过程

At last we can fully flesh out the entire RAG pipeline. Here’s how it looks:
我们终于可以完全充实整个 RAG 管道了。它看起来是这样的

RAG: Complete

First we index our knowledge base. We get the knowledge and turn into individual documents using a loader, and then use a splitter to turn it into bite-size chunks or snippets. Once we have those, we pass them to our embedding machine, which turns them into vectors that can be used for semantic searching. We save these embeddings, alongside their text snippets in our vector database.
首先,我们对知识库进行索引。我们使用加载器获取知识并将其转化为单个文档,然后使用分割器将其转化为小块或片段。有了这些片段后,我们将它们传递给嵌入机,嵌入机将它们转化为可用于语义搜索的向量。我们将这些嵌入和文本片段一起保存在向量数据库中。

Next comes retrieval. It starts with the question, which is then sent through the same embedding machine and passed into our vector database to determine the closest matched snippets, which we’ll use to answer the question.
接下来是检索。首先是问题,然后通过相同的嵌入式机器将问题发送到我们的向量数据库,以确定最匹配的片段,我们将使用这些片段来回答问题。

Finally, augmented answer generation. We take the snippets of knowledge, format them alongside a custom system prompt and our question, and, finally, get our context-specific answer.
最后,生成增强答案。我们获取知识片段,将其与自定义系统提示和我们的问题一起格式化,最后得到针对具体情境的答案。

Whew! 呼!

Hopefully you now have a basic understanding of how retrieval augmented generation works. If you’d like to try it out on a knowledge base of your own, without all the work of setting it up, check out Scriv.ai, which lets you build domain-specific chatbots in just a few minutes with no coding skills required.
希望你现在对检索增强生成的工作原理有了基本了解。如果你想在自己的知识库中尝试一下,而不需要所有的设置工作,请查看 Scriv.ai,它可以让你在几分钟内构建特定领域的聊天机器人,而且不需要任何编码技能。

In future posts we’ll expand on many of these concepts, including all the ways you can improve on the “default” set up outlined here. As I mentioned, there is nearly infinite depth to each of these pieces, and in the future we’ll dig into those one at a time.
在今后的文章中,我们将扩展其中的许多概念,包括您可以改进此处概述的 "默认 "设置的所有方法。正如我所提到的,这些内容几乎都有无限的深度,今后我们将逐一深入探讨。

If you’d like to be notified when those posts come out, sign up to receive updates here. I don’t spam, and you can unsubscribe whenever you want.
如果您想在这些文章发布时收到通知,请在这里注册接收更新。我不会发送垃圾邮件,你也可以随时取消订阅。

Interested in learning about building with LLMs?
有兴趣了解使用 LLMs 进行建筑?
Sign up for our email list to get notified when there are new posts like this, no spam, unsubscribe anytime.
注册我们的电子邮件列表,以便在有类似新帖时收到通知,没有垃圾邮件,可随时取消订阅。

Powered by EmailOctopus 技术支持:EmailOctopus

Thanks to Will Pride and Rowena Luk for reviewing drafts of this.
感谢 Will Pride 和 Rowena Luk 对本文草稿的审阅。