Applied AI Software Engineering: RAG
应用人工智能软件工程:RAG
Retrieval-Augmented Generation (RAG) is a common building block of AI software engineering. A deep dive into what it is, its limitations, and some alternative use cases. By Ross McNairn.
👋 Hi, this is Gergely with a subscriber-only issue of the Pragmatic Engineer Newsletter. In every issue, I cover challenges at Big Tech and startups through the lens of engineering managers and senior engineers. To get articles like this in your inbox, every week, subscribe:
👋 嗨,我是 Gergely,这是一份仅限订阅者的《务实工程师通讯》。在每期中,我通过工程经理和高级工程师的视角,讨论大科技公司和初创企业面临的挑战。想要每周在收件箱中接收这样的文章,请订阅:
I recently spoke with Karthik Hariharan, who heads up engineering at VC firm Goodwater Capital, and he highlighted a trend he’d spotted:
我最近与卡尔提克·哈里哈兰进行了交谈,他是风险投资公司 Goodwater Capital 的工程负责人,他强调了一个他注意到的趋势:
“There’s an engineering project I’m seeing almost every startup building a Large Language Model (LLM) put in place: building their own Retrieval Augmentation Generation (RAG) pipelines.
“我看到几乎每个创业公司都在建立一个大型语言模型(LLM)的工程项目:建立自己的检索增强生成(RAG)管道。”RAGs are a common pattern for anyone building an LLM application. This is because it provides a layer of ‘clean prompts’ and fine-tuning. There are some existing open-source solutions, but almost everyone just builds their own, anyway.”
RAGs 是构建 LLM 应用程序的常见模式。这是因为它提供了一层“干净的提示”和微调。有一些现有的开源解决方案,但几乎每个人还是只是自己构建。
I asked a few Artificial Intelligence (AI) startups about this, and sure enough, all do build their own RAG. So, I reached out to a startup I know is doing the same: Wordsmith AI. It’s an AI startup for in-house legal teams that’s making heavy use of RAG, and was co-founded by Ross McNairn. He and I worked for years together at Skyscanner and he offered to share Wordsmith AI’s approach for building RAG pipelines, and some learnings. Declaration of interest: I’m an investor in Wordsmith, and the company has recently launched out of stealth.
我问了一些人工智能(AI)初创公司这个问题,果然,它们都在构建自己的 RAG。因此,我联系了一个我知道也在做同样事情的初创公司:Wordsmith AI。 这是一个为内部法律团队提供服务的 AI 初创公司,正在大量使用 RAG,由罗斯·麦克奈恩共同创立。他和我在 Skyscanner 合作了多年,他愿意分享 Wordsmith AI 构建 RAG 管道的方法和一些经验。利益声明:我是一名Wordsmith 的投资者,该公司最近刚刚启动,从隐形状态中走出。
Today, we cover: 今天,我们将讨论:
Providing an LLM with additional context
提供一个带有额外上下文的LLMThe simplest RAGs 最简单的 RAGs
What is a RAG pipeline?
RAG 管道是什么?Preparing the RAG pipeline data store
准备 RAG 管道数据存储Bringing it all together 将所有内容整合在一起
RAG limitations RAG 的局限性
Real-world learnings building RAG pipelines
构建 RAG 管道的现实世界经验
Today’s article includes a “code-along,” so you can build your own RAG. View the code used in this article at this GitHub repository: hello-wordsmith. To keep up with Ross, subscribe to his blog or follow him on LinkedIn.
今天的文章包含一个“编码挑战”,你可以构建自己的 RAG。查看本文章中使用的代码 在这个 GitHub 仓库: hello-wordsmith。要跟上 Ross 的步伐,订阅他的博客 或 在 LinkedIn 上关注他。
With that, it’s over to Ross:
就这样,交给罗斯了:
Introduction 介绍
Hi there! This post is designed to help you get familiar with one of the most fundamental patterns of AI software engineering: RAG, aka Retrieval Augmented Generation.
嗨!这篇帖子旨在帮助您熟悉人工智能软件工程中最基本的模式之一:RAG,即检索增强生成。
I co-founded a legal tech startup called Wordsmith, where we are building a platform for running a modern in-house legal team. Our founding team previously worked at Meta, Skyscanner, Travelperk and KPMG.
我共同创办了一家名为Wordsmith的法律科技初创公司,我们正在构建一个用于运营现代内部法律团队的平台。我们的创始团队曾在 Meta、Skyscanner、Travelperk 和 KPMG 工作。
We are working in a targeted domain – legal texts – and building AI agents to give in-house legal teams a suite of AI tools to remove bottlenecks and improve how they work with the rest of the business. Performance and accuracy are key characteristics for us, so we’ve invested a lot of time and effort in how to best enrich and “turbo charge” these agents with custom data and objectives.
我们正在一个特定领域—法律文本—中工作,并构建人工智能代理,为内部法律团队提供一套人工智能工具,以消除瓶颈并改善他们与其他业务部门的合作。性能和准确性是我们关注的关键特征,因此我们在如何最好地充实和“加速”这些代理,使其具备定制数据和目标上投入了大量的时间和精力。
We ended up building our RAG pipeline, and I will now walk you through how we did it and why. We’ll go into our learnings, and how we benchmark our solution. I hope that the lessons we learned are useful for all budding AI engineers.
我们最终建立了我们的 RAG 流水线,现在我将向您介绍我们是如何做到的以及为什么。我们将探讨我们的经验教训,以及我们是如何基准测试我们的解决方案的。我希望我们学到的教训对所有有志的 AI 工程师有所帮助。
1. Providing an LLM with additional context
1. 提供带有额外上下文的 LLM
Have you ever asked ChatGPT a question it does not know how to answer, or its answer is too high level? We’ve all been there, and all too often, interacting with a GPT feels like talking to someone who speaks really well, but doesn’t know the facts. Even worse, they can make up the information in their responses!
你有没有问过 ChatGPT 一个它不知道如何回答的问题,或者它的回答太高深?我们都经历过这种情况,和 GPT 互动常常感觉像是在和一个说得很流利但不知道事实的人交谈。更糟的是,他们可能会在回答中编造信息!
Here is one example. On 1 February 2024, during an earnings call, Mark Zuckerberg laid out the strategic benefits of Meta’s AI strategy. But when we ask ChatGPT a question about this topic, this model will make up an answer that is high-level, but is not really what we want:
这里有一个例子。2024 年 2 月 1 日,在一次财报电话会议上,马克·扎克伯格阐述了 Meta 人工智能战略的战略好处。但是当我们问 ChatGPT 一个关于这个主题的问题时,这个模型会编造一个高层次的答案,但并不是我们真正想要的。

ChatGPT 3.5 对有关 Meta AI 战略的问题的回答。这个回答过于概括,遗漏了一个关键来源,该来源回答了这个问题。
This makes sense, as the model’s training cutoff date was before Mark Zuckerberg made the comments. If the model had access to that information, it would have likely been able to summarize the facts of that meeting, which are:
这很有道理,因为模型的训练截止日期是在马克·扎克伯格发表评论之前。如果模型能够访问到该信息,它可能能够总结那次会议的事实,具体如下:
“So I thought it might be useful to lay out the strategic benefits [of Meta’s open source strategy) here. (...)
所以我认为在这里阐明一下[Meta 开源策略的]战略优势可能会很有用。(...)The short version is that open sourcing improves our models. (...)
简短的说,开源有助于提升我们的模型。
First, open-source software is typically safer and more secure as well as more compute-efficient to operate due to all the ongoing feedback, scrutiny and development from the community. (...)
首先,开源软件通常更安全、更可靠,同时由于社区的持续反馈、审查和开发,其运行效率也更高。(...)
Second, open-source software often becomes an industry standard. (...)
第二,开源软件往往成为行业标准。Third, open source is hugely popular with developers and researchers. (...)
第三,开源在开发者和研究人员中非常受欢迎。The next part of our playbook is just taking a long-term approach towards the development.”
我们剧本的下一部分是采取长期的开发方式。
LLMs’ understanding of the world is limited to the data they’re trained on. If you’ve been using ChatGPT for some time, you might remember this constraint in the earlier version of ChatGPT, when the bot responded: “I have no knowledge after April 2021,” in several cases.
大型语言模型对世界的理解仅限于其训练数据。如果您已经使用 ChatGPT 一段时间,您可能会记得早期版本的 ChatGPT 中的这一限制,当时该机器人在几种情况下回复:“我在 2021 年 4 月之后没有知识。”
Providing an LLM with additional information
提供额外信息的 LLM
There is a bunch of additional information you want an LLM to use. In the above example, I might have the transcripts of all of Meta’s shareholders meetings that I want the LLM to use. But how can we provide this additional information to an existing model?
有一堆额外的信息您希望 LLM 使用。在上述示例中,我可能拥有所有 Meta 股东会议的记录,希望 LLM 使用这些记录。但是我们如何能将这些额外的信息提供给现有模型呢?
Option 1: input via a prompt
选项 1:通过提示输入
The most obvious solution is to input the additional information via a prompt; for example, by prompting “Using the following information: [input a bunch of data] please answer the question of [ask your question].”
最明显的解决方案是通过提示输入额外的信息;例如,通过提示“使用以下信息:[输入一堆数据],请回答[问你的问题]。”
This is a pretty good approach. The biggest problem is that this may not scale because of these reasons:
这是一种相当不错的方法。最大的问题是,由于这些原因,这可能无法扩展:
The input tokens limit. Every model has an input prompt token limit. At the time of publication this is 4.069 tokens for GPT-3, 16,385 for GPT-3.5, 8,192 for GPT-4, 128,000 for GPT-4 Turbo, 200.000 for Anthropic models. Google’s Gemini model allows for an impressive one million token limit. While a million-token limit greatly increases the possibilities, it might still be too low for use cases with a lot of additional text to input.
输入令牌限制。每个模型都有一个输入提示令牌限制。在发布时,这是 GPT-3 的 4,069 个令牌,GPT-3.5 的 16,385 个令牌,GPT-4 的 8,192 个令牌,GPT-4 Turbo 的 128,000 个令牌,以及 Anthropic 模型的 200,000 个令牌。谷歌的 Gemini 模型允许达到令人印象深刻的百万令牌限制。虽然百万令牌限制大大增加了可能性,但对于需要输入大量附加文本的使用案例而言,这可能仍然太低。Performance. The performance of LLMs substantially decreases with longer input prompts; in particular, you get degradation of context in the middle of your prompt. Even when creating long input prompts is a possibility, the performance tradeoff might make it impractical.
性能。大型语言模型的性能在输入提示较长时会显著降低;特别是,您会在提示的中间部分失去上下文。即使创建长输入提示是可能的,但性能的折衷可能使其不切实际。
Option 2: fine-tune the model
选项 2:微调模型
We know LLMs are based on a massive weights matrix. Read more on how ChatGPT works in this Pragmatic Engineer issue. All LLMs use the same principles.
我们知道大语言模型是基于一个巨大的权重矩阵。阅读更多关于本期实用工程师中 ChatGPT 是如何工作的。所有大语言模型都使用相同的原理。
An option is to update these weight matrices based on additional information we’d like our model to know. This can be a good option, but it is a much higher upfront cost in terms of time, money, and computing resources. Also, it can only be done with access to the model’s weightings, which is not the case when you use models like ChatGPT, Anthropic, and other “closed source” models.
一个选项是根据我们希望模型了解的额外信息来更新这些权重矩阵。这可能是一个不错的选择,但在时间、金钱和计算资源方面的前期成本要高得多。此外,这只能在能够访问模型权重的情况下进行,而在使用像 ChatGPT、Anthropic 以及其他“闭源”模型时就不具备这种条件。
Option 3: RAG 选项 3:RAG
The term ‘RAG’ originated in a 2020 paper led by Patrick Lewis. One thing many people notice is that “Retrieval Augmented Generation” sounds a bit ungrammatical. Patrick agrees, and has said this:
“RAG”一词起源于2020 年的一篇论文,由帕特里克·刘易斯领导。许多人注意到,“检索增强生成”听起来有点不太符合语法。帕特里克同意,并表示:
“We always planned to have a nicer-sounding name, but when it came time to write the paper, no one had a better idea.”
“我们一直计划有一个更好听的名字,但当写论文的时候,没人有更好的主意。”
RAG is a collection of techniques which help to modify a LLM, so it can fill in the gaps and speak with authority, and some RAG implementations even let you cite sources. The biggest benefits of the RAG approach:
RAG 是一系列技术的集合,帮助修改一个LLM,使其能够填补空白并具备权威性,一些 RAG 实现甚至允许你引用来源。RAG 方法的最大好处:
Give a LLM domain-specific knowledge You can pick what data you want your LLM to draw from, and even turn it into a specialist on any topic there is data about.
为 LLM 提供特定领域的知识你可以选择你希望你的 LLM 从中提取的数据,甚至可以使其成为任何有数据的主题的专家。
This flexibility means you can also extend your LLMs’ awareness far beyond the model’s training cutoff dates, and even expose it to near-real time data, if available.
这种灵活性意味着您还可以将 LLM 的意识扩展到模型训练截止日期之外,甚至可以接触到近实时数据(如果可用)。
Optimal cost and speed. For all but a handful of companies, it's impractical to even consider training their own foundational model as a way to personalize the output of an LLM, due to the very high cost and skill thresholds.
最佳成本和速度。对于绝大多数公司来说,由于成本极高和技能要求,很难甚至考虑训练自己的基础模型来个性化LLM的输出。
In contrast, deploying a RAG pipeline will get you up-and-running relatively quickly for minimal cost. The tooling available means a single developer can have something very basic functional in a few hours.
相比之下,部署一个 RAG 流程可以让您以最低的成本快速启动和运行。可用的工具意味着一个开发人员在几小时内可以将一些非常基本的功能实现。
Reduce hallucinations. “Hallucination” is the term for when LLMs “make up” responses. A well-designed RAG pipeline that presents relevant data will all but eliminate this frustrating side effect, and your LLM will speak with much greater authority and relevance on the domain about which you have provided data.
减少幻觉。“幻觉”是指大型语言模型(LLMs)“编造”响应的情况。一个设计良好的 RAG 管道能够展示相关数据,将几乎消除这一令人沮丧的副作用,而您的LLM在您提供数据的领域将能更具权威性和相关性地发言。
For example, in the legal sector it’s often necessary to ensure an LLM draws its insight from a specific jurisdiction. Take the example of asking a model a seemingly simple question, like:
例如,在法律领域,通常需要确保一个LLM从特定的管辖区获取洞见。以问模型一个看似简单的问题为例,例如:
How do I hire someone?
我如何雇用一个人?
Your LLM will offer context based on the training data. However, you do not want the model to extract hiring practices from a US state like California, and combine this with British visa requirements!
您的 LLM 将根据训练数据提供上下文。然而,您不希望模型提取像加利福尼亚这样的美国州的招聘做法,并将其与英国签证要求结合起来!
With RAG, you control the underlying data source, meaning you can scope the LLM to only have access to a single jurisdiction’s data, which ensures responses are consistent.
使用 RAG,您可以控制底层数据源,这意味着您可以将LLM的访问范围限定为仅访问单个管辖区的数据,从而确保响应的一致性。
Better transparency and observability. Tracing inputs and answers through LLMs is very hard. The LLM can often feel like a “black box,” where you have no idea where some answers come from. With RAG, you see the additional source information injected, and debug your responses.
更好的透明度和可观察性。通过 LLM 跟踪输入和答案非常困难。LLM 通常感觉像一个“黑箱”,你不知道某些答案来自哪里。使用 RAG,您可以看到注入的额外源信息,并调试您的响应。
2. The simplest RAGs 最简单的 RAGs
The best way to understand new technology is often just to play with it. Getting a basic implementation up and running is relatively simple, and can be done with just a few lines of code. To help, Wordsmith has created a wrapper around the LlamaIndex open source project to help abstract away some complexity. You can get up and running, easily. It has a README file in place that will get you set up with a local RAG pipeline on your machine, and which chunks and embeds a copy of the US Constitution, and lets you search away with your command line.
了解新技术的最好方法往往是直接玩弄它。启动一个基本的实现相对简单,只需几行代码即可完成。为了帮助您,Wordsmith 已经 创建了一个包装器,围绕 LlamaIndex 开源项目,以帮助抽象一些复杂性。您可以 轻松上手,并且它有一个 README 文件,可以帮助您在本地机器上设置 RAG 管道,并分块和嵌入美国宪法的副本,让您可以通过命令行进行搜索。
This is as simple as RAGs get; you can “swap out” the additional context provided in this example by simply changing the source text documents!
这就像 RAG 一样简单;您可以通过简单地更改源文本文件来“替换”这个示例中提供的额外上下文!
This article is designed as a code-along, so I'm going to link you to sections of this repo, so you can see where specific concepts manifest in code.
本文旨在作为代码配合使用,因此我将链接到 这个仓库的部分,以便您可以看到特定概念在代码中的表现。
To follow along with the example, the following is needed:
要跟随示例,以下内容是必需的:
An active OpenAI subscription with API usage. Set one up here if needed. Note: running a query will cost in the realm of $0.25-$0.50 per run.
一个活跃的 OpenAI 订阅以及 API 使用。在这里设置一个(如果需要)。注意:运行一个查询的费用在每次运行 $0.25-$0.50 之间。Follow the instructions to set up a virtual Python environment, configure your OpenAI key, and start the virtual assistant.
按照说明 设置虚拟 Python 环境,配置你的 OpenAI 密钥,并启动虚拟助手。
This example will load the text of the US constitution from this text file, as a RAG input. However, the application can be extended to load your own data from a text file, and to “chat” with this data.
这个例子将加载美国宪法的文本来自这个文本文件,作为 RAG 输入。然而,该应用程序可以扩展以从文本文件加载您自己的数据,以及与这些数据“聊天”。
Here’s an example of how the application works when set up, and when the OpenAI API key is configured:
这是应用程序设置后以及配置了 OpenAI API 密钥时的工作示例:

示例 RAG 管道应用程序使用美国宪法作为附加背景回答问题
If you’ve followed along and have run this application: congratulations! You have just executed a RAG pipeline. Now, let’s get into explaining how it works.
如果你一直跟随并运行了这个应用程序:恭喜你!你刚刚执行了一个 RAG 管道。现在,让我们来解释一下它是如何工作的。
3. What is a RAG pipeline?
什么是 RAG 管道?
A RAG pipeline is a collection of technologies needed to enable the capability of answering using provided context. In our example, this context is the US Constitution and our LLM model is enriched with additional data extracted from the US Constitution document.
RAG 管道是一组技术,用于实现根据提供的上下文进行回答的能力。在我们的例子中,这个上下文是美国宪法,我们的 LLM 模型则结合了从美国宪法文件中提取的额外数据。
Here are the steps to building a RAG pipeline:
构建 RAG 管道的步骤如下:
Step 1: Take an inbound query and deconstruct it into relevant concepts
第一步: 接收一个入站查询并将其拆解成相关概念
Step 2: Collect similar concepts from your data store
步骤 2: 从您的数据存储中收集相似的概念
Step 3: Recombine these concepts with your original query to build a more relevant, authoritative answer.
步骤 3: 将这些概念与您的原始查询重新组合,以构建一个更相关、更权威的答案。
Weaving this together: 将这些编织在一起:

一个 RAG 管道在工作。它通过从数据存储中提取类似概念来扩展LLM可以访问的上下文,以回答问题。
While this process appears simple, there is quite a bit of nuance in how to approach each step. A number of decisions are required to tailor to your use case, starting with how to prepare the data for use in your pipeline.
尽管这个过程看起来简单,但在如何处理每一步时有很多细微差别。需要做出多个决策,以便针对您的用例进行定制,从如何准备数据以便在您的管道中使用开始。