To understand the latest advance in generative AI, imagine a courtroom.
要理解生成式人工智能的最新进展,想象一下一个法庭。
Judges hear and decide cases based on their general understanding of the law. Sometimes a case — like a malpractice suit or a labor dispute — requires special expertise, so judges send court clerks to a law library, looking for precedents and specific cases they can cite.
法官根据他们对法律的一般理解来审理和判决案件。有时候,一些案件,比如医疗事故诉讼或劳资纠纷,需要特殊专业知识,因此法官会派法院书记员去法律图书馆查找可以引用的先例和具体案例。
Like a good judge, large language models (LLMs) can respond to a wide variety of human queries. But to deliver authoritative answers that cite sources, the model needs an assistant to do some research.
像一位优秀的法官一样,大型语言模型(LLMs)可以回答各种各样的人类查询。但要提供引用来源的权威答案,模型需要助手进行一些研究。
The court clerk of AI is a process called retrieval-augmented generation, or RAG for short.
AI 法院的法院书记员是一个名为检索增强生成的过程,简称 RAG。
How It Got Named ‘RAG’
它是如何被命名为“RAG”的
Patrick Lewis, lead author of the 2020 paper that coined the term, apologized for the unflattering acronym that now describes a growing family of methods across hundreds of papers and dozens of commercial services he believes represent the future of generative AI.
帕特里克·刘易斯是 2020 年论文的主要作者,该论文创造了这个术语。他为这个不太讨人喜欢的首字母缩写道歉,该缩写现在描述了跨越数百篇论文和数十种商业服务的一大家族方法,他认为这代表了生成式人工智能的未来。

“We definitely would have put more thought into the name had we known our work would become so widespread,” Lewis said in an interview from Singapore, where he was sharing his ideas with a regional conference of database developers.
“如果我们知道我们的工作会变得如此广泛,我们肯定会更深入地考虑名称的选择。”刘易斯在新加坡接受采访时说道,他正在那里与一群数据库开发者分享自己的想法。
“We always planned to have a nicer sounding name, but when it came time to write the paper, no one had a better idea,” said Lewis, who now leads a RAG team at AI startup Cohere.
“我们一直计划有一个听起来更好的名字,但当写论文的时候,没有人有更好的主意,”现在领导 AI 初创公司 Cohere 的 RAG 团队的 Lewis 说。
So, What Is Retrieval-Augmented Generation (RAG)?
那么,检索增强生成(RAG)是什么?
Retrieval-augmented generation (RAG) is a technique for enhancing the accuracy and reliability of generative AI models with facts fetched from external sources.
检索增强生成(RAG)是一种技术,通过从外部来源获取的事实来增强生成式人工智能模型的准确性和可靠性。
In other words, it fills a gap in how LLMs work. Under the hood, LLMs are neural networks, typically measured by how many parameters they contain. An LLM’s parameters essentially represent the general patterns of how humans use words to form sentences.
换句话说,它填补了LLMs工作中的一个空白。在幕后,LLMs是神经网络,通常通过它们包含的参数数量来衡量。一个LLM的参数基本上代表了人类如何使用单词来构成句子的一般模式。
That deep understanding, sometimes called parameterized knowledge, makes LLMs useful in responding to general prompts at light speed. However, it does not serve users who want a deeper dive into a current or more specific topic.
那种深入理解,有时被称为参数化知识,使LLMs在以光速回应一般提示时非常有用。然而,对于希望深入了解当前或更具体主题的用户来说,它并不适用。
Combining Internal, External Resources
结合内部和外部资源
Lewis and colleagues developed retrieval-augmented generation to link generative AI services to external resources, especially ones rich in the latest technical details.
Lewis 和他的同事们开发了检索增强生成技术,将生成式人工智能服务与外部资源链接起来,特别是那些丰富最新技术细节的资源。
The paper, with coauthors from the former Facebook AI Research (now Meta AI), University College London and New York University, called RAG “a general-purpose fine-tuning recipe” because it can be used by nearly any LLM to connect with practically any external resource.
这篇论文的合著者来自前 Facebook AI 研究部门(现在是 Meta AI)、伦敦大学学院和纽约大学,称 RAG 为“通用微调配方”,因为它可以被几乎任何LLM使用,以连接几乎任何外部资源。
Building User Trust 建立用户信任
Retrieval-augmented generation gives models sources they can cite, like footnotes in a research paper, so users can check any claims. That builds trust.
检索增强生成为模型提供了可以引用的来源,就像研究论文中的脚注一样,用户可以查证任何主张。这有助于建立信任。
What’s more, the technique can help models clear up ambiguity in a user query. It also reduces the possibility a model will make a wrong guess, a phenomenon sometimes called hallucination.
此外,这种技术可以帮助模型消除用户查询中的歧义。它还减少了模型猜错的可能性,有时被称为幻觉现象。
Another great advantage of RAG is it’s relatively easy. A blog by Lewis and three of the paper’s coauthors said developers can implement the process with as few as five lines of code.
RAG 的另一个巨大优势是它相对容易。刘易斯和该论文的另外三位合著者在一篇博客中表示,开发人员可以用仅五行代码实现该过程。
That makes the method faster and less expensive than retraining a model with additional datasets. And it lets users hot-swap new sources on the fly.
这使得该方法比使用额外数据集重新训练模型更快速、更经济。它还允许用户在运行时热插拔新数据源。
How People Are Using RAG
人们如何使用 RAG
With retrieval-augmented generation, users can essentially have conversations with data repositories, opening up new kinds of experiences. This means the applications for RAG could be multiple times the number of available datasets.
通过检索增强生成,用户基本上可以与数据存储库进行对话,开启新的体验。这意味着 RAG 的应用可能是可用数据集数量的多倍。
For example, a generative AI model supplemented with a medical index could be a great assistant for a doctor or nurse. Financial analysts would benefit from an assistant linked to market data.
例如,一个结合了医学指标的生成式人工智能模型可以成为医生或护士的很好助手。金融分析师将受益于与市场数据相关联的助手。
In fact, almost any business can turn its technical or policy manuals, videos or logs into resources called knowledge bases that can enhance LLMs. These sources can enable use cases such as customer or field support, employee training and developer productivity.
事实上,几乎任何企业都可以将其技术或政策手册、视频或日志转化为称为知识库的资源,从而增强LLMs。这些资源可以实现客户或现场支持、员工培训和开发人员生产力等用例。
The broad potential is why companies including AWS, IBM, Glean, Google, Microsoft, NVIDIA, Oracle and Pinecone are adopting RAG.
广泛的潜力是为什么包括 AWS、IBM、Glean、Google、Microsoft、NVIDIA、Oracle 和 Pinecone 在内的公司正在采用 RAG。
Getting Started With Retrieval-Augmented Generation
检索增强生成入门
To help users get started, NVIDIA developed an AI workflow for retrieval-augmented generation. It includes a sample chatbot and the elements users need to create their own applications with this new method.
为了帮助用户入门,NVIDIA 开发了一种用于检索增强生成的 AI 工作流程。它包括一个示例聊天机器人和用户需要使用这种新方法创建自己的应用程序的元素。
The workflow uses NVIDIA NeMo, a framework for developing and customizing generative AI models, as well as software like NVIDIA Triton Inference Server and NVIDIA TensorRT-LLM for running generative AI models in production.
该工作流程使用 NVIDIA NeMo,这是一个用于开发和定制生成式人工智能模型的框架,以及诸如 NVIDIA Triton 推理服务器和 NVIDIA TensorRT-LLM等软件,用于在生产环境中运行生成式人工智能模型。
The software components are all part of NVIDIA AI Enterprise, a software platform that accelerates development and deployment of production-ready AI with the security, support and stability businesses need.
软件组件都是 NVIDIA AI Enterprise 的一部分,这是一个软件平台,可以加速开发和部署生产就绪的人工智能,提供企业所需的安全性、支持和稳定性。
Getting the best performance for RAG workflows requires massive amounts of memory and compute to move and process data. The NVIDIA GH200 Grace Hopper Superchip, with its 288GB of fast HBM3e memory and 8 petaflops of compute, is ideal — it can deliver a 150x speedup over using a CPU.
为了使 RAG 工作流程获得最佳性能,需要大量的内存和计算资源来移动和处理数据。拥有 288GB 快速 HBM3e 内存和 8 petaflops 计算能力的 NVIDIA GH200 Grace Hopper Superchip 是理想选择 — 它可以比使用 CPU 提供 150 倍的加速。
Once companies get familiar with RAG, they can combine a variety of off-the-shelf or custom LLMs with internal or external knowledge bases to create a wide range of assistants that help their employees and customers.
一旦公司熟悉了 RAG,他们可以将各种现成的或定制的LLMs与内部或外部知识库相结合,创建各种助手,帮助他们的员工和客户。
RAG doesn’t require a data center. LLMs are debuting on Windows PCs, thanks to NVIDIA software that enables all sorts of applications users can access even on their laptops.
RAG 不需要数据中心。LLMs首次登陆 Windows 个人电脑,这要归功于 NVIDIA 软件,使用户可以访问各种应用程序,即使是在他们的笔记本电脑上。

RAG 在 PC 上的示例应用程序。
PCs equipped with NVIDIA RTX GPUs can now run some AI models locally. By using RAG on a PC, users can link to a private knowledge source – whether that be emails, notes or articles – to improve responses. The user can then feel confident that their data source, prompts and response all remain private and secure.
配备 NVIDIA RTX GPU 的个人电脑现在可以在本地运行一些 AI 模型。通过在个人电脑上使用 RAG,用户可以链接到私人知识源,无论是电子邮件、笔记还是文章,以改善响应。用户可以确信他们的数据源、提示和响应都保持私密和安全。
A recent blog provides an example of RAG accelerated by TensorRT-LLM for Windows to get better results fast.
最近的一篇博客提供了一个在 Windows 上通过 TensorRT-LLM加速的 RAG 示例,以快速获得更好的结果。
The History of RAG
RAG 的历史
The roots of the technique go back at least to the early 1970s. That’s when researchers in information retrieval prototyped what they called question-answering systems, apps that use natural language processing (NLP) to access text, initially in narrow topics such as baseball.
该技术的根源可以追溯至至少上世纪 70 年代初。那时,信息检索领域的研究人员原型化了他们所称的问答系统,这些应用程序利用自然语言处理(NLP)来访问文本,最初仅限于狭窄主题,如棒球。
The concepts behind this kind of text mining have remained fairly constant over the years. But the machine learning engines driving them have grown significantly, increasing their usefulness and popularity.
这种文本挖掘背后的概念多年来基本保持不变。但驱动它们的机器学习引擎已经显著增长,提高了它们的实用性和受欢迎程度。
In the mid-1990s, the Ask Jeeves service, now Ask.com, popularized question answering with its mascot of a well-dressed valet. IBM’s Watson became a TV celebrity in 2011 when it handily beat two human champions on the Jeopardy! game show.
在上世纪 90 年代中期,Ask Jeeves 服务(现在的 Ask.com)通过其一名穿着得体的侍者吉祥物推广了问题解答。IBM 的沃森在 2011 年成为电视名人,当时它轻松击败了 Jeopardy!游戏节目上的两位人类冠军。
Today, LLMs are taking question-answering systems to a whole new level.
今天,LLMs 正将问答系统推向一个全新的水平。
Insights From a London Lab
伦敦实验室的见解
The seminal 2020 paper arrived as Lewis was pursuing a doctorate in NLP at University College London and working for Meta at a new London AI lab. The team was searching for ways to pack more knowledge into an LLM’s parameters and using a benchmark it developed to measure its progress.
2020 年的重要论文是在 Lewis 在伦敦大学学院攻读自然语言处理博士学位并在 Meta 的伦敦新人工智能实验室工作时发布的。团队正在寻找方法来将更多知识装入LLM的参数,并使用他们开发的基准来衡量进展。
Building on earlier methods and inspired by a paper from Google researchers, the group “had this compelling vision of a trained system that had a retrieval index in the middle of it, so it could learn and generate any text output you wanted,” Lewis recalled.
借鉴早期方法,并受到谷歌研究人员一篇论文的启发,Lewis 回忆说,该团队“有这样一个引人注目的愿景,即训练有素的系统中间有一个检索索引,因此它可以学习并生成任何您想要的文本输出。”

IBM 沃森问答系统在电视游戏节目《危险边缘》上大获成功后成为名人
When Lewis plugged into the work in progress a promising retrieval system from another Meta team, the first results were unexpectedly impressive.
当刘易斯将另一个 Meta 团队的一个有前途的检索系统插入正在进行的工作时,第一批结果出乎意料地令人印象深刻。
“I showed my supervisor and he said, ‘Whoa, take the win. This sort of thing doesn’t happen very often,’ because these workflows can be hard to set up correctly the first time,” he said.
“我向我的主管展示了,他说:‘哇,抓住这个机会。这种事情并不经常发生’,因为这些工作流程第一次正确设置起来可能会很困难,”他说。
Lewis also credits major contributions from team members Ethan Perez and Douwe Kiela, then of New York University and Facebook AI Research, respectively.
Lewis 还认为团队成员 Ethan Perez 和 Douwe Kiela 也做出了重大贡献,他们当时分别来自纽约大学和 Facebook AI 研究部门。
When complete, the work, which ran on a cluster of NVIDIA GPUs, showed how to make generative AI models more authoritative and trustworthy. It’s since been cited by hundreds of papers that amplified and extended the concepts in what continues to be an active area of research.
当完成时,这项工作在一组 NVIDIA GPU 上运行,展示了如何使生成式人工智能模型更加权威和可信。此后,数百篇论文引用了这项工作,并扩展了其中的概念,使其成为一个持续活跃的研究领域。
How Retrieval-Augmented Generation Works
检索增强生成的工作原理
At a high level, here’s how an NVIDIA technical brief describes the RAG process.
在较高层次上,这是 NVIDIA 技术简报描述 RAG 过程的方式。
When users ask an LLM a question, the AI model sends the query to another model that converts it into a numeric format so machines can read it. The numeric version of the query is sometimes called an embedding or a vector.
当用户向LLM提问时,AI 模型会将查询发送到另一个模型,将其转换为数字格式,以便机器能够读取。查询的数字版本有时被称为嵌入或向量。

检索增强生成将LLMs与嵌入模型和向量数据库相结合。
The embedding model then compares these numeric values to vectors in a machine-readable index of an available knowledge base. When it finds a match or multiple matches, it retrieves the related data, converts it to human-readable words and passes it back to the LLM.
嵌入模型然后将这些数值与可用知识库的机器可读索引中的向量进行比较。当它找到匹配项或多个匹配项时,它检索相关数据,将其转换为人类可读的词语,并将其传递回LLM。
Finally, the LLM combines the retrieved words and its own response to the query into a final answer it presents to the user, potentially citing sources the embedding model found.
最后,LLM将检索到的单词与自己对查询的响应结合起来,形成最终答案呈现给用户,可能会引用嵌入模型发现的来源。
Keeping Sources Current 保持来源的最新状态
In the background, the embedding model continuously creates and updates machine-readable indices, sometimes called vector databases, for new and updated knowledge bases as they become available.
在后台,嵌入模型不断地为新的和更新的知识库创建和更新机器可读的索引,有时被称为向量数据库。

检索增强生成将LLMs与嵌入模型和向量数据库相结合。
Many developers find LangChain, an open-source library, can be particularly useful in chaining together LLMs, embedding models and knowledge bases. NVIDIA uses LangChain in its reference architecture for retrieval-augmented generation.
许多开发人员发现 LangChain,一个开源库,在将LLMs、嵌入模型和知识库连接在一起方面特别有用。 NVIDIA 在其用于检索增强生成的参考架构中使用 LangChain。
The LangChain community provides its own description of a RAG process.
LangChain 社区提供了自己对 RAG 流程的描述。
Looking forward, the future of generative AI lies in creatively chaining all sorts of LLMs and knowledge bases together to create new kinds of assistants that deliver authoritative results users can verify.
展望未来,生成式人工智能的未来在于创造性地将各种LLMs和知识库相互链接,以创建新型助手,为用户提供可验证的权威结果。
Get a hands on using retrieval-augmented generation with an AI chatbot in this NVIDIA LaunchPad lab.
在这个 NVIDIA LaunchPad 实验室中,通过使用检索增强生成技术与 AI 聊天机器人亲自动手。
Explore generative AI sessions and experiences at NVIDIA GTC, the global conference on AI and accelerated computing, running March 18-21 in San Jose, Calif., and online.
探索 NVIDIA GTC 上的生成式人工智能会议和体验,这是一场关于人工智能和加速计算的全球性会议,将于 3 月 18 日至 21 日在加利福尼亚州圣何塞举行,同时也可在线参与。