Radar / AI & ML
雷达 / 人工智能与机器学习

What We Learned from a Year of Building with LLMs (Part I)
我们从用LLMs建造一年中学到了什么(第一部分)

By Eugene Yan, Bryan Bischof, Charles Frye, Hamel Husain, Jason Liu and Shreya Shankar
由尤金·严、布莱恩·比肖夫、查尔斯·弗莱、哈梅尔·侯赛因、杰森·刘和谢雅·尚卡尔翻译
May 28, 2024  2024 年 5 月 28 日
Dana Codispoti interview

Learn faster. Dig deeper. See farther.
学得更快。探得更深。看得更远。

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.
加入 O'Reilly 在线学习平台。今天就获取免费试用,随时找到答案,或学习新的有用知识。

Learn more 了解更多

To hear directly from the authors on this topic, sign up for the upcoming virtual event on June 20th, and learn more from the Generative AI Success Stories Superstream on June 12th.
要直接听取作者关于这个话题的分享,请报名参加 6 月 20 日即将举行的虚拟活动,并在 6 月 12 日的生成式 AI 成功案例超级直播中了解更多信息。

Part II of this series can be found here and part III is forthcoming. Stay tuned.
本系列的第二部分可以在这里找到,第三部分即将发布。敬请关注。

It’s an exciting time to build with large language models (LLMs). Over the past year, LLMs have become “good enough” for real-world applications. The pace of improvements in LLMs, coupled with a parade of demos on social media, will fuel an estimated $200B investment in AI by 2025. LLMs are also broadly accessible, allowing everyone, not just ML engineers and scientists, to build intelligence into their products. While the barrier to entry for building AI products has been lowered, creating those effective beyond a demo remains a deceptively difficult endeavor.
现在是使用大型语言模型(LLMs)进行构建的激动人心的时刻。在过去的一年里,LLMs已经变得“足够好”以用于现实世界的应用。LLMs的改进速度,加上社交媒体上的一系列演示,预计将在 2025 年推动 2000 亿美元的人工智能投资。LLMs也广泛可用,使每个人,不仅仅是机器学习工程师和科学家,都能将智能融入他们的产品中。虽然进入 AI 产品构建的门槛已经降低,但创造出超越演示的有效产品仍是一个具有欺骗性的困难任务。

We’ve identified some crucial, yet often neglected, lessons and methodologies informed by machine learning that are essential for developing products based on LLMs. Awareness of these concepts can give you a competitive advantage against most others in the field without requiring ML expertise! Over the past year, the six of us have been building real-world applications on top of LLMs. We realized that there was a need to distill these lessons in one place for the benefit of the community.
我们已经确定了一些关键但常被忽视的、由机器学习启发的课程和方法论,这些都是基于LLMs开发产品的必需品。了解这些概念可以让你在没有机器学习专业知识的情况下,比领域中的大多数其他人获得竞争优势!在过去的一年中,我们六个人一直在LLMs之上构建实际应用程序。我们意识到有必要将这些课程汇总在一处,以造福社区。

We come from a variety of backgrounds and serve in different roles, but we’ve all experienced firsthand the challenges that come with using this new technology. Two of us are independent consultants who’ve helped numerous clients take LLM projects from initial concept to successful product, seeing the patterns determining success or failure. One of us is a researcher studying how ML/AI teams work and how to improve their workflows. Two of us are leaders on applied AI teams: one at a tech giant and one at a startup. Finally, one of us has taught deep learning to thousands and now works on making AI tooling and infrastructure easier to use. Despite our different experiences, we were struck by the consistent themes in the lessons we’ve learned, and we’re surprised that these insights aren’t more widely discussed.
我们来自不同的背景,担任不同的角色,但我们都亲身经历了使用这项新技术所带来的挑战。我们中的两人是独立顾问,帮助众多客户将LLM个项目从初步概念成功转化为成品,见证了决定成功或失败的模式。我们中有一人是研究如何提高机器学习/人工智能团队工作效率的研究者。我们中的两人是应用人工智能团队的领导者:一人在一家科技巨头公司,另一人在一家初创公司。最后,我们中的一人曾教授数千人深度学习,现在致力于使人工智能工具和基础设施更易于使用。尽管我们的经验各不相同,但我们对我们所学到的教训中一贯的主题感到惊讶,这些见解并没有被更广泛地讨论。

Our goal is to make this a practical guide to building successful products around LLMs, drawing from our own experiences and pointing to examples from around the industry. We’ve spent the past year getting our hands dirty and gaining valuable lessons, often the hard way. While we don’t claim to speak for the entire industry, here we share some advice and lessons for anyone building products with LLMs.
我们的目标是使这成为围绕LLMs构建成功产品的实用指南,借鉴我们自己的经验,并指出业界的一些例子。过去一年,我们一直在实际操作中积累宝贵的经验,往往是通过艰难的方式。虽然我们不声称代表整个行业,但在这里我们分享一些关于使用LLMs构建产品的建议和经验。

This work is organized into three sections: tactical, operational, and strategic. This is the first of three pieces. It dives into the tactical nuts and bolts of working with LLMs. We share best practices and common pitfalls around prompting, setting up retrieval-augmented generation, applying flow engineering, and evaluation and monitoring. Whether you’re a practitioner building with LLMs or a hacker working on weekend projects, this section was written for you. Look out for the operational and strategic sections in the coming weeks.
这项工作分为三个部分:战术、运营和战略。这是三篇文章中的第一篇。它深入探讨了与LLMs合作的战术细节。我们分享了关于提示、设置检索增强生成、应用流程工程以及评估和监控的最佳实践和常见陷阱。无论你是使用LLMs的从业者还是周末项目的黑客,这一部分都是为你写的。请期待未来几周发布的运营和战略部分。

Ready to dive delve in? Let’s go.
准备好深入了解吗?我们走吧。

Tactical 战术性

In this section, we share best practices for the core components of the emerging LLM stack: prompting tips to improve quality and reliability, evaluation strategies to assess output, retrieval-augmented generation ideas to improve grounding, and more. We also explore how to design human-in-the-loop workflows. While the technology is still rapidly developing, we hope these lessons, the by-product of countless experiments we’ve collectively run, will stand the test of time and help you build and ship robust LLM applications.
在本节中,我们分享了新兴LLM堆栈的核心组件的最佳实践:提示技巧以提高质量和可靠性,评估策略以评估输出,检索增强生成的想法以提高基础性,等等。我们还探讨如何设计人在循环中的工作流程。尽管技术仍在迅速发展,但我们希望这些经验教训——我们共同进行的无数实验的副产品,将经得起时间的考验,并帮助您构建和发布稳健的LLM应用程序。

Prompting 提示

We recommend starting with prompting when developing new applications. It’s easy to both underestimate and overestimate its importance. It’s underestimated because the right prompting techniques, when used correctly, can get us very far. It’s overestimated because even prompt-based applications require significant engineering around the prompt to work well.
我们建议在开发新应用程序时从提示开始。人们很容易低估或高估它的重要性。它被低估是因为,正确的提示技术如果使用得当,可以带来很大的进展。它被高估是因为,即使是基于提示的应用程序也需要围绕提示进行大量的工程设计才能良好运作。

Focus on getting the most out of fundamental prompting techniques
专注于充分利用基本的提示技巧

A few prompting techniques have consistently helped improve performance across various models and tasks: n-shot prompts + in-context learning, chain-of-thought, and providing relevant resources.
几种提示技术已经一直有助于提高各种模型和任务的性能:n 次示例提示+情境学习、思维链条以及提供相关资源。

The idea of in-context learning via n-shot prompts is to provide the LLM with a few examples that demonstrate the task and align outputs to our expectations. A few tips:
通过 n 次提示的情境学习的想法是为LLM提供一些示例,这些示例展示了任务并使输出符合我们的期望。一些建议:

  • If n is too low, the model may over-anchor on those specific examples, hurting its ability to generalize. As a rule of thumb, aim for n ≥ 5. Don’t be afraid to go as high as a few dozen.
    如果 n 值过低,模型可能会过度依赖这些特定的例子,从而影响其泛化能力。通常来说,n 值应该大于等于 5。不要害怕将 n 值提高到几十。
  • Examples should be representative of the expected input distribution. If you’re building a movie summarizer, include samples from different genres in roughly the proportion you expect to see in practice.
    示例应代表预期输入分布。如果你正在构建一个电影摘要器,请包括不同类型的样本,大致按照你实际预期的比例包含。
  • You don’t necessarily need to provide the full input-output pairs. In many cases, examples of desired outputs are sufficient.
    你不一定需要提供完整的输入输出对。在许多情况下,提供期望输出的示例就足够了。
  • If you are using an LLM that supports tool use, your n-shot examples should also use the tools you want the agent to use.
    如果您使用的是支持使用工具的LLM,您的 n 次示例也应使用您希望代理使用的工具。

In chain-of-thought (CoT) prompting, we encourage the LLM to explain its thought process before returning the final answer. Think of it as providing the LLM with a sketchpad so it doesn’t have to do it all in memory. The original approach was to simply add the phrase “Let’s think step-by-step” as part of the instructions. However, we’ve found it helpful to make the CoT more specific, where adding specificity via an extra sentence or two often reduces hallucination rates significantly. For example, when asking an LLM to summarize a meeting transcript, we can be explicit about the steps, such as:
在连锁思考(CoT)提示中,我们鼓励LLM解释其思考过程,然后再给出最终答案。可以将其视为为LLM提供一个草图本,这样它就不必全部依靠记忆来完成。最初的方法是简单地添加短语“让我们一步步来思考”作为指令的一部分。然而,我们发现使 CoT 更具体化很有帮助,通过增加一两个额外的句子通常可以显著降低幻觉率。例如,当要求LLM总结会议记录时,我们可以明确步骤,例如:

  • First, list the key decisions, follow-up items, and associated owners in a sketchpad.
    首先,在画板上列出关键决策、后续事项及其负责人。
  • Then, check that the details in the sketchpad are factually consistent with the transcript.
    然后,检查画板中的细节与文字记录是否一臀致。
  • Finally, synthesize the key points into a concise summary.
    最后,将关键点综合成简洁的总结。

Recently, some doubt has been cast on whether this technique is as powerful as believed. Additionally, there’s significant debate about exactly what happens during inference when chain-of-thought is used. Regardless, this technique is one to experiment with when possible.
最近,有些人怀疑这种技术是否像人们认为的那样强大。此外,关于在使用思维链推理过程中究竟发生了什么,也存在着激烈的讨论。不管怎样,这种技术是值得尝试的。

Providing relevant resources is a powerful mechanism to expand the model’s knowledge base, reduce hallucinations, and increase the user’s trust. Often accomplished via retrieval augmented generation (RAG), providing the model with snippets of text that it can directly utilize in its response is an essential technique. When providing the relevant resources, it’s not enough to merely include them; don’t forget to tell the model to prioritize their use, refer to them directly, and sometimes to mention when none of the resources are sufficient. These help “ground” agent responses to a corpus of resources.
提供相关资源是扩展模型知识库、减少幻觉和增加用户信任的有效机制。这通常通过检索增强生成(RAG)来实现,即为模型提供可以直接在其回应中使用的文本片段,这是一种必不可少的技术。在提供相关资源时,仅仅包含这些资源是不够的;不要忘记告诉模型优先使用这些资源,直接引用它们,有时还要提及当没有任何资源足够时。这些可以帮助将代理响应“锚定”到一系列资源中。

Structure your inputs and outputs
构建你的输入和输出

Structured input and output help models better understand the input as well as return output that can reliably integrate with downstream systems. Adding serialization formatting to your inputs can help provide more clues to the model as to the relationships between tokens in the context, additional metadata to specific tokens (like types), or relate the request to similar examples in the model’s training data.
结构化的输入和输出有助于模型更好地理解输入内容,并返回可以可靠地与下游系统集成的输出。在输入中添加序列化格式可以为模型提供更多关于上下文中标记之间关系的线索,为特定标记(如类型)提供额外的元数据,或将请求与模型训练数据中的类似示例关联起来。

As an example, many questions on the internet about writing SQL begin by specifying the SQL schema. Thus, you may expect that effective prompting for Text-to-SQL should include structured schema definitions; indeed.
例如,互联网上关于编写 SQL 的许多问题都是从指定 SQL 模式开始的。因此,你可能会期望有效的 Text-to-SQL 提示应包括结构化的模式定义;确实如此。

Structured output serves a similar purpose, but it also simplifies integration into downstream components of your system. Instructor and Outlines work well for structured output. (If you’re importing an LLM API SDK, use Instructor; if you’re importing Huggingface for a self-hosted model, use Outlines.) Structured input expresses tasks clearly and resembles how the training data is formatted, increasing the probability of better output.
结构化输出具有类似的目的,但它还简化了与系统下游组件的集成。Instructor 和 Outlines 适用于结构化输出。(如果您正在导入LLM API SDK,请使用 Instructor;如果您正在导入用于自托管模型的 Huggingface,请使用 Outlines。)结构化输入清晰地表达任务,并且与训练数据的格式相似,提高了获得更好输出的可能性。

When using structured input, be aware that each LLM family has their own preferences. Claude prefers xml while GPT favors Markdown and JSON. With XML, you can even pre-fill Claude’s responses by providing a response tag like so.
在使用结构化输入时,请注意每个LLM家族都有自己的偏好。克劳德偏好 xml ,而 GPT 则偏好 Markdown 和 JSON。使用 XML,您甚至可以通过提供 response 标签来预填克劳德的回应,如此这般。

                                                     </> python
messages=[     
    {         
        "role": "user",         
        "content": """Extract the <name>, <size>, <price>, and <color> 
                   from this product description into your <response>.   
                <description>The SmartHome Mini 
                   is a compact smart home assistant 
                   available in black or white for only $49.99. 
                   At just 5 inches wide, it lets you control   
                   lights, thermostats, and other connected 
                   devices via voice or app—no matter where you
                   place it in your home. This affordable little hub
                   brings convenient hands-free control to your
                   smart devices.             
                </description>"""     
   },     
   {         
        "role": "assistant",         
        "content": "<response><name>"     
   } 
]

Have small prompts that do one thing, and only one thing, well
设有小提示,专做一件事,并且做好这一件事

A common anti-pattern/code smell in software is the “God Object,” where we have a single class or function that does everything. The same applies to prompts too.
在软件中常见的反模式/代码异味是“上帝对象”,即我们有一个单一的类或函数负责所有事情。这同样适用于提示。

A prompt typically starts simple: A few sentences of instruction, a couple of examples, and we’re good to go. But as we try to improve performance and handle more edge cases, complexity creeps in. More instructions. Multi-step reasoning. Dozens of examples. Before we know it, our initially simple prompt is now a 2,000 token frankenstein. And to add injury to insult, it has worse performance on the more common and straightforward inputs! GoDaddy shared this challenge as their No. 1 lesson from building with LLMs.
提示通常很简单:几句指令,几个例子,我们就可以开始了。但是当我们试图提高性能并处理更多边缘情况时,复杂性就会悄然增加。更多的指令。多步骤推理。数十个例子。不知不觉中,我们最初简单的提示现在已经变成了一个 2000 个标记的怪物。更糟糕的是,它在更常见和直接的输入上的表现还不如原来!GoDaddy 分享了这一挑战,作为他们从LLMs建设中获得的第一课。

Just like how we strive (read: struggle) to keep our systems and code simple, so should we for our prompts. Instead of having a single, catch-all prompt for the meeting transcript summarizer, we can break it into steps to:
就像我们努力(也可以说是挣扎)保持系统和代码的简洁一样,我们也应该简化我们的提示。我们可以将会议记录摘要的单一、万能提示分解成几个步骤:

  • Extract key decisions, action items, and owners into structured format
    提取关键决策、行动项和负责人到结构化格式
  • Check extracted details against the original transcription for consistency
    检查提取的详细信息与原始记录的一致性
  • Generate a concise summary from the structured details
    从结构化细节中生成简洁的摘要

As a result, we’ve split our single prompt into multiple prompts that are each simple, focused, and easy to understand. And by breaking them up, we can now iterate and eval each prompt individually.
因此,我们将单一提示分成了多个简单、专注且易于理解的提示。通过分开处理,我们现在可以逐一迭代和评估每个提示。

Craft your context tokens
打造你的上下文令牌

Rethink, and challenge your assumptions about how much context you actually need to send to the agent. Be like Michaelangelo, do not build up your context sculpture—chisel away the superfluous material until the sculpture is revealed. RAG is a popular way to collate all of the potentially relevant blocks of marble, but what are you doing to extract what’s necessary?
重新思考,并挑战你对于向代理发送多少背景信息的假设。要像米开朗基罗一样,不要构建你的背景雕塑——去除多余的材料,直到雕塑显现出来。RAG 是一个流行的方法,用来整合所有可能相关的大理石块,但你是如何提取必要信息的?

We’ve found that taking the final prompt sent to the model—with all of the context construction, and meta-prompting, and RAG results—putting it on a blank page and just reading it, really helps you rethink your context. We have found redundancy, self-contradictory language, and poor formatting using this method.
我们发现,将最后发送给模型的提示——包括所有的上下文构建、元提示和 RAG 结果——放在一张空白页上,仅仅阅读它,确实有助于你重新思考你的上下文。使用这种方法,我们发现了冗余、自相矛盾的语言和格式不佳的问题。

The other key optimization is the structure of your context. Your bag-of-docs representation isn’t helpful for humans, don’t assume it’s any good for agents. Think carefully about how you structure your context to underscore the relationships between parts of it, and make extraction as simple as possible.
另一个关键的优化是您上下文的结构。您的文档包表示法对人类不太有帮助,不要认为它对代理有任何好处。仔细思考如何构建您的上下文,以强调其各部分之间的关系,并尽可能简化信息提取过程。

Information Retrieval/RAG
信息检索/RAG

Beyond prompting, another effective way to steer an LLM is by providing knowledge as part of the prompt. This grounds the LLM on the provided context which is then used for in-context learning. This is known as retrieval-augmented generation (RAG). Practitioners have found RAG effective at providing knowledge and improving output, while requiring far less effort and cost compared to finetuning.RAG is only as good as the retrieved documents’ relevance, density, and detail
在提示之外,引导LLM的另一种有效方式是在提示中提供知识。这将LLM基于所提供的上下文进行在场景中学习。这被称为检索增强生成(RAG)。实践者发现,RAG 在提供知识和改善输出方面非常有效,而且与微调相比,所需的努力和成本要少得多。RAG 的效果好坏取决于检索文档的相关性、密度和细节。

The quality of your RAG’s output is dependent on the quality of retrieved documents, which in turn can be considered along a few factors.
您 RAG 输出的质量取决于检索文档的质量,而这又可以从几个因素来考虑。

The first and most obvious metric is relevance. This is typically quantified via ranking metrics such as Mean Reciprocal Rank (MRR) or Normalized Discounted Cumulative Gain (NDCG). MRR evaluates how well a system places the first relevant result in a ranked list while NDCG considers the relevance of all the results and their positions. They measure how good the system is at ranking relevant documents higher and irrelevant documents lower. For example, if we’re retrieving user summaries to generate movie review summaries, we’ll want to rank reviews for the specific movie higher while excluding reviews for other movies.
首先且最明显的指标是相关性。这通常通过排名指标如平均倒数排名(MRR)或归一化折扣累积增益(NDCG)来量化。MRR 评估系统将第一个相关结果放在排名列表中的表现,而 NDCG 则考虑所有结果及其位置的相关性。它们衡量系统在将相关文档排名更高和不相关文档排名更低方面的表现。例如,如果我们在检索用户摘要以生成电影评论摘要,我们会希望将特定电影的评论排名更高,同时排除其他电影的评论。

Like traditional recommendation systems, the rank of retrieved items will have a significant impact on how the LLM performs on downstream tasks. To measure the impact, run a RAG-based task but with the retrieved items shuffled—how does the RAG output perform?
像传统推荐系统一样,检索项目的排名将对LLM在下游任务中的表现产生重大影响。为了测量这种影响,请运行一个基于 RAG 的任务,但将检索到的项目进行洗牌——RAG 输出的表现如何?

Second, we also want to consider information density. If two documents are equally relevant, we should prefer one that’s more concise and has lesser extraneous details. Returning to our movie example, we might consider the movie transcript and all user reviews to be relevant in a broad sense. Nonetheless, the top-rated reviews and editorial reviews will likely be more dense in information.
其次,我们还需要考虑信息密度。如果两份文件同样相关,我们应该更倾向于选择更简洁、细节更少的一份。回到我们的电影例子,我们可能会认为电影剧本和所有用户评论在广义上都是相关的。然而,高评分的评论和编辑评论很可能在信息上更为密集。

Finally, consider the level of detail provided in the document. Imagine we’re building a RAG system to generate SQL queries from natural language. We could simply provide table schemas with column names as context. But, what if we include column descriptions and some representative values? The additional detail could help the LLM better understand the semantics of the table and thus generate more correct SQL.
最后,考虑文档提供的细节程度。想象我们正在构建一个 RAG 系统,用于从自然语言生成 SQL 查询。我们可以简单地提供带有列名的表格架构作为上下文。但是,如果我们包括列描述和一些代表性值呢?额外的细节可以帮助LLM更好地理解表的语义,从而生成更正确的 SQL。

Don’t forget keyword search; use it as a baseline and in hybrid search.
不要忘记关键词搜索;将其作为基线并用于混合搜索。

Given how prevalent the embedding-based RAG demo is, it’s easy to forget or overlook the decades of research and solutions in information retrieval.
鉴于基于嵌入的 RAG 演示的普遍性,人们很容易忘记或忽视信息检索领域几十年的研究和解决方案。

Nonetheless, while embeddings are undoubtedly a powerful tool, they are not the be all and end all. First, while they excel at capturing high-level semantic similarity, they may struggle with more specific, keyword-based queries, like when users search for names (e.g., Ilya), acronyms (e.g., RAG), or IDs (e.g., claude-3-sonnet). Keyword-based search, such as BM25, are explicitly designed for this. And after years of keyword-based search, users have likely taken it for granted and may get frustrated if the document they expect to retrieve isn’t being returned.
尽管嵌入技术无疑是一个强大的工具,但它们并非万能。首先,虽然它们擅长捕捉高层次的语义相似性,但在处理更具体的、基于关键词的查询时可能会遇到困难,例如当用户搜索名字(例如,伊利亚)、缩写词(例如,RAG)或 ID(例如,claude-3-sonnet)时。基于关键词的搜索,如 BM25,正是为此而专门设计的。经过多年的基于关键词的搜索,用户可能已经将其视为理所当然,如果他们期望检索的文档没有被返回,可能会感到沮丧。

Vector embeddings do not magically solve search. In fact, the heavy lifting is in the step before you re-rank with semantic similarity search. Making a genuine improvement over BM25 or full-text search is hard.
向量嵌入并不能神奇地解决搜索问题。实际上,在您使用语义相似性搜索重新排序之前,繁重的工作已经完成了。相较于 BM25 或全文搜索,要实现真正的改进是很困难的。

Aravind Srinivas, CEO Perplexity.ai
— 阿拉温德·斯里尼瓦斯,Perplexity.ai 首席执行官

We’ve been communicating this to our customers and partners for months now. Nearest Neighbor Search with naive embeddings yields very noisy results and you’re likely better off starting with a keyword-based approach.
我们已经向我们的客户和合作伙伴传达了几个月的信息。使用简单嵌入进行最近邻搜索会产生非常嘈杂的结果,您可能更好的做法是从基于关键词的方法开始。

Beyang Liu, CTO Sourcegraph
— 刘备洋,首席技术官 Sourcegraph

Second, it’s more straightforward to understand why a document was retrieved with keyword search—we can look at the keywords that match the query. In contrast, embedding-based retrieval is less interpretable. Finally, thanks to systems like Lucene and OpenSearch that have been optimized and battle-tested over decades, keyword search is usually more computationally efficient.
其次,关键词搜索检索文档的原因更直观——我们可以查看与查询匹配的关键词。相比之下,基于嵌入的检索可解释性较差。最后,由于像 Lucene 和 OpenSearch 这样的系统经过数十年的优化和实战测试,关键词搜索通常在计算效率上更高。

In most cases, a hybrid will work best: keyword matching for the obvious matches, and embeddings for synonyms, hypernyms, and spelling errors, as well as multimodality (e.g., images and text). Shortwave shared how they built their RAG pipeline, including query rewriting, keyword + embedding retrieval, and ranking.
在大多数情况下,混合模式将是最佳选择:关键词匹配用于明显的匹配,而嵌入用于同义词、上义词和拼写错误,以及多模态(例如,图像和文本)。Shortwave 分享了他们如何构建其 RAG 管道,包括查询重写、关键词+嵌入检索和排名。

Prefer RAG over fine-tuning for new knowledge
优先选择 RAG 而不是微调来获取新知识

Both RAG and fine-tuning can be used to incorporate new information into LLMs and increase performance on specific tasks. Thus, which should we try first?
RAG 和微调都可以用来将新信息整合到LLMs中,并提高在特定任务上的表现。那么,我们应该先尝试哪一个呢?

Recent research suggests that RAG may have an edge. One study compared RAG against unsupervised fine-tuning (a.k.a. continued pre-training), evaluating both on a subset of MMLU and current events. They found that RAG consistently outperformed fine-tuning for knowledge encountered during training as well as entirely new knowledge. In another paper, they compared RAG against supervised fine-tuning on an agricultural dataset. Similarly, the performance boost from RAG was greater than fine-tuning, especially for GPT-4 (see Table 20 of the paper).
最近的研究表明,RAG 可能具有优势。一项研究将 RAG 与无监督微调(又称持续预训练)进行了比较,评估了它们在 MMLU 的一个子集和当前事件上的表现。研究发现,无论是在训练期间遇到的知识还是全新的知识,RAG 的表现都一致优于微调。在另一篇论文中,他们将 RAG 与在农业数据集上的监督微调进行了比较。同样,RAG 带来的性能提升大于微调,尤其是对于 GPT-4(见论文的表 20)。

Beyond improved performance, RAG comes with several practical advantages too. First, compared to continuous pretraining or fine-tuning, it’s easier—and cheaper!—to keep retrieval indices up-to-date. Second, if our retrieval indices have problematic documents that contain toxic or biased content, we can easily drop or modify the offending documents.
除了性能提升,RAG 还具有多个实际优势。首先,与持续的预训练或微调相比,更新检索索引更加简单和便宜。其次,如果我们的检索索引中有包含有害或有偏见内容的问题文件,我们可以轻松地删除或修改这些问题文件。

In addition, the R in RAG provides finer grained control over how we retrieve documents. For example, if we’re hosting a RAG system for multiple organizations, by partitioning the retrieval indices, we can ensure that each organization can only retrieve documents from their own index. This ensures that we don’t inadvertently expose information from one organization to another.
此外,RAG 中的 R 提供了更细粒度的控制,以便我们检索文档。例如,如果我们为多个组织托管 RAG 系统,通过对检索索引进行分区,我们可以确保每个组织只能从其自己的索引中检索文档。这确保我们不会无意中将一个组织的信息暴露给另一个组织。

Long-context models won’t make RAG obsolete
长上下文模型不会使 RAG 过时

With Gemini 1.5 providing context windows of up to 10M tokens in size, some have begun to question the future of RAG.
随着 Gemini 1.5 提供高达 1000 万个令牌的上下文窗口,一些人开始质疑 RAG 的未来。

I tend to believe that Gemini 1.5 is significantly overhyped by Sora. A context window of 10M tokens effectively makes most of existing RAG frameworks unnecessary—you simply put whatever your data into the context and talk to the model like usual. Imagine how it does to all the startups/agents/LangChain projects where most of the engineering efforts goes to RAG 😅 Or in one sentence: the 10m context kills RAG. Nice work Gemini.
我倾向于认为,由 Sora 大肆宣传的 Gemini 1.5 被严重过度炒作了。一个 1000 万令牌的上下文窗口实际上使得大多数现有的 RAG 框架变得不必要——你只需将数据放入上下文中,像往常一样与模型对话即可。想象一下这对所有那些大部分工程努力都投入到 RAG 的初创公司/代理商/LangChain 项目会有什么影响😅或者用一句话说:1000 万的上下文终结了 RAG。干得好,Gemini。

Yao Fu 姚甫

While it’s true that long contexts will be a game-changer for use cases such as analyzing multiple documents or chatting with PDFs, the rumors of RAG’s demise are greatly exaggerated.
虽然长内容确实会改变分析多个文件或与 PDF 文件交流等用例的游戏规则,但关于 RAG 衰落的谣言大为夸张。

First, even with a context window of 10M tokens, we’d still need a way to select information to feed into the model. Second, beyond the narrow needle-in-a-haystack eval, we’ve yet to see convincing data that models can effectively reason over such a large context. Thus, without good retrieval (and ranking), we risk overwhelming the model with distractors, or may even fill the context window with completely irrelevant information.
首先,即使使用 1000 万个令牌的上下文窗口,我们仍然需要一种方法来选择信息以输入模型。其次,在狭窄的大海捞针式评估之外,我们尚未看到有力的数据证明模型能够有效地在如此大的上下文中进行推理。因此,如果没有良好的检索(和排名),我们可能会用干扰项压倒模型,或者甚至可能将上下文窗口填满完全无关的信息。

Finally, there’s cost. The Transformer’s inference cost scales quadratically (or linearly in both space and time) with context length. Just because there exists a model that could read your organization’s entire Google Drive contents before answering each question doesn’t mean that’s a good idea. Consider an analogy to how we use RAM: we still read and write from disk, even though there exist compute instances with RAM running into the tens of terabytes.
最后,成本问题。Transformer 的推理成本随着上下文长度呈二次方(或在空间和时间上均呈线性)增长。仅仅因为存在一个模型可以在回答每个问题之前阅读您组织的整个 Google Drive 内容,并不意味着这是一个好主意。考虑一个类比,我们如何使用 RAM:尽管存在拥有数十 TB RAM 的计算实例,我们仍然从磁盘读写数据。

So don’t throw your RAGs in the trash just yet. This pattern will remain useful even as context windows grow in size.
所以不要急着把你的 RAG 扔进垃圾桶。即使上下文窗口的大小增加,这种模式仍然有用。

Tuning and optimizing workflows
调优和优化工作流程

Prompting an LLM is just the beginning. To get the most juice out of them, we need to think beyond a single prompt and embrace workflows. For example, how could we split a single complex task into multiple simpler tasks? When is finetuning or caching helpful with increasing performance and reducing latency/cost? In this section, we share proven strategies and real-world examples to help you optimize and build reliable LLM workflows.
触发LLM只是开始。要从中获取最大效益,我们需要超越单一提示,采用工作流程。例如,我们如何将一个复杂的任务分解为多个简单的任务?微调或缓存何时有助于提高性能并降低延迟/成本?在本节中,我们将分享经过验证的策略和现实世界中的例子,以帮助您优化并构建可靠的LLM工作流程。

Step-by-step, multi-turn “flows” can give large boosts.
逐步的多轮“流程”可以带来巨大的提升。

We already know that by decomposing a single big prompt into multiple smaller prompts, we can achieve better results. An example of this is AlphaCodium: By switching from a single prompt to a multi-step workflow, they increased GPT-4 accuracy (pass@5) on CodeContests from 19% to 44%. The workflow includes:
我们已经知道,通过将一个大的提示分解为多个小提示,我们可以获得更好的结果。一个例子是 AlphaCodium:通过从单一提示切换到多步骤工作流程,他们将 GPT-4 在 CodeContests 上的准确率(pass@5)从 19%提高到 44%。该工作流程包括:

  • Reflecting on the problem
    反思这个问题
  • Reasoning on the public tests
    公共测试的推理
  • Generating possible solutions
    生成可能的解决方案
  • Ranking possible solutions
    排名可能的解决方案
  • Generating synthetic tests
    生成合成测试
  • Iterating on the solutions on public and synthetic tests.
    在公共和合成测试上迭代解决方案。

Small tasks with clear objectives make for the best agent or flow prompts. It’s not required that every agent prompt requests structured output, but structured outputs help a lot to interface with whatever system is orchestrating the agent’s interactions with the environment.
小任务配合明确的目标,是最佳的代理或流程提示。并非每个代理提示都需要请求结构化输出,但结构化输出在与协调代理与环境互动的系统对接时大有帮助。

Some things to try
尝试一些事情

  • An explicit planning step, as tightly specified as possible. Consider having predefined plans to choose from (c.f. https://youtu.be/hGXhFa3gzBs?si=gNEGYzux6TuB1del).
    一个尽可能明确具体的计划步骤。考虑选择预先定义的计划(参见 https://youtu.be/hGXhFa3gzBs?si=gNEGYzux6TuB1del)。
  • Rewriting the original user prompts into agent prompts. Be careful, this process is lossy!
    重写原始用户提示为代理提示。注意,这个过程会有损失!
  • Agent behaviors as linear chains, DAGs, and State-Machines; different dependency and logic relationships can be more and less appropriate for different scales. Can you squeeze performance optimization out of different task architectures?
    代理行为表现为线性链、有向无环图和状态机;不同的依赖和逻辑关系对不同的规模或多或少适合。你能从不同的任务架构中挤出性能优化吗?
  • Planning validations; your planning can include instructions on how to evaluate the responses from other agents to make sure the final assembly works well together.
    规划验证;您的计划可以包括如何评估其他代理的响应的指令,以确保最终组装能够良好协作。
  • Prompt engineering with fixed upstream state—make sure your agent prompts are evaluated against a collection of variants of what may happen before.
    提示工程学中的固定上游状态——确保你的代理提示针对可能发生的各种变体进行评休。

Prioritize deterministic workflows for now
目前优先考虑确定性工作流

While AI agents can dynamically react to user requests and the environment, their non-deterministic nature makes them a challenge to deploy. Each step an agent takes has a chance of failing, and the chances of recovering from the error are poor. Thus, the likelihood that an agent completes a multi-step task successfully decreases exponentially as the number of steps increases. As a result, teams building agents find it difficult to deploy reliable agents.
虽然 AI 代理可以动态地响应用户请求和环境,但它们的非确定性特征使得部署它们成为一项挑战。代理采取的每一步都有可能失败,而从错误中恢复的可能性很小。因此,随着步骤数量的增加,代理成功完成多步骤任务的可能性呈指数级下降。结果,构建代理的团队发现很难部署可靠的代理。

A promising approach is to have agent systems that produce deterministic plans which are then executed in a structured, reproducible way. In the first step, given a high-level goal or prompt, the agent generates a plan. Then, the plan is executed deterministically. This allows each step to be more predictable and reliable. Benefits include:
有前景的方法是使用代理系统生成确定性计划,然后以结构化、可复制的方式执行。首先,根据一个高层次的目标或提示,代理生成一个计划。然后,该计划被确定性地执行。这使得每一步都更可预测和可靠。好处包括:

  • Generated plans can serve as few-shot samples to prompt or finetune an agent.
    生成的计划可以作为少量样本,用来提示或微调一个智能体。
  • Deterministic execution makes the system more reliable, and thus easier to test and debug. Furthermore, failures can be traced to the specific steps in the plan.
    确定性执行使系统更加可靠,因此更易于测试和调试。此外,故障可以追溯到计划中的具体步骤。
  • Generated plans can be represented as directed acyclic graphs (DAGs) which are easier, relative to a static prompt, to understand and adapt to new situations.
    生成的计划可以表示为有向无环图(DAGs),相对于静态提示,这种表示方式更易于理解和适应新情况。

The most successful agent builders may be those with strong experience managing junior engineers because the process of generating plans is similar to how we instruct and manage juniors. We give juniors clear goals and concrete plans, instead of vague open-ended directions, and we should do the same for our agents too.
最成功的代理构建者可能是那些在管理初级工程师方面拥有丰富经验的人,因为生成计划的过程与我们指导和管理初级员工的方式相似。我们给初级员工明确的目标和具体的计划,而不是模糊的开放式指导,我们也应该对我们的代理做同样的事情。

In the end, the key to reliable, working agents will likely be found in adopting more structured, deterministic approaches, as well as collecting data to refine prompts and finetune models. Without this, we’ll build agents that may work exceptionally well some of the time, but on average, disappoint users which leads to poor retention.
最终,可靠的工作代理的关键很可能在于采用更加结构化、确定性的方法,以及收集数据以完善提示和微调模型。如果没有这些,我们将构建的代理可能有时会表现得非常好,但平均来说,会让用户感到失望,从而导致用户流失。

Getting more diverse outputs beyond temperature
获取超越温度的更多多样化输出

Suppose your task requires diversity in an LLM’s output. Maybe you’re writing an LLM pipeline to suggest products to buy from your catalog given a list of products the user bought previously. When running your prompt multiple times, you might notice that the resulting recommendations are too similar—so you might increase the temperature parameter in your LLM requests.
假设您的任务需要在LLM的输出中体现多样性。也许您正在编写一个LLM管道,以供建议用户从您的目录中购买产品,这些建议基于用户之前购买的产品列表。当您多次运行您的提示时,您可能会注意到所得到的推荐过于相似——因此,您可能会在您的LLM请求中增加温度参数。

Briefly, increasing the temperature parameter makes LLM responses more varied. At sampling time, the probability distributions of the next token become flatter, meaning that tokens which are usually less likely get chosen more often. Still, when increasing temperature, you may notice some failure modes related to output diversity. For example,Some products from the catalog that could be a good fit may never be output by the LLM.The same handful of products might be overrepresented in outputs, if they are highly likely to follow the prompt based on what the LLM has learned at training time.If the temperature is too high, you may get outputs that reference nonexistent products (or gibberish!)
简而言之,提高温度参数会使LLM的回应更加多样化。在抽样时,下一个标记的概率分布变得更平坦,这意味着通常较不可能被选择的标记会更频繁地被选中。然而,提高温度时,你可能会注意到一些与输出多样性相关的失败模式。例如,目录中的某些产品可能非常适合,但可能永远不会由LLM输出。如果基于LLM在训练时所学的内容,这些产品极有可能跟随提示出现,那么输出中可能会过度代表同一小批产品。如果温度过高,你可能会得到引用不存在的产品(或胡言乱语!)的输出。

In other words, increasing temperature does not guarantee that the LLM will sample outputs from the probability distribution you expect (e.g., uniform random). Nonetheless, we have other tricks to increase output diversity. The simplest way is to adjust elements within the prompt. For example, if the prompt template includes a list of items, such as historical purchases, shuffling the order of these items each time they’re inserted into the prompt can make a significant difference.
换句话说,提高温度并不能保证LLM会从你期望的概率分布(例如,均匀随机)中抽取输出。尽管如此,我们还有其他方法来增加输出的多样性。最简单的方法是调整提示中的元素。例如,如果提示模板包括一个项目列表,如历史购买记录,每次将这些项目插入提示时改变它们的顺序,可以带来显著的差异。

Additionally, keeping a short list of recent outputs can help prevent redundancy. In our recommended products example, by instructing the LLM to avoid suggesting items from this recent list, or by rejecting and resampling outputs that are similar to recent suggestions, we can further diversify the responses. Another effective strategy is to vary the phrasing used in the prompts. For instance, incorporating phrases like “pick an item that the user would love using regularly” or “select a product that the user would likely recommend to friends” can shift the focus and thereby influence the variety of recommended products.
此外,保留一份近期输出的简短清单可以帮助避免重复。在我们推荐产品的例子中,通过指导LLM避免建议来自这份近期清单的物品,或者拒绝并重新抽样与近期建议相似的输出,我们可以进一步增加回应的多样性。另一个有效策略是改变提示中使用的措辞。例如,加入诸如“选择用户会经常使用并喜爱的物品”或“选择用户可能会推荐给朋友的产品”等短语,可以转移焦点,从而影响推荐产品的多样性。

Caching is underrated. 缓存被低估了。

Caching saves cost and eliminates generation latency by removing the need to recompute responses for the same input. Furthermore, if a response has previously been guardrailed, we can serve these vetted responses and reduce the risk of serving harmful or inappropriate content.
缓存通过消除对相同输入重新计算响应的需要,节省成本并消除生成延迟。此外,如果某个响应之前已经设置了防护措施,我们可以提供这些经过审核的响应,减少提供有害或不当内容的风险。

One straightforward approach to caching is to use unique IDs for the items being processed, such as if we’re summarizing new articles or product reviews. When a request comes in, we can check to see if a summary already exists in the cache. If so, we can return it immediately; if not, we generate, guardrail, and serve it, and then store it in the cache for future requests.
一种简单直接的缓存方法是使用被处理项目的唯一 ID,比如我们正在总结新文章或产品评论。当请求到来时,我们可以检查缓存中是否已经存在一个摘要。如果存在,我们可以立即返回;如果不存在,我们生成、保护并提供它,然后将其存储在缓存中以应对未来的请求。

For more open-ended queries, we can borrow techniques from the field of search, which also leverages caching for open-ended inputs. Features like autocomplete and spelling correction also help normalize user input and thus increase the cache hit rate.
对于更开放式的查询,我们可以借鉴搜索领域的技术,该领域也利用缓存来处理开放式输入。自动完成和拼写纠正等功能也有助于规范用户输入,从而提高缓存命中率。

When to fine-tune 何时进行微调

We may have some tasks where even the most cleverly designed prompts fall short. For example, even after significant prompt engineering, our system may still be a ways from returning reliable, high-quality output. If so, then it may be necessary to finetune a model for your specific task.
我们可能会遇到一些任务,即使是设计得最巧妙的提示也难以胜任。例如,即便经过大量的提示工程优化,我们的系统可能仍然无法提供可靠、高质量的输出。如果是这样,那么可能需要对模型进行微调,以适应您的特定任务。

Successful examples include:
成功的例子包括:

  • Honeycomb’s Natural Language Query Assistant: Initially, the “programming manual” was provided in the prompt together with n-shot examples for in-context learning. While this worked decently, fine-tuning the model led to better output on the syntax and rules of the domain-specific language.
    蜂窝的自然语言查询助手:起初,与 n 次示例一起在上下文中学习的“编程手册”是在提示中提供的。虽然这种方法效果尚可,但微调模型后在特定领域语言的语法和规则上产生了更好的输出。
  • ReChat’s Lucy: The LLM needed to generate responses in a very specific format that combined structured and unstructured data for the frontend to render correctly. Fine-tuning was essential to get it to work consistently.
    ReChat 的 Lucy:LLM需要以一种结合了结构化和非结构化数据的特定格式生成响应,以便前端正确渲染。精细调整对于保证其持续有效地工作至关重要。

Nonetheless, while fine-tuning can be effective, it comes with significant costs. We have to annotate fine-tuning data, finetune and evaluate models, and eventually self-host them. Thus, consider if the higher upfront cost is worth it. If prompting gets you 90% of the way there, then fine-tuning may not be worth the investment. However, if we do decide to fine-tune, to reduce the cost of collecting human annotated data, we can generate and finetune on synthetic data, or bootstrap on open-source data.
尽管微调可能有效,但它需要付出显著的成本。我们必须标注微调数据,对模型进行微调和评估,最终还要自行托管这些模型。因此,要考虑是否值得承担较高的前期成本。如果提示能让你完成 90%的工作,那么微调可能不值得这笔投资。然而,如果我们决定进行微调,为了降低收集人工标注数据的成本,我们可以生成并在合成数据上进行微调,或者利用开源数据进行引导。

Evaluation & Monitoring 评估与监控

Evaluating LLMs can be a minefield. The inputs and the outputs of LLMs are arbitrary text, and the tasks we set them to are varied. Nonetheless, rigorous and thoughtful evals are critical—it’s no coincidence that technical leaders at OpenAI work on evaluation and give feedback on individual evals.
评估LLMs可能是一场雷区。LLMs的输入和输出是任意文本,我们设置的任务也各不相同。尽管如此,严谨而深思熟虑的评估至关重要——这并非偶然,OpenAI 的技术领导者们正致力于评估工作,并对个别评估提供反馈。

Evaluating LLM applications invites a diversity of definitions and reductions: it’s simply unit testing, or it’s more like observability, or maybe it’s just data science. We have found all of these perspectives useful. In the following section, we provide some lessons we’ve learned about what is important in building evals and monitoring pipelines.
评估LLM个应用程序引入了多种定义和简化:它只是单元测试,或者更像是可观测性,或者可能只是数据科学。我们发现所有这些观点都很有用。在接下来的部分中,我们将提供一些我们在构建评估和监控管道时学到的重要经验。

Create a few assertion-based unit tests from real input/output samples
从真实的输入/输出样本中创建一些基于断言的单元测试

Create unit tests (i.e., assertions) consisting of samples of inputs and outputs from production, with expectations for outputs based on at least three criteria. While three criteria might seem arbitrary, it’s a practical number to start with; fewer might indicate that your task isn’t sufficiently defined or is too open-ended, like a general-purpose chatbot. These unit tests, or assertions, should be triggered by any changes to the pipeline, whether it’s editing a prompt, adding new context via RAG, or other modifications. This write-up has an example of an assertion-based test for an actual use case.
创建单元测试(即断言),包括来自生产的输入和输出样本,并根据至少三个标准设定输出预期。虽然三个标凊可能看起来是任意的,但它是一个实用的起点;更少可能表明你的任务定义不够明确或过于开放,如通用聊天机器人。这些单元测试或断言应当在管道的任何更改时触发,无论是编辑提示、通过 RAG 添加新上下文还是其他修改。本文档中有一个基于断言的测试实际用例的例子。

Consider beginning with assertions that specify phrases or ideas to either include or exclude in all responses. Also consider checks to ensure that word, item, or sentence counts lie within a range. For other kinds of generation, assertions can look different. Execution-evaluation is a powerful method for evaluating code-generation, wherein you run the generated code and determine that the state of runtime is sufficient for the user-request.
考虑从指定在所有回应中包括或排除的短语或想法的断言开始。还应考虑检查以确保单词、项目或句子的数量在一定范围内。对于其他类型的生成,断言可能看起来不同。执行评估是评休代码生成的一种强大方法,其中您运行生成的代码并确定运行时状态对用户请求是否足够。

As an example, if the user asks for a new function named foo; then after executing the agent’s generated code, foo should be callable! One challenge in execution-evaluation is that the agent code frequently leaves the runtime in slightly different form than the target code. It can be effective to “relax” assertions to the absolute most weak assumptions that any viable answer would satisfy.
例如,如果用户请求一个名为 foo 的新功能;那么在执行代理生成的代码后,应该可以调用 foo!执行评估中的一个挑战是,代理代码经常使运行时与目标代码略有不同。放宽断言到任何可行答案都能满足的绝对最弱假设可能是有效的。

Finally, using your product as intended for customers (i.e., “dogfooding”) can provide insight into failure modes on real-world data. This approach not only helps identify potential weaknesses, but also provides a useful source of production samples that can be converted into evals.
最终,按照客户的预期使用您的产品(即“自用产品测试”)可以在真实世界数据上洞察失败模式。这种方法不仅有助于识别潜在的弱点,还提供了一个有用的生产样本来源,这些样本可以转化为评估材料。

LLM-as-Judge can work (somewhat), but it’s not a silver bullet
1001 号法官可以起作用(在某种程度上),但它不是万能的解决办法

LLM-as-Judge, where we use a strong LLM to evaluate the output of other LLMs, has been met with skepticism by some. (Some of us were initially huge skeptics.) Nonetheless, when implemented well, LLM-as-Judge achieves decent correlation with human judgements, and can at least help build priors about how a new prompt or technique may perform. Specifically, when doing pairwise comparisons (e.g., control vs. treatment), LLM-as-Judge typically gets the direction right though the magnitude of the win/loss may be noisy.
作为裁判的LLM,我们使用强大的LLM来评估其他LLMs的输出,最初遭到了一些人的怀疑。(我们中的一些人最初是巨大的怀疑者。)尽管如此,当执行得当时,LLM-as-Judge 与人类判断具有相当的相关性,并且至少可以帮助构建关于新提示或技术可能表现的先验知识。特别是在进行成对比较(例如,对照组与处理组)时,LLM-as-Judge 通常能够正确判断方向,尽管胜/负的幅度可能会有噪音。

Here are some suggestions to get the most out of LLM-as-Judge:
这里有一些建议,帮助您最大限度地利用LLM作为裁判:

  • Use pairwise comparisons: Instead of asking the LLM to score a single output on a Likert scale, present it with two options and ask it to select the better one. This tends to lead to more stable results.
    使用成对比较:不是让LLM在李克特量表上为单一输出评分,而是给出两个选项,让其选择较好的一个。这通常会导致更稳定的结果。
  • Control for position bias: The order of options presented can bias the LLM’s decision. To mitigate this, do each pairwise comparison twice, swapping the order of pairs each time. Just be sure to attribute wins to the right option after swapping!
    控制位置偏见:呈现选项的顺序可能会影响LLM的决策。为了减轻这种影响,每对比较要进行两次,每次都交换配对顺序。只需确保在交换后将胜利归因于正确的选项!
  • Allow for ties: In some cases, both options may be equally good. Thus, allow the LLM to declare a tie so it doesn’t have to arbitrarily pick a winner.
    允许平局:在某些情况下,两个选项可能同样好。因此,允许LLM宣布平局,这样就不必任意选择一个赢家。
  • Use Chain-of-Thought: Asking the LLM to explain its decision before giving a final preference can increase eval reliability. As a bonus, this allows you to use a weaker but faster LLM and still achieve similar results. Because frequently this part of the pipeline is in batch mode, the extra latency from CoT isn’t a problem.
    使用思维链条:在给出最终偏好之前要求LLM解释其决策,可以提高评估的可靠性。作为额外好处,这使您可以使用较弱但更快的LLM,同时仍能获得类似的结果。因为这部分流程通常是批处理模式,所以来自 CoT 的额外延迟并不是问题。
  • Control for response length: LLMs tend to bias toward longer responses. To mitigate this, ensure response pairs are similar in length.
    为了控制回应长度:LLMs倾向于偏向较长的回应。为了缓解这一点,请确保回应对长度相近。

One particularly powerful application of LLM-as-Judge is checking a new prompting strategy against regression. If you have tracked a collection of production results, sometimes you can rerun those production examples with a new prompting strategy, and use LLM-as-Judge to quickly assess where the new strategy may suffer.
一个特别强大的LLM-as-Judge 应用是检查新的提示策略是否有退化。如果你追踪了一系列生产结果,有时你可以用新的提示策略重新运行这些生产示例,并使用LLM-as-Judge 来快速评估新策略可能的不足。

Here’s an example of a simple but effective approach to iterate on LLM-as-Judge, where we simply log the LLM response, judge’s critique (i.e., CoT), and final outcome. They are then reviewed with stakeholders to identify areas for improvement. Over three iterations, agreement with human and LLM improved from 68% to 94%!
这是一个简单但有效的方法示例,用于迭代LLM作为裁判的过程,我们只需记录LLM的响应、裁判的批评(即 CoT)和最终结果。然后与利益相关者一起审查这些内容,以识别改进的领域。经过三次迭代,与人类和LLM的一致性从 68%提高到了 94%!

LLM-as-Judge is not a silver bullet though. There are subtle aspects of language where even the strongest models fail to evaluate reliably. In addition, we’ve found that conventional classifiers and reward models can achieve higher accuracy than LLM-as-Judge, and with lower cost and latency. For code generation, LLM-as-Judge can be weaker than more direct evaluation strategies like execution-evaluation.
LLM-as-Judge 虽然不是万能的解决方案。语言中有一些微妙的方面,即使是最强大的模型也无法可靠地评估。此外,我们发现传统的分类器和奖励模型可以比LLM-as-Judge 实现更高的准确性,且成本和延迟更低。对于代码生成,LLM-as-Judge 可能比像执行评估这样的更直接的评估策略更弱。

The “intern test” for evaluating generations
评估各代人的“实习生测试”

We like to use the following “intern test” when evaluating generations: If you took the exact input to the language model, including the context, and gave it to an average college student in the relevant major as a task, could they succeed? How long would it take?
我们喜欢使用以下的“实习生测试”来评估生成内容:如果你将完全相同的输入,包括上下文,交给一个相关专业的普通大学生作为任务,他们能否成功完成?需要多长时间?

If the answer is no because the LLM lacks the required knowledge, consider ways to enrich the context.
如果答案是否定的,因为LLM缺乏所需的知识,请考虑丰富背景的方法。

If the answer is no and we simply can’t improve the context to fix it, then we may have hit a task that’s too hard for contemporary LLMs.
如果答案是否定的,而我们无法通过改善环境来解决问题,那么我们可能遇到了一个对当代LLMs来说太难的任务。

If the answer is yes, but it would take a while, we can try to reduce the complexity of the task. Is it decomposable? Are there aspects of the task that can be made more templatized?
如果答案是肯定的,但需要一些时间,我们可以尝试降低任务的复杂性。它可以分解吗?任务的某些方面可以模板化吗?

If the answer is yes, they would get it quickly, then it’s time to dig into the data. What’s the model doing wrong? Can we find a pattern of failures? Try asking the model to explain itself before or after it responds, to help you build a theory of mind.
如果答案是肯定的,他们会很快得到它,那么现在是时候深入研究数据了。模型出了什么问题?我们能找到失败的模式吗?在模型回应前后尝试让模型解释自己,以帮助你构建思维理论。

Overemphasizing certain evals can hurt overall performance
过分强调某些评估可能会损害整体表现

“When a measure becomes a target, it ceases to be a good measure.”
当一个指标变成了目标,它就不再是一个好的指标。

— Goodhart’s Law 古德哈特法则

An example of this is the Needle-in-a-Haystack (NIAH) eval. The original eval helped quantify model recall as context sizes grew, as well as how recall is affected by needle position. However, it’s been so overemphasized that it’s featured as Figure 1 for Gemini 1.5’s report. The eval involves inserting a specific phrase (“The special magic {city} number is: {number}”) into a long document which repeats the essays of Paul Graham, and then prompting the model to recall the magic number.
这个例子是针对“大海捞针”(NIAH)评估。原始评估帮助量化了随着上下文大小的增长,模型召回率如何,以及召回率如何受到针的位置的影响。然而,它被过分强调,以至于成为了 Gemini 1.5 报告的图 1。该评估涉及在一篇长文档中插入一个特定短语(“特殊魔法{城市}号码是:{号码}”),该文档重复了 Paul Graham 的文章,然后提示模型回忆魔法号码。

While some models achieve near-perfect recall, it’s questionable whether NIAH truly reflects the reasoning and recall abilities needed in real-world applications. Consider a more practical scenario: Given the transcript of an hour-long meeting, can the LLM summarize the key decisions and next steps, as well as correctly attribute each item to the relevant person? This task is more realistic, going beyond rote memorization and also considering the ability to parse complex discussions, identify relevant information, and synthesize summaries.
虽然一些模型实现了接近完美的召回率,但 NIAH 是否真正反映了现实世界应用中所需的推理和记忆能力,这还有待商榷。考虑一个更实际的场景:给定一个小时会议的文字记录,LLM能否总结出关键决策和后续步骤,并正确地将每项内容与相关人员对应起来?这个任务更加现实,不仅仅是死记硬背,还包括解析复杂讨论的能力,识别相关信息,并综合总结。

Here’s an example of a practical NIAH eval. Using transcripts of doctor-patient video calls, the LLM is queried about the patient’s medication. It also includes a more challenging NIAH, inserting a phrase for random ingredients for pizza toppings, such as “The secret ingredients needed to build the perfect pizza are: Espresso-soaked dates, Lemon and Goat cheese.” Recall was around 80% on the medication task and 30% on the pizza task.
这是一个实用的 NIAH 评估示例。使用医生与病人视频通话的文字记录,LLM 被用来查询病人的用药情况。它还包括了一个更具挑战性的 NIAH,加入了随机披萨配料的短语,例如“制作完美披萨所需的秘密配料包括:浸泡在浓缩咖啡中的枣、柠檬和山羊奶酪。” 在用药任务上的回忆率约为 80%,在披萨任务上为 30%。

Tangentially, an overemphasis on NIAH evals can lead to lower performance on extraction and summarization tasks. Because these LLMs are so finetuned to attend to every sentence, they may start to treat irrelevant details and distractors as important, thus including them in the final output (when they shouldn’t!)
间接地,过分强调 NIAH 评估可能会导致在提取和总结任务上的表现下降。因为这些LLMs对每个句子的关注都非常细致,它们可能开始将无关的细节和干扰因素视为重要,因此在最终输出中包含这些内容(本不应该包含的!)

This could also apply to other evals and use cases. For example, summarization. An emphasis on factual consistency could lead to summaries that are less specific (and thus less likely to be factually inconsistent) and possibly less relevant. Conversely, an emphasis on writing style and eloquence could lead to more flowery, marketing-type language that could introduce factual inconsistencies.
这也适用于其他评估和用例。例如,摘要。强调事实一致性可能会导致摘要不够具体(因此不太可能事实不一致),可能也不够相关。相反,强调写作风格和雄辩可能会导致更华丽、更像市场营销类型的语言,这可能会引入事实上的不一致。

Simplify annotation to binary tasks or pairwise comparisons
简化注释为二元任务或成对比较

Providing open-ended feedback or ratings for model output on a Likert scale is cognitively demanding. As a result, the data collected is more noisy—due to variability among human raters—and thus less useful. A more effective approach is to simplify the task and reduce the cognitive burden on annotators. Two tasks that work well are binary classifications and pairwise comparisons.
提供开放式反馈或使用李克特量表对模型输出进行评分在认知上要求较高。因此,收集到的数据更加嘈杂——由于人类评估者之间的变异性——因而较为无用。一种更有效的方法是简化任务并减轻标注者的认知负担。两个效果良好的任务是二元分类和成对比较。

In binary classifications, annotators are asked to make a simple yes-or-no judgment on the model’s output. They might be asked whether the generated summary is factually consistent with the source document, or whether the proposed response is relevant, or if it contains toxicity. Compared to the Likert scale, binary decisions are more precise, have higher consistency among raters, and lead to higher throughput. This was how Doordash setup their labeling queues for tagging menu items though a tree of yes-no questions.
在二元分类中,标注者被要求对模型的输出做出简单的是或否的判断。他们可能被问到生成的摘要是否与源文件事实一致,或者提出的回应是否相关,或者是否包含有害内容。与李克特量表相比,二元决策更加精确,评审者之间的一致性更高,且能提高处理量。这就是 Doordash 设置其标记菜单项的标签队列通过一系列是或否问题的方式。

In pairwise comparisons, the annotator is presented with a pair of model responses and asked which is better. Because it’s easier for humans to say “A is better than B” than to assign an individual score to either A or B individually, this leads to faster and more reliable annotations (over Likert scales). At a Llama2 meetup, Thomas Scialom, an author on the Llama2 paper, confirmed that pairwise-comparisons were faster and cheaper than collecting supervised finetuning data such as written responses. The former’s cost is $3.5 per unit while the latter’s cost is $25 per unit.
在成对比较中,标注者会被展示一对模型响应,并被询问哪一个更好。因为对人类来说说出“A 比 B 好”比给 A 或 B 单独打分要容易,这导致了更快速和更可靠的标注(相较于李克特量表)。在一次 Llama2 聚会上,Llama2 论文的作者之一 Thomas Scialom 确认,成对比较比收集如书面回应之类的监督式微调数据更快、更便宜。前者的成本是每单位 3.5 美元,而后者的成本是每单位 25 美元。

If you’re starting to write labeling guidelines, here are some reference guidelines from Google and Bing Search.
如果您开始编写标签指南,这里有一些来自谷歌和必应搜索的参考指南。

(Reference-free) evals and guardrails can be used interchangeably
无参考评休和防护栏可以互换使用

Guardrails help to catch inappropriate or harmful content while evals help to measure the quality and accuracy of the model’s output. In the case of reference-free evals, they may be considered two sides of the same coin. Reference-free evals are evaluations that don’t rely on a “golden” reference, such as a human-written answer, and can assess the quality of output based solely on the input prompt and the model’s response.
护栏有助于捕捉不当或有害内容,而评估则有助于衡量模型输出的质量和准确性。在无参考评估的情况下,它们可以被视为同一枚硬币的两面。无参考评估是不依赖于“黄金”参考的评估,例如人工编写的答案,它可以仅根据输入提示和模型的响应来评估输出的质量。

Some examples of these are summarization evals, where we only have to consider the input document to evaluate the summary on factual consistency and relevance. If the summary scores poorly on these metrics, we can choose not to display it to the user, effectively using the eval as a guardrail. Similarly, reference-free translation evals can assess the quality of a translation without needing a human-translated reference, again allowing us to use it as a guardrail.
这些的一些例子包括摘要评估,在这种评估中,我们只需考虑输入文档来评估摘要的事实一致性和相关性。如果摘要在这些指标上得分较低,我们可以选择不向用户显示,有效地使用评估作为防护栏。同样,无需参考的翻译评估可以在不需要人工翻译参考的情况下评估翻译的质量,再次允许我们将其用作防护栏。

LLMs will return output even when they shouldn’t
LLMs即使在不应该输出时也会返回输出

A key challenge when working with LLMs is that they’ll often generate output even when they shouldn’t. This can lead to harmless but nonsensical responses, or more egregious defects like toxicity or dangerous content. For example, when asked to extract specific attributes or metadata from a document, an LLM may confidently return values even when those values don’t actually exist. Alternatively, the model may respond in a language other than English because we provided non-English documents in the context.
在处理LLMs时的一个主要挑战是,它们经常会在不应该输出的时候产生输出。这可能导致无害但毫无意义的回应,或更严重的缺陷,如有毒内容或危险内容。例如,当要求从文档中提取特定属性或元数据时,LLM可能会自信地返回值,即使这些值实际上并不存在。或者,模型可能会用英语以外的语言回应,因为我们在上下文中提供了非英语文档。

While we can try to prompt the LLM to return a “not applicable” or “unknown” response, it’s not foolproof. Even when the log probabilities are available, they’re a poor indicator of output quality. While log probs indicate the likelihood of a token appearing in the output, they don’t necessarily reflect the correctness of the generated text. On the contrary, for instruction-tuned models that are trained to respond to queries and generate coherent response, log probabilities may not be well-calibrated. Thus, while a high log probability may indicate that the output is fluent and coherent, it doesn’t mean it’s accurate or relevant.
虽然我们可以尝试促使LLM返回“不适用”或“未知”的回应,但这并不是万无一失的。即使日志概率是可用的,它们也是衡量输出质量的差指标。虽然日志概率表明了一个词符在输出中出现的可能性,但它们并不一定反映生成文本的正确性。相反,对于经过指令调优、训练用于响应查询并生成连贯回应的模型来说,日志概率可能没有很好的校准。因此,虽然高日志概率可能表明输出流畅且连贯,但这并不意味着它准确或相关。

While careful prompt engineering can help to some extent, we should complement it with robust guardrails that detect and filter/regenerate undesired output. For example, OpenAI provides a content moderation API that can identify unsafe responses such as hate speech, self-harm, or sexual output. Similarly, there are numerous packages for detecting personally identifiable information (PII). One benefit is that guardrails are largely agnostic of the use case and can thus be applied broadly to all output in a given language. In addition, with precise retrieval, our system can deterministically respond “I don’t know” if there are no relevant documents.
虽然仔细的提示工程在一定程度上有所帮助,但我们应该配合使用强大的防护措施来检测并过滤/重新生成不希望的输出。例如,OpenAI 提供了一个内容审核 API,能够识别不安全的回应,如仇恨言论、自我伤害或性相关输出。同样,也有许多用于检测个人身份信息(PII)的软件包。一个好处是,防护措施大体上不依赖于使用案例,因此可以广泛应用于特定语言的所有输出。此外,通过精确检索,如果没有相关文档,我们的系统可以确定性地回答“我不知道”。

A corollary here is that LLMs may fail to produce outputs when they are expected to. This can happen for various reasons, from straightforward issues like long tail latencies from API providers to more complex ones such as outputs being blocked by content moderation filters. As such, it’s important to consistently log inputs and (potentially a lack of) outputs for debugging and monitoring.
相关的一点是,LLMs可能在预期应产出结果时无法产出。这可能由多种原因造成,从 API 提供者的长尾延迟这样的直接问题到输出被内容审查过滤器阻止这样的复杂问题。因此,持续记录输入和(可能的)输出缺失对于调试和监控非常重要。

Hallucinations are a stubborn problem.
幻觉是一个棘手的问题。

Unlike content safety or PII defects which have a lot of attention and thus seldom occur, factual inconsistencies are stubbornly persistent and more challenging to detect. They’re more common and occur at a baseline rate of 5 – 10%, and from what we’ve learned from LLM providers, it can be challenging to get it below 2%, even on simple tasks such as summarization.
与内容安全或个人信息保护缺陷不同,这些问题由于受到了广泛关注,因此很少发生,事实上的不一致却是顽固持续的,且更难以发现。这种问题更为常见,基本发生率在 5%到 10%之间,而根据我们从LLM供应商那里了解到的情况,即使是在执行像总结这样的简单任务时,将其降低到 2%以下也是一项挑战。

To address this, we can combine prompt engineering (upstream of generation) and factual inconsistency guardrails (downstream of generation). For prompt engineering, techniques like CoT help reduce hallucination by getting the LLM to explain its reasoning before finally returning the output. Then, we can apply a factual inconsistency guardrail to assess the factuality of summaries and filter or regenerate hallucinations. In some cases, hallucinations can be deterministically detected. When using resources from RAG retrieval, if the output is structured and identifies what the resources are, you should be able to manually verify they’re sourced from the input context.
为了解决这个问题,我们可以结合提示工程(生成上游)和事实不一致防护(生成下游)。在提示工程中,像 CoT 这样的技术通过让LLM在最终输出前解释其推理过程,帮助减少幻觉。然后,我们可以应用事实不一致防护来评估摘要的事实性,并过滤或重新生成幻觉。在某些情况下,可以确定性地检测到幻觉。当使用 RAG 检索资源时,如果输出是结构化的并且标识了资源是什么,你应该能够手动验证它们是否来源于输入上下文。

About the authors 关于作者

Eugene Yan designs, builds, and operates machine learning systems that serve customers at scale. He’s currently a Senior Applied Scientist at Amazon where he builds RecSys serving millions of customers worldwide RecSys 2022 keynote and applies LLMs to serve customers better AI Eng Summit 2023 keynote. Previously, he led machine learning at Lazada (acquired by Alibaba) and a Healthtech Series A. He writes & speaks about ML, RecSys, LLMs, and engineering at eugeneyan.com and ApplyingML.com.
尤金·严设计、构建并运营可服务大规模客户的机器学习系统。他目前是亚马逊的高级应用科学家,负责构建服务全球数百万客户的推荐系统,并在 RecSys 2022 主题演讲中介绍此系统,同时在 AI Eng Summit 2023 主题演讲中应用LLMs以更好地服务客户。此前,他曾在 Lazada(被阿里巴巴收购)和一家健康科技 A 轮公司领导机器学习工作。他在 eugeneyan.com 和 ApplyingML.com 上撰写并演讲关于机器学习、推荐系统、LLMs以及工程的内容。

Bryan Bischof is the Head of AI at Hex, where he leads the team of engineers building Magic—the data science and analytics copilot. Bryan has worked all over the data stack leading teams in analytics, machine learning engineering, data platform engineering, and AI engineering. He started the data team at Blue Bottle Coffee, led several projects at Stitch Fix, and built the data teams at Weights and Biases. Bryan previously co-authored the book Building Production Recommendation Systems with O’Reilly, and teaches Data Science and Analytics in the graduate school at Rutgers. His Ph.D. is in pure mathematics.
布莱恩·比绍夫是 Hex 公司人工智能部门的负责人,他领导的工程师团队正在开发名为 Magic 的数据科学与分析副驾驶。布莱恩在数据堆栈的各个领域都有工作经验,曾领导过分析、机器学习工程、数据平台工程和人工智能工程团队。他创立了蓝瓶咖啡的数据团队,曾在 Stitch Fix 领导多个项目,并建立了 Weights and Biases 的数据团队。布莱恩此前与 O'Reilly 合著了《构建生产推荐系统》一书,并在罗格斯大学研究生院教授数据科学与分析。他的博士学位是纯数学。

Charles Frye teaches people to build AI applications. After publishing research in psychopharmacology and neurobiology, he got his Ph.D. at the University of California, Berkeley, for dissertation work on neural network optimization. He has taught thousands the entire stack of AI application development, from linear algebra fundamentals to GPU arcana and building defensible businesses, through educational and consulting work at Weights and Biases, Full Stack Deep Learning, and Modal.
查尔斯·弗莱教人们构建人工智能应用。在发表了精神药理学和神经生物学的研究后,他在加州大学伯克利分校获得了博士学位,博士论文是关于神经网络优化的研究。通过在 Weights and Biases、Full Stack Deep Learning 和 Modal 的教育和咨询工作,他教授了数千人从线性代数基础到 GPU 奥秘以及构建有防御力的企业的整个人工智能应用开发技术栈。

Hamel Husain is a machine learning engineer with over 25 years of experience. He has worked with innovative companies such as Airbnb and GitHub, which included early LLM research used by OpenAI for code understanding. He has also led and contributed to numerous popular open-source machine-learning tools. Hamel is currently an independent consultant helping companies operationalize Large Language Models (LLMs) to accelerate their AI product journey.
Hamel Husain 是一位拥有超过 25 年经验的机器学习工程师。他曾与 Airbnb 和 GitHub 等创新公司合作,参与了早期的LLM研究,该研究被 OpenAI 用于代码理解。他还领导并贡献了许多流行的开源机器学习工具。Hamel 目前是一名独立顾问,帮助公司实现大型语言模型(LLMs)的运营,以加速他们的 AI 产品旅程。

Jason Liu is a distinguished machine learning consultant known for leading teams to successfully ship AI products. Jason’s technical expertise covers personalization algorithms, search optimization, synthetic data generation, and MLOps systems. His experience includes companies like Stitchfix, where he created a recommendation framework and observability tools that handled 350 million daily requests. Additional roles have included Meta, NYU, and startups such as Limitless AI and Trunk Tools.
刘杰森是一位杰出的机器学习顾问,以带领团队成功推出人工智能产品而闻名。刘杰森的技术专长包括个性化算法、搜索优化、合成数据生成和 MLOps 系统。他的经验涵盖了像 Stitchfix 这样的公司,在那里他创建了一个处理每日 3.5 亿请求的推荐框架和可观测性工具。他还曾在 Meta、纽约大学以及 Limitless AI 和 Trunk Tools 等初创公司担任过职务。

Shreya Shankar is an ML engineer and PhD student in computer science at UC Berkeley. She was the first ML engineer at 2 startups, building AI-powered products from scratch that serve thousands of users daily. As a researcher, her work focuses on addressing data challenges in production ML systems through a human-centered approach. Her work has appeared in top data management and human-computer interaction venues like VLDB, SIGMOD, CIDR, and CSCW.
Shreya Shankar 是加州大学伯克利分校计算机科学专业的机器学习工程师和博士研究生。她曾是两家初创公司的首位机器学习工程师,从零开始构建每天为数千用户服务的 AI 驱动产品。作为研究人员,她的工作重点是通过以人为本的方法解决生产机器学习系统中的数据挑战。她的研究成果已发表在顶级数据管理和人机交互领域的会议,如 VLDB、SIGMOD、CIDR 和 CSCW。

Contact Us 联系我们

We would love to hear your thoughts on this post. You can contact us at contact@applied-llms.org. Many of us are open to various forms of consulting and advisory. We will route you to the correct expert(s) upon contact with us if appropriate.
我们非常希望听到您对这篇文章的看法。您可以通过 contact@applied-llms.org 与我们联系。我们中的许多人对各种形式的咨询和建议持开放态度。如果合适的话,我们会在您联系我们后,将您引导至正确的专家。

Acknowledgements 致谢

This series started as a conversation in a group chat, where Bryan quipped that he was inspired to write “A Year of AI Engineering.” Then, ✨magic✨ happened in the group chat, and we were all inspired to chip in and share what we’ve learned so far.
这个系列始于一个群聊中的对话,布莱恩打趣说他受到启发要写《一年的人工智能工程》。然后,群聊中发生了✨奇迹✨,我们都受到启发,开始贡献和分享我们迄今为止学到的东西。

The authors would like to thank Eugene for leading the bulk of the document integration and overall structure in addition to a large proportion of the lessons. Additionally, for primary editing responsibilities and document direction. The authors would like to thank Bryan for the spark that led to this writeup, restructuring the write-up into tactical, operational, and strategic sections and their intros, and for pushing us to think bigger on how we could reach and help the community. The authors would like to thank Charles for his deep dives on cost and LLMOps, as well as weaving the lessons to make them more coherent and tighter—you have him to thank for this being 30 instead of 40 pages! The authors appreciate Hamel and Jason for their insights from advising clients and being on the front lines, for their broad generalizable learnings from clients, and for deep knowledge of tools. And finally, thank you Shreya for reminding us of the importance of evals and rigorous production practices and for bringing her research and original results to this piece.
作者们想要感谢 Eugene 在整合文件内容和总体结构方面所做的主要工作,以及在课程内容上的大量贡献。此外,还有在初步编辑职责和文件方向上的工作。作者们想要感谢 Bryan 为这篇文章带来的灵感,将文章重组为战术、操作和战略部分及其引言,并推动我们思考如何能在帮助社区方面有更大的思考。作者们想要感谢 Charles 在成本和 LLMOps 方面的深入研究,以及将课程内容编织得更加连贯紧凑——这篇文章能从 40 页减至 30 页,多亏了他!作者们感谢 Hamel 和 Jason 从顾问客户和一线工作中获得的洞见,他们从客户那里学到的广泛的可推广的知识,以及对工具的深入了解。最后,感谢 Shreya 提醒我们评估和严格生产实践的重要性,并将她的研究和原创成果带到这篇文章中。

Finally, the authors would like to thank all the teams who so generously shared your challenges and lessons in your own write-ups which we’ve referenced throughout this series, along with the AI communities for your vibrant participation and engagement with this group.
最后,作者们要感谢所有团队,他们在自己的总结报告中慷慨分享了各自的挑战和经验,我们在本系列文章中多次引用了这些内容。同时也感谢 AI 社区的活跃参与和与本团队的互动。

Post topics: AI & ML, Artificial Intelligence
帖子主题:人工智能与机器学习,人工智能
Post tags: Deep Dive
深入探讨标签
Share: