这是用户在 2024-11-3 9:21 为 https://hamel.dev/blog/posts/llm-judge/#step-1-find-the-principal-domain-expert 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Creating a LLM-as-a-Judge That Drives Business Results
创建一个驱动业务成果的LLM-作为法官

LLMs
evals 评估
A step-by-step guide with my learnings from 30+ AI implementations.
一步步指南,包含我从 30 多个 AI 实施中学到的经验。
Author 作者

Hamel Husain  韩梅尔·胡赛因

Published 已发布

October 29, 2024 2024 年 10 月 29 日

Earlier this year, I wrote Your AI product needs evals. Many of you asked, “How do I get started with LLM-as-a-judge?” This guide shares what I’ve learned after helping over 30 companies set up their evaluation systems.
今年初,我写了《你的 AI 产品需要评估》。你们很多人问:“我如何开始使用LLM作为评委?”本指南分享了我帮助 30 多家公司建立评估系统后所学到的经验。

The Problem: AI Teams Are Drowning in Data
问题:AI 团队陷入数据海洋

Ever spend weeks building an AI system, only to realize you have no idea if it’s actually working? You’re not alone. I’ve noticed teams repeat the same mistakes when using LLMs to evaluate AI outputs:
经常花费数周时间构建一个 AI 系统,结果发现自己根本不知道它是否真的在运行?你不是一个人。我注意到团队在使用LLMs评估 AI 输出时反复犯同样的错误:

  1. Too Many Metrics: Creating numerous measurements that become unmanageable.
    过多指标:创建大量难以管理的测量值。
  2. Arbitrary Scoring Systems: Using uncalibrated scales (like 1-5) across multiple dimensions, where the difference between scores is unclear and subjective. What makes something a 3 versus a 4? Nobody knows, and different evaluators often interpret these scales differently.
    任意评分系统:使用未经校准的量表(如 1-5)跨越多个维度,其中分数之间的差异不明确且主观。什么是 3 与 4 的区别?没有人知道,不同的评估者经常对这些量表有不同的解读。
  3. Ignoring Domain Experts: Not involving the people who understand the subject matter deeply.
    忽略领域专家:不涉及对主题有深刻理解的人。
  4. Unvalidated Metrics: Using measurements that don’t truly reflect what matters to the users or the business.
    未经验证的指标:使用无法真正反映用户或业务关注点的测量方法。

The result? Teams end up buried under mountains of metrics or data they don’t trust and can’t use. Progress grinds to a halt. Everyone gets frustrated.
结果?团队最终被他们不信任且无法使用的大量指标或数据所淹没。进步停滞不前。每个人都感到沮丧。

For example, it’s not uncommon for me to see dashboards that look like this:
例如,我经常看到这样的仪表盘:

An illustrative example of a bad eval dashboard
一个糟糕的评估仪表板的示例

Tracking a bunch of scores on a 1-5 scale is often a sign of a bad eval process (I’ll discuss why later). In this post, I’ll show you how to avoid these pitfalls. The solution is to use a technique that I call “Critique Shadowing”. Here’s how to do it, step by step.
跟踪一组 1-5 分的评分通常是一个糟糕评估过程的标志(我稍后会讨论原因)。在这篇文章中,我将向您展示如何避免这些陷阱。解决方案是使用我称之为“批判性影子”的技术。以下是具体步骤。

Step 1: Find The Principal Domain Expert
第一步:找到主要领域专家

In most organizations there is usually one (maybe two) key individuals whose judgment is crucial for the success of your AI product. These are the people with deep domain expertise or represent your target users. Identifying and involving this Principal Domain Expert early in the process is critical.
在大多数组织中,通常有一到两个人(可能两个)的判断对您的人工智能产品的成功至关重要。这些人拥有深厚的领域专业知识或代表您的目标用户。在早期过程中识别并涉及这个主要领域专家是至关重要的。

Why is finding the right domain expert so important?
为什么找到合适的领域专家如此重要?

  • They Set the Standard: This person not only defines what is acceptable technically, but also helps you understand if you’re building something users actually want.
    他们树立了标准:这个人不仅定义了技术上可接受的内容,还帮助你理解你是否在构建用户真正需要的东西。

  • Capture Unspoken Expectations: By involving them, you uncover their preferences and expectations, which they might not be able to fully articulate upfront. Through the evaluation process, you help them clarify what a “passable” AI interaction looks like.
    捕捉未言明的期望:通过让他们参与,你揭示他们的偏好和期望,这些他们可能无法一开始完全表达出来。通过评估过程,你帮助他们明确“可接受”的 AI 交互是什么样的。

  • Consistency in Judgment: People in your organization may have different opinions about the AI’s performance. Focusing on the principal expert ensures that evaluations are consistent and aligned with the most critical standards.
    判断一致性:贵组织的人可能对 AI 的性能有不同的看法。关注主要专家确保评估是一致的,并与最关键的标准相一致。

  • Sense of Ownership: Involving the expert gives them a stake in the AI’s development. They feel invested because they’ve had a hand in shaping it. In the end, they are more likely to approve of the AI.
    所有权感:让专家参与让他们在人工智能的发展中拥有股份。因为他们参与了塑造它,所以他们感觉有所投资。最终,他们更有可能批准人工智能。

Examples of Principal Domain Experts:
主要领域专家示例:

  • A psychologist for a mental health AI assistant.
    心理健康人工智能助手的心理学家。
  • A lawyer for an AI that analyzes legal documents.
    律师,为分析法律文件的 AI 服务。
  • A customer service director for a support chatbot.
    客户服务部支持聊天机器人的主管。
  • A lead teacher or curriculum developer for an educational AI tool.
    教育 AI 工具的主管教师或课程开发者。
Exceptions  异常

In a smaller company, this might be the CEO or founder. If you are an independent developer, you should be the domain expert (but be honest with yourself about your expertise).
在一个较小的公司,这可能是 CEO 或创始人。如果你是一名独立开发者,你应该成为该领域的专家(但对自己拥有的专业知识要诚实)。

If you must rely on leadership, you should regularly validate their assumptions against real user feedback.
如果您必须依赖领导,您应该定期将他们的假设与真实用户反馈进行验证。

Many developers attempt to act as the domain expert themselves, or find a convenient proxy (ex: their superior). This is a recipe for disaster. People will have varying opinions about what is acceptable, and you can’t make everyone happy. What’s important is that your principal domain expert is satisfied.
许多开发者试图亲自充当领域专家,或者找到一个方便的代理人(例如:他们的上级)。这是灾难的配方。人们对什么是可接受的会有不同的看法,你无法让每个人都满意。重要的是你的主要领域专家感到满意。

Remember: This doesn’t have to take a lot of the domain expert’s time. Later in this post, I’ll discuss how you can make the process efficient. Their involvement is absolutely critical to the AI’s success.
记住:这不需要占用领域专家太多时间。在本帖的后续部分,我将讨论如何使这个过程高效。他们的参与对于 AI 的成功至关重要。

Next Steps 下一步

Once you’ve found your expert, we need to give them the right data to review. Let’s talk about how to do that next.
一旦找到专家,我们需要提供给他们正确的数据进行审查。接下来让我们谈谈如何做到这一点。

Step 2: Create a Dataset
第二步:创建数据集

With your principal domain expert on board, the next step is to build a dataset that captures problems that your AI will encounter. It’s important that the dataset is diverse and represents the types of interactions that your AI will have in production.
与您的主要领域专家一起,下一步是建立一个数据集,以捕捉您的 AI 将遇到的问题。数据集的多样性和代表 AI 在生产中将遇到的交互类型非常重要。

Why a Diverse Dataset Matters
为什么多样化的数据集很重要

  • Comprehensive Testing: Ensures your AI is evaluated across a wide range of situations.
    全面测试:确保您的 AI 在各种情况下得到评估。
  • Realistic Interactions: Reflects actual user behavior for more relevant evaluations.
    真实交互:反映实际用户行为,以进行更相关的评估。
  • Identifies Weaknesses: Helps uncover areas where the AI may struggle or produce errors.
    识别弱点:帮助揭示 AI 可能遇到困难或产生错误的地方。

Dimensions for Structuring Your Dataset
数据集结构尺寸

You want to define dimensions that make sense for your use case. For example, here are ones that I often use for B2C applications:
您想定义对您的用例有意义的维度。例如,以下是我经常用于 B2C 应用的几个维度:

  1. Features: Specific functionalities of your AI product.
    功能:您 AI 产品的特定功能。
  2. Scenarios: Situations or problems the AI may encounter and needs to handle.
    场景:AI 可能遇到并需要处理的情况或问题。
  3. Personas: Representative user profiles with distinct characteristics and needs.
    人物:具有独特特征和需求的代表性用户档案。

Examples of Features, Scenarios, and Personas
示例功能、场景和角色

Features 特性

Feature 特性 Description 描述
Email Summarization 邮件摘要 Condensing lengthy emails into key points.
将长篇电子邮件归纳为要点。
Meeting Scheduler 会议调度器 Automating the scheduling of meetings across time zones.
自动化跨时区的会议调度。
Order Tracking 订单跟踪 Providing shipment status and delivery updates.
提供运输状态和配送更新。
Contact Search 联系搜索 Finding and retrieving contact information from a database.
从数据库中查找和检索联系信息。
Language Translation 语言翻译 Translating text between languages.
翻译不同语言之间的文本。
Content Recommendation 内容推荐 Suggesting articles or products based on user interests.
根据用户兴趣推荐文章或产品。

Scenarios 场景

Scenarios are situations the AI needs to handle, (not based on the outcome of the AI’s response).
场景是 AI 需要处理的情况,(不基于 AI 响应的结果)。

Scenario 场景 Description 描述
Multiple Matches Found 多个匹配项找到 User’s request yields multiple results that need narrowing down. For example: User asks “Where’s my order?” but has three active orders (#123, #124, #125). AI must help identify which specific order they’re asking about.
用户请求产生多个结果需要缩小范围。例如:用户询问“我的订单在哪里?”但有三项活跃订单(#123、#124、#125)。AI 必须帮助他们确定询问的是哪个具体订单。
No Matches Found 无匹配结果 User’s request yields no results, requiring alternatives or corrections. For example: User searches for order #ABC-123 which doesn’t exist. AI should explain valid order formats and suggest checking their confirmation email.
用户请求无结果,需要寻找替代方案或进行修正。例如:用户搜索订单号#ABC-123,该订单不存在。AI 应解释有效的订单格式并建议检查他们的确认邮件。
Ambiguous Request 模糊请求 User input lacks necessary specificity. For example: User says “I need to change my delivery” without specifying which order or what aspect of delivery (date, address, etc.) they want to change.
用户输入缺乏必要的具体性。例如:用户说“我需要更改我的配送”,但没有指明是哪一笔订单或想要更改配送的哪个方面(日期、地址等)。
Invalid Data Provided 无效数据提供 User provides incorrect data type or format. For example: User tries to track a return using a regular order number instead of a return authorization (RMA) number.
用户提供了错误的数据类型或格式。例如:用户试图使用常规订单号而不是退货授权(RMA)号来跟踪退货。
System Errors 系统错误 Technical issues prevent normal operation. For example: While looking up an order, the inventory database is temporarily unavailable. AI needs to explain the situation and provide alternatives.
技术问题导致正常操作受阻。例如:在查找订单时,库存数据库暂时不可用。AI 需要解释情况并提供替代方案。
Incomplete Information 不完整信息 User omits required details. For example: User wants to initiate a return but hasn’t provided the order number or reason. AI needs to collect this information step by step.
用户省略了必要的详细信息。例如:用户想要发起退货,但没有提供订单号或原因。AI 需要逐步收集这些信息。
Unsupported Feature 不支持的功能 User requests functionality that doesn’t exist. For example: User asks to change payment method after order has shipped. AI must explain why this isn’t possible and suggest alternatives.
用户请求不存在的功能。例如:用户在订单发货后要求更改支付方式。AI 必须解释为什么这不可能,并提出替代方案。

Personas 角色

Persona 人物 Description 描述
New User 新用户 Unfamiliar with the system; requires guidance.
不熟悉该系统;需要指导。
Expert User 专家用户 Experienced; expects efficiency and advanced features.
经验丰富;期望效率与高级功能。
Non-Native Speaker 非母语者 May have language barriers; uses non-standard expressions.
可能存在语言障碍;使用非标准表达。
Busy Professional 忙碌的专业人士 Values quick, concise responses; often multitasking.
快速、简洁的回复;经常多任务处理。
Technophobe 技术恐惧症 Uncomfortable with technology; needs simple instructions.
对技术不舒服;需要简单说明。
Elderly User 老年用户 May not be tech-savvy; requires patience and clear guidance.
可能不具备技术专长;需要耐心和清晰的指导。

This taxonomy is not universal
这个分类法不是普遍的

This taxonomy (features, scenarios, personas) is not universal. For example, it may not make sense to even have personas if users aren’t directly engaging with your AI. The idea is you should outline dimensions that make sense for your use case and generate data that covers them. You’ll likely refine these after the first round of evaluations.
这个分类法(功能、场景、人物角色)并非通用。例如,如果用户没有直接与您的 AI 互动,甚至拥有人物角色可能都没有意义。想法是您应该概述适用于您用例的维度并生成涵盖这些维度的数据。您可能会在第一轮评估之后对这些进行细化。

Generating Data 生成数据

To build your dataset, you can:
为了构建您的数据集,您可以:

  • Use Existing Data: Sample real user interactions or behaviors from your AI system.
    使用现有数据:从您的 AI 系统中提取真实用户交互或行为样本。
  • Generate Synthetic Data: Use LLMs to create realistic user inputs covering various features, scenarios, and personas.
    生成合成数据:使用LLMs创建涵盖各种特性、场景和角色的逼真用户输入。

Often, you’ll do a combination of both to ensure comprehensive coverage. Synthetic data is not as good as real data, but it’s a good starting point. Also, we are only using LLMs to generate the user inputs, not the LLM responses or internal system behavior.
经常,你会结合两者以确保全面覆盖。合成数据不如真实数据好,但它是良好的起点。此外,我们只使用LLMs来生成用户输入,而不是LLM的响应或内部系统行为。

Regardless of whether you use existing data or synthetic data, you want good coverage across the dimensions you’ve defined.
无论您使用现有数据还是合成数据,您都希望覆盖您定义的维度。

Incorporating System Information
整合系统信息

When making test data, use your APIs and databases where appropriate. This will create realistic data and trigger the right scenarios. Sometimes you’ll need to write simple programs to get this information. That’s what the “Assumptions” column is referring to in the examples below.
在制作测试数据时,根据需要使用您的 API 和数据库。这将创建真实的数据并触发正确的场景。有时您需要编写简单的程序来获取这些信息。以下示例中的“假设”列所指的就是这个。

Example LLM Prompts for Generating User Inputs
示例 LLM 生成用户输入的提示

Here are some example prompts that illustrate how to use an LLM to generate synthetic user inputs for different combinations of features, scenarios, and personas:
以下是一些示例提示,说明如何使用LLM生成针对不同特征、场景和角色的合成用户输入:

ID Feature 特性 Scenario 场景 Persona 人物 LLM Prompt to Generate User Input
LLM 生成用户输入提示
Assumptions (not directly in the prompt)
假设(非直接在提示中)
1 Order Tracking 订单跟踪 Invalid Data Provided 无效数据提供 Frustrated Customer 沮丧的客户 “Generate a user input from someone who is clearly irritated and impatient, using short, terse language to demand information about their order status for order number #1234567890. Include hints of previous negative experiences.”
“生成一个用户输入,该用户明显烦躁且不耐烦,使用简短、简洁的语言要求了解其订单状态,订单编号为#1234567890。包含之前负面经历的暗示。”
Order number #1234567890 does not exist in the system.
订单号#1234567890 在系统中不存在。
2 Contact Search 联系搜索 Multiple Matches Found 多个匹配项找到 New User 新用户 “Create a user input from someone who seems unfamiliar with the system, using hesitant language and asking for help to find contact information for a person named ‘Alex’. The user should appear unsure about what information is needed.”
“创建一个看似对系统不熟悉的用户输入,使用犹豫的语言并寻求帮助以找到名为‘Alex’的人的联系方式。用户应显得不确定需要哪些信息。”
Multiple contacts named ‘Alex’ exist in the system.
系统中存在多个名为“Alex”的联系人。
3 Meeting Scheduler 会议调度器 Ambiguous Request 模糊请求 Busy Professional 忙碌的专业人士 “Simulate a user input from someone who is clearly in a hurry, using abbreviated language and minimal details to request scheduling a meeting. The message should feel rushed and lack specific information.”
“模拟一个明显匆忙的人的用户输入,使用缩写语言和最少细节来请求安排会议。信息应感觉仓促且缺乏具体信息。”
N/A
4 Content Recommendation 内容推荐 No Matches Found 无匹配结果 Expert User 专家用户 “Produce a user input from someone who demonstrates in-depth knowledge of their industry, using specific terminology to request articles on sustainable supply chain management. Use the information in this article involving sustainable supply chain management to formulate a plausible query: {{article}}”
“从对行业有深入了解的人那里生成用户输入,使用特定术语请求关于可持续供应链管理的文章。利用本文中关于可持续供应链管理的信息,制定一个合理的查询:{{article}}”
No articles on ‘Emerging trends in sustainable supply chain management’ exist in the system.
系统中不存在关于“可持续供应链管理新兴趋势”的文章。

Generating Synthetic Data
生成合成数据

When generating synthetic data, you only need to create the user inputs. You then feed these inputs into your AI system to generate the AI’s responses. It’s important that you log everything so you can evaluate your AI. To recap, here’s the process:
当生成合成数据时,您只需创建用户输入。然后,将这些输入输入到您的 AI 系统中以生成 AI 的响应。记录下所有内容以便评估您的 AI 很重要。为了回顾,以下是流程:

  1. Generate User Inputs: Use the LLM prompts to create realistic user inputs.
    生成用户输入:使用LLM提示来创建逼真的用户输入。
  2. Feed Inputs into Your AI System: Input the user interactions into your AI as it currently exists.
    将输入数据喂给您的 AI 系统:将用户交互输入到您现有的 AI 中。
  3. Capture AI Responses: Record the AI’s responses to form complete interactions.
    捕获 AI 响应:记录 AI 的响应以形成完整的交互。
  4. Organize the Interactions: Create a table to store the user inputs, AI responses, and relevant metadata.
    组织交互:创建一个表格来存储用户输入、AI 响应和相关元数据。

How much data should you generate?
您应该生成多少数据?

There is no right answer here. At a minimum, you want to generate enough data so that you have examples for each combination of dimensions (in this toy example: features, scenarios, and personas). However, you also want to keep generating more data until you feel like you have stopped seeing new failure modes. The amount of data I generate varies significantly depending on the use case.
这里没有正确答案。至少,你希望生成足够的数据,以便为每个维度的组合(在这个玩具示例中:特征、场景和人物)提供示例。然而,你仍然希望继续生成更多数据,直到你觉得你已经停止看到新的故障模式。我生成数据量的大小取决于用例。

Does synthetic data actually work?
合成数据真的有效吗?

You might be skeptical of using synthetic data. After all, it’s not real data, so how can it be a good proxy? In my experience, it works surprisingly well. Some of my favorite AI products, like Hex use synthetic data to power their evals:
您可能对使用合成数据持怀疑态度。毕竟,它不是真实数据,那么它怎么能成为一个好的代理呢?根据我的经验,它出奇地好。我喜欢的某些 AI 产品,如 Hex,就是使用合成数据来驱动它们的评估:

“LLMs are surprisingly good at generating excellent - and diverse - examples of user prompts. This can be relevant for powering application features, and sneakily, for building Evals. If this sounds a bit like the Large Language Snake is eating its tail, I was just as surprised as you! All I can say is: it works, ship it.” Bryan Bischof, Head of AI Engineering at Hex
“LLMs 在生成优秀且多样化的用户提示示例方面令人惊讶。这可以用于驱动应用程序功能,并且偷偷地,用于构建评估。如果这听起来有点像大型语言蛇在吞自己的尾巴,我跟你一样感到惊讶!我只能说:它有效,发货吧。” 布莱恩·比朔夫,Hex 人工智能工程部负责人

Next Steps 下一步

With your dataset ready, now comes the most important part: getting your principal domain expert to evaluate the interactions.
有了您的数据集,现在到了最重要的部分:让您的首席领域专家评估交互。

Step 3: Direct The Domain Expert to Make Pass/Fail Judgments with Critiques
步骤 3:指导领域专家进行通过/不通过判断并提供评价

The domain expert’s job is to focus on one thing: “Did the AI achieve the desired outcome?” No complex scoring scales or multiple metrics. Just a clear pass or fail decision. In addition to the pass/fail decision, the domain expert should write a critique that explains their reasoning.
领域专家的工作是专注于一件事:“AI 是否达到了预期的结果?”没有复杂的评分标准或多个指标。只需一个明确的通过或失败决定。除了通过/失败的决定外,领域专家还应撰写一份评论,解释他们的推理。

Why are simple pass/fail metrics important?
为什么简单的通过/失败指标很重要?

  • Clarity and Focus: A binary decision forces everyone to consider what truly matters. It simplifies the evaluation to a single, crucial question.
    清晰与专注:二元决策迫使每个人考虑真正重要的事情。它将评估简化为一个单一、关键的问题。

  • Actionable Insights: Pass/fail judgments are easy to interpret and act upon. They help you quickly identify whether the AI meets the user’s needs.
    可操作见解:通过/失败判断易于理解和执行。它们帮助您快速判断人工智能是否满足用户需求。

  • Forces Articulation of Expectations: When domain experts must decide if an interaction passes or fails, they are compelled to articulate their expectations clearly. This process uncovers nuances and unspoken assumptions about how the AI should behave.
    力量明确期望的阐述:当领域专家必须决定一个交互是成功还是失败时,他们被迫清楚地阐述他们的期望。这个过程揭示了关于 AI 应该如何行为的细微差别和未言明的假设。

  • Efficient Use of Resources: Keeps the evaluation process manageable, especially when starting out. You avoid getting bogged down in detailed metrics that might not be meaningful yet.
    资源高效利用:使评估过程易于管理,尤其是在开始时。您避免陷入可能尚无意义的详细指标中。

The Role of Critiques
批评的作用

Alongside a binary pass/fail judgment, it’s important to write a detailed critique of the LLM-generated output. These critiques:
与二进制通过/失败判断并列,重要的是对LLM生成的输出进行详细的评论。这些评论:

  • Captures Nuances: The critique allows you to note if something was mostly correct but had areas for improvement.
    捕捉细微差别:评论允许你记录某件事基本上是正确的,但也有改进的空间。

  • Guide Improvement: Detailed feedback provides specific insights into how the AI can be enhanced.
    指南改进:详细反馈提供关于如何增强 AI 的具体见解。

  • Balance Simplicity with Depth: While the pass/fail offers a clear verdict, the critique offers the depth needed to understand the reasoning behind the judgment.
    平衡简洁与深度:虽然及格/不及格提供明确的判断,但评论提供了理解判断背后理由所需的深度。

Why Write Critiques?: 为什么写评论?

In practice, domain experts may not have fully internalized all the judgment criteria. By forcing them to make a pass/fail decision and explain their reasoning, they clarify their expectations and provide valuable guidance for refining the AI.
在实践中,领域专家可能并未完全内化所有判断标准。通过强迫他们做出通过/失败的决定并解释他们的推理,他们明确了他们的期望,并为改进 AI 提供了有价值的指导。

“But my problem is complex!” Trust me—starting simple forces you to focus on what truly matters. You can introduce more complexity later if needed.
“但我的问题很复杂!”相信我——从简单开始会让你专注于真正重要的事情。如果需要,你以后可以引入更多的复杂性。

Examples of Good Critiques
示例优秀的评论

To illustrate how simple pass/fail judgments combined with detailed critiques work in practice, here’s a table showcasing examples of user interactions with an AI assistant. The table includes both passes and fails, with critiques explaining why the AI received that judgment. In cases where the AI passed despite critical issues, the critique highlights these aspects and justifies why it still passed overall. For failed interactions, the critique explains the critical elements that led to the failure.
为了说明简单的通过/失败判断与详细评论在实际中的应用,下面是一个展示用户与人工智能助手交互的示例表格。表格中包含了通过和失败的情况,以及解释 AI 为何得到这种判断的评论。在 AI 尽管存在关键问题但仍通过的情况下,评论会突出这些方面并说明为什么它仍然总体通过。对于失败的交互,评论会解释导致失败的关键要素。

User Interaction with AI 用户与人工智能交互 Judgment 判断 Critique 批评
User: “I need to cancel my flight booking for next week.”
用户:“我需要取消下周的航班预订。”

AI: “Your flight booking for next week has been canceled. A confirmation email has been sent to you.”
AI:“您下周的航班预订已被取消。已向您发送确认邮件。”
Pass 通过 The AI successfully executed the user’s request by canceling the flight and sending a confirmation, achieving the primary goal. However, it should have asked for confirmation before proceeding to prevent accidental cancellations. Despite this critical oversight, the AI fulfilled the user’s request, so it passes, but adding a confirmation step would improve user safety.
人工智能成功执行了用户的请求,通过取消航班并发送确认信息,实现了主要目标。然而,它应该在继续操作之前请求确认,以防止意外取消。尽管存在这一关键疏忽,人工智能仍然满足了用户的需求,因此它通过了,但增加一个确认步骤将提高用户的安全性。
User: “What’s the capital of France?”
用户:“法国的首都是哪里?”

AI: “The capital of France is Paris.”
人工智能:“法国的首都是巴黎。”
Pass 通过 The AI correctly answered the user’s question, providing accurate information promptly. Although the response is sufficient, the AI could enhance the interaction by offering additional interesting facts or asking if the user has more questions. Despite this opportunity for added engagement, the AI met the user’s primary need, so it passes.
人工智能正确回答了用户的问题,迅速提供了准确的信息。尽管回答足够,但人工智能可以通过提供更多有趣的事实或询问用户是否有更多问题来增强互动。尽管有增加互动的机会,但人工智能满足了用户的主要需求,因此它通过了。
User: “I can’t log into my account; it says my password is incorrect.”
用户:“我无法登录我的账户;它说我的密码不正确。”

AI: “Please reset your password by clicking ‘Forgot Password’.”
AI:“请点击‘忘记密码’重置您的密码。”
Fail 失败 The AI provided a generic solution without acknowledging the user’s frustration or checking for other issues like account lockout. It failed to offer empathetic support or additional assistance. Due to the lack of personalized help and poor user experience, it fails.
人工智能提供了一个通用的解决方案,没有承认用户的挫败感或检查其他问题,如账户锁定。它未能提供同理心的支持或额外的帮助。由于缺乏个性化帮助和糟糕的用户体验,它失败了。
User: “Book a table for two at an Italian restaurant tonight at 7 PM.”
用户:“今晚 7 点在意大利餐厅预订一张两人桌。”

AI: “Table for two at Luigi’s Italian Restaurant booked for tonight at 7 PM. Confirmation number is 12345.”
AI:“今晚 7 点在 Luigi 的意大利餐厅预订了两人桌。确认号码是 12345。”
Pass 通过 The AI efficiently completed the booking as requested, providing all necessary details and a confirmation number. However, it could have asked if the user had seating preferences or dietary restrictions. The AI accomplished the primary task successfully, so it passes, with minor opportunities for enhanced service.
人工智能高效地完成了预订,提供了所有必要的详细信息以及确认号码。然而,它本可以询问用户是否有座位偏好或饮食限制。人工智能成功完成了主要任务,因此通过,但也存在一些提升服务的小机会。
User: “I need help with my billing statement.”
用户:“我需要帮助查看我的账单。”

AI: “Please provide your account number and full name.”
AI:“请提供您的账户号码和全名。”
Fail 失败 The AI requested sensitive information without verifying the user’s identity through secure means, potentially compromising security. It should guide the user through a secure authentication process first. Due to this critical oversight in user data protection, it fails.
人工智能未经通过安全方式验证用户身份就请求敏感信息,可能危及安全。它应首先引导用户通过安全认证过程。由于这一关键的用户数据保护疏忽,它失败了。

These examples demonstrate how the AI can receive both “Pass” and “Fail” judgments. In the critiques:
这些示例展示了人工智能如何接收“通过”和“失败”的判断。在评论中:

  • For passes, we explain why the AI succeeded in meeting the user’s primary need, even if there were critical aspects that could be improved. We highlight these areas for enhancement while justifying the overall passing judgment.
    关于通过,我们解释了为什么 AI 在满足用户主要需求方面取得成功,即使存在一些可以改进的关键方面。我们强调这些改进区域,同时为总体通过判断进行辩护。

  • For fails, we identify the critical elements that led to the failure, explaining why the AI did not meet the user’s main objective or compromised important factors like user experience or security.
    对于失败,我们识别导致失败的关键要素,解释为什么 AI 没有达到用户的主要目标或损害了用户体验或安全等重要因素。

Most importantly, the critique should be detailed enough so that you can use it in a few-shot prompt for a LLM judge. In other words, it should be detailed enough that a new employee could understand it. Being too terse is a common mistake.
最重要的是,批评应该足够详细,以便您可以用它来对LLM法官进行几轮提问。换句话说,它应该足够详细,以至于新员工也能理解。过于简略是一个常见的错误。

Note that the example user interactions with the AI are simplified for brevity - but you might need to give the domain expert more context to make a judgement. More on that later.
请注意,为了简洁,示例中与 AI 的用户交互被简化——但您可能需要向领域专家提供更多上下文以便做出判断。关于这一点,稍后详述。

Note  注意

At this point, you don’t need to perform a root cause analysis into the technical reasons behind why the AI failed. Many times, it’s useful to get a sense of overall behavior before diving into the weeds.
在这个阶段,您不需要对 AI 失败背后的技术原因进行根本原因分析。很多时候,在深入细节之前,了解整体行为是有用的。

Don’t stray from binary pass/fail judgments when starting out
不要在开始时偏离二进制通过/失败判断

A common mistake is straying from binary pass/fail judgments. Let’s revisit the dashboard from earlier:
一个常见的错误是偏离二元的通过/失败判断。让我们回顾一下之前提到的仪表板:

If your evaluations consist of a bunch of metrics that LLMs score on a 1-5 scale (or any other scale), you’re doing it wrong. Let’s unpack why.
如果你的评估包含一串在 1-5 分(或任何其他尺度)上得分的指标,那么你做错了。让我们来分析一下原因。

  1. It’s not actionable: People don’t know what to do with a 3 or 4. It’s not immediately obvious how this number is better than a 2. You need to be able to say “this interaction passed because…” and “this interaction failed because…”.
    它不可操作:人们不知道如何处理 3 或 4。这个数字如何比 2 更好并不立即明显。你需要能够说“这次交互通过是因为……”,“这次交互失败是因为……”。
  2. More often than not, these metrics do not matter. Every time I’ve analyzed data on domain expert judgments, they tend not to correlate with these kind of metrics. By having a domain expert make a binary judgment, you can figure out what truly matters.
    往往这些指标并不重要。每次我分析领域专家的判断数据时,它们往往与这类指标不相关。通过让领域专家做出二元判断,你可以找出真正重要的事情。

This is why I hate off the shelf metrics that come with many evaluation frameworks. They tend to lead people astray.
这是为什么我讨厌许多评估框架中附带的标准指标。它们往往会让人误入歧途。

Common Objections to Pass/Fail Judgments:
常见对通过/不通过判断的异议:

  • “The business said that these 8 dimensions are important, so we need to evaluate all of them.”
    “该企业表示,这 8 个维度很重要,因此我们需要评估所有这些维度。”
  • “We need to be able to say why an interaction passed or failed.”
    “我们需要能够说明为什么一个交互通过了或失败了。”

I can guarantee you that if someone says you need to measure 8 things on a 1-5 scale, they don’t know what they are looking for. They are just guessing. You have to let the domain expert drive and make a pass/fail judgment with critiques so you can figure out what truly matters. Stand your ground here.
我能保证,如果有人说你需要用 1-5 的尺度来衡量 8 件事,他们并不清楚自己在寻找什么。他们只是在猜测。你必须让领域专家引领,并做出及格/不及格的判断,同时提出批评,这样你才能弄清楚什么才是真正重要的。在这里站稳立场。

Make it easy for the domain expert to review data
使领域专家轻松审查数据

Finally, you need to remove all friction from reviewing data. I’ve written about this here. Sometimes, you can just use a spreadsheet. It’s a judgement call in terms of what is easiest for the domain expert. I found that I often have to provide additional context to help the domain expert understand the user interaction, such as:
最后,您需要从数据审查中消除所有摩擦。我在这里写过关于这一点。有时,您只需使用电子表格即可。这取决于对领域专家来说什么最容易。我发现我经常需要提供额外的背景信息,以帮助领域专家理解用户交互,例如:

  • Metadata about the user, such as their location, subscription tier, etc.
    用户元数据,例如其位置、订阅等级等。
  • Additional context about the system, such as the current time, inventory levels, etc.
    关于系统的附加信息,例如当前时间、库存水平等。
  • Resources so you can check if the AI’s response is correct (ex: ability to search a database, etc.)
    资源,以便您检查 AI 的回答是否正确(例如:搜索数据库的能力等。)

All of this data needs to be presented on a single screen so the domain expert can review it without jumping through hoops. That’s why I recommend building a simple web app to review data.
所有这些数据都需要在一个屏幕上展示,以便领域专家可以无需跳过障碍来审查。这就是我建议构建一个简单的 Web 应用程序来审查数据的原因。

How many examples do you need?
您需要多少个示例?

The number of examples you need depends on the complexity of the task. My heuristic is that I start with around 30 examples and keep going until I do not see any new failure modes. From there, I keep going until I’m not learning anything new.
您需要的示例数量取决于任务的复杂度。我的启发式方法是开始时使用大约 30 个示例,并持续进行,直到我看不到任何新的故障模式。从那时起,我会继续进行,直到我不再学习到新的东西。

Next, we’ll look at how to use this data to build an LLM judge.
接下来,我们将探讨如何使用这些数据构建一个LLM评委。

Step 4: Fix Errors
步骤 4:修复错误

After looking at the data, it’s likely you will find errors in your AI system. Instead of plowing ahead and building an LLM judge, you want to fix any obvious errors. Remember, the whole point of the LLM as a judge is to help you find these errors, so it’s totally fine if you find them earlier!
在查看数据后,你可能会发现你的 AI 系统中存在错误。与其继续前进并构建一个LLM法官,你更希望修复任何明显的错误。记住,LLM作为法官的整个目的是帮助你找到这些错误,所以你早点发现它们是完全正常的!

If you have already developed Level 1 evals as outlined in my previous post, you should not have any pervasive errors. However, these errors can sometimes slip through the cracks. If you find pervasive errors, fix them and go back to step 3. Keep iterating until you feel like you have stabilized your system.
如果您已经根据我之前的帖子中概述的内容开发了第 1 级评估,您不应该有任何普遍错误。然而,这些错误有时会悄悄溜走。如果您发现普遍错误,请修复它们并返回到第 3 步。继续迭代,直到您觉得您的系统已经稳定为止。

Step 5: Build Your LLM as A Judge, Iteratively
步骤 5:作为裁判迭代构建您的 LLM

The Hidden Power of Critiques
隐藏的批评力量

You cannot write a good judge prompt until you’ve seen the data. The paper from Shankar et al., “Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences” summarizes this well:
您在看到数据之前无法写出好的法官提示。Shankar 等人发表的论文《谁验证验证者?将LLM辅助评估LLM输出的结果与人类偏好对齐》很好地总结了这一点:

to grade outputs,people need to externalize and define their evaluation criteria; however, the process of grading outputs helps them to define that very criteria. We dub this phenomenon criteria drift, and it implies thatit is impossible to completely determine evaluation criteria prior to human judging of LLM outputs.
对输出进行评分,人们需要将评估标准外部化并定义;然而,评分输出的过程有助于他们定义这些标准。我们称这种现象为标准漂移,这意味着在人类对LLM输出进行判断之前,完全确定评估标准是不可能的。

Start with Expert Examples
从专家示例开始

Let me share a real-world example of building an LLM judge you can apply to your own use case. When I was helping Honeycomb build their Query Assistant feature, we needed a way to evaluate if the AI was generating good queries. Here’s what our LLM judge prompt looked like, including few-shot examples of critiques from our domain expert, Phillip:
让我分享一个构建LLM裁判器的真实世界例子,您可以将它应用到自己的用例中。当我帮助 Honeycomb 构建他们的查询助手功能时,我们需要一种方法来评估 AI 是否生成了好的查询。以下是我们LLM裁判器的提示词,包括来自我们领域专家 Phillip 的少量示例批评:

You are a Honeycomb query evaluator with advanced capabilities to judge if a query is good or not.
You understand the nuances of the Honeycomb query language, including what is likely to be
most useful from an analytics perspective. 

Here is information about the Honeycomb query language:
{{query_language_info}}

Here are some guidelines for evaluating queries:
{{guidelines}}

Example evaluations:

<examples>

<example-1>
<nlq>show me traces where ip is 10.0.2.90</nlq>
<query>
{
  "breakdowns": ["trace.trace_id"],
  "calculations": [{"op": "COUNT"}],
  "filters": [{"column": "net.host.ip", "op": "=", "value": "10.0.2.90"}]
}
</query>
<critique>
{
  "critique": "The query correctly filters for traces with an IP address of 10.0.2.90 
   and counts the occurrences of those traces, grouped by trace.trace_id. The response 
   is good as it meets the requirement of showing traces from a specific IP address 
   without additional complexities.",
  "outcome": "good"
}
</critique>
</example-1>

<example-2>
<nlq>show me slowest trace</nlq>
<query>
{
  "calculations": [{"column": "duration_ms", "op": "MAX"}],
  "orders": [{"column": "duration_ms", "op": "MAX", "order": "descending"}],
  "limit": 1
}
</query>
<critique>
{
  "critique": "While the query attempts to find the slowest trace using MAX(duration_ms) 
   and ordering correctly, it fails to group by trace.trace_id. Without this grouping, 
   the query only shows the MAX(duration_ms) measurement over time, not the actual 
   slowest trace.",
  "outcome": "bad"
}
</critique>
</example-2>

<example-3>
<nlq>count window-hash where window-hash exists per hour</nlq>
<query>
{
  "breakdowns": ["window-hash"],
  "calculations": [{"op": "COUNT"}],
  "filters": [{"column": "window-hash", "op": "exists"}],
  "time_range": 3600
}
</query>
<critique>
{
  "critique": "While the query correctly counts window-hash occurrences, the time_range 
   of 3600 seconds (1 hour) is insufficient for per-hour analysis. When we say 'per hour', 
   we need a time_range of at least 36000 seconds to show meaningful hourly patterns.",
  "outcome": "bad"
}
</critique>
</example-3>

</examples>

For the following query, first write a detailed critique explaining your reasoning,
then provide a pass/fail judgment in the same format as above.

<nlq>{{user_input}}</nlq>
<query>
{{generated_query}}
</query>
<critique>

Notice how each example includes:
注意每个示例都包括:

  1. The natural language query (NLQ) in <nlq> tags
    自然语言查询(NLQ)在 <nlq> 标签中
  2. The generated query in <query> tags
    生成的查询在 <query> 标签中
  3. The critique and outcome in <critique> tags
    <critique> 标签中的评论和结果

In the prompt above, the example critiques are fixed. An advanced approach is to include examples dynamically based upon the item you are judging. You can learn more in this post about Continual In-Context Learning.
在上述提示中,示例评论已修正。一种高级方法是根据您正在评判的项目动态包含示例。您可以在本帖中了解更多关于持续情境学习的内容。

Keep Iterating on the Prompt Until Convergence With Domain Expert
持续迭代提示直到与领域专家收敛

In this case, I used a low-tech approach to iterate on the prompt. I sent Phillip a spreadsheet with the following information:
在这种情况下,我采用了一种低技术手段来迭代提示。我给菲利普发送了一个包含以下信息的电子表格:

  1. The NLQ NLQ
  2. The generated query 生成的查询
  3. The critique The critique 译文:批评
  4. The outcome (pass or fail)
    结果(通过或失败)

Phillip would then fill out his own version of the spreadsheet with his critiques. I used this to iteratively improve the prompt. The spreadsheet looked like this:
菲利普随后会填写他自己的版本电子表格,包含他的评论。我利用这一点来迭代改进提示。电子表格看起来是这样的:

I also tracked agreement rates over time to ensure we were converging on a good prompt.
我也跟踪了协议率随时间的变化,以确保我们正朝着一个好的提示收敛。

It took us only three iterations to achieve > 90% agreement between the LLM and Phillip. Your mileage may vary depending on the complexity of the task. For example, Swyx has conducted a similar process hundreds of times for AI News, an extremely popular news aggregator with high quality recommendations. The quality of the AI owing to this process is why this product has received critical acclaim.
仅经过三次迭代,我们就实现了LLM与 Phillip 之间超过 90%的一致性。您的里程数可能会因任务的复杂程度而有所不同。例如,Swyx 已经为 AI News 进行了数百次类似的过程,AI News 是一个非常受欢迎的新闻聚合器,提供高质量的建议。正是由于这个过程,AI 的质量使得这款产品获得了极高的评价。

How to Optimize the LLM Judge Prompt?
如何优化LLM 判断提示?

I usually adjust the prompts by hand. I haven’t had much luck with prompt optimizers like DSPy. However, my friend Eugene Yan has just released a promising tool named ALIGN Eval. I like it because it’s simple and effective. Also, don’t forget the approach of continual in-context learning mentioned earlier - it can be effective when implemented correctly.
我通常手动调整提示。在使用 DSPy 等提示优化器时,我并没有太多运气。然而,我的朋友尤金·严刚刚发布了一个名为 ALIGN Eval 的有前景的工具。我喜欢它,因为它简单且有效。此外,别忘了之前提到的持续情境学习方法——当正确实施时,它可能非常有效。

In rare cases, I might fine-tune a judge, but I prefer not to. I talk about this more in the FAQ section.
在罕见情况下,我可能会微调一个法官,但我更愿意不这样做。我在常见问题解答部分对此有更多讨论。

The Human Side of the Process
过程的人文方面

Something unexpected happened during this process. Phillip Carter, our domain expert at Honeycomb, found that reviewing the LLM’s critiques helped him articulate his own evaluation criteria more clearly. He said,
在此次过程中发生了意外情况。Honeycomb 的领域专家菲利普·卡特发现,回顾LLM的评论帮助他更清晰地阐述自己的评估标准。他说:

“Seeing how the LLM breaks down its reasoning made me realize I wasn’t being consistent about how I judged certain edge cases.”
“看到LLM如何分解其推理,我意识到我在判断某些边缘情况时并不一致。”

This is a pattern I’ve seen repeatedly—the process of building an LLM judge often helps standardize evaluation criteria.
这是我看到反复出现的模式——建立一个LLM评委的过程通常有助于标准化评估标准。

Furthermore, because this process forces the domain expert to look at data carefully, I always uncover new insights about the product, AI capabilities, and user needs. The resulting benefits are often more valuable than creating a LLM judge!
此外,因为这个过程迫使领域专家仔细查看数据,我总是能发现关于产品、AI 功能和用户需求的新见解。由此产生的效益通常比创建一个LLM法官更有价值!

How Often Should You Evaluate?
您应该多久评估一次?

I conduct this human review at regular intervals and whenever something material changes. For example, if I update a model, I’ll run the process again. I don’t get too scientific here; instead, I rely on my best judgment. Also note that after the first two iterations, I tend to focus more on errors rather than sampling randomly. For example, if I find an error, I’ll search for more examples that I think might trigger the same error. However, I always do a bit of random sampling as well.
我定期进行这项人工审查,并在有重大变化时进行。例如,如果更新了一个模型,我会再次运行该过程。这里我不会过于科学;相反,我依赖我的最佳判断。此外,请注意,在前两次迭代之后,我倾向于更多地关注错误,而不是随机抽样。例如,如果我发现一个错误,我会寻找更多我认为可能触发相同错误的例子。然而,我仍然会进行一些随机抽样。

What if this doesn’t work?
如果这不行怎么办?

I’ve seen this process fail when:
我已经看到这个过程失败,当:

  • The AI is overscoped: Example - a chatbot in a SaaS product that promises to do anything you want.
    人工智能范围过大:例如——一个在 SaaS 产品中的聊天机器人,承诺做你想要的一切。
  • The process is not followed correctly: Not using the principal domain expert, not writing proper critiques, etc.
    过程没有正确执行:未使用主要领域专家,未撰写适当的评论等。
  • The expectations of alignment are unrealistic or not feasible.
    对对齐的期望是不切实际的或不可行的。

In each of these cases, I try to address the root cause instead of trying to force alignment. Sometimes, you may not be able to achieve the alignment you want and may have to lean heavier on human annotations. However, after following the process described here, you will have metrics that help you understand how much you can trust the LLM judge.
在每个这些案例中,我试图解决根本原因而不是试图强行对齐。有时,你可能无法实现你想要的对齐,可能需要更依赖人工标注。然而,在遵循这里描述的过程之后,你将拥有帮助你理解可以多大程度上信任LLM评委的指标。

Mistakes I’ve noticed in LLM judge prompts
错误我在LLM评委提示中注意到的

Most of the mistakes I’ve seen in LLM judge prompts have to do with not providing good examples:
大多数我在LLM评委提示中看到的错误都与没有提供好的例子有关:

  1. Not providing any critiques.
    不提供任何评论。
  2. Writing extremely terse critiques.
    撰写极其简练的评论。
  3. Not providing external context. Your examples should contain the same information you use to evaluate, including external information like user metadata, system information etc.
    未提供外部上下文。您的示例应包含用于评估的信息,包括用户元数据、系统信息等外部信息。
  4. Not providing diverse examples. You need a wide variety of examples to ensure that your judge works for a wide variety of inputs.
    未提供多样化的示例。您需要广泛的各种示例以确保您的判断适用于广泛的输入。

Sometimes, you may encounter difficulties with fitting everything you need into the prompt, and may have to get creative about how you structure the examples. However, this is becoming less of an issue thanks to expanding context windows and prompt caching.
有时,您可能难以将所需的所有内容都放入提示中,并可能需要创造性地构建示例的结构。然而,随着上下文窗口的扩大和提示缓存的实现,这个问题正在逐渐减少。

Step 6: Perform Error Analysis
步骤 6:执行错误分析

After you have created a LLM as a judge, you will have a dataset of user interactions with the AI, and the LLM’s judgments. If your metrics show an acceptable agreement between the domain expert and the LLM judge, you can apply this judge against real or synthetic interactions. After this, you can you calculate error rates for different dimensions of your data. You should calculate the error on unseen data only to make sure your aren’t getting biased results.
在您创建了一个LLM作为裁判后,您将拥有用户与 AI 交互的数据集以及LLM的判断。如果您的指标显示领域专家与LLM裁判之间有可接受的协议,您可以将此裁判应用于真实或合成的交互。之后,您可以计算您数据不同维度的错误率。您应该只在未见过的数据上计算错误,以确保您不会得到有偏见的成果。

For example, if you have segmented your data by persona, scenario, feature, etc, your data analysis may look like this
例如,如果您已按角色、场景、功能等对数据进行分段,您的数据分析可能如下所示

Error Rates by Key Dimensions
错误率按关键维度分类

Feature 特性 Scenario 场景 Persona 人物 Total Examples 总示例 Failure Rate 故障率
Order Tracking 订单跟踪 Multiple Matches 多个匹配 New User 新用户 42 24.3%
Order Tracking 订单跟踪 Multiple Matches 多个匹配 Expert User 专家用户 38 18.4%
Order Tracking 订单跟踪 No Matches 无匹配 Expert User 专家用户 30 23.3%
Order Tracking 订单跟踪 No Matches 无匹配 New User 新用户 20 75.0%
Contact Search 联系搜索 Multiple Matches 多个匹配 New User 新用户 35 22.9%
Contact Search 联系搜索 Multiple Matches 多个匹配 Expert User 专家用户 32 19.7%
Contact Search 联系搜索 No Matches 无匹配 New User 新用户 25 68.0%
Contact Search 联系搜索 No Matches 无匹配 Expert User 专家用户 28 21.4%

Classify Traces 分类痕迹

Once you know where the errors are now you can perform an error analysis to get to the root cause of the errors. My favorite way is to look at examples of each type of error and classify them by hand. I recommend using a spreadsheet for this. For example, a trace for Order tracking where there are no matches for new users might look like this:
一旦你知道错误在哪里,你就可以进行错误分析,找到错误的根本原因。我最喜欢的方法是查看每种类型错误的示例,并手动进行分类。我建议使用电子表格来做这件事。例如,对于订单跟踪的跟踪,对于新用户没有匹配项的情况,可能看起来像这样:

Example Trace  示例追踪

In this example trace, the user provides an invalid order number. The AI correctly identifies that the order number is invalid but provides an unhelpful response. If you are not familiar with logging LLM traces, refer to my previous post on evals.
在这个示例跟踪中,用户提供了一个无效的订单号。AI 正确地识别出订单号无效,但提供了一个无用的响应。如果您不熟悉LLM跟踪的日志,请参阅我之前的关于 evals 的帖子。

Note that this trace is formatted for readability.
请注意,此跟踪已格式化以提高可读性。

{
 "user_input": "Where's my order #ABC123?",
 "function_calls": [
   {
     "name": "search_order_database",
     "args": {"order_id": "ABC123"},
     "result": {
       "status": "not_found",
       "valid_patterns": ["XXX-XXX-XXX"]
     }
   },
   {
     "name": "retrieve_context",
     "result": {
       "relevant_docs": [
         "Order numbers follow format XXX-XXX-XXX",
         "New users should check confirmation email"
       ]
     }
   }
 ],
 "llm_intermediate_steps": [
   {
     "thought": "User is new and order format is invalid",
     "action": "Generate help message with format info"
   }
 ],
 "final_response": "I cannot find that order #. Please check the number and try again."
}

In this case, you might classify the error as: Missing User Education. The system retrieved new user context and format information but failed to include it in the response, which suggests we could improve our prompt. After you have classified a number of errors, you can calculate the distribution of errors by root cause. That might look like this:
在这种情况下,您可能将错误分类为: Missing User Education 。系统检索了新的用户上下文和格式信息,但未能将其包含在响应中,这表明我们可以改进我们的提示。在您对多个错误进行分类后,您可以计算由根本原因引起的错误分布。这可能看起来像这样:

Root Cause Distribution (20 Failed Interactions)
根本原因分布(20 次失败交互)

Root Cause 根本原因 Count 计数 Percentage 百分比
Missing User Education 缺失用户教育 8 40%
Authentication/Access Issues
认证/访问问题
6 30%
Poor Context Handling 糟糕的上下文处理 4 20%
Inadequate Error Messages
不充分的错误信息
2 10%

Now you know where to focus your efforts. This doesn’t have to take an extraordinary amount of time. You can get quite far in just 15 minutes. Also, you can use a LLM to help you with this classification, but that is beyond the scope of this post (you can use a LLM to help you do anything in this post, as long as you have a process to verify the results).
现在你知道该在哪里集中精力了。这不需要花费非常多的时间。只需 15 分钟你就能取得相当大的进展。此外,你可以使用LLM来帮助你进行此类分类,但这超出了本文的范围(只要你有验证结果的过程,你就可以使用LLM来帮助你在本文中做任何事情)。

An Interactive Walkthrough of Error Analysis
交互式错误分析流程演示

Error analysis has been around in Machine Learning for quite some time. This video by Andrew Ng does a great job of walking through the process interactively:
错误分析在机器学习领域已经存在很长时间了。Andrew Ng 的这段视频很好地通过交互式演示了整个过程:

Fix Your Errors, Again
修复您的错误,再次

Now that you have a sense of the errors, you can go back and fix them again. Go back to step 3 and iterate until you are satisfied. Note that every time you fix an error, you should try to write a test case for it. Sometimes, this can be an assertion in your test suite, but other times you may need to create a more “specialized” LLM judge for these failures. We’ll talk about this next.
现在你已经有了错误的概念,你可以回去再次修复它们。回到第 3 步并迭代,直到你满意为止。请注意,每次修复一个错误时,你应该尝试为它编写一个测试用例。有时,这可以是在你的测试套件中的一个断言,但有时你可能需要为这些失败创建一个更“专业”的LLM法官。我们将在下一节讨论这个问题。

Doing this well requires data literacy
做好这件事需要数据素养

Investigating your data is much harder in practice than I made it look in this post. It requires a nose for data that only comes from practice. It also helps to have some basic familiarity with statistics and data analysis tools. My favorite post on data literacy is this one by Jason Liu and Eugene Yan.
调查您的数据在实践中比我在这篇文章中展示的要困难得多。这需要来自实践的敏锐的嗅觉。同时,对统计学和数据分析工具有一些基本了解也有帮助。我最喜欢的关于数据素养的文章是刘杰和严宇的这篇。

Step 7: Create More Specialized LLM Judges (if needed)
步骤 7:如有需要,创建更多专业化的 LLM 法官

Now that you have a sense for where the problems in your AI are, you can decide where and if to invest in more targeted LLM judges. For example, if you find that the AI has trouble with citing sources correctly, you can created a targeted eval for that. You might not even need a LLM judge for some errors (and use a code-based assertion instead).
现在您已经对您的 AI 中存在的问题有了感觉,您可以决定在哪里以及是否投资更多有针对性的LLM评委。例如,如果您发现 AI 在正确引用来源方面有困难,您可以为此创建一个有针对性的评估。对于某些错误,您可能甚至不需要LLM评委(而是使用基于代码的断言)。

The key takeaway is don’t jump directly to using specialized LLM judges until you have gone through this critique shadowing process. This will help you rationalize where to invest your time.
关键要点是,在你完成这个批判性影子过程之前,不要直接跳到使用专门的LLM法官。这将帮助你理性地决定如何投资你的时间。

Recap of Critique Shadowing
回顾批判性跟踪

Using an LLM as a judge can streamline your AI evaluation process if approached correctly. Here’s a visual illustration of the process (there is a description of the process below the diagram as well):
使用LLM作为评委,如果方法正确,可以简化您的 AI 评估流程。以下是该过程的视觉说明(图下也有对该过程的描述):

Yes
No
No
Yes
Yes
No
Yes
Start
1 Find Principal Domain Expert
2 Create Dataset
3 Domain Expert Reviews Data
Found Errors?
4 Fix Errors
5 Build LLM Judge
Test Against Domain Expert
Acceptable Agreement?
Refine Prompt
6 Perform Error Analysis
Critical Issues Found?
7 Fix Issues & Create Specialized Judges
Material Changes or Periodic Review?

The Critique Shadowing process is iterative, with feedback loops. Let’s list out the steps:
批判性跟踪过程是迭代的,具有反馈循环。让我们列出步骤:

  1. Find Principal Domain Expert
    查找主要领域专家
  2. Create A Dataset  创建数据集
    • Generate diverse examples covering your use cases
      生成涵盖您用例的多样化示例
    • Include real or synthetic user interactions
      包含真实或合成的用户交互
  3. Domain Expert Reviews Data
    领域专家评审数据
    • Expert makes pass/fail judgments
      专家做出通过/不通过判断
    • Expert writes detailed critiques explaining their reasoning
      专家撰写详细评论解释其理由
  4. Fix Errors (if found)
    修复错误(如有)
    • Address any issues discovered during review
      处理审查过程中发现的问题
    • Return to expert review to verify fixes
      返回专家审查以验证修复
    • Go back to step 3 if errors are found
      返回步骤 3 如发现错误
  5. Build LLM Judge  构建 LLM 判断
    • Create prompt using expert examples
      创建使用专家示例的提示
    • Test against expert judgments
      与专家判断进行测试
    • Refine prompt until agreement is satisfactory
      精炼提示直至满意为止
  6. Perform Error Analysis  执行错误分析
    • Calculate error rates across different dimensions
      计算不同维度的错误率
    • Identify patterns and root causes
      识别模式和根本原因
    • Fix errors and go back to step 3 if needed
      修复错误,如有需要返回步骤 3
    • Create specialized judges as needed
      根据需要创建专业法官

This process never truly ends. It repeats periodically or when material changes occur.
这个过程永远不会真正结束。它周期性地重复,或者当发生材料变化时。

It’s Not The Judge That Created Value, Afterall
它并非法官创造了价值,毕竟

The real value of this process is looking at your data and doing careful analysis. Even though an AI judge can be a helpful tool, going through this process is what drives results. I would go as far as saying that creating a LLM judge is a nice “hack” I use to trick people into carefully looking at their data!
这个过程的真实价值在于审视你的数据并进行仔细分析。尽管人工智能法官可以是一个有用的工具,但通过这个过程才是推动结果的关键。我甚至可以说,创建一个LLM法官是一个我用来诱使人们仔细查看他们数据的“技巧”!

That’s right. The real business value comes from looking at your data. But hey, potato, potahto.
没错。真正的商业价值来自于查看你的数据。但是,嘿,土豆,土豆头。

Do You Really Need This?
你真的需要这个吗?

Phew, this seems like a lot of work! Do you really need this? Well, it depends. There are cases where you can take a shortcut through this process. For example, let’s say:
哎呀,这似乎是很多工作!你真的需要这个吗?嗯,这取决于。有些情况下你可以通过这个过程走捷径。例如,让我们来说说:

  1. You are an independent developer who is also a domain expert.
    您是一位既是独立开发者又是领域专家的人。
  2. You are working with test data that already available. (Tweets, etc.)
    您正在使用已可用的测试数据。(推文等。)
  3. Looking at data is not costly (etc. you can manually look at enough data in a few hours)
    查看数据并不昂贵(等等,你可以在几小时内手动查看足够的数据)

In this scenario, you can jump directly to something that looks like step 3 and start looking at data right away. Also, since it’s not that costly to look at data, it’s probably fine to just do error analysis without a judge (at least initially). You can incorporate what you learn directly back into your primary model right away. This example is not exhaustive, but gives you an idea of how you can adapt this process to your needs.
在这种情况下,您可以直接跳转到类似第 3 步的内容,并立即开始查看数据。另外,查看数据并不昂贵,因此可能只需进行错误分析而不需要法官(至少最初是这样)。您可以直接将所学内容立即应用到您的原始模型中。这个例子并不全面,但可以给您一个如何根据您的需求调整此过程的思路。

However, you can never completely eliminate looking at your data! This is precisely the step that most people skip. Don’t be that person.
然而,你永远无法完全消除查看你的数据!这正是大多数人跳过的步骤。不要成为那样的人。

FAQ 常见问题解答

I received a lot of questions about this topic. Here are answers to the most common ones:
我收到了很多关于这个主题的问题。以下是常见问题的答案:

If I have a good judge LLM, isn’t that also the LLM I’d also want to use?
如果我有一个好的法官LLM,那难道不是我也想使用的LLM吗?

Effective judges often use larger models or more compute (via longer prompts, chain-of-thought, etc.) than the systems they evaluate.
有效的法官通常使用的模型更大或计算资源(通过更长的提示、思维链等)比他们评估的系统更多。

However, If the cost of the most powerful LLM is not prohibitive, and latency is not an issue, then you might want to consider where you invest your efforts differently. In this case, it might make sense to put more effort towards specialist LLM judges, code-based assertions, and A/B testing. However, you should still go through the process of looking at data and critiquing the LLM’s output before you adopt specialized judges.
然而,如果最强大的LLM的成本不是过高的,并且延迟不是问题,那么你可能需要考虑将你的努力投资在不同的地方。在这种情况下,将更多精力投入到专业LLM法官、基于代码的断言和 A/B 测试中可能是有意义的。然而,在采用专业法官之前,你仍然应该通过查看数据和评论LLM的输出这一过程。

Do you recommend fine-tuning judges?
您推荐微调法官吗?

I prefer not to fine-tune LLM judges. I’d rather spend the effort fine-tuning the actual LLM instead. However, fine-tuning guardrails or other specialized judges can be useful (especially if they are small classifiers).
我更喜欢不对LLM法官进行微调。我宁愿把精力花在调整实际的LLM上。然而,调整护栏或其他专业法官可能是有用的(尤其是如果它们是小型分类器的话)。

As a related note, you can leverage a LLM judge to curate and transform data for fine-tuning your primary model. For example, you can use the judge to:
作为相关说明,您可以使用LLM法官来整理和转换数据,以微调您的原始模型。例如,您可以使用法官来:

  • Eliminate bad examples for fine-tuning.
    消除微调中的不良示例。
  • Generate higher quality outputs (by referencing the critique).
    生成更高质量的输出(通过参考批评)。
  • Simulate high quality chain-of-thought with critiques.
    模拟高质量思维链并包含批评。

Using a LLM judge for enhancing fine-tuning data is even more compelling when you are trying to distill a large LLM into a smaller one. The details of fine-tuning are beyond the scope of this post. If you are interested in learning more, see these resources.
使用LLM法官来增强微调数据,当你试图将大量LLM提炼成更小的一个时,这一点更加有说服力。微调的细节超出了本文的范围。如果你有兴趣了解更多,请参阅这些资源。

What’s wrong with off-the-shelf LLM judges?
什么问题出在现成的LLM法官身上?

Nothing is strictly wrong with them. It’s just that many people are led astray by them. If you are disciplined you can apply them to your data and see if they are telling you something valuable. However, I’ve found that these tend to cause more confusion than value.
它们并没有什么绝对错误。只是很多人被它们误导了。如果你有纪律,你可以将它们应用于你的数据,看看它们是否在告诉你一些有价值的东西。然而,我发现这些往往比价值造成更多的困惑。

How Do you evaluate the LLM judge?
您如何评估LLM评委?

You will collect metrics on the agreement between the domain expert and the LLM judge. This tells you how much you can trust the judge and in what scenarios. Your domain expert doesn’t have to inspect every single example, you just need a representative sample so you can have reliable statistics.
您将收集关于领域专家与LLM法官之间协议的指标。这告诉您可以信任法官的程度以及哪些场景。您的领域专家不必检查每一个单独的例子,您只需要一个代表性的样本,以便您拥有可靠的统计数据。

What model do you use for the LLM judge?
您使用什么模型作为LLM评委?

For the kind of judge articulated in this blog post, I like to use the most powerful model I can afford in my cost/latency budget. This budget might be different than my primary model, depending on the number of examples I need to score. This can vary significantly according to the use case.
对于这篇博客文章中提到的这种法官,我喜欢在我的成本/延迟预算内使用我能负担的最强大模型。这个预算可能与我主要模型不同,取决于我需要评分的示例数量。这可能会根据用例而有很大差异。

What about guardrails? 关于护栏呢?

Guardrails are a separate but related topic. They are a way to prevent the LLM from saying/doing something harmful or inappropriate. This blog post focuses on helping you create a judge that’s aligned with business goals, especially when starting out.
围栏是一个独立但相关的话题。它们是一种防止LLM说出/做有害或不适当的事情的方法。这篇博客文章专注于帮助您创建一个与商业目标一致的评价标准,尤其是在起步阶段。

I’m using LLM as a judge, and getting tremendous value but I didn’t follow this approach.
我正在使用LLM作为评判标准,获得了巨大的价值,但我没有遵循这种方法。

I believe you. This blog post is not the only way to use a LLM as a judge. In fact, I’ve seen people use a LLM as a judge in all sorts of creative ways, which include ranking, classification, model selection and so-on. I’m focused on an approach that works well when you are getting started, and avoids the pitfalls of confusing metric sprawl. However, the general process of looking at the data is still central no matter what kind of judge you are building.
我相信你。这篇博客文章并不是唯一使用LLM作为评判标准的方法。实际上,我见过人们以各种创造性的方式使用LLM作为评判标准,包括排名、分类、模型选择等等。我专注于一种在你开始时效果良好的方法,并避免了混淆性指标膨胀的陷阱。然而,无论你构建的是哪种评判标准,查看数据的一般过程仍然是核心。

How do you choose between traditional ML techniques, LLM-as-a-judge and human annotations?
您如何在传统机器学习技术、LLM-as-a-judge 和人工标注之间进行选择?

The answer to this (and many other questions) is: do the simplest thing that works. And simple doesn’t always mean traditional ML techniques. Depending on your situation, it might be easier to use a LLM API as a classifier than to train a model and deploy it.
这个(以及许多其他问题)的答案是:做最简单且有效的事情。简单并不意味着总是使用传统的机器学习技术。根据你的情况,使用一个LLM API 作为分类器可能比训练和部署一个模型更容易。

Can you make judges from small models?
你能从小型模型中生成法官吗?

Yes, potentially. I’ve only used the larger models for judges. You have to base the answer to this question on the data (i.e. the agreement with the domain expert).
是的,可能。我只使用了较大的模型供法官使用。你必须根据数据(即与领域专家的协议)来回答这个问题。

How do you ensure consistency when updating your LLM model?
如何确保更新LLM模型时的一致性?

You have to go through the process again and measure the results.
您必须再次通过这个过程并衡量结果。

How do you phase out human in the loop to scale this?
如何逐步淘汰人工参与以实现规模化?

You don’t need a domain expert to grade every single example. You just need a representative sample. I don’t think you can eliminate humans completely, because the LLM still needs to be aligned to something, and that something is usually a human. As your evaluation system gets better, it naturally reduces the amount of human effort required.
您不需要领域专家来评估每一个例子。您只需要一个具有代表性的样本。我认为您无法完全消除人类,因为LLM仍然需要与某物对齐,而这通常是人类。随着您的评估系统变得更好,它自然会减少所需的人类工作量。

Resources 资源

These are some of the resources I recommend to learn more on this topic:
这些是我推荐的一些学习这个主题的资源:

  • Your AI Product Needs Evals: This blog post is the predecessor to this one, and provides a high-level overview of evals for LLM based products.
    您的 AI 产品需要评估:本博客文章是本文的前篇,为基于LLM产品的评估提供了一个高级概述。
  • Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: This paper by Shreya Shankar et al provides a good overview of the challenges of evaluating LLMs, and the importance of following a good process.
    谁验证验证器?将LLM辅助评估LLM输出与人类偏好对齐:Shreya Shankar 等人撰写的这篇论文提供了对评估LLMs挑战的良好概述,以及遵循良好流程的重要性。
  • Align Eval: Eugene Yan’s new tool that helps you build LLM judges by following a good process. Also read his accompanying blog post.
    对齐评估:尤金·严的新工具,帮助您通过良好的流程构建LLM评委。同时阅读他的配套博客文章。
  • Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge): This is a great survey of different use-cases and approaches for LLM judges, also written by Eugene Yan.
    评估LLM-评估者(又称LLM-作为-法官):这是一项关于LLM法官不同用例和方法的出色调查,也由 Eugene Yan 撰写。
  • Enhancing LLM-As-A-Judge with Grading Notes by Yi Liu et al. Describes an approach very similar to the one in this blog post, and provides another point of view regarding the utility of writing critiques (they call them grading notes).
    提升LLM-As-A-Judge 的评分笔记(刘毅等著)。描述了一种与本文博客中类似的方法,并提供了关于撰写评论(他们称之为评分笔记)的实用性的另一种观点。
  • Custom LLM as a Judge to Detect Hallucinations with Braintrust by Ankur Goyal and Shaymal Anadkt provide an end-to-end example of building a LLM judge, and for the use case highlighted, authors found that a classification approach was more reliable than numeric ratings (consistent with this blog post).
    自定义LLM作为判官以检测脑信托中的幻觉,Ankur Goyal 和 Shaymal Anadkt 提供构建LLM判官的端到端示例,对于所强调的使用案例,作者发现分类方法比数值评分更可靠(与这篇博客文章一致)。
  • Techniques for Self-Improving LLM Evals by Eric Xiao from Arize shows a nice approach to building LLM Evals with some additional tools that are worth checking out.
    技巧自提升LLM评估由 Arize 的 Eric Xiao 展示,展示了一种构建LLM评估的不错方法,其中包含一些值得一看的额外工具。
  • How Dosu Used LangSmith to Achieve a 30% Accuracy Improvement with No Prompt Engineering by Langchain shows a nice approach to building LLM prompts with dynamic examples. The idea is simple, but effective. I’ve been adapting it for my own use cases, including LLM judges. Here is a video walkthrough of the approach.
    如何使用 LangSmith 通过 Langchain 实现 30%的准确率提升且无需提示工程展示了构建LLM动态示例提示的不错方法。这个想法简单但有效。我已经将其应用于自己的用例,包括LLM评委。以下是该方法的视频演示。
  • What We’ve Learned From A Year of Building with LLMs: is a great overview of many practical aspects of building with LLMs, with an emphasis on the importance of evaluation.
    我们从一年使用LLMs的建设中学到了什么:是对使用LLMs进行建设的许多实用方面的良好概述,强调了评估的重要性。

Stay Connected 保持连接

I’m continuously learning about LLMs, and enjoy sharing my findings. If you’re interested in this journey, consider subscribing.
我一直在学习关于LLMs的知识,并享受分享我的发现。如果你对这个旅程感兴趣,请考虑订阅。

What to expect: 期待什么:

  • Occasional emails with my latest insights on LLMs
    偶尔发送关于LLMs的最新见解的电子邮件
  • Early access to new content
    早期访问新内容
  • No spam, just honest thoughts and discoveries
    无垃圾邮件,只有真诚的想法和发现