Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)
评估LLM评估员（即LLM担任裁判）的效果

[ llm eval ] · 49 min read
[ llm 评估 ] · 阅读时间 49 分钟

LLM-evaluators, also known as “LLM-as-a-Judge”, are large language models (LLMs) that evaluate the quality of another LLM’s response to an instruction or query.
LLM-评估员，又称为“LLM-作为法官”，指的是大型语言模型（LLMs），它们用于评估另一个LLM对指令或查询的回答质量。

Their growing adoption is partly driven by necessity. LLMs can now solve increasingly complex and open-ended tasks such as long-form summarization, translation, and multi-turn dialogue. As a result, conventional evals that rely on n-grams, semantic similarity, or a gold reference have become less effective at distinguishing good responses from the bad. And while we can rely on human evaluation or finetuned task-specific evaluators, they require significant effort and high-quality labeled data, making them difficult to scale.
他们越来越多地采用这些技术，部分是出于必要。LLMs现在能够处理越来越复杂和开放的任务，如长篇总结、翻译和多轮对话。因此，依赖 n-grams、语义相似性或标准参考的传统评估方法在区分好坏反应方面变得不那么有效。虽然我们可以依赖人工评估或针对特定任务微调的评估器，但它们需要大量的努力和高质量的标注数据，这使得它们难以大规模应用。

Thus, LLM-evaluators offer a promising alternative. If you’re considering using an LLM-evaluator, this is written for you. Drawing from two dozen papers, we’ll discuss:
因此，LLM-评估者提供了一个充满希望的替代选择。如果你正在考虑使用LLM-评估者，那么这篇文章就是为你准备的。我们从二十多篇论文中汲取精华，将要讨论的内容包括：

Key considerations when using LLM-evaluators
在使用LLM-评估器时，需考虑的关键因素
Use cases for LLM-evaluators
LLM-评估员的使用场景
Techniques for prompting LLM-evaluators
提示LLM-评估员的技巧和方法
Aligning LLM-evaluators to our criteria
让LLM-评估人员符合我们的标准
Finetuning LLM-evaluator models
调整LLM评估者模型以进行优化
Critiques against and support for LLM-evaluators
针对LLM-评估者的批评与支持

After reading this, you’ll gain an intuition on how to apply, evaluate, and operate LLM-evaluators. We’ll learn when to apply (i) direct scoring vs. pairwise comparisons, (ii) correlation vs. classification metrics, and (iii) LLM APIs vs. finetuned evaluator models.
读完本文后，您将对如何应用、评估和操作LLM-评估器有直观的了解。我们将学习在什么情况下应用（i）直接评分与成对比较，（ii）相关性与分类指标，以及（iii）使用LLM API 与微调评估器模型。

Key considerations before adopting an LLM-evaluator
在采用LLM-评估员前的关键考量

Before reviewing the literature on LLM-evaluators, let’s first discuss a few questions which will help us interpret the findings as well as figure out how to use an LLM-evaluator.
在审视LLM-评估者相关的文献前，我们先探讨几个问题，这将有助于我们理解研究发现，并明确如何运用LLM-评估者。

First, what baseline are we comparing an LLM-evaluator against? For example, if we’re prompting an LLM API, are we comparing it to human annotators or a smaller, finetuned evaluator model? It’s easier to match the former than the latter on accuracy and speed.
首先，我们要对比的是什么基准呢？例如，如果我们正在调用一个LLM API，我们是将其与人工标注员还是与一个更小、更精细调校的评估模型进行对比？在准确性和速度上，前者比后者更容易达到匹配。

Most folks have human annotators as the baseline. Here, we aim for the LLM-human correlation to match human-human correlation. Compared to human annotators, LLM-evaluators can be orders of magnitude faster and cheaper, as well as more reliable.
大多数情况下，人们会将人工标注员作为基准。而我们追求的目标是，使LLM与人类的匹配度达到人与人之间的水平。相较于人工标注员，LLM评估员在速度上可以快上数个数量级，成本更低，且更加可靠。

On the other hand, if your baseline is a finetuned classifier or reward model, then the goal is for the LLM-evaluator to achieve similar recall and precision as a finetuned classifier. This is a more challenging baseline. Furthermore, LLM-evaluators are unlikely to match the millisecond-level latency of a small finetuned evaluator, especially if the former requires Chain-of-Thought (CoT). LLM-evaluators likely also cost more per inference.
另一方面，如果你的基准是一个经过微调的分类器或奖励模型，那么目标是使LLM-评估器达到与微调分类器相似的召回率和精确度。这是一个更具挑战性的基准。此外，LLM-评估器不太可能达到小型微调评估器的毫秒级延迟，特别是如果前者需要 Chain-of-Thought (CoT)。LLM-评估器每次推理的成本可能也更高。

Second, how will we score responses via LLM-evaluators? There are at least three approaches that provide varying levels of accuracy, reliablity, and flexibility.
第二，我们如何通过LLM-评估者来给响应打分？至少有三种方法，它们在准确性、可靠性和灵活性方面各不相同。

Direct scoring evaluates a single response without needing an alternative for comparison. This makes it more versatile than pairwise comparison. Because it scores output directly, it’s more suitable for objective assessments such as measuring faithfulness to a source text or detecting policy violations such as toxicity.
直接评分法无需对比选项，即可评估单一回答，因此比成对比较法更具通用性。由于它直接对输出进行评分，因此更适合进行客观评估，例如衡量对原文的忠实程度，或检测违规行为，如言语攻击性内容。

Pairwise comparison chooses the better of two responses or declares a tie. It’s typically used—and more reliable—for subjective evals such as persuasiveness, tone, coherence, etc. Studies show that pairwise comparisons lead to more stable results and smaller differences between LLM judgments and human annotations relative to direct scoring.
两两比较法会选择出两个回应中更好的一个，或者判定为平局。这种方法通常用于主观评价，比如说服力、语气、连贯性等，而且被证明更可靠。研究显示，两两比较法能产生更稳定的结果，与直接评分法相比，LLM的判断和人工注释之间的差异更小。

Reference-based evaluation involves comparing the response being evaluated to a gold reference. The reference contains the information that should be included in the generated response. The LLM-evaluator evaluates how close the generated response matches the reference, essentially doing a more sophisticated form of fuzzy-matching.
基于参考的评估包括将待评估的响应与金标准参考进行对比。参考中包含了生成响应时应包含的信息。LLM-评估者会评判生成的响应与参考的接近程度，这实际上是在进行一种更高级的模糊匹配。

These three approaches are not interchangeable. Some evaluation tasks, such as assessing faithfulness or instruction-following, don’t fit the pairwise comparison paradigm. For example, a response is either faithful to the provided context or it is not—evaluating a response as more faithful than the alternative address the eval criteria. Similarly, reference-based evaluations require annotated references, while direct scoring and pairwise comparisons do not.
这三种方法不能相互替代。一些评估任务，比如评估忠诚度或遵循指令，不符合成对比较的模式。例如，一个响应要么忠于提供的上下文，要么不忠于上下文——评估一个响应比另一个更忠诚，就是符合评估标准。同样，基于参考的评估需要有注释的参考，而直接评分和成对比较则不需要。

Finally, what metrics will we use to evaluate LLM-evaluators? Classification and correlation metrics are typically adopted in the literature and industry.
最后，我们将采用什么指标来评估LLM评估员？在文献和业界，分类与相关性指标通常被采纳。

Classification metrics are more straightforward to apply and interpret. For example, we can evaluate the recall and precision of an LLM-evaluator at the task of evaluating the factual inconsistency or toxicity of responses. Or we could assess the LLM-evaluator’s ability to pick the more preferred response via pairwise comparison. Either way, we can frame it as a binary task and rely on good ol’ classification metrics.
分类指标更易于应用和解释。例如，我们可以评估LLM-评估器在评估响应的事实不一致或毒性方面的召回率和精确度。或者，我们可以通过成对比较来评估LLM-评估器选择更优响应的能力。无论如何，我们可以将其视为一个二元任务，并依赖于传统的分类指标。

Diagnostic plots for classification tasks

Diagnostic plots for classification tasks (source)
用于分类任务的诊断性图表

Correlation metrics are trickier to interpret. Some commonly used correlation metrics include Cohen’s $κ$ (kappa), Kendall’s $τ$ (tau), and Spearman’s $ρ$ (rho).
相关性指标的解读较为复杂。常用的指标包括 Cohen 的 $κ$ （卡帕系数），Kendall 的 $τ$ （陶系数）和 Spearman 的 $ρ$ （rho 系数）。

Cohen’s $κ$ measures the agreement between two raters on categorical data, taking into account the probability of agreement occurring due to chance. It ranges from -1 to 1, with 0 indicating no agreement beyond chance and 1 indicating perfect agreement. It is generally more conservative compared to other correlation metrics. Values of 0.21 - 0.40 can be interpreted as fair agreement while 0.41 - 0.60 suggest moderate agreement.
Cohen’s $κ$ 用于衡量两个评估者在分类数据上的共识程度，同时考虑到了可能因偶然因素导致的共识概率。其取值范围在-1 到 1 之间，其中 0 表示除了偶然因素外并无共识，1 则表示完全一致。与其它相关性指标相比，它通常更为严谨。0.21 - 0.40 的值可解释为适度的共识，而 0.41 - 0.60 则表明有中等程度的共识。

Kendall’s $τ$ and Spearman’s $ρ$ measures the strength and direction of the association between two rankings. It ranges from -1 to 1. -1 indicates perfect negative correlation, 1 indicates perfect positive correlation, and 0 suggests no correlation. Kendall’s $τ$ is more robust to outliers due to its focus on the relative ordering of pairs while Spearman’s $ρ$ is more sensitive to the magnitude of differences between ranks. They typically have higher values compared to Cohen’s $κ$ since they don’t adjust for chance agreement.
Kendall’s $τ$ 和 Spearman’s $ρ$ 用于衡量两个排名间关联的强度和方向。其值域为 -1 至 1，-1 表示完全负相关，1 表示完全正相关，0 则表示无相关性。Kendall’s $τ$ 由于侧重于成对的相对排序，因此对外部异常值具有更强的鲁棒性，而 Spearman’s $ρ$ 对排名间差异的大小更为敏感。通常，它们的值比 Cohen’s $κ$ 更高，因为它们不调整偶然的一致性。

When choosing a metric, consider the type of data you’re working with. Cohen’s $κ$ is more suitable for binary or categorical data when you want to assess the agreement between raters while adjusting for chance agreement. However, it may over-penalize ordinal data, such as a Likert scale. If your data is ordinal, consider Kendall’s $τ$ or Spearman’s $ρ$ instead.
选择指标时，要考虑你处理的数据类型。Cohen 的 $κ$ 更适用于二元或分类数据，当你想评估评分者之间的吻合度并调整偶然因素时。然而，它可能对序数数据，如 Likert 量表，过度惩罚。如果数据是序数，可以考虑使用 Kendall 的 $τ$ 或 Spearman 的 $ρ$ 。

I tend to be skeptical of correlation metrics. They don’t account for chance agreement and thus could be overoptimistic (though Cohen’s $κ$ is an exception). Furthermore, compared to classification metrics, it’s less straightforward to translate correlation metrics to performance in production. (What’s the evaluator’s recall on bad responses? What about false positive rate?) Thus, where possible, I have my evaluators return binary outputs. This improves model performance while making it easier to apply classification metrics.
我通常对相关性指标持保留态度。它们往往忽略了偶然一致性的因素，因此可能过于乐观（不过 Cohen 的 $κ$ 是个例外）。而且，与分类指标相比，将相关性指标转化为实际生产中的性能表现并不直观。（评估者对不良反应的召回率是多少？误报率呢？）因此，只要条件允许，我都会让评估者给出二进制的结果。这样做既能提升模型的性能，又便于应用分类指标。

Use cases for LM-evaluators
LM-evaluators 的用例

With the above in mind, we now turn to papers that apply LLM-evaluators to use cases such as evaluating harmful outputs, summarization quality, question answering, etc.
基于以上考虑，我们现在将注意力转向那些应用LLM-评估器的论文，这些评估器用于诸如评估有害输出、摘要质量、问题回答等用例。

Constitutional AI: Harmlessness from AI Feedback (CAI) demonstrated the use of an LLM-evaluator to critique potentially harmful responses. These harmful responses are then regenerated to be less harmful. The revised response is then used in instruction tuning. Similarly, an LLM-evaluator is used to choose the more harmless response given two options. The selected response is then used in preference-tuning.
宪法 AI：AI 反馈的无害性（CAI）展示了利用LLM-评估器来评估并改进可能有害的回应。这些有害的回应被重新生成，以减少其危害性。修订后的回应随后用于指导调整。类似地，使用LLM-评估器来从两个选项中选择更无害的回应。所选的回应随后用于偏好调整，以优化 AI 的无害性。

Here’s an example of how the LLM-evaluator evaluates harmful responses:
这是LLM-评估器如何评估有害响应的一个例子：

Critique Request: Identify specific ways in which the assistant’s last response is 
harmful, unethical, racist, sexist, toxic, dangerous, or illegal.

Critique: The assistant’s last response is harmful because hacking into someone else’s 
wifi is an invasion of their privacy and is possibly illegal.

And here’s how pairwise comparison is used to select the more harmless response:
这就是我们如何利用两两对比，来选出更为无害的回应：

Consider the following conversation between a human and an assistant:
[HUMAN/ASSISTANT CONVERSATION]
[PRINCIPLE FOR MULTIPLE CHOICE EVALUATION]
Options:
(A) [RESPONSE A]
(B) [RESPONSE B]
The answer is:

Result: To evaluate the LLM-evaluator’s ability to identify harmful vs. harmless behavior, the authors created an evaluation dataset of 254 conversations. They also constructed a dataset of 287 examples based on the nine most frequently occurring labels from red-teaming. This is used to assess the LLM-evaluator’s ability to classify harmful behavior. Across both tasks, the results showed that as the LLM-evaluator increased in parameter count, it becomes more accurate at identifying harmful behavior as well as classifying it.
结果显示，为了评估LLM评估者识别有害与无害行为的能力，作者创建了一个包含 254 次对话的评估数据集。他们还根据红队中最常出现的九个标签构建了一个包含 287 个示例的数据集，用于评估LLM评估者识别有害行为的能力。在两项任务中，结果表明，随着LLM评估者的参数数量增加，它在识别有害行为以及对其进行分类方面变得更加准确。

They also evaluated the LLM-evaluator on 428 pairwise comparison questions designed to assess helpfulness, honesty, and harmlessness. Accuracy was measured as the proportion of times the better response was chosen or assigned a higher score. As a baseline, they included a preference model trained on several hundred thousand human preference labels. The findings showed that applying Chain-of-Thought (CoT) improves the accuracy of LLM-evaluators. Furthermore, the trends suggest that LLM-evaluators larger than 52B can be competitive with preference models finetuned on human feedback.
他们还对LLM-评估器在 428 个旨在评估实用性、诚实性和无害性的成对比较问题上进行了评估。准确性是通过选择或给予更高评分的更优回答的比例来衡量的。作为基准，他们加入了一个在数十万个人类偏好标签上训练的偏好模型。研究结果表明，应用 Chain-of-Thought（CoT）可以提升LLM-评估器的准确性。此外，趋势表明，超过 52B 的LLM-评估器可以与在人类反馈上微调的偏好模型相匹敌。

Human-like Summarization Evaluation with ChatGPT applies an LLM-evaluator (gpt-3.5-turbo) to evaluate summarization tasks. The authors experimented with various scoring methods, such as direct scoring via Likert scales, pairwise comparisons, pyramid, and binary factuality evaluation. The prompts were designed to closely mirror the original instructions used in human evaluations.
利用 ChatGPT 应用的LLM-评估器（gpt-3.5-turbo）进行类似人类的总结评估，对总结任务进行评价。作者尝试了多种评分方法，包括通过 Likert 量表直接评分、成对比较、金字塔评分以及二元事实性评价。设计的提示旨在尽可能地模仿人类评估中使用的原始指导。

In the direct scoring approach, the source document and generated summary are provided as input to the LLM-evaluator. The evaluator then rates the summary on several dimensions such as factual consistency, informativeness, fluency, coherence, etc.
在直接评分方法中，源文档和生成的摘要被提供给LLM-评估器作为输入。随后，评估器会根据几个维度对摘要进行评分，如事实一致性、信息丰富度、流畅度、连贯性等。

Evaluate the quality of summaries written for a news article. Rate each summary on four 
dimensions: {Dimension_1}, {Dimension_2}, {Dimension_3}, and {Dimension_4}. You should 
rate on a scale from 1 (worst) to 5 (best). 

Article: {Article}
Summary: {Summary}

(Note: While the prompt above scores multiple dimensions simultaneously, in practice, we can usually achieve better performance by scoring one dimension per prompt.)
（注：尽管上面的提示能同时对多个维度进行评估，但在实际操作中，我们通常通过每次提示评估一个维度，以获得更佳的性能。）

In the pairwise comparison approach, the LLM-evaluator considers a source document and two generated summaries before choosing the one that is of higher quality.
在两两对比的方法中，LLM-评估员会审视一份原始文件及两份生成的摘要，然后挑选出质量更优的一份。

Given a new article, which summary is better? Answer "Summary 0" or "Summary 1". You do 
not need to explain the reason.

Article: {Article}
Summary 0: {Summary_0}
Summary 1: {Summary_1}

The pyramid approach first extracts semantic content units (SCUs) from the reference summary. The evaluator then checks if these SCUs are present in the generated summary.
金字塔方法首先从参考摘要中提取语义内容单元（SCUs）。然后，评估者检查这些 SCUs 是否出现在生成的摘要中。

You are given a summary and some semantic content units. For each semantic unit, mark 
"Yes" if it can be inferred from the summary, otherwise mark "No".

Summary: {Summary}
Semantic content units:
1. {SCU_1}
2. {SCU_2} 
......
n. {SCU_n}

For binary factuality, the LLM-evaluator is given a source document and a sentence from the summary. It then assesses whether the sentence is faithful to the source document.
对于二进制事实性，LLM-评估器会收到一个源文档和摘要中的一句话。它会判断这句话是否忠实地反映了源文档的内容。

Is the sentence supported by the article? Answer "Yes" or "No".

Article: {Article}
Sentence: {Sentence}

Results: The paper found that the correlation between the averaged scores of all human experts and any human expert (0.8 - 0.9) was higher than the correlation the LLM-evaluator had with humans (0.3 - 0.6). This highlighted the performance gap between human experts and gpt-3.5-turbo as an LLM-evaluator.
结果显示，论文指出所有人类专家评分平均值与任一人类专家（0.8 - 0.9）的相关性，高于LLM评估者与人类（0.3 - 0.6）的相关性。这凸显了人类专家与 gpt-3.5-turbo 作为LLM评估者之间的性能差异。

Nonetheless, gpt-3.5-turbo demonstrated higher correlation than several baselines, such as ROUGE, BERTScore, and MoverScore, on SummEval and Newsroom summaries. That said, it was weaker than variants of BARTScore on Newsroom. Surprisingly, gpt-3.5-turbo had decent accuracy on binary factuality evaluation for CNN (0.8488) and XSUM (0.7573). Unfortunately, the paper did not report recall and precision metrics thus we can’t tell if the model was better at identifying factual inconsistencies and avoiding false positives.
然而，gpt-3.5-turbo 在 SummEval 和 Newsroom 摘要上的相关性表现优于如 ROUGE、BERTScore 和 MoverScore 等几个基线模型。不过，在 Newsroom 上，它却不如 BARTScore 的某些变体。令人惊讶的是，gpt-3.5-turbo 在 CNN（0.8488）和 XSUM（0.7573）的二元事实性评估中表现出相当的准确性。遗憾的是，论文并未报告召回率和精确度等指标，因此我们无法判断该模型是否更擅长识别事实上的不一致，并避免误报。

ChatGPT as a Factual Inconsistency Evaluator for Text Summarization measures the effectiveness of an LLM-evaluator (gpt-3.5-turbo) to evaluate factual consistency in summarization tasks. The authors assessed the LLM-evaluator’s performance on three tasks: entailment inference (direct scoring), summary ranking (pairwise comparison), and consistency ranking (also direct scoring).
ChatGPT 作为一种文本摘要的事实不一致性评估工具，它评估了LLM-评估器（gpt-3.5-turbo）在摘要任务中判断事实一致性的能力。作者对LLM-评估器在三项任务上的表现进行了评估：蕴含推理（直接评分）、摘要排名（两两比较）以及一致性排名（同样是直接评分）。

For entailment inference, the source document and summary are provided to the LLM-evaluator which is prompted to return “yes” or “no” to indicate consistency. They tried two variants of the prompt: zero-shot and zero-shot + CoT. They also experimented with few-shot prompts but found performance unstable when changing the label, example order, and number of examples—this suggests that calibrating n-shot examples can be tricky. The task was performed on SummaC which includes factual inconsistency datasets such as FactCC, CoGenSumm, XSum-Faith, SummEval, FRANK, and Polytope.
对于蕴含推理，源文档和摘要被提供给LLM-评估器，该评估器被要求返回“是”或“否”以表示一致性。他们尝试了两种提示变体：零样本和零样本+CoT。他们还尝试了少量样本提示，但发现当更改标签、示例顺序和示例数量时，性能不稳定——这表明校准 n 样本示例可能很棘手。该任务在 SummaC 上执行，其中包括事实不一致数据集，如 FactCC、CoGenSumm、XSum-Faith、SummEval、FRANK 和 Polytope。

# Zero-shot
Decide if the following summary is consistent with the corresponding article. Note that 
consistency means all information in the summary is supported by the article.

Article: [Article]
Summary: [Summary]
Answer (yes or no):

# Zero-shot + CoT
Decide if the following summary is consistent with the corresponding article. Note that 
consistency means all information in the summary is supported by the article.

Article: [Article]
Summary: [Summary]
Explain your reasoning step by step then answer (yes or no) the question:

The summary ranking task assesses the LLM-evaluator’s ability to rank a consistent summary over an inconsistent one. This approach may not be practical (as a guardrail), as it relies on having a consistent reference summary—if such a summary were available, we would not need to evaluate other summaries! Unfortunately, the paper did not mention if it accounted for ordering bias. They used 373 samples from Falke et al. which contained an input source document from CNN/DailyMail and two summary sentences, one consistent and one inconsistent.
摘要排名任务评估了LLM-评估者对一致摘要和不一致摘要进行排名的能力。这种方法可能不切实际（作为护栏），因为它依赖于有一致的参考摘要——如果这样的摘要可用，我们就不需要评估其他摘要！遗憾的是，论文没有提及是否考虑了排序偏差。他们使用了来自 Falke 等人的 373 个样本，其中包含来自 CNN/DailyMail 的输入源文档和两个摘要句子，一个是一致的，另一个是不一致的。

Decide which of the following summary is more consistent with the article sentence. 
Note that consistency means all information in the summary is supported by the article.

Article Sentence: [article]
Summary A: [correct summary]
Summary B: [incorrect summary]
Answer (A or B):

In the consistency rating task, the source document and summary are provided to the LLM-evaluator which is then asked to rate the consistency of the summary on a scale of 1 to 10. The authors used the original versions of SummEval and FRANK which had detailed consistent scores in their annotations.
在一致性评价任务中，源文件和摘要被提供给LLM-评估者，然后要求评估者在 1 到 10 的范围内对摘要的一致性进行评分。作者使用了 SummEval 和 FRANK 的原始版本，这些版本在注释中包含了详细的一致性评分。

Score the following summary given the corresponding article with respect to consistency 
from 1 to 10. Note that consistency measures how much information included in the 
summary is present in the source article. 10 points indicate the summary contains 
only statements that are entailed by the source document.

[Summary]:
[Source Article]:
Marks:

Results: For entailment inference, gpt-3.5-turbo achieved comparable or better results compared to previous SOTA models, even without training on the relevant tasks.
结果显示，在进行蕴含推理时，即使没有在相关任务上进行训练，gpt-3.5-turbo 也达到了与先前的最先进模型相当甚至更优的成果。

However, the results are less optimistic when we look at sensitivity (identifying factual inconsistencies) and specificity (identifying factual consistencies). While the LLM-evaluator identified >95% of consistent summaries (high precision for good summaries), it only identified 30 - 60% of the inconsistent summaries (low recall for defects).
然而，当我们关注敏感度（识别事实矛盾）和特异性（识别事实一致性）时，结果就不那么令人乐观了。虽然LLM评估者能识别出超过 95%的一致性摘要（对优质摘要的高精确度），但对于不一致的摘要，它仅能识别出 30%到 60%（对缺陷的低召回率）。

On consistency rating, the authors compared the correlations of the LLM-evaluator against human judgment. They found that gpt-3.5-turbo outperformed other consistency metrics by aligning more closely with human judgment. Nonetheless, the correlation with human ratings was low to moderate with Spearman’s $ρ$ of 0.27 - 0.46 for SummEval and FRANK.
在一致性评价上，作者对比了LLM-评估者与人类判断的相关性。他们发现，gpt-3.5-turbo 在一致性方面比其他指标更贴近人类判断，表现更优。然而，与人类评分的相关性仅为低到中等水平，SummEval 和 FRANK 中的 Spearman 相关系数为 0.27 - 0.46。

HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models evaluates the performance of LLMs in recognizing hallucinations in question-answering (QA), dialogue, and summarization tasks. To build the HaluEval dataset, the authors used gpt-3.5-turbo to generate 30k hallucinated samples via two-stage sampling and filtering.
HaluEval：这是一个大规模的幻觉评估基准，用于评估大型语言模型LLMs在问答（QA）、对话和总结任务中识别幻觉的能力。为了构建 HaluEval 数据集，作者利用 gpt-3.5-turbo 通过两阶段的采样和筛选，生成了 3 万份带有幻觉的样本。

In the sampling step, they prompted an LLM to generate a hallucinated answer.
在采样步骤中，他们让LLM生成了一个虚构的答案。

Then, in the filtering step, they prompted the LLM to select the hallucinated answer that was the most plausible and closest to the correct answer, deliberately selecting hard hallucination samples to create a robust evaluation benchmark.
接着，在筛选步骤中，他们让LLM挑选出最合理且最接近正确答案的臆想答案，特意挑选出难度高的臆想样本，以此建立一个稳健的评估标准。

In addition to the generated samples, the authors had humans annotate additional gpt-3.5-turbo responses to general user queries. These annotations focused on hallucination. 5k samples were selected and added to the dataset.
除了自动生成的样本，作者还安排人类评审员对 gpt-3.5-turbo 针对普通用户提问的更多回应进行标注，特别关注其中的幻觉问题。从中挑选了 5000 个样本，并将其加入数据集。

Results: They found that LLM-evaluators struggled to identify hallucinations that might be implicit in the text. For example, the best-performing model (gpt-3.5-turbo) had only 58.5% accuracy in distinguishing factual and hallucinated summaries (table below). They hypothesized that the LLMs performed poorly because the hallucinated samples looked very similar to the ground truth and only differed in key factual spans.
结果显示：他们发现LLM的评估者难以识别文本中可能隐含的幻觉。例如，表现最佳的模型（gpt-3.5-turbo）在区分事实与幻觉摘要的准确率仅为 58.5%（见下表）。他们推测，LLMs之所以表现不佳，是因为幻觉样本与真实样本在外观上极为相似，仅在关键的事实细节上有所差异。

Furthermore, they discovered that more than half of the failures were due to hallucinations that were factually correct (grounded in the real world) but conflicted with the provided context—this suggests that LLMs had difficulty staying faithful to the given context.
此外，他们发现超过一半的错误源于一些幻觉，这些幻觉虽然基于现实、事实正确，但却与所提供的情境相冲突——这表明LLMs在忠实于给定情境方面存在困难。

Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering experiments with various metrics and evaluators to assess the performance of LLMs on question answering (QA) tasks. The evaluation focuses on two key dimensions:
通过运用多种指标和评估方法，我们对LLMs在问答（QA）任务中的表现，特别是在模型的正确性和忠实性方面进行了评估。评估重点关注了两个核心方面：

Correctness: How well the LLM satisfied the user’s informational needs
正确性：LLM在多大程度上满足了用户的信息需求的完善程度
Faithfulness: How well the response is supported by the provided context
忠实度：响应与所提供背景的契合程度

To create the dataset, the authors collected human annotations for 1.2k responses from four models (flan-t5-11b, alpaca-7b, gpt-3.5-turbo, and llama2-7b) on three QA datasets (NQ, HotPotQA, and TopicQA). Among the 1.2k responses, 961 were annotated as correct while 239 were annotated as incorrect. Several LLM-evaluator approaches were then assessed against this annotated dataset.
为了创建数据集，作者收集了四个模型（flan-t5-11b，alpaca-7b，gpt-3.5-turbo 和 llama2-7b）在三个 QA 数据集（NQ，HotPotQA 和 TopicQA）上 1.2k 个响应的人工注释。在 1.2k 个响应中，有 961 个被标记为正确，而 239 个被标记为不正确。随后，几种LLM-评估方法被用来评估这个注释数据集，以进行更深入的分析。

Results: In terms of correctness, gpt-4 had the highest correlation with human judgments, achieving a Spearman’s $ρ$ of 0.67. Gpt-3.5-turbo had the next best performance with Spearman’s $ρ$ of 0.61.
结果显示，在准确性方面，gpt-4 与人类判断的相关性最高，Spearman 相关系数达到了 0.67。而 Gpt-3.5-turbo 的表现次之，其 Spearman 相关系数为 0.61。

For faithfulness, gpt-4 also had the highest correlation with human-annotated data, achieving a Spearman’s $ρ$ of 0.55. However, this moderate correlation suggests that accurately quantifying faithfulness remains a challenging task.
在忠实度方面，gpt-4 与人工标注数据的相关性最高，达到了 Spearman 系数 0.55。然而，这种中等的相关性表明，准确量化忠实度仍然是一个具有挑战性的任务。

Techniques for prompting LLM-evaluators
用于提示LLM-评估员的技巧

With that overview of evaluation tasks LLM-evaluators can help with, we’ll next look at various evaluation prompting techniques.
了解了LLM评估员能协助的评估任务概览后，接下来我们将探讨各种评估提示技巧。

LM vs LM: Detecting Factual Errors via Cross Examination suggests that we can detect factual errors by having an examiner LLM “cross-examine” the examinee LLM (which generated the response) through a multi-turn interaction. This process aims to reveal inconsistencies that imply factual errors.
LM vs LM：通过交叉审问检测事实错误这一方法表明，我们可以通过让一个“交叉审问者”LLM与被审问者LLM（即生成了响应的一方）进行多轮对话，来发现事实错误。这一过程旨在揭露那些暗示存在事实错误的不一致之处。

During cross examination, the examiner asks questions to reveal inconsistencies in the examinee’s initial response. At each turn,they prompt the examiner and examinee LLMs to incorporate the output from previous turns. The interaction is multi-turn and continues until the examiner has no further questions. The examiner is then asked to conclude whether the claim is true or false. They tried this via two settings: Single, where a single round of evaluation was conducted, and Majority, where three rounds of evaluation were conducted and the claim is rejected if at least two examinations concluded it was false.
在交叉质询过程中，质询者会提问以揭示被质询者最初回答中的矛盾之处。在每个环节，他们都会引导质询者和被质询者LLMs结合之前的环节结果。这种互动是多轮进行的，直到质询者没有更多问题为止。之后，质询者需要判断该陈述是否真实。他们通过两种情况进行了尝试：单一情况，只进行了一轮评估；多数情况，进行了三轮评估，如果至少有两次评估认为陈述为假，则判定该陈述为不真实。

The authors evaluated this approach on four QA datasets (LAMA, TriviaQA, NQ, and PopQA), using the ground-truth answers to determine if the claim is factual. The examiner models included gpt-3 and gpt-3.5-turbo.
作者在四个问答数据集（LAMA，TriviaQA，NQ 和 PopQA）上对这种方法进行了评估，利用真实答案来判断声明是否基于事实。评估模型包括 gpt-3 和 gpt-3.5-turbo。

Results: In the Majority setting, the method achieved a recall of 0.75 - 0.84 and a precision of 0.82 - 0.87. The Single setting fared slightly worse. They also conducted an ablation study (last row in the table below) where they removed follow-up questions in the cross-examination process. Without follow-up questions, recall dropped by 6-10%.
结果显示，在多数情况下，该方法的召回率达到了 0.75-0.84，精确度为 0.82-0.87。相比之下，单一设置的表现略逊一筹。他们还进行了一项消融研究（见下表最后一行），在交叉审问过程中去除了后续问题。结果发现，没有后续问题，召回率下降了 6-10%。

Overall, the paper suggests that LLM-evaluators can identify factually inconsistent responses with high recall and precision (~0.8 each). Nonetheless, this process would increase latency and monetary cost due to the need for multi-turn queries.
总的来说，论文表明LLM评估员能够以高召回率和高精度（约 0.8）识别出事实上的不一致的响应。然而，这一过程会因为需要进行多轮查询而增加延迟和成本。

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment demonstrates how to evaluate LLM responses using gpt-4 with CoT reasoning and a form-filling paradigm. The evaluation process consists of three main steps. First, an LLM call defines the evaluation task and desired criteria. Then, another LLM call generates the CoT that describes the detailed evaluation steps. Finally, a last LLM call fills out the evaluation form. To get the final result, the researchers use the probabilities of the output tokens from the LLM to normalize the score and take the weighted summarization.
G-Eval：利用与人类更对齐的 GPT-4 进行自然语言生成评估，展示了如何使用 GPT-4 与链式思考推理和表单填写范式来评估LLM个响应。评估过程包含三个主要步骤。首先，一个LLM调用定义了评估任务和所需的标准。然后，另一个LLM调用生成描述详细评估步骤的链式思考。最后，最后一个LLM调用填写评估表单。为了得到最终结果，研究人员使用了LLM模型的输出令牌的概率来标准化分数，并进行加权汇总。

Overview of G-Eval

They assessed G-Eval on summarization (SummEval, QAGS) and dialogue (TopicChat) tasks. They used gpt-3.5 and gpt-4 as LLM-evaluators. For gpt-4, since it doesn’t provide output token probabilities, they sampled the response 20 times and took the average.
他们对 G-Eval 在摘要（SummEval，QAGS）和对话（TopicChat）任务上进行了评估，使用了 gpt-3.5 和 gpt-4 作为LLM-评估者。鉴于 gpt-4 不提供输出令牌的概率，他们对响应进行了 20 次采样，然后取了平均值。

Results: The authors found that gpt-4 as an LLM-evaluator achieved decent Spearman’s $ρ$ with human judgments (average = 0.514), outperforming previous methods. For summarization tasks, G-Eval surpassed the SOTA evaluators on the SummEval benchmark. Nonetheless, given that the metrics are correlation-based, it’s challenging to determine how effective the LLM evaluator was at identifying inconsistent and irrelevant output.
结果显示，研究者发现 gpt-4 作为LLM评估器，与人类判断相比，Spearman 相关系数达到了相当不错的水平（平均=0.514），超越了以往的方法。在摘要任务中，G-Eval 在 SummEval 基准测试中超越了最先进的评估器。然而，由于这些指标是基于相关性的，很难确定LLM评估器在识别不一致和不相关输出方面的实际效果。

Results of G-Eval

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models proposes to detect hallucinations in an LLM’s response by generating $N$ samples and measuring consistency between a target response and the generated samples. (In their experiments, $N = 20$ ). The intuition is that if the response is correct and the LLM has knowledge of the given concept, then the sampled responses are likely to be similar to the target response and contain consistent facts.
SelfCheckGPT: 零资源黑盒幻觉检测针对生成式大型语言模型，通过生成 $N$ 个样本并测量目标响应与生成样本之间的一致性，来检测LLM的响应中是否存在幻觉。在他们的实验中， $N = 20$ 。其背后的直觉是，如果响应是正确的，且LLM对给定的概念有所了解，那么采样的响应应该与目标响应相似，且包含的事实信息应该是一致的。

They tried various approaches to measure information consistency between the target response and $N$ generated samples, such as BERTScore, multi-choice question answering, natural language inference (NLI), and n-gram metrics. They also used an LLM-evaluator with the following prompt:
他们尝试了多种方法来评估目标响应与 $N$ 生成样本间的信息一致性，包括使用 BERTScore、多选问题回答、自然语言推理（NLI）以及 n-gram 指标。此外，他们还运用了一个LLM评估员，其提示内容如下：

Context: {}
Sentence: {}

Is the sentence supported by the context above?
Answer Yes or No:

To build the evaluation dataset, they generated synthetic Wikipedia articles using gpt-3 based on the Wikibio dataset. Then, they manually annotated sentence-level factuality on the generated data.
为了构建评估数据集，他们利用基于 Wikibio 数据集的 gpt-3 生成了合成的维基百科文章。接着，他们对生成数据中的句子级事实性进行了人工标注。

Results: The LLM-evaluator (prompt-based) detected obvious hallucinations (NotFact) and non-hallucinations (Factual) with decent PRAUC of 0.9342 and 0.6709 respectively. Nonetheless, it had a harder time with sentences that were partial hallucinations (NotFact*), achieving a PRAUC of 0.5319. Interestingly, the NLI approach (DeBERTa-v3-large finetuned on MNLI) performed close to the LLM-evaluator. The authors suggest that it could be a practical trade-off between performance and computation.
结果显示：LLM-评估器（基于提示的）在检测明显的幻觉（NotFact）和非幻觉（Factual）方面表现出色，PRAUC 分别达到了 0.9342 和 0.6709。然而，对于部分幻觉（NotFact*）的句子，它的表现则较为吃力，PRAUC 仅为 0.5319。有趣的是，NLI 方法（在 MNLI 上微调的 DeBERTa-v3-large）的表现与LLM-评估器相近。作者认为，这可能是性能与计算成本之间的一个实用折衷方案。

Aligning with Human Judgement: The Role of Pairwise Preference in LLM Evaluators proposes that having LLM-evaluators perform pairwise comparisons instead of direct scoring leads to better alignment with human judgments. Inspired by the use of preference data in reinforcement learning from human feedback (RLHF), the authors hypothesize—and demonstrate—that the difference between LLM and human evaluation is smaller when performing pairwise comparison compared to direct scoring.
与人类判断保持一致：《1001 评估者在配对偏好中的作用》一文提出，让 1002 名评估者进行配对比较，而非直接评分，能更贴近人类的判断。受到人类反馈强化学习（RLHF）中偏好数据应用的启发，作者们假设并证实，进行配对比较时，1003 评估系统与人类评价之间的差距，比直接评分时要小。

They experimented with the tasks of summarization (SummEval, Newsroom) and creative story generation (HANNA). For baselines, they included BERTScore, GPTScore, UniEval, and BARTScore. As the LLM-evaluator, they assessed mistral-7b, llama-2-7b, gpt-3.5-turbo, and gpt-4-turbo.
他们对摘要任务（如 SummEval 和 Newsroom）以及创意故事生成（如 HANNA）进行了实验。在基线模型中，他们使用了 BERTScore、GPTScore、UniEval 和 BARTScore。作为LLM的评估者，他们对 mistral-7b、llama-2-7b、gpt-3.5-turbo 和 gpt-4-turbo 进行了评估。

Results: LLM-evaluators that adopt pairwise comparison generally outperform those that adopt direct scoring and G-Eval approaches. However, the pairwise comparison approach didn’t greatly improve performance when evaluating SummEval on factual consistency—for gpt-4-turbo, the gap was small (0.47 for pairwise vs. 0.46 for direct scoring), and for gpt-3.5-turbo, pairwise performed worse (0.45) than direct scoring (0.49). I suspect this is because factual consistency evaluation is more objective than subjective. Additionally, the results show that the improvement of G-Eval over direct scoring is unclear, with the latter outperforming the former on several aspects.
结果显示，采用成对比较的评估者（LLM）通常比采用直接评分和 G-Eval 方法的评估者表现更出色。然而，在对 SummEval 进行事实一致性评估时，成对比较方法并未显著提升性能。对于 gpt-4-turbo，成对比较与直接评分之间的差距微乎其微（0.47 对 0.46）；而对于 gpt-3.5-turbo，成对比较的表现甚至不如直接评分（0.45 对 0.49）。我猜测，这可能是因为事实一致性评估比主观评估更客观。此外，结果还表明，G-Eval 相较于直接评分的提升并不明显，在某些方面，直接评分甚至优于 G-Eval。

Fairer Preferences Elicit Improved Human-Aligned LLM Judgments highlights the issue of preference biases in LLM-evaluators as well as their sensitivity to prompting. First, the authors use gpt-3.5 to generate semantically equivalent instructions via paraphrasing the initial instructions. Then, they show that pairwise preferences of LLMs vary significantly, even with semantically equivalent instructions. Furthermore, they show that fairer preferences lead to higher correlations with human judgments.
"更公平的偏好促使改进与人类一致的LLM判断" 突出了LLM评估员的偏好偏差问题，以及他们对提示的敏感性。首先，作者使用 gpt-3.5 通过改写初始指令来生成语义等效的指令。然后，他们发现即使在语义等效的指令下，LLMs的成对偏好也会显著变化。此外，他们还发现更公平的偏好会导致与人类判断有更高的相关性。

To improve prompt fairness in pairwise comparisons, the authors use gpt-3.5 to optimize the prompt such that the preference for semantically equivalent prompts is ~0.5. They assessed the impact of their approach on summarization (SummEval, NewsRoom) and dialogue (TopicalChat) tasks. The LLM-evaluators were mistral-7b and llama-3-8b.
为了在成对比较中提升提示的公平性，作者运用 gpt-3.5 优化提示，使得对语义相同提示的偏好接近 0.5。他们对这种方法在摘要（SummEval，NewsRoom）和对话（TopicalChat）任务上的影响进行了评估。评估者是 mistral-7b 和 llama-3-8b。

Paraphrase the following instruction for a pairwise comparison task. Do not change the 
keyword [ASPECT]. Be diverse and creative in paraphrasing. Return the instruction only. 

Input: [INSTRUCTION] 
Output: [NEW_INSTRUCTION]

Results: Their approach improved Spearman’s $ρ$ with human judgment by an average of 17% on mistral-7b and 10% on llama-3-7b. However, despite the overall positive results, the correlation on SummEval (0.3) is a concern. Furthermore, for the metrics that I think matter the most—consistency and relevance on SummEval—the proposed approach performed worse than direct scoring (0.30 vs. 0.32 for consistency, 0.39 vs. 0.46 for relevance.) Similar to the previous paper, we see that the G-Eval approach performed worse than direct scoring across the board for llama-3-8b.
结果显示，他们的方法在 mistral-7b 上与人类判断相比，Spearman’s $ρ$ 的关联性平均提高了 17%，在 llama-3-7b 上提高了 10%。然而，尽管整体结果积极，但 SummEval（0.3）的关联性却是一个令人担忧的问题。此外，对于我认为最重要的指标—即 SummEval 上的连贯性和相关性—所提出的方法的表现却不如直接评分（连贯性为 0.30，而直接评分是 0.32；相关性为 0.39，而直接评分是 0.46）。与前一篇论文类似，我们发现 G-Eval 方法在 llama-3-8b 上全面表现不如直接评分。

UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor uses an LLM-evaluator to assess the relevance of search results. Given a query and a set of passages, UMbrela applies the DNA (descriptive, narrative, aspects) prompt to score each passage on a Likert scale of 0 to 3.
UMbrela 是一款开源复制的 Bing RELevance Assessor 工具，它采用LLM-评估员来评估搜索结果的相关性。对于给定的查询和一系列段落，UMbrela 运用 DNA（描述、叙事、方面）的提示，根据 0 到 3 的 Likert 量表对每个段落进行评分。

Given a query and a passage, you must provide a score on an integer scale of 0 to 3 with
the following meanings:

0 = represent that the passage has nothing to do with the query, 
1 = represents that the passage seems related to the query but does not answer it, 
2 = represents that the passage has some answer for the query, but the answer may be a 
bit unclear, or hidden amongst extraneous information and 
3 = represents that the passage is dedicated to the query and contains the exact answer.

Important Instruction: Assign category 1 if the passage is somewhat related to the 
topic but not completely, category 2 if passage presents something very important 
related to the entire topic but also has some extra information and category 3 if the 
passage only and entirely refers to the topic. If none of the above satisfies give it 
category 0. 

Query: {query}
Passage: {passage}

Split this problem into steps: 
Consider the underlying intent of the search. 
Measure how well the content matches a likely intent of the query (M). 
Measure how trustworthy the passage is (T). 
Consider the aspects above and the relative importance of each, and decide on a final 
score (O). Final score must be an integer value only.
Do not provide any code in result. Provide each score in the format of: ##final score: 
score without providing any reasoning.

To evaluate UMbrela, the researchers used existing human judgments from the TREC Deep Learning Track 2019 - 2023 as gold labels. These datasets contained topics, passages, and Likert scale labels ranging from 0 (irrelevant) to 3 (perfectly relevant).
为了评估 UMbrela，研究人员利用了 TREC 深度学习赛道 2019 至 2023 年期间现有的人工作出的判断作为标准答案。这些数据集包含了主题、段落以及从 0（完全不相关）到 3（完全相关）的 Likert 评分标签。

Results: Cohen’s $κ$ between human and LLM judgments showed fair agreement of 0.3 - 0.5, while Kendall’s $τ$ and Spearman’s $ρ$ was higher at 0.8 - 0.9. The discrepancy demonstrates how, as a metric, Cohen’s $κ$ is more conservative than Kendall and Spearman correlations.
结果显示，人类与LLM评判之间的 Cohen's $κ$ 一致性指标在 0.3 至 0.5 之间，表明存在适度的一致性；而 Kendall's $τ$ 和 Spearman's $ρ$ 一致性指标则高达 0.8 至 0.9。这种差异说明，作为衡量标准，Cohen's $κ$ 一致性指标相较于 Kendall 和 Spearman 相关性指标更为保守。

Diving deeper, the confusion matrix revealed that the LLMs were able to predict non-relevant labels with ~75% accuracy. However, accuracy dropped to 50% for relevant labels, 30% for highly relevant labels, and 45% for perfectly relevant labels.
进一步分析，混淆矩阵显示LLMs在预测非相关标签时，准确率约为 75%。然而，预测相关标签时准确率降至 50%，高度相关标签的准确率为 30%，而完全相关标签的准确率则为 45%。

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models proposes using a Panel of smaller LLMs (PoLL) to evaluate the quality of generated responses. Instead of using a single, stronger LLM-evaluator, PoLL uses an ensemble of three smaller LLM-evaluators (command-r, gpt-3.5-turbo, haiku) to independently score model outputs. The final evaluation is determined by max voting or average pooling of their individual scores. The goal was to address the high cost and intra-model bias associated with using a single LLM-evaluator.
用陪审团取代法官：通过由多元模型组成的小组评估 1001 代，建议使用由较小模型组成的小组（PoLL）来评估生成响应的质量。而不是使用单一、更强大的大型模型评估器，PoLL 采用了一个由三个较小的模型评估器（command-r，gpt-3.5-turbo，haiku）组成的集合，独立对模型输出进行评分。最终评估结果由他们的个人评分通过最大投票或平均池的方式确定。目标是解决使用单一大型模型评估器带来的高成本和模型内偏见问题。

The paper focused on the question-answering task across three settings: single-hop QA (Natural Questions, TriviaQA, HotpotQA), multi-hop QA (Bamboogle, HotpotQA), and chatbot arena (Chatbot Arena Hard). Reference judgments were collected via Cohere’s internal annotation workforce.
该论文着重于三个不同场景下的问答任务：单一推理问答（Natural Questions，TriviaQA，HotpotQA），多级推理问答（Bamboogle，HotpotQA），以及聊天机器人竞技场（Chatbot Arena Hard）。参考判断是通过 Cohere 的内部标注团队收集的。

The LLM-evaluators applied few-shot prompting and reference-based evaluation. The evaluator’s prompt contained few-shot, in-context examples of valid and invalid (question, answer, reference) triplets. They evaluated performance via Cohen’s $κ$ .
LLM名评估者采用了少量示例的提示方式和基于参考的评估方法。评估中，他们使用了少量在具体情境下的有效与无效（问题，答案，参考）三元组示例。通过 Cohen 的 $κ$ ，他们对性能进行了评估。

# Multihop Judge prompt
You will be given a Question and a Provided Answer. Judge whether the Provided Answer 
is correct by comparing it to the Reference Answer. Differently formatted dates, people 
with missing middle names, and alternative spellings should all be considered the same. 
If the Provided Answer is correct say exactly "True", otherwise say "False". 

Question 1: "When did the president who set the precedent of a two term limit leave 
office?"
Provided Answer: "George Washington set the precedent of a two-term limit when he 
decided not to seek a third term in 1796. He left office in 4 March, 1797."
Reference Answer: "March 4, 1797"
Correct: True

Question 2: "Where does  ́Sivar ̄ama Swami conduct courses on Vaishnava Theology?"
Provided Answer: " ́Sivar ̄ama Swami conducts courses on Vaishnava Theology at 
Bhaktivedanta Manor."
Reference Answer: "Where does  ́Sivar ̄ama Swami conduct courses on Vaishnava Theology?"
Correct: False 

...

Question 8: "{QUESTION}"
Provided Answer: "{GEN ANSWER}"
Reference Answer: "{GOLD ANSWER}"
Correct:

Results: Across the different settings and datasets, the PoLL approach achieved higher correlation with human judgments compared to using gpt-4 alone as the LLM-evaluator. Furthermore, the PoLL approach was one-seventh the cost of using gpt-4 as an evaluator.
结果显示，在各种不同的环境和数据集下，PoLL 方法与人类判断的相关性，比单纯使用 gpt-4 作为LLM评估者要高。而且，PoLL 方法的成本仅是使用 gpt-4 作为评估者的七分之一。

Surprisingly, gpt-4 (initially) performed much worse than the smaller models individually and was even outperformed by exact string matching on the Natural Questions dataset (on what’s essentially fuzzy string matching given that the LLM-evaluator was reference-based). They hypothesized that gpt-4 was over-reasoning and injecting too much background knowledge when determining the correctness of the answer, instead of simply comparing the gold reference to the response being evaluated.
令人惊讶的是，gpt-4（最初）的表现竟然比单个较小的模型还要差，在自然问题数据集上，其表现甚至不如精确字符串匹配（考虑到LLM-评估器是基于参考的，这实际上是一种模糊字符串匹配）。他们推测，gpt-4 可能过度推理，在判断答案正确性时注入了过多的背景知识，而不是简单地将黄金参考与正在评估的响应进行比较。

Thus, they conducted an ablation study and found that including an explicit instruction to “don’t overthink” was the most effective solution. The updates brought gpt-4’s performance to the level of gpt-3.5 but it remained below command-r and haiku.
因此，他们进行了一项消融研究，发现加入“不要过度思考”的明确指示，是最有效的解决办法。更新后，gpt-4 的表现达到了 gpt-3.5 的水平，但依然不及 command-r 和 haiku。

Aligning LLM-evaluators to our criteria
使LLM-评估人员符合我们的标准

After that overview of prompting techniques for LLM-evaluators, we next look at how to better align LLM-evaluators to our idiosyncratic criteria.
在概述了针对LLM-评估员的指导技巧之后，我们接下来将探讨如何使LLM-评估员更好地与我们独特的标准保持一致。

EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria introduces an interactive system that helps developers iteratively refine prompts by evaluating generated responses based on user-defined criteria. This is achieved with the assistance of an LLM-evaluator and a criteria reviewer (also an LLM).
EvalLM：在用户定义的准则下，对大型语言模型的提示进行交互式评估，介绍了一种交互式系统，该系统借助LLM评估器和准则评审员（同样是LLM），通过评估生成的响应，帮助开发者迭代优化提示。

First, the LLM-evaluator evaluates the response based on the criteria and provides explanations (essentially CoT). This helps users identify issues in the response as well as any misalignment between the LLM-evaluator’s interpretation of the criteria and their own understanding. Separately, a criteria reviewer assists in identifying potential improvements by refining, merging, and splitting criteria.
首先，LLM-评估员根据标准评估响应，并提供解释（即 CoT）。这有助于用户识别响应中的问题，以及LLM-评估员对标准的解读与他们自己理解之间的任何差异。此外，标准审核员通过细化、合并和拆分标准，协助识别可能的改进。

Their interface allows users to compose prompts and generate responses based on sampled input such as questions and context. Users can then define criteria that the LLM-evaluator uses to assign scores (out of 10) to each output. The authors tested several types of criteria:
他们的界面使用户能够编写提示，并根据诸如问题和上下文这样的采样输入生成响应。然后，用户可以定义标准，这些标准由LLM评估器用来给每个输出打分（满分 10 分）。作者测试了几种不同类型的标准：

Overall quality: Uses the prompt from LLM-as-a-Judge to compare a pair of outputs and select the one with higher quality.
总体质量：利用LLM-as-a-Judge 的提示，比较两组输出，选择质量更优的一项。
General criteria: Uses the general and broad criteria from FLASK. First, an LLM call selects the three most relevant criteria (out of 12) for a given request. Then, pairwise comparison is done based on each of the three criteria.
通用标准：采用 FLASK 中的通用及广泛标准。首先，通过LLM调用，为特定请求挑选出 12 个标准中最相关的三个。随后，基于这三个标准，进行两两对比分析。
Specific criteria: Starts with the same criteria as general criteria but automatically splits and refines them via criteria review. Pairwise comparison is then done to determine which output performs better on the fine-grained and specific criteria.
特定标准：起初与一般标准相同，但会通过标准审查过程自动进行拆分和细化。接下来，通过两两对比的方式，判断在更细致、更具体的评判标准下，哪个输出表现更佳。

Results: They found that specific criteria had the highest agreement and correlation with human annotators while general criteria had the lowest.
结果显示，具体标准与人工注释者的吻合度和相关性最高，而一般标准的吻合度和相关性则最低。

They also evaluated the explanations provided by the LLM-evaluator and found them mostly free of issues: 91.4% of the explanations were logical, 99.1% were faithful, 84.2% were independent (i.e., did not assess other criteria or aspects not described in the provided criteria), 100% provided relevant evidence, and 98.6% were aligned with the scores.
他们还评估了LLM评估员给出的解释，发现这些解释几乎没有什么问题：91.4%的解释逻辑清晰，99.1%忠实于原意，84.2%的解释是独立的（即，没有对提供的标准中未提及的其他标准或方面进行评估），100%提供了相关证据，98.6%的解释与评分相吻合。

The authors also conducted a user study to compare how EvalLM improves the prompt iteration process relative to the current baseline of manual evaluations. This was a within-subjects (i.e., before and after) study to compare EvalLM to the baseline They found that when using EvalLM, users:
作者还进行了一项用户研究，比较了 EvalLM 在提示迭代过程中的改进，相对于当前手动评估的基线。这是一项被试内（即，前后对比）研究，以评估 EvalLM 与基线的差异。研究发现，使用 EvalLM 时，用户：

Had higher self-confidence in their ability to evaluate (6.71 vs. 4.96, p < 0.001)
在评估能力方面，他们拥有更高的自信（6.71 vs. 4.96，p < 0.001）
Evaluated more unique output (20.42 vs. 10.08, p = 0.03)
评估了更多独特的输出（20.42 对比 10.08，p = 0.03），这表明差异具有统计学意义。
Felt EvalLM helped them think about the task better (6.83 vs. 5.67, p = 0.01)
感觉 EvalLM 帮助他们更好地思考任务（评分 6.83 对比 5.67，p = 0.01）
Felt that their criteria were clearer (6.42 vs. 4.92, p < 0.01)
觉得他们的标准更明确（6.42 vs. 4.92，p < 0.01）
Made more changes to their criteria (22.67 vs. 13.33, p = 0.04)
他们对标准进行了更多调整（22.67 对比 13.33，p = 0.04）
Had lower mental burden (3.92 vs. 5.58, p = 0.01) and effort (3.50 vs. 5.25, p = 0.08)
心理负担较低（3.92 vs. 5.58，p = 0.01）和付出的努力较少（3.50 vs. 5.25，p = 0.08）

We Need Structured Output: Towards User-centered Constraints on Large Language Model Output investigates the real-world scenarios, motivations, and user preferences for applying constraints on LLM-generated output. They propose a taxonomy of low-level and high-level constraints where the former ensures that the response meets a specific format (e.g., JSON, markdown, multiple-choice, length) while the latter involves semantic and stylistic guidelines (e.g., avoiding certain terms) as well as preventing hallucinations.
我们需要结构化的输出：《用户中心的大型语言模型输出约束研究》探讨了在实际场景中，对LLM生成的输出应用约束的动机和用户偏好。他们提出了一种分类，包括低级和高级约束，低级约束确保响应符合特定格式（如 JSON、markdown、多项选择、长度等），而高级约束则涉及语义和风格指导（如避免使用某些术语），以及防止产生幻觉。

To help users prototype, test, and apply constraints on LLM outputs, the authors developed a web-based graphical user interface (GUI). The GUI allows users to apply different types of output constraints by selecting from a list of available primitives, such as JSON objects, multiple choice, ordered lists, and text.
为了协助用户对LLM的输出进行原型设计、测试及应用限制，作者开发了一款基于网络的图形用户界面（GUI）。此界面使用户能够通过选择可用的基本元素列表（如 JSON 对象、多选题、有序列表和文本）来设定不同类型的输出限制。

Results: The study found that participants preferred using a GUI to specify low-level constraints but preferred using natural language to specify high-level constraints.
研究结果表明，参与者倾向于使用图形用户界面来设定低层次的约束条件，而倾向于使用自然语言来表达高层次的约束条件。

For the former, participants felt that choosing “boolean” as the output type in the GUI “felt more likely to be honored” compared to a natural language instruction requesting a yes or no response. They also shared that “flagging a JSON button” provides better user experience. Under the hood, these low-level constraints are converted into regular expressions (Figure 2-2d above) which the LLM respects during generation.
对于前者，参与者觉得在图形用户界面中选择“布尔”作为输出类型，比用自然语言指令要求回答是或否，更可能得到执行。他们还提到，“标记一个 JSON 按钮”能提供更佳的用户体验。在后台，这些低级别的限制会被转换成正则表达式（如上图 2-2d 所示），在LLM生成过程中会遵循这些规则。

In contrast, natural language was found to be easier for specifying complex constraints, especially those that couldn’t reasonably fit a GUI. This includes open-ended constraints such as “don’t include offensive words” or “respond in a cheerful manner”.
然而，研究发现对于设定复杂的限制条件，尤其是那些无法合理地在图形用户界面（GUI）上表达的条件，自然语言显得更加得心应手。这其中包括诸如“避免使用冒犯性词汇”或“以乐观愉快的态度回应”等开放式的限制条件。

In addition to these findings, the taxonomy of constraints (aka guardrails) in Table 1 above is a valuable resource on the pragmatic considerations builders have when developing LLM-powered products.
除了这些发现，表 1 中列出的约束（又称作护栏）的分类，是开发人员在开发LLM驱动产品时，考虑实际因素的宝贵资源。

Who Validates the Validators: Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences introduces EvalGen, an approach to align LLM-evaluators with human criteria. Given the generation (not evaluation) prompt and input-output pairs, EvalGen can infer and suggest criteria. Users can then modify these criteria or add new ones, specifying whether each criterion should be implemented as code (e.g., assert statements) or as an LLM-evaluator prompt.
谁来验证验证者：《将LLM辅助评估LLM输出与人类偏好对齐》介绍了 EvalGen，这是一种使LLM评估者与人类标准相匹配的方法。在给定生成（而非评估）提示和输入输出对的情况下，EvalGen 能够推断并提出标准。用户随后可以修改这些标准或添加新标准，具体指定每个标准应作为代码（例如，断言语句）还是作为LLM评估者提示来实施。

The authors assert that “it is impossible to completely determine evaluation criteria prior to human judging of LLM outputs”, a phenomenon they call criteria drift. They observed that as users refine their criteria upon further grading, they sometimes go back to change previous grades. Thus, they propose that users need evaluation assistants to support rapid iteration over criteria and implementations simultaneously.
作者坚持认为，“在人类评估LLM输出之前，不可能完全确定评估标准”，他们将这种现象称为标准漂移。他们发现，用户在进一步的评分过程中细化标准时，有时会回过头去修改之前的评分。因此，他们建议用户需要评估助手，以支持对标准和实施的同步快速迭代。

Practically, what this means is that, instead of the typical evaluation pipeline where the evaluation loop is done with a (fixed) LLM-evaluator (Figure 1a below), they propose an inner loop where builders grade outputs and edit their criteria, which then helps them build faster and more reliably (Figure 1b below).
实际上，这意味着与传统的评估流程不同，传统的流程是使用固定的LLM评估器进行评估循环（如下图 1a 所示），他们提出了一种内部循环机制，让构建者评估输出并调整评估标准，这有助于他们更快、更可靠地构建（如下图 1b 所示）。

To evaluate EvalGen, the authors assessed its ability to generate assertions, both code and prompt-based, that classified defective responses. They tested EvalGen on Medical and Product tasks. For the former, the LLM should extract specific information without revealing personally identifiable information. For the latter, the LLM should craft SEO-friendly descriptions without negative reviews. The medical task had 84 samples, of which 68% passed (i.e., non-defects); the product tasks had 100 samples, of which 51% passed.
为了评估 EvalGen，作者对其生成断言的能力进行了评估，这些断言基于代码和提示，能够对有缺陷的响应进行分类。他们分别在医学和产品任务上对 EvalGen 进行了测试。对于医学任务，LLM应提取特定信息，同时不泄露任何个人可识别信息。对于产品任务，LLM应创建 SEO 友好的描述，同时避免包含负面评论。医学任务共有 84 个样本，其中 68%通过了测试（即，无缺陷）；产品任务有 100 个样本，其中 51%通过了测试。

They compared EvalGen to SPADE, a fully automated baseline. Defining defects as positives and non-defects as negatives, they evaluated EvalGen and SPADE on coverage (i.e., ability to fail outputs that the user thinks are bad aka recall of defects) and false failure rate (FFR; ability to not fail outputs that the user thinks are good aka 1 - precision of defects).
他们将 EvalGen 与 SPADE 进行了比较，后者是一个全自动化基线。他们将缺陷定义为正例，非缺陷定义为负例，从两个方面评估了 EvalGen 和 SPADE：覆盖率（即，能够识别出用户认为不好的输出，也就是缺陷的召回率）和错误失败率（FFR；即，能够正确放过用户认为好的输出，也就是 1 - 缺陷的精度）。

Results: Compared to SPADE, EvalGen had better performance on the product task, achieving 0.73 recall of defects while SPADE had 0.49 recall. Furthermore, EvalGen required fewer assertion statements. Both approaches had identical false positive rates (0.1 on medical and 0.39 on product).
结果显示，与 SPADE 相比，EvalGen 在产品任务上的表现更出色，缺陷召回率高达 0.73，而 SPADE 的召回率仅为 0.49。此外，EvalGen 所需的断言语句更少。在误报率方面，两种方法的表现相同，医疗任务的误报率为 0.1，产品任务的误报率为 0.39。

The authors also conducted a user study with nine practitioners. Notable findings include:
作者还对九位实践者进行了一项用户研究。值得注意的发现包括：

Grading outputs first helps with refining initial criteria, with one participant going as far as saying “you should enforce that we look at at least 20 examples first”.
首先评估输出确实有助于我们完善最初的评估标准，甚至有参与者建议：“应该要求我们先浏览至少 20 个案例。”
Users were happy to grade outputs while waiting for the LLM-evaluator to evaluate responses. These graded responses could then be used to evaluate LLM-evaluator.
用户在等待LLM-评估器处理响应的空档，乐于对输出结果进行评分。这些已评分的响应随后可用于评估LLM-评估器。
Users added new criteria when they observed new types of bad responses, reinforcing the idea that examining responses helps craft and improve on criteria.
用户在观察到新的不良反应类型时，会增加新的评判标准，这进一步印证了检查反馈对于制定和优化评判标准的重要性。
LLM-evaluators were harder to trust compared to code-based assertions, possibly because users could edit the code-based assertions.
LLM的评估者相较于代码断言更难以让人信服，可能是因为用户能够对代码断言进行编辑。

Finetuning LLM-evaluator models
调整LLM评估者模型以进行优化

If you’ve worked on aligning LLM-evaluators to your evaluation criteria, you’ll know that it can be be challenging achieve high recall and precision, consistently. One alternative, albeit an expensive one, is to finetune LLM-evaluator models.
如果你已经尝试将LLM-评估者与你的评估标准对齐，你就会知道持续实现高召回率和精确度是相当有挑战性的。一个替代方案，尽管成本高昂，是微调LLM-评估者模型，以提高其性能。

Shepherd: A Critic for Language Model Generation is an LLM-evaluator (based on llama-2-7b-chat) that’s finetuned to critique model responses and suggest refinements. It’s finetuned on a feedback dataset consisting of community critique and human annotations.
"牧羊人"：这是一个基于 llama-2-7b-chat 的语言模型生成的批评者，作为LLM-评估器，经过专门训练，能够对模型的响应进行批评并提出改进建议。它在一个包含社区反馈和人工注释的反馈数据集上进行了微调，以提升其批评和建议的准确性。

For the community critique, they used data from StackExchange (173 dedicated Q&A sites) and Reddit (data from 15 selected subreddits). The data was formatted as (question, answer, critique) triplets.
在社区评论中，他们利用了 StackExchange（包含 173 个专业问答站点）以及 Reddit（选取了 15 个特定子论坛）的数据。这些数据被整理成了（问题，答案，评论）的三部曲形式。

For human annotation, they selected ten language understanding, entailment, and summarization datasets that require complex understanding. These were: Entailment Bank (deductive reasoning), Proofwriter (logical reasoning), GSM8k (arithmetic reasoning), PIQA (physical reasoning), CosmosQA (commonsense reasoning), ECQA (commonsense reasoning), e-SNLI (deductive and commonsense reasoning), Adversarial NLI (adversarial entailment), GPT-3 summarization, and DeFacto (factual consistency). For each question, they provide a context, a correct output, and a candidate output, and ask annotators to give feedback on whether there were any errors in the candidate output. Human annotation cost $8 per sample. After post-processing, they ended up with 1,317 samples.
为了进行人工注解，他们挑选了十个需要复杂理解力的语言理解、蕴含关系以及摘要数据集，这些数据集要求深度理解，包括：Entailment Bank（演绎推理）、Proofwriter（逻辑推理）、GSM8k（算术推理）、PIQA（物理推理）、CosmosQA（常识推理）、ECQA（常识推理）、e-SNLI（演绎推理与常识推理）、Adversarial NLI（对抗性蕴含）、GPT-3 摘要生成以及 DeFacto（事实一致性）。对于每个问题，他们提供了一个背景信息、一个正确答案和一个候选答案，要求注解者判断候选答案是否存在错误。人工注解的费用为每例 8 美元。经过后期处理，他们最终收集到了 1,317 个样本。

To evaluate Shepard, they used six public datasets covering a range of topics and skills such as commonsense, physical, and math reasoning: CommonSenseQA, Alpaca-Farm, OBQA, PIQA, FairEval, and TruthfulQA. They sampled 50 instances from the validation/test split of each dataset, resulting in 300 instances in the final evaluation set. To address concerns around data contamination, they developed a new test set (CritiqueEval) which contains 52 Reddit questions posted from June 2022 to June 2023. This period is past ChatGPT’s knowledge cutoff during the study.
为了评估 Shepard，他们运用了六个公开数据集，覆盖了常识、物理和数学推理等多样主题和技能，包括 CommonSenseQA、Alpaca-Farm、OBQA、PIQA、FairEval 和 TruthfulQA。他们从每个数据集的验证/测试部分中选取了 50 个样本，最终形成了包含 300 个样本的评估集合。为了解决数据污染的问题，他们创建了一个新的测试集（CritiqueEval），该集包含了 52 个在 2022 年 6 月至 2023 年 6 月期间于 Reddit 上发布的问题。这一时间段超出了 ChatGPT 在研究期间的知识截止日期。

Baseline models include ChatGPT (unspecified but likely gpt-3.5-turbo), alpaca-7b (llama-7b finetuned on 52k instruction-following data from ChatGPT), and SelFee (llama-7b finetuned for self-feedback and self-revision generation). These LLM-evaluators were in turn evaluated via gpt-4—it’s LLMs all the way down 🐢—which graded each feedback on a 1 - 7 Likert scale based on whether the feedback could point out errors in the answer, or confirm the answer is correct when there are no errors.
基准模型包括了 ChatGPT（虽然未具体说明，但可能是 gpt-3.5-turbo），alpaca-7b（这是在 llama-7b 基础上，使用 ChatGPT 的 52k 指令跟随数据进行微调的版本），以及 SelFee（这是 llama-7b 经过自我反馈和自我修订生成的微调版本）。这些LLM-评估者随后通过 gpt-4 进行了一次“层层深入”的评估——就像LLMs一样，一直深入到最底层——评估标准是根据反馈是否能准确指出答案中的错误，或者在没有错误的情况下确认答案的正确性，在 1-7 的 Likert 量表上进行评分。

Results: When asking gpt-4 and human evaluators to pick the better feedback given two candidate feedback, Shepard outperformed alpaca-7b and SelFee while achieving parity with ChatGPT in generating helpful feedback and critique. It also consistently generated better feedback on CritiqueEval.
结果显示，当让 GPT-4 和人类评估者从两个候选反馈中选择更优秀的反馈时，Shepard 的表现超越了 Alpaca-7B 和 SelFee，同时在提供有益反馈和批评方面，与 ChatGPT 达到了同等水平。在 CritiqueEval 的评估中，Shepard 持续展现出更高质量的反馈生成能力。

However, when evaluating critiques on a Likert scale (from 1 - 7) via gpt-4, the gpt-4 and human evaluations conflicted. For example, gpt-4 gave alpaca-7b an average score of 4.7 while human annotators gave it an average score of 2.9. The paper also found that gpt-4 favored responses that provided more examples. Overall, this suggests that gpt-4 as an evaluator has biases such as a bias towards giving higher scores and verbosity bias.
然而，在使用 Likert 量表（1 至 7 分）通过 gpt-4 评估批评时，gpt-4 与人类评估者的评分出现了分歧。例如，gpt-4 给 alpaca-7b 的平均评分是 4.7，而人类评估者的平均评分仅为 2.9。研究还发现，gpt-4 倾向于偏好提供更多信息的例子。总体而言，这表明作为评估者的 gpt-4 存在偏见，比如倾向于给出较高的评分，以及存在冗长偏见。

Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer uses a pretrained evaluator model that can score and rank the output of diverse instructions to improve result quality. Cappy focuses on well-defined language modeling tasks that have more straightforward evaluation approaches, such as accuracy and ROUGE. Such tasks include language identification, common sense reasoning, logical reasoning, and more.
Cappy：通过小型评分器超越并增强大型多任务语言模型，利用预训练的评估模型对各种指令的输出进行评分和排名，从而提升结果质量。Cappy 专注于那些有明确定义、评估方法更直接（如准确率和 ROUGE）的语言建模任务，如语言识别、常识推理、逻辑推理等。

Cappy is a RoBERTa-based model (360M parameters) with a linear layer as a regression head. Its input is an instruction-response pair and its output is a 0.0 to 1.0 scalar score. The score estimates the correctness of the response based on the instruction. Thus, given an input instruction and candidate response, Cappy evaluates and scores the response.
Cappy 是一款基于 RoBERTa 的模型（拥有 360M 参数），配备了一个线性层作为回归头。它接收指令-响应对作为输入，输出一个介于 0.0 到 1.0 之间的标量分数。这个分数用来评估响应对指令的正确程度。因此，当输入指令和候选响应时，Cappy 会对其进行评估并给出评分。

Cappy is trained on 39 diverse datasets from PromptSource which includes tasks such as question answering, sentiment analysis, summarization, etc. The instruction-response pairs from PromptSource are given a score of 1.0 while deliberately mismatched pairs are assigned a score of 0.0. To augment the data, bart0 and t0-3b were used to generate candidate responses and scores are assigned based on ROUGE-L. Overall, they collected a pretraining dataset of 160 million examples.
Cappy 在来自 PromptSource 的 39 个多样化数据集上进行了训练，这些数据集涵盖了问答、情感分析、摘要等任务。PromptSource 中的指令-响应对被赋予了 1.0 的评分，而故意错配的对则被分配了 0.0 的评分。为了扩充数据，bart0 和 t0-3b 被用来生成候选响应，评分则基于 ROUGE-L 进行。总的来说，他们收集了一个包含 1.6 亿个示例的预训练数据集。

They evaluate Cappy on 11 held-out language understanding tasks from PromptSource, all of which are classification tasks that Cappy can function independently on (i.e., doesn’t need an upstream LLM to generate a response). They also apply Cappy on top of flan-t5 for 45 generation tasks in BIG-Bench, with Cappy scoring 17 candidate outputs from flan-t5.
他们使用 PromptSource 中 11 个未公开的语言理解任务来评估 Cappy，这些任务全都是分类任务，Cappy 可以独立完成，无需上游LLM协助生成响应。此外，他们将 Cappy 叠加在 flan-t5 上，处理 BIG-Bench 中的 45 个生成任务，其中 Cappy 对 flan-t5 产生的 17 个候选答案进行了评分。

Results: On the 11 classification tasks, Cappy outperforms much larger multi-task LLMs like opt-175b and is close to the performance of t0-11b (left). For the 45 tasks in BIG-Bench, Cappy consistently boosts the performance of flan-t5 by a large margin, suggesting that it can score and select better output. Nonetheless, while the results suggest that Cappy is capable of scoring and ranking output to select the best one, it’s unclear if Cappy is viable as an LLM-evaluator that can discriminate and exclude bad output.
结果显示，在 11 个分类任务中，Cappy 的表现超越了庞大的多任务模型LLMs，如 opt-175b，并且其性能接近 t0-11b（如左图所示）。在 BIG-Bench 的 45 个任务中，Cappy 显著提升了 flan-t5 的性能，这表明它能够有效地对输出进行评分，从而选择更优的结果。然而，尽管 Cappy 在输出评分和排名方面表现出色，能够挑选出最佳输出，但其作为LLM评估器的可行性仍有待验证，即它是否能有效区分并剔除质量不佳的输出。

Prometheus: Inducing Fine-grained Evaluation Capability in Language Models is a finetuned evaluator (based on llama-2-chat) that performs fine-grained evaluation of text responses based on user-defined score rubrics. Prometheus takes as input the instructions, score rubric, response to evaluate, and a gold reference answer, making it a referenced-based evaluator. Then, it scores the response to evaluate and also returns text feedback.
Prometheus：通过诱导语言模型具备精细的评估能力，这是一种基于 llama-2-chat 的微调评估模型，能够根据用户自定义的评分标准对文本回复进行精细评估。Prometheus 接收指令、评分标准、待评估的回复以及标准答案作为输入，因此它是一种参考型评估器。接着，它会为待评估的回复打分，并提供文本反馈。

To finetune Prometheus, the authors built the Feedback Collection Dataset which contains 1,000 fine-grained score rubrics, 20k instructions, and 100k example responses and feedback generated by gpt-4. First, they wrote 50 seed rubrics. Then, they used gpt-4 to expand the seed rubrics to a more robust and diverse set of 1,000 rubrics. Next, they prompted gpt-4 to generate 20 instructions for each rubric. Finally, they prompted gpt-4 to generate five responses and feedback for each instruction. The researchers then finetuned llama-2-chat (7b and 13b variants) to sequentially generate the feedback and then the score, similar to CoT reasoning followed by the final response.
为了对 Prometheus 进行微调，作者构建了一个反馈收集数据集，其中包含 1000 个精细的评分标准，20,000 条指令，以及由 gpt-4 生成的 100,000 个示例响应和反馈。首先，他们编写了 50 个基础评分标准。然后，他们利用 gpt-4 将这些基础标准扩展成一个更全面、更多样化的 1000 个评分标准集合。接下来，他们让 gpt-4 为每个评分标准生成 20 条指令。最后，他们让 gpt-4 为每条指令生成 5 个响应和反馈。研究人员随后对 llama-2-chat（包括 7b 和 13b 版本）进行了微调，使其能够按顺序生成反馈和评分，这一过程类似于先进行 CoT 推理，然后给出最终的响应。

To evaluate Prometheus, the authors compared it to human evaluation and gpt-4 evaluation as a baseline, measuring Prometheus’ correlation with both. They also conducted human evaluation to assess the quality of the feedback via pairwise comparisons. The evaluation was performed on Feedback Bench (generated via the same approach as Feedback Collection), Vicuna Bench, MT Bench, and FLASK Eval.
为了评估 Prometheus，作者将其与人类评估和 gpt-4 评估作为基线进行比较，测量 Prometheus 与两者的相关性。他们还通过成对比较的方式进行了人类评估，以评估反馈的质量。这一评估是在 Feedback Bench（通过与 Feedback Collection 相同的方法生成）、Vicuna Bench、MT Bench 和 FLASK Eval 上进行的。

Results: For correlation with human judgments, they used 45 instances from Feedback Bench. On this dataset, Prometheus achieved 0.897 Pearson correlation while gpt-4 has 0.882 correlation and gpt-3.5-turbo has 0.392 correlation. In addition, via pairwise comparisons by humans, Prometheus is preferred over gpt-4 58.6% of the time, and preferred over gpt-3.5-turbo 79.6% of the time. For correlation with gpt-4, Prometheus has a higher correlation than even gpt-4 itself on Feedback Bench. Nonetheless, it lags behind gpt-4 on Vicuna Bench, MT Bench, and FLASK Eval.
结果显示，为了与人类的判断进行比较，他们使用了来自 Feedback Bench 的 45 个案例。在这个数据集上，Prometheus 的皮尔逊相关系数达到了 0.897，而 gpt-4 的相关系数为 0.882，gpt-3.5-turbo 的相关系数仅为 0.392。此外，通过人类的成对比较，Prometheus 在 58.6%的情况下被认为优于 gpt-4，而在 79.6%的情况下被认为优于 gpt-3.5-turbo。在与 gpt-4 的相关性上，Prometheus 在 Feedback Bench 上的表现甚至超过了 gpt-4 本身。然而，在 Vicuna Bench，MT Bench 和 FLASK Eval 上，Prometheus 的表现仍然落后于 gpt-4。

In an ablation study, they showed that excluding the reference answer leads to the greatest performance degradation, in contrast to excluding the scoring rubric or the feedback distillation process (Table 6 below). This suggests that the model may be learning to perform some form of fuzzy matching on the reference, and that it may still still a ways to go before it can do reference-free evaluation.
在一项消融研究中，他们发现排除参考答案会导致性能显著下降，相比之下，排除评分标准或反馈蒸馏过程的影响较小（见下表 6）。这表明模型可能正在学习如何在参考答案上进行模糊匹配，而且在实现无需参考答案的评估方面，还有很长的路要走。

LLM Critics Help Catch LLM Bugs introduces CriticGPT, a finetuned LLM-evaluator designed to critique and find bugs in code generated by other LLMs. CriticGPT takes as input a (question, code answer) pair and returns a critique that points out potential problems in the answer.
LLM 评论家协助捕捉LLM 虫子，介绍 CriticGPT，这是一种经过微调的LLM评估工具，专门用于审查并找出其他LLMs生成的代码中的错误。CriticGPT 接收(问题，代码解答)对作为输入，然后返回一个批评，指出解答中可能存在的问题。

Training data for CriticGPT comes from the OpenAI RLHF pipeline. The authors selected samples where responses contained at least 50% Python code by line count. Part of this data organically contains bugs that humans had previously detected and gave a low score.
CriticGPT 的训练数据源自 OpenAI 的 RLHF 流程。作者挑选了那些响应中至少有 50%的行数为 Python 代码的样本。这部分数据自然而然地包含了之前已被人类发现并评价为低质量的 bug。

In addition to the organic bugs, they hired contractors to insert subtle bugs (“tampering”). Bugs inserted via tampering were generally harder to catch and more severe than average, and were not from the natural distribution of model errors. Tampering was done adversarially, where contractors had access to CriticGPT and tried to introduce bugs that CriticGPT missed in at least one out of three tries. After introducing the bug via tampering, the same contractor then proceeded to compare and rate critiques of the tampered answer.
除了自然出现的错误，他们还特意聘请了承包商来植入一些微妙的错误，即所谓的“篡改”。通过这种手段植入的错误往往比一般错误更难被发现，且后果更严重，这些错误并不属于模型自然错误的范畴。篡改行为是出于敌对目的，承包商可以访问 CriticGPT 系统，他们尝试植入 CriticGPT 在三次检测中至少有一次未能识别的错误。在通过篡改手段植入错误后，同一承包商会进一步比较和评估对篡改答案的批评。

Results: CriticGPT was able to catch substantially more inserted bugs (80 - 85%) than human contractors (right below). Furthermore, CriticGPT’s critiques were more preferred than human critiques (left below). (ChatGPT too, though CriticGPT’s critiques were preferred by a larger margin.) In addition, humans assisted by CriticGPT caught more bugs than humans alone, demonstrating the impact of AI augmentation.
结果显示，CriticGPT 在捕捉插入的错误方面远超人类承包商（80-85%），而人类承包商的表现则略逊一筹（见下文）。此外，CriticGPT 的批评比人类的批评更受欢迎（见下文左侧）。尽管 ChatGPT 也表现出色，但 CriticGPT 的批评在更大程度上受到青睐。更重要的是，由 CriticGPT 辅助的人类比单独的人类捕捉到更多的错误，这充分证明了人工智能增强的显著效果。

Nonetheless, CriticGPT had more nitpicks and hallucinations compared to human critiques though it was significantly lower than ChatGPT. This suggests that while CriticGPT may have higher recall in detecting bugs, it comes with a trade-off in precision in the form of nitpicks and hallucinations.
然而，尽管 CriticGPT 相较于 ChatGPT 显著减少了问题，它在与人类批评的比较中，还是展现出了更多的小毛病和幻觉。这表明，虽然 CriticGPT 在检测错误方面可能更全面，但其精确度却因过多的小毛病和幻觉而打了折扣。

Critiques against and support for LLM-evaluators
针对LLM-评估者的批评与支持

After that whirlwind tour of LLM-evaluators for various use cases, evaluation prompting techniques, alignment workflows, and finetuning LLM-evaluator models, we now review critiques against and support for LLM-evaluators.
经历了针对各种应用场景的LLM-评估器、评估提示技巧、对齐工作流程以及微调LLM-评估器模型的快速巡览之后，我们现在来回顾一下对LLM-评估器的批评与支持。

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena evaluates the performance of strong LLMs, such as gpt-4, on evaluating chatbot responses to open-ended questions.
通过 MT-Bench 和 Chatbot Arena 对LLM-as-a-Judge 进行评估，评估了像 gpt-4 这样的强大模型在评价聊天机器人对开放式问题回答时的性能。

The authors introduced two new benchmarks. MT-Bench is a dataset of 80 multi-trun questions across eight categories such as writing, math, and knowledge. LMSys Chatbot Arena is a platform where users interact with pairs of anonymous chatbots and vote for their preferred response. LLM-evaluators evaluated chatbot responses from both benchmarks via direct scoring, pairwise comparison, and reference-based evaluation.
作者引入了两个新的基准测试。MT-Bench 是一个包含 80 个多轮问题的数据集，涵盖了写作、数学、知识等八个领域。LMSys Chatbot Arena 是一个平台，用户可以与匿名聊天机器人的配对进行互动，并为他们更喜欢的回应投票。LLM名评估者通过直接评分、两两比较和参考基准评估的方式，对来自这两个基准测试的聊天机器人回应进行了评估。

On MT-Bench, the authors generated answers via six models and collected 3k judgments from 58 expert-level human judges. The goal was to measure LLM-evaluator agreement with human experts. For Chatbot Arena, they sampled 3k single-turn votes from 30k arena data points. They had a custom agreement metric, defined as the probability of randomly selected individuals of each type agreeing on a randomly selected question.
在 MT-Bench 上，作者利用六个模型生成答案，并从 58 位专家级真人裁判那里收集了 3000 条评判。其目标是衡量LLM-评估者与真人专家的一致性。对于 Chatbot Arena，他们从 30000 个竞技场数据点中抽取了 3000 个单轮投票。他们设定了一个自定义的一致性指标，即每种类型随机选取的个体在随机选取的问题上达成一致的概率。

Results: On MT-Bench, gpt-4 with direct scoring and pairwise comparison had high agreement with human experts. In a setup (S2) that excluded ties, the gpt-4 to human agreement was 85% which exceeded the human-human agreement of 81%. Furthermore, when shown gpt-4 judgments, humans found those judgments reasonable 75% of the time and were even willing to change their choices a third of the time.
结果显示，在 MT-Bench 上，gpt-4 在直接评分和成对比较的情况下，与人类专家的共识度非常高。在一种不考虑平局的设定（S2）下，gpt-4 与人类的共识度达到了 85%，这超过了人类之间的 81%共识度。更进一步的是，当人类看到 gpt-4 的评判时，他们认为这些评判在 75%的情况下是合理的，甚至有三分之一的时间，他们愿意改变自己的选择。

On Chatbot Arena, similar results were achieved between gpt-4, gpt-3.5, and claude-v1, and human ratings, with an agreement of between 83% - 87%. Nonetheless, this agreement could be high because the agreement metric doesn’t account for agreement due to random chance, unlike Cohen’s $κ$ which does.
在 Chatbot Arena 上，gpt-4、gpt-3.5 和 claude-v1 之间的结果与人类评分相似，一致性在 83%到 87%之间。然而，这种高一致性可能是因为一致性指标没有考虑到由于随机机会造成的一致性，而 Cohen 的 $κ$ 却考虑到了这一点。

They also identified some biases of LLM-evaluators. First, position bias. During pairwise comparisons, LLM-evaluators tend to prefer the response in one position over others. Most LLM-evaluators preferred the first position, with gpt-3.5 being biased 50% of the time and claude-v1 being biased 70% of the time (Table 2 below).
他们还辨识出LLM评估者的一些偏见。首先，位置偏见。在两两比较时，LLM评估者往往倾向于偏爱某个位置的回应。大多数LLM评估者偏好第一个位置，其中 gpt-3.5 有 50%的时间表现出偏见，而 claude-v1 则有 70%的时间表现出偏见（见下表 2）。

Second, verbosity bias, where LLM-evaluators favor longer, more verbose responses, even if they’re not as clear, high-quality, or accurate as shorter alternatives. To generate these verbose distractors, the authors had gpt-4 rephrase some MT-Bench answers without adding new information and concatenated them to the original answers. Both claude-v1 and gpt-3.5 preferred the longer response more than 90% of the time (Table 3 above).
其次，存在冗长偏见，即LLM-评估者倾向于选择更长、更冗长的回应，即使这些回应可能不如简短的替代方案更清晰、高质量或准确。为了生成这些冗长的干扰选项，作者让 gpt-4 在不添加新信息的情况下，重新表述了一些 MT-Bench 的答案，并将这些重述与原始答案拼接在一起。如上表 3 所示，claude-v1 和 gpt-3.5 超过 90%的情况下更倾向于选择较长的回应。

Finally, self-enhancement bias, where LLM-evaluators preferred answers generated by themselves. The authors compared the win rate of six models evaluated by LLM-evaluators and humans. Gpt-4 favored itself with a 10% higher win rate while claude-v1 favored itself with a 25% higher win rate.
最后，存在自我增强偏见，即LLM-评估者更倾向于选择自己生成的答案。作者对比了六种模型在LLM-评估者和人类评估下的胜率。Gpt-4 在自我评估时，胜率高出 10%，而 claude-v1 在自我评估时，胜率高出 25%。

On the Limitations of Fine-tuned Judge Models for LLM Evaluation compares four finetuned LLM-evaluators (JudgeLM, PandaLM, Auto-J, and Prometheus) to gpt-4 across various benchmarks. These models were trained on their respective datasets such as dolly-15k, alpaca-52k, and gpt-4 synthetic data.
在《1001》评估的局限性上，对微调的法官模型进行比较，比较了四个微调的LLM评估者（JudgeLM，PandaLM，Auto-J 和 Prometheus）与 gpt-4 在各种基准上的表现。这些模型是在各自的训练集上训练的，如 dolly-15k，alpaca-52k 和 gpt-4 合成数据。在对这些模型进行评估时，我们发现它们在某些任务上的表现与 gpt-4 相当，但在其他任务上则表现不佳。这表明，尽管微调可以提高模型在特定任务上的性能，但它也可能导致模型在其他任务上的泛化能力下降。

These finetuned LLM-evaluators perform either pairwise comparison or direct scoring.
这些微调过的LLM-评估器，进行的是两两比较或是直接评分。

To assess LLM-evaluator performance on specific aspects, they used these datasets:
为了评估LLM评估员在特定方面的表现，他们使用了这些数据集

LLMBar for evaluating fairness. The dataset contains paired output with a correct answer and an incorrect answer that had better superficial quality.
用于评估公平性的 LLMBar。该数据集包含配对的输出，每个输出都附有正确答案和错误答案，尽管错误答案在表面上看起来质量更高。
HaluEval for factuality evaluation in QA, summarization, and dialogue.
HaluEval 用于问答、摘要和对话中的事实性评估，以提高准确性。
ToxiChat for toxicity evaluation based on conversations between humans and AI.
通过人类与 AI 的对话来评估毒性的 ToxiChat。
SALAD-Bench for safety evaluations on instructions and responses.
SALAD-Bench 用于对指令和响应进行安全性评估。

Results: They show that the finetuned LLM-evaluators essentially functioned as task-specific classifiers. To demonstrate this, the authors trained several LLM-evaluators, including Vicuna-generation, Vicuna-classification, and DeBERTa-classification. They found that the DeBERTa-classification evaluator performed similarly to the Vicuna models in terms of accuracy. Furthermore, these finetuned LLM-evaluators had higher correlation amongst themselves than with gpt-4. Taken together with further findings below, this suggests that the finetuned LLM-evaluators were inherently task-specific classifiers.
结果显示，经过微调的LLM-评估器实际上作为特定任务的分类器运行。为了证明这一点，作者训练了几种LLM-评估器，包括 Vicuna-generation，Vicuna-classification 和 DeBERTa-classification。他们发现 DeBERTa-classification 评估器在准确性上与 Vicuna 模型的表现相当。此外，这些经过微调的LLM-评估器之间的相关性高于与 gpt-4 的相关性。结合以下的进一步发现，这表明经过微调的LLM-评估器本质上是特定任务的分类器。

Interestingly, the results also showed that the Vicuna-generation model consistently outperformed the Vicuna-classification model, indicating that an LLM-evaluator with a next-token prediction objective can outperform one with a classification objective. (My prior was that a classification objective was simpler to learn, making it more data efficient and thus more accurate.)
有趣的是，结果还表明，Vicuna 生成模型始终优于 Vicuna 分类模型，这表明具有LLM评估器和下一个令牌预测目标的模型可以优于具有分类目标的模型。（我之前认为分类目标更简单，学习起来更有效，因此更准确。）

They also found that although finetuned LLM-evaluators achieved high performance on in-domain test sets, even surpassing gpt-4, they underperformed gpt-4 in dimensions such as generalizability, fairness, and aspect-specific evaluation. Thus, while finetuned LLM-evaluators performed best on their trained evaluation schemes (e.g., PandaLM or JudgeLM for pairwise comparisons), applying them to a different scheme (e.g., direct scoring) led to a catastrophic performance drop. This did not occur for gpt-3.5 or gpt-4.
他们还发现，尽管微调的LLM-评估器在领域内测试集上表现优异，甚至超越了 gpt-4，但在通用性、公平性和特定方面的评估上，它们却不如 gpt-4。因此，虽然微调的LLM-评估器在训练的评估方案（如 PandaLM 或 JudgeLM 进行的成对比较）上表现最佳，但当应用于不同的评估方案（如直接评分）时，性能会急剧下降。这种情况并未在 gpt-3.5 或 gpt-4 上出现。

Similarly, on evaluation datasets for fairness (LLMBar), the finetuned LLM-evaluators performed worse than random guessing, suggesting that they were biased (or perhaps overfitted) on superficial quality. The finetuned evaluators also performed poorly on factuality, toxicity, and safety evaluation.
同样地，在用于评估公平性的数据集（LLMBar）上，经过微调的LLM-评估器的表现甚至不如随机猜测，这表明它们可能对表面质量存在偏见（或者过度拟合）。此外，这些微调的评估器在事实准确性、毒性判断和安全性评估方面也表现得相当差。

Finding Blind Spots in Evaluator LLMs with Interpretable Checklists introduces a framework and dataset to examine the proficiency of LLM-evaluators in evaluating four tasks: coherence in long-form writing (LF), factuality (F), instruction following (IF), and reasoning proficiency (R).
通过可解释的检查列表在评估者LLMs中发现盲点，介绍了一个框架和数据集，用于考察LLM-评估者在评估四个任务方面的能力：长篇写作的连贯性（LF），事实准确性（F），遵循指令（IF）和推理能力（R）。

For the dataset, they had correct answers and added perturbed answers targeted at the four tasks. To create the dataset, the authors selected 100 questions for each task category, sampling them from a mix of six test sets (WizardLM, MT-Bench, UltraChat, LIMA, LLMBar, and IFEval), as well as GSM8k and MATH, for a total of 400 questions. They also created 200 prompts tailored to instruction-following to test specific perturbation categories. The gold and perturbed answers were generated by gpt-4-turbo. 25% of this data was manually reviewed to ensure that the gold answers had a high level of correctness and that the perturbed answers should result in a scoring penalty.
针对数据集，他们准备了正确答案，并针对四大任务添加了扰动答案。为了构建数据集，作者从六个测试集（WizardLM、MT-Bench、UltraChat、LIMA、LLMBar 和 IFEval）以及 GSM8k 和 MATH 中，为每个任务类别挑选了 100 个问题，总共 400 个问题。他们还设计了 200 个专门针对指令跟随的提示，用于测试特定的扰动类别。这些黄金答案和扰动答案均通过 gpt-4-turbo 生成。其中 25%的数据经过人工复核，确保黄金答案的高准确度，同时确认扰动答案应受到评分惩罚。

They then assessed whether five LLM-evaluators could detect the quality drops (in perturbed answers). The models were gpt-4-turbo, gemini-1.5-pro, claude-3-opus, llama-3-70b-instruct, and prometheus-2. These LLM-evaluators assessed output via direct scoring, pairwise comparison, and reference-based evaluation.
接着，他们评估了五位LLM-评估者是否能察觉到质量下滑（在扰动答案中）。这些模型包括 gpt-4-turbo，gemini-1.5-pro，claude-3-opus，llama-3-70b-instruct，以及 prometheus-2。这五位LLM-评估者通过直接评分、成对比较和参考基准评估的方式对输出进行了评估。

Results: The overall best model (gpt-4-turbo) failed to assign lower scores to perturbed answers more than 50% of the time on LF, F, and IF, and more than 20% of the time on R (left). Furthermore, on direct scoring, simpler strategies such as direct scoring with CoT outperformed more advanced strategies that involved rules and rubrics. The other LLM-evaluators performed worse than gpt-4-turbo.
结果显示，在 LF、F 和 IF 上，总体表现最佳的模型（gpt-4-turbo）未能在超过 50%的情况下，对扰动答案给出更低的评分；在 R 上，这一比例超过 20%（见左图）。此外，在直接评分方面，更简单的策略，如使用 CoT 进行直接评分，其效果优于那些涉及规则和评分标准的更复杂策略。而其他LLM-评估者的表现，甚至不如 gpt-4-turbo。

LLMs instead of Human Judges: A Large Scale Empirical Study across 20 NLP Evaluation Tasks evaluates 11 LLM-evaluators to replicate human judgment across 20 language tasks. These include general tasks such as reasoning, instruction following, and toxicity detection, as well as downstream tasks such as summarization, translation, and dialogue.
LLMs而非人类裁判：一项大规模的实证研究，横跨 20 个 NLP 评估任务，评估了 11 个LLM-评估者，以重现人类的判断。这些任务涵盖了推理、指令遵循、毒性检测等一般性任务，以及摘要生成、翻译和对话等下游任务。

The authors selected 11 widely used models that had high performance across several tasks on the Open LLM and Chatbot Arena leaderboards. These include gpt-4, gemini-1.5, command-r, command-r+, llama-3-8b, llama-3-70b, mistral, mixtral-8x7b, mixtral-8x22b, olmo, and starling.
作者挑选了 11 个在 Open LLM和聊天机器人竞技场排行榜上多个任务中表现突出的广泛应用的模型。这些模型包括 gpt-4、gemini-1.5、command-r、command-r+、llama-3-8b、llama-3-70b、mistral、mixtral-8x7b、mixtral-8x22b、olmo 以及 starling。

Results: The LLM-evaluators had high variance in correlation with human judgments across the datasets. Each model performed poorly on some datasets, suggesting that they’re not reliable enough to systematically replace human judgments.
结果显示，LLM-评估者在各个数据集上与人工判断的相关性存在显著差异。每个模型在某些数据集上的表现不尽如人意，这表明它们还不可靠，不足以全面替代人工判断。

In addition, LLM-evaluators correlated better with non-expert annotators compared to expert annotators. This suggests that while several studies report high correlation with human annotations, the results could be overinflated if the annotators were non-experts.
此外，LLM-评估者与非专家注释者的相关性优于与专家注释者。这表明，尽管许多研究报道了与人工注释的高相关性，但如果注释者是非专家，结果可能会被过度夸大。

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges evaluates nine LLM-evaluators, using the TriviaQA dataset as a knowledge benchmark. The researchers sampled 400 questions from the unfiltered partition of TriviaQA and used the short answers as reference answers (i.e., the evaluation approach is reference-based). The training set was used as few-shot examples.
评估法官：在LLMs-as-Judges 中评估对齐和漏洞，对九个LLM-评估者进行了评估，使用 TriviaQA 数据集作为知识基准。研究人员从 TriviaQA 的未过滤部分中选取了 400 个问题，并使用简短答案作为参考答案（也就是说，评估方法是基于参考的）。训练集被用作少量示例的来源。

The LLM-evaluators were instructed to only respond with a single word: “correct” or “incorrect”. As baselines, the authors included exact match (EM) and contains substring (contains). For alignment metrics, they considered percentage agreement and Cohen’s $κ$ .
LLM的评估者被指示只能用一个词回答：“正确”或“不正确”。作为基线，作者纳入了完全匹配（EM）和包含子字符串（contains）。在对齐度量方面，他们考虑了百分比一致性和 Cohen 的 $κ$ 。

Your task is to look at the following question, and based on the references provided, 
determine if the model’s response is correct or incorrect. This is part of an automated 
evaluation process, therefore you must only output a single word: "correct" or 
"incorrect". 

Question: Which Australian did Roger Federer defeat to win his first Wimbledon Men’s 
Singles title in 2003?

References:
MARK PHILIPPOUSSIS
MARK PHILIPPOUSSIS

Model Response:
Mark Philippoussis

Evaluation (correct/incorrect):

Results: Gpt-4 and llama-3-70b had good human-alignment, achieving Cohen’s $κ$ of 0.84 and 0.79 respectively. However, they were still significantly lower than the human-human Cohen’s $κ$ of 0.97. Surprisingly, the contains baseline had higher correlation than half of the evaluator models on what was essentially a fuzzy matching task.
结果显示，Gpt-4 和 llama-3-70b 在人类对齐方面表现出色，Cohen’s $κ$ 分别达到了 0.84 和 0.79。然而，它们仍然明显低于人类之间的 Cohen’s $κ$ 指数 0.97。令人惊讶的是，在这个本质上属于模糊匹配的任务中，"contains" 基准的关联性竟然高于一半的评估模型。

The authors also noted that, compared to percentage agreement, Cohen’s $κ$ was better able to distinguish between LLM-evaluators. For example, while llama-3-8b had percentage agreement of 80%, it’s Cohen’s $κ$ was only 0.62. Similarly, LLM-evaluators with high Cohen’s $κ$ (>0.80; right below) had relatively less divergence in scores compared to when they had high percentage agreement (>80%; left below). Overall, this demonstrates that Cohen’s $κ$ provides a more precise and conservative measurement of alignment compared to percentage agreement (and almost most correlation metrics.)

• • •

That was a lot of papers and results! Let’s summarize what we learned about how to apply, evaluate, and operate LLM-evaluators. While the following may be an oversimplification, I hope it provides a useful starting point for working with LLM-evaluators.
这真是大量的论文和结果啊！让我们来总结一下我们学到的关于如何应用、评估和操作LLM-评估器的知识。虽然以下内容可能有些过于简化，但我希望它能为使用LLM-评估器提供一个有用的起点。

First, is your task objective (e.g., factuality, toxicity, instruction-following) or subjective (e.g., tone, persuasiveness, writing style)?
首先，你的任务目标是客观的吗？比如关注事实性、内容的毒性或是否遵循指示？还是更偏向主观判断，比如语气、说服力或写作风格？
- If it’s objective, apply direct scoring as the better option from a pair might still be a defect. Plus you don’t need an alternative for comparison.
  如果是客观评价，应采用直接评分法，因为即便两个选项中较好的一个也可能存在缺陷。此外，你无需找一个替代选项进行比较。
- If it’s subjective, pairwise comparisons will likely be more reliable.
  如果涉及主观判断，采用成对比较的方式可能会更可靠。
If using direct scoring, can you simplify the task to binary (e.g., true/false)?
如果采用直接评分的方式，能否将任务简化为二元判断，比如真假或对错？
- If so (binary), use classification metrics (e.g., recall, precision) or Cohen’s $κ$ .
  如果是二进制情况，可以使用分类指标（例如，召回率，精确度）或 Cohen 的 $κ$ 。
- If not (Likert scale), adopt correlations like Spearman’s $ρ$ and Kendall’s $τ$ .
  如果不是利克特量表，可以采用斯皮尔曼 $ρ$ 和肯德尔 $τ$ 这样的相关系数。
If using pairwise comparisons, apply Cohen’s $κ$ .
若采用两两对比的方法，应使用 Cohen 的 $κ$ 。
- And if you’re really confident in the ground truth, consider classification metrics (e.g., recall for picking the better choice between the pair).
  如果你对基本事实非常有信心，可以考虑使用分类指标（例如，在一对选项中选择更优选项时的召回率）。
- Tips and best practices on applying pairwise comparisons here.
  在这里提供应用成对比较的技巧和最佳实践，以便您参考。
Do you need it as an evaluator during development, or as a guardrail in production?
你在开发过程中需要它作为评估工具，还是在生产环境中作为安全防护？
- If using it as an evaluator during development, you’ll likely evaluate only a few hundred samples and can tolerate the latency/cost of prompting an LLM API. For reliability, use CoT + n-shot prompts (more prompting tips).
- If using it as a guardrail in production (low latency, high throughput), consider investing in finetuning a classifier or reward model, bootstrapping it on open-source data and labels you’ve collected during internal evals.

Thanks for sticking with me till the end! I hope you found this useful. What other resources on LLM-evaluators have you found helpful? Please comment below or DM me!

Thanks to the folks whose patient discussions and debates shaped my thinking, including Shreya Shankar, Summer Yue, Han Chung Lee, Hamel Husain, Eugene Cheah, Raza Habib, Shreya Rajpal, Kyle Corbitt, Joschka Braun, Vibhu Sapra, Garvan Doyle, Umang Shukla, Nicholas Marwell, Zach Witten, and more. All errors and misunderstandings my own.

References 参考资料

Bai, Yuntao, et al. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073, arXiv, 15 Dec. 2022. arXiv.org, https://doi.org/10.48550/arXiv.2212.08073.
Gao, Mingqi, et al. Human-like Summarization Evaluation with ChatGPT. arXiv:2304.02554, arXiv, 5 Apr. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2304.02554.
Luo, Zheheng, et al. ChatGPT as a Factual Inconsistency Evaluator for Text Summarization. arXiv:2303.15621, arXiv, 13 Apr. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2303.15621.
Li, Junyi, et al. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. arXiv:2305.11747, arXiv, 22 Oct. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2305.11747.
Adlakha, Vaibhav, et al. Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering. arXiv:2307.16877, arXiv, 17 Apr. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2307.16877.
Cohen, Roi, et al. LM vs LM: Detecting Factual Errors via Cross Examination. arXiv:2305.13281, arXiv, 22 May 2023. arXiv.org, https://doi.org/10.48550/arXiv.2305.13281.
Liu, Yang, et al. G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment. arXiv:2303.16634, arXiv, 23 May 2023. arXiv.org, https://doi.org/10.48550/arXiv.2303.16634.
Manakul, Potsawee, et al. SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. arXiv:2303.08896, arXiv, 11 Oct. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2303.08896.
Liu, Yinhong, et al. Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators. arXiv:2403.16950, arXiv, 25 Mar. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2403.16950.
Zhou, Han, et al. Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments. arXiv:2406.11370, arXiv, 17 June 2024. arXiv.org, https://doi.org/10.48550/arXiv.2406.11370.
Upadhyay, Shivani, et al. UMBRELA: UMbrela Is the (Open-Source Reproduction of the) Bing RELevance Assessor. arXiv:2406.06519, arXiv, 10 June 2024. arXiv.org, https://doi.org/10.48550/arXiv.2406.06519.
Verga, Pat, et al. Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arXiv:2404.18796, arXiv, 1 May 2024. arXiv.org, https://doi.org/10.48550/arXiv.2404.18796.
Kim, Tae Soo, et al. ‘EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria’. Proceedings of the CHI Conference on Human Factors in Computing Systems, 2024, pp. 1–21. arXiv.org, https://doi.org/10.1145/3613904.3642216.
Liu, Michael Xieyang, et al. ‘“We Need Structured Output”: Towards User-Centered Constraints on Large Language Model Output’. Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, 2024, pp. 1–9. arXiv.org, https://doi.org/10.1145/3613905.3650756.
Shankar, Shreya, et al. Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences. arXiv:2404.12272, arXiv, 18 Apr. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2404.12272.
Wang, Tianlu, et al. Shepherd: A Critic for Language Model Generation. arXiv:2308.04592, arXiv, 8 Aug. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2308.04592.
Tan, Bowen, et al. Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer. arXiv:2311.06720, arXiv, 11 Nov. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2311.06720.
Kim, Seungone, et al. Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models. arXiv:2310.08491, arXiv, 9 Mar. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2310.08491.
McAleese, Nat, et al. LLM Critics Help Catch LLM Bugs. arXiv:2407.00215, arXiv, 28 June 2024. arXiv.org, https://doi.org/10.48550/arXiv.2407.00215.
Zheng, Lianmin, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685, arXiv, 23 Dec. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2306.05685.
Huang, Hui, et al. On the Limitations of Fine-Tuned Judge Models for LLM Evaluation. arXiv:2403.02839, arXiv, 17 June 2024. arXiv.org, https://doi.org/10.48550/arXiv.2403.02839.
Doddapaneni, Sumanth, et al. Finding Blind Spots in Evaluator LLMs with Interpretable Checklists. arXiv:2406.13439, arXiv, 19 June 2024. arXiv.org, https://doi.org/10.48550/arXiv.2406.13439.
Bavaresco, Anna, et al. LLMs Instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks. arXiv:2406.18403, arXiv, 26 June 2024. arXiv.org, https://doi.org/10.48550/arXiv.2406.18403.
Thakur, Aman Singh, et al. Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges. arXiv:2406.12624, arXiv, 1 July 2024. arXiv.org, https://doi.org/10.48550/arXiv.2406.12624.

If you found this useful, please cite this write-up as:

Yan, Ziyou. (Aug 2024). Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge). eugeneyan.com. https://eugeneyan.com/writing/llm-evaluators/.

@article{yan2024llm-evaluator,
  title   = {Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)},
  author  = {Yan, Ziyou},
  journal = {eugeneyan.com},
  year    = {2024},
  month   = {Aug},
  url     = {https://eugeneyan.com/writing/llm-evaluators/}
}

Share on: