这是用户在 2024-5-21 7:14 为 https://ar5iv.labs.arxiv.org/html/2308.00436?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
\newfloatcommand

capbtabboxtable[][\FBwidth] \floatsetup[figure]style=plain,subcapbesideposition=center \floatsetup[table]capposition=above

SelfCheck: using LLMs to zero-shot check their own step-by-step reasoning

Ning Miao1* Yee Whye Teh1 Tom Rainforth1 1Department of Statistics, University of Oxford. *Email: <ning.miao@stats.ox.ac.uk>.
Abstract

The recent progress in large language models (LLMs), especially the invention of chain-of-thought prompting, has made it possible to automatically answer questions by stepwise reasoning. However, when faced with more complicated problems that require non-linear thinking, even the strongest LLMs make mistakes. To address this, we explore whether LLMs are able to recognize errors in their own step-by-step reasoning, without resorting to external resources. To this end, we propose SelfCheck, a general-purpose zero-shot verification schema for recognizing such errors. We then use the results of these checks to improve question-answering performance by conducting weighted voting on multiple solutions to the question. We test SelfCheck on three datasets—GSM8K, MathQA, and MATH—and find that it successfully recognizes errors and, in turn, increases final answer accuracies.
近年来,大型语言模型(LLMs)取得了显著进展,尤其是链式思维提示的发明,使得通过逐步推理自动回答问题成为可能。然而,当面对需要非线性思维的更复杂问题时,即使是最强大的LLMs也会犯错。为了解决这个问题,我们探讨了LLMs是否能够在不依赖外部资源的情况下识别自身逐步推理中的错误。为此,我们提出了 SelfCheck,一种通用的零样本验证模式,用于识别此类错误。然后,我们利用这些检查结果,通过对问题的多种解决方案进行加权投票来提高问答性能。我们在三个数据集——GSM8K、MathQA 和 MATH 上测试了 SelfCheck,发现它成功识别了错误,并进而提高了最终答案的准确性。

1 Introduction

Recent years have witnessed dramatic changes in the areas of NLP and AI brought on by significant advances in LLMs. From GPT-3 (Brown et al., 2020), PaLM (Chowdhery et al., 2022), Llama (Touvron et al., 2023) and Falcon (Almazrouei et al., 2023) to GPT-4 (OpenAI, 2023) and PaLM-2 (Google, 2023), the increasing model sizes and exploding amount of training data have empowered LLMs to achieve human-level performance on a large range of tasks, including summarization, translation, and question answering. The invention of Chain-of-Thought prompting (CoT, Wei et al. (2022)) has further enhanced LLMs’ ability to solve complex problems by generating step-by-step solutions.
近年来,由于LLMs的重大进展,自然语言处理(NLP)和人工智能(AI)领域发生了巨大的变化。从 GPT-3(Brown 等,2020)、PaLM(Chowdhery 等,2022)、Llama(Touvron 等,2023)和 Falcon(Almazrouei 等,2023)到 GPT-4(OpenAI,2023)和 PaLM-2(Google,2023),模型规模的不断扩大和训练数据量的爆炸性增长使得LLMs在包括摘要、翻译和问答在内的大量任务上实现了接近人类水平的表现。链式思维提示(Chain-of-Thought prompting,CoT,Wei 等(2022))的发明进一步增强了LLMs通过生成逐步解决方案来解决复杂问题的能力。

However, the performance of even the largest LLMs is still unsatisfactory on more difficult reasoning problems. For example, GPT-4 with CoT prompting only correctly answers 42.5% of problems in the MATH dataset (Bubeck et al., 2023, Hendrycks et al., 2021), which is far below human level. Such problems require careful and extensive multi-step reasoning to solve, and LLMs are consequently prone to make mistakes: even though their error rate on individual steps may be low, the probability of generating at least one erroneous step can still be quite high, undermining the final answer.
然而,即使是最大的LLMs在更困难的推理问题上的表现仍然不尽如人意。例如,使用 CoT 提示的 GPT-4 在 MATH 数据集上的正确回答率仅为 42.5%(Bubeck 等,2023,Hendrycks 等,2021),远低于人类水平。这类问题需要仔细且广泛的多步骤推理来解决,因此LLMs容易出错:尽管它们在单个步骤上的错误率可能很低,但生成至少一个错误步骤的概率仍然相当高,从而削弱了最终答案的准确性。

Recent works have tried to overcome this limitation by checking for errors in these step-by-step solutions (Cobbe et al., 2021, Li et al., 2022, Ling et al., 2023). Such checks can then be used to provide confidence scores in answers and select between different possible alternatives. This checking has typically been performed either by using an external verification model (Cobbe et al., 2021, Lyu et al., 2023, Peng et al., 2023), or through few-shot in-context learning (Brown et al., 2020) of an LLM (Weng et al., 2022, Ling et al., 2023).
最近的研究试图通过检查这些逐步解决方案中的错误来克服这一限制(Cobbe 等,2021,Li 等,2022,Ling 等,2023)。然后可以使用这些检查来提供答案的置信度评分,并在不同的可能替代方案之间进行选择。这种检查通常是通过使用外部验证模型(Cobbe 等,2021,Lyu 等,2023,Peng 等,2023),或通过少量示例的上下文学习(Brown 等,2020)来进行的(Weng 等,2022,Ling 等,2023)。

Unfortunately, existing methods generally require extra training data and/or domain-specific exemplars, which often makes them inconvenient to use in practice and restricts them to specific domains or data formats. The aim of our work is thus to instead provide a general-purpose, zero-shot, approach to checking that relies only on the original LLM, without the need for additional external resources.
不幸的是,现有方法通常需要额外的训练数据和/或特定领域的示例,这往往使其在实际使用中不便,并将其限制在特定领域或数据格式中。因此,我们工作的目的是提供一种通用的零样本检查方法,仅依赖于原始LLM,无需额外的外部资源。

To this end, we introduce SelfCheck, a zero-shot step-by-step checker for self-identifying errors in LLM reasoning chains. SelfCheck uses the LLM to individually check the conditional correctness of each step in the chain based on the preceding steps, in a manner similar to a human going back to check their working. The results of these individual checks are then integrated to form an overall correctness estimation for the whole reasoning chain.
为此,我们引入了 SelfCheck,这是一种用于自我识别LLM推理链中错误的零样本逐步检查器。SelfCheck 使用LLM来逐步检查链中每一步的条件正确性,基于前面的步骤,类似于人类回头检查他们的工作。这些单独检查的结果随后被整合,以形成对整个推理链的整体正确性估计。

Key to SelfCheck’s success is a novel mechanism for performing the checking of individual steps. As we will show, the naive approach of directly asking the LLM to check a step is typically ineffective. Instead, we introduce a multi-stage approach that breaks the problem down into a series of simpler tasks, leverages the generative strengths of the LLM, and decorrelates errors between the original generation and checking. Specifically, using separate calls to the LLM we first extract the target and relevant context for the step, then regenerate an independent alternative step from these, and finally compare the two. The original step is then deemed to pass the check if it matches the regeneration.
SelfCheck 成功的关键在于一种新颖的机制,用于检查每个步骤。正如我们将要展示的,直接要求LLM检查一个步骤的简单方法通常是无效的。相反,我们引入了一种多阶段的方法,将问题分解为一系列更简单的任务,利用LLM的生成优势,并使原始生成和检查之间的错误去相关化。具体来说,通过分别调用LLM,我们首先提取步骤的目标和相关背景,然后从这些信息中重新生成一个独立的替代步骤,最后将两者进行比较。如果原始步骤与再生成的步骤匹配,则认为其通过检查。

Besides providing an estimation of correctness for each solution, SelfCheck can also boost final answer accuracies for the original questions by weighted voting. Namely, given multiple solutions to a question, it uses confidence scores as weights to vote among the answers, which provides a soft way to focus on more accurate solutions.
除了为每个解决方案提供正确性的估计外,SelfCheck 还可以通过加权投票来提高原始问题的最终答案准确性。具体来说,针对一个问题的多个解决方案,它使用置信度分数作为权重在答案中进行投票,这提供了一种软性的方法来关注更准确的解决方案。

We evaluate SelfCheck on three math tasks, namely GSM8K (Cobbe et al., 2021), MathQA (Amini et al., 2019), and MATH (Hendrycks et al., 2021). For all datasets, we find that using SelfCheck achieves a significant increase in final answer accuracies compared with simple majority voting and other baselines. We also see that SelfCheck provides an accurate confidence estimation for LLM’s solutions, which decreases the proportion of incorrect solutions by 9%, 22.8%, and 16.2% on the three datasets respectively when filtering out solutions with low confidence scores. We further perform a number of ablations to justify some of our key design choices in the SelfCheck approach.
我们在三个数学任务上评估了 SelfCheck,分别是 GSM8K(Cobbe 等,2021)、MathQA(Amini 等,2019)和 MATH(Hendrycks 等,2021)。对于所有数据集,我们发现使用 SelfCheck 相比于简单的多数投票和其他基线方法,最终答案的准确率显著提高。我们还发现 SelfCheck 为LLM的解决方案提供了准确的置信度估计,在过滤掉低置信度评分的解决方案时,三个数据集上不正确解决方案的比例分别减少了 9%、22.8%和 16.2%。我们进一步进行了多项消融实验,以证明 SelfCheck 方法中一些关键设计选择的合理性。

To summarize, we introduce SelfCheck as a novel and effective zero-shot schema for self-checking step-by-step reasoning in LLMs. Unlike previous methods, SelfCheck does not need any finetuning or example crafting, so can be directly applied to reasoning tasks in different domains. Our experiments confirm that it can, in turn, be used to improve final predictive performance of LLMs. Our code is available at https://github.com/NingMiao/SelfCheck.
总而言之,我们提出了 SelfCheck 作为一种新颖且有效的零样本自检逐步推理方案。与以往的方法不同,SelfCheck 不需要任何微调或示例制作,因此可以直接应用于不同领域的推理任务。我们的实验证实,它反过来可以用于提高LLMs的最终预测性能。我们的代码可在 https://github.com/NingMiao/SelfCheck 获取。

2 Related Work

How to automatically check the correctness of a sequence of reasoning steps is a long-standing question. We now discuss how previous methods have tried to tackle this in an LLM context. We note that none of these works are able to work in the zero-shot setting covered by SelfCheck, requiring either problem-specific examples, an external model, and/or finetuning.
如何自动检查一系列推理步骤的正确性是一个长期存在的问题。我们现在讨论以前的方法如何在LLM背景下尝试解决这个问题。我们注意到,这些工作都无法在 SelfCheck 所涵盖的零样本设置中工作,需要特定问题的示例、外部模型和/或微调。

Few-shot verification

Though our focus will be on zero-shot checking, for some problems one may have hand-crafted exemplars available that are specifically designed to that particular question-answering task. Previous methods have been designed to perform checking of LLMs’ generated solutions in this few-shot checking scenario.
尽管我们的重点将放在零样本检查上,但对于某些问题,可能会有专门为特定问答任务设计的手工示例。以前的方法已经被设计用于在这种少样本检查场景中对LLMs生成的解决方案进行检查。

For example, the Self-Verification (SV) approach of Weng et al. (2022) verifies the whole solution by backward prediction. That is, it uses the conclusion from CoT reasoning to predict a masked condition in the question. However, it only supports single-step checking and is based on the assumption that every piece of information in the question can be recovered using a correct solution of it, which is often not the case. Consequently, it is only applicable to simpler tasks, such as GSM8K.
例如,Weng 等人(2022)的自我验证(SV)方法通过反向预测验证整个解决方案。也就是说,它使用链式推理(CoT)得出的结论来预测问题中被掩盖的条件。然而,它仅支持单步检查,并且基于这样一个假设:问题中的每一条信息都可以通过正确的解决方案恢复,这通常并非如此。因此,它仅适用于较简单的任务,如 GSM8K。

The Deductive Verification (DV) approach of Ling et al. (2023) instead looks to verify independent sub-tasks, as per SelfCheck. However, its verifier only supports checking reasoning chains in a special format called Natural Programs. As a result, it can only work with a specific specialised generator, without serving as a general verifier for multi-step reasoning.
Ling 等人(2023)的演绎验证(DV)方法则试图根据 SelfCheck 验证独立的子任务。然而,其验证器仅支持检查一种称为自然程序的特殊格式的推理链。因此,它只能与特定的专用生成器一起工作,无法作为多步推理的一般验证器。

Verification with external resources

In some cases, there might be external resources available to verify the logical correctness or faithfulness of LLM outputs. Lyu et al. (2023) translate a question into a symbolic reasoning chain using an LLM and solve the problem by a symbolic logic solver. Peng et al. (2023) introduced an external database to check for incorrect knowledge in LLM outputs. These methods are limited by the availability of external resources and are typically restricted to checking for certain types of errors.
在某些情况下,可能有外部资源可用于验证LLM输出的逻辑正确性或忠实性。Lyu 等人(2023)使用LLM将问题翻译成符号推理链,并通过符号逻辑求解器解决问题。Peng 等人(2023)引入了一个外部数据库来检查LLM输出中的错误知识。这些方法受限于外部资源的可用性,通常仅限于检查某些类型的错误。

Training/finetuning a verifier

A few other methods train or finetune a separate verifier model to check reasoning chains. Cobbe et al. (2021) finetuned a GPT-3 model on GSM8K to predict the correctness of a solution as a whole. Li et al. (2022) trained a binary deberta-v3-large (He et al., 2020) classifier on each domain to predict step correctness. More recently, Lightman et al. (2023) built a large dataset, which contains step-wise correctness labels from human labelers, and finetuned a GPT-4 model on it. Unlike SelfCheck, all of these methods require extra data and external computational resources, restricting their applicability and ease of use.
其他一些方法训练或微调一个单独的验证模型来检查推理链。Cobbe 等人(2021)在 GSM8K 上微调了一个 GPT-3 模型,以预测整个解决方案的正确性。Li 等人(2022)在每个领域上训练了一个二元 deberta-v3-large(He 等人,2020)分类器,以预测步骤的正确性。最近,Lightman 等人(2023)构建了一个大型数据集,其中包含来自人工标注者的逐步正确性标签,并在其上微调了一个 GPT-4 模型。与 SelfCheck 不同,所有这些方法都需要额外的数据和外部计算资源,限制了它们的适用性和易用性。

3 SelfCheck: Using LLMs to Check Their Own Reasoning

Rather than relying on external resources or problem-specific data like the aforementioned approaches, it would be highly beneficial if we could develop self-contained checking schemes that require only the original LLM itself. In other words, we would like to use the LLM to identify errors in its own step-by-step reasoning, analogously to how a human might go back to check their working.
与上述方法不同,如果我们能够开发出只需要原始LLM本身的自包含检查方案,将会非常有益。换句话说,我们希望使用LLM来识别其自身逐步推理中的错误,类似于人类回过头来检查自己的工作。

Unfortunately, directly asking the LLM to check its own reasoning is largely ineffective: it almost invariably declares that the original answer is correct, with Ling et al. (2023) finding answers checked in this way are deemed correct more than 90% of the time regardless of whether they actually are. As we will show in Section 5, individually prompting the LLM to check each step in the CoT reasoning fares slightly better, but is still only able to offer marginal gains compared to not checking at all.
不幸的是,直接要求LLM检查其自身的推理效果不佳:它几乎总是宣称原始答案是正确的,Ling 等人(2023)发现以这种方式检查的答案被认为是正确的比例超过 90%,无论它们实际上是否正确。正如我们将在第 5 节中展示的那样,单独提示LLM检查 CoT 推理中的每一步稍微好一些,但与完全不检查相比,仍然只能提供微小的改进。

A more nuanced method to perform this checking is thus required. To this end, we introduce SelfCheck, a general-purpose, zero-shot, checking schema for self-identifying errors in LLM CoT reasoning. Given a question, q𝑞q, and its step-by-step solution, s𝑠s, produced by some generator (which will generally be an LLM with appropriate CoT prompting), SelfCheck considers each step of s𝑠s in turn and tries to establish its individual correctness based on the preceding steps. This checking is done by leveraging an LLM (which can either be the same LLM used to generate s𝑠s or a separate one), but rather than directly asking the LLM to perform the check, we instead introduce a novel step checking method (see Section 3.1) that exploits their generative modeling strengths. The results of the checks on individual steps are then combined into a single confidence score, w[0,1]𝑤01w\in[0,1], for the whole solution. These confidence scores, in turn, allow us to improve predictive performance, by using them to perform weighted voting on multiple solutions to the same question.
因此,需要一种更细致的方法来进行此检查。为此,我们引入了 SelfCheck,这是一种通用的、零样本的检查模式,用于自我识别LLM CoT 推理中的错误。给定一个问题, q𝑞q ,及其逐步解决方案, s𝑠s ,由某个生成器生成(通常是一个具有适当 CoT 提示的LLM),SelfCheck 依次考虑 s𝑠s 的每一步,并尝试基于前面的步骤确定其个别正确性。此检查是通过利用一个LLM(可以是生成 s𝑠s 的同一个LLM或一个单独的LLM)来完成的,但我们并不是直接要求LLM进行检查,而是引入了一种新颖的步骤检查方法(见 3.1 节),利用其生成建模的优势。对各个步骤的检查结果然后被合并成一个单一的置信评分, w[0,1]𝑤01w\in[0,1] ,用于整个解决方案。这些置信评分反过来允许我们通过对同一问题的多个解决方案进行加权投票来提高预测性能。

Refer to caption
Figure 1: Example of using SelfCheck, focusing on the checking of a particular step (Step 5). To check the correctness of the step, SelfCheck goes through 4 stages. First, in the target extraction stage, it figures out that the main purpose of Step 5 is to complete the square. In the information collection stage, it then establishes that Step 5 only directly relies on Step 4. Next, the step regeneration stage instructs the LLM to complete the square independently, only using Step 4 as context. The regeneration result shows that the center and radius of the circle are (3,0)30(3,0) and 333, which is different from what is implied by the original Step 5. Consequently, the result comparison stage concludes that Step 5 is likely to be wrong. After checking all the steps, SelfCheck integrates the results to form an overall confidence score, w𝑤w. See Appendix A for a complete version of the example.
图 1:使用 SelfCheck 的示例,重点检查特定步骤(步骤 5)。为了检查该步骤的正确性,SelfCheck 经过 4 个阶段。首先,在目标提取阶段,它确定步骤 5 的主要目的是完成平方。在信息收集阶段,它接着确定步骤 5 仅直接依赖于步骤 4。接下来,步骤再生阶段指示LLM独立完成平方,仅使用步骤 4 作为上下文。再生结果显示圆的中心和半径分别是 (3,0)30(3,0)333 ,这与原始步骤 5 所暗示的不同。因此,结果比较阶段得出结论,步骤 5 可能是错误的。在检查所有步骤后,SelfCheck 整合结果形成一个整体置信度分数, w𝑤w 。完整示例请参见附录 A。

3.1 Step checking

To check individual steps of the reasoning process, the first thing we should note is that the correctness of each step is highly dependent on its context, namely the question and previous steps in the solution. For example, we usually need to refer to previous steps for the definition of variables and the meaning of specific numbers. If each step is conditionally correct based on the provided context and the last step provides an answer in the required format, then the overall reasoning will itself be correct. The target of the step checking is thus simply to check the conditional correctness of each step based on the provided context. That is, we only care about catching errors at the current step, and can assume all information from its context to be correct.
要检查推理过程的各个步骤,首先我们应该注意到,每一步的正确性在很大程度上取决于其上下文,即问题和解决方案中的前几步。例如,我们通常需要参考前几步来定义变量和特定数字的含义。如果每一步在给定的上下文中是有条件正确的,并且最后一步以所需格式提供答案,那么整体推理本身就是正确的。因此,步骤检查的目标只是基于提供的上下文检查每一步的有条件正确性。也就是说,我们只关心在当前步骤中捕捉错误,并且可以假设其上下文中的所有信息都是正确的。

A simple idea to try and achieve this would be to feed the current step as well as all its context to an LLM and directly ask it to ‘check the correctness of the step’. However, in practice, we find that this task is too difficult for the LLM to do effectively, even with careful prompting that exemplifies how to do the checking in detail (see Section 5). This difficulty comes first from the fact that there are multiple aspects to the checking problem that the checker must deal with simultaneously: it needs to understand the key content in the step and then collect all related information from the context, before actually checking for its correctness. Second, ‘checking’ is a less common task in the training corpus of most LLMs, such that it is a problem that does not necessarily play to their strengths. Finally, there are likely to be strong correlations between the errors such a checker will make with the errors made in the original generation, undermining its usefulness.
一个简单的想法是将当前步骤及其所有上下文输入LLM,并直接要求其“检查步骤的正确性”。然而,在实践中,我们发现即使通过仔细提示,详细示范如何进行检查,这个任务对于LLM来说仍然过于困难(见第 5 节)。这种困难首先来自于检查问题的多个方面,检查者必须同时处理:它需要理解步骤中的关键内容,然后从上下文中收集所有相关信息,最后实际检查其正确性。其次,“检查”在大多数LLMs的训练语料库中是一个较少见的任务,因此这并不是一个能充分发挥其优势的问题。最后,检查者可能会与原始生成中的错误有很强的相关性,从而削弱其有用性。

To address these difficulties, SelfCheck instead decomposes the checking task for each step into four stages: target extraction, information collection, step regeneration, and result comparison. The LLM is used to execute each stage successively, with the outcome of the result comparison providing the correctness prediction.
为了解决这些困难,SelfCheck 将每一步的检查任务分解为四个阶段:目标提取、信息收集、步骤再生和结果比较。LLM 被用来依次执行每个阶段,结果比较的结果提供正确性预测。

The idea behind this decomposition is to make the LLM focus on an easier task at each stage and ensure the individual tasks carried out are more closely aligned to the LLM’s strengths. Moreover, by focusing on regenerating and then comparing, we hope to reduce the correlations between the errors of the checking and the original generation.
这种分解的理念是使LLM在每个阶段专注于更容易的任务,并确保所执行的各个任务更符合LLM的优势。此外,通过专注于再生和比较,我们希望减少检查错误与原始生成之间的相关性。

At a high level, the stages work by first prompting the LLM to figure out the target of the current step and what information it uses to achieve the target; we find that the LLM is usually able to perform these tasks extremely accurately. Then we ask the LLM to re-achieve the target using only the collected information, providing an alternative to the original step that maintains the same purpose in the overall reasoning process. Here the clear description of the target and the simplified context we provide make the regeneration stage less challenging. As a result, we hope its output will be more reliable and thus serve as a useful reference. Even if this is not the case, it will still hopefully provide a viable alternative, with a distinct generation, that can be used for comparison. The last stage then uses the LLM to compare the original step with the regenerated output. If their main conclusions match/mismatch, this provides evidence that the original step was correct/incorrect.
在高层次上,这些阶段首先通过提示LLM来确定当前步骤的目标以及它使用的信息;我们发现LLM通常能够非常准确地执行这些任务。然后我们要求LLM仅使用收集到的信息重新实现目标,提供一个替代原始步骤的方案,同时保持在整体推理过程中的相同目的。在这里,我们提供的目标的清晰描述和简化的上下文使再生阶段变得不那么具有挑战性。因此,我们希望其输出会更加可靠,从而作为一个有用的参考。即使情况并非如此,它仍然有望提供一个可行的替代方案,通过不同的生成方式,可以用于比较。最后一个阶段使用LLM来比较原始步骤与再生输出。如果它们的主要结论匹配/不匹配,这就提供了原始步骤正确/错误的证据。

A worked example of this step-checking process is provided in Figure 1. In the following, we describe each of the subtasks in detail and provide our specific instructions to the LLM. We note here that the different LLM queries are made independently, rather than keeping the queries and answers from previous stages in context. Thus, for example, when the LLM is called to carry out the step regeneration, it does not have access to the original generation. The same prompts are used across LLMs and datasets, thereby providing a general-purpose approach.
图 1 提供了此步骤检查过程的一个实例。在下文中,我们将详细描述每个子任务,并提供我们对LLM的具体指示。我们在此指出,不同的LLM查询是独立进行的,而不是将前一阶段的查询和答案保持在上下文中。因此,例如,当调用LLM进行步骤再生时,它无法访问原始生成。相同的提示在LLMs和数据集中使用,从而提供了一种通用的方法。

Target extraction

To check a step (for example, Step 5 in Figure 1), we first need to figure out what the step is trying to achieve. Without a specific target, the regeneration stage would proceed in a random direction, making it impossible to serve as a reference to the original step. We thus use the LLM itself to extract the target of a step using the question and all previous steps (Steps 0-4 in Figure 1) with the following prompt (we omit some line breaks due to space limitations):
要检查一个步骤(例如,图 1 中的步骤 5),我们首先需要弄清楚该步骤试图实现什么目标。没有具体目标,再生阶段将朝着随机方向进行,使其无法作为原始步骤的参考。因此,我们使用LLM本身,通过问题和所有先前步骤(图 1 中的步骤 0-4)提取步骤的目标,使用以下提示(由于空间限制,我们省略了一些换行符):

  • The following is a part of the solution to the problem [Question]: [Step 0,…, Step i]. What specific action does the step [Step i] take? Please give a brief answer using a single sentence and do not copy the steps.
    以下是问题[问题]的部分解决方案:[步骤 0,…, 步骤 i]。步骤[步骤 i]采取了什么具体行动?请用一句话简要回答,不要复制步骤。

During execution, we copy the question and steps into [Question] and [Step 0, …, Step i] to form the actual input to the LLM. The reason for requesting a brief answer is to try and keep the amount of information retained to the minimum needed, thereby avoiding unnecessary influence on the regeneration and hopefully reducing correlations in errors in turn.
在执行过程中,我们将问题和步骤复制到[Question]和[Step 0, …, Step i]中,以形成LLM的实际输入。要求简短回答的原因是尽量保持所保留信息的最小化,从而避免对再生过程的不必要影响,并希望减少错误之间的相关性。

Information collection

To reduce the difficulty of the regeneration stage and avoid unrelated information from affecting the result, we filter out information that is not directly related to the current step. Specifically, we ask the LLM to select useful items from the question and all previous items with the following prompt, where [Information j] is simply the j-th sentence in the question:
为了减少再生阶段的难度并避免无关信息影响结果,我们过滤掉与当前步骤不直接相关的信息。具体来说,我们要求LLM从问题和所有先前的项目中选择有用的项目,使用以下提示,其中[信息 j] 只是问题中的第 j 句:

  • This is a math question: [Question]. The following is information extracted from the question:
    Information 0: [Information 0] Information 1: [Information 1] …
    The following are the first a few steps in a solution to the problem:
    Step 0: [Step 0] Step 1: [Step 1] … Step i-1: [Step i-1]
    Which previous steps or information does the next step [Step i] directly follow from?

After retrieving the free-text response from the LLM, we extract step or information ids by regular expression. For example in Figure 1, the current step requires Step 4 and no information from the question as context. The selected steps and information are then fed into the regeneration stage.
从LLM中检索自由文本响应后,我们通过正则表达式提取步骤或信息 ID。例如,在图 1 中,当前步骤需要步骤 4,并且不需要问题中的信息作为上下文。选择的步骤和信息随后被输入再生阶段。

Step regeneration

Given the target and necessary information of the step, we can now ask the LLM to achieve the target independently with only the collected information, without seeing the original step. Because the step is usually a small jump from previous conclusions, and the information collection stage has already filtered out irrelevant information, we can usually trust regeneration results. The prompt for this stage is:
鉴于步骤的目标和必要信息,我们现在可以要求LLM仅凭收集到的信息独立实现目标,而不查看原始步骤。由于该步骤通常是从先前结论的小幅跳跃,并且信息收集阶段已经过滤掉了无关信息,我们通常可以信任再生结果。此阶段的提示是:

  • We are in the process of solving a math problem. We have some information from the problem:
    Information 0: [Information I0subscript𝐼0I_{0}] Information 1: [Information I1subscript𝐼1I_{1}] …
    The following are some previous steps: Step 0: [Step S0subscript𝑆0S_{0}] Step 1: [Step S1subscript𝑆1S_{1}] …
    The target for the next step is: [Target]
    Please try to achieve the target with the information from the problem or previous steps.

Here [Target] is the output from the target extraction stage. [Information Iisubscript𝐼𝑖I_{i}] and [Step Sisubscript𝑆𝑖S_{i}] correspond to the specific items selected by the information collection stage. In Figure 1, only Step 4 and no information from the question is directly related to the current step, so SelfCheck simply copies the content of Step 4 into [Step S0subscript𝑆0S_{0}] and removes the block containing [Information Iisubscript𝐼𝑖I_{i}].
这里的[Target]是目标提取阶段的输出。[Information Iisubscript𝐼𝑖I_{i} ]和[Step Sisubscript𝑆𝑖S_{i} ]对应于信息收集阶段选择的特定项目。在图 1 中,只有步骤 4 与当前步骤没有直接相关的问题信息,因此 SelfCheck 只是将步骤 4 的内容复制到[Step S0subscript𝑆0S_{0} ]中,并删除包含[Information Iisubscript𝐼𝑖I_{i} ]的块。

Result comparison

The last step is to compare results from the regeneration stage and the original step with the following prompt:
最后一步是将再生阶段的结果与原始步骤进行比较,并使用以下提示:

  • The following are 2 solutions to a math problem. Solution 1: [Regeneration output] Solution 2: [Step i]
    以下是一个数学问题的两种解法。 解法 1:[再生输出] 解法 2:[步骤 i]

    Compare the key points from both solutions step by step and then check whether Solution 1 ‘supports’, ‘contradicts’ or ‘is not directly related to’ the conclusion in Solution 2. Pay special attention to the difference in numbers.
    逐步比较两个解决方案的关键点,然后检查解决方案 1 是否“支持”、“矛盾”或“与解决方案 2 的结论无直接关系”。特别注意数字上的差异。

If the regeneration output ‘supports’/‘contradicts’ the original step, we can conclude that the original step is likely correct/incorrect respectively. Sometimes, the correctness of the original step cannot be directly inferred from the regeneration output. For example, when the target is to simplify an equation, then there may be multiple valid solutions. In such cases, we are not sure about the correctness of the original step, which makes ‘is not directly related to’ the third possible outcome of the check.
如果再生输出“支持”/“反驳”原始步骤,我们可以分别得出原始步骤可能是正确/错误的结论。有时,原始步骤的正确性不能直接从再生输出中推断出来。例如,当目标是简化一个方程时,可能会有多个有效的解决方案。在这种情况下,我们无法确定原始步骤的正确性,这使得“与之不直接相关”成为检查的第三种可能结果。

3.2 Results integration

After running step-checking and getting a checking result for each step, we need an integration function ϕitalic-ϕ\phi to give a confidence score, w[0,1]𝑤01w\in[0,1], for the overall correctness of the solution. The input of ϕitalic-ϕ\phi should be a vector in the form of [r0,r1,,rn]subscript𝑟0subscript𝑟1subscript𝑟𝑛[r_{0},r_{1},...,r_{n}], where each item risubscript𝑟𝑖r_{i} represents the step checking result for Step i𝑖i. We will use ri=1,0,subscript𝑟𝑖10r_{i}=-1,0, and 111 to represent the step-checking results ‘contradict’, ‘is not directly related to’ and ‘support’ respectively. We find that the following simple integration function works well in practice
在运行步骤检查并获得每个步骤的检查结果后,我们需要一个集成函数 ϕitalic-ϕ\phi 来给出整体解决方案正确性的置信度分数 w[0,1]𝑤01w\in[0,1]ϕitalic-ϕ\phi 的输入应该是一个形式为 [r0,r1,,rn]subscript𝑟0subscript𝑟1subscript𝑟𝑛[r_{0},r_{1},...,r_{n}] 的向量,其中每个项目 risubscript𝑟𝑖r_{i} 代表步骤 i𝑖i 的步骤检查结果。我们将使用 ri=1,0,subscript𝑟𝑖10r_{i}=-1,0,111 分别表示步骤检查结果“矛盾”、“不直接相关”和“支持”。我们发现以下简单的集成函数在实践中效果很好。

w=ϕ([r0,r1,,rn])=2Sigmoid(λ1i=0n𝟙ri=1λ0i=0n𝟙ri=0),𝑤italic-ϕsubscript𝑟0subscript𝑟1subscript𝑟𝑛2Sigmoidsubscript𝜆1superscriptsubscript𝑖0𝑛subscript1subscript𝑟𝑖1subscript𝜆0superscriptsubscript𝑖0𝑛subscript1subscript𝑟𝑖0\displaystyle w=\phi([r_{0},r_{1},...,r_{n}])=2*\text{Sigmoid}\left(-\lambda_{-1}\sum_{i=0}^{n}\mathbbm{1}_{r_{i}=-1}-\lambda_{0}\sum_{i=0}^{n}\mathbbm{1}_{r_{i}=0}\right), (1)

where λ1subscript𝜆1\lambda_{-1} and λ0subscript𝜆0\lambda_{0} are two non-negative hyperparameters with λ1>λ0subscript𝜆1subscript𝜆0\lambda_{-1}>\lambda_{0}; we fix λ1=1subscript𝜆11\lambda_{-1}=1 and λ0=0.3subscript𝜆00.3\lambda_{0}=0.3 in our experiments. The rationale of this setup is that the more failed checks we see, the more likely the overall reasoning process, and thus final solution, are wrong. Note here that, because the checks are themselves imperfect, we do not necessarily want to immediately reject the whole solution from a single step-check failure, especially for ri=0subscript𝑟𝑖0r_{i}=0 cases. This is why we take a ‘soft’ approach to the verification with a confidence score. The number of successful checks, i.e. i=0n𝟙ri=1superscriptsubscript𝑖0𝑛subscript1subscript𝑟𝑖1\sum_{i=0}^{n}\mathbbm{1}_{r_{i}=1}, is deliberately not included in our integration function as an increased number of successful checks does not actually increase our confidence in the overall solution: shorter reasoning chains are generally preferable to longer ones for a given question and LLM.
其中 λ1subscript𝜆1\lambda_{-1}λ0subscript𝜆0\lambda_{0} 是两个非负超参数, λ1>λ0subscript𝜆1subscript𝜆0\lambda_{-1}>\lambda_{0} ;在我们的实验中,我们固定了 λ1=1subscript𝜆11\lambda_{-1}=1λ0=0.3subscript𝜆00.3\lambda_{0}=0.3 。这种设置的基本原理是,看到的失败检查越多,整体推理过程以及最终解决方案出错的可能性就越大。请注意,由于检查本身并不完美,我们不一定要因为单步检查失败就立即否定整个解决方案,特别是对于 ri=0subscript𝑟𝑖0r_{i}=0 情况。这就是为什么我们采用带有置信度评分的“软”验证方法。成功检查的数量,即 i=0n𝟙ri=1superscriptsubscript𝑖0𝑛subscript1subscript𝑟𝑖1\sum_{i=0}^{n}\mathbbm{1}_{r_{i}=1} ,故意不包括在我们的整合函数中,因为增加成功检查的数量实际上并不会增加我们对整体解决方案的信心:对于给定的问题和LLM,较短的推理链通常比较长的更可取。

Once calculated, the resulting confidence score can be directly used as a weight for voting between different possible solutions. We can thus use SelfCheck to increase the accuracy of an LLM’s answers by generating multiple possible solutions, calculating confidence scores for each, and then choosing our final answer through weighted voting.
一旦计算完成,所得的置信度分数可以直接用作不同可能解决方案之间投票的权重。因此,我们可以使用 SelfCheck 通过生成多个可能的解决方案、计算每个方案的置信度分数,然后通过加权投票选择最终答案,从而提高LLM的答案准确性。

4 Experiments

We now run experiments on three math-reasoning datasets to evaluate SelfCheck’s effectiveness in checking multi-step reasoning and improving final answer accuracies. Note here that our focus on math-reasoning problems is due to ease of performance evaluation and dataset availability; SelfCheck is directly applicable to other question-answering problems with nominal changes to our prompts.
我们现在在三个数学推理数据集上进行实验,以评估 SelfCheck 在检查多步骤推理和提高最终答案准确性方面的有效性。请注意,我们关注数学推理问题是因为其性能评估的简便性和数据集的可用性;SelfCheck 可以直接应用于其他问答问题,只需对我们的提示进行名义上的更改。

Datasets

GSM8K (Cobbe et al., 2021), MathQA (Amini et al., 2019), and MATH (Hendrycks et al., 2021) consist of math problems on primary school, middle school, and competition levels, containing 1319, 2985, and 5000 test samples, respectively. For GSM8K and MathQA, we evaluate SelfCheck on the whole test sets. Due to limited resources, we use a subset of MATH test set taken from Ling et al. (2023).111https://github.com/lz1oceani/verify_cot/tree/main/results/chatgpt3.5/natural_program/MATH_np.json
GSM8K(Cobbe 等,2021)、MathQA(Amini 等,2019)和 MATH(Hendrycks 等,2021)包含小学、初中和竞赛级别的数学问题,分别包含 1319、2985 和 5000 个测试样本。对于 GSM8K 和 MathQA,我们在整个测试集上评估 SelfCheck。由于资源有限,我们使用了 Ling 等(2023)中的 MATH 测试集的一个子集。
Besides the levels of difficulty, the three datasets differ from each other in the following aspects. Firstly, MathQA provides 5 options to choose from for each problem, while GSM8K and MATH have no options. Secondly, GSM8K only has arithmetic problems, while MathQA and MATH contain more diverse problems in geometry, physics, probability, and algebra.
除了难度级别之外,这三个数据集在以下几个方面也有所不同。首先,MathQA 为每个问题提供了 5 个选项,而 GSM8K 和 MATH 没有选项。其次,GSM8K 只有算术问题,而 MathQA 和 MATH 包含更多样化的问题,如几何、物理、概率和代数。

LLMs

We use GPT-3.5 (gpt-3.5-0301) and GPT-4 (gpt-4-0613) as our LLMs, focusing in particular on the former due to budget restrictions. Note that the same prompts are used for all datasets with both LLMs during evaluation; no dataset-specific customization or tuning has been performed. When devising the prompts, a small number of training samples from MathQA dataset were utilized.
我们使用 GPT-3.5(gpt-3.5-0301)和 GPT-4(gpt-4-0613)作为我们的LLMs,特别关注前者,因为预算限制。请注意,在评估期间,所有数据集都使用相同的提示词,并且没有进行数据集特定的定制或调整。在设计提示词时,使用了少量来自 MathQA 数据集的训练样本。

Baselines

We use majority voting (also known as Self-Consistency Decoding (Wang et al., 2022) in the context of CoT reasoning) as our main baseline following Ling et al. (2023) and Lightman et al. (2023). Despite its simplicity, this is still quite a strong baseline in the current literature. In particular, most existing few-shot methods report similar results compared with it (Weng et al., 2022, Ling et al., 2023). We also compare with previously quoted results from Self Verification (SV, Ling et al. (2023)) and Deductive Verification (DV, Weng et al. (2022)) when possible. We note though that these approaches are not directly comparable to SelfCheck in general, as they require additional exemplars which will often not be available in practice. Despite this, we will find that SelfCheck outperforms them when comparisons are possible.
我们使用多数投票(在 CoT 推理背景下也称为自一致性解码(Wang 等,2022))作为我们的主要基线,遵循 Ling 等(2023)和 Lightman 等(2023)的做法。尽管其简单性,这在当前文献中仍然是一个相当强的基线。特别是,大多数现有的少样本方法报告的结果与其相似(Weng 等,2022,Ling 等,2023)。我们还在可能的情况下与自验证(SV,Ling 等(2023))和演绎验证(DV,Weng 等(2022))的先前引用结果进行比较。我们注意到,这些方法通常不能直接与 SelfCheck 进行比较,因为它们需要额外的示例,这在实践中往往不可用。尽管如此,我们会发现,在可以进行比较的情况下,SelfCheck 的表现优于它们。

We omit results from Faithful-CoT (Lyu et al., 2023), because it has already been shown to decrease the accuracies on GSM8K and MATH by 11.8% and 4.2%, respectively compared to majority voting (Ling et al., 2023). It is also impossible for us to compare with training/finetuning based methods such as Lightman et al. (2023), because we have neither access to their finetuned models nor computation resources to repeat their training/finetuning. The significant extra data and resources they require also means their contributions are somewhat tangential to SelfCheck regardless.
我们省略了 Faithful-CoT(Lyu 等,2023)的结果,因为已经显示其在 GSM8K 和 MATH 上的准确率分别降低了 11.8%和 4.2%,相比于多数投票(Ling 等,2023)。我们也无法与基于训练/微调的方法(如 Lightman 等,2023)进行比较,因为我们既无法访问他们的微调模型,也没有计算资源来重复他们的训练/微调。它们所需的大量额外数据和资源也意味着它们的贡献在某种程度上与 SelfCheck 无关。

4.1 Final answer correctness

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(a) GSM8K
Refer to caption
(b) MathQA
Refer to caption
(c) MATHsuperscriptMATH\text{MATH}^{*}
Figure 2: The upper plots show the accuracies of SelfCheck and majority voting for different numbers of generated solutions per question with GPT-3.5. The lower plots show the accuracy gaps between each method and majority voting, where DV and SV stand for Deductive Verification (Weng et al., 2022) and Self-Verification (Ling et al., 2023), respectively. It is difficult to compare with DV and SV with respect to absolute accuracies because they are using different generator models. However, we can see that SelfCheck achieves higher relative performance gains than both in their reported settings.
图 2:上图显示了在每个问题生成不同数量的解决方案时,SelfCheck 和多数投票的准确性。下图显示了每种方法与多数投票之间的准确性差距,其中 DV 和 SV 分别代表演绎验证(Weng 等,2022)和自我验证(Ling 等,2023)。由于它们使用不同的生成模型,难以在绝对准确性方面与 DV 和 SV 进行比较。然而,我们可以看到,SelfCheck 在其报告的设置中实现了比两者更高的相对性能提升。

Figure 2 shows the performance gains using the confidence scores from SelfCheck to do weighted voting compared with baseline methods. The upper plots show that accuracies of both SelfCheck and majority voting have the same increasing tendency as the number of generated solutions per question increases, which is a result of the variance reduction provided by averaging over more solutions. The bottom plots show the difference in accuracy between the two including the standard error in the estimate. We can see that by allocating higher weights to correct solutions, SelfCheck achieves significantly higher accuracies than majority voting for all solution numbers per question. We also find the improvements of SelfCheck (compared with majority voting) to be higher than Deductive Verification and Self-Verification in their reported settings, despite the use of in-context learning from additional examples. We will perform additional ablations on how performance changes when ensembling over a larger number of solutions in Section 5.1.
图 2 显示了使用 SelfCheck 的置信评分进行加权投票与基线方法相比的性能提升。上图显示,随着每个问题生成的解决方案数量增加,SelfCheck 和多数投票的准确率都有相同的上升趋势,这是由于对更多解决方案进行平均所提供的方差减少的结果。下图显示了两者之间的准确率差异,包括估计中的标准误差。我们可以看到,通过对正确的解决方案分配更高的权重,SelfCheck 在所有每个问题的解决方案数量上都显著高于多数投票的准确率。我们还发现,尽管使用了来自额外示例的上下文学习,SelfCheck(与多数投票相比)的改进幅度仍高于演绎验证和自我验证在其报告的设置中的表现。我们将在第 5.1 节中进行更多关于在更大数量的解决方案上进行集成时性能变化的消融实验。

Dataset Generator Checker ✗✗ (%) ✗✓ (%) ✓✓ (%) Acc(MV, %) Acc(SelfCheck, %) ΔΔ\DeltaAcc (%)
GPT-3.5 GPT-3.5 16.8 23.0 60.2 71.7 74.3 2.8±plus-or-minus\pm0.9
GSM8K GPT-4 GPT-4 8.8 8.2 83.0 87.1 86.9 -0.2±plus-or-minus\pm0.2
GPT-4 GPT-3.5 8.8 8.2 83.0 87.1 88.1 1.0±plus-or-minus\pm0.3
GPT-3.5 GPT-3.5 27.6 26.4 46.0 59.2 64.6 5.4±plus-or-minus\pm1.1
MathQA GPT-4 GPT-4 16.2 11.0 72.8 78.3 80.9 2.6±plus-or-minus\pm0.4
GPT-4 GPT-3.5 16.2 11.0 72.8 78.3 81.2 3.0±plus-or-minus\pm0.4
GPT-3.5 GPT-3.5 52.6 23.2 24.2 35.8 38.0 2.2±plus-or-minus\pm0.7
MATHsuperscriptMATH\text{MATH}^{*} GPT-4 GPT-4 42.0 20.2 37.8 47.9 51.3 3.4±plus-or-minus\pm0.6
GPT-4 GPT-3.5 42.0 20.2 37.8 47.9 48.9 1.0±plus-or-minus\pm0.8
Table 1: SelfCheck significantly increases final answer accuracies with both GPT-3.5 and GPT-4, even we only have 222 candidate solutions for each question. ΔΔ\DeltaAcc is the performance gain of SelfCheck compared with majority voting (MV), with the ±plus-or-minus\pm indicating the standard error. ✗✗, ✗✓and ✓✓represent the proportions of questions with 0, 1 or 2 correct solutions. We see that the gains from SelfCheck are typically larger in cases where it is common for only one of the solutions to be correct, as these are the cases using weighted voting can influence the final answer.
表 1:即使每个问题只有 222 个候选解决方案,SelfCheck 在 GPT-3.5 和 GPT-4 上的最终答案准确率显著提高。 ΔΔ\Delta Acc 是 SelfCheck 与多数投票(MV)相比的性能提升, ±plus-or-minus\pm 表示标准误差。✗✗、✗✓和✓✓分别表示有 0 个、1 个或 2 个正确解决方案的问题的比例。我们看到,SelfCheck 的增益在通常只有一个解决方案正确的情况下更大,因为在这些情况下,使用加权投票可以影响最终答案。

To investigate the effect of using more powerful LLMs, and of using a different LLM for the generation and checking, we further conducted experiments with GPT-4 and a mix of GPT-4 and GPT-3.5. Because of the high cost of calling the GPT-4 API, we randomly sample 500 questions from each dataset to form the test sets and generate 2 (instead of 10) answers to each question. In Table 1, we see that SelfCheck significantly outperforms majority voting with both GPT-3.5 and GPT-4. We also notice that using GPT-3.5 to check GPT-4 generated answers yields surprisingly good results, actually outperforming checking with GPT-4 on the simpler GSM8K and MathQA tasks. This is likely because using different LLMs helps to further decorrelate the errors of the generator and the checker, and shows that using a cheaper LLM can still often be sufficient for the checking. For the more difficult problems in MATH, using GPT-4 as checker always produces better results, but even here the checking from GPT-3.5 is beneficial compared to doing no checking at all.
为了研究使用更强大的LLMs以及使用不同的LLM进行生成和检查的效果,我们进一步进行了使用 GPT-4 和混合使用 GPT-4 和 GPT-3.5 的实验。由于调用 GPT-4 API 的成本较高,我们从每个数据集中随机抽取 500 个问题组成测试集,并为每个问题生成 2 个(而不是 10 个)答案。在表 1 中,我们看到 SelfCheck 在使用 GPT-3.5 和 GPT-4 时都显著优于多数投票。我们还注意到,使用 GPT-3.5 检查 GPT-4 生成的答案取得了出乎意料的好结果,实际上在较简单的 GSM8K 和 MathQA 任务中,检查效果优于使用 GPT-4。这可能是因为使用不同的LLMs有助于进一步去相关生成器和检查器的错误,并表明使用更便宜的LLM在检查中仍然常常足够。对于 MATH 中的更难问题,使用 GPT-4 作为检查器总是能产生更好的结果,但即使在这里,使用 GPT-3.5 进行检查也比不进行检查要有益。

Refer to caption
(a) GSM8K
Refer to caption
(b) MathQA
Refer to caption
(c) MATHsuperscriptMATH\text{MATH}^{*}
Figure 3: When raising the classification thresholds t𝑡t, the proportions of real correct solutions in predicted correct solutions (Real + in Pred +) increase for GSM8K (67.5%\rightarrow76.5%), MathQA (59.4%\rightarrow82.2%) and MATH (34.6%\rightarrow50.8%).
图 3:当提高分类阈值时,预测正确解中的真实正确解比例(预测正确中的真实正确)在 GSM8K(67.5%到 76.5%)、MathQA(59.4%到 82.2%)和 MATH(34.6%到 50.8%)中增加。

4.2 Verification performance

Besides serving as a confidence score calculator to improve the performance of voting, SelfCheck can also predict the correctness of a single solution. To do so, we simply set a threshold t𝑡t to the confidence score, where solutions with confidence scores wt𝑤𝑡w\geq t are classified as correct.
除了作为一个置信度评分计算器来提高投票的性能外,SelfCheck 还可以预测单个解决方案的正确性。为此,我们只需将置信度评分设置一个阈值 t𝑡t ,置信度评分为 wt𝑤𝑡w\geq t 的解决方案被分类为正确。

Refer to caption
Figure 4: True positive rates (TP) vs. false positive rates (FP) as classification threshold, t𝑡t, is varied.
图 4:随着分类阈值 t𝑡t 的变化,真阳性率(TP)与假阳性率(FP)的关系。

Figure 4 shows the ROC curves for each dataset. As a comparison, directly prompting GPT-3.5 to verify whole reasoning chains leads to no meaningful control on the false and true positive rates (FP and TP): they are always both 100% on MATH and 98% on GSM8K, as observed by  Ling et al. (2023). In other words, the checker always predicts the answer as correct, providing no useful information.
图 4 显示了每个数据集的 ROC 曲线。作为对比,直接提示 GPT-3.5 验证整个推理链条并不能对假阳性率(FP)和真阳性率(TP)进行有意义的控制:在 MATH 数据集上它们始终都是 100%,在 GSM8K 数据集上则是 98%,正如 Ling 等人(2023)所观察到的。换句话说,检查器总是预测答案是正确的,未提供任何有用的信息。

As well as verification accuracies, we may also care about the solution quality after filtering out solutions with low confidence scores w𝑤w. Figure 3 shows that by increasing the threshold t𝑡t, SelfCheck can filter out more incorrect solutions, such that a higher proportion of the solutions that pass the check are indeed correct (Real ++ in Pred ++). Though this is at the cost of misclassifying more of the real correct solutions as incorrect, this can be a useful feature in cases where the risk of choosing an incorrect solution is higher than rejecting a correct one.
除了验证准确性之外,我们可能还关心在过滤掉低置信度分数的解决方案后的解决方案质量。图 3 显示,通过提高阈值,SelfCheck 可以过滤掉更多不正确的解决方案,从而使通过检查的解决方案中正确的比例更高(在预测中真实)。尽管这会导致更多真实正确的解决方案被误分类为不正确,但在选择不正确解决方案的风险高于拒绝正确解决方案的情况下,这可能是一个有用的特性。

5 Analysis

We now perform some ablations to justify some of the key design choices made by SelfCheck and provide insights on its behavior. Limited by budget and time, all experiments in this section are performed on a subset of the MathQA test set with 100 randomly selected questions.
我们现在进行一些消融实验,以证明 SelfCheck 的一些关键设计选择的合理性,并提供其行为的见解。由于预算和时间的限制,本节中的所有实验均在 MathQA 测试集的一个子集上进行,该子集包含 100 个随机选择的问题。

Refer to caption
(a)
Figure 5: SelfCheck achieves significantly higher final answer accuracies than majority voting for large ensembles of solutions.
图 5:对于大型解集,SelfCheck 的最终答案准确率显著高于多数投票。

5.1 More solutions per question?

Serving as a method to reduce variance, majority voting increased final answer accuracies on different datasets when we increased from 2 to 10 solutions in Figure 2. In cases where we only care about final predictive performance, one might thus question whether it is better to simply use our computational resources to keep increasing the size of this ensemble, rather than relying on a checking scheme.
作为一种减少方差的方法,当我们在图 2 中将解决方案数量从 2 增加到 10 时,多数投票提高了不同数据集上的最终答案准确性。在我们只关心最终预测性能的情况下,人们可能会质疑是否更好地简单地使用我们的计算资源来不断增加这个集成的规模,而不是依赖于检查方案。

However, as shown in Figure 5, this effect saturates for larger solution ensembles, with the accuracy of majority voting never going above that achieved when n=9𝑛9n=9, thereby never reaching the performance we already achieved by SelfCheck for the smaller ensemble. Moreover, the performance of SelfCheck continues to increase as the ensemble grows. By lowering the weights (confidence) of incorrect solutions, SelfCheck increases the chance of selecting the correct answers, even when their generation probabilities in the generator LLM are low. Therefore, with SelfCheck, LLMs can effectively rectify their own biased beliefs by themselves.
然而,如图 5 所示,对于较大的解集,这种效果趋于饱和,多数投票的准确性从未超过当 n=9𝑛9n=9 时所达到的水平,因此从未达到我们通过 SelfCheck 在较小解集上已经实现的性能。此外,随着解集的增长,SelfCheck 的性能继续提高。通过降低错误解的权重(置信度),SelfCheck 增加了选择正确答案的机会,即使它们在生成器LLM中的生成概率较低。因此,使用 SelfCheck,LLMs可以有效地自行纠正其自身的偏见信念。

5.2 Albation studies

In order to pick apart the effect of several critical design choices for SelfCheck, we compare SelfCheck with some of its variants with respect to final answer and verification accuracies on MathQA.
为了分析 SelfCheck 若干关键设计选择的效果,我们将 SelfCheck 与其一些变体在 MathQA 上的最终答案和验证准确性进行比较。

Global v.s. step-by-step checking

The first question is can we simply ask an LLM to check the whole solution without taking steps into consideration. To answer it, we prompt the LLM to perform global checking with the following instruction:
第一个问题是我们是否可以简单地要求LLM在不考虑步骤的情况下检查整个解决方案。为了回答这个问题,我们提示LLM按照以下指示进行全局检查:

  • The following is a question and a solution to it from a student. Carefully check whether the solution is correct step by step. End your response with your conclusion that starts with "Correct", "Wrong" or "Not Sure".
    以下是一个学生提出的问题及其解决方案。请仔细检查解决方案是否逐步正确。请以“正确”、“错误”或“不确定”开头得出结论。

    Question: [Question] Solution: [Step 0, Step 1,…, Step n]

Refer to caption
Figure 6: Generation accuracies for variants of SelfCheck on MathQA with GPT-3.5.

Similar to the findings of Ling et al. (2023), we find that the global checker outputs "correct" most of the time and rarely recognizes an error. Consequently, its final answer accuracies are very close to majority voting (in Figure 6) and its verification accuracy (55.0%) is only marginally above random guess (50.0%). This lack of ability to deal with the difficulty of global checking is what makes step checking necessary.
与 Ling 等人(2023)的研究结果相似,我们发现全局检查器大多数时候输出“正确”,很少识别出错误。因此,它的最终答案准确率与多数投票(见图 6)非常接近,其验证准确率(55.0%)仅略高于随机猜测(50.0%)。这种处理全局检查困难的能力不足正是使步骤检查变得必要的原因。

Single-stage v.s. multiple-stage step checking

Next, we ask whether we really need to decompose the step checking into several stages? To answer this, we design the following prompt to use the LLM directly.
接下来,我们要问是否真的需要将步骤检查分解为几个阶段?为了解答这个问题,我们设计了以下提示以直接使用LLM。

  • The following is a question and the first a few steps in its solution.
    Question: [Question] Solution: [Step 0, Step 1,…, Step i-1]
    Check the correctness of the next step: [Step i]
    Please consider the information it relies on and check step by step. Please end your response with your conclusion that starts with "Correct", "Wrong" or "Not Sure".
    请考虑它所依赖的信息,并逐步检查。请以“正确”、“错误”或“不确定”开始您的结论。

Method Accuracy (%)
SelfCheck 66.7%
Global Check 55.0%percent55.055.0\%
Single stage Check 57.2%percent57.257.2\%
Error Check (0-shot) 63.1%percent63.163.1\%
Error Check (1-shot) 64.2%percent64.264.2\%
Table 2: Verification accuracies for variants of SelfCheck on MathQA with GPT-3.5. The reported verification accuracy is the average of true positive and true negative rates.
表 2:在 MathQA 上使用 GPT-3.5 的 SelfCheck 变体的验证准确率。报告的验证准确率是正确率和负率的平均值。

Figure 6 and Table 2 show that although this is better than global checking, it is still significantly worse than SelfCheck with its multi-stage checking. This indicates that checking a step in a single stage is still too challenging for the LLM, so it is necessary to further decompose step checking into a pipeline of easier sub-tasks.
图 6 和表 2 显示,尽管这比全局检查要好,但仍然显著逊色于具有多阶段检查的 SelfCheck。这表明在单阶段中检查一个步骤对于LLM来说仍然过于困难,因此有必要将步骤检查进一步分解为更容易的子任务流水线。

Error check v.s. regenerate and compare

We now justify the choice to perform step regeneration and comparison instead of direct error checking for each step. To do so, we replace our regeneration stage and comparison stage with a single error-checking stage. We first compare with a zero-shot version of the variant with the following prompt:
我们现在证明选择执行步骤再生和比较而不是对每一步进行直接错误检查的合理性。为此,我们将再生阶段和比较阶段替换为单一的错误检查阶段。我们首先使用以下提示与零样本版本的变体进行比较:

  • Given the following information:
    Information 0: [Information I0subscript𝐼0I_{0}] Information 1: [Information I1subscript𝐼1I_{1}] …
    Step 0: [Step S0subscript𝑆0S_{0}] Step 1: [Step S1subscript𝑆1S_{1}] …
    Check the correctness of the next step [Step i]
    Please check for grounding errors, reasoning errors and calculation errors step by step. Please end your response with your conclusion that starts with "Correct", "Wrong" or "Not Sure".
    请逐步检查接地错误、推理错误和计算错误。请以“正确”、“错误”或“不确定”开头得出结论。

We then add an examplar from Ling et al. (2023) (see Appendix B) to make a more powerful one-shot error checker. However, results in Figure 6 and Table 2 show that even with a very detailed and instructive example, direct error checking still performs worse than our regenerate and compare approach, which supports our previous argument that LLMs are better at generation than checking.
然后,我们添加了 Ling 等人(2023)的一个示例(见附录 B),以创建一个更强大的单次错误检查器。然而,图 6 和表 2 的结果显示,即使有一个非常详细和指导性的示例,直接错误检查的表现仍然不如我们的再生成和比较方法,这支持了我们之前的论点,即LLMs在生成方面比检查更擅长。

6 Conclusions

In this paper, we have introduced SelfCheck, a general-purpose, zero-shot, step-by-step checking scheme for LLMs. Unlike previous approaches, SelfCheck does not require any additional data or external resources: it uses the LLM to identify errors in its own reasoning, leveraging a novel regenerate-and-compare approach. By using the results of this checking to perform weighted voting over different solutions, we find that SelfCheck is able to, in turn, increase final predictive accuracy.
在本文中,我们介绍了 SelfCheck,这是一种通用的、零样本、逐步检查方案,适用于LLMs。与以往的方法不同,SelfCheck 不需要任何额外的数据或外部资源:它使用LLM来识别自身推理中的错误,利用一种新颖的再生成和比较方法。通过使用检查结果对不同的解决方案进行加权投票,我们发现 SelfCheck 能够反过来提高最终的预测准确性。

References

  • Almazrouei et al. (2023) Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. Falcon-40B: an open large language model with state-of-the-art performance. 2023.
    Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, Badreddine Noune, Baptiste Pannier, 和 Guilherme Penedo. Falcon-40B: 一种具有最先进性能的开放大型语言模型。2023 年。
  • Amini et al. (2019) Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  2357–2367, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1245. URL https://aclanthology.org/N19-1245.
    Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, 和 Hannaneh Hajishirzi. MathQA: 通过基于操作的形式化方法实现可解释的数学文字题解答。在 2019 年北美计算语言学协会会议:人类语言技术,第一卷(长篇和短篇论文),第 2357-2367 页,明尼苏达州明尼阿波利斯,2019 年 6 月。计算语言学协会。doi: 10.18653/v1/N19-1245. URL https://aclanthology.org/N19-1245.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
    汤姆·布朗,本杰明·曼,尼克·赖德,梅拉妮·苏比亚,贾里德·D·卡普兰,普拉富拉·达里瓦尔,阿尔温德·尼拉坎坦,普拉纳夫·夏姆,吉里什·萨斯特里,阿曼达·阿斯克尔,等。语言模型是少样本学习者。神经信息处理系统进展,33:1877-1901,2020 年。
  • Bubeck et al. (2023) Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg 等人. 人工通用智能的火花:GPT-4 的早期实验. arXiv 预印本 arXiv:2303.12712, 2023.
  • Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann 等人. Palm: 使用路径扩展语言建模. arXiv 预印本 arXiv:2204.02311, 2022.
  • Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
    卡尔·科布(Karl Cobbe)、维尼特·科萨拉朱(Vineet Kosaraju)、穆罕默德·巴瓦里安(Mohammad Bavarian)、马克·陈(Mark Chen)、许熙宇(Heewoo Jun)、卢卡斯·凯撒(Lukasz Kaiser)、马蒂亚斯·普拉佩特(Matthias Plappert)、杰里·特沃雷克(Jerry Tworek)、雅各布·希尔顿(Jacob Hilton)、中野礼一郎(Reiichiro Nakano)等人。训练验证器解决数学文字题。arXiv 预印本 arXiv:2110.14168, 2021 年。
  • Google (2023) Google. Palm 2 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • He et al. (2020) Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations, 2020.
    何鹏程,刘晓东,高建峰,陈伟柱。Deberta: 具有解码增强和解缠注意力的 BERT。在国际学习表征会议,2020 年。
  • Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, 和 Jacob Steinhardt. 使用数学数据集测量数学问题解决能力. 在第三十五届神经信息处理系统会议数据集和基准测试分会(第二轮),2021 年.
  • Li et al. (2022) Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. On the advance of making language models better reasoners. arXiv preprint arXiv:2206.02336, 2022.
    李一飞,林泽奇,张世卓,傅强,陈贝,楼建光,陈伟柱。关于提升语言模型推理能力的进展。arXiv 预印本 arXiv:2206.02336, 2022.
  • Lightman et al. (2023) Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, 和 Karl Cobbe. 让我们一步一步地验证。arXiv 预印本 arXiv:2305.20050, 2023.
  • Ling et al. (2023) Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, and Hao Su. Deductive verification of chain-of-thought reasoning. arXiv preprint arXiv:2306.03872, 2023.
    凌展, 方云浩, 李轩林, 黄志奥, 李明宇, 罗兰·梅米塞维奇, 苏浩. 链式思维推理的演绎验证. arXiv 预印本 arXiv:2306.03872, 2023.
  • Lyu et al. (2023) Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. Faithful chain-of-thought reasoning. arXiv preprint arXiv:2301.13379, 2023.
    吕青, Shreya Havaldar, Adam Stein, 张丽, Delip Rao, Eric Wong, Marianna Apidianaki, 和 Chris Callison-Burch. 忠实的链式思维推理. arXiv 预印本 arXiv:2301.13379, 2023.
  • OpenAI (2023) OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Peng et al. (2023) Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, et al. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813, 2023.
    彭宝林, Michel Galley, 何鹏程, 程浩, 谢宇佳, 胡宇, 黄秋源, Lars Liden, 俞舟, 陈伟柱, 等. 检查你的事实并重试:利用外部知识和自动反馈改进大型语言模型. arXiv 预印本 arXiv:2302.12813, 2023.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar 等人. Llama: 开放且高效的基础语言模型. arXiv 预印本 arXiv:2302.13971, 2023.
  • Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2022.
    王学智, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, 和 Denny Zhou. 自一致性提高了语言模型中的链式思维推理能力. 见第十一届国际学习表征会议, 2022.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
    魏杰森,王雪芝,舒尔曼斯·戴尔,博斯马·马滕,夏飞,池艾德,黎国志,周丹尼,等。链式思维提示在大型语言模型中引发推理。《神经信息处理系统进展》,35:24824–24837, 2022。
  • Weng et al. (2022) Yixuan Weng, Minjun Zhu, Shizhu He, Kang Liu, and Jun Zhao. Large language models are reasoners with self-verification. arXiv preprint arXiv:2212.09561, 2022.
    翁一轩, 朱敏君, 何世竹, 刘康, 赵军. 大型语言模型是具有自我验证功能的推理者. arXiv 预印本 arXiv:2212.09561, 2022.

Appendix A A complete example of SelfCheck

To help readers get a better idea of how SelfCheck works, we show a complete version of the example in Figure 1, including the exact inputs and outputs of different stages. Please note that we have compiled all latex codes for readability, though the inputs and outputs to LLMs are in their original form. The question is
为了帮助读者更好地了解 SelfCheck 的工作原理,我们在图 1 中展示了一个完整的示例,包括不同阶段的确切输入和输出。请注意,为了便于阅读,我们已经编译了所有的 latex 代码,尽管LLMs的输入和输出保持其原始形式。问题是

Let S𝑆S be the set of complex numbers z𝑧z such that the real part of 1z1𝑧\frac{1}{z} is equal to 1616\frac{1}{6}. This set forms a curve. Find the area of the region inside the curve.
S𝑆S 为复数集合 z𝑧z ,使得 1z1𝑧\frac{1}{z} 的实部等于 1616\frac{1}{6} 。该集合形成一条曲线。求曲线内部区域的面积。

We feed the following instruction to the LLM that serves as the generator:

Solve the following problem step by step. Please start each step with "Step :" and split sentences in each step with "\n\n". Please finish you response with "So the answer is …". Problem: Let S𝑆S be the set of complex numbers z𝑧z such that the real part of 1z1𝑧\frac{1}{z} is equal to 1616\frac{1}{6}. This set forms a curve. Find the area of the region inside the curve.
逐步解决以下问题。请以“步骤:”开始每一步,并用“\n\n”分隔每一步中的句子。请以“所以答案是……”结束你的回答。问题:设 S𝑆S 为复数集合 z𝑧z ,使得 1z1𝑧\frac{1}{z} 的实部等于 1616\frac{1}{6} 。该集合形成一条曲线。求曲线内区域的面积。

After getting the following solution, we start checking Step 5 as an example:

Solution:
(Step 0) Let z=x+yi𝑧𝑥𝑦𝑖z=x+yi be a complex number, where x𝑥x and y𝑦y are real numbers.
(Step 1) We are given that the real part of 1z1𝑧\frac{1}{z} is equal to 1616\frac{1}{6}.
(Step 2) Using the fact that the real part of a complex number a+bi𝑎𝑏𝑖a+bi is a𝑎a, we have
Re(1z)=Re(1x+yi)=xx2+y2.Re1𝑧Re1𝑥𝑦𝑖𝑥superscript𝑥2superscript𝑦2\operatorname{Re}\left(\frac{1}{z}\right)=\operatorname{Re}\left(\frac{1}{x+yi}\right)=\frac{x}{x^{2}+y^{2}}.
(Step 3) Setting Re(1z)=16Re1𝑧16\operatorname{Re}\left(\frac{1}{z}\right)=\frac{1}{6}, we get xx2+y2=16.𝑥superscript𝑥2superscript𝑦216\frac{x}{x^{2}+y^{2}}=\frac{1}{6}.
(Step 4) Cross-multiplying and rearranging, we get the equation 6x=x2+y2.6𝑥superscript𝑥2superscript𝑦26x=x^{2}+y^{2}.
(Step 5) Completing the square, we obtain (x32)2+y2=94.superscript𝑥322superscript𝑦294\left(x-\frac{3}{2}\right)^{2}+y^{2}=\frac{9}{4}.
(Step 6) This is the equation of a circle with center (32,0)320\left(\frac{3}{2},0\right) and radius 3232\frac{3}{2}.
(Step 7) The area of the region inside the circle is 9π49𝜋4\frac{9\pi}{4}. So the answer is 9π49𝜋4\frac{9\pi}{4}.

Variable definition extraction

This optional stage extracts the definitions of variables that appear in the reasoning chain. SelfCheck feeds the definitions to the regeneration stage to avoid LLM not understanding the meaning of a variable. The input to the LLM is:
这个可选阶段提取出推理链中出现的变量定义。SelfCheck 将这些定义传递给再生阶段,以避免LLM不理解变量的含义。LLM的输入是:

The following is the solution to a math question: Question: Let S𝑆S be the set of complex numbers z𝑧z such that the real part of 1z1𝑧\frac{1}{z} is equal to 16.16\frac{1}{6}. This set forms a curve. Find the area of the region inside the curve.
以下是一个数学问题的解答: 问题:设 S𝑆S 为复数集合 z𝑧z ,使得 1z1𝑧\frac{1}{z} 的实部等于 16.16\frac{1}{6}. 。该集合形成一条曲线。求曲线内区域的面积。


Solution:
(Step 0) Let z=x+yi𝑧𝑥𝑦𝑖z=x+yi be a complex number, where x𝑥x and y𝑦y are real numbers.

(Step 7) The area of the region inside the circle is 9π49𝜋4\frac{9\pi}{4}. So the answer is 9π49𝜋4\frac{9\pi}{4}.

Are there variables in the solution? If so, please list the definition of variable in the form of "1. variable is defined as…".
解决方案中是否存在变量?如果存在,请以“1. 变量被定义为……”的形式列出变量的定义。

The underlined parts are the type of question, the question itself and the steps in its solution, respectively.
下划线部分分别是问题类型、问题本身及其解决步骤。

The output from the LLM is:

Yes. There are variables in the solution.
1. 𝒛𝒛\boldsymbol{z} is defined as a complex number of the form x+yi𝑥𝑦𝑖\boldsymbol{x+yi}, where x𝑥\boldsymbol{x} and y𝑦\boldsymbol{y} are real numbers.

The bold part is then saved to form a part of the input in the regeneration stage.

Target extraction

To get a brief and clear target of the current step, the input to the LLM is:

The following is a part of the solution to the problem: Let S𝑆S be the set of complex numbers z𝑧z such that the real part of 1z1𝑧\frac{1}{z} is equal to 16.16\frac{1}{6}. This set forms a curve. Find the area of the region inside the curve.
以下是问题的部分解决方案:设 S𝑆S 为复数集合 z𝑧z ,使得 1z1𝑧\frac{1}{z} 的实部等于 16.16\frac{1}{6}. 。该集合形成一条曲线。求曲线内区域的面积。

(Step 0) Let z=x+yi𝑧𝑥𝑦𝑖z=x+yi be a complex number, where x𝑥x and y𝑦y are real numbers.

(Step 5) Completing the square, we obtain (x32)2+y2=94.superscript𝑥322superscript𝑦294\left(x-\frac{3}{2}\right)^{2}+y^{2}=\frac{9}{4}.

What specific action does the step "Completing the square, we obtain (x32)2+y2=94.superscript𝑥322superscript𝑦294\left(x-\frac{3}{2}\right)^{2}+y^{2}=\frac{9}{4}." take? Please give a brief answer using a single sentence and do not copy the steps.
完成平方后,我们得到 (x32)2+y2=94.superscript𝑥322superscript𝑦294\left(x-\frac{3}{2}\right)^{2}+y^{2}=\frac{9}{4}. 的具体步骤是什么?请用一句话简要回答,不要复制步骤。

The underlined parts are the question and reasoning steps before the current one, including the current one.
下划线部分是当前步骤之前的问题和推理步骤,包括当前步骤。

The output of the LLM is:

The step completes the square to rewrite the equation in standard form of a circle.

The whole sentence is saved and forms the most important part of the input in the regeneration stage.
整个句子被保存下来,并在再生阶段形成输入中最重要的部分。

Information Collection

To get sentences in the question and previous steps in the solution that are directly related to the current step, the input to the LLM is:
为了在问题中的句子和解决方案的前几步中获取与当前步骤直接相关的句子,LLM的输入是:

This is a math question: Question: Let S𝑆S be the set of complex numbers z𝑧z such that the real part of 1z1𝑧\frac{1}{z} is equal to 16.16\frac{1}{6}. This set forms a curve. Find the area of the region inside the curve.
这是一个数学问题:问题:设 S𝑆S 为复数集合 z𝑧z ,使得 1z1𝑧\frac{1}{z} 的实部等于 16.16\frac{1}{6}. 。该集合形成一条曲线。求曲线内区域的面积。


The following is information extracted from the question:
Information 0: Let S𝑆S be the set of complex numbers z𝑧z such that the real part of 1z1𝑧\frac{1}{z} is equal to 16.16\frac{1}{6}.
以下是从问题中提取的信息:信息 0:设 S𝑆S 为复数集合 z𝑧z ,使得 1z1𝑧\frac{1}{z} 的实部等于 16.16\frac{1}{6}.

Information 1: This set forms a curve.
Information 2: Find the area of the region inside the curve.

The following is the first a few steps in a solution to the problem:
Step 0: Let z=x+yi𝑧𝑥𝑦𝑖z=x+yi be a complex number, where x𝑥x and y𝑦y are real numbers.

Step 4: Cross-multiplying and rearranging, we get the equation 6x=x2+y2.6𝑥superscript𝑥2superscript𝑦26x=x^{2}+y^{2}.

Which previous steps or information does the next step "Completing the square, we obtain (x32)2+y2=94.superscript𝑥322superscript𝑦294\left(x-\frac{3}{2}\right)^{2}+y^{2}=\frac{9}{4}." directly follow from?
下一步“通过配方法,我们得到 (x32)2+y2=94.superscript𝑥322superscript𝑦294\left(x-\frac{3}{2}\right)^{2}+y^{2}=\frac{9}{4}. ”直接遵循哪些前面的步骤或信息?

The underlined information (contents after Information X:) is simply a list of sentences in the question. The underlined steps are reasoning steps before (not including) the current one in the solution.
下划线的信息(Information X 后的内容)只是问题中的句子列表。下划线的步骤是在解决方案中当前步骤之前(不包括当前步骤)的推理步骤。

The output from the LLM is:

The next step "Completing the square, we obtain (x32)2+y2=94.superscript𝑥322superscript𝑦294\left(x-\frac{3}{2}\right)^{2}+y^{2}=\frac{9}{4}." directly follows from Step 4.

Then we use regular expression to extract the information and step id from LLM’s output. Some steps may only rely on information from the question while others may only use results from previous steps.
然后我们使用正则表达式从LLM的输出中提取信息和步骤 ID。一些步骤可能仅依赖于问题中的信息,而其他步骤可能仅使用前面步骤的结果。

Step regeneration

In this stage, we instruct the LLM to achieve the extracted target with only collected information with the following prompt:
在这个阶段,我们指示LLM仅使用收集到的信息,通过以下提示来实现提取的目标:

We are in a process of solving a math problem.

Variables are defined as: 1. z𝑧z is defined as a complex number of the form x+yi𝑥𝑦𝑖x+yi, where x𝑥x and y𝑦y are real numbers.
变量定义为:1. z𝑧z 被定义为形式为 x+yi𝑥𝑦𝑖x+yi 的复数,其中 x𝑥xy𝑦y 是实数。

The following are some previous steps: Step 0: Cross-multiplying and rearranging, we get the equation 6x=x2+y2.6𝑥superscript𝑥2superscript𝑦26x=x^{2}+y^{2}.
以下是一些前面的步骤:步骤 0:交叉相乘并重新排列,我们得到方程 6x=x2+y2.6𝑥superscript𝑥2superscript𝑦26x=x^{2}+y^{2}.

The target for next step is: The step completes the square to rewrite the equation in standard form of a circle.
下一步的目标是:通过完成平方将方程改写为标准的圆方程形式。

Please try to achieve the target with the information from previous steps.

The underlined parts are the type of question, variable definition (if any), collected information from the question (if any), collected previous steps (if any) and the extracted target. Please notice that the current step does not directly rely on any information from the question, we removed the corresponding block from the input.
下划线部分是问题类型、变量定义(如果有)、从问题中收集的信息(如果有)、从之前步骤中收集的信息(如果有)和提取的目标。请注意,当前步骤不直接依赖于任何来自问题的信息,我们已从输入中删除了相应的块。

The output from the LLM is:

To rewrite the equation in standard form of a circle, we need to complete the square for both x𝑥\boldsymbol{x} and y𝑦\boldsymbol{y}.
要将方程改写为标准形式的圆方程,我们需要对 x𝑥\boldsymbol{x}y𝑦\boldsymbol{y} 进行配方。

Starting from the given equation: 𝟔x=x𝟐+y𝟐6𝑥superscript𝑥2superscript𝑦2\boldsymbol{6x=x^{2}+y^{2}}.
We add and subtract (𝟔/𝟐)𝟐=𝟗superscript6229\boldsymbol{(6/2)^{2}=9} on the right-hand side:
𝟔x+𝟗𝟗=x𝟐+y𝟐+𝟗𝟗6𝑥99superscript𝑥2superscript𝑦299\boldsymbol{6x+9-9=x^{2}+y^{2}+9-9}.
Simplifying: (x𝟑)𝟐+y𝟐=𝟗superscript𝑥32superscript𝑦29\boldsymbol{(x-3)^{2}+y^{2}=9}. This is now in the standard form of a circle with center (𝟑,𝟎)30\boldsymbol{(3,0)} and radius 𝟑3\boldsymbol{3}.
简化: (x𝟑)𝟐+y𝟐=𝟗superscript𝑥32superscript𝑦29\boldsymbol{(x-3)^{2}+y^{2}=9} 。这现在是一个标准形式的圆,圆心为 (𝟑,𝟎)30\boldsymbol{(3,0)} ,半径为 𝟑3\boldsymbol{3}

Then we can directly compare the regenerated step with the original one in the next stage.

Result comparison

The input to the LLM is:

The following are 2 solutions to a math problem:

Solution 1: To rewrite the equation in standard form of a circle, we need to complete the square for both x𝑥x and y𝑦y.
解决方案 1:为了将方程改写为标准形式的圆方程,我们需要对 x𝑥xy𝑦y 进行配方。

Starting from the given equation: 6x=x2+y2.6𝑥superscript𝑥2superscript𝑦26x=x^{2}+y^{2}.
We add and subtract (6/2)2=9superscript6229(6/2)^{2}=9 on the right-hand side: 6x+99=x2+y2+99.6𝑥99superscript𝑥2superscript𝑦2996x+9-9=x^{2}+y^{2}+9-9.
Simplifying: (x3)2+y2=9.superscript𝑥32superscript𝑦29(x-3)^{2}+y^{2}=9. This is now in the standard form of a circle with center (3,0)30(3,0) and radius 333.
简化: (x3)2+y2=9.superscript𝑥32superscript𝑦29(x-3)^{2}+y^{2}=9. 这现在是标准形式的圆,圆心为 (3,0)30(3,0) ,半径为 333

Solution 2: Completing the square, we obtain (x32)2+y2=94.superscript𝑥322superscript𝑦294\left(x-\frac{3}{2}\right)^{2}+y^{2}=\frac{9}{4}.

Compare the key points from both solutions step by step and then check whether Solution 1 "supports", "contradicts" or "is not directly related to" the conclusion in Solution 2. Pay special attention to difference in numbers.
逐步比较两个解决方案的关键点,然后检查解决方案 1 是否“支持”、“矛盾”或“与解决方案 2 的结论无直接关系”。特别注意数字上的差异。

The underlined parts are simply the problem type, the regenerated step and the original step.

The output is:

In both solutions, the goal is to rewrite the given equation in standard form of a circle. However, in Solution 1, the completion of the square results in the center of the circle being (3,0)30(3,0) and a radius of 333. In Solution 2, the completion of the square results in the center of the circle being (3/2,0)320(3/2,0) and a radius of 3/2323/2.
在这两种解法中,目标都是将给定的方程改写成标准的圆方程形式。然而,在解法 1 中,通过配方法得到的圆心是 (3,0)30(3,0) ,半径是 333 。在解法 2 中,通过配方法得到的圆心是 (3/2,0)320(3/2,0) ,半径是 3/2323/2

Therefore, Solution 1 contradicts the conclusion in Solution 2 as the centers and radii of the circles are different.
因此,方案 1 与方案 2 的结论相矛盾,因为圆的中心和半径不同。

If the word ‘support’ or ‘contradict’ appears in the last line of the output, SelfCheck outputs a step score 1 or -1. Otherwise, SelfCheck outputs 0 meaning that two steps are not related.
如果输出的最后一行出现“支持”或“反驳”一词,SelfCheck 输出步骤分数 1 或-1。否则,SelfCheck 输出 0,表示两个步骤不相关。

Appendix B Examplar for one-shot error checking

The following example from Ling et al. (2023) is used to perform one-shot error checking.

Here is some information:
“1. Alice gets 2 apples.
2. Alice gets twice as many oranges as bananas.
3. Original price per apple: $2
Discount: 50%
Price per apple after discount: $2 * (1 - 0.50) = $2 * 0.50 = $1"
2. 爱丽丝得到的橙子数量是香蕉的两倍。 3. 每个苹果的原价:$2 折扣:50% 折扣后的每个苹果价格:$2 * (1 - 0.50) = $2 * 0.50 = $1

Based on the given information, here is a reasoning process:
"Calculate Alice’s cost of the 5 apples after a 50% discount.
Price per apple after discount: $1
Apples be eaten: 3
Oranges be eaten: 6
Total apples: 5
Total oranges: x
Total bananas: 2x
Alice’s total number of fruits: 5 + x + 2x = 5 + 4x
Alice’s total cost of 5 apples: $1 * (5 - 3) = $2”
折扣后每个苹果的价格:$1 吃掉的苹果:3 吃掉的橙子:6 总苹果数:5 总橙子数:x 总香蕉数:2x 爱丽丝的水果总数:5 + x + 2x = 5 + 4x 爱丽丝购买 5 个苹果的总费用:$1 * (5 - 3) = $2

Double-check the reasoning process, let’s analyze its correctness, and end with "yes" or "no".

Answer:
Let’s think step by step.

Grounding check
Purpose: Find references for numbers in the reasoning process sequentially
Rule: Any ungrounded number makes the check fails

"x" is a variable, we do not care about variables.
"y" is a variable, we do not care about variables.

"5" is a quantity from the given information that represents the number of apples.
"50%" is a quantity from the given information that represents the discount.
"$1" is a quantity from the given information that represents the price per apple after the discount.
"3" is a quantity not from the given information, it is ungrounded.
"6" is a quantity not from the given information, it is ungrounded.
"$2" is a quantity from the given information that represents the original price per apple.
Result: Based on the check, there are ungrounded numbers, so the grounding check fails.
“$1” 是给定信息中的一个数量,表示折扣后的每个苹果的价格。“3” 是一个不来自给定信息的数量,它是无根据的。“6” 是一个不来自给定信息的数量,它是无根据的。“$2” 是给定信息中的一个数量,表示每个苹果的原价。结果:根据检查,有无根据的数字,因此基础检查失败。

Reasoning check:
Purpose: Check whether the reasoning logic correctly answers the question
Rule: Any logic error makes the check fails

To calculate the total cost of apples after a 50% discount, we should multiply the number of apples by the discounted price. But to answer the question, we do not need to calculate the total number of fruit Alice gets.
要计算苹果在打五折后的总成本,我们应该将苹果的数量乘以折扣后的价格。但要回答这个问题,我们不需要计算爱丽丝得到的水果总数。

Result: Based on the check, there are some logic mistakes, so the reasoning check fails.

Calculation check:
Purpose: Check whether the calculation process is consistent
Rule: Any inconsistent calculation makes the check fails
calculation1:

equation: $1 * (5 - 3), answer: $2
(5 - 3) = 2
$1 * 2 = $2 is consistent with the answer, so the calculation is correct.
calculation2:
equation: 5 + x + 2x, answer: 5 + 4x
x + 2x = 3x
5 + 3x is inconsistent with the answer, so the calculation is incorrect.
Result: Based on the check, the calculation process is inconsistent, so the calculation check fails.
$1 * 2 = $2 与答案一致,因此计算是正确的。计算 2:方程:5 + x + 2x,答案:5 + 4x x + 2x = 3x 5 + 3x 与答案不一致,因此计算是错误的。结果:根据检查,计算过程不一致,因此计算检查失败。

Check results: Ground check fails, Reasoning check fails, Calculation check fails.
Rule: Any failed check makes the reasoning incorrect.
So the answer is "no".