这是用户在 2024-6-20 9:58 为 https://app.immersivetranslate.com/pdf-pro/9bc9f434-b12a-48b1-8341-7b01738030c1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Chain-of-Thought Reasoning without Prompting

Xuezhi Wang and Denny Zhou
王学志 和 周登 nyi
Google DeepMind, {xuezhiw, dennyzhou}@google.com
## 翻译: ### 谷歌 DeepMind,{xuezhiw, dennyzhou}@google.com

In enhancing the reasoning capabilities of large language models (LLMs), prior research primarily focuses on specific prompting techniques such as few-shot or zero-shot chain-of-thought (CoT) prompting. These methods, while effective, often involve manually intensive prompt engineering. Our study takes a novel approach by asking: Can LLMs reason effectively without prompting? Our findings reveal that, intriguingly, CoT reasoning paths can be elicited from pre-trained LLMs by simply altering the decoding process. Rather than conventional greedy decoding, we investigate the top- alternative tokens, uncovering that CoT paths are frequently inherent in these sequences. This approach not only bypasses the confounders of prompting but also allows us to assess the LLMs' intrinsic reasoning abilities. Moreover, we observe that the presence of a CoT in the decoding path correlates with a higher confidence in the model's decoded answer. This confidence metric effectively differentiates between CoT and non-CoT paths. Extensive empirical studies on various reasoning benchmarks show that the proposed CoT-decoding effectively elicits reasoning capabilities from language models, which were previously obscured by standard greedy decoding.
为了增强大型语言模型的推理能力,以往的研究主要集中在特定的提示技术,例如少量或零样本的思维链(CoT)提示。虽然这些方法有效,但通常需要大量的手动提示工程。我们的研究采用了一种新颖的方法,提出以下问题:大型语言模型在没有提示的情况下能否有效推理?我们的研究结果表明,有趣的是,可以通过改变解码过程从预训练的 大型语言模型中 推导出 CoT 推理路径。与传统的贪婪解码不同,我们研究了 top-k 个备选标记,发现 CoT 路径经常存在于这些序列中。这种方法不仅绕过了提示的混杂因素,还允许我们评估 大型语言模型 的内在推理能力。此外,我们观察到解码路径中存在 CoT 与模型解码答案的较高置信度相关联。此置信度指标有效地区分了 CoT 路径和非 CoT 路径。 广泛的实证研究表明,在各种推理基准上的 CoT 解码可以有效地从语言模型中提取推理能力,而这些推理能力以前被标准的贪婪解码所掩盖。

1. Introduction 1. 概述

Large language models (LLMs) have demonstrated remarkable performance on various complicated reasoning benchmarks (Anil et al., 2023; Brown et al., 2020; Chowdhery et al., 2023; Gemini, 2023; OpenAI, 2023; Romera-Paredes et al., 2023). These reasoning capabilities of LLMs are typically elicited by prompting techniques (Brown et al., 2020), which can be few-shot prompting with intermediate steps augmented demonstration exemplars (Chen et al., 2023b; Gao et al., 2022; Nye et al., 2021; Wei et al., 2022; Yao et al., 2023; Zhou et al., 2023a), or zero-shot prompting with specific instructions which ask for showing certain intermediate steps (Kojima et al., 2022; Yasunaga et al., 2023). The other prevalent strategy for eliciting LLM reasoning is through model training or instruction tuning using a substantial amount of chain-of-thought (CoT) reasoning data (Chung et al., 2022; Cobbe et al., 2021b; Ling et al., 2017; Nye et al., 2021).
大型语言模型 (LLMs) 在各种复杂的推理基准测试中展现了卓越的性能(Anil 等人,2023 年;Brown 等人,2020 年;Chowdhery 等人,2023 年;双子座,2023 年;OpenAI,2023 年;Romera-Paredes 等人,2023 年)。LLMs 的推理能力通常通过提示技术(Brown 等人,2020 年)来引发,可以是带有中间步骤强化演示示例的少样本提示(Chen 等人,2023b;Gao 等人,2022 年;Nye 等人,2021 年;Wei 等人,2022 年;Yao 等人,2023 年;Zhou 等人,2023a 年),也可以是带有特定指令的零样本提示,这些指令要求显示某些中间步骤(Kojima 等人,2022 年;Yasunaga 等人,2023 年)。引发LLM 推理的另一种普遍策略是通过模型训练或指令调整,使用大量的链式推理 (CoT) 数据(Chung 等人,2022 年;Cobbe 等人,2021b 年;Ling 等人,2017 年;Nye 等人,2021 年)。
Prompting techniques, while effective, often encode task-specific human priors, thereby making it difficult to assess a language model's intrinsic reasoning abilities. Ideally, a language model should be able to reason independently and provide the optimal response, without requiring humans to tweak the prompts or refine repeatedly if the initial response is unsatisfactory. Model-tuning can be expensive and requires a substantial amount of supervised data. In this work, we explore a different perspective and ask: Can LLMs reason effectively without prompting? And to what extent can they reason? We find that, perhaps surprisingly, there exists a task-agnostic way to elicit CoT reasoning from pre-trained LLMs by simply altering the decoding procedure. Figure 1 illustrates this phenomenon: given a reasoning question, the LLM generates a wrong answer via the standard greedy decoding path, yet alternative top-k token inspection unveiled inherent CoT paths (e.g., decoding paths 2 and 4), which accurately resolved the query. This decoding modification bypasses prompting and is entirely unsupervised without the need for model tuning.
提示技术虽然有效,但通常会编码特定于任务的人类先验知识,从而难以评估语言模型的内在推理能力。理想情况下,语言模型应该能够独立推理并提供最佳响应,而无需人类调整提示或在初始响应不令人满意的情况下进行重复细化。模型微调可能成本高昂,并且需要大量监督数据。在这项工作中,我们探索了一个不同的视角,并提出以下问题:LLMs 是否可以在没有提示的情况下进行有效推理?他们可以在多大程度上进行推理?我们惊奇地发现,通过简单地改变解码过程,存在一种与任务无关的方式从预训练的 LLMs 中提取 CoT 推理。图 1 说明了这种现象:给定一个推理问题,LLM 通过标准贪婪解码路径生成错误答案,但备选的 top-k 标记检查揭示了固有的 CoT 路径(例如,解码路径 2 和 4),这些路径准确地解决了查询。这种解码修改绕过了提示,并且完全是无监督的,无需模型微调。
In more details, we formulate the input using the standard question-answer (QA) format: "Q:
更详细地说,我们使用标准的问答 (QA) 格式来构建输入:“问题:
Figure 1 | Illustration of CoT-decoding. Pre-trained LLMs are capable of inherent reasoning without prompting by considering alternative top- tokens, rather than solely relying on the top-1 greedy decoding path. Moreover, these models tend to display higher confidence in decoding the final answer (indicated by a darker shaded color) when a CoT reasoning path is present.
图 1 | CoT 解码的示意图。预训练的模型LLMs 无需提示就能进行推理,这是通过考虑备选的前 个 token 而不是仅仅依赖于 top-1 贪婪解码路径来实现的。此外,当存在 CoT 推理路径时,这些模型往往在解码最终答案时表现出更高的置信度(用深色阴影颜色表示)。
[question] \nA:". While most existing work suggest that LLMs falter in such direct-QA scenarios on reasoning (Cobbe et al., 2021a; Kojima et al., 2022; Nye et al., 2021; Wei et al., 2022), our findings reveal a nuanced picture. We observe that LLMs indeed struggle with reasoning when relying solely on greedily decoded paths. However, when we consider alternative paths among the top-k tokens, CoT reasoning patterns emerge naturally within the decoding trajectories of LLMs. In addition, we have observed an interesting pattern: the model demonstrates increased confidence in the final answer when a CoT reasoning path is present in the decoding process. As illustrated in Figure 1, this is evident where paths 2 and 4 show heightened certainty in arriving at the correct answer " 8 ", contrasting sharply with the high uncertainty in paths that lead to the incorrect " 5 ". Leveraging this phenomenon, we develop a method to sift through the top-k decoding paths, which we refer to as CoT-decoding, thereby isolating the most reliable paths for model output.
[问题] \nA:“ 尽管大多数现有工作表明 LLMs 在这些直接问答场景中推理能力不足(Cobbe 等人,2021a;小岛等人,2022;Nye 等人,2021;魏等人,2022),但我们的发现揭示了一个细致入微的图景。我们观察到 LLMs 确实在仅依赖贪婪解码路径时难以进行推理。然而,当我们在顶级令牌中考虑其他路径时,CoT 推理模式会在 LLMs 的解码轨迹中自然浮现。此外,我们还观察到一个有趣的现象:当解码过程中存在 CoT 推理路径时,模型对最终答案的置信度会提高。如图 1 所示,在路径 2 和 4 表现出高度确定性得到正确答案“8”的示例中,这很明显,与导致错误答案“5”的路径中的高不确定性形成鲜明对比。利用此现象,我们开发了一种筛选顶级解码路径的方法,我们将其称为 CoT 解码,从而隔离模型输出最可靠的路径。
Our contributions are summarized as follows:
  • We present a novel finding that LLMs can reason by simple decoding changes, without the use of prompting. In contrast to prior research that focuses on refining prompts to elicit reasoning from LLMs, our work, for the first time, shows that the reasoning process can be readily elicited by simple decoding changes. Moreover, we challenge the prevailing notion in the literature that LLMs are inherently incapable of effective reasoning without prompting. We show that this belief is an artifact of considering only the greedy path during decoding, and the model's reasoning paths can be revealed by traversing the alternative decoding paths.
  • Our method enables a better understanding of LLMs' intrinsic reasoning capabilities without imposing human priors. The employment of intricate prompting techniques often introduces various human priors, making it difficult to distinguish between the extent of "human teaching" and the degree to which LLMs can reason independently. Our approach bypasses the confounders introduced by prompting, enabling a more truthful assessment of the models' intrinsic reasoning abilities. Our study reveals that pre-trained language models inherently possess reasoning capabilities for many tasks including math and commonsense reasoning, and existing prompting approaches mostly serve the role of bringing those inherent reasoning paths forward as the top decoding paths. In contrast, the CoT-paths are less prevalent in complex and highly synthetic tasks, where the few-shot CoT demonstrations play a "teaching" role in guiding how models solve a task, with models primarily mimicing the format of these prompts to generate accurate reasoning paths.
    我们的方法能够更好地理解 LLMs 固有的推理能力,而无需施加人类先验。 复杂的提示技术的使用通常会引入各种人类先验,使得难以区分“人类教导”的程度和 LLMs 独立推理的能力。 我们的方法绕过了由提示带来的混淆因素,从而能够更真实地评估模型的内在推理能力。 我们的研究表明,预训练的语言模型天生就具备了许多任务的推理能力,包括数学和常识推理,现有的提示方法主要起到将这些内在推理路径作为最高解码路径的作用。 相比之下,在复杂和高度抽象的任务中,CoT 路径不太常见,其中,很少有 CoT 演示起到了“教导”的作用,指导模型如何解决任务,模型主要模仿这些提示的格式来生成准确的推理路径。
  • We further propose CoT-decoding that reliably selects CoT-paths based on answer confidence. We find that the language model's confidence in its final answers increases when a CoT is present in its decoding path. Leveraging this increased confidence, we propose CoT-decoding to select more reliable decoding paths, demonstrating significant improvements over greedy decoding across various reasoning benchmarks.
    我们会进一步提出基于答案置信度可靠地选择 CoT 路径的 CoT 解码方案。我们发现,当 CoT 存在于解码路径中时,语言模型对其最终答案的置信度会提高。利用这种增强的置信度,我们提出 CoT 解码来选择更可靠的解码路径,在各种推理基准测试中获得了显著的改进,超越了贪婪解码。

2. Chain-of-Thought (CoT) Decoding
2. 基于思想链 (CoT) 的解码

2.1. Pre-trained Language Models Can Reason without Prompting
2.1. 预训练语言模型可以不经提示进行推理。

We investigate whether pre-trained language models inherently possess reasoning capabilities, without explicit prompts or human intervention. In Table 1, we show example decoding paths across math (GSM8K, Cobbe et al. (2021a)) and commonsense reasoning (year parity, Allen-Zhu and Li (2023)). We employ the pre-trained PaLM-2 large model (Anil et al., 2023) to compare its greedy decoding path , predominantly used in state-of-the-art LLMs for reasoning tasks, with alternative decoding paths ( ), where represents the choice of the -th token at the first decoding step.
我们调查了预训练语言模型是否在没有明确提示或人工干预的情况下,固有地具备推理能力。在表 1 中,我们展示了跨数学 (GSM8K,Cobbe et al. (2021a)) 和常识推理 (年份奇偶,Allen-Zhu 和 Li (2023)) 的示例解码路径。我们采用预训练的 PaLM-2 大型模型 (Anil et al., 2023) 来比较其在推理任务中普遍用于最先进的 LLMs 的贪婪解码路径 ,以及其他解码路径 ( ),其中 表示在第一个解码步骤中选择第 个词条。
[Year Parity] Was Nicolas Cage born in an even or odd year?
[年份奇偶性] 尼古拉斯·凯奇出生于奇数年还是偶数年?
Greedy path: 贪婪路径:
: Nicolas Cage was born in an odd year. (0.117)
Alternative top- paths:
备用顶层 路径:
Even (0.207)
``` 平均 (0.207) ```
Odd (0.198)
奇数 (0.198)
, an even year. (0.949)
: He was born in an even year. (0.0)
: Cage was born in 1964, an even year. (0.978)
: 凯奇出生于 1964 年,一个偶数年。(0.978)
Table 1 | Examples of greedy decoded paths and alternative top- paths over the PaLM-2 Large model. The model's confidence over the answers (bolded) are highlighted in blue (See § for details).
表 1 | 贪婪解码路径和 1001 个顶 模型上的备用最佳 PaLM 路径的示例。模型对答案(加粗)的置信度以蓝色突出显示(详情见参考文献 § )。
LLMs indeed cannot reason if we only consider the greedy decoding path. First, we observe that models employing greedy decoding often does not contain a CoT path, opting to solve problems directly. This tendency may stem from the model's skewed perception of problem difficulty, shaped by its pre-training on predominantly simpler questions. Consequently, the model is predisposed to immediate problem-solving. This observation aligns with findings in (Cobbe et al., 2021a; Kojima et al., 2022; Nye et al., 2021; Wei et al., 2022), which show that direct-answer prompts generally result in low accuracy on reasoning tasks even for large language models.
LLMs 如果我们只考虑贪婪解码路径,那么模型的确无法推理。首先,我们观察到使用贪婪解码的模型通常不包含 CoT 路径,而是选择直接解决问题。这种趋势可能源于模型对问题难度的错误感知,这种感知是由其在更简单的训练集上进行预训练形成的。因此,该模型倾向于直接解决问题。这一观察结果与 (Cobbe 等人,2021 年;Kojima 等人,2022 年;Nye 等人,2021 年;Wei 等人,2022 年) 中的发现一致,这些发现表明,即使对于大型语言模型,直接回答提示在推理任务上的准确率通常都很低。
LLMs can reason if we consider the alternative decoding paths. Contrastingly, an intriguing phenomenon emerges when exploring alternative top- tokens at the first decoding step. Continuing with greedy decoding from this point reveals natural CoT reasoning in many cases. These findings suggest that large language models possess inherent reasoning capabilities for numerous tasks following pre-training, but these abilities are obscured by the predominant use of greedy decoding. These reasoning paths can be easily uncovered by incorporating alternative decoding paths.
LLMs 通过考虑替代解码路径来进行推理。相比之下,当在第一个解码步骤中探索替代的顶级 标记时,会出现一个有趣的现象。从这里继续贪婪解码,在很多情况下会揭示自然的 CoT 推理。这些发现表明,大型语言模型在预训练后,对众多任务都具备推理能力,但由于主要使用贪婪解码,这些能力被掩盖了。通过采用替代的解码路径,这些推理路径很容易被发现。
For instance, in the GSM8K question (Table 1), a valid CoT emerges at . Similarly, in the year parity task, greedy decoding attempts to directly answer the parity question at , leading to a random choice between "even" and "odd" which often results in an incorrect answer. However, when exploring , the model naturally generates CoT paths at and , where it first determines the year before resolving the parity.
例如,在 GSM8K 问题(表 1)中,有效的 CoT 出现在 。类似地,在年份奇偶性任务中,贪婪解码试图在 直接回答奇偶性问题,导致在“偶数”和“奇数”之间随机选择,这通常会导致错误的答案。然而,当探索 时,模型自然地在 生成 CoT 路径,首先确定年份然后再解析奇偶性。

2.2. CoT-Decoding for Extracting CoT Paths
## 2.2 CoT 解码器:用于提取 CoT 路径

In this section, we further show how we can reliably extract those CoT-paths during the decoding process. Table 1 illustrates that CoT paths do not consistently outrank non-CoT ones in the model's probability assessment. Moreover, they often do not represent the predominant answer among all paths, rendering methods like self-consistency (Wang et al., 2023a) inapplicable. For instance, in the GSM8K question, the prevalent answer " 60 ", which aligns with the greedy decoding result, fails to serve as a reliable indicator for identifying the correct path.
在这一部分,我们将进一步展示如何 在解码过程中可靠地提取 CoT 路径。 表 1 来说明 CoT 路径在模型的 概率评估中并不总是优于非 CoT 路径。 此外,它们通常并不代表所有 路径中占主导地位的答案,这使得 诸如自一致性 (Wang 等人,2023a) 之类的 方法不可应用。 例如,在 GSM8K 问题中, 流行答案“ 60 ”,它与贪婪解码 结果一致,不能作为识别正确路径 的可靠指标。
Interestingly, upon examining the model's logits, we found that the presence of a CoT path typically leads to a more confident decoding of the final answer, characterized by a significant probability disparity between the top and secondary tokens:
有趣的是,在检查模型的 logit 后,我们发现 CoT 路径的存在通常会导致最终答案的解码更自信,这可以通过最高和次要 token 之间显著的概率差异来表征
Here and represent the top two tokens at the -th decoding step in the -th decoding path, chosen for their maximum post-softmax probabilities from the vocabulary, given being part of the answer tokens. This uncertainty measure is similar to the minimum-margin approach in (Jiang and Gupta, 2019) and in our case, the model's overall confidence in decoding the final answer is approximated by averaging these probability differences for all relevant answer tokens . For example, for the GSM8K question in Table 1, given the answer "60", we average the probability differences for all tokens in that answer, i.e., " 6 " and " 0 ". 2
这里 表示第 解码步骤中 解码路径上的前两个标记,根据给定 是答案标记的一部分,从词汇表中选择具有最大后 softmax 概率的标记。这种不确定性测量类似于(Jiang 和 Gupta,2019)中的最小 margin 方法,在我们的情况下,通过 对所有相关答案标记 的这些概率差异进行平均来近似模型 对解码最终答案的总体信心。例如,对于表 1 中的 GSM8K 问题,给定答案“60”,我们对该答案中的所有标记(即“6”和“0”)进行概率差异的平均。
This method, referred to as CoT-decoding, extracts such CoT paths among the decoded paths from the model. As illustrated in Table 1, each decoding path is marked with its corresponding value in blue (the answer tokens are bolded). It is evident that paths with a CoT component exhibit a significantly higher , highlighting the model's increased confidence, as opposed to paths without CoT. We also did a quantitative analysis by manually examining the first 100 questions in GSM8K, and among those, if we take the decoding path with the highest answer confidence among the top-10 decoding paths, of them contain CoT paths. This shows an overwhelmingly high correlation between the model's answer confidence and the CoT paths.
该方法被称为 CoT 解码,它从模型解码的路径中提取这些 CoT 路径。如表 1 所示,每个解码路径都用其对应的 值以蓝色标记(答案标记以粗体显示)。显然,包含 CoT 组件的路径表现出明显更高的 ,这突出了模型的信心增加,与不包含 CoT 的路径相比。我们还通过手动检查 GSM8K 中的前 100 个问题进行了定量分析,在这些问题中,如果我们在前 10 个解码路径中采用答案置信度最高的解码路径,则有 个包含 CoT 路径。这表明模型的答案置信度和 CoT 路径之间存在极高的相关性。
Comparing different CoT-path extraction approaches. In Table 2, we compare different ways to extract the CoT-paths out of the top-10 decoded paths. It is easy to see that the model's own probability measure does not serve as a reliable indicator, nor does the model's length-normalized probability (since an intuition could be a CoT-path should usually be a longer decoding path, which is not always the case, e.g., on the year parity task). In contrast, CoT-decoding can reliably extract the CoT-paths, yielding a significant boost on the model's reasoning performance.
比较不同的 CoT 路径提取方法。在表 2 中,我们比较了从前 10 个解码路径中提取 CoT 路径的不同方法。很容易看出,模型自身的概率度量不是一个可靠的指标,模型的长度规范化概率也不是(因为一种直觉可能是 CoT 路径通常应该是更长的解码路径,但情况并非总是如此,例如,在年份奇偶性任务中)。相比之下,CoT 解码可以可靠地提取 CoT 路径,从而显著提高模型的推理性能。
GSM8K (top-100) GSM8K(前 100 名) Year Parity 年代奇偶校验
Greedy decoding 贪婪解码
Decode 10 paths, rank by model's highest log-prob
解码 10 条路径,按模型最高对数概率排序
Decode 10 paths, rank by model's highest length-normalized log-prob
解码 10 种路径,根据模型的最高长度归一化对数概率进行排序
CoT-decoding (decode 10 paths, rank by model's answer confidence)
解码 10 条路径并按模型答案置信度进行排序(CoT 解码)
Table 2 | CoT-decoding reliably extracts the CoT-paths compared to other methods (on PaLM-2 L).
表格 2 | 与其他方法相比,CoT 解码可靠地提取了 CoT 路径(在 PaLM-2 L 上)。
Identify the answer spans. Computing requires identifying the answer spans in a model's response. One common approach used for public models is to extract the last numerical value in math reasoning tasks, or the final option in set-based reasoning tasks, as the answer, following the Tülu evaluation (Ivison et al., 2023; Liu et al., 2024; Wang et al., 2023b). Alternatively, similarly to the method used in Kojima et al. (2022), we can also extend the model's output with the prompt "So the answer is", and then align these continuations with spans in the model's decoding path as the answer.
识别答案跨度。计算 需要在模型的响应中识别答案跨度。一个针对公开模型的常见方法是,按照 Tülu 评估(Ivison 等人,2023;Liu 等人,2024;Wang 等人,2023b),在数学推理任务中提取最后的数值,或在基于集合的推理任务中提取最后的选项,作为答案。或者,类似于 Kojima 等人 (2022) 使用的方法,我们也可以用提示词“所以答案是”来扩展模型的输出,然后将这些延续与模型解码路径中的跨度对齐作为答案。

Sampling under the standard

QA format. CoT-decoding explores alternative tokens at the first decoding step. A natural question arises: can sampling
Q&A 格式。CoT 解码在第一个解码步骤中探索候选词。自然会产生一个问题:能够采样
Table 3 | CoT-decoding and self-consistency w/o prompts on GSM8K. achieve a similar effect and un-
## 表格 3 | 在 GSM8K 上不使用提示符的 CoT 解码 和 自一致性。 实现了类似的效果和 un-
Mistral-7B 秘斯特拉尔-7B PaLM-2 L -2 公升
Greedy decoding 贪婪解码
Self-consistency without CoT-prompt (10 paths)
无 CoT 提示的自洽性(10 种路径)
CoT-decoding (10 paths) CoT-解码(10 路径)
veil the CoT reasoning paths?
遮蔽 CoT 推理路径?
We found that, although sampling works well under few-shot CoT prompting (Wang et al., 2023a), it does not exhibit the desired behaviour without the prompts. We compare CoT-decoding with self-consistency when no CoT prompt is used in Table 3. The ineffectiveness of sampling stems from the model's strong tendency in providing a direct answer during decoding, hence the first token tends to have less diversity compared to CoT-decoding. In contrast, CoT-decoding works by explicitly encouraging diversity at the first decoding step.
我们发现,尽管抽样在少量 CoT 提示下效果良好(Wang 等人,2023a),但它在没有提示的情况下不会表现出期望的行为。我们在表 3 中比较了没有使用 CoT 提示时的 CoT 解码和自洽性。抽样效果不佳的原因是模型在解码期间倾向于提供直接答案,因此与 CoT 解码相比,第一个标记往往缺乏多样性。相比之下,CoT 解码通过在第一个解码步骤显式地鼓励多样性来工作。
Figure 2 | Decoded paths by considering alternative tokens at various decoding steps.
图 2 | 通过在不同解码步骤考虑替代 token 来解码路径。
Branching at other decoding steps. Another natural question is whether branching is viable at later decoding stages, comparing to only branching at the first decoding step. In Figure 2, we highlight the impact of alternative token consideration in subsequent decoding steps. It is evident that early branching, e.g., at the first decoding step, significantly enhances the diversity of potential paths.
在其他解码步骤进行分支。另一个自然而然的问题是,与仅在第一个解码步骤进行分支相比,在后面的解码阶段进行分支是否可行。在图 2 中,我们重点介绍了在后续解码步骤中考虑其他标记的影响。显而易见的是,早期分支(例如,在第一个解码步骤)显著地增加了潜在路径的多样性。
Conversely, later-stage branching is significantly influenced by previously generated tokens. For instance, initiating with the token "5" greatly decreases the likelihood of rectifying an erroneous path. Nonetheless, the optimal branching point may vary with the task; in the year parity task, for instance, mid-path branching can effectively yield correct CoT paths.
相反,后期分支受之前生成的令牌的显著影响。例如,以令牌“5”开头会大大降低纠正错误路径的可能性。然而,最佳分支点可能因任务而异;例如,在年份奇偶性任务中,路径中间分支可以有效地产生正确的 CoT 路径。
Aggregation of the decoding paths. Since we already decode the top-k paths, one natural extension is to aggregate the answers over all those paths, similar to self-consistency (Wang et al., 2023a) but without the use of prompts. The rationale behind this aggregation is to mitigate sensitivity to small differences in the model's logits, particularly when relying solely on the path with the maximum . The examples in Table 1 show that the majority answer is unlikely to be the correct one. Instead, we propose a weighted aggregation method, i.e., we take the answer that maximizes where is the -th decoding path whose answer . We found that adopting this approach enhances the stability of the results, and further analysis is presented in Section §3.3.
### 中文翻译: 聚合解码路径。由于我们已经解码了 前 k 条 路径,因此一种自然的扩展是在所有这些路径上的答案上 进行聚合,类似于 self-consistency (Wang et al., 2023a),但是没有使用提示符。这种聚合背后的原理是为了减轻 对模型 logit 较小差异的敏感性,尤其是当仅依赖 最大 路径时。表 1 中的例子表明,多数答案可能不是正确的答案。相反,我们提出了一种加权聚合方法,即,我们采用使 最大的那个答案,其中 是 第 条解码路径,其答案为 。我们发现采用这种方法可以增强结果的稳定性,更多分析见 §3.3 节。

3. Experiments ## 3. 实验

Experiment Setup. For all experiments, the default input to the model is the standard QA format of Q: [question] \nA:, where [question] is filled with the actual question depending on the task, and we ask the model to continue the generation given that prefix. During decoding, we use as default for the alternative top-k tokens at the first decoding position, and continue greedy decoding afterwards. We show ablation studies with respect to the different choice of in Section §3.1.
实验设置。对于所有实验,模型的默认输入都是标准的 QA 格式,Q: [问题] \nA:,其中 [问题] 会根据任务填充实际问题,我们要求模型在给定该前缀的情况下继续生成。在解码过程中,我们在第一个解码位置的默认备选 top-k 令牌使用 ,然后继续进行贪婪解码。我们将在 §3.1 节中展示关于 不同选择的消融研究。
Datasets. For mathematical reasoning, we use the Grade-school math problems (GSM8K; Cobbe et al., 2021a) and the multi-step arithmetic dataset from (MultiArith; Roy and Roth, 2015). For commonsense reasoning, we investigate the "year parity" task where recent literature finds large language models still struggle with. The task is to query the model with "Was [person] born in an even or odd year?" where "[person]" is filled by a random celebrity name. Existing work (Allen-Zhu and Li, 2023; Berglund et al., 2023) shows that even SoTA models like GPT-4 struggle with such tasks, achieving at-chance accuracy ( ) when prompted directly. Additionally, we investigate symbolic reasoning tasks from Big-Bench-Hard (bench authors, 2023; Suzgun et al., 2022).
数据集。对于数学推理,我们使用小学数学问题(GSM8K;Cobbe 等人,2021a)和来自 (MultiArith;Roy 和 Roth,2015) 的多步算术数据集。对于常识推理,我们调查了“年份奇偶性”任务,最近的文献发现大型语言模型仍然难以处理。该任务是使用“ [人物] 出生于偶数年还是奇数年?”来查询模型,其中“ [人物] ”由随机名人姓名填充。 现有的工作(Allen-Zhu 和 Li,2023;Berglund 等人,2023)表明,即使是像 GPT-4 这样的 SoTA 模型在这些任务中也表现不佳,当直接提示时,准确率为偶然命中率( )。此外,我们还调查了来自 Big-Bench-Hard(长凳作者,2023;Suzgun 等人,2022)的符号推理任务。
Models. We use three public models: (1) PaLM-2 (Anil et al., 2023) with different scales, ranging from X-Small, Small, Medium, and Large; (2) Mistral-7B (Jiang et al., 2023), and (3) Gemma-7B (Team et al., 2024). Our experiments primarily focus on pre-trained models, but we also include experiments with instruction-tuned models (denoted as "inst-tuned" or "IT").
模型。我们使用三个公开模型:(1) 不同规模的PaLM-2 (Anil 等人,2023),包括 X-Small、Small、Medium 和 Large;(2) Mistral-7B (Jiang 等人,2023);(3) Gemma-7B(Team 等人,2024)。我们的实验主要集中在预训练模型上,但也包括了指令微调模型(表示为“inst-tuned”或“IT”)的实验。

3.1. CoT-Decoding Effectively Elicits Reasoning from Language Models
### 3.1. CoT-Decoding 有效地从语言模型中提取推理

CoT-decoding is the only decoding strategy that effectively improves language model reasoning. In Table 4, we present results from popular decoding baselines on the Mistral-7B pre-trained model, including temperature sampling (Ackley et al., 1985; Ficler and Goldberg, 2017), top-k sampling (Fan et al., 2018; Holtzman et al., 2018; Radford et al., 2019), nucleus sampling (Holtzman et al., 2020), and beam search. We
生成语言模型推理的唯一有效解码策略是 CoT 解码。在表 4 中,我们展示了 Mistral-7B 预训练模型上流行解码基线的結果,包括温度采样(Ackley 等,1985;Ficler 和 Goldberg,2017)、top-k 采样(Fan 等,2018;Holtzman 等,2018;Radford 等,2019)、核采样(Holtzman 等, 2020)和光束搜索。我们

Table 4 | CoT-decoding is the only decoding strategy that can significantly enhance language models' reasoning.
表 4 | 协同译码是唯一能显著增强语言模型推理能力的解码策略。
GSM8K Acc GSM8K 交流电接触器
Top-k sampling
顶 k 采样
Top-p / Nucleus sampling
顶级-p / 核采样
Beam search
Temperature sampling
Greedy decoding 贪婪解码
Self-consistency w/o CoT prompt (10 paths)
零样本翻译 (10 种路径)
## 简单的基于 Transformer 的零样本语义解析 **模型:** 编码器-解码器 (CoT) **解码器输入:** * code0: 代表要生成的第一个代码行的标记。 * 特殊令牌 (例如 ) 标记解码器的开始和结束。 **输出:** * 一个有效的代码序列。 **CoT-解码:** CoT 模型会根据输入的 code0 标记和特殊令牌,逐渐生成一个完整的代码序列。 **示例:** 假设我们要生成以下代码: ```python # 这是一段简单的 Python 代码示例。 print("Hello, world!") ``` 那么 code0 将会被设置为 `# 这是一段简单的 Python 代码示例。`,然后 CoT 模型会根据这个输入,逐步生成接下来的代码,直到完整的代码生成完毕
can see that CoT-decoding is the only decoding strategy that effectively enables language models to reason, while some of the decoding methods even hurt model reasoning compared to greedy decoding.
可以选择: **选择 1:** 可以使用 Transformer 语言模型,例如 BART 或 T5。这些模型在各种语言理解任务中都表现出色,并且可以很好地处理条件语言生成。 **选择 2:** 可以使用 GPT-3 等大型语言模型。这些模型经过了大量数据的训练,可以生成高质量的文本,包括条件语言生成。 **选择 3:** 可以使用专门针对条件语言生成任务训练的模型。例如,可以使用预训练的语言模型进行微调,使其适应条件语言生成任务。 无论选择哪种模型,都需要注意以下几点: * 确保模型经过足够多的数据训练,才能生成高质量的文本。 * 使用适当的解码策略,例如 beam search 或 top-k 采样,以生成更高质量的输出。 * 使用人类评估来评估模型生成的文本的质量。 在选择模型时,还需要考虑其他因素,例如预算、时间限制和所需的文本质量。 **其他建议:** * 可以使用 GPT-3 的 Ada 和 Curie 模型进行实验,以比较不同模型的表现。 * 可以使用不同的解码策略,例如 beam search 和 top-k 采样,以探索不同的结果。 * 可以使用人类评估来评估模型生成的文本的质量,并根据评估结果改进模型。
Figure 3 | CoT-decoding effectively elicits reasoning across multiple language model families including PaLM-2, Mistral and Gemma, with significant accuracy gains over three reasoning tasks.
图 3 | CoT 解码在包括 PaLM-2、Mistral 和 Gemma 在内的多种语言模型系列中有效地引发了推理,在三项推理任务中取得了显著的准确率提升。
CoT-decoding effectively elicits reasoning across language models. In Figure 3, we show that across three language model families, PaLM-2, Mistral and Gemma, CoT-decoding effectively elicits model's reasoning, yielding consistent accuracy gains over both math and commonsense reasoning tasks, sometimes doubling or even tripling the performance compared to greedy decoding.
在图 3 中,我们展示了在 PaLM-2、Mistral 和 Gemma 三个语言模型家族中,CoT 解码有效地启发了模型的推理,与贪婪解码相比,在数学和常识推理任务中都获得了稳定的准确率提升,有时甚至将性能提高了两倍甚至三倍。
CoT-decoding elicits reasoning across model scales. In Figure 4, we show that CoT-decoding enhances reasoning across different model scales over the PaLM-2 model family. On GSM8K, CoTdecoding consistently yields absolute accuracy gains. On year parity, when using greedy decoding, the model's performance remains flat even after scaling up model sizes, consistent with the findings in (Allen-Zhu and Li, 2023). In contrast, CoT-decoding significantly boosts the performance by recovering the CoT paths, achieving almost perfect accuracy at larger model scales.
CoT 解码在不同模型规模下都能促进推理。在图 4 中,我们展示了 CoT 解码在 PaLM-2 模型系列的不同模型规模下增强了推理能力。在 GSM8K 上,CoT 解码始终带来 绝对精度提升。在年份奇偶性上,当使用贪婪解码时,即使模型规模扩大,模型的性能也保持平稳,这与 (Allen-Zhu 和 Li,2023) 中的发现一致。相比之下,CoT 解码通过恢复 CoT 路径,显著提高了性能,在更大的模型规模下实现了接近完美的准确率。
Figure 4 | CoT-decoding reliably improves reasoning performance across model scales (PaLM-2), even when the task does not naturally improve by scaling up only (e.g., year parity).
图 4 | CoT 解码在各种模型规模上稳定地提高了推理性能(PaLM-2),即使在任务本身通过单纯放大规模并不能自然提高的情况下(例如,年份奇偶校验)。
CoT-decoding partially closes the reasoning gap between pre-trained and instruction-tuned models, without using any supervised data. Intriguingly, we observe that CoT-decoding enables a pre-trained model to achieve a similar performance of an instruction-tuned model: in Figure 4 (left), CoT-decoding achieves accuracy on the pre-trained PaLM-2 Large model, close to the performance of the instruction-tuned model of the same scale at . The results demonstrate that instruction-tuning with sufficient CoT data (Chung et al., 2022) can be partially achieved by modifying the decoding procedure within pre-trained models.
CoT 解码通过 CoT 数据(Chung 等人,2022)部分地缩小了预训练模型和指令调优模型之间的推理差距,而无需任何监督数据。有趣的是,我们观察到 CoT 解码使预训练模型能够达到与指令调优模型相似的性能:图 4(左)中,CoT 解码在预训练 -2 大型模型上实现了 PaLM 的准确率,接近相同规模的指令调优模型的性能 。结果表明,通过修改预训练模型中的解码过程,可以部分地实现具有足够 CoT 数据(Chung 等人,2022 年)的指令调优。
More interestingly, we observe that CoTdecoding can further improve the instructiontuned model (Figure 4 (left) and Table 5). The instruction-tuning procedure (Chung et al., 2022) has already incorporated abundant CoT annotations during the fine-tuning process. Consequently, the model is expected to inherently generate CoT paths when addressing reasoning tasks. However, upon analyzing specific examples, we found that even after instruction-tuning, the model occasionally persists in attempting to
更有趣的是,我们观察到 CoT 解码可以进一步提升指令调优模型(图 4(左)和表 5)。指令调优过程(Chung 等人,2022 年)在微调过程中已经纳入了丰富的 CoT 注释。因此,期望模型在处理推理任务时能够内生地生成 CoT 路径。然而,在分析具体示例时,我们发现即使在指令调优之后,模型仍然偶尔会尝试

Table 5 | CoT-decoding improves both pre-trained and instruction-tuned Mistral-7B models.
表 5 | CoT 解码改进了预训练和指令微调的 Mistral-7B 模型。
Pre-trained 预训练 Inst-tuned 实例微调
GSM8K Greedy 贪婪 9.9 31.2
CoT-decoding 词元解码
MultiArith 多算术 Greedy 贪婪 14.3 37.8
CoT-decoding 共译码解码
Year  Greedy 贪婪 35.0 62.2
Parity 平价 CoT-decoding 共模转换解码
directly address a question. In contrast, CoT-decoding can enhance the exploration of alternative paths by triggering a CoT first, consequently leading to more accurate answers.
直接回答问题。相比之下,CoT 解码可以通过首先触发 CoT 来增强对备选路径的探索,从而得出更准确的答案。
Choice of . In Figure 5, we illustrate how the choice of , representing the number of top alternative tokens considered, influences the overall accuracy. Overall we found that higher values of typically result in improved model performance, suggesting that in many cases, the correct CoT paths may indeed exist but are often ranked lower during model's decoding. For instruction-tuned models, the effect of is less significant, indicating that the process of instruction-tuning effectively brings forth the majority of CoT-paths to the first few decoding paths.
选择 。图 5 展示了 (表示考虑的最顶层候选标记数)的选择如何影响整体准确率。总的来说,我们发现更高的 值通常会导致更好的模型性能,这表明在很多情况下,正确的 CoT 路径确实可能存在,但在模型解码过程中往往排名较低。对于指令调整的模型, 的影响较小,这表明指令调整的过程有效地将大多数 CoT 路径带到了前几个解码路径中。

Figure 5 | The effect of on reasoning accuracy w.r.t. PaLM-2 model scales and task difficulty.
图 5 | 对推理准确性的影响,相对于 PaLM-2 模型规模和任务难度。

3.2. CoT-decoding Enables a Better Understanding of Model's Intrinsic Reasoning Abilities
3.2. 解码增强模型内在推理能力的理解

Compared to existing works that improve model's reasoning via better human-written prompts, a key distinction of our proposed approach lies in the complete elimination of human-provided prompts. This modification enables a more truthful assessment of a language model's intrinsic reasoning capabilities. In the previous section, we show that language models inherently possess reasoning capabilities for grade-school-level math problems and simple commonsense reasoning tasks. In this section, we will systematically vary the difficulty levels of synthetic tasks to gain a more comprehensive understanding of language models' inherent reasoning abilities via CoT-decoding.
相比于现有的通过更好的人工书写提示来提高模型推理能力的工作,我们提出的方法的关键区别在于完全消除了人工提供的提示。这种修改可以更真实地评估语言模型的内在推理能力。在上一节中,我们展示了语言模型在小学数学问题和简单的常识推理任务方面具有推理能力。在本节中,我们将系统地改变合成任务的难度级别,以通过 CoT 解码获得对语言模型内在推理能力的更全面理解。
We consider the following symbolic reasoning tasks: (1) the Coin Flip task from (Wei et al., 2022), with 2,3,4 rounds of potential flips; and two tasks from Big-Bench-Hard (bench authors, 2023; Suzgun et al., 2022): (2) Web of lies, with 3, 4, 5 truth/lie statements, and (3) Multi-step arithmetic with various depth level and length . For each task, we produce 100 examples for each difficulty level, except for Web-of-Lies (5) we use the existing dataset from (Suzgun et al., 2022). We also include two natural-language-based but synthetic tasks from Big-Bench, Sports Understanding and Object Counting, to probe model's intrinsic abilities in solving synthetic tasks.
我们将考虑以下符号推理任务:(1)来自 Coin Flip 任务(Wei 等人,2022),有 2、3、4 轮潜在的翻转;以及两个来自 Big-Bench-Hard 的任务(bench 作者,2023;Suzgun 等人,2022):(2) 谎言网,有 3、4、5 个真假陈述,以及 (3) 多步算术,具有不同的深度级别 和长度 。对于每个任务,我们针对每个难度级别生成 100 个示例,Web-of-Lies (5) 除外,我们使用 (Suzgun 等人,2022) 中现有的数据集。我们还包括来自 Big-Bench 的两个基于自然语言但合成的任务:运动理解和对象计数,以探测模型在解决合成任务方面的内在能力。
Coin Flip 抛硬币 Web of lies 谎言之網 Multi-step Arithmetic 多步算术
 ## 目标计数
2 3 4 3 4 5
Greedy 贪婪 70.0 53.0 48.0 76.0 58.0 53.6 39.0 19.0 8.0 0.0 58.8 36.0
CoT-decoding 词元解码 58.0
Table 6 | The model's intrinsic reasoning ability varies depending on the task difficulty levels.
表格 6 | 模型的内在推理能力随任务难度等级的变化而变化。
The presence of correct CoT paths depends on the task difficulty levels and correlates with task prominence in the pre-training distribution. The results in Table 6 (on PaLM-2 L) show that despite CoT-decoding helps elicit better reasoning across almost all tasks, the gains vary significantly depending on the task difficulty level: the simpler the task is, the better chance that a correct reasoning path can be found. We also looked at the model's top-k decoding paths, and found that models can generate the correct CoT paths when the solution involves at most 1 or 2 step knowledge manipulation, and the model starts to struggle with generating the correct CoT-paths when the steps become 3 or more. See Figure 5 (right) where the model's accuracy improves only for larger k's as task complexity increases (higher and 's). This phenomenon suggests that the correct CoT-paths become harder to find when the task becomes more synthetic. This mirrors the finding in (McCoy et al., 2023), where the authors show language models are highly influenced by the distribution they have been trained on.
正确的 CoT 路径的存在取决于任务难度级别,并且与预训练分布中的任务突出度相关。表 6(在PaLM-2L 上)的结果表明,虽然 CoT 解码有助于在几乎所有任务中引发更好的推理,但收益因任务难度级别而异:任务越简单,找到正确的推理路径的机会就越大。我们还查看了模型的 top-k 解码路径,发现当解决方案涉及最多 1 或 2 步知识操作时,模型可以生成正确的 CoT 路径,而当步骤变为 3 步或更多时,模型开始难以生成正确的 CoT 路径。见图 5(右),当任务复杂度增加时(较高的 ),模型的准确率只有在更大的 k 值时才提高。这种现象表明,当任务变得更合成时,正确的 CoT 路径变得更难以找到。这与 McCoy 等人(2023)的发现相符,他们发现语言模型受其训练分布的影响很大。
CoT-decoding unveils model's intrinsic vulnerabilities in reasoning. Our results also unveil the specific areas where language models still struggle with: for example, on Coin-Flip and Web-of-Lies, we observe that the model can generate CoT paths that simulate the process step-by-step, but it can easily lose track of the states, especially when the task complexity increases. This reveals model's intrinsic vulnerability in performing accurate state tracking. On Multi-step Arithmetic, we observe that the model tends to perform calculations from left to right in the CoT-decoding paths, rather than following the correct mathematical order. These observations point to future directions where we should improve the models on.
基于 CoT 解码揭示了模型在推理方面的固有漏洞。我们的研究结果还揭示了语言模型仍然难以处理的具体领域:例如,在 Coin-Flip 和 Web-of-Lies 中,我们观察到该模型可以生成 CoT 路径,逐步模拟过程,但它很容易忘记状态,尤其是在任务复杂性增加时。这表明模型在执行精确状态跟踪方面存在固有漏洞。在多步算术中,我们观察到该模型倾向于在 CoT 解码路径中从左到右进行计算,而不是按照正确的数学顺序进行计算。这些观察结果指明了我们应该改进模型的方向。
In addition, over these synthetic tasks, we found that existing CoT prompts on Big-Bench-Hard (Suzgun et al., 2022) play a larger "teaching" role in guiding the model to solve such tasks, and in most cases the model just mimics the patterns in the CoT prompts to generate the correct response: e.g., the few-shot CoT prompts teach the model to perform explicit state tracking in each step for Web-of-lies. On the Sports Understanding task, CoT-decoding can better reveal LLMs' intrinsic strategy in solving a problem (see Appendix A), without being influenced by the external prompts which could be biased by the prompt designers. In contrast, few-shot CoT prompting constrains the model to follow an artificial strategy curated through human knowledge and intervention.
此外,在这些合成任务中,我们发现 Big-Bench-Hard(Suzgun 等人,2022)上现有的 CoT 提示在引导模型解决此类任务方面发挥了更大的“教学”作用,并且在大多数情况下,模型只是模仿 CoT 提示中的模式来生成正确响应:例如,少量 CoT 提示教模型在 Web-of-lies 的每个步骤中执行显式状态跟踪。在体育理解任务中,CoT 解码可以更好地揭示 LLMs 在解决问题时的内在策略(见附录 A),而不会受到可能受提示设计者偏见影响的外部提示的影响。相比之下,少量 CoT 提示会约束模型遵循由人类知识和干预策划的人工策略。

3.3. Combining CoT-decoding with CoT-Prompting
## 3.3 结合 CoT-解码与 CoT-提示

We further show that CoT-decoding can be easily combined with CoT-prompting, yielding even larger reasoning gains over multiple language models (Table 7). CoT-decoding maintains a strong performance compared to self-consistency (Wang et al., 2023a) when both are combined with CoTprompts. Since self-consistency aggregates over multiple paths, we also show the performance based on our path aggregation algorithm, which significantly improves the model's reasoning at a similar cost. For a fair comparison, we use for all methods that require multiple decoding paths.
我们将进一步表明 CoT 解码可以轻松与 CoT 提示相结合,从而在多种语言模型上获得更大的推理收益(表 7)。与自一致性(王等人,2023a)相比,CoT 解码在与 CoT 提示结合时保持了强劲的性能。由于自一致性会聚合多条路径,我们还展示了基于我们路径聚合算法的性能,该算法以类似的成本显著提高了模型的推理能力。为了公平比较,我们对所有需要多个解码路径的方法使用
Mistral-7B 秘斯特拉尔-7B PaLM-2 L -2 公升 Compute 计算

Methods without
Greedy decoding 贪婪解码
Self-consistency without CoT
无 COT 的自一致性
CoT-decoding (max path) 共现解码 (最大路径)
CoT-decoding (agg path) 基于 CoT 解码 (聚合路径)
Methods with
Zero-shot CoT prompting 零样本 CoT 提示
Self-consistency with zero-shot CoT-prompt
零样本 CoT 提示下的自我一致性
CoT-decoding (max path) + zero-shot CoT-prompt
最大路径 CoT 解码 + 零样本 CoT 提示
CoT-decoding (agg path) + zero-shot CoT-prompt
CoT 解码(聚合路径)+ 零样本 CoT 提示
Table 7 | Adding CoT-decoding on top of zero-shot CoT-prompting can further boost the reasoning performance on both models. The accuracy number here is computed over the GSM8K test set.
表格 7 | 在零样本 CoT 提示的基础上添加 CoT 解码可以进一步提升两个模型上的推理性能。这里的准确率是在 GSM8K 测试集上计算得出的。
Chain-of-thought reasoning in large language models. Existing work enhancing the reasoning abilities in large language models predominantly involve proposing better prompting techniques to better elicit CoT reasoning paths from the model (Kojima et al., 2022; Nye et al., 2021; Wei et al., 2022; Yao et al., 2023; Yasunaga et al., 2023; Zhou et al., 2023a). Despite achieving high performance, few-shot prompting techniques are often task-specific, requiring prompt designs tailored to each task. This limits their generalizability across tasks. Advanced prompting techniques often require manually intensive prompt engineering, and their effectiveness varies depending on the choice of prompts, resulting in inconsistent performance outcomes (Wang et al., 2022; Ye and Durrett, 2022; Zhou et al., 2023b). Efforts to discover improved prompts (Yang et al., 2024; Zhou et al., 2023b) further entail model-specific and task-specific tuning.
大型语言模型中的链式推理。现有的增强大型语言模型推理能力的工作主要涉及提出更好的提示技术,以便从模型中更好地获取 CoT 推理路径(Kojima et al., 2022; Nye et al., 2021; Wei et al., 2022; Yao et al., 2023; Yasunaga et al., 2023; Zhou et al., 2023a)。尽管取得了很高的性能,但少量提示技术通常是特定于任务的,需要针对每个任务量身定制提示设计。这限制了它们跨任务的泛化能力。高级提示技术通常需要手动密集的提示工程,其有效性取决于提示的选择,导致性能结果不一致(Wang et al., 2022; Ye and Durrett, 2022; Zhou et al., 2023b)。寻找改进提示的工作(Yang et al., 2024; Zhou et al., 2023b)进一步需要模型特定和任务特定调整。
In addition, these prompting techniques can subtly alter the vocabulary's posterior distribution in ways that remain largely elusive (Min et al., 2022; Webson and Pavlick, 2022). Specifically, prompts may assist in task decomposition, induce the model to generate additional tokens, or directly "teach" the model the exact underlying procedure to solve particular problems via manually crafted few-shot demonstrations. Dissecting the distinct influence of each aspect, however, presents a significant challenge. In contrast, our work explores a different perspective within the decoding stage, demonstrating that, even without explicit prompting, the model inherently holds the capability to generate chain-of-thought reasoning paths across a wide set of tasks.
此外,这些提示技术可以微妙地改变词汇的后验分布,这种方式在很大程度上仍然难以捉摸(Min 等人,2022 年;Webson 和 Pavlick,2022 年)。具体来说,提示可以帮助任务分解,诱导模型生成额外的标记,或直接通过手动制作的少量镜头演示“教”模型解决特定问题的精确底层过程。然而,剖析每个方面的不同影响是一个重大的挑战。相比之下,我们的工作探索了解码阶段的不同视角,证明了即使没有明确的提示,模型本身也具有跨越广泛任务集生成思维链推理路径的能力。
Recent work proposes to improve the CoT generation process via better controlling and verifying the steps generated, e.g., step-by-step verification (Lightman et al., 2023), process-based feedback (Uesato et al., 2022), self-evaluation guided beam search (Xie et al., 2023), and PathFinder (Golovneva et al., 2023). Note all these works still require CoT prompting in order to generate the CoT reasoning paths, while our work completely removes CoT prompting. In addition, these works focus on searching and verifying the "steps" produced by the language model, while our work purely searches in the decoding space on the token-level and utilizes the confidence scores when decoding the answer.
基于更好地控制和验证生成的步骤,最近的工作提出改进 CoT 生成过程,例如,逐步验证(Lightman 等人,2023 年)、基于过程的反馈(Uesato 等人,2022 年)、自评估引导的波束搜索(Xie 等人,2023 年)和 PathFinder(Golovneva 等人,2023 年)。请注意,所有这些工作仍然需要 CoT 提示才能生成 CoT 推理路径,而我们的工作完全删除了 CoT 提示。此外,这些工作侧重于搜索和验证语言模型生成的“步骤”,而我们的工作纯粹在解码空间中进行令牌级搜索,并在解码答案时利用置信度评分。
Additionally, recent works (Feng et al., 2023; Li et al., 2023b; Prystawski et al., 2023). McCoy et al. (2023); Razeghi et al. (2022) demonstrate a similar phenomenon where the pretraining distribution heavily influences the model's performance in few-shot reasoning.
此外,最近的研究(Feng 等人,2023 年;Li 等人,2023 年;Prystawski 等人,2023 年)。McCoy 等人 (2023);Razeghi 等人 (2022) 表明,预训练分布对模型的少量推理性能有很大影响。
Instruction-tuning to elicit CoTs in language models. When supervision is allowed, techniques such as instruction-tuning or distillation offer another way to elicit reasoning paths from language models without explicit prompting (Chung et al., 2022; Huang et al., 2023; Magister et al., 2023). However, these approaches typically involve resource-intensive fine-tuning over large language models
基于指示的调整以在语言模型中诱发 CoT。当允许监督时,诸如基于指示的调整或蒸馏等技术提供了一种从语言模型中推理路径的另一种方式,而无需显式提示 (Chung 等人,2022 年;Huang 等人,2023 年;Magister 等人,2023 年)。然而,这些方法通常涉及对大型语言模型进行资源密集型的微调。

and require a large set of examples annotated with CoTs, which may not be readily available.
需要大量带 CoT 标注的示例,而这些示例可能并不容易获得。
Liu et al. (2024) show that a language model can be tuned by a proxy. Their method requires a few additional models, and implicitly assumes that the tuned model is well-optimized, e.g., on reasoning benchmarks the model needs to be tuned with CoT paths to enable contrasting logits with respect to the base untuned model. In contrast, our approach is entirely unsupervised and examines a model's intrinsic ability in generating CoT paths, without resorting to fine-tuning or any additional models.
刘等人 (2024) 表明可以通过代理模型对语言模型进行微调。他们的方法需要一些额外的模型,并隐含地假设微调模型是经过良好优化的,例如,在推理基准测试中,需要使用 CoT 路径对模型进行微调,以便能够相对于基础未微调模型对比预测结果。相比之下,我们的方法是完全无监督的,它考察了一个模型在生成 CoT 路径中的内在能力,而无需依赖微调或任何额外的模型。
Decoding algorithms for language models. The predominant focus in existing literature on decoding for language models revolves around aspects such as fluency, coherence, reduction of repetitiveness, and diversity in responses. Popular decoding algorithms used for language models include greedy decoding, temperature sampling (Ackley et al., 1985; Ficler and Goldberg, 2017), top-k sampling (Fan et al., 2018; Holtzman et al., 2018; Radford et al., 2019), and nucleus sampling (Holtzman et al., 2020). Additionally, there exist refined algorithms such as minimum Bayes risk decoding (Eikema and Aziz, 2020), and typical decoding (Meister et al., 2022). Diverse beam search (Vijayakumar et al., 2018) is another way to explore alternative paths in a model's generation. However, it emphasizes generation diversity rather than accuracy.
语言模型的解码算法。现有语言模型解码文献主要关注流畅度、连贯性、重复性降低和响应多样性等方面。语言模型常用的解码算法包括贪婪解码、温度采样(Ackley 等人,1985;Ficler 和 Goldberg,2017)、top-k 采样(Fan 等人,2018;Holtzman 等人,2018;Radford 等人,2019)和 nucleus 采样(Holtzman 等人,2020)。此外,还存在一些精炼的算法,如最小贝叶斯风险解码(Eikema 和 Aziz,2020)和典型解码(Meister 等人,2022)。多元化束搜索(Vijayakumar 等人,2018)是另一种探索模型生成中替代路径的方法。但是,它更强调生成多样性而不是准确性。
There is relatively little research dedicated to enhancing decoding algorithms specifically for reasoning tasks. Wang et al. (2023a) improves upon CoT prompting by sampling and aggregating over multiple generated responses to improve reasoning. Contrastive decoding (Li et al., 2023a) is another way to improve model's generation quality by penalizing the logits from smaller models, and recent work (O'Brien and Lewis, 2023) shows that contrastive decoding can contribute to enhancing reasoning performance. Shi et al. (2023) propose context-aware decoding to improves the faithfulness of language models. These approaches typically require additional information, such as employing additional models to generate contrasting logits or incorporating additional contexts. In contrast, our work relies solely on a single model without the need for supplementary knowledge.
近年来,专门针对推理任务增强解码算法的研究相对较少。王等人 (2023a) 通过对多个生成的响应进行采样和聚合以改进推理,改进了 CoT 提示。对比解码 (Li 等人,2023a) 是另一种通过惩罚较小模型的 logits 来提高模型生成质量的方法,最近的研究 (O'Brien 和 Lewis,2023) 表明对比解码可以帮助提高推理性能。石等人 (2023) 提出了一种上下文感知解码来提高语言模型的保真度。这些方法通常需要额外的信息,例如使用额外的模型来生成对比 logits 或合并额外的上下文。相比之下,我们的工作仅依赖于单个模型,而不需要补充知识。
Decoding algorithms for efficiency. In addition to decoding algorithms for improving quality, there is a substantial body of research dedicated to improving decoding efficiency, e.g., speculative decoding (Chen et al., 2023a; Leviathan et al., 2022; Zhou et al., 2024). This line of work is orthogonal to our work as their primary focus is not on improving a model's reasoning performance. However, these techniques could potentially be leveraged to improve the efficiency of CoT-decoding.
解码算法以提高效率。除了用于提高质量的解码算法之外,还有大量研究致力于提高解码效率,例如,推测性解码(Chen 等人,2023a;Leviathan 等人,2022;Zhou 等人,2024)。 这项工作与我们的工作是正交的,因为它们的主要目标不是提高模型的推理性能。但是,这些技术有可能被用来提高 CoT 解码的效率。

5. Conclusion and Discussion
5. 结论与讨论

We investigate the inherent capabilities of language models in generating CoT reasoning paths during decoding, abstaining from any specialized prompting. Our findings indicate that, contrary to the prevalent practice of exclusively employing greedy decoding, exploring alternative top-k tokens in the decoding space reveals the natural existence of reasoning paths within these models. Furthermore, our empirical observations highlight that the presence of a CoT reasoning path correlates with increased model confidence in decoding its final answer. Based on this observation, we introduce CoT-decoding to extract more reliable decoding paths from language models, thereby enhancing their overall reasoning performance.
我们研究了语言模型在解码过程中生成 CoT 推理路径的固有能力,避免任何专门的提示。我们的研究结果表明,与仅使用贪婪解码的普遍做法相比,在解码空间中探索其他顶级代币揭示了这些模型中推理路径的自然存在。此外,我们的实证观察表明,CoT 推理路径的存在与模型对其最终答案的解码信心增加相关。基于此观察,我们引入了 CoT 解码来从语言模型中提取更可靠的解码路径,从而增强其整体推理性能。
Discussion and Limitations. The exploration of alternative decoding paths incurs additional computational costs. Future work could leverage the CoT-decoding paths to fine-tune the model to further enhance its reasoning capabilities. Additionally, in cases where the answers are more open-ended, utilizing the probability differences of the top two tokens as an indicator of how models prefer one answer over another could be less precise. While existing work (Burns et al., 2023)
讨论和局限性。探索替代解码路径会产生额外的计算成本。未来的工作可以利用 CoT 解码路径对模型进行微调,以进一步增强其推理能力。此外,在答案更开放的情况下, 利用前两个标记的概率差异作为模型如何偏好某个答案的指标可能不太精确。 虽然现有的工作(Burns 等人,2023)

leverages the model's activation space to uncover latent knowledge, its applicability is restricted to answering yes-no questions. We hope that future research can address this limitation by delving deeper into the model's internal representation across a broader, more open-ended answer space.
Furthermore, our current exploration focuses on branching at the first token, but for future work one can explore branching at any token and searching for the best possible paths during the decoding phase. The computational cost will be substantially higher though, and how to reliably identify the best token during the search will be an interesting direction to explore.

Acknowledgements 致谢

We would like to thank Yongchao Zhou, Yifeng Lu, Dale Schuurmans, and Ed Chi for helpful discussion and feedback on this work.
感谢周勇超、陆毅峰、Dale Schuurmans 和 Ed Chi 对这项工作提供的宝贵讨论和反馈。

References 参考资料

D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for boltzmann machines. Cognitive Science, 9(1):147-169, 1985. ISSN 0364-0213. URL https://www. sciencedirect. com/science/article/pii/S0364021385800124.
Ackley, D.H.;Hinton, G.E.;Sejnowski, T.J. 玻尔兹曼机的学习算法. 《认知科学》 9(1):147-169, 1985。ISSN 0364-0213。URL https://www. sciencedirect. com/science/article/pii/S0364021385800124。
Z. Allen-Zhu and Y. Li. Physics of language models: Part 3.2, knowledge manipulation, 2023.
Z. Allen-Zhu 和 Y. Li. 语言模型的物理:3.2、知识处理,2023.
R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
R. Anil、A. M. Dai、O. Firat、M. Johnson、D. Lepikhin、A. Passos、S. Shakeri、E. Taropa、P. Bailey、Z. Chen 等人。 Palm 技术报告。arXiv 预印本 arXiv:2305.10403,2023 年。
B. bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj.
B. 论文作者。超越模仿游戏:量化和推断语言模型的能力。机器学习研究汇刊,2023 年。ISSN 2835-8856。URL https://openreview.net/forum?id=uyTL5Bvosj.
L. Berglund, M. Tong, M. Kaufmann, M. Balesni, A. C. Stickland, T. Korbak, and O. Evans. The reversal curse: Llms trained on "a is b" fail to learn "b is a", 2023.
L. Berglund、M. Tong、M. Kaufmann、M. Balesni、A. C. Stickland、T. Korbak 和 O. Evans 的论文:《反向诅咒:在“a 是 b”上训练的 Llms 无法学习“b 是 a”》,2023 年。
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877-1901, 2020.
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell 等人。语言模型是少量样本学习者。神经信息处理系统进展,33:1877-1901,2020 年。
C. Burns, H. Ye, D. Klein, and J. Steinhardt. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=ETKGubyOhcs.
C. Burns、H. Ye、D. Klein 和 J. Steinhardt。在没有监督的情况下发现语言模型中的潜在知识。在第十一届学习表征国际会议上,2023 年。网址 https://openreview.net/forum?id=ETKGubyOhcs。
C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. M. Jumper. Accelerating large language model decoding with speculative sampling. ArXiv, abs/2302.01318, 2023a. URL https : //api.semanticscholar.org/CorpusID:256503945.
C. 陈, S. 博格, G. 欧文, J.-B. 莱斯皮奥, L. 西弗雷, 和 J. M. 杰珀. 通过推测性采样加速大型语言模型解码. ArXiv, abs/2302.01318, 2023a. 网址 https://api.semanticscholar.org/CorpusID:256503945.
W. Chen, X. Ma, X. Wang, and W. W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research, 2023b. ISSN 2835-8856. URL https://openreview.net/forum?id=YfZ4ZPt8zd.
陈 W, 马 X, 王 X, 和 W. W. 科恩。思想提示程序:分离数字推理任务的计算和推理。机器学习研究汇刊,2023b。国际标准期刊号 2835-8856。网址 https://openreview.net/forum?id=YfZ4ZPt8zd。
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra,
## A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, ### **阿. 乔杜里,S. 纳兰,J. 德夫林,M. 博斯马,G. 密什拉,A. 罗伯茨,P. 巴勒姆,H. W. 钟,C. 萨顿,S. 格赫曼,P. 舒赫,K. 石,S. 茨维亚申科,J. 马内兹,A. 饶,P. 巴恩斯,Y. 塔伊,N. 沙齐尔,V. 普拉巴卡兰,E. 赖夫,N. 杜,B. 赫钦森,R. 波普,J. 布拉德伯里,J. 奥斯汀,M. 伊萨德,G. 古尔-阿伊,P. 殷,T. 公爵,A. 列夫斯卡娅,S. 盖马瓦特,S. 德夫,H. 迈克尔莱斯基,X. 加西亚,V. 米斯拉** ## I have translated the names of the authors from English to Simplified Chinese. I hope this is helpful!

  1. The QA format is only needed because without it a pre-trained language model will continue the question instead of answering. It is also the most basic formatting employed in existing works for pre-trained models.
  2. [GSM8K] Kylar went to the store to buy glasses for his new apartment. One glass costs , but every second glass costs only of the price. Kylar wants to buy 16 glasses. How much does he need to pay for them? Greedy path:
    [GSM8K]Kylar 去商店购买他的新公寓的酒杯。一个酒杯的价格是 ,但每两件酒杯的价格只有 ,Kylar 想买 16 个酒杯,他需要支付多少钱? 贪心路径:
    (0.029) ``` (0.029) ```
    Alternative top-k paths:
    备用 top-k 路径:
    (0.058)  (0.058)
    : Kylar needs to pay for 16 glasses. (0.058)
    凯拉需要支付 元购买 16 个杯子。(0.058)
    : If Kylar buys 16 glasses, he will pay . (0.032)
    :如果凯拉购买 16 个眼镜,他将支付 元。(0.032)
    : We can calculate the price of 16 glasses by multiplying the price of one glass by 16 . However, we need to remember that every second glass costs only of the price. Therefore, we need to multiply the price of one glass by 16 and then subtract of the price of 8 glasses. Kylar needs to pay for 16 glasses. (0.994)
    我们可以通过将一个眼镜的价格乘以 16 来计算 16 个眼镜的价格。然而,我们需要记住,每隔一个眼镜只需支付 的价格。因此,我们需要将一个眼镜的价格乘以 16,然后减去 8 个眼镜价格的 Kylar 需要支付 来购买 16 个眼镜。(0.994)
  3. We also considered other popular choices for measuring the model's uncertainty (Settles, 2009), e.g., using the model's probability on the token itself (i.e., only), which performs slightly worse compared to the min-margin approach. In addition, an entropy estimate is not accurate due to the large vocabulary size in LLMs and the common use of vocabulary truncation.
    我们还考虑了其他流行的模型不确定性测量方法(Settles,2009),例如,使用模型对其自身的标记的概率(即 ),这比最小边距方法表现略差。此外,由于 LLMs 中巨大的词汇表大小和常用词汇截断,熵估计并不准确。
  4. We curate a list of the top 100 celebrity names from (Berglund et al., 2023): https://github.com/lukasberglund/ reversal_curse/blob/main/data/celebrity_relations/top_celebrities.txt
    我们根据 Berglund 等人 (2023) 的研究整理了一份包含 100 个顶级名人姓名的列表:https://github.com/lukasberglund/reversal_curse/blob/main/data/celebrity_relations/top_celebrities.txt