这是用户在 2024-7-19 11:51 为 https://ar5iv.labs.arxiv.org/html/2305.14992?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
\NewDocumentCommand\hsb

mO Shibo[#1] \NewDocumentCommand\gy mO Yi[#1] \NewDocumentCommand\hzt mO ZT[#1] \NewDocumentCommand\zhen mO ZW[#1] \NewDocumentCommand\jjh mO Joshua[#1]

Reasoning with Language Model is Planning with World Model
使用语言模型进行推理就是使用世界模型进行规划

Shibo Haoabsent*\clubsuit  Yi Guabsent*\clubsuit  Haodi Ma\diamondsuit  Joshua Jiahua Hong\clubsuit
施博豪 absent*\clubsuit 一顾 absent*\clubsuit 好迪马 \diamondsuit Joshua 家华宏 \clubsuit

Zhen Wang\clubsuit \spadesuit  Daisy Zhe Wang\diamondsuit Zhiting Hu\clubsuit
真王 \clubsuit \spadesuit Daisy 这王 \diamondsuit 芝婷胡 \clubsuit

\clubsuitUC San Diego, \diamondsuitUniversity of Florida
\clubsuit 加州大学圣地亚哥分校, \diamondsuit 佛罗里达大学

\spadesuitMohamed bin Zayed University of Artificial Intelligence
\spadesuit 穆罕默德·本·扎耶德人工智能大学

{s5hao, yig025, jjhong, zhw085, zhh019}@ucsd.edu
{ma.haodi, daisyw}@ufl.edu
  equal contribution 平等贡献
Abstract 摘要

Large language models (LLMs) have shown remarkable reasoning capabilities, particularly with chain-of-thought (CoT) prompting. However, LLMs sometimes still struggle with problems that are easy for humans, such as generating action plans to achieve given goals in an environment, or performing complex math or logical reasoning. The deficiency stems from the key fact that LLMs lack an internal world model to predict the world state (e.g., environment status, intermediate variable values) and simulate long-term outcomes of actions. This prevents LLMs from performing deliberate planning akin to human brains, which involves exploring alternative reasoning paths, anticipating future states and rewards, and iteratively refining existing reasoning steps. To overcome the limitations, we propose a new LLM reasoning framework, Reasoning via Planning (RAP). RAP repurposes the LLM as both a world model and a reasoning agent, and incorporates a principled planning algorithm based on Monte Carlo Tree Search for strategic exploration in the vast reasoning space. During reasoning, the LLM (as agent) incrementally builds a reasoning tree under the guidance of the LLM (as world model) and rewards, and efficiently obtains a high-reward reasoning path with a proper balance between exploration vs. exploitation. We apply RAP to various challenging reasoning problems including plan generation, math reasoning, and logical inference, and demonstrate its superiority over strong baselines. RAP with LLaMA-33B even surpasses CoT with GPT-4, achieving 33% relative improvement in a plan generation setting.111The code is available at https://github.com/Ber666/llm-reasoners
代码可在 https://github.com/Ber666/llm-reasoners 找到

大型语言模型(LLMs)展示了出色的推理能力,特别是在思维链(CoT)提示方面。然而,LLMs 有时仍然在一些对人类来说很容易的问题上遇到困难,比如生成实现给定目标的行动计划,或进行复杂的数学或逻辑推理。这种缺陷源于一个关键事实,即LLMs 缺乏内部世界模型来预测世界状态(例如环境状态、中间变量值)并模拟行动的长期结果。这阻碍了LLMs 进行类似于人类大脑的蓄意规划,其中包括探索替代推理路径、预测未来状态和奖励,并迭代地完善现有的推理步骤。为了克服这些限制,我们提出了一种新的LLM 推理框架,即规划推理(RAP)。RAP 将LLM 重新用作世界模型和推理代理,并结合基于蒙特卡洛树搜索的原则性规划算法,用于在广阔的推理空间中进行战略性探索。 在推理过程中,LLM(作为代理)在LLM(作为世界模型)和奖励的指导下逐步构建推理树,并有效地获得一个在探索与开发之间保持适当平衡的高奖励推理路径。我们将 RAP 应用于包括计划生成、数学推理和逻辑推理在内的各种具有挑战性的推理问题,并展示其优于强基线的表现。LLaMA-33B 的 RAP 甚至超越了 GPT-4 的 CoT,在计划生成设置中实现了 33%的相对改进。 1

1 Introduction 1 介绍

Refer to caption
Figure 1: An overview of Reasoning via Planning (RAP). Compared with previous LLM reasoning methods like Chain-of-Thought Wei et al. (2022), we explicitly model the world state from a world model (repurposed from the language model), and leverage advanced planning algorithms to solve the reasoning problems.
图 1:通过规划进行推理(RAP)的概述。与以往的像 Chain-of-Thought Wei 等人(2022 年)的推理方法相比,我们明确地从世界模型(从语言模型重新调整)中建模世界状态,并利用先进的规划算法来解决推理问题。

Large language models (LLMs) have exhibited emergent reasoning abilities in a wide range of tasks Brown et al. (2020); Chowdhery et al. (2022); OpenAI (2023). Recent approaches further boost their ability by prompting LLMs to generate intermediate reasoning steps, e.g., Chain-of-Thought, CoT Wei et al. (2022) or answer a series of subquestions, e.g., least-to-most prompting Zhou et al. (2022). However, LLMs still face difficulties with tasks that humans find easy. For example, in creating action plans to move blocks to a target state, GPT-3 Brown et al. (2020) achieves a success rate of only 1%, compared to 78% for humans Valmeekam et al. (2022); these models also struggle with complex tasks that require multiple steps of math, logical, or commonsense reasoning Huang and Chang (2022); Mialon et al. (2023).
大型语言模型(LLMs)在各种任务中展现出新兴的推理能力 Brown 等人(2020);Chowdhery 等人(2022);OpenAI(2023)。最近的方法通过促使LLMs生成中间推理步骤,例如思维链(Chain-of-Thought)、CoT Wei 等人(2022)或回答一系列子问题,例如由少到多的提示 Zhou 等人(2022),进一步提升它们的能力。然而,LLMs仍然在人类认为容易的任务中面临困难。例如,在创建将块移动到目标状态的行动计划方面,GPT-3 Brown 等人(2020)的成功率仅为 1%,而人类的成功率为 78% Valmeekam 等人(2022);这些模型在需要多步数学、逻辑或常识推理的复杂任务中也遇到困难 Huang 和 Chang(2022);Mialon 等人(2023)。

Humans possess an internal world model, a mental representation of the environment Johnson-Laird (1983, 2010); Gentner and Stevens (2014), which enables humans to simulate actions and their effects on the world’s state for deliberate planning for complex tasks of motor control, imagery, inference, and decision making Tolman (1948); Briscoe (2011); Schulkin (2012); LeCun (2022). For example, to make an action plan towards a goal, planning with the world model involves exploring various alternative courses of actions, assessing the likely outcomes by rolling out possible future scenarios, and iteratively refining the plan based on the assessment Huys et al. (2012); Gasparski and Orel (2014); Ho et al. (2021). This is in stark contrast to the current LLM reasoning, which instinctively generates a reasoning trace in an autoregressive manner. In particular, we identify several key limitations of the current reasoning with LLMs, including (1) the lack of an internal world model to simulate the state of the world (e.g., the configuration of blocks, the values of intermediate variables), which is the foundation of human planning; (2) the absence of a reward mechanism to assess and guide the reasoning towards the desired state; and due to both limitations, (3) the incapability of balancing exploration vs. exploitation to efficiently explore vast reasoning space.
人类拥有内部世界模型,即环境的心理表征 Johnson-Laird(1983,2010);Gentner 和 Stevens(2014),这使人类能够模拟行动及其对世界状态的影响,以便对运动控制、想象、推理和决策制定复杂任务的计划 Tolman(1948);Briscoe(2011);Schulkin(2012);LeCun(2022)。例如,为了制定朝着目标的行动计划,使用世界模型进行规划涉及探索各种替代行动方案,通过展开可能的未来场景评估可能的结果,并根据评估逐步完善计划 Huys 等人(2012);Gasparski 和 Orel(2014);Ho 等人(2021)。这与当前LLM推理形成鲜明对比,后者本能地以自回归方式生成推理轨迹。特别是,我们确定了当前LLMs推理的几个关键局限,包括(1)缺乏模拟世界状态的内部世界模型(例如。块的配置,中间变量的值),这是人类规划的基础;(2)缺乏奖励机制来评估和引导推理朝着期望的状态发展;由于这两个限制,(3)无法平衡探索与开发,以有效地探索广阔的推理空间。

To address these limitations, this paper proposes a new framework, Reasoning via Planning (RAP), that enables LLMs to reason in a manner close to humans’ conscious planning. RAP augments the LLM with a world model, and reasons with principled planning (specifically Monte Carlo Tree Search, MCTS) to produce high-reward reasoning traces after efficient exploration (Figure 1). Notably, we acquire the world model by repurposing the LLM itself with appropriate prompts. During the reasoning, the LLM strategically builds a reasoning tree by iteratively considering the most promising reasoning steps (actions) and using the world model (the same, repurposed LLM) to look ahead for future outcomes. The estimated future rewards are then backpropagated to update the LLM’s beliefs about the current reasoning steps, guiding it to refine the reasoning by exploring better alternatives. Our MCTS-based planning effectively maintains a proper balance between exploration (of unvisited reasoning traces) and exploitation (of the best reasoning steps identified so far).
为了解决这些限制,本文提出了一个新的框架,即规划推理(RAP),使LLMs能够以接近人类意识规划的方式进行推理。RAP 通过增加世界模型来增强LLM,并采用原则性规划(具体来说是蒙特卡洛树搜索,MCTS)进行推理,以在有效探索后产生高回报的推理轨迹(图 1)。值得注意的是,我们通过适当的提示重新利用LLM本身来获取世界模型。在推理过程中,LLM通过迭代考虑最有前途的推理步骤(动作)并利用世界模型(相同的、重新利用的LLM)展望未来结果来策略性地构建推理树。然后,估计的未来回报被反向传播以更新LLM对当前推理步骤的信念,引导其通过探索更好的替代方案来完善推理。我们基于 MCTS 的规划有效地保持了对未访问的推理轨迹的探索和对迄今为止确定的最佳推理步骤的开发之间的适当平衡。

Refer to caption
Figure 2: RAP for plan generation in Blocksworld (left), math reasoning in GSM8K (middle), and logical reasoning in PrOntoQA (right).
图 2:Blocksworld 中的计划生成的 RAP(左),GSM8K 中的数学推理(中),PrOntoQA 中的逻辑推理(右)。

We show RAP is a general framework applicable to a diverse range of challenging problems and achieves substantial improvements over recent popular LLM reasoning methods. For plan generation, particularly in 2/4/6-step problems of Blocksworld Valmeekam et al. (2023), RAP achieves an average success rate of 64% while CoT fails almost completely. Moreover, LLaMA-33B with RAP surpasses GPT-4 with CoT by 33% relative improvement. In the domains of mathematical reasoning, such as GSM8K Cobbe et al. (2021) and logical inference exemplified by PrOntoQA Saparov and He (2022), RAP also consistently improves over strong baselines, including CoT, least-to-most prompting, and their self-consistency variants.
我们展示了 RAP 是一个通用框架,适用于各种具有挑战性的问题,并在最近流行的LLM推理方法上取得了显著的改进。对于计划生成,特别是在 Blocksworld Valmeekam 等人(2023 年)的 2/4/6 步问题中,RAP 的平均成功率达到 64%,而 CoT 几乎完全失败。此外,LLaMA-33B 与 RAP 相比,相对改进了 33%,超过了 CoT 的 GPT-4。在数学推理领域,例如 GSM8K Cobbe 等人(2021 年)和逻辑推理,例如 PrOntoQA Saparov 和 He(2022 年)所示,RAP 也持续改进,包括 CoT、从最少到最多提示以及它们的自洽变体。

2 Related Work 2 相关工作

Reasoning with LLMs. LLM reasoning typically involves decomposing complex questions into sequential intermediate steps (a.k.a. chains) before producing the final answer, exemplified by Chain-of-Thought (CoT) prompting and its variants Wei et al. (2022); Kojima et al. (2022). The basic CoT generates chains all at once and can induce additional errors as the step count increases. Self-Consistency Wang et al. (2022) samples multiple chains to choose the best answer via majority voting. Least-to-most prompting Zhou et al. (2022) reduces the question into simpler subquestions and answers them sequentially. Similar to our reward formulation, recent works have explored self-evaluation approaches to provide feedback for intermediate steps Welleck et al. (2022); Shinn et al. (2023); Paul et al. (2023). Aligned with our state formulation, Li et al. (2022) incorporate latent “situations” into LLMs, referring to the state of entities from the context. More relevantly, recent works have started to explore more complex structures guided by some search algorithms. For instance, CoRe Zhu et al. (2022) fine-tunes reasoning step generator and verifier for math word problems with MCTS for decoding. Concurrently to our work, Yao et al. (2023) apply heuristic-based search, like depth-/breadth-first search, for better reasoning paths. However, none of the above methods formally introduce the world model and instantiates the reward and state into a unified framework. Compared with these search-guided methods, RAP is a more principled framework to combine world model and reward with advanced planning.
使用LLMs进行推理。LLM推理通常涉及将复杂问题分解为顺序中间步骤(即链),然后产生最终答案,例如 Chain-of-Thought(CoT)提示及其变体 Wei 等人(2022);Kojima 等人(2022)。基本的 CoT 一次性生成所有链,随着步数的增加可能会引起额外错误。Self-Consistency Wang 等人(2022)通过多个链进行抽样,通过多数投票选择最佳答案。Least-to-most 提示 Zhou 等人(2022)将问题简化为更简单的子问题,并按顺序回答。与我们的奖励制定类似,最近的研究探索了自我评估方法,为中间步骤提供反馈 Welleck 等人(2022);Shinn 等人(2023);Paul 等人(2023)。与我们的状态制定一致,Li 等人(2022)将潜在的“情况”纳入LLMs,指的是来自上下文的实体状态。更相关的是,最近的研究已经开始探索由一些搜索算法引导的更复杂结构。例如,CoRe Zhu 等人。 (2022) 使用 MCTS 对数学文字问题的推理步骤生成器和验证器进行微调以进行解码。 与我们的工作同时进行,Yao 等人(2023)应用基于启发式的搜索,如深度/广度优先搜索,以获得更好的推理路径。 然而,以上方法均未正式引入世界模型,并将奖励和状态实例化为统一框架。 与这些搜索引导方法相比,RAP 是一个更有原则性的框架,可以将世界模型和奖励与先进规划相结合。

Planning with LLMs. Planning, a central ability in intelligent agents, involves generating a series of actions to achieve a specific goal McCarthy (1963); Bylander (1994). Classical planning methods have been widely adopted in robots and embodied environments Camacho and Alba (2013); Jiang et al. (2019). Recently, prompting LLMs to do planning directly has gained attention and shown potential Huang et al. (2022); Singh et al. (2022); Ding et al. (2023). Moreover, based on LLMs’ powerful programming ability Lyu et al. (2023); Jojic et al. (2023); Liu et al. (2023), recent works first translate natural language instructions into the executable programming languages, such as Planning Domain Description Language (PDDL), and runs classical planning algorithms, such as LLM+P Liu et al. (2023). However, code-based planning is constrained by its narrow domains and the environment, while RAP can handle open-domain problems, such as math and logical reasoning. More related works on world models and planning are discussed in the Appendix D.
使用LLMs进行规划。规划是智能代理中的核心能力,涉及生成一系列动作以实现特定目标麦卡锡(1963);拜兰德(1994)。经典规划方法已被广泛应用于机器人和具体环境卡马乔和阿尔巴(2013);姜等(2019)。最近,直接促使LLMs进行规划引起了关注并展现了潜力黄等(2022);辛格等(2022);丁等(2023)。此外,基于LLMs的强大编程能力吕等(2023);乔吉克等(2023);刘等(2023),最近的研究首先将自然语言指令转换为可执行的编程语言,如规划领域描述语言(PDDL),并运行经典规划算法,如LLM+P 刘等(2023)。然而,基于代码的规划受其狭窄领域和环境的限制,而 RAP 可以处理开放领域问题,如数学和逻辑推理。附录 D 中讨论了更多关于世界模型和规划的相关工作。

3 Reasoning via Planning (RAP)

In this section, we present the Reasoning via Planning (RAP) framework that enables LLMs to strategically plan a coherent reasoning trace for solving a wide range of reasoning tasks. We first build the world model by repurposing the LLM with prompting (Section 3.1). The world model serves as the foundation for deliberate planning, by allowing the LLM to plan ahead and seek out the expected outcomes in the future. We then introduce the rewards for assessing each state during reasoning in Section 3.2. Guided by the world model and rewards, the planning with Monte Carlo Tree Search (MCTS) efficiently explores the vast reasoning space and finds optimal reasoning traces (Section 3.3). Finally, when multiple promising reasoning traces are acquired during planning, we further introduce an aggregation method in Section 3.4 that yields an ensembled result and further boosts the reasoning performance.
在本节中,我们介绍了通过规划(RAP)框架,使LLMs能够为解决各种推理任务策略性地规划一致的推理过程。我们首先通过使用提示重新配置LLM来构建世界模型(第 3.1 节)。世界模型作为刻意规划的基础,允许LLM提前规划并寻找未来的预期结果。然后我们在第 3.2 节介绍了评估推理过程中每个状态的奖励。在世界模型和奖励的指导下,通过蒙特卡洛树搜索(MCTS)进行的规划有效地探索广阔的推理空间,并找到最佳的推理过程(第 3.3 节)。最后,在规划过程中获得多个有前景的推理过程时,我们在第 3.4 节进一步介绍了一种聚合方法,产生一个集成结果并进一步提升推理性能。

3.1 Language Model as World Model
3.1 语言模型作为世界模型

In general, a world model predicts the next state of the reasoning after applying an action to the current state Ha and Schmidhuber (2018b); Matsuo et al. (2022). RAP enables us to instantiate the general concepts of state and action in different ways depending on the specific reasoning problems at hand (Figure 2). For example, in Blocksworld, it is natural to define a state as the configuration of blocks (described in natural language), and an action to be a behavior of moving a block (e.g., ‘‘pickup the orange block’’). In a math reasoning problem, we use the state to represent the values of intermediate variables, and set an action to be a subquestion that drives the reasoning to derive new values. In logical reasoning, a state is a fact we are focusing on, and an action is to choose a rule for the next deduction.
通常,世界模型预测在将动作应用于当前状态 Ha 后推理的下一个状态 Ha 和 Schmidhuber(2018b); Matsuo 等人(2022 年)。 RAP 使我们能够根据手头的具体推理问题以不同方式实例化状态和动作的一般概念(图 2)。例如,在 Blocksworld 中,将状态定义为块的配置(用自然语言描述),将动作定义为移动块的行为(例如,“拾起橙色块”)是很自然的。在数学推理问题中,我们使用状态表示中间变量的值,并将动作设置为推动推理以推导新值的子问题。在逻辑推理中,状态是我们关注的事实,动作是选择下一个推断规则。

With the definition of state and action, the reasoning process can thus be described as a Markov decision process (MDP): given the current state st,t=0,1,,Tsubscript𝑠formulae-sequence𝑡𝑡01𝑇s_{t,t=0,1,\dots,T}, e.g., the initial state s0subscript𝑠0s_{0}, the LLM (as a reasoning agent) generates an action space by sampling from its generative distribution atp(a|st,c)similar-tosubscript𝑎𝑡𝑝conditional𝑎subscript𝑠𝑡𝑐a_{t}\sim p(a|s_{t},c), where c𝑐c is a proper prompt (e.g., in-context demonstrations). Once an action is chosen, the world model then predicts the next state st+1subscript𝑠𝑡1s_{t+1} of the reasoning. Specifically, we repurpose the same LLM to obtain a state transition distribution p(st+1|st,at,c)𝑝conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscript𝑎𝑡superscript𝑐p(s_{t+1}|s_{t},a_{t},c^{\prime}), where csuperscript𝑐c^{\prime} is another prompt to guide the LLM to generate a state. For instance, in Blocksworld, the LLM (as the world model) generates text st+1subscript𝑠𝑡1s_{t+1} to describe the new configuration of blocks, given previous state stsubscript𝑠𝑡s_{t} and the action atsubscript𝑎𝑡a_{t}.
通过状态和动作的定义,推理过程可以被描述为马尔可夫决策过程(MDP):给定当前状态 st,t=0,1,,Tsubscript𝑠formulae-sequence𝑡𝑡01𝑇s_{t,t=0,1,\dots,T} ,例如初始状态 s0subscript𝑠0s_{0} ,推理代理(作为一个推理代理)通过从其生成分布 atp(a|st,c)similar-tosubscript𝑎𝑡𝑝conditional𝑎subscript𝑠𝑡𝑐a_{t}\sim p(a|s_{t},c) 中抽样来生成一个动作空间,其中 c𝑐c 是一个适当的提示(例如,在上下文演示中)。一旦选择了一个动作,世界模型就会预测推理的下一个状态 st+1subscript𝑠𝑡1s_{t+1} 。具体来说,我们重新利用相同的LLM 来获得状态转移分布 p(st+1|st,at,c)𝑝conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscript𝑎𝑡superscript𝑐p(s_{t+1}|s_{t},a_{t},c^{\prime}) ,其中 csuperscript𝑐c^{\prime} 是另一个提示,用于指导LLM 生成一个状态。例如,在 Blocksworld 中,世界模型(作为世界模型)生成文本 st+1subscript𝑠𝑡1s_{t+1} 来描述块的新配置,给定先前的状态 stsubscript𝑠𝑡s_{t} 和动作 atsubscript𝑎𝑡a_{t}

Continuing the process results in a reasoning trace, which consists of a sequence of interleaved states and actions (s0,a0,s1,,aT1,sT)subscript𝑠0subscript𝑎0subscript𝑠1subscript𝑎𝑇1subscript𝑠𝑇(s_{0},a_{0},s_{1},\dots,a_{T-1},s_{T}). This differs from the previous reasoning methods, such as Chain-of-Thought Wei et al. (2022), where the intermediate reasoning steps consist of only a sequence of actions, e.g., (a0=subscript𝑎0absenta_{0}= ‘‘pickup red block’’, a1=subscript𝑎1absenta_{1}= ‘‘stack on yellow block’’, …) (see comparisons in Figure 1). Augmenting the reasoning with the (predicted) world states helps the LLM with a more grounded and coherent inference. Note that the full reasoning trace is simulated by the LLM itself (as a reasoning agent with an internal world model) without interacting with the external real environment. This resembles humans contemplating a possible plan in their minds. The capability of simulating future states, by introducing the world model, allows us to incorporate principled planning algorithms to efficiently explore the vast reasoning space as described in Section 3.3.
继续这个过程会产生一个推理追踪,由交错的状态和动作序列组成 (s0,a0,s1,,aT1,sT)subscript𝑠0subscript𝑎0subscript𝑠1subscript𝑎𝑇1subscript𝑠𝑇(s_{0},a_{0},s_{1},\dots,a_{T-1},s_{T}) 。这与以往的推理方法不同,例如 Chain-of-Thought Wei 等人(2022 年)中,中间推理步骤仅包括一系列动作,例如( a0=subscript𝑎0absenta_{0}= “拾起红色积木”, a1=subscript𝑎1absenta_{1}= “堆叠在黄色积木上”,…)(请参见图 1 中的比较)。通过(预测的)世界状态增强推理,有助于LLM进行更加扎实和连贯的推断。请注意,完整的推理追踪是由LLM本身模拟的(作为具有内部世界模型的推理代理),而不是与外部真实环境互动。这类似于人类在头脑中考虑可能的计划。通过引入世界模型来模拟未来状态的能力,使我们能够将有原则的规划算法纳入,以有效地探索广阔的推理空间,如第 3.3 节所述。

3.2 Reward Design 3.2 奖励设计

During reasoning, we want to assess the feasibility and desirability of each reasoning step, and guide the reasoning based on the assessment (Section 3.3). The assessment of each reasoning step (i.e., applying an action atsubscript𝑎𝑡a_{t} to the state stsubscript𝑠𝑡s_{t}) is performed by a reward function rt=r(st,at)subscript𝑟𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡r_{t}=r(s_{t},a_{t})\in\mathbb{R}. Similar to the state and action, the reward function can be specified in different ways to accommodate any knowledge or preferences about the reasoning problem of interest. Here we introduce several common rewards applicable to different tasks and shown to be effective in our experiments.
在推理过程中,我们希望评估每个推理步骤的可行性和可取性,并根据评估指导推理(第 3.3 节)。对每个推理步骤(即将动作 atsubscript𝑎𝑡a_{t} 应用于状态 stsubscript𝑠𝑡s_{t} )的评估是通过奖励函数 rt=r(st,at)subscript𝑟𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡r_{t}=r(s_{t},a_{t})\in\mathbb{R} 执行的。与状态和动作类似,奖励函数可以以不同方式指定,以适应对感兴趣的推理问题的任何知识或偏好。在这里,我们介绍了几种适用于不同任务的常见奖励,并证明在我们的实验中是有效的。

Likelihood of the action. When an action is generated by the LLM conditioning on the in-context demonstration and the current state, the probability of the specific action reflects the LLM’s preference. We thus can incorporate the log probability of the action as a reward. This reward reflects the “instinct” of LLMs as an agent, and can be also used as a prior for which action to explore.
行动的可能性。当一个行动由LLM在上下文演示和当前状态的条件下生成时,特定行动的概率反映了LLM的偏好。因此,我们可以将行动的对数概率作为奖励。这个奖励反映了LLMs作为一个代理的“本能”,也可以用作探索哪个行动的先验。

Confidence of the state. State prediction is nontrivial in some problems, e.g., in math reasoning (Figure 2, middle), given an action (i.e., a subquestion), the world model predicts the next state by answering the subquestion. We incorporate the confidence of the state (i.e., answers in this case) as a reward. Specifically, we draw multiple sample answers from the world model, and use the proportion of the most frequent answer as the confidence. Higher confidence indicates that the state prediction is more consistent with the world knowledge of LLMs Hao et al. (2023b), which typically leads to a more reliable reasoning step.
国家的信心。在某些问题中,国家预测并不是微不足道的,例如,在数学推理中(图 2,中),给定一个动作(即一个子问题),世界模型通过回答子问题来预测下一个状态。我们将状态的信心(即在这种情况下的答案)作为奖励。具体来说,我们从世界模型中抽取多个样本答案,并使用最频繁答案的比例作为信心。更高的信心表明状态预测与LLMs Hao 等人(2023b)的世界知识更一致,通常会导致更可靠的推理步骤。

Self-evaluation by the LLM. It’s sometimes easier to recognize the errors in reasoning than avoid generating them in advance. Thus, it’s beneficial to allow the LLM to criticize itself with the question “Is this reasoning step correct?”, and use the next-word probability of the token “Yes” as a reward. The reward evaluates LLM’s own estimation of the correctness of reasoning. Note that the specific problems for self-evaluation can be different depending on the tasks.
LLM的自我评估。有时候,识别推理错误比避免提前生成它们更容易。因此,允许LLM用问题“这个推理步骤正确吗?”批评自己是有益的,并使用标记“是”的下一个词概率作为奖励。奖励评估LLM对推理正确性的自我估计。请注意,自我评估的具体问题可能因任务而异。

Task-specific heuristics. RAP also allows us to flexibly plug in other task-specific heuristics into the reward function. For example, in plan generation for Blocksworld, we compare the predicted current state of blocks with the goal to calculate a reward (Section 4.1). The reward encourages the plan of movements to actively pace towards the target.
任务特定的启发式。RAP 还允许我们灵活地将其他任务特定的启发式插入到奖励函数中。例如,在 Blocksworld 的计划生成中,我们将预测的方块当前状态与目标进行比较,以计算奖励(第 4.1 节)。奖励鼓励移动计划积极朝向目标前进。

3.3 Planning with Monte Carlo Tree Search
3.3 使用蒙特卡洛树搜索进行规划

Refer to caption
Figure 3: An illustration of the four phases in an iteration in MCTS planning (Section 3.3).
图 3:MCTS 规划中迭代中的四个阶段示意图(第 3.3 节)。

Once equipped with the world model (Section 3.1) and rewards (Section 3.2), LLMs can reason with any planning algorithms. We adopt Monte Carlo Tree Search (MCTS) Kocsis and Szepesvári (2006); Coulom (2007), a powerful planning algorithm that strategically explores the space of reasoning trees and strikes a proper balance between exploration and exploitation to find high-reward reasoning traces efficiently.
一旦配备了世界模型(第 3.1 节)和奖励(第 3.2 节),LLMs可以与任何规划算法进行推理。我们采用蒙特卡洛树搜索(MCTS)Kocsis 和 Szepesvári(2006 年);Coulom(2007 年),这是一种强大的规划算法,可以战略性地探索推理树空间,并在探索和利用之间取得适当平衡,以高效地找到高奖励的推理轨迹。

Specifically, MCTS builds a reasoning tree iteratively, where each node represents a state, and each edge represents an action and the transition from the current state to the next state after applying the action (Figure 1). To guide the LLM agent to expand and explore the most promising nodes of the tree, the algorithm maintains a state-action value function Q:𝒮×𝒜:𝑄maps-to𝒮𝒜Q:\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R}, where Q(s,a)𝑄𝑠𝑎Q(s,a) estimates the expected future reward of taking action a𝑎a in state s𝑠s. Figure 3 illustrates four operations in each iteration to expand the tree and update Q𝑄Q values. The process continues until a specified computational budget (e.g., the number of iterations) is reached, and the resulting traces are acquired from the tree. More details and the pseudo-code of the planning algorithm are given in Appendix A and Algorithm 1.
具体来说,MCTS 迭代地构建推理树,其中每个节点代表一个状态,每条边代表一个动作以及应用该动作后从当前状态过渡到下一个状态(图 1)。为了引导LLM代理扩展和探索树中最有前途的节点,该算法维护一个状态-动作值函数 Q:𝒮×𝒜:𝑄maps-to𝒮𝒜Q:\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R} ,其中 Q(s,a)𝑄𝑠𝑎Q(s,a) 估计在状态 s𝑠s 中采取动作 a𝑎a 的预期未来奖励。图 3 展示了每次迭代中扩展树和更新 Q𝑄Q 值的四个操作。该过程持续进行直到达到指定的计算预算(例如,迭代次数),并从树中获取结果轨迹。有关规划算法的更多细节和伪代码请参见附录 A 和算法 1。

Selection. The first phase selects a portion of the existing tree that is most promising for further expansion in the next phase. Starting from the root node (i.e., initial state s0subscript𝑠0s_{0}), at each level of the tree, the algorithm selects a child node as the next node. The phase finishes when a leaf node of the current tree is reached. Figure 3(a) highlights the selected path in red. To balance between exploration (of less-visited nodes) and exploitation (of high-value nodes), we use the well-known Upper Confidence bounds applied to Trees (UCT) Kocsis and Szepesvári (2006) to select each child node. Specifically, at node s𝑠s, we select the action in the tree by considering both the Q𝑄Q value (for exploitation) and uncertainty (for exploration):
选择。第一阶段选择现有树中最有前途的部分,以便在下一阶段进一步扩展。从根节点(即初始状态 s0subscript𝑠0s_{0} )开始,在树的每个级别,算法选择一个子节点作为下一个节点。当到达当前树的叶节点时,该阶段结束。图 3(a)用红色突出显示了所选路径。为了在探索(少访问节点)和利用(高价值节点)之间取得平衡,我们使用了广为人知的应用于树的上置信界限(UCT)Kocsis 和 Szepesvári(2006)来选择每个子节点。具体来说,在节点 s𝑠s 处,我们通过考虑价值(用于利用)和不确定性(用于探索)来选择树中的动作:

a=argmaxaA(s)[Q(s,a)+wlnN(s)N(c(s,a))],superscript𝑎subscript𝑎𝐴𝑠𝑄𝑠𝑎𝑤𝑁𝑠𝑁𝑐𝑠𝑎\displaystyle a^{\ast}=\arg\max_{a\in A(s)}\left[Q(s,a)+w\sqrt{\frac{\ln N(s)}{N(c(s,a))}}\right], (1)

where N(s)𝑁𝑠N(s) is the number of times node s𝑠s has been visited in previous iterations, and c(s,a)𝑐𝑠𝑎c(s,a) is the child node of applying a𝑎a in state s𝑠s. The less a child node was visited before (i.e., the more uncertain about this child node), the higher the second term in the equation. The weight w𝑤w controls the balance between exploration and exploitation.
其中 N(s)𝑁𝑠N(s) 是节点 s𝑠s 在先前迭代中被访问的次数, c(s,a)𝑐𝑠𝑎c(s,a) 是应用 a𝑎a 在状态 s𝑠s 中的子节点。子节点被访问次数越少(即对该子节点的不确定性越高),方程中第二项的权重就越高。权重 w𝑤w 控制着探索和利用之间的平衡。

Expansion. This phase expands the tree by adding new child nodes to the leaf node selected above. Given the state of the leaf node, we use the LLM (as agent) to sample d𝑑d possible actions (e.g., subquestions in math reasoning), and then use the LLM (as world model) to predict the respective next state, resulting in d𝑑d child nodes. Note that if the leaf node selected above is a terminal node (the end of a reasoning chain) already, we will skip expansion and jump to back-propagation.
扩展。此阶段通过向上面选择的叶节点添加新的子节点来扩展树。鉴于叶节点的状态,我们使用LLM(作为代理)来采样 d𝑑d 个可能的动作(例如,数学推理中的子问题),然后使用LLM(作为世界模型)来预测相应的下一个状态,从而产生 d𝑑d 个子节点。请注意,如果上面选择的叶节点是终端节点(推理链的末端),我们将跳过扩展并转到反向传播。

Simulation. To estimate the expected future rewards (Q𝑄Q values), this phase simulates the future situations of the current node using the world model. Starting from the current node as above, at each node s𝑠s, we create an action following a roll-out policy and use the world model to predict the next state. The roll-out process continues until a terminal state is reached. There could be many ways to define the roll-out policy (e.g., by adding different randomness). In our experiments, for simplicity and reduced noises, we follow a similar process as in the expansion above, i.e., generating d𝑑d candidate actions and picking one of the largest local reward a=maxar(s,a)superscript𝑎subscriptsuperscript𝑎𝑟𝑠𝑎a^{\prime}=\max_{a^{\prime}}r(s,a). In practice, for efficiency, we discard the computationally costly components in r𝑟r (e.g., the reward from the confidence of state requires sampling the answer multiple times), and use the resulting lightweight reward function for selecting actions during simulation.
模拟。为了估计预期的未来奖励( Q𝑄Q 值),这个阶段使用世界模型模拟当前节点的未来情况。从上述当前节点开始,在每个节点 s𝑠s ,我们根据一个展开策略创建一个动作,并使用世界模型预测下一个状态。展开过程持续到达终端状态。可以有许多方法来定义展开策略(例如,通过添加不同的随机性)。在我们的实验中,为了简化和减少噪音,我们遵循与上述扩展相似的过程,即生成 d𝑑d 个候选动作并选择其中一个最大的局部奖励 a=maxar(s,a)superscript𝑎subscriptsuperscript𝑎𝑟𝑠𝑎a^{\prime}=\max_{a^{\prime}}r(s,a) 。在实践中,为了效率,我们丢弃 r𝑟r 中计算成本高昂的组件(例如,从状态的置信度中获得奖励需要多次采样答案),并使用生成的轻量级奖励函数来在模拟过程中选择动作。

Back-propagation. Once we reach a terminal state in the above phases, we obtain a reasoning path from the root node to the terminal node. We now back-propagate the rewards on the path to update the Q𝑄Q value of each state-action pair along the path. Specifically, we update Q(s,a)𝑄𝑠𝑎Q(s,a) by aggregating the rewards in all future steps of node s𝑠s.
反向传播。一旦我们在上述阶段达到终端状态,我们就从根节点到终端节点获得推理路径。现在,我们将路径上的奖励进行反向传播,以更新沿路径的每个状态-动作对的 Q𝑄Q 值。具体来说,我们通过聚合节点 s𝑠s 所有未来步骤中的奖励来更新 Q(s,a)𝑄𝑠𝑎Q(s,a)

Once a predetermined number of MCTS iterations is reached, we terminate the algorithm and select the final reasoning trace from the constructed tree for evaluation. There are various ways for the selection. One is to start from the root node and iteratively choose the action with the highest Q𝑄Q value until reaching a terminal. Also, one can directly select the path from the iterations that yielded the highest reward, or opt to choose the leaf node (and the respective root-to-leaf path) that has been visited the most. In practice, we observed that the second strategy often yields the best results.
一旦达到预定数量的 MCTS 迭代次数,我们终止算法并从构建的树中选择最终的推理跟踪进行评估。有各种选择方式。一种是从根节点开始,迭代地选择具有最高 Q𝑄Q 值的动作,直到达到终端。另外,可以直接选择产生最高奖励的迭代路径,或选择访问最多的叶节点(以及相应的从根到叶的路径)。在实践中,我们观察到第二种策略通常会产生最佳结果。

3.4 RAP-Aggregation 3.4RAP-聚合

For problems, such as math reasoning (Section 4.2) where only the final answer is required, RAP could produce multiple traces and answers from different MCTS iterations, which will be aggregated to produce the final answer. We refer to such a mechanism as RAP-Aggregation. Note that problems like plan generation or logical inference require a complete reasoning trace as output; thus, RAP-Aggregation will not be applied.
对于只需要最终答案的问题,比如数学推理(第 4.2 节),RAP 可以从不同的 MCTS 迭代中产生多个跟踪和答案,这些将被汇总以产生最终答案。我们将这样的机制称为 RAP-Aggregation。请注意,像计划生成或逻辑推理这样的问题需要完整的推理跟踪作为输出;因此,RAP-Aggregation 不会被应用。

4 Experiments 4 实验

In this section, we demonstrate the flexibility and effectiveness of our RAP framework by applying it to a wide range of problems, including plan generation in an embodied environment (4.1), mathematical reasoning for solving math word problems (4.2), and logical reasoning for verifying hypotheses (4.3). The subsequent sections demonstrate how the world model formulation in RAP enables a versatile design of the state and action, catering to various reasoning contexts.
在本节中,我们通过将我们的 RAP 框架应用于广泛的问题,包括在具体环境中生成计划(4.1)、解决数学应用题的数学推理(4.2)以及验证假设的逻辑推理(4.3),展示了我们的 RAP 框架的灵活性和有效性。随后的部分展示了 RAP 中的世界模型制定如何实现对状态和动作的多功能设计,以满足各种推理背景。

We primarily compare RAP with chain-of-thought (CoT) Wei et al. (2022), and its variants like least-to-most prompting Zhou et al. (2022) as baselines. We also consider ensembling multiple reasoning paths if applicable (also known as self-consistency Wang et al. (2022)). Moreover, we compare RAP with GPT-4 OpenAI (2023) when computation resources allow. By default, we use the LLaMA-33B model Touvron et al. (2023a) as the base LLM for both our methods and baselines, with a sampling temperature of 0.8. All prompts are listed in Appendix C.
我们主要将 RAP 与思维链(CoT)魏等人(2022 年)以及其变体,如由周等人(2022 年)提出的从最少到最多提示作为基线进行比较。我们还考虑如果适用的话合并多个推理路径(也称为自一致性王等人(2022 年))。此外,我们在计算资源允许的情况下将 RAP 与 GPT-4 OpenAI(2023 年)进行比较。默认情况下,我们使用 LLaMA-33B 模型 Touvron 等人(2023a 年)作为我们的方法和基线的基础LLM,采样温度为 0.8。所有提示都列在附录 C 中。

4.1 Plan Generation 4.1 计划生成

Refer to caption
Figure 4: Comparing reasoning traces in Blocksworld from CoT (left) and RAP (right).
图 4:比较 CoT(左)和 RAP(右)在 Blocksworld 中的推理轨迹。

The plan generation task aims to produce a sequence of actions to achieve a given goal, possibly with additional constraints. The ability to generate plans is important for intelligent embodied agents, e.g. household robots Puig et al. (2018).
计划生成任务旨在生成一系列动作序列,以实现给定目标,可能还带有额外的约束。生成计划的能力对于智能实体代理非常重要,例如家庭机器人 Puig 等人(2018 年)。

Task setup. To explore the viability of the RAP framework for plan generation tasks, we adapt and evaluate RAP on the Blocksworld benchmark Valmeekam et al. (2022), where an agent is asked to rearrange the blocks into stacks in a particular order. We define a state as the current orientation of the blocks and an action as an instruction that moves blocks. Specifically, an action is composed of one of the 4 verbs (i.e., Stack, Unstack, Put, and Pickup) and manipulated objects. For the action space, we generate the currently valid actions given the domain restrictions on actions and the current orientation of the blocks. To transit between states, we take the current action and query the LLM to predict the state changes to the relevant blocks. We then update the current state by adding the new block conditions and removing the conditions that are no longer true. Once a state has met all conditions in the goal or the depth limit of the tree is reached, we terminate the associated node.
任务设置。为了探索 RAP 框架在计划生成任务中的可行性,我们在 Blocksworld 基准 Valmeekam 等人(2022 年)上对 RAP 进行了调整和评估,其中代理被要求按特定顺序重新排列块。我们将状态定义为块的当前方向,将动作定义为移动块的指令。具体而言,动作由 4 个动词之一(即 Stack,Unstack,Put 和 Pickup)和被操作的对象组成。对于动作空间,我们根据动作的域限制和块的当前方向生成当前有效的动作。为了在状态之间转换,我们执行当前动作并查询LLM以预测与相关块的状态变化。然后,通过添加新的块条件和删除不再成立的条件来更新当前状态。一旦状态满足目标中的所有条件或者树的深度限制达到,我们终止相关节点。

To assess the quality of actions within this domain, we use two separate rewards. First, we prompt the LLM with some example test cases along with their solutions, and then calculate the log probability of the action given the current state (“Likelihood of action” reward in Section 3.2), denoted as r1subscript𝑟1r_{1}. This reward reflects the intuition of the LLM as the reasoning agent. It’s typically indicative when there are few steps left to the goal, while not as reliable for a distant goal. Additionally, we compare the new state after performing an action with the goal and provide a reward, r2subscript𝑟2r_{2}, scaling with the number of conditions met (“Task-specific heuristics” reward). Specifically, when all the conditions are met, we assign a super large reward to make sure this plan will be selected as the solution.
为了评估该领域内行动的质量,我们使用两个单独的奖励。首先,我们提示LLM一些示例测试用例以及它们的解决方案,然后计算给定当前状态的动作的对数概率(“动作的可能性”奖励在第 3.2 节中),表示为 r1subscript𝑟1r_{1} 。这个奖励反映了LLM作为推理代理的直觉。当距离目标仅剩下几个步骤时,这通常是一个指示,而对于遥远的目标则不太可靠。此外,我们比较执行动作后的新状态与目标,并提供一个奖励 r2subscript𝑟2r_{2} ,根据满足的条件数量进行缩放(“任务特定启发式”奖励)。具体来说,当所有条件都满足时,我们分配一个超大的奖励,以确保该计划将被选为解决方案。

Method 方法 2-step 2 步 4-step 4 步 6-step 6 步
CoT 0.17 0.02 0.00
CoT - pass@10 0.23 0.07 0.00
CoT (GPT-4) CoT(GPT-4) 0.50 0.63 0.40
RAP(10) 1.00 0.86 0.26
RAP(20) 1.00 0.88 0.42
Table 1: Results on Blocksworld. RAP(10) and RAP(20) refer to our method where the iteration number is set to 10 and 20, respectively. “pass@10” means 10 plans are sampled for each test case, and the test case is regarded as solved if at least one plan is correct. All other settings including RAP, only evaluate a single plan.
表 1:Blocksworld 上的结果。RAP (10) 和 RAP (20) 分别指的是我们的方法,其中迭代次数分别设置为 10 和 20。“pass@10”表示每个测试用例会抽样 10 个计划,如果至少有一个计划正确,则视为解决该测试用例。所有其他设置包括 RAP,只评估单个计划。

Results. We use test cases from the Blocksworld dataset Valmeekam et al. (2023) and group them by minimum number of actions required, resulting in 30 cases solvable within 2 steps, 57 cases within 4 steps, and 114 cases within 6 steps. There are at most 5 blocks in each test case. As the baseline method, we prompt the LLM with 4 test cases with corresponding solutions, and ask it to generate a plan for a new question. This setting is the same as one described in Valmeekam et al. (2022), and we denote it as Chain-of-Thought (CoT) as the solution is generated step by step. For RAP, the same prompt is shown to help LLMs calculate r1subscript𝑟1r_{1}.
结果。我们使用来自 Blocksworld 数据集 Valmeekam 等人(2023 年)的测试用例,并按照所需的最少操作数对它们进行分组,结果是 30 个在 2 步内可解决的案例,57 个在 4 步内可解决的案例,以及 114 个在 6 步内可解决的案例。每个测试用例中最多有 5 个块。作为基准方法,我们提示LLM使用 4 个带有相应解决方案的测试用例,并要求其为新问题生成一个计划。这个设置与 Valmeekam 等人(2022 年)描述的设置相同,我们将其标记为 Chain-of-Thought(CoT),因为解决方案是逐步生成的。对于 RAP,相同的提示显示以帮助LLMs计算 r1subscript𝑟1r_{1}

As shown in Table 1, CoT with LLaMA-33B can only generate successful plans for a few 2-step cases, and completely fails on harder problems. RAP substantially improves over CoT by nearly solving all problems within 4 steps, and a part of 6-step problems, achieving an average success rate of 64%percent6464\%. It’s worth noting that the searching space of 6-step problems can be as large as 56superscript565^{6}, while our algorithm can find a successful plan 42% of the time within 20 iterations. Even more, our framework allows LLaMA-33B to outperform GPT-4 by 33%percent3333\% relative gain, which is known to have much stronger reasoning ability Bubeck et al. (2023).
如表 1 所示,使用 LLaMA-33B 的 CoT 只能为少数 2 步骤案例生成成功的计划,在更困难的问题上完全失败。RAP 相对于 CoT 有了显著的改进,几乎可以在 4 步骤内解决所有问题,并且在部分 6 步骤问题上取得成功,实现了 64%percent6464\% 的平均成功率。值得注意的是,6 步骤问题的搜索空间可能达到 56superscript565^{6} ,而我们的算法可以在 20 次迭代中有 42%的时间找到成功的计划。更重要的是,我们的框架使 LLaMA-33B 能够相对于 GPT-4 获得 33%percent3333\% 的提升,后者被认为具有更强的推理能力(Bubeck 等人,2023 年)。

Case study. We compare the reasoning paths from CoT and RAP in Figure 4. We summarize the reasons accounting for the improvement: (1) By maintaining the world state during reasoning, RAP can recognize valid actions for the current state, avoiding generating illegal plans. (2) RAP is capable of backtracking and trying out other solutions when the first intuition from the LLM doesn’t work. Specifically, CoT attempts to achieve the second goal, i.e. “orange on red”, and achieve that with the first two steps. However, accomplishing the second goal first would prevent the first goal from being satisfied. On the contrary, even though RAP makes the same mistakes in the first iterations, our framework drives the agent to explore other possible paths (as described in Section 3.3) and finally generate a successful plan. (3) When calculating rtsubscript𝑟𝑡r_{t}, we can only feed the current state to the LLM and hide the history. E.g., in the case of Figure 4, to calculate the reward for a2subscript𝑎2a_{2}, the LLM is provided with a “new” test case, in which s2subscript𝑠2s_{2} is the initial state. This significantly lowers the difficulties of the last few steps, and saves more iterations for harder decisions of the first few steps.
案例研究。我们比较图 4 中 CoT 和 RAP 的推理路径。我们总结了导致改进的原因:(1) 通过在推理过程中保持世界状态,RAP 能够识别当前状态的有效操作,避免生成非法计划。(2) RAP 能够进行回溯并尝试其他解决方案,当第一个直觉不起作用时。具体而言,CoT 试图实现第二个目标,即“橙色在红色上”,并通过前两个步骤实现该目标。然而,首先实现第二个目标将阻止第一个目标的实现。相反,即使在最初的迭代中 RAP 犯了相同的错误,我们的框架也会驱使代理探索其他可能的路径(如第 3.3 节所述),最终生成成功的计划。(3) 在计算 rtsubscript𝑟𝑡r_{t} 时,我们只能将当前状态提供给LLM,并隐藏历史。例如,在图 4 的情况下,为了计算 a2subscript𝑎2a_{2} 的奖励,LLM提供了一个“新”的测试用例,其中 s2subscript𝑠2s_{2} 是初始状态。 这显著降低了最后几个步骤的难度,并为最初几个步骤更困难的决策节省了更多的迭代次数。

4.2 Math Reasoning 4.2 数学推理

Task setup. Math reasoning tasks, such as GSM8k Cobbe et al. (2021), often include a description and a final question. To arrive at the answer to the final question, it is necessary to undertake multi-step mathematical calculations based on the problem’s context. It is thus natural to decompose the final question into a sequence of smaller sub-questions (Figure 2, right). We define a state as the values of intermediate variables, and an action as to propose an incremental sub-question about a unknown intermediate variable. The world model then responds to the sub-question using the intermediate variables and the problem description, adding the new intermediate variable value into the next state. We combine the self-evaluation of helpfulness by LLM rt,1subscript𝑟𝑡1r_{t,1} and the confidence of state rt,2subscript𝑟𝑡2r_{t,2} using weighted geometric mean rt=rt,1αrt,21αsubscript𝑟𝑡superscriptsubscript𝑟𝑡1𝛼superscriptsubscript𝑟𝑡21𝛼r_{t}=r_{t,1}^{\alpha}*r_{t,2}^{1-\alpha} as the reward function. This reward encourages more relevant and useful sub-questions. To account for the impact of the reasoning path’s length on the reward, we compute the Q𝑄Q value by using the maximum of average rewards in future steps.
任务设置。数学推理任务,如 GSM8k Cobbe 等人(2021 年)所述,通常包括描述和最终问题。为了得出最终问题的答案,需要根据问题的背景进行多步数学计算。因此,将最终问题分解为一系列较小的子问题是很自然的(图 2,右侧)。我们将状态定义为中间变量的值,将动作定义为提出关于未知中间变量的增量子问题。然后,世界模型根据中间变量和问题描述回答子问题,并将新的中间变量值添加到下一个状态中。我们将通过使用加权几何平均值将自我评估的帮助性LLM rt,1subscript𝑟𝑡1r_{t,1} 和状态 rt,2subscript𝑟𝑡2r_{t,2} 的信心结合为奖励函数。这种奖励鼓励提出更相关和有用的子问题。为了考虑推理路径长度对奖励的影响,我们通过使用未来步骤中平均奖励的最大值来计算 Q𝑄Q 值。

Q(st,at)=maxst,at,rt,,sl,al,rl,sl+1avg(rt,,rl).superscript𝑄subscript𝑠𝑡subscript𝑎𝑡subscriptsubscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscript𝑠𝑙subscript𝑎𝑙subscript𝑟𝑙subscript𝑠𝑙1avgsubscript𝑟𝑡subscript𝑟𝑙\displaystyle Q^{\ast}(s_{t},a_{t})=\max_{s_{t},a_{t},r_{t},\dots,s_{l},a_{l},r_{l},s_{l+1}}\operatorname{avg}(r_{t},\dots,r_{l}). (2)
Method 方法 Accuracy (%) 准确度(%)
Chain-of-Thought 思维链 29.4
+ SC(10) 46.8
Least-to-Most 从最少到最多 25.5
+ SC(10) 42.5
RAP(1) 40.0
RAP(10) 48.6
+ aggr 51.6
Table 2: Results on GSM8k. The superscripts indicate the number of samples or iterations.
表 2:GSM8k 的结果。上标表示样本数或迭代次数。

As a related work, Least-to-Most prompting Zhou et al. (2022) shares a similar idea to us in sub-question decomposition, but they generate sub-questions all at once. On the contrary, RAP considers each action atsubscript𝑎𝑡a_{t} based on the current state stsubscript𝑠𝑡s_{t}, which enables more informed decisions.
作为相关工作,Least-to-Most 提示周等人(2022 年)在子问题分解方面与我们有类似的想法,但他们一次性生成所有子问题。相反,RAP 考虑每个动作 atsubscript𝑎𝑡a_{t} 基于当前状态 stsubscript𝑠𝑡s_{t} ,这样可以做出更明智的决策。

Refer to caption
Figure 5: Results on GSM-8K, with different numbers of sampled paths or iterations.
图 5:在 GSM-8K 上的结果,使用不同数量的采样路径或迭代。

Results. We evaluate our framework on GSM8k, a dataset of grade school math word problems. We also evaluate the base model with CoT prompting Wei et al. (2022), Least-to-Most prompting Zhou et al. (2022), and their self-consistency Wang et al. (2022) variants, as the baselines. We use the same 4-shot examples demonstrations for all methods.
结果。我们在 GSM8k 上评估了我们的框架,这是一个包含小学数学应用题的数据集。我们还将基准模型与 CoT 提示 Wei 等人(2022 年)、Least-to-Most 提示 Zhou 等人(2022 年)以及它们的自一致性 Wang 等人(2022 年)变体进行了评估,作为基线。我们对所有方法使用相同的 4 个示例演示。

As shown in Table 2, our RAP framework answers 48.8%percent48.848.8\% of the problems correctly, outperforming both the Chain-of-Thought and the Least-to-Most prompting with Self-Consistency. Notably, this result is achieved when RAP only selects only one reasoning trace based on the reward. The introduction of RAP-Aggregate further improves the accuracy by 3%similar-toabsentpercent3\sim 3\%. We also calculate the accuracy with different numbers of iterations in MCTS and self-consistency samples in baselines, as illustrated in Figure 5. We find that across all numbers of iterations/samples, RAP-Aggregation outperforms baselines consistently, which indicates that when only a few iterations/samples are allowed, our framework is significantly better at finding reliable reasoning paths with the guide of reward.
如表 2 所示,我们的 RAP 框架正确回答了 48.8%percent48.848.8\% 的问题,表现优于思维链和最少到最多提示与自一致性。值得注意的是,当 RAP 仅基于奖励选择一个推理路径时,就实现了这一结果。引入 RAP-Aggregate 进一步提高了准确性 3%similar-toabsentpercent3\sim 3\% 。我们还计算了 MCTS 中不同迭代次数和基线中自一致性样本的准确性,如图 5 所示。我们发现,在所有迭代次数/样本数量中,RAP-Aggregation 始终优于基线,这表明当只允许少量迭代次数/样本时,我们的框架在奖励的指导下更容易找到可靠的推理路径。

4.3 Logical Reasoning 4.3 逻辑推理

Task setup. A logical reasoning task (e.g. PrOntoQA Saparov and He (2022)) typically provides a set of facts and logical rules, and a model is required to verify if a hypothesis fact is true or false by applying the logical rules to the given facts, as illustrated in Figure 2. These tasks not only require the correct final answer (true/false), but also a detailed proof demonstrating the result. To apply our framework, we define the state as a fact we are focusing on, analogous to the human’s working memory Baddeley (1992) for inference. An action is defined as selecting a rule from the fact set. The world model performs a one-hop reasoning step to get a new fact as the next state. The reward is calculated with Self-evaluation (Section 3.2. Specifically, we prompt the LLM with a few examples with their labels to help it better understand the quality of reasoning steps. We use the average reward of future steps to update the Q𝑄Q function, the same as Equation (2) for GSM8k.
任务设置。逻辑推理任务(例如 PrOntoQA Saparov 和 He(2022))通常提供一组事实和逻辑规则,需要模型通过将逻辑规则应用于给定事实来验证假设事实是真还是假,如图 2 所示。这些任务不仅需要正确的最终答案(真/假),还需要详细的证明来证明结果。为了应用我们的框架,我们将状态定义为我们关注的事实,类似于人类的工作记忆 Baddeley(1992)用于推理。动作被定义为从事实集中选择规则。世界模型执行一次跳跃推理步骤,以获得新的事实作为下一个状态。奖励是通过自我评估(第 3.2 节)计算的。具体来说,我们提示LLM几个示例及其标签,以帮助其更好地理解推理步骤的质量。我们使用未来步骤的平均奖励来更新 Q𝑄Q 函数,与 GSM8k 的方程式(2)相同。

Method 方法 Pred Acc 预测准确率 Proof Acc 证明 Acc
CoT 87.8 64.8
CoT + SC 89.8 -
RAP (Ours) RAP(我们) 94.2 78.8
Table 3: Results on ProntoQA.
表 3:ProntoQA 的结果。
Setting 设置 Method 方法 2-step 2 步 4-step 4 步 6-step 6 步 8-step 8 步 10-step 10 步 12-step 12 步 All 全部
Easy 简单 CoT 0.49 0.18 0.06 0.01 0.01 0.00 0.08
RAP(10) 1.00 0.99 0.75 0.61 0.32 0.32 0.65
Hard 硬盘 CoT 0.22 0.14 0.02 0.02 0.00 0.00 0.05
RAP(10) 0.67 0.76 0.74 0.48 0.17 0.09 0.51
Table 4: Results on the full Blocksworld with Llama-2 70B.
表 4:使用 Llama-2 70B 在完整的 Blocksworld 上的结果。

Results. We assess the performance of our RAP framework on PrOntoQA Saparov and He (2022) and adopt their settings of “true” ontology (using real-world knowledge), “random” ordering of rules. We mix the examples requiring 3, 4, and 5 reasoning hops in a correct proof to prevent LLM from memorizing when to finish the reasoning. We sample 500 examples from the generation script released by Saparov and He (2022). We compare both the prediction accuracy of the final answer and the accuracy of the entire proof. We do 20 iterations for MCTS and 20 samples for self-consistency.
结果。我们在 PrOntoQA Saparov 和 He(2022 年)上评估了我们的 RAP 框架的性能,并采用了他们的“真实”本体(使用现实世界知识)、“随机”规则排序的设置。我们混合需要 3、4 和 5 个推理跳数的示例在正确的证明中,以防止LLM在何时完成推理时进行记忆。我们从 Saparov 和 He(2022 年)发布的生成脚本中抽取了 500 个示例。我们比较最终答案的预测准确性和整个证明的准确性。我们对 MCTS 进行 20 次迭代,对自洽性进行 20 次采样。

As the results presented in Table 3, our framework achieves a correct answer rate of 94.2% and a proof accuracy of 78.8%, surpassing the CoT baseline by 14% proof accuracy and the self-consistency CoT baseline by 4.4%percent4.44.4\% prediction accuracy. Such substantial improvements clearly demonstrate the effectiveness of RAP in solving logical reasoning problems in PrOntoQA. Also, as the case illustrated in Figure 2, RAP can effectively recognize when a reasoning chain comes to a dead end, and propagate the signal back to earlier reasoning steps, with the planning algorithm allowing it to explore alternatives to the previous steps. The self-evaluation reward further helps RAP to recognize potential incorrect reasoning steps, encouraging the agent to avoid them in future iterations.
如表 3 所示的结果,我们的框架实现了 94.2%的正确答案率和 78.8%的证明准确率,超过了 CoT 基线 14%的证明准确率和自洽性 CoT 基线 4.4%percent4.44.4\% 的预测准确率。这些显著的改进清楚地展示了 RAP 在解决 PrOntoQA 中的逻辑推理问题方面的有效性。另外,正如图 2 所示的案例,RAP 能够有效识别推理链路走到死胡同时,并将信号传播回之前的推理步骤,规划算法使其能够探索先前步骤的替代方案。自我评估奖励进一步帮助 RAP 识别潜在的错误推理步骤,鼓励代理避免它们在未来的迭代中。

5 Analysis 5 分析

5.1 Complex problems 5.1 复杂问题

To further study whether RAP can help stronger LLMs to solve more complex problems, we conduct experiments on the full Blocksworld Valmeekam et al. (2023) dataset using a more capable LLM, Llama-2 70B Touvron et al. (2023b).
进一步研究 RAP 是否能帮助更强大的LLMs解决更复杂的问题,我们使用更强大的LLM,Llama-2 70B Touvron 等人(2023b 年)在完整的 Blocksworld Valmeekam 等人(2023 年)数据集上进行实验。

The full Blocksworld Valmeekam et al. (2023) comprises 602 test cases. We group them based on the minimum number of actions required for each test case. Our experiments are conducted in two distinct settings: Easy and Hard. In Easy setting, we assume prior knowledge of the minimum number of actions for each case. Leveraging this information, we use demonstration cases that share the same minimum number of actions as the test case. For each group of cases, we randomly select 10 cases to create a pool of demonstration cases, leaving the remaining cases as the test set. During inference, we randomly sample 4-shot demonstration cases from this pool and utilize them to formulate prompts. In the Hard setting, we randomly select 10 cases from the full dataset to form a demonstration pool and subsequently exclude these cases from the test set. During inference, we randomly sample 4-shot demonstration cases from this global pool, irrespective of the minimum number of actions required for the test case.
完整的 Blocksworld Valmeekam 等人(2023 年)包括 602 个测试用例。我们根据每个测试用例所需的最少操作次数对它们进行分组。我们的实验在两个不同的设置中进行:简单和困难。在简单设置中,我们假设对每个案例所需的最少操作次数有先验知识。利用这些信息,我们使用与测试用例具有相同最少操作次数的演示案例。对于每组案例,我们随机选择 10 个案例创建演示案例池,将剩余案例作为测试集。在推理过程中,我们从该池中随机抽取 4 个演示案例,并利用它们制定提示。在困难设置中,我们从完整数据集中随机选择 10 个案例形成演示池,并随后将这些案例从测试集中排除。在推理过程中,我们从全局池中随机抽取 4 个演示案例,不考虑测试用例所需的最少操作次数。

We employ chain-of-thought prompting Wei et al. (2022) as a baseline, and evaluate our RAP(10) (with 10 iterations) with an improved prompting technique (Appendix E). Our experimental results are summarized in Table 4. In both the Easy and Hard settings, RAP demonstrates superior performance over CoT by a substantial margin. Notably, when the test case necessitates a larger number of steps (six or more) to solve, CoT exhibits a severe drop in success rates, whereas RAP maintains a relatively high success rate. Comparing these results with Section 4.1, we additionally conclude that RAP is a general framework able to enhance the reasoning abilities of LLMs, regardless of their intrinsic capabilities.
我们采用链式思维提示 Wei 等人(2022 年)作为基准,并使用改进的提示技术(附录 E)评估我们的 RAP (10) (进行 10 次迭代)。我们的实验结果总结在表 4 中。在简单和困难设置中,RAP 表现出比 CoT 高出很大幅度的性能。值得注意的是,当测试案例需要更多步骤(六步或更多)来解决时,CoT 的成功率急剧下降,而 RAP 保持相对较高的成功率。将这些结果与第 4.1 节进行比较,我们另外得出结论,RAP 是一个通用框架,能够增强LLMs的推理能力,而不受其固有能力的影响。

5.2 Reward Choice 5.2 奖励选择

In our main experiments, we choose the combination of rewards in our current experiments based on heuristics and our exploratory experiments. To understand the effects of the reward choice for LLM reasoning, we supplement comprehensive experiments on rewards for plan generation (Table 5) and math reasoning (Table 6).
在我们的主要实验中,我们根据启发式和我们的探索性实验选择了当前实验中奖励的组合。为了了解LLM推理的奖励选择的影响,我们对计划生成(表 5)和数学推理(表 6)的奖励进行了综合实验。

Generally, the combination of multiple rewards contributes to the performance. However, the effects of a reward depends on the nature of tasks. For example, the action likelihood reward is essential for plan generation, but not very helpful to mathmatical reasoning. More discussions are in Appendix F.
通常,多重奖励的组合有助于绩效。然而,奖励的效果取决于任务的性质。例如,行动可能性奖励对计划生成至关重要,但对数学推理并不是很有帮助。更多讨论请参见附录 F。

6 Conclusion 6 结论

In this paper, we present Reasoning via Planning (RAP), a novel LLM reasoning framework that equips LLMs with an ability to reason akin to human-like strategic planning. Our framework, which repurposes the LLM to act as both a world model and a reasoning agent, enables the LLM to simulate states of the world and anticipate action outcomes, and achieve an effective balance between exploration and exploitation via Monte-Carlo Tree Search. Extensive experiments on a variety of challenging reasoning problems demonstrate RAP’s superiority over several contemporary CoT-based reasoning approaches, and even the advanced GPT-4 in certain settings.
在本文中,我们提出了一种新颖的推理规划(RAP)框架,该框架使LLM具备类似于人类战略规划的推理能力。我们的框架重新利用LLM作为世界模型和推理代理,使LLM能够模拟世界状态并预测行动结果,并通过蒙特卡洛树搜索实现探索和开发之间的有效平衡。对各种具有挑战性的推理问题进行的大量实验表明,RAP 在几个当代基于 CoT 的推理方法以及某些情境下甚至优于先进的 GPT-4。

Limitations 限制

In this work, we mainly focus on utilizing frozen LLMs, whose abilities might be bounded by the pre-training. In the future, it is worth exploring how to fine-tune LLMs to better reason and serve as a world model Xiang et al. (2023), as well as how to combine external tools Hao et al. (2023a); Schick et al. (2023) with RAP to solve more complex real-world problems.
在这项工作中,我们主要关注利用冻结的LLMs,其能力可能受到预训练的限制。未来值得探索如何微调LLMs以更好地推理并作为世界模型 Xiang 等人(2023 年),以及如何将外部工具 Hao 等人(2023a);Schick 等人(2023 年)与 RAP 相结合,解决更复杂的现实世界问题。

Ethics Statement 道德声明

In this paper, we primarily focus on the applications on plan generation, mathematical reasoning, and logical reasoning, posing no significant ethical concerns. We recognize that future research on border applications of reasoning with LLMs may pose a risk of misuse, and we recommend careful consideration of all aspects of safety before relevant techniques are applied to the real world.
在本文中,我们主要关注计划生成、数学推理和逻辑推理的应用,不涉及重大的伦理问题。我们意识到未来关于使用LLMs进行推理的边界应用的研究可能存在滥用风险,我们建议在相关技术应用于现实世界之前仔细考虑所有安全方面。

References

  • Baddeley (1992) Alan Baddeley. 1992. Working memory. Science, 255(5044):556–559.
  • Briscoe (2011) Robert Eamon Briscoe. 2011. Mental imagery and the varieties of amodal perception. Pacific Philosophical Quarterly, 92(2):153–173.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  • Bubeck et al. (2023) Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  • Bylander (1994) Tom Bylander. 1994. The computational complexity of propositional strips planning. Artificial Intelligence, 69(1-2):165–204.
  • Camacho and Alba (2013) Eduardo F Camacho and Carlos Bordons Alba. 2013. Model predictive control. Springer science & business media.
  • Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  • Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  • Coulom (2007) Rémi Coulom. 2007. Efficient selectivity and backup operators in monte-carlo tree search. In Computers and Games: 5th International Conference, CG 2006, Turin, Italy, May 29-31, 2006. Revised Papers 5, pages 72–83. Springer.
  • Ding et al. (2023) Yan Ding, Xiaohan Zhang, Chris Paxton, and Shiqi Zhang. 2023. Task and motion planning with large language models for object rearrangement. arXiv preprint arXiv:2303.06247.
  • Gasparski and Orel (2014) Wojciech W Gasparski and Tufan Orel. 2014. Designology: Studies on Planning for Action, volume 1. Transaction Publishers.
  • Gentner and Stevens (2014) Dedre Gentner and Albert L Stevens. 2014. Mental models. Psychology Press.
  • Ha and Schmidhuber (2018a) David Ha and Jürgen Schmidhuber. 2018a. Recurrent world models facilitate policy evolution. Advances in neural information processing systems, 31.
  • Ha and Schmidhuber (2018b) David Ha and Jürgen Schmidhuber. 2018b. World models. arXiv preprint arXiv:1803.10122.
  • Hafner et al. (2019) Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. 2019. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603.
  • Hafner et al. (2020) Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. 2020. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193.
  • Hao et al. (2023a) Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. 2023a. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. Advances in neural information processing systems, 36.
  • Hao et al. (2023b) Shibo Hao, Bowen Tan, Kaiwen Tang, Bin Ni, Xiyan Shao, Hengzhe Zhang, Eric Xing, and Zhiting Hu. 2023b. Bertnet: Harvesting knowledge graphs with arbitrary relations from pretrained language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5000–5015.
  • Ho et al. (2021) Mark K Ho, David Abel, Carlos G Correa, Michael L Littman, Jonathan D Cohen, and Thomas L Griffiths. 2021. Control of mental representations in human planning. arXiv e-prints, pages arXiv–2105.
  • Huang and Chang (2022) Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403.
  • Huang et al. (2022) Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. 2022. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608.
  • Huys et al. (2012) Quentin JM Huys, Neir Eshel, Elizabeth O’Nions, Luke Sheridan, Peter Dayan, and Jonathan P Roiser. 2012. Bonsai trees in your head: how the pavlovian system sculpts goal-directed choices by pruning decision trees. PLoS computational biology, 8(3):e1002410.
  • Jiang et al. (2019) Yu-qian Jiang, Shi-qi Zhang, Piyush Khandelwal, and Peter Stone. 2019. Task planning in robotics: an empirical comparison of pddl-and asp-based systems. Frontiers of Information Technology & Electronic Engineering, 20:363–373.
  • Johnson-Laird (2010) Philip N Johnson-Laird. 2010. Mental models and human reasoning. Proceedings of the National Academy of Sciences, 107(43):18243–18250.
  • Johnson-Laird (1983) Philip Nicholas Johnson-Laird. 1983. Mental models: Towards a cognitive science of language, inference, and consciousness. 6. Harvard University Press.
  • Jojic et al. (2023) Ana Jojic, Zhen Wang, and Nebojsa Jojic. 2023. Gpt is becoming a turing machine: Here are some ways to program it. arXiv preprint arXiv:2303.14310.
  • Kocsis and Szepesvári (2006) Levente Kocsis and Csaba Szepesvári. 2006. Bandit based monte-carlo planning. In Machine Learning: ECML 2006: 17th European Conference on Machine Learning Berlin, Germany, September 18-22, 2006 Proceedings 17, pages 282–293. Springer.
  • Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916.
  • LeCun (2022) Yann LeCun. 2022. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62.
  • Li et al. (2022) Belinda Z Li, Maxwell Nye, and Jacob Andreas. 2022. Language modeling with latent situations. arXiv preprint arXiv:2212.10012.
  • Liu et al. (2023) Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. 2023. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477.
  • Lyu et al. (2023) Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. 2023. Faithful chain-of-thought reasoning. arXiv preprint arXiv:2301.13379.
  • Matsuo et al. (2022) Yutaka Matsuo, Yann LeCun, Maneesh Sahani, Doina Precup, David Silver, Masashi Sugiyama, Eiji Uchibe, and Jun Morimoto. 2022. Deep learning, reinforcement learning, and world models. Neural Networks.
  • McCarthy (1963) John McCarthy. 1963. Situations, actions, and causal laws. Technical report, STANFORD UNIV CA DEPT OF COMPUTER SCIENCE.
  • Mialon et al. (2023) Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. 2023. Augmented language models: a survey. arXiv preprint arXiv:2302.07842.
  • OpenAI (2023) OpenAI. 2023. Gpt-4 technical report.
  • Paul et al. (2023) Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert West, and Boi Faltings. 2023. Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904.
  • Puig et al. (2018) Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. 2018. Virtualhome: Simulating household activities via programs.
  • Saparov and He (2022) Abulhair Saparov and He He. 2022. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. arXiv preprint arXiv:2210.01240.
  • Schick et al. (2023) Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
  • Schrittwieser et al. (2020) Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. 2020. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609.
  • Schulkin (2012) Jay Schulkin. 2012. Action, perception and the brain: Adaptation and cephalic expression. Springer.
  • Sekar et al. (2020) Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. 2020. Planning to explore via self-supervised world models. In International Conference on Machine Learning, pages 8583–8592. PMLR.
  • Shinn et al. (2023) Noah Shinn, Beck Labash, and Ashwin Gopinath. 2023. Reflexion: an autonomous agent with dynamic memory and self-reflection. ArXiv, abs/2303.11366.
  • Silver et al. (2017) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. 2017. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815.
  • Singh et al. (2022) Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. 2022. Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302.
  • Tolman (1948) Edward C Tolman. 1948. Cognitive maps in rats and men. Psychological review, 55(4):189.
  • Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  • Valmeekam et al. (2022) Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. 2022. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498.
  • Valmeekam et al. (2023) Karthik Valmeekam, Sarath Sreedharan, Matthew Marquez, Alberto Olmo, and Subbarao Kambhampati. 2023. On the planning abilities of large language models (a critical investigation with a proposed benchmark). arXiv preprint arXiv:2302.06706.
  • Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  • Welleck et al. (2022) Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. 2022. Generating sequences by learning to self-correct. arXiv preprint arXiv:2211.00053.
  • Wu et al. (2023) Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. 2023. Daydreamer: World models for physical robot learning. In Conference on Robot Learning, pages 2226–2240. PMLR.
  • Xiang et al. (2023) Jiannan Xiang, Tianhua Tao, Yi Gu, Tianmin Shu, Zirui Wang, Zichao Yang, and Zhiting Hu. 2023. Language models meet world models: Embodied experiences enhance language models. Advances in neural information processing systems, 36.
  • Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
  • Zhou et al. (2022) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Chi. 2022. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
  • Zhu et al. (2022) Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Ruyi Gan, Jiaxing Zhang, and Yujiu Yang. 2022. Solving math word problem via cooperative reasoning induced language models. arXiv preprint arXiv:2210.16257.

Appendix A MCTS Planning 附录 AMCTS 计划

We adapt MCTS to search for the optimal reasoning path (Algorithm 1). Compared with traditional applications of MCTS, we are faced with a large reasoning space, and the heavy computational cost of LLMs. Thus, we made several modifications to the classic MCTS in our implementation: (1) For open domain problems, e.g., math problems, it’s impossible to enumerate all actions (subquestions), so we reduce the action space by sampling a fixed number of potential actions from LLMs, conditioned on a prompt of the current state and in-context demonstration. (2) In the selection phase, if there are actions that haven’t been visited before, we estimate the Q value with lightweight local rewards, e.g., self-evaluation reward, and then select the action with UCT. This provides prior knowledge for the exploration, which is crucial given the limited iteration budgets.
我们将 MCTS 调整为搜索最佳推理路径(算法 1)。与 MCTS 的传统应用相比,我们面临着庞大的推理空间和LLMs的巨大计算成本。因此,在我们的实现中对经典 MCTS 进行了几处修改:(1)对于开放领域问题,例如数学问题,不可能枚举所有动作(子问题),因此我们通过从LLMs中采样固定数量的潜在动作来减少动作空间,条件是当前状态和上下文演示的提示。(2)在选择阶段,如果存在尚未访问过的动作,我们将使用轻量级本地奖励(例如自我评估奖励)估计 Q 值,然后使用 UCT 选择动作。这为探索提供了先验知识,鉴于迭代预算有限,这一点至关重要。

Algorithm 1 RAP-MCTS 算法 1 RAP-MCTS
1:Initial state s0subscript𝑠0s_{0}, state transition probability function pθsubscript𝑝𝜃p_{\theta}, reward function rθsubscript𝑟𝜃r_{\theta}, action generator pϕsubscript𝑝italic-ϕp_{\phi}, number of generated actions d𝑑d, depth limit L𝐿L, number of roll-outs N𝑁N, and exploration weight w𝑤w
1: 初始状态 s0subscript𝑠0s_{0} ,状态转移概率函数 pθsubscript𝑝𝜃p_{\theta} ,奖励函数 rθsubscript𝑟𝜃r_{\theta} ,动作生成器 pϕsubscript𝑝italic-ϕp_{\phi} ,生成动作数量 d𝑑d ,深度限制 L𝐿L ,模拟次数 N𝑁N ,以及探索权重 w𝑤w
2:Initialize memory of actions A:𝒮𝒜:𝐴maps-to𝒮𝒜A:\mathcal{S}\mapsto\mathcal{A}, children c:𝒮×𝒜𝒮:𝑐maps-to𝒮𝒜𝒮c:\mathcal{S}\times\mathcal{A}\mapsto\mathcal{S} and rewards r:𝒮×𝒜:𝑟maps-to𝒮𝒜r:\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R}
2:初始化动作 A:𝒮𝒜:𝐴maps-to𝒮𝒜A:\mathcal{S}\mapsto\mathcal{A} ,子节点 c:𝒮×𝒜𝒮:𝑐maps-to𝒮𝒜𝒮c:\mathcal{S}\times\mathcal{A}\mapsto\mathcal{S} 和奖励 r:𝒮×𝒜:𝑟maps-to𝒮𝒜r:\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R} 的内存
3:Initialize the state-action value function Q:𝒮×𝒜:𝑄maps-to𝒮𝒜Q:\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R} and visit counter N:𝒮:𝑁maps-to𝒮N:\mathcal{S}\mapsto\mathbb{N}
3:初始化状态-动作值函数 Q:𝒮×𝒜:𝑄maps-to𝒮𝒜Q:\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R} 和访问计数器 N:𝒮:𝑁maps-to𝒮N:\mathcal{S}\mapsto\mathbb{N}
4:for n0,,N1𝑛0𝑁1n\leftarrow 0,\dots,N-1 do 4:对 n0,,N1𝑛0𝑁1n\leftarrow 0,\dots,N-1 执行
5:     t0𝑡0t\leftarrow 0
6:     while N(st)>0𝑁subscript𝑠𝑡0N(s_{t})>0 do
6:当 N(st)>0𝑁subscript𝑠𝑡0N(s_{t})>0 时执行
\triangleright Selection   \triangleright 选择
7:         N(st)N(st)+1𝑁subscript𝑠𝑡𝑁subscript𝑠𝑡1N(s_{t})\leftarrow N(s_{t})+1
8:         atargmaxaA(st)[Q(st,a)+wlnN(st)N(c(st,a))]subscript𝑎𝑡subscript𝑎𝐴subscript𝑠𝑡𝑄subscript𝑠𝑡𝑎𝑤𝑁subscript𝑠𝑡𝑁𝑐subscript𝑠𝑡𝑎a_{t}\leftarrow\arg\max_{a\in A(s_{t})}\left[Q(s_{t},a)+w\sqrt{\frac{\ln N(s_{t})}{N(c(s_{t},a))}}\right]
9:         rt=r(st,at)subscript𝑟𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡r_{t}=r(s_{t},a_{t}), st+1c(st,at)subscript𝑠𝑡1𝑐subscript𝑠𝑡subscript𝑎𝑡s_{t+1}\leftarrow c(s_{t},a_{t})
10:         tt+1𝑡𝑡1t\leftarrow t+1
11:     end while 11:结束循环
12:     while stsubscript𝑠𝑡s_{t} is not a terminal state \wedge tL𝑡𝐿t\leq L do
12:当 stsubscript𝑠𝑡s_{t} 不是终止状态 \wedge tL𝑡𝐿t\leq L 时执行
13:         for i1,,d𝑖1𝑑i\leftarrow 1,\dots,d do
13:对 i1,,d𝑖1𝑑i\leftarrow 1,\dots,d 执行
\triangleright Expansion   \triangleright 扩展
14:              Sample at(i)pϕ(ast)similar-tosuperscriptsubscript𝑎𝑡𝑖subscript𝑝italic-ϕconditional𝑎subscript𝑠𝑡a_{t}^{(i)}\sim p_{\phi}(a\mid s_{t}), st+1(i)pθ(st,at(i))similar-tosuperscriptsubscript𝑠𝑡1𝑖subscript𝑝𝜃subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖s_{t+1}^{(i)}\sim p_{\theta}(s_{t},a_{t}^{(i)}), and rt(i)rθ(st,at(i))similar-tosuperscriptsubscript𝑟𝑡𝑖subscript𝑟𝜃subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖r_{t}^{(i)}\sim r_{\theta}(s_{t},a_{t}^{(i)})
14: 示例 at(i)pϕ(ast)similar-tosuperscriptsubscript𝑎𝑡𝑖subscript𝑝italic-ϕconditional𝑎subscript𝑠𝑡a_{t}^{(i)}\sim p_{\phi}(a\mid s_{t})st+1(i)pθ(st,at(i))similar-tosuperscriptsubscript𝑠𝑡1𝑖subscript𝑝𝜃subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖s_{t+1}^{(i)}\sim p_{\theta}(s_{t},a_{t}^{(i)}) ,和 rt(i)rθ(st,at(i))similar-tosuperscriptsubscript𝑟𝑡𝑖subscript𝑟𝜃subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖r_{t}^{(i)}\sim r_{\theta}(s_{t},a_{t}^{(i)})
15:              Update A(st){at(i)}i=1d𝐴subscript𝑠𝑡superscriptsubscriptsuperscriptsubscript𝑎𝑡𝑖𝑖1𝑑A(s_{t})\leftarrow\{a_{t}^{(i)}\}_{i=1}^{d}, c(st,at(i))st+1(i)𝑐subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖superscriptsubscript𝑠𝑡1𝑖c(s_{t},a_{t}^{(i)})\leftarrow s_{t+1}^{(i)}, and r(st,at)rt(i)𝑟subscript𝑠𝑡subscript𝑎𝑡superscriptsubscript𝑟𝑡𝑖r(s_{t},a_{t})\leftarrow r_{t}^{(i)}
15: 更新 A(st){at(i)}i=1d𝐴subscript𝑠𝑡superscriptsubscriptsuperscriptsubscript𝑎𝑡𝑖𝑖1𝑑A(s_{t})\leftarrow\{a_{t}^{(i)}\}_{i=1}^{d}c(st,at(i))st+1(i)𝑐subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖superscriptsubscript𝑠𝑡1𝑖c(s_{t},a_{t}^{(i)})\leftarrow s_{t+1}^{(i)} ,和 r(st,at)rt(i)𝑟subscript𝑠𝑡subscript𝑎𝑡superscriptsubscript𝑟𝑡𝑖r(s_{t},a_{t})\leftarrow r_{t}^{(i)}
16:         end for 16:结束循环
17:         at+1argmaxaA(st)r(st,at)subscript𝑎𝑡1subscript𝑎𝐴subscript𝑠𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡a_{t+1}\leftarrow\arg\max_{a\in A(s_{t})}r(s_{t},a_{t}) \triangleright Simulation   \triangleright 模拟
18:         rtr(st,at)subscript𝑟𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡r_{t}\leftarrow r(s_{t},a_{t}), st+1c(st,at)subscript𝑠𝑡1𝑐subscript𝑠𝑡subscript𝑎𝑡s_{t+1}\leftarrow c(s_{t},a_{t})
19:         tt+1𝑡𝑡1t\leftarrow t+1
20:     end while 20:结束循环
21:     for tt,,0superscript𝑡𝑡0t^{\prime}\leftarrow t,\dots,0 do
21:对 tt,,0superscript𝑡𝑡0t^{\prime}\leftarrow t,\dots,0 执行
\triangleright Back propagation   \triangleright 反向传播
22:         Update Q(st,at)𝑄subscript𝑠superscript𝑡subscript𝑎superscript𝑡Q(s_{t^{\prime}},a_{t^{\prime}}) with {rt,rt+1,,rt}subscript𝑟superscript𝑡subscript𝑟superscript𝑡1subscript𝑟𝑡\{r_{t^{\prime}},r_{t^{\prime}+1},\dots,r_{t}\}
22:使用 {rt,rt+1,,rt}subscript𝑟superscript𝑡subscript𝑟superscript𝑡1subscript𝑟𝑡\{r_{t^{\prime}},r_{t^{\prime}+1},\dots,r_{t}\} 更新 Q(st,at)𝑄subscript𝑠superscript𝑡subscript𝑎superscript𝑡Q(s_{t^{\prime}},a_{t^{\prime}})
23:     end for 23:结束循环
24:end for 24:结束 for

Appendix B Experiment Settings
附录 B 实验设置

B.1 Language Model Decoding
B.1 语言模型解码

We use random sampling with a temperature of 0.8. The generation is cut off at the maximum length of 2048 or a newline token. 

B.2 Computing Resources 

All of our experiments run on 4 ×\times NVIDIA A5000 GPUs with 24GB memory. 

Appendix C Prompt 

C.1 Plan Generation 

We show the prompt to calculate the action likelihood for RAP below. The same prompt is also applied in CoT baseline. <init_state> and <goals> would be instantiated by the problem to solve. 

I am playing with a set of blocks where I need to arrange the blocks into stacks. Here are the actions I can do
我正在玩一个积木套装,需要把积木摆成一摞。以下是我可以做的动作:
Pick up a block
拾起一个方块
Unstack a block from on top of another block
从另一个方块的顶部移开一个方块
Put down a block
放下一个块
Stack a block on top of another block
将一个积木叠放在另一个积木上
I have the following restrictions on my actions:
我在行动方面有以下限制:
I can only pick up or unstack one block at a time.
我一次只能拿起或拆开一个积木。
I can only pick up or unstack a block if my hand is empty.
只有当我的手是空的时候,我才能拾取或拆卸一个方块。
I can only pick up a block if the block is on the table and the block is clear. A block is clear if the block has no other blocks on top of it and if the block is not picked up.
只有在桌子上且未被占据的情况下,我才能拾起一个积木。如果一个积木顶部没有其他积木且未被拾起,那么这个积木就是未被占据的。
I can only unstack a block from on top of another block if the block I am unstacking was really on top of the other block.
只有在我要移开的方块确实在另一个方块的顶部时,我才能从另一个方块上移开一个方块。
I can only unstack a block from on top of another block if the block I am unstacking is clear.
只有在我要移开的方块是空的情况下,我才能从另一个方块上面移开一个方块。
Once I pick up or unstack a block, I am holding the block.
一旦我拿起或拆开一个方块,我就握住了这个方块。
I can only put down a block that I am holding.
我只能放下我正在握着的方块。
I can only stack a block on top of another block if I am holding the block being stacked.
只有在我手持被堆叠的方块时,才能将一个方块叠放在另一个方块上。
I can only stack a block on top of another block if the block onto which I am stacking the block is clear.
只有在我要堆叠的方块上方是空的情况下,我才能将一个方块堆叠在另一个方块上。
Once I put down or stack a block, my hand becomes empty.
一旦我放下或堆叠一个方块,我的手就会变空。
[STATEMENT] [声明]
As initial conditions I have that, the red block is clear, the yellow block is clear, the hand is empty, the red block is on top of the blue block, the yellow block is on top of the orange block, the blue block is on the table and the orange block is on the table.
作为初始条件,我有:红色块是清晰的,黄色块是清晰的,手是空的,红色块在蓝色块的顶部,黄色块在橙色块的顶部,蓝色块在桌子上,橙色块在桌子上。
My goal is to have that the orange block is on top of the red block.
我的目标是让橙色方块位于红色方块之上。
My plan is as follows:
我的计划如下:
[PLAN] [计划]
unstack the yellow block from on top of the orange block
从橙色积木顶部取下黄色积木
put down the yellow block 
pick up the orange block 
stack the orange block on top of the red block 
[PLAN END] 
[STATEMENT] 
As initial conditions I have that, the orange block is clear, the yellow block is clear, the hand is empty, the blue block is on top of the red block, the orange block is on top of the blue block, the red block is on the table and the yellow block is on the table. 
My goal is to have that the blue block is on top of the red block and the yellow block is on top of the orange block. 
My plan is as follows: 
[PLAN] 
pick up the yellow block 
stack the yellow block on top of the orange block 
[PLAN END] 
[STATEMENT] 
As initial conditions I have that, the red block is clear, the blue block is clear, the orange block is clear, the hand is empty, the blue block is on top of the yellow block, the red block is on the table, the orange block is on the table and the yellow block is on the table. 
My goal is to have that the blue block is on top of the orange block and the yellow block is on top of the red block. 
My plan is as follows:
我的计划如下:
[PLAN] [计划]
unstack the blue block from on top of the yellow block
从黄色积木顶部取下蓝色积木
stack the blue block on top of the orange block
将蓝色积木叠放在橙色积木顶部
pick up the yellow block
拿起黄色的积木
stack the yellow block on top of the red block
将黄色积木叠放在红色积木上方
[PLAN END] [计划结束]
[STATEMENT] [声明]
As initial conditions I have that, the red block is clear, the blue block is clear, the yellow block is clear, the hand is empty, the yellow block is on top of the orange block, the red block is on the table, the blue block is on the table and the orange block is on the table.
作为初始条件,我有,红色块是清晰的,蓝色块是清晰的,黄色块是清晰的,手是空的,黄色块在橙色块的顶部,红色块在桌子上,蓝色块在桌子上,橙色块在桌子上。
My goal is to have that the orange block is on top of the blue block and the yellow block is on top of the red block.
我的目标是让橙色方块位于蓝色方块之上,黄色方块位于红色方块之上。
My plan is as follows:
我的计划如下:
[PLAN] [计划]
unstack the yellow block from on top of the orange block
从橙色积木顶部取下黄色积木
stack the yellow block on top of the red block
将黄色积木叠放在红色积木上方
pick up the orange block
拿起橙色的积木
stack the orange block on top of the blue block
将橙色积木叠放在蓝色积木顶部
[PLAN END] [计划结束]
[STATEMENT] [声明]
As initial conditions I have that, <initial_state>
作为初始条件,我有:<initial_state>
My goal is to have that <goals>.
我的目标是实现那个<goals>。
My plan is as follows:
我的计划如下:
[PLAN] 

For the next state prediction with the world model, we apply the prompts conditioned on the last action. Here we show the prompt to update the state after a “pick up” action as an example. Again, <state> and <action> would be instantiated with the current state and action. 

I am playing with a set of blocks where I need to arrange the blocks into stacks. Here are the actions I can do
我正在玩一个积木套装,需要把积木摆成一摞。以下是我可以做的动作:
Pick up a block
拾起一个方块
Unstack a block from on top of another block 
Put down a block 
Stack a block on top of another block 
I have the following restrictions on my actions: 
I can only pick up or unstack one block at a time. 
I can only pick up or unstack a block if my hand is empty. 
I can only pick up a block if the block is on the table and the block is clear. A block is clear if the block has no other blocks on top of it and if the block is not picked up. 
I can only unstack a block from on top of another block if the block I am unstacking was really on top of the other block. 
I can only unstack a block from on top of another block if the block I am unstacking is clear. Once I pick up or unstack a block, I am holding the block. 
I can only put down a block that I am holding.
我只能放下我正在握着的方块。
I can only stack a block on top of another block if I am holding the block being stacked.
只有在我手持被堆叠的方块时,才能将一个方块叠放在另一个方块上。
I can only stack a block on top of another block if the block onto which I am stacking the block is clear. Once I put down or stack a block, my hand becomes empty.
只有在我要堆叠的方块上方是空的情况下,我才能将一个方块叠放在另一个方块上。一旦我放下或叠放一个方块,我的手就会变空。
After being given an initial state and an action, give the new state after performing the action.
给定初始状态和动作后,执行动作后给出新状态。
[SCENARIO 1] [场景 1]
[STATE 0] I have that, the white block is clear, the cyan block is clear, the brown block is clear, the hand is empty, the white block is on top of the purple block, the purple block is on the table, the cyan block is on the table and the brown block is on the table.
[状态 0] 我有那个,白色方块是清晰的,青色方块是清晰的,棕色方块是清晰的,手是空的,白色方块在紫色方块的上面,紫色方块在桌子上,青色方块在桌子上,棕色方块在桌子上。
[ACTION] Pick up the brown block.
【动作】拿起棕色的方块。
[CHANGE] The hand was empty and is now holding the brown block, the brown block was on the table and is now in the hand, and the brown block is no longer clear.
[变化] 手是空的,现在拿着棕色的积木,棕色的积木在桌子上,现在在手里,棕色的积木不再清晰。
[STATE 1] I have that, the white block is clear, the cyan block is clear, the brown block is in the hand, the hand is holding the brown block, the white block is on top of the purple block, the purple block is on the table and the cyan block is on the table.
[状态 1] 我有那个,白色方块是清晰的,青色方块是清晰的,棕色方块在手中,手中握着棕色方块,白色方块在紫色方块的上方,紫色方块在桌子上,青色方块也在桌子上。
[SCENARIO 2] [场景 2]
[STATE 0] I have that, the purple block is clear, the cyan block is clear, the white block is clear, the hand is empty, the white block is on top of the brown block, the purple block is on the table, the cyan block is on the table and the brown block is on the table.
[状态 0] 我有那个,紫色方块是清晰的,青色方块是清晰的,白色方块是清晰的,手是空的,白色方块在棕色方块上面,紫色方块在桌子上,青色方块在桌子上,棕色方块在桌子上。
[ACTION] Pick up the cyan block.
[操作] 拾取青色方块。
[CHANGE] The hand was empty and is now holding the cyan block, the cyan block was on the table and is now in the hand, and the cyan block is no longer clear.
【变化】手是空的,现在拿着青色的方块,青色的方块原本在桌子上,现在在手里,青色的方块不再清晰。
[STATE 1] I have that, the cyan block is in the hand, the white block is clear, the purple block is clear, the hand is holding the cyan block, the white block is on top of the brown block, the purple block is on the table and the brown block is on the table.
[状态 1] 我有那个,青色方块在手中,白色方块清晰可见,紫色方块清晰可见,手中握着青色方块,白色方块在棕色方块上方,紫色方块在桌子上,棕色方块在桌子上。
[SCENARIO 3] [场景 3]
[STATE 0] <state> [状态 0] <状态>
[ACTION] <action>
[CHANGE] [更改]

C.2 Math Reasoning C.2 数学推理

We show the prompt of RAP for math reasoning as below. The prompt is used for both action proposal and next state prediction. After instantiate <question>, we append a prefix Question 5.1 to the prompt, so that we can sample the first action with the LLM. The future actions are sampled similarly, except that all previous sub-questions and sub-answers need to be appended to the prompt, following the formats of in-context demonstration. The next state prediction, i.e., answering the sub-question, works in the same way. 

Given a question, please decompose it into sub-questions. For each sub-question, please answer it in a complete sentence, ending with "The answer is". When the original question is answerable, please start the subquestion with "Now we can answer the question: ". 
Question 1: Four years ago, Kody was only half as old as Mohamed. If Mohamed is currently twice as 30 years old, how old is Kody? 
Question 1.1: How old is Mohamed? 
Answer 1.1: He is currently 30 * 2 = 60 years old. The answer is 60. 
Question 1.2: How old was Mohamed four years ago? 
Answer 1.2: Four years ago, he must have been 60 - 4 = 56 years old. The answer is 56. 
Question 1.3: How old was Kody four years ago? 
Answer 1.3: Kody was half as old as Mohamed four years ago. Thus, Kody was 56 / 2 = 28 years old. The answer is 28. 
Question 1.4: Now we can answer the question: How old is Kody? 
Answer 1.4: She is currently 28 + 4 = 32 years old. The answer is 32. 
Question 2: On a moonless night, three fireflies danced in the evening breeze. They were joined by four less than a dozen more fireflies before two of the fireflies flew away. How many fireflies remained? 
Question 2.1: How many fireflies joined? 
Answer 2.1: The fireflies were joined by four less than a dozen more fireflies, which are 12 - 4 = 8 fireflies. The answer is 8. 
Question 2.2: Now we can answer the question: How many fireflies remained? 
Answer 2.2: Three fireflies were dancing originally. They were joined by 8 fireflies before two of them flew away. So there were 3 + 8 - 2 = 9 remaining. The answer is 9. 
Question 3: Ali has four $10 bills and six $20 bills that he saved after working for Mr. James on his farm. Ali gives her sister half of the total money he has and uses 3/5 of the remaining amount of money to buy dinner. Calculate the amount of money he has after buying the dinner. 
Question 3.1: How much money does Ali have in total? 
Answer 3.1: Ali has four $10 bills and six $20 bills. So he has 4 * 10 + 6 * 20 = 160 dollars. The answer is 160. 
Question 3.2: How much money does Ali give to his sister? 
Answer 3.2: Ali gives half of the total money he has to his sister. So he gives 160 / 2 = 80 dollars to his sister. The answer is 80. 
Question 3.3: How much money does Ali have after giving his sister the money? 
Answer 3.3: After giving his sister the money, Ali has 160 - 80 = 80 dollars left. The answer is 80. 
Question 3.4: How much money does Ali use to buy dinner? 
Answer 3.4: Ali uses 3/5 of the remaining amount of money to buy dinner. So he uses 80 * 3/5 = 48 dollars to buy dinner. The answer is 48. 
Question 3.5: Now we can answer the question: How much money does Ali have after buying the dinner? 
Answer 3.5: After buying the dinner, Ali has 80 - 48 = 32 dollars left. The answer is 32. 
Question 4: A car is driving through a tunnel with many turns. After a while, the car must travel through a ring that requires a total of 4 right-hand turns. After the 1st turn, it travels 5 meters. After the 2nd turn, it travels 8 meters. After the 3rd turn, it travels a little further and at the 4th turn, it immediately exits the tunnel. If the car has driven a total of 23 meters around the ring, how far did it have to travel after the 3rd turn? 
Question 4.1: How far did the car travel except for the 3rd turn? 
Answer 4.1: It travels 5 meters after the 1st, 8 meters after the 2nd, and 0 meters after the 4th turn. Its a total of 5 + 8 + 0 = 13 meters. The answer is 13. 
Question 4.2: Now we can answer the question: How far did the car have to travel after the 3rd turn? 
Answer 4.2: The car has driven a total of 23 meters around the ring. It travels 13 meters except for the 3rd turn. So it has to travel 23 - 13 = 10 meters after the 3rd turn. The answer is 10. 
Question 5: <question> 

C.3 Logical Reasoning C.3 逻辑推理

We show the prompt for action proposal, action likelihood calculation, and next state prediction. <fact> and <query> would be instantiated with the problem.
我们展示了行动提议的提示,行动可能性计算和下一个状态预测。<fact>和<query>将被实例化为问题。

Given a list of facts, and a current claim, output one possible fact as the next step. Be sure to copy the exact sentences in the facts. Do not change any wording. Do not create your own words.
给定一组事实列表和一个当前声明,输出一个可能的事实作为下一步。务必复制事实中的确切句子。不要更改任何措辞。不要创造自己的词语。
Facts 1: Each lepidopteran is an insect. Each arthropod is a protostome. Every animal is multicellular. Protostomes are invertebrates. Each whale is bony. Each painted lady is a butterfly. Invertebrates are animals. Butterflies are lepidopterans. Each insect is six-legged. Every insect is an arthropod. Arthropods are not bony.
事实 1:每只鳞翅目动物都是昆虫。每只节肢动物都是原口动物。每只动物都是多细胞生物。原口动物是无脊椎动物。每只鲸鱼都是骨鱼。每只彩蛱蝶都是蝴蝶。无脊椎动物是动物。蝴蝶是鳞翅目动物。每只昆虫都是六足动物。每只昆虫都是节肢动物。节肢动物不是骨鱼。
Query 1: True or false: Sally is not bony.
查询 1:真或假:Sally 不瘦。
Claim 1.1: Sally is an insect.
声明 1.1:Sally 是一只昆虫。
Next 1.1: Each insect is six-legged.
下一个 1.1:每只昆虫都有六条腿。
Claim 1.2: Sally is a butterfly.
声明 1.2:Sally 是一只蝴蝶。
Next 1.2: Butterflies are lepidopterans.
下一个 1.2 版本:蝴蝶是鳞翅目昆虫。
Claim 1.3: Sally is a lepidopteran.
声明 1.3:Sally 是鳞翅目昆虫。
Next 1.3: Each lepidopteran is an insect.
下一个 1.3:每只鳞翅目昆虫。
Claim 1.4: Sally is not bony.
声明 1.4:Sally 不瘦。
Next 1.4: Finish. 下一个 1.4:完成。
Claim 1.5: Sally is an arthropod.
声明 1.5:Sally 是一种节肢动物。
Next 1.5: Arthropods are not bony.
下一个 1.5:节肢动物不是骨头。
Claim 1.6: Sally is a painted lady.
声明 1.6:Sally 是一位化妆的女士。
Next 1.6: Each painted lady is a butterfly.
下一个 1.6 版本:每只彩蝶都是一只蝴蝶。
Facts 2: Prime numbers are natural numbers. Every Mersenne prime is not composite. Imaginary numbers are not real. Every real number is a number. Natural numbers are integers. Every real number is real. Every Mersenne prime is a prime number. Natural numbers are positive. Prime numbers are not composite. Integers are real numbers.
事实 2:素数是自然数。每个梅森素数都不是合数。虚数不是实数。每个实数都是一个数。自然数是整数。每个实数都是实数。每个梅森素数都是素数。自然数是正数。素数不是合数。整数是实数。
Query 2: True or false: 127 is not real.
查询 2:真或假:127 不是实数。
Claim 2.1: 127 is real.
声明 2.1:127 是实数。
Next 2.1: Finish. 
Claim 2.1: 127 is a natural number. 
Next 2.1: Natural numbers are integers. 
Claim 2.2: 127 is a prime number. 
Next 2.2: Prime numbers are natural numbers. 
Claim 2.3: 127 is a real number. 
Next 2.3: Every real number is real. 
Claim 2.4: 127 is a Mersenne prime. 
Next 2.4: Every Mersenne prime is a prime number. 
Claim 2.5: 127 is an integer. 
Next 2.5: Integers are real numbers. 
Facts 3: Lepidopterans are insects. Every animal is multicellular. Each insect is an arthropod. Each invertebrate is an animal. Insects are six-legged. Arthropods are small. Arthropods are invertebrates. Each butterfly is a lepidopteran. Whales are not small. 
Query 3: True or false: Polly is not small. 
Claim 3.1: Polly is an arthropod. 
Next 3.1: Arthropods are small. 
Claim 3.2: Polly is an insect. 
Next 3.2: Each insect is an arthropod. 
Claim 3.3: Polly is small. 
Next 3.3: Finish. 
Claim 3.4: Polly is a lepidopteran.
声明 3.4:波莉是鳞翅目昆虫。
Next 3.4: Lepidopterans are insects.
下一个 3.4:鳞翅目昆虫。
Facts 4: Every cat is a feline. Mammals are vertebrates. Bilaterians are animals. Vertebrates are chordates. Carnivores are mammals. Mammals are not cold-blooded. Each chordate is a bilaterian. Every feline is a carnivore. Snakes are cold-blooded. Animals are not unicellular. Every carnivore is not herbivorous.
事实 4:每只猫都是食肉动物。哺乳动物是脊椎动物。双侧对称动物是动物。脊椎动物是脊索动物。食肉动物是哺乳动物。哺乳动物不是冷血动物。每个脊索动物都是双侧对称动物。每只食肉动物都是食肉动物。蛇是冷血动物。动物不是单细胞生物。每只食肉动物都不是食草动物。
Query 4: True or false: Fae is not cold-blooded.
查询 4:是真是假:Fae 不是冷血动物。
Claim 4.1: Fae is a feline.
声明 4.1:Fae 是一只猫。
Next 4.1: Every feline is a carnivore.
下一个 4.1 版本:每只猫科动物都是食肉动物。
Claim 4.2: Fae is not cold-blooded.
主张 4.2:Fae 不是冷血动物。
Next 4.2: Finish. 下一个 4.2:完成。
Claim 4.2: Fae is a mammal.
声明 4.2:Fae 是一种哺乳动物。
Next 4.2: Mammals are not cold-blooded.
下一个 4.2:哺乳动物不是冷血动物。
Claim 4.3: Fae is a cat.
声明 4.3:Fae 是一只猫。
Next 4.3: Every cat is a feline.
下一个 4.3 版本:每只猫都是食肉动物。
Claim 4.4: Fae is a carnivore.
声明 4.4:Fae 是食肉动物。
Next 4.4: Carnivores are mammals.
下一个 4.4:食肉动物是哺乳动物。
Facts 5: Prime numbers are prime. Real numbers are numbers. Every integer is a real number. Real numbers are not imaginary. Mersenne primes are prime numbers. Complex numbers are imaginary. Each prime number is a natural number. Natural numbers are positive. Each Mersenne prime is prime. Each natural number is an integer.
事实 5:素数是素数。实数是数字。每个整数都是实数。实数不是虚数。梅森素数是素数。复数是虚数。每个素数都是自然数。自然数是正数。每个梅森素数都是素数。每个自然数都是整数。
Query 5: True or false: 7 is imaginary.
查询 5:真或假:7 是虚构的。
Claim 5.1: 7 is not imaginary.
声明 5.1:7 不是虚数。
Next 5.1: Finish. 下一个 5.1:完成。
Claim 5.1: 7 is a natural number. 
Next 5.1: Each natural number is an integer. 
Claim 5.2: 7 is a prime number. 
Next 5.2: Each prime number is a natural number. 
Claim 5.3: 7 is a real number. 
Next 5.3: Real numbers are not imaginary. 
Claim 5.4: 7 is an integer. 
Next 5.4: Every integer is a real number. 
Facts 6: Spiders are not six-legged. Insects are six-legged. Insects are arthropods. Every animal is not unicellular. Invertebrates are animals. Lepidopterans are insects. Every arthropod is segmented. Arthropods are invertebrates. Every butterfly is a lepidopteran. Stella is a butterfly. 
Query 6: True or false: Stella is six-legged. 
Claim 6.1: Stella is an insect. 
Next 6.1: Insects are six-legged. 
Claim 6.2: Stella is a lepidopteran. 
Next 6.2: Lepidopterans are insects. 
Claim 6.3: Stella is a butterfly. 
Next 6.3: Every butterfly is a lepidopteran. 
Claim 6.4: Stella is six-legged. 
Next 6.4: Finish. 
Facts 7: <fact> 
Query 7: <query> 

Appendix D Related work: world model and planning 

Recent years have witnessed successful applications of planning algorithms Sekar et al. (2020), such as AlphaZero Silver et al. (2017), and MuZero Schrittwieser et al. (2020). These algorithms are typically based on tree-structured search and are designed to effectively maintain the balance of exploration and exploitation. Knowledge of transition dynamics is the prerequisite for planning, and recent research on model-based reinforcement learning propose to learn a world model (or dynamics model) to plan or assist policy learning. To improve sample efficiency, previous research attempts to learn a world model from offline trajectories, and directly learn a policy within the world model Ha and Schmidhuber (2018a, b). With latent imagination in a world model, RL agents can be trained to solve long-horizon tasks Hafner et al. (2019, 2020). Besides, the world model is also shown to be helpful to physical robot learning Wu et al. (2023). In this paper, we use LLMs as world models and apply a planning algorithm to search for a reasoning path. This is similar in spirit to model predictive control Camacho and Alba (2013). Compared with previous works, our framework uses general LLMs as the world model and can be adapted to a wide range of open-domain reasoning tasks. Xiang et al. (2023) propose to train LLMs wih a external world model to gain embodied experience, while RAP focuses on the inference stage and is compatible with any training methods. 

Appendix E Adaptive Prompting
附录 E 自适应提示

Through our preliminary experiments, we observed that the performance of LLMs is impacted by the discrepancy in difficulty between demonstration cases and the test cases. In the case of RAP, when a new state is predicted, we reformulate the remaining task as a new test case, initialized with the predicted new state. This new test case would require a smaller minimum number of actions, leading to a disparity in the distribution of the demonstration cases and the new cases. To mitigate this issue, we pre-compute the intermediate states of the demonstration cases beforehand. During inference, we truncate the trace from the beginning for each new state in an iteration, which reduces the minimum action number of the demonstration cases as the search tree deepens. This technique significantly enhances the performance of RAP, especially for more intricate problems, which are more susceptible to distribution mismatches.
通过我们的初步实验,我们观察到LLMs的性能受到演示案例和测试案例之间难度差异的影响。在 RAP 的情况下,当预测到一个新状态时,我们将剩余任务重新构造为一个新的测试案例,并初始化为预测的新状态。这个新的测试案例将需要更少的最小动作数,导致演示案例和新案例的分布存在差异。为了缓解这个问题,我们预先计算演示案例的中间状态。在推断过程中,我们对每个迭代中的新状态从开始截断轨迹,随着搜索树的加深,减少演示案例的最小动作数。这种技术显著提升了 RAP 的性能,特别是对于更复杂的问题,更容易受到分布不匹配的影响。

Appendix F Reward Choice 附录 F 奖励选择

Table 5: Ablation study on Blocksworld. R1subscript𝑅1R_{1} is action likelihood reward, R2subscript𝑅2R_{2} is task-specific reward, and R3subscript𝑅3R_{3} is self-evaluation reward.
表 5:Blocksworld 上的消融研究。 R1subscript𝑅1R_{1} 是动作可能性奖励, R2subscript𝑅2R_{2} 是任务特定奖励, R3subscript𝑅3R_{3} 是自我评估奖励。
R1subscript𝑅1R_{1} R2subscript𝑅2R_{2} R3subscript𝑅3R_{3} Success 成功
0.88
0.91
0.46
0.21
0.14
0.02
Table 6: Ablation study on GSM8k (first 300 examples). R1subscript𝑅1R_{1} is state transition confidence reward, R2subscript𝑅2R_{2} is action likelihood reward, and R3subscript𝑅3R_{3} is self-evaluation reward.
表 6:对 GSM8k(前 300 个示例)的消融研究。 R1subscript𝑅1R_{1} 是状态转移置信奖励, R2subscript𝑅2R_{2} 是动作可能性奖励, R3subscript𝑅3R_{3} 是自我评估奖励。
R1subscript𝑅1R_{1} R2subscript𝑅2R_{2} R3subscript𝑅3R_{3} RAP(1) RAP(10) +aggr
0.410 0.450 0.503
0.350 0.447 0.490
0.373 0.423 0.443

Results. We conduct comprehensive experiments on rewards for plan generation (Table 5) and math reasoning (Table 6). Note that, in both tables, the first row indicates the setting we use in the main experiments. As shown in Table 5, the combination of action likelihood and task-specific reward (row 1) can significantly outperform the single reward baselines (row 3, 4, 5). Interestingly, adding the self-evaluation reward can further improve the performance slightly (row 2). Furthermore, as the results on the first 300 samples of GSM8k shown in Table 6, we can see adding either action likelihood (row 3) or self-evaluation (row 1) on top of confidence reward (row 2) can boost the RAP performance of only using confidence reward (row 1) with one iteration, but action likelihood reward downgrades the accuracy with more iterations. The self-evaluation reward leads to the best performance overall. This indicates the importance of self-evaluation reward in guiding reasoning as an effective and computationally efficient prior to exploration.
结果。我们对计划生成的奖励(表 5)和数学推理(表 6)进行了全面的实验。请注意,在两个表中,第一行表示我们在主要实验中使用的设置。如表 5 所示,动作可能性和任务特定奖励的组合(第 1 行)可以显著优于单一奖励基线(第 3、4、5 行)。有趣的是,添加自我评估奖励可以进一步略微提高性能(第 2 行)。此外,正如在表 6 中显示的 GSM8k 的前 300 个样本的结果,我们可以看到在置信奖励(第 2 行)的基础上添加动作可能性(第 3 行)或自我评估(第 1 行)可以提高仅使用置信奖励(第 1 行)的 RAP 性能,但动作可能性奖励会随着更多迭代而降低准确性。自我评估奖励在整体上表现最佳。这表明自我评估奖励在引导推理方面的重要性,作为一种有效且计算效率高的探索先验。

Self-evaluation and action likelihood. The rewards of self-evaluation and action likelihood are of particular interest, as they can be applied to a wide range of reasoning tasks. Generally, the best usage and combination with other rewards require empirical design and understanding of the task nature, and their effectiveness can vary significantly across different tasks. Here, we provide some intuitions behind the reward choices:
自我评估和行动可能性。自我评估和行动可能性的奖励尤其引人关注,因为它们可以应用于各种推理任务。通常,最佳的使用方式和与其他奖励的组合需要经验设计和对任务性质的理解,它们的有效性在不同任务之间可能会有显著差异。在这里,我们提供了一些关于奖励选择背后的直觉:

(a) For the problems in which one reasoning step is short and structured, the action likelihood can be very indicative. Otherwise, it may be disturbed by unimportant tokens and become unreliable. For instance, a single step within the Blocksworld domain typically adheres to specific patterns (e.g., pick/put/stack a block…), rendering the action likelihood indicative. However, in the math domain, a reasoning step is expressed in natural language sentences, allowing for greater freedom and potentially introducing noise.
(a)对于那些推理步骤简短且结构化的问题,动作可能非常具有指示性。否则,它可能会受到不重要的标记的干扰,变得不可靠。例如,在 Blocksworld 领域内的单个步骤通常遵循特定模式(例如,拾取/放置/堆叠一个块...),使动作可能性具有指示性。然而,在数学领域中,推理步骤以自然语言句子表达,允许更大的自由度,可能引入噪音。

(b) For the problems where it’s easier to recognize some errors afterward than avoid them during generation, self-evaluation emerges as a helpful mechanism for enhancing reasoning accuracy. In mathematical reasoning, LLMs may struggle to generate a correct reasoning step in the first place, but the detection of calculation or logic errors is more feasible. In Blocksworlds, however, assessing the quality of a candidate action is not straightforward and still requires multi-step reasoning. This characteristic diminishes the accuracy of the self-evaluation reward, making it less helpful especially given that likelihood already provides a good intuition for search.
(b)对于那些事后更容易识别错误而不是在生成过程中避免错误的问题,自我评估出现为增强推理准确性的有益机制。在数学推理中,LLMs可能会在一开始难以生成正确的推理步骤,但检测计算或逻辑错误更为可行。然而,在 Blocksworlds 中,评估候选动作的质量并不直接,仍需要多步推理。这种特征降低了自我评估奖励的准确性,使其不太有帮助,尤其是考虑到可能性已经为搜索提供了良好的直觉。