Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning
反思、重试、奖励：通过强化学习实现自我改进的大语言模型

Shelly Bensal* Umar Jamil* Christopher Bryant Melisa Russak Kiran Kamble Dmytro Mozolevskyi Muayad Ali Waseem AlShikh Writer, Inc.
谢莉·本萨尔* 乌马尔·贾米尔* 克里斯托弗·布莱恩特梅丽莎·鲁萨克基兰·坎布尔德米特罗·莫佐列夫斯基穆亚德·阿里瓦西姆·阿尔谢赫 Writer, Inc.{shelly, ..., waseem}@writer.com

Abstract 摘要

We explore a method for improving the performance of large language models through self-reflection and reinforcement learning. By incentivizing the model to generate better self-reflections when it answers incorrectly, we demonstrate that a model’s ability to solve complex, verifiable tasks can be enhanced even when generating synthetic data is infeasible and only binary feedback is available. Our framework operates in two stages: first, upon failing a given task, the model generates a self-reflective commentary analyzing its previous attempt; second, the model is given another attempt at the task with the self-reflection in context. If the subsequent attempt succeeds, the tokens generated during the self-reflection phase are rewarded. Our experimental results show substantial performance gains across a variety of model architectures, as high as $34.7 %$ improvement at math equation writing and $18.1 %$ improvement at function calling. Notably, smaller fine-tuned models ( 1.5 billion to 7 billion parameters) outperform models in the same family that are 10 times larger. Our novel paradigm is thus an exciting pathway to more useful and reliable language models that can self-improve on challenging tasks with limited external feedback.
我们探索了一种通过自我反思和强化学习来改进大型语言模型性能的方法。通过激励模型在回答错误时生成更好的自我反思，我们证明了即使在难以生成合成数据且仅有二元反馈的情况下，模型解决复杂可验证任务的能力也可以得到增强。我们的框架分为两个阶段：首先，在未能完成给定任务时，模型生成一个自我反思性评论，分析其先前的尝试；其次，模型在上下文中包含自我反思的情况下再次尝试该任务。如果后续尝试成功，则奖励自我反思阶段生成的标记。我们的实验结果显示，在各种模型架构中都取得了显著的性能提升，数学方程写作最高可达 $34.7 %$ 的改进，函数调用最高可达 $18.1 %$ 的改进。值得注意的是，较小的微调模型（1.5 亿到 70 亿参数）的表现超过了同一家族中大 10 倍的模型。因此，我们这种新颖的范式是通向更有用、更可靠的语言模型的令人兴奋的途径，这些模型能够在有限的外部反馈下自我改进复杂任务。

1 Introduction 1 引言

Large language models (LLMs) have demonstrated impressive capabilities across a wide range of natural language processing tasks (Zhao et al., 2025), as well as in mathematics (Ahn et al., 2024), coding (Jiang et al., 2024), and reasoning (Huang and Chang, 2023). Despite these advancements however, models still have blind spots, and there is no guarantee that a model that succeeds at one task will succeed at another, even if the task is of a similar type (Asher et al., 2023; Huckle and Williams, 2025). The most direct way to address this problem is to retrain or fine-tune a model on data that represents the failed task, however this may not be possible if no such dataset exists. Furthermore, if the largest state-of-the-art models also struggle to complete the task, we similarly cannot use them to generate synthetic training data (Liu et al., 2024a).
大型语言模型（LLMs）已在广泛的自然语言处理任务（Zhao 等，2025）中展现出令人印象深刻的能力，同时在数学（Ahn 等，2024）、编程（Jiang 等，2024）和推理（Huang 和 Chang，2023）领域也表现出色。尽管取得了这些进步，但模型仍然存在盲点，并且无法保证在一个任务中成功的模型就能在另一个类似的任务中同样成功（Asher 等，2023；Huckle 和 Williams，2025）。解决这个问题最直接的方法是在代表失败任务的数据上重新训练或微调模型，但如果没有这样的数据集，这种方法可能行不通。此外，如果最大的先进模型也难以完成该任务，我们同样无法使用它们生成合成训练数据（Liu 等，2024a）。

An alternative solution is to prompt the model to explain its reasoning or self-reflect on why it failed. For example, the popular Chain-of-Thought (CoT) paradigm (Wei et al., 2022) showed that models performed significantly better at arithmetic, commonsense, and reasoning tasks if they were prompted to show their reasoning in addition to simply providing a response. Self-reflection operates on a similar principle, in that if we can detect when a LLM provides an incorrect response, we can prompt it to reflect on any flaws in its reasoning and perhaps try again (Ji et al., 2023; Renze and Guven, 2024). The main advantage of these approaches is that they do not require any additional training data, however their effectiveness is directly tied to the effectiveness of the reasoning/reflection prompt.
另一种解决方案是提示模型解释其推理过程或对失败原因进行自我反思。例如，流行的思维链（Chain-of-Thought，CoT）范式（Wei 等，2022）表明，如果模型不仅仅给出响应，还被要求展示推理过程，那么它们在算术、常识和推理任务中的表现会显著提高。自我反思遵循类似的原则，即当我们能够检测到 LLM 给出了不正确的响应时，可以提示它反思推理中的缺陷，并尝试重新解决问题（Ji 等，2023；Renze 和 Guven，2024）。这些方法的主要优势在于不需要额外的训练数据，但其有效性直接取决于推理/反思提示的有效性。

In this paper, we investigate the extent to which LLMs can learn to generate better self-reflections in order to self-improve on downstream tasks. More specifically, if a model fails to complete a task on its first attempt, it generates a self-reflection which it uses to make a second attempt. If the model then succeeds on its second attempt, we use reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO) (Shao et al., 2024), to reward the tokens in the self-reflection, such that future self-reflections will be more effective. In this way, models can learn how to improve upon all kinds of tasks without requiring any task-specific data; they instead just optimize how to reflect on mistakes.
在本文中，我们研究了 LLMs 能在多大程度上学习生成更好的自我反思，从而在下游任务上实现自我改进。更具体地说，如果模型在第一次尝试中无法完成任务，它会生成一个自我反思，并用这个反思进行第二次尝试。如果模型在第二次尝试中成功，我们将使用强化学习（RL），特别是群体相对策略优化（GRPO）（Shao 等，2024），来奖励自我反思中的词元，以使未来的自我反思更加有效。通过这种方式，模型可以学习如何改进各种任务，而无需任何特定任务的数据；相反，它们只是优化如何反思错误。

Our main contribution is thus a novel methodology for training a model to generate better selfreflections to improve on challenging tasks in a task-agnostic way. Crucially, this method only requires a binary success/failure signal from a response verifier, which makes it well-suited to tasks where success can be easily verified. To demonstrate the efficacy of our approach, we carry out experiments on both the APIGen function calling dataset (Liu et al., 2024b) and the Countdown equation task introduced by Pan et al. (2025b).
我们的主要贡献是一种新颖的方法，用于训练模型以任务无关的方式生成更好的自我反思，从而改进具有挑战性的任务。关键是，该方法仅需要来自响应验证器的二元成功/失败信号，这使其非常适合可以轻松验证成功的任务。为了证明我们方法的有效性，我们在 APIGen 函数调用数据集（Liu 等，2024b）和 Pan 等（2025b）引入的倒数方程任务上进行了实验。

2.1 Self-Reflection 2.1 自我反思

Self-reflection in LLMs Self-reflection, also referred to as introspection, is a metaprompting strategy in which a language model analyzes its own reasoning in order to identify and correct potential mistakes. This paradigm has gained momentum in large language model (LLM) research as a means to boost multi-step reasoning and problem-solving performance, especially in domains such as arithmetic, commonsense reasoning, and question answering (Wei et al., 2022; Madaan et al. 2023; Renze and Guven, 2024, Shinn et al., 2023). Typically, self-reflection involves generating an initial answer, producing natural language feedback to critique that answer, and then refining the response based on this critique. This process can be applied iteratively, often using the same model to both generate and evaluate solutions, and may include modules such as memory buffers or explicit meta-instruction guides (Liu et al., 2025; Wu et al., 2025).
LLM 的自我反思自我反思，也称为内省，是一种元提示策略，语言模型通过这种策略分析自身的推理过程，以识别和纠正潜在的错误。这一范式在大型语言模型（LLM）研究中越来越受到重视，作为提升多步推理和问题解决性能的一种手段，尤其是在算术、常识推理和问答等领域（Wei 等，2022；Madaan 等，2023；Renze 和 Guven，2024，Shinn 等，2023）。通常，自我反思涉及生成初始答案，产生自然语言反馈来批评这个答案，然后根据这个批评改进响应。这个过程可以迭代进行，通常使用同一个模型来生成和评估解决方案，并可能包括内存缓冲区或明确的元指令指南等模块（Liu 等，2025；Wu 等，2025）。
Approaches and Limitations The methodology for self-reflection in LLMs varies along several axes. Some methods apply self-correction only to failed or low-confidence queries, while others use it for every response; feedback can be provided in the form of scalar scores, external annotations, or natural language, and may be generated by humans, external models, or the LLM itself (Bai et al., 2022; Peng et al., 2023, Yang et al., 2022, Pan et al., 2025c). While prompting LLMs to self-reflect does improve accuracy in many settings, recent work has shown that the effectiveness depends strongly on the context: challenges include the inability to reliably identify self-errors without ground-truth oracles, diminishing returns from repeated reflection, and risks of performance deterioration for easier prompts or high-performing base models (Huang et al., 2024, Zhang et al., 2024; Kim et al., 2023). In particular, self-reflection is most effective when initial accuracy is low, question difficulty is high, and external verification is available. Conversely, LLMs may sometimes fail to recognize their own mistakes but can still benefit from external feedback when such supervision exists (Pan et al., 2025c, Shinn et al., 2023).
方法及局限性 LLMs 中的自我反思方法在几个维度上存在差异。一些方法仅对失败或低置信度的查询应用自我纠正，而另一些则对每个响应都使用；反馈可以是标量得分、外部标注或自然语言的形式，并可能由人类、外部模型或 LLM 本身生成（Bai 等，2022；Peng 等，2023，Yang 等，2022，Pan 等，2025c）。虽然提示 LLMs 进行自我反思确实在许多场景中提高了准确性，但最近的研究表明，其有效性强烈依赖于上下文：挑战包括在没有地面真值参考的情况下可靠地识别自身错误、重复反思带来的收益递减，以及对于较简单的提示或高性能基础模型可能导致性能下降的风险（Huang 等，2024，Zhang 等，2024；Kim 等，2023）。尤其是，自我反思在初始准确性低、问题难度高且有外部验证可用的情况下最为有效。相反，LLMs 有时可能无法识别自身错误，但仍可在存在外部反馈的情况下受益（Pan 等，2025c，Shinn 等，2023）。
Training-Based Methods Recent directions focus on incorporating self-improvement capabilities during model training, either by fine-tuning on self-correction trajectories or by formulating the process as a multi-turn reinforcement learning problem (Kumar et al., 2024; Qu et al., 2024; Wu et al., 2025). These training-based methods suggest that leveraging the model’s own critiques during learning yields persistent improvements-even when no test-time self-reflection is performed. However, these approaches typically rely on larger teacher models for data generation or supervision, which can be seen as a form of knowledge distillation (Hinton et al., 2015).
基于训练的方法最近的研究方向侧重于在模型训练期间融入自我改进能力，方式包括在自我纠正轨迹上进行微调，或将过程构建为多轮强化学习问题（Kumar 等，2024；Qu 等，2024；Wu 等，2025）。这些基于训练的方法表明，在学习过程中利用模型自身的批评会带来持续的改进，即使在没有测试时自我反思的情况下也是如此。然而，这些方法通常依赖于更大的教师模型进行数据生成或监督，这可以被视为一种知识蒸馏的形式（Hinton 等，2015）。
Our Approach Building on insights from prior research, we propose correcting only failed cases identified by an external verifier, converting its binary feedback into self-reflective prompts, and training the model to use the self-reflection to succeed at the second attempt. This oracle-grounded conditional computation leverages training-time benefits to reduce test-time overhead and is guaranteed to improve or maintain performance, since corrections are applied only to initially incorrect examples. For training, we employ Group Relative Policy Optimization (GRPO), introduced in the next section. Notably, this approach bootstraps solely from the model’s own outputs, without relying on external LLMs.
我们的方法基于先前研究的洞察，我们建议仅纠正由外部验证器识别的失败案例，将其二进制反馈转换为自我反思提示，并训练模型使用自我反思在第二次尝试中成功。这种基于 Oracle 的条件计算利用训练时的优势来减少测试时的开销，并保证提高或维持性能，因为仅对最初不正确的示例应用修正。对于训练，我们采用下一节将介绍的群组相对策略优化（GRPO）。值得注意的是，这种方法完全依靠模型自身的输出，不依赖外部 LLMs。

Figure 1: Reflect, Retry, Reward Mechanism The model is first prompted to complete a task based on a user query. If the initial response is correct, the process stops. If not, the model is prompted to generate a self-reflection on how to improve. The model then retries the same task, this time with its self-reflection included, and the new answer is evaluated. If the second attempt succeeds, the model learns that it generated an effective self-reflection.
图 1：反思、重试、奖励机制模型首先根据用户查询被提示完成一项任务。如果初始响应正确，流程即停止。如果不正确，模型被提示生成关于如何改进的自我反思。随后模型使用其自我反思重新尝试同一任务，新的答案将被评估。如果第二次尝试成功，模型将学习到它生成了有效的自我反思。

2.2 Reinforcement Learning for Language Models
2.2 语言模型的强化学习

GRPO Group Relative Policy Optimization (GRPO) is an outcome-based reinforcement learning method proposed to address the unique challenges faced when fine-tuning LLMs, such as those encountered in complex mathematical reasoning tasks (Shao et al., 2024). Unlike conventional approaches like Proximal Policy Optimization (PPO) (Schulman et al., 2017), GRPO dispenses with a separate value (critic) network and instead estimates advantages directly by comparing outcomes from a group of sampled completions. This makes GRPO particularly well-suited to settings where supervision is sparse and only available at the conclusion of a generation-for example, whether a completed math solution is correct. In such environments, the model must generate an entire sequence before receiving any feedback, typically in the form of a scalar reward reflecting the quality or correctness of the output.
GRPO（组相对策略优化）是一种基于结果的强化学习方法，旨在解决微调 LLM 时遇到的独特挑战，例如在复杂数学推理任务中遇到的挑战（Shao 等，2024）。与传统方法（如近端策略优化（PPO）（Schulman 等，2017））不同，GRPO 不使用单独的价值（评论家）网络，而是通过比较一组采样完成结果直接估算优势。这使得 GRPO 特别适合监督稀疏且仅在生成结束时可用的场景，例如，已完成的数学解决方案是否正确。在这样的环境中，模型必须在生成整个序列后才能收到反馈，通常以反映输出质量或正确性的标量奖励的形式。

Our Approach In this work, we adopt GRPO as the sole mechanism for reinforcement learning, without involving additional supervised fine-tuning stages. Recent research has demonstrated that modifying GRPO’s reward structure can effectively encourage models to persist through failure, for instance by rewarding retries after unsuccessful attempts, thereby promoting self-correction and robustness (Dao and Le, 2025). GRPO has further shown promise in related domains requiring complex, outcome-supervised behaviors-including tool use and advanced mathematical problem solving-offering a flexible and efficient optimization strategy in diverse LLM applications (Qian et al., 2025; Li et al., 2025).
我们的方法在这项工作中，我们采用 GRPO 作为强化学习的唯一机制，不涉及额外的监督微调阶段。最近的研究表明，修改 GRPO 的奖励结构可以有效地鼓励模型在失败后保持尝试，例如通过在不成功的尝试后奖励重试，从而促进自我纠正和鲁棒性（Dao 和 Le，2025）。GRPO 在需要复杂、基于结果的行为的相关领域也显示出前景，包括工具使用和高级数学问题求解，在各种 LLM 应用中提供了灵活高效的优化策略（Qian 等，2025；Li 等，2025）。

3 Reflect, Retry, Reward
3 反思、重试、奖励

Our novel Reflect, Retry, Reward methodology operates as follows, and is illustrated in Figure 1.
我们的新颖的反思、重试、奖励方法的运作如下，并在图 1 中进行了说明。
First, a model is prompted to complete a task. If it succeeds, we do nothing as the model already meets our needs. If it fails however, we prompt it to generate a self-reflection on what might have gone wrong. Note that this presupposes a validator that automatically evaluates whether a response was a success or failure (binary). While it is sometimes possible to define a task-dependent validator that meets this criteria without ground-truth labels, such as in basic API function calling (Did the API call return a valid response?), mathematical equations (Does the equation evaluate to the target answer?), or code (Does the generated code execute?), some task types may require gold-standard target answers.
首先，提示模型完成一项任务。如果成功，我们不做任何处理，因为模型已经满足了我们的需求。如果失败，我们提示它生成关于可能出错原因的自我反思。请注意，这预设了一个自动评估响应是成功还是失败（二元）的验证器。虽然有时可以定义满足此标准的特定任务的验证器，且无需真值标签，比如在基本 API 函数调用（API 调用是否返回有效响应？）、数学方程（方程是否求解得到目标答案？）或代码（生成的代码是否可执行？），但某些任务类型可能需要黄金标准目标答案。

Having generated a self-reflection, the model then makes a second attempt to complete the task, making use of the self-reflection in the conversation history. If it still fails, we do nothing; the self-reflection was insufficient to turn a failure into a success. If it succeeds however, we use GRPO to reward only the tokens that were generated in the self-reflection. This is possible by setting the advantage terms for all other generated tokens to zero. We do this because we want the model to learn how to self-reflect more generally rather than specialize for a particular task. In other words, we do not reward the correct answer, we only reward the self-reflection.
生成自我反思后，模型会再次尝试完成任务，并在对话历史中利用这个自我反思。如果仍然失败，我们不做任何处理；自我反思不足以将失败转变为成功。但如果成功，我们使用 GRPO 仅奖励在自我反思中生成的标记。这可以通过将所有其他生成的标记的优势项设置为零来实现。我们这样做是因为我们希望模型学习更普遍的自我反思方式，而不是专门针对某个特定任务。换句话说，我们不奖励正确答案，只奖励自我反思。

4 Experiments 4 实验

We demonstrate the effectiveness of our approach through experiments on two different tasks: function calling and math equations.
我们通过在两种不同任务上的实验，证明了我们方法的有效性：函数调用和数学方程。

4.1 Function Calling 4.1 函数调用

We use the APIGen dataset (Liu et al., 2024b) for our function calling experiments. APIGen is a dataset of 60,000 high quality function calls that consist of a user query (plain text), a list of possible tools that can answer that query plus their parameters (JSON) and the correctly formatted function call with the correct parameters and values (JSON). There are a total of 4,211 unique tools in the dataset, with an average of 2.3 parameters per tool, and each user query has an average of 2.8 tools to choose from (

min 1, max 8

). A model is only considered to be correct if it not only selects the right tool, but also generates the correct parameters and values. A sample datapoint with a choice of two different tools is shown below (formatted to be more human readable).
我们使用 APIGen 数据集（Liu 等，2024b）进行函数调用实验。APIGen 是一个包含 60,000 个高质量函数调用的数据集，其中包括用户查询（纯文本）、可以回答该查询的可能工具及其参数列表（JSON），以及正确格式的函数调用，包括正确的参数和值（JSON）。数据集中共有 4,211 个唯一工具，每个工具平均有 2.3 个参数，每个用户查询平均有 2.8 个可供选择的工具（

min 1, max 8

）。只有在选择正确的工具并生成正确的参数和值时，模型才被认为是正确的。下面展示了一个包含两个不同工具选择的数据点示例（为了更便于阅读，进行了格式化）。

USER QUERY:
Check if the Vimeo username 'john_doe_artist' is available.
TOOLS PROVIDED:
[{ "name": "vimeo",
    "description": "Checks if a given Vimeo username is available using the
         Toolbench RapidAPI service.",
    "parameters": {"username": {"description": "The Vimeo username to check for
         availability.", "type": "str", "default": "username"}}
    },
{ "name": "get_user_pins",
    "description": "Retrieves the Pinterest pins of a specified user.",
    "parameters": {"username": {"description": "The Pinterest username whose pins
         are to be fetched.", "type": "str", "default": "0869178429hau"}}
}]
CORRECT ANSWER:
[{"name": "vimeo", "arguments": {"username": "john_doe_artist"}}]

To preserve the integrity of our experiments, we only evaluate models that were released before the APIGen dataset was released (June 2024). This ensures it is impossible that any of these models could have been trained on the dataset to obtain an unfair advantage. Specifically, we report results for Qwen2 (1.5B/7B Instruct) (Yang et al., 2024), Llama3.1 (8B Instruct) (Grattafiori et al., 2024), and Phi3.5-mini Instruct (Abdin et al., 2024). We also report the vanilla performance of Qwen2-72B Instruct, Llama3.1-70B Instruct, and Writer’s Palmyra X4 (Writer.com, 2024) as a baseline.
为了保证实验的完整性，我们仅评估在 APIGen 数据集发布（2024 年 6 月）之前发布的模型。这确保了这些模型不可能通过训练该数据集获得不公平的优势。具体而言，我们报告了 Qwen2（1.5B/7B Instruct）（Yang 等，2024）、Llama3.1（8B Instruct）（Grattafiori 等，2024）和 Phi3.5-mini Instruct（Abdin 等，2024）的结果。我们还报告了 Qwen2-72B Instruct、Llama3.1-70B Instruct 和 Writer 的 Palmyra X4（Writer.com，2024）作为基准。
Since different model families also have different suggested tool-calling approaches, we tested different templates for each model family and ultimately chose the prompt formats that provided the strongest baselines. For our function calling validator, we require the model output to exactly match the correct answer in the dataset (i.e. based on the ground-truth labels). We used the following prompt to generate self-reflections for failed function calling attempts:
由于不同的模型家族有不同的建议工具调用方法，我们为每个模型家族测试了不同的模板，并最终选择了提供最强基准的提示格式。对于我们的函数调用验证器，我们要求模型输出必须与数据集中的正确答案完全匹配（即基于地面真相标签）。我们使用以下提示来为失败的函数调用尝试生成自我反思：

You tried performing the task, but failed in generating the correct tool call.
     Reflect on what went wrong and write a short explanation that will help you
     do better next time.

4.2 Countdown Math Equations
4.2 倒计时数学方程

We use the Countdown dataset introduced by the TinyZero project for our math equation experiments (Pan et al., 2025a b). The Countdown dataset consists of 450 k lists of

3 - 4

numbers along with a target number. The goal is to apply basic arithmetic operations to the numbers such that the equation evaluates to the target number. A model is only considered to be correct if it uses all the numbers once (in any order) and if the final equation successfully evaluates to the target number. A sample datapoint is shown below.
我们使用 TinyZero 项目引入的倒计时数据集进行数学方程实验（Pan 等，2025a b）。倒计时数据集由 45 万个包含

3 - 4

个数字的列表以及一个目标数字组成。目标是对这些数字应用基本算术运算，使得方程能够计算得到目标数字。只有在使用所有数字恰好一次（可以是任意顺序）且最终方程成功计算出目标数字时，模型才被认为是正确的。下面展示了一个示例数据点。

Using the numbers [4, 73, 4, 23], create an equation that equals 76. You can use
     basic arithmetic operations (+, -, *, /) and each number can only be used
     once.

As with function calling, to preserve the integrity of our experiments, we only evaluate models that were released or have a knowledge cutoff before the Countdown dataset was made publicly available (January 2025). Specifically, we report results for Qwen2.5 (1.5B/3B/7B Instruct) (Yang et al., 2025), Llama3.1 (8B Instruct), Llama3.2 (3B Instruct), and Writer’s Palmyra 1.7B. We also report the vanilla performance of Qwen2.5-32B Instruct, Qwen2.5-72B Instruct, Llama3.1-70B Instruct, and Writer’s Palmyra X4 (Writer.com, 2024) as a baseline.
与函数调用一样，为了保持实验的完整性，我们仅评估在倒计时数据集公开发布（2025 年 1 月）之前已发布或具有知识截止点的模型。具体来说，我们报告了 Qwen2.5（1.5B/3B/7B Instruct）（Yang 等，2025）、Llama3.1（8B Instruct）、Llama3.2（3B Instruct）和 Writer 的 Palmyra 1.7B 的结果。我们还报告了 Qwen2.5-32B Instruct、Qwen2.5-72B Instruct、Llama3.1-70B Instruct 以及 Writer 的 Palmyra X4（Writer.com，2024）的基准性能。

We once again tried several different prompt formats for each model family, and ultimately chose to use the format that provided the strongest baseline. For our math equation validator, we required the generated equation to match the target answer in the prompt (i.e. no need for ground-truth labels). We used the following prompt to generate self-reflections for failed Countdown math equations:
我们再次尝试了每个模型系列的几种不同提示格式，最终选择了提供最强基线的格式。对于我们的数学方程验证器，我们要求生成的方程与提示中的目标答案匹配（即不需要 ground-truth 标签）。我们使用以下提示来为失败的倒计时数学方程生成自我反思：

You tried solving the problem and got the wrong answer. Reflect on what went wrong
     and write a short explanation that will help you do better next time.

4.3 A Dataset of Failures
4.3 失败数据集

For reasons of efficiency, and to facilitate a more intuitive analysis, we did not train our models on the full function calling and math equation training sets, but instead opted to first create a dataset of failures for each task. More specifically, we prompted each model for each task to generate up to 64 responses (depending on model size) to each user query and preserved only those queries where the model failed (based on each task-dependent verifier). We typically generated more responses for larger models because they failed less frequently than smaller models, and so would otherwise yield fewer training samples. To accelerate the rejection sampling process, we used vLLM (Kwon et al., 2023) with prefix caching.
出于效率和便于更直观分析的原因，我们没有在完整的函数调用和数学方程训练集上训练模型，而是选择首先为每个任务创建一个失败数据集。更具体地说，我们提示每个模型为每个任务生成最多 64 个响应（取决于模型大小）来回应每个用户查询，并仅保留模型失败的那些查询（基于每个任务相关的验证器）。通常我们为较大的模型生成更多响应，因为它们比小型模型失败的频率更低，否则将得到较少的训练样本。为加速拒绝采样过程，我们使用了带有前缀缓存的 vLLM（Kwon 等，2023）。

This approach has several advantages. First and foremost, it saves time because there is no point training our self-reflection model on queries it already handles successfully and hence cannot learn from. Second, by generating several responses per query, we make the data more robust; for example, if a base model generates a correct response to the same query

80 %

of the time, we can still learn from the remaining

20 %

since responses are not deterministic. Finally, by only having failure cases in our dataset, we can precisely determine how many samples the model needed to train on before it converged on the optimum self-reflection.
这种方法有几个优点。首先，它节省时间，因为在模型已经成功处理的查询上训练自我反思模型是没有意义的，因此无法学习。其次，通过为每个查询生成多个响应，我们使数据更加健壮；例如，如果基础模型对同一查询正确响应了

80 %

，我们仍然可以从剩余的

20 %

学习，因为响应并非确定性的。最后，通过数据集中仅包含失败案例，我们可以精确地确定模型在收敛到最佳自我反思之前需要训练多少样本。
We must emphasize that we took this approach purely for reasons of efficiency and analysis, and it is otherwise functionally equivalent to learning from a real-world scenario where we receive both successful and failed responses.
我们必须强调，我们采用这种方法纯粹是出于效率和分析的原因，在功能上等同于在现实场景中同时接收成功和失败的响应的学习方式。

4.4 Multi-Step GRPO 4.4 多步骤 GRPO

We used the TRL framework (von Werra et al., 2020) as a starting base to implement our multi-step GRPO algorithm (i.e., to learn from the second attempt after self-reflection). In particular, we extend the GRPOTrainer and alter its _prepare_inputs function to call a second_step function that, given the completions generated by the GRPOTrainer, will perform another step of completion generations, without affecting the mask already computed by the GRPOTrainer. As we operate on the dataset of failures, prompting the model to generate its self-reflection commentary, the mask corresponds to the tokens of the self-reflection text. This way, we can perform as many secondary steps on the initial completions as necessary and only reward the tokens (through the mask generated by the GRPOTrainer) corresponding to the initial completions. The second_step function also adds
我们使用 TRL 框架（von Werra 等，2020）作为基础来实现我们的多步骤 GRPO 算法（即从第二次尝试中学习自我反思）。具体而言，我们扩展了 GRPOTrainer 并修改其_prepare_inputs 函数，调用 second_step 函数，该函数在 GRPOTrainer 生成完成内容后，执行另一轮完成内容生成，且不会影响 GRPOTrainer 已计算的掩码。由于我们在失败数据集上操作，提示模型生成自我反思评论，掩码对应于自我反思文本的标记。通过这种方式，我们可以对初始完成内容进行任意次数的次要步骤，并仅奖励（通过 GRPOTrainer 生成的掩码）对应于初始完成内容的标记。second_step 函数还添加了

APIGen	Vanilla 1st Try 普通首次尝试	+ Reflection 2nd Try + 反思第二次尝试	Trained 1st Try 首次训练	+ Reflection 2nd Try + 反思第二次尝试
Qwen-2-1.5B Instruct	32.6%	34.8%	48.6%	52.9%
Qwen-2-7B Instruct Qwen-2-7B 指令模式	66.4%	69.4%	72.2%	77.3%
Llama-3.1-8B Instruct Llama-3.1-8B 指令模式	64.9%	70.9%	68.7%	74.9%
Phi-3.5-mini Instruct (3.8B) Phi-3.5-mini 指令模式（3.8B）	47.5%	50.2%	52.9%	56.0%
Qwen-2-72B Instruct Qwen-2-72B 指令模式	73.7%	76.6%	-	-
Llama-3.1-70B Instruct Llama-3.1-70B 指令模型	66.8%	76.9%	-	-
Palmyra-X4 (73B) Palmyra-X4（73B）	79.9%	83.5%	-	-

Table 1: APIGen Results This table shows model performance in terms of accuracy on our APIGen test set ( 12,000 samples) both on the first and second attempt, and with/without our GRPO selfreflection training.
表 1：APIGen 结果本表展示了在我们的 APIGen 测试集（12,000 个样本）上，模型在首次和二次尝试中的性能，并对比了使用和不使用我们的 GRPO 自我反思训练的情况。
a data structure to the inputs sent to the reward function that helps in understanding the performance of the initial completion on the successive steps. This multi-step approach allows us to integrate any complex downstream reward mechanism instead of only rewarding the initial completions.
一种数据结构，添加到发送至奖励函数的输入中，有助于理解初始完成在后续步骤中的性能。这种多步骤方法允许我们整合任何复杂的下游奖励机制，而不仅仅是奖励初始完成。

We trained our models on the respective failure datasets for up to 1,750 steps with an effective batch size of 256 failures (though in practice most models converged significantly faster) and we evaluated them at their convergence point. For example, the function calling experiment on Llama-3.1-8B Instruct required only 100 training steps and utilized less than 2,000 unique queries. Only one function calling experiment saw the entire dataset of 48,000 queries; the average across all function calling experiments was less than 25,000 unique queries. The most any math equation writing experiments used was less than 25,000 unique problems; the average of all math equation writing experiments was around 15,000 unique problems.
我们在各自的失败数据集上训练模型，最多进行 1,750 步，有效批量大小为 256 个失败案例（尽管实际上大多数模型收敛速度更快），并在其收敛点对它们进行评估。例如，Llama-3.1-8B Instruct 上的函数调用实验仅需要 100 个训练步骤，并利用了不到 2,000 个唯一查询。只有一个函数调用实验使用了全部 48,000 个查询的数据集；所有函数调用实验的平均数少于 25,000 个唯一查询。数学方程写作实验使用的最多查询数不超过 25,000 个唯一问题；所有数学方程写作实验的平均数约为 15,000 个唯一问题。
We used standard GRPO training parameters as described in the original DeepSeek implementation (Shao et al., 2024), and conducted some hyperparameter experimentation. In our final experiments, we set the KL divergence coefficient to 0.001 , and used a learning rate of

5 e - 7

with a cosine annealing schedule and a warmup ratio of 0.03 . To train each model, we used between 4 and 8 H100 GPUs. We limit our experiments to models between 1.5 billion and 8 billion parameters due to known computational efficiency and scalability concerns with GRPO (Zhang and Zuo, 2025).
我们使用了原始 DeepSeek 实现（Shao 等，2024）中描述的标准 GRPO 训练参数，并进行了一些超参数实验。在最终实验中，我们将 KL 散度系数设置为 0.001，并使用了

5 e - 7

的学习率，配合余弦退火调度和 0.03 的预热比率。为了训练每个模型，我们使用了 4 到 8 块 H100 GPU。由于已知 GRPO 在计算效率和可扩展性方面存在问题（Zhang 和 Zuo，2025），我们将实验限制在了 15 亿到 80 亿参数的模型范围内。

In addition to the experimental results reported here, we also carried out experiments with some smaller models. We quickly discovered, however, that these models had a very limited capacity to answer accurately and self-reflect; e.g. Qwen2/Qwen2.5 0.5B Instruct and Llama3.2-1B Instruct. Similarly, while Microsoft’s Phi 3.5 mini model was able to handle function calling, it struggled significantly with equation writing. We do not report results for these models.
除了此处报告的实验结果，我们还对一些较小的模型进行了实验。然而，我们很快发现这些模型准确回答和自我反思的能力非常有限；例如 Qwen2/Qwen2.5 0.5B Instruct 和 Llama3.2-1B Instruct。同样，尽管 Microsoft 的 Phi 3.5 mini 模型能够处理函数调用，但它在方程写作方面表现极其困难。我们没有报告这些模型的结果。

5 Experimental Results 5 实验结果

Our main experimental results are shown in Table 1 and Table 2. Specifically, Table 1 shows model performance for each model’s first and second attempts on the APIGen test set (12,000 samples) both before and after our multi-step GRPO training, while Table 2 shows the same but for the Countdown test set (15,000 samples).
我们的主要实验结果如表 1 和表 2 所示。具体而言，表 1 展示了在 APIGen 测试集（12,000 个样本）上，每个模型在进行多步骤 GRPO 训练前后的首次和第二次尝试的性能，而表 2 则针对倒计时测试集（15,000 个样本）显示了相同的结果。

In terms of APIGen, we first note that model size correlates perfectly with model performance after one attempt (as expected). We also note that performance increased by an average of

4.5 %

after a second attempt using a self-reflection, which again is in line with previous work. We see the biggest increase after our GRPO training however, where although we only reward self-reflection tokens, almost all models are able to outperform even the two-attempt vanilla models after just a single attempt. We hypothesize this is because the self-reflection tokens help with model reasoning in general, so the model benefits even if it does not need to generate an explicit self-reflection. Nevertheless, self-reflection still helps after our training, and performance increases a further

4.7 %

(on average) when models can self-reflect for their second attempt. Most strikingly, we observe that our Qwen-2-7B model after GRPO training is able to outperform a vanilla Qwen-2-72B model when both models are given two attempts, even though the latter model is 10x bigger than the first.
就 APIGen 而言，我们首先注意到模型大小与一次尝试后的模型性能完全相关（这是意料之中的）。我们还观察到，通过自我反思进行第二次尝试后，性能平均提高了

4.5 %

，这与之前的研究结果一致。然而，在我们的 GRPO 训练之后，性能提升最为显著。尽管我们仅奖励自我反思标记，但几乎所有模型在单次尝试后就能超越两次尝试的原始模型。我们推测，这是因为自我反思标记有助于模型整体推理，即使模型不需要生成显式的自我反思，它仍能受益。尽管如此，自我反思在训练后仍然有帮助，当模型可以在第二次尝试时进行自我反思时，性能平均再提高

4.7 %

。最引人注目的是，经过 GRPO 训练后，我们的 Qwen-2-7B 模型能够在两次尝试中胜过一个原始的 Qwen-2-72B 模型，尽管后者的模型规模比前者大 10 倍。

Countdown 倒计时	Vanilla 1st Try 原始首次尝试	+ Reflection 2nd Try + 第二次反思	Trained 1st Try 第一次训练	+ Reflection 2nd Try + 第二次反思
Qwen-2.5-1.5B Instruct Qwen-2.5-1.5B 指令模型	6.0%	10.2%	34.9%	45.0%
Qwen-2.5-3B Instruct	18.8%	29.0%	33.9%	47.3%
Qwen-2.5-7B Instruct	31.7%	38.0%	41.6%	50.3%
Llama-3.1-8B Instruct	2.2%	4.6%	8.8%	17.8%
Llama-3.2-3B Instruct	2.1%	3.0%	8.8%	13.8%
Palmyra 1.7B	26.8%	31.8%	33.3%	38.6%
Qwen-2.5-32B Instruct Qwen-2.5-32B 指令模型	38.6%	45.1%	-	-
Qwen-2.5-72B Instruct Qwen-2.5-72B 指令模型	45.2%	49.9%	-	-
Llama-3.1-70B Instruct Llama-3.1-70B 指令模型	17.3%	25.5%	-	-
Palmyra-X4 (73B)	46.8%	51.6%	-	-

Table 2: Countdown Results This table shows model performance in terms of accuracy on the Countdown test set (15,000 samples) both on the first and second attempt, and with/without our GRPO self-reflection training.
表 2：倒计时结果该表展示了在倒计时测试集（15,000 个样本）上，模型在首次和二次尝试中的准确性表现，并对比了是否使用我们的 GRPO 自我反思训练。

Transitioning from Task-Specific to Generalisable Self-Reflections
从任务特定到可泛化的自我反思

Figure 2: Better Self-Reflections We observe that reflections generated by vanilla models tend to be long, confusing, and redundant, whereas GRPO fine-tuned models produce much shorter, clearer, and more generalisable reflections.
图 2：更好的自我反思我们观察到，原始模型生成的反思倾向于冗长、混乱且重复，而经过 GRPO 微调的模型可以生成更加简洁、清晰和更具普适性的反思。

In terms of Countdown, it is first worth noting that performance was lower across the board, and the vanilla Llama models in particular (both Llama-3.1 and Llama-3.2) really struggled to complete the task; for example, the Llama-3.1.70B model was outclassed by even the Qwen-2.5-3B model, which is more than 20x smaller. Otherwise, the pattern of improvement is similar to the APIGen experiments, albeit at a slightly higher magnitude: self-reflection increased performance by an average of

5.3 %

and

8.6 %

respectively before and after our GRPO training. We hypothesize that these larger gains come from the fact the models started from a lower baseline and hence had a greater opportunity to learn.
就倒计时而言，首先值得注意的是，整体性能较低，特别是原始的 Llama 模型（无论是 Llama-3.1 还是 Llama-3.2）确实在完成任务方面举步维艰；例如，Llama-3.1.70B 模型甚至被比它小 20 多倍的 Qwen-2.5-3B 模型击败。不过，性能提升的模式与 APIGen 实验类似，只是幅度略大：自我反思分别在我们的 GRPO 训练前后平均提高了

5.3 %

和

8.6 %

的性能。我们推测，这些更大的收益来自于模型起点较低，因此有更大的学习空间。
Ultimately, our findings not only reinforce previous work on the benefits of self-reflection, they also demonstrate how learning to optimize for self-reflection with GRPO can improve performance further still.
最终，我们的发现不仅强化了之前关于自我反思益处的研究，还证明了通过 GRPO 学习优化自我反思可以进一步提升性能。

Model 模型	MMLU-Pro		GSM8K		HellaSwag		MATH
Model 模型	Vanilla 原味的	Trained 经过训练的	Vanilla 原味的	Trained 经过训练的	Vanilla 原味的	Trained 训练的	Vanilla 原味的	Trained 训练的
Function calling models 函数调用模型
Qwen2-1.5B Instruct	22.1%	21.9%	59.6%	59.5%	66.0%	66.1%	21.2%	20.9%
Qwen2-7B Instruct	43.4%	42.8%	77.6%	77.0%	80.7%	80.5%	34.0%	33.7%
Llama3.1-8B Instruct	41.0%	40.9%	78.0%	77.7%	79.3%	79.3%	31.1%	31.5%
Phi3.5-mini Phi3.5-迷你	44.6%	44.5%	79.2%	79.7%	76.8%	76.8%	31.6%	31.0%
Math equation writing models 数学方程写作模型
Qwen2.5-1.5B Instruct Qwen2.5-1.5B 指令调优版	30.9%	31.5%	59.4%	58.0%	68.3%	68.3%	25.5%	26.3%
Qwen2.5-3B Instruct Qwen2.5-3B 指令调优版	33.4%	32.7%	65.3%	65.1%	74.9%	74.9%	32.2%	31.6%
Qwen2.5-7B Instruct 千问 2.5-7B 指令模型	44.8%	44.2%	84.2%	82.1%	80.5%	80.5%	42.3%	40.3%
Llama3.1-8B Instruct 羊驼 3.1-8B 指令模型	41.0%	41.1%	78.0%	77.3%	79.3%	79.6%	31.1%	31.0%
Llama3.2-3B Instruct 羊驼 3.2-3B 指令模型	31.5%	31.7%	65.8%	65.0%	70.4%	71.0%	28.3%	28.2%
Palmyra 1.7B 帕尔米拉 1.7B 模型	31.7%	31.7%	76.2%	76.3%	50.0%	49.9%	43.0%	43.0%

Table 3: Catastrophic Forgetting Analysis Comparison between vanilla and GRPO fine-tuned models on common LLM benchmarks shows that despite fine-tuning, we observe minimal catastrophic forgetting, with fine-tuned models maintaining strong performance on these standard benchmarks.
表 3：在常见的 LLM 基准测试中，比较原始模型和使用 GRPO 微调的模型的灾难性遗忘分析，表明尽管经过微调，我们观察到极小的灾难性遗忘，微调后的模型在这些标准基准测试中仍保持强劲的性能。

5.1 Better Self-Reflections
5.1 更好的自我反思

To provide an insight into how self-reflections improve after self-reflection training, we present a qualitative example of a self-reflection generated by a vanilla model alongside a self-reflection generated by the same model after GRPO training in Figure 2. It is immediately obvious that vanilla self-reflections are much longer, more verbose, and repetitive compared to the more concise, optimized self-reflections after training. While this intuitively makes sense - humans likewise prefer short, simple instructions - this finding contrasts with chain-of-thought-style outputs, which are believed to perform better precisely because they are more verbose. We leave it as an open question as to when it may be more beneficial for a model to generate concise vs. verbose output.
为了洞察自我反思训练如何改进自我反思，我们在图 2 中展示了一个定性示例，对比了原始模型生成的自我反思和经过 GRPO 训练后同一模型生成的自我反思。立即可以看出，原始模型的自我反思更长、更冗长、更重复，而训练后的自我反思则更加简洁和优化。尽管直觉上这是有道理的——人类也更喜欢简短、简单的指令——但这一发现与连锁思考风格的输出形成对比，后者恰恰被认为因其冗长性而表现更好。我们将何时为模型生成简洁输出还是冗长输出更有益这一问题留作开放性探讨。

5.2 Low Catastrophic Forgetting
5.2 低灾难性遗忘

A common concern when fine-tuning models is catastrophic forgetting, i.e. when a model learns to specialize on one task at the expense of others (Li and Hoiem, 2016, Lopez-Paz and Ranzato, 2017, Kotha et al., 2024). Since our self-reflection training is designed to improve performance in a task agnostic way, we evaluate our models on several diverse benchmarks (MMLU-Pro (Wang et al., 2024), GSM8K (Cobbe et al., 2021), HellaSwag (Zellers et al., 2019), and MATH (Hendrycks et al., 2021) in order to assess their capacity for language understanding, mathematical problem solving, and commonsense reasoning both before and after self-reflection training. We do this using the common evaluation benchmark framework lm-eval (Gao et al., 2024). Our hypothesis is that performance should remain relatively unchanged, since we never optimize for a specific task, but instead optimize self-reflection reasoning in general.
在微调模型时，一个常见的担忧是灾难性遗忘，即模型在专注于一个任务时会牺牲其他任务的性能（Li and Hoiem，2016，Lopez-Paz and Ranzato，2017，Kotha 等，2024）。由于我们的自我反思训练旨在以任务无关的方式提高性能，我们在多个不同的基准测试中评估了我们的模型（MMLU-Pro（Wang 等，2024），GSM8K（Cobbe 等，2021），HellaSwag（Zellers 等，2019）和 MATH（Hendrycks 等，2021），以评估它们在自我反思训练前后在语言理解、数学问题求解和常识推理方面的能力。我们使用通用的评估基准框架 lm-eval（Gao 等，2024）进行这项评估。我们的假设是性能应该保持相对稳定，因为我们从未针对特定任务进行优化，而是普遍优化自我反思推理。
We present our results in Table 3, and find that performance does indeed remain stable after selfinflection training. In most cases, there is less than

1 %

degradation compared to the base model, and some models even improve; e.g. Qwen-2.5-1.5B performance increases by

0.6 %

and

0.8 %

respectively on MMLU-Pro and MATH after self-reflection training on the Countdown dataset. We treat this as evidence our approach is robust to catastrophic forgetting.
我们在表 3 中展示了结果，发现性能确实在自我反思训练后保持稳定。在大多数情况下，与基础模型相比性能下降不到

1 %

，有些模型甚至有所改进；例如，在 Countdown 数据集上进行自我反思训练后，Qwen-2.5-1.5B 在 MMLU-Pro 和 MATH 上的性能分别提高了

0.6 %

和

0.8 %

。我们将此视为我们的方法能够抵抗灾难性遗忘的证据。

6 Conclusion 6 结论

In this paper, we have shown that it is possible to significantly improve LLM performance by training a model to improve at self-reflection rather than at a particular task. This indirect approach depends only on a validator that can detect whether a model response is correct or incorrect, and so is particularly well-suited to tasks where responses can be easily verified; e.g. whether JSON output is formatted correctly, whether generated code is actually executable, or whether all the constraints of an equation are satisfied.
在本文中，我们展示了通过训练模型改进自我反思能力，而非针对特定任务，可以显著提高 LLM 的性能。这种间接方法仅依赖于能够检测模型响应是否正确的验证器，因此特别适合那些响应可以轻松验证的任务；例如，JSON 输出是否格式正确，生成的代码是否可实际执行，或者方程的所有约束是否得到满足。

We demonstrated the efficacy of our approach through experiments on the APIGen function calling and Countdown math equation solving datasets, and found that models trained for self-reflection using GRPO improved performance by an average of

9.0 %

on the function calling test set (12,000 samples) and

16.0 %

on the Countdown match equation dataset ( 15,000 samples). We furthermore found that smaller self-reflection trained models could outperform larger untrained models on both tasks, despite their size difference; e.g. Qwen-2-7B Instruct (trained) outperformed Qwen2-72B Instruct (untrained) on function calling, and Qwen2.5-7B Instruct (trained) outperformed Qwen2.572B Instruct (untrained) on Countdown math equations. Our models were also robust to catastrophic forgetting.
我们通过在 APIGen 函数调用和倒计时数学方程求解数据集上的实验证明了我们方法的有效性，发现使用 GRPO 进行自我反思训练的模型在函数调用测试集（12,000 个样本）上平均性能提高了

9.0 %

，在倒计时数学方程数据集（15,000 个样本）上提高了

16.0 %

。我们进一步发现，经过自我反思训练的较小模型可以在这两个任务上超越更大的未训练模型，尽管存在大小差异；例如，Qwen-2-7B Instruct（经过训练）在函数调用上优于 Qwen2-72B Instruct（未训练），Qwen2.5-7B Instruct（经过训练）在倒计时数学方程上优于 Qwen2.572B Instruct（未训练）。我们的模型还对灾难性遗忘具有鲁棒性。
Although we only trained models to improve at self-reflection, we found they also performed significantly better even when they did not need to self-reflect; i.e. they succeeded on the first attempt so there was no need to reflect and try again. We hypothesize that this is because by focusing on self-reflection rather than a particular task, models may have improved their reasoning skills more generally. In future work, we hope to investigate whether self-reflection training generalizes across different tasks.
尽管我们仅训练模型提高自我反思能力，但我们发现即使在模型不需要自我反思的情况下，它们的表现也显著更好；即在第一次尝试时就成功，因此无需反思和重试。我们假设这是因为通过关注自我反思而非特定任务，模型可能更广泛地提高了推理能力。在未来的工作中，我们希望研究自我反思训练是否可以跨不同任务泛化。

7 Limitations 7 局限性

It may not always be straightforward to define a binary success/fail validator for every task. We developed our method with the view that labeled training data may be scarce, but recognize that ground-truth labels could be used as a validator if available. Alternatively, it may also be possible to use a larger model as a judge (Zheng et al., 2023).
对于每个任务来说，定义一个二元成功/失败验证器并不总是直接了当。我们开发这种方法的出发点是标注训练数据可能稀缺，但也认识到如果有可用的，基础事实标签可以用作验证器。或者，使用更大的模型作为裁判也是可能的（Zheng 等，2023）。
We also find that our approach does not work for all models and all tasks; the model must have some basic ability to perform the task, self-reflect, and learn in order for boosting self-correction ability to work. For example, Llama3.2-3B Instruct was unable to learn to self-correct on the function calling task.
我们还发现，我们的方法并不适用于所有模型和所有任务；模型必须具备执行任务、自我反思和学习的基本能力，才能提高自我纠正的能力。例如，Llama3.2-3B Instruct 无法学习在函数调用任务中进行自我纠正。

References 参考文献

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, and 110 others. 2024. Phi-3 technical report: A highly capable language model locally on your phone Preprint, arXiv:2404.14219.
玛拉·阿布丁、乔蒂·安杰、哈尼·阿瓦达拉、艾哈迈德·阿瓦达拉、阿米尔·艾哈迈德·阿旺、阮·巴赫、阿米特·巴赫里、阿拉什·巴赫蒂亚里、贾明·鲍、哈基拉特·贝赫尔、阿隆·本海姆、米沙·比连科、约翰·比约克、塞巴斯蒂安·泡布克、马丁·蔡、蔡钦、维什拉夫·乔杜里、董辰、董冬冬，以及另外 110 位作者。2024. Phi-3 技术报告：一个在您手机上本地运行的高性能语言模型预印本，arXiv:2404.14219。

Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. 2024. Large language models for mathematical reasoning: Progresses and challenges. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 225-237, St. Julian’s, Malta. Association for Computational Linguistics.
珍妮斯·安、里舒·维尔马、伦泽·娄、迪·刘、瑞·张和温鹏·尹。2024. 大型语言模型用于数学推理：进展与挑战。在《第 18 届欧洲计算语言学协会会议学生研究研讨会论文集》，第 225-237 页，圣朱利安斯，马耳他。计算语言学协会。

Nicholas Asher, Swarnadeep Bhar, Akshay Chaturvedi, Julie Hunter, and Soumya Paul. 2023. Limits for learning with language models. In Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), pages 236-248, Toronto, Canada. Association for Computational Linguistics.
Nicholas Asher, Swarnadeep Bhar, Akshay Chaturvedi, Julie Hunter 和 Soumya Paul. 2023. 语言模型学习的局限性。载于《第 12 届词汇与计算语义联合会议论文集》(*SEM 2023)，第 236-248 页，加拿大多伦多。计算语言学协会。

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova Dassarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, John Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, and 12 others. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv, abs/2204.05862.
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova Dassarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, John Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume 和其他 12 位作者. 2022. 通过人类反馈强化学习训练有用且无害的助手。ArXiv, abs/2204.05862.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. Preprint, arXiv:2110.14168.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse 和 John Schulman. 2021. 训练验证器以解决数学应用题。预印本，arXiv:2110.14168。

Alan Dao and Thinh Le. 2025. Rezero: Enhancing llm search ability by trying one-more-time. Preprint, arXiv:2504.11001.
Alan Dao 和 Thinh Le. 2025. Rezero：通过再次尝试增强 LLM 搜索能力。预印本，arXiv:2504.11001。

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, and 5 others. 2024. The language model evaluation harness.
Leo Gao、Jonathan Tow、Baber Abbasi、Stella Biderman、Sid Black、Anthony DiPofi、Charles Foster、Laurence Golding、Jeffrey Hsu、Alain Le Noac'h、Haonan Li、Kyle McDonell、Niklas Muennighoff、Chris Ociepa、Jason Phang、Laria Reynolds、Hailey Schoelkopf、Aviya Skowron、Lintang Sutawika 及其他 5 位。2024 年。语言模型评测工具集。

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. The llama 3 herd of models Preprint, arXiv:2407.21783.
Aaron Grattafiori、Abhimanyu Dubey、Abhinav Jauhri、Abhinav Pandey、Abhishek Kadian、Ahmad Al-Dahle、Aiesha Letman、Akhil Mathur、Alan Schelten、Alex Vaughan、Amy Yang、Angela Fan、Anirudh Goyal、Anthony Hartshorn、Aobo Yang、Archi Mitra、Archie Sravankumar、Artem Korenev、Arthur Hinsvark 及其他 542 位。2024 年。Llama 3 模型群预印本，arXiv:2407.21783。

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. NeurIPS.
Dan Hendrycks、Collin Burns、Saurav Kadavath、Akul Arora、Steven Basart、Eric Tang、Dawn Song 和 Jacob Steinhardt。2021 年。使用数学数据集测量数学问题求解能力。神经信息处理系统大会（NeurIPS）。

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. Preprint, arXiv:1503.02531.
Geoffrey Hinton、Oriol Vinyals 和 Jeff Dean。2015 年。提取神经网络中的知识。预印本，arXiv:1503.02531。

Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049-1065, Toronto, Canada. Association for Computational Linguistics.
黄洁和陈建传. 2023. 迈向大型语言模型中的推理：一项调查. 在计算语言学协会大会成果集: ACL 2023, 第 1049-1065 页, 多伦多, 加拿大. 计算语言学协会.

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2024. Large language models cannot self-correct reasoning yet. In The Twelfth International Conference on Learning Representations.
黄洁, 陈新云, 斯瓦鲁普·米什拉, 郑晓旭, 于德伟, 宋新颖, 和周大伟. 2024. 大型语言模型尚不能自我纠正推理. 在第十二届学习表征国际会议.

James Huckle and Sean Williams. 2025. Easy Problems that LLMs Get Wrong. In Advances in Information and Communication, pages 313-332, Cham. Springer Nature Switzerland.
詹姆斯·赫克尔和肖恩·威廉姆斯. 2025. LLMs 错误解决的简单问题. 载于信息与通信进展, 第 313-332 页, 瑞士钦. 施普林格出版社.

Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. 2023. Towards mitigating LLM hallucination via self reflection. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1827-1843, Singapore. Association for Computational Linguistics.
纪子伟, 余铁政, 徐妍, 李娜延, 石枝惠, 和方静. 2023. 通过自我反思减轻 LLM 幻觉. 在计算语言学协会大会成果集: EMNLP 2023, 第 1827-1843 页, 新加坡. 计算语言学协会.

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A survey on large language models for code generation Preprint, arXiv:2406.00515.
姜巨勇、王帆、谢佳斯、金圣珠和金圣恒。2024 年。关于代码生成大语言模型的调查预印本，arXiv:2406.00515。

Geunwoo Kim, Pierre Baldi, and Stephen Marcus McAleer. 2023. Language models can solve computer tasks. In Thirty-seventh Conference on Neural Information Processing Systems.
金根宇、皮埃尔·巴尔迪和斯蒂芬·马库斯·麦卡利尔。2023 年。语言模型可以解决计算机任务。载于第三十七届神经信息处理系统大会。

Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. 2024. Understanding catastrophic forgetting in language models via implicit inference Preprint, arXiv:2309.10105.
苏哈斯·科塔、雅各布·米切尔·斯普林格和阿迪蒂·拉古纳桑。2024 年。通过隐式推理理解语言模型中的灾难性遗忘预印本，arXiv:2309.10105。

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, and Aleksandra Faust. 2024. Training language models to self-correct via reinforcement learning Preprint, arXiv:2409.12917.
阿维拉尔·库马尔、文森特·庄、里沙布·阿加瓦尔、伊·苏、约翰·D·科雷耶斯、阿维·辛格、凯特·鲍姆利、沙里克·伊克巴尔、科尔顿·主教、丽贝卡·罗尔夫斯、雷·M·张、凯·麦金尼、迪莎·什里瓦斯塔瓦、科斯敏·帕杜拉鲁、乔治·塔克、多伊娜·普雷库普、费里亚尔·贝哈巴尼和亚历桑德拉·福斯特。2024 年。通过强化学习训练语言模型自我纠正预印本，arXiv:2409.12917。

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang 和 Ion Stoica. 2023. 使用分页注意力进行大型语言模型服务的高效内存管理. 载于第 29 届操作系统原理学会（ACM SIGOPS）研讨会论文集.

Xuefeng Li, Haoyang Zou, and Pengfei Liu. 2025. Torl: Scaling tool-integrated rl. Preprint, arXiv:2503.23383.
Xuefeng Li, Haoyang Zou 和 Pengfei Liu. 2025. Torl：扩展工具集成强化学习. 预印本，arXiv:2503.23383.

Zhizhong Li and Derek Hoiem. 2016. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40:2935-2947.
Zhizhong Li 和 Derek Hoiem. 2016. 无遗忘学习. IEEE 模式分析与机器智能汇刊, 40:2935-2947.

Liping Liu, Chunhong Zhang, Likang Wu, Chuang Zhao, Zheng Hu, Ming He, and Jianping Fan. 2025. Instruct-of-reflection: Enhancing large language models iterative reflection capabilities via dynamic-meta instruction. Preprint, arXiv:2503.00902.
Liping Liu, Chunhong Zhang, Likang Wu, Chuang Zhao, Zheng Hu, Ming He 和 Jianping Fan. 2025. 反思指令：通过动态元指令增强大型语言模型的迭代反思能力. 预印本，arXiv:2503.00902.

Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, and Andrew M. Dai. 2024a. Best practices and lessons learned on synthetic data. In First Conference on Language Modeling.
Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou 和 Andrew M. Dai. 2024a. 合成数据的最佳实践和经验教训。载于语言建模第一次会议。

Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng, Rithesh RN, and 1 others. 2024b. Apigen: Automated pipeline for generating verifiable and diverse function-calling datasets. Advances in Neural Information Processing Systems, 37:54463-54482.
Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng, Rithesh RN 和其他 1 人. 2024b. Apigen：用于生成可验证和多样化函数调用数据集的自动化流水线。神经信息处理系统进展，37:54463-54482。

David Lopez-Paz and Marc’Aurelio Ranzato. 2017. Gradient episodic memory for continual learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6470-6479, Red Hook, NY, USA. Curran Associates Inc.
David Lopez-Paz 和 Marc'Aurelio Ranzato. 2017. 持续学习的梯度片段记忆。载于第 31 届神经信息处理系统国际会议论文集，NIPS'17，第 6470-6479 页，纽约州红钩市。Curran Associates Inc.

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, and 1 others. 2023. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36:4653446594.
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang 和其他 1 人. 2023. 自我改进：带有自反馈的迭代改进。神经信息处理系统进展，36:4653-4465-94。

Jiayi Pan, Xiuyu Li, Long Lian, Charlie Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, and Alane Suhr. 2025a. Learning adaptive parallel reasoning with language models Preprint, arXiv:2504.15466.
Jiayi Pan、Xiuyu Li、Long Lian、Charlie Snell、Yifei Zhou、Adam Yala、Trevor Darrell、Kurt Keutzer 和 Alane Suhr. 2025a. 基于语言模型学习自适应并行推理预印本，arXiv:2504.15466。

Jiayi Pan, Junjie Zhang, Xingyao Wang, Lifan Yuan, Hao Peng, and Alane Suhr. 2025b. Tinyzero. https://github.com/Jiayi-Pan/TinyZero. Accessed: 2025-05-05.
Jiayi Pan, Junjie Zhang, Xingyao Wang, Lifan Yuan, Hao Peng 和 Alane Suhr. 2025b. Tinyzero. https://github.com/Jiayi-Pan/TinyZero. 访问时间：2025-05-05。

Zhuoshi Pan, Yu Li, Honglin Lin, Qizhi Pei, Zinan Tang, Wei Wu, Chenlin Ming, H. Vicky Zhao, Conghui He, and Lijun Wu. 2025c. Lemma: Learning from errors for mathematical advancement in llms, Preprint, arXiv:2503.17439.
Zhuoshi Pan、Yu Li、Honglin Lin、Qizhi Pei、Zinan Tang、Wei Wu、Chenlin Ming、H. Vicky Zhao、Conghui He 和 Lijun Wu. 2025c. Lemma：从错误中学习以促进 LLMs 的数学进步，预印本，arXiv:2503.17439。

Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, and Jianfeng Gao. 2023. Check your facts and try again: Improving large language models with external knowledge and automated feedback. Preprint, arXiv:2302.12813.
Baolin Peng、Michel Galley、Pengcheng He、Hao Cheng、Yujia Xie、Yu Hu、Qiuyuan Huang、Lars Liden、Zhou Yu、Weizhu Chen 和 Jianfeng Gao. 2023. 检查你的事实并重试：利用外部知识和自动反馈改进大型语言模型。预印本，arXiv:2302.12813。

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. 2025. Toolrl: Reward is all tool learning needs, Preprint, arXiv:2504.13958.
Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur 和 Heng Ji. 2025. Toolrl: 学习工具需要的只是奖励，预印本，arXiv:2504.13958。

Yuxiao Qu, Tianjun Zhang, Naman Garg, and Aviral Kumar. 2024. Recursive introspection: Teaching language model agents how to self-improve Preprint, arXiv:2407.18219.
Yuxiao Qu, Tianjun Zhang, Naman Garg 和 Aviral Kumar. 2024. 递归内省：教授语言模型代理如何自我改进，预印本，arXiv:2407.18219。

Matthew Renze and Erhan Guven. 2024. Self-reflection in llm agents: Effects on problem-solving performance. arXiv preprint arXiv:2405.06682.
Matthew Renze 和 Erhan Guven. 2024. LLM 代理中的自我反思：对问题求解性能的影响。arXiv 预印本 arXiv:2405.06682。

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms Preprint, arXiv:1707.06347.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford 和 Oleg Klimov. 2017. 近端策略优化算法，预印本，arXiv:1707.06347。

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models Preprint, arXiv:2402.03300.
邵智宏、王佩仪、朱启豪、徐润新、宋俊骁、毕小、张浩伟、张明川、李永坤、吴毅、郭大亚. 2024. DeepSeekMath：推进开放语言模型数学推理的极限. 预印本, arXiv:2402.03300.

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36:8634-8652.
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan 和 Shunyu Yao. 2023. Reflexion：具有语言强化学习的智能体. 神经信息处理系统进展, 36:8634-8652.

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. 2020. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, 黄盛义, Kashif Rasul 和 Quentin Gallouédec. 2020. TRL：Transformer 强化学习. https://github.com/huggingface/trl

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. 2024. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark Preprint, arXiv:2406.01574.
王宇博、马旭光、张戈、倪元生、Abhranil Chandra、郭世光、任伟明、Aaran Arulraj、何轩、蒋子岩、李天乐、Max Ku、王凯、Alex Zhuang、范荣琦、岳翔和陈文虎. 2024. MMLU-Pro：一个更加健壮和具有挑战性的多任务语言理解基准. 预印本, arXiv:2406.01574.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824-24837. Curran Associates, Inc.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, 和 Denny Zhou. 2022. 思维链提示引发大型语言模型的推理能力. 载于《神经信息处理系统进展》，第 35 卷，第 24824-24837 页。Curran Associates, Inc.

Writer.com. 2024. Palmyra x4 I tool calling llm. https://writer.com/llms/palmyra-x4/. Accessed: 2025-05-29.
Writer.com. 2024. Palmyra x4 I 工具调用 LLM. https://writer.com/llms/palmyra-x4/. 访问时间：2025-05-29.

Zongqian Wu, Baoduo Xu, Ruochen Cui, Mengmeng Zhan, Xiaofeng Zhu, and Lei Feng. 2025. Rethinking chain-of-thought from the perspective of self-training Preprint, arXiv:2412.10827.
吴宗谦、徐保铎、崔若辰、詹梦梦、朱小峰和冯雷. 2025. 从自训练的角度重新思考思维链. 预印本，arXiv:2412.10827.

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, and 43 others. 2024. Qwen2 technical report Preprint, arXiv:2407.10671.
安杨、杨宝松、惠斌元、郑博、于伯文、周昌、李成鹏、李陈元、刘德一恒、黄飞、董冠廷、魏浩然、林欢、唐嘉龙、王佳琳、杨建、涂建宏、张建伟、马建新，以及其他 43 位作者. 2024. Qwen2 技术报告. 预印本，arXiv:2407.10671.

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 23 others. 2025. Qwen2.5 technical report. Preprint, arXiv:2412.15115.
An Yang、Baosong Yang、Beichen Zhang、Binyuan Hui、Bo Zheng、Bowen Yu、Chengyuan Li、Dayiheng Liu、Fei Huang、Haoran Wei、Huan Lin、Jian Yang、Jianhong Tu、Jianwei Zhang、Jianxin Yang、Jiaxi Yang、Jingren Zhou、Junyang Lin、Kai Dang 等 23 人。2025 年。Qwen2.5 技术报告。预印本，arXiv:2412.15115。

Kevin Yang, Yuandong Tian, Nanyun Peng, and Dan Klein. 2022. Re3: Generating longer stories with recursive reprompting and revision. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4393-4479, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Kevin Yang、Yuandong Tian、Nanyun Peng 和 Dan Klein。2022 年。Re3：通过递归重新提示和修订生成更长的故事。在《2022 年自然语言处理实证方法会议论文集》中，第 4393-4479 页，阿联酋阿布扎比。计算语言学协会。

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? Preprint, arXiv:1905.07830.
Rowan Zellers、Ari Holtzman、Yonatan Bisk、Ali Farhadi 和 Yejin Choi。2019 年。Hellaswag：机器真的能完成你的句子吗？预印本，arXiv:1905.07830。

Jixiao Zhang and Chunsheng Zuo. 2025. Grpo-lead: A difficulty-aware reinforcement learning approach for concise mathematical reasoning in language models Preprint, arXiv:2504.09696.
Jixiao Zhang 和 Chunsheng Zuo。2025 年。GRPO-Lead：一种面向语言模型数学推理的难度感知强化学习方法。预印本，arXiv:2504.09696。

Wenqi Zhang, Yongliang Shen, Linjuan Wu, Qiuying Peng, Jun Wang, Yueting Zhuang, and Weiming Lu. 2024. Self-contrast: Better reflection through inconsistent solving perspectives. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3602-3622, Bangkok, Thailand. Association for Computational Linguistics.
张文琦、沈永亮、吴丽娟、彭秋英、王军、庄跃廷和陆卫明. 2024. 自我对比：通过不一致的求解视角进行更好的反思. 载于《计算语言学协会第 62 届年会论文集》(第 1 卷：长论文)，第 3602-3622 页，泰国曼谷. 计算语言学协会.

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, and 3 others. 2025. A survey of large language models. Preprint, arXiv:2303.18223.
韦恩·鑫·赵、周坤、李俊义、唐天一、王晓雷、侯宇鹏、闵颖芊、张蓓琛、张俊杰、董子灿、杜一帆、杨晨、陈宇硕、陈志鹏、姜锦豪、任瑞阳、李一帆、唐新宇、刘子康等. 2025. 大语言模型调查. 预印本，arXiv:2303.18223.

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS '23, Red Hook, NY, USA. Curran Associates Inc.
郑联民、蒋伟林、盛颖、庄思远、吴长豪、庄永豪、林梓、李卓涵、李大成、谢平·艾克辛、张浩、约瑟夫·E·冈萨雷斯和伊翁·斯托伊卡. 2023. 用 MT-Bench 和聊天机器人竞技场评判 LLM 作为评判者. 载于《第 37 届神经信息处理系统国际会议论文集》，NIPS '23，美国纽约州红钩市. Curran Associates Inc.

A Prompt Templates 附录 A 提示模板

For reproducibility and clarity, we provide details on the prompt templates used during training. To the best of our ability, we followed model provider recommendations for prompting. We iterated on prompts to achieve reasonable baselines for each model on each task.
为了保证可重复性和清晰度，我们提供了训练期间使用的提示模板的详细信息。在尽可能的范围内，我们遵循了模型提供者关于提示的建议。我们对提示进行了迭代，以在每个任务上为每个模型实现合理的基准。

A. 1 Function Calling
A. 1 函数调用

Qwen 2 models Our prompting style for Qwen 2 models for function calling is as follows. First, we provide the following system prompt:
Qwen 2 模型我们在进行 Qwen 2 模型的函数调用时采用的提示方式如下。首先，我们给出以下系统提示：

You are a helpful assistant that can answer questions and help with tasks.
# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{List of tools, each on a new line}
</tools>
For each function call, return a json object with function name and arguments within
     <tool_call></tool_call> XML tags:
<tool_call>
{\"name\": <function-name>, \"arguments\": <args-json-object>}
</tool_call>

This is followed by the user query from the dataset, as role user. The model then replies with its first attempt at the task. If the attempt is incorrect, we prompt for a self-reflection as follows:
随后是来自数据集的用户查询，使用用户角色。然后模型对任务进行首次尝试。如果尝试不正确，我们会提示进行自我反思，具体如下：

You tried performing the task, but failed in generating the correct tool call.
     Reflect on what went wrong and write a short explanation that will help you
     do better next time.

After the model generates a self-reflection, we again prompt with the system prompt and user query to set up the model for its second attempt at the task.
在模型生成自我反思后，我们再次使用系统提示和用户查询，为模型的第二次任务尝试做好准备。

Llama 3.1 and Phi 3.5 models We follow the recommended Llama 3.1 tool calling format. We found that Phi performs better following the Llama tool-calling template than the Qwen 2 template. First, we provide the following system prompt:
Llama 3.1 和 Phi 3.5 模型我们遵循推荐的 Llama 3.1 工具调用格式。我们发现 Phi 遵循 Llama 工具调用模板比遵循 Qwen 2 模板效果更好。首先，我们提供以下系统提示：

When you receive a tool call response, use the output to format an answer to the
     original user question.
You are a helpful assistant with tool calling capabilities.

Then, as role user, we provide the tools and user query from the dataset as follows:
然后，以用户角色，我们按照如下方式提供数据集中的工具和用户查询：

Given the following functions, please respond with a JSON for a function call with
     its proper arguments that best answers the given prompt.
Respond in the format {\"name\": function name, \"parameters\": dictionary of
     argument name and its value}. Do not use variables.
{List of tools, each on a new line}
Question:

This is followed by the user query from the dataset, as role user. The model then replies with its first attempt at the task. We then prompt for a self-reflection:
接下来是数据集中的用户查询，以用户角色呈现。模型随后会对任务进行第一次尝试。然后我们提示进行自我反思：

You tried performing the task, but failed in generating the correct tool call.
     Reflect on what went wrong and write a short explanation that will help you
     do better next time.

After the model generates a self-reflection, we prompt with just the user query to set up the model for its second attempt at the task.
在模型生成自我反思后，我们仅使用用户查询重新提示模型，为其第二次尝试任务做准备。

A. 2 Countdown Math Equations
A. 2 倒数数学方程

We provide the following system prompt:
我们提供以下系统提示：
Please reason step by step, and put your final answer within

∖ ∖

boxed{}.
请逐步推理，并将你的最终答案放在

∖ ∖

标签中。
Then, as role user, we provide the main problem as follows:
然后，作为用户角色，我们提出主要问题如下：

Using the numbers \{nums, in list format\} create an equation that equals \{target\}.
    \(\hookrightarrow\) You can use basic arithmetic operations (+, -, *, /) and each number can
    \(\hookrightarrow\) only be used once.
Please reason step by step, and put your final answer within \(\backslash \backslash\) boxed\{\}.

The model then replies with its first attempt at the task. Given a failure, we then prompt for a self-reflection:
模型随后尝试解决任务。如果失败，我们会提示模型进行自我反思：

You tried solving the problem and got the wrong answer. Reflect on what went wrong
    \(\hookrightarrow\) and write a short explanation that will help you do better next time.

After the model generates a self-reflection, we repeat the user message from above to set the model up for its second attempt at the task:
在模型生成自我反思后，我们重复之前的用户消息，为模型的第二次任务尝试做准备：

Using the numbers \{nums, in list format\} create an equation that equals \{target\}.
    \(\hookrightarrow\) You can use basic arithmetic operations (+, -, *, /) and each number can
    \(\hookrightarrow\) only be used once.
Please reason step by step, and put your final answer within \(\backslash \backslash\) boxed\{\}.

APIGen	Tool choice error 工具选择错误	Parameter error 参数错误	Format error 格式错误
Vanilla Qwen-2-1.5B Instruct 原始 Qwen-2-1.5B Instruct	33.9%	25.3%	8.2%
Trained Qwen-2-1.5B Instruct 经过训练的 Qwen-2-1.5B 指令模型	21.3%	25.3%	4.8%
Vanilla Qwen-2-7B Instruct 原始 Qwen-2-7B 指令模型	5.2%	27.2%	1.2%
Trained Qwen-2-7B Instruct 经过训练的 Qwen-2-7B 指令模型	3.5%	22.7%	1.6%
Vanilla Llama-3.1-8B Instruct 原始 Llama-3.1-8B 指令模型	4.6%	28.5%	2.0%
Trained Llama-3.1-8B Instruct 训练后的 Llama-3.1-8B Instruct	3.7%	25.6%	2.0%
Vanilla Phi-3.5-mini Instruct (3.8B) 原始 Phi-3.5-mini Instruct (3.8B)	18.8%	24.8%	8.9%
Trained Phi-3.5-mini Instruct (3.8B) 训练后的 Phi-3.5-mini Instruct (3.8B)	17.6%	23.9%	5.7%

Table 4: APIGen Error Analysis This table categorises model errors on the first attempt at the task with and without GRPO self-reflection training (12,000 sample test set).
表 4：APIGen 错误分析本表对具有和不具有 GRPO 自我反思训练的模型在首次尝试任务时的错误进行了分类（12,000 个样本的测试集）。

Countdown 倒计时	Invalid equation 无效方程	Wrong numbers 错误数字	Missed target 未达目标
Vanilla Qwen-2.5-1.5B Instruct 原始 Qwen-2.5-1.5B 指令模型	5.9%	73.7%	14.4%
Trained Qwen-2.5-1.5B Instruct 训练后的 Qwen-2.5-1.5B 指令模型	0.6%	34.3%	30.2%
Vanilla Qwen-2.5-3B Instruct 原始 Qwen-2.5-3B 指令模型	4.9%	52.5%	23.9%
Trained Qwen-2.5-3B Instruct 训练后的 Qwen-2.5-3B 指令模型	2.1%	33.6%	30.4%
Vanilla Qwen-2.5-7B Instruct 原始的 Qwen-2.5-7B Instruct	3.8%	39.1%	25.4%
Trained Qwen-2.5-7B Instruct 训练后的 Qwen-2.5-7B Instruct	0.1%	55.6%	2.7%
Vanilla Llama-3.1-8B Instruct 原始的 Llama-3.1-8B Instruct	0.1%	97.1%	0.6%
Trained Llama-3.1-8B Instruct 训练后的 Llama-3.1-8B Instruct	0.0%	90.4%	0.7%
Vanilla Llama-3.2-3B Instruct 原始的 Llama-3.2-3B Instruct	0.1%	97.1%	0.7%
Trained Llama-3.2-3B Instruct 经过训练的 Llama-3.2-3B Instruct	0.0%	90.4%	0.7%
Vanilla Palmyra 1.7B 原始的 Palmyra 1.7B	4.1%	57.5%	11.6%
Trained Palmyra 1.7B 经过训练的 Palmyra 1.7B	3.0%	44.6%	19.1%

Table 5: Countdown Error Analysis This table categorises model errors on the first attempt at the task with and without GRPO self-reflection training (15,000 sample test set).
表 5：倒计时错误分析本表对具有和不具有 GRPO 自我反思训练的模型在第一次尝试任务时的错误进行了分类（测试集样本数为 15,000）。

B Error Analysis B 错误分析

We categorise the errors of our models before and after training in an attempt to better understand what types of errors models are prone to on these tasks, and what types of errors can be mitigated by self-reflection training. We look exclusively at errors made on the first attempt at the task (pass@1).
我们对模型在训练前后的错误进行了分类，旨在更好地理解模型在这些任务中容易出现什么类型的错误，以及通过自我反思训练可以缓解哪些类型的错误。我们仅关注了在第一次尝试任务时（pass@1）犯的错误。

B. 1 Function Calling
B.1 函数调用

For function calling, we categorise errors into three types: errors in tool choice, errors in parameter names or values, and errors in format. We consider parameter choice to be much more difficult than tool choice.
对于函数调用，我们将错误分为三类：工具选择错误、参数名称或值错误，以及格式错误。我们认为参数选择比工具选择要困难得多。

The two smallest models (Qwen-2-1.5B Instruct and Phi-3.5-mini Instruct) struggle significantly with tool choice without training, and don’t improve much, if at all, on parameter values through training. Conversely, the larger models (7-8 billion parameters) are already quite good at tool choice without training, and training primarily seems to teach parameter selection.
两个最小的模型（Qwen-2-1.5B Instruct 和 Phi-3.5-mini Instruct）在没有训练的情况下很难进行工具选择，并且通过训练对参数值的改进不大。相比之下，较大的模型（70-80 亿参数）在没有训练的情况下已经能很好地进行工具选择，训练似乎主要是教授参数选择。

B. 2 Math Countdown Equations
B. 2 数学倒计时方程

For math countdown equations, we categorise errors into three types: an invalid equation (or one that uses disallowed characters), an equation that uses numbers outside of the provided ones (wrong numbers), and an equation that does not evaluate to the provided target (missed target).
对于数学倒计时方程，我们将错误分为三类：无效方程（或使用不允许的字符）、使用提供范围之外的数字（错误数字）以及无法得出给定目标的方程（未达到目标）。

All models struggled primarily with outputting equations that used only allowed numbers. Training significantly decreased this error for all models except for Qwen-2.5-7B Instruct. Put another way, all models except the largest Qwen model primarily learned to use the correct numbers in the equation through training, even if it resulted in missing the target, whereas Qwen-2.5-7B Instruct learned to hit the target even if it meant using the wrong numbers.
所有模型主要在输出仅使用允许的数字的方程式方面遇到困难。通过训练，除 Qwen-2.5-7B Instruct 外的所有模型的这类错误显著减少。换句话说，除最大的 Qwen 模型外，所有模型主要通过训练学会在方程式中使用正确的数字，即使这意味着未命中目标，而 Qwen-2.5-7B Instruct 则学会命中目标，即使这意味着使用错误的数字。

*Equal contribution *同等贡献

Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning 反思、重试、奖励：通过强化学习实现自我改进的大语言模型

Abstract 摘要

1 Introduction 1 引言

2 Related Work 2 相关工作

2.1 Self-Reflection 2.1 自我反思

2.2 Reinforcement Learning for Language Models2.2 语言模型的强化学习

3 Reflect, Retry, Reward3 反思、重试、奖励