- Scaling Up Training, New and Old Paradigms Continue
扩大培训,新旧范式持续进行 - Scaling Sings Odes to the Greatest Scaling Law of Computing, Moore’s Law
扩展颂歌:计算领域最伟大的扩展法则,摩尔定律 - Challenges in Scaling Pre-training – Data wall, fault tolerance
在扩展预训练中的挑战 - 数据壁垒,容错性 - Newer, Harder Evals to Climb
更新、更严格的评估标准待提升 - Post-training: a new scaling domain
后训练:一个新的扩展领域 - Supervised Fine-Tuning 监督微调
- Synthetic Data’s Integral Role in Post-training
合成数据在后训练中的重要作用 - Synthetic Data Examples 合成数据示例
- Rejection Sampling 拒绝采样
- Judgement by Model 模型判断
- Long Context Datasets 长上下文数据集
- Reinforcement Learning 强化学习
- Proximal Policy Optimization (PPO)
近端策略优化(PPO) - RLHF
- RLAIF
- Reasoning Models and Chain of Thought (CoT)
推理模型和思维链(CoT) - Inference-time Scaling 推理时间缩放
- Scaling Inference Compute Through Search
通过搜索扩展推理计算 - o1: Navigating Strawberry Fields
o1: 导航草莓田 - Backtracking: Emergent or Trained?
回溯:自发还是训练? - o1 ’Berry Training Infrastructure
o1 ’贝瑞培训基础设施 - Incredible Amounts of Forward Passes During Training
训练期间惊人的前传次数 - Post-Training FLOPS Exceed Pre-Training
训练后 FLOPS 超过训练前 - Rapid Iteration as Another Form of Scaling
快速迭代作为另一种扩展形式 - o1 Inference Architecture: Tokenomics
o1 推理架构:代币经济学 - Reasoning models face reliability issues
推理模型面临可靠性问题 - O1 Pro Innovations & Cost
O1 Pro 创新与成本 - Beyond o1 超越 o1
- Scaling Training Is Cheaper Than Scaling Inference Time Compute
扩展训练比扩展推理时间计算更便宜
In our pursuit of becoming a better full service research firm, we’ve moved off Substack. For any questions please read https://semianalysis.com/faq/#substack
在我们追求成为更好的全方位服务研究公司的过程中,我们已不再使用 Substack。如有任何问题,请阅读 https://semianalysis.com/faq/#substack
There has been an increasing amount of fear, uncertainty and doubt (FUD) regarding AI Scaling laws. A cavalcade of part-time AI industry prognosticators have latched on to any bearish narrative they can find, declaring the end of scaling laws that have driven the rapid improvement in Large Language Model (LLM) capabilities in the last few years. Journalists have joined the dogpile and have supported these narratives, armed with noisy leaks filled with vague information around the failure of models to scale successfully due to alleged underperformance. Other skeptics point to saturated benchmarks, with newer models showing little sign of improvement said benchmarks. Critics also point to the exhaustion of available training data and slowing hardware scaling for training.
关于人工智能扩展法则的恐惧、不确定性和怀疑(FUD)日益增加。一群兼职的人工智能行业预言家抓住了他们能找到的任何看跌叙述,宣称推动近年来大型语言模型(LLM)能力快速提升的扩展法则已走到尽头。记者们也加入了这一阵营,支持这些叙述,手中拿着充满模糊信息的嘈杂泄漏,声称模型未能成功扩展是由于所谓的表现不佳。其他怀疑者则指出基准测试的饱和,新模型在这些基准测试中几乎没有显示出改善的迹象。批评者还指出可用训练数据的枯竭和训练硬件扩展的放缓。

Despite this angst, large AI Labs and hyperscalers’ accelerating datacenter buildouts and capital expenditure speaks for itself. From Amazon investing considerable sums to accelerate its Trainium2 custom silicon and preparing 400k chips for Anthropic at an estimated cost of $6.5B in total IT and datacenter investment, to Meta’s 2GW datacenter plans for 2026 in Indiana, to OpenAI and Google’s aggressive multi-datacenter training plans to overcome single-site power limitations – key decision makers appear to be unwavering in their conviction that scaling laws are alive and well. Why?
尽管存在这种焦虑,大型人工智能实验室和超大规模数据中心的加速建设和资本支出不言而喻。从亚马逊投资大量资金加速其 Trainium2 定制硅片,并为 Anthropic 准备 40 万颗芯片,预计总 IT 和数据中心投资成本为 65 亿美元,到 Meta 在印第安纳州 2026 年的 2GW 数据中心计划,再到 OpenAI 和谷歌积极的多数据中心训练计划以克服单一地点的电力限制——关键决策者似乎对扩展法则的信念坚定不移。为什么?
Scaling Up Training, New and Old Paradigms Continue
扩大培训,新旧范式持续进行
The reality is that there are more dimensions for scaling beyond simply focusing on pre-training, which has been the sole focus of most of the part-time prognosticators. OpenAI’s o1 release has proved the utility and potential of reasoning models, opening a new unexplored dimension for scaling. This is not the only technique, however, that delivers meaningful improvements in model performance as compute is scaled up. Other areas that deliver model improvements with more compute include Synthetic Data Generation, Proximal Policy Optimization (PPO), Functional Verifiers, and other training infrastructure for reasoning. The sands of scaling are still shifting and evolving, and, with it, the entire AI development process has continued to accelerate.
现实是,除了单纯关注预训练之外,还有更多的扩展维度,而这一直是大多数兼职预测者的唯一关注点。OpenAI 的 o1 版本证明了推理模型的实用性和潜力,为扩展开辟了一个新的未探索维度。然而,这并不是唯一一种在计算能力提升时能带来模型性能显著改善的技术。其他在计算能力提升时能改善模型的领域包括合成数据生成、近端策略优化(PPO)、功能验证器以及其他推理训练基础设施。扩展的沙子仍在不断变化和演变,整个 AI 开发过程也在持续加速。
Shifting from faulty benchmarks to more challenging ones will enable better measures of progress. In this report we will outline the old pre-training scaling trend as well as the new scaling trends for post-training and inference time. This includes how new methods will push the frontier – and will require even more training time compute scaling then thought before.
从错误的基准转向更具挑战性的基准将能够更好地衡量进展。在本报告中,我们将概述旧的预训练规模趋势以及新的后训练和推理时间的规模趋势。这包括新方法如何推动前沿——并且将需要比之前认为的更多的训练时间计算规模。
We will cover OpenAI o1 and o1 Pro’s architecture from both a training infrastructure and inference tokenomics perspective including cost, KVCache scaling, batching, and more. We will also dive into leading AI Lab synthetic data and RL infrastructure. Lastly, we want to set the record straight on Anthropic’s Claude 3.5 Opus and OpenAI’s Orion’s “failures,” and what scaling plans are going forward.
我们将从训练基础设施和推理代币经济学的角度,涵盖 OpenAI o1 和 o1 Pro 的架构,包括成本、KVCache 扩展、批处理等内容。我们还将深入探讨领先的 AI 实验室合成数据和强化学习基础设施。最后,我们希望澄清 Anthropic 的 Claude 3.5 Opus 和 OpenAI 的 Orion 的“失败”,以及未来的扩展计划。
Scaling Sings Odes to the Greatest Scaling Law of Computing, Moore’s Law
扩展颂歌:计算领域最伟大的扩展法则,摩尔定律
Today’s debate on AI scaling laws is not dissimilar to the decades-long debate around compute scaling and Moore’s law. Anyone who tries to measure CPU compute primarily by clock speed – a common metric used before the late 2000s around the time of the end of Dennard Scaling – would argue that we have not made any progress at all since then. In reality, compute has been advancing all along – when we hit a wall on processor clock speed, the focus shifted to multi-core architectures and other methods to drive performance, despite power density and cooling constraints.
今天关于人工智能扩展法则的辩论与围绕计算扩展和摩尔定律的数十年辩论并没有太大区别。任何试图主要通过时钟速度来衡量 CPU 计算的人——这是在 2000 年代末德纳德缩放结束时常用的指标——都会认为自那时以来我们没有任何进展。实际上,计算一直在不断进步——当我们在处理器时钟速度上遇到瓶颈时,焦点转向了多核架构和其他推动性能的方法,尽管存在功率密度和冷却限制。

来源:1970-2015 年 CPU 晶体管密度、时钟速度、功耗和性能 – 查尔斯·莱吉特
The end of Moore’s Law is another wall that with which the semiconductor industry has contended, but this debate has been quieter lately as AI pioneers like Nvidia have provided massive compute gains by scaling along a few entirely new dimensions. Advanced packaging has enabling continued advances in compute by scaling input/output (I/Os) and enabling chips to harness a total silicon area beyond the reticle size limit. Parallel computing within and across chips and building larger high-bandwidth networking domains has enabled chips to work better together at scale, especially for inference.
摩尔定律的终结是半导体行业面临的另一个壁垒,但随着像英伟达这样的人工智能先驱通过在几个全新维度上扩展提供了巨大的计算增益,这场辩论最近变得更加平静。先进封装使得通过扩展输入/输出(I/Os)并使芯片能够利用超出掩模尺寸限制的总硅面积,继续推动计算的进步。在芯片内部和跨芯片的并行计算以及构建更大高带宽网络域,使得芯片能够在规模上更好地协同工作,特别是在推理方面。

As with computer enthusiasts in 2004, mainstream analysts and journalists are missing the forest for the trees: despite the slowing down of one trend, the industry collectively remains moving forward at a breakneck pace due to other new emerging paradigms that are ripe for scaling and expansion. It is possible to stack “scaling laws” – pre-training will become just one of the vectors of improvement, and the aggregate “scaling law” will continue scaling just like Moore’s Law has over last 50+ years.
与 2004 年的计算机爱好者一样,主流分析师和记者也未能看清大局:尽管某一趋势正在放缓,但由于其他新兴范式的出现,整个行业仍在以惊人的速度向前发展,这些新兴范式正适合扩展和规模化。“规模法则”可以叠加——预训练将成为改进的一个向量,而整体“规模法则”将继续扩展,就像摩尔定律在过去 50 多年中一样。
Challenges in Scaling Pre-training – Data wall, fault tolerance
在扩展预训练中的挑战 - 数据壁垒,容错性
Scaling pre-training has provided significant gains in model performance, but there are a few speed bumps that the industry is currently focusing on overcoming.
扩展预训练已显著提高模型性能,但行业目前正专注于克服一些障碍。
One obvious speed bump is that data is increasingly difficult to collect – while data on the internet is expanding quickly, it is not expanding at a rate proportional to compute. This is why today’s trillion parameter mega-models have been less than Chinchilla optimal – a much lower number of training tokens vs model parameters.
一个明显的障碍是数据越来越难以收集——虽然互联网上的数据快速增长,但其增长速度并未与计算能力成正比。这就是为什么今天的万亿参数超级模型的表现低于 Chinchilla 的最优水平——训练令牌的数量远低于模型参数。
Chinchilla scaling refers to the optimal increases in data versus parameter counts relative to increases in compute. Not enough data causes the model to generalize poorly, while too much data results in overtraining, which wastes compute resources. There are some instances where deviating from the optimal ratio makes sense: over-training models (e.g. GPT-4o and Llama) can decrease inference costs significantly and is preferrable for providers that have a larger user base to serve said model to.
奇努基拉缩放是指数据与参数数量相对于计算能力增加的最佳比例。数据不足会导致模型泛化能力差,而数据过多则会导致过度训练,浪费计算资源。在某些情况下,偏离最佳比例是有意义的:过度训练模型(例如 GPT-4o 和 Llama)可以显著降低推理成本,并且对于拥有更大用户基础的提供商来说,更加可取。
In January of 2023, before the launch of GPT-4, we wrote about the practical limits for scaling and how GPT-4 planned to break through them. Since then, models have ping-ponged from being more than Chinchilla Optimal (much greater data than model parameters) to less than Chinchilla Optimal (when data became constrained). The compute availability speedbump was overcome in the past when improvements in training and inference hardware alleviated constraints.
在 2023 年 1 月,GPT-4 发布之前,我们讨论了扩展的实际限制以及 GPT-4 计划如何突破这些限制。自那时以来,模型在超过 Chinchilla Optimal(数据远大于模型参数)和低于 Chinchilla Optimal(当数据受到限制时)之间反复波动。过去,计算可用性瓶颈通过训练和推理硬件的改进得以克服,从而缓解了限制。
With respect to today’s narrative around speed bumps – useful data sources such as textbooks and documentation are exhausted, and what remains is mostly lower-quality text data sources. Furthermore, web data is still a narrow distribution of data and models need more out of distribution data to continue to generalize. With models harder to scale in a way that is optimal, pre-training is becoming more challenging.
关于今天围绕速度障碍的叙述——有用的数据来源如教科书和文档已经用尽,剩下的主要是低质量的文本数据来源。此外,网络数据仍然是数据的狭窄分布,模型需要更多的分布外数据才能继续泛化。由于模型在以最佳方式扩展方面变得更加困难,预训练变得更加具有挑战性。
Also, if labs train models with an insufficient amount of data as they keep scaling, the models become over-parametrized, becoming inefficient and leading to heavy amounts of memorization rather than generalization. Labs have instead been turning to an increasing use of synthetic data to alleviate this problem.
此外,如果实验室在不断扩展时用不足的数据训练模型,这些模型会变得过度参数化,效率低下,导致大量记忆而非泛化。实验室因此越来越多地转向使用合成数据来缓解这个问题。
Though, this issue applies less to the main AI Labs. Meta alone has approximately 100x more data available to them than is on the public internet (if they can harness this data in a compliant manner). This may give them an edge in continuing to scale with fewer issues than others. YouTube has 720,000 new hours of video uploaded every day – and we think that AI Labs have only begun to contemplate training on the vast amount of data contained within video. This is in addition to their ability to generate quality synthetically generated data, which we discuss the architecture for later.
尽管这个问题对主要的人工智能实验室影响较小。Meta 单独拥有的可用数据大约是公共互联网数据的 100 倍(如果他们能够以合规的方式利用这些数据)。这可能使他们在继续扩展时面临的问题少于其他公司。YouTube 每天上传 72 万个小时的新视频——我们认为人工智能实验室仅仅开始考虑在视频中包含的大量数据上进行训练。这还不包括他们生成高质量合成数据的能力,我们将在后面讨论其架构。
To train on the quadrillions of alternative tokens available from video requires a huge continuation of scaling overall training FLOPs, which will be delivered by hardware innovation and systems engineering. For instance, scaling another order of magnitude on training FLOPs will require multi-datacenter training as the number of accelerators needed can no longer fit inside a single datacenter site. Project Rainier has Amazon providing Anthropic with 400k Tranium 2 chips, but, in raw FLOPs, that is less than 100k GB200s. Anthropic will have to produce significant engineering achievements to pull off training in such a cluster. Spreading accelerators across a large campus, or multiple campuses, itself leads to significant challenges posed by Amdahl’s law, though there are already more than a few posited solutions to address this challenge.
在视频中可用的数万亿替代令牌上进行训练需要大规模持续扩展整体训练 FLOPs,这将通过硬件创新和系统工程来实现。例如,训练 FLOPs 再扩大一个数量级将需要多数据中心训练,因为所需的加速器数量已无法容纳在单一数据中心内。Rainier 项目让亚马逊向 Anthropic 提供 40 万颗 Tranium 2 芯片,但在原始 FLOPs 上,这还不到 10 万 GB200。Anthropic 必须取得显著的工程成就才能在这样的集群中完成训练。将加速器分布在一个大型校园或多个校园中,带来了由阿姆达尔定律所带来的重大挑战,尽管已经提出了不少解决方案来应对这一挑战。
The other constraint with respect to scaling parameters is inference economics. AI Labs can capitalize vast sums of investment into training large models and amortize the model’s use both over a large and growing userbase, as well as for internal use cases, to develop further model iterations. When it comes to inference, they must be careful not to bring to market models that are too costly or uneconomical to serve.
与缩放参数相关的另一个限制是推理经济学。人工智能实验室可以将大量投资用于训练大型模型,并在一个庞大且不断增长的用户基础上摊销模型的使用,同时也用于内部用例,以开发进一步的模型迭代。在推理方面,他们必须小心不要将成本过高或经济上不合理的模型推向市场。
Evals are also not comprehensive; there are many capabilities or properties of models that existing evals do not cover well. Transfer learning, where the model gets better at a domain through learning about something else, and in-context learning are both areas where more evals need to be developed. Finally, there will always be end use cases that may be hard to predict in advance but provide an immense benefit to the end user.
评估也并不全面;现有的评估未能很好地覆盖模型的许多能力或特性。迁移学习,即模型通过学习其他事物在某个领域变得更好,以及上下文学习,都是需要开发更多评估的领域。最后,总会有一些最终使用案例可能难以提前预测,但对最终用户提供巨大的好处。
That which gets measured, improves.
被测量的事物会得到改善。
Newer, Harder Evals to Climb
更新、更严格的评估标准待提升
Newer evaluations have sprung up that aim to better differentiate models and focus on directly addressing specific useful applications. SWE-Bench is one of the most important evaluations today, aiming to have models solve human-reviewed GitHub issues from open-source Python repositories. The new Claude 3.5 Sonnet currently has achieved (State of the Art) on SWE-Bench Verified at 49%, but most models are much lower.
新的评估方法应运而生,旨在更好地区分模型,并专注于直接解决特定的有用应用。SWE-Bench 是目前最重要的评估之一,旨在让模型解决来自开源 Python 仓库的人类审核的 GitHub 问题。新的 Claude 3.5 Sonnet 目前在 SWE-Bench 上的验证状态达到了 49% 的(最先进水平),但大多数模型的表现要低得多。
Another example is a benchmark investigating AI R&D capabilities, which some describe as “the most important capability to track.” Research Engineering Benchmark (RE) consists of seven challenging and open-ended ML research environments. Humans generally perform better on evals over longer time horizons, but, on a 2-hour time horizon, the best AI agents achieved a score 4x higher than humans. Important tasks such as the above, in which humans currently dominate, are the perfect ground for scaling inference time compute. We expect that models that better leverage this form of scaling will outperform humans in the future.
另一个例子是一个基准调查 AI 研发能力,一些人将其描述为“最重要的跟踪能力”。研究工程基准(RE)由七个具有挑战性和开放性的机器学习研究环境组成。人类在较长时间范围内的评估表现通常更好,但在 2 小时的时间范围内,最佳 AI 代理的得分是人类的 4 倍。像上述重要任务,目前人类占主导地位,是扩展推理时间计算的完美基础。我们预计,能够更好利用这种扩展形式的模型将在未来超越人类。

来源:RE-Bench:评估语言模型代理的前沿人工智能研发能力与人类专家的对比
Yet another trend is for evaluations to include extremely difficult expert-level questions. Two prominent examples are Graduate-Level Google-Proof Q&A Benchmark (GPQA) and Frontier Math. GPQA is made up of 448 multiple choice questions across chemistry, biology, and physics. For context, OpenAI found that expert-level humans (i.e. people with PhDs) scored ~70% on GPQA Diamond, with o1 scoring 78% on the same set. Last year, GPT-4 with search (and CoT on abstention) scored 39% on GPQA Diamond.
另一个趋势是评估中包含极其困难的专家级问题。两个显著的例子是研究生级谷歌防护问答基准(GPQA)和前沿数学。GPQA 由 448 个多项选择题组成,涵盖化学、生物学和物理学。作为背景,OpenAI 发现专家级人类(即拥有博士学位的人)在 GPQA Diamond 上的得分约为 70%,而 o1 在同一组题目上的得分为 78%。去年,带有搜索功能的 GPT-4(以及关于弃权的链式推理)在 GPQA Diamond 上的得分为 39%。
Another example of the trend towards using extremely tough questions is FrontierMath (FM). FM is a benchmark of hundreds of original math questions that can take humans hours and even up to days to solve. It covers a broad range of mathematical topics, including number theory, real analysis, etc. The special sauce with this eval is that it is not published, minimizing the risk of data contamination, and can be graded via an automated verifier – simplifying the evaluation process.
另一个使用极其困难问题的趋势例子是 FrontierMath(FM)。FM 是数百个原创数学问题的基准,这些问题可能需要人类花费数小时甚至数天来解决。它涵盖了广泛的数学主题,包括数论、实分析等。这个评估的特别之处在于它没有公开,最大限度地降低了数据污染的风险,并且可以通过自动验证器进行评分——简化了评估过程。

来源:FrontierMath:评估人工智能中高级数学推理的基准
The best performing model on this benchmark comes in at 2%, but the labs expect this to dramatically improve. Anthropic has line of sight to hit 80% on FrontierMath over the medium term.
在这个基准上表现最好的模型为 2%,但实验室预计这一数字将大幅改善。Anthropic 有望在中期内在 FrontierMath 上达到 80%。
Post-training: a new scaling domain
后训练:一个新的扩展领域
Pre-training tends to be the focus of debates regarding scaling laws because it is easy to understand, but it is only one part of the AI lifecycle. Once a model is pre-trained, there is still considerable work to be done on getting it ready for use. The objective during pre-training is, very narrowly, to “predict the next token correctly.” Accomplishing this still leaves us well short of the end-goal of LLM development which is to “answer user prompts” or “do a task.”
预训练往往是关于规模法则辩论的焦点,因为它易于理解,但它只是人工智能生命周期的一部分。一旦模型经过预训练,仍然需要大量工作来使其准备好使用。预训练期间的目标非常狭窄,即“正确预测下一个标记”。实现这一目标仍然使我们远未达到LLM开发的最终目标,即“回答用户提示”或“完成任务”。
We will do an overview on Supervised Fine Tuning (SFT), Reinforcement Learning (RL), and Synthetic Data, before diving into how OpenAI’s O1 Pro model works and was created.
我们将对监督微调(SFT)、强化学习(RL)和合成数据进行概述,然后深入探讨 OpenAI 的 O1 Pro 模型是如何运作和创建的。
Supervised Fine-Tuning 监督微调
Supervised Fine-Tuning (SFT) is the most well-known type of post-training. A curated dataset of input and output pairs are shown to the model, with the “demonstration data” covering a specific domain (e.g. code, math, instruction following, etc.). Unlike with pre-training, the quality of fine-tuning data is much more important here than quantity. Given the lower quantity of data, that means it is less compute intensive.
监督微调(SFT)是最著名的后训练类型。一个经过精心策划的输入和输出对的数据集被展示给模型,“示范数据”涵盖特定领域(例如代码、数学、指令跟随等)。与预训练不同,微调数据的质量在这里比数量更为重要。由于数据量较少,这意味着它的计算强度较低。
The magic of GPT originally was using heavily curated samples of human generated and labeled data from firms like Scale AI. As time goes on, however, human generated data is struggling to scale.
GPT 的魔力最初在于使用来自 Scale AI 等公司的大量精心策划的人类生成和标记的数据。然而,随着时间的推移,人类生成的数据正面临扩展的困难。
Synthetic Data’s Integral Role in Post-training
合成数据在后训练中的重要作用
The most important challenge within SFT is constructing sufficiently large, high quality data sets in the desired domains. This allows the model to operate better in specific areas like code, math, reasoning, and due to transfer learning, has spillover effects making the model better in other domains too. Obviously, models with strong math and coding skills are better at general reasoning, but this extends to other areas – models trained on Chinese and English are better at English than those trained on English alone. Synthetic data has opened a dimension where high-quality data can be generated using a controlled, beyond scalable methodology to fine-tune models over any subject matter for which there exists a will to create it.
SFT 中最重要的挑战是构建足够大、高质量的数据集,以满足所需领域。这使得模型在特定领域如代码、数学、推理等方面表现更好,并且由于迁移学习的影响,使得模型在其他领域也变得更优秀。显然,具备强大数学和编码技能的模型在一般推理方面表现更佳,但这也扩展到其他领域——在中文和英文上训练的模型在英语方面的表现优于仅在英语上训练的模型。合成数据开辟了一个维度,可以使用受控的、超越可扩展的方法生成高质量数据,以便对任何有意创建的主题进行模型微调。
The heavy use of synthetic data also incentivizes a push toward better models. For example, OpenAI had GPT-4 before anyone else and could use it to generate better synthetic data sets than other model providers – until other providers had a model to match. One the primary reasons that many models in Open Source and at Chinese Labs caught up so fast was that they were trained on synthetic data from GPT-4.
合成数据的广泛使用也促使了对更好模型的推动。例如,OpenAI 在其他人之前就拥有了 GPT-4,并可以利用它生成比其他模型提供者更好的合成数据集——直到其他提供者有了可以匹配的模型。许多开源模型和中国实验室的模型迅速赶上的主要原因之一是它们是基于 GPT-4 的合成数据进行训练的。
The better the underlying model is at judging tasks, the better the dataset for training. Inherent in this are scaling laws of their own. This is how we got the “new Claude 3.5 Sonnet”. Anthropic finished training Claude 3.5 Opus and it performed well, with it scaling appropriately (ignore the scaling deniers who claim otherwise – this is FUD).
基础模型在判断任务方面越好,训练数据集就越好。这其中固有着自身的规模法则。这就是我们获得“新 Claude 3.5 Sonnet”的原因。Anthropic 完成了 Claude 3.5 Opus 的训练,并且表现良好,规模也适当(忽略那些声称相反的规模否认者——这只是恐惧、不确定和怀疑)。
Yet Anthropic didn’t release it. This is because instead of releasing publicly, Anthropic used Claude 3.5 Opus to generate synthetic data and for reward modeling to improve Claude 3.5 Sonnet significantly, alongside user data. Inference costs did not change drastically, but the model’s performance did. Why release 3.5 Opus when, on a cost basis, it does not make economic sense to do so, relative to releasing a 3.5 Sonnet with further post-training from said 3.5 Opus?
然而,Anthropic 并没有发布它。这是因为 Anthropic 没有选择公开发布,而是使用 Claude 3.5 Opus 生成合成数据,并进行奖励建模,以显著改善 Claude 3.5 Sonnet,此外还结合了用户数据。推理成本没有发生剧烈变化,但模型的性能却有所提升。为什么要发布 3.5 Opus,而在成本基础上,相比于发布经过进一步后训练的 3.5 Sonnet,这样做并没有经济意义呢?
With more synthetic data comes better models. Better models provide better synthetic data and act as better judges for filtering or scoring preferences. Inherent in the use of synthetic data are many smaller scaling laws that, collectively, push toward developing better models faster.
随着合成数据的增加,模型变得更好。更好的模型提供更好的合成数据,并作为更好的评判者来过滤或评分偏好。使用合成数据固有地包含许多较小的规模法则,这些法则共同推动更快地开发更好的模型。
Synthetic Data Examples 合成数据示例
Rejection Sampling 拒绝采样
An example of an area where synthetic data is heavily used is in generating datasets of code. This is typically done through designating a variety of programming tasks or prompts as seeds and prompting a model to generate questions relating to those tasks.
一个合成数据被广泛使用的领域示例是生成代码的数据集。这通常是通过将各种编程任务或提示指定为种子,并提示模型生成与这些任务相关的问题来完成的。
The model is then asked to generate a set of potential solutions. Solutions which pass the corresponding tests, or can execute correctly, are appended to the training dataset, effectively filtering out poor-quality samples in a process referred to as Rejection Sampling. Rejection Sampling is an instrumental part of the synthetic data generation process as it ensures that the dataset is of a sufficient quality to be valuable during Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). However, as a result, many of the generated tokens are thrown out – synthetic data generation takes a lot of compute.
模型随后被要求生成一组潜在解决方案。通过相应测试的解决方案,或能够正确执行的解决方案,将被附加到训练数据集中,有效地过滤掉低质量样本,这一过程被称为拒绝采样。拒绝采样是合成数据生成过程中的一个重要部分,因为它确保数据集的质量足够高,以便在监督微调(SFT)或强化学习(RL)期间具有价值。然而,结果是许多生成的标记被丢弃——合成数据生成需要大量计算。
This methodology for building a synthetic dataset for use in fine-tuning has been adopted by many of the large AI labs, and it is used for fine-tuning Gemini, GPT, Llama, and Claude.
这种用于构建合成数据集以进行微调的方法已被许多大型人工智能实验室采用,并用于微调 Gemini、GPT、Llama 和 Claude。
But Rejection Sampling can be more complicated than it appears. In Llama’s case, the model was prompted to revise its answer if the initial response was incorrect, and the model got the answer right on its second try 20% of the time. In another illustration of the usefulness of synthetic data, the Meta team translated Python code into PHP, ensuring quality via syntax parsing and execution, and fed this additional data into the SFT data set to account for the lack of public PHP code. This effectively demonstrates synthetic data being used to generate useful data reliably and predictably for underrepresented areas.
但拒绝采样可能比看起来更复杂。在 Llama 的案例中,如果初始响应不正确,模型会被提示修正其答案,而模型在第二次尝试时正确回答的概率为 20%。在另一个展示合成数据有用性的例子中,Meta 团队将 Python 代码翻译成 PHP,通过语法解析和执行确保质量,并将这些额外的数据输入到 SFT 数据集中,以弥补公共 PHP 代码的缺乏。这有效地展示了合成数据被用来可靠且可预测地生成有用数据,以满足代表性不足的领域。

Judgement by Model 模型判断
Another trend is to use another LLM as a judge. Meta used another, earlier version of Llama 3 as the rejection sampler, acting as the judge for code that was not strictly executable (i.e. pseudocode) and grading the output ‘pass’ or ‘fail’ on code correctness and style. In some instances, rejection sampling is done via a variety of models running concurrently to grade models. Although on net this is cheaper than human data, it is difficult to pull off such a chorus of automated judges.
另一个趋势是使用另一个LLM作为评判者。Meta 使用了另一个早期版本的 Llama 3 作为拒绝采样器,充当不严格可执行代码(即伪代码)的评判者,并根据代码的正确性和风格对输出进行“通过”或“失败”的评分。在某些情况下,拒绝采样是通过多种模型并行运行来对模型进行评分。尽管总体上这比人工数据便宜,但要实现这样一组自动评判者是很困难的。
What is important to note here is that, across all methods of rejection sampling, code or not, the better the “judge” model, the higher the quality of the resulting data set. This feedback loop, while only just introduced in production for Meta this year, has been in use by Anthropic and OpenAI for a for a year or two prior to that.
需要注意的是,在所有拒绝采样的方法中,无论是否有代码,“评判”模型越好,生成的数据集质量就越高。这个反馈循环虽然今年刚刚在 Meta 的生产中引入,但在此之前,Anthropic 和 OpenAI 已经使用了一两年。
Long Context Datasets 长上下文数据集
Another example of synthetic data use is long context lengths. Models are pre-trained with capped context lengths (as most of the data is of a low context length already), but also because longer sequence lengths means a larger KV Cache to keep in memory – making the deployment of training infrastructure even harder than it already is. Models such as Gemini, GPT, and Claude are originally pre-trained with lower sequence lengths and then subsequently post-trained to add longer context lengths.
另一个合成数据使用的例子是长上下文长度。模型在限制的上下文长度下进行预训练(因为大多数数据的上下文长度已经很低),但更长的序列长度意味着需要在内存中保持更大的 KV 缓存——这使得训练基础设施的部署比现在更困难。像 Gemini、GPT 和 Claude 这样的模型最初是在较低的序列长度下进行预训练,然后随后进行后训练以添加更长的上下文长度。
It is generally difficult for humans to annotate long context examples in SFT data, as there are limitations of human resources of a sufficient talent level to provide quality annotation. Reading lengthy pieces of text is time consuming and tedious. Synthetic data has emerged as a useful, reliable way to ameliorate this problem.
人类在 SFT 数据中注释长上下文示例通常是困难的,因为提供高质量注释的人力资源有限。阅读冗长的文本既耗时又乏味。合成数据已成为改善这一问题的有用且可靠的方法。
One method to generate long context-length synthetic data is to use a model from an earlier checkpoint and have it summarize large pieces of text chunked into the size of its (currently small) context length. These summaries, or in other occasions, chats including simulated questions and answers, can then be used to help generate a body of synthetic data that can then be used in SFT.
生成长上下文长度合成数据的一种方法是使用早期检查点的模型,让它对大块文本进行摘要,这些文本被分块到其(当前较小的)上下文长度大小。这些摘要,或在其他情况下,包括模拟问答的聊天,可以用来帮助生成一批合成数据,然后可以在 SFT 中使用。
Other examples include generating synthetic data to make evals such as needle in haystack benchmarks pass. There are many more complex types of synthetic data to train the models to generalize and understand data in various parts of the extended context length.
其他例子包括生成合成数据以使评估通过,例如大海捞针基准测试。还有许多更复杂类型的合成数据用于训练模型,以便在扩展上下文长度的各个部分中进行泛化和理解数据。
Reinforcement Learning 强化学习
Reinforcement Learning (RL) is a leading method for alignment and model improvements.
强化学习(RL)是对齐和模型改进的主要方法。
Reinforcement Learning (RL) is when an Agent (for example, a Large Language Model) is taught to perform specific actions and seek certain outcomes by maximizing rewards that are given either for those specific actions or for achieving a given outcome. There are two axes to think about when it comes to RL: the source of the feedback, and how feedback is incorporated. The former is about how to source the signals, and the latter is about how to use those signals to update the model.
强化学习(RL)是指一个代理(例如,一个大型语言模型)被教导执行特定的动作并寻求某些结果,通过最大化为这些特定动作或实现给定结果而给予的奖励。考虑到 RL,有两个轴需要思考:反馈的来源,以及反馈是如何被纳入的。前者是关于如何获取信号,后者是关于如何使用这些信号来更新模型。
With reinforcement learning – the Large Language Model we are trying to optimize plays the role of an agent that can take a set of actions given an input or state and receive different rewards depending on the action it takes. We optimize this agent’s behavior with respect to our reinforcement learning goals by having the Agent learn the actions that can maximize the expected cumulative reward.
通过强化学习——我们试图优化的大型语言模型充当一个代理,能够根据输入或状态采取一系列行动,并根据其采取的行动获得不同的奖励。我们通过让代理学习能够最大化预期累积奖励的行动,来优化该代理的行为,以符合我们的强化学习目标。
There are a few main approaches to incorporate feedback and determine the action that an Agent takes – using Value-based methods or Policy-based methods such Direct Preference Optimization and Trust Region Policy Optimization (TRPO) as well as Actor-Critic methods that combine policy and value-based methods. Proximal Policy Optimization (PPO) is a prominent example of an actor-critic model, and more complex variations of it are the primary RL method at all major AI labs.
有几种主要方法来整合反馈并确定代理采取的行动——使用基于价值的方法或基于策略的方法,如直接偏好优化和信任区域策略优化(TRPO),以及结合策略和基于价值的方法的演员-评论家方法。近端策略优化(PPO)是演员-评论家模型的一个显著例子,其更复杂的变体是所有主要人工智能实验室的主要强化学习方法。
Value-based methods instead determine the value of getting to a given state and define values for each possible state. Each state is assigned a value based on the expected discounted return the agent can get if it starts in that state and then determines its action at each step based on the value of each action available to it. Historically, value-based methods were more commonly used in RL, but modern applications are much better served with Policy-based methods.
基于价值的方法则确定达到给定状态的价值,并为每个可能的状态定义价值。每个状态根据代理在该状态下开始时可以获得的预期折现回报被分配一个价值,然后代理根据可用每个动作的价值在每一步确定其行动。历史上,基于价值的方法在强化学习中更为常用,但现代应用更适合使用基于策略的方法。

In Policy-based methods, the Agent is driven by a policy function that identifies a set of actions that can be taken for a given state and assigns a probability distribution over those set of actions. Actions to be performed at a given state can be deterministic, meaning that being in each state will always lead to the same action, or stochastic, where a probability distribution instead describes potential actions at that given state. The policy function is then trained to direct the Agent towards actions that maximize expected reward.
在基于策略的方法中,代理由一个策略函数驱动,该函数识别出在给定状态下可以采取的一组行动,并为这些行动集分配一个概率分布。在给定状态下要执行的行动可以是确定性的,这意味着处于每个状态将始终导致相同的行动,或者是随机的,其中概率分布描述了在该给定状态下的潜在行动。然后,策略函数被训练以引导代理朝向最大化期望奖励的行动。

When employing policy-based methods during RL, a model can either evaluate the final result of a given task to determine the reward in the case of an Outcome Reward Model (ORM) or it can determine the reward by evaluating each individual step in a given process in the case of a Process Reward Model (PRM). Using a PRM can be particularly helpful when training reasoning models as while an ORM can detect that a chain of reasoning led to an incorrect answer, a PRM can tell you which step of the chain had the mistake.
在强化学习中使用基于策略的方法时,模型可以通过结果奖励模型(ORM)评估给定任务的最终结果以确定奖励,或者在过程奖励模型(PRM)的情况下,通过评估给定过程中的每个单独步骤来确定奖励。使用 PRM 在训练推理模型时特别有帮助,因为虽然 ORM 可以检测到一系列推理导致了错误答案,但 PRM 可以告诉你链中的哪个步骤出现了错误。
Because the policy function directs what the agent does at any given step – it is also an especially useful framework for optimizing the behavior of agents/models at intermediate steps of an inference process.
因为政策函数指引代理在任何给定步骤的行为——这也是一个特别有用的框架,用于优化推理过程中的中间步骤的代理/模型行为。
Outcome Reward Models and Process Reward Models are often used in Proximal Policy Optimization (PPO), an algorithm commonly used in reinforcement learning that iteratively improves a policy model to maximize cumulative rewards and optimize an LLM towards a given objective. Using ORMs and PRMs with PPO is particularly important when training multi-step reasoning models that are currently a key focus in the community. We will describe how this is done for o1 Pro below.
结果奖励模型和过程奖励模型通常用于近端策略优化(PPO),这是一种常用于强化学习的算法,通过迭代改进策略模型以最大化累积奖励并优化LLM以实现给定目标。在训练多步推理模型时,使用 ORM 和 PRM 与 PPO 特别重要,这些模型目前是社区的关键关注点。我们将描述如何为 o1 Pro 完成此操作。
Proximal Policy Optimization (PPO)
近端策略优化 (PPO)
Proximal Policy Optimization (PPO) can be used for both Alignment and Fine Tuning, but it is much better suited to and is used more often during Reinforcement Learning used during Alignment.
近端策略优化(PPO)可以用于对齐和微调,但在对齐过程中使用的强化学习中,它更为适合且使用得更为频繁。
For PPO, Policy refers to the abovementioned use of a policy model to dictate the actions of an agent or model, Proximal refers to the algorithm’s methodology of only gradually updating the policy, and Optimization refers to the process of iteratively improving the policy by providing feedback from a reward model to improve the policy model, thereby optimizing the expected cumulative reward.
对于 PPO,政策指的是上述使用政策模型来决定代理或模型的行为,近端指的是算法的方法论,即仅逐步更新政策,优化指的是通过从奖励模型提供反馈来迭代改进政策的过程,从而优化预期的累积奖励。
We have mainly discussed Policy-based methods above, but PPO incorporates both Policy-based methods and Value-based methods in its implementation. As such, PPO can be said to use the Actor Critic method. An Actor is driven by a policy-based model that determines which action to take for a given state (i.e. Policy-based method) and there is a Critic that evaluates the action taken according to a value function (Value-based method). The Actor and Critic thus work together in an iterative fashion.
我们主要讨论了基于策略的方法,但 PPO 在其实现中结合了基于策略的方法和基于价值的方法。因此,可以说 PPO 使用了演员-评论家方法。演员由一个基于策略的模型驱动,该模型决定在给定状态下采取哪个行动(即基于策略的方法),而评论家根据价值函数评估所采取的行动(基于价值的方法)。因此,演员和评论家以迭代的方式协同工作。
Maximizing the PPO objective function will therefore push the policy in the direction of favoring actions that correspond to a higher value of the Advantage Function.
因此,最大化 PPO 目标函数将推动策略朝着有利于对应于更高优势函数值的动作的方向发展。
RLHF
Reinforcement Learning with Human Feedback (RLHF) has been a primary technique to align LLMs, make them useful, and was a leading factor for ChatGPT’s explosive growth. It typically utilizes policy-based learning, which is when a reward model that learns based on human feedback is used to update a policy that drives how a model behaves.
人类反馈强化学习(RLHF)一直是对齐LLMs、使其有用的主要技术,并且是 ChatGPT 爆炸性增长的一个重要因素。它通常利用基于策略的学习,即使用基于人类反馈学习的奖励模型来更新驱动模型行为的策略。
With RLHF, human annotators review a sample of responses to prompts and rank their preference for one response over the other. The goal here is to amasses significant data on what responses humans would prefer. This preference data is then used to train a reward model, which attempts to guess the average labeler’s preference for a given output from a model. In other words, the trained reward model acts as a Critic in the Actor-Critic framework.
通过 RLHF,人类注释者审查一组对提示的响应样本,并对一个响应相对于另一个响应的偏好进行排名。这里的目标是积累关于人类偏好的重要数据。这些偏好数据随后用于训练奖励模型,该模型试图猜测平均标注者对模型给定输出的偏好。换句话说,训练后的奖励模型在演员-评论家框架中充当评论家。
The trained reward model evaluates this action against the human preferences it is trained on, and how much better or worse the action is compared to the average action. The feedback from this reward model then acts to align the Actor model, ensuring that it takes actions (generates tokens) in accordance with the desired policy.
经过训练的奖励模型将此行为与其训练所依据的人类偏好进行评估,并比较该行为与平均行为的优劣。来自该奖励模型的反馈随后用于调整演员模型,确保其采取的行动(生成的标记)符合期望的策略。
As discussed above, PPO is used to iteratively update the policy function of the language model. Allowing for stable learning while preventing drastic changes in policy. Large scale PPO for AI labs utilizes multiple weighted reward models for specific aspects like helpfulness, truthfulness, and safety.
如上所述,PPO 用于迭代更新语言模型的策略函数。允许稳定学习,同时防止策略的剧烈变化。大型 PPO 用于人工智能实验室,利用多个加权奖励模型来关注特定方面,如有用性、真实性和安全性。
Broadly speaking, RLHF allows models to perform better on tasks that real end users care about and have provided preference data on. Meta’s Llama 2-Chat achieved much better performance on factors such as helpfulness and harmlessness after rounds of RLHF. The paper demonstrates that the additional compute used to scale models during RL delivers clear results. The potential benefits from using synthetic data as opposed to human-generated feedback and relying more heaving on AI for feedback can also justify the use of even more compute.
从广义上讲,RLHF 使模型在真实终端用户关心并提供偏好数据的任务上表现更好。Meta 的 Llama 2-Chat 在有用性和无害性等因素上经过多轮 RLHF 后取得了更好的表现。论文表明,在 RL 过程中用于扩展模型的额外计算带来了明显的结果。使用合成数据而非人类生成反馈的潜在好处,以及更依赖 AI 进行反馈的做法,也可以证明使用更多计算资源的合理性。

However, there are significant limitations to RLHF. First – carrying out the entire lifecycle of RLHF can be very slow as one must take time to expose the various generated responses to human responders, usually through an AI company inserting such prompts for feedback when serving its models or human labelers.
然而,RLHF 存在显著的局限性。首先,执行 RLHF 的整个生命周期可能非常缓慢,因为必须花时间将各种生成的响应暴露给人类响应者,通常是通过 AI 公司在为其模型提供服务时插入此类提示以获取反馈或人类标注者。
Even with a large userbase, collecting a large amount of preference data is difficult and expensive – Meta spent $10-20 million dollars on preference data for Llama 2, more than the compute time itself.
即使拥有庞大的用户基础,收集大量偏好数据仍然困难且昂贵——Meta 在 Llama 2 的偏好数据上花费了 1000 万到 2000 万美元,甚至超过了计算时间本身的成本。
RLHF is inherently difficult to scale, especially in areas where there is not a huge amount of existing data. Human annotation is also expensive. This is why many AI companies are pivoting towards Reinforcement Learning with AI Feedback (RLAIF) during training.
RLHF 本质上难以扩展,尤其是在没有大量现有数据的领域。人工标注也很昂贵。这就是为什么许多人工智能公司在训练过程中转向使用带有人工反馈的强化学习(RLAIF)。
The larger AI companies have a clear advantage here. Claude, Gemini, and ChatGPT all ask users provide feedback on responses from models they host. For instance, on occasion, ChatGPT will explicitly ask you to select which one of two responses you prefer. This effectively gathers the best source of feedback (directly from users) for free. Because OpenAI has a huge customer base of more than 300M users, it can gather a lot of feedback, for improving models.
大型人工智能公司在这里具有明显优势。Claude、Gemini 和 ChatGPT 都要求用户对他们托管的模型的响应提供反馈。例如,ChatGPT 有时会明确要求您选择两个响应中您更喜欢哪一个。这有效地收集了最佳的反馈来源(直接来自用户),而且是免费的。由于 OpenAI 拥有超过 3 亿用户的庞大客户基础,它可以收集大量反馈,以改进模型。
Providers with fewer users or that operate a platform that is less conducive towards users providing feedback need to resort to other methods such as DPO instead of PPO. Direct Preference Optimization (DPO) is another technique often discussed with RLHF, though most do not technically categorize it as a Reinforcement Learning technique.
用户较少的提供商或运营一个不太利于用户提供反馈的平台需要采用其他方法,例如直接偏好优化(DPO),而不是偏好优化(PPO)。直接偏好优化(DPO)是另一种常与强化学习人类反馈(RLHF)讨论的技术,尽管大多数人并不将其技术上归类为强化学习技术。
DPO entirely forgoes training a reward model and instead uses optimization to directly adjust the policy to maximize the probability that the policy drives the model to produce the preferred outputs as based on the human preference data. The optimization works by using a binary cross-entropy loss that compares probability ratios between the current model and a reference model (generally the same model before fine tuning). DPO ensures the model learns to favor preferred responses while staying close to the reference model’s behavior.
DPO 完全放弃了训练奖励模型,而是使用优化直接调整策略,以最大化策略驱动模型产生基于人类偏好数据的首选输出的概率。优化通过使用二元交叉熵损失来工作,该损失比较当前模型与参考模型(通常是微调前的同一模型)之间的概率比率。DPO 确保模型学习偏好首选响应,同时保持接近参考模型的行为。
The simpler approach used in DPO can achieve comparable or better results than RLHF using a full reward model, while being less prone to crashes and easier to implement. A prominent example of this approach’s merits is that Llama 3 did not undergo RLHF and went through DPO. Meta found that in the case of Llama 3, DPO was more effective and stable than PPO and used less compute. However – using DPO means that the quality of the preference data set is paramount, meriting extra care and attention on how this data is gathered and processed.
DPO 中使用的更简单的方法可以实现与使用完整奖励模型的 RLHF 相当或更好的结果,同时不易崩溃且更易于实施。这个方法优点的一个显著例子是 Llama 3 没有经过 RLHF,而是采用了 DPO。Meta 发现,在 Llama 3 的情况下,DPO 比 PPO 更有效且更稳定,并且使用的计算资源更少。然而,使用 DPO 意味着偏好数据集的质量至关重要,需要额外关注和小心处理这些数据的收集和处理方式。

Meta eventually discovered the lesson the other labs already knew: DPO does not scale as well as PPO – and that they must turn to RLAIF to continue to improve their post training. This was shown in the release of the newest LLAMA 3.3.
Meta 最终发现了其他实验室已经知道的教训:DPO 的扩展性不如 PPO——他们必须转向 RLAIF 以继续改善训练后的表现。这在最新的 LLAMA 3.3 发布中得到了体现。
RLAIF
Instead of relying on human feedback to train a reward model, Reinforcement Learning with AI Feedback (RLAIF) replaces human feedback with another model. The reward model is trained based on AI-generated feedback – usually some form of scoring model or algorithm that will evaluate given completions and determine the reward accordingly.
与其依赖人类反馈来训练奖励模型,人工智能反馈强化学习(RLAIF)用另一个模型替代人类反馈。奖励模型是基于人工智能生成的反馈进行训练的——通常是一种评分模型或算法,用于评估给定的完成情况并相应地确定奖励。

来源:RLAIF 与 RLHF:通过 AI 反馈扩展人类反馈的强化学习
Broadly, not much else is inherently different from RLHF, but RLAIF makes a dramatic difference. Annotations can be made quickly, and prompts can be generated synthetically to prompt the model undergoing reinforcement learning in areas where additional data or training is needed.
总体而言,RLHF 本身并没有太大不同,但 RLAIF 却带来了显著的变化。注释可以快速完成,提示可以合成生成,以促使正在进行强化学习的模型在需要额外数据或训练的领域进行学习。
In addition to providing feedback on typical math, science and general knowledge tasks, RLAIF also means that feedback to tackle more nuanced circumstances like ethical dilemmas, cultural norms, and social interactions can be generated quickly and ranked by another LLM. This enables more coverage in terms of topics to align the model over and also allows model trainers to quickly ramp training on those topics without waiting to gather human feedback.
除了提供对典型数学、科学和一般知识任务的反馈外,RLAIF 还意味着可以快速生成针对更微妙情况的反馈,例如伦理困境、文化规范和社会互动,并由另一个 LLM 进行排名。这使得在主题覆盖方面能够更好地对齐模型,同时也允许模型训练者快速开始这些主题的训练,而无需等待收集人类反馈。
A unique use of RLAIF is Anthropic’s constitutional AI. Constitutional AI works in two stages. In the first stage, a base model critiques and revises its own outputs in accordance with a set of constitutional principles written by humans. These initial responses that are evaluated can be toxic or unhelpful. The responses are then revised continuously using a variety of principles from the constitution. This creates a data set of revision and prompt pairs that are then used to fine tune a model through supervised fine-tuning (SFT).
RLAIF 的独特用法是 Anthropic 的宪法 AI。宪法 AI 分为两个阶段。在第一阶段,基础模型根据人类编写的一套宪法原则对其自身输出进行批评和修订。这些被评估的初始响应可能是有毒或无帮助的。然后,使用宪法中的各种原则不断修订这些响应。这创建了一组修订和提示对的数据集,随后用于通过监督微调(SFT)对模型进行微调。
The second stage of the process for Constitutional AI is similar to RLHF, but without the human preference data providing feedback regarding harmlessness. The AI evaluates pairs of responses from the previous stage’s model in accordance with constitutional principles which in effect are like multiple reward models. AI-generated preferences for harmlessness are combined with human feedback data for helpfulness to train a hybrid preference model (hybrid meaning it includes human data). Finally, the model from the first stage is fine-tuned using RL with this preference model as the reward signal.
过程的第二阶段对于宪法人工智能来说类似于 RLHF,但没有人类偏好数据提供关于无害性的反馈。人工智能根据宪法原则评估来自前一阶段模型的响应对,这些原则实际上就像多个奖励模型。人工智能生成的无害性偏好与人类反馈数据(关于有用性)结合,以训练一个混合偏好模型(混合意味着它包含人类数据)。最后,第一阶段的模型使用 RL 进行微调,以这个偏好模型作为奖励信号。
The most notable observation of this approach is that it’s scalable across many different domains – if there is a model that is good at ranking responses based on which one is more scientifically accurate in addition to being able to identify harmlessness, the model can be used to optimize for scientifically accurate responses as well.
这种方法最显著的观察是,它在许多不同领域具有可扩展性——如果有一个模型能够根据哪个响应在科学上更准确来对响应进行排名,并且能够识别无害性,那么该模型也可以用于优化科学准确的响应。

来源:Anthropic 宪法 AI:来自 AI 反馈的无害性
RL is also a key part of developing reasoning models that use Chain of Thought (CoT).
RL 也是开发使用思维链(CoT)的推理模型的关键部分。
Reasoning Models and Chain of Thought (CoT)
推理模型和思维链(CoT)
Math is the fundamental logic and reasoning of engineering, construction, and system design. Math stands out as a focus discipline for fine tuning models as model trainers lack sufficiently complex prompts at advanced difficulty levels. One way to overcome this problem is to pay highly skilled humans to craft prompts or generate them in house. Solving Math problems effectively through reasoning requires a clearly articulated and correct chain of thought that the model can learn from.
数学是工程、建筑和系统设计的基本逻辑和推理。数学作为一个重点学科,在微调模型方面表现突出,因为模型训练者在高级难度水平上缺乏足够复杂的提示。克服这个问题的一种方法是雇佣高技能的人类来制作提示或在内部生成提示。通过推理有效地解决数学问题需要一个清晰表达且正确的思维链,以便模型能够从中学习。
While some math capabilities can improve through tools like code interpreter access, allowing models to generate and execute code in languages like Python which can assist solving some math problems, code is not enough to solve many problems – particularly the most difficult math problems. A huge amount of effort is currently targeted at training reasoning models to solve complex math problems.
虽然一些数学能力可以通过像代码解释器访问这样的工具来提高,允许模型生成和执行像 Python 这样的语言中的代码,从而帮助解决一些数学问题,但代码并不足以解决许多问题——特别是最困难的数学问题。目前,巨大的努力正集中在训练推理模型以解决复杂的数学问题上。
Models can be prompted to generate chains of thought out of the box, but results can be unreliable since an error on one step of the chain will compound to the wrong end solution. Though, o1 Pro has multiple safeguards to prevent this. Another challenge is that even the latest models can hallucinate and fabricate information if there is uncertainty, which can easily compound error in one of the reasoning steps.
模型可以被提示生成超出常规的思维链,但结果可能不可靠,因为链中某一步的错误会导致错误的最终解决方案。不过,o1 Pro 有多重保护措施来防止这种情况。另一个挑战是,即使是最新的模型在存在不确定性时也可能产生幻觉和虚构信息,这可能会轻易加剧推理步骤中的错误。
A model has been aligned to conduct Reasoning using Chain of Thought can address many of the challenges above. In this approach, reinforcement learning is used to align the model’s behavior towards this Chain of Thought approach.
一个模型已经被调整以使用思维链进行推理,可以解决上述许多挑战。在这种方法中,使用强化学习来使模型的行为与这种思维链方法对齐。
This process applies reinforcement learning to align a base LLM’s behavior towards the Chain of Thought approach and improve its accuracy using several other separate models and LLMs.
该过程应用强化学习来使基础LLM的行为与思维链方法对齐,并利用其他几个独立模型和LLMs提高其准确性。
The first independent LLM to discuss is the Generator, which is trained to produce solutions that are reasoned out across multiple steps. The generator is typically separate from the base LLM as it is fine-tuned specifically for the task of generating these reasoning steps while the base LLM is usually fine-tuned for general tasks.
第一个独立的LLM要讨论的是生成器,它经过训练以产生经过多步推理得出的解决方案。生成器通常与基础LLM分开,因为它是专门针对生成这些推理步骤的任务进行微调的,而基础LLM通常是针对一般任务进行微调的。
Secondly is the Verifier Model, which is responsible for evaluating whether the solutions produced by the Generator are correct or not and provides a corresponding reward.
其次是验证模型,负责评估生成器产生的解决方案是否正确,并提供相应的奖励。
Verifier Models can be trained using either human annotation, through automatic process annotation or using automatic verifiers. Alternatively – verification In OpenAI’s paper, Let’s Verify Step by Step, researchers introduced the PRM800K process supervision dataset, in which human data-labelers annotate 800,000 process steps that form part of 75,000 solutions to 12,000 questions from the MATH Dataset that are output from a Generator as discussed in the paper.
验证模型可以通过人工标注、自动过程标注或使用自动验证器进行训练。或者——在 OpenAI 的论文《逐步验证》中,研究人员介绍了 PRM800K 过程监督数据集,其中人工数据标注者对 80 万个过程步骤进行标注,这些步骤构成了 75,000 个解决方案的一部分,涉及来自 MATH 数据集的 12,000 个问题,这些问题是由论文中讨论的生成器输出的。

来源:逐步验证
The cost of gathering these annotations is not trivial. In the original Math paper, a few university students that were given an hour to complete 20 problems scored between 40% and 90%, with the 90% scorer being a three-time IMO gold medalist. The OpenAI paper cited cost as a reason that it would be impractical to build a large enough human annotated PRM-oriented dataset to match the order of magnitude larger ORM-oriented dataset to conduct apples-to-apples comparisons.
收集这些注释的成本并不低。在原始的数学论文中,几名大学生被给予一个小时的时间来完成 20 个问题,得分在 40%到 90%之间,其中 90%得分者是三届国际数学奥林匹克金牌得主。OpenAI 的论文引用成本作为理由,认为构建一个足够大的人工注释 PRM 导向数据集以匹配数量级更大的 ORM 导向数据集进行直接比较是不切实际的。
The alternatives are to use automatic process annotation, or to find automatic verifiers.
替代方案是使用自动过程注释,或寻找自动验证器。
Automatic verifiers are a system or model that can ideally quickly and easily verify whether the solution to a given problem is correct. For code, this could simply be the actual execution of the cost to test that it produces the desired results, while for Math it could be evaluating a given function or using prover like LEAN to check for correctness. However, using automatic verifiers might not be as “automatic” as it sounds – creating dependencies on external systems can add overhead which can detract from good training performance, while automatic verifiers can sometimes take time to run.
自动验证器是一种系统或模型,理想情况下可以快速而轻松地验证给定问题的解决方案是否正确。对于代码,这可能仅仅是实际执行成本以测试其是否产生期望的结果,而对于数学,这可能是评估给定函数或使用像 LEAN 这样的证明器来检查正确性。然而,使用自动验证器可能并不像听起来那么“自动”——对外部系统的依赖可能会增加开销,从而影响良好的训练性能,而自动验证器有时可能需要时间来运行。
Automatic process annotation can generate this step-by-step process annotation. Instead of having a human evaluate an intermediate step, the Completer is used to create multiple different paths of reasoning steps. The Math-Shepherd paper uses automatic process annotation – generating a number of paths, then evaluating these paths by either marking it as a good reasoning step if it leads to a correct final answer (i.e. Hard Estimation) or by assigning a score based on the frequency with which the step leads to the correct solution (i.e. Soft Estimation).
自动过程注释可以生成这个逐步过程注释。与其让人类评估一个中间步骤,不如使用 Completer 来创建多个不同的推理步骤路径。Math-Shepherd 论文使用自动过程注释——生成多个路径,然后通过标记为良好的推理步骤(如果它导致正确的最终答案,即困难估计)或根据该步骤导致正确解决方案的频率分配分数(即软估计)来评估这些路径。

来源:Math-Shepherd:逐步验证和强化 LLMs 无需人工注释
The fourth model is the Reward Model, which is trained from the process annotation labels.
第四种模型是奖励模型,它是根据过程注释标签进行训练的。
To recap our earlier explanation, there are two types of reward models: ones which provide a reward based on the outcome, an Outcome Reward Model (ORM), and ones which provide a reward based on the process, Process Reward Models (PRM). ORMs typically work by ranking a variety of different answers that a model provides and then selecting the highest ranked one. In contrast, PRMs evaluate and assign a score to each step of the reasoning chain of thought and provide a reward based on this score and for this reason are generally preferred when training Chain of Thought models. The Let’s Verify Step by Step paper showcased stronger results for PRMs over ORMs. With that said, OpenAI relies more heavily on ORMs still.
回顾我们之前的解释,奖励模型有两种类型:一种是基于结果提供奖励的结果奖励模型(ORM),另一种是基于过程提供奖励的过程奖励模型(PRM)。ORM 通常通过对模型提供的各种不同答案进行排名,然后选择排名最高的答案来工作。相比之下,PRM 评估并为推理链中每一步分配分数,并根据该分数提供奖励,因此在训练思维链模型时通常更受欢迎。《逐步验证》论文展示了 PRM 相较于 ORM 的更强结果。尽管如此,OpenAI 仍然更依赖于 ORM。

来源:逐步验证
In Math-Shepherd, Reinforcement Learning via Step-by-step Proximal Policy Optimization (PPO), is used to reinforce the final LLM to teach it to the desired reasoning chain of thought behavior.
在 Math-Shepherd 中,通过逐步近端策略优化(PPO)的强化学习被用来强化最终的LLM,以教导其达到期望的推理链思维行为。
Inference-time Scaling 推理时间缩放
The release of OpenAI o1 preview has brought the industry’s attention to the rise of a new scaling law – the greater the test-time compute (i.e. compute at inference time), the better the answer, and efforts to exploit this scaling dimension are at a major inflection point.
OpenAI o1 预览的发布引起了行业对一种新扩展法则的关注——测试时计算(即推理时的计算)越大,答案越好,利用这一扩展维度的努力正处于一个重大转折点。
When presented with queries, whether for simple or difficult questions, traditional LLMs will generate tokens continuously, without tracking intermediate steps, until they think they have reached the answer.
当面对查询时,无论是简单还是困难的问题,传统的LLMs会不断生成令牌,而不跟踪中间步骤,直到他们认为自己已经得到了答案。
In contrast, as explained above, Reasoning Models break the response into a discrete number of reasoning step called a Chain-of-Thought, before delivering a response to the user. Reasoning models can backtrack if they reach an illogical conclusion, recognizing that a mistake has been made or a certain approach has reached a dead end, revisiting earlier steps to put the chain of reasoning back on the right path.
相反,如上所述,推理模型将响应分解为称为思维链的离散推理步骤,然后再向用户提供响应。如果推理模型得出不合逻辑的结论,它可以回溯,意识到已经犯了错误或某种方法已走入死胡同,重新审视早期步骤,以将推理链重新引导回正确的路径。
There are two profound implications from the release of reasoning models – first, a meaningful improvement in model performance for challenging evaluations such as those oriented around coding, math, and science, and second, the realization that this improvement in model performance scales with test-time compute extends robustly to LLMs.
发布推理模型有两个深远的影响——首先,对于编码、数学和科学等具有挑战性的评估,模型性能有了显著提升;其次,意识到这种模型性能的提升与测试时的计算规模稳健地扩展到LLMs。

Test-time scaling is not a new concept. In board games and poker, the idea of expanding test-time compute has been around for some time. For example, AlphaGo, which is DeepMind’s system for playing Go, uses Monte Carlo Tree Search during test time to decide which moves to play. If stripped of its capabilities of searching during inference, it drops in Elo from ~5,200 to 3,000 (top humans are around ~3,800). Inference time compute allowed for superhuman achievements in Go.
测试时间扩展并不是一个新概念。在棋类游戏和扑克中,扩展测试时间计算的想法已经存在了一段时间。例如,AlphaGo 是 DeepMind 用于下围棋的系统,在测试时使用蒙特卡罗树搜索来决定下哪些棋。如果在推理过程中剥夺了其搜索能力,Elo 评分将从约 5200 降至 3000(顶级人类选手的评分约为 3800)。推理时间计算使得在围棋中实现超人类的成就成为可能。
With greater compute, reasoning models can think through more steps and increase the likelihood of reaching the right answer. Today, reasoning capabilities are bottlenecked by inference system capabilities as the long context lengths required for reasoning models significantly increase memory and compute requirements.
随着计算能力的增强,推理模型可以思考更多的步骤,并提高达到正确答案的可能性。如今,推理能力受到推理系统能力的瓶颈,因为推理模型所需的长上下文长度显著增加了内存和计算要求。
This means operators of inference systems for reasoning models are limiting the length of reasoning chains of thought to keep context lengths reasonable and prices down so as to serve an economical number of users at a reasonable token to token latency. It follows that today’s reasoning models are performing with one arm tied behind their back and could scale very significantly in performance as more capable inference systems such as the GB200 NVL72 come to market. Once economical, allowing o1 to adjust the length of its reasoning chain and compute employed will be a technique to harness test-time compute scaling.
这意味着推理模型的推理系统运营商正在限制推理链的思维长度,以保持上下文长度合理并降低价格,从而以合理的令牌到令牌延迟为经济数量的用户提供服务。因此,今天的推理模型在性能上受到限制,随着更强大的推理系统如 GB200 NVL72 的上市,其性能可以显著提升。一旦经济可行,允许 o1 调整其推理链的长度和所使用的计算将是一种利用测试时计算扩展的技术。

As we see from evals and from the graph further down below, with one attempt, GPT-4o beats other models. The most naïve way to scale test-time compute is to simply increase the number of samples concurrently being run, effectively channeling the infinite monkey theorem. The paper Large Language Monkeys demonstrates that simply repeated sampling can scale inference time compute and can yield much better results.
从评估和下面的图表中可以看出,GPT-4o 在一次尝试中超越了其他模型。测试时间计算的最简单扩展方式是简单地增加同时运行的样本数量,有效地体现了无限猴子定理。论文《大型语言猴子》表明,简单的重复采样可以扩展推理时间计算,并且可以产生更好的结果。

来源:大型语言猴子:通过重复采样扩展推理计算
This is arguably one of the most basic ways of doing search. Generating more samples allows for greater coverage, which is defined as any of the samples getting the correct answer (i.e. pass@k). One could argue that simply enabling these smaller models to think over a problem many times may be more accurate and cheaper, though we will need to have an effective verifier to identify when we have successfully generated the metaphorical complete works of Shakespeare.
这可以说是进行搜索的最基本方式之一。生成更多样本可以实现更大的覆盖率,这被定义为任何样本获得正确答案(即 pass@k)。有人可能会认为,仅仅让这些较小的模型多次思考一个问题可能更准确且成本更低,尽管我们需要一个有效的验证者来识别何时成功生成了隐喻上的莎士比亚完整作品。

“这是最好的时代,也是最糟糕的时代”
Source: The Simpsons 来源:《辛普森一家》
Scaling Inference Compute Through Search
通过搜索扩展推理计算
Search is another dimension of scaling that goes unharnessed with OpenAI o1 but is utilized in o1 Pro. o1 does not evaluate multiple paths of reasoning during test-time (i.e. during inference) or conduct any search at all. Sasha Rush’s video on Speculations on Test-Time Scaling (o1) provides a useful discussion and illustration of Search and other topics related to reasoning models.
搜索是扩展的另一个维度,在 OpenAI o1 中未被利用,但在 o1 Pro 中得到了应用。o1 在测试时(即推理期间)不评估多条推理路径,也不进行任何搜索。Sasha Rush 关于测试时间扩展(o1)的推测视频提供了对搜索和其他与推理模型相关主题的有用讨论和说明。
Self-Consistency / Majority Vote is one such search methodology in which we simply run the prompt through the model multiple times, thereby generating multiple responses, and then we pick the correct answer by choosing the response that appears most often among a given number of samples.
自一致性/多数投票是一种搜索方法,我们简单地将提示多次输入模型,从而生成多个响应,然后通过选择在给定数量的样本中出现最频繁的响应来选择正确答案。

Best-of-N Sampling is another idea in which we generate N solutions for a specific prompt and then use a verifier model to identify chains-of-thoughts that led to the correct answer. This method is generally restricted to areas that are amenable to verification (e.g., sudoku and not essays) and is limited by the effectiveness of the verifier model.
最佳 N 采样是另一个概念,我们为特定提示生成 N 个解决方案,然后使用验证模型识别导致正确答案的思维链。这种方法通常仅限于适合验证的领域(例如,数独而不是论文),并受到验证模型有效性的限制。

Monte Carlo roll-outs are a technique that build on Best-of-N. Here we evaluate a given intermediate step by generating multiple paths to complete the chain-of-thought starting from that intermediate step. This evaluation can help us decide whether to proceed with this step or move forward with prospective future step, improving our overall chain of thought.
蒙特卡罗滚动法是一种基于最佳选择的技术。在这里,我们通过生成多个路径来评估给定的中间步骤,以完成从该中间步骤开始的思维链。这种评估可以帮助我们决定是继续进行此步骤还是向前推进潜在的未来步骤,从而改善我们的整体思维链。
Now that we have discussed the basis of RL, Synthetic Data, Chain-of-Thought, Inference Time Compute and other concepts, let us go through what OpenAI has done with o1 and o1 Pro both during training and during inference. The construction of o1 is unique and doesn’t mirror the papers above. We will also discuss the tokenomics of inference time compute including cost, KV Cache scaling, batching, and more. Lastly, we will explain what OpenAI is doing next with Orion and why the narrative around it being a failure isn’t accurate.
现在我们已经讨论了强化学习的基础、合成数据、思维链、推理时间计算和其他概念,让我们来看看 OpenAI 在训练和推理过程中对 o1 和 o1 Pro 所做的工作。o1 的构建是独特的,并不与上述论文相似。我们还将讨论推理时间计算的代币经济学,包括成本、KV 缓存扩展、批处理等。最后,我们将解释 OpenAI 接下来在 Orion 上所做的工作,以及为什么关于它是失败的叙述并不准确。
o1: Navigating Strawberry Fields
o1: 导航草莓田
o1 at inference time uses a Chain of Thought approach, breaking reasoning into multiple discrete steps. The o1 model can plan its various reasoning steps, evaluate intermediate steps, and backtrack if the step is incorrect or reaches a dead end.
o1 在推理时使用链式思维方法,将推理分解为多个离散步骤。o1 模型可以规划其各种推理步骤,评估中间步骤,并在步骤不正确或达到死胡同时进行回溯。
There are many methods the community has proposed such as that o1 is exploring a tree of potential reasoning paths or chains of thought at inference time, but this is not true. OpenAI o1 follows only a single chain of thought within this tree before arriving at an answer. OpenAI o1 doesn’t use search at test-time, forgoing exploration of a tree of potential reasoning paths during inference. This means that it can only utilize a pass@1 approach at inference. o1 Pro does use self-consistency / majority vote – more on that later in the tokenomics section.
社区提出了许多方法,例如 o1 在推理时探索潜在推理路径或思维链的树,但这并不真实。OpenAI o1 仅在到达答案之前遵循此树中的单一思维链。OpenAI o1 在测试时不使用搜索,放弃了在推理过程中探索潜在推理路径的可能性。这意味着它在推理时只能利用 pass@1 方法。o1 Pro 确实使用自我一致性/多数投票——稍后在代币经济学部分会详细介绍。
There are a few theories on how o1 generates its singular chain of thought. One widely held theory is that it instead uses a Process Reward Model during reinforcement learning to drive reasoning steps, with similar reward models to switch between verifying and generating. By using the same model to act as both the generator and the verifier, the model can effectively switch between the two and continuously iterate on its thinking.
关于 o1 如何生成其独特的思维链,有几种理论。一种广泛接受的理论是,它在强化学习过程中使用过程奖励模型来推动推理步骤,使用类似的奖励模型在验证和生成之间切换。通过使用相同的模型同时充当生成器和验证器,该模型可以有效地在两者之间切换,并不断迭代其思维。
Backtracking: Emergent or Trained?
回溯:自发还是训练?
As mentioned above, another core ability of o1 is its ability to self-correct and backtrack on its single chain of thought.
如上所述,o1 的另一个核心能力是其自我纠正和回溯单一思维链的能力。
There is one thing to highlight: these abilities emerged from increasing inference-time compute. They did not appear because they were specifically engineered to do so, but rather as a consequence of scaling inference time compute.
有一点需要强调:这些能力是由于推理时间计算的增加而出现的。它们并不是因为专门设计而出现的,而是推理时间计算规模扩大所导致的结果。
There are some caveats however to the idea that better results will always come from the model thinking for longer. One is with respect to the type of questions that could benefit from a longer time spent on thinking. For example, a question such as “What is the capital of x” does not benefit from increasingly longer time spent thinking, but difficult math or coding questions do. It is also easier to verify math and coding problems than English essays. For now – it remains ambiguous about how exactly more test time compute is brought to bear – all we know is that there is some sort of setting that OpenAI has on the backend that they can control.
然而,关于模型思考时间更长会始终带来更好结果的想法,存在一些警告。其中一个是关于哪些类型的问题可以从更长的思考时间中受益。例如,“x 的首都是什么”这样的问题并不会因为思考时间的延长而受益,但困难的数学或编程问题则会。验证数学和编程问题也比验证英语论文更容易。目前,关于如何确切地利用更多的测试时间进行计算仍然模糊不清——我们所知道的是,OpenAI 在后台有某种设置可以控制。
As we can see from the subject-matter wise win-rate graph below, o1 and reasoning models in general do better vs non-reasoning models in subjects that are easier to verify relative to how hard answers are to generate, and worse in areas where it is both hard to verify and generate.
从下面的主题胜率图中可以看出,o1 和推理模型在相对容易验证的主题上表现优于非推理模型,而在既难以验证又难以生成的领域表现较差。

This is because OpenAI’s o1 training process heavily relies on functional verifiers to provide feedback for the model during training.
这是因为 OpenAI 的 o1 训练过程在很大程度上依赖于功能验证器在训练期间为模型提供反馈。
o1 ’Berry Training Infrastructure
o1 ’贝瑞培训基础设施
OpenAI generates vast sums of data for training o1. The entire system for training reasoning models is known as berry training. This data is generated in a Monte Carlo tree with many concurrent rollouts. The model will generate many different variations and branch off at many different points based on PRM for each of their current ~10 million problems. These problems have thousands of different answer “trajectories” generated. Some trajectories may have shared pre-fixes due to branching off partially through the answer. The reason it is called a trajectory is because in isolation it is a chain of thought working towards an answer. Each of these trajectories contains thousands of tokens. This is hundreds of trillions of tokens generated for training a strawberry model such as o1.
OpenAI 为训练 o1 生成了大量数据。整个推理模型的训练系统被称为莓果训练。这些数据是在一个蒙特卡洛树中生成的,具有许多并发的展开。该模型将基于 PRM 为他们当前约 1000 万个问题生成许多不同的变体,并在许多不同的点上分支。这些问题生成了数千个不同的答案“轨迹”。由于部分分支通过答案,一些轨迹可能具有共享的前缀。之所以称之为轨迹,是因为在孤立状态下,它是一条朝着答案努力的思维链。每个轨迹包含数千个标记。这是为训练像 o1 这样的草莓模型生成的数百万亿个标记。
These trajectories then are pruned using functional verifiers and ORMs. PRM is not efficient and the majority of data selection is through ORM, so there are many concurrent rollouts per problem that get completed and don’t get pruned until the end. If PRMs were good then the ratio of generated trajectories vs good ones to retain would be much better, but unfortunately ORM dominates and culls the majority of data. These functional verifiers differ in many ways, but they can be thought of as independent “sandboxes” that are checking the math or running code to verify the generate data for correctness.
这些轨迹随后通过功能验证器和 ORM 进行修剪。PRM 效率不高,大多数数据选择是通过 ORM 进行的,因此每个问题都有许多并发的展开完成,直到最后才被修剪。如果 PRM 表现良好,那么生成的轨迹与保留的良好轨迹的比例会好得多,但不幸的是 ORM 占主导地位,并剔除了大多数数据。这些功能验证器在许多方面有所不同,但可以将它们视为独立的“沙盒”,用于检查数学或运行代码以验证生成数据的正确性。
Running all these models at once and properly parallelizing it is an insanely difficult systems and infrastructure problem. For example, all the different models must run on various GPUs and the results must be routed to the appropriate next stage in the pipeline while also updating multiple model’s weights while also making sure the workload is load balanced appropriately.
同时运行所有这些模型并正确地进行并行处理是一个极其困难的系统和基础设施问题。例如,所有不同的模型必须在各种 GPU 上运行,结果必须路由到管道中的适当下一个阶段,同时还要更新多个模型的权重,并确保工作负载得到适当的负载平衡。
Furthermore, the functional verifier “sandboxes” often don’t run well on GPUs, which means they often get offloaded to the CPU. One interesting dynamic here is that while standard Nvidia systems today have 8 GPUs and 2 x86 CPUs, i.e. a 4:1 ratio, Nvidia’s next generation GPU system, GB200 NVL72, has 72 GPUs and 36 CPUs, a 2:1 ratio. On the flip side, Anthropic’s next generation system from Amazon codenamed Project Rainier, has 16 Trainium2, but only 2 CPUs, an 8:1 ratio. There is a vast difference in CPU to GPU resources across next generation training systems, which could lead to OpenAI being able to run more complex functional verifiers, while Anthropic has a raw cost per FLOP and cost per memory BW/capacity advantage. It’s possible Anthropic’s fewer CPU resources may even make it more difficult to run as complex functional verification systems.
此外,功能验证器“沙盒”通常在 GPU 上运行不佳,这意味着它们通常会被卸载到 CPU 上。这里一个有趣的动态是,虽然标准的 Nvidia 系统今天有 8 个 GPU 和 2 个 x86 CPU,即 4:1 的比例,但 Nvidia 的下一代 GPU 系统 GB200 NVL72 有 72 个 GPU 和 36 个 CPU,比例为 2:1。另一方面,Anthropic 的下一代系统来自亚马逊,代号为 Project Rainier,拥有 16 个 Trainium2,但只有 2 个 CPU,比例为 8:1。下一代训练系统中 CPU 与 GPU 资源之间存在巨大差异,这可能导致 OpenAI 能够运行更复杂的功能验证器,而 Anthropic 在每 FLOP 的原始成本和每内存带宽/容量的成本上具有优势。Anthropic 较少的 CPU 资源甚至可能使其更难以运行复杂的功能验证系统。
Incredible Amounts of Forward Passes During Training
训练期间惊人的前传次数
Now one can start to understand that reasoning training is extremely compute intensive. You are already generating hundreds of trillions of tokens for the hundreds of billions of trajectories for the 10 million problems you are training reasoning on. Imagine what happens as the problem set continues to scale and more domains are tackled. The amount of data being generated will be absurd, and it’s not like this overlaps entirely with customer requests, so this is more inference generated tokens then your pre-training data set.
现在人们可以开始理解推理训练是极其计算密集型的。您已经为您正在训练推理的 1000 万个问题生成了数百万亿个标记,涉及数百亿条轨迹。想象一下,随着问题集的不断扩展和更多领域的攻克,会发生什么。生成的数据量将是惊人的,并且这并不完全与客户请求重叠,因此这比您的预训练数据集生成了更多的推理生成标记。
Furthermore, due to the way PPO and PRN works, you must run multiple forward passes (running models) per backwards pass (updating model). This is because in addition to the absurdly intensive generator models, you also have policy models, multiple reward models, and other model-based verifiers all running to validate every backwards pass during post training time.
此外,由于 PPO 和 PRN 的工作方式,您必须在每次反向传递(更新模型)时运行多个前向传递(运行模型)。这是因为除了极其密集的生成器模型外,您还有策略模型、多个奖励模型以及其他基于模型的验证器,所有这些都在后训练期间运行以验证每次反向传递。
In many cases, these models will run multiple times per backwards pass depending on how much data needs to be pruned or rejected. This leads to an extremely high ratio of forward passes versus backwards passes for post-training whereas in pre-training the ratio is 1 to 1.
在许多情况下,这些模型在每次反向传播中会运行多次,具体取决于需要修剪或拒绝的数据量。这导致后训练阶段前向传播与反向传播的比例极高,而在预训练阶段,该比例为 1 比 1。
This changes the infrastructure requirements heavily for training. For example, having a single large fully connected scale out fabric may not be as necessary as it was in the past. One big positive is that training across geographically distributed datacenters is easier because they can focus purely on data generation and pruning rather than updating the model.
这大大改变了训练的基础设施要求。例如,拥有一个单一的大型全连接扩展架构可能不再像过去那样必要。一个很大的积极因素是,跨地理分布的数据中心进行训练变得更容易,因为它们可以专注于数据生成和修剪,而不是更新模型。
Post-Training FLOPS Exceed Pre-Training
训练后 FLOPS 超过训练前
Prior reasoning model post training runs take nearly as much compute as pre-training, and in many cases, the current ones already exceed the pre-training FLOPS. This is because post training usually involves using multiple copies of the largest / best possible model, at least on the forward pass of generators, reward, policy, and the various verifiers.
先前的推理模型在训练后的运行所需的计算量几乎与预训练相当,在许多情况下,当前的计算量已经超过了预训练的 FLOPS。这是因为后训练通常涉及使用多个最大的/最佳可能模型,至少在生成器、奖励、策略和各种验证器的前向传递中。
Take OpenAI’s next model for example. They are now training a model that is between GPT-4o and Orion in pre-training scale. They will pretrain a base model, then make two models from there. One will be a traditional chat model, and the other is a true reasoning model. This transformation from base model to reasoning model will cost more post-training FLOPs then even pre-training did. This is because Orion is going to be used for generation of much of the ‘berry training data and it is also heavily used in the various verifier and reward models as well.
以 OpenAI 的下一个模型为例。他们现在正在训练一个介于 GPT-4o 和 Orion 之间的预训练规模的模型。他们将预训练一个基础模型,然后从中制作两个模型。一个将是传统的聊天模型,另一个是真正的推理模型。从基础模型到推理模型的转变在后期训练中将消耗比预训练更多的 FLOPs。这是因为 Orion 将用于生成大量的“浆果训练数据”,并且它在各种验证器和奖励模型中也被广泛使用。
Pre-training will keep scaling up due to new architectures, the need to swallow up growing amounts of synthetic data, and video data. More importantly, the advent of reasoning training means that there needs to be even more training compute for post training too. Compute scaling laws for training are alive and well.
预训练将继续扩大规模,原因在于新架构、吞噬不断增长的合成数据和视频数据的需求。更重要的是,推理训练的出现意味着后续训练也需要更多的训练计算。训练的计算扩展法则依然有效。
Rapid Iteration as Another Form of Scaling
快速迭代作为另一种扩展形式
The breakneck pace of the industry incentivizes iteration speed and shorter training times. Algorithms and data are advancing at a pace that allows the physical compute required for a given model to decrease by a third each year, while other architectural advancements allow for better models to be developed. As such, training runs rarely exceed ~3 months, and most major pre-training runs are usually 1-2 months by the time of release.
行业的快速发展促使迭代速度和更短的训练时间。算法和数据的进步速度使得所需的物理计算能力每年减少三分之一,而其他架构的进步则允许开发更好的模型。因此,训练周期很少超过约 3 个月,而大多数主要的预训练周期在发布时通常为 1-2 个月。
OpenAI’s Orion broke these norms and trained for longer than 3 months. With the feedback cycle of reasoning models, however, this flips. They are now focusing on much faster feedback loops of training runs alongside ever larger clusters to keep iterating on models. Huge runs like Orion are still needed to help train smaller models, but until Blackwell, models like this are not economic to serve, given the size.
OpenAI 的 Orion 打破了这些规范,训练时间超过了 3 个月。然而,随着推理模型的反馈周期,这一情况发生了变化。他们现在专注于更快的训练反馈循环,并与越来越大的集群一起不断迭代模型。像 Orion 这样的大规模训练仍然是帮助训练小型模型所必需的,但在 Blackwell 之前,像这样的模型在规模上并不经济。
o1 Inference Architecture: Tokenomics
o1 推理架构:代币经济学
Even small reasoning models are going to see a huge serving efficiency gain with Blackwell. Despite the fact that GPT 4o and o1 are the same architecture and size, the pricing difference is 6x per token. The same applies with GPT-4o mini and o1 mini, there is an even larger difference in pricing of 20x per token. Part of this is OpenAI charging more margin because they have unique capabilities, especially on o1 mini, but a the major reason is simply the cost is much higher.
即使是小型推理模型在使用 Blackwell 时也会看到巨大的服务效率提升。尽管 GPT 4o 和 o1 具有相同的架构和大小,但每个 token 的定价差异为 6 倍。GPT-4o mini 和 o1 mini 也是如此,每个 token 的定价差异甚至更大,达到 20 倍。这部分是因为 OpenAI 收取了更高的利润,因为他们具有独特的能力,特别是在 o1 mini 上,但主要原因是成本确实高得多。
We can conduct a simple experiment to quickly illustrate the vast difference in token pricing for reasoning models from a first principles basis. We use the first example logical reasoning prompt featured in the recent Qwen QwQ release blog post and feed it into a few models:
我们可以进行一个简单的实验,以快速说明基于第一性原理的推理模型中代币定价的巨大差异。我们使用最近 Qwen QwQ 发布博客文章中 featured 的第一个逻辑推理提示,并将其输入到几个模型中:
Please add a pair of parentheses to the incorrect equation: 1 + 2 * 3 + 4 * 5 + 6 * 7 + 8 * 9 = 479, to make the equation true.

As we can see from the Qwen release blog, this problem required ~2166 words to generate an answer.
从 Qwen 发布博客中可以看出,这个问题需要大约 2166 个单词来生成答案。
As expected, reasoning models such as o1-preview and o1-mini generate far more output tokens than their non-reasoning counterparts of similar size. Note that reasoning tokens are included within the chargeable output tokens, even if they are never displayed or provided to the operator. Compounded with the notably high cost per token for reasoning models, the cost per query is 24x higher in the case of o1-mini and 57x higher for o1-preview.
正如预期的那样,推理模型如 o1-preview 和 o1-mini 生成的输出令牌远远超过其同等大小的非推理模型。请注意,推理令牌包含在可收费的输出令牌中,即使它们从未显示或提供给操作员。加上推理模型每个令牌的成本显著较高,o1-mini 的每个查询成本高出 24 倍,而 o1-preview 的每个查询成本高出 57 倍。

This cost per query difference is shocking but the important part to think about is the sequence length and KVCache. In the chart below, we illustrate how the larger sequence lengths necessitate using smaller batch sizes when targeting a reasonable interactivity of 30 tokens per second per user. If we were to take the 7,661 output tokens for o1-preview and then run a query that resulted in that same sequence length of 7,661 tokens on Llama 3.1 405B, this would limit us to a maximum batch size of 72 given that we want to target an interactivity of 30 tokens per second per user based on a pure roofline model.
每个查询的成本差异令人震惊,但重要的是要考虑序列长度和 KVCache。在下面的图表中,我们说明了较大的序列长度在针对每个用户每秒 30 个令牌的合理交互性时需要使用较小的批量大小。如果我们以 o1-preview 的 7,661 个输出令牌为基础,然后运行一个查询,结果是 Llama 3.1 405B 上相同的 7,661 个令牌的序列长度,这将限制我们最大批量大小为 72,因为我们希望基于纯屋顶模型针对每个用户每秒 30 个令牌的交互性。

For this simplified analysis, we do not incorporate any impact on memory bandwidth utilization or model flops utilization from different batch sizes.
对于这个简化分析,我们不考虑不同批量大小对内存带宽利用率或模型浮点运算利用率的影响。
The same problem run on GPT-4o only resulted in a sequence length of 775 tokens, and if the same number of tokens were run on Llama 3.1 405B, this would correspond to a maximum batch size of 368 when targeting an interactivity of 30 tokens per second per user based on roofline modeling due to KVCache requirements. Being able to amortize the inference system cost across much fewer users, this means the cost per token over 5x higher for the longer sequence length query from KVCache limiting maximum batch size. This is only a first principles framework, but it can help to give a sense directionally to how costs scale with respect to context length. There are other factors at play that also drive this large pricing gap.
在 GPT-4o 上运行相同的问题仅导致了 775 个标记的序列长度,如果在 Llama 3.1 405B 上运行相同数量的标记,这将对应于最大批量大小为 368,基于 KVCache 要求,目标是每用户每秒 30 个标记的交互性。能够在更少的用户之间摊销推理系统成本,这意味着由于 KVCache 限制最大批量大小,较长序列长度查询的每个标记成本超过 5 倍。这只是一个基本原则框架,但它可以帮助大致了解成本如何随着上下文长度的变化而变化。还有其他因素也在起作用,推动了这一巨大的定价差距。
But what is the nature of the increased compute intensity and greater memory requirements of reasoning models that leads to lower batch sizes and lower throughput per GPU?
但推理模型计算强度增加和内存需求更高的本质是什么,这导致每批次的大小减少和每个 GPU 的吞吐量降低?
The answer is twofold. The main driver of greater memory requirements is the larger KV Cache required to handle longer sequence lengths. The total KV Cache Size when using GQA can be calculated as per below:
答案是双重的。更大内存需求的主要驱动因素是处理更长序列长度所需的更大 KV 缓存。使用 GQA 时,总 KV 缓存大小可以按以下方式计算:
Total GQA KV Cache Size in Bytes = Batch Size x Sequence Length x 2 x Number of Layers x (Hidden Size / Num Heads x Num of KV Heads) x Precision in Bytes.
KV Cache size scales linearly with sequence length, but it also increases linearly with respect to Batch Size as well, so having both a high number of users generating long sequence lengths leads to large KV Cache requirements.
KV 缓存大小与序列长度线性扩展,但它也会随着批量大小线性增加,因此同时拥有大量用户生成长序列长度会导致较大的 KV 缓存需求。
In the below illustration, we show that for Llama 3.1 405B, a sequence length of 39,000 tokens would mean that the KV Cache requirements would completely fill up all the 640GB of total HBM capacity of an 8xH100 node, and we have not even factored in the 405GB required just to load the model parameters. If we factor in parameters, memory available for KV Cache drops down to 235GB (the red line in the below chart), and in reality we reach memory limits at sequence lengths shorter than about 16k tokens.
在下面的插图中,我们展示了对于 Llama 3.1 405B,序列长度为 39,000 个标记将意味着 KV 缓存的需求将完全填满 8xH100 节点的 640GB 总 HBM 容量,而我们甚至还没有考虑到加载模型参数所需的 405GB。如果考虑到参数,KV 缓存可用的内存降至 235GB(下图中的红线),实际上我们在序列长度短于大约 16k 个标记时就达到了内存限制。

Because increases in KV Cache size lead directly to greater memory capacity and bandwidth requirements, it also reduces interactivity when batch size is held constant or restricts the maximum batch size to 16 for which a minimum interactivity can be delivered.
由于 KV 缓存大小的增加直接导致内存容量和带宽需求的增加,因此在批量大小保持不变时,它还会降低交互性,或者将最大批量大小限制为 16,以便提供最低的交互性。

The other key factor is how FLOP requirements scale up with respect to longer sequence lengths:
另一个关键因素是 FLOP 需求如何随着更长序列长度的增加而增加:
Scaled Dot Product Attention (SDPA) FLOP required per token = 4 x Num Heads x Number of Layers x Head Dimension x Sequence Length Tokens.
FLOP per token required scales linearly with respect to sequence length, however because this is FLOP per token, total FLOP for a given sequence is multiplied by the sequence length, meaning that FLOP requirements scale quadratically with respect to sequence length.
每个令牌所需的 FLOP 与序列长度成线性关系,然而由于这是每个令牌的 FLOP,给定序列的总 FLOP 会被序列长度乘以,这意味着 FLOP 需求与序列长度成平方关系。
In the illustration below, the inference system hits the FLOPS constraint very quickly as context length increases – at a sequence length of approximately 4,096 in the below example.
在下面的插图中,推理系统在上下文长度增加时很快达到了 FLOPS 限制——在下面示例中,序列长度约为 4,096。

Increasing sequence lengths drastically increase memory and FLOP requirements linearly and quadratically, respectively, which results in much smaller batch sizes to amortize cluster total cost of ownership over. This, in turn, makes each token served significantly more expensive.
增加序列长度会分别线性和平方地大幅增加内存和 FLOP 需求,这导致批量大小大幅减小,以分摊集群的总拥有成本。这反过来又使得每个服务的令牌变得显著更昂贵。
Note that OpenAI has heavy usage of attention modifications such as local global, and others, which help mitigate these issues, but that only changes the constant in transformer attention, slowing the quadratic scaling, but not solving it. Long context architectures which address this while retaining quality are sorely needed otherwise reasoning models will forever have much higher costs per token plus many more tokens generally.
请注意,OpenAI 在注意力修改方面有大量使用,例如局部全局等,这有助于缓解这些问题,但这仅仅改变了变换器注意力中的常数,减缓了二次扩展,但并没有解决它。需要长上下文架构来解决这个问题,同时保持质量,否则推理模型每个令牌的成本将永远更高,并且通常会有更多的令牌。
Reasoning models face reliability issues
推理模型面临可靠性问题
Another challenge caused by increased sequence length, in addition to the much greater memory and FLOPS requirements at inference time, is reliability.
另一个由于序列长度增加而导致的挑战,除了在推理时对内存和 FLOPS 需求大幅增加外,还有可靠性。
We have covered fault tolerance before and how it can enable multi-datacenter training, and it is a critical part of the hyperscaler infrastructure tool set across all applications.
我们之前讨论过容错及其如何支持多数据中心训练,它是超大规模基础设施工具集在所有应用中的关键部分。
Checkpointing during training run is ubiquitously adopted by AI labs to allow training runs to quickly restart after faults, significantly reducing the disruption from these faults.
在训练过程中进行检查点是人工智能实验室普遍采用的做法,以便在故障后快速重启训练过程,从而显著减少这些故障带来的干扰。
However, silent data corruption errors and other faults still occur at inference. Despite low rates of occurrence of many of these faults, the sheer number of users that hyperscalers serve and the large numbers of accelerators serving this inference means this issue must be tackled.
然而,静默数据损坏错误和其他故障在推理时仍然发生。尽管许多这些故障的发生率较低,但超大规模服务商所服务的用户数量庞大,以及为此推理服务的大量加速器,使得这个问题必须得到解决。
In the transformer architecture, each additional token generated is appended to all prior tokens generated before being passed through the model again. If an error occurs during the generation of any given token, this corrupted token becomes part of the context window of the conversation, potentially causing grammatical, contextual, or formatting errors.
在变压器架构中,每生成一个额外的标记,都会附加到所有先前生成的标记上,然后再通过模型进行处理。如果在生成任何给定标记的过程中发生错误,则该损坏的标记将成为对话的上下文窗口的一部分,可能导致语法、上下文或格式错误。
This is true for all long context models but is especially true for reasoning models as the long sequence lengths lead to a compounding of errors. Many of these errors can also just be innate within the model or because the chain of thought started off on the wrong trajectory during inference as well.
这对于所有长上下文模型都是正确的,但对于推理模型尤其如此,因为长序列长度会导致错误的累积。这些错误中的许多也可能是模型内在的,或者是因为推理过程中思维链条一开始就走上了错误的轨道。
O1 Pro Innovations & Cost
O1 Pro 创新与成本
For this reason, OpenAI’s O1 Pro implements Self-Consistency / Majority Vote at inference time. It is the exact same model and weights as the normal o1. On the surface, Self-Consistency / Majority Vote is incredibly costly because you generate 5x as many tokens if there are 5 streams for the vote. That would justify the cost increases for OpenAI from $20 -> $200 price increase for ChatGPT Pro subscriptions.
出于这个原因,OpenAI 的 O1 Pro 在推理时实现了自一致性/多数投票。它与普通的 o1 模型和权重完全相同。从表面上看,自一致性/多数投票的成本非常高,因为如果有 5 个流进行投票,你将生成 5 倍的令牌。这将为 OpenAI 的成本增加提供合理的理由,从 20 美元增加到 200 美元的 ChatGPT Pro 订阅价格。
In reality, the cost increase for OpenAI is nowhere near the price increase. This is because when running longer average sequence lengths and increasing the ratio of decode tokens versus prefill tokens, inference systems are generally much more bandwidth and capacity limited than they are FLOPs limited. They have spare FLOPs but nowhere to spend them. On the flip side, because Self-Consistency / Majority Vote uses a shared prefix for most of the sequence length, there is no need to spend additional bandwidth or memory on KV Cache.
实际上,OpenAI 的成本增加远远没有价格上涨那么多。这是因为在运行更长的平均序列长度并增加解码令牌与预填充令牌的比例时,推理系统通常在带宽和容量方面受到的限制远大于 FLOPs 的限制。它们有多余的 FLOPs,但没有地方可以使用。另一方面,由于自一致性/多数投票在大多数序列长度上使用共享前缀,因此无需在 KV 缓存上额外消耗带宽或内存。
Beyond o1 超越 o1
OpenAI o1 today is believed to focus on only one chain of thought, but a few other models released in just the last few weeks are already pushing the envelope on sampling multiple chains of thought and utilizing multiple agent approaches.
OpenAI o1 今天被认为只专注于一条思路,但在过去几周内发布的其他几个模型已经在多条思路的采样和多代理方法的利用上推动了极限。
Deepseek R1, a Chinese reasoning model, also follows one CoT. Deepseek was the first company to publicly match OpenAI’s capability of developing and deploying a reasoning model. R1 not only matches o1, but also beats o1 preview on a math benchmark while also showcasing better improvements via inference time scaling.
Deepseek R1,一个中国推理模型,也遵循一个 CoT。Deepseek 是第一家公开匹配 OpenAI 开发和部署推理模型能力的公司。R1 不仅匹配 o1,还在数学基准测试中超越了 o1 预览,同时在推理时间缩放方面展示了更好的改进。

Deepseek no doubt has the talent, the ability to execute well on infrastructure management and the capability to train large models. They have the data, funding, talent, and skill to match Western labs. Deepseek has more GPUs than you’d think: 50k Hopper GPUs – more than all but a few of the leading Western AI Labs.
Deepseek 无疑拥有人才、良好的基础设施管理执行能力以及训练大型模型的能力。他们拥有数据、资金、人才和技能,可以与西方实验室相匹配。Deepseek 拥有的 GPU 数量超乎你的想象:50,000 个 Hopper GPU——比除了少数几个领先的西方人工智能实验室还要多。
There are two other currently available Chinese reasoning models. Alibaba’s QwQ is a 32B parameter model that also relies on test-time compute and is also suspected to follow multiple chains of thought. It outperforms o1 preview on some math related benchmarks.
目前还有两个可用的中文推理模型。阿里巴巴的 QwQ 是一个 32B 参数的模型,依赖于测试时计算,并且也被怀疑遵循多条思维链。在一些与数学相关的基准测试中,它的表现优于 o1 预览。

The model is not perfect, as it may switch between languages unexpectedly and enter recursive loops with no conclusive answer. Still, it is an important step in showcasing China’s capabilities in developing reasoning models. The relatively modest size (32B parameters) showcases how reasoning models can utilize a significantly smaller base model and get much better performance by increasing test-time compute.
该模型并不完美,因为它可能会意外地在语言之间切换,并进入没有结论的递归循环。尽管如此,它在展示中国在开发推理模型方面的能力上仍然是一个重要的步骤。相对较小的规模(32B 参数)展示了推理模型如何利用显著更小的基础模型,并通过增加测试时的计算量获得更好的性能。
Another notable recently announced Chinese reasoning model is Alibaba’s Marco o1. This model is more interesting than the rest – it took a base model (Qwen2-7B instruct) and conducted full parameter fine tuning (via SFT) using publicly available CoT datasets combined with synthetic data generated in house, likely following methods we outline above. The publicly available CoT datasets include OpenAI’s CoT dataset, which the Alibaba team filtered according to their own heuristics and uses.
另一个最近宣布的中国推理模型是阿里巴巴的 Marco o1。这个模型比其他模型更有趣——它采用了一个基础模型(Qwen2-7B instruct),并使用公开可用的 CoT 数据集结合内部生成的合成数据进行了全参数微调(通过 SFT),可能遵循我们上面概述的方法。公开可用的 CoT 数据集包括 OpenAI 的 CoT 数据集,阿里巴巴团队根据他们自己的启发式方法进行了过滤并使用。
Marco-o1 also incorporates Monte Carlo Tree Search (MCTS), a heuristic search algorithm that considers multiple paths and was used in aiding AlphaGo’s search capabilities. MCTS allows exploration of multiple reasoning paths using confidence scores derived from SoftMax applied log probabilities of the top-k alternative tokens, guiding the model to optimal solutions. In this case, MCTS was also used to generate high quality synthetic CoT dataset that helped the model formulate reasoning capabilities.
Marco-o1 还结合了蒙特卡罗树搜索(MCTS),这是一种启发式搜索算法,考虑多个路径,并用于辅助 AlphaGo 的搜索能力。MCTS 允许使用从 SoftMax 应用的前 k 个替代标记的对数概率得出的置信度分数来探索多个推理路径,引导模型找到最佳解决方案。在这种情况下,MCTS 还用于生成高质量的合成 CoT 数据集,帮助模型形成推理能力。

Both of Alibaba’s reasoning models go further by exploring and searching many branches of this tree of potential reasoning paths at inference time, opening the door to achieving pass@n>1 accuracy. They take a multi-agent approach that is often used during post-training and implement it at inference time. In this approach, an actor model would generate the next reasoning step, while a critic model would evaluate this step and determine whether the actor model should continue with the chain of reasoning or backtrack.
阿里巴巴的两个推理模型在推理时进一步探索和搜索这个潜在推理路径树的许多分支,从而打开了实现 pass@n>1 准确率的大门。它们采用了一种多智能体的方法,这种方法通常在后训练阶段使用,并在推理时实施。在这种方法中,演员模型生成下一个推理步骤,而评论家模型评估该步骤并确定演员模型是否应该继续推理链或回溯。
Considering using more than one chain of thought to arrive at the final answer opens yet another dimension of scaling. Not only can more sophisticated tree search algorithms fed by greater amounts of compute lead to better results, but a larger sample size of results can also lead to greater accuracy. This is especially true if it is coupled with strong verifiers, as noted above, which means training better and larger verifiers leads to better results.
考虑使用多种思维方式来得出最终答案,开启了另一个扩展维度。更复杂的树搜索算法在更大计算量的支持下,不仅可以产生更好的结果,更大的结果样本量也可以提高准确性。如果与强大的验证者相结合,尤其如此,如上所述,这意味着训练更好且更大的验证者会带来更好的结果。
Nous Research takes an entirely different approach, and instead of producing a reasoning model, they are producing a reasoning API. This API works across different models (e.g. Nous Hermes, Claude, GPT-4) and gives them reasoning capabilities through MCTS and other methods. These other methods include a Mixture of Agents approach where multiple models independently analyze the same prompt and then “confer” together. These models then collectively judge the best answer and that is what it showcased to the user. In Nous’ case, they provide ~3 responses to the user to pick from, though that is likely to improve with scale.
Nous Research 采取了完全不同的方法,他们不是生产推理模型,而是生产推理 API。该 API 可以在不同模型(例如 Nous Hermes、Claude、GPT-4)之间工作,并通过 MCTS 和其他方法赋予它们推理能力。这些其他方法包括一种混合代理的方法,其中多个模型独立分析相同的提示,然后“协商”在一起。这些模型随后共同判断最佳答案,并将其展示给用户。在 Nous 的案例中,他们提供大约 3 个供用户选择的响应,尽管随着规模的扩大,这一数字可能会有所改善。
The scaling laws playing out in test-time compute include the time given for the model, the number of generations a model can do, the strength of the verifier(s) used in checking the generations, and in the sophistication of the search algorithms.
在测试时间计算中发挥作用的规模法则包括模型所给予的时间、模型可以进行的代数数量、用于检查代数的验证者的强度,以及搜索算法的复杂性。
Scaling Training Is Cheaper Than Scaling Inference Time Compute
扩展训练比扩展推理时间计算更便宜
The more expensive nature of reasoning models combined with their heavy token usage drives inference costs up considerably. Stopping this upward spiral in deployment costs is of paramount importance to model providers if they want to serve these reasoning models economically. The major labs don’t have enough capacity to serve their models as broadly as they’d like. Microsoft still can’t even roll out the full co-pilot feature set. Sora is unable to be used widely and signing up has been impossible for two days (and possibly longer). Compute is still very limited, on both pre-training and in inference.
推理模型的高昂成本以及其大量的令牌使用显著提高了推理成本。对于模型提供者来说,停止部署成本的这种上升螺旋至关重要,以便经济地服务这些推理模型。主要实验室的能力不足,无法像他们希望的那样广泛服务他们的模型。微软仍然无法推出完整的协作助手功能集。Sora 无法被广泛使用,注册在过去两天内(可能更长时间)变得不可能。计算资源仍然非常有限,无论是在预训练还是推理阶段。
To that end, scaling pre-training can still make a huge difference in reducing costs. Concretely, the same performance can be achieved as what would be with a Chinchilla optimal using overtraining with two orders of magnitude extra FLOPs, resulting in a reduction in inference costs by one order of magnitude.
为此,扩展预训练仍然可以在降低成本方面产生巨大差异。具体而言,可以通过使用两个数量级额外 FLOPs 的过度训练来实现与 Chinchilla 最优相同的性能,从而将推理成本降低一个数量级。

Scaling pre-training two additional orders of magnitude will be more expensive than ever, but this can still be justified and hyperscalers continue to build out larger clusters, with Elon Musk aiming for a 1 million GPU cluster. Given OpenAI and Microsoft are running inference of GPTs on roughly a couple hundred thousand GPUs, scaling training looks like it can still deliver the cost savings required in training.
将预训练规模扩大两个数量级将比以往任何时候都更昂贵,但这仍然是合理的,超大规模公司继续构建更大的集群,埃隆·马斯克的目标是建立一个 100 万 GPU 的集群。考虑到 OpenAI 和微软正在大约几千个 GPU 上运行 GPT 的推理,扩展训练似乎仍然可以提供训练所需的成本节约。
Leave a Reply