Exploring OpenAI O1 Model Replication
探索 OpenAI O1 模型复制

Author: Jian Hu  作者： 胡健

First published on: 2024/11/21
首次发布于： 2024/11/21

The release of models like Kimi K0-Math, DeepSeek R1 Lite and Qwen QwQ has brought the replication of OpenAI’s O1 models into the spotlight, igniting fervent discussions across the AI community.
Kimi K0-Math、DeepSeek R1 Lite 和 Qwen QwQ 等模型的发布使 OpenAI 的 O1 模型的复制成为人们关注的焦点，在整个 AI 社区引发了热烈的讨论。

Two months ago, I launched an open-source project called Awesome-LLM-Strawberry, a curated collection of research papers, blogs, and projects focusing on OpenAI O1 model replication strategies and reasoning techniques. The repository has garnered over 5,000 stars on GitHub.
两个月前，我启动了一个名为 Awesome-LLM，这是一个精选的研究论文、博客和项目集合，专注于 OpenAI O1 模型复制策略和推理技术。该存储库在 GitHub 上获得了 5,000 多颗星。

Awesome LLM Strawberry (OpenAI o1) - GitHub "A collection of LLM papers, blogs, and projects, with a focus on OpenAI O1 and reasoning techniques."
太棒了 LLM Strawberry （OpenAI o1） - GitHub “LLM 论文、博客和项目的集合，重点是 OpenAI O1 和推理技术。”

By diving deep into relevant research and collaborating with experts, I’ve compiled and hypothesized several potential strategies for replicating O1 models. This post outlines these findings for further exploration.
通过深入研究相关研究并与专家合作，我编译并假设了几种复制 O1 模型的潜在策略。这篇文章概述了这些发现以供进一步探索。

DeepSeek R1 Lite & Kimi K0-Math & Qwen QwQ

The recent releases of DeepSeek R1 Lite, Kimi K0-Math and Qwen QwQ provide valuable insights into potential approaches for O1 model replication.
最近发布的 DeepSeek R1 Lite、Kimi K0-Math 和 Qwen QwQ 为 O1 模型复制的潜在方法提供了有价值的见解。

ALT

Evaluation results for DeepSeek-R-Lite and Kimi K0-math 
DeepSeek-R-Lite 和 Kimi K0-math 的评估结果

ALT

Evaluation results for Qwen QwQ
Qwen QwQ 的评估结果

ALT

Training Phase 训练阶段

Stage 0: Continued Pretraining
第 0 阶段：继续预训练

Objective: Enhance the base model’s reasoning capabilities using large-scale datasets such as CoT (Chain-of-Thought), code, and mathematics.
目标：使用 CoT （Chain-of-Thought）、代码和数学等大规模数据集增强基础模型的推理能力。

Stage 1: Supervised Fine-Tuning (SFT)
第 1 阶段：监督微调（SFT）

Objective: Train the model to generate ultra-long CoT reasoning chains and reflective instruction formats, laying the groundwork for subsequent reinforcement learning training. 
目标：训练模型生成超长的 CoT 推理链和反射指令格式，为后续的强化学习训练奠定基础。

For open-source models, such as o1-journey-part2, distilling from o1-preview has achieved good results.
对于开源模型，例如 o1-journey-part2，从 o1-preview 中蒸馏取得了良好的效果。

Stage 2: Reinforcement Learning for Advanced Reasoning
第 2 阶段：用于高级推理的强化学习

Objective: Enhancing the ability of large language models for long reasoning and reflection using RL.
目标：使用 RL 增强大型语言模型的长推理和反射能力。

Option 1: Large-Scale RLHF (PPO)
选项 1：大规模 RLHF （PPO）

Datasets and Feedback: High-quality mathematical and code datasets,  reward models (RM), rule-based feedback, or compiler feedback.
数据集和反馈：高质量的数学和代码数据集、奖励模型 （RM）、基于规则的反馈或编译器反馈。

Advantages: Highly scalable, well-suited for large-scale training pipelines.
优点：高度可扩展，非常适合大规模训练管道。

An example of this is the Self-Correction via Reinforcement Learning (SCoRe) developed by Google DeepMind: 
这方面的一个例子是通过强化学习 （SCoRe） 开发的自我纠正：

ALT

Option 2: MCTS-Based Strategy
选项 2：基于 MCTS 的策略

Approach: Utilize Monte Carlo Tree Search (MCTS) to generate complex reasoning samples, combined with high-quality datasets, RM, rule-based and compiler feedback, and Off-Policy RL or SFT techniques.
方法：利用蒙特卡洛树搜索 （MCTS） 生成复杂的推理样本，并结合高质量的数据集、RM、基于规则和编译器的反馈以及非策略 RL 或 SFT 技术。

Advantages: Allows customization of CoT formats and may achieve higher performance ceilings.
优点： 允许自定义 CoT 格式，并可能实现更高性能的天花板。

Challenges: Training pipelines are complex, making large-scale training more difficult.
挑战：训练管道很复杂，使大规模训练更加困难。

An example of this is the AlphaZero-Like training:
这方面的一个例子是 AlphaZero-Like 训练：

ALT

paper: AlphaZero-Like Tree-Search can Guide Large Language Model Decoding and Training
论文：AlphaZero-Like Tree-Search 可以指导大型语言模型解码和训练

ALT

The training methods shown in the figure can be replaced with DPO or even Off-policy RL.
图中所示的训练方法可以替换为 DPO 甚至 Off-policy RL。

DeepSeek R1 Lite, Kimi K0-Math and Qwen QwQ are likely built upon these strategies. One possible approach is to first use MCTS to generate SFT data to finetune the LLMs and then apply PPO training.
DeepSeek R1 Lite、Kimi K0-Math 和 Qwen QwQ 可能建立在这些策略之上。一种可能的方法是首先使用 MCTS 生成 SFT 数据以微调 LLMs然后应用 PPO 训练。

Inference Phase 推理阶段

Option 1: Ultra-Long CoT + Reflection Chains
选项 1：超长 CoT + 反射链

Approach:  方法：

Combines long reasoning chains with reflection mechanisms.
将长推理链与反射机制相结合。

Implements Best-of-N or Majority Voting for inference scaling.
实施 N 个最佳或多数投票以进行推理扩展。

Advantages:  优点：

Simple to implement and scale.
易于实施和扩展。

Fast output, particularly suited for streaming inference.
快速输出，特别适合流式推理。

Case Study: DeepSeek R1 Lite
案例研究：DeepSeek R1 Lite

In tests involving simple problems like "1+1," DeepSeek R1 Lite demonstrates its ultra-long reasoning process and efficiently outputs answers in a streaming format (20 seconds).
在涉及 “1+1” 等简单问题的测试中，DeepSeek R1 Lite 展示了其超长的推理过程，并有效地以流式格式（20 秒）输出答案。

Fun Fact: DeepSeek's API suffers from higher latency compared to competitors due to architectural constraints.
有趣的事实：由于架构限制，与竞争对手相比，DeepSeek 的 API 延迟更高。

ALT

Insight: The Inference Scaling Law trends showcased by DeepSeek R1 Lite suggest the model emphasizes increasing inference length over width for improved performance. This pattern was previously confirmed in earlier research by Google DeepMind.
洞察：DeepSeek R1 Lite 展示的推理扩展定律趋势表明，该模型强调增加推理长度而不是宽度以提高性能。这种模式之前在 Google DeepMind 的早期研究中得到了证实。

ALT

DeepSeek-R-Lite demonstrates its Inference Scaling Law, showing that increasing inference length is more effective than increasing width.
DeepSeek-R-Lite 演示了其推理缩放定律，表明增加推理长度比增加宽度更有效。

ALT

DeepSeek R1 Lite  appears to control reasoning length. One potential and effective strategy could be multiple rounds of reflective dialogue.
DeepSeek R1 Lite 似乎控制了推理长度。一种潜在且有效的策略可能是多轮反思性对话。

These signs (Efficient streaming output and the Inference Scaling Law) indicate that the Deepseek R1 Lite is very likely using Ultra-Long CoT, and it hasn't even enabled Majority Voting or Best-of-N.
这些迹象（高效流输出和推理扩展定律）表明 Deepseek R1 Lite 很可能使用了超长 CoT，甚至没有启用多数投票或 Best-of-N。

Case Study: Kimi K0-Math & QwQ
个案研究： Kimi K0-Math & QwQ

We recently (2024/11/26) tested the Kimi k0-Math and found that it exhibits similar behavioral patterns to DeepSeek-R1-Lite, such as rapid streaming output without hidden reasoning chains. The distinction lies in the fact that DeepSeek's reasoning chains are somewhat longer and provide a summary of the reasoning chain at the end. Therefore, we speculate that Kimi K0 Math is also based on mechanisms of Long CoT and reflection.
我们最近（2024 年 11 月 26 日）对 Kimi k0-Math 进行了测试，发现它表现出与 DeepSeek-R1-Lite 相似的行为模式，例如没有隐藏推理链的快速流输出。区别在于 DeepSeek 的推理链稍长，并在末尾提供推理链的摘要。因此，我们推测 Kimi K0 Math 也是基于 Long CoT 和反射的机制。

In our recent tests of the open-sourced Qwen QwQ, we discovered that its impressive performance on evaluation datasets like AIME is achieved solely through Long CoT, further validating the importance this view.
在我们最近对开源 Qwen QwQ 的测试中，我们发现它在 AIME 等评估数据集上的令人印象深刻的性能完全是通过 Long CoT 实现的，进一步验证了这一观点的重要性。

Option 2: MCTS-Based Inference
选项 2：基于 MCTS 的推理

Advantages: Potentially superior performance ceilings.
优点： 潜在的卓越性能天花板。

Challenges:  挑战：

High implementation complexity.
实施复杂度高。

Expensive and computationally inefficient.
成本高昂且计算效率低下。

Difficult to deploy at scale in the short term.
短期内难以大规模部署。

ALT

rStar (MCTS)  rStar （MCTS）

ALT

Case Study: rStar 案例研究：rStar

Below are some experimental results for MCTS (rStar) and Self-Consistency for the MATH500 dataset,
以下是 MCTS （rStar） 和 MATH500 数据集的自洽性的一些实验结果，

H100 Hours (Qwen2.5-7B-Instruct)
H100 小时 （Qwen2.5-7B-Instruct）

Method 方法	Time (hours) 时间（小时）
Self-Consistency (N = 64) 自洽性（n = 64）	20h 20 小时
MCTS (rStar: 16 rollouts and depth=8) MCTS（rStar：16 个卷展栏，深度 = 8）	58h 58 小时

Llama3.1-8B-Instruct  Llama3.1-8B-指令

Self-Consistency without CoT
无 CoT 的自洽性

Number of responses 响应数	Accuracy (%) 准确率（%）
1	44.84
2	44.84
4	48.58
8	51.12
16	52.08
32	52.84
64	53.16

MCTS ( rStar  rollouts=16, depth=8)
MCTS（“rStar”卷展栏 = 16，深度 = 8）

Metric 度量	Value 价值
Majority Vote Acc 多数票 Acc	0.5960

Conclusion 结论

The journey toward replicating OpenAI’s O1 models is well underway. From Kimi K0-Math to DeepSeek R1 Lite adn Qwen QwQ, the community is actively exploring diverse training and inference strategies. Each approach presents unique strengths and challenges, whether it’s leveraging large-scale data during training or employing innovative techniques like reflective CoT chains during inference.
复制 OpenAI 的 O1 模型的旅程正在顺利进行中。从 Kimi K0-Math 到 DeepSeek R1 Lite 和 Qwen QwQ，社区正在积极探索多样化的训练和推理策略。无论是在训练期间利用大规模数据，还是在推理过程中采用反射式 CoT 链等创新技术，每种方法都存在独特的优势和挑战。

How to efficiently construct long-chain cot data that is logically coherent, natural, has an appropriate level of reflection, and features a reasonable triggering mechanism.
Once the method for synthesizing the data is determined, the scaling up of training and inference will become clearer.
如何高效构建逻辑连贯、自然、具有适当反射水平、具有合理触发机制的长链 cot 数据。
一旦确定了合成数据的方法，训练和推理的扩展就会变得更加清晰。

Exploring OpenAI O1 Model Replication探索 OpenAI O1 模型复制

DeepSeek R1 Lite & Kimi K0-Math & Qwen QwQ

Training Phase 训练阶段

Stage 0: Continued Pretraining第 0 阶段：继续预训练

Stage 1: Supervised Fine-Tuning (SFT)第 1 阶段：监督微调 （SFT）

Stage 2: Reinforcement Learning for Advanced Reasoning第 2 阶段：用于高级推理的强化学习

Inference Phase 推理阶段

Option 1: Ultra-Long CoT + Reflection Chains选项 1：超长 CoT + 反射链

Case Study: DeepSeek R1 Lite案例研究：DeepSeek R1 Lite

Case Study: Kimi K0-Math & QwQ个案研究： Kimi K0-Math & QwQ

Option 2: MCTS-Based Inference选项 2：基于 MCTS 的推理

Case Study: rStar 案例研究：rStar

Conclusion 结论

Exploring OpenAI O1 Model Replication
探索 OpenAI O1 模型复制

Stage 0: Continued Pretraining
第 0 阶段：继续预训练

Stage 1: Supervised Fine-Tuning (SFT)
第 1 阶段：监督微调（SFT）

Stage 2: Reinforcement Learning for Advanced Reasoning
第 2 阶段：用于高级推理的强化学习

Option 1: Ultra-Long CoT + Reflection Chains
选项 1：超长 CoT + 反射链

Case Study: DeepSeek R1 Lite
案例研究：DeepSeek R1 Lite

Case Study: Kimi K0-Math & QwQ
个案研究： Kimi K0-Math & QwQ

Option 2: MCTS-Based Inference
选项 2：基于 MCTS 的推理