这是用户在 2024-12-21 8:32 为 https://arcprize.org/blog/oai-o3-pub-breakthrough 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

ARC Prize remains undefeated.
New ideas still needed.
ARC Prize 依然未尝败绩。仍需新思路。

By François Chollet  由弗朗索瓦·肖莱
Published 20 Dec 2024
发布于 2024 年 12 月 20 日

OpenAI o3 Breakthrough High Score on ARC-AGI-Pub
在 ARC-AGI-Pub 上 OpenAI O3 突破性高分

OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set - has scored a breakthrough 75.7% on the Semi-Private Evaluation set at our stated public leaderboard $10k compute limit. A high-compute (172x) o3 configuration scored 87.5%.
OpenAI 的新 o3 系统——基于 ARC-AGI-1 公开训练集进行训练——在公开排行榜上所声明的$10k 计算限制下,于半私有评估集上取得了突破性的 75.7%得分。高计算量(172 倍)的 o3 配置则达到了 87.5%的得分。

o Series Performance

This is a surprising and important step-function increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models. For context, ARC-AGI-1 took 4 years to go from 0% with GPT-3 in 2020 to 5% in 2024 with GPT-4o. All intuition about AI capabilities will need to get updated for o3.
这是 AI 能力的一次令人惊讶且重要的阶跃式提升,展示了 GPT 系列模型前所未有的任务适应能力。回顾一下,ARC-AGI-1 从 2020 年 GPT-3 的 0%到 2024 年 GPT-4o 的 5%,用了 4 年时间。所有关于 AI 能力的直觉都需要为 o3 进行更新。

The mission of ARC Prize goes beyond our first benchmark: to be a North Star towards AGI. And we're excited to be working with the OpenAI team and others next year to continue to design next-gen, enduring AGI benchmarks.
ARC Prize 的使命不仅限于我们的首个基准:成为通向 AGI 的北极星。我们很高兴明年能与 OpenAI 团队及其他伙伴合作,继续设计下一代持久性的 AGI 基准。

ARC-AGI-2 (same format - verified easy for humans, harder for AI) will launch alongside ARC Prize 2025. We're committed to running the Grand Prize competition until a high-efficiency, open-source solution scoring 85% is created.
ARC-AGI-2(相同格式——人类验证简单,AI 难度更高)将与 ARC Prize 2025 一同推出。我们致力于举办大奖赛,直至产生一个高效、开源且得分达到 85%的解决方案。

Read on for the full testing report.
阅读完整测试报告。


OpenAI o3 ARC-AGI Results
OPENAI O3 ARC-AGI 结果

We tested o3 against two ARC-AGI datasets:
我们测试了 o3 在两个 ARC-AGI 数据集上的表现:

At OpenAI's direction, we tested at two levels of compute with variable sample sizes: 6 (high-efficiency) and 1024 (low-efficiency, 172x compute).
在 OpenAI 的指导下,我们以两种计算水平进行了测试,样本大小可变:6(高效)和 1024(低效,计算量增加 172 倍)。

Here are the results.
结果如下。

Set  设置 Tasks  任务 Efficiency  效率 Score  得分 Retail Cost  零售价格 Samples  示例 Tokens  代币 Cost/Task  成本/任务 Time/Task (mins)  时间/任务(分钟)
Semi-Private  半私有 100 High   75.7% $2,012 6 33M $20 1.3
Semi-Private  半私有 100 Low   87.5% - 1024 5.7B - 13.8
Public  公开 400 High   82.8% $6,677 6 111M $17 N/A
Public  公开 400 Low   91.5% - 1024 9.5B - N/A

Note: o3 high-compute costs not available as pricing and feature availability is still TBD. The amount of compute was roughly 172x the low-compute configuration.
注意:o3 高计算成本暂不可用,因为定价和功能可用性仍待确定。计算量大约是低计算配置的 172 倍。

Due to variable inference budget, efficiency (e.g., compute cost) is now a required metric when reporting performance. We've documented both the total costs and the cost per task as an initial proxy for efficiency. As an industry, we'll need to figure out what metric best tracks efficiency, but directionally, cost is a solid starting point.
由于推理预算的可变性,效率(例如计算成本)现已成为报告性能时的必要指标。我们已记录了总成本和每项任务的成本,作为效率的初步代理指标。作为行业,我们需要确定最能追踪效率的指标,但就目前而言,成本是一个可靠的起点。

The high-efficiency score of 75.7% is within the budget rules of ARC-AGI-Pub (costs <$10k) and therefore qualifies as 1st place on the public leaderboard!
高效得分 75.7%符合 ARC-AGI-Pub 的预算规则(成本<1 万美元),因此有资格在公开排行榜上位列第一!

The low-efficiency score of 87.5% is quite expensive, but still shows that performance on novel tasks does improve with increased compute (at least up to this level.)
87.5%的低效得分相当昂贵,但仍表明在新颖任务上的表现确实随着计算量的增加而提升(至少达到这一水平)。

Despite the significant cost per task, these numbers aren't just the result of applying brute force compute to the benchmark. OpenAI's new o3 model represents a significant leap forward in AI's ability to adapt to novel tasks. This is not merely incremental improvement, but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs. o3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain.
尽管每项任务的成本显著,这些成绩并非仅通过将蛮力计算应用于基准测试而得。OpenAI 的新 o3 模型代表了 AI 适应新任务能力的重大飞跃。这不仅仅是渐进式的改进,而是一次真正的突破,标志着 AI 能力相较于LLMs的先前局限性有了质的转变。o3 是一个能够适应从未遇到过的任务的系统,可以说在 ARC-AGI 领域接近了人类水平的表现。

Of course, such generality comes at a steep cost, and wouldn't quite be economical yet: you could pay a human to solve ARC-AGI tasks for roughly $5 per task (we know, we did that), while consuming mere cents in energy. Meanwhile o3 requires $17-20 per task in the low-compute mode. But cost-performance will likely improve quite dramatically over the next few months and years, so you should plan for these capabilities to become competitive with human work within a fairly short timeline.
当然,这种普遍性伴随着高昂的成本,目前尚不经济:你可以支付大约 5 美元让人类解决 ARC-AGI 任务(我们知道,我们确实这么做了),而仅消耗几分钱的能源。与此同时,o3 在低计算模式下每项任务需要 17-20 美元。但未来几个月到几年内,成本效益很可能会显著提升,因此你应该计划在相当短的时间内让这些能力与人类工作相竞争。

o3's improvement over the GPT series proves that architecture is everything. You couldn't throw more compute at GPT-4 and get these results. Simply scaling up the things we were doing from 2019 to 2023 – take the same architecture, train a bigger version on more data – is not enough. Further progress is about new ideas.
o3 相较于 GPT 系列的提升证明了架构的重要性。仅靠增加计算资源给 GPT-4,是无法达到这些成果的。简单地将 2019 年至 2023 年间的做法放大——采用相同的架构,在更多数据上训练更大的版本——并不足够。进一步的进展需要新的思路。


So is it AGI?
这是否就是 AGI?

ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less demanding benchmarks cannot. However, it is important to note that ARC-AGI is not an acid test for AGI – as we've repeated dozens of times this year. It's a research tool designed to focus attention on the most challenging unsolved problems in AI, a role it has fulfilled well over the past five years.
ARC-AGI 作为检测此类突破的关键基准,突显了泛化能力,这是饱和或要求较低的基准无法做到的。然而,值得注意的是,ARC-AGI 并非 AGI 的终极考验——正如我们今年反复强调的那样。它是一个研究工具,旨在将注意力集中在人工智能领域中最具挑战性的未解决问题上,过去五年中,它在这一角色上表现出色。

Passing ARC-AGI does not equate to achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.
通过 ARC-AGI 测试并不等同于实现了 AGI,事实上,我认为 o3 目前还不是 AGI。o3 在某些非常简单的任务上仍然失败,这表明它与人类智能存在根本性差异。

Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training). This demonstrates the continued possibility of creating challenging, unsaturated benchmarks without having to rely on expert domain knowledge. You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.
此外,早期的数据点表明,即将推出的 ARC-AGI-2 基准测试仍将对 o3 构成重大挑战,即使在高计算量下,其得分也可能降至 30%以下(而一个聪明的人类无需训练仍能获得超过 95%的分数)。这表明,无需依赖专家领域知识,仍有可能创建具有挑战性且未饱和的基准测试。当创建对普通人类容易但对 AI 困难的任务变得不可能时,你就知道 AGI 已经到来了。

What's different about o3 compared to older models?
与旧模型相比,O3 有何不同?

Why does o3 score so much higher than o1? And why did o1 score so much higher than GPT-4o in the first place? I think this series of results provides invaluable data points for the ongoing pursuit of AGI.
为什么 o3 得分比 o1 高这么多?而最初 o1 又为何比 GPT-4o 得分高出许多?我认为这一系列结果为正在追求 AGI 的进程提供了极其宝贵的数据点。

My mental model for LLMs is that they work as a repository of vector programs. When prompted, they will fetch the program that your prompt maps to and "execute" it on the input at hand. LLMs are a way to store and operationalize millions of useful mini-programs via passive exposure to human-generated content.
我对LLMs的认知模型是,它们充当向量程序的存储库。当被提示时,它们会提取与提示对应的程序,并在当前输入上“执行”该程序。LLMs是一种通过被动接触人类生成内容来存储和操作数百万个有用小程序的方式。

This "memorize, fetch, apply" paradigm can achieve arbitrary levels of skills at arbitrary tasks given appropriate training data, but it cannot adapt to novelty or pick up new skills on the fly (which is to say that there is no fluid intelligence at play here.) This has been exemplified by the low performance of LLMs on ARC-AGI, the only benchmark specifically designed to measure adaptability to novelty – GPT-3 scored 0, GPT-4 scored near 0, GPT-4o got to 5%. Scaling up these models to the limits of what's possible wasn't getting ARC-AGI numbers anywhere near what basic brute enumeration could achieve years ago (up to 50%).
这种“记忆、提取、应用”的范式在给予适当训练数据的情况下,能够在任意任务上达到任意水平的技能,但它无法适应新颖性或在需要时即时掌握新技能(也就是说,这里并不涉及流体智力)。这一点在LLMs在 ARC-AGI 上的低表现中得到了体现,ARC-AGI 是唯一专门设计来衡量对新颖性适应能力的基准测试——GPT-3 得分 0,GPT-4 接近 0,GPT-4o 达到 5%。即使将这些模型扩展到可能的极限,也无法使 ARC-AGI 的分数接近多年前通过基本暴力枚举所能达到的水平(最高可达 50%)。

To adapt to novelty, you need two things. First, you need knowledge – a set of reusable functions or programs to draw upon. LLMs have more than enough of that. Second, you need the ability to recombine these functions into a brand new program when facing a new task – a program that models the task at hand. Program synthesis. LLMs have long lacked this feature. The o series of models fixes that.
要适应新事物,你需要两样东西。首先,你需要知识——一套可重复使用的函数或程序作为基础。LLMs 已经拥有足够多的知识。其次,面对新任务时,你需要具备将这些函数重新组合成全新程序的能力——这个程序能够模拟当前任务。这就是程序合成。LLMs 长期以来一直缺乏这一特性。而 o 系列模型解决了这个问题。

For now, we can only speculate about the exact specifics of how o3 works. But o3's core mechanism appears to be natural language program search and execution within token space – at test time, the model searches over the space of possible Chains of Thought (CoTs) describing the steps required to solve the task, in a fashion perhaps not too dissimilar to AlphaZero-style Monte-Carlo tree search. In the case of o3, the search is presumably guided by some kind of evaluator model. To note, Demis Hassabis hinted back in a June 2023 interview that DeepMind had been researching this very idea – this line of work has been a long time coming.
目前,我们只能推测 o3 具体是如何运作的。但 o3 的核心机制似乎是在标记空间内进行自然语言程序搜索和执行——在测试时,模型在可能的思维链(CoTs)空间中进行搜索,这些思维链描述了解决任务所需的步骤,这种方式或许与 AlphaZero 风格的蒙特卡洛树搜索相差无几。在 o3 的情况下,搜索大概是由某种评估模型引导的。值得一提的是,Demis Hassabis 在 2023 年 6 月的采访中曾暗示,DeepMind 一直在研究这一想法——这项工作已经酝酿了很长时间。

So while single-generation LLMs struggle with novelty, o3 overcomes this by generating and executing its own programs, where the program itself (the CoT) becomes the artifact of knowledge recombination. Although this is not the only viable approach to test-time knowledge recombination (you could also do test-time training, or search in latent space), it represents the current state-of-the-art as per these new ARC-AGI numbers.
因此,尽管单代LLMs在应对新颖性方面存在困难,o3 通过生成并执行自己的程序克服了这一难题,其中程序本身(即 CoT)成为了知识重组的产物。虽然这不是测试时知识重组的唯一可行方法(你也可以进行测试时训练,或在潜在空间中搜索),但根据这些新的 ARC-AGI 数据,它代表了当前最先进的技术水平。

Effectively, o3 represents a form of deep learning-guided program search. The model does test-time search over a space of "programs" (in this case, natural language programs – the space of CoTs that describe the steps to solve the task at hand), guided by a deep learning prior (the base LLM). The reason why solving a single ARC-AGI task can end up taking up tens of millions of tokens and cost thousands of dollars is because this search process has to explore an enormous number of paths through program space – including backtracking.
实际上,o3 代表了一种深度学习引导的程序搜索形式。该模型在“程序”空间(在此例中,即自然语言程序——描述解决当前任务步骤的 CoTs 空间)上进行测试时搜索,由深度学习先验(基础模型 LLM)引导。解决单个 ARC-AGI 任务最终耗费数千万个令牌并花费数千美元的原因在于,这一搜索过程必须探索程序空间中的大量路径——包括回溯。

There are however two significant differences between what's happening here and what I meant when I previously described "deep learning-guided program search" as the best path to get to AGI. Crucially, the programs generated by o3 are natural language instructions (to be "executed" by a LLM) rather than executable symbolic programs. This means two things. First, that they cannot make contact with reality via execution and direct evaluation on the task – instead, they must be evaluated for fitness via another model, and the evaluation, lacking such grounding, might go wrong when operating out of distribution. Second, the system cannot autonomously acquire the ability to generate and evaluate these programs (the way a system like AlphaZero can learn to play a board game on its own.) Instead, it is reliant on expert-labeled, human-generated CoT data.
然而,这里发生的事情与我之前将“深度学习引导的程序搜索”描述为通向 AGI 的最佳路径时所指的内容之间存在两个显著差异。关键在于,o3 生成的程序是自然语言指令(由LLM“执行”),而非可执行的符号程序。这意味着两件事。首先,它们无法通过执行并在任务上直接评估来接触现实——相反,它们必须通过另一个模型来评估其适应性,而由于缺乏这种基础,评估在超出分布范围操作时可能会出错。其次,系统无法自主获得生成和评估这些程序的能力(就像 AlphaZero 系统能够自行学习下棋一样)。相反,它依赖于专家标注的人工生成的 CoT 数据。

It's not yet clear what the exact limitations of the new system are and how far it might scale. We'll need further testing to find out. Regardless, the current performance represents a remarkable achievement, and a clear confirmation that intuition-guided test-time search over program space is a powerful paradigm to build AI systems that can adapt to arbitrary tasks.
目前尚不清楚新系统的具体局限性以及其扩展潜力有多大。我们需要进一步测试才能揭晓。尽管如此,当前的表现已是一项显著成就,并明确证实了在程序空间中以直觉引导的测试时搜索,是构建能够适应任意任务的人工智能系统的强大范式。

What comes next?  接下来是什么?

First of all, open-source replication of o3, facilitated by the ARC Prize competition in 2025, will be crucial to move the research community forward. A thorough analysis of o3's strengths and limitations is necessary to understand its scaling behavior, the nature of its potential bottlenecks, and anticipate what abilities further developments might unlock.
首先,2025 年 ARC Prize 竞赛推动的 o3 开源复现,对于推动研究社区前进至关重要。全面分析 o3 的优势与局限性,是理解其扩展行为、潜在瓶颈性质,并预见未来发展可能解锁能力的关键。

Moreover, ARC-AGI-1 is now saturating – besides o3's new score, the fact is that a large ensemble of low-compute Kaggle solutions can now score 81% on the private eval.
此外,ARC-AGI-1 现已达到饱和状态——除了 o3 的新高分外,事实是大量低计算量的 Kaggle 解决方案如今在私有评估中也能达到 81%的得分。

We're going to be raising the bar with a new version – ARC-AGI-2 - which has been in the works since 2022. It promises a major reset of the state-of-the-art. We want it to push the boundaries of AGI research with hard, high-signal evals that highlight current AI limitations.
我们将通过新版本——ARC-AGI-2——提升标准,该版本自 2022 年起就在筹备中。它承诺对现有技术进行重大重置。我们希望它能推动 AGI 研究的边界,通过高难度的、能凸显当前 AI 局限性的评估来实现。

Our early ARC-AGI-2 testing suggests it will be useful and extremely challenging, even for o3. And, of course, ARC Prize's objective is to produce a high-efficiency and open-source solution in order to win the Grand Prize. We currently intend to launch ARC-AGI-2 alongside ARC Prize 2025 (estimated launch: late Q1).
我们早期的 ARC-AGI-2 测试表明,即便对于 o3 而言,它也将是实用且极具挑战性的。当然,ARC Prize 的目标是产生高效且开源的解决方案,以赢得大奖。我们目前计划在 2025 年 ARC Prize(预计在第一季度末)发布时,同步推出 ARC-AGI-2。

Going forward, the ARC Prize Foundation will continue to create new benchmarks to focus the attention of researchers on the hardest unsolved problems on the way to AGI. We've started work on a third-generation benchmark which departs completely from the 2019 ARC-AGI format and incorporates some exciting new ideas.
展望未来,ARC Prize 基金会将继续设立新的基准,以引导研究者关注通向 AGI 之路上的最棘手未解难题。我们已着手开发第三代基准,该基准将完全脱离 2019 年 ARC-AGI 的框架,并融入一些令人振奋的新思路。


Get Involved: Open-Source Analysis
参与其中:开源分析

Today, we're also releasing data (results, attempts, and prompt) from our high-compute o3 testing and would like your help to analyze the results. In particular, we are very curious about the ~9% set of Public Eval tasks o3 was unable to solve, even with lots of compute, yet are straightforward for humans.
今天,我们还将发布高计算量 o3 测试的数据(结果、尝试和提示),并希望您能协助分析这些结果。特别令我们感兴趣的是,o3 在大量计算下仍无法解决的约 9%的公共评估任务,而这些任务对人类来说却相当简单。

We invite the community to help us assess the characteristics of both solved and unsolved tasks.
我们邀请社区帮助我们评估已解决和未解决任务的特征。

To get your ideas flowing, here are 3 examples of tasks unsolved by high-compute o3.
为了激发您的灵感,以下是 3 个未被高计算量 o3 解决的任务示例。

ARC-AGI Task Id: c6e1b8da
ARC-AGI Task ID: c6e1b8da
ARC-AGI 任务 ID:c6e1b8da
ARC-AGI Task Id: 0d87d2a6
ARC-AGI Task ID: 0d87d2a6
ARC-AGI 任务 ID:0d87d2a6
ARC-AGI Task Id: b457fec5
ARC-AGI Task ID: b457fec5
ARC-AGI 任务 ID:b457fec5

See our full set of o3 testing data.
查看我们完整的 o3 测试数据集。

Here's the prompt that was used in testing.
以下是测试中使用的提示。

We've also created a new channel in our Discord named oai-analysis and we'd love to hear your analysis and insights there. Or tag us on X/Twitter @arcprize.
我们也在 Discord 上创建了一个名为 oai-analysis 的新频道,并期待在那里听到您的分析和见解。或者在 X/Twitter 上@arcprize 标记我们。


Conclusions  结论

To sum up – o3 represents a significant leap forward. Its performance on ARC-AGI highlights a genuine breakthrough in adaptability and generalization, in a way that no other benchmark could have made as explicit.
总而言之,o3 代表了一次显著的飞跃。它在 ARC-AGI 上的表现突显了在适应性和泛化能力方面的真正突破,这种突破是其他基准测试无法如此明确展现的。

o3 fixes the fundamental limitation of the LLM paradigm – the inability to recombine knowledge at test time – and it does so via a form of LLM-guided natural language program search. This is not just incremental progress; it is new territory, and it demands serious scientific attention.
o3 解决了LLM范式的根本性局限——测试时无法重组知识——并通过一种LLM引导的自然语言程序搜索来实现这一点。这不仅仅是渐进式的进步,而是开辟了新领域,需要科学界给予严肃关注。

Sign up to get updates
注册以获取更新

Toggle Animation  切换动画