AGI by 2027 is strikingly plausible. GPT-2 to GPT-4 took us from ~preschooler to ~smart high-schooler abilities in 4 years. Tracing trendlines in compute (~0.5 orders of magnitude or OOMs/year), algorithmic efficiencies (~0.5 OOMs/year), and “unhobbling” gains (from chatbot to agent), we should expect another preschooler-to-high-schooler-sized qualitative jump by 2027.
2027 年實現 AGI 的說法相當可信。從GPT-2 到GPT-4,我們僅花了 4 年時間,就讓 AI 的能力從學齡前兒童躍升至聰明高中生。觀察計算能力(每年約 0.5 個數量級或 OOM)、演算法效率(每年約 0.5 個 OOM)以及「解放」效益(從聊天機器人到代理人)的發展趨勢,我們預計到 2027 年,AI 將再次迎來一次飛躍性的質變,其幅度將如同從學齡前兒童成長至高中生。
Look. The models, they just want to learn. You have to understand this. The models, they just want to learn.
Ilya Sutskever (circa 2015, via Dario Amodei)
你看,這些模型,它們只是想學習。你必須要明白這一點,它們真的只是想學習。
伊利亞·蘇茨克維(約 2015 年,轉述自達里奧·阿莫代)
GPT-4’s capabilities came as a shock to many: an AI system that could write code and essays, could reason through difficult math problems, and ace college exams. A few years ago, most thought these were impenetrable walls.
GPT-4 的能力讓許多人跌破眼鏡:這個 AI 系統不僅能編寫程式碼和文章,還能推論出複雜的數學問題,甚至在大學考試中取得高分。就在幾年前,這些能力還被視為難以突破的障礙。
But GPT-4 was merely the continuation of a decade of breakneck progress in deep learning. A decade earlier, models could barely identify simple images of cats and dogs; four years earlier, GPT-2 could barely string together semi-plausible sentences. Now we are rapidly saturating all the benchmarks we can come up with. And yet this dramatic progress has merely been the result of consistent trends in scaling up deep learning.
然而,GPT-4 的出現僅僅是深度學習領域十年來突飛猛進的一個縮影。回首十年前,那時的模型連辨識貓狗的簡單圖片都顯得吃力;而短短四年前,GPT-2 也僅能勉強拼湊出語義尚可的句子。時至今日,我們幾乎可以輕鬆突破所有可想像的基準測試。然而,如此顯著的進步僅僅是我們不斷擴展深度學習規模所帶來的必然結果。
There have been people who have seen this for far longer. They were scoffed at, but all they did was trust the trendlines. The trendlines are intense, and they were right. The models, they just want to learn; you scale them up, and they learn more.
早就有人察覺到這個趨勢了。他們雖然被嘲笑,卻始終堅信趨勢線的預測。這些趨勢線變化劇烈,而他們是對的。模型本身就想學習,規模越大,學到的就越多。
I make the following claim: it is strikingly plausible that by 2027, models will be able to do the work of an AI researcher/engineer. That doesn’t require believing in sci-fi; it just requires believing in straight lines on a graph.
我認為,到 2027 年,AI 模型將有能力完成人工智能研究員或工程師的工作,這種看法是相當合理的。這並不需要我們相信科幻小說,只需要相信數據趨勢。

基於本文討論的公開數據,我們可以對過去和未來有效算力的規模擴張(包括物理算力和算法效率)進行粗略估計。隨著模型規模的擴大,它們的智能程度也在不斷提升。通過「計算數量級」,我們可以大致預測出在不久的將來模型智能的發展水平。(此圖僅展示了基礎模型的規模擴張,未包含「能力解鎖」的部分。)
In this piece, I will simply “count the OOMs” (OOM = order of magnitude, 10x = 1 order of magnitude): look at the trends in 1) compute, 2) algorithmic efficiencies (algorithmic progress that we can think of as growing “effective compute”), and 3) ”unhobbling” gains (fixing obvious ways in which models are hobbled by default, unlocking latent capabilities and giving them tools, leading to step-changes in usefulness). We trace the growth in each over four years before GPT-4, and what we should expect in the four years after, through the end of 2027. Given deep learning’s consistent improvements for every OOM of effective compute, we can use this to project future progress.
在本文中,我將以「數量級」(OOM)為單位進行簡單的「計數」(10 倍增長即為 1 個數量級):探討 1)計算能力、2)演算法效率(可視為提升「有效計算」的演算法進展),以及 3)「解除束縛」所帶來的效益(解決模型預設狀態下明顯受限的問題,釋放潛在能力並提供工具,從而實現效用的階段性提升)等方面的趨勢。我們將追蹤這三項指標在GPT-4 年前四年的增長情況,並預測其在接下來四年(至 2027 年底)的發展趨勢。鑑於深度學習在每個有效計算數量級上都展現出持續的進步,我們可以據此預測其未來的發展。
Publicly, things have been quiet for a year since the GPT-4 release, as the next generation of models has been in the oven—leading some to proclaim stagnation and that deep learning is hitting a wall.
自從 GPT-4 版本發布後,外界就鮮少聽聞消息,因為下一代模型一直在開發中,導致有些人認為發展停滯,深度學習已走到盡頭。但只要看看 OOM 的數量,就能了解我們真正該期待什麼。
The upshot is pretty simple. GPT-2 to GPT-4—from models that were impressive for sometimes managing to string together a few coherent sentences, to models that ace high-school exams—was not a one-time gain. We are racing through the OOMs extremely rapidly, and the numbers indicate we should expect another ~100,000x effective compute scaleup—resulting in another GPT-2-to-GPT-4-sized qualitative jump—over four years. Moreover, and critically, that doesn’t just mean a better chatbot; picking the many obvious low-hanging fruit on “unhobbling” gains should take us from chatbots to agents, from a tool to something that looks more like drop-in remote worker replacements.
結果顯而易見。GPT-2 到 GPT-4 的進步——從過去只能勉強拼湊出幾個連貫句子的模型,進化到如今能夠在高中考試中取得優異成績的模型——並非曇花一現。我們正以驚人的速度突破運算規模的限制,數據顯示,未來四年內,我們預計有效運算規模將會擴增 100,000 倍,帶來 GPT-2 到 GPT 的質的飛躍。更重要的是,這不僅僅意味著聊天機器人將更加強大;透過發掘「解放」所帶來的諸多顯而易見的成果,我們將見證聊天機器人進化成代理,從單純的工具蛻變為足以取代臨時遠端工作者的存在。
While the inference is simple, the implication is striking. Another jump like that very well could take us to AGI, to models as smart as PhDs or experts that can work beside us as coworkers. Perhaps most importantly, if these AI systems could automate AI research itself, that would set in motion intense feedback loops—the topic of the next piece in the series.
雖然推論過程並不複雜,但其背後的意義卻引人深思。若能實現又一次如此巨大的飛躍,我們很有可能迎來通用人工智慧(AGI)時代,屆時將出現如博士或專家般智慧的模型,它們將成為我們的同事,與我們並肩協作。更重要的是,如果這些人工智慧系統能夠自動進行人工智慧研究,將形成強大的反饋迴路,加速技術的進步——這也是本系列下一篇文章將探討的主題。
Even now, barely anyone is pricing all this in. But situational awareness on AI isn’t actually that hard, once you step back and look at the trends. If you keep being surprised by AI capabilities, just start counting the OOMs.
即使到了現在,幾乎沒有人將這些因素都考慮進去。然而,只要退一步觀察趨勢,你會發現掌握 AI 的情境意識其實並不如想像中困難。如果你持續對 AI 的能力感到驚奇,不妨開始關注其發展的規模和速度。
The last four years 過去四年
We have machines now that we can basically talk to like humans. It’s a remarkable testament to the human capacity to adjust that this seems normal, that we’ve become inured to the pace of progress. But it’s worth stepping back and looking at the progress of just the last few years.
我們現在擁有幾乎可以像人一樣溝通的機器。人類適應力之強,從我們對這種快速進步習以為常的現象可見一斑。但我們還是應該停下來回顧一下,看看過去幾年來的進展。
GPT-2 to GPT-4 從 <code>1001</code>-2 到 <code>1002</code>-4
Let me remind you of how far we came in just the ~4 (!) years leading up to GPT-4.
讓我提醒您,在短短四年(沒錯,才四年!)的時間裡,我們一路走來,直到 GPT-4,取得了多麼驚人的成就。
GPT-2 (2019) ~ preschooler: “Wow, it can string together a few plausible sentences.” A very-cherry-picked example of a semi-coherent story about unicorns in the Andes it generated was incredibly impressive at the time. And yet GPT-2 could barely count to 5 without getting tripped up;
GPT-2 (2019) 版本推出時,學齡前兒童都驚嘆:「哇!它可以把幾個句子串成一個看似合理的故事!」它最厲害的地方是可以生成一個關於安地斯山脈獨角獸的半連貫故事,儘管這個例子是精心挑選出來的,但在當時已經非常令人印象深刻。然而,GPT-2 連數到 5 都常常出錯;在總結文章方面,它的表現只比從文章中隨機挑選 3 個句子好一點點。

以下是一些人們當初對GPT-2 感到印象深刻的例子。左圖:GPT-2 在極基本的閱讀理解題上表現尚可。右圖:從精心挑選的樣本中(10 次嘗試中的最佳結果),可以看到GPT-2 可以寫出一段語句勉強通順的段落,其中包含一些和南北戰爭沾邊的內容。
Comparing AI capabilities with human intelligence is difficult and flawed, but I think it’s informative to consider the analogy here, even if it’s highly imperfect. GPT-2 was shocking for its command of language, and its ability to occasionally generate a semi-cohesive paragraph, or occasionally answer simple factual questions correctly. It’s what would have been impressive for a preschooler.
將人工智能的能力與人類智慧相提並論,雖然困難重重且並不完美,但我認為即使只是類比,依然具有參考價值。GPT-2 最令人驚嘆的是它對語言的掌握能力,它不僅能夠偶爾生成語句通順的段落,還能準確回答簡單的事實問題。這些能力即使在學齡前兒童身上,也會被視為令人印象深刻的表現。
GPT-3 (2020)
GPT-3 模型(2020 年)问世,当时的水平就好比小学生:“哇,只需一些示例,它就能完成一些简单的实用任务。” 它在多段文本的连贯性方面有了显著提升,还可以纠正语法错误并进行一些非常基础的算术运算。 此外,它首次在某些特定领域展现出商业价值:例如,GPT-3 模型可以生成用于搜索引擎优化和市场营销的简单文案。

以下列舉一些當時人們對GPT-3 印象深刻之處。上方:只需簡單指示,GPT-3 就能在新句子中使用自創詞彙。左下方:GPT-3 能夠進行豐富的互動式敘事。右下方:GPT-3 可以生成一些非常簡單的程式碼。
Again, the comparison is imperfect, but what impressed people about GPT-3 is perhaps what would have been impressive for an elementary schooler: it wrote some basic poetry, could tell richer and coherent stories, could start to do rudimentary coding, could fairly reliably learn from simple instructions and demonstrations, and so on.
必須再次強調,這樣的比較並不完美,但GPT-3 令人印象深刻之處,在於它展現出猶如小學生般驚人的能力:它能夠創作簡單的詩歌,講述更豐富生動、前後連貫的故事,開始進行基礎的程式編寫,並且能夠相當可靠地從簡單的指令和示範中學習等等。
GPT-4 (2023) ~ smart high schooler: “Wow, it can write pretty sophisticated code and iteratively debug, it can write intelligently and sophisticatedly about complicated subjects, it can reason through difficult high-school competition math, it’s beating the vast majority of high schoolers on whatever tests we can give it, etc.” From code to math to Fermi estimates, it can think and reason. GPT-4 is now useful in my daily tasks, from helping write code to revising drafts.
GPT-4 (2023) 版本就像個聰明的高中生會說:「哇!它能寫出超複雜的程式碼,還能自己修正錯誤,寫文章的思路清晰又成熟,連困難的高中競賽數學題也難不倒它,而且在各種考試中都打敗了大部分的高中生!」從寫程式、解數學到做費米估計,它都能思考和推理。GPT-4 現在已經成為我的得力助手,每天寫程式和修改草稿都靠它幫忙。

〈AGI 的火花〉論文中,人們對 GPT-4 發布時印象深刻的部分內容。上方:它可以編寫極其複雜的程式碼(產生中間顯示的圖表),並且能夠推論出重要的數學問題。左下:解決大學先修課程的數學問題。右下:解決一個相當複雜的程式設計問題。更多關於探索 GPT-4 功能的有趣摘錄。
On everything from AP exams to the SAT, GPT-4 scores better than the vast majority of high schoolers.
無論是大學先修課程考試(AP 考試)還是 SAT,GPT-4 的成績都遠高於大部分高中生。
Of course, even GPT-4 is still somewhat uneven; for some tasks it’s much better than smart high-schoolers, while there are other tasks it can’t yet do. That said, I tend to think most of these limitations come down to obvious ways models are still hobbled, as I’ll discuss in-depth later. The raw intelligence is (mostly) there, even if the models are still artificially constrained; it’ll take extra work to unlock models being able to fully apply that raw intelligence across applications.
當然,即使 GPT-4 的能力仍有些不均衡;它在某些任務上的表現遠勝於聰明的高中生,但在其他任務上仍力有未逮。儘管如此,我認為這些限制大多源於模型目前仍受到明顯束縛,我將在後文深入探討。模型的原始智慧(大部分)已經存在,即使它們仍受到人為限制;要釋放這些潛力,使其能夠在各種應用中得到充分發揮,還需要付出更多努力。

四年來的進步,您覺得自己走到哪了呢?
The trends in deep learning
深度學習的發展趨勢
The pace of deep learning progress in the last decade has simply been extraordinary. A mere decade ago it was revolutionary for a deep learning system to identify simple images. Today, we keep trying to come up with novel, ever harder tests, and yet each new benchmark is quickly cracked. It used to take decades to crack widely-used benchmarks; now it feels like mere months.
深度學習於過去十年的進展可謂一日千里。猶記得十年前,深度學習系統若能識別簡單圖像,已屬劃時代的創舉。時至今日,我們不斷設計出更新穎、更艱鉅的測試,但這些新標準往往很快就被突破。想當年,要破解被廣泛應用的標準測試,往往需時數十年;如今,數月之間便能做到,速度之快,令人咋舌。

深度學習系統在許多領域的表現正迅速追趕甚至超越人類。資料來源:我們的數據世界
We’re literally running out of benchmarks. As an anecdote, my friends Dan and Collin made a benchmark called MMLU a few years ago, in 2020. They hoped to finally make a benchmark that would stand the test of time, equivalent to all the hardest exams we give high school and college students. Just three years later, it’s basically solved: models like GPT-4 and Gemini get ~90%.
基準測試快要被我們用完了。舉個例子,我的朋友 Dan 和 Collin 在 2020 年設計了一個叫做 MMLU 的基準測試。他們希望這個基準測試能夠經得起時間的考驗,可以比擬我們為高中和大學生設計的最困難的考試。沒想到才短短三年時間,這個測試就已經被破解了:像是 GPT-4 和 Gemini 這樣的模型,在測試中都獲得了大約 90% 的高分。
More broadly, GPT-4 mostly cracks all the standard high school and college aptitude tests.
更廣泛來說,GPT-4 幾乎可以破解所有標準高中和大學的能力測驗。(甚至從 GPT-3.5 到 GPT-4 的那一年間,我們的表現就從遠低於人類平均水準提升到人類頂尖。)

GPT-4 在標準化測驗中取得了優異的成績。值得注意的是,從 GPT-3.5 版本到 GPT-4 版本,模型在人類百分位數上有了顯著的提升,通常是從遠低於人類平均水平躍升至頂尖水平。(這裡指的是 GPT-3.5 版本,一個在 GPT-4 版本發布前不到一年推出的相對較新的模型,而非我們先前討論的那個笨拙的、小學程度的 GPT-3 版本!)

灰色部分展示了 2021 年 8 月专业机构对 MATH 基准测试(源于高中数学竞赛的难题)在 2022 年 6 月的表现预测。红色五角星则代表了截至 2022 年 6 月的实际最佳性能,其结果远远超出了预测者给出的上限,甚至大多数机器学习研究人员都对此持更加悲观的态度。.
Or consider the MATH benchmark, a set of difficult mathematics problems from high-school math competitions.
或者以 MATH 基准测试为例,这是一套源自高中数学竞赛的难题。该基准测试于 2021 年发布时,即使是最先进的模型也只能正确解答约 5% 的问题。原始论文指出:“此外,我们发现,如果当前的模型扩展趋势持续下去,仅仅增加预算和模型参数数量,对于实现强大的数学推理是不切实际的 […]。为了在数学问题解决方面取得更显著的进展,我们可能需要更广泛的研究界在算法方面取得根本性的突破”——换言之,他们认为,解决 MATH 基准测试中的难题需要全新的突破。 一份針對機器學習研究人員的調查曾預測,該領域未來幾年的進展將十分有限;然而,僅僅過了一年時間(到 2022 年年中),最佳模型的準確率就從 5% 左右躍升至 50%;如今,數學問題基本上已被克服,近期模型的表現已超過 90%。
Over and over again, year after year, skeptics have claimed “deep learning won’t be able to do X” and have been quickly proven wrong.
多年來,懷疑論者總是 repeatedly 斷言「深度學習做不到 X」,但這些預測很快就被推翻。過去十年的 AI 發展告訴我們,千萬別小覷深度學習的潛力。
Now the hardest unsolved benchmarks are tests like GPQA, a set of PhD-level biology, chemistry, and physics questions. Many of the questions read like gibberish to me, and even PhDs in other scientific fields spending 30+ minutes with Google barely score above random chance. Claude 3 Opus currently gets ~60%,
目前,最難的未解基準測試是像 GPQA 這樣的項目,它涵蓋一系列博士級別的生物、化學和物理問題。這些問題對我來說如同天書,即使是其他科學領域的博士,在借助谷歌搜索 30 多分鐘後,也難以取得高於隨機概率的成績。Claude 3 Opus 目前在該測試中能達到 60% 左右的正確率,而相關領域的博士則可以達到 80%——我預計這個基準在一兩代模型更迭後也會被超越。

GPQA 問題範例。模型在這方面已經比我厲害了,而且我們可能很快就能達到專家或博士級別的水準…
Counting the OOMs 統計 OOM 的發生次數
How did this happen? The magic of deep learning is that it just works—and the trendlines have been astonishingly consistent, despite naysayers at every turn.
這到底是怎麼發生的?深度學習的魅力就在於它的確有效——儘管質疑聲浪不斷,但整體趨勢卻展現出驚人的一致性。

以 Sora (OpenAI) 為例,探討擴展運算所帶來的影響.
With each OOM of effective compute, models predictably, reliably get better.
運算能力每提升一個數量級,模型的預測能力和可靠性就會隨之顯著提升。透過計算數量級的增長,我們就能大致推估出模型能力的提升幅度。這就是為何有些先知先覺的人能夠預見 GPT-4 的出現。
We can decompose the progress in the four years from GPT-2 to GPT-4 into three categories of scaleups:
我們可以將GPT-2 年到GPT-4 年這四年來的進展,依規模擴大的類型區分為三大類:
- Compute: We’re using much bigger computers to train these models.
運算方面:我們使用運算能力更強大的電腦來訓練這些模型。 - Algorithmic efficiencies: There’s a continuous trend of algorithmic progress. Many of these act as “compute multipliers,” and we can put them on a unified scale of growing effective compute.
演算法效率:演算法不斷進步,已成為一種持續趨勢。許多演算法如同「計算倍增器」,我們可以將它們置於一個不斷增長的有效計算統一規模中。 - ”Unhobbling” gains: By default, models learn a lot of amazing raw capabilities, but they are hobbled in all sorts of dumb ways, limiting their practical value. With simple algorithmic improvements like reinforcement learning from human feedback (RLHF), chain-of-thought (CoT), tools, and scaffolding, we can unlock significant latent capabilities.
「釋放潛能」的效益:一般來說,模型會習得許多令人驚嘆的原始能力,但它們卻受到各種無謂限制,使其難以發揮實際價值。透過一些簡單的演算法改進,例如從人類回饋中強化學習 (RLHF)、思維鏈 (CoT)、工具和鷹架,我們就能釋放模型潛藏的巨大能力。
We can “count the OOMs” of improvement along these axes: that is, trace the scaleup for each in units of effective compute. 3x is 0.5 OOMs; 10x is 1 OOM; 30x is 1.5 OOMs; 100x is 2 OOMs; and so on. We can also look at what we should expect on top of GPT-4, from 2023 to 2027.
我們可以沿著這些軸線「計算運算規模提升的數量級」:也就是以有效計算單位來追踪每個軸線的放大規模。3 倍是 0.5 個數量級;10 倍是 1 個數量級;30 倍是 1.5 個數量級;100 倍是 2 個數量級,依此類推。我們還可以看看,從 2023 年到 2027 年,我們預計在 GPT-4 版本之上還能看到哪些進展。
I’ll go through each one-by-one, but the upshot is clear: we are rapidly racing through the OOMs. There are potential headwinds in the data wall, which I’ll address—but overall, it seems likely that we should expect another GPT-2-to-GPT-4-sized jump, on top of GPT-4, by 2027.
我會逐一說明,但結論很明顯:我們正快速消耗 OOM。雖然數據牆可能造成阻礙,我之後會再說明,但整體而言,我們應該預期到 2027 年,在 GPT-4 的基礎上,還會出現 GPT-2 到 GPT-4 倍的增長。
Compute 運算
I’ll start with the most commonly-discussed driver of recent progress: throwing (a lot) more compute at models.
我會先從近期發展中最常被討論的驅動因素談起:也就是投入(大量)更多運算資源來訓練模型。
Many people assume that this is simply due to Moore’s Law. But even in the old days when Moore’s Law was in its heyday, it was comparatively glacial—perhaps 1-1.5 OOMs per decade. We are seeing much more rapid scaleups in compute—close to 5x the speed of Moore’s law—instead because of mammoth investment. (Spending even a million dollars on a single model used to be an outrageous thought nobody would entertain, and now that’s pocket change!)
許多人將此歸因於摩爾定律,但事實上,即使在摩爾定律盛行的年代,其進展速度也相對緩慢,每十年僅提升 1-1.5 個數量級。而現今,由於巨額投資的挹注,計算能力的提升速度遠超摩爾定律,幾乎達到了 5 倍之多。(試想,過去在單一模型上投資百萬美元簡直是天方夜譚,而如今卻是微不足道的金額!)
Model 模型 | Estimated compute 預估運算用量 | Growth 成長 |
GPT-2 (2019) <code>1001</code>-2 (2019) | ~4e21 FLOP 約 4e21 次浮點運算 | |
GPT-3 (2020) <code>1001</code>-3 (2020) | ~3e23 FLOP 約 3 x 10²³ 次浮點運算 | + ~2 OOMs 大約兩個數量級 |
GPT-4 (2023) <code>1001</code>-4 (2023) | 8e24 to 4e25 FLOP 8×10²⁴ 到 4×10²⁵ FLOPS | + ~1.5–2 OOMs 約 1.5 到 2 個數量級 |
由 Epoch AI 估計的 GPT-2 到 GPT-4 計算量
We can use public estimates from Ep