IIIa. Racing to the Trillion-Dollar Cluster
IIIa. 競逐兆美元集群

The most extraordinary techno-capital acceleration has been set in motion. As AI revenue grows rapidly, many trillions of dollars will go into GPU, datacenter, and power buildout before the end of the decade. The industrial mobilization, including growing US electricity production by 10s of percent, will be intense. 
最非凡的科技資本加速已經啟動。隨著人工智慧收入迅速增長,數萬億美元將在本世紀末前投入到 GPU、數據中心和電力建設中。包括美國電力生產增長數十個百分點在內的工業動員將會非常激烈。

You see, I told you it couldn’t be done without turning the whole country into a factory. You have done just that.

Niels Bohr (to Edward Teller, upon learning of the scale of the Manhattan Project in 1944)
尼爾斯·玻爾(對愛德華·泰勒,於 1944 年得知曼哈頓計劃的規模時)
The trillion-dollar cluster. Credit: DALLE. 

The race to AGI won’t just play out in code and behind laptops—it’ll be a race to mobilize America’s industrial might. Unlike anything else we’ve recently seen come out of Silicon Valley, AI is a massive industrial process: each new model requires a giant new cluster, soon giant new power plants, and eventually giant new chip fabs. The investments involved are staggering. But behind the scenes, they are already in motion.

In this post, I’ll walk you through numbers to give you a sense of what this will mean:

  • As revenue from AI products grows rapidly—plausibly hitting a $100B annual run rate for companies like Google or Microsoft by ~2026, with powerful but pre-AGI systems—that will motivate ever-greater capital mobilization, and total AI investment could be north of $1T annually by 2027.
    隨著人工智慧產品的收入迅速增長——到 2026 年左右,像谷歌或微軟這樣的公司可能會達到每年 1000 億美元的運行率,擁有強大的但尚未達到通用人工智慧的系統——這將激勵更大的資本動員,到 2027 年,總人工智慧投資可能每年超過 1 萬億美元。
  • We’re on the path to individual training clusters costing $100s of billions by 2028—clusters requiring power equivalent to a small/medium US state and more expensive than the International Space Station. 
    我們正走在到 2028 年個人訓練集群成本達到數千億美元的道路上——這些集群需要相當於一個小型/中型美國州的電力,並且比國際空間站更昂貴。
  • By the end of the decade, we are headed to $1T+ individual training clusters, requiring power equivalent to >20% of US electricity production. Trillions of dollars of capex will churn out 100s of millions of GPUs per year overall.
    到本世紀末,我們將邁向超過 1 萬億美元的個人訓練集群,這需要相當於美國電力生產超過 20%的電力。數萬億美元的資本支出將每年生產數億個 GPU。

Nvidia shocked the world as its datacenter sales exploded from about $14B annualized to about $90B annualized in the last year. But that’s still just the very beginning. 
Nvidia 震驚了世界,其數據中心銷售額從約 140 億美元年化增長到約 900 億美元年化,僅在去年。但這仍然只是個開始。

Training compute 訓練計算

Earlier, we found a roughly ~0.5 OOMs/year trend growth of AI training compute.

1 If this trend were to continue for the rest of the decade, what would that mean for the largest training clusters?
早些時候,我們發現 AI 訓練計算量大約每年增長 ~0.5 個數量級。如果這一趨勢在本世紀剩餘時間內繼續下去,這對最大的訓練集群意味著什麼?

Year OOMs 記憶體不足 (OOMs)# of H100s-equivalent      # 的 H100s-等效Cost 成本Power 權力Power reference class 權力參考類別
2022~GPT-4 cluster ~GPT-4 叢集~10k 請提供具體的文本內容以便進行翻譯~$500M 約 5 億美元~10 MW ~10 兆瓦~10,000 average homes 約 10,000 戶普通家庭
~2024+1 OOM~100k 來源文本:~100k 翻譯文本:$billions 數十億美元~100MW ~100 兆瓦~100,000 homes 約 10 萬戶住宅
~2026+2 OOMs +2 個數量級~1M$10s of billions 數百億美元~1 GW ~1 吉瓦The Hoover Dam, or a large nuclear reactor
~2028+3 OOMs +3 個數量級~10M 約一千萬$100s of billions 數千億美元~10 GW ~10 吉瓦A small/medium US state 美國一個中小型州
~2030+4 OOMs +4 個數量級~100M$1T+ 1 兆美元以上~100GW ~100 吉瓦>20% of US electricity production
美國電力生產的超過 20%
Scaling the largest training clusters, rough back-of-the-envelope calculations.

The OpenAI GPT-4 tech report stated that GPT-4 finished training in August 2022. Thereafter we play forward the rough ~0.5 OOMs/year trend.


Semianalysis, JP Morgan, and others estimate GPT-4 was trained on ~25k A100s, and H100s are 2-3x the performance of A100s.


Often people cite numbers like “$100M for GPT-4 training,” using just the rental cost of the GPUs (i.e., something like “how much would it cost to rent this size cluster for 3 months of training). But that’s a mistake. What matters is something more like ~the actual cost to build the cluster. If you want one of the largest clusters in the world, you can’t just rent it for 3 months! And moreover, you need the compute for more than just the flagship training run: there will be lots of derisking experiments, failed runs, other models, etc.

To approximate the GPT-4 cluster cost:

  • The public estimates suggest the GPT-4 cluster being around 25k A100s.
  • Assuming $1/A100-hour for 2-3 years gives roughly a $500M cost.
  • Alternatively, you can estimate it as ~$25k cost per H100, 10k H100s-equivalent, and Nvidia GPUs being around ~half the cost of a cluster (the rest being power, the physical datacenter, cooling, networking, maintenance personnel, etc.).
    • (For example, this total-cost-of-ownership analysis estimates that around 40% of a large cluster cost is the H100 GPUs itself, and another 13% goes to Nvidia for Infiniband networking. That said, excluding cost of capital in that calculation would mean the GPUs are about 50% of the cost, and with networking Nvidia gets a bit over 60% of the cost of the cluster.)

FLOP/$ is improving somewhat for each Nvidia generation, but not a ton. E.g., the H100 -> B100 is likely something like 1.5x FLOP/$: B100s are effectively two H100s stapled together, but retailing for <2x the cost. But the B100 was somewhat of an exception. They are surprisingly cheap, likely because Nvidia wants to crush competition. By contrast, A100s -> H100s weren’t much of a FLOP/$ improvement (2x better chip without fp8, roughly 2x the cost), maybe 1.5x if we count fp8 improvements—and that was for a two-year generation.

While I think there are some tailwinds to further FLOP/$ improvements from margin compression, GPUs might also get more expensive as they become massively constrained. Gains from AI chip specialization will continue, but it’s not clear to me there will still be game-changing technical improvements to FLOP/$ coming, given chips are already pretty specialized for AI (e.g. specialized for Transformers, and already at fp8/fp4 precision), Moore’s Law is glacial these days, and other bottleneck components like memory and interconnect are improving more slowly. If you look at Epoch’s data, there seems to be less than a 10x in FLOP/$ over the past decade for top ML GPUs, and for the aforementioned reasons if anything I’d expect this to slow down.

Something like a 35%/year improvement in FLOP/$ would give us the 1T cost for the +4 OOM cluster. Maybe FLOP/$ improves faster, but also datacenter capex is going to get more expensive—simply because you’ll need to actually build new power, doing a lot of capex up front, rather than just renting existing depreciated power plants.

These are just very rough numbers anyway. It would be very much within error bars if e.g. the 1T cluster can be done more efficiently and actually yields more like +4.5 OOMs on compute.


Power requirements
An H100 is 700W, but there’s a bunch of datacenter power you need (cooling, networking, storage); Semianalysis estimates ~1,400W per H100.

There are some gains to be had on FLOP/Watt, though once we’ve exhausted the gains from AI chip specialization (see previous footnote), e.g. gone down to the lowest possible precision, these seem somewhat limited (mostly just chip process improvements, which are slow). That said, as power becomes more of a constraint (and thus a larger fraction of costs), chip designs might specialize to more power-efficiency at the costs of FLOPs. Still, there is still the power demand for cooling, networking, storage, and so on (which in the H100 numbers above was already roughly half the power demand).

For these back of the envelope numbers, let’s work with 1kW per H100-equivalent here; again these are just rough back-of-the-envelope calculations. (And if there were an unexpected FLOP/Watt breakthrough, I’d expect the same power expenditure, just bigger OOM compute gains.)


Power reference classes
A 10GW cluster run continuously for a year is 87.6 TWh. By comparison, Oregon consumes about 27 TWh of electricity annually, Washington state consumes about 92 TWh annually.

A 100GW cluster run continuously for a year is 876 TWh, while total annual US electricity production is about 4,250 TWh.

This may seem hard to believe—but it appears to be happening. Zuck bought 350k H100s. Amazon bought a 1GW datacenter campus next to a nuclear power plant. Rumors suggest a 1GW, 1.4M H100-equivalent cluster (~2026-cluster) is being built in Kuwait. Media report that Microsoft and OpenAI are rumored to be working on a $100B cluster, slated for 2028 (a cost comparable to the International Space Station!). And as each generation of models shocks the world, further acceleration may yet be in store. 
這可能看起來難以置信——但似乎正在發生。扎克購買了 35 萬個 H100。亞馬遜在一座核電廠旁邊購買了一個 1GW 的數據中心園區。傳聞稱,一個 1GW、140 萬個 H100 等效的集群(約 2026 年集群)正在科威特建設中。媒體報導稱,微軟和 OpenAI 傳聞正在合作建設一個價值 1000 億美元的集群,預計在 2028 年完成(成本可與國際空間站相媲美!)。隨著每一代模型震驚世界,進一步的加速可能還在未來。

Perhaps the wildest part is that willingness-to-spend doesn’t even seem to be the binding constraint at the moment, at least for training clusters. It’s finding the infrastructure itself: “Where do I find 10GW?” (power for the $100B+, trend 2028 cluster) is a favorite topic of conversation in SF. What any compute guy is thinking about is securing power, land, permitting, and datacenter construction.

2 While it may take you a year of waiting to get the GPUs, the lead times for these are much longer still.
也許最瘋狂的部分是,目前看來,願意花錢似乎甚至不是主要的限制,至少對於訓練集群來說。現在的問題是找到基礎設施本身:「我在哪裡能找到 10GW?」(為 2028 年趨勢的 1000 億美元以上集群提供電力)是舊金山熱門的話題。任何計算專家都在考慮的是確保電力、土地、許可和數據中心建設。雖然你可能需要等待一年才能拿到 GPU,但這些的前置時間還要更長。

The trillion-dollar cluster—+4 OOMs from the GPT-4 cluster, the ~2030 training cluster on the current trend—will be a truly extraordinary effort. The 100GW of power it’ll require is equivalent to >20% of US electricity production; imagine not just a simple warehouse with GPUs, but hundreds of power plants. Perhaps it will take a national consortium. 
這個萬億美元的集群——來自 GPT-4 集群的+4 個數量級,按照當前趨勢大約在 2030 年的訓練集群——將是一項真正非凡的努力。它所需的 100 吉瓦電力相當於美國電力生產的 20%以上;想像一下,不僅僅是一個簡單的 GPU 倉庫,而是數百個發電廠。也許這將需要一個國家聯盟。

(Note that I think it’s pretty likely we’ll only need a ~$100B cluster, or less, for AGI. The $1T cluster might be what we’ll train and run superintelligence on, or what we’ll use for AGI if AGI is harder than expected. In any case, in a post-AGI world, having the most compute will probably still really matter.)
(請注意,我認為我們只需要一個大約 1000 億美元的集群,或更少,就能實現人工通用智能。1 萬億美元的集群可能是我們用來訓練和運行超級智能的,或者在人工通用智能比預期更難的情況下使用。無論如何,在後人工通用智能的世界裡,擁有最多的計算能力可能仍然非常重要。)

Overall compute 整體計算

The above are just rough numbers for the largest training clusters. Overall investment is likely to be much larger still: a large fraction of GPUs will probably be be used for inference

3 (GPUs to actually run the AI systems for products), and there could be multiple players with giant clusters in the race.
以上只是最大訓練集群的粗略數字。總體投資可能還會更大:很大一部分 GPU 可能會用於推理(實際運行產品的 AI 系統的 GPU),而且在這場競賽中可能會有多個擁有巨型集群的參與者。

My rough estimate is that 2024 will already feature $100B-$200B of AI investment:
我的粗略估計是,2024 年將已經有 1000 億至 2000 億美元的人工智慧投資:

  • Nvidia datacenter revenue will hit a ~$25B/quarter run rate soon, i.e. ~$100B of capex flowing via Nvidia alone. But of course, Nvidia isn’t the only player (Google’s TPUs are great too!), and close to half of datacenter capex is on things other than the chips (site, building, cooling, power, etc.)
    Nvidia 數據中心收入很快將達到每季度約 250 億美元的運行率,即僅 Nvidia 一家就有約 1000 億美元的資本支出流入。但當然,Nvidia 並不是唯一的玩家(Google 的 TPU 也很棒!),而且接近一半的數據中心資本支出是用在芯片以外的東西上(場地、建築、冷卻、電力等)。
Quarterly Nvidia datacenter revenue. Plot by Thomas Woodside
季度 Nvidia 數據中心收入。圖表由 Thomas Woodside 製作。
  • Big tech has been dramatically ramping their capex numbers: Microsoft and Google will likely do $50B+,
    5 AWS and Meta $40B+, in capex this year. Not all of this is AI, but combined their capex will have grown $50B-100B year-over-year because of the AI boom, and even then they are still cutting back on other capex to shift even more spending to AI. Moreover, other cloud providers, companies (e.g., Tesla is spending $10B on AI this year), and nation-states are investing in AI as well.
    大型科技公司大幅增加了資本支出:微軟和谷歌今年的資本支出可能超過 500 億美元,AWS 和 Meta 超過 400 億美元。雖然並非所有這些資本支出都用於人工智慧,但由於人工智慧的蓬勃發展,他們的資本支出總額將同比增長 500 億至 1000 億美元,即便如此,他們仍在削減其他資本支出以將更多資金轉向人工智慧。此外,其他雲端服務提供商、公司(例如特斯拉今年在人工智慧上的支出達 100 億美元)和國家也在投資人工智慧。
Big tech capex is growing extremely rapidly since ChatGPT unleashed the AI boom. Graphic source
自從 ChatGPT 引發 AI 熱潮以來,大型科技公司的資本支出正在極速增長。圖表來源

Let’s play this forward. My best guess is overall compute investments will grow more slowly than the 3x/year largest training clusters, let’s say 2x/year.


Year Annual investment 年度投資AI accelerator shipments (in H100s-equivalent)
AI 加速器出貨量(以 H100s 為等效)
Power as % of US electricity production

Chips as % of current leading-edge TSMC wafer production
2024~$150B 約 1500 億美元~5-10M  ~5-10 百萬
~2026~$500B 約 5000 億美元~10s of millions 數千萬5%~25%
~2028~$2T 約 2 萬億美元~100M20%~100%
~2030~$8T 約 8 萬億美元Many 100s of millions 
100%4x current capacity 4 倍當前容量
Playing forward trends on total world AI investment. Rough back-of-the-envelope calculation. 

And these aren’t just my idiosyncratic numbers. AMD forecasted a $400B AI accelerator market by 2027, implying $700B+ of total AI spending, pretty close to my numbers (and they are surely much less “AGI-pilled” than I am). Sam Altman is reported to be in talks to raise funds for a project of “up to $7T” in capex to build out AI compute capacity (the number was widely mocked, but it seems less crazy if you run the numbers here…). One way or another, this massive scaleup is happening. 
而這些並不只是我個人的奇特數字。AMD 預測到 2027 年,AI 加速器市場將達到 4000 億美元,這意味著總 AI 支出將超過 7000 億美元,這與我的數字相當接近(而且他們肯定比我少了很多“AGI 熱潮”)。據報導,Sam Altman 正在洽談為一個“高達 7 兆美元”的資本支出項目籌集資金,以建立 AI 計算能力(這個數字被廣泛嘲笑,但如果你在這裡運算一下,似乎就不那麼瘋狂了……)。無論如何,這種大規模擴展正在發生。

Will it be done? Can it be done?

The scale of investment postulated here may seem fantastical. But both the demand-side and the supply-side seem like they could support the above trajectory. The economic returns justify the investment, the scale of expenditures is not unprecedented for a new general-purpose technology, and the industrial mobilization for power and chips is doable.

AI revenue  人工智慧收入

Companies will make large AI investments if they expect the economic returns to justify it. 

Reports suggest OpenAI was at a $1B revenue run rate in August 2023, and a $2B revenue run rate in February 2024. That’s roughly a doubling every 6 months. If that trend holds, we should see a ~$10B annual run rate by late 2024/early 2025, even without pricing in a massive surge from any next-generation model. One estimate puts Microsoft at ~$5B of incremental AI revenue already. 
報導顯示,OpenAI 在 2023 年 8 月的收入運行率為 10 億美元,而在 2024 年 2 月的收入運行率為 20 億美元。這大約是每 6 個月翻一番。如果這一趨勢持續下去,即使不考慮任何下一代模型帶來的大幅增長,我們應該在 2024 年底/2025 年初看到約 100 億美元的年運行率。一項估計顯示,微軟已經獲得了約 50 億美元的增量 AI 收入。

So far, every 10x scaleup in AI investment seems to yield the necessary returns. GPT-3.5 unleashed the ChatGPT mania. The estimated $500M cost for the GPT-4 cluster would have been paid off by the reported billions of annual revenue for Microsoft and OpenAI (see above calculations), and a “2024-class” training cluster in the billions will easily pay off if Microsoft/OpenAI AI revenue continues on track to a $10B+ revenue run rate. The boom is investment-led: it takes time from a huge order of GPUs to build the clusters, build the models, and roll them out, and the clusters being planned today are many years out. But if the returns on the last GPU order keep materializing, investment will continue to skyrocket (and outpace revenue), plowing in even more capital in a bet that the next 10x will keep paying off. 
到目前為止,每次在人工智慧投資上的 10 倍增長似乎都能帶來必要的回報。GPT-3.5 引發了 ChatGPT 的狂熱。據估計,GPT-4 集群的成本為 5 億美元,而據報導,微軟和 OpenAI 的年收入達數十億美元(見上述計算),這筆費用已經收回。如果微軟/OpenAI 的人工智慧收入繼續保持在每年超過 100 億美元的增長速度,那麼“2024 級”訓練集群的數十億美元投資將很容易收回。這一繁榮是由投資驅動的:從大量訂購 GPU 到建造集群、建立模型並推出,這需要時間,而今天計劃的集群要在多年後才能完成。但如果上一次 GPU 訂單的回報繼續實現,投資將繼續飆升(並超過收入),投入更多的資本,賭下一個 10 倍增長將繼續帶來回報。

A key milestone for AI revenue that I like to think about is: when will a big tech company (Google, Microsoft, Meta, etc.) hit a $100B revenue run rate from AI (products and API)? These companies have on the order of $100B-$300B of revenue today; $100B would thus start representing a very substantial fraction of their business. Very naively extrapolating out the doubling every 6 months, supposing we hit a $10B revenue run rate in early 2025, suggests this would happen mid-2026. 
我喜歡思考的一個人工智慧收入關鍵里程碑是:何時一家大型科技公司(谷歌、微軟、Meta 等)能夠從人工智慧(產品和 API)達到每年 1000 億美元的收入?這些公司目前的收入大約在 1000 億至 3000 億美元之間;因此,1000 億美元將開始代表其業務的一個非常重要的部分。非常天真地推測每 6 個月翻一番,假設我們在 2025 年初達到 100 億美元的收入,這表明這將在 2026 年中期發生。

That may seem like a stretch, but it seems to me to require surprisingly little imagination to reach that milestone. For example, there are around 350 million paid subscribers to Microsoft Office—could you get a third of these to be willing to pay $100/month for an AI add-on? For an average worker, that’s only a few hours a month of productivity gained; models powerful enough to make that justifiable seem very doable in the next couple years.
這可能看起來有點牽強,但在我看來,要達到這一里程碑似乎不需要太多的想像力。例如,微軟 Office 有大約 3.5 億付費訂閱者——你能讓其中三分之一的人願意每月支付 100 美元來購買 AI 附加功能嗎?對於一個普通的工人來說,這只是每月多幾個小時的生產力;在未來幾年內,足夠強大的模型來使這一點變得合理似乎是非常可行的。


It’s hard to understate the ensuing reverberations. This would make AI products the biggest revenue driver for America’s largest corporations, and by far their biggest area of growth. Forecasts of overall revenue growth for these companies would skyrocket. Stock markets would follow; we might see our first $10T company soon thereafter. Big tech at this point would be willing to go all out, each investing many hundreds of billions (at least) into further AI scaleout. We probably see our first many-hundred-billion dollar corporate bond sale then.
隨之而來的影響難以低估。這將使人工智慧產品成為美國最大企業的最大收入驅動力,並且是它們增長最快的領域。這些公司的總收入增長預測將飆升。股市也會跟隨;我們可能很快就會看到第一家市值達到 10 萬億美元的公司。此時,大型科技公司將願意全力以赴,每家公司至少投資數千億美元進一步擴展人工智慧。我們可能會看到第一個數千億美元的公司債券發行。


Beyond $100B, it gets harder to see the contours. But if we are truly on the path to AGI, the returns will be there. White-collar workers are paid tens of trillions of dollars in wages annually worldwide; a drop-in remote worker that automates even a fraction of white-collar/cognitive jobs (imagine, say, a truly automated AI coder) would pay for the trillion-dollar cluster. If nothing else, the national security import could well motivate a government project, bundling the nation’s resources in the race to AGI (more later). 

Historical precedents 歷史先例

$1T/year of total annual AI investment by 2027 seems outrageous. But it’s worth taking a look at other historical reference classes: 
到 2027 年,每年總計 1 萬億美元的人工智慧投資似乎令人難以置信。但值得看看其他歷史參考類別:

  • In their peak years of funding, the Manhattan and Apollo programs reached 0.4% of GDP, or ~$100 billion annually today (surprisingly small!). At $1T/year, AI investment would be about 3% of GDP. 
    在資金高峰期,曼哈頓計劃和阿波羅計劃達到了國內生產總值的 0.4%,即今天每年約 1000 億美元(出乎意料的小!)。以每年 1 萬億美元計算,人工智慧投資將佔國內生產總值的約 3%。
  • Between 1996–2001, telecoms invested nearly $1 trillion in today’s dollars in building out internet infrastructure.
    在 1996 年至 2001 年間,電信公司以今日的美元計算,投資了近 1 萬億美元來建設互聯網基礎設施。
  • From 1841 to 1850, private British railway investments totaled a cumulative ~40% of British GDP at the time. A similar fraction of US GDP would be equivalent to ~$11T over a decade. 
    從 1841 年到 1850 年,英國私人鐵路投資總額累計約佔當時英國 GDP 的 40%。相同比例的美國 GDP 在十年間相當於約 11 萬億美元。
  • Many trillions are being spent on the green transition. 
  • Rapidly-growing economies often spend a high fraction of their GDP on investment; for example, China has spent more than 40% of its GDP on investment for two decades (equivalent to $11T annually given US GDP).
    快速增長的經濟體通常會將其國內生產總值的很大一部分用於投資;例如,中國在過去二十年中將其國內生產總值的 40%以上用於投資(相當於美國國內生產總值的每年 11 萬億美元)。
  • In the historically most exigent national security circumstances—wartime—borrowing to finance the national effort has often comprised enormous fractions of GDP. During WWI, the UK and France, and Germany borrowed over 100% of their GDPs while the US borrowed over 20%; during WWII, the UK and Japan borrowed over 100% of their GDPs while the US borrowed over 60% of GDP (equivalent to over $17T today).  
    在歷史上最嚴峻的國家安全情況下——戰時——借款以資助國家努力往往佔據了國內生產總值的巨大比例。在第一次世界大戰期間,英國、法國和德國的借款超過了其國內生產總值的 100%,而美國的借款超過了其國內生產總值的 20%;在第二次世界大戰期間,英國和日本的借款超過了其國內生產總值的 100%,而美國的借款超過了其國內生產總值的 60%(相當於今天的超過 17 萬億美元)。

$1T/year of total AI investment by 2027 would be dramatic—among the very largest capital buildouts ever—but would not be unprecedented. And a trillion-dollar individual training cluster by the end of the decade seems on the table.
到 2027 年,每年 1 萬億美元的總人工智慧投資將是戲劇性的——是有史以來最大的資本建設之一——但並非前所未有。而到本世紀末,單個價值一萬億美元的訓練集群似乎也在考慮之中。


Power 權力

Probably the single biggest constraint on the supply-side will be power. Already, at nearer-term scales (1GW/2026 and especially 10GW/2028), power has become the binding constraint: there simply isn’t much spare capacity, and power contracts are usually long-term locked-in. And building, say, a new gigawatt-class nuclear power plant takes a decade. (I’ll wonder when we’ll start seeing things like tech companies buying aluminum smelting companies for their gigawatt-class power contracts.

可能在供應方面最大的限制將是電力。已經在較近的時間範圍內(1GW/2026,尤其是 10GW/2028),電力已成為約束因素:根本沒有太多的剩餘容量,而且電力合同通常是長期鎖定的。而建造一個新的千兆瓦級核電廠需要十年時間。(我會想知道我們什麼時候會開始看到科技公司購買鋁冶煉公司以獲得其千兆瓦級的電力合同。)

Comparing trends on total US electricity production to our rough back of the envelope estimates on AI electricity demands.

Total US electricity generation has barely grown 5% in the last decade.

14 Utilities are starting to get excited about AI (instead of 2.6% growth over the next 5 years, they now estimate 4.7%!). But they’re barely pricing in what’s coming. The trillion-dollar, 100GW cluster alone would require ~20% of current US electricity generation in 6 years; together with large inference capacity, demand will be multiples higher. 
過去十年,美國的總發電量僅增長了 5%。公用事業公司開始對人工智慧感到興奮(未來五年的增長率預計從 2.6%提高到 4.7%!)。但他們幾乎沒有考慮到即將到來的變化。僅一個價值數萬億美元、100GW 的集群在六年內就需要約佔目前美國總發電量的 20%;再加上大規模的推理能力,需求將會成倍增加。

To most, this seems completely out of the question. Some are betting on Middle Eastern autocracies, who have been going around offering boundless power and giant clusters to get their rulers a seat at the AGI-table. 
對大多數人來說,這似乎完全不可能。有些人則押注中東的專制政權,他們一直在四處提供無限的權力和巨大的集群,以讓他們的統治者在 AGI 桌上佔有一席之地。

But it’s totally possible to do this in the United States: we have abundant natural gas.


  • Powering a 10GW cluster would take only a few percent of US natural gas production and could be done rapidly. 
    為一個 10GW 的集群供電只需佔用美國天然氣產量的幾個百分點,並且可以迅速完成。
  • Even the 100GW cluster is surprisingly doable.
    即使是 100GW 的集群也是令人驚訝的可行。
    • Right now the Marcellus/Utica shale (around Pennsylvania) alone is producing around 36 billion cubic feet a day of gas; that would be enough to generate just under 150GW continuously with generators (and combined cycle power plants could output 250 GW due to their higher efficiency). 
      目前僅馬塞勒斯/尤蒂卡頁岩(賓夕法尼亞州周圍)每天就生產約 360 億立方英尺的天然氣;這足以用發電機連續產生接近 150 吉瓦的電力(而聯合循環發電廠由於其更高的效率可以輸出 250 吉瓦)。
    • It would take about ~1200 new wells for the 100GW cluster.
      16 Each rig can drill roughly 3 wells per month, so 40 rigs (the current rig count in the Marcellus) could build up the production base for 100GW in less than a year.
      17The Marcellus had a rig count of ~80 as recently as 2019 so it would not be taxing to add 40 rigs to build up the production base.
      要建設 100GW 的集群大約需要 1200 口新井。每台鑽機每月大約可以鑽 3 口井,因此 40 台鑽機(目前在馬塞勒斯的鑽機數量)可以在不到一年的時間內建立起 100GW 的生產基礎。馬塞勒斯在 2019 年時的鑽機數量曾高達 80 台,因此增加 40 台鑽機來建立生產基礎並不會造成負擔。
    • More generally, US natural gas production has more than doubled in a decade; simply continuing that trend could power multiple trillion-dollar datacenters.
    • The harder part would be building enough generators/turbines; this wouldn’t be trivial, but it seems doable with about $100B of capex
      20 for 100GW of natural gas power plants. Combined cycle plants can be built in about two years; the timeline for generators would be even shorter still.
      更困難的部分將是建造足夠的發電機/渦輪機;這並非易事,但似乎可以通過約 1000 億美元的資本支出來建造 100 吉瓦的天然氣發電廠。聯合循環發電廠大約可以在兩年內建成;發電機的時間表甚至會更短。

The barriers to even trillions of dollars of datacenter buildout in the US are entirely self-made. Well-intentioned but rigid climate commitments (not just by the government, but green datacenter commitments by Microsoft, Google, Amazon, and so on) stand in the way of the obvious, fast solution. At the very least, even if we won’t do natural gas, a broad deregulatory agenda would unlock the solar/batteries/SMR/geothermal megaprojects. Permitting, utility regulation, FERC regulation of transmission lines, and NEPA environmental review makes things that should take a few years take a decade or more. We don’t have that kind of time.

We’re going to drive the AGI datacenters to the Middle East, under the thumb of brutal, capricious autocrats. I’d prefer clean energy too—but this is simply too important for US national security. We will need a new level of determination to make this happen. The power constraint can, must, and will be solved. 
我們將把 AGI 數據中心推向中東,置於殘暴、反覆無常的獨裁者的控制之下。我也更喜歡清潔能源——但這對美國國家安全來說實在是太重要了。我們需要一個新的決心來實現這一目標。電力限制可以、必須並且將會得到解決。

Chips 晶片

While chips are usually what comes to mind when people think about AI-supply-constraints, they’re likely a smaller constraint than power. Global production of AI chips is still a pretty small percent of TSMC-leading-edge production, likely less than 10%. There’s a lot of room to grow via AI becoming a larger share of TSMC production.
當人們想到人工智慧供應限制時,通常會想到晶片,但它們可能比電力的限制要小。全球人工智慧晶片的生產仍然只佔台積電先進製程生產的一小部分,可能不到 10%。隨著人工智慧在台積電生產中佔比增加,還有很大的成長空間。

Indeed, 2024 production of AI chips (~5-10M H100-equivalents) would already be almost enough for the $100s of billion cluster (if they were all diverted to one cluster). From a pure logic fab standpoint ~100% of TSMC’s output for a year could already support the trillion-dollar cluster (again if all the chips went to one datacenter).

22 Of course, not all of TSMC will be able to be diverted to AI, and not all of AI chip production for a year will be for one training cluster. Total AI chip demand (including inference and multiple players) by 2030 will be a multiple of TSMC’s current total leading-edge logic chip capacity, just for AI. TSMC ~doubled
23 in the past 5 years; they’d likely need to go ~at least twice as fast on their pace of expansion to meet AI chip demand. Massive new fab investments would be necessary. 
事實上,2024 年生產的 AI 晶片(約 500 萬至 1000 萬個 H100 等效晶片)已經幾乎足夠用於數百億美元的集群(如果所有晶片都被分配到一個集群)。從純邏輯晶圓廠的角度來看,台積電一年的產量幾乎可以支持萬億美元的集群(同樣,如果所有晶片都進入一個數據中心)。當然,台積電的所有產能不可能全部轉向 AI,且一年的 AI 晶片生產也不會全部用於一個訓練集群。到 2030 年,總體 AI 晶片需求(包括推理和多個參與者)將是台積電目前總領先邏輯晶片產能的多倍,僅僅是為了 AI。台積電在過去五年中幾乎翻了一番;他們可能需要至少以兩倍的速度擴展才能滿足 AI 晶片需求。巨大的新晶圓廠投資將是必要的。

Even if raw logic fabs won’t be the constraint, chip-on-wafer-on-substrate (CoWoS) advanced packaging (connecting chips to memory, also made by TSMC, Intel, and others) and HBM memory (for which demand is enormous) are already key bottlenecks for the current AI GPU scaleup; these are more specialized to AI, unlike the pure logic chips, so there’s less pre-existing capacity. In the near term, these will be the primary constraint on churning out more GPUs, and these will be the huge constraints as AI scales. Still, these are comparatively “easy” to scale; it’s been incredible watching TSMC literally build “greenfield” fabs (i.e. entirely new facilities from scratch) to massively scale up CoWoS production this year (and Nvidia is even starting to find CoWoS alternatives to work around the shortage). 
即使原始邏輯晶圓廠不會成為限制,晶片封裝技術(如晶片對晶圓對基板(CoWoS)先進封裝技術,將晶片連接到記憶體,也由台積電、英特爾等公司製造)和高帶寬記憶體(HBM,需求巨大)已經成為當前 AI GPU 擴展的主要瓶頸;這些技術更專門針對 AI,不像純邏輯晶片,因此現有產能較少。在短期內,這些將是生產更多 GPU 的主要限制,並且隨著 AI 的擴展,這些將成為巨大的限制。然而,這些相對來說“容易”擴展;看到台積電今年實際上從零開始建造“綠地”晶圓廠(即全新設施)以大規模擴展 CoWoS 生產,真是令人難以置信(而且英偉達甚至開始尋找 CoWoS 替代方案來應對短缺)。

A new TSMC Gigafab (a technological marvel) costs around $20B in capex and produces 100k wafer-starts a month. For hundreds of millions of AI GPUs a year by the end of the decade, TSMC would need to build dozens of these—as well as a huge buildout for memory, advanced packaging, networking, etc., which will be a major fraction of capex. It could add up to over $1T of capex. It will be intense, but doable. (Perhaps the biggest roadblock will not be feasibility, but TSMC not even trying—TSMC does not yet seem AI-scaling-pilled! They think AI will “only” grow at a glacial 50% CAGR.)
一座新的台積電超級晶圓廠(技術奇蹟)成本約為 200 億美元資本支出,每月生產 10 萬片晶圓。為了在本世紀末每年生產數億個 AI GPU,台積電需要建造數十座這樣的工廠,以及大規模擴建記憶體、先進封裝、網絡等,這將佔據資本支出的很大一部分。總資本支出可能超過 1 萬億美元。這將是非常激烈的,但可行。(也許最大的障礙不是可行性,而是台積電甚至不嘗試——台積電似乎還沒有準備好大規模擴展 AI!他們認為 AI 將“僅”以每年 50%的複合年增長率緩慢增長。)

Recent USG efforts like the CHIPS Act have been trying to onshore more AI chip production to the US (as insurance in case of the Taiwan contingency). While onshoring more of AI chip production to the US would be nice, it’s less critical than having the actual datacenter (on which the AGI lives) in the US. If having chip production abroad is like having uranium deposits abroad, having the AGI datacenter abroad is like having the literal nukes be built and stored abroad. Given the dysfunction and cost we’ve seen from building fabs in the US in practice, my guess is we should prioritize datacenters in the US while betting more heavily on democratic allies like Japan and South Korea for fab projects—fab buildouts there seem much more functional

The Clusters of Democracy

Before the decade is out, many trillions of dollars of compute clusters will have been built. The only question is whether they will be built in America. Some are rumored to be betting on building them elsewhere, especially in the Middle East. Do we really want the infrastructure for the Manhattan Project to be controlled by some capricious Middle Eastern dictatorship?

The clusters that are being planned today may well be the clusters AGI and superintelligence are trained and run on, not just the “cool-big-tech-product clusters.”  The national interest demands that these are built in America (or close democratic allies). Anything else creates an irreversible security risk: it risks the AGI weights getting stolen

24 (and perhaps be shipped to China) (more later); it risks these dictatorships physically seizing the datacenters (to build and run AGI themselves) when the AGI race gets hot; or even if these threats are only wielded implicity, it puts AGI and superintelligence at unsavory dictator’s whims. America sorely regretted her energy dependence on the Middle East in the 70s, and we worked so hard to get out from under their thumbs. We cannot make the same mistake again. 
今天正在計劃的集群很可能是用來訓練和運行 AGI 和超級智能的集群,而不僅僅是“酷炫的大型科技產品集群”。國家利益要求這些集群必須在美國(或親密的民主盟友)建造。任何其他選擇都會造成不可逆轉的安全風險:這風險包括 AGI 權重被盜(並可能被運送到中國);這風險包括這些獨裁政權在 AGI 競賽升溫時實際上奪取數據中心(以便自己建造和運行 AGI);即使這些威脅只是隱含地被使用,也會使 AGI 和超級智能處於不受歡迎的獨裁者的控制之下。美國在 70 年代對中東能源依賴深感遺憾,我們努力擺脫他們的控制。我們不能再犯同樣的錯誤。

The clusters can be built in the US, and we have to get our act together to make sure it happens in the US. American national security must come first, before the allure of free-flowing Middle Eastern cash, arcane regulation, or even, yes, admirable climate commitments. We face a real system competition—can the requisite industrial mobilization only be done in “top-down” autocracies? If American business is unshackled, America can build like none other (at least in red states). Being willing to use natural gas, or at the very least a broad-based deregulatory agenda—NEPA exemptions, fixing FERC and transmission permitting at the federal level, overriding utility regulation, using federal authorities to unlock land and rights of way—is a national security priority. 
這些集群可以在美國建造,我們必須齊心協力確保它在美國發生。美國的國家安全必須優先於中東自由流動的資金、晦澀難懂的法規,甚至是值得讚揚的氣候承諾。我們面臨真正的系統競爭——必要的工業動員只能在“自上而下”的專制國家完成嗎?如果美國企業不受束縛,美國可以建造出無與倫比的設施(至少在紅州)。願意使用天然氣,或者至少是廣泛的放鬆管制議程——NEPA 豁免、修復 FERC 和聯邦層面的傳輸許可、推翻公用事業法規、使用聯邦權力解鎖土地和通行權——是國家安全的優先事項。

In any case—the exponential is in full swing now. 

In the “old days,” when AGI was still a dirty word, some colleagues and I used to make theoretical economic models of what the path to AGI might look like. One feature of these models used to be a hypothetical “AI wakeup” moment, when the world started realizing how powerful these models could be and began rapidly ramping up their investments—culminating in multiple % of GDP towards the largest training runs. 
在“過去的日子裡”,當 AGI 還是一個敏感詞彙時,我和一些同事曾經製作過理論經濟模型,來預測通往 AGI 的路徑可能會是什麼樣子。這些模型的一個特點是設想了一個“AI 覺醒”時刻,當世界開始意識到這些模型的強大潛力並迅速增加投資——最終達到 GDP 的多個百分比用於最大的訓練運行。

It seemed far-off then, but that time has come. 2023 was “AI wakeup.”

25 Behind the scenes, the most staggering techno-capital acceleration has been put into motion.
那時看起來還很遙遠,但那個時刻已經到來。2023 年是“人工智慧覺醒”。在幕後,最驚人的科技資本加速已經啟動。

Brace for the G-forces.

Next post in series: 下一篇文章:
IIIb. Lock Down the Labs: Security for AGI
IIIb. 鎖定實驗室:AGI 的安全性

(What all of this means for NVDA/TSM/etc. I leave as an exercise for the reader. Hint: Those with situational awareness bought much lower than you, but it’s still not even close to fully priced in.

這一切對於 NVDA/TSM 等公司的意義,我留給讀者自行思考。提示:那些具有情境意識的人買入的價格比你低得多,但這仍然遠未完全反映在價格中。

  1. As mentioned,  OOM = order of magnitude, 10x = 1 order of magnitude↩
    如前所述,OOM = 數量級,10x = 1 個數量級 ↩

  2. One key uncertainty is how distributed training will be—if instead of needing that amount of power in one location, we could spread it among 100 locations, it’d be a lot easier.↩
    一個關鍵的不確定性是分佈式訓練的程度——如果我們不需要在一個地點集中那麼多的電力,而是可以將其分散到 100 個地點,那將容易得多。

  3. See, for example, Zuck here; only ~45k of his H100s are in his largest training clusters, the vast majority of his 350k H100s for inference. Meta likely has heavier inference needs than other players, who serve fewer customers so far, but as everyone else’s AI products scale, I expect inference to become a strong majority of the GPUs.↩
    例如,看看這裡的扎克;他最大的訓練集群中只有約 4.5 萬個 H100,大多數的 35 萬個 H100 用於推理。Meta 的推理需求可能比其他公司更大,因為他們目前服務的客戶較少,但隨著其他公司的 AI 產品規模擴大,我預計推理將成為 GPU 的主要用途。

  4. For example, this total-cost-of-ownership analysis estimates that around 40% of a large cluster cost is the H100 GPUs itself, and another 13% goes to Nvidia for Infiniband networking. That said, excluding cost of capital in that calculation would mean the GPUs are about 50% of the cost, and with networking would mean Nvidia gets a bit over 60% of the cost of the cluster.↩
    例如,這項總擁有成本分析估計,大型集群成本中約有 40%是 H100 GPU 本身,另有 13%是支付給 Nvidia 的 Infiniband 網絡費用。也就是說,如果在計算中不包括資本成本,GPU 約佔成本的 50%,加上網絡費用則意味著 Nvidia 獲得集群成本的略超過 60%。

  5. And apparently, despite Microsoft growing capex by 79% compared to a year ago in a recent quarter, their AI cloud demand still exceeds supply!↩
    而且顯然,儘管微軟在最近一個季度的資本支出比去年同期增長了 79%,他們的人工智慧雲端需求仍然超過供應!

  6. A larger fraction of global GPU production will probably be going to the largest training cluster in the future than today, e.g. because of a consolidation to just a few leading labs, rather than many companies having frontier-model-scale clusters.↩
    未來,全球 GPU 生產的更大部分可能會流向最大的訓練集群,而不是今天,例如,由於僅有少數領先實驗室的整合,而不是許多公司擁有前沿模型規模的集群。

  7. Of course, not all of this will be in the US, but to give a reference class.↩

  8. I estimate Nvidia is going to ship on the order of 5M datacenter GPUs in 2024. A minority of those are B100s, which we’ll count as 2x+ H100s. Then there’s the other AI chips: TPUs, Trainium, Meta’s custom silicon, AMD GPUs, etc.↩
    我估計 Nvidia 在 2024 年將出貨約 500 萬個數據中心 GPU。其中少數是 B100,我們將其計算為 2 倍以上的 H100。然後還有其他 AI 芯片:TPU、Trainium、Meta 的定制矽片、AMD GPU 等。

  9. TSMC has capacity for over 150k 5nm wafers per month, is ramping to 100k 3nm wafers per month, and likely another 150k or so 7nm wafers per month; let’s call it 400k wafers per month in total. 
    台積電每月擁有超過 15 萬片 5 納米晶圓的產能,正在提升至每月 10 萬片 3 納米晶圓,並且每月可能還有約 15 萬片 7 納米晶圓;總計每月約 40 萬片晶圓。

    Let’s say roughly 35 H100s per wafer (H100s are made on 5nm). At 5-10 million H100-equivalents in 2024, that’s 150k-300k wafers per year for annual AI chip production in 2024. 
    假設每片晶圓大約有 35 個 H100(H100 是用 5 納米製造的)。在 2024 年生產 500 萬到 1000 萬個 H100 等效產品,這意味著 2024 年每年需要 15 萬到 30 萬片晶圓來生產 AI 晶片。

    Depending on where in that range and whether we want to count 7nm production, that’s about 3-10% of annual leading-edge wafer production.↩
    根據範圍內的位置以及我們是否要計算 7nm 的生產,這大約佔年度先進晶圓生產的 3-10%。

  10. A big uncertainty for me is what the lags are for the technology to diffuse and be adopted. I think it’s plausible revenue is slowed because intermediate, pre-AGI models take a lot of “schlep” to properly integrate into company workflows; historically, it’s taken a while to fully harvest the productivity gains from new general purpose technologies. This is where the “sonic boom” discussion earlier comes in: as we “unhobble” models and they start looking more like agents/drop-in remote workers,, deploying them becomes much easier. Rather than having to completely remake some workflow to harvest a 25% productivity gain from a GPT-chatbot, instead you’ll get models that you can onboard and work with as you would a new coworker (e.g., just directly substitute for an engineer, rather than needing to train up engineers to use some new tool). Or, in the extreme, and later on: you won’t need to completely redesign a factory to work with some new tool, you’ll just bring in the humanoid robots. 

    That said, this may lead to some discontinuity in economic value and revenue generated, depending on how quickly we can “unhobble” models.↩

  11. What will happen to interest rates will be interesting… see Tyler Cowen here; Chow, Mazlish and Halperin here.↩

  12. And, farther out, but if AGI truly led to substantial increases in economic growth, $10T+ annually would start being plausible—the reference class being investment rates of countries during high-growth periods.↩

  13.  “Since 2011, the Alouette Smelter uses 930 MW electricity at maximum production capacity.”↩

  14. That said, this is “net-new” capacity: some of that is building new renewables and taking old fossil fuel plants off the grid. Maybe it’s closer to a percent or two a year of gross-new capacity.↩

  15. Thanks to Austin Vernon (private correspondence) for helping with these estimates.↩

  16. New wells produce around 0.01 BCF per day.↩

  17.  Each well produces ~20 BCF over its lifetime, meaning two new wells a month would replace the depleted reserves, i.e. it would need only one rig to maintain the production.↩

  18.  Though it would be more efficient to add less rigs and build up over a longer time frame than 10 months.↩

  19. A cubic foot of natural gas generates about 0.13 kWh. Shale gas production was about ~70 billion cubic feet per day in the US in 2020. Suppose we doubled production again, and the extra capacity all went to compute clusters. That’s 3322 TWh/year of electricity, or enough for almost 4 100GW clusters.↩

  20. The capex costs for natural gas power plants seem to be under $1000 per kW, meaning the capex for 100GW of natural gas power plants would be about $100 billion.↩

  21. Solar and batteries aren’t a totally crazy alternative, but it does just seem rougher than natural gas. I did appreciate Casey Handmer’s calculation of tiling the Earth in solar panels:
    “With current GPUs, the global solar datacenter’s compute is equivalent to ~150 billion humans, though if our computers can eventually match [human brain] efficiency, we could support more like 5 quadrillion AI souls.”↩

  22.  This poses the interesting question of why power requirements are going up so much before chip fab production starts being really constrained. A simple answer is while datacenters run continuously at close to max power, most chips currently produced are idle a lot of the time. Currently, smartphones are close to half of leading chip demand, but use a lot less energy per wafer area (trading transistors for serial operations and energy efficiency) and have low utilization as smartphones are mostly idle. The AI revolution means working our transistors way harder, dedicating them all to constantly-running, high-performance AI datacenters instead of idle, battery-powered/energy-saving devices. HT Carl Shulman for this point.↩

  23. (Using revenue as a proxy.)↩

  24.  It’s a lot easier to do side-channel attacks to exfiltrate weights with physical access!↩

  25. I distinctly remember writing “THE TAKEOFF HAS STARTED” on my whiteboard in March of 2023.↩

  26. Mainstream sell-side analysts seem to assume only 10-20% year-over-year growth in Nvidia revenue from CY24 to CY25, maybe $120B-$130B in CY25 (or at least did until very recently). Insane! It’s been pretty obvious for a while that Nvidia is going to do over $200B of revenue in CY25.↩