IIIa. Racing to the Trillion-Dollar Cluster
IIIa. 競逐兆美元集群

The most extraordinary techno-capital acceleration has been set in motion. As AI revenue grows rapidly, many trillions of dollars will go into GPU, datacenter, and power buildout before the end of the decade. The industrial mobilization, including growing US electricity production by 10s of percent, will be intense. 
最非凡的科技資本加速已經啟動。隨著人工智慧收入迅速增長,數萬億美元將在本世紀末前投入到 GPU、數據中心和電力建設中。包括美國電力生產增長數十個百分點在內的工業動員將會非常激烈。


You see, I told you it couldn’t be done without turning the whole country into a factory. You have done just that.
你看,我告訴過你,如果不把整個國家變成工廠,是做不到的。你已經做到了。

Niels Bohr (to Edward Teller, upon learning of the scale of the Manhattan Project in 1944)
尼爾斯·玻爾(對愛德華·泰勒,於 1944 年得知曼哈頓計劃的規模時)
The trillion-dollar cluster. Credit: DALLE. 
萬億美元集群。來源:DALLE。

The race to AGI won’t just play out in code and behind laptops—it’ll be a race to mobilize America’s industrial might. Unlike anything else we’ve recently seen come out of Silicon Valley, AI is a massive industrial process: each new model requires a giant new cluster, soon giant new power plants, and eventually giant new chip fabs. The investments involved are staggering. But behind the scenes, they are already in motion.
通往通用人工智慧的競賽不僅僅是在代碼和筆記本電腦後面展開——這將是一場動員美國工業力量的競賽。與我們最近從矽谷看到的任何其他事物不同,人工智慧是一個巨大的工業過程:每個新模型都需要一個巨大的新集群,很快就需要巨大的新發電廠,最終需要巨大的新晶片製造廠。所涉及的投資是驚人的。但在幕後,它們已經在進行中。

In this post, I’ll walk you through numbers to give you a sense of what this will mean:
在這篇文章中,我將通過數字來讓你了解這意味著什麼:

  • As revenue from AI products grows rapidly—plausibly hitting a $100B annual run rate for companies like Google or Microsoft by ~2026, with powerful but pre-AGI systems—that will motivate ever-greater capital mobilization, and total AI investment could be north of $1T annually by 2027.
    隨著人工智慧產品的收入迅速增長——到 2026 年左右,像谷歌或微軟這樣的公司可能會達到每年 1000 億美元的運行率,擁有強大的但尚未達到通用人工智慧的系統——這將激勵更大的資本動員,到 2027 年,總人工智慧投資可能每年超過 1 萬億美元。
  • We’re on the path to individual training clusters costing $100s of billions by 2028—clusters requiring power equivalent to a small/medium US state and more expensive than the International Space Station. 
    我們正走在到 2028 年個人訓練集群成本達到數千億美元的道路上——這些集群需要相當於一個小型/中型美國州的電力,並且比國際空間站更昂貴。
  • By the end of the decade, we are headed to $1T+ individual training clusters, requiring power equivalent to >20% of US electricity production. Trillions of dollars of capex will churn out 100s of millions of GPUs per year overall.
    到本世紀末,我們將邁向超過 1 萬億美元的個人訓練集群,這需要相當於美國電力生產超過 20%的電力。數萬億美元的資本支出將每年生產數億個 GPU。

Nvidia shocked the world as its datacenter sales exploded from about $14B annualized to about $90B annualized in the last year. But that’s still just the very beginning. 
Nvidia 震驚了世界,其數據中心銷售額從約 140 億美元年化增長到約 900 億美元年化,僅在去年。但這仍然只是個開始。

Training compute 訓練計算

Earlier, we found a roughly ~0.5 OOMs/year trend growth of AI training compute.

1 If this trend were to continue for the rest of the decade, what would that mean for the largest training clusters?
早些時候,我們發現 AI 訓練計算量大約每年增長 ~0.5 個數量級。如果這一趨勢在本世紀剩餘時間內繼續下去,這對最大的訓練集群意味著什麼?

Year OOMs 記憶體不足 (OOMs)# of H100s-equivalent      # 的 H100s-等效Cost 成本Power 權力Power reference class 權力參考類別
2022~GPT-4 cluster ~GPT-4 叢集~10k 請提供具體的文本內容以便進行翻譯~$500M 約 5 億美元~10 MW ~10 兆瓦~10,000 average homes 約 10,000 戶普通家庭
~2024+1 OOM~100k 來源文本:~100k 翻譯文本:$billions 數十億美元~100MW ~100 兆瓦~100,000 homes 約 10 萬戶住宅
~2026+2 OOMs +2 個數量級~1M$10s of billions 數百億美元~1 GW ~1 吉瓦The Hoover Dam, or a large nuclear reactor
胡佛水壩,或一個大型核反應堆
~2028+3 OOMs +3 個數量級~10M 約一千萬$100s of billions 數千億美元~10 GW ~10 吉瓦A small/medium US state 美國一個中小型州
~2030+4 OOMs +4 個數量級~100M$1T+ 1 兆美元以上~100GW ~100 吉瓦>20% of US electricity production
美國電力生產的超過 20%
Scaling the largest training clusters, rough back-of-the-envelope calculations.
擴展最大的訓練集群,粗略的背面信封計算。

Year
The OpenAI GPT-4 tech report stated that GPT-4 finished training in August 2022. Thereafter we play forward the rough ~0.5 OOMs/year trend.

 

H100s-equivalent
Semianalysis, JP Morgan, and others estimate GPT-4 was trained on ~25k A100s, and H100s are 2-3x the performance of A100s.

 

Cost
Often people cite numbers like “$100M for GPT-4 training,” using just the rental cost of the GPUs (i.e., something like “how much would it cost to rent this size cluster for 3 months of training). But that’s a mistake. What matters is something more like ~the actual cost to build the cluster. If you want one of the largest clusters in the world, you can’t just rent it for 3 months! And moreover, you need the compute for more than just the flagship training run: there will be lots of derisking experiments, failed runs, other models, etc.

To approximate the GPT-4 cluster cost:

  • The public estimates suggest the GPT-4 cluster being around 25k A100s.
  • Assuming $1/A100-hour for 2-3 years gives roughly a $500M cost.
  • Alternatively, you can estimate it as ~$25k cost per H100, 10k H100s-equivalent, and Nvidia GPUs being around ~half the cost of a cluster (the rest being power, the physical datacenter, cooling, networking, maintenance personnel, etc.).
    • (For example, this total-cost-of-ownership analysis estimates that around 40% of a large cluster cost is the H100 GPUs itself, and another 13% goes to Nvidia for Infiniband networking. That said, excluding cost of capital in that calculation would mean the GPUs are about 50% of the cost, and with networking Nvidia gets a bit over 60% of the cost of the cluster.)

FLOP/$ is improving somewhat for each Nvidia generation, but not a ton. E.g., the H100 -> B100 is likely something like 1.5x FLOP/$: B100s are effectively two H100s stapled together, but retailing for <2x the cost. But the B100 was somewhat of an exception. They are surprisingly cheap, likely because Nvidia wants to crush competition. By contrast, A100s -> H100s weren’t much of a FLOP/$ improvement (2x better chip without fp8, roughly 2x the cost), maybe 1.5x if we count fp8 improvements—and that was for a two-year generation.

While I think there are some tailwinds to further FLOP/$ improvements from margin compression, GPUs might also get more expensive as they become massively constrained. Gains from AI chip specialization will continue, but it’s not clear to me there will still be game-changing technical improvements to FLOP/$ coming, given chips are already pretty specialized for AI (e.g. specialized for Transformers, and already at fp8/fp4 precision), Moore’s Law is glacial these days, and other bottleneck components like memory and interconnect are improving more slowly. If you look at Epoch’s data, there seems to be less than a 10x in FLOP/$ over the past decade for top ML GPUs, and for the aforementioned reasons if anything I’d expect this to slow down.

Something like a 35%/year improvement in FLOP/$ would give us the 1T cost for the +4 OOM cluster. Maybe FLOP/$ improves faster, but also datacenter capex is going to get more expensive—simply because you’ll need to actually build new power, doing a lot of capex up front, rather than just renting existing depreciated power plants.

These are just very rough numbers anyway. It would be very much within error bars if e.g. the 1T cluster can be done more efficiently and actually yields more like +4.5 OOMs on compute.

 

Power requirements
An H100 is 700W, but there’s a bunch of datacenter power you need (cooling, networking, storage); Semianalysis estimates ~1,400W per H100.

There are some gains to be had on FLOP/Watt, though once we’ve exhausted the gains from AI chip specialization (see previous footnote), e.g. gone down to the lowest possible precision, these seem somewhat limited (mostly just chip process improvements, which are slow). That said, as power becomes more of a constraint (and thus a larger fraction of costs), chip designs might specialize to more power-efficiency at the costs of FLOPs. Still, there is still the power demand for cooling, networking, storage, and so on (which in the H100 numbers above was already roughly half the power demand).

For these back of the envelope numbers, let’s work with 1kW per H100-equivalent here; again these are just rough back-of-the-envelope calculations. (And if there were an unexpected FLOP/Watt breakthrough, I’d expect the same power expenditure, just bigger OOM compute gains.)

 

Power reference classes
A 10GW cluster run continuously for a year is 87.6 TWh. By comparison, Oregon consumes about 27 TWh of electricity annually, Washington state consumes about 92 TWh annually.

A 100GW cluster run continuously for a year is 876 TWh, while total annual US electricity production is about 4,250 TWh.

This may seem hard to believe—but it appears to be happening. Zuck bought 350k H100s. Amazon bought a 1GW datacenter campus next to a nuclear power plant. Rumors suggest a 1GW, 1.4M H100-equivalent cluster (~2026-cluster) is being built in Kuwait. Media report that Microsoft and OpenAI are rumored to be working on a $100B cluster, slated for 2028 (a cost comparable to the International Space Station!). And as each generation of models shocks the world, further acceleration may yet be in store. 
這可能看起來難以置信——但似乎正在發生。扎克購買了 35 萬個 H100。亞馬遜在一座核電廠旁邊購買了一個 1GW 的數據中心園區。傳聞稱,一個 1GW、140 萬個 H100 等效的集群(約 2026 年集群)正在科威特建設中。媒體報導稱,微軟和 OpenAI 傳聞正在合作建設一個價值 1000 億美元的集群,預計在 2028 年完成(成本可與國際空間站相媲美!)。隨著每一代模型震驚世界,進一步的加速可能還在未來。

Perhaps the wildest part is that willingness-to-spend doesn’t even seem to be the binding constraint at the moment, at least for training clusters. It’s finding the infrastructure itself: “Where do I find 10GW?” (power for the $100B+, trend 2028 cluster) is a favorite topic of conversation in SF. What any compute guy is thinking about is securing power, land, permitting, and datacenter construction.

2 While it may take you a year of waiting to get the GPUs, the lead times for these are much longer still.

The trillion-dollar cluster—+4 OOMs from the GPT-4 cluster, the ~2030 training cluster on the current trend—will be a truly extraordinary effort. The 100GW of power it’ll require is equivalent to >20% of US electricity production; imagine not just a simple warehouse with GPUs, but hundreds of power plants. Perhaps it will take a national consortium. 

(Note that I think it’s pretty likely we’ll only need a ~$100B cluster, or less, for AGI. The $1T cluster might be what we’ll train and run superintelligence on, or what we’ll use for AGI if AGI is harder than expected. In any case, in a post-AGI world, having the most compute will probably still really matter.)

Overall compute

The above are just rough numbers for the largest training clusters. Overall investment is likely to be much larger still: a large fraction of GPUs will probably be be used for inference

3 (GPUs to actually run the AI systems for products), and there could be multiple players with giant clusters in the race.

My rough estimate is that 2024 will already feature $100B-$200B of AI investment:

  • Nvidia datacenter revenue will hit a ~$25B/quarter run rate soon, i.e. ~$100B of capex flowing via Nvidia alone. But of course, Nvidia isn’t the only player (Google’s TPUs are great too!), and close to half of datacenter capex is on things other than the chips (site, building, cooling, power, etc.)
    4
Quarterly Nvidia datacenter revenue. Plot by Thomas Woodside
  • Big tech has been dramatically ramping their capex numbers: Microsoft and Google will likely do $50B+,
    5 AWS and Meta $40B+, in capex this year. Not all of this is AI, but combined their capex will have grown $50B-100B year-over-year because of the AI boom, and even then they are still cutting back on other capex to shift even more spending to AI. Moreover, other cloud providers, companies (e.g., Tesla is spending $10B on AI this year), and nation-states are investing in AI as well.
Big tech capex is growing extremely rapidly since ChatGPT unleashed the AI boom. Graphic source.

Let’s play this forward. My best guess is overall compute investments will grow more slowly than the 3x/year largest training clusters, let’s say 2x/year.

6

YearAnnual investmentAI accelerator shipments (in H100s-equivalent)Power as % of US electricity production
7  
Chips as % of current leading-edge TSMC wafer production
2024~$150B~5-10M
8
1-2%5-10%
9   
~2026~$500B~10s of millions5%~25%
~2028~$2T~100M20%~100%
~2030~$8TMany 100s of millions 100%4x current capacity
Playing forward trends on total world AI investment. Rough back-of-the-envelope calculation. 

And these aren’t just my idiosyncratic numbers. AMD forecasted a $400B AI accelerator market by 2027, implying $700B+ of total AI spending, pretty close to my numbers (and they are surely much less “AGI-pilled” than I am). Sam Altman is reported to be in talks to raise funds for a project of “up to $7T” in capex to build out AI compute capacity (the number was widely mocked, but it seems less crazy if you run the numbers here…). One way or another, this massive scaleup is happening. 

Will it be done? Can it be done?

The scale of investment postulated here may seem fantastical. But both the demand-side and the supply-side seem like they could support the above trajectory. The economic returns justify the investment, the scale of expenditures is not unprecedented for a new general-purpose technology, and the industrial mobilization for power and chips is doable.

AI revenue 

Companies will make large AI investments if they expect the economic returns to justify it. 

Reports suggest OpenAI was at a $1B revenue run rate in August 2023, and a $2B revenue run rate in February 2024. That’s roughly a doubling every 6 months. If that trend holds, we should see a ~$10B annual run rate by late 2024/early 2025, even without pricing in a massive surge from any next-generation model. One estimate puts Microsoft at ~$5B of incremental AI revenue already. 

So far, every 10x scaleup in AI investment seems to yield the necessary returns. GPT-3.5 unleashed the ChatGPT mania. The estimated $500M cost for the GPT-4 cluster would have been paid off by the reported billions of annual revenue for Microsoft and OpenAI (see above calculations), and a “2024-class” training cluster in the billions will easily pay off if Microsoft/OpenAI AI revenue continues on track to a $10B+ revenue run rate. The boom is investment-led: it takes time from a huge order of GPUs to build the clusters, build the models, and roll them out, and the clusters being planned today are many years out. But if the returns on the last GPU order keep materializing, investment will continue to skyrocket (and outpace revenue), plowing in even more capital in a bet that the next 10x will keep paying off. 

A key milestone for AI revenue that I like to think about is: when will a big tech company (Google, Microsoft, Meta, etc.) hit a $100B revenue run rate from AI (products and API)? These companies have on the order of $100B-$300B of revenue today; $100B would thus start representing a very substantial fraction of their business. Very naively extrapolating out the doubling every 6 months, supposing we hit a $10B revenue run rate in early 2025, suggests this would happen mid-2026. 

That may seem like a stretch, but it seems to me to require surprisingly little imagination to reach that milestone. For example, there are around 350 million paid subscribers to Microsoft Office—could you get a third of these to be willing to pay $100/month for an AI add-on? For an average worker, that’s only a few hours a month of productivity gained; models powerful enough to make that justifiable seem very doable in the next couple years.

10

It’s hard to understate the ensuing reverberations. This would make AI products the biggest revenue driver for America’s largest corporations, and by far their biggest area of growth. Forecasts of overall revenue growth for these companies would skyrocket. Stock markets would follow; we might see our first $10T company soon thereafter. Big tech at this point would be willing to go all out, each investing many hundreds of billions (at least) into further AI scaleout. We probably see our first many-hundred-billion dollar corporate bond sale then.

11

Beyond $100B, it gets harder to see the contours. But if we are truly on the path to AGI, the returns will be there. White-collar workers are paid tens of trillions of dollars in wages annually worldwide; a drop-in remote worker that automates even a fraction of white-collar/cognitive jobs (imagine, say, a truly automated AI coder) would pay for the trillion-dollar cluster. If nothing else, the national security import could well motivate a government project, bundling the nation’s resources in the race to AGI (more later). 

Historical precedents

$1T/year of total annual AI investment by 2027 seems outrageous. But it’s worth taking a look at other historical reference classes: 

  • In their peak years of funding, the Manhattan and Apollo programs reached 0.4% of GDP, or ~$100 billion annually today (surprisingly small!). At $1T/year, AI investment would be about 3% of GDP. 
  • Between 1996–2001, telecoms invested nearly $1 trillion in today’s dollars in building out internet infrastructure.
  • From 1841 to 1850, private British railway investments totaled a cumulative ~40% of British GDP at the time. A similar fraction of US GDP would be equivalent to ~$11T over a decade. 
  • Many trillions are being spent on the green transition. 
  • Rapidly-growing economies often spend a high fraction of their GDP on investment; for example, China has spent more than 40% of its GDP on investment for two decades (equivalent to $11T annually given US GDP).
  • In the historically most exigent national security circumstances—wartime—borrowing to finance the national effort has often comprised enormous fractions of GDP. During WWI, the UK and France, and Germany borrowed over 100% of their GDPs while the US borrowed over 20%; during WWII, the UK and Japan borrowed over 100% of their GDPs while the US borrowed over 60% of GDP (equivalent to over $17T today).  

$1T/year of total AI investment by 2027 would be dramatic—among the very largest capital buildouts ever—but would not be unprecedented. And a trillion-dollar individual training cluster by the end of the decade seems on the table.

12

Power

Probably the single biggest constraint on the supply-side will be power. Already, at nearer-term scales (1GW/2026 and especially 10GW/2028), power has become the binding constraint: there simply isn’t much spare capacity, and power contracts are usually long-term locked-in. And building, say, a new gigawatt-class nuclear power plant takes a decade. (I’ll wonder when we’ll start seeing things like tech companies buying aluminum smelting companies for their gigawatt-class power contracts.

13

Comparing trends on total US electricity production to our rough back of the envelope estimates on AI electricity demands.

Total US electricity generation has barely grown 5% in the last decade.

14 Utilities are starting to get excited about AI (instead of 2.6% growth over the next 5 years, they now estimate 4.7%!). But they’re barely pricing in what’s coming. The trillion-dollar, 100GW cluster alone would require ~20% of current US electricity generation in 6 years; together with large inference capacity, demand will be multiples higher. 

To most, this seems completely out of the question. Some are betting on Middle Eastern autocracies, who have been going around offering boundless power and giant clusters to get their rulers a seat at the AGI-table. 

But it’s totally possible to do this in the United States: we have abundant natural gas.

15

  • Powering a 10GW cluster would take only a few percent of US natural gas production and could be done rapidly. 
  • Even the 100GW cluster is surprisingly doable.
    • Right now the Marcellus/Utica shale (around Pennsylvania) alone is producing around 36 billion cubic feet a day of gas; that would be enough to generate just under 150GW continuously with generators (and combined cycle power plants could output 250 GW due to their higher efficiency). 
    • It would take about ~1200 new wells for the 100GW cluster.
      16 Each rig can drill roughly 3 wells per month, so 40 rigs (the current rig count in the Marcellus) could build up the production base for 100GW in less than a year.
      17The Marcellus had a rig count of ~80 as recently as 2019 so it would not be taxing to add 40 rigs to build up the production base.
      18
    • More generally, US natural gas production has more than doubled in a decade; simply continuing that trend could power multiple trillion-dollar datacenters.
      19
    • The harder part would be building enough generators/turbines; this wouldn’t be trivial, but it seems doable with about $100B of capex
      20 for 100GW of natural gas power plants. Combined cycle plants can be built in about two years; the timeline for generators would be even shorter still.
      21

The barriers to even trillions of dollars of datacenter buildout in the US are entirely self-made. Well-intentioned but rigid climate commitments (not just by the government, but green datacenter commitments by Microsoft, Google, Amazon, and so on) stand in the way of the obvious, fast solution. At the very least, even if we won’t do natural gas, a broad deregulatory agenda would unlock the solar/batteries/SMR/geothermal megaprojects. Permitting, utility regulation, FERC regulation of transmission lines, and NEPA environmental review makes things that should take a few years take a decade or more. We don’t have that kind of time.

We’re going to drive the AGI datacenters to the Middle East, under the thumb of brutal, capricious autocrats. I’d prefer clean energy too—but this is simply too important for US national security. We will need a new level of determination to make this happen. The power constraint can, must, and will be solved. 

Chips

While chips are usually what comes to mind when people think about AI-supply-constraints, they’re likely a smaller constraint than power. Global production of AI chips is still a pretty small percent of TSMC-leading-edge production, likely less than 10%. There’s a lot of room to grow via AI becoming a larger share of TSMC production.

Indeed, 2024 production of AI chips (~5-10M H100-equivalents) would already be almost enough for the $100s of billion cluster (if they were all diverted to one cluster). From a pure logic fab standpoint ~100% of TSMC’s output for a year could already support the trillion-dollar cluster (again if all the chips went to one datacenter).

22 Of course, not all of TSMC will be able to be diverted to AI, and not all of AI chip production for a year will be for one training cluster. Total AI chip demand (including inference and multiple players) by 2030 will be a multiple of TSMC’s current total leading-edge logic chip capacity, just for AI. TSMC ~doubled
23 in the past 5 years; they’d likely need to go ~at least twice as fast on their pace of expansion to meet AI chip demand. Massive new fab investments would be necessary. 

Even if raw logic fabs won’t be the constraint, chip-on-wafer-on-substrate (CoWoS) advanced packaging (connecting chips to memory, also made by TSMC, Intel, and others) and HBM memory (for which demand is enormous) are already key bottlenecks for the current AI GPU scaleup; these are more specialized to AI, unlike the pure logic chips, so there’s less pre-existing capacity. In the near term, these will be the primary constraint on churning out more GPUs, and these will be the huge constraints as AI scales. Still, these are comparatively “easy” to scale; it’s been incredible watching TSMC literally build “greenfield” fabs (i.e. entirely new facilities from scratch) to massively scale up CoWoS production this year (and Nvidia is even starting to find CoWoS alternatives to work around the shortage). 

A new TSMC Gigafab (a technological marvel) costs around $20B in capex and produces 100k wafer-starts a month. For hundreds of millions of AI GPUs a year by the end of the decade, TSMC would need to build dozens of these—as well as a huge buildout for memory, advanced packaging, networking, etc., which will be a major fraction of capex. It could add up to over $1T of capex. It will be intense, but doable. (Perhaps the biggest roadblock will not be feasibility, but TSMC not even trying—TSMC does not yet seem AI-scaling-pilled! They think AI will “only” grow at a glacial 50% CAGR.)

Recent USG efforts like the CHIPS Act have been trying to onshore more AI chip production to the US (as insurance in case of the Taiwan contingency). While onshoring more of AI chip production to the US would be nice, it’s less critical than having the actual datacenter (on which the AGI lives) in the US. If having chip production abroad is like having uranium deposits abroad, having the AGI datacenter abroad is like having the literal nukes be built and stored abroad. Given the dysfunction and cost we’ve seen from building fabs in the US in practice, my guess is we should prioritize datacenters in the US while betting more heavily on democratic allies like Japan and South Korea for fab projects—fab buildouts there seem much more functional

The Clusters of Democracy

Before the decade is out, many trillions of dollars of compute clusters will have been built. The only question is whether they will be built in America. Some are rumored to be betting on building them elsewhere, especially in the Middle East. Do we really want the infrastructure for the Manhattan Project to be controlled by some capricious Middle Eastern dictatorship?

The clusters that are being planned today may well be the clusters AGI and superintelligence are trained and run on, not just the “cool-big-tech-product clusters.”  The national interest demands that these are built in America (or close democratic allies). Anything else creates an irreversible security risk: it risks the AGI weights getting stolen

24 (and perhaps be shipped to China) (more later); it risks these dictatorships physically seizing the datacenters (to build and run AGI themselves) when the AGI race gets hot; or even if these threats are only wielded implicity, it puts AGI and superintelligence at unsavory dictator’s whims. America sorely regretted her energy dependence on the Middle East in the 70s, and we worked so hard to get out from under their thumbs. We cannot make the same mistake again. 

The clusters can be built in the US, and we have to get our act together to make sure it happens in the US. American national security must come first, before the allure of free-flowing Middle Eastern cash, arcane regulation, or even, yes, admirable climate commitments. We face a real system competition—can the requisite industrial mobilization only be done in “top-down” autocracies? If American business is unshackled, America can build like none other (at least in red states). Being willing to use natural gas, or at the very least a broad-based deregulatory agenda—NEPA exemptions, fixing FERC and transmission permitting at the federal level, overriding utility regulation, using federal authorities to unlock land and rights of way—is a national security priority. 


In any case—the exponential is in full swing now. 

In the “old days,” when AGI was still a dirty word, some colleagues and I used to make theoretical economic models of what the path to AGI might look like. One feature of these models used to be a hypothetical “AI wakeup” moment, when the world started realizing how powerful these models could be and began rapidly ramping up their investments—culminating in multiple % of GDP towards the largest training runs. 

It seemed far-off then, but that time has come. 2023 was “AI wakeup.”

25 Behind the scenes, the most staggering techno-capital acceleration has been put into motion.

Brace for the G-forces.

Next post in series:
IIIb. Lock Down the Labs: Security for AGI

(What all of this means for NVDA/TSM/etc. I leave as an exercise for the reader. Hint: Those with situational awareness bought much lower than you, but it’s still not even close to fully priced in.

26)


  1. As mentioned,  OOM = order of magnitude, 10x = 1 order of magnitude↩

  2. One key uncertainty is how distributed training will be—if instead of needing that amount of power in one location, we could spread it among 100 locations, it’d be a lot easier.↩

  3. See, for example, Zuck here; only ~45k of his H100s are in his largest training clusters, the vast majority of his 350k H100s for inference. Meta likely has heavier inference needs than other players, who serve fewer customers so far, but as everyone else’s AI products scale, I expect inference to become a strong majority of the GPUs.↩

  4. For example, this total-cost-of-ownership analysis estimates that around 40% of a large cluster cost is the H100 GPUs itself, and another 13% goes to Nvidia for Infiniband networking. That said, excluding cost of capital in that calculation would mean the GPUs are about 50% of the cost, and with networking would mean Nvidia gets a bit over 60% of the cost of the cluster.↩

  5. And apparently, despite Microsoft growing capex by 79% compared to a year ago in a recent quarter, their AI cloud demand still exceeds supply!↩

  6. A larger fraction of global GPU production will probably be going to the largest training cluster in the future than today, e.g. because of a consolidation to just a few leading labs, rather than many companies having frontier-model-scale clusters.↩

  7. Of course, not all of this will be in the US, but to give a reference class.↩

  8. I estimate Nvidia is going to ship on the order of 5M datacenter GPUs in 2024. A minority of those are B100s, which we’ll count as 2x+ H100s. Then there’s the other AI chips: TPUs, Trainium, Meta’s custom silicon, AMD GPUs, etc.↩

  9. TSMC has capacity for over 150k 5nm wafers per month, is ramping to 100k 3nm wafers per month, and likely another 150k or so 7nm wafers per month; let’s call it 400k wafers per month in total. 

    Let’s say roughly 35 H100s per wafer (H100s are made on 5nm). At 5-10 million H100-equivalents in 2024, that’s 150k-300k wafers per year for annual AI chip production in 2024. 

    Depending on where in that range and whether we want to count 7nm production, that’s about 3-10% of annual leading-edge wafer production.↩

  10. A big uncertainty for me is what the lags are for the technology to diffuse and be adopted. I think it’s plausible revenue is slowed because intermediate, pre-AGI models take a lot of “schlep” to properly integrate into company workflows; historically, it’s taken a while to fully harvest the productivity gains from new general purpose technologies. This is where the “sonic boom” discussion earlier comes in: as we “unhobble” models and they start looking more like agents/drop-in remote workers,, deploying them becomes much easier. Rather than having to completely remake some workflow to harvest a 25% productivity gain from a GPT-chatbot, instead you’ll get models that you can onboard and work with as you would a new coworker (e.g., just directly substitute for an engineer, rather than needing to train up engineers to use some new tool). Or, in the extreme, and later on: you won’t need to completely redesign a factory to work with some new tool, you’ll just bring in the humanoid robots. 

    That said, this may lead to some discontinuity in economic value and revenue generated, depending on how quickly we can “unhobble” models.↩

  11. What will happen to interest rates will be interesting… see Tyler Cowen here; Chow, Mazlish and Halperin here.↩

  12. And, farther out, but if AGI truly led to substantial increases in economic growth, $10T+ annually would start being plausible—the reference class being investment rates of countries during high-growth periods.↩

  13.  “Since 2011, the Alouette Smelter uses 930 MW electricity at maximum production capacity.”↩

  14. That said, this is “net-new” capacity: some of that is building new renewables and taking old fossil fuel plants off the grid. Maybe it’s closer to a percent or two a year of gross-new capacity.↩

  15. Thanks to Austin Vernon (private correspondence) for helping with these estimates.↩

  16. New wells produce around 0.01 BCF per day.↩

  17.  Each well produces ~20 BCF over its lifetime, meaning two new wells a month would replace the depleted reserves, i.e. it would need only one rig to maintain the production.↩

  18.  Though it would be more efficient to add less rigs and build up over a longer time frame than 10 months.↩

  19. A cubic foot of natural gas generates about 0.13 kWh. Shale gas production was about ~70 billion cubic feet per day in the US in 2020. Suppose we doubled production again, and the extra capacity all went to compute clusters. That’s 3322 TWh/year of electricity, or enough for almost 4 100GW clusters.↩

  20. The capex costs for natural gas power plants seem to be under $1000 per kW, meaning the capex for 100GW of natural gas power plants would be about $100 billion.↩

  21. Solar and batteries aren’t a totally crazy alternative, but it does just seem rougher than natural gas. I did appreciate Casey Handmer’s calculation of tiling the Earth in solar panels:
    “With current GPUs, the global solar datacenter’s compute is equivalent to ~150 billion humans, though if our computers can eventually match [human brain] efficiency, we could support more like 5 quadrillion AI souls.”↩

  22.  This poses the interesting question of why power requirements are going up so much before chip fab production starts being really constrained. A simple answer is while datacenters run continuously at close to max power, most chips currently produced are idle a lot of the time. Currently, smartphones are close to half of leading chip demand, but use a lot less energy per wafer area (trading transistors for serial operations and energy efficiency) and have low utilization as smartphones are mostly idle. The AI revolution means working our transistors way harder, dedicating them all to constantly-running, high-performance AI datacenters instead of idle, battery-powered/energy-saving devices. HT Carl Shulman for this point.↩

  23. (Using revenue as a proxy.)↩

  24.  It’s a lot easier to do side-channel attacks to exfiltrate weights with physical access!↩

  25. I distinctly remember writing “THE TAKEOFF HAS STARTED” on my whiteboard in March of 2023.↩

  26. Mainstream sell-side analysts seem to assume only 10-20% year-over-year growth in Nvidia revenue from CY24 to CY25, maybe $120B-$130B in CY25 (or at least did until very recently). Insane! It’s been pretty obvious for a while that Nvidia is going to do over $200B of revenue in CY25.↩