這是用戶在 2024-12-23 24:03 為 https://semianalysis.com/2024/03/25/nvidias-optical-boogeyman-nvl72-infiniband/ 保存的雙語快照頁面,由 沉浸式翻譯 提供雙語支持。了解如何保存?

Nvidia’s Optical Boogeyman – NVL72, Infiniband Scale Out, 800G & 1.6T Ramp

Transceiver to GPU Ratio, DSP Growth, Revealing The Real Boogeyman

6 minutes
6 comments on Nvidia’s Optical Boogeyman – NVL72, Infiniband Scale Out, 800G & 1.6T Ramp

Transceiver to GPU Ratio, DSP Growth, Revealing The Real Boogeyman

At GTC, Nvidia announced 8+ different SKUs and configurations of the Blackwell architecture. While there are some chip level differences such as memory and CUDA core counts, most of the configurations are system level such as form factor, networking, CPU, and power consumption. Nvidia is offering multiple 8 GPU baseboard style configurations, but the main focus for Nvidia at GTC was their vertically integrated DGX GB200 NVL72.

Rather than the typical 8 GPU server we are accustomed to, it is a single integrated rack with 72 GPUs, 36 CPUs, 18 NVSwitches, 72 InfiniBand NICs for the back end network, and 36 Bluefield 3 Ethernet NICs for front end network.

Nvidia

Optics and the GB200 NVL72 Panic

When Nvidia’s DGX GB200 NVL72 was announced in the keynote speech last week – with the capability of linking 72 GPUs in the same rack with 900GB/s per GPU NVLink 5 connections – panic set in. Tourists in various optical markets began to run for the hills after reacting quite violently to Jensen’s words.

And so we have 5,000 cables, 5,000 NVLink cables. In total, 2 miles. Now this is the amazing thing. If we had to use optics, we would have had to use transceivers and retimers. And those transceivers and retimers alone would have cost 20,000 watts, 20 kilowatts of just transceivers alone, just to drive the NVLink spine. As a result, we did it completely for free over NVLink switch and we were able to save the 20 kilowatts for computation.

Jensen Huang

NVSwitch With 288 Copper Cables Per Switch

These skittish observers saw the NVLink scale up to 72 GPUs over 5,184 direct drive copper cables as the optical boogeyman come to spoil the various optical supply chain players. Many have argued that because the NVLink network connects all 72 GPUs in the rack, you need fewer optics to achieve connectivity for GPUs within a cluster. These optical tourists thought the optical intensity, IE number of optical transceivers required per Nvidia GPU clusters, would go down meaningfully.

This is false, and they misunderstood Jensen’s words. The number of optics does not go down.

Nvidia

The DGX H100 and DGX GB200 NVL utilize three distinct networks. A front end network running Ethernet with a ratio of 2 or 4 GPUs per NIC, a back end scale out InfiniBand or Ethernet network running at 400G or 800G, depending on the configuration, but always with a ratio of 1 NIC per GPU, with another back end scale up NVLink network linking all 8 or 72 GPUs together.

Nvidia

For the back end scale out network – the NVL72 rack showcased at GTC still has 72 OSFP ports at 400G / 800G – one for each GPU – which is exactly the same connectivity implemented in the H100 – i.e. the same ratio of optical transceivers to GPUs. As GPU network sizes scale, the number of optical transceivers required also scales.

Top layer is assumed fully populated for simplicity, the full optics model goes out to 100k GPU

The only scenario in which you do not populate transceivers into the 72 OSFP ports on the GB200 NVL72 is if you plan to only buy one rack of the GB200 NVL72. To be clear, no one is buying only 1 rack, as they’d be better off buying 8 GPU baseboards instead. Second, deployment flexibility is everything, and while you may intend a server or rack for one purpose in the short term, that purpose will change over time, and as such ratios change.

The NVL72 is not the optical boogeyman people should worry about, there is another, much more real optical boogeyman.

This optical boogeyman dramatically reduces the number of transceivers by a significant volume, and it’s shipping in volume next year. Below we will explain what, how, and by how much.

The Clos Non-blocking Fat Tree Network

Nvidia’s H100 reference architecture uses a Clos non-blocking fat tree network in order to deliver 400G all to all bandwidth to each node (i.e. each end device on the network). A key benefit of the Clos network design is that it is easily scalable to accommodate increasing number of nodes without significant additional complexity and it creates multiple connections between leaf and spine switches so as to allow many paths for a given node to connect to another node on the network. Non-blocking means a pair of nodes are able to connect to each other without having to block or disconnect an existing connection.  

We will work through an example of building a network for 512 GPUs to visualize how such a network is built up, but first we must clarify the NIC and port configuration within a typical DGX H100 or HGX H100. For the back end scale out network, there are 8 400G NICs that provide connectivity. Each pair of NICs then has an 800G OSFP (2x400G) Multimode SR8 twin-port optical transceiver installed in it in the reference architecture.

Nvidia DGX H100 Superpod Reference Architecture

The next step is to calculate how many leaf switches are required. Because a non-blocking network is required, the ratio of uplink ports to downlink ports at each leaf switch should be equivalent.

The reference architecture uses Nvidia Quantum-2 QM9700 switches, which have a total radix of 25.6Tbps across 32 OSFP cages – i.e. 800G each port. The dual-port 800G transceiver comes into play here as it is used in the switches to provide 64 ports of 400G (32 ports of 800G). With 512 nodes in the network – each connecting to a leaf switch, we therefore need 16 leaf switches (512 nodes divided by 32 ports per switch).

But which leaf node should we connect each GPU to? We hear a lot of analysts still using the term “Top of Rack” to refer to the leaf switches that the GPU nodes connect to – implying that all the GPUs in a given rack will connect to the same leaf switch.

Nvidia’s rail-optimized architecture calls for the opposite – with GPUs intentionally connected to different leaf switches.

Nvidia

This is because the NVLink within the H100 server itself provides another alternative path through the network with fewer hops. The diagram above shows how – if a given GPU is sending data to a different leaf switch – then the GPU need not send the message up to the spine level in order to connect to the distant leaf switch – instead, sending the message through the NVLink to the other GPU within the H100 server that is connected to the correct leaf switch is also possible.

If all the GPUs in the server were connected to the same leaf switch – this capability would not be available. One key implication of rail-optimized is that we can expect distances from the node to leaf spine to be greater than a typical intra-rack configuration – making it difficult to use Passive Direct Attach Copper which has a reach of ~3 meters, or Active Electrical Cables which has a reach of ~7 meters.

Nvidia

Turning to the spine layer, each leaf switch will be connected to every spine switch. The 16 leaf switches will take up 16 ports out of 64 on the first spine switch. Instead of letting the other ports go unutilized, we use these other ports to connect the leaf and spine switch again with another link until there are no more ports remaining. In this case, we end up with a bundle of 4 links between the leaf and spine switch and each spine switch (the top layer in the diagram below) will be connected to all 16 leaf switches (bottom layer in the diagram below). Note also that the spine switches do not connect to each other – only to the leaf switches.

Given each leaf switch has 32 uplink ports of 400G each, using a bundle size of 4 links means each leaf switch will connect to 8 spine switches (32 divided by 4).

Nvidia

Running this algorithm for larger clusters of GPUs – we can see that a cluster size of 2048 is the largest we can build using a two-layer network of leaf switches and spine switches. A 2048 node cluster requires 64 leaf switches – connecting all of these to one spine switch will fill up all 64 ports on the spine switch – meaning if we were to go to a 4096 cluster, its 128 leaf switches would no longer be able to connect to the same spine. This necessitates the use of another layer of core switches to connect the spines together – leading to a step up jump higher in networking complexity and cost. We will leave the construction of the core layer to another time and cut to the chase – with the below table outlining the number of switches at each layer, and the total number of transceivers required to build up a network of a given size.

SemiAnalysis Estimates

With an assumption in hand of the ratio of clusters that will require a 3-layer network (i.e. greater than 2048 GPU cluster) – it is straightforward to solve for the 800G Twin-port transceiver TAM. We do exactly this in our upcoming optical model which we will launch this week at OFC.

The Real Optics Bogeyman – The 144 Port Quantum-X800 Q3400-RA 4U switch

However, while everyone’s attention was on figuring out the implications of the NVL72 on optics, they have completely missed the Real Optics Bogeyman –Nvidia’s new 144 port 800G Quantum-X800 Q3400-RA 4U switch, which achieves 144 800G ports using 72 OSFP ports through the use of 1.6T dual-port transceivers for a total Radix of 115.2T – i.e. 4 times that of its 25.6T predecessor, the 32 OSFP cage, 64 ports of 400G Quantum-2 QM9700.

SemiAnalysis, Nvidia

While generationally, Nvidia did grow the switch IO significantly by moving to 200G SerDes, they also put in multiple piece of switch silicon. To the outside world, this looks like 1 big switch, but inside, there are 4 electrically connected switch packages.

GB200 NVL72 has multiple SKUs, and one of those delineations is the use of ConnectX-7 (400G) versus ConnectX-8 (800G). This difference requires the use of either the Quantum-2 switches or the Quantum-800 switches

A fat tree network using a 144 port switch can include up to 10,368 GPU nodes while still staying on a 2 layer network topology, nearly 5x more nodes than a network based on the old 64 port switch using a 2 layer network.

SemiAnalysis Estimates

The benefit of using a 144 port switch shines when we look at a 9k cluster – with the 144 port switch allowing the network to stay on 2 layers as opposed to the 3 layers required to build this network using a 64 port switch. This dramatically simplifies the network, requiring 70% fewer switches, and reducing total transceiver count by 27% respectively. The transceiver to GPU ratio will fall considerably if a meaningful number of AI networks adopt 144 port switches.

SemiAnalysis Estimates

In this reference architecture network with 4,608 GPU nodes built up using 144 port switches, each leaf switch has 72 ports of downlink, with 9 connections to each of 8 racks. In this rail-optimized topology, the compute nodes will connect to different leaf switches and not just through one single leaf switch. In the below example of a three-layer network, each pod of 8 racks (a total of 576 GPUs per pod) requires 8 leaf switches and 8 spine switches.

These leaf and spine switches are grouped into four rails of 2 leaf switches and 2 spine switches each – with each of the 2 leaf switches connected to each of the 2 spine switches with very thick 36 port bundles each. The spine switches across the entire cluster are then connected using a total of 8 core switches – with two 6-port bundles connecting to two spine switches in each of 8 pods.

Top layer is assumed fully populated for simplicity, the full optics model goes out to 100k GPU

Oddly, firms that decide to go with the more expensive ConnectX-8 and Quantum X800, will actually dramatically reduce their optics volumes relative to the ConnectX-7 and Quantum-2 variants. Note this isn’t all bad though. The move from 400G to 800G for NIC ports and 800G to 1.6T for Switch ports will enable ASP increases for certain sub-components, but not all. Is it enough to offset the unit declines though?

In our optics model that we are launching this week at OFC, we have the ASP, volume estimates, and downstream subcomponent BOM and share for various players in the industry across 400G, 800G, and 1.6T. We offer shipments by quarter through 2027.