Google AI Infrastructure Supremacy: Systems Matter More Than Microarchitecture

From DLRM to LLM, internal workloads win, but how does Google fare in external workloads?

The dawn of the AI era is here, and it is crucial to understand that the cost structure of AI-driven software deviates considerably from traditional software. Chip microarchitecture and system architecture play a vital role in the development and scalability of these innovative new forms of software. The hardware infrastructure on which AI software runs has a notably larger impact on Capex and Opex, and subsequently the gross margins, in contrast to earlier generations of software, where developer costs were relatively larger. Consequently, it is even more crucial to devote considerable attention to optimizing your AI infrastructure to be able to deploy AI software. Firms that have an advantage in infrastructure will also have an advantage in the ability to deploy and scale applications with AI.

Google had peddled the idea of building AI-specific infrastructure as far back as 2006, but the problem came to a boiling point in 2013. They realized they needed to double the number of datacenters they had if they wanted to deploy AI at any scale. As such, they started laying the groundwork for their TPU chips which were put into production in 2016. It’s interesting to compare this to Amazon, who in the same year, realized they needed to build custom silicon too. In 2013, they started the Nitro Program, which was focused on developing silicon to optimize general-purpose CPU computing and storage. Two very different companies optimized their efforts for infrastructure for different eras of computing and software paradigms.

Since 2016, Google has now built 6 different AI-focused chips, TPU, TPUv2, TPUv3, TPUv4i, TPUv4, and TPUv5. Google primarily designed these chips, with varying amounts of mid and back-end collaboration from Broadcom. These chips were all fabricated by TSMC. Since TPUv2, the chips have also utilized HBM memory from Samsung and SK Hynix. While Google’s chip architecture is interesting and something we will dive into later in this report, there is a far more important topic at play.

Google has a near-unmatched ability to deploy AI at scale reliably with low cost and high performance. With that said, let’s bring some rationality to the argument, as Google has also made disingenuous claims related to chip-level performance, which need to be corrected. We believe Google has a performance/total cost of ownership (perf/TCO) advantage in AI workloads versus Microsoft and Amazon due to their holistic approach from microarchitecture to system architecture. The ability to commercialize generative AI to enterprises and consumers is a different discussion.

The realm of technology is a perpetual arms race, with AI being the swiftest-moving battlefield. The model architectures that were trained and deployed have shifted significantly over time. The case and point is with Google’s internal data. There was a swift rise in CNN models from 2016 to 2019, but then they fell again. CNNs have a very different profile of computation, memory accesses, networking, etc vs DLRMs vs Transformers vs RNNs. The same happened with RNNs which were completely displaced by transformers.

As such, hardware must be flexible to the developments of the industry and support them. The underlying hardware cannot over-specialize on any specific model architecture, or it will risk becoming obsolete as model architectures change. Chip development to large-scale volume deployment generally takes 4 years, and as such, the hardware can be left behind by what software wants to do on it. This can already be seen with certain AI accelerator architectures from startups that used a specific model type as their optimization point. This is one of the many reasons why most AI hardware startups have/will fail.

The point is especially clear with Google’s own TPUv4i chip, which was designed for inference, yet cannot run inference on Google’s best models such as PaLM. The last-generation Google TPUv4 and Nvidia A100 could not have possibly been designed with large language models in mind. Similarly, the recently deployed Google TPUv5 and Nvidia H100 could not have been designed with the AI Brick Wall in mind, nor the new model architecture strategies that have been developed to address it. These strategies are a core part of GPT-4’s model architecture.

Hardware architects have to make their best guess about the direction in which machine learning was headed for the chips they are designing. This includes memory access patterns, tensor sizes, data reuse structures, arithmetic density vs networking overhead, and more.

Furthermore, the chip microarchitecture is a fraction of the true cost of AI infrastructure. System-level architecture and deployment flexibility are far more important factors. Today we want to dive into Google’s TPU microarchitecture, system architecture, deployment slicing, scalability, and their tremendous advantage in infrastructure versus the other tech titans. This includes our thinking in a TCO model comparing the cost of Google’s AI infrastructure vs that of Microsoft, Amazon, and Meta.

We will also be directly comparing Google’s architecture to Nvidia’s, which is top of mind, especially from a performance and networking standpoint. We will also briefly compare this to AI hardware from other firms, including AMD, Intel, Graphcore, Amazon, Sambanova, Cerebras, Enflame, Groq, Biren, Iluvatar, and Preferred Networks.

We will also examine this from a practitioner’s lens for large model research, training, and deployment. We also want to dive into DLRM models, which are often under-discussed despite currently being the largest at-scale AI model architecture. Furthermore, we will discuss the infrastructure differences between DLRM and LLM model types. Lastly, we will discuss Google’s ability to succeed with TPU for external cloud customers. Also at the end, there’s an easter egg of an anomaly with Google’s TPU that we believe is an error.

Google’s System Infrastructure Advantage

Part of Google’s advantage in infrastructure it that they have always designed TPU’s from a system-level perspective. This means the individual chip is important, but how it can be used together in a system in the real world is far more important. As such, we will go layer by layer from system architecture to deployment use to chip level in our analysis.

While Nvidia also thinks from a systems perspective, their scale of a system has been smaller and more narrow than Google’s. Furthermore, until recently, Nvidia had no experience with cloud deployments. One of Google’s biggest innovations in its AI infrastructure is the use of a custom networking stack between TPUs, ICI. This link is low latency and high performance relative to costly Ethernet and InfiniBand deployments. It is more analogous to Nvidia’s NVLink.

Google’s TPUv2 could scale to 256 TPU chips, the same number as Nvidia’s current generation H100 GPU. They increased this number to 1024 with TPUv3 and to 4096 with TPUv4. We would assume that the current generation TPUv5 can scale up to 16,384 chips without going through inefficient ethernet based on the trendline. While this is important from the perspective of performance for large-scale model training, more important is their ability to divide this up for real use.

Google’s TPUv4 systems have 8 TPUv4 chips and 2 CPUs per server. This configuration is identical to Nvidia’s GPUs which come in servers of 8 A100 or H100 with 2 CPUs per server. A single server is generally the unit of compute for GPU deployments, but for the TPU, the unit of deployment is a larger “slice” of 64 TPU chips and 16 CPUs. These 64 chips connect internally with the ICI network in a 4^3 cube, over direct attached copper cabling.

Beyond this unit of 64 chips, communications transfer over to the optical realm instead. These optical transceivers cost more than 10x that of passive copper cables, so Google optimized their slice size for this 64 number to minimize system-level cost from a networking standpoint.

Compare this to a 2023 Nvidia SuperPod deployment which maxes out 256 GPUs with NVLink, 16 times smaller than the 2020 TPUv4 pod of 4096 chips. Furthermore, Nvidia clearly pays significantly less attention to density and networking costs based on Nvidia’s 1^st party renders and DGX Superpod systems. Nvidia’s deployments are generally 4 servers per rack.

Beyond the realm of 4 servers with 32 total GPUs, generally, the communications must go optical. As such, Nvidia requires significantly more optical transceivers for large-scale deployments.

Google OCS

Google deployed its custom optical switch, which uses arrays of mems-based micro-mirror arrays for switching between 64 TPU slices. The quick summary is that Google claims their custom network improves throughput by 30%, use 40% less power, incurs 30% less Capex, reduces flow completion by 10%, and delivers 50x less downtime across their network, for more detailed why and how, see this report.

Google uses these OCS to make its datacenter spine. They also use them to inter and intraconnect TPU pods together. The big advantage of this OCS is that signals remain only in the optical domain from any 64 TPU slice to any other TPU slice within the 4096 TPU Pod.

Compare this to an Nvidia GPU deployment of 4,096 GPUs with multiple Nvidia SuperPods. This system would require multiple layers of switching between these GPUs, and a total of ~568 InfiniBand switches. Google only requires 48 of their optical switches for their 4096 TPU deployment.

It should be noted that Google’s OCS are also about 3.2x to 3.5x more expensive per switch, when purchased directly from Google’s contract manufacturer compared to Nvidia’s InfiniBand switches when purchased by a third party from Nvidia. This is not a fair comparison, though, given it includes Nvidia’s ~75% datacenter gross margin.

If we compare only contract manufacturing costs alone, IE cost to Google versus the cost to Nvidia; then the cost differential rises to 12.8x to 14x that of Nvidia InfiniBand switches. The number of switches required for the deployment of 4096 chips is 48 vs 568, IE is 11.8x. Nvidia’s solution is cheaper to manufacture on a switch basis. When the cost of the additional optical transceivers is included, this equation equalizes or shifts in the favor of Google.

Each connection between each layer of switching is another point that necessitates more cabling. While some of this can be done over direct attached copper cables, there are still multiple points where the signal would also need to travel over optical. Each of those layers would convert from electrical to optical to electrical between each layer of switching. This would drive power consumption for a large-scale electrical switching system much higher than that of Google’s OCS.

Google claims all these power and cost savings are so large that their networking cost is <5% of the total TPU v4 supercomputer capital costs and <3% of total power. This isn’t done just by moving from electrical to in-house optical switches.

Minimizing Network Cost Through Topology

While Google pushes this viewpoint heavily, it is important to recognize that the topology of the Nvidia and Nvidia networks is entirely different. Nvidia systems deploy “Clos networks” which are “non-blocking”. This means they can establish full bandwidth connections between all input and output pairs simultaneously without any conflicts or blocking. This design provides a scalable approach for connecting many devices in a data center, minimizing latency, and increasing redundancy.

Google’s TPU networking forgoes this. They use a 3D torus topology to connect nodes in a three-dimensional grid-like structure. Each node is connected to its six neighboring nodes in a grid (up, down, left, right, front, and back), forming a closed loop in each of the three dimensions (X, Y, and Z). This creates a highly interconnected structure, where the nodes form a continuous loop in all three dimensions.

The first image is more logical, but if you think about it for a while and are a bit hungry, this network topology is literally a doughnut!

The torus topology has several advantages versus the Clos topology that Nvidia utilizes:

Lower latency: The 3D torus topology can provide lower latency due to its short, direct links between neighboring nodes. This is particularly useful when running tightly-coupled, parallel applications that require frequent communication between nodes, such as some types of AI models.
Better locality: In a 3D torus network, nodes that are physically close to each other are also logically close, which can lead to better data locality and reduced communication overhead. While latency is one aspect, power is also a tremendous benefit.
Lower network diameter: The 3D torus topology has a lower network diameter than Clos networks for the same number of nodes. There are tremendous cost savings due to requiring significantly fewer switches relative to Clos networks.

On the flip side of the coin, there are many disadvantages to the 3D torus network.

Predictable performance: Clos networks, especially in data center environments, can provide predictable and consistent performance due to their non-blocking nature. They ensure that all input-output pairs can be connected simultaneously at full bandwidth without conflicts or blocking, which is not guaranteed in a 3D torus network.
Easier to scale: In a spine-leaf architecture, adding new leaf switches to the network (to accommodate more servers, for example) is relatively simple and does not require major changes to the existing infrastructure. In contrast, scaling a 3D torus network may involve reconfiguring the entire topology, which can be more complex and time-consuming.
Load balancing: Clos networks offer more paths between any two nodes, which allows for better load balancing and redundancy. While 3D torus networks also provide multiple paths, the number of alternative paths in Clos networks can be higher, depending on the network’s configuration.

Overall, while Clos has advantages, Google’s OCS mitigates many of these. OCS enables simple scaling between multiple slices and multiple pods.

The biggest issue facing 3D torus topologies is that errors can be a bigger issue. Errors can crop up and do. Even with 99% host availability, a slide of 2,048 TPUs would have near 0 ability to work properly. Even at 99.9%, a training run with 2,000 TPUs has 50% goodput without Google’s OCS.

The beauty of OCS is that it enables routing to be reconfigured on the fly.

Spares are needed to allow scheduling jobs despite some failed nodes. An operator cannot realistically schedule two 2k node slices from 4k node pod without risking failures. Nvidia-based training runs often require excessive overhead dedicated to checkpointing, pulling failed nodes, and restarting them. Google simplifies this to some extent by just routing around failed nodes rather.

One other benefit of the OCS is that slices can be used as soon as they are deployed rather than waiting for the full network.

Deploying Infrastructure – A User’s Perspective

The infrastructure efficiencies are nice from a cost and power perspective, allowing Google to deploy more TPU per $ than other firms can deploy GPU, but this means nothing for use. One of the biggest advantages Google’s internal users get to experience is that they can tailor their infrastructure demands to their model.

No chip or system is ever going to match the memory, network, and types of compute profile that all users will want. Chips have to generalize, but at the same time, users want that flexibility, and they don’t want a 1 size fits all solution. Nvidia addresses this by offering many different SKU variations. Furthermore, they offer some different memory capacity tiers as well as tighter integration options such as Grace + Hopper and NVLink Network for SuperPods.

Google cannot afford this luxury. Each additional SKU means that the total deployed volume per individual SKU is lower. This in turn, reduces their utilization rates across their infrastructure. More SKUs would also mean it is harder for users to get the type of compute they want when they want it because certain options will inevitably be oversubscribed. Those users would then be forced to use a suboptimal configuration.

As such, Google has a tough problem feeding their researchers the exact products they want while also minimizing SKU variation. Google has exactly 1 TPUv4 deployment configuration of 4,096 TPU’s, in comparison to hundreds of different size deployments and SKUs that Nvidia must support for their larger, more varied customer base. Despite this, Google is still able to slice and dice this in a unique way that enables internal users to have the flexibility of infrastructure they desire.

Google’s OCS also enables the creation of custom network topologies such as twisted torus networks. These are 3d torus networks where some dimensions are twisted, meaning that nodes at the edges of the network are connected in a non-trivial, non-linear manner, creating additional shortcuts between nodes. This further improves network diameter, load balancing, and performance.

Google’s teams take advantage of this heavily to assist with certain model architectures. Below is a snapshot of the popularity of various TPU configurations by the number of chips and network topology for just 1 day in November 2022. There are more than 30 different configurations, despite many having the same number of chips in the system, to suit a variety of model architectures that are being developed. This is tremendous powerful insight from Google on their use of TPUs and flexibility. Furthermore, they also have many less-used topologies that are not even pictured.

To take full advantage of the bandwidth available, users map data parallelism along one dimension of the 3D torus and the two model parallel parameters on the other dimensions. Google claims optimal topology selection enables 1.2x to 2.3x higher performance.

We will discuss the software stack and external users later in this report.

The Largest At Scale AI Model Architecture: DLRM

Any discussion of AI infrastructure is incomplete without discussing Deep Learning Recommendation Models (DLRMs). These DLRMs are the backbone of companies like Baidu, Meta, ByteDance, Netflix, and Google. It is the engine of over a trillion dollars of annual revenue in advertising, search ranking, social media feed ordering, etc. These models consist of billions of weights, training on more than a trillion examples and handling inference at over 300,000 queries per second. The size of these models (10TB+) and far exceeds that of even the largest transformer models, such as GPT4, which is on the order of 1TB+ (model architecture differences).

The common thread between all of the firms mentioned above is that they rely on constantly updated DLRMs to drive their businesses of personalizing content, products, or services in various industries, such as e-commerce, search, social media, and streaming services. The cost of these models is tremendous, and hardware must be co-optimized to it. DLRMs have not been static, but they have been constantly improving over time, but let’s explain the general model architecture before moving forward. We will try to keep it simple.

DLRM aims to learn meaningful representations of user-item interactions by modeling both categorical and numerical features. The architecture is comprised of two main components: the Embedding Component (dealing with categorical features) and the Multilayer Perceptron (MLP) Component (handling numerical features).

In the most simplified terms, the multilayer perceptron component is dense. The features are fed into a series of fully connected layers. This is similar to older pre-GPT 4 transformer architectures, which were also dense. The dense layers map very well to massive matrix multiple units on hardware.

The embedding component is highly unique to DLRMs and the one that makes its computational profile so unique. DLRM inputs are categorical features represented as discrete, sparse vectors. A simple Google search only contains a few words out of the entire language. These sparse inputs do not map well to massive matrix multiply units found in hardware since they are fundamentally more akin to hash tables, not tensors. Since neural networks usually perform better on dense vectors, embeddings are employed to convert categorical features into dense vectors.

Sparse input: [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
Dense vector: [0.3261477, 0.4263801, 0.5121493]

Embedding functions map the categorical space (words in the English language, engagement with a social media post, behavior towards a type of post) to a smaller, dense space (100-vectors representing each word). These functions are implemented using lookup tables, which are an essential part of DLRMs and often form the first layer of DLRM models. The size of embedding tables can vary significantly, ranging from tens of megabytes to hundreds of gigabytes or even terabytes each.

Meta’s 2-year-old DLRM was over 12 trillion parameters and required 128 GPUs to run inference. Nowadays, the largest production DLRM model are at least multiple times larger and consumes over 30TB of memory just to hold the model embeddings. Expectations are that this increases to over 70TB of embeddings over the next year! As a result, these tables need to be partitioned across the memory of many chips. There are three primary partitioning methods: column sharding, row sharding, and table sharding.

The performance of DLRMs is largely gated by the memory bandwidth, memory capacity, vector processing performance, and the networking/interconnect between chips. Embedding lookup operations primarily consist of small gather or scatter memory accesses, which have low arithmetic intensity (FLOPS do not matter at all). The accesses to the embedding tables are fundamentally unstructured sparsity. Every query must pull data from part of the 30TB+ of embeddings, sharded across hundreds or thousands of chips. This can lead to imbalances in computation, memory, and communication loads across a supercomputer for DLRM inference.

This differs greatly for dense operations in MLPs and GPT-3-like transformers. Chip FLOPS/sec are still relevant as one of the primary performance drivers. Of course, there are a variety of factors holding performance back beyond FLOPs, but GPUs can still achieve over 71% hardware flops utilization in Chinchilla-style LLMs.

Google’s TPU Architecture

Google’s TPU introduces some key innovations to the architecture, which set it apart from other processors. Unlike traditional processors, the TPU v4 doesn’t have a dedicated instruction cache. Instead, it employs a Direct Memory Access (DMA) mechanism, similar to the Cell processor. The vector caches in TPU v4 are not part of a standard cache hierarchy but are utilized as scratchpads. Scratchpads differ from standard caches in that they require manual writing, while standard caches handle data automatically. Google can utilize this more efficient infrastructure due to not needing to serve as large of a general-purpose compute market. This does affect the programming model somewhat, although Google engineers believe the XLA compiler stack handles this well. The same cannot be said for external users.

The TPU v4 boasts 160MB SRAM for the scratchpad along with 2 TensorCores each of which has 1 Vector Unit with 4 Matrix Multiply Units (MXUs) and 16MB Vector Memory (VMEM). The two TensorCores share 128MB of memory. They support 275 TFLOPS of BF16 and also support INT8 data types. The memory bandwidth of the TPU v4 is 1200GB/s. The Inter Chip Interconnect (ICI) provides a data transfer rate of 300GB/s via six 50GB/s links.

A 322b Very Long Instruction Word (VLIW) Scalar Computation Unit is included in the TPU v4. In VLIW architectures, the instructions are grouped together into a single, long instruction word, which is then dispatched to the processor for execution. These grouped instructions, also known as bundles, are explicitly defined by the compiler during program compilation. The VLIW bundle comprises up to 2 scalar instructions, 2 vector ALU instructions, 1 vector load and 1 vector store instruction, and 2 slots for transferring data to and from the MXUs.

The Vector Processing Unit (VPU) is equipped with 32 2D registers, containing 128x 8 32b elements, making it a 2D vector ALU. The Matrix Multiply Units (MXUs) are 128×128 on v2, v3, and v4, with the v1 version featuring a 256×256 configuration. The reason for this change was that Google simulated that four 128×128 MXUs have 60% higher utilization than one 256×256 MXU yet the four 128×128 MXUs take up the same amount of area as the 256×256 MXU. The MXU inputs utilize 16b Floating Point (FP) inputs and accumulate with 32b floating point (FP).

These larger units allow more efficient data reuse to break through the memory wall.

Google DLRM Optimizations

Google was one of the first to start using DLRMs at scale with their search product. This unique need led to a very unique solution. The above architecture described has a major deficiency in that it cannot effectively handle the embeddings of a DLRM. Google’s main TensorCore is very large and does not match the computational profile of these embeddings. Google had to develop an entirely new type of ”SparseCore” in their TPU which is different than the “TensorCore” for dense layers, which was described above.

The SparseCore (SC) provides the hardware support for embeddings in Google’s TPU. From as early as TPU v2, these domain-specific processors have tiles directly tied to each HBM channel/sub-channel. They accelerate the most memory bandwidth-intensive part of training Deep Learning Recommendation Models (DLRM) while only taking up about 5% of die area and power. By using the fast, HBM2 on each TPU v4 chip for embeddings, rather than CPUs, Google showed a 7x speedup of their internal production DLRM compared to leaving embeddings on the host CPU’s main memory (TPU v4 SparseCore vs TPU v4 Embeddings on Skylake-SP).

SparseCore enables fast memory access from HBM, with dedicated fetch, processing, and flush units to move data to banks of Sparse Vector Memory (Spmem) and updated by a programmable 8-wide SIMD Vector Processing Unit (scVPU). 16 compute tiles of these units go into a SparseCore.

Additional cross-channel units perform specific embedding operations (DMA, Sort, Sparse Reduce, Fork, Concatenate). There are 4 SparseCores per TPU v4 chip, each with 2.5MB of Spmem. Going forward, we speculate that the number of SparseCores continues to increase to 6 for TPUv5 and the number of tiles to increase to 32 due to the increased number of sub-channels on HBM3.

While the performance gain from moving to HBM is massive, performance scaling is still affected by interconnect bisection bandwidth. The new 3D torus of the ICI in TPU v4 helps scale embedding lookup performance further. However, the improvement drops off when scaling up to 1024 chips as SparseCore overheads become the bottleneck.

This bottleneck likely results in Spmem per tile also increasing with TPUv5 if Google feels their DLRMs need to increase in size and capacity beyond that of ~512 chips.

The rest of this report will compare the Google TPU to Nvidia GPUs with real-world data for large language model training, not just the typical comparison of small models that are not relevant to training budgets that you typically see.

It will also compare the microarchitecture to Nvidia GPUs, as well as to other AI hardware from AMD, Intel, Graphcore, Amazon, Sambanova, Cerebras, Enflame, Groq, Biren, Iluvatar, and Preferred Networks.

We will also compare other tech titans’ infrastructure costs for AI versus Google’s. Lastly, there’s also a weird anomaly with Google’s TPU that we have to assume is an error.

GPU Architecture Comparison

It is important to note that the TPU operates in a very different way versus the GPU. GPU offers many more threads vs a TPU, which is comprised of very few threads doing much more work. As such, the GPU’s smaller threads can more effectively operate on smaller vectors. Google’s SparseCores are sort of a way of making up for this deficiency related to waiting for data movement. The GPU threads can afford to idle waiting for memory, but a TPU’s larger TensorCore cannot without significantly impacting hardware utilization.

A big disadvantage to GPUs having many more threads is that this results in the register files being ~100x larger register file (27 MB versus 0.25 MB). However, it’s important to note that TPUv4 employs a large scratchpad instead of the traditional cache hierarchy found in A100. This makes programming for TPUv4 more challenging than for A100, as developers (or their cumbersome compiler stack) must manually manage data storage and retrieval within the scratchpad.

The larger sizes of Google’s matrix units help enable more efficient data reuse. While Google absolutely has an advantage with regards to TPUv4 vs A100, this won’t remain the case. With the H100, Nvidia both increased the size of their Tensor Core and brought new features such as distributed shared memory and L2 multicast to Hopper. The idea is that different SMs (think cores) can write directly to another SM’s SRAM (shared memory/L1 Cache). This effectively increases the size of the cache and reduces the required bandwidth of read/writes.

While TPUv4 may offer some power efficiency advantages, its architectural differences also introduce certain challenges for programmers.

GPU vs TPU Performance

So how does it all stack up? Performance-wise, we are not fans of using MLPerf, given it lags behind what the industry is building with the bulk of the compute budget. A great example is with their DLRM. The DLRM model implemented by MLPerf in no way resembles DLRMs that firms like Meta, Google, and Baidu deploy. It has less than 2 million FP32 weights versus hundreds of millions or billions from the other firms. Those firms also use FP16, BF16, or Int 8. Unfortunately, there is no cut-and-dry solution to this.

The same applies to LLMs. The best MLPerf has is BERT which even Google doesn’t call an LLM anymore. The next MLPerf will have a true LLM, but that’s already obsolete. All the investment in LLMs by leaders is not in dense LLMs; the battleground has already shifted away. Google was quite disingenuous in its TPU paper with regard to some comparisons of performance. This was especially the case with Google picking on Graphcore, who, besides already being nearly out for the count, has always sold 2x the chips for the same price due to their lack of HBM.

The true measure of performance is that of LLMs. Google claimed they can get much better performance out of their TPUs, but they won’t prove it. For example, Google can only achieve 46.2% MFU and 57.8% HFU in PaLM on TPUv4. In comparison, Nvidia A100 GPUs can still achieve over 53% MFU and 71% HFU in Chinchilla-style LLM. Do note these models are not identical. They are similar, though, in that they are dense transformers. The comparison shows higher utilization can be achieved on GPUs. Note these numbers come from MosaicML, which has a scaling team that is orders of magnitude smaller than Google’s. We’d be willing to bet that OpenAI gets even better utilization rates on their fleet of 10k A100 GPUs, making the gap even larger.

Nvidia A100 has a 31% performance advantage versus TPUv4. To be clear, this is for 3-year-old chips, and the real battleground is with TPUv5 and Nvidia H100. It should be noted that this last-generation hardware from Google performs better in LLM inference versus Nvidia’s last-generation hardware.

Other AI Hardware

To be frank, the other AI hardware pictured below, outside of Google TPUs and Nvidia GPUs, will not see large commercial success. AMD’s MI300 has a shot, but it isn’t pictured, as that is a topic for another day. Amazon’s Tranium 1, while it is getting ~$400M of incremental deployments this year, only has any use because of the GPU shortage + Amazon’s willingness to discount heavily from list price to incredibly low figures that likely result in negative gross margins.

We didn’t include a couple of startups who have a sliver of a chance at actually coming to market and selling some okay volumes of products such as Cerebras, but even there, we are very skeptical.

Google Cloud Success With TPUs

While we are more positive about the refocused Google shipping AI-based services that can grab some share, we are not confident in Google’s ability to succeed in the cloud infrastructure business with TPU. Many of Google’s best customers in the infrastructure space, such as ByteDance want GPUs, not TPUs. The software stack of XLA with varying front ends of TensorFlow vs Jax is not a good messaging strategy for customers, even if it does make sense to have multiple.

Even the customers that Google bought, such as Anthropic, are still requiring Google Cloud provides them with significant H100 credits.

The biggest roadblock is that Google has to open up its programming model and hardware roadmap a year before deployment, as Nvidia does. They have to put all their cards on the table and enable developers to use the TPU from day 1. Locking the best documentation and how the system works behind NDAs will never work. The experience of using TPU as an internal Google user is vastly different than that of an external users. These external users are treated as second-class citizens.

Google has had the SparseCore for 5 years, yet they only disclosed it now in 2023. They have multiple hardware features, including the SparseCore and the reconfigurable network stack, that are not marketed publicly or even accessible for some random person to mess with on the cloud despite being big avenues for performance and power efficiency. Google would need to open up all its internal best practices if it wanted to succeed in public cloud infrastructure with TPU.

Frankly, it doesn’t make sense for Google to share all their cards, either. In our opinion, from a business strategy perspective, keeping their cards close is a good idea. In fact, they should keep them closer and have never published the switch transformer, Chinchilla, PaLM inference, or many other groundbreaking papers. Of course, we thank them for it because these are what enable the industry to understand what’s happening and where to focus development efforts. Their competitors leap forward months or even years every time Google publishes one of these notable papers publicly.

Google’s Infrastructure Advantage Versus Tech Titans

Google has an absolute cost advantage over Microsoft, Meta, and Amazon in AI hardware. These companies’ internal AI silicon efforts have varying degrees of design and productization, but there is a common thread. They are not competitive with that of Nvidia or Google. Their best option now is to go with Nvidia-based infrastructure or take a swing with MI300 in late 2023 / early 2024. Internal silicon efforts are not going to be competitive until 2025 at minimum. Furthermore, the combination of networking, compute, and software competence required is such that it is hard to compete.

When looking at these firms deploying infrastructure, the only realistic choice is to deploy Nvidia’s H100. Even when accounting for Nvidia maintaining a 30% performance advantage with H100 vs TPUv5 as they did with A100 vs TPUv4, the total system level Capex and Opex required means Google has a ~3x cost advantage versus Amazon, Microsoft, and Meta for running the same model deployed at scale.

This advantage is tremendous, and the real question is if Google can develop applications to deploy AI at scale. In search, it is likely Google beats Microsoft Bing, while maintaining a favorable cost model. This won’t be the case in all applications. In fact, most new applications will be adopted by non-Google firms, including potentially Nvidia Cloud, which we will discuss in detail in a future report.

Google’s Error

The fun easter egg we found that related to Google’s TPU paper having a potential error. Google released this photo of TPUv4.

They denoted the die size as <600mm^2

But the image Google provided does not show this. If you measure the image using the industry standard size of HBM2 as a reference point, the die size of the chip is actually closer to 617 mm^2. This is clearly either a mislabled chip from Google in the picture, a thermal/physical mockup, or an error in communications. The design is likely <600mm^2, but the scribe lines make the physical die larger than 600.

For those that don’t know, in semiconductor manufacturing, scribe lines (also known as streets or saw lanes) are the narrow spaces or lines that separate individual die on a silicon wafer. These lines are reserved for the cutting or dicing process, which separates the individual dies from each other, after all the necessary fabrication steps have been completed. Firms also often put test structures in these scribe lines. Chip design teams usually refer to the smaller area (without scribe), while packaging teams are more concerned with the final singulated chips (with scribe).

Brian

April 12, 2023

So, is the CUDA monopoly breaking because of MI300 or not. The other article comparing ML hardwares from NVidia and AMD, you were hopeful about MI300 but in this article, you don’t seem as hopeful…

1. Dylan Patel
  
  April 12, 2023
  
  I never said mi300 was doing anything against cuda. I did pytorch 2.0 and triton were. And that mi300 was the main beneficiary. We are hopeful of it still hence not putting it in the table and saying it’s a topic for another day
  
  1. Brian
    
    April 12, 2023
    
    I do not mean to put words in your mouth. But now I’m a bit confused. PyTorch 2.0 and Triton are software environment. CUDA as well. If MI300 (a hardware) is going to benefit from high level ML framework taking over from CUDA, its competing hardware will not benefit as much, compared to a scenario where MI300 is not as competitive. Isn’t that a logical conclusion from your other article?
    In any case, I appreciate your in depth analysis of technology and financials. Not many people can write about both topics as well and I appreciate it very much. Looking forward to reading about that topic another day 🙂
    
    1. Dylan Patel
      
      April 12, 2023
      
      You said “is the CUDA monopoly breaking because of MI300”
      
      Those are completely orthogonal.
      
      Pytorch 2.0 + OAI triton make it so good HW can compete better due to software barrer reducing.
      
      MI300 benefits hugely. Now question is can AMD deliver on hardware, beat H100, while also not being so late they effectively compete partially with Nvidia current get but also partially with Nvidia’s next generation of chips.
      
      1. Brian
        
        April 12, 2023
        
        I see where I made a mistake. I understand position better now. Thank you for clearing it up!
      2. Dylan Patel
        
        April 12, 2023
        
        Reading this back, sorry if it was hostile.
      3. Brian
        
        April 12, 2023
        
        Not at all. I appreciated you taking time to respond to my comments in a very clear manner. It helped me a lot in understanding your position and the situation in the industry.
        
        I have a lot of money riding on AMD’s success. So, I need as much information and analysis as I can get. I value your article & opinion a lot.
        
        Thank you!
    2. justlookinaround
      
      April 12, 2023
      
      If the world is locked into Nvidia proprietary software that only works on Nvidia’s proprietary hardware, then no hardware has a meaningful chance.
      
      But a more open software layer that is more hardware agnostic gives other hardware players a chance to compete by running that open software well. If it stays proprietary, the other companies cannot even get into the game to try.
      
Forrest Funnell

April 12, 2023

Wonderful, piece – thanks for sharing!!

1. Dylan Patel
  
  April 12, 2023
  
  Thanks! You should reshare it around 🙁
  
Eamon

April 12, 2023

Great article.

I think part of the analysis on this needs to go beyond the handful of companies spending billions on computing the LLMs to the average enterprise who (maybe) could get benefit from internal productivity or user facing innovation from these models.

Per Hugging Face and Google’s own internal usage, much of the inference can be done on a collection of smaller models with the most cost effective one chosen at query time.

This problem of deploying a collection of models and managing the lifecycle as the data distribution changes (‘MLOps’) is still something most organisations can’t do effectively. GCP’s Bigquery, Vertex AI and Dataplex is a much better offering than either AWS or Azure have. Obviously most of the data is still in AWS/Azure but Bigquery has the ability to pull data from s3 or Azure Files.

1. Dylan Patel
  
  April 16, 2023
  
  As far as working to fine tune models, use cases, great, but not super well versed in everything. Of course we aren’t well versed in everything, but we are able to connect to a network of experts to get up to speed and find the right reading materials and market/tech perspectives. If you want to help us along the path, please do reach out and reply to one the emails and we can have a chat about the topic.
  
SJ

April 13, 2023

The “different eras…” link at the beginning doesn’t work:

https://semianalysis.sharepoint.com/:w:/s/SemiAnalysisTeam/Eb9uLDn1NNdCqpPrdBe1FPQBLoDPv2znd6m8psWauhEK4Q

1. Dylan Patel
  
  April 13, 2023
  
  That is our internal site for sharing between members of SemiAnalysis. Whoops. Updated link.
  
Oli

April 13, 2023

Great analysis! A quick perhaps noob question:

I think many people won’t be too surprised to find ASIC beating GPGPU in terms of performance, energy efficiency and $ efficiency, as they are not so ‘generalized’. So what’s your opinion when taking ‘versatility’ into consideration, like TPU supports for frameworks like PyTorch?

1. Dylan Patel
  
  April 13, 2023
  
  ASIC vs not is kinda nonsense argument. H100 can barely do graphics. Yes there’s some baggage, but in many ways, no, it is an architecture designed for AI just like TPU is.
  
  TPU support outside XLA via TF and Jax is pretty bad, and who outside of Google wants to use those for large models lol
  
Louie Peters

April 13, 2023

Very interesting information and thoughts again thanks Dylan.
I think google’s largest publicly mentioned training run was PaLM connecting two pods of ~3000 TPU v4s each – while i guess Azure trained GPT-4 on likely 20,000+ A100s?
Given all the advantages Google has with TPUs and its architecture and pod size vs Nvidia SuperPods – Why is it that Google appears behind Azure in this training scaling do you think? And can anyone else scale Nvidia chip training as much as Microsoft Azure? Is even Nvidia maybe using Azure for training now?

1. Dylan Patel
  
  April 13, 2023
  
  Google has run larger than PaLM internally, not yet disclosed though.
  
  They also have a weird organization where multiple teams work very distinctly. The biggest example is deepmind being kinda split off.
  
  1. Louie Peters
    
    April 13, 2023
    
    Yes so many artificial handicaps! Do you know if the number of TPUs they have actually manufactured / capex sign off has been a bottleneck?
    
    1. Dylan Patel
      
      April 13, 2023
      
      It is to some extent, any compute you give a dev, they will always want 50% more, but they have A LOT and those system sizes keep growing. If they spent the same money on GPUs, they’d have less capable systems. Tens of thousands if not hundreds of thousands of TPUv4 alone
      
Jack

April 14, 2023

The quote “[TPUv4i] yet cannot run inference on Google’s best models such as PaLM” needs a reference. The linked article is about AWS and has no mention on TPUv4i.

1. Dylan Patel
  
  April 14, 2023
  
  I meant to link to the Palm Inference paper here. That paper is linked and discussed in the AWS article though in the Google section.
  
  https://arxiv.org/abs/2211.05102
  
  1. Jack
    
    April 14, 2023
    
    In that paper, there is no mention of “PaLM cannot run on TPUv4i”. It only states that TPUv4 is used for inference.
    
    1. Dylan Patel
      
      April 14, 2023
      
      Read it carefully and look at the v4i hardware. It can’t.
      
JS

April 14, 2023

It seems like you are overly focused on the utilization rate. For instance, when you said, “We’d be willing to bet that OpenAI gets even better utilization rates on their fleet of 10k A100 GPUs, making the gap even larger” it suggests that you are giving too much weight to this metric. Training a LLM is not as simple as calculating hash per sec in Bitcoin mining.

For example, how to select the batch size is still a mystery. I wonder if a utilization rate of 70% or more is really useful. What will the batch size be for 10,000+ GPUs? Note that Google used a maximum batch size of 2048 on 6000+ TPUs to train PaLM and still achieved a 57% HFU, as reported in this paper: https://arxiv.org/pdf/2204.02311.pdf.

1. Dylan Patel
  
  April 14, 2023
  
  Batch size for that figure is linked, and not that crazy high, of course it’s a smaller model.
  
  Massive batch sizes have not shown huge degrigation of improvement, provided the number of training examples also balloons.
  
  Furthermore, microbatching can also be used.
  
  We linked that paper in the post as well?
  
Db

April 15, 2023

Don’t understand even a fraction of the technical bits of the article due to low IQ. My conclusion is, we continue to be Jensen’s bitch for the foreseeable future with perhaps mi300 making a run at it thanks to the competing software from Meta. The industry couldnt even play dirty and gang up on Nvidia somehow even if they tried at this point. It’s chuckle worthy watching folks rage at the nvda stock multiples ignorant of the actual situation playing out.

Dave

April 20, 2023

These two recent projects seem to offer more promising MFU numbers than the PaLM paper:

https://github.com/google/maxtext#runtime-performance-results
https://github.com/google/paxml#benchmark-on-cloud-tpu-v4

Character.ai has also claimed to get 80% MFU, but without more details on the architecture, who can say what that means:
https://twitter.com/IrwanBello/status/1599881662560899073?lang=en

1. Dylan Patel
  
  April 20, 2023
  
  The first one is so few chips, not relevant. I’ve seen over 80% on single Nvidia chips too, but that’s not really relevant.
  
  The 2nd is interesting, thanks for that one! I wish google pointed it out in their TPU paper haha
  
  Character’s claim is odd. I’ve seen it before and don’t quite get it. They also do 8bit int training… So that begs the question. Are they getting 40% int 8 util but calling it 80% based on 16 bit flops? This is a commonly held belief in my circles btw.
  
Guilherme Partel

April 20, 2023

Dylan, Elon Musk keeps talking about their dojo chips, maybe a good topic to explore

1. Dylan Patel
  
  April 20, 2023
  
  We discussed it 2 years ago. We revealed they system architecture and packaging before they even announced the chip lol. We did multiple follow ons. Nothing has changed since then
  
Marimo

August 28, 2023

All the tables and images on this article come from a paper by Google called “TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings”

https://arxiv.org/pdf/2304.01433.pdf

Leave a Reply Cancel reply

No results found

Filter options

Filter