TPUv5e: The New Benchmark in Cost-Efficient Inference and Training for

September 1, 2023

TPUv5e: The New Benchmark in Cost-Efficient Inference and Training for <200B Parameter Models

Latency, Performance, Fine-tuning, Scaling, and Networking

9 minutes

21 comments

By

Dylan Patel

and

Aleksandar Kostovic

Latency, Performance, Fine-tuning, Scaling, and Networking

Latency, Performance, Fine-tuning, Scaling, and Networking

During its Cloud Next 2023 event, Google has announced general availability of its latest AI chip, the TPUv5e (TPUv5 lite), and it is a game changer, due to the performance/TCO that it brings for both Google and the new Cloud TPU customers. It is straight up a massive cost advantage for many external parties to train and inference models with less than 200 billion parameters.

TPUv5e also enables Google to inference models that are larger than OpenAI at the same cost as OpenAI’s smaller model. This will massively help Google level the playing field, because they can play the brute force game that no one else can. OpenAI will have to rely on being much smarter with their chips and algorithms due to the massive compute deficit versus Google. AI chips from Amazon (Trainium/Inferentia), Meta (MTIA), and Microsoft (Athena) are all nowhere close to where Google is.

Today we want to detail this game changing chip by demonstrating its performance/TCO advantage with data for cost of training GPT-3 and cost of inference for LLAMA-65B. Furthermore, we want to discuss list price as well as discounted cloud pricing being offered by Google, and how that compares to various GPU pricing.

Hilariously, it makes economic sense for OpenAI to use Google Cloud with the TPUv5e to inference some models, rather than A100 and H100 through Microsoft Azure, despite their favorable deal. Of course, there’s a whole host of political/business reasons this probably never happens.

Before we dive in, we just want to assuage Sam Altman’s concerns. This is in no way Google marketing, and we have no contacts at Google marketing or HR departments. They don’t pay us outside of subscribing to the newsletter. This is analysis based on factual data from Google and a 3^rd party AI startup using TPUv5e. The prior he’s referring to are shipments for the full size TPUv5, from supply chain.

To answer Elon, they are not wrong.

Now that we are done being cheeky, let’s discuss the chip and system before moving onto real performance. The TPUv5e (TPUv5 lite) is the successor to the TPUv4i (TPUv4 lite), and should not be confused with the main line of TPUv4 (Pufferfish) and TPUv5 (Viperfish). The TPUv4 lite, which was externally given the i suffix for being an inference chip. The TPUv5 lite, now has the e suffix for efficiency. In the past, most of our focus has been on the full-scale chips, despite the lite chips being used heavily in Google’s internal inference workloads. From TPUv4i to TPUv5e, this changes, because the small chip actually makes sense to use externally.

The TPUv5 and the smaller sibling, TPUv5e, are clearly not designed for peak performance at the cost of everything else. They are both significantly lower power, memory bandwidth, and FLOPS than Nvidia’s H100. This is a conscious decision by Google, and not just an indicator of worse chip design. Google, due to designing and acquiring their own chips through Broadcom, pays significantly lower margins for them. As such, power consumption, networking cost, system cost, and deployment flexibility are much larger indicators of the total cost of ownership (TCO) for the chip over the course of 4+ years.

Apr 12, 2023

Google AI Infrastructure Supremacy: Systems Matter More Than Microarchitecture

Dylan Patel, George Cozma, Gerald Wong

In Nvidia’s model, due to their massive gross margins on the hardware, has their customers TCO equation dominated by Capex. The Opex costs are relatively much smaller. Therefore, it is more logical for pushing the H100 to 2x the power consumption of a TPUv5 and ~5x that of TPUv5e to squeeze out way more performance. Furthermore, differences in Nvidia’s architecture and SKU lineup make it much more conducive to massive chips. Google’s lack of SKUing and massive tensor units, mean they cannot yield harvest or approach Nvidia’s >90% parametric yield on their AI chips. For these reasons, Google goes for a lower power smaller chip, on not only the TPUv5e, but also the TPUv5. The TPUv5e is ~325mm^2.

Google’s TPUs have either one or two Tensor Cores that operate inside of it. This applied to the TPUv4 and the TPUv4i (lite). The TPUv5e (lite) likewise takes a step back from the unannounced TPU v5 (Viperfish). The TPUv5e only a single Tensor Core, unlike TPU v5 which includes two. Furthermore it is half the HBM stacks and at lower speeds. Lastly, the networking is neutered. Each Tensor Core has 4 Matrix Multiply Units (MXU), a vector unit, and a scalar unit. The MXU is based on 128 x 128 multiply/accumulators in a systolic array. MXUs provide the bulk of the compute power in a Tensor Core. Each MXU can perform 16,000 multiply-accumulate operations per cycle. The TPUv5e has 197 BF16 TFLOPS and 393 Int8 TOPS.

The Tensor Cores communicate with 16 GB of HBM2E memory running at 3200MT/s, for a total memory bandwidth of 819.2GB/s There are up to 256 TPUv5e chips in a pod, there’s which are 4 dual-sided rack unit with 8 TPUv5e sleds per each side. The system had four TPU chips in it, along with a CPU and a 100G NIC. Each 4 TPUs shares 112 vCPUs. These are actually 64C AMD chips, so it appears that Google still requires CPU cores for the hypervisor, and is unable to run it on their NICs.

Google lets you rent up from 1 to 256 TPUv5e with linear cost scaling as you add chips.

Each TPU connects to 4 other TPUs, to the north, south, east, and west at 400Gbps (400G Tx, 400G Rx) via their inter-chip interconnect (ICI). This gives each TPU a staggering 1.6T aggregate bandwidth, which is very high relative to the compute and memory bandwidth of the TPUv5e. Google paid special care to minimizing the number of the number of optics, in a way others don’t, to further reduce the costs. Unlike the TPUv4 and TPUv5, there is no OCS in the ICI inside the pod. The topology is flat. No twisted Torus or anything fancy. This saves a lot on the system level.

Mar 17, 2023

Google OCS Apollo: The >$3 Billion Game-Changer in Datacenter Networking

Dylan Patel

Multiple pods can be connected over the Datacenter spine network. The 100G NIC per TPUv5e sled means there is 6.4T pod to pod ethernet based interconnect. In addition, Google has multi-pod available. These inter-pod connections go through the OCS.

Google shared figures for performance scaling all the way up to 4096 TPUv5e’s which is 16 total TPUv5e pods. While that indicates Google has 16 of these pods in 1 datacenter, we believe they have more than 128 TPUv5e pods (32k TPUv5e) in just one datacenter based on the video they released.

As far as software, there is a lot of software Google has made to make this easy to use. This includes everything from compilers to software that makes it easier to batch. While Jax+XLA would work best, the Pytorch+XLA backend is still pretty good performance, meaning many can get away with little to no code changes. For most, to take an existing LLM and run inference, it is as easy as a GPU and perhaps easier to get high utilization than an Nvidia GPU because good GPU inference requires a lot of manual work. This is mostly due to the closed nature of TensorRT making it unusable for anything outside cookie-cutter models or further optimizations with Speculative Decoding and FasterTransformers deprecation/lack of effort.

Below let’s share training costs for a TPUv5e pod vs A100 and H100s for GPT-3. Furthermore, lets also share the inference costs for LLAMA-65B. Inference latency will also be shared.

So GPT-3 training costs are quite interesting. Even if you assume the price paid is $1.1/hr per SXM A100, $2/hr per H100 (basically the best deal we have since shortages), the two chips fall behind the TPUv5e for models under 200B parameters. All figures are BF16.

We have seen prices as low as a third of a dollar for TPUv5e vs list price of $0.4, so we use that $0.33 for TPUv5e. With those inputs, given training time numbers shared by various firms such as MosaicML on GPUs and Google on TPUv5e, we see the cost of pre-training as $514k for A100, $393k for H100, and 222k on TPUv5e.

Mind you on a single pod of 256 TPUv5e’s this would take ~100 days to train the full GPT-3, so there would need to be more testing for multi-pod performance, but Google claims the multi-pod performance scaling is near perfect. Even a full TPUv5e pod, without multi-pod, would still be great for fine-tuning 175B class GPT-3 size models. It seems pretty clear if you’re fine-tuning LLAMA-65B models, you should use TPUv5e given pricing.

The above is mostly based on Google’s claims. Below, we also have figures from an AI startup that has run LLAMA-65B on TPUv5e with int8. It should be noted parallelism strategies are quite different.

The startup tell us they utilized an 8xTPUv5e slice and were able to achieve ~35ms per token on a batch size of 32 with int8 quantization! That is quite usable and very strong performance.

Assuming that $0.33 per TPUv5e hour cost mentioned earlier, that is less than $0.0007 per 1k tokens for LLAMA-65B. This number could be brought down further if you let latency per token creep up to. That smashes GPUs and gets better latency than many folks implementations too. There is a big limitation in that TPUv5e only has 16GB per each chip, but this is a manageable problem due to TPU’s more flexible parallelism due to the interconnect, at least for models that are small (less than 200B parameters).

In comparison, MosaicML is charging $0.002, and Together is charging $0.003 on GPU per 1k tokens. Together’s latency is much worse too which affects user experience while also costing 3x TPUv5e costs. Of course the above assumes you can feed the batch sizes, which is difficult for most smaller players to do, but the giants can. API access is still probably better cost when you don’t care about your prompts having sensitive info or you have extremely bursty usage instead of more consistent usage like the giants. These APIs also have some margin embedded.

It seems pretty clear that GPT-3.5 turbo should be inferenced on TPUv5e given these costs. We believe GPT-4 is probably too large to fit properly across TPUv5e’s and would require full scale TPUv5’s.

Nonetheless, OpenAI would probably net win on cost if they could access TPUv5e at fair prices <$0.35 and move all their GPT-3.5 Turbo inference over to them from GPUs. Plus then, their other GPUs they have currently could be redirected to other tasks. The cost advantage of TPUv5e exists even with OpenAI special advantaged pricing with Azure for A100 and H100.

Just to be clear, we don’t think OpenAI will use it, but all the other AI startups and enterprises should look at it very seriously.

Keep Reading

Comments

jacky wang

September 1, 2023

牛逼

Reply
Jaibir S

September 1, 2023

Google has this habit of lauching something and then abruptly shutting it down. They seem to have extended this to their cloud business services as well. It may make potential customers a little wary. This would be ignored in the current, GPU starved environment but not as things normalise.

Reply
1. Dylan Patel
  
  September 1, 2023
  
  TPU program will not shut down dude. You know how integral it is to everything they do? Literally every good business from them runs on it. From search to ad sense to YouTube
  
  Reply
  1. Jaibir S
    
    September 1, 2023
    
    Agreed, TPUs will exist and continue to get better. My worry would be access to them through Google Cloud.
    
    Reply
Dickson Pau

September 1, 2023

Why do the AWS chips suck compared to TPUs?

Reply
1. Tanj
  
  September 2, 2023
  
  Software? Or maybe they just need a better publicist? Inferentia2 looks to be in the same league as v5e: maybe a bit less computation, but 2x the memory.
  AWS does not say much about scaling out to large numbers of chips for one model – I see a mention of up to 12, which seems to be the set of chips built into one server?
  https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/inf2-arch.html
  https://aws.amazon.com/blogs/machine-learning/deploy-large-language-models-on-aws-inferentia2-using-large-model-inference-containers/
  
  Probably worth benchmarking for comparison for models in the few-hundred GB size range, see how it compares to v5e.
  
  Reply
  1. Dylan Patel
    
    September 2, 2023
    
    Nah Inf2 doesn’t get similar perf / cost.
    
    SW is much worse too.
    
    They can only scale in 1 node, which is 12 ya for Inf2, 16 for Trn.
    
    I’ve seen benchmarks from many people who have tried cause it’s cheap. Doesn’t work well.
    
    Reply
    1. Dickson Pau
      
      September 6, 2023
      
      Sorry, but what’s ya?
      
      Reply
Louie Peters

September 1, 2023

You keep coming with the high value blogs lately! Thanks a lot Dylan.
What mix do you expect of H100s usage for training vs inference broadly in the market over 2023-24?
And how do you expect Google to scale TPUv5e production vs v5? 50-50?
Hard to gauge their expectations of demand for larger models like Gemini vs smaller GPT-3/Turbo class models.

Reply
1. Dylan Patel
  
  September 1, 2023
  
  H100 will be the majority.
  
  I think smaller models will get swamped in demand by the large models.
  
  Reply
  1. Dylan Patel
    
    September 1, 2023
    
    Inference is getting cheaper than people think
    
    Reply
Forrest Funnell

September 1, 2023

Awesome post! Who manufactures TPUv5 and TPUv5e…. TSMC?

Reply
1. wxw04b
  
  September 2, 2023
  
  Broadcom
  
  Reply
  1. Mike
    
    September 4, 2023
    
    Broadcom is fabless. Broadcom uses TSMC, GF, and UMC but approx, 90% of Broadcom chips are manufactured by TSMC.
    
    Reply
2. Dylan Patel
  
  September 4, 2023
  
  See the prior post, before this one
  
  Reply
Tanj

September 1, 2023

It may be just 10% of the throughput of a top H100, but the v5e chip is roughly $100 packaged, the HBM2 is likely under $100, and the packaging and interconnect might add another $100. Just 1% of the cost of an H100.

But those 112 vCPUs per 4 TPUs stick out like a sore thumb adding costs. Why would a TPU need a whole 14 cores? Got a link to a description of that offering?

Reply
Guilherme Partel

September 11, 2023

Hi Dylan,
Have you seen my question on competition/insourcing risk for Broadcom on TPUs in the previous post?
Also, do you guys follow what Cisco is doing in semiconductors? They have big claims there…
Thanks

Reply
1. Dylan Patel
  
  September 11, 2023
  
  Too long discussion for comments. Yes follow them pretty close.
  
  Reply
  1. Guilherme Partel
    
    September 12, 2023
    
    Looking forward to a piece from you on Cisco!
    
    Reply
SLVR

September 11, 2023

Seems like AMD won the CPU socket: https://www.servethehome.com/mlperf-inference-v3-1-shows-nvidia-grace-hopper-and-a-cool-amd-tpu-v5e-win/

Reply
1. Dylan Patel
  
  September 12, 2023
  
  Yea, I was wrong, I saw 112 threads, and assumed it was 56 core Sapphire Rapids. Instead it is AMD CPUs with threads reserved for hypervisor
  
  Reply

Leave a Reply Cancel reply

No results found

Filter options

Filter