Groq Inference Tokenomics: Speed, But At What Cost?

Faster than Nvidia? Dissecting the economics

Groq, an AI hardware startup, has been making the rounds recently because of their extremely impressive demos showcasing the leading open-source model, Mistral Mixtral 8x7b on their inference API. They are achieving up to 4x the throughput of other inference services while also charging less than 1/3 that of Mistral themselves.

https://artificialanalysis.ai/models/mixtral-8x7b-instruct

Groq has a genuinely amazing performance advantage for an individual sequence. This could enable techniques such as chain of thought to be far more usable in the real world. Furthermore, as AI systems become autonomous, output speeds of LLMs need to be higher for applications such as agents. Likewise, codegen also needs token output latency to be significantly lower as well. Real time Sora style models could be an incredible avenue for entertainment. These services may not even be viable or usable for end market customers if the latency is too high.

This has led to an immense amount of hype regarding Groq’s hardware and inference service being revolutionary for the AI industry. While it certainly is a game changer for certain markets and applications, speed is only one part of the equation. Supply chain diversification is another one that lands in Groq’s favor. Their chips are entirely fabricated and packaged in the United States. Nvidia, Google, AMD, and other AI chips require memory from South Korea, and chips/advanced packaging from Taiwan.

These are positives for Groq, but the primary formula for evaluating if hardware is revolutionary is performance / total cost of ownership. This is something Google understands intimately.

The dawn of the AI era is here, and it is crucial to understand that the cost structure of AI-driven software deviates considerably from traditional software. Chip microarchitecture and system architecture play a vital role in the development and scalability of these innovative new forms of software. The hardware infrastructure on which AI software runs has a notably larger impact on Capex and Opex, and subsequently the gross margins, in contrast to earlier generations of software, where developer costs were relatively larger. Consequently, it is even more crucial to devote considerable attention to optimizing your AI infrastructure to be able to deploy AI software. Firms that have an advantage in infrastructure will also have an advantage in the ability to deploy and scale applications with AI.
Google AI Infrastructure Supremacy: Systems Matter More Than Microarchitecture

Google’s infrastructure supremacy is why Gemini 1.5 is significantly cheaper to serve for Google vs OpenAI GPT-4 Turbo while performing better in many tasks, especially long sequence code. Google uses far more chips for an individual inference system, but they do it with better performance / TCO.

Performance in this context isn’t just the raw tokens per second for a single user, i.e. latency optimized. When evaluating TCO, one must account for the number of users being served concurrently on hardware. This is the primary reason why improving edge hardware for LLM inference has a very tenuous or unattractive tradeoff. Most edge systems won’t make up for the increased hardware costs required to properly run LLMs due to such edge systems not being able to be amortized across massive numbers of users. As for serving many users with extremely high batch sizes, IE throughput and cost optimized, GPUs are king.

As we discussed in the Inference Race to the Bottom analysis, many firms are genuinely losing money on their Mixtral API inference service. Some also have very low-rate limits to limit the amount they lose. We dove deeper into quantization and other hardware GPU options such as MI300X in the report, but the key takeaway is those serving an unmodified model (FP16) required batch sizes of 64+ to turn a profit. We believe that Mistral, Together, and Fireworks are serving Mistral at breakeven to slight profit margins.

The same cannot be said for others offering Mixtral APIs. They are either lying about quantization, or lighting VC money on fire to acquire a customer base. Groq, in a bold move, is matching these folks on pricing, with their extremely low $0.27 per million token pricing.

Is their pricing because of a performance/TCO calculation like Together and Fireworks?

Or is it subsidized to drive hype? Note that Groq’s last round was in 2021, with a $50M SAFE last year, and they are currently raising.

Let’s walk through Groq’s chip, system, a costing analysis, and how they achieve this performance.

https://semianalysis.com/wp-content/uploads/2024/11/d5004a20-863b-4542-8e2f-45788b0524ef_1456x474.jpg

Groq’s chip has a fully deterministic VLIW architecture, with no buffers, and it reaches ~725mm² die size on Global Foundries 14nm process node. It has no external memory, and it keeps weights, KVCache, and activations, etc all on-chip during processing. Because each chip only has 230MB of SRAM, no useful models can actually fit on a single chip. Instead, they must utilize many chips to fit the model and network them together.

In the case of the Mixtral model, Groq had to connect 8 racks of 9 servers each with 8 chips per server. That’s a total of 576 chips to build up the inference unit and serve the Mixtral model. Compare that to Nvidia where a single H100 can fit the model at low batch sizes, and two chips have enough memory to support large batch sizes.

The wafer cost used to fabricate Groq’s chip is likely less than $6,000 per wafer. Compare this to Nvidia’s H100 at 814mm² die size on a custom variant of TSMC’s 5nm called 4N. The cost for these wafers is closer to $16,000 per wafer. On the flip side, Groq’s architecture seems less viable for implementing yield harvesting versus Nvidia’s, who has an extremely high parametric yield, due to them disabling ~15% of die for most H100 SKUs.

Furthermore, Nvidia buys 80GB of HBM from SK Hynix for ~$1,150 for each H100 chip. Nvidia also has to pay for TSMC’s CoWoS and take the yield hit there, whereas Groq does not have any off chip memory. Groq’s raw bill of materials for their chip is significantly lower. Groq is also a startup, so they have much lower volume/higher relative fixed costs for a chip, and this includes having to pay Marvell a hefty margin for their custom ASIC services.

The table below presents three deployments, one is for Groq, with their current pipeline parallelism and with batch size 3, which we we hear they will implement in production next week, and the others outline a latency optimized H100 inference deployment with speculative decoding as well as a throughput optimized H100 inference deployment.

The table above greatly simplifies the economics (while ignoring significant amounts of system level costs which we will dive into later, and it also ignores Nvidia’s massive margin). The point here is to show that Groq has a chip architectural advantage in terms of dollars of silicon bill of materials per token of output versus a latency optimized Nvidia system.

8xA100s can serve Mixtral and achieve a throughput of ~220 tokens per second per user, and 8xH100s can hit ~280 tokens per second per user without speculative decoding. With Speculative decoding, the 8xH100 inference unit can achieve throughputs approaching 420 tokens per second per user. The throughput could exceed this figure, but implementing speculative decoding on MoE models is challenging.

Latency optimized API services currently do not exist because the economics are so bad. The API providers currently do not see a market for charging 10x more for lower latencies. Once agents and other extremely low latency tasks become more popular, GPU based API providers will likely spin up latency optimized APIs alongside their current throughput optimized APIs.

Latency optimized Nvidia system with speculative decoding are still quite far behind on throughput and costs versus Groq without speculative decoding, once Groq implements their batching system next week. Furthermore, Groq is using a much older 14nm process technology and paying a sizable chip margin to Marvell. If Groq gets more funding and can ramp production of their next generation 4nm chip, coming in ~H2 2025, the economics could begin to change significantly. Note that Nvidia is far from a sitting duck as we think they are going to be announcing their next generation B100 in less than a month.

In a throughput optimized system, the economics change significantly. Nvidia systems attains an order magnitude better performance per dollar on a BOM basis, but with lower throughput per user. Groq is not competitive architecturally at all for throughput optimized scenarios.

However, the simplified analysis presented above isn’t the right way to look at the business case for people that are buying systems and deploying them, as that analysis ignores system costs, margins, power consumption, and more. Below, we present a performance / total cost of ownership analysis instead.

Once we account for these factors, the Tokenomics (cred swyx for swanky new word), look very different. On the Nvidia side we will use the GPU cloud economics explained here and shown below.

The cost of capital includes the hurdle rate, that is – accounting for the return on investment someone presenting this business case would expect to earn in order to justify the project’s risk.

Nvidia applies a huge gross margin to their GPU baseboards. Furthermore, this $350,000 price charged for the server, which is well above the hyperscaler cost for an H100 server, also includes significant costs for memory, 8 InfiniBand NICs with aggregate bandwidth of 3.2Tbps (not needed for this inference application), and a decent OEM margins stacked on top of Nvidia’s margins.

For Groq, we are estimating system costs and are factoring in details regarding the chip, package, networking, CPUs, memory, while assuming a lower overall ODM margin. We are not including Groq’s margin charged for selling hardware either, so while it may seem Apples vs Oranges, it’s also a fair comparison of Groq’s cost vs an Inference API provider’s costs, as both are serving up the same product/model.

It’s noteworthy that 8 Nvidia GPUs only need 2 CPUs, but Groq’s 576 chip systems currently have 144 CPUs, and 144TBs of RAM.

Adding up these component costs, we arrive at $35,000 per Groq LPU server, which includes 8 Groq LPUs and all the of the other hardware listed above. The Mixtral Groq inference deployment uses 8 racks of 9 servers per rack, or $2,520,000 for an inference deployment with 576 LPU chips in total. By comparison, a typical H100 HGX system costs $350,000 of upfront capex and includes 8 H100s. Most H100-based Mixtral inference instances only use 2 H100 chips, so 4 inference deployments can be had per H100 server.

The total amortized monthly capital cost of an H100 system (i.e. H100 HGX with 8 H100 GPUs) is $8,888 USD assuming an 18% cost of capital/hurdle rate and a 5-year useful life while the monthly hosting cost is $2,586 USD for a total cost of ownership of $11,474 per month. The total cost of ownership for the much larger Groq system comes out to be $122,400 USD per month.

While the entire Groq system (9 servers times 8 racks) is 7.2x the monthly amortized capital cost ($63,991 for Groq vs $8,888 for the H100 HGX), the Groq system delivers 13.7x the FLOPS performance at FP16 (108,000 TFLOPS for Groq vs 7,912 TFLOPS for the H100 HGX). Due to the memory wall, an H100-based inference system typically has low FLOPS utilization, while Groq’s architecture sidesteps the memory wall by having on chip SRAM. With that said, for some reason, whether it be a lack of buffers or the VLIW architecture, Groq’s FLOPS utilization is lower than Nvidia’s even with next weeks push of batch size 3 implemented.

Unlike the inference API providers that are purchasing systems with an 80%+ gross margin (Nvidia + OEM stacking), Groq effectively purchases its systems at cost. They do have to pay margin to SuperMicro and Marvell for the system and chips respectively, but nowhere close to that of the API providers and their GPU cloud providers. The result is that the Groq’s total cost of ownership is far less dominated by capital cost – with capital cost standing at 52% of total cost of ownership for Groq compared to almost 80% for an H100 system.

The 8xH100 latency optimized inference deployment has a cost of $5.2 USD per million tokens. 2xH100 throughput optimized inference deployment has a cost of $0.57 USD per million tokens. By comparison, Groq achieves a cost of $1.94 USD per million tokens, so faster and cheaper than the 8x H100.

Like many inference providers, Groq is currently operating a negative gross profit business model and it will need to boost throughput by over 7x in order to break even. This is far closer than a latency optimized inference deployment like the 8xH100 unit, which is nearly 20x off from break even if it used the same pricing.

In addition to selling an inference API service, Groq’s business model also includes selling its system outright. If Groq sold its systems at a 60% gross margin to a 3^rd party operator, this would roughly match the H100 HGX’s capital intensity of total cost of ownership and would work out to a system price of about $6,350,000.

Groq claims to have a power advantage, but we can’t see that. Even with the most pessimistic assumptions for H100 servers, at 10kW, which would include the CPU and all 8 NICs running full blast, it is more efficient than the 576 chip Groq server which requires 230kW, or 3.2kW per 8 chip server. Groq claimed a performance per watt advantage, but we do not see how that is calculated.

We should note, while Groq is currently losing money on their API and needs more than a 7.2x improvement to break, (including implementing batch 3 next week), they claim to a roadmap of improvements to get them past breakeven over the next few quarters. This is generally through three vectors of improvement:

Continuing compiler work to improve throughput,
A new server architecture which dramatically reduces non-chip costs including the card, uses fewer CPUs and less DRAM,
Deploying larger systems which enable above linear performance scaling due to more pipelines enabling much higher batching, and ultimately also enabling larger models.

While each of these items are logical in isolation, and we are hopeful they can achieve this, a 7x improvement is something quite sizable.

There are a few major challenges we see.

Currently the largest MoE models sit in the 1-2 trillion parameter range, but we expect Google and OpenAI to launch >10 trillion parameter models over the next year that will require inference systems of hundreds of GPUs and 10s of TB of memory. While Mixtral is currently the most important model for fine tuning, API services, and on premises today, this will not be the case in ~3 months. LLAMA 3 and larger Mistral models are coming.

Through Groq has demonstrated that they can build systems for serving <100 billion parameter models, even LLAMA 3 sized models will struggle to fit on a theoretical future thousand chip systems. Groq plans to have 1 million chips deployed in two years, with each individual inference system planned to be larger than the current 576 chip deployment.

This also flows into a similarly difficult challenge we see with extremely large context lengths. Google showed off a 10,000,000 context length on Gemini 1.5 Pro, and it’s amazing. That’s enough for 10 hours of video, 110 hours of audio, 300k lines of code, or 7 million words. We expect many firms and services to offer models with massive sequence lengths to fit codebases and document libraries into the prompt due to far superior performance vs RAG, which usually falls on its face in the real world.

Sure, prefill will take a long time initially (Google showed a full minute to get the first token output), but after that, the prefill costs are amortized across many requests. The massive customer specific prompt doesn’t need to be recomputed frequently. While Google isn’t doing standard attention which scales O(n²), Gemini 1.5 Pro still requires hundreds of GB if not TBs of memory to hold the KVCache.

We struggle to see how Groq could ever implement extremely large context lengths given the KVCache size requirements. This would require systems of tens of thousands of chips, instead of 10s or 100s of chips as is used with Google, Nvidia, and AMD based inference solutions.

Groq’s ability to network chips at low latency is impressive, but it would be extremely difficult to scale that to the tens of thousands of chips required for ultra long context with moderate sized models like Gemini Pro 1.5 or extremely large models like GPT-5 and Gemini Ultra 2.

This brings into question the useful life of these gut wrenchingly big AI buildouts. For GPUs, due to their flexibility, it’s easy to see them being useful in in 4 years for new models. Due to Groq’s lack of DRAM, it is much tougher to see as much flexibility as model sizes continue to soar. If this becomes an issue, then the depreciable life of their systems would no longer be 5 years, but instead much shorter. This would massively increase costs.

Another challenge for Groq is that speculative decoding and techniques such as Medusa are getting better at a rapid clip. Tree/branch speculation approaches are leading to upwards of 3x speedups with speculative decoding. If these can be deployed efficiently on production grade systems, then an 8x H100 system could achieve over 600 tokens per second. That alone would blow away Groq’s advantage in speed.

Groq also says they have plans to implement speculative decoding in the future, but we don’t understand how that will work with their deterministic architecture. Generally, speculative decoding entails trading FLOPS for bandwidth efficiencies achieved through higher batch sizes. Groq is mostly limited by FLOPS and networking, not SRAM bandwidth. Groq would need to massively grow their batching capabilities beyond 3 to implement speculative decoding in an effective manner.

Lastly, the B100 is being announced next month, shipping in the 2^nd half, and rumors for perf/TCO improvements are at over 2x vs the H100. Furthermore, Nvidia is moving extremely fast with B200 to be launched two quarters later, and X/R100 another two quarters after that. Nvidia is not a static target.

With that said, if Groq can efficiently scale out to systems of thousands of chips, the number of pipelines would increase massively, and with that increase in pipelines would come extra SRAM for more KVCache per pipeline stage. That in turn would enable large batch sizes >10 and potentially bring down costs massively. We see it as a possibility, but maybe not a high probability one. It’s up to Groq to prove us wrong and massively increase their throughput.

The question that really matters though, is if low latency small model inference is a large enough market on its own, and if it is, is it worth having specialized infrastructure when flexible GPU infrastructure can get close to the same cost and be redeployed for throughput or large model applications fairly easily.

OutspokenGeek

February 22, 2024

Well, if Graphcore could not make it with a vaguely similar approach, then what chance does Groq have? It’s funny to keep seeing software folks thinking they can do hardware better and vice versa. At different times different approaches work depending on the prevailing software and hardware sophistication which has been well known as the “Wheel of reincarnation” –
https://www.computerhope.com/jargon/w/wor.htm

DDDD

February 22, 2024

How does groq go from server to server? Ethernet? Woof. Still they are getting impressive throughputs, would have thought the scale-out networking would have killed them.

1. Dylan Patel
  
  February 22, 2024
  
  Ya or infiniband.
  
DDDD

February 22, 2024

Do you have a reference for how you got from tokens to video and audio?

1. Dylan Patel
  
  February 22, 2024
  
  Google provided ratio in technical report.

Leave a Reply Cancel reply

No results found

Filter options

Filter