- Intro
- Key Findings
- Executive Recommendation to AMD
- A Summary of the AMD vs Nvidia Narrative
- General Matrix Multiply (GEMM) Performance
- Popular GEMM Benchmark Isn't Accurate
- HBM Memory Bandwidth Performance
- AMD Hand-Crafted VIP Custom Builds and WIP Development Builds
- Dec 21st AMD Development Builds
- Training Testing Methodology (GPT1.5B, Llama 8B, Llama 70B, Mistral)
- Single Node Training Performance
- Multi-Node Training Performance
- AMD PYTORCH_TUNABLE_OPS FLAG is a Bad User Experience
- Scale Up NVLink/xGMI Topology
- All Reduce/All to All/Reduce Scatter/All Gather Collectives Overview
- Single Node NCCL Collective
- Multi Node RCCL/NCCL Collectives and Scale Out Network Benchmarks
- AMD's User Experience is Suboptimal and the MI300X is Not Usable Out of the Box
- Exploring Ideas for Better Performance on AMD
- AMD’s Forked Libraries
- Detailed Recommendations to AMD on How to Fix Their Software
- H100/H200/MI300X Networking BoM Analysis and Performance per TCO
- H100/H200/MI300X Networking BoM Analysis and Performance per TCO
- Further Experiments
- Benchmarking Warmup/Repeats Effects
- VBoost Power Shifting
- BF16 vs FP16
- Input Distribution Affects Performance
- FLOP per GPU PicoJoule
- PyTorch PyPi Distribution vs. Nvidia NGC Stable PyTorch Images
Intro 介绍
SemiAnalysis has been on a five-month long quest to settle the reality of MI300X. In theory, the MI300X should be at a huge advantage over Nvidia’s H100 and H200 in terms of specifications and Total Cost of Ownership (TCO). However, the reality is that the on paper specs as given below are not representative of performance that can be expected in a real-world environment. If AMD could deliver the below marketed performance with this memory, it would be a very strong competitor in the market.
SemiAnalysis 一直在寻求解决 MI300X 的现实问题长达五个月。从理论上讲,MI300X 在规格和总拥有成本 (TCO) 方面应该比 Nvidia 的 H100 和 H200 具有巨大优势。然而,现实情况是,下面给出的纸上规格并不代表在实际环境中可以预期的性能。如果 AMD 能够用这款内存提供低于市场的性能,它将是市场上非常强大的竞争对手。

来源: SemiAnalysis, Nvidia, AMD
Today we are going to talk through our five-month journey conducting independent analysis and training-focused benchmarking of the MI300X, the H100 and the H200, engaging with both NVIDIA and AMD. We will do a detailed overview of the numerous low-level benchmarks that we ran, see the table of contents for summary. Furthermore, we will compare the total cost of ownership of Nvidia and AMD GPUs and factor in performance. Ultimately much of what we are doing is openly giving a comprehensive public recommendation to AMD on what they need to do to be competitive and fix their software issues after five months of submitting and squashing bugs. It’s not just that it’s immature software, they need to change how they do development.
今天,我们将讨论我们与 NVIDIA 和 AMD 合作,对 MI300X、H100 和 H200 进行独立分析和以培训为重点的基准测试的五个月的旅程。我们将详细概述我们运行的众多低级基准测试,请参阅目录以获取摘要。此外,我们将比较 Nvidia 和 AMD GPU 的总拥有成本,并考虑性能因素。最终,我们所做的大部分工作是公开向 AMD 提供全面的公开建议,告诉 AMD 在提交和消除错误五个月后,他们需要做什么才能保持竞争力并修复他们的软件问题。这不仅仅是因为它不成熟的软件,他们需要改变他们进行开发的方式。
In short, when comparing Nvidia’s GPUs to AMD’s MI300X, we found that the potential on paper advantage of the MI300X was not realized due to a lack within AMD public release software stack and the lack of testing from AMD.
简而言之,在将 Nvidia 的 GPU 与 AMD 的 MI300X 进行比较时,我们发现由于 AMD 公开发布软件堆栈中缺乏并且缺乏 AMD 的测试,MI300X 的潜在纸面上优势并未实现。
AMD’s software experience is riddled with bugs rendering out of the box training with AMD is impossible. We were hopeful that AMD could emerge as a strong competitor to NVIDIA in training workloads, but, as of today, this is unfortunately not the case. The CUDA moat has yet to be crossed by AMD due to AMD’s weaker-than-expected software Quality Assurance (QA) culture and its challenging out of the box experience. As fast as AMD tries to fill in the CUDA moat, NVIDIA engineers are working overtime to deepen said moat with new features, libraries, and performance updates.
AMD 的软件体验充满了错误,开箱即用,使用 AMD 进行培训是不可能的。我们曾希望 AMD 能够在训练工作负载方面成为 NVIDIA 的强大竞争对手,但不幸的是,截至今天,情况并非如此。由于 AMD 的软件质量保证 (QA) 文化弱于预期,并且其具有挑战性的开箱即用体验,AMD 尚未跨越 CUDA 护城河。就在 AMD 试图填补 CUDA 护城河的同时,NVIDIA 工程师正在加班加点地工作,通过新功能、库和性能更新来加深这条护城河。
We shared benchmark source code and intermediate test results for GEMM benchmark and Single Node Training with both Nvidia and AMD and held calls and discussions to solicit feedback and implement improvements to the benchmarks, and we worked with AMD to implement bug fixes for the software stacks.
我们与 Nvidia 和 AMD 共享了 GEMM 基准测试和单节点训练的基准测试源代码和中间测试结果,并举行了电话会议和讨论,以征求反馈并实施对基准测试的改进,并与 AMD 合作为软件堆栈实施错误修复。
Our goal with this highly iterative interaction was to ensure that our tests are an unbiased evaluation of what real-world users would experience.
我们通过这种高度迭代的交互来确保我们的测试是对真实世界用户体验的公正评估。
We initially planned to publish this article a few months ago but wanted to take the extra time to engage with the AMD team and explore possible fixes or development work. We spent a considerable time identifying and fixing AMD software bugs so that we could give AMD every chance to show MI300X unhindered by AMD software stack bugs as opposed to only showing problematic performance out of the box. To give a fair impression, we also explain the considerable amount of work on tuning and bug-squashing that it took to get there. We think this approach provides users with the best possible level of transparency.
我们最初计划在几个月前发布这篇文章,但希望花额外的时间与 AMD 团队合作,探索可能的修复或开发工作。我们花费了大量时间来识别和修复 AMD 软件错误,以便让 AMD 有机会展示 MI300X,不受 AMD 软件堆栈错误的阻碍,而不是只显示开箱即用的问题性能。为了给人留下一个公平的印象,我们还解释了为实现这一目标而进行的大量调整和错误压缩工作。我们认为这种方法为用户提供了尽可能高的透明度。
We wanted to contribute in any way we could to try to improve the AMD ecosystem. Though AMD software is much better now due to our bug reports and tire-kicking, its public software stack still falls short. We have open-sourced many of the benchmarks and created simple one-liner commands to reproduce them.
我们希望尽一切可能做出贡献,以尝试改善 AMD 生态系统。尽管由于我们的错误报告和轮胎踢球, AMD 软件现在要好得多, 但它的公共软件堆栈仍然不足。我们开源了许多基准测试,并创建了简单的单行命令来重现它们。
If Lisa Su and the AMD Leadership redouble their investment with a focus on their software and testing stack, they have a chance to be competitive with Nvidia on training. We think the engineers at AMD are extremely capable and are doing their best to advance the AMD ecosystem – and indeed support from these engineers in the form of bug fixes, configuration help and custom images improved the results we were able to get from the MI300X.
如果 Lisa Su 和 AMD 领导层加倍投资,专注于他们的软件和测试堆栈,他们就有机会在培训方面与 Nvidia 竞争。我们认为 AMD 的工程师非常有能力,并且正在尽最大努力推动 AMD 生态系统的发展 – 事实上,这些工程师以错误修复、配置帮助和自定义映像的形式提供的支持改善了我们从 MI300X 获得的结果。
To bring our benchmarking process to a coda, on November 15th, 2024 we sent Nvidia and AMD a draft of most of our major GEMM and single node benchmarking code and results for comments, verification, and fine-tuning. We asked that any final comments, fixes, feedback and any performance improvements be submitted by November 25th. We set this time frame to crystallize test results to allow time to write an in-depth analysis and commentary and carry out multiple rounds of internal and external reviews, all steps that can take a variable and often unknowable amount of time, typically from 2-4 weeks.
为了使我们的基准测试过程达到尾声,我们于 2024 年 11 月 15 日向 Nvidia 和 AMD 发送了我们大部分主要 GEMM 和单节点基准测试代码和结果的草案,以供评论、验证和微调。我们要求在 11 月 25 日之前提交任何最终评论、修复、反馈和任何性能改进。我们设定这个时间框架来具体化测试结果,以便有时间撰写深入的分析和评论,并进行多轮内部和外部审查,所有这些步骤可能需要可变且通常不可知的时间,通常为 2-4 周。
A few days ago, after we informed both that we had confirmed an article publication date of December 20th, AMD requested that we delay publication to include results based on a beta WIP development build on an AMD developer’s branch. All of our benchmarking on Nvidia was conducted on publicly available stable release builds. In the spirit of transparency and fairness, we include these results as well as updated testing harness results on as the original November 25th deadline image and the latest publicly available software. However, we believe that the correct way to interpret the results is to look at the performance of the public stable release of AMD/Nvidia software.
几天前,在我们通知双方已确认文章发布日期为 12 月 20日后,AMD 要求我们推迟发布,以包含基于 AMD 开发人员分支上的 beta WIP 开发版本的结果。我们对 Nvidia 的所有基准测试都是在公开可用的稳定版本上进行的。本着透明和公平的精神,我们将这些结果以及更新的自动化测试框架结果作为原始 11 月 25日截止日期图像和最新的公开可用软件包含在内。但是,我们认为解释结果的正确方法是查看 AMD/Nvidia 软件的公开稳定版本的性能。
Below are the list of software builds that we have used for benchmarking:
以下是我们用于基准测试的软件版本列表:
- H100 Public Stable Release – Out of Box experience for Nvidia H100.
H100 公开稳定版 – Nvidia H100 的开箱即用体验。 - H200 Public Stable Release – Out of Box experience for Nvidia H200.
H200 公开稳定版 – Nvidia H200 的开箱即用体验。 - MI300X Nov 25th Custom Build – This is a custom VIP docker image hand-crafted that builds all dependencies from source code written by AMD principal engineers.
MI300X 11 月 25日定制版 – 这是一个手工制作的自定义 VIP docker 镜像,它从 AMD 首席工程师编写的源代码构建所有依赖项。 - MI300X Stable Public Release PyTorch 2.5.1 – Out of Box experience for AMD MI300X.
MI300X 稳定公开版 PyTorch 2.5.1 – AMD MI300X 的开箱即用体验。 - MI300X Public Nightly Dec 19th – This can indicate where AMD performance can be by January 2025, when PyTorch 2.6 is released, over 1 year after launch.
MI300X Public Nightly 12 月 19日 – 这可以表明到 2025 年 1 月,即 PyTorch 2.6 发布时,即发布 1 年多后,AMD 的性能可以达到什么水平。 - MI300X Dec 21st WIP dev build – This is the image that AMD submitted to us after we agreed to delay publication of the article. It is an experimental development build that has not yet been merged into AMD’s internal main branch, and it does not use the native PyTorch flash attention API. Performance with this image can indicate where AMD public stable release performance will be in 1-2 quarters from now.
MI300X 12 月 21日 WIP 开发版本 – 这是我们同意推迟发布文章后 AMD 提交给我们的图像。它是一个实验性开发版本,尚未合并到 AMD 的内部 main 分支中,并且它不使用原生 PyTorch flash 注意力 API。此图像的性能可以指示 AMD 公开稳定版本性能在从现在起的 1-2 个季度内将处于什么位置。
We are very thankful for the technical support provided by AMD and Nvidia throughout this process, but we maintain our independence in the results we publish. We want to shout out to and thank our AMD counterparties, Anush Elangovan (AMD VP of AI), Hui Liu and many dozens of amazing AMD Principal/Senior engineers, AMD VPs of Engineering, AMD Engineering Fellows, AMD CVPs of Engineering and AMD Directors of Engineering, AMD Software Library Leads, for triaging and fixing our various bug reports. On the Nvidia side, we are grateful to Kedar Potdar, Ian Buck, Sylvain Jeaugey and the NCCL team from NVIDIA for their amazing support.
我们非常感谢 AMD 和 Nvidia 在整个过程中提供的技术支持,但我们在发布结果时保持独立性。我们要感谢我们的 AMD 交易对手 Anush Elangovan(AMD 人工智能副总裁)、Hui Liu 和数十位了不起的 AMD 首席/高级工程师、AMD 工程副总裁、AMD 工程研究员、AMD 工程副总裁和 AMD 工程总监、AMD 软件库负责人,他们对我们的各种错误报告进行分类和修复。在 NVIDIA 方面,我们感谢 Kedar Potdar、Ian Buck、Sylvain Jeaugey 和 NVIDIA 的 NCCL 团队提供的大力支持。
Thank you to Crusoe, TensorWave (AMD Ventures Portco), Nebius, Lambda, Hot Aisle and Sustainable Metal Cloud (SMC) / Firmus for the compute and for being supporters of open-source benchmarking. Crusoe, Nebius, SMC / Firmus and Lambda support managed SLURM and shared home directories out of the box. TensorWave currently has managed SLURM in beta and this feature will come to general availability (GA) at the start of next year. Sustainable Metal Cloud is one of the few neoclouds that has official MLPerf GPT-3 175B Training results.
感谢 Crusoe、TensorWave (AMD Ventures Portco)、Nebius、Lambda、Hot Aisle 和 Sustainable Metal Cloud (SMC) / Firmus 的计算和开源基准测试的支持者。Crusoe、Nebius、SMC / Firmus 和 Lambda 支持开箱即用的托管 SLURM 和共享主目录。TensorWave 目前托管了 SLURM 测试版,此功能将于明年年初正式发布 (GA)。Sustainable Metal Cloud 是为数不多的拥有官方 MLPerf GPT-3 175B 训练结果的新云之一。
We will be releasing a follow up article on inferencing for the H100, H200 and MI300X. We may also release a follow-up article in a few months to follow up on AMD training performance to see if out of box experience has improved and test other models such as LlaVa & Mamba.
我们将发布有关 H100、H200 和 MI300X 推理的后续文章。我们也可能在几个月后发布一篇后续文章,以跟进AMD的训练性能,看看开箱即用的体验是否有所改善,并测试其他模型,如LlaVa和Mamba。

Key Findings 主要发现
- Comparing on paper FLOP/s and HBM Bandwidth/Capacity is akin to comparing cameras by merely examining megapixel count. The only way to tell the actual performance is to run benchmarking.
在纸面上比较 FLOP/s 和 HBM 带宽/容量类似于仅通过检查百万像素数来比较相机。判断实际性能的唯一方法是运行基准测试。 - Nvidia’s Out of the Box Performance & Experience is amazing, and we did not run into any Nvidia specific bugs during our benchmarks. Nvidia tasked a single engineer to us for technical support, but we didn’t run into any Nvidia software bugs as such we didn’t need much support.
Nvidia的开箱即用性能和体验令人惊叹,而且我们在基准测试期间没有遇到任何Nvidia特定的错误。Nvidia 指派了一名工程师给我们提供技术支持,但我们没有遇到任何 Nvidia 软件错误,因此我们不需要太多支持。 - AMD’s Out of the Box Experience is very difficult to work with and can require considerable patience and elbow grease to move towards a usable state. On most of our benchmarks, Public AMD stable releases of AMD PyTorch is still broken and we needed workarounds.
AMD 的开箱即用体验非常难以使用,可能需要相当大的耐心和肘部润滑脂才能进入可用状态。在我们的大多数基准测试中,AMD PyTorch 的公共 AMD 稳定版本仍然有问题,我们需要解决方法。 - If we weren’t supported by multiple teams of AMD engineers triaging and fixing bugs in AMD software that we ran into, AMD’s results would have been much lower than Nvidia’s.
如果没有多个 AMD 工程师团队的支持来分类和修复我们遇到的 AMD 软件中的错误,AMD 的结果将远低于 Nvidia。 - We ran unofficial MLPerf Training GPT-3 175B on 256 H100 in collaboration with Sustainable Metal Cloud to test the effects of different VBoost setting
我们与 Sustainable Metal Cloud 合作,在 256 H100 上运行非官方的 MLPerf 训练 GPT-3 175B,以测试不同 VBoost 设置的效果 - For AMD, Real World Performance on public stable released software is nowhere close to its on paper marketed TFLOP/s. Nvidia’s real world performance also undershoots its marketing TFLOP/s, but not by nearly as much.
对于 AMD 来说,公共稳定版软件的实际性能远不及其纸面上销售的 TFLOP/s。Nvidia 的实际性能也低于其营销 TFLOP/s,但幅度不大。 - The MI300X has a lower total cost of ownership (TCO) compared to the H100/H200, but training performance per TCO is worse on the MI300X on public stable releases of AMD software. This changes if one uses custom development builds of AMD software.
与 H100/H200 相比,MI300X 的总拥有成本 (TCO) 较低,但在 AMD 软件的公开稳定版本上,MI300X 的单位 TCO 训练性能较差。如果使用 AMD 软件的自定义开发版本,情况会发生变化。 - Training performance is weaker, as demonstrated by the MI300X ‘s matrix multiplication micro-benchmarks, and AMD public release software on single-node training throughput still lags that of Nvidia’s H100 and H200.
训练性能较弱,MI300X 的矩阵乘法微基准测试证明了这一点,AMD 公版软件在单节点训练吞吐量上仍然落后于 Nvidia 的 H100 和 H200。 - MI300X performance is held back by AMD software. AMD MI300X software on BF16 development branches have better performance but has not yet merged into the main branch of AMD’s internal repos. By the time it gets merged into the main branch and into the PyTorch stable release, Nvidia Blackwell will have already been available to everyone.
MI300X 的性能受到 AMD 软件的阻碍。BF16 开发分支上的 AMD MI300X 软件性能较好,但尚未合并到 AMD 内部仓库的主分支。当它合并到 main 分支和 PyTorch 稳定版本时,Nvidia Blackwell 已经可供所有人使用。 - AMD’s training performance is also held back as the MI300X does not deliver strong scale out performance. This is due to its weaker ROCm Compute Communication Library (RCCL) and AMD’s lower degree of vertical integration with networking and switching hardware compared to Nvidia’s strong integration of its Nvidia Collective Communications Library (NCCL), InfiniBand/Spectrum-X network fabric and switches.
由于 MI300X 没有提供强大的横向扩展性能,AMD 的训练性能也受到了阻碍。这是由于 AMD 的 ROCm 计算通信库 (RCCL) 较弱,并且与 Nvidia 对其 Nvidia 集体通信库 (NCCL)、InfiniBand/Spectrum-X 网络结构和交换机的强烈集成相比,AMD 与网络和交换硬件的垂直集成程度较低。 - Many of AMD AI Libraries are forks of NVIDIA AI Libraries, leading to suboptimal outcomes and compatibility issues.
许多 AMD AI 库是 NVIDIA AI 库的分支,导致结果欠佳和兼容性问题。 - AMD customers tend to use hand crafted kernels only for inference, which means their performance outside of very narrow well defined use cases is poor, and their flexibility to rapidly shifting workloads is non-existent.
AMD 客户倾向于仅使用手工制作的内核进行推理,这意味着它们在非常狭窄、定义明确的用例之外的性能很差,并且不存在快速转移工作负载的灵活性。
Executive Recommendation to AMD
对 AMD 的高管推荐
We genuinely want to see another effective competitor to Nvidia and want to help AMD get to that spot, but, unfortunately, there is still much work to be done on that front. At the bottom of this article, we have a detailed list of feedback for the Lisa Su and the AMD Leadership Team, but provide a summary here:
我们 真诚 地希望 看到 Nvidia 的另一个有效竞争对手,并希望帮助 AMD 达到那个位置,但不幸的是,这方面还有很多工作要做。在本文的底部,我们提供了对 Lisa Su 和 AMD 领导团队的详细反馈列表,但在此处提供了摘要:
- Give AMD Engineers more compute and engineering resources to fix and improve the AMD ecosystem, they have very few internal gpu boxes relative to what Nvidia provides to their engineers. Tensorwave, the largest AMD GPU Cloud has given GPU time for free to a team at AMD to fix software issues, which is insane given they paid for the GPUs.
为 AMD 工程师提供更多的计算和工程资源来修复和改进 AMD 生态系统,相对于 Nvidia 提供给工程师的 GPU 盒,他们的内部 GPU 盒非常少。最大的 AMD GPU Cloud Tensorwave 已向 AMD 的一个团队免费提供 GPU 时间来修复软件问题,考虑到他们为 GPU 付费,这太疯狂了。 - AMD needs to hook up thousands more of MI300X, MI325X to PyTorch CI/CD for automated testing to ensure there is no AMD performance regressions & functional AMD bugs. Nvidia has given thousands of GPUs for PyTorch CI/CD to ensure an amazing out of box experience
AMD需要将数千个MI300X、MI325X连接到PyTorch CI/CD进行自动化测试,以确保没有AMD性能回归和功能AMD错误。Nvidia 为 PyTorch CI/CD 提供了数千个 GPU,以确保令人惊叹的开箱即用体验 - The AMD Executive Team should personally and intensively internally test (i.e., “dogfood”) products that are being shipped to the public rather than focus on testing internal builds. Preferably dogfood during livestream (twitch.tv) to show the authentic out of box experience. This is like how geohotz livestreams
AMD 管理团队应该亲自深入地在内部测试(即“dogfood”)向公众发货的产品,而不是专注于测试内部构建。最好在直播期间 (twitch.tv) 使用 dogfood,以展示真实的开箱即用体验。这就像 geohotz 直播的方式 - AMD should collaborate with Meta to get production LLM training workloads working as soon as possible on PyTorch ROCm, AMD’s answer to CUDA, as commonly, PyTorch code paths that Meta isn’t using have numerous bugs.
AMD 应该与 Meta 合作,让生产LLM训练工作负载尽快在 PyTorch ROCm(AMD 对 CUDA 的回应)上运行,因为 Meta 不使用的 PyTorch 代码路径通常存在许多错误。 - Move away from over-reliance on properly setting numerous environment flags (up to dozens) to make an AMD deployment usable. Instead, bake these settings into the default configuration. Make the out of the box experience usable!
摆脱过度依赖正确设置大量环境标志(最多几十个)以使 AMD 部署可用。相反,请将这些设置烘焙到默认配置中。让开箱即用的体验可用! - Focus on making out of box experience good instead of over-reliance on custom VIP images that build all dependencies from source code main@specificcommit and take 5 hours to build.
专注于提供良好的开箱即用体验,而不是过度依赖自定义 VIP 映像,这些映像从源代码main@specificcommit构建所有依赖项,并且需要 5 小时来构建。 - Stop expecting end users to use PYTORCH_TUNABLE_OPS which is a prototype buggy feature and is not respectful of the end users time as it takes ~1 hour for the end user to tune every time an end user wants to make any changes to their code.
不要期望最终用户使用 PYTORCH_TUNABLE_OPS这是一个原型错误功能,并且不尊重最终用户的时间,因为每次最终用户想要对其代码进行任何更改时,最终用户都需要 ~1 小时来调整。 - AMD should submit MLPerf Training GPT-3 175B results. MLPerf is an apples-to-apples benchmarking methodology that uses time to convergence as the north star.
AMD 应提交 MLPerf Training GPT-3 175B 结果。MLPerf 是一种以时间收敛为北极星的 apple-to-apples 基准测试方法。 - We want AMD to be competitive and are open to meet with more detailed feedback on how to fix the AMD Datacenter GPU Ecosystem for the better.
我们希望 AMD 具有竞争力,并愿意就如何改进 AMD 数据中心 GPU 生态系统提供更详细的反馈。
A Summary of the AMD vs Nvidia Narrative
AMD 与 Nvidia 的叙述总结
Before we dive into various facets of AMD’s software stack that hold AMD back, we will discuss the MI300X’s basic specifications, its comparative total cost of ownership, and how most analysts and investors have evaluated its competitiveness.
在我们深入研究 AMD 软件堆栈中阻碍 AMD 的各个方面之前,我们将讨论 MI300X 的基本规格、其相对总拥有成本,以及大多数分析师和投资者如何评估其竞争力。
The MI300X launched in late 2023 with an exciting set of on paper specifications—featuring 1,307 TFLOP/s of FP16 compute (stronger than the H100’s 989 TFLOP/s), 5.3 TB/s of memory bandwidth, and 192GB of HBM3, 3.35 TB/s of memory bandwidth, and 80GB of HBM3. These specs outstrip those of the H200, which itself is, effectively, a memory-spec bumped version of the H100, delivering 4.8TB/s of memory bandwidth and 141GB of HBM3e.
MI300X 于 2023 年底推出,具有一套令人兴奋的纸上规格,具有 1307 TFLOP/s 的 FP16 计算能力(比 H100 的 989 TFLOP/s 更强)、5.3 TB/s 的内存带宽以及 192GB 的 HBM3、3.35 TB/s 的内存带宽和 80GB 的 HBM3。这些规格超过了 H200 的规格,而 H200 本身实际上是 H100 的内存规格凸起版本,提供 4.8TB/s 的内存带宽和 141GB 的 HBM3e。

来源: SemiAnalysis, Nvidia, AMD
On paper total cost of ownership for an MI300X deployment is extremely compelling, not only due to the lower ASP of the MI300X, but also because it is typically deployed using cheaper Ethernet networking. Comparing a cluster of 16k H200s vs a 16k MI300X ethernet cluster leads to nearly 40% of the cost savings coming from networking alone, with the remainder of the savings from a lower accelerator cost. The use of Whitebox Ethernet switches is a substantial cost savings compared to using Nvidia’s Quantum-2 switches, but the real difference is cheaper transceivers, as Nvidia branded transceivers cost as much as 2-3x over what a typical transceiver OEM charges.
从理论上讲,MI300X 部署的总拥有成本非常有吸引力,这不仅是因为 MI300X 的平均售价较低,还因为它通常使用更便宜的以太网网络进行部署。将 16k H200 集群与 16k MI300X 以太网集群进行比较,仅联网就节省了近 40% 的成本,其余成本来自较低的加速器成本。与使用 Nvidia 的 Quantum-2 交换机相比,使用 Whitebox 以太网交换机可以节省大量成本,但真正的区别在于更便宜的收发器,因为 Nvidia 品牌收发器的成本是典型收发器 OEM 收费的 2-3 倍。
At face value, the MI300X seems the best of both worlds: higher performance and lower total cost of ownership. At the time of its launch, it was logical to expect share gains to the underdog AMD from this compelling combination. The table below shows total upfront cluster capex – we present a more detailed breakdown of cluster capex components as well as a detailed networking BoM analysis in the sections at near the bottom of the article.
从表面上看,MI300X 似乎是两全其美的:更高的性能和更低的总拥有成本。在推出时,预计这一引人注目的合并会为处于劣势的 AMD 带来份额增长是合乎逻辑的。下表显示了集群前期资本支出总额 – 我们在文章底部附近的部分中提供了集群资本支出组成部分的更详细细分以及详细的网络 BoM 分析。

来源:SemiAnalysis AI TCO 模型
As orders solidified, excitement built up for potential of the MI300X, helped along by bullish commentary and guidance from AMD. With a compelling spec advantage, it was easy to argue for further upside to AMD’s guidance, which most investors assumed management was sandbagging. AMD had a strong hand, in theory. After all they have mid-single digit market share in datacenter GPUs for 2024 and, logically, a glide path towards even 10-12% market share by 2027 could be conservative while offering considerable earnings upside for AMD.
随着订单的巩固,人们对 MI300X 的潜力感到兴奋,这得益于 AMD 的看涨评论和指导。凭借令人信服的规格优势,很容易争论 AMD 的指导进一步上行,大多数投资者认为管理层是在装沙袋。理论上,AMD 有一手强牌。毕竟,到 2024 年,他们在数据中心 GPU 的市场份额将达到中个位数,从逻辑上讲,到 2027 年实现 10-12% 市场份额的下滑路径可能是保守的,同时为 AMD 提供了可观的收益上升空间。
However, over from late 2023 and through most of 2024, guidance for full year 2024 datacenter GPU sales repeatedly underperformed those lofty expectations. From its 1Q24 earnings through its 3Q24 earnings, AMD only raised guidance from $4B to $5B, well under the $6-8B investor bogey based on CoWoS and HBM supply agreements. Our demand view in the Accelerator Model tracked Microsoft’s disappointment early in the year and lack of follow on orders.
然而,从 2023 年底到 2024 年的大部分时间,2024 年全年数据中心 GPU 销售的指导值一再低于这些崇高的预期。从 24 年第一季度收益到 24 年第三季度收益,AMD 仅将指引从 $4B 上调至 $5B,远低于基于 CoWoS 和 HBM 供应协议的 $6-8B 投资者忌趟。我们在加速器模型中的需求视图跟踪了 Microsoft 在年初的失望和缺乏后续订单。
The earlier bullish line of reasoning was like purchasing a certain car model from a magazine without a test drive or soliciting feedback from owners of that model or reading any reviews. But fear not – SemiAnalysis has put the MI300X, H100, and H200 through their paces at scale and can show why AMD’s current software stack issues decisively disprove this line of reasoning.
早期的看涨推理方式就像从杂志上购买某个车型,但没有试驾,也没有征求该车型车主的反馈或阅读任何评论。但不要害怕 – SemiAnalysis 已经对 MI300X、H100 和 H200 进行了大规模的测试,并且可以说明为什么 AMD 当前的软件堆栈问题决定性地反驳了这种推理。
General Matrix Multiply (GEMM) Performance
一般矩阵乘法 (GEMM) 性能
Most FLOPS in a transformer-based architecture (i.e. ChatGPT, Llama, etc.) go towards matrix multiplication, also known as GEMMs. For this reason, GEMM performance is a good proxy for how well frontier transformers, such as ChatGPT, Llama, Claude, Grok, etc. will train on the hardware.
基于 transformer 的架构(即 ChatGPT、Llama 等)中的大多数 FLOPS 都用于矩阵乘法,也称为 GEMM。因此,GEMM 性能可以很好地代表 ChatGPT、Llama、Claude、Grok 等前沿转换器在硬件上的训练情况。
GEMMs take two input matrices, Matrix A and Matrix B, with Matrix A having the shape of (M, K), M rows and K columns, and Matrix B having the shape of (K,N) to produce an output matrix of shape (M,N).
GEMM 采用两个输入矩阵,即矩阵 A 和矩阵 B,其中矩阵 A 的形状为 (M,K)、M 行和 K 列,矩阵 B 的形状为 (K,N),以生成形状为 (M,N) 的输出矩阵。

Conceptually, each element of the resulting matrix is a sum of element-wise multiplications along the “K” dimension of the inputs. For this matter, the K dimension is also known as the reduction dimension.
从概念上讲,结果矩阵的每个元素都是沿输入的 “K” 维度的元素乘法之和。对于这个问题,K 维度也称为缩减维度。

Below, we have tested the following real-world shapes, given in the form (M,N,K)—which is short for multiplying a matrix of dimensions (M,K) and (K,N) together.
下面,我们测试了以下真实世界的形状,以 (M,N,K) 的形式给出,这是将维度 (M,K) 和 (K,N) 矩阵相乘的缩写。
These following matrix shapes were actually used in Meta’s Llama 70B production training:
以下这些矩阵形状实际上用于 Meta 的 Llama 70B 生产训练:
- (16384, 8192, 1280) – Fused QKV Projection GEMM shape
(16384, 8192, 1280) – 熔融 QKV 投影 GEMM 形状 - (16384, 1024, 8192) – Attention Output Projection shape
(16384、1024、8192) – 注意力输出投影形状 - (16384, 8192, 7168) – FFN GEMM shape
(16384、8192、7168) – FFN GEMM 形状 - (16384, 3584, 8192) – FFN GEMM shape
(16384、3584、8192) – FFN GEMM 形状 - (8192, 8192, 8192) – Standard GEMM shape for benchmarking
(8192, 8192, 8192) – 用于基准测试的标准 GEMM 形状
We used OpenAI’s do_bench function for the benchmark setup, an industry standard method of benchmarking PyTorch. The do_bench function provides cache clearing between runs as a default and provides ways to warmup and execute the benchmark multiple times, taking the median result as the given accuracy. We used warmup=30 and rep=200 for these tests. Both input tensor A and B were randomly initialized with a normal distribution with mean 0 and variance 1. This is because a normal distribution comes the closest to matching the actual distribution of weights and activations in modern neural networks. The distribution of the input tensors will affect the results of the TFLOP/s performance benchmark. We will discuss the reasons why the input distribution effects TFLOP/s performance later in the article.
我们使用 OpenAI 的 do_bench 函数进行基准测试设置,这是对 PyTorch 进行基准测试的行业标准方法。do_bench 函数默认在运行之间提供缓存清除,并提供预热和多次执行基准测试的方法,将中位数结果作为给定的准确性。我们在这些测试中使用了 warmup=30 和 rep=200。输入张量 A 和 B 均以均值为 0 且方差为 1 的正态分布随机初始化。这是因为正态分布最接近于现代神经网络中权重和激活的实际分布。输入 Tensor 的分布将影响 TFLOP/s 性能基准测试的结果。我们将在本文后面讨论输入分布影响 TFLOP/s 性能的原因。
For BF16, we can see that the H100 and H200 achieves roughly 720 TFLOP/s against their marketed 989.5 TFLOP/s, while the MI300X reaches a mere ~620 TFLOP/s compared with their marketed 1,307 TFLOP/s.
对于 BF16,我们可以看到 H100 和 H200 的吞吐量约为 720 TFLOP/s,而其市售的 989.5 TFLOP/s,而 MI300X 的吞吐量仅为 ~620 TFLOP/s,而其市售的 1,307 TFLOP/s。
This means that, despite a much higher marketed BF16 TFLOP/s, the MI300X is 14% slower than the H100 and H200. This AMD result used a custom docker image that was hand crafted by an AMD principal engineer yet still achieved slower performance than Nvidia’s GPUs. For our out of the box testing of the MI300X, the TFLOP/s throughput even slower than this! In addition to a custom image, AMD also requires the user to set numerous environment flags that aren’t set by default to reach these performance results.
这意味着,尽管 BF16 TFLOP/s 的市售速度要高得多,但 MI300X 比 H100 和 H200 慢 14%。此 AMD 结果使用了由 AMD 首席工程师手工制作的自定义 docker 映像,但性能仍然比 Nvidia 的 GPU 慢。对于我们对 MI300X 的开箱即用测试,TFLOP/s 吞吐量甚至比这还要慢!除了自定义映像之外,AMD 还要求用户设置许多默认情况下未设置的环境标志,以达到这些性能结果。

Unfortunately, the story is worse for FP8. The H100/H200 achieves ~1,280 TFLOP/s out of the marketed 1979 TFLOP/s. The MI300X, in comparison, only reaches ~990 TFLOP/s. Thus, for FP8, the MI300X is 22% slower than H100. This is for both inputs being of the e4m3 FP8 (i.e. 4 exponent bits and 3 mantissa bits) datatype.
不幸的是,FP8 的情况更糟。H100/H200 的吞吐量与市售的 1979 TFLOP/s 相比,达到了 ~1,280 TFLOP/s。相比之下,MI300X 仅达到 ~990 TFLOP/s。因此,对于 FP8,MI300X 比 H100 慢 22%。这是针对 e4m3 FP8 (即 4 个指数位和 3 个尾数位)数据类型的两个输入。

It is important to note that calling GEMM is a simple task, and we shouldn’t expect to run into AMD software bugs. Unfortunately, a major bug that we encountered is that the torch.matmul and F.Linear APIs have been delivering different performances on AMD for a couple of months during the summer. One would expect the torch.matmul and F.Linear APIs to have the same performance, but, surprisingly, F.Linear is much slower!
需要注意的是,调用 GEMM 是一项简单的任务,我们不应该期望遇到 AMD 软件错误。不幸的是,我们遇到的一个主要错误是 torch.matmul 和 F.Linear API 在夏季的几个月里一直在 AMD 上提供不同的性能。人们会期望 torch.matmul 和 F.Linear API 具有相同的性能,但令人惊讶的是,F.Linear 要慢得多!
This is a strange bug as torch.matmul and F.Linear are both wrappers around the hardware vendor GEMM libraries, so they should achieve the same level of performance. F.Linear, in particular, is important, as this is the way most end users in PyTorch launch the GEMM kernels.
这是一个奇怪的错误,因为 torch.matmul 和 F.Linear 都是硬件供应商 GEMM 库的包装器,因此它们应该达到相同的性能水平。F.Linear 尤其重要,因为这是 PyTorch 中大多数最终用户启动 GEMM 内核的方式。
When we started testing AMD five months ago, the public AMD PyTorch still had this bug. The root cause was that AMD in fact has two different underlying GEMM libraries, rocBLAS and hipBLASLt, with HipBLASLt being more optimized for the MI300X. The bug was that torch.matmul uses the optimized hipBLASLt, but AMD had not changed F.Linear by default, leaving it to use the unoptimized rocBLAS library.
当我们五个月前开始测试 AMD 时,公共 AMD PyTorch 仍然存在此错误。根本原因是 AMD 实际上有两个不同的底层 GEMM 库,rocBLAS 和 hipBLASLt,其中 HipBLASLt 针对 MI300X 进行了更多优化。错误在于 torch.matmul 使用优化的 hipBLASLt,但 AMD 默认没有更改 F.Linear,而是让它使用未优化的 rocBLAS 库。
This major bug was ultimately fixed by AMD a few months ago after our bug reports, and we hope it doesn’t reappear due to a lack of proper regression testing. AMD’s usability could improve considerably if it boosted its testing efforts instead of waiting for users to discover these critical issues.
几个月前,AMD 在我们的错误报告后最终修复了这个重大错误,我们希望它不会因为缺乏适当的回归测试而再次出现。如果 AMD 加强测试工作,而不是等待用户发现这些关键问题,它的可用性可能会大大提高。
We have open sourced the GEMM benchmark used in our tests into a simple three liner that anyone can easily run:
我们已将测试中使用的 GEMM 基准测试开源为一个简单的三行代码,任何人都可以轻松运行:

来源:SemiAnalysis
Popular GEMM Benchmark Isn’t Accurate
流行的 GEMM 基准测试不准确
Recently, a benchmark has been floating around the internet that claims that, on GEMMs, AMD MI300X’s performance is close to that of the H100.
最近,互联网上流传着一个基准,声称在 GEMM 上,AMD MI300X 的性能接近 H100。

There are two main issues with the benchmark: it isn’t properly carrying out L2 Cache clearing and also is simply taking the max performance, instead of the median/mean TFLOP/s over the course of the iterations for a specific shape. Without L2 Cache clearing between iterations, the benchmark does not accurately reflect real-world GEMM performance. Furthermore, since the TFLOP/s change based on which iteration it is on, you need to use a mean/median over at least 100 iterations as the basis for an accurate GEMM benchmark. OpenAI’s do_bench provides L2 cache and mean/median out of the box by default, so we recommend that engineers use it for micro-benchmarking. Below, we have simplified the benchmark into pseudocode and have commented on the issues mentioned above.
基准测试有两个主要问题:它没有正确执行 L2 缓存清除,并且只是在特定形状的迭代过程中采用最大性能,而不是中位数/平均 TFLOP/s。如果在迭代之间没有清除 L2 缓存,基准测试将无法准确反映实际的 GEMM 性能。此外,由于 TFLOP/s 会根据迭代而变化,因此您需要使用至少 100 次迭代的平均值/中位数作为准确 GEMM 基准测试的基础。OpenAI 的 do_bench 默认提供开箱即用的 L2 缓存和平均值/中位数,因此我们建议工程师将其用于微基准测试。下面,我们将基准测试简化为伪代码,并对上述问题进行了评论。

HBM Memory Bandwidth Performance
HBM 内存带宽性能
It is widely known that AMD MI300X has better memory bandwidth than the Nvidia H100 and H200, offering 5.3 TB/s of bandwidth vs 4.8 TB/s for the H200 and 3.35 TB/s for the H100. Improved HBM memory bandwidth is very useful in inferencing and is sometimes useful in training. In training, users can set a larger batch size if they have more HBM memory capacity and memory bandwidth. Although if a larger global batch size is used, after a certain size, the model will take longer to convergence. It is easy to run fast with big global batch size but at a high level, it will hurt time to convergence.
众所周知,AMD MI300X 具有比 Nvidia H100 和 H200 更好的内存带宽,提供 5.3 TB/s 的带宽,而 H200 为 4.8 TB/s,H100 为 3.35 TB/s。改进的 HBM 内存带宽在推理中非常有用,有时在训练中也很有用。在训练中,如果用户具有更多的 HBM 内存容量和内存带宽,则可以设置更大的批量大小。尽管如果使用更大的全局批量大小,但在达到一定大小后,模型将需要更长的时间才能收敛。在全局大批量的情况下,它很容易快速运行,但在高级别上,它会损害收敛时间。
From our HBM memory bandwidth benchmarking, we see that that MI300X indeed has way better memory bandwidth than both the H200 and the H100. We tested memory bandwidth in Pytorch with Tensor.copy_ & used the industry standard OpenAI do_bench to ensure accuracy.
从我们的 HBM 内存带宽基准测试中,我们看到 MI300X 确实比 H200 和 H100 具有更好的内存带宽。我们在Pytorch中使用Tensor.copy_测试了内存带宽,并使用了行业标准的OpenAI do_bench来确保准确性。
As you will see in our upcoming H100 vs H200 vs MI300X inference article, memory bandwidth is very important for inferencing.
正如我们即将发布的 H100 vs H200 vs MI300X 推理文章中所看到的那样,内存带宽对于推理非常重要。


来源:SemiAnalysis
AMD Hand-Crafted VIP Custom Builds and WIP Development Builds
AMD 手工制作的 VIP 定制版本和 WIP 开发版本
The only reason we have been able to get AMD performance within 75% of H100/H200 performance is because we have been supported by multiple teams at AMD in fixing numerous AMD software bugs. To get AMD to a usable state with somewhat reasonable performance, a giant ~60 command Dockerfile that builds dependencies from source, hand crafted by an AMD principal engineer, was specifically provided for us, since the Pytorch Nightly and public PyTorch AMD images functioned poorly and had version differences. This docker image requires ~5 hours to build from source and installs dependencies and sub-dependencies (hipBLASLt, Triton, PyTorch, TransformerEngine), a huge difference compared to Nvidia, which offers a pre-built, out of the box experience and takes but a single line of code. Most users do not build Pytorch, hipBLASLt from source code but instead use the stable release.
我们能够将 AMD 性能控制在 H100/H200 性能的 75% 以内的唯一原因是,我们在修复大量 AMD 软件错误方面得到了 AMD 多个团队的支持。为了使 AMD 达到可用状态并具有一定的性能,我们专门为我们提供了一个巨大的 ~60 命令 Dockerfile,该 Dockerfile 从源代码构建依赖项,由 AMD 首席工程师手工制作,因为 Pytorch Nightly 和公共 PyTorch AMD 映像运行不佳且存在版本差异。此 docker 映像需要 ~5 小时才能从源代码构建并安装依赖项和子依赖项(hipBLASLt、Triton、PyTorch、TransformerEngine),与 Nvidia 相比,这是一个巨大的差异,Nvidia 提供预构建的开箱即用体验,并且只需要一行代码。大多数用户不会从源代码构建 Pytorch、hipBLASLt,而是使用稳定版本。
When using public PyTorch, users have the choice of working with the latest stable images or a nightly PyTorch upload. So, although a nightly PyTorch upload may have the latest commits that could potentially lead to better performance or could fix some bugs, but users must accept that the upload may not be fully tested and could contain new bugs from Meta/AMD/Nvidia or other PyTorch contributors that have not been discovered yet. Note that most end users are using the stable release of PyTorch.
使用公有 PyTorch 时,用户可以选择使用最新的稳定映像或每晚上传 PyTorch。因此,尽管夜间 PyTorch 上传可能包含最新的提交,这可能会带来更好的性能或修复一些错误,但用户必须接受上传可能没有经过全面测试,并且可能包含来自 Meta/AMD/Nvidia 或其他 PyTorch 贡献者的新错误尚未被发现。请注意,大多数最终用户都在使用 PyTorch 的稳定版本。

来源:SemiAnalysis,AMD

Delightfully, Nvidia’s Docker images contain the complete set of developer tools needed for profiling and debugging, like Nsight Compute and Nsight Systems. AMD, in contrast, does not include their OmniTrace developer tool out of the box.
令人高兴的是,Nvidia 的 Docker 映像包含分析和调试所需的全套开发人员工具,例如 Nsight Compute 和 Nsight Systems。相比之下,AMD 并没有提供开箱即用的 OmniTrace 开发人员工具。
Until a couple weeks ago, the AMD docker images only supported PyTorch 2.3, which released 8 months ago. Mainline PyTorch 2.4 and PyTorch 2.5 have also since released and PyTorch 2.6 is about to come out in Q1 2025. We recommended to an AMD Principal Engineer and to AMD’s VP of AI that AMD should have the latest AMD PyTorch version – AMD has since started publishing containers for some of these AMD PyTorch versions. Docker image for AMD PyTorch 2.5 is still missing.
直到几周前,AMD docker 镜像仅支持 8 个月前发布的 PyTorch 2.3。主线 PyTorch 2.4 和 PyTorch 2.5 也已发布,PyTorch 2.6 将于 2025 年第一季度发布。我们向 AMD 首席工程师和 AMD 的 AI 副总裁建议 AMD 应该拥有最新的 AMD PyTorch 版本 – AMD 已经开始为其中一些 AMD PyTorch 版本发布容器。AMD PyTorch 2.5 的 Docker 映像仍然缺失。

Dec 21st AMD Development Builds
12 月 21 日 AMD 开发版本
Below is AMD’s December 21st development build docker image. As you can see, it uses a number of non stable devlopment branches for dependencies such as hipBLASLt, AOTriton, ROCm Attention and installs everything including PyTorch from source code, taking upwards of 5 hours to build. These versions of the dependencies haven’t even been merged into AMD’s own main branch yet. 99.9% of users will not be installing PyTorch from source code and all of its dependencies from source code on development branches but will instead use the public stable PyPi PyTorch.
以下是 AMD 12 月 21 日的开发版本 docker 镜像。如您所见,它对依赖项(如 hipBLASLt、AOTriton、ROCm Attention)使用许多不稳定的开发分支,并从源代码安装包括 PyTorch 在内的所有内容,构建时间超过 5 小时。这些版本的依赖项甚至还没有合并到 AMD 自己的 main 分支中。99.9% 的用户不会从开发分支上的源代码安装源码及其所有依赖项,而是使用公共稳定的 PyPi PyTorch。
Furthermore, instead of using Flash Attention through the PyTorch native user friendly torch. scaled_dot_product_attention API, this AMD Development build imports another library (development branch as well) attention implementation. We have seen more users use Flash Attention through PyTorch native torch. scaled_dot_product_attention API since it is more user friendly and bundled into out of box PyTorch. Even AMD’s own public documentation recommends using Flash Attention through torch.scaled_dot_product_attention API. We hope that these kernels get merged into PyTorch flash attention instead of making the end user install a separate library taking hours of their time to build. This is not a user-friendly experience. Furthermore, AMD must support FlexAttention as it has quickly become the go to in the industry.
此外,此 AMD 开发版本不是通过 PyTorch 本机用户友好的 torch.scaled_dot_product_attention API 使用 Flash Attention,而是导入另一个库(也是开发分支)Attention 实现。我们已经看到越来越多的用户通过 PyTorch 原生火炬使用 Flash Attention。scaled_dot_product_attention API,因为它对用户更友好,并且捆绑到开箱即用的 PyTorch 中。甚至 AMD 自己的公共文档也建议通过 torch.scaled_dot_product_attention API 使用 Flash Attention。我们希望这些内核能够合并到 PyTorch flash attention 中,而不是让最终用户安装一个单独的库,花费数小时的时间来构建。这不是用户友好的体验。此外,AMD 必须支持 FlexAttention,因为它已迅速成为业内的首选。
AMD’s December 21st Dev build is on a hanging development branch. That means it is a branch that has not been fully QA’ed and is at use only at a risk branch. There are many concerns about the validity of the results from using a development build and branches and building from source code, as most users are not doing this in real life. Most users will be installing AMD/Nvidia PyTorch from PyPI stable release mostly so we recommend readers keep this in mend when analyzing these results.
AMD 在 12 月 21 日的 Dev 版本上处于悬而未决的开发分支上。这意味着它是一个尚未完全进行 QA 的分支,仅在风险分支中使用。对于使用开发版本和分支以及从源代码构建的结果的有效性,存在许多担忧,因为大多数用户在现实生活中并没有这样做。大多数用户主要从 PyPI 稳定版本安装 AMD/Nvidia PyTorch,因此我们建议读者在分析这些结果时保持谨慎。
That being said, we are including these development build results as it is an indication of where AMD public stable release software will be 1-2 quarters from now. However, at the same time, when it comes to compete, 1-2 quarters from now, Nvidia Blackwell will already be widely deployed, while AMD MI355X will not commence shipments until H2 2025.
话虽如此,我们之所以包括这些开发构建结果,是因为它表明了 AMD 公开稳定版软件在从 1-2 个季度后将处于何处。然而,与此同时,在竞争方面,从现在开始的 1-2 个季度,Nvidia Blackwell 已经被广泛部署,而 AMD MI355X 要到 2025 年下半年才会开始出货。

来源:SemiAnalysis,AMD
Training Testing Methodology (GPT1.5B, Llama 8B, Llama 70B, Mistral)
训练测试方法(GPT1.5B、Llama 8B、Llama 70B、Mistral)
There are many ways to test training performance. The most accurate way is to take a medium-sized AI startup model’s internal codebases and run them on a 512-1024 GPU cluster. This way, the test run has all the optimizations that a typical user would have. Everything else is just a proxy for the performance of these training runs. Training performance takes into account HBM bandwidth, HBM capacity, TFLOP/s, networking, and system architecture. Comparing on paper HBM bandwidth/capacity is just like comparing on paper camera megapixels.
测试训练表现的方法有很多种。最准确的方法是采用中型 AI 初创公司模型的内部代码库,并在 512-1024 GPU 集群上运行它们。这样,测试运行具有典型用户将具有的所有优化。其他所有内容只是这些训练运行性能的代理。训练性能考虑了 HBM 带宽、HBM 容量、TFLOP/s、网络和系统架构。在纸上比较 HBM 带宽/容量就像在纸上比较相机百万像素一样。
MLPerf GPT3 175B Training is also a good proxy to measure the time it takes to train to a specific convergence. MLPerf benchmark considers global batch sizes and whether a mixed precision implementation incurs a convergence penalty. Unfortunately, MLPerf is quite difficult to run due to a lack of user-friendly documentation and instructions, and the performance is often min-maxed via a custom tuned configuration specifically concocted for MLPerf that an average user would not adopt. Note that Nvidia has submitted MLPerf Training results with over 11k H100s, while AMD runs MLPerf Training internally. AMD’s results are likely weak, so they have never submitted any MLPerf Training, let alone the MLPerf GPT3 175B benchmark.
MLPerf GPT3 175B Training 也是一个很好的代理,用于衡量训练到特定收敛所需的时间。MLPerf 基准测试会考虑全局批量大小以及混合精度实现是否会导致收敛损失。不幸的是,由于缺乏用户友好的文档和说明,MLPerf 非常难以运行,并且性能通常是通过专门为 MLPerf 定制的优化配置来实现的,普通用户不会采用。请注意,Nvidia 已经提交了超过 11k H100 的 MLPerf 训练结果,而 AMD 在内部运行 MLPerf 训练。AMD 的结果可能很弱,因此他们从未提交过任何 MLPerf 训练,更不用说 MLPerf GPT3 175B 基准测试了。
When designing our SemiAnalysis benchmark, we wanted to reflect the average user’s model implementation, and so opted for torch. scaled_dot_product_attention API (which uses flash attention backend), PyTorch Distributed Data Parallel (DDP) and/or Fully Sharded Data Parallel (FSDP) with torch.compile. Also note that AMD recommends users use torch.scaled_dot_product_attention in their own documentation. We believe this is the most representative of a typical user workload. Further, we used a generic PyTorch native implementation of these models to keep it close to a typical ML Scientist user and make it easy to run with a single line of code. In contrast to MLPerf, the goal of our benchmark is to be as simple to run as possible, while still being a good proxy for performance. Note, since we don’t take into account time to convergence, this benchmark has a slight bias towards AMD as we set the micro batch size higher on AMD vs on Nvidia. When taking time to convergence into account, AMD results will be worse than what is stated.
在设计我们的 SemiAnalysis 基准测试时,我们希望反映普通用户的模型实现,因此选择了 torch。scaled_dot_product_attention API(使用 flash attention 后端)、PyTorch 分布式数据并行 (DDP) 和/或完全分片数据并行 (FSDP) 与 torch.compile。另请注意,AMD 建议用户在自己的文档中使用 torch.scaled_dot_product_attention。我们认为这是最能代表典型用户工作负载的工作负载。此外,我们使用了这些模型的通用 PyTorch 原生实现,使其更接近典型的 ML Scientist 用户,并且只需一行代码即可轻松运行。与 MLPerf 相比,我们的基准测试的目标是尽可能简单地运行,同时仍然是一个很好的性能代理。请注意,由于我们没有考虑收敛时间,因此此基准测试对 AMD 略有偏向,因为我们将 AMD 的微批量大小设置为高于 Nvidia 的微批量大小。如果考虑到收敛时间,AMD 的结果会比所说的要差。
As an aside, many AI practitioners have said they are not using Megatron or NeMo or 3D Parallelism due to the high level of complexity and lack of flexibility associated with those libraries, whose rigidity and complexity make their usage for ML Research effectively impossible. Note that in terms of 3D Parallelism, both Nvidia and AMD will get higher performance, assuming their software stack works, which is a big assumption for AMD. AMD Megatron is a fork of Nvidia Megatron and has less than 10 stars which means that it is probably not dogfooded well. Submitting bug reports would take extra months to get AMD Megatron working for simple models.
顺便说一句,许多 AI 从业者表示,他们没有使用 Megatron 或 NeMo 或 3D 并行性,因为这些库具有高度的复杂性和缺乏灵活性,其僵化和复杂性使它们实际上不可能用于 ML 研究。请注意,就 3D 并行性而言,假设 Nvidia 和 AMD 的软件堆栈正常工作,它们都将获得更高的性能,这对 AMD 来说是一个很大的假设。AMD Megatron 是 Nvidia Megatron 的一个分支,星数不到 10 颗,这意味着它可能不是很好的狗粮。提交错误报告需要额外的几个月才能让 AMD Megatron 为简单的模型工作。
For our SemiAnalysis model training benchmark, we will test four models, with the first being a simple GPT 1.5B DDP, as we believe this is representative of what small-scale experiments/ablations would look like before scale-out to bigger model sizes. DDP is a much simpler and less network-intensive form of parallelism. Next, we tested the standard Llama3 8B and Llama3 70B 4 Layer Proxy as a baseline for a popular model’s performance. Third, we tested Mistral 7B v0.1, which evaluates if hardware will perform well when adding a bit of complexity, as Mistral uses sliding window attention instead of the standard causal attention. Modern models such as ChatGPT, Claude, Genimi, o1, o3 do not use standard causal attention & use a complex attention mechanism.
对于我们的 SemiAnalysis 模型训练基准,我们将测试四个模型,第一个是简单的 GPT 1.5B DDP,因为我们认为这代表了小规模实验/消融在横向扩展到更大模型尺寸之前的样子。DDP 是一种更简单、网络密集程度更低的并行形式。接下来,我们测试了标准的 Llama3 8B 和 Llama3 70B 4 层代理,作为流行模型性能的基准。第三,我们测试了 Mistral 7B v0.1,它评估了硬件在增加一点复杂性时是否表现良好,因为 Mistral 使用滑动窗口注意力而不是标准的因果注意力。像ChatGPT, Claude, Genimi, o1, o3这样的现代模型不使用标准的因果注意力,而是使用复杂的注意力机制。
A Modern GPT/Llama/Transformer model is built by stacking the same transformer layer over & over again. As such, measuring the performance of just 4 layers is a great proxy for the overall performance of the model.
现代GPT/Llama/Transformer模型是通过一次又一次地堆叠相同的变压器层来构建的。因此,仅测量 4 层的性能可以很好地代表模型的整体性能。

Furthermore, in modern LLM training for all frontier LLM models, pipeline parallelism is used which means that a couple of transformer layers are placed in each GPU server. Never in modern pretraining is a whole model placed on a single node.
此外,在所有 Frontier LLM 模型的现代LLM训练中,都使用了管道并行性,这意味着每个 GPU 服务器中都放置了几个 transformer 层。在现代预训练中,从来没有将整个模型放置在单个节点上。

The model FLOP for each token trained is defined by the following formula:
训练的每个标记的模型 FLOP 由以下公式定义:
6 * non_input_embedding_params + 12 * num_layers * num_heads * head_dim * max_seq_len * density
6 * non_input_embedding_params + 12 * num_layers * num_heads * head_dim * max_seq_len * 密度
With density being how sparse the attention is relative to a full mask. Causal attention has, for example, a 50% sparsity, while sliding window attention has even lower sparsity.
密度是注意力相对于完整掩码的稀疏程度。例如,因果注意力具有 50% 的稀疏性,而滑动窗口注意力的稀疏性甚至更低。
Note that originally our testing harness used 6 * params instead of 6 * non_input_embedding_params which is the wrong way of calculating model FLOP per token. Furthermore, there was another bug in regard to the way we used FSDP. We have since updated our testing harness and retroactively retested as well as updated all of benchmark results across all versions of software for both H100, H200, MI300X, public stable, public nightly, VIP images and AMD development builds. All results listed below are with the updated testing harness.
请注意,最初我们的测试框架使用 6 * params 而不是 6 * non_input_embedding_params这是计算每个代币模型 FLOP 的错误方法。此外,关于我们使用 FSDP 的方式还有另一个错误。此后,我们更新了我们的测试工具,并追溯性地重新测试,并更新了 H100、H200、MI300X、公共稳定版、公共夜间版、VIP 映像和 AMD 开发版本的所有软件版本的所有基准测试结果。下面列出的所有结果均使用更新的测试框架。
Single Node Training Performance
单节点训练性能
Note that the H100/H200 performance we present in this report reflects an out of the box performance without any hand-crafted tuning from Nvidia engineers, while the results for the MI300X comes after many months of tuning and bug fixes from AMD’s engineers. We did not run into any Nvidia-specific bugs compared to AMD training, which was comparatively bug-filled. Five months ago, many models couldn’t run at more than 150 TFLOP/s on the AMD MI300X due to an AMD software bug in attention backwards and torch compile, which forced the user to manually mark a region of the model as non-compliable instead of having a full graph compile.
请注意,我们在本报告中介绍的 H100/H200 性能反映了开箱即用的性能,无需 Nvidia 工程师进行任何手工制作的调整,而 MI300X 的结果是在 AMD 工程师经过数月的调整和错误修复后得出的。与 AMD 训练相比,我们没有遇到任何特定于 Nvidia 的错误,而 AMD 训练相对来说充满了错误。五个月前,由于 AMD 软件在注意力向后和火炬编译方面的错误,许多模型无法在 AMD MI300X 上以超过 150 TFLOP/s 的速度运行,这迫使用户手动将模型的某个区域标记为不合规,而不是编译完整的图形。
We see that, for all models, the H100/H200 wins relative to MI300X public releases/public nightly releases/Nov 25thbuild from source VIP image. It is interesting that the MI300X does not perform well on smaller models such as GPT 1.5B or on any model that uses a non-causal attention layer, like Mistral 7B v0.1. This is due to FlexAttention not being fully operational at the time of the deadline, while, on Nvidia GPUs, it has been working since August 2024. As such, the H100/H200 beats MI300X by more than 2.5x in terms of TFLOP/s for MI300X public release/public nightly release/Nov25th VIP build.
我们看到,对于所有型号,H100/H200 相对于 MI300X 公开发布/公开夜间发布/11 月 25日从源 VIP 映像中获胜。有趣的是,MI300X 在 GPT 1.5B 等较小的模型或任何使用非因果注意力层的模型(如 Mistral 7B v0.1)上表现不佳。这是因为 FlexAttention 在截止日期时没有完全运行,而在 Nvidia GPU 上,它自 2024 年 8 月以来一直在工作。因此,H100/H200 在 MI300X 公开发布/公开夜间发布/11 月 25 日 VIP 构建的 TFLOP/s 方面比 MI300X 高出 2.5 倍以上。
For the Dec 21st MI300X internal WIP development branches build, we still see it perform worse than H100/H200 on GPT 1.5B. Furthermore, it performs slightly worse than H100 on Mistral 7B. For Llama3 8B and Llama3 70B Proxy, the Dec 21st MI300X WIP development build performs better than H100/H200, but note that this is due to MI300X WIP development using an AMD engineer’s development branch that has not even been merged to the AMD main branch.
对于 12 月 21日的 MI300X 内部 WIP 开发分支构建,我们仍然看到它在 GPT 1.5B 上的表现比 H100/H200 差。此外,它在 Mistral 7B 上的表现略差于 H100。对于 Llama3 8B 和 Llama3 70B Proxy,12 月 21日的 MI300X WIP 开发版本的性能优于 H100/H200,但请注意,这是由于 MI300X WIP 开发使用 AMD 工程师的开发分支,该开发分支甚至尚未合并到 AMD 主分支。

Three months ago, attempting to do FP8 Training on AMD led to segfaults and hard errors. On the off chance it did work, it was, in fact, slower than the same run using BF16. We worked with AMD’s FP8 team to fix this issue, as well as the AMD hipBLASLt team, which created tuning for fixing MI300X FP8 performance. FP8 Training is important as it speeds up training compared to BF16 & most frontier labs use FP8 Training.
三个月前,尝试在 AMD 上进行 FP8 训练会导致段错误和硬错误。事实上,它确实有效,它比使用 BF16 的相同运行慢。我们与 AMD 的 FP8 团队合作解决了这个问题,并与 AMD hipBLASLt 团队合作,该团队为修复 MI300X FP8 性能创建了调整。FP8训练很重要,因为它比BF16更快地训练,而且大多数前沿实验室都使用FP8训练。
After many fixes, we can see that the MI300X’s Nov 25th throughput for Llama3 8B and GPT 1.5B is somewhat competitive with H100’s. As usual, H200 wins in this category. However, for Llama3 70B 4 Layer Proxy, AMD Nov 25th’s results are sorely beaten.
经过多次修复,我们可以看到 MI300X 在 11 月 25 日对 Llama3 8B 和 GPT 1.5B 的吞吐量与 H100 相比有些竞争力。像往常一样,H200 在这个类别中获胜。然而,对于 Llama3 70B 4 层代理,AMD 11 月 25 日的结果被大大击败。
For Mistral 7B which has a non-causal attention layer, AMD Nov 25th performance is close to half that of an H100. This shows that, for anything that isn’t a simple model, even after months of tuning, AMD is still not competitive due to a slight tweak in the model structure. Many frontier models and AI training startups are using complex attention layers for long context spans and efficient attention, but, AMD is still far behind on those.
对于具有非因果注意力层的 Mistral 7B,AMD 11 月 25 日的性能接近 H100 的一半。这表明,对于任何不是简单模型的东西,即使经过数月的调整,由于模型结构的轻微调整,AMD 仍然没有竞争力。许多前沿模型和 AI 训练初创公司正在使用复杂的注意力层来实现较长的上下文跨度和高效的注意力,但是,AMD 在这些方面仍然远远落后。
Unfortunately, FP8 training on AMD only works on custom images such as our November 25th VIP image and December 21st WIP development branch image. When we first started trying AMD FP8 Training, it was slower than AMD BF16 Training on public releases.
遗憾的是,AMD 上的 FP8 训练仅适用于自定义映像,例如我们 11 月 25日的 VIP 映像和 12 月 21日的 WIP 开发分支映像。当我们第一次开始尝试 AMD FP8 训练时,它比公开版本的 AMD BF16 训练慢。

For AMD’s WIP development builds, we see that on Llama3 8B, it wins against H100 but is still slower than H200’s public stable software release. H200 performance completely beats MI300X even on their Dec 21st WIP development branches.
对于 AMD 的 WIP 开发版本,我们看到在 Llama3 8B 上,它战胜了 H100,但仍然比 H200 的公开稳定软件版本慢。H200 的性能完全超过了 MI300X,即使在 12 月 21日的 WIP 开发分支上也是如此。
It is interesting that the MI300X does not perform well on non-causal attention layer, like Mistral 7B v0.1 even for their internal builds. Mistral using sliding window attention which some of the frontier models uses. It seems that if you want to train a model that doesn’t use causal attention, AMD MI300X will automatically lose.
有趣的是,MI300X 在非因果注意力层上表现不佳,比如 Mistral 7B v0.1,即使对于它们的内部构建也是如此。Mistral 使用一些前沿模型使用的滑动窗口注意。似乎如果你想训练一个不使用因果注意力的模型,AMD MI300X 会自动输掉。
While a lot of people putting out performance comparisons between hardware, most do not open source their testing code and they do not make easily reproducible. We took an open source approach, and we have open-sourced our single node training benchmark and made it easy to run with only a couple of lines:
虽然很多人在硬件之间进行性能比较,但大多数人并没有开源他们的测试代码,而且它们不容易重现。我们采用了开源方法,并且开源了我们的单节点训练基准测试,只需几行代码即可轻松运行:

来源:SemiAnalysis
Multi-Node Training Performance
多节点训练性能
For multi-node, we benchmarked two nodes of H100 and two nodes of MI300X. Unfortunately, we didn’t get access to a multi-node H200 deployment in time for the article.
对于多节点,我们对 H100 的两个节点和 MI300X 的两个节点进行了基准测试。遗憾的是,我们未能及时获得本文的多节点 H200 部署。
H100 wins again by a big margin in this benchmark compared to MI300X, with the H100 ranging from 10-25% faster. This gap widens as you add more nodes working together into a single training workload. This is a known problem, which AMD is attempting to fix next year by deploying their new in house 400G AI focused NIC.
与 MI300X 相比,H100 在此基准测试中再次以较大优势获胜,H100 的速度提高了 10-25%。当您将更多协同工作的节点添加到单个训练工作负载中时,这种差距会扩大。这是一个已知问题,AMD 正试图通过部署新的内部 400G AI NIC 来解决该问题。
AMD PYTORCH_TUNABLE_OPS FLAG is a Bad User Experience
AMD PYTORCH_TUNABLE_OPS FLAG 用户体验不佳
In order to get AMD training working decently, users need to use PYTORCH_TUNABLE_OPS which is an AMD specific prototype flag for the end user to tune GEMMs. Since this is a prototype feature (i.e. not stable), in the past a lot of bugs with this feature cropped up including but not limited to seg faults, HBM memory leaks, and a whole host of otherissues such as many unit tests being disabled. These known tunable ops bugs have been fixed now but there are likely a many more unknown AMD software bugs.
为了让 AMD 训练正常工作,用户需要使用 PYTORCH_TUNABLE_OPS 这是 AMD 特定的原型标志,供最终用户调整 GEMM。由于这是一个原型功能(即不稳定),因此过去出现了很多具有此功能的错误,包括但不限于 seg 错误、HBM 内存泄漏以及一大堆其他问题,例如许多单元测试被禁用。这些已知的可调操作错误现已修复,但可能还有更多未知的 AMD 软件错误。
Furthermore, even if users do not encounter any bugs and thus the runway is clear for this prototype AMD flag to work, it still takes users anywhere from 1-2 hours to tune any modern LLM model. Although these GEMMs can be cached by the end user, any minor changes to the end user’s code results in the need for the user to spend another 1-2 hours tuning. As you can imagine, this will slow down an ML Scientist’s iteration cycle speed when trying to conduct model R&D and ablations experiments.
此外,即使用户没有遇到任何错误,因此这个原型 AMD 标志的运行轨道是清晰的,用户仍然需要 1-2 小时来调整任何现代LLM模型。尽管最终用户可以缓存这些 GEMM,但对最终用户代码的任何微小更改都会导致用户需要再花费 1-2 小时进行优化。你可以想象,这会减慢机器学习科学家在尝试进行模型研发和消融实验时的迭代周期速度。
On Nvidia, this flag isn’t needed as their GEMM library (cuBLASLt) comes tuned out of the box and cuBLASLt’s heuristic model out of the box picks the correct algorithm for most shapes on H100/H200. In contrast, AMD hipBLASLt/rocBLAS’s heuristic model picks the wrong algorithm for most shapes out of the box, which is why so much time-consuming tuning is required by the end user.
在 Nvidia 上,不需要这个标志,因为他们的 GEMM 库( cuBLASLt )是开箱即用的,而 cuBLASLt 开箱即用的启发式模型为 H100/H200 上的大多数形状选择正确的算法。相比之下,AMD hipBLASLt/rocBLAS 的启发式模型开箱即用地为大多数形状选择了错误的算法,这就是为什么最终用户需要如此耗时的调整。
We recommend that AMD to fix their GEMM libraries’ heuristic model such that it picks the correct algorithm out of the box instead of wasting the end user’s time doing tuning on their end. Users often iterate quickly when doing research and therefore rerunning tunable ops will slow down research velocity significantly.
我们建议 AMD 修复其 GEMM 库的启发式模型,以便它开箱即用地选择正确的算法,而不是浪费最终用户的时间在他们这边进行调优。用户在进行研究时通常会快速迭代,因此重新运行可调操作会显著降低研究速度。
Scale Up NVLink/xGMI Topology
纵向扩展 NVLink/xGMI 拓扑
Scale up fabric is extremely important for GPU Clusters, as it provides an extremely fast path for tensor and expert parallelism used in frontier model training. For this reason, we have conducted benchmarks to measure scale up fabric performance.
纵向扩展结构对于 GPU 集群极为重要,因为它为前沿模型训练中使用的张量和专家并行性提供了极快的路径。因此,我们进行了基准测试来衡量 Scale Up Fabric 的性能。
The scale up fabric on H100 and H200 is called NVLink and provides 450GByte/s of bandwidth per GPU and connects 8 GPUs together. On the MI300X, the scale up fabric is called xGMI and, on paper, it connects 8 GPUs, providing 448GByte/s of bandwidth per GPU. On the surface, MI300X’s scale up network is extremely similar and close in performance to that of the H100/H200, providing just 0.5% less on paper bandwidth. Unfortunately, the reality of the situation differs sharply.
H100 和 H200 上的纵向扩展结构称为 NVLink,每个 GPU 提供 450GByte/s 的带宽,并将 8 个 GPU 连接在一起。在 MI300X 上,纵向扩展结构称为 xGMI,在纸面上,它连接了 8 个 GPU,每个 GPU 提供 448GByte/s 的带宽。从表面上看,MI300X 的纵向扩展网络与 H100/H200 非常相似,性能接近,仅减少了 0.5% 的纸张带宽。不幸的是,现实情况却大相径庭。
First, MI300X’s xGMI is a point-to-point fabric, which means that it isn’t actually providing 448GByte/s of bandwidth between GPUs pairs. Instead, each GPU can only talk to one another at 64GByte/s. A GPU can only reach the stated 448GByte/s if one GPU addresses all 7 other GPUs simultaneously. That means that, for Tensor Parallelism TP=2, the maximum bandwidth is 64GByte/s and 189GByte/s for TP=4.
首先,MI300X 的 xGMI 是一种点对点结构,这意味着它实际上并没有在 GPU 对之间提供 448GByte/s 的带宽。相反,每个 GPU 只能以 64GByte/s 的速度相互通信。如果一个 GPU 同时处理所有其他 7 个 GPU,则 GPU 只能达到规定的 448GByte/s。这意味着,对于 Tensor Parallelism TP=2 ,最大带宽为 64GByte/s,TP=4 的最大带宽为 189GByte/s。

In contrast, since Nvidia’s NVLink uses a switched topography, one GPU can talk to another GPU at the full 450GByte/s. Furthermore, the four NVSwitches in H100/H200 support in-network reduction (referred to as NVLink SHARP (NVLS), enabled by default), a technique to reduce data movements by carrying out collectives/reductions inside the switch itself.
相比之下,由于 Nvidia 的 NVLink 使用切换拓扑,因此一个 GPU 可以以 450GByte/s 的速度与另一个 GPU 通信。此外,H100/H200 中的四个 NVSwitch 支持网络内缩减(称为 NVLink SHARP (NVLS),默认启用),这是一种通过在交换机本身内部执行集合/缩减来减少数据移动的技术。

All Reduce/All to All/Reduce Scatter/All Gather Collectives Overview
全部减少/全部到全部/减少分散/全部聚集集合概述
We will showcase benchmarks across scale-up and scale-out networks for both the Nvidia H100/H200 and AMD’s MI300. The collectives that we will be testing are the main set of collectives used in frontier LLM training: all_reduce, all_gather, reduce_scatter, and all to all. All reduce is for data parallelism and tensor parallelism, all gather is used for ZeRO/FSDP parallelism (as well as for tensor parallelism), and Reduce Scatter is used for ZeRO/FSDP parallelism.
我们将展示 Nvidia H100/H200 和 AMD MI300 的纵向和横向扩展网络基准测试。我们将要测试的 Collective 是 Frontier LLM Training 中使用的主要集合集:all_reduce、all_gather、reduce_scatter 和 all to all。所有 reduce 用于数据并行和张量并行,所有 gather 用于 ZeRO/FSDP 并行(以及张量并行),Reduce Scatter 用于 ZeRO/FSDP 并行。
Due to the way that compute-communication overlapping works, real-world message sizes range from 16MiB to 256MiB, with the default PyTorch DDP size being 25MiB (NVIDIA’s MLPerf 11,000 H100 GPT-3 175B run used a message size of max 200MiB). We also test 8GiB and 16GiB just to see what the peak bus bandwidth is, though these message sizes are not used in the real world. All these collectives discussed above are used during 3D Parallelism and FSDP/ZeRO Parallelism, which are common techniques for training frontier models.
由于计算通信重叠的工作方式,实际消息大小从 16MiB 到 256MiB 不等,默认 PyTorch DDP 大小为 25MiB(NVIDIA 的 MLPerf 11,000 H100 GPT-3 175B 运行使用的最大消息大小为 200MiB)。我们还测试了 8GiB 和 16GiB,只是为了查看峰值总线带宽是多少,尽管这些消息大小在现实世界中并未使用。上面讨论的所有这些集合都在 3D Parallelism 和 FSDP/ZeRO Parallelism 中使用,这是训练前沿模型的常用技术。


Single Node NCCL Collective
单节点 NCCL 集合体
We see that Nvidia does much better than AMD across all the real-world messages for every single collective. This is not surprising due to the H100/H200’s superior 450GByte/s NVLink switched topology with in-network reduction (NVLS), compared to MI300X’s 7x64GByte/s xGMI point-to-point topology.
我们看到 Nvidia 在每个集体的所有真实信息中都比 AMD 做得更好。这并不奇怪,因为与 MI300X 的 7x64GByte/s xGMI 点对点拓扑相比,H100/H200 具有卓越的 450GByte/s NVLink 交换拓扑和网络内缩减 (NVLS)。




To reproduce this test, you can use our open source ClusterMax-NCCL/RCCL benchmark, which we developed to be easily run with one line of Bash. ClusterMax is our upcoming evaluation quantitative performance and qualitative user experience for ranking H100/B200/GB200/MI300X Neocloud clusters. Look forward to our upcoming “ClusterMax Neocloud Evaluation | How to Rent GPUs” article.
要重现此测试,您可以使用我们的开源 ClusterMax-NCCL/RCCL 基准测试,我们开发的基准测试旨在通过一行 Bash 轻松运行。ClusterMax 是我们即将推出的评估定量性能和定性用户体验,用于对 H100/B200/GB200/MI300X Neocloud 集群进行排名。期待我们即将推出的“ClusterMax Neocloud 评估 |如何租用 GPU“一文。

Multi Node RCCL/NCCL Collectives and Scale Out Network Benchmarks
多节点 RCCL/NCCL 集合和横向扩展网络基准测试
On both Nvidia’s H100/H200 and the MI300X, each GPU is connected to other nodes over the scale out network using a 400G Network Interface Card (NIC), connected directly every GPU. The H100/H200 reference design typically uses ConnectX-7 NICs for InfiniBand NDR or BlueField-3 for Spectrum-X Ethernet. Spectrum-X is NVIDIA’s custom Ethernet solution purpose-built for AI workloads. On the MI300X, the reference design recommends using RoCEv2 Ethernet with Broadcom Thor-2 NIC.
在 Nvidia 的 H100/H200 和 MI300X 上,每个 GPU 都使用 400G 网络接口卡 (NIC) 通过横向扩展网络连接到其他节点,每个 GPU 都直接连接到每个 GPU。H100/H200 参考设计通常将 ConnectX-7 NIC 用于 InfiniBand NDR,或将 BlueField-3 用于 Spectrum-X 以太网。Spectrum-X 是 NVIDIA 专为 AI 工作负载构建的定制以太网解决方案。在 MI300X 上,参考设计建议使用带有 Broadcom Thor-2 NIC 的 RoCEv2 以太网。

A typical GPU cluster almost always requires more layers than a single tier network, as a single-tier network can only support 128 GPUs (in the case of Broadcom Ethernet or Nvidia Spectrum X Ethernet) and 64 GPUs (for H100/H200 InfiniBand). In such a multi-tier network, deployments typically use an 8-rail optimized fat tree, where each one of the 8 GPU is connected to a separate switch (such a connection is called a “rail”). In our AI Neocloud Playbook and Anatomy article, we explained in detail how a rail optimized network works.
典型的 GPU 集群几乎总是需要比单层网络更多的层,因为单层网络只能支持 128 个 GPU(在 Broadcom 以太网或 Nvidia Spectrum X 以太网的情况下)和 64 个 GPU(对于 H100/H200 InfiniBand)。在这样的多层网络中,部署通常使用 8 轨优化的胖树,其中 8 个 GPU 中的每一个都连接到一个单独的交换机(这种连接称为“轨”)。在我们的 AI Neocloud Playbook 和 Anatomy 文章中,我们详细解释了铁路优化网络的工作原理。

Just as Nvidia’s NVLink offers NVLS for its scale-up network, Nvidia’s H100/H200 InfiniBand scale out network also offers InfiniBand SHARP In-network Reduction which is, again, exclusive to Nvidia. AMD does not have an analogous product for the MI300X. InfiniBand SHARP works similarly to NVLink SHARP In-network Reduction as they both provide a way to reduce the amount of traffic going through the network, with the reductions carried out inside of Quantum-2 InfiniBand switches in the case of InfiniBand SHARP.
正如 Nvidia 的 NVLink 为其纵向扩展网络提供 NVLS 一样,Nvidia 的 H100/H200 InfiniBand 横向扩展网络也提供 InfiniBand SHARP 网络内缩减,这同样是 Nvidia 独有的。AMD 没有适用于 MI300X 的类似产品。InfiniBand SHARP 的工作原理类似于 NVLink SHARP 网络内减少,因为它们都提供了一种减少通过网络的流量的方法,在 InfiniBand SHARP 的情况下,减少是在 Quantum-2 InfiniBand 交换机内部进行的。
Unfortunately, unlike NVLink SHARP, which is enabled by default, InfiniBand SHARP is not enabled by default in the UFM/IB subnet manager. We have spoken to many Neoclouds, H100 cluster operators, and AI frontier labs, and most have said that they have not enabled SHARP due to increased NCCL_TIMEOUT rates and difficulties installing and configuring the network. We asked NVIDIA which AI customers use InfiniBand SHARP, but they declined to answer in specifics. One could speculate that if InfiniBand SHARP was useful in AI production workloads, NVIDIA marketing would shout at the top of their lungs to promote its successful deployment. Given the apparently limited adoption of InfiniBand SHARP for now, we show here collective performance for Nvidia both when SHARP is and is not enabled.
遗憾的是,与默认启用的 NVLink SHARP 不同,InfiniBand SHARP 在 UFM/IB 子网管理器中默认不启用。我们已经与许多 Neoclouds、H100 集群运营商和 AI 前沿实验室进行了交谈,大多数人都表示,由于 NCCL_TIMEOUT 率增加以及安装和配置网络的困难,他们没有启用 SHARP。我们询问 NVIDIA 哪些 AI 客户使用 InfiniBand SHARP,但他们拒绝回答具体细节。人们可以推测,如果 InfiniBand SHARP 在 AI 生产工作负载中有用,NVIDIA 营销人员会掏空心思大声宣传其成功部署。鉴于目前 InfiniBand SHARP 的采用显然有限,我们在这里展示了 Nvidia 在启用 SHARP 和未启用 SHARP 时的集体性能。
For some of the benchmarks, we have also collected Nvidia Spectrum-X Ethernet data on an Nvidia internal cluster called Israel-1. Nvidia Spectrum-X is used in xAI’s 200k H100/H200 cluster and can support clusters up to 100k GPUs in the Spectrum-X reference architecture version 1.2, but could potentially support up to 512k GPUs with a non-reference custom design.
对于一些基准测试,我们还在名为 Israel-1 的 Nvidia 内部集群上收集了 Nvidia Spectrum-X 以太网数据。Nvidia Spectrum-X 用于 xAI 的 200k H100/H200 集群,可以在 Spectrum-X 参考架构版本 1.2 中支持高达 100k GPU 的集群,但可能会支持高达 512k 的 GPU,采用非参考定制设计。
We are also in the process of testing Google Cloud (GCP) H100’s in-house ethernet, as well as AWS’ H100 and H200s that are deployed on AWS’s in-house Ethernet (called EFAv2/EFAv3). We will be sharing the results in our upcoming “Collective Deep Dive” article, which will provide visualizations of the different types of collectives, explain the different NCCL protocols (SIMPLE, LL, LL128), different NCCL algorithms (NVLS, NVLSTREE, RING, TREE, COLNETDIRECT, COLNETCHAIN, PAT), and how collectives run on GCP H100 Ethernet, AWS H100/H200 EFA, InfiniBand H100, Spectrum-X, etc.
我们还在测试 Google Cloud (GCP) H100 的内部以太网,以及部署在 AWS 内部以太网(称为 EFAv2/EFAv3)上的 AWS H100 和 H200。我们将在即将发布的“集体深入探讨”文章中分享结果,该文章将提供不同类型集合的可视化,解释不同的 NCCL 协议(SIMPLE、LL、LL128)、不同的 NCCL 算法(NVLS、NVLSTREE、RING、TREE、COLNETDIRECT、COLNETCHAIN、PAT),以及集合如何在 GCP H100 以太网、AWS H100/H200 EFA、InfiniBand H100、Spectrum-X、 等。
Below we show a 32 GPU all reduce collective test. You can see that MI300X RoCEv2 is in last place compared to normal InfiniBand H100 and InfiniBand H100 with SHARP enabled. Simply put, poor all reduce performance leads to poor scale-out training.
下面我们展示了一个 32 GPU all reduce 集体测试。可以看到,与普通的 InfiniBand H100 和启用了 SHARP 的 InfiniBand H100 相比,MI300X RoCEv2 排在最后。简单地说,差都会降低性能会导致 scale-out 训练差。

The MI300X’s performance decreases if you scale out (i.e. increase) the number of GPUs participating in a collective. As you can imagine, modern frontier training is carried out on clusters of at least 100,000 GPUs. MI300X RoCEv2 runs at half the speed for all the real-world message sizes of 16MiB to 256MiB when compared to the baseline of InfiniBand Non-SHARP. As per the chart below, Nvidia Spectrum-X Ethernet performance is quite close to InfiniBand Non-SHARP’s performance, due to Spectrum-X’s vertical integration with the NCCL collective library as well as its use of good congestion control and adaptive routing. AMD is attempting to vertically integrate next year with their upcoming Pollara 400G NIC, which supports Ultra Ethernet, hopefully making AMD competitive with Nvidia. As always, Nvidia is not standing still and by late next year, it will be ready to go into production with its 800G ConnectX-8 NICs, which provide a line rate twice as fast as AMD’s Pollara NIC.
如果横向扩展(即增加)参与集合体的 GPU 数量,MI300X 的性能会降低。可以想象,现代前沿训练是在至少 100,000 个 GPU 的集群上进行的。与 InfiniBand Non-SHARP 的基线相比,MI300X RoCEv2 在 16MiB 到 256MiB 的所有实际消息大小的运行速度只有原来的一半。如下图所示,由于 Spectrum-X 与 NCCL 集合库的垂直集成以及它使用良好的拥塞控制和自适应路由,Nvidia Spectrum-X 以太网性能与 InfiniBand Non-SHARP 的性能非常接近。AMD 正试图在明年与他们即将推出的 Pollara 400G NIC 进行垂直整合,该 NIC 支持 Ultra 以太网,希望使 AMD 与 Nvidia 竞争。与往常一样,Nvidia 并没有停滞不前,到明年晚些时候,它将准备好投入生产其 800G ConnectX-8 NIC,该 NIC 的线速是 AMD 的 Pollara NIC 的两倍。
AMD RCCL is a fork of Nvidia NCCL. AMD’s RCCL Team and many other teams at AMD are resource limited and don’t have enough of either compute or headcount to improve the AMD ecosystem. AMD’s RCCL Team currently has stable access to less than 32 MI300Xs for R&D, which is ironic, as improving collective operations is all about having access to many GPUs. This is frankly silly, AMD should spend more on their software teams having access to more GPUs.
AMD RCCL 是 Nvidia NCCL 的一个分支。AMD 的 RCCL 团队和 AMD 的许多其他团队资源有限,没有足够的计算或人员来改进 AMD 生态系统。AMD 的 RCCL 团队目前可以稳定地使用不到 32 个 MI300X 进行研发,这很讽刺,因为改进集体运营就是要访问许多 GPU。坦率地说,这很愚蠢,AMD 应该在他们的软件团队上花费更多,以便获得更多 GPU。
This contrasts with Nvidia’s NCCL team, which has access to R&D resources on Nvidia’s 11,000 H100 internal EOS cluster. Furthermore, Nvidia has Sylvain Jeaugey, who is the subject matter expert on collective communication. There are a lot of other world class collective experts working at Nvidia as well, and, unfortunately, AMD has largely failed to attract collective library talent due to less attractive compensation and resources – as opposed to engineers at Nvidia, where it is not uncommon to see engineers make greater than a million dollars per year thanks to appreciation in the value of RSUs.
这与Nvidia的NCCL团队形成鲜明对比,后者可以访问Nvidia的11,000个H100内部EOS集群的研发资源。此外,Nvidia 还有 Sylvain Jeaugey,他是集体通信的主题专家。Nvidia 还有很多其他世界级的集体专家在工作,不幸的是,由于薪酬和资源不那么有吸引力,AMD 在很大程度上未能吸引集体库人才——与 Nvidia 的工程师相反,由于 RSU 的价值升值,工程师每年赚取超过 100 万美元的情况并不少见。
To help alleviate these issues, TensorWave and SemiAnalysis are currently working with the AMD RCCL Team to improve collective performance. TensorWave has generously sponsored AMD a medium-sized cluster in order help the RCCL Team have greater resources to do their jobs. The fact that Tensorwave after buying many GPUs has to give AMD GPUs for them to fix their software is insane.
为了帮助缓解这些问题,TensorWave 和 SemiAnalysis 目前正在与 AMD RCCL 团队合作,以提高集体性能。TensorWave 慷慨赞助了 AMD 一个中型集群,以帮助 RCCL 团队有更多的资源来完成他们的工作。Tensorwave 在购买许多 GPU 后不得不提供 AMD GPU 来修复他们的软件,这一事实是疯狂的。
Another trend to notice is that for non-SHARP networks, all reduce collective’s speed will reduce logarithmically as you double the number of GPUs. In contrast, with SHARP, the speed/completion time stays the same. We have results for up to 1,024 H100s showing that IB SHARP all reduce is constant time across any number of GPUs in a collective. We will publish this in our upcoming “Collective Deep Dive” article.
另一个需要注意的趋势是,对于非 SHARP 网络,随着 GPU 数量增加一倍,所有 reduce collective 的速度都会以对数方式降低。相比之下,使用 SHARP 时,速度/完成时间保持不变。我们最多有 1024 个 H100 的结果表明,IB SHARP all reduce 在一个集合中任意数量的 GPU 上都是恒定时间。我们将在即将发布的“Collective Deep Dive” 文章中发布此内容。

For all gather, all to all, and reduce scatter collectives, MI300X is anywhere from 2-4 times slower than InfiniBand. Unfortunately, we did not have access to Spectrum-X or InfiniBand SHARP benchmark data for all gather or reduce scatter.
对于 all gather、all to all 和 reduce scatter collectives,MI300X 比 InfiniBand 慢 2-4 倍。遗憾的是,我们无法访问所有 gather 或 reduce scatter 的 Spectrum-X 或 InfiniBand SHARP 基准测试数据。



Below, we provide our nccl/rccl benchmarking script. Unfortunately, due to the nature of cluster-specific setups, it is not as simple as a one-liner. It does require you to follow the README.md of nccl/rccl and nccl-tests/rccl-tests to run properly. On AWS and Google Cloud, there may also be custom nccl adapters that you will need to install.
下面,我们提供了我们的 nccl/rccl 基准测试脚本。遗憾的是,由于特定于集群的设置的性质,它并不像单行代码那么简单。它确实需要你遵循 nccl/rccl 和 nccl-tests/rccl-tests 的 README.md 才能正常运行。在 AWS 和 Google Cloud 上,您可能还需要安装自定义 nccl 适配器。

来源:SemiAnalysis
AMD’s User Experience is Suboptimal and the MI300X is Not Usable Out of the Box
AMD 的用户体验不理想,小米 300X 不能开箱即用
Due to poor internal testing (i.e. “dogfooding”) and a lack of automated testing on AMD’s part, the MI300 is not usable out of the box and requires considerable amounts of work and tuning. In November 2024 at AMD’s “Advancing AI”, AMD’s SVP of AI stated that are over 200k tests running every evening internally at AMD. However, this seems to have done little to ameliorate the many AMD software bugs we ran into, and we doubt AMD is doing proper CI/CD tests include proper performance regression, or functional and convergence/numerics testing. We will outline a few examples here for readers to understand the nature of the AMD software bugs we have encountered and why we feel they have been very obstructive to a good user experience on AMD.
由于内部测试不佳(即“dogfangeling”)和 AMD 缺乏自动化测试,MI300 不能开箱即用,需要大量的工作和调整。2024 年 11 月,在 AMD 的“推进人工智能”上,AMD 的人工智能高级副总裁表示,AMD 每晚内部都会进行超过 200k 次测试。然而,这似乎对改善我们遇到的许多 AMD 软件错误几乎没有什么帮助,我们怀疑 AMD 是否正在进行适当的 CI/CD 测试,包括适当的性能回归或功能和收敛/数值测试。我们将在此处概述一些示例,供读者了解我们遇到的 AMD 软件错误的性质,以及为什么我们认为它们对 AMD 上的良好用户体验造成了很大阻碍。
Although AMD’s own documentation recommends using PyTorch native Flash Attention, for a couple months this summer, AMD’s PyTorch native Flash Attention kernel ran at less than 20 TFLOP/s, meaning that a modern CPU would have calculated the attention backwards layer faster than an MI300X GPU. For a time, basically all Transformer/GPT model training using PyTorch on the MI300X ran at a turtle’s pace. Nobody at AMD noticed this until a bug report was filed following deep PyTorch/Perfetto profiling showing the backwards pass (purple/brown kernels) took up far more time than the forward pass (dark green section). Normally, the backwards section should take up just ~2x as much time as the forward pass (slightly more if using activation checkpointing).
尽管 AMD 自己的文档建议使用 PyTorch 原生 Flash Attention,但在今年夏天的几个月里,AMD 的 PyTorch 原生 Flash Attention 内核的运行速度低于 20 TFLOP/s,这意味着现代 CPU 会比 MI300X GPU 更快地向后计算注意力层。有一段时间,在 MI300X 上使用 PyTorch 进行的所有 Transformer/GPT 模型训练基本上都以的速度运行。AMD 没有人注意到这一点,直到在深入的 PyTorch/Perfetto 分析后提交了错误报告,显示向后传递(紫色/棕色内核)比向前传递(深绿色部分)占用的时间要多得多。通常,向后部分占用的时间应该是向前传递的 ~2 倍(如果使用激活检查点,则略多)。

Another issue we encountered was that the AMD PyTorch attention layer led to a hard error when used with torch.compile due to the rank of the longsumexp Tensor being incorrect. What was frustrating is that this had already been fixed in internal builds of AMD PyTorch on May 30th, but did not reach any AMD PyTorch distributions or even any PyTorch nightly builds until October when it was pointed out to them that there was a bug. This demonstrates a lack of testing and dogfooding on the packages AMD puts out to the public. Another core reason for this problem is that the lead maintainer of PyTorch (Meta) does not currently use MI300X internally for production LLM training, leading to code paths not used internally at Meta being buggy and not dogfooded properly. We believe AMD should partner with Meta to get their internal LLM training working on MI300X.
我们遇到的另一个问题是,由于 longsumexp 张量的秩不正确,AMD PyTorch 注意力层在与 torch.compile 一起使用时会导致硬错误。令人沮丧的是,这已经在 5 月 30 日的 AMD PyTorch 内部版本中得到了修复,但直到 10 月才到达任何 AMD PyTorch 发行版,甚至没有到达任何 PyTorch 夜间版本,当时向他们指出存在错误。这表明 AMD 向公众发布的软件包缺乏测试和测试。这个问题的另一个核心原因是 PyTorch 的首席维护者 (Meta) 目前没有在内部使用 MI300X 进行生产LLM训练,导致 Meta 内部未使用的代码路径有问题并且没有正确地进行 dogfood。我们认为 AMD 应该与 Meta 合作,让他们的内部LLM培训针对 MI300X。

On August 8th, Horace He and the Meta PyTorch Team released FlexAttention, a critical API for creating non-causal attention layers without losing speed. To previously use attention variants like document masking, sliding window attention, softcap, and Alibi, a user would need to spend weeks handcrafting their own kernel in CUDA/HIP language, and subsequently pybinding it to PyTorch. However, with FlexAttention, a user can quickly generate all the attention variants using the API. FlexAttention achieves great performance by using block sparsity by only calculating the blocks of the mask where needed, ignoring the rest.
8 月 8日,Horace He 和 Meta PyTorch 团队发布了 FlexAttention,这是一个关键的 API,用于创建非因果注意力层,而不会降低速度。以前要使用文档遮罩、滑动窗口注意力、软上限和 Alibi 等注意力变体,用户需要花费数周时间用 CUDA/HIP 语言手工制作自己的内核,然后将其 py绑定到 PyTorch。但是,借助 FlexAttention,用户可以使用 API 快速生成所有注意力变体。FlexAttention 通过使用块稀疏性来实现出色的性能,它只计算需要的掩码块,而忽略其余部分。


With sliding window attention, FlexAttention can improve performance by 10-20x! This is amazing for the end user, but unfortunately, MI300X FlexAttention was in a poor state and suffers from numerous AMD software bugs (including convergence issues) until but a couple days ago. While the latest PyTorch nightly now fixes for convergence issues, this contrasts starkly with FlexAttention on Nvidia, which has been available since August. That means a ~6 month gap exists between the availability of these fantastic Pytorch features on Nvidia and AMD’s platforms. For frontier AI labs, six months is a lifetime, with OpenAI, Anthropic, and Google having released numerous models in such a span.
通过滑动窗口注意力,FlexAttention 可以将性能提高 10-20 倍!这对最终用户来说是惊人的,但不幸的是,MI300X FlexAttention 的状态很差,并且直到几天前还存在许多 AMD 软件错误(包括收敛问题)。虽然最新的 PyTorch nightly 现在修复了收敛问题,但这与 Nvidia 上的 FlexAttention 形成鲜明对比,后者自 8 月以来一直可用。这意味着这些出色的 Pytorch 功能在 Nvidia 和 AMD 平台上的可用性之间存在 ~6 个月的差距。对于前沿 AI 实验室来说,六个月就是一生,OpenAI、Anthropic 和 Google 在这样的时间内发布了许多模型。

Exploring Ideas for Better Performance on AMD
探索在 AMD 上实现更高性能的想法
AMD recommended we try PYTORCH_ TUNABLE_OPS to improve GEMM performance by sweeping through GEMM algorithms at runtime. However, as we mentioned earlier, this API works poorly because GEMMs should be tuned when compiling the hipBLASLt/RoCBLAS/cuBLASLt and not during the users’ runtime. Users of Nvidia H100s do not need to use PYTORCH_ TUNABLE_OPS for most shapes because cuBLAS heuristic model will pick the correct algorithmn. This contrasts with AMD’s heuristic model, which never seems to pick the correct algorithm for most shapes. We recommend that AMD stop suggesting that users try tunable ops and instead focus on properly tuning their GEMM libraries internally.
AMD 建议我们尝试PYTORCH_ TUNABLE_OPS通过在运行时扫描 GEMM 算法来提高 GEMM 性能。但是,正如我们之前提到的,这个 API 效果不佳,因为 GEMM 应该在编译 hipBLASLt/RoCBLAS/cuBLASLt 时进行调整,而不是在用户的运行时进行调整。Nvidia H100 的用户不需要对大多数形状使用 PYTORCH_ TUNABLE_OPS,因为 cuBLAS 启发式模型会选择正确的算法。这与 AMD 的启发式模型形成鲜明对比,后者似乎从来没有为大多数形状选择正确的算法。我们建议 AMD 停止建议用户尝试可调操作,而是专注于在内部正确调整其 GEMM 库。
When we tried PYTORCH_ TUNABLE_OPS on AMD, it led to an HBM memory leak of over 25 GByte out of the total MI300X capacity of 192GBytes, essentially wiping out the MI300’s HBM capacity advantage over the H100. The fix for this is to set a default hipBLASLt and rocBLAS workspace to prevent memory leaks.
当我们在 AMD 上尝试 PYTORCH_ TUNABLE_OPS 时,它导致 MI300X 总容量 192GByte 中的 HBM 内存泄漏超过 25 GB,基本上抹去了 MI300 相对于 H100 的 HBM 容量优势。解决此问题的方法是设置默认的 hipBLASLt 和 rocBLAS 工作区以防止内存泄漏。

As we mentioned earlier in this article, another issue we ran into was that there was a plethora of environment flags needed on MI300X to make it actually usable. We recommend to AMD that they stop putting users in the position of having to set these environment flags themselves and, instead, set default flags that lead to a usable environment. It is not simply their number, but also the complex interactions between the flags, making troubleshooting difficult. Getting reasonable training performance out of AMD MI300X is an NP-Hard problem.
正如我们在本文前面提到的,我们遇到的另一个问题是 MI300X 上需要大量的环境标志才能使其真正可用。我们建议 AMD 停止让用户自己设置这些环境标志,而是设置默认标志,从而获得可用的环境。这不仅仅是它们的编号,还有标志之间的复杂交互,这使得故障排除变得困难。从 AMD MI300X 获得合理的训练性能是一个 NP-Hard 问题。
Another issue is that certain AMD ROCm libraries could not be installed inside Docker due to AMD software CMake bugs leading to hard errors. This has since been fixed. On AMD GPUs, you need to pass in a convoluted set of flags to get the GPUs to be able to work inside a container, whereas with docker, getting GPUs to work is as simple as passing in “—gpus=all”. We recommend to AMD that they partner with Docker and ensure that Docker can autodetect GPUs for AMD as well, making the workflow as streamlined as when working with Nvidia GPUs.
另一个问题是,由于 AMD 软件 CMake 错误导致硬错误,某些 AMD ROCm 库无法安装在 Docker 中。此问题已得到修复。在 AMD GPU 上,您需要传入一组复杂的标志才能使 GPU 能够在容器内工作,而使用 docker,让 GPU 工作就像传入 “—gpus=all” 一样简单。我们建议 AMD 与 Docker 合作,并确保 Docker 也可以自动检测 AMD 的 GPU,从而使工作流程与使用 Nvidia GPU 时一样简化。

AMD’s Forked Libraries AMD 的分叉库
Many of AMD’s libraries are forked off Nvidia’s open-source or ecosystem libraries. AMD uses a tool called Hipify to carry out source-to-source translation of Nvidia CUDA to AMD HIP. While the motivation is understandable, they arenevertheless building on top of their competitor’s platform and cannot expect to match or surpass Nvidia’s user experience with this software development strategy. They need to contribute their software to the AMD ecosystem. For example, instead of supporting FP8 training by forking Nvidia/TransformerEngine and source-to-source translation, they should attempt PyTorch native FP8 training to work well on their own hardware. Currently, AMD PyTorch native FP8 training recipes don’t work on AMD and the unit tests don’t even pass yet, there is no CI/CD for AMD PyTorch native FP8 training.
AMD 的许多库都是从 Nvidia 的开源或生态系统库分叉而来的。AMD 使用名为 Hipify 的工具执行 Nvidia CUDA 到 AMD HIP 的源到源转换。虽然动机是可以理解的,但他们仍然在竞争对手的平台上进行构建,不能指望通过这种软件开发策略来匹配或超越 Nvidia 的用户体验。他们需要为 AMD 生态系统贡献他们的软件。例如,他们应该尝试 PyTorch 原生 FP8 训练,以便在自己的硬件上正常工作,而不是通过分叉 Nvidia/TransformerEngine 和源到源转换来支持 FP8 训练。目前,AMD PyTorch 原生 FP8 训练配方在 AMD 上不起作用,单元测试甚至尚未通过,没有用于 AMD PyTorch 原生 FP8 训练的 CI/CD。

Detailed Recommendations to AMD on How to Fix Their Software
向 AMD 提供有关如何修复其软件的详细建议
First, AMD needs to focus on attracting more software engineering resources and improving compensation for current engineers. The current compensation gap between AMD and Nvidia means that top talent is lured to Nvidia over AMD. This top talent is also attracted to Nvidia as it has far more compute/resources for engineers. AMD should procure more GPUs for their in-house development work and submit an MLPerf GPT3 175B result as soon as possible. Even if the result is not competitive with Nvidia right now, submitting such a benchmark will kick off the process for iterative improvement.
首先,AMD 需要专注于吸引更多的软件工程资源并提高现有工程师的薪酬。目前 AMD 和 Nvidia 之间的薪酬差距意味着顶尖人才被吸引到 Nvidia 而不是 AMD。这些顶尖人才也被 Nvidia 所吸引,因为它为工程师提供了更多的计算/资源。AMD 应该为其内部开发工作采购更多的 GPU,并尽快提交 MLPerf GPT3 175B 结果。即使结果现在与 Nvidia 没有竞争力,提交这样的基准测试也将启动迭代改进的过程。
We also notice that AMD frequently gives their customers custom images, and, in fact, AMD developers themselves often work on top of such bespoke images. This is not best practice, as this means that AMD engineers have a different experience vs. images available to the public. AMD should instead lift the standard of public images by using these images internally and with its customers, and the AMD executive team should personally internally test (i.e. “dogfood”) what is getting shipped publicly.
我们还注意到,AMD 经常为其客户提供自定义映像,事实上,AMD 开发人员自己也经常在此类自定义映像之上工作。这不是最佳实践,因为这意味着 AMD 工程师的体验与公众可用的映像不同。相反,AMD 应该通过在内部和与客户一起使用这些映像来提高公共映像的标准,并且 AMD 执行团队应该亲自在内部测试(即“狗粮”)公开发布的内容。
We recommend that AMD create a public dashboard that runs every night, showing the performance of their hardware on benchmarks such as MLPerf or TorchBench. This dashboard should also include H100/H200 performance as a baseline.
我们建议 AMD 创建一个每晚运行的公共仪表板,在 MLPerf 或 TorchBench 等基准测试中显示其硬件的性能。此仪表板还应包括 H100/H200 性能作为基准。
Finally, AMD needs to completely transform its approach to environmental flags. Instead of setting a myriad of flags to get running out of the box, it should set them to recommended defaults so users can get started quickly.
最后,AMD 需要彻底改变其环境标志的方法。与其设置无数的标志来开箱即用,不如将它们设置为建议的默认值,以便用户可以快速入门。
AMD should collaborate with Meta to get production training workloads working on ROCm, as it is well-known amongst PyTorch users that PyTorch code paths tend to have tons of bugs unless Meta uses it internally. Meta currently hand writes HIP Kernels for their production MI300X inferencing but does not use MI300X for real training. It would be a fantastic improvement for the AMD ecosystem, and a marketing victory, if a smaller version of the next Llama is trained on AMD. Not to mention that this would open the door to AMD progressively moving towards larger models/clusters with Meta. Meta using AMD GPUs for actual model training would be a win-win for both companies as Meta is also looking for alternative training chips to Nvidia.
AMD 应该与 Meta 合作,让生产训练工作负载在 ROCm 上运行,因为 PyTorch 用户都知道,除非 Meta 在内部使用 PyTorch 代码路径,否则 PyTorch 代码路径往往会有大量的错误。Meta 目前为其生产 MI300X 推理手写 HIP 内核,但不使用 MI300X 进行实际训练。如果下一代 Llama 的较小版本在 AMD 上进行训练,那将是 AMD 生态系统的一个巨大改进,也是一场营销胜利。更不用说这将为 AMD 逐步向 Meta 的更大模型/集群迈进敞开大门。Meta 使用 AMD GPU 进行实际模型训练对两家公司来说都是双赢的,因为 Meta 也在寻找 Nvidia 的替代训练芯片。
Currently Nvidia offers well over 1,000 GPUs for Continuous improvement and development of Pytorch externally and many more internally. AMD doesn’t. AMD needs to work with an AMD focused GPU Neocloud to have ~10,000 GPUs of each generation for internal development purposes and Pytorch. This will still be 1/8th that of Nvidia with their coming huge Blackwell clusters, but it’s a start. These can be dedicated to internal development and CICD for Pytorch.
目前,Nvidia 提供了超过 1,000 个 GPU,用于外部 Pytorch 的持续改进和开发,以及更多内部的 GPU。AMD 没有。AMD 需要与专注于 AMD 的 GPU Neocloud 合作,每代拥有 ~10,000 个 GPU 用于内部开发目的和 Pytorch。这仍然是 Nvidia 即将推出的巨大 Blackwell 集群的 1/8,但这是一个开始。这些可以专用于内部开发和 Pytorch 的 CICD。
Lisa, we are open to a meeting on how to fix AMD’s Datacenter GPU User Experience for the better!
Lisa,我们愿意召开一次会议,讨论如何改善 AMD 的数据中心 GPU 用户体验!
H100/H200/MI300X Networking BoM Analysis and Performance per TCO
H100/H200/MI300X 网络 BoM 分析和性能(按 TCO)
In addition to our benchmarking of collectives and GEMM throughput, we have conducted several experiments exploring insightful topics for conducting further benchmarks and running real-world workloads on clusters. These experiments cover benchmarking warmup and repeat effects, VBoost Power Shifting, MLPerf Training GPT-3, BF16 vs FP16 throughput, throughput by GEMM input distribution, power per FLOP, and throughput for the PyTorch PyPi distribution vs Nvidia NGC Stable PyTorch images.
除了对集合和 GEMM 吞吐量进行基准测试外,我们还进行了多项实验,探索了用于进行进一步基准测试和在集群上运行实际工作负载的有见地的主题。这些实验涵盖基准测试预热和重复效果、VBoost 功率转移、MLPerf 训练 GPT-3、BF16 与 FP16 吞吐量、GEMM 输入分配的吞吐量、每次浮点运算的功率以及 PyTorch PyPi 分布与 Nvidia NGC 稳定 PyTorch 映像的吞吐量。
We also present a detailed networking bill of materials (BoM) analysis for the 1k GPU Ethernet, 1k GPU InfiniBand, 16k GPU Ethernet, and 16k GPU InfiniBand clusters. We also discuss the impact of using 51.2T Radix vs. 25.6T Radix switches for back-end networking.
我们还提供了 1k GPU 以太网、1k GPU InfiniBand、16k GPU 以太网和 16k GPU InfiniBand 集群的详细网络物料清单 (BoM) 分析。我们还讨论了使用 51.2T Radix 与 25.6T Radix 交换机进行后端网络的影响。
Lastly – we present a performance per TCO analysis that shows how the H100/H200/MI300X stacks up in terms of $/hr per effective training petaflop. These items are available below to all SemiAnalysis subscribers and will be of great interest to datacenter operators, ML scientists, and investors.
最后 – 我们提出了每个 TCO 的性能分析,该分析显示了 H100/H200/MI300X 在每次有效训练 petaflop 的 $/hr 方面的叠加情况。这些项目在下面可供所有 SemiAnalysis 订阅者使用,数据中心运营商、ML 科学家和投资者将非常感兴趣。
H100/H200/MI300X Networking BoM Analysis and Performance per TCO
H100/H200/MI300X 网络 BoM 分析和性能(按 TCO)
Using higher radix 51.2T switches can lead to a simplified network deployment as it can support a larger GPU cluster size for a given network layer size. Whereas a 25.6T radix switch, such as the Nvidia Quantum-2 QM9700, can connect to 2,048 GPUs for a 2-layer network (i.e. a leaf layer connecting to GPUs, and a spine layer connecting the leaf switches), using a 51.2T radix switch based on a Tomahawk 5 ASIC or Spectrum-X SN5600 which can connect up to 8,192 GPUs.
使用更高的基数 51.2T 交换机可以简化网络部署,因为它可以支持给定网络层大小的更大 GPU 集群大小。而 25.6T 基数交换机,例如 Nvidia Quantum-2 QM9700,可以使用基于 Tomahawk 5 ASIC 或 Spectrum-X SN5600 的 51.2T 基数交换机连接到 2,048 个 GPU,用于 2 层网络(即连接到 GPU 的叶子层,连接叶子交换机的骨干层),最多可连接 8,192 个 GPU。

For the same 8,192 GPU deployment, using a 25.6T radix switch would require a 3-layer network and 640 switches, while using a 51.2T radix switch would only require a 2-layer network and 192 switches, as well as fewer transceivers. An advantage of Ethernet-based deployments for back-end networking over InfniBand based deployments is the fact that the highest radix switch widely deployed on the market today for InfiniBand is only 25.6T, while Ethernet has the option of using a 51.2T radix switch.
对于相同的 8,192 个 GPU 部署,使用 25.6T 基数交换机需要一个 3 层网络和 640 个交换机,而使用 51.2T 基数交换机只需要一个 2 层网络和 192 个交换机,以及更少的收发器。与基于 InfniBand 的部署相比,基于以太网的后端网络部署的一个优势是,当今市场上广泛部署的 InfiniBand 的最高基数交换机仅为 25.6T,而以太网可以选择使用 51.2T 基数交换机。

We start by analyzing the networking BoM for a 1,024 GPU InfiniBand deployment at a Neocloud Giant, as this is a typical cluster size found at many Neoclouds. Assuming a non-blocking network for the backend fabric, but a 2:1 oversubscribed frontend fabric that also handles storage networking (i.e., a converged fabric), we estimate a total cluster networking BoM of $5.2M, or $40,705 per 8-GPU server.
我们首先分析了 Neocloud Giant 的 1024 GPU InfiniBand 部署的网络 BoM,因为这是许多 Neocloud 的典型集群大小。假设后端结构是非阻塞网络,但 2:1 超额订阅的前端结构也处理存储网络(即融合结构),我们估计总集群网络 BoM 为 $5.2M,或每个 8-GPU 服务器 40,705 美元。

Turning to a deployment of the same size but using Ethernet for back-end fabric, we estimate much lower costs of $3.0M for total cluster networking cost, or $23,816 per server. This is lower than for the InfiniBand-based deployment due to the lower cost of switches and transceivers. A cluster size of 1,024 GPUs requires a two-layer network whether using 25.6T or 51.2T radix switches, so using an Arista Tomahawk 5-based 51.2T radix switch for the Ethernet deployment does not save costs from a lower transceiver count or a simplified network topology.
转向相同规模但使用以太网作为后端结构的部署,我们估计集群网络总成本要低得多,为 3.0M 美元,或每台服务器 23,816 美元。由于交换机和收发器的成本较低,这低于基于 InfiniBand 的部署。无论使用 25.6T 还是 51.2T 基数交换机,1024 个 GPU 的集群大小都需要两层网络,因此使用基于 Arista Tomahawk 5 的 51.2T 基数交换机进行以太网部署不会因收发器数量较少或网络拓扑简化而节省成本。

A cluster size of 16,384 GPUs exceeds the maximum two-layer node count of 8,192 GPUs when using 51.2T radix switches, necessitating a shift towards using a three-layer network for the backend fabric. This substantially increases the cost per server to $34,214 given a greater switch and transceiver count. We note that not many Neocloud giants will opt to deploy on Ethernet given the extra work required to tune RoCEv2 to work well on collectives.
使用 51.2T 基数交换机时,16384 个 GPU 的集群大小超过了 8192 个 GPU 的最大两层节点数,因此需要转向使用三层网络作为后端结构。鉴于交换机和收发器数量更多,这将每台服务器的成本大幅增加到 34,214 美元。我们注意到,考虑到调整 RoCEv2 以在 Collective 上正常工作所需的额外工作,没有多少 Neocloud 巨头会选择在以太坊上部署。

A Hyperscaler can utilize much stronger bargaining power to deploy a network at a far cheaper cost than a Neocloud Giant can. Using Whitebox Tomahawk 5-based switches instead of branded Ethernet switches from providers such as Arista can save 40-50% on switch ASPs. Hyperscalers can also purchase transceivers for far less, saving 40-50% on unit pricing here. We calculate the total networking capex at $18,677 per server.
超大规模公司可以利用更强的议价能力,以比 Neocloud Giant 便宜得多的成本部署网络。使用基于 Whitebox Tomahawk 5 的交换机而不是 Arista 等提供商的品牌以太网交换机可以节省 40-50% 的交换机 ASP。超大规模企业还可以以低得多的价格购买收发器,在这里节省 40-50% 的单价。我们计算出每台服务器的总网络资本支出为 18677 USD。

The highest-price option is for Neocloud Giants that are deploying InfiniBand-based networking. Here, they will be using Nvidia’s switches, which have the highest ASP, and purchasing Nvidia-branded transceivers—also at the highest ASP for a given transceiver speed.
价格最高的选择是部署基于 InfiniBand 的网络的 Neocloud 巨头。在这里,他们将使用具有最高 ASP 的 Nvidia 交换机,并购买 Nvidia 品牌的收发器 — 对于给定的收发器速度,ASP 也是最高的。

Finally, we analyze a 16k InfiniBand-based cluster deployed by a Hyperscaler. Using the same network deployment as used by a Neocloud giant and only changing the ASP of the networking equipment, we see that the total networking cost is $47,470 per server.
最后,我们分析了由超大规模提供商部署的基于 16k InfiniBand 的集群。使用与 Neocloud 巨头相同的网络部署,并且只更改了网络设备的 ASP,我们看到每台服务器的总网络成本为 47,470 美元。

Looking at total upfront cluster capex, can see that the much lower cost of Ethernet based networking is as much a factor in lower total cost of ownership as the lower cost of an AMD MI300X server vs an Nvidia H100 or H200 server.
查看总前期集群资本支出,可以看出,与 AMD MI300X 服务器与 Nvidia H100 或 H200 服务器相比,基于以太网的网络的成本要低得多,这同样是降低总拥有成本的一个因素。

Putting everything together, a Neocloud Giant can deploy the MI300X for a much lower total cost per hour of $1.23 vs $1.65 for the H100 and $1.66 for the H200, while a Hyperscaler can deploy the MI300X for $1.09 per hour vs $1.50 for the H100 and $1.51 for the H200.
综上所述,Neocloud 巨头可以以每小时 1.23 美元的总成本部署 MI300X,而 H100 为 1.65 美元,H200 为 1.66 美元,而超大规模公司可以以每小时 1.09 美元的价格部署 MI300X,而 H100 为 1.50 美元,H200 为 1.51 美元。
However, if we divide the cost per hour by the training effective FLOP/s delivered (as determined by our FP8 Single-node training benchmark) to get the cost of training compute in units of $/hr per effective PFLOP/s, we find that the cost advantage for the MI300X vanishes. For a Neocloud giant this MI300X cost of compute is at $2.47/hr per effective PFLOP/s, higher than that of the H200, and for a Hyperscaler, the MI300X cost of compute is $2.10/hr per effective PFLOP, higher than the $1.85 for the H200.
但是,如果我们将每小时成本除以提供的训练有效 FLOP/s(由我们的 FP8 单节点训练基准确定),以每个有效 PFLOP/s 的 $/hr 为单位得到训练计算成本,我们会发现 MI300X 的成本优势消失了。对于 Neocloud 巨头来说,MI300X 的计算成本为每有效 PFLOP/s 2.47 美元/小时,高于 H200,而对于超大规模企业,MI300X 的计算成本为每有效 PFLOP 2.10 美元/小时,高于 H200 的 1.85 美元。

Further Experiments 进一步的实验
Beyond the headline GEMM throughput in TFLOP/s and the collective performance, we conducted several further experiments to explore interesting concepts and hypotheses and further explore GPU architecture and limitations.
除了以 TFLOP/s 为单位的 GEMM 吞吐量和整体性能之外,我们还进行了几次进一步的实验,以探索有趣的概念和假设,并进一步探索 GPU 架构和局限性。
Benchmarking Warmup/Repeats Effects
对预热/重复效果进行基准测试
GPUs are power-limited. This means that they will never be able to sustain their max clock frequency within the allowed Thermal Design Power (TDP). As a result, GPUs are always faster in earlier iterations than in later iterations because, over time, the GPU needs to throttle down into a stable frequency. Interestingly, AMD tends to reach a stable state much earlier than H100/H200.
GPU 的功率有限。这意味着它们将永远无法在允许的热设计功率 (TDP) 内维持其最大时钟频率。因此,GPU 在早期迭代中总是比在后续迭代中更快,因为随着时间的推移,GPU 需要降低到稳定的频率。有趣的是,AMD 往往比 H100/H200 更早达到稳定状态。
The said, we believe that warmup=30, repeat=200 is the best setting for benchmarking because in real workloads, a user would not hit a compute bound GEMM 1,000 times. In real-world usage, GEMMs are usually followed up a layernorm or softmax or potentially an exposed communication. All these kernels allow the GPU to change to “cooldown”.
也就是说,我们认为 warmup=30, repeat=200 是基准测试的最佳设置,因为在实际工作负载中,用户不会命中计算绑定的 GEMM 1000 次。在实际使用中,GEMM 通常紧跟在 layernorm 或 softmax 之后,或者可能是公开的通信。所有这些内核都允许 GPU 更改为 “cooldown”。

A benchmark hack is to forego a warmup and carry out several iterations to prevent power throttling. Another method is to set up the benchmark such that it doesn’t take three iterations for the GPU to ramp to its peak realized frequency but instead lock the GPU so that it starts immediately at the highest frequency. This, however, doesn’t result in an accurate lock and counterproductively leads to a longer time required for the GPU to reach a stable frequency. We forgo these tricks when benchmarking because, in production training workloads, users do not lock GPU frequency, as can result in the clock speed being throttled down even more.
基准测试 hack 是放弃预热并执行多次迭代以防止功率限制。另一种方法是设置基准测试,以便 GPU 不需要三次迭代即可达到其峰值实现频率,而是锁定 GPU,使其立即以最高频率启动。但是,这不会产生准确的锁定,并且会适得其反地导致 GPU 达到稳定频率所需的时间更长。在进行基准测试时,我们放弃了这些技巧,因为在生产训练工作负载中,用户不会锁定 GPU 频率,这可能会导致时钟速度进一步降低。

VBoost Power Shifting VBoost 动力变速
GPUs are power-limited; thus, effective use of power on the chip is required. Nvidia GPUs include a setting to shift power away from the L2 cache and towards GPC units (which contain the tensor cores). Currently, AMD doesn’t have a similar setting, but they might release such a setting in ROCM6.5. Note that for all of our benchmarks above, we did not use vboost as it is an advanced flag that has no documentation/recommendations on what it does and what scenarios to use it for from Nvidia.
GPU 的功率有限;因此,需要有效利用芯片上的电源。Nvidia GPU 包括一个设置,用于将功率从 L2 缓存转移到 GPC 单元(包含张量核心)。目前,AMD 没有类似的设置,但他们可能会在 ROCM6.5 中发布此类设置。请注意,对于上述所有基准测试,我们没有使用 vboost,因为它是一个高级标志,没有关于 Nvidia 关于它的作用以及将其用于哪些场景的文档/建议。
Training is typically compute bound, meaning a high L2 cache clock frequency is unnecessary.
训练通常是计算受限的,这意味着不需要高 L2 缓存时钟频率。
Vboost=0 is the default profile out of the box. Nvidia has not publicly released the vboost clock speed ratios, but from our experiments, we believe that vboost=1 shifts the most power towards the tensor cores, and vboost=2 shifts the second most power towards the tensor cores and away from the l2 cache.
Vboost=0 是开箱即用的默认配置文件。Nvidia 尚未公开发布 vboost 时钟速度比,但从我们的实验来看,我们认为 vboost=1 将最大的功率转移到张量核,而 vboost=2 将第二大功率转移到张量核并远离 l2 缓存。
For BF16 compute-bound kernels, vboost=1 provides a small 2-4% boost in performance. For FP8 compute-bound kernels, vboost=2 is the best setting. We couldn’t figure out what the vboost=3 and vboost=4 relative ratios are, but we believe that it removes power away from the GPCs and towards the L2 cache. The use cases will be for memory-bound operations and non-tensor core SIMT operations.
对于 BF16 计算绑定的内核,vboost=1 提供 2-4% 的性能小幅提升。对于 FP8 计算绑定的内核,vboost=2 是最佳设置。我们无法弄清楚 vboost=3 和 vboost=4 的相对比率是多少,但我们相信它会将 GPC 的功率从 GPC 转移到 L2 缓存。这些用例将用于内存绑定操作和非张量核心 SIMT 操作。



Also, in collaboration with Sustainable Metal Cloud/Firmus, we tested the real-world performance on vboost on Unofficial MLPerf GPT-3 175B Training on 256 H100 with 200G HDR InfiniBand Networking. We see that using vboost does have better performance on real world training models in addition to GEMM microbenchmarking.
此外,我们与 Sustainable Metal Cloud/Firmus 合作,在具有 200G HDR InfiniBand 网络的 256 H100 上测试了非官方 MLPerf GPT-3 175B 训练中 vboost 的实际性能。我们看到,除了 GEMM 微基准测试之外,使用 vboost 在真实世界的训练模型上确实具有更好的性能。

来源:SemiAnalysis,Sustainable Metal Cloud
BF16 vs FP16 BF16 与 FP16
Another observation from our experiments is that BF16 tends to have a slightly higher TFLOP/s than FP16, even though practitioners have come to expect very similar performance. The mental mapping of interchangeable formats is also reflected in the H100’s marketed throughput, with both the BF16 and FP16 achieving 989.5 TFLOP/s. Our experiments show a slight edge for BF16 over FP16 for each of the H100/H200/MI300X, which makes sense as GPUs are power limited.
我们实验的另一个观察结果是,BF16 的 TFLOP/s 往往比 FP16 略高,尽管从业者已经开始期望非常相似的性能。可互换格式的心理映射也反映在 H100 的上市吞吐量上,BF16 和 FP16 都达到了 989.5 TFLOP/s。我们的实验表明,对于 H100/H200/MI300X,BF16 比 FP16 略有优势,这是有道理的,因为 GPU 的功率有限。

Why the difference in performance? BF16 is E8M7 (8 exponent bits, and 7 mantissa bits) while FP16 is E4M10, meaning BF16 has fewer mantissa bits than FP16. For floating point multiplication, exponents are added and the mantissa are multiplied. Simplistically, add circuits generally use O(n) transistors, whereas multiplication circuits use O(n logn) to O(n^2) where n is the number of bits.
为什么性能会有所不同?BF16 是 E8M7(8 个指数位和 7 个尾数位),而 FP16 是 E4M10,这意味着 BF16 的尾数位比 FP16 少。对于浮点乘法,将指数相加,并将尾数相乘。简单地说,加法电路通常使用 O(n) 晶体管,而乘法电路使用 O(n logn) 到 O(n^2),其中 n 是位数。
As stated previously, GPUs are power-limited. BF16 has fewer mantissa bits than FP16, so it uses fewer transistors. This leads to a higher clock frequency and a higher TFLOP/s. The logic extends to comparing FP8 e4m3 vs. FP8 e5m2 but with a less significant difference.
如前所述,GPU 的功率受到限制。BF16 的尾数比 FP16 少,因此它使用的晶体管更少。这会导致更高的 clock frequency 和更高的 TFLOP/s。该逻辑扩展到比较 FP8 e4m3 与 FP8 e5m2,但差异较小。

Input Distribution Affects Performance
输入分配影响性能
As first mentioned by Horace He’s blog post, the input distribution to your kernels matters for benchmarking. When benchmarking kernels, practitioners should use a normal distribution with a mean and variance that matches actual production workloads.
正如 Horace He 的博客文章首次提到的,内核的输入分布对于基准测试很重要。在对内核进行基准测试时,从业者应使用均值和方差与实际生产工作负载相匹配的正态分布。
One of the main sources of power draw in a GPU is transistors switching state from 0 -> 1 and from 1 -> 0. Transistors that stay at the same value (i.e. 1 -> 1 or 0 -> 0) use less power than those that are switching states.
GPU 功耗的主要来源之一是晶体管将状态从 0 切换到 1 -> 1 和 1 -> 0。保持相同值的晶体管(即 1 -> 1 或 0 -> 0)比切换状态的晶体管消耗更少的功率。
As Horace stated in his blog post, power ~= clock speed * “transistor flips per clock”.
正如 Horace 在他的博客文章中所说,功率 ~= 时钟速度 * “每个时钟的晶体管翻转次数”。
As such, multiplying two zero-filled tensors together achieves higher performance than multiplying two normally distributed tensors. Are zero-filled tensors multiplied together in real workloads? DEFINITELY NOT.
因此,将两个零填充的张量相乘比将两个正态分布的张量相乘可实现更高的性能。在实际工作负载中,零填充的张量是否相乘?绝对不是。
Even though there are no custom circuits for unstructured sparsity in Nvidia hardware, it’s faster than a normally distributed tensor. Multiplying uniform distributions is faster than multiplying normal distributions as there are fewer transistor flips. Our experiments show similar results for the H100/H200/MI300X, given that these GPUs are power-limited. We also have experiments for 2:4 structured sparsity. 2:4 is never used in current training recipes but is somewhat used for inferencing. We will talk about 2:4 structured sparsity in our “MI300X vs H100 vs H200 inferencing” article.
尽管 Nvidia 硬件中没有针对非结构化稀疏性的自定义电路,但它比正常分布的张量更快。乘以均匀分布比乘以正态分布更快,因为晶体管翻转更少。我们的实验显示了 H100/H200/MI300X 的类似结果,因为这些 GPU 的功率有限。我们也有 2:4 结构化稀疏性的实验。2:4 从未在当前的训练配方中使用过,但在某种程度上用于推理。我们将在“MI300X vs H100 vs H200 推理”一文中讨论 2:4 结构化稀疏性。

FLOP per GPU PicoJoule 每个 GPU 的 FLOP 皮焦耳
Power constraints have forced many clusters to power limit their GPUs to 400W to 550W instead of achieving maximum performance by running GPUs at full power.
功率限制迫使许多集群将其 GPU 的功率限制为 400W 到 550W,而不是通过以全功率运行 GPU 来实现最高性能。

For the H100, limiting power to 450W will yield the lowest picoJoule per FLOP. While for the MI300X, the lowest picojoule per FLOP is attained by setting power to 550W. We note that since electricity cost is a small portion of the TCO, limiting power is wasteful in terms of TCO but must be done if the goal is to fit more GPUs within a defined power constraint such as a 40MW datacenter.
对于 H100,将功率限制为 450W 将产生每 FLOP 的最低皮焦耳。而对于 MI300X,通过将功率设置为 550W 来获得最低的每 FLOP 皮焦耳。我们注意到,由于电力成本只占 TCO 的一小部分,因此限制功率在 TCO 方面是浪费的,但如果目标是在定义的功率限制(例如 40MW 数据中心)内容纳更多 GPU,则必须这样做。

For this article, we only focused on the GPU Direct Current (DC) picoJoule per FLOP, only factoring in the power draw from the GPU itself. In an upcoming article, we will do a deep dive into all-in AC picoJoule per FLOP as measured at the utility transformer level for air cooled in Virginia, Iceland, Singapore Immersion, and Direct to Chip (DLC) liquid cooling, factoring in power budgets for the overall server as well required networking, storage and management systems, & PUE i.e. cooling systems & power distribution. This will provide a real-world measure of the economics of deploying GPU clusters for different cooling solutions.
在本文中,我们只关注每 FLOP 的 GPU 直流电 (DC) 皮焦耳,只考虑了 GPU 本身的功耗。在即将发布的文章中,我们将深入探讨在弗吉尼亚州、冰岛、新加坡的空气冷却和直接到芯片(DLC)液体冷却的公用变压器级别测量的每FLOP全能交流皮焦耳,同时考虑到整体服务器的功率预算以及所需的网络、存储和管理系统以及PUE,即冷却系统和配电。这将为为不同的冷却解决方案部署 GPU 集群的经济性提供真实衡量标准。
PyTorch PyPi Distribution vs. Nvidia NGC Stable PyTorch Images
PyTorch PyPi 分发与 Nvidia NGC 稳定版 PyTorch 映像
Another observation from our experiments is that PyTorch from Nvidia NGC official stable PyTorch is slightly faster than PyTorch installed through the typical “pip install torch” as pypi distribution of PyTorch does not use the latest Nvidia libraries but instead uses libraries 4-6 months old (CUDA 12.4) compared to Nvidia NGC PyTorch 24.09+.
从我们的实验中观察到的另一个结果是,来自 Nvidia NGC 官方稳定版 PyTorch 的 PyTorch 比通过典型的“pip install torch”安装的 PyTorch 略快,因为 PyTorch 的 pypi 发行版不使用最新的 Nvidia 库,而是使用 4-6 个月大的库 (CUDA 12.4),与 Nvidia NGC PyTorch 24.09+ 相比。
Meta and Nvidia are in the progress of upgrading to cuBLASLt 12.6.2, but Nvidia’s NGC stable PyTorch images will always be ahead and have the latest versions of Nvidia cuBLASLt/CUTLASS/NCCL/cuDNN libraries.
Meta 和 Nvidia 正在升级到 cuBLASLt 12.6.2,但 Nvidia 的 NGC 稳定 PyTorch 映像将始终领先,并拥有最新版本的 Nvidia cuBLASLt/CUTLASS/NCCL/cuDNN 库。

Leave a Reply 留言