Nowadays, artificial intelligence (AI) technology with large models plays an increasingly important role in both academia and industry. It also brings a rapidly increasing demand for the computing power of the hardware. As the computing demand for AI continues to grow, the growth of hardware computing power has failed to keep up. This has become a significant factor restricting the development of AI. The augmentation of hardware computing power is mainly propelled by the escalation of transistor density and chip area. However, the former is impeded by the termination of the Moore's Law and Dennard scaling, and the latter is significantly restricted by the challenge of disrupting the legacy fabrication equipment and process. 如今,具有大型模型的人工智能(AI)技术在学术界和工业界发挥着越来越重要的作用。这也带来了对硬件计算能力的快速增长需求。随着人工智能计算需求的不断增长,硬件计算能力的增长却跟不上。这已成为制约人工智能发展的重要因素。硬件计算能力的提升主要得益于晶体管密度和芯片面积的增加。然而,前者受阻于摩尔定律和 Dennard Scaling 的终结,后者则受到颠覆传统制造设备和工艺的挑战的极大限制。
In recent years, advanced packaging technologies that have gradually matured are increasingly used to implement bigger chips that integrate multiple chiplets, while still providing interconnections with chip-level density and bandwidth. This technique points out a new path of continuing the increase of computing power while leveraging the current fabrication process without significant disruption. Enabled by this technique, a chip can extend to a size of wafer-scale (over ), provisioning orders of magnitude more computing capabilities (several POPS within just one monolithic chip) and die-to-die bandwidth density (over ) than regular chips, and emerges a new Wafer-scale Computing paradigm. Compared to conventional high-performance computing paradigms such as multi-accelerator and datacenter-scale computing, Wafer-scale Computing shows remarkable advantages in communication bandwidth, integration density, and programmability potential. Not surprisingly, disruptive Wafer-scale Computing also brings unprecedented design challenges for hardware architecture, design- system- technology co-optimization, power and cooling systems, and compiler tool chain. At present, there are no comprehensive surveys summarizing the current state and design insights of Wafer-scale Computing. This paper aims to take the first step to help academia and industry review existing waferscale chips and essential technologies in a one-stop manner. So that people can conveniently grasp the basic knowledge and key points, understand the achievements and shortcomings of existing research, and contribute to this promising research direction. 近年来,逐渐成熟的先进封装技术越来越多地用于实现集成多个芯片的更大芯片,同时仍提供芯片级密度和带宽的互连。这种技术为继续提高计算能力指明了一条新的道路,同时还能充分利用当前的制造工艺,而不会产生重大影响。在这种技术的支持下,芯片可以扩展到晶圆级尺寸(超过 ),提供比普通芯片更多数量级的计算能力(仅在一个单片芯片内就有多个 POPS)和芯片到芯片的带宽密度(超过 ),并出现了一种新的晶圆级计算范例。与多加速器和数据中心级计算等传统高性能计算范例相比,晶圆级计算在通信带宽、集成密度和可编程潜力方面具有显著优势。毫不奇怪,颠覆性的晶圆级计算也为硬件架构、设计- 系统-技术协同优化、电源和冷却系统以及编译器工具链带来了前所未有的设计挑战。目前,还没有对晶圆级计算的现状和设计见解进行全面总结的调查报告。本文旨在迈出第一步,帮助学术界和产业界一站式回顾现有的晶圆级芯片和基本技术。这样,人们就能方便地掌握基本知识和要点,了解现有研究的成就和不足,并为这一前景广阔的研究方向做出贡献。
Index Terms-AI chip, waferscale-computing, hardware architecture, software system. 索引词条--人工智能芯片、晶圆级计算、硬件架构、软件系统。
Fig. 1. Scissors difference between the required/provided computing powers. 图 1.所需/所提供计算能力之间的剪刀差。
I. InTRODUCTION I.引言
In today's world, artificial intelligence (AI) technology with large models plays an increasingly important role in promoting scientific progress and is becoming a basic tool for human beings to explore the world. For example, AlphaFold2 can predict of human protein sequences with high confidence while the traditional method can only cover 17% [1], AlphaTensor discovers fast matrix multiplication algorithms and surpasses the classic algorithm discovered 50 years ago for the first time [2], DGMR beats competitive methods in of the cases of two-hour-ahead high-resolution weather forecasting [3], and DM21 performs better than traditional functionals for describing matter at the quantum level [4]. As an AI language model developed by OpenAI, ChatGPT [5] has gained significant popularity and recognition in the field of natural language processing (NLP) since its release. 当今世界,拥有大型模型的人工智能(AI)技术在推动科学进步方面发挥着越来越重要的作用,并逐渐成为人类探索世界的基本工具。例如,AlphaFold2 可以高置信度预测人类蛋白质序列的 ,而传统方法只能覆盖 17% [1];AlphaTensor 发现了快速矩阵乘法算法,并首次超越了 50 年前发现的经典算法 [2];DGMR 在提前两小时高分辨率天气预报的 案例中击败了竞争方法 [3];DM21 在量子层面描述物质的表现优于传统函数 [4]。作为 OpenAI 开发的人工智能语言模型,ChatGPT [5] 自发布以来在自然语言处理(NLP)领域获得了极大的欢迎和认可。
Although large AI models offer numerous benefits to research, production, and daily life, they also impose an immense demand on hardware computing power. With the emergence of the transformer model[6] in recent years, the demand for computing power required by large models has experienced explosive growth, increasing by a factor of 1,000 within a span of two years, as shown in Figure 1. By contrast, we observe that the growth rate of hardware computing power utilized to train large models only doubles over a period of two years. As a result, a significant gap exists between the computing power demanded by large models and that which is currently available from chips, which serves as a key limiting factor for the advancement of AI. 虽然大型人工智能模型为科研、生产和日常生活带来了诸多益处,但同时也对硬件计算能力提出了巨大的要求。近年来,随着变压器模型[6]的出现,大型模型对计算能力的需求出现了爆炸式增长,两年内增长了 1000 倍,如图 1 所示。相比之下,我们发现用于训练大型模型的硬件计算能力的增长率在两年内仅翻了一番。因此,大型模型所需的计算能力与目前芯片所能提供的计算能力之间存在巨大差距,这是限制人工智能发展的关键因素。
Computing Power of Chip Transistor Density (@Process) × Chip Area (@ Exposure Area) 芯片晶体管密度(@工艺)×芯片面积(@曝光面积)的计算能力
Fig. 2. Limiting factors of computing power. 图 2.计算能力的限制因素
The computing power of a chip can be represented as the total number of transistors integrated within it, which is the product of transistor density (i.e., the number of transistors per unit area) and area. Transistor density is mainly determined by the advancement of the chip process, whereas chip area is predominantly determined by the caliber of the chip photolithography process, and is limited by a maximum reticle size. Unfortunately, both paths are presently encountering significant obstacles that impede their continuation. 芯片的计算能力可以用芯片中集成的晶体管总数来表示,它是晶体管密度(即单位面积上的晶体管数量)与面积的乘积。晶体管密度主要由芯片工艺的先进性决定,而芯片面积则主要由芯片光刻工艺的口径决定,并受到最大网罩尺寸的限制。遗憾的是,这两条道路目前都遇到了阻碍其继续发展的重大障碍。
Firstly, the increasing difficulty of improving transistor density due to the slowing or termination of Moore's Law and Dennard scaling has become a significant issue after 3 nm [7]. Secondly, the chip area is generally restricted by reticle size (i.e., the largest area of the wafer that can be patterned using a lithography stepper system), which is challenging to enlarge while maintaining the most up-to-date chip process and preserving yield [8]. 首先,由于摩尔定律和 Dennard 扩展的放缓或终止,提高晶体管密度的难度越来越大,这已成为 3 纳米之后的一个重要问题[7]。其次,芯片面积通常受网罩尺寸(即使用光刻步进系统可图案化的最大晶片面积)的限制,要在保持最新芯片工艺和产量的同时扩大网罩尺寸具有挑战性[8]。
In recent years, advanced packaging technologies that have gradually matured are increasingly used to implement bigger chips that integrate multiple chiplets/dielets, while still providing interconnections with chip-level density and bandwidth [9-11]. This points out a new path of continuing the increase of computing power while leveraging the current fabrication process without significant disruption, and emerges a new and exciting Wafer-scale Computing paradigm. In this paper, Wafer-scale Computing is defined as the computing paradigm which extends the chip area to a size of wafer-scale (over ) through leveraging the advanced packaging [12] or field stitching [13] techniques to tightly integrate multiple chiplets/dielets, provisioning orders of magnitude more computing capabilities (several POPS within just one monolithic chip) and die-to-die bandwidth density (over ) than regular chips, while leveraging the current fabrication process without significant disruption. 近年来,逐渐成熟的先进封装技术越来越多地用于实现集成多个芯片/子芯片的更大芯片,同时仍提供芯片级密度和带宽的互连[9-11]。这就为继续提高计算能力指明了一条新的道路,同时又能充分利用当前的制造工艺而不产生重大影响,从而出现了一种全新的、令人兴奋的晶圆级计算范例。在本文中,晶圆级计算被定义为通过利用先进的封装[12]或现场拼接[13]技术将多个芯片/小片紧密集成,从而将芯片面积扩展到晶圆级( 以上)的计算范例,与普通芯片相比,晶圆级计算可提供更多数量级的计算能力(仅在一个单片芯片中就有多个 POPS)和芯片到芯片的带宽密度( 以上),同时利用当前的制造工艺而不会产生重大影响。
One might question why not construct an accelerator cluster (such as NVIDIA's GPU cluster [14]) using multiple chips to achieve computing power that surpasses that of a single conventional chip. Alternatively, what benefits do wafer-scale chips have over accelerator clusters? The answers mainly include the following aspects: 有人可能会问,为什么不利用多个芯片构建一个加速器集群(如英伟达™(NVIDIA®)公司的 GPU 集群[14]),以实现超过单个传统芯片的计算能力?另外,晶圆级芯片与加速器集群相比有哪些优势?答案主要包括以下几个方面:
One of the primary advantages of wafer-level integration is the significant enhancement of die-to-die bandwidth. This benefit is straightforward and evident. For instance, NVIDIA's interconnect provides a bandwidth of for H100 GPUs [14], whereas every D1 die edge in Tesla Dojo delivers a bandwidth of 2TB/s [15]. This high bandwidth breaks the previous memory limitations and opens up new design spaces for exploring more aggressive application solutions that can fully utilize the computing power available. 晶圆级集成的主要优势之一是大大提高了芯片到芯片的带宽。这一优势直接而明显。例如,英伟达™(NVIDIA®)的互联为 H100 GPU 提供了 的带宽[14],而 Tesla Dojo 的每个 D1 芯片边缘提供了 2TB/s 的带宽[15]。这种高带宽打破了以往的内存限制,为探索更先进的应用解决方案开辟了新的设计空间,使其能够充分利用现有的计算能力。
Moreover, wafer-scale chips offer better integration density, which implies higher size and form factor efficiencies. For instance, a Tesla Dojo training tile can tightly integrate 25 regular-sized dies, whereas 25 NVIDIA H100s require a complete package for each GPU and take up more than ten times the total area. This size advantage may be preferred in supercomputing applications with limited space, such as aerospace and military applications. 此外,晶圆级芯片具有更高的集成密度,这意味着更高的尺寸和外形效率。例如,一块 Tesla Dojo 训练芯片可以紧密集成 25 个普通尺寸的芯片,而 25 个英伟达 H100 则需要为每个 GPU 安装一个完整的封装,所占总面积是普通芯片的十倍以上。在空间有限的超级计算应用(如航空航天和军事应用)中,这种尺寸优势可能更受青睐。
Lastly, wafer-scale chips offer significant potential for programmability. Compared to GPU clusters, wafer-scale chips have considerably less overhead when it comes to inter-die and intra-die data communication, and current wafer-scale chip designs [15-17] strive to minimize this gap even further. This means that programmers do not have to be overly concerned about data access across multiple dies, and compilers can generate more flexible and fine-grained hardware resource partitioning and task mapping to more effectively utilize the computing power available 最后,晶圆级芯片具有巨大的可编程潜力。与 GPU 集群相比,晶圆级芯片在芯片间和芯片内数据通信方面的开销要小得多,目前的晶圆级芯片设计[15-17]正在努力进一步缩小这一差距。这意味着程序员不必过分关注多芯片间的数据访问,编译器可以生成更灵活、更细粒度的硬件资源分区和任务映射,从而更有效地利用可用的计算能力。
Not surprisingly, this disruptive wafer-scale computing paradigm also brings unprecedented challenges in nearly every design aspect. 毫不奇怪,这种颠覆性的晶圆级计算模式也在几乎所有设计方面带来了前所未有的挑战。
To start, the hardware footprints for Wafer-scale Computing system is orders of magnitudes larger than traditional chips, bringing significantly larger design spaces, which nullifies conventional architectural design methodologies. 首先,晶圆级计算系统的硬件占地面积比传统芯片大很多个数量级,设计空间大大增加,这就使传统的架构设计方法变得无效。
From an architectural perspective, the current hardware execution model exhibits inadequate scalability. To enable the efficient operation of Wafer-scale Computing systems, 从架构的角度来看,当前的硬件执行模型表现出不充分的可扩展性。为了实现晶圆级计算系统的高效运行、
it is imperative to develop new computational and execution models. 开发新的计算和执行模型势在必行。
From an integration/packaging perspective, the current advanced packaging technologies are being implemented in Wafer-scale Computing only as experimental runs and in an ad hoc manner. There is still a need for systematic design principles to determine the appropriate wafer-scale substrate techniques and layout, yield issues, and to cooptimize the packaging and system design. 从集成/封装的角度来看,目前的先进封装技术在晶圆级计算中的应用只是实验性的和临时性的。我们仍然需要系统的设计原则,以确定合适的晶圆级基板技术和布局、产量问题,并优化封装和系统设计。
From a systems perspective, there is currently no crossstack system design methodology for Wafer-scale Computing systems. Such systems operate comprehensively, with tightly coupled computing dies, wafer-scale substrates, power and cooling systems, and mechanical parts. The performance and availability of any single component can significantly affect the other parts. 从系统角度来看,目前还没有针对晶圆级计算系统的跨堆栈系统设计方法。此类系统运行全面,包括紧密耦合的计算芯片、晶圆级基板、电源和冷却系统以及机械部件。任何一个部件的性能和可用性都会对其他部件产生重大影响。
From a software perspective, the current software stack has not encountered computing resources on the scale of Wafer-scale Computing and lacks efficient mechanisms to fully leverage the power of such systems. There is a high demand for compiling and execution mechanisms to map big AI models to Wafer-scale Computing resources and run them efficiently. 从软件角度来看,目前的软件栈还没有遇到过晶圆级计算这种规模的计算资源,也缺乏有效的机制来充分利用这种系统的力量。因此,我们亟需编译和执行机制,以便将大型人工智能模型映射到晶圆级计算资源并高效运行。
At present, there are no comprehensive surveys summarizing the current states and design insights of Wafer-scale Computing. The objective of this paper is to provide a comprehensive overview of existing wafer-scale chips and their essential technologies in a single source. This will enable individuals in academia and industry to easily comprehend the fundamental concepts and key aspects, assess the accomplishments and limitations of existing research, and make contributions to this promising research field. 目前,还没有对晶圆级计算的现状和设计见解进行总结的全面调查。本文旨在通过单一来源全面概述现有晶圆级芯片及其基本技术。这将使学术界和工业界人士能够轻松理解基本概念和关键方面,评估现有研究的成就和局限性,并为这一前景广阔的研究领域做出贡献。
The rest of this paper is organized as follows: Section II introduces the background technologies. Section III discusses the key architecture design points of wafer-scale chips based on existing works. Section IV reviews common-used interconnect interfaces and protocols which may be used in wafer-scale chips. Section V reviews typical compiler tools for driving large-scale acceleration platforms (including traditional ones and Wafer-scale computing ones), and discusses what make the compilation of Wafer-scale Computing different from traditional one. Section VI reviews how existing works integrate wafer-scale chips. Section VII discusses how to deliver power and clock to the whole wafer-scale chip, and how to solve the cooling problem under such high integration density. Section VIII provides a summary of fault tolerance designs. Section IX presents a brief introduction to the important scientific computing applications of Wafer-scale Computing outside of AI computing domain. Section X concludes the paper. 本文接下来的内容安排如下:第二节介绍背景技术。第三节根据现有著作讨论晶圆级芯片的关键架构设计要点。第四节回顾晶圆级芯片中可能使用的常用互连接口和协议。第五部分回顾了驱动大规模加速平台(包括传统平台和晶圆级计算平台)的典型编译工具,并讨论了晶圆级计算编译与传统编译的不同之处。第六节回顾了现有工作如何集成晶圆级芯片。第七节讨论如何为整个晶圆级芯片提供电源和时钟,以及如何在如此高的集成密度下解决冷却问题。第八节总结了容错设计。第九节简要介绍了晶圆级计算在人工智能计算领域之外的重要科学计算应用。第十节是本文的结论。
II. BACKGROUND II.背景情况
In this section, we will introduce the efforts for improving the computing power of AI acceleration. 在本节中,我们将介绍为提高人工智能加速计算能力所做的努力。
A. Conventional Accelerators for AI Tasks A.用于人工智能任务的传统加速器
As Moore's Law and Dennard scaling are slowing down or even terminating, various accelerators different from classic 随着摩尔定律和 Dennard 扩展的放缓甚至终止,各种不同于传统加速器的
Von Neumann architectures have been proposed to meet the ever-increasing user demand of AI applications. These accelerators mainly include general-purpose graphics processing units (GPGPUs) [18], field-programmable gate arrays (FPGAs) [19], application-specific integrated circuits (ASICs)[20] and coarse-grained reconfigurable architectures (CGRAs) [21]. GPGPUs, such as NVIDIA's A100 [22] and H100 [14], are designed for scientific computation, encryption/decryption, and, of course, AI computation. GPGPUs have much more computing units than Von Neumann CPUs, and adopt the supporting execution mechanism (e.g., single instruction multiple thread, SIMT [23]), so they can surpass CPUs in the performance of high parallel computing. Since GPGPUs usually have large volume and power consumption, they are mainly used in data centers. With the popularization of AI and increasingly diversified user needs, researchers also develop more domain-specific accelerators on the platforms other than GPGPUs, to meet the demands such as energy efficiency and portability. These accelerators, including FPGAs, ASICs and CGRAs, simplify the computing units, data-paths and memories to exactly match the application scenarios, so they usually have the advantages of energy efficiency and area efficiency over GPGPUs. FPGAs offer fine-grained programmable logic, computing and storage units, for allowing users to totally customize the computing path structure according to algorithm requirements. CGRAs offer coarse-grained reconfigurability to provide flexibility limited but sufficient for the target application scenarios. ASICs cannot change its functions in the original sense, however, for AI accelerators the boundary of ASIC and CGRA is blurred, many so-called ASIC-based AI accelerators (e.g., [24-26] ) also have coarse-grained reconfigurability. 为了满足用户对人工智能应用日益增长的需求,人们提出了冯-诺依曼架构。这些加速器主要包括通用图形处理器(GPGPU)[18]、现场可编程门阵列(FPGA)[19]、专用集成电路(ASIC)[20]和粗粒度可重构架构(CGRA)[21]。英伟达™(NVIDIA®)的 A100 [22] 和 H100 [14]等 GPGPU 专为科学计算、加密/解密而设计,当然也包括人工智能计算。GPGPU 拥有比冯-诺伊曼 CPU 更多的计算单元,并采用了配套的执行机制(如单指令多线程,SIMT [23]),因此在高并行计算性能方面可以超越 CPU。由于 GPGPU 通常体积大、功耗高,因此主要应用于数据中心。随着人工智能的普及和用户需求的日益多样化,研究人员还在 GPGPU 之外的平台上开发了更多特定领域的加速器,以满足能效和可移植性等需求。这些加速器包括 FPGA、ASIC 和 CGRA,它们简化了计算单元、数据路径和存储器,以精确匹配应用场景,因此通常比 GPGPU 更具有能效和面积效率优势。FPGA 提供细粒度的可编程逻辑、计算和存储单元,允许用户根据算法要求完全定制计算路径结构。CGRA 具有粗粒度可重新配置性,可提供有限的灵活性,但足以满足目标应用场景的需要。ASIC 无法改变其原始意义上的功能,但对于人工智能加速器来说,ASIC 和 CGRA 的界限是模糊的。,[24-26] )也具有粗粒度可重构性。
Those mentioned above accelerators have in common that they adopt 1) large numbers of parallel computing units to provide high computing power, 2) hierarchical memory systems to reuse the data and reduce the data movement overhead, 3) convenient networks on chip ( NoCs ) to connect the distributed computing units and memories and 4) elaborate dataflow to utilize as more computing power as possible. Designers make decisions on these issues, to strive for the best performance, efficiency or flexibility. 上述加速器的共同点是:1)采用大量并行计算单元来提供高计算能力;2)采用分层内存系统来重复使用数据并减少数据移动开销;3)采用便捷的片上网络(NoC)来连接分布式计算单元和内存;4)采用精心设计的数据流来尽可能利用更多计算能力。设计人员会就这些问题做出决定,力求获得最佳性能、效率或灵活性。
B. Accelerator Clusters B.加速器集群
Caused by the slowdown of CMOS scaling and the limit of chip area, the number of available transistors on a monolithic chip becomes stagnant nowadays, which will seriously restrict the processor design optimization space. The gains from specialized architecture design will gradually diminish and ultimately hit an upper-bound, which is called accelerator wall [27]. To address the gap between the diminishing gains of accelerator performance and the growing demand of computing power for AI tasks, a straightforward way is to group multiple individual accelerators into a cluster, so that much more computing power can be exploited. 由于 CMOS 扩展速度的放缓和芯片面积的限制,如今单片芯片上可用晶体管的数量变得停滞不前,这将严重限制处理器设计的优化空间。专业化架构设计带来的收益将逐渐减少,并最终触及上限,这就是所谓的 "加速器墙"[27]。为了解决加速器性能收益递减与人工智能任务对计算能力需求不断增长之间的差距,一种直接的方法是将多个单独的加速器组合成一个集群,这样就可以利用更多的计算能力。
For running on an accelerator cluster, the AI tasks are usually executed under the data parallel (DP) [28] or model parallel (MP) [29] strategy, as shown in Figure 3. The accelerator cluster can be seen as a graph where each node represents 如图 3 所示,在加速器集群上运行人工智能任务时,通常采用数据并行(DP)[28] 或模型并行(MP)[29] 策略。加速器集群可视为一个图,其中每个节点代表
one accelerator (or some tightly linked accelerators), and each edge represents the data path between nodes. Under the DP strategy, samples are distributed to nodes (i.e. different nodes hold different parts of samples), while weights are broadcast to all nodes (i.e. each node processes on a copy of all weights). In contrast, under the MP strategy, weights are divided in the dimension of input/output channel (Tensor MP) or layer (Pipeline MP) and distributed to nodes, and each node processes on all samples. 每条边代表节点之间的数据路径。在 DP 策略下,样本被分配到各个节点(即不同节点持有样本的不同部分),而权重则被广播到所有节点(即每个节点处理所有权重的副本)。与此相反,在 MP 策略下,权重按输入/输出通道(张量 MP)或层(管道 MP)的维度划分并分配给节点,每个节点对所有样本进行处理。
Since the training of neural networks on multiple nodes usually require all-reduce operation (i.e, collecting partial results from all nodes and distributing the updated data to all nodes)[30-33], a powerful and stable network with high bandwidth should be built to link all the accelerators. Taking NVIDIA GPU cluster as an example, the nodes are usually linked with a Fat-tree network [34] ensuring that each node can access any other one in full bandwidth. Based on InfiniBand Quantum-2 switches[35-37] or Ethernet data center switches of equivalent performance, employing a three-layer Fat-tree network architecture, a single cluster can support a maximum of 100,000 GPU cards. 由于在多个节点上训练神经网络通常需要进行全还原操作(即从所有节点收集部分结果,并将更新后的数据分发到所有节点)[30-33],因此需要构建一个功能强大且稳定的高带宽网络来连接所有加速器。以英伟达™(NVIDIA®)图形处理器集群为例,节点通常通过胖树网络(Fat-tree network)[34]连接,确保每个节点都能以全带宽访问任何其他节点。基于 InfiniBand Quantum-2 交换机[35-37] 或同等性能的以太网数据中心交换机,采用三层胖树网络架构,单个集群最多可支持 100,000 块 GPU 卡。
Many works have pointed out that bandwidth is a major bottleneck for GPUs[38-40]. These works conduct experiments on GPUs and Wafer-scale Computing systems, scale the dieto-die bandwidth to up to and demonstrate that larger bandwidth can effectively reduce the exposed communication time, thereby reducing the overall execution time. Currently, NVIDIA's NDR InfiniBand supports s interconnection between every two cards in different nodes (the receiving and sending bandwidths are added together)[35, 41]. Furthermore, NVIDIA has proposed NVLink[42]to provide bandwidth between any two cards in a same node. This is already a quite high number, but we have found that further increasing the bandwidth can lead to even greater benefits. Based on Megatron-LM[43, 44], we built a performance model in the scenario of training GPT-3[45] on an GPU cluster with 16,384 H100 GPUs (each node has 256 GPUs). We found that, if the NVLink bandwidth is increased from to , theoretical training time can be reduced by , resulting in a speedup of . Therefore, it is highly valuable to continue increasing the bandwidth between accelerators. 许多研究都指出,带宽是 GPU 的主要瓶颈[38-40]。这些研究在图形处理器和晶圆级计算系统上进行了实验,将芯片带宽扩展到 ,并证明更大的带宽可以有效减少暴露的通信时间,从而缩短整体执行时间。目前,NVIDIA 的 NDR InfiniBand 支持不同节点中每两块板卡之间 的互连(接收带宽和发送带宽相加)[35, 41]。此外,NVIDIA 还提出了 NVLink[42],可在同一节点的任意两块板卡之间提供 的带宽。这已经是一个相当高的数字,但我们发现,进一步提高带宽可以带来更大的效益。基于 Megatron-LM[43,44],我们在一个拥有 16,384 个 H100 GPU(每个节点有 256 个 GPU)的 GPU 集群上建立了一个 GPT-3[45] 训练性能模型。我们发现,如果将 NVLink 带宽从 增加到 ,理论训练时间可减少 ,从而加快 。因此,继续增加加速器之间的带宽非常有价值。
C. Advanced Packaging and Chiplet C.先进封装和 Chiplet
To increase the bandwidth between accelerators, a fundamental way is to reduce the physical distance between dies. This is also one of the primary goals of the advanced packaging technologies[12, 46, 47] that have garnered significant attention in recent years. Unlike traditional accelerator architectures which package every single die into an individual device, advanced packaging architectures integrate multiple bare dies together (side by side or stacked vertically) and package them in a whole. As shown in Figure 4, the various advanced packaging technologies can be generally classified into three categories in terms of the carrier type: substratebased, silicon-based, and redistribution layer (RDL)-based packaging technologies. The main advantages and disadvantages of each category are as follows: 要提高加速器之间的带宽,最根本的方法是缩短芯片之间的物理距离。这也是近年来备受关注的先进封装技术[12, 46, 47]的主要目标之一。传统加速器架构将每个裸芯片封装成一个单独的器件,而先进封装架构则不同,它将多个裸芯片集成在一起(并排或垂直堆叠)并封装成一个整体。如图 4 所示,各种先进封装技术按载体类型一般可分为三类:基于衬底的封装技术、基于硅的封装技术和基于再分布层(RDL)的封装技术。各类技术的主要优缺点如下:
Substrate-based packaging technology uses organic substrate materials to complete the wiring connections between dies with the etching process, which does not rely on the chip foundry process, so the related materials and production cost is low. However, the density of IO pins is low and the transmission capability per pin is affected by the crosstalk effect, so the bandwidth of die-to-die connections is limited. 基于基板的封装技术使用有机基板材料,通过蚀刻工艺完成芯片之间的布线连接,不依赖芯片代工工艺,因此相关材料和生产成本低。但 IO 引脚的密度较低,每个引脚的传输能力受到串扰效应的影响,因此芯片与芯片之间的连接带宽有限。
Silicon-based packaging technology implements the interconnection between dies by placing an extra silicon layer between the substrate and die. The connection between die and substrate is achieved with through-silicon vias (TSVs) and micro-bumps, which have smaller bump pitch and trace distance, so the IO density is improved and transmission delay and power consumption are reduced. However, silicon interposer relies on the chip foundry process so the cost is much higher than organic substrate. To alleviate this, researchers proposed some variations of the original silicon interposer technology, such as silicon bridge technology [48] which only integrates small thin silicon layers on the substrate for inter-die interconnection, and silicon interconnect fabric (Si-IF) technology [49] which removes the organic substrate and assembles the dies at small spacing using fine pitch interconnects. 硅基封装技术通过在基板和芯片之间放置额外的硅层来实现芯片之间的互连。芯片与基板之间的连接是通过硅通孔(TSV)和微凸块实现的,硅通孔和微凸块的凸块间距和迹线距离较小,因此 IO 密度提高,传输延迟和功耗降低。然而,硅插层依赖于芯片代工工艺,因此成本远高于有机基板。为了缓解这一问题,研究人员提出了一些原始硅插层技术的变体,如硅桥技术[48],这种技术只在基板上集成小的薄硅层,用于芯片间的互连;以及硅互连结构(Si-IF)技术[49],这种技术去除了有机基板,使用细间距互连线以小间距组装芯片。
RDL-based packaging technology deposits metal and dielectric layers on the surface of the wafer to form a redistribution layer for carrying the metal wiring pattern and rearranging the IO ports. At present, fan-out style [50] is common-used, which rearranges the IO ports on the loose area outside the die, to shorten the circuit length for enhancing the signal quality. Compared to the silicon interposer-based packaging, RDL-based fan-out packaging has lower cost but less wiring resources. 基于 RDL 的封装技术在晶片表面沉积金属和介电层,形成再分布层,用于承载金属布线图案和重新排列 IO 端口。目前常用的是扇出式封装[50],即在芯片外的松散区域重新排列 IO 端口,以缩短电路长度,提高信号质量。与基于硅插层的封装相比,基于 RDL 的扇出式封装成本更低,但布线资源更少。
The bare dies integrated by advanced packaging are also called as dielets or chiplets[53], which introduces us to another well-known concept: chiplet technology[51, 54, 55]. Advanced packaging is a packaging technology aimed at integrating multiple functional components, while chiplet technology is a design methodology that divides an integrated circuit into 先进封装所集成的裸芯片也被称为小芯片或芯片组[53],这就引出了另一个著名的概念:芯片组技术[51, 54, 55]。先进封装是一种旨在集成多种功能组件的封装技术,而芯片技术则是一种将集成电路分为以下几部分的设计方法
(a)
(b)
(c)
Fig. 4. Advanced packaging categories. (a) Substrate-based packaging. (b) Silicon interposer-based packaging. (c) Redistribution layer (RDL)-based packaging. Blue font indicates the bump type and pitch for die-to-die interconnection. Adapted from [12, 51, 52]. 图 4.先进封装类别。(a) 基于基板的封装。(b) 基于硅插层的封装。(c) 基于再分布层 (RDL) 的封装。蓝色字体表示凸点类型和 间距 ,用于芯片与芯片之间的互连。改编自 [12、51、52]。
multiple independent chips (chiplets), each containing specific functions or modules. The former serves as the infrastructure for the latter, and the latter is the driving force for the development of the former. 多个独立的芯片(chiplets),每个芯片包含特定的功能或模块。前者是后者的基础设施,后者是前者发展的动力。
Originally, the principle of chiplet technology is based on the concept of modular design and integration. By dividing a complex integrated circuit into smaller chiplets each of which can be designed, manufactured, and tested independently, we can achieve easier customization, better reusability and scalability of chip designs, faster time-to-market, improved yield and cost efficiency. Therefore, chiplet technology does not necessarily produce big chips. However, with the increasing demand for computing power, there have emerged many works that integrate large chiplets to break through the upper limits of single-chip computing power while keeping high internal bandwidth[15, 16, 41, 56, 57]. Some of them even increase the total area to wafer level, for example, a Tesla Dojo Tile integrates twenty-five D1 dies with an area of 645 [15], resulting in a total area exceeding , which surpasses the area of the square inscribed in a 6 -inch wafer . This kind of works can be referred to as wafer-scale chips. 芯片技术的原理最初是基于模块化设计和集成的概念。通过将复杂的集成电路分割成更小的芯片,每个芯片都可以独立设计、制造和测试,我们可以更容易地实现定制,提高芯片设计的可重用性和可扩展性,加快产品上市时间,提高产量和成本效益。因此,芯片技术并不一定能生产出大芯片。然而,随着人们对计算能力的要求越来越高,出现了许多集成大芯片的作品,以突破单芯片计算能力的上限,同时保持较高的内部带宽[15, 16, 41, 56, 57]。例如,特斯拉 Dojo Tile 集成了 25 个 D1 芯片,面积为 645 [15],因此总面积超过 ,超过了 6 英寸晶圆上所刻正方形的面积 。这类作品可称为晶圆级芯片。
D. Wafer-scale Computing Systems D.晶圆级计算系统
It should be noted that chiplet technology is just one way to achieve wafer-scale chips. There are other technological approaches, such as the field stitching[13] used by Cerebras[17], that can also accomplish this. This computing paradigm, which scales the size of a single chip to the wafer level to achieve high computing power and large bandwidth benefits, is referred to as Wafer-scale Computing, regardless of the approach adopted. 需要指出的是,芯片技术只是实现晶圆级芯片的一种方法。还有其他技术方法,如 Cerebras[17] 采用的场拼接技术[13],也能实现这一目标。无论采用哪种方法,这种将单个芯片的尺寸扩展到晶圆级以实现高计算能力和大带宽优势的计算模式都被称为晶圆级计算。
More is different. To extend the chip to the wafer level, there are many design issues need to be reconsidered. The major design consideration differences among various chip scales are shown in Table I. While providing significant advantages on die-to-die bandwidth, integration density and programmability, wafer-scale chips also pose significant challenges on design space exploration, system implementation and tool chain development. 更多不同。要将芯片扩展到晶圆级,有许多设计问题需要重新考虑。各种芯片级别的主要设计考虑差异如表一所示。晶圆级芯片在芯片间带宽、集成密度和可编程性方面具有明显优势,但同时也给设计空间探索、系统实现和工具链开发带来了巨大挑战。
Figure 5 shows an overview of typical Wafer-scale Computing system, as well as the main contents of this paper. The wafer-scale compute plane is the core component of the whole system. A lot of effort needs to be paid into designing the architecture elements of it, including hierarchy, micro architecture, execution framework, NoC and on-chip memory system. The interconnection interface and protocol design (including intra-die, inter-die and off-chip interconnections) are also extremely critical to the performance of wafer-scale chips. To integrate the dies and interconnections together to form the physical wafer-scale chip, advanced packaging [12] or field stitching [13] technologies are required. Then, to drive the wafer-scale chip, specially designed power and clock delivery mechanisms are required, and the cooling modules is also necessary because of the high density of heat power. At last, to run diverse applications on this Wafer-scale Computing system, a dedicated compiler tool chain for mapping the workloads onto the chip is indispensable. In addition, fault tolerance designs relate to almost all aspects. In the rest of this paper, we will discuss the key points, challenges and the possible future research directions from these aspects. 图 5 显示了典型晶圆级计算系统的概览以及本文的主要内容。晶圆级计算平面是整个系统的核心组成部分。设计其中的架构元素需要付出大量精力,包括层次结构、微架构、执行框架、NoC 和片上存储器系统。互连接口和协议设计(包括芯片内、芯片间和芯片外互连)对于晶圆级芯片的性能也极为关键。要将芯片和互连集成在一起,形成物理晶圆级芯片,需要先进的封装[12]或现场拼接[13]技术。然后,要驱动晶圆级芯片,需要专门设计的电源和时钟传输机制,由于热功率密度高,还需要冷却模块。最后,要在晶圆级计算系统上运行各种应用,将工作负载映射到芯片上的专用编译工具链必不可少。此外,容错设计几乎涉及所有方面。在本文的其余部分,我们将从这些方面讨论关键点、挑战和未来可能的研究方向。
III. ARCHITECTURE III.建筑
Like the traditional chips, the architecture design of wafer scale chips is also aimed at improving the performance, efficiency and flexibility. But the difference is, designers should not only consider the trade-off within a single, conventionally sized die, but also the co-design of a tens of times larger system. In this section, we will introduce the architecture design of existing typical wafer-scale works from the aspects of overall architecture, microarchitecture, dataflow, NoC, and memory system. 与传统芯片一样,晶圆级芯片的架构设计也旨在提高性能、效率和灵活性。但不同的是,设计者不仅要考虑单个传统尺寸芯片内的权衡,还要考虑数十倍大系统的协同设计。本节将从总体架构、微架构、数据流、NoC 和存储器系统等方面介绍现有典型晶圆级作品的架构设计。
A. Hierarchy A.层次结构
First let's review the overall architectures of three typical wafer-scale systems, including UCLA&UIUC's work[16], Tesla Dojo [15] and Cerebras CS-2 [17], as shown in Figure 6, Figure 7 and Figure 8. 首先,让我们回顾一下三个典型晶圆级系统的总体架构,包括加州大学洛杉矶分校和加州大学洛杉矶分校的工作[16]、特斯拉 Dojo [15] 和 Cerebras CS-2 [17],如图 6、图 7 和图 8 所示。
UCLA&UIUC's wafer-scale processor is comprised of 32 tiles, each tile is heterogeneously integrated by a compute chiplet and a memory chiplet. The compute chiplet contains 14 ARM cores. The total number of cores in the wafer-scale processor is 14,336 . Any core on any tile can directly access the globally shared memory across the entire wafer-scale system using the wafer-scale interconnect network. Tesla's wafer-scale chip called Dojo training tile is comprised of 25 D1 dies and 40 I/O dies, and each D1 die contains 354 nodes. DOJO nodes are full-fledged computers, and the total number of them in the Dojo training tile is 8,850 . Cerebras 加州大学洛杉矶分校(UCLA&UIUC)的晶圆级处理器由 32 个芯片组成,每个芯片由一个计算芯片和一个内存芯片异构集成。计算芯片组包含 14 个 ARM 内核。晶圆级处理器的内核总数为 14,336 个。任何芯片上的任何内核都可以通过晶圆级互连网络直接访问整个晶圆级系统的全局共享内存。特斯拉的晶圆级芯片名为 Dojo 训练瓦片,由 25 个 D1 芯片和 40 个 I/O 芯片组成,每个 D1 芯片包含 354 个节点。DOJO 节点是完整的计算机,Dojo 训练芯片中的节点总数为 8850 个。大脑
TABLE I 表 I
DESIGN CONSIDERATION DIFFERENCES OF DIFFERENT CHIP SCALES 设计时考虑不同芯片尺度的差异
Monolithic Conventional Chip 单片传统芯片
GPU Cluster (One Node) GPU 集群(一个节点)
Wafer-scale Computing System (One Chip) 晶圆级计算系统(单芯片)
Computing Power (FP16 Dense) 计算能力(FP16 密集)
POPS
POPS
POPS
Monolithic Chip Area 单片面积
Common-Used Memory System Pattern 常用内存系统模式
on-Chip Shared SRAM + 片上共享 SRAM +
片上共享 SRAM
片外 DRAM
on-Chip Shared SRAM
off-Chip DRAM
on-Chip Distributed SRAM 片上分布式 SRAM
Inter-Die Interconnection 芯片间互连
-
Network Cable or NVLink 网线或 NVLink
Advanced Packaging or
Field Stitching
Inter-Die Bandwidth 模间带宽
-
Inter-Die Bandwidth Density 芯片间带宽密度
-
Inter-Die Data Transfer Energy 芯片间数据传输能量
-
bit 位
Fault Tolerance Design for Computation 计算容错设计
Not Required 不需要
Not Required 不需要
冗余流程元素,
旁路路由等。
Redundant Process Elements,
Bypass Routing, etc.
Fault Tolerance Design for Data Transfer * 数据传输容错设计 *
Not Required 不需要
Ethernet Protocol 以太网协议
冗余路由、NoC 协议等。
Redundant Physical Connections,
Redundant Routing, NoC Protocol, etc.
Power Delivery 电力输送
Conventional 常规
Conventional 常规
Edge or Vertical 边缘或垂直
Clock Distribution 时钟分配
Conventional 常规
Conventional 常规
Edge or Vertical 边缘或垂直
Cooling Solution 冷却解决方案
Conventional Air/Liquid 常规空气/液体
Conventional Air/Liquid 常规空气/液体
Conventional Air/Liquid or
Microfluidics
Compilation 汇编
Single-chip Mapping 单芯片映射
Multi-chip Mapping 多芯片映射
Multi-chip Mapping with finer grain and 2D-constraint 更细粒度的多芯片制图和 2D 限制
Besides conventional ECC for DRAM. 除了用于 DRAM 的传统 ECC 外。
Fig. 5. Overview of Wafer-scale Computing system and main contents of this paper. 图 5.晶圆级计算系统和本文主要内容概览。
Fig. 6. Overall architecture and microarchitecture of UCLA&UIUC's work. Adapted from [16]. 图 6.加州大学洛杉矶分校和加州大学洛杉矶分校工作的整体架构和微架构。改编自 [16]。
Fig. 7. Overall architecture and microarchitecture of Tesla Dojo. Adapted from [15]. 图 7.Tesla Dojo 的整体架构和微架构。改编自 [15]。
TABLE II 表 II
CHARACTERISTICS OF TYPICAL WAFER-SCALE SYSTEMS 典型晶圆级系统的特点
UCLA&UIUC's work [16] 加州大学洛杉矶分校和加州大学洛杉矶分校的工作 [16]
Tesla Dojo [15] 特斯拉道场 [15]
Cerebras CS-2 [17]
Name of Wafer / Base Core 晶圆/基芯名称
Array 阵列
ARM Core ARM 内核
Dojo Tile 道场瓷砖
Node 节点
WSE-2
Core 核心
Area 地区
Comp. Power Comp.功率
4.4 POPS
9.1 POPS
7.5 POPS
SRAM
64 KB
11 GB
1.25 MB
40 GB
48 KB
SRAM BW
6.144 TBps
-
-
-
Network BW 网络 BW
9.83 TBps
-
9 TBps
-
-
Power 电源
-
-
15 kW 15 千瓦
30 mW 30 毫瓦
Estimated through dividing the total amount by the number of base cores. 用总金额除以基本内核数估算得出。
** Estimated through multiplying the value of base cores by the number of base cores. ** 用基本内核值乘以基本内核数估算得出。
CS-2 is comprised of dies, and each die is comprised of cores. Cerebras adopts more granular cores than UCLA&UIUC and Tesla, the total number of cores in Cerebras CS-2 is as high as 853,104 . CS-2 由 个芯片组成,每个芯片由 个内核组成。与 UCLA&UIUC 和 Tesla 相比,Cerebras 采用了更细粒度的内核,Cerebras CS-2 的内核总数高达 853 104 个。
The characteristics of the three wafer-scale works are listed in Table II. In summary, all the three wafer-scale works integrate a large number of cores to form an over wafer, aggressively use on-wafer distributed memories, and adopt a mesh/torus topology at the top level. The computing powers of all the three wafers reach several POPS. 表 II 列出了这三个晶圆级工程的特点。总之,这三个晶圆级工程都集成了大量内核,形成了一个超 晶圆,积极使用晶圆上的分布式存储器,并在顶层采用了网状/四叶草拓扑结构。这三个晶圆的计算能力都达到了数 POPS。
B. Microarchitecture B.微架构
Now let's see the microarchitectures of aforementioned typical wafer-scale works, as shown in the right side of Figure 6, Figure 7 and Figure 8. The three wafer-scale works employ different strategies in base core designing. UCLA&UIUC's work [16] adopts a standard general-purpose processor (ARM Cortex M3) as the base core, Cerebras CS-2 [17] builds the base core around tensor acceleration, and Tesla Dojo [15] designs a large full-fledged core which combines vector computing units and scalar general-purpose processing modules. AI acceleration is one of the primary target applications of 现在我们来看看上述典型晶圆级作品的微体系结构,如图 6、图 7 和图 8 右侧所示。这三个晶圆级作品采用了不同的基础内核设计策略。UCLA&UIUC 的作品[16]采用标准通用处理器(ARM Cortex M3)作为基础内核,Cerebras CS-2 [17]围绕张量加速构建基础内核,而 Tesla Dojo [15]则设计了一个结合了矢量计算单元和标量通用处理模块的大型成熟内核。人工智能加速是 Tesla 的主要目标应用之一。
Fig. 8. Overall architecture and microarchitecture of Cerebras CS-2. Adapted from [17]. 图 8.Cerebras CS-2 的整体架构和微架构。改编自 [17]。
Tesla Dojo and Cerebras CS-2, so they both devote many resources on vector units for tensor acceleration. The scalar general-purpose processing modules in Tesla Dojo are aimed at providing flexibility supports, such as branch jumping and sparse computing. Cerebras CS-2 proposes a dataflow trigger mechanism to accomplish this. Tesla Dojo 和 Cerebras CS-2 都投入了大量资源用于张量加速的矢量单元。Tesla Dojo 的标量通用处理模块旨在提供灵活性支持,如分支跳转和稀疏计算。Cerebras CS-2 提出了一种数据流触发机制来实现这一目标。
The base cores of Tesla Dojo and Cerebras CS-2 are both homogeneously integrated to form a 2D mesh/torus grid, while the base cores in UCLA&UIUC's work are heterogeneously integrated with private SRAMs, shared SRAMs, and intra-tile networks, to form a tile, and then the tiles form a homogeneous 2D mesh grid. Tesla Dojo 和 Cerebras CS-2 的基础内核都是同质集成,形成二维网状/四叶草网格,而 UCLA&UIUC 的基础内核则是异质集成,包括私有 SRAM、共享 SRAM 和瓦片内网络,形成一个瓦片,然后由瓦片形成同质的二维网状网格。
C. Execution Framework C.执行框架
To execute tasks, wafer-scale system can utilize multiple granularities of parallelisms, which is similar to GPU clusters [11, 58, 59]. As shown in Figure 9, the wafer-scale chip can be divided into multiple partitions, each partition containing several cores works like a device in GPU cluster. At macro level, the tasks (e.g. training of neural networks) are executed under the data parallel (DP) [28] or model parallel (MP) [29] strategy. At micro level, each node processes the sub-task assigned to it like traditional accelerators do. The vector units perform parallel computing on the partitioned tensors, and scalar units provide flexibility supports such as conditional branch jumping and sparse computing. 为了执行任务,晶圆级系统可以利用多种并行粒度,这与 GPU 集群类似 [11, 58, 59]。如图 9 所示,晶圆级芯片可分为多个分区,每个分区包含多个内核,就像 GPU 集群中的设备一样工作。在宏观层面,任务(如神经网络训练)按照数据并行(DP)[28] 或模型并行(MP)[29] 策略执行。在微观层面,每个节点像传统加速器一样处理分配给它的子任务。矢量单元对分区张量进行并行计算,标量单元提供灵活性支持,如条件分支跳转和稀疏计算。
Compared with traditional accelerators, the differences include: 与传统加速器相比,不同之处包括
The much higher bandwidth of network on chip (NoC) and smaller gap between intra- and inter-die bandwidth broaden the design space of parallel. 芯片网络(NoC)的带宽更高,芯片内和芯片间的带宽差距更小,这拓宽了并行设计的空间。
2D-only interconnection imposes extra constraints on the design space. Specifically, long-distance data transfers on 2D mesh NoC would seriously waste bandwidth, so it is necessary to keep data locality, as emphasized by Tesla 纯二维互连对设计空间造成了额外的限制。具体来说,二维网状 NoC 上的长距离数据传输会严重浪费带宽,因此有必要保持数据的本地性,正如特斯拉所强调的那样
. Usually, it is better to adopt a dataflow-based execution framework which only passes data from one die to its neighbors if possible. 。通常,最好采用基于数据流的执行框架,在可能的情况下,只将数据从一个模子传递到其相邻的模子。
D. Network on Chip (NoC) D.片上网络(NoC)
Topology is the first major design decision for a network on chip (NoC). Common-used topologies include mesh, torus, binary tree, and butterfly tree [61], as shown in figure 10. Topologies differ in the number of required routing units, the minimum and maximum number of connections or hops between any two compute units, as shown in Table III. 拓扑结构是片上网络(NoC)的第一个重要设计决策。常用的拓扑结构包括网状、环状、二叉树和蝴蝶树 [61],如图 10 所示。拓扑结构在所需路由单元数量、任意两个计算单元之间的最小和最大连接数或跳数方面存在差异,如表 III 所示。
Generally speaking, tree topologies tend to have smaller number of hops than mesh and torus, especially in large networks, as shown in Figure 11. However, mesh and torus topologies are more popular in existing wafer-scale systems. The networks of UCLA&UIUC's work[16] (inter-tile), Tesla Dojo [15] and Cerebras CS-2 [17] all adopt a mesh or torus topology. The main reasons why they prefer mesh or torus may include: 一般来说,树形拓扑的跳数往往少于网状拓扑和环状拓扑,尤其是在大型网络中,如图 11 所示。不过,在现有的晶圆级系统中,网状拓扑和环状拓扑更受欢迎。加州大学洛杉矶分校和加州大学洛杉矶分校的工作[16](晶片间)、Tesla Dojo [15] 和 Cerebras CS-2 [17]的网络都采用了网状或环状拓扑结构。采用网状或环状拓扑结构的主要原因包括
TABLE III 表 III
ChARACTERISTICS OF COMMON-USED NOC TOPOLOGIES 常用 NOC 拓扑的特征
Topology 拓扑结构
# Router Ports # 路由器端口
# Routers # 路由器
Min # Hops 最少啤酒花数量
Max # Hops 最大啤酒花数量
Mesh 网布
5
N
3
Height+Width+2 高度+宽度+2
Torus 托鲁斯
5
N
3
Height/2+Width/2+2 高/2+宽/2+2
Binary Tree 二叉树
3
2
Butterfly Tree 蝴蝶树
6
2
Note: The number of node is N . The data come from thesis [61]. 注:节点数为 N。数据来自论文 [61]。
Fig. 11. Maximum number of hops VS number of nodes. 图 11.最大跳数 VS 节点数
Less physical implementation difficulty: In mesh and torus topologies, the compute and routing unit pairs are arranged in a grid which makes it easier to integrate them in a wafer-scale chip and do scaling out. 物理实施难度较低:在网状拓扑和环状拓扑中,计算和路由单元成对排列在网格中,这样更容易将它们集成到晶圆级芯片中并进行扩展。
Better fault tolerance: When a router breaks down, it is easier to find alternate paths in mesh and torus topologies than tree topologies. 更好的容错性:当路由器发生故障时,网状拓扑和环状拓扑比树状拓扑更容易找到备用路径。
E. Memory System E.记忆系统
In Cerebras CS-2 [17] and UCLA&UIUC's work[16], all memories are SRAMs. Cerebras CS-2 fully distributes the SRAMs with cores. Although UCLA&UIUC's work adopts both private SRAMs for cores and shared SRAMs for global access, the shared SRAMs are also distributed with tiles. Tesla Dojo [15] adopts both on-wafer SRAMs and off-wafer DRAMs (HBMs), and the SRAMs are also distributed with cores (nodes). 在 Cerebras CS-2 [17] 和 UCLA&UIUC 的工作[16]中,所有存储器都是 SRAM。Cerebras CS-2 将 SRAM 完全分配给内核。虽然 UCLA&UIUC 的工作采用了内核专用 SRAM 和全局访问共享 SRAM,但共享 SRAM 也与磁盘阵列分布在一起。Tesla Dojo [15] 同时采用晶圆上 SRAM 和晶圆外 DRAM(HBM),SRAM 也随内核(节点)分布。
Generally speaking, existing wafer-scale system works [1517] tend to adopt distributed memory structures, here are the main reasons for it: 一般来说,现有的晶圆级系统工程[1517]倾向于采用分布式存储器结构,其主要原因如下:
Attaching large DRAMs to each node of traditional accelerator clusters is a common practice, but it is difficult to integrate large DRAMs with each die on waferscale chips because of the limited area and the process differences between DRAM units and compute logics. As a result, the heavy storage burden for wafer-scale chips mainly falls on the distributed SRAMs. One might have noticed that Tesla also uses HBMs on the sides of Dojo tiles [15, 62], but the cost of data transfer between HBMs and the central dies is high, so the locality of data access should be enhanced, which is reported by Dojo developers on AI day [60]. 在传统加速器集群的每个节点上安装大型 DRAM 是一种常见的做法,但由于 DRAM 单元和计算逻辑之间的面积和工艺差异有限,很难在晶圆级芯片的每个芯片上集成大型 DRAM。因此,晶圆级芯片的存储重任主要落在分布式 SRAM 上。人们可能已经注意到,特斯拉也在 Dojo 瓦片的侧面使用了 HBM [15,62],但 HBM 与中央芯片之间的数据传输成本很高,因此应增强数据访问的本地性,Dojo 开发人员在 AI 日报告了这一点 [60]。
In a wafer-scale chip, centralized memory may cause seriously non-uniform access latency between the cores near and far away. Distributed memory structure and elaborated parallel strategy helps keep data locality and avoid most long-distance memory access, so that reduce the difficulty of scaling out. 在晶圆级芯片中,集中式内存可能会导致远近核心之间的访问延迟严重不均匀。分布式内存结构和精心设计的并行策略有助于保持数据的本地性,避免大部分长距离内存访问,从而降低了扩展的难度。
The bandwidth of NoC in wafer-scale chips is far more abundant than that in traditional accelerator clusters, which supports the use of distributed memory. As Cerebras points out in HotChips 2022 [17], traditional centralized memory with low bandwidth requires high data reuse and caching to be efficient, while distributed memory with high bandwidth can realize full datapath performance without caching. 晶圆级芯片中 NoC 的带宽远比传统加速器集群的带宽丰富,这为分布式存储器的使用提供了支持。正如 Cerebras 在《HotChips 2022》[17] 中指出的,传统的集中式内存带宽低,需要较高的数据重用和缓存才能实现高效,而高带宽的分布式内存无需缓存即可实现完整的数据通路性能。
F. Discussion F.讨论情况
Design space exploration for wafer-scale chips: It is a formidable task to design a wafer-scale chip and achieve the optimization. The design elements are closely related, and should be considered jointly. Many of them are not included in traditional chip design and there is no experience that can be learned from. 晶圆级芯片的设计空间探索设计晶圆级芯片并实现优化是一项艰巨的任务。设计要素密切相关,应综合考虑。其中许多设计元素在传统芯片设计中并不包含,也没有可借鉴的经验。
The design space of wafer-scale systems can be composed of three main layers: workload, system and network [38]. In the workload layer, the superficial requirements (e.g. the target application tasks and the expected performance) should be transformed into high-level requirements (e.g. operator types, parallelization strategies, data communication and reuse patterns). In the system level, the microarchitecture and overall architecture (e.g. dataflow design, compute units and memory organization) should be decided, to meet the requirements. In the network level, network framework and implementation (e.g. hierarchical fabric design and topology, interconnect interfaces and protocols) should be decided to ensure the system runs as expected. In addition, many extra variables are produced by the wafer-scale background, e.g. various implementations of die-to-die interconnection and fault tolerance mechanisms. 晶圆级系统的设计空间可由三个主要层组成:工作负载层、系统层和网络层[38]。在工作负载层,应将表面要求(如目标应用任务和预期性能)转化为高层要求(如运算器类型、并行化策略、数据通信和重用模式)。在系统层面,应确定微架构和整体架构(如数据流设计、计算单元和内存组织),以满足要求。在网络层面,应确定网络框架和实施(如分层结构设计和拓扑、互连接口和协议),以确保系统按预期运行。此外,晶圆级背景也会产生许多额外的变量,例如芯片到芯片互连和容错机制的各种实现方式。
The design choices in the three levels interact with each other, so separately deciding the design elements step by step may cause two problems: 1) Cumulative effect of minor losses [63]: the accumulation of small losses in the parts may cause a large loss on the whole system, resulting in unsatisfactory global solutions. 2) Ignorance of system bottleneck: the optimizer in one level cannot get exact information in other levels so that it does not know what the real system bottleneck is, and may make much useless work in the non-critical paths, leading to non-optimal choices. However, exploring the whole design space of wafer-scale chips including all the three levels at once will suffer from a tremendous time complexity. 三个层次中的设计选择是相互影响的,因此一步步单独决定设计元素可能会导致两个问题:1)微小损失的累积效应[63]:零件中微小损失的累积可能会导致整个系统的巨大损失,从而导致全局解不理想。2) 忽视系统瓶颈:一个层次的优化器无法获得其他层次的准确信息,因此不知道真正的系统瓶颈是什么,可能会在非关键路径上做很多无用功,导致非最优选择。然而,同时探索包括所有三个层次在内的晶圆级芯片的整个设计空间将耗费大量的时间。
Although Cerebras and Tesla have showed most of the detailed parameters of their wafer-scale chips, they have not disclosed their design space exploration strategies. The group from UCLA&UIUC has presented their design space exploration framework [63]. It is the first optimization framework for solving the multiple chiplet/system selection problem. From the view of wafer-scale chip design, this framework has the advantage of supporting heterogeneous integration, but it does not cover important integration models such as Cerebras CS-2's reticle stitching technology [8, 13]. Moreover, the reported results lack the cases of training large neural network models on real wafer-scale chips. How to abstract the requirements of such workloads, how to decompose the tasks and map to the cores, how to load data from offwafer storages, how to design the dataflow and organize the data communication across multiple cores, these key points are unclear. To fully exploit the advantages of wafer-scale integration, where the design choices in different levels are tightly coupled, cross-layer hardware and software co-design is required. Therefore, a more comprehensive and mature methodology for wafer-scale chip design space exploration is yet to be developed. 尽管 Cerebras 和 Tesla 展示了其晶圆级芯片的大部分详细参数,但他们并未披露其设计空间探索策略。加州大学洛杉矶分校和加州大学洛杉矶分校的研究小组介绍了他们的设计空间探索框架[63]。这是首个用于解决多芯片/系统选择问题的优化框架。从晶圆级芯片设计的角度来看,该框架的优点是支持异构集成,但并没有涵盖重要的集成模式,如 Cerebras CS-2 的网状缝合技术[8, 13]。此外,所报告的结果缺乏在真实晶圆级芯片上训练大型神经网络模型的案例。如何抽象此类工作负载的要求,如何分解任务并映射到内核,如何从晶圆外存储中加载数据,如何设计数据流并组织跨多个内核的数据通信,这些关键点都不清楚。为了充分发挥晶圆级集成的优势,不同层次的设计选择紧密耦合,需要跨层软硬件协同设计。因此,一种更全面、更成熟的晶圆级芯片设计空间探索方法尚待开发。
2) Design considerations related to the yield problem: Compared with traditional chip manufacturing, building waferscale systems faces greater challenge of yield problem, so designers try to propose fault tolerance mechanisms to address it. Cerebras proposes a fault tolerance mechanism with redundant cores and redundant fabric links [9, 64]. The redundant cores can replace defective cores and the extra links can reconnect fabric to restore logical 2D mesh. Unlike UCLA&UIUC's work and Tesla Dojo which use individual pre-tested chiplets (i.e., known-good-dies, KGDs) for assembling into a waferscale chip, Cerebras directly produces an integrated waferscale chip, so the yield challenge is more significant. Therefore, Cerebras' cores are designed to be quite small (much smaller than the cores of UCLA&UIUC's work and Tesla Dojo), in order to address yield at low cost. UCLA&UIUC's fault tolerance mechanism [16] mainly focus on die-to-die interconnection. Two independent networks across the wafer are designed, one with dimension-ordered routing and the other with Y-X dimension ordered routing, to ensure the access between any two tiles. There is still much room for existing fault tolerance mechanisms to be improved. The fault can be caused by defective links, chiplets, cores, or partial logics in a core. Different problems should be solved in different ways, and their interactions need to be considered. A complete set of fault tolerance technologies for wafer-scale systems have yet to be developed. 2) 与良品率问题相关的设计考虑:与传统芯片制造相比,构建晶圆级系统面临更大的良品率挑战,因此设计人员试图提出容错机制来解决这一问题。Cerebras 提出了冗余内核和冗余结构链路的容错机制[9, 64]。冗余内核可以替换有缺陷的内核,额外的链路可以重新连接 Fabric 以恢复逻辑 2D 网格。与加州大学洛杉矶分校(UCLA)和加州大学洛杉矶分校(UIUC)的工作以及特斯拉道场(Tesla Dojo)使用单个预先测试的芯片(即已知良品,KGD)组装成晶圆级芯片不同,Cerebras 直接生产集成的晶圆级芯片,因此良品率挑战更为严峻。因此,Cerebras 的内核设计得相当小(比加州大学洛杉矶分校和加州大学伯克利分校的工作和特斯拉道场的内核小得多),以便以低成本解决良品率问题。加州大学洛杉矶分校和加州大学旧金山分校的容错机制[16]主要集中在芯片到芯片的互连上。在晶圆上设计了两个独立的网络,一个是 维有序路由,另一个是Y-X维有序路由,以确保任意两个晶圆之间的访问。现有的容错机制仍有很大的改进空间。故障可能由有缺陷的链路、芯片、核心或核心中的部分逻辑引起。不同的问题应以不同的方式解决,而且需要考虑它们之间的相互作用。目前尚未开发出一套完整的晶圆级系统容错技术。
Yield problem is also related to other design points. In existing wafer-scale system works [15-17], the NoC router and compute part of each core are designed to be highly decoupled, this facilitates the the implementation of fault tolerance mechanism (and the utilization of distributed memories). Besides, existing wafer-scale system designs [15-17] only integrate SRAMs on-chip, which are low-density and expensive, because the wafer-scale integration technology has not been mature enough to solve the process differences between DRAMs and compute logics while maintaining the yield. This may be one research direction in the future. 良品率问题还与其他设计要点有关。在现有的晶圆级系统作品[15-17]中,每个内核的 NoC 路由器和计算部分被设计为高度解耦,这有利于容错机制的实现(以及分布式存储器的利用)。此外,现有的晶圆级系统设计[15-17]仅在片上集成了 SRAM,密度低且成本高,这是因为晶圆级集成技术还不够成熟,无法在保证良率的前提下解决 DRAM 与计算逻辑之间的工艺差异。这可能是未来的一个研究方向。
IV. INTERCONNECTION INTERFACES AND PROTOCOLS IV.互联接口和协议
The interconnection interface and protocol design (including intra-die, inter-die and off-chip interconnections) are extremely critical to the performance of wafer-scale chips. Compared with traditional chips, we should consider not only the unit transmission bandwidth, power consumption per bit, but also the requirements of advanced manufacturing process, packaging technology and system integration. 互连接口和协议设计(包括片内、片间和片外互连)对晶圆级芯片的性能极为关键。与传统芯片相比,我们不仅要考虑单位传输带宽、每比特功耗,还要考虑先进制造工艺、封装技术和系统集成的要求。
In this section, we will first review common-used interconnect interfaces and protocols which may be used in waferscale chips, and then discuss what characteristics of existing interfaces and protocols are especially important for building Wafer-scale Computing systems, or how they can be tailored to fit for Wafer-scale Computing systems. 在本节中,我们将首先回顾晶圆级芯片中可能使用的常用互连接口和协议,然后讨论现有接口和协议的哪些特性对构建晶圆级计算系统尤为重要,或如何对其进行定制以适应晶圆级计算系统。
A. Intra-Die Interconnection A.芯片内部互连
The intra-die interconnection is responsible for communication among processing elements and memories within each die. For wafer-scale chips, the optional styles of intra-die 片内互连负责每个片内处理元件和存储器之间的通信。对于晶圆级芯片,晶粒内互连的可选样式包括
interconnection include bus, crossbar, ring, network-on-chip , and so on, which are similar to those for conventional chips. The bus is the simplest to design but hard to scale with the number of processing elements, while the NoC is the opposite [65]. There have been some typical, commonused protocols for bus, such as the Advanced Microcontroller Bus Architecture (AMBA) Advanced High-Performance bus (AHB) and Advanced Extensible Interface (AXI), Wishbone Bus, Open Core Protocol (OCP) and CoreConnect Bus [66]. In contrast, the protocol for NoC is more complex and usually needs to be specifically designed according to the application scenario. Designers should make decisions on many design elements to balance the performance, cost and robustness, such as number of destinations (unicast or multicast), routing decisions (centralized, source, distributed or multiphase), adaptability (deterministic or adaptive), progressiveness (progressive or backtracking), minimality (profitable or misrouting) and number of paths (complete or partial) [67]. 互联包括总线、交叉条、环、片上网络 等,与传统芯片的互联类似。总线的设计最简单,但很难随着处理元件数量的增加而扩展,而 NoC 则恰恰相反 [65]。总线有一些典型的常用协议,如高级微控制器总线架构(AMBA)、高级高性能总线(AHB)和高级可扩展接口(AXI)、Wishbone 总线、开放核心协议(OCP)和 CoreConnect 总线 [66]。相比之下,NoC 的协议更为复杂,通常需要根据应用场景进行专门设计。设计人员应就许多设计要素做出决策,以平衡性能、成本和稳健性,例如目的地数量(单播或多播)、路由决策(集中式、源、分布式或多相)、适应性(确定性或自适应)、渐进性(渐进或回溯)、最小化(盈利或错误路由)和路径数量(完整或部分)[67]。
B. Inter-Die Interconnection B.芯片间互连
The inter-die interconnection interfaces and protocols are used for interconnection multiple dies in a same package. Serial interface and parallel interface are two options of dieto-die physical layer interface used for wafer-scale chips. 晶片间互连接口和协议用于同一封装中多个晶片的互连。串行接口和并行接口是晶圆级芯片物理层接口的两种选择。
Inter-Die Serial Interfaces: Serial interfaces can be implemented at a long interconnection distance, so it is the mainstream in the field of wafer-scale chips. 芯片间串行接口:串行接口可以在较长的互连距离内实现,因此是晶圆级芯片领域的主流。
USR SerDes: SerDes [51, 68] contains a series of serial interfaces ranging from long reach to short reach, including USR (Ultra-Short Reach), which focuses on high-speed dieto-die connection at ultra-short distance ( 10 mm level) by 2.5D/3D heterogeneous integration. Because of its short interconnection distance, USR is able to provide power consumption, nanosecond-level latency, and low error rate via advanced coding, multi-bit transmission and other technologies. However, due to its short transmission distance demand, implementation of USR in large-scale integration such as wafer-scale chips could be challenging. USR SerDes:SerDes [51, 68] 包含一系列从长距离到短距离的串行接口,其中包括 USR(Ultra-Short Reach),它主要是通过 2.5D/3D 异构集成实现超短距离(10 毫米级别)的高速芯片连接。由于其互连距离短,USR 能够通过先进的编码、多比特传输和其他技术实现 低功耗、纳秒级延迟和低错误率。然而,由于其传输距离要求较短,在晶圆级芯片等大规模集成中实施 USR 可能具有挑战性。
Apple UltraFusion: Apple's M1 Ultra chip [69] uses TSMC's 5nm process. It includes 20 CPU cores, 64 GPU cores, 32 neural network engine NPU cores, 128 GB of memory, and 800 GBps of memory bandwidth. It also solves the interconnection of very-large area Chiplets. Especially, the UltraFusion interconnection technology design of die-to-die is extremely impressive. According to the current published papers and patents, UltraFusion should be an interconnection architecture based on TSMC's CoWoS-S5 interconnection technology, using Silicon Interposer and Micro-Bump technology. UltraFusion also provides more than 10,000 die-todie connection signal lines, with an ultra-high interconnection bandwidth of . Compared to other multi-chip module (MCM) packaging technologies, UltraFusion's interposer provides dense and short metal interconnects between logic dies or between logic dies and memory stacks. It has better inter-chip integrity, lower power consumption, and can run at higher clock frequencies. In addition, UltraFusion effectively improves packaging yield and reduces costs through DieStitching technology. In UltraFusion, only the Known Good Apple UltraFusion:苹果公司的 M1 Ultra 芯片[69]采用台积电的 5 纳米工艺。它包括 20 个 CPU 内核、64 个 GPU 内核、32 个神经网络引擎 NPU 内核、128 GB 内存和 800 GBps 内存带宽。它还解决了超大面积 Chiplet 的互连问题。尤其是芯片与芯片之间的 UltraFusion 互联技术设计令人印象极为深刻。根据目前已发表的论文和专利,UltraFusion 应该是基于台积电 CoWoS-S5 互联技术的互联架构,采用了 Silicon Interposer 和 Micro-Bump 技术。UltraFusion 还提供超过 10,000 条芯片-芯片连接信号线,具有 的超高互连带宽。与其他多芯片模块(MCM)封装技术相比,UltraFusion 的内插层可在逻辑芯片之间或逻辑芯片与存储器堆栈之间提供密集而短的金属互连。它具有更好的芯片间完整性、更低的功耗以及更高的时钟频率。此外,UltraFusion 还通过 DieStitching 技术有效提高了封装产量并降低了成本。在 UltraFusion 中,只有已知良好
Die (KGD) is bonded, which avoids the problem that the failed chips in traditional Wafer on Wafer (WoW) or Chip on Wafer are encapsulated, thereby increasing post-package yield and reducing overall average cost. KGD)键合,避免了传统晶圆上晶片(WoW)或晶圆上芯片 中故障芯片被封装的问题,从而提高了后封装良率,降低了总体平均成本。
AMD Infinity Fabric: Infinity Fabric (IF) [70] is AMD's proprietary system interconnect, consisting of two systems, Infinity Scalable Data Fabric (SDF) and Infinity Scalable Control Fabric (SCF), which are responsible for data transmission and control respectively. SDF has 2 distinct systems on-die and off-die. The on-die SDF connects CPU cores, GPU cores, IMC and other parts within the chip, while the off-die SDF is responsible for interconnection of die-to-die on the package or multiple sockets. AMD Infinity Fabric:Infinity Fabric(IF)[70] 是 AMD 专有的系统互联,由 Infinity Scalable Data Fabric(SDF)和 Infinity Scalable Control Fabric(SCF)两个系统组成,分别负责数据传输和控制。SDF 有两个不同的系统,分别位于芯片上和芯片外。芯片上的 SDF 连接芯片内的 CPU 内核、GPU 内核、IMC 和其他部件,而芯片外的 SDF 则负责封装或多个插座上芯片与芯片之间的互连。
Inter-Die Parallel Interfaces: Compared to serial interfaces, parallel interfaces can deliver higher bandwidth, lower latency and lower IO power consumption, which are attractive for the die-to-die interconnection in wafer-scale chips. However, the requirement of massive transmission wires and extremely short interconnection distance has made it hard to apply in practice so far. In the future, when the process can achieve relatively high wire density and close inter-die distances, parallel interfaces can be taken into consideration. 芯片间并行接口:与串行接口相比,并行接口能提供更高的带宽、更低的延迟和更低的 IO 功耗,这对于晶圆级芯片的芯片间互连很有吸引力。然而,由于需要大量传输线和极短的互连距离,目前还很难在实践中应用。未来,当工艺能实现相对较高的导线密度和较近的芯片间距时,就可以考虑并行接口。
Intel AIB/MDIO: Intel's AIB/MDIO [71] provides a parallel interconnection standard for the physical layer. MDIO technology provides higher transmission efficiency, and its response speed and bandwidth density are more than twice that of AIB. AIB and MDIO are applied to advanced 2.5D/3D packaging technologies with short interconnection communication distance and high interconnection density, such as EMIB [72] and Foveros [73]. 英特尔 AIB/MDIO:英特尔的 AIB/MDIO [71] 为物理层提供了并行互连标准。MDIO 技术具有更高的传输效率,其响应速度和带宽密度是 AIB 的两倍多。AIB 和 MDIO 适用于互连通信距离短、互连密度高的先进 2.5D/3D 封装技术,如 EMIB [72] 和 Foveros [73]。
TSMC LIPINCON: TSMC's LIPINCON [74] developed a high-performance parallel interconnection interface technology for the specific case of small chip memory interface. Through advanced packaging technology such as InFO and CoWoS, it can use simpler "clock forwarding" circuit, which greatly reduces the area overhead of the I/O circuit and its parasitic effects. 台积公司 LIPINCON:台积公司的 LIPINCON [74]针对小型芯片存储器接口的特殊情况,开发了一种高性能并行互连接口技术。通过 InFO 和 CoWoS 等先进封装技术,它可以使用更简单的 "时钟转发 "电路,从而大大减少 I/O 电路的面积开销及其寄生效应。
OCP ODSA BOW: The BoW interface proposed by the OCP ODSA group focuses on solving the problem of parallel interconnections based on organic substrates [75]. There are three types of BoW, namely BoW-base, BoW-fast and BoWturbo. BoW-Base is designed for transmission distances less than 10 mm , using an unterminated unidirectional interface with a data rate of up to 4 Gbps per line. The BoW-Fast uses a termination interface for cable lengths up to 50 mm . The transmission rate per line is 16 Gbps . Compared with BoWFast, Bow-Turbo uses two wires and supports bidirectional 16 Gbps transmission bandwidth. OCP ODSA BOW:OCP ODSA 小组提出的 BoW 接口主要解决基于有机基底的并行互连问题 [75]。BoW 有三种类型,即 BoW-base、BoW-fast 和 BoWturbo。BoW-Base专为传输距离小于 10 毫米而设计,使用未端接的单向接口,每条线路的数据传输速率高达 4 Gbps。BoW-Fast 使用端接接口,电缆长度可达 50 毫米。每条线路的传输速率为 16 Gbps。与 BoWFast 相比,Bow-Turbo 使用两条导线,支持双向 16 Gbps 传输带宽。
NVLink-C2C: NVIDIA's NVLink-C2C [76] is a chip interconnection technology extended from NVLink [42] for custom silicon integration. It is extensible from PCB-level integration, multi-chip modules (MCM), silicon interposer or wafer-level connections, allowing NVIDIA GPUs, DPUs, and CPUs to be coherently interconnected with custom silicon. With advanced packaging, NVLink-C2C interconnection delivers up to 25 X more energy efficiency and is 90 X more area-efficiency than a PCIe Gen 5 PHY on NVIDIA chips. NVLink-C2C:英伟达™(NVIDIA®)的NVLink-C2C[76]是从NVLink[42]扩展而来的芯片互连技术,用于定制芯片集成。它可从 PCB 级集成、多芯片模块 (MCM)、硅内插或晶圆级连接中进行扩展,使英伟达™(NVIDIA®)GPU、DPU 和 CPU 能够与定制芯片实现连贯互连。通过先进的封装,NVLink-C2C 互连的能效比英伟达芯片上的 PCIe Gen 5 PHY 高出 25 倍,面积效率高达 90 倍。
inter-die interconnection protocols: There are a variety of different die-to-die interconnection protocols can be adapting to the physical layer interfaces. To choose a suitable die-to-die interconnection protocols for wafer-scale chips, the programmability, scalability and fault tolerance should be considered. 芯片间互连协议:有多种不同的芯片到芯片互连协议可与物理层接口相适应。要为晶圆级芯片选择合适的芯片间互连协议,应考虑可编程性、可扩展性和容错性。
Intel PCIe gen 4, 5, and 6: PCI express (PCI-e) [77] is a general-purpose serial interconnection technology that is not only suitable for interconnecting GPUs, network adapters, and other additional devices, but also for die-to-die interconnection communication. Since PCI-e Gen 3 was widely adopted, it has been developing rapidly. The transfer rate of PCI-e Gen 4 is , and the theoretical throughput is for 16x ports. PCI-e Gen 5, which was finalized in 2019, has twice the transfer rate of Gen 4 at 32GT/s, and the 16x port has a bandwidth of about . Currently, the latest PCI-e Gen 6 standard has now been released, which doubles the data transfer rate from to , providing approximately of bandwidth for 16x ports. In addition, the PCI-e Gen 6 uses a new PAM4 encoding to replace the PCIe 5.0 NRZ, using a encoding scheme, which can encapsulate more data in a single channel in the same time. The PCIe 6.0 [78] also introduces low-latency forward error correction (FEC) and related mechanisms to improve bandwidth efficiency and reliability. 英特尔 PCIe Gen 4、5 和 6:PCI Express(PCI-e)[77] 是一种通用串行互连技术,不仅适用于 GPU、网络适配器和其他附加设备的互连,也适用于芯片与芯片之间的互连通信。自 PCI-e Gen 3 被广泛采用以来,它一直在快速发展。PCI-e Gen 4 的传输速率为 ,16x 端口的理论吞吐量为 。PCI-e Gen 5 于 2019 年定型,其传输速率是 Gen 4 的两倍,达到 32GT/s,16x 端口的带宽约为 。目前,最新的 PCI-e Gen 6 标准已经发布,它将数据传输速率从 倍增到 ,为 16x 端口提供了约 的带宽。此外,PCI-e Gen 6 使用新的 PAM4 编码取代 PCIe 5.0 NRZ,采用 编码方案,可在同一时间在单通道中封装更多数据。PCIe 6.0 [78] 还引入了低延迟前向纠错 (FEC) 和相关机制,以提高带宽效率和可靠性。
Intel UCIe 1.0: Intel, AMD, Arm, Qualcomm, TSMC and other industry giants jointly established the Chiplet Standard Alliance and officially launched the universal Chiplet highspeed interconnection standard "Universal Chiplet interconnection Express" (Universal Chiplet interconnection Express, UCIe), aiming to define an open, interoperable chiplet ecosystem standard [79]. The UCIe 1.0 protocol uniformly uses Intel's mature PCIe and CXL (Compute Express Link) interconnection bus technologies. PCIe provides broad interoperability and flexibility, while CXL can be used for more advanced low-latency/high-throughput connections. The establishment of the UCIe standard maximizes the combination of the advantages of various fabs and technology companies, which is more conducive to the development of integrated circuit interconnection technology. 英特尔 UCIe 1.0:英特尔、AMD、Arm、高通、台积电等行业巨头联合成立芯片标准联盟,正式推出通用芯片高速互连标准 "通用芯片互连 Express"(Universal Chiplet interconnection Express,UCIe),旨在定义一个开放、可互操作的芯片生态系统标准[79]。UCIe 1.0 协议统一使用英特尔成熟的 PCIe 和 CXL(Compute Express Link)互连总线技术。PCIe 具有广泛的互操作性和灵活性,而 CXL 可用于更先进的低延迟/高吞吐量连接。UCIe 标准的建立,最大限度地结合了各晶圆厂和技术公司的优势,更有利于集成电路互连技术的发展。
C. Off-Chip Interconnection C.片外互连
Due to high-density integration, wafer-scale chips require external IO modules for inter-chip interconnection. PCIe is usually used for the external communication of the IO modules. For example, a wafer-scale chip is connected to the IO module through PCIe PHY in Dojo's design. Besides PCIe, many other high-speed protocols can also apply to chip-tochip interconnection, with different tradeoffs being taken into account 由于高密度集成,晶圆级芯片需要外部 IO 模块进行芯片间互连。IO 模块的外部通信通常使用 PCIe。例如,在 Dojo 的设计中,晶圆级芯片通过 PCIe PHY 与 IO 模块相连。除 PCIe 外,许多其他高速协议也可用于芯片与芯片之间的互连,但需要考虑不同的权衡因素
1) Off-Chip interconnection Interface: 1) 片外互连接口:
LR/MR/VSR SerDes: For chip interconnection based on PCB boards, it can be divided into long reach/medium reach/very short reach SerDes according to transmission distance between chips [51, 68]. Although this interconnection technology has the advantages of high reliability, low cost and easy integration, it is difficult to meet the high-performance interconnection communication requirements between wafer dies in terms of delay, power consumption and density. LR/MR/VSR SerDes:对于基于 PCB 板的芯片互连,可根据芯片之间的传输距离分为长距离/中距离/极短距离 SerDes [51,68]。虽然这种互连技术具有可靠性高、成本低、易于集成等优点,但在延迟、功耗和密度方面难以满足晶圆芯片间高性能互连通信的要求。
Off-Chip interconnection Protocol: 片外互连协议
Ethernet: EtherNet/IP [80] is an industrial communication network managed and published by the ODVA specification, which is realized by combining the CIP protocol, TCP/IP, and Ethernet. Currently, the mainstream Ethernet protocols include 100/25 Gigabit Ethernet, 200 and 400 Gigabit Ethernet, and RoCE. The 100/25 Gigabit Ethernet provides the ability to obtain a large number of 25 Gbps ports from a single switch using 100 Gbps switches and fanout cables. 200 Gbps Ethernet uses four 50 Gbps lanes per port, whereas the original 400 Gbps standard would use eight 50 Gbps lanes, simply doubling the number of lanes in a port. RoCE is a protocol for realizing RDMA access/transmission through conventional Ethernet. Its basic idea is to encapsulate an InfiniBand transmission packet into a normal Ethernet packet at the link layer. It uses Direct Memory Access (DMA), where the network adapter can read and write directly from host memory bypassing the CPU core, reducing CPU load. 以太网EtherNet/IP [80] 是一种由 ODVA 规范管理和发布的工业通信网络,它通过结合 CIP 协议、TCP/IP 和以太网来实现。目前,主流的以太网协议包括 100/25 千兆位以太网、200 和 400 千兆位以太网以及 RoCE。100/25 千兆位以太网可使用 100 Gbps 交换机和扇出电缆,从单个交换机获得大量 25 Gbps 端口。200 Gbps 以太网每个端口使用 4 个 50 Gbps 通道,而最初的 400 Gbps 标准使用 8 个 50 Gbps 通道,只是将端口中的通道数量增加了一倍。RoCE 是一种通过传统以太网实现 RDMA 访问/传输的协议。其基本思想是在链路层将 InfiniBand 传输数据包封装为普通以太网数据包。它使用直接内存访问(DMA),网络适配器可绕过 CPU 内核直接读写主机内存,从而减少 CPU 负载。
NVIDA NVLink: NVIDIA's NVLink [42] is the first high-speed inter-GPU interconnection technology to improve interconnection communication between GPUs in multi-GPU systems. NVLink allows multiple GPUs into a unified memory space, allowing one GPU to work on the local memory of another GPU that supports atomic operations. NVIDA P100 is the first product equipped with NVLink 1.0, and its single GPU has a bandwidth of , which is equivalent to 5 times the bandwidth of PCIe 4. The NVIDA V100 is equipped with NVLink 2.0, which increases the GPU bandwidth to , which is almost 10 times that of PCIe 4. Currently, NVIDIA A100 integrates the latest NVLink 3.0. A single NVIDIA A100 Tensor core GPU supports up to 12 NVLink 3.0 connections, with a total bandwidth of per second, almost 10 times the bandwidth of PCIe 5. In addition, NVIDIA has developed a separate NVLink switch chip - NVSwitch. NVSwitch has multiple ports, which can integrate multiple NVLinks and can be used to interconnection multiple GPUs. The NVSwitch 2.0 enables high-speed communication between all GPUs simultaneously at . 英伟达™(NVIDIA®)NVLink:英伟达™(NVIDIA®)的NVLink[42]是首个高速GPU间互连技术,用于改善多GPU系统中GPU之间的互连通信。NVLink 允许多个 GPU 进入统一的内存空间,允许一个 GPU 在另一个支持原子操作的 GPU 的本地内存上工作。NVIDA P100 是首款配备 NVLink 1.0 的产品,其单个 GPU 的带宽为 ,相当于 PCIe 4 带宽的 5 倍。NVIDA V100 配备了 NVLink 2.0,将 GPU 带宽提高到 ,几乎是 PCIe 4 的 10 倍。目前,NVIDIA A100 集成了最新的 NVLink 3.0。单个英伟达 A100 Tensor core GPU 支持多达 12 个 NVLink 3.0 连接,总带宽为每秒 ,几乎是 PCIe 5 带宽的 10 倍。此外,英伟达还开发了独立的 NVLink 交换芯片 - NVSwitch。NVSwitch 拥有多个端口,可集成多个 NVLink,用于多个 GPU 的互连。NVSwitch 2.0 可在 下同时实现所有 GPU 之间的高速通信。
ARM CCIX: Cache Coherent interconnection for Accelerators (CCIX) [81] is an open standard interconnection designed to provide a cache coherent interconnection between CPUs and accelerators. CCIX mainly includes protocol specification and transport specification, which supports cache coherence by extending the functions of transaction layer and protocol layer on the standard PCIe data link layer. CCIX usually uses two mechanisms to improve performance and reduce latency. One is to use cache coherence to automatically keep the caches of processors and accelerators coherent. The other is to increase the raw bandwidth of the CCIX link, taking advantage of the standard transfer rate of PCIe 4 , up to a maximum rate of . ARM CCIX:加速器高速缓存一致性互连(CCIX)[81] 是一种开放标准互连,旨在提供 CPU 和加速器之间的高速缓存一致性互连。CCIX 主要包括协议规范和传输规范,通过在标准 PCIe 数据链路层上扩展事务层和协议层的功能来支持高速缓存一致性。CCIX 通常采用两种机制来提高性能和减少延迟。一种是使用缓存一致性来自动保持处理器和加速器缓存的一致性。另一种机制是利用 PCIe 4 的标准