Nowadays, artificial intelligence (AI) technology with large models plays an increasingly important role in both academia and industry. It also brings a rapidly increasing demand for the computing power of the hardware. As the computing demand for AI continues to grow, the growth of hardware computing power has failed to keep up. This has become a significant factor restricting the development of AI. The augmentation of hardware computing power is mainly propelled by the escalation of transistor density and chip area. However, the former is impeded by the termination of the Moore's Law and Dennard scaling, and the latter is significantly restricted by the challenge of disrupting the legacy fabrication equipment and process. 如今,具有大型模型的人工智能(AI)技术在学术界和工业界发挥着越来越重要的作用。这也带来了对硬件计算能力的快速增长需求。随着人工智能计算需求的不断增长,硬件计算能力的增长却跟不上。这已成为制约人工智能发展的重要因素。硬件计算能力的提升主要得益于晶体管密度和芯片面积的增加。然而,前者受阻于摩尔定律和 Dennard Scaling 的终结,后者则受到颠覆传统制造设备和工艺的挑战的极大限制。
In recent years, advanced packaging technologies that have gradually matured are increasingly used to implement bigger chips that integrate multiple chiplets, while still providing interconnections with chip-level density and bandwidth. This technique points out a new path of continuing the increase of computing power while leveraging the current fabrication process without significant disruption. Enabled by this technique, a chip can extend to a size of wafer-scale (over ), provisioning orders of magnitude more computing capabilities (several POPS within just one monolithic chip) and die-to-die bandwidth density (over ) than regular chips, and emerges a new Wafer-scale Computing paradigm. Compared to conventional high-performance computing paradigms such as multi-accelerator and datacenter-scale computing, Wafer-scale Computing shows remarkable advantages in communication bandwidth, integration density, and programmability potential. Not surprisingly, disruptive Wafer-scale Computing also brings unprecedented design challenges for hardware architecture, design- system- technology co-optimization, power and cooling systems, and compiler tool chain. At present, there are no comprehensive surveys summarizing the current state and design insights of Wafer-scale Computing. This paper aims to take the first step to help academia and industry review existing waferscale chips and essential technologies in a one-stop manner. So that people can conveniently grasp the basic knowledge and key points, understand the achievements and shortcomings of existing research, and contribute to this promising research direction. 近年来,逐渐成熟的先进封装技术越来越多地用于实现集成多个芯片的更大芯片,同时仍提供芯片级密度和带宽的互连。这种技术为继续提高计算能力指明了一条新的道路,同时还能充分利用当前的制造工艺,而不会产生重大影响。在这种技术的支持下,芯片可以扩展到晶圆级尺寸(超过 ),提供比普通芯片更多数量级的计算能力(仅在一个单片芯片内就有多个 POPS)和芯片到芯片的带宽密度(超过 ),并出现了一种新的晶圆级计算范例。与多加速器和数据中心级计算等传统高性能计算范例相比,晶圆级计算在通信带宽、集成密度和可编程潜力方面具有显著优势。毫不奇怪,颠覆性的晶圆级计算也为硬件架构、设计- 系统-技术协同优化、电源和冷却系统以及编译器工具链带来了前所未有的设计挑战。目前,还没有对晶圆级计算的现状和设计见解进行全面总结的调查报告。本文旨在迈出第一步,帮助学术界和产业界一站式回顾现有的晶圆级芯片和基本技术。这样,人们就能方便地掌握基本知识和要点,了解现有研究的成就和不足,并为这一前景广阔的研究方向做出贡献。
Index Terms-AI chip, waferscale-computing, hardware architecture, software system. 索引词条--人工智能芯片、晶圆级计算、硬件架构、软件系统。
Fig. 1. Scissors difference between the required/provided computing powers. 图 1.所需/所提供计算能力之间的剪刀差。
I. InTRODUCTION I.引言
In today's world, artificial intelligence (AI) technology with large models plays an increasingly important role in promoting scientific progress and is becoming a basic tool for human beings to explore the world. For example, AlphaFold2 can predict of human protein sequences with high confidence while the traditional method can only cover 17% [1], AlphaTensor discovers fast matrix multiplication algorithms and surpasses the classic algorithm discovered 50 years ago for the first time [2], DGMR beats competitive methods in of the cases of two-hour-ahead high-resolution weather forecasting [3], and DM21 performs better than traditional functionals for describing matter at the quantum level [4]. As an AI language model developed by OpenAI, ChatGPT [5] has gained significant popularity and recognition in the field of natural language processing (NLP) since its release. 当今世界,拥有大型模型的人工智能(AI)技术在推动科学进步方面发挥着越来越重要的作用,并逐渐成为人类探索世界的基本工具。例如,AlphaFold2 可以高置信度预测人类蛋白质序列的 ,而传统方法只能覆盖 17% [1];AlphaTensor 发现了快速矩阵乘法算法,并首次超越了 50 年前发现的经典算法 [2];DGMR 在提前两小时高分辨率天气预报的 案例中击败了竞争方法 [3];DM21 在量子层面描述物质的表现优于传统函数 [4]。作为 OpenAI 开发的人工智能语言模型,ChatGPT [5] 自发布以来在自然语言处理(NLP)领域获得了极大的欢迎和认可。
Although large AI models offer numerous benefits to research, production, and daily life, they also impose an immense demand on hardware computing power. With the emergence of the transformer model[6] in recent years, the demand for computing power required by large models has experienced explosive growth, increasing by a factor of 1,000 within a span of two years, as shown in Figure 1. By contrast, we observe that the growth rate of hardware computing power utilized to train large models only doubles over a period of two years. As a result, a significant gap exists between the computing power demanded by large models and that which is currently available from chips, which serves as a key limiting factor for the advancement of AI. 虽然大型人工智能模型为科研、生产和日常生活带来了诸多益处,但同时也对硬件计算能力提出了巨大的要求。近年来,随着变压器模型[6]的出现,大型模型对计算能力的需求出现了爆炸式增长,两年内增长了 1000 倍,如图 1 所示。相比之下,我们发现用于训练大型模型的硬件计算能力的增长率在两年内仅翻了一番。因此,大型模型所需的计算能力与目前芯片所能提供的计算能力之间存在巨大差距,这是限制人工智能发展的关键因素。
Computing Power of Chip Transistor Density (@Process) × Chip Area (@ Exposure Area) 芯片晶体管密度(@工艺)×芯片面积(@曝光面积)的计算能力
Fig. 2. Limiting factors of computing power. 图 2.计算能力的限制因素
The computing power of a chip can be represented as the total number of transistors integrated within it, which is the product of transistor density (i.e., the number of transistors per unit area) and area. Transistor density is mainly determined by the advancement of the chip process, whereas chip area is predominantly determined by the caliber of the chip photolithography process, and is limited by a maximum reticle size. Unfortunately, both paths are presently encountering significant obstacles that impede their continuation. 芯片的计算能力可以用芯片中集成的晶体管总数来表示,它是晶体管密度(即单位面积上的晶体管数量)与面积的乘积。晶体管密度主要由芯片工艺的先进性决定,而芯片面积则主要由芯片光刻工艺的口径决定,并受到最大网罩尺寸的限制。遗憾的是,这两条道路目前都遇到了阻碍其继续发展的重大障碍。
Firstly, the increasing difficulty of improving transistor density due to the slowing or termination of Moore's Law and Dennard scaling has become a significant issue after 3 nm [7]. Secondly, the chip area is generally restricted by reticle size (i.e., the largest area of the wafer that can be patterned using a lithography stepper system), which is challenging to enlarge while maintaining the most up-to-date chip process and preserving yield [8]. 首先,由于摩尔定律和 Dennard 扩展的放缓或终止,提高晶体管密度的难度越来越大,这已成为 3 纳米之后的一个重要问题[7]。其次,芯片面积通常受网罩尺寸(即使用光刻步进系统可图案化的最大晶片面积)的限制,要在保持最新芯片工艺和产量的同时扩大网罩尺寸具有挑战性[8]。
In recent years, advanced packaging technologies that have gradually matured are increasingly used to implement bigger chips that integrate multiple chiplets/dielets, while still providing interconnections with chip-level density and bandwidth [9-11]. This points out a new path of continuing the increase of computing power while leveraging the current fabrication process without significant disruption, and emerges a new and exciting Wafer-scale Computing paradigm. In this paper, Wafer-scale Computing is defined as the computing paradigm which extends the chip area to a size of wafer-scale (over ) through leveraging the advanced packaging [12] or field stitching [13] techniques to tightly integrate multiple chiplets/dielets, provisioning orders of magnitude more computing capabilities (several POPS within just one monolithic chip) and die-to-die bandwidth density (over ) than regular chips, while leveraging the current fabrication process without significant disruption. 近年来,逐渐成熟的先进封装技术越来越多地用于实现集成多个芯片/子芯片的更大芯片,同时仍提供芯片级密度和带宽的互连[9-11]。这就为继续提高计算能力指明了一条新的道路,同时又能充分利用当前的制造工艺而不产生重大影响,从而出现了一种全新的、令人兴奋的晶圆级计算范例。在本文中,晶圆级计算被定义为通过利用先进的封装[12]或现场拼接[13]技术将多个芯片/小片紧密集成,从而将芯片面积扩展到晶圆级( 以上)的计算范例,与普通芯片相比,晶圆级计算可提供更多数量级的计算能力(仅在一个单片芯片中就有多个 POPS)和芯片到芯片的带宽密度( 以上),同时利用当前的制造工艺而不会产生重大影响。
One might question why not construct an accelerator cluster (such as NVIDIA's GPU cluster [14]) using multiple chips to achieve computing power that surpasses that of a single conventional chip. Alternatively, what benefits do wafer-scale chips have over accelerator clusters? The answers mainly include the following aspects: 有人可能会问,为什么不利用多个芯片构建一个加速器集群(如英伟达™(NVIDIA®)公司的 GPU 集群[14]),以实现超过单个传统芯片的计算能力?另外,晶圆级芯片与加速器集群相比有哪些优势?答案主要包括以下几个方面:
One of the primary advantages of wafer-level integration is the significant enhancement of die-to-die bandwidth. This benefit is straightforward and evident. For instance, NVIDIA's interconnect provides a bandwidth of for H100 GPUs [14], whereas every D1 die edge in Tesla Dojo delivers a bandwidth of 2TB/s [15]. This high bandwidth breaks the previous memory limitations and opens up new design spaces for exploring more aggressive application solutions that can fully utilize the computing power available. 晶圆级集成的主要优势之一是大大提高了芯片到芯片的带宽。这一优势直接而明显。例如,英伟达™(NVIDIA®)的互联为 H100 GPU 提供了 的带宽[14],而 Tesla Dojo 的每个 D1 芯片边缘提供了 2TB/s 的带宽[15]。这种高带宽打破了以往的内存限制,为探索更先进的应用解决方案开辟了新的设计空间,使其能够充分利用现有的计算能力。
Moreover, wafer-scale chips offer better integration density, which implies higher size and form factor efficiencies. For instance, a Tesla Dojo training tile can tightly integrate 25 regular-sized dies, whereas 25 NVIDIA H100s require a complete package for each GPU and take up more than ten times the total area. This size advantage may be preferred in supercomputing applications with limited space, such as aerospace and military applications. 此外,晶圆级芯片具有更高的集成密度,这意味着更高的尺寸和外形效率。例如,一块 Tesla Dojo 训练芯片可以紧密集成 25 个普通尺寸的芯片,而 25 个英伟达 H100 则需要为每个 GPU 安装一个完整的封装,所占总面积是普通芯片的十倍以上。在空间有限的超级计算应用(如航空航天和军事应用)中,这种尺寸优势可能更受青睐。
Lastly, wafer-scale chips offer significant potential for programmability. Compared to GPU clusters, wafer-scale chips have considerably less overhead when it comes to inter-die and intra-die data communication, and current wafer-scale chip designs [15-17] strive to minimize this gap even further. This means that programmers do not have to be overly concerned about data access across multiple dies, and compilers can generate more flexible and fine-grained hardware resource partitioning and task mapping to more effectively utilize the computing power available 最后,晶圆级芯片具有巨大的可编程潜力。与 GPU 集群相比,晶圆级芯片在芯片间和芯片内数据通信方面的开销要小得多,目前的晶圆级芯片设计[15-17]正在努力进一步缩小这一差距。这意味着程序员不必过分关注多芯片间的数据访问,编译器可以生成更灵活、更细粒度的硬件资源分区和任务映射,从而更有效地利用可用的计算能力。
Not surprisingly, this disruptive wafer-scale computing paradigm also brings unprecedented challenges in nearly every design aspect. 毫不奇怪,这种颠覆性的晶圆级计算模式也在几乎所有设计方面带来了前所未有的挑战。
To start, the hardware footprints for Wafer-scale Computing system is orders of magnitudes larger than traditional chips, bringing significantly larger design spaces, which nullifies conventional architectural design methodologies. 首先,晶圆级计算系统的硬件占地面积比传统芯片大很多个数量级,设计空间大大增加,这就使传统的架构设计方法变得无效。
From an architectural perspective, the current hardware execution model exhibits inadequate scalability. To enable the efficient operation of Wafer-scale Computing systems, 从架构的角度来看,当前的硬件执行模型表现出不充分的可扩展性。为了实现晶圆级计算系统的高效运行、
it is imperative to develop new computational and execution models. 开发新的计算和执行模型势在必行。
From an integration/packaging perspective, the current advanced packaging technologies are being implemented in Wafer-scale Computing only as experimental runs and in an ad hoc manner. There is still a need for systematic design principles to determine the appropriate wafer-scale substrate techniques and layout, yield issues, and to cooptimize the packaging and system design. 从集成/封装的角度来看,目前的先进封装技术在晶圆级计算中的应用只是实验性的和临时性的。我们仍然需要系统的设计原则,以确定合适的晶圆级基板技术和布局、产量问题,并优化封装和系统设计。
From a systems perspective, there is currently no crossstack system design methodology for Wafer-scale Computing systems. Such systems operate comprehensively, with tightly coupled computing dies, wafer-scale substrates, power and cooling systems, and mechanical parts. The performance and availability of any single component can significantly affect the other parts. 从系统角度来看,目前还没有针对晶圆级计算系统的跨堆栈系统设计方法。此类系统运行全面,包括紧密耦合的计算芯片、晶圆级基板、电源和冷却系统以及机械部件。任何一个部件的性能和可用性都会对其他部件产生重大影响。
From a software perspective, the current software stack has not encountered computing resources on the scale of Wafer-scale Computing and lacks efficient mechanisms to fully leverage the power of such systems. There is a high demand for compiling and execution mechanisms to map big AI models to Wafer-scale Computing resources and run them efficiently. 从软件角度来看,目前的软件栈还没有遇到过晶圆级计算这种规模的计算资源,也缺乏有效的机制来充分利用这种系统的力量。因此,我们亟需编译和执行机制,以便将大型人工智能模型映射到晶圆级计算资源并高效运行。
At present, there are no comprehensive surveys summarizing the current states and design insights of Wafer-scale Computing. The objective of this paper is to provide a comprehensive overview of existing wafer-scale chips and their essential technologies in a single source. This will enable individuals in academia and industry to easily comprehend the fundamental concepts and key aspects, assess the accomplishments and limitations of existing research, and make contributions to this promising research field. 目前,还没有对晶圆级计算的现状和设计见解进行总结的全面调查。本文旨在通过单一来源全面概述现有晶圆级芯片及其基本技术。这将使学术界和工业界人士能够轻松理解基本概念和关键方面,评估现有研究的成就和局限性,并为这一前景广阔的研究领域做出贡献。
The rest of this paper is organized as follows: Section II introduces the background technologies. Section III discusses the key architecture design points of wafer-scale chips based on existing works. Section IV reviews common-used interconnect interfaces and protocols which may be used in wafer-scale chips. Section V reviews typical compiler tools for driving large-scale acceleration platforms (including traditional ones and Wafer-scale computing ones), and discusses what make the compilation of Wafer-scale Computing different from traditional one. Section VI reviews how existing works integrate wafer-scale chips. Section VII discusses how to deliver power and clock to the whole wafer-scale chip, and how to solve the cooling problem under such high integration density. Section VIII provides a summary of fault tolerance designs. Section IX presents a brief introduction to the important scientific computing applications of Wafer-scale Computing outside of AI computing domain. Section X concludes the paper. 本文接下来的内容安排如下:第二节介绍背景技术。第三节根据现有著作讨论晶圆级芯片的关键架构设计要点。第四节回顾晶圆级芯片中可能使用的常用互连接口和协议。第五部分回顾了驱动大规模加速平台(包括传统平台和晶圆级计算平台)的典型编译工具,并讨论了晶圆级计算编译与传统编译的不同之处。第六节回顾了现有工作如何集成晶圆级芯片。第七节讨论如何为整个晶圆级芯片提供电源和时钟,以及如何在如此高的集成密度下解决冷却问题。第八节总结了容错设计。第九节简要介绍了晶圆级计算在人工智能计算领域之外的重要科学计算应用。第十节是本文的结论。
II. BACKGROUND II.背景情况
In this section, we will introduce the efforts for improving the computing power of AI acceleration. 在本节中,我们将介绍为提高人工智能加速计算能力所做的努力。
A. Conventional Accelerators for AI Tasks A.用于人工智能任务的传统加速器
As Moore's Law and Dennard scaling are slowing down or even terminating, various accelerators different from classic 随着摩尔定律和 Dennard 扩展的放缓甚至终止,各种不同于传统加速器的
Von Neumann architectures have been proposed to meet the ever-increasing user demand of AI applications. These accelerators mainly include general-purpose graphics processing units (GPGPUs) [18], field-programmable gate arrays (FPGAs) [19], application-specific integrated circuits (ASICs)[20] and coarse-grained reconfigurable architectures (CGRAs) [21]. GPGPUs, such as NVIDIA's A100 [22] and H100 [14], are designed for scientific computation, encryption/decryption, and, of course, AI computation. GPGPUs have much more computing units than Von Neumann CPUs, and adopt the supporting execution mechanism (e.g., single instruction multiple thread, SIMT [23]), so they can surpass CPUs in the performance of high parallel computing. Since GPGPUs usually have large volume and power consumption, they are mainly used in data centers. With the popularization of AI and increasingly diversified user needs, researchers also develop more domain-specific accelerators on the platforms other than GPGPUs, to meet the demands such as energy efficiency and portability. These accelerators, including FPGAs, ASICs and CGRAs, simplify the computing units, data-paths and memories to exactly match the application scenarios, so they usually have the advantages of energy efficiency and area efficiency over GPGPUs. FPGAs offer fine-grained programmable logic, computing and storage units, for allowing users to totally customize the computing path structure according to algorithm requirements. CGRAs offer coarse-grained reconfigurability to provide flexibility limited but sufficient for the target application scenarios. ASICs cannot change its functions in the original sense, however, for AI accelerators the boundary of ASIC and CGRA is blurred, many so-called ASIC-based AI accelerators (e.g., [24-26] ) also have coarse-grained reconfigurability. 为了满足用户对人工智能应用日益增长的需求,人们提出了冯-诺依曼架构。这些加速器主要包括通用图形处理器(GPGPU)[18]、现场可编程门阵列(FPGA)[19]、专用集成电路(ASIC)[20]和粗粒度可重构架构(CGRA)[21]。英伟达™(NVIDIA®)的 A100 [22] 和 H100 [14]等 GPGPU 专为科学计算、加密/解密而设计,当然也包括人工智能计算。GPGPU 拥有比冯-诺伊曼 CPU 更多的计算单元,并采用了配套的执行机制(如单指令多线程,SIMT [23]),因此在高并行计算性能方面可以超越 CPU。由于 GPGPU 通常体积大、功耗高,因此主要应用于数据中心。随着人工智能的普及和用户需求的日益多样化,研究人员还在 GPGPU 之外的平台上开发了更多特定领域的加速器,以满足能效和可移植性等需求。这些加速器包括 FPGA、ASIC 和 CGRA,它们简化了计算单元、数据路径和存储器,以精确匹配应用场景,因此通常比 GPGPU 更具有能效和面积效率优势。FPGA 提供细粒度的可编程逻辑、计算和存储单元,允许用户根据算法要求完全定制计算路径结构。CGRA 具有粗粒度可重新配置性,可提供有限的灵活性,但足以满足目标应用场景的需要。ASIC 无法改变其原始意义上的功能,但对于人工智能加速器来说,ASIC 和 CGRA 的界限是模糊的。,[24-26] )也具有粗粒度可重构性。
Those mentioned above accelerators have in common that they adopt 1) large numbers of parallel computing units to provide high computing power, 2) hierarchical memory systems to reuse the data and reduce the data movement overhead, 3) convenient networks on chip ( NoCs ) to connect the distributed computing units and memories and 4) elaborate dataflow to utilize as more computing power as possible. Designers make decisions on these issues, to strive for the best performance, efficiency or flexibility. 上述加速器的共同点是:1)采用大量并行计算单元来提供高计算能力;2)采用分层内存系统来重复使用数据并减少数据移动开销;3)采用便捷的片上网络(NoC)来连接分布式计算单元和内存;4)采用精心设计的数据流来尽可能利用更多计算能力。设计人员会就这些问题做出决定,力求获得最佳性能、效率或灵活性。
B. Accelerator Clusters B.加速器集群
Caused by the slowdown of CMOS scaling and the limit of chip area, the number of available transistors on a monolithic chip becomes stagnant nowadays, which will seriously restrict the processor design optimization space. The gains from specialized architecture design will gradually diminish and ultimately hit an upper-bound, which is called accelerator wall [27]. To address the gap between the diminishing gains of accelerator performance and the growing demand of computing power for AI tasks, a straightforward way is to group multiple individual accelerators into a cluster, so that much more computing power can be exploited. 由于 CMOS 扩展速度的放缓和芯片面积的限制,如今单片芯片上可用晶体管的数量变得停滞不前,这将严重限制处理器设计的优化空间。专业化架构设计带来的收益将逐渐减少,并最终触及上限,这就是所谓的 "加速器墙"[27]。为了解决加速器性能收益递减与人工智能任务对计算能力需求不断增长之间的差距,一种直接的方法是将多个单独的加速器组合成一个集群,这样就可以利用更多的计算能力。
For running on an accelerator cluster, the AI tasks are usually executed under the data parallel (DP) [28] or model parallel (MP) [29] strategy, as shown in Figure 3. The accelerator cluster can be seen as a graph where each node represents 如图 3 所示,在加速器集群上运行人工智能任务时,通常采用数据并行(DP)[28] 或模型并行(MP)[29] 策略。加速器集群可视为一个图,其中每个节点代表
one accelerator (or some tightly linked accelerators), and each edge represents the data path between nodes. Under the DP strategy, samples are distributed to nodes (i.e. different nodes hold different parts of samples), while weights are broadcast to all nodes (i.e. each node processes on a copy of all weights). In contrast, under the MP strategy, weights are divided in the dimension of input/output channel (Tensor MP) or layer (Pipeline MP) and distributed to nodes, and each node processes on all samples. 每条边代表节点之间的数据路径。在 DP 策略下,样本被分配到各个节点(即不同节点持有样本的不同部分),而权重则被广播到所有节点(即每个节点处理所有权重的副本)。与此相反,在 MP 策略下,权重按输入/输出通道(张量 MP)或层(管道 MP)的维度划分并分配给节点,每个节点对所有样本进行处理。
Since the training of neural networks on multiple nodes usually require all-reduce operation (i.e, collecting partial results from all nodes and distributing the updated data to all nodes)[30-33], a powerful and stable network with high bandwidth should be built to link all the accelerators. Taking NVIDIA GPU cluster as an example, the nodes are usually linked with a Fat-tree network [34] ensuring that each node can access any other one in full bandwidth. Based on InfiniBand Quantum-2 switches[35-37] or Ethernet data center switches of equivalent performance, employing a three-layer Fat-tree network architecture, a single cluster can support a maximum of 100,000 GPU cards. 由于在多个节点上训练神经网络通常需要进行全还原操作(即从所有节点收集部分结果,并将更新后的数据分发到所有节点)[30-33],因此需要构建一个功能强大且稳定的高带宽网络来连接所有加速器。以英伟达™(NVIDIA®)图形处理器集群为例,节点通常通过胖树网络(Fat-tree network)[34]连接,确保每个节点都能以全带宽访问任何其他节点。基于 InfiniBand Quantum-2 交换机[35-37] 或同等性能的以太网数据中心交换机,采用三层胖树网络架构,单个集群最多可支持 100,000 块 GPU 卡。
Many works have pointed out that bandwidth is a major bottleneck for GPUs[38-40]. These works conduct experiments on GPUs and Wafer-scale Computing systems, scale the dieto-die bandwidth to up to and demonstrate that larger bandwidth can effectively reduce the exposed communication time, thereby reducing the overall execution time. Currently, NVIDIA's NDR InfiniBand supports s interconnection between every two cards in different nodes (the receiving and sending bandwidths are added together)[35, 41]. Furthermore, NVIDIA has proposed NVLink[42]to provide bandwidth between any two cards in a same node. This is already a quite high number, but we have found that further increasing the bandwidth can lead to even greater benefits. Based on Megatron-LM[43, 44], we built a performance model in the scenario of training GPT-3[45] on an GPU cluster with 16,384 H100 GPUs (each node has 256 GPUs). We found that, if the NVLink bandwidth is increased from to , theoretical training time can be reduced by , resulting in a speedup of . Therefore, it is highly valuable to continue increasing the bandwidth between accelerators. 许多研究都指出,带宽是 GPU 的主要瓶颈[38-40]。这些研究在图形处理器和晶圆级计算系统上进行了实验,将芯片带宽扩展到 ,并证明更大的带宽可以有效减少暴露的通信时间,从而缩短整体执行时间。目前,NVIDIA 的 NDR InfiniBand 支持不同节点中每两块板卡之间 的互连(接收带宽和发送带宽相加)[35, 41]。此外,NVIDIA 还提出了 NVLink[42],可在同一节点的任意两块板卡之间提供 的带宽。这已经是一个相当高的数字,但我们发现,进一步提高带宽可以带来更大的效益。基于 Megatron-LM[43,44],我们在一个拥有 16,384 个 H100 GPU(每个节点有 256 个 GPU)的 GPU 集群上建立了一个 GPT-3[45] 训练性能模型。我们发现,如果将 NVLink 带宽从 增加到 ,理论训练时间可减少 ,从而加快 。因此,继续增加加速器之间的带宽非常有价值。
C. Advanced Packaging and Chiplet C.先进封装和 Chiplet
To increase the bandwidth between accelerators, a fundamental way is to reduce the physical distance between dies. This is also one of the primary goals of the advanced packaging technologies[12, 46, 47] that have garnered significant attention in recent years. Unlike traditional accelerator architectures which package every single die into an individual device, advanced packaging architectures integrate multiple bare dies together (side by side or stacked vertically) and package them in a whole. As shown in Figure 4, the various advanced packaging technologies can be generally classified into three categories in terms of the carrier type: substratebased, silicon-based, and redistribution layer (RDL)-based packaging technologies. The main advantages and disadvantages of each category are as follows: 要提高加速器之间的带宽,最根本的方法是缩短芯片之间的物理距离。这也是近年来备受关注的先进封装技术[12, 46, 47]的主要目标之一。传统加速器架构将每个裸芯片封装成一个单独的器件,而先进封装架构则不同,它将多个裸芯片集成在一起(并排或垂直堆叠)并封装成一个整体。如图 4 所示,各种先进封装技术按载体类型一般可分为三类:基于衬底的封装技术、基于硅的封装技术和基于再分布层(RDL)的封装技术。各类技术的主要优缺点如下:
Substrate-based packaging technology uses organic substrate materials to complete the wiring connections between dies with the etching process, which does not rely on the chip foundry process, so the related materials and production cost is low. However, the density of IO pins is low and the transmission capability per pin is affected by the crosstalk effect, so the bandwidth of die-to-die connections is limited. 基于基板的封装技术使用有机基板材料,通过蚀刻工艺完成芯片之间的布线连接,不依赖芯片代工工艺,因此相关材料和生产成本低。但 IO 引脚的密度较低,每个引脚的传输能力受到串扰效应的影响,因此芯片与芯片之间的连接带宽有限。
Silicon-based packaging technology implements the interconnection between dies by placing an extra silicon layer between the substrate and die. The connection between die and substrate is achieved with through-silicon vias (TSVs) and micro-bumps, which have smaller bump pitch and trace distance, so the IO density is improved and transmission delay and power consumption are reduced. However, silicon interposer relies on the chip foundry process so the cost is much higher than organic substrate. To alleviate this, researchers proposed some variations of the original silicon interposer technology, such as silicon bridge technology [48] which only integrates small thin silicon layers on the substrate for inter-die interconnection, and silicon interconnect fabric (Si-IF) technology [49] which removes the organic substrate and assembles the dies at small spacing using fine pitch interconnects. 硅基封装技术通过在基板和芯片之间放置额外的硅层来实现芯片之间的互连。芯片与基板之间的连接是通过硅通孔(TSV)和微凸块实现的,硅通孔和微凸块的凸块间距和迹线距离较小,因此 IO 密度提高,传输延迟和功耗降低。然而,硅插层依赖于芯片代工工艺,因此成本远高于有机基板。为了缓解这一问题,研究人员提出了一些原始硅插层技术的变体,如硅桥技术[48],这种技术只在基板上集成小的薄硅层,用于芯片间的互连;以及硅互连结构(Si-IF)技术[49],这种技术去除了有机基板,使用细间距互连线以小间距组装芯片。
RDL-based packaging technology deposits metal and dielectric layers on the surface of the wafer to form a redistribution layer for carrying the metal wiring pattern and rearranging the IO ports. At present, fan-out style [50] is common-used, which rearranges the IO ports on the loose area outside the die, to shorten the circuit length for enhancing the signal quality. Compared to the silicon interposer-based packaging, RDL-based fan-out packaging has lower cost but less wiring resources. 基于 RDL 的封装技术在晶片表面沉积金属和介电层,形成再分布层,用于承载金属布线图案和重新排列 IO 端口。目前常用的是扇出式封装[50],即在芯片外的松散区域重新排列 IO 端口,以缩短电路长度,提高信号质量。与基于硅插层的封装相比,基于 RDL 的扇出式封装成本更低,但布线资源更少。
The bare dies integrated by advanced packaging are also called as dielets or chiplets[53], which introduces us to another well-known concept: chiplet technology[51, 54, 55]. Advanced packaging is a packaging technology aimed at integrating multiple functional components, while chiplet technology is a design methodology that divides an integrated circuit into 先进封装所集成的裸芯片也被称为小芯片或芯片组[53],这就引出了另一个著名的概念:芯片组技术[51, 54, 55]。先进封装是一种旨在集成多种功能组件的封装技术,而芯片技术则是一种将集成电路分为以下几部分的设计方法
(a)
(b)
(c)
Fig. 4. Advanced packaging categories. (a) Substrate-based packaging. (b) Silicon interposer-based packaging. (c) Redistribution layer (RDL)-based packaging. Blue font indicates the bump type and pitch for die-to-die interconnection. Adapted from [12, 51, 52]. 图 4.先进封装类别。(a) 基于基板的封装。(b) 基于硅插层的封装。(c) 基于再分布层 (RDL) 的封装。蓝色字体表示凸点类型和 间距 ,用于芯片与芯片之间的互连。改编自 [12、51、52]。
multiple independent chips (chiplets), each containing specific functions or modules. The former serves as the infrastructure for the latter, and the latter is the driving force for the development of the former. 多个独立的芯片(chiplets),每个芯片包含特定的功能或模块。前者是后者的基础设施,后者是前者发展的动力。
Originally, the principle of chiplet technology is based on the concept of modular design and integration. By dividing a complex integrated circuit into smaller chiplets each of which can be designed, manufactured, and tested independently, we can achieve easier customization, better reusability and scalability of chip designs, faster time-to-market, improved yield and cost efficiency. Therefore, chiplet technology does not necessarily produce big chips. However, with the increasing demand for computing power, there have emerged many works that integrate large chiplets to break through the upper limits of single-chip computing power while keeping high internal bandwidth[15, 16, 41, 56, 57]. Some of them even increase the total area to wafer level, for example, a Tesla Dojo Tile integrates twenty-five D1 dies with an area of 645 [15], resulting in a total area exceeding , which surpasses the area of the square inscribed in a 6 -inch wafer . This kind of works can be referred to as wafer-scale chips. 芯片技术的原理最初是基于模块化设计和集成的概念。通过将复杂的集成电路分割成更小的芯片,每个芯片都可以独立设计、制造和测试,我们可以更容易地实现定制,提高芯片设计的可重用性和可扩展性,加快产品上市时间,提高产量和成本效益。因此,芯片技术并不一定能生产出大芯片。然而,随着人们对计算能力的要求越来越高,出现了许多集成大芯片的作品,以突破单芯片计算能力的上限,同时保持较高的内部带宽[15, 16, 41, 56, 57]。例如,特斯拉 Dojo Tile 集成了 25 个 D1 芯片,面积为 645 [15],因此总面积超过 ,超过了 6 英寸晶圆上所刻正方形的面积 。这类作品可称为晶圆级芯片。
D. Wafer-scale Computing Systems D.晶圆级计算系统
It should be noted that chiplet technology is just one way to achieve wafer-scale chips. There are other technological approaches, such as the field stitching[13] used by Cerebras[17], that can also accomplish this. This computing paradigm, which scales the size of a single chip to the wafer level to achieve high computing power and large bandwidth benefits, is referred to as Wafer-scale Computing, regardless of the approach adopted. 需要指出的是,芯片技术只是实现晶圆级芯片的一种方法。还有其他技术方法,如 Cerebras[17] 采用的场拼接技术[13],也能实现这一目标。无论采用哪种方法,这种将单个芯片的尺寸扩展到晶圆级以实现高计算能力和大带宽优势的计算模式都被称为晶圆级计算。
More is different. To extend the chip to the wafer level, there are many design issues need to be reconsidered. The major design consideration differences among various chip scales are shown in Table I. While providing significant advantages on die-to-die bandwidth, integration density and programmability, wafer-scale chips also pose significant challenges on design space exploration, system implementation and tool chain development. 更多不同。要将芯片扩展到晶圆级,有许多设计问题需要重新考虑。各种芯片级别的主要设计考虑差异如表一所示。晶圆级芯片在芯片间带宽、集成密度和可编程性方面具有明显优势,但同时也给设计空间探索、系统实现和工具链开发带来了巨大挑战。
Figure 5 shows an overview of typical Wafer-scale Computing system, as well as the main contents of this paper. The wafer-scale compute plane is the core component of the whole system. A lot of effort needs to be paid into designing the architecture elements of it, including hierarchy, micro architecture, execution framework, NoC and on-chip memory system. The interconnection interface and protocol design (including intra-die, inter-die and off-chip interconnections) are also extremely critical to the performance of wafer-scale chips. To integrate the dies and interconnections together to form the physical wafer-scale chip, advanced packaging [12] or field stitching [13] technologies are required. Then, to drive the wafer-scale chip, specially designed power and clock delivery mechanisms are required, and the cooling modules is also necessary because of the high density of heat power. At last, to run diverse applications on this Wafer-scale Computing system, a dedicated compiler tool chain for mapping the workloads onto the chip is indispensable. In addition, fault tolerance designs relate to almost all aspects. In the rest of this paper, we will discuss the key points, challenges and the possible future research directions from these aspects. 图 5 显示了典型晶圆级计算系统的概览以及本文的主要内容。晶圆级计算平面是整个系统的核心组成部分。设计其中的架构元素需要付出大量精力,包括层次结构、微架构、执行框架、NoC 和片上存储器系统。互连接口和协议设计(包括芯片内、芯片间和芯片外互连)对于晶圆级芯片的性能也极为关键。要将芯片和互连集成在一起,形成物理晶圆级芯片,需要先进的封装[12]或现场拼接[13]技术。然后,要驱动晶圆级芯片,需要专门设计的电源和时钟传输机制,由于热功率密度高,还需要冷却模块。最后,要在晶圆级计算系统上运行各种应用,将工作负载映射到芯片上的专用编译工具链必不可少。此外,容错设计几乎涉及所有方面。在本文的其余部分,我们将从这些方面讨论关键点、挑战和未来可能的研究方向。
III. ARCHITECTURE III.建筑
Like the traditional chips, the architecture design of wafer scale chips is also aimed at improving the performance, efficiency and flexibility. But the difference is, designers should not only consider the trade-off within a single, conventionally sized die, but also the co-design of a tens of times larger system. In this section, we will introduce the architecture design of existing typical wafer-scale works from the aspects of overall architecture, microarchitecture, dataflow, NoC, and memory system. 与传统芯片一样,晶圆级芯片的架构设计也旨在提高性能、效率和灵活性。但不同的是,设计者不仅要考虑单个传统尺寸芯片内的权衡,还要考虑数十倍大系统的协同设计。本节将从总体架构、微架构、数据流、NoC 和存储器系统等方面介绍现有典型晶圆级作品的架构设计。
A. Hierarchy A.层次结构
First let's review the overall architectures of three typical wafer-scale systems, including UCLA&UIUC's work[16], Tesla Dojo [15] and Cerebras CS-2 [17], as shown in Figure 6, Figure 7 and Figure 8. 首先,让我们回顾一下三个典型晶圆级系统的总体架构,包括加州大学洛杉矶分校和加州大学洛杉矶分校的工作[16]、特斯拉 Dojo [15] 和 Cerebras CS-2 [17],如图 6、图 7 和图 8 所示。
UCLA&UIUC's wafer-scale processor is comprised of 32 tiles, each tile is heterogeneously integrated by a compute chiplet and a memory chiplet. The compute chiplet contains 14 ARM cores. The total number of cores in the wafer-scale processor is 14,336 . Any core on any tile can directly access the globally shared memory across the entire wafer-scale system using the wafer-scale interconnect network. Tesla's wafer-scale chip called Dojo training tile is comprised of 25 D1 dies and 40 I/O dies, and each D1 die contains 354 nodes. DOJO nodes are full-fledged computers, and the total number of them in the Dojo training tile is 8,850 . Cerebras 加州大学洛杉矶分校(UCLA&UIUC)的晶圆级处理器由 32 个芯片组成,每个芯片由一个计算芯片和一个内存芯片异构集成。计算芯片组包含 14 个 ARM 内核。晶圆级处理器的内核总数为 14,336 个。任何芯片上的任何内核都可以通过晶圆级互连网络直接访问整个晶圆级系统的全局共享内存。特斯拉的晶圆级芯片名为 Dojo 训练瓦片,由 25 个 D1 芯片和 40 个 I/O 芯片组成,每个 D1 芯片包含 354 个节点。DOJO 节点是完整的计算机,Dojo 训练芯片中的节点总数为 8850 个。大脑
TABLE I 表 I
DESIGN CONSIDERATION DIFFERENCES OF DIFFERENT CHIP SCALES 设计时考虑不同芯片尺度的差异
Monolithic Conventional Chip 单片传统芯片
GPU Cluster (One Node) GPU 集群(一个节点)
Wafer-scale Computing System (One Chip) 晶圆级计算系统(单芯片)
Computing Power (FP16 Dense) 计算能力(FP16 密集)
POPS
POPS
POPS
Monolithic Chip Area 单片面积
Common-Used Memory System Pattern 常用内存系统模式
on-Chip Shared SRAM + 片上共享 SRAM +
片上共享 SRAM
片外 DRAM
on-Chip Shared SRAM
off-Chip DRAM
on-Chip Distributed SRAM 片上分布式 SRAM
Inter-Die Interconnection 芯片间互连
-
Network Cable or NVLink 网线或 NVLink
Advanced Packaging or
Field Stitching
Inter-Die Bandwidth 模间带宽
-
Inter-Die Bandwidth Density 芯片间带宽密度
-
Inter-Die Data Transfer Energy 芯片间数据传输能量
-
bit 位
Fault Tolerance Design for Computation 计算容错设计
Not Required 不需要
Not Required 不需要
冗余流程元素,
旁路路由等。
Redundant Process Elements,
Bypass Routing, etc.
Fault Tolerance Design for Data Transfer * 数据传输容错设计 *
Not Required 不需要
Ethernet Protocol 以太网协议
冗余路由、NoC 协议等。
Redundant Physical Connections,
Redundant Routing, NoC Protocol, etc.
Power Delivery 电力输送
Conventional 常规
Conventional 常规
Edge or Vertical 边缘或垂直
Clock Distribution 时钟分配
Conventional 常规
Conventional 常规
Edge or Vertical 边缘或垂直
Cooling Solution 冷却解决方案
Conventional Air/Liquid 常规空气/液体
Conventional Air/Liquid 常规空气/液体
Conventional Air/Liquid or
Microfluidics
Compilation 汇编
Single-chip Mapping 单芯片映射
Multi-chip Mapping 多芯片映射
Multi-chip Mapping with finer grain and 2D-constraint 更细粒度的多芯片制图和 2D 限制
Besides conventional ECC for DRAM. 除了用于 DRAM 的传统 ECC 外。
Fig. 5. Overview of Wafer-scale Computing system and main contents of this paper. 图 5.晶圆级计算系统和本文主要内容概览。
Fig. 6. Overall architecture and microarchitecture of UCLA&UIUC's work. Adapted from [16]. 图 6.加州大学洛杉矶分校和加州大学洛杉矶分校工作的整体架构和微架构。改编自 [16]。
Fig. 7. Overall architecture and microarchitecture of Tesla Dojo. Adapted from [15]. 图 7.Tesla Dojo 的整体架构和微架构。改编自 [15]。
TABLE II 表 II
CHARACTERISTICS OF TYPICAL WAFER-SCALE SYSTEMS 典型晶圆级系统的特点
UCLA&UIUC's work [16] 加州大学洛杉矶分校和加州大学洛杉矶分校的工作 [16]
Tesla Dojo [15] 特斯拉道场 [15]
Cerebras CS-2 [17]
Name of Wafer / Base Core 晶圆/基芯名称
Array 阵列
ARM Core ARM 内核
Dojo Tile 道场瓷砖
Node 节点
WSE-2
Core 核心
Area 地区
Comp. Power Comp.功率
4.4 POPS
9.1 POPS
7.5 POPS
SRAM
64 KB
11 GB
1.25 MB
40 GB
48 KB
SRAM BW
6.144 TBps
-
-
-
Network BW 网络 BW
9.83 TBps
-
9 TBps
-
-
Power 电源
-
-
15 kW 15 千瓦
30 mW 30 毫瓦
Estimated through dividing the total amount by the number of base cores. 用总金额除以基本内核数估算得出。
** Estimated through multiplying the value of base cores by the number of base cores. ** 用基本内核值乘以基本内核数估算得出。
CS-2 is comprised of dies, and each die is comprised of cores. Cerebras adopts more granular cores than UCLA&UIUC and Tesla, the total number of cores in Cerebras CS-2 is as high as 853,104 . CS-2 由 个芯片组成,每个芯片由 个内核组成。与 UCLA&UIUC 和 Tesla 相比,Cerebras 采用了更细粒度的内核,Cerebras CS-2 的内核总数高达 853 104 个。
The characteristics of the three wafer-scale works are listed in Table II. In summary, all the three wafer-scale works integrate a large number of cores to form an over wafer, aggressively use on-wafer distributed memories, and adopt a mesh/torus topology at the top level. The computing powers of all the three wafers reach several POPS. 表 II 列出了这三个晶圆级工程的特点。总之,这三个晶圆级工程都集成了大量内核,形成了一个超 晶圆,积极使用晶圆上的分布式存储器,并在顶层采用了网状/四叶草拓扑结构。这三个晶圆的计算能力都达到了数 POPS。
B. Microarchitecture B.微架构
Now let's see the microarchitectures of aforementioned typical wafer-scale works, as shown in the right side of Figure 6, Figure 7 and Figure 8. The three wafer-scale works employ different strategies in base core designing. UCLA&UIUC's work [16] adopts a standard general-purpose processor (ARM Cortex M3) as the base core, Cerebras CS-2 [17] builds the base core around tensor acceleration, and Tesla Dojo [15] designs a large full-fledged core which combines vector computing units and scalar general-purpose processing modules. AI acceleration is one of the primary target applications of 现在我们来看看上述典型晶圆级作品的微体系结构,如图 6、图 7 和图 8 右侧所示。这三个晶圆级作品采用了不同的基础内核设计策略。UCLA&UIUC 的作品[16]采用标准通用处理器(ARM Cortex M3)作为基础内核,Cerebras CS-2 [17]围绕张量加速构建基础内核,而 Tesla Dojo [15]则设计了一个结合了矢量计算单元和标量通用处理模块的大型成熟内核。人工智能加速是 Tesla 的主要目标应用之一。
Fig. 8. Overall architecture and microarchitecture of Cerebras CS-2. Adapted from [17]. 图 8.Cerebras CS-2 的整体架构和微架构。改编自 [17]。
Tesla Dojo and Cerebras CS-2, so they both devote many resources on vector units for tensor acceleration. The scalar general-purpose processing modules in Tesla Dojo are aimed at providing flexibility supports, such as branch jumping and sparse computing. Cerebras CS-2 proposes a dataflow trigger mechanism to accomplish this. Tesla Dojo 和 Cerebras CS-2 都投入了大量资源用于张量加速的矢量单元。Tesla Dojo 的标量通用处理模块旨在提供灵活性支持,如分支跳转和稀疏计算。Cerebras CS-2 提出了一种数据流触发机制来实现这一目标。
The base cores of Tesla Dojo and Cerebras CS-2 are both homogeneously integrated to form a 2D mesh/torus grid, while the base cores in UCLA&UIUC's work are heterogeneously integrated with private SRAMs, shared SRAMs, and intra-tile networks, to form a tile, and then the tiles form a homogeneous 2D mesh grid. Tesla Dojo 和 Cerebras CS-2 的基础内核都是同质集成,形成二维网状/四叶草网格,而 UCLA&UIUC 的基础内核则是异质集成,包括私有 SRAM、共享 SRAM 和瓦片内网络,形成一个瓦片,然后由瓦片形成同质的二维网状网格。
C. Execution Framework C.执行框架
To execute tasks, wafer-scale system can utilize multiple granularities of parallelisms, which is similar to GPU clusters [11, 58, 59]. As shown in Figure 9, the wafer-scale chip can be divided into multiple partitions, each partition containing several cores works like a device in GPU cluster. At macro level, the tasks (e.g. training of neural networks) are executed under the data parallel (DP) [28] or model parallel (MP) [29] strategy. At micro level, each node processes the sub-task assigned to it like traditional accelerators do. The vector units perform parallel computing on the partitioned tensors, and scalar units provide flexibility supports such as conditional branch jumping and sparse computing. 为了执行任务,晶圆级系统可以利用多种并行粒度,这与 GPU 集群类似 [11, 58, 59]。如图 9 所示,晶圆级芯片可分为多个分区,每个分区包含多个内核,就像 GPU 集群中的设备一样工作。在宏观层面,任务(如神经网络训练)按照数据并行(DP)[28] 或模型并行(MP)[29] 策略执行。在微观层面,每个节点像传统加速器一样处理分配给它的子任务。矢量单元对分区张量进行并行计算,标量单元提供灵活性支持,如条件分支跳转和稀疏计算。
Compared with traditional accelerators, the differences include: 与传统加速器相比,不同之处包括
The much higher bandwidth of network on chip (NoC) and smaller gap between intra- and inter-die bandwidth broaden the design space of parallel. 芯片网络(NoC)的带宽更高,芯片内和芯片间的带宽差距更小,这拓宽了并行设计的空间。
2D-only interconnection imposes extra constraints on the design space. Specifically, long-distance data transfers on 2D mesh NoC would seriously waste bandwidth, so it is necessary to keep data locality, as emphasized by Tesla 纯二维互连对设计空间造成了额外的限制。具体来说,二维网状 NoC 上的长距离数据传输会严重浪费带宽,因此有必要保持数据的本地性,正如特斯拉所强调的那样
. Usually, it is better to adopt a dataflow-based execution framework which only passes data from one die to its neighbors if possible. 。通常,最好采用基于数据流的执行框架,在可能的情况下,只将数据从一个模子传递到其相邻的模子。
D. Network on Chip (NoC) D.片上网络(NoC)
Topology is the first major design decision for a network on chip (NoC). Common-used topologies include mesh, torus, binary tree, and butterfly tree [61], as shown in figure 10. Topologies differ in the number of required routing units, the minimum and maximum number of connections or hops between any two compute units, as shown in Table III. 拓扑结构是片上网络(NoC)的第一个重要设计决策。常用的拓扑结构包括网状、环状、二叉树和蝴蝶树 [61],如图 10 所示。拓扑结构在所需路由单元数量、任意两个计算单元之间的最小和最大连接数或跳数方面存在差异,如表 III 所示。
Generally speaking, tree topologies tend to have smaller number of hops than mesh and torus, especially in large networks, as shown in Figure 11. However, mesh and torus topologies are more popular in existing wafer-scale systems. The networks of UCLA&UIUC's work[16] (inter-tile), Tesla Dojo [15] and Cerebras CS-2 [17] all adopt a mesh or torus topology. The main reasons why they prefer mesh or torus may include: 一般来说,树形拓扑的跳数往往少于网状拓扑和环状拓扑,尤其是在大型网络中,如图 11 所示。不过,在现有的晶圆级系统中,网状拓扑和环状拓扑更受欢迎。加州大学洛杉矶分校和加州大学洛杉矶分校的工作[16](晶片间)、Tesla Dojo [15] 和 Cerebras CS-2 [17]的网络都采用了网状或环状拓扑结构。采用网状或环状拓扑结构的主要原因包括
TABLE III 表 III
ChARACTERISTICS OF COMMON-USED NOC TOPOLOGIES 常用 NOC 拓扑的特征
Topology 拓扑结构
# Router Ports # 路由器端口
# Routers # 路由器
Min # Hops 最少啤酒花数量
Max # Hops 最大啤酒花数量
Mesh 网布
5
N
3
Height+Width+2 高度+宽度+2
Torus 托鲁斯
5
N
3
Height/2+Width/2+2 高/2+宽/2+2
Binary Tree 二叉树
3
2
Butterfly Tree 蝴蝶树
6
2
Note: The number of node is N . The data come from thesis [61]. 注:节点数为 N。数据来自论文 [61]。
Fig. 11. Maximum number of hops VS number of nodes. 图 11.最大跳数 VS 节点数
Less physical implementation difficulty: In mesh and torus topologies, the compute and routing unit pairs are arranged in a grid which makes it easier to integrate them in a wafer-scale chip and do scaling out. 物理实施难度较低:在网状拓扑和环状拓扑中,计算和路由单元成对排列在网格中,这样更容易将它们集成到晶圆级芯片中并进行扩展。
Better fault tolerance: When a router breaks down, it is easier to find alternate paths in mesh and torus topologies than tree topologies. 更好的容错性:当路由器发生故障时,网状拓扑和环状拓扑比树状拓扑更容易找到备用路径。
E. Memory System E.记忆系统
In Cerebras CS-2 [17] and UCLA&UIUC's work[16], all memories are SRAMs. Cerebras CS-2 fully distributes the SRAMs with cores. Although UCLA&UIUC's work adopts both private SRAMs for cores and shared SRAMs for global access, the shared SRAMs are also distributed with tiles. Tesla Dojo [15] adopts both on-wafer SRAMs and off-wafer DRAMs (HBMs), and the SRAMs are also distributed with cores (nodes). 在 Cerebras CS-2 [17] 和 UCLA&UIUC 的工作[16]中,所有存储器都是 SRAM。Cerebras CS-2 将 SRAM 完全分配给内核。虽然 UCLA&UIUC 的工作采用了内核专用 SRAM 和全局访问共享 SRAM,但共享 SRAM 也与磁盘阵列分布在一起。Tesla Dojo [15] 同时采用晶圆上 SRAM 和晶圆外 DRAM(HBM),SRAM 也随内核(节点)分布。
Generally speaking, existing wafer-scale system works [1517] tend to adopt distributed memory structures, here are the main reasons for it: 一般来说,现有的晶圆级系统工程[1517]倾向于采用分布式存储器结构,其主要原因如下:
Attaching large DRAMs to each node of traditional accelerator clusters is a common practice, but it is difficult to integrate large DRAMs with each die on waferscale chips because of the limited area and the process differences between DRAM units and compute logics. As a result, the heavy storage burden for wafer-scale chips mainly falls on the distributed SRAMs. One might have noticed that Tesla also uses HBMs on the sides of Dojo tiles [15, 62], but the cost of data transfer between HBMs and the central dies is high, so the locality of data access should be enhanced, which is reported by Dojo developers on AI day [60]. 在传统加速器集群的每个节点上安装大型 DRAM 是一种常见的做法,但由于 DRAM 单元和计算逻辑之间的面积和工艺差异有限,很难在晶圆级芯片的每个芯片上集成大型 DRAM。因此,晶圆级芯片的存储重任主要落在分布式 SRAM 上。人们可能已经注意到,特斯拉也在 Dojo 瓦片的侧面使用了 HBM [15,62],但 HBM 与中央芯片之间的数据传输成本很高,因此应增强数据访问的本地性,Dojo 开发人员在 AI 日报告了这一点 [60]。
In a wafer-scale chip, centralized memory may cause seriously non-uniform access latency between the cores near and far away. Distributed memory structure and elaborated parallel strategy helps keep data locality and avoid most long-distance memory access, so that reduce the difficulty of scaling out. 在晶圆级芯片中,集中式内存可能会导致远近核心之间的访问延迟严重不均匀。分布式内存结构和精心设计的并行策略有助于保持数据的本地性,避免大部分长距离内存访问,从而降低了扩展的难度。
The bandwidth of NoC in wafer-scale chips is far more abundant than that in traditional accelerator clusters, which supports the use of distributed memory. As Cerebras points out in HotChips 2022 [17], traditional centralized memory with low bandwidth requires high data reuse and caching to be efficient, while distributed memory with high bandwidth can realize full datapath performance without caching. 晶圆级芯片中 NoC 的带宽远比传统加速器集群的带宽丰富,这为分布式存储器的使用提供了支持。正如 Cerebras 在《HotChips 2022》[17] 中指出的,传统的集中式内存带宽低,需要较高的数据重用和缓存才能实现高效,而高带宽的分布式内存无需缓存即可实现完整的数据通路性能。
F. Discussion F.讨论情况
Design space exploration for wafer-scale chips: It is a formidable task to design a wafer-scale chip and achieve the optimization. The design elements are closely related, and should be considered jointly. Many of them are not included in traditional chip design and there is no experience that can be learned from. 晶圆级芯片的设计空间探索设计晶圆级芯片并实现优化是一项艰巨的任务。设计要素密切相关,应综合考虑。其中许多设计元素在传统芯片设计中并不包含,也没有可借鉴的经验。
The design space of wafer-scale systems can be composed of three main layers: workload, system and network [38]. In the workload layer, the superficial requirements (e.g. the target application tasks and the expected performance) should be transformed into high-level requirements (e.g. operator types, parallelization strategies, data communication and reuse patterns). In the system level, the microarchitecture and overall architecture (e.g. dataflow design, compute units and memory organization) should be decided, to meet the requirements. In the network level, network framework and implementation (e.g. hierarchical fabric design and topology, interconnect interfaces and protocols) should be decided to ensure the system runs as expected. In addition, many extra variables are produced by the wafer-scale background, e.g. various implementations of die-to-die interconnection and fault tolerance mechanisms. 晶圆级系统的设计空间可由三个主要层组成:工作负载层、系统层和网络层[38]。在工作负载层,应将表面要求(如目标应用任务和预期性能)转化为高层要求(如运算器类型、并行化策略、数据通信和重用模式)。在系统层面,应确定微架构和整体架构(如数据流设计、计算单元和内存组织),以满足要求。在网络层面,应确定网络框架和实施(如分层结构设计和拓扑、互连接口和协议),以确保系统按预期运行。此外,晶圆级背景也会产生许多额外的变量,例如芯片到芯片互连和容错机制的各种实现方式。
The design choices in the three levels interact with each other, so separately deciding the design elements step by step may cause two problems: 1) Cumulative effect of minor losses [63]: the accumulation of small losses in the parts may cause a large loss on the whole system, resulting in unsatisfactory global solutions. 2) Ignorance of system bottleneck: the optimizer in one level cannot get exact information in other levels so that it does not know what the real system bottleneck is, and may make much useless work in the non-critical paths, leading to non-optimal choices. However, exploring the whole design space of wafer-scale chips including all the three levels at once will suffer from a tremendous time complexity. 三个层次中的设计选择是相互影响的,因此一步步单独决定设计元素可能会导致两个问题:1)微小损失的累积效应[63]:零件中微小损失的累积可能会导致整个系统的巨大损失,从而导致全局解不理想。2) 忽视系统瓶颈:一个层次的优化器无法获得其他层次的准确信息,因此不知道真正的系统瓶颈是什么,可能会在非关键路径上做很多无用功,导致非最优选择。然而,同时探索包括所有三个层次在内的晶圆级芯片的整个设计空间将耗费大量的时间。
Although Cerebras and Tesla have showed most of the detailed parameters of their wafer-scale chips, they have not disclosed their design space exploration strategies. The group from UCLA&UIUC has presented their design space exploration framework [63]. It is the first optimization framework for solving the multiple chiplet/system selection problem. From the view of wafer-scale chip design, this framework has the advantage of supporting heterogeneous integration, but it does not cover important integration models such as Cerebras CS-2's reticle stitching technology [8, 13]. Moreover, the reported results lack the cases of training large neural network models on real wafer-scale chips. How to abstract the requirements of such workloads, how to decompose the tasks and map to the cores, how to load data from offwafer storages, how to design the dataflow and organize the data communication across multiple cores, these key points are unclear. To fully exploit the advantages of wafer-scale integration, where the design choices in different levels are tightly coupled, cross-layer hardware and software co-design is required. Therefore, a more comprehensive and mature methodology for wafer-scale chip design space exploration is yet to be developed. 尽管 Cerebras 和 Tesla 展示了其晶圆级芯片的大部分详细参数,但他们并未披露其设计空间探索策略。加州大学洛杉矶分校和加州大学洛杉矶分校的研究小组介绍了他们的设计空间探索框架[63]。这是首个用于解决多芯片/系统选择问题的优化框架。从晶圆级芯片设计的角度来看,该框架的优点是支持异构集成,但并没有涵盖重要的集成模式,如 Cerebras CS-2 的网状缝合技术[8, 13]。此外,所报告的结果缺乏在真实晶圆级芯片上训练大型神经网络模型的案例。如何抽象此类工作负载的要求,如何分解任务并映射到内核,如何从晶圆外存储中加载数据,如何设计数据流并组织跨多个内核的数据通信,这些关键点都不清楚。为了充分发挥晶圆级集成的优势,不同层次的设计选择紧密耦合,需要跨层软硬件协同设计。因此,一种更全面、更成熟的晶圆级芯片设计空间探索方法尚待开发。
2) Design considerations related to the yield problem: Compared with traditional chip manufacturing, building waferscale systems faces greater challenge of yield problem, so designers try to propose fault tolerance mechanisms to address it. Cerebras proposes a fault tolerance mechanism with redundant cores and redundant fabric links [9, 64]. The redundant cores can replace defective cores and the extra links can reconnect fabric to restore logical 2D mesh. Unlike UCLA&UIUC's work and Tesla Dojo which use individual pre-tested chiplets (i.e., known-good-dies, KGDs) for assembling into a waferscale chip, Cerebras directly produces an integrated waferscale chip, so the yield challenge is more significant. Therefore, Cerebras' cores are designed to be quite small (much smaller than the cores of UCLA&UIUC's work and Tesla Dojo), in order to address yield at low cost. UCLA&UIUC's fault tolerance mechanism [16] mainly focus on die-to-die interconnection. Two independent networks across the wafer are designed, one with dimension-ordered routing and the other with Y-X dimension ordered routing, to ensure the access between any two tiles. There is still much room for existing fault tolerance mechanisms to be improved. The fault can be caused by defective links, chiplets, cores, or partial logics in a core. Different problems should be solved in different ways, and their interactions need to be considered. A complete set of fault tolerance technologies for wafer-scale systems have yet to be developed. 2) 与良品率问题相关的设计考虑:与传统芯片制造相比,构建晶圆级系统面临更大的良品率挑战,因此设计人员试图提出容错机制来解决这一问题。Cerebras 提出了冗余内核和冗余结构链路的容错机制[9, 64]。冗余内核可以替换有缺陷的内核,额外的链路可以重新连接 Fabric 以恢复逻辑 2D 网格。与加州大学洛杉矶分校(UCLA)和加州大学洛杉矶分校(UIUC)的工作以及特斯拉道场(Tesla Dojo)使用单个预先测试的芯片(即已知良品,KGD)组装成晶圆级芯片不同,Cerebras 直接生产集成的晶圆级芯片,因此良品率挑战更为严峻。因此,Cerebras 的内核设计得相当小(比加州大学洛杉矶分校和加州大学伯克利分校的工作和特斯拉道场的内核小得多),以便以低成本解决良品率问题。加州大学洛杉矶分校和加州大学旧金山分校的容错机制[16]主要集中在芯片到芯片的互连上。在晶圆上设计了两个独立的网络,一个是 维有序路由,另一个是Y-X维有序路由,以确保任意两个晶圆之间的访问。现有的容错机制仍有很大的改进空间。故障可能由有缺陷的链路、芯片、核心或核心中的部分逻辑引起。不同的问题应以不同的方式解决,而且需要考虑它们之间的相互作用。目前尚未开发出一套完整的晶圆级系统容错技术。
Yield problem is also related to other design points. In existing wafer-scale system works [15-17], the NoC router and compute part of each core are designed to be highly decoupled, this facilitates the the implementation of fault tolerance mechanism (and the utilization of distributed memories). Besides, existing wafer-scale system designs [15-17] only integrate SRAMs on-chip, which are low-density and expensive, because the wafer-scale integration technology has not been mature enough to solve the process differences between DRAMs and compute logics while maintaining the yield. This may be one research direction in the future. 良品率问题还与其他设计要点有关。在现有的晶圆级系统作品[15-17]中,每个内核的 NoC 路由器和计算部分被设计为高度解耦,这有利于容错机制的实现(以及分布式存储器的利用)。此外,现有的晶圆级系统设计[15-17]仅在片上集成了 SRAM,密度低且成本高,这是因为晶圆级集成技术还不够成熟,无法在保证良率的前提下解决 DRAM 与计算逻辑之间的工艺差异。这可能是未来的一个研究方向。
IV. INTERCONNECTION INTERFACES AND PROTOCOLS IV.互联接口和协议
The interconnection interface and protocol design (including intra-die, inter-die and off-chip interconnections) are extremely critical to the performance of wafer-scale chips. Compared with traditional chips, we should consider not only the unit transmission bandwidth, power consumption per bit, but also the requirements of advanced manufacturing process, packaging technology and system integration. 互连接口和协议设计(包括片内、片间和片外互连)对晶圆级芯片的性能极为关键。与传统芯片相比,我们不仅要考虑单位传输带宽、每比特功耗,还要考虑先进制造工艺、封装技术和系统集成的要求。
In this section, we will first review common-used interconnect interfaces and protocols which may be used in waferscale chips, and then discuss what characteristics of existing interfaces and protocols are especially important for building Wafer-scale Computing systems, or how they can be tailored to fit for Wafer-scale Computing systems. 在本节中,我们将首先回顾晶圆级芯片中可能使用的常用互连接口和协议,然后讨论现有接口和协议的哪些特性对构建晶圆级计算系统尤为重要,或如何对其进行定制以适应晶圆级计算系统。
A. Intra-Die Interconnection A.芯片内部互连
The intra-die interconnection is responsible for communication among processing elements and memories within each die. For wafer-scale chips, the optional styles of intra-die 片内互连负责每个片内处理元件和存储器之间的通信。对于晶圆级芯片,晶粒内互连的可选样式包括
interconnection include bus, crossbar, ring, network-on-chip , and so on, which are similar to those for conventional chips. The bus is the simplest to design but hard to scale with the number of processing elements, while the NoC is the opposite [65]. There have been some typical, commonused protocols for bus, such as the Advanced Microcontroller Bus Architecture (AMBA) Advanced High-Performance bus (AHB) and Advanced Extensible Interface (AXI), Wishbone Bus, Open Core Protocol (OCP) and CoreConnect Bus [66]. In contrast, the protocol for NoC is more complex and usually needs to be specifically designed according to the application scenario. Designers should make decisions on many design elements to balance the performance, cost and robustness, such as number of destinations (unicast or multicast), routing decisions (centralized, source, distributed or multiphase), adaptability (deterministic or adaptive), progressiveness (progressive or backtracking), minimality (profitable or misrouting) and number of paths (complete or partial) [67]. 互联包括总线、交叉条、环、片上网络 等,与传统芯片的互联类似。总线的设计最简单,但很难随着处理元件数量的增加而扩展,而 NoC 则恰恰相反 [65]。总线有一些典型的常用协议,如高级微控制器总线架构(AMBA)、高级高性能总线(AHB)和高级可扩展接口(AXI)、Wishbone 总线、开放核心协议(OCP)和 CoreConnect 总线 [66]。相比之下,NoC 的协议更为复杂,通常需要根据应用场景进行专门设计。设计人员应就许多设计要素做出决策,以平衡性能、成本和稳健性,例如目的地数量(单播或多播)、路由决策(集中式、源、分布式或多相)、适应性(确定性或自适应)、渐进性(渐进或回溯)、最小化(盈利或错误路由)和路径数量(完整或部分)[67]。
B. Inter-Die Interconnection B.芯片间互连
The inter-die interconnection interfaces and protocols are used for interconnection multiple dies in a same package. Serial interface and parallel interface are two options of dieto-die physical layer interface used for wafer-scale chips. 晶片间互连接口和协议用于同一封装中多个晶片的互连。串行接口和并行接口是晶圆级芯片物理层接口的两种选择。
Inter-Die Serial Interfaces: Serial interfaces can be implemented at a long interconnection distance, so it is the mainstream in the field of wafer-scale chips. 芯片间串行接口:串行接口可以在较长的互连距离内实现,因此是晶圆级芯片领域的主流。
USR SerDes: SerDes [51, 68] contains a series of serial interfaces ranging from long reach to short reach, including USR (Ultra-Short Reach), which focuses on high-speed dieto-die connection at ultra-short distance ( 10 mm level) by 2.5D/3D heterogeneous integration. Because of its short interconnection distance, USR is able to provide power consumption, nanosecond-level latency, and low error rate via advanced coding, multi-bit transmission and other technologies. However, due to its short transmission distance demand, implementation of USR in large-scale integration such as wafer-scale chips could be challenging. USR SerDes:SerDes [51, 68] 包含一系列从长距离到短距离的串行接口,其中包括 USR(Ultra-Short Reach),它主要是通过 2.5D/3D 异构集成实现超短距离(10 毫米级别)的高速芯片连接。由于其互连距离短,USR 能够通过先进的编码、多比特传输和其他技术实现 低功耗、纳秒级延迟和低错误率。然而,由于其传输距离要求较短,在晶圆级芯片等大规模集成中实施 USR 可能具有挑战性。
Apple UltraFusion: Apple's M1 Ultra chip [69] uses TSMC's 5nm process. It includes 20 CPU cores, 64 GPU cores, 32 neural network engine NPU cores, 128 GB of memory, and 800 GBps of memory bandwidth. It also solves the interconnection of very-large area Chiplets. Especially, the UltraFusion interconnection technology design of die-to-die is extremely impressive. According to the current published papers and patents, UltraFusion should be an interconnection architecture based on TSMC's CoWoS-S5 interconnection technology, using Silicon Interposer and Micro-Bump technology. UltraFusion also provides more than 10,000 die-todie connection signal lines, with an ultra-high interconnection bandwidth of . Compared to other multi-chip module (MCM) packaging technologies, UltraFusion's interposer provides dense and short metal interconnects between logic dies or between logic dies and memory stacks. It has better inter-chip integrity, lower power consumption, and can run at higher clock frequencies. In addition, UltraFusion effectively improves packaging yield and reduces costs through DieStitching technology. In UltraFusion, only the Known Good Apple UltraFusion:苹果公司的 M1 Ultra 芯片[69]采用台积电的 5 纳米工艺。它包括 20 个 CPU 内核、64 个 GPU 内核、32 个神经网络引擎 NPU 内核、128 GB 内存和 800 GBps 内存带宽。它还解决了超大面积 Chiplet 的互连问题。尤其是芯片与芯片之间的 UltraFusion 互联技术设计令人印象极为深刻。根据目前已发表的论文和专利,UltraFusion 应该是基于台积电 CoWoS-S5 互联技术的互联架构,采用了 Silicon Interposer 和 Micro-Bump 技术。UltraFusion 还提供超过 10,000 条芯片-芯片连接信号线,具有 的超高互连带宽。与其他多芯片模块(MCM)封装技术相比,UltraFusion 的内插层可在逻辑芯片之间或逻辑芯片与存储器堆栈之间提供密集而短的金属互连。它具有更好的芯片间完整性、更低的功耗以及更高的时钟频率。此外,UltraFusion 还通过 DieStitching 技术有效提高了封装产量并降低了成本。在 UltraFusion 中,只有已知良好
Die (KGD) is bonded, which avoids the problem that the failed chips in traditional Wafer on Wafer (WoW) or Chip on Wafer are encapsulated, thereby increasing post-package yield and reducing overall average cost. KGD)键合,避免了传统晶圆上晶片(WoW)或晶圆上芯片 中故障芯片被封装的问题,从而提高了后封装良率,降低了总体平均成本。
AMD Infinity Fabric: Infinity Fabric (IF) [70] is AMD's proprietary system interconnect, consisting of two systems, Infinity Scalable Data Fabric (SDF) and Infinity Scalable Control Fabric (SCF), which are responsible for data transmission and control respectively. SDF has 2 distinct systems on-die and off-die. The on-die SDF connects CPU cores, GPU cores, IMC and other parts within the chip, while the off-die SDF is responsible for interconnection of die-to-die on the package or multiple sockets. AMD Infinity Fabric:Infinity Fabric(IF)[70] 是 AMD 专有的系统互联,由 Infinity Scalable Data Fabric(SDF)和 Infinity Scalable Control Fabric(SCF)两个系统组成,分别负责数据传输和控制。SDF 有两个不同的系统,分别位于芯片上和芯片外。芯片上的 SDF 连接芯片内的 CPU 内核、GPU 内核、IMC 和其他部件,而芯片外的 SDF 则负责封装或多个插座上芯片与芯片之间的互连。
Inter-Die Parallel Interfaces: Compared to serial interfaces, parallel interfaces can deliver higher bandwidth, lower latency and lower IO power consumption, which are attractive for the die-to-die interconnection in wafer-scale chips. However, the requirement of massive transmission wires and extremely short interconnection distance has made it hard to apply in practice so far. In the future, when the process can achieve relatively high wire density and close inter-die distances, parallel interfaces can be taken into consideration. 芯片间并行接口:与串行接口相比,并行接口能提供更高的带宽、更低的延迟和更低的 IO 功耗,这对于晶圆级芯片的芯片间互连很有吸引力。然而,由于需要大量传输线和极短的互连距离,目前还很难在实践中应用。未来,当工艺能实现相对较高的导线密度和较近的芯片间距时,就可以考虑并行接口。
Intel AIB/MDIO: Intel's AIB/MDIO [71] provides a parallel interconnection standard for the physical layer. MDIO technology provides higher transmission efficiency, and its response speed and bandwidth density are more than twice that of AIB. AIB and MDIO are applied to advanced 2.5D/3D packaging technologies with short interconnection communication distance and high interconnection density, such as EMIB [72] and Foveros [73]. 英特尔 AIB/MDIO:英特尔的 AIB/MDIO [71] 为物理层提供了并行互连标准。MDIO 技术具有更高的传输效率,其响应速度和带宽密度是 AIB 的两倍多。AIB 和 MDIO 适用于互连通信距离短、互连密度高的先进 2.5D/3D 封装技术,如 EMIB [72] 和 Foveros [73]。
TSMC LIPINCON: TSMC's LIPINCON [74] developed a high-performance parallel interconnection interface technology for the specific case of small chip memory interface. Through advanced packaging technology such as InFO and CoWoS, it can use simpler "clock forwarding" circuit, which greatly reduces the area overhead of the I/O circuit and its parasitic effects. 台积公司 LIPINCON:台积公司的 LIPINCON [74]针对小型芯片存储器接口的特殊情况,开发了一种高性能并行互连接口技术。通过 InFO 和 CoWoS 等先进封装技术,它可以使用更简单的 "时钟转发 "电路,从而大大减少 I/O 电路的面积开销及其寄生效应。
OCP ODSA BOW: The BoW interface proposed by the OCP ODSA group focuses on solving the problem of parallel interconnections based on organic substrates [75]. There are three types of BoW, namely BoW-base, BoW-fast and BoWturbo. BoW-Base is designed for transmission distances less than 10 mm , using an unterminated unidirectional interface with a data rate of up to 4 Gbps per line. The BoW-Fast uses a termination interface for cable lengths up to 50 mm . The transmission rate per line is 16 Gbps . Compared with BoWFast, Bow-Turbo uses two wires and supports bidirectional 16 Gbps transmission bandwidth. OCP ODSA BOW:OCP ODSA 小组提出的 BoW 接口主要解决基于有机基底的并行互连问题 [75]。BoW 有三种类型,即 BoW-base、BoW-fast 和 BoWturbo。BoW-Base专为传输距离小于 10 毫米而设计,使用未端接的单向接口,每条线路的数据传输速率高达 4 Gbps。BoW-Fast 使用端接接口,电缆长度可达 50 毫米。每条线路的传输速率为 16 Gbps。与 BoWFast 相比,Bow-Turbo 使用两条导线,支持双向 16 Gbps 传输带宽。
NVLink-C2C: NVIDIA's NVLink-C2C [76] is a chip interconnection technology extended from NVLink [42] for custom silicon integration. It is extensible from PCB-level integration, multi-chip modules (MCM), silicon interposer or wafer-level connections, allowing NVIDIA GPUs, DPUs, and CPUs to be coherently interconnected with custom silicon. With advanced packaging, NVLink-C2C interconnection delivers up to 25 X more energy efficiency and is 90 X more area-efficiency than a PCIe Gen 5 PHY on NVIDIA chips. NVLink-C2C:英伟达™(NVIDIA®)的NVLink-C2C[76]是从NVLink[42]扩展而来的芯片互连技术,用于定制芯片集成。它可从 PCB 级集成、多芯片模块 (MCM)、硅内插或晶圆级连接中进行扩展,使英伟达™(NVIDIA®)GPU、DPU 和 CPU 能够与定制芯片实现连贯互连。通过先进的封装,NVLink-C2C 互连的能效比英伟达芯片上的 PCIe Gen 5 PHY 高出 25 倍,面积效率高达 90 倍。
inter-die interconnection protocols: There are a variety of different die-to-die interconnection protocols can be adapting to the physical layer interfaces. To choose a suitable die-to-die interconnection protocols for wafer-scale chips, the programmability, scalability and fault tolerance should be considered. 芯片间互连协议:有多种不同的芯片到芯片互连协议可与物理层接口相适应。要为晶圆级芯片选择合适的芯片间互连协议,应考虑可编程性、可扩展性和容错性。
Intel PCIe gen 4, 5, and 6: PCI express (PCI-e) [77] is a general-purpose serial interconnection technology that is not only suitable for interconnecting GPUs, network adapters, and other additional devices, but also for die-to-die interconnection communication. Since PCI-e Gen 3 was widely adopted, it has been developing rapidly. The transfer rate of PCI-e Gen 4 is , and the theoretical throughput is for 16x ports. PCI-e Gen 5, which was finalized in 2019, has twice the transfer rate of Gen 4 at 32GT/s, and the 16x port has a bandwidth of about . Currently, the latest PCI-e Gen 6 standard has now been released, which doubles the data transfer rate from to , providing approximately of bandwidth for 16x ports. In addition, the PCI-e Gen 6 uses a new PAM4 encoding to replace the PCIe 5.0 NRZ, using a encoding scheme, which can encapsulate more data in a single channel in the same time. The PCIe 6.0 [78] also introduces low-latency forward error correction (FEC) and related mechanisms to improve bandwidth efficiency and reliability. 英特尔 PCIe Gen 4、5 和 6:PCI Express(PCI-e)[77] 是一种通用串行互连技术,不仅适用于 GPU、网络适配器和其他附加设备的互连,也适用于芯片与芯片之间的互连通信。自 PCI-e Gen 3 被广泛采用以来,它一直在快速发展。PCI-e Gen 4 的传输速率为 ,16x 端口的理论吞吐量为 。PCI-e Gen 5 于 2019 年定型,其传输速率是 Gen 4 的两倍,达到 32GT/s,16x 端口的带宽约为 。目前,最新的 PCI-e Gen 6 标准已经发布,它将数据传输速率从 倍增到 ,为 16x 端口提供了约 的带宽。此外,PCI-e Gen 6 使用新的 PAM4 编码取代 PCIe 5.0 NRZ,采用 编码方案,可在同一时间在单通道中封装更多数据。PCIe 6.0 [78] 还引入了低延迟前向纠错 (FEC) 和相关机制,以提高带宽效率和可靠性。
Intel UCIe 1.0: Intel, AMD, Arm, Qualcomm, TSMC and other industry giants jointly established the Chiplet Standard Alliance and officially launched the universal Chiplet highspeed interconnection standard "Universal Chiplet interconnection Express" (Universal Chiplet interconnection Express, UCIe), aiming to define an open, interoperable chiplet ecosystem standard [79]. The UCIe 1.0 protocol uniformly uses Intel's mature PCIe and CXL (Compute Express Link) interconnection bus technologies. PCIe provides broad interoperability and flexibility, while CXL can be used for more advanced low-latency/high-throughput connections. The establishment of the UCIe standard maximizes the combination of the advantages of various fabs and technology companies, which is more conducive to the development of integrated circuit interconnection technology. 英特尔 UCIe 1.0:英特尔、AMD、Arm、高通、台积电等行业巨头联合成立芯片标准联盟,正式推出通用芯片高速互连标准 "通用芯片互连 Express"(Universal Chiplet interconnection Express,UCIe),旨在定义一个开放、可互操作的芯片生态系统标准[79]。UCIe 1.0 协议统一使用英特尔成熟的 PCIe 和 CXL(Compute Express Link)互连总线技术。PCIe 具有广泛的互操作性和灵活性,而 CXL 可用于更先进的低延迟/高吞吐量连接。UCIe 标准的建立,最大限度地结合了各晶圆厂和技术公司的优势,更有利于集成电路互连技术的发展。
C. Off-Chip Interconnection C.片外互连
Due to high-density integration, wafer-scale chips require external IO modules for inter-chip interconnection. PCIe is usually used for the external communication of the IO modules. For example, a wafer-scale chip is connected to the IO module through PCIe PHY in Dojo's design. Besides PCIe, many other high-speed protocols can also apply to chip-tochip interconnection, with different tradeoffs being taken into account 由于高密度集成,晶圆级芯片需要外部 IO 模块进行芯片间互连。IO 模块的外部通信通常使用 PCIe。例如,在 Dojo 的设计中,晶圆级芯片通过 PCIe PHY 与 IO 模块相连。除 PCIe 外,许多其他高速协议也可用于芯片与芯片之间的互连,但需要考虑不同的权衡因素
1) Off-Chip interconnection Interface: 1) 片外互连接口:
LR/MR/VSR SerDes: For chip interconnection based on PCB boards, it can be divided into long reach/medium reach/very short reach SerDes according to transmission distance between chips [51, 68]. Although this interconnection technology has the advantages of high reliability, low cost and easy integration, it is difficult to meet the high-performance interconnection communication requirements between wafer dies in terms of delay, power consumption and density. LR/MR/VSR SerDes:对于基于 PCB 板的芯片互连,可根据芯片之间的传输距离分为长距离/中距离/极短距离 SerDes [51,68]。虽然这种互连技术具有可靠性高、成本低、易于集成等优点,但在延迟、功耗和密度方面难以满足晶圆芯片间高性能互连通信的要求。
Off-Chip interconnection Protocol: 片外互连协议
Ethernet: EtherNet/IP [80] is an industrial communication network managed and published by the ODVA specification, which is realized by combining the CIP protocol, TCP/IP, and Ethernet. Currently, the mainstream Ethernet protocols include 100/25 Gigabit Ethernet, 200 and 400 Gigabit Ethernet, and RoCE. The 100/25 Gigabit Ethernet provides the ability to obtain a large number of 25 Gbps ports from a single switch using 100 Gbps switches and fanout cables. 200 Gbps Ethernet uses four 50 Gbps lanes per port, whereas the original 400 Gbps standard would use eight 50 Gbps lanes, simply doubling the number of lanes in a port. RoCE is a protocol for realizing RDMA access/transmission through conventional Ethernet. Its basic idea is to encapsulate an InfiniBand transmission packet into a normal Ethernet packet at the link layer. It uses Direct Memory Access (DMA), where the network adapter can read and write directly from host memory bypassing the CPU core, reducing CPU load. 以太网EtherNet/IP [80] 是一种由 ODVA 规范管理和发布的工业通信网络,它通过结合 CIP 协议、TCP/IP 和以太网来实现。目前,主流的以太网协议包括 100/25 千兆位以太网、200 和 400 千兆位以太网以及 RoCE。100/25 千兆位以太网可使用 100 Gbps 交换机和扇出电缆,从单个交换机获得大量 25 Gbps 端口。200 Gbps 以太网每个端口使用 4 个 50 Gbps 通道,而最初的 400 Gbps 标准使用 8 个 50 Gbps 通道,只是将端口中的通道数量增加了一倍。RoCE 是一种通过传统以太网实现 RDMA 访问/传输的协议。其基本思想是在链路层将 InfiniBand 传输数据包封装为普通以太网数据包。它使用直接内存访问(DMA),网络适配器可绕过 CPU 内核直接读写主机内存,从而减少 CPU 负载。
NVIDA NVLink: NVIDIA's NVLink [42] is the first high-speed inter-GPU interconnection technology to improve interconnection communication between GPUs in multi-GPU systems. NVLink allows multiple GPUs into a unified memory space, allowing one GPU to work on the local memory of another GPU that supports atomic operations. NVIDA P100 is the first product equipped with NVLink 1.0, and its single GPU has a bandwidth of , which is equivalent to 5 times the bandwidth of PCIe 4. The NVIDA V100 is equipped with NVLink 2.0, which increases the GPU bandwidth to , which is almost 10 times that of PCIe 4. Currently, NVIDIA A100 integrates the latest NVLink 3.0. A single NVIDIA A100 Tensor core GPU supports up to 12 NVLink 3.0 connections, with a total bandwidth of per second, almost 10 times the bandwidth of PCIe 5. In addition, NVIDIA has developed a separate NVLink switch chip - NVSwitch. NVSwitch has multiple ports, which can integrate multiple NVLinks and can be used to interconnection multiple GPUs. The NVSwitch 2.0 enables high-speed communication between all GPUs simultaneously at . 英伟达™(NVIDIA®)NVLink:英伟达™(NVIDIA®)的NVLink[42]是首个高速GPU间互连技术,用于改善多GPU系统中GPU之间的互连通信。NVLink 允许多个 GPU 进入统一的内存空间,允许一个 GPU 在另一个支持原子操作的 GPU 的本地内存上工作。NVIDA P100 是首款配备 NVLink 1.0 的产品,其单个 GPU 的带宽为 ,相当于 PCIe 4 带宽的 5 倍。NVIDA V100 配备了 NVLink 2.0,将 GPU 带宽提高到 ,几乎是 PCIe 4 的 10 倍。目前,NVIDIA A100 集成了最新的 NVLink 3.0。单个英伟达 A100 Tensor core GPU 支持多达 12 个 NVLink 3.0 连接,总带宽为每秒 ,几乎是 PCIe 5 带宽的 10 倍。此外,英伟达还开发了独立的 NVLink 交换芯片 - NVSwitch。NVSwitch 拥有多个端口,可集成多个 NVLink,用于多个 GPU 的互连。NVSwitch 2.0 可在 下同时实现所有 GPU 之间的高速通信。
ARM CCIX: Cache Coherent interconnection for Accelerators (CCIX) [81] is an open standard interconnection designed to provide a cache coherent interconnection between CPUs and accelerators. CCIX mainly includes protocol specification and transport specification, which supports cache coherence by extending the functions of transaction layer and protocol layer on the standard PCIe data link layer. CCIX usually uses two mechanisms to improve performance and reduce latency. One is to use cache coherence to automatically keep the caches of processors and accelerators coherent. The other is to increase the raw bandwidth of the CCIX link, taking advantage of the standard transfer rate of PCIe 4 , up to a maximum rate of . ARM CCIX:加速器高速缓存一致性互连(CCIX)[81] 是一种开放标准互连,旨在提供 CPU 和加速器之间的高速缓存一致性互连。CCIX 主要包括协议规范和传输规范,通过在标准 PCIe 数据链路层上扩展事务层和协议层的功能来支持高速缓存一致性。CCIX 通常采用两种机制来提高性能和减少延迟。一种是使用缓存一致性来自动保持处理器和加速器缓存的一致性。另一种机制是利用 PCIe 4 的标准 传输速率,增加 CCIX 链路的原始带宽,最大传输速率可达 。
OpenCAPI: To improve PCIe, IBM launched the Coherent Accelerator Processor Interface (CAPI) protocol [82]. The CAPI1.0 multiplexes the physical layer, link layer and OpenCAPI为了改进 PCIe,IBM 推出了相干加速器处理器接口 (CAPI) 协议 [82]。CAPI1.0 多路复用了物理层、链接层和
transaction layer of PCIe, and uses the Payload field of PCIe data packets to tunneling package cache coherence and CAPI control transactions. In later version of CAPI, it gradually evolved into OpenCAPI, which has its own physical, link, transaction layer and independent processing blocks. OpenCAPI is an interface between processors and accelerators. It aims at being low-latency and high-bandwidth. It allows an accelerator (which could be an FPGA, ASICs, ...) to access the host memory coherently, using virtual addresses. An OpenCAPI device can also host its own memory, that can be accessed from the host. CAPI 是 PCIe 的事务层,使用 PCIe 数据包的 Payload 字段来隧道封装高速缓存一致性和 CAPI 控制事务。在后来的 CAPI 版本中,它逐渐演变为 OpenCAPI,拥有自己的物理层、链路层、事务层和独立的处理模块。OpenCAPI 是处理器和加速器之间的接口。它的目标是低延迟和高带宽。它允许加速器(可以是 FPGA、ASIC......)使用虚拟地址连贯地访问主机内存。OpenCAPI 设备也可以拥有自己的内存,可以从主机访问。
Intel CXL: Compute Express Link (CXL) [83] is an open standard for high-speed interconnects designed to address the interconnection gap between CPUs and devices and between CPUs and memory. Intel pointed out that PCIe is perfect for client machines without too many devices and too much memory, but when dealing with multiple devices that require bandwidth and huge shared memory pools, PCIe has huge delays and inefficient access mechanisms. CXL is designed to overcome these problems without giving up the best parts of PCIe - the simplicity and adaptability of its physical layer. CXL is based on the physical layer of PCI-e Gen 5 and uses the same physical and electrical interface. CXL provides protocols in three areas: CXL.io, CXL.cache, and CXL.memory. CXL.io is the simplest one, which is very similar to PCI-e, supporting the same features as PCI-e; CXL.cache is used to handle device access to local processor memory, for example, to allow network drives to access their own caches; CXL.memory is used to provide processor access to attached device memory (non-native memory). 英特尔 CXL:Compute Express Link(CXL)[83] 是一种高速互连的开放标准,旨在解决 CPU 与设备以及 CPU 与内存之间的互连问题。英特尔指出,PCIe 非常适合没有太多设备和太多内存的客户机,但在处理需要带宽和巨大共享内存池的多个设备时,PCIe 会产生巨大的延迟和低效的访问机制。CXL 旨在克服这些问题,同时不放弃 PCIe 的最佳部分--其物理层的简单性和适应性。CXL 基于 PCI-e Gen 5 的物理层,使用相同的物理和电气接口。CXL 提供三个方面的协议:CXL.io、CXL.cache 和 CXL.memory。CXL.io 是最简单的协议,它与 PCI-e 非常相似,支持与 PCI-e 相同的功能;CXL.cache 用于处理设备对本地处理器内存的访问,例如,允许网络驱动器访问自己的缓存;CXL.memory 用于提供处理器对附加设备内存(非本地内存)的访问。
D. Discussion D.讨论情况
Uniformity: Traditional HPC systems employ multiple existing protocols that match the bandwidth at different hardware scales to access varying levels of memory, aiming to attain data communication and scalability across hardware dimensions. However, in emerging wafer-scale chips where intra- and inter-die bandwidths converge, the inherent rate mismatch across protocols and complex protocol hierarchy mandating replicate data motion across memory and protocol packets will incur significant communication performance penalties. To avoid these pitfalls and fully harness the bandwidth of wafer-scale chips, adopting a uniform cross-scale communication protocol offers an effective approach. That is what Tesla does - proposing Tesla Transport Protocol (TTP) to cover the interconnections of intra-D1-die, inter-D1-die, and inter-Dojo-tile[15, 62]. 统一性:传统的高性能计算系统采用多种现有协议,在不同硬件尺度上匹配带宽,以访问不同级别的存储器,从而实现跨硬件维度的数据通信和可扩展性。然而,在芯片内部和芯片之间带宽趋同的新兴晶圆级芯片中,各协议之间固有的速率不匹配和复杂的协议层次结构要求在内存和协议数据包之间复制数据移动,这将导致通信性能大幅下降。为了避免这些隐患并充分利用晶圆级芯片的带宽,采用统一的跨尺度通信协议是一种有效的方法。特斯拉就是这样做的--提出了特斯拉传输协议(Tesla Transport Protocol,TTP),以覆盖D1-die内、D1-die间和Dojo-tile间的互连[15, 62]。
Compensating for 2 D NoC : Central congestion is a key challenge for designing the interconnection protocol of wafer-scale chips. Specifically, the interconnection of data centers can be physically implemented in the 3D space, which means more free topologies (e.g., FatTree) can be adopted and the central congestion problem can be naturally avoided. In contrast, the NoCs for wafer-scale chips are usually limited to a 2 D topology, which results in the closer to the center the larger traffic there is, so the solutions for central congestion problem should be considered when designing the protocols. 补偿 2 D NoC:中心拥塞是设计晶圆级芯片互连协议的关键挑战。具体来说,数据中心的互联可以在三维空间中物理实现,这意味着可以采用更自由的拓扑结构(如 FatTree),自然可以避免中心拥塞问题。相比之下,晶圆级芯片的 NoC 通常局限于 2 D 拓扑,这就导致越靠近中心的地方流量越大,因此在设计协议时应考虑中心拥塞问题的解决方案。
To this end, Tesla proposes Z-Plane Topology and the corresponding protocol named '"Tesla Transport Protocol over Ethernet(TTPoE)" to enable "shortcuts" across large network hops . 为此,特斯拉提出了 Z-Plane 拓扑和名为 "Tesla Transport Protocol over Ethernet (TTPoE) "的相应协议,以实现跨越大型网络 的 "捷径"。
Fault Tolerance Support: Fault tolerance is another core issue for wafer-scale chips. Since physical failures of dies and interconnects may appear during production and use, the interfaces and protocols for wafer-scale chips should be specially designed to support the fault tolerance functions, e.g., switching the reserved datapath when the active one is broken, routing bypass the faulty dies. 容错支持:容错是晶圆级芯片的另一个核心问题。由于晶粒和互连在生产和使用过程中可能会出现物理故障,因此晶圆级芯片的接口和协议应专门设计以支持容错功能,例如,当活动数据通路损坏时切换预留数据通路,路由绕过故障晶粒。
V. COMPILER TOOl ChAIN V.编译器
Like the traditional platforms aimed at large-scale Deep Learning acceleration, Wafer-scale Computing systems also need a compiler toolchain to meet the user requirements of extreme computing efficiency and ease of use. The former requirement means that the application workloads should be perfectly optimized, scheduled and mapped, and the hardware resources are fully utilized. The latter requirement means that the compile procedure should be highly automatic, flexible, end-to-end, without too much manual work. These two requirements are fairly contradictory and researchers have paid much effort on them for traditional Deep Learning acceleration platforms [84-86]. Compared with the compiler tool chain of traditional platforms, the one of Wafer-scale Computing systems can be generally similar in overall framework, but significantly different in detailed design considerations and strategies. 与以大规模深度学习加速为目标的传统平台一样,晶圆级计算系统也需要编译器工具链来满足用户对极高计算效率和易用性的要求。前者意味着应用工作负载应得到完美的优化、调度和映射,硬件资源应得到充分利用。后一种要求则是指编译程序应高度自动、灵活、端到端,不需要过多的人工操作。这两个要求相当矛盾,研究人员在传统的深度学习加速平台上为此付出了很多努力[84-86]。与传统平台的编译器工具链相比,晶圆级计算系统的编译器工具链在总体框架上大体相似,但在详细设计考虑和策略上却有显著不同。
In this section, we will first give an overview of the common end-to-end compilation flow for driving large-scale neural network acceleration platforms, and then review typical compiler tools for traditional platforms and Wafer-scale Computing platforms. At last we will discuss what make the compilation of Wafer-scale Computing different from traditional one. 在本节中,我们将首先概述驱动大规模神经网络加速平台的常见端到端编译流程,然后回顾传统平台和晶圆级计算平台的典型编译工具。最后,我们将讨论晶圆级计算编译与传统编译的不同之处。
A. Common End-to-End Compilation Flow of Large-Scale Deep Learning Acceleration A.大规模深度学习加速的端到端通用编译流程
The common end-to-end compilation flow of Deep Learning acceleration is shown in Figure 12. First, the compiler transforms the inputted application specification into high-level intermediate representation (IR). Since there are various deep learning frameworks (such as TensorFlow [87], Pytorch[88], PaddlePaddle [89], and so on), the compiler usually needs to have the ability of framework adaption, i.e., be flexible enough for different forms of inputs. The high-level IR typically comes in the form of directed acyclic graphs to represent the operations as nodes and the data dependencies as edges [85]. Then, the compiler performs graph optimizations on the high-level IR, such as operator fusion, data layout transformation and rematerialization [90-92]. After that, if the target hardware platform is an accelerator cluster with multiple devices, or a single device with multiple partitions, task mapping is performed, i.e., the task is divided into multiple sub-tasks, scheduled and mapped to the devices or partitions. Parallel and pipelining strategies are usually used to increase the hardware utilization and execution performance 深度学习加速的常见端到端编译流程如图 12 所示。首先,编译器将输入的应用规范转换为高级中间表示(IR)。由于深度学习框架多种多样(如 TensorFlow[87]、Pytorch[88]、PaddlePaddle[89]等),编译器通常需要具备框架自适应能力,即能够灵活应对不同形式的输入。高层集成电路通常采用有向无环图的形式,将操作表示为节点,将数据依赖关系表示为边 [85]。然后,编译器会对高级 IR 进行图优化,如运算符融合、数据布局转换和再物化 [90-92]。之后,如果目标硬件平台是具有多个设备的加速器集群,或具有多个分区的单个设备,则要执行任务映射,即把任务划分为多个子任务,进行调度并映射到设备或分区。通常采用并行和流水线策略来提高硬件利用率和执行性能
Fig. 12. Overview of end-to-end compilation flow. 图 12.端到端编译流程概览。
[28, 29]. Usually, a high-level cost model is built to help decide which mapping solution to use. After these stages, low-level IR is generated, which describes the operation in a more finegrained representation than that in high-level IR, reflects the hardware characteristics and represents the hardware-specifc optimizations. Based on it, operator optimization (e.g. affine transformation and kernel library optimization) is performed to tune the computation and memory access for each operator. After completing the optimizations, the compile generates machine code from the optimized IR and for driving the hardware. [28, 29].通常,会建立一个高层成本模型,以帮助决定使用哪种映射解决方案。在这些阶段之后,会生成底层 IR,它以比高层 IR 更精细的方式描述操作,反映硬件特性,并体现硬件特定的优化。在此基础上,进行运算符优化(如仿射变换和内核库优化),以调整每个运算符的计算和内存访问。完成优化后,编译器会根据优化后的 IR 生成机器代码并驱动硬件。
The whole compiler flow can be divided into three parts: frontend, optimizer(middle-end) and backend. We are chiefly concerned with the optimizer in this section. Generally speaking, graph optimization belongs to high-level optimization which takes multiple operators into consideration at a time, while operator optimization belongs to low-level optimization which focuses on how to execute a single operator more efficiently. Graph optimization can be also regarded as part of the frontend, and operator optimization can be also regarded as part of the backend. 整个编译器流程可分为三个部分:前端、优化器(中端)和后端。本节主要讨论优化器。一般来说,图优化属于高级优化,一次考虑多个运算符,而运算符优化属于低级优化,重点是如何更高效地执行单个运算符。图优化也可视为前端的一部分,而运算符优化也可视为后端的一部分。
For Wafer-scale Computing, what are the differences and why? 对于晶圆级计算,有哪些区别,为什么?
Physical Level 物理级别
Architecture Level 建筑级别
3 System Level 3 系统级别
B. Typical Compiler Tools for Large-Scale Deep Learning Acceleration B.用于大规模深度学习加速的典型编译器工具
In 2022, Zheng et al. proposed an complier named Alpa, which can automatically generate model-mapping plans leveraging hierarchically optimizing the intra-operator and interoperator parallelism [93]. The key insight is based on the observations that different parallelism schemes require diverse bandwidth for communication, while typical computing clusters exhibit corresponding structures that closely located cores communicate with sufficient bandwidth, while distant cores own limited bandwidth. 2022 年,Zheng 等人提出了一种名为 Alpa 的编译器,它可以通过分层优化操作员内和操作员间的并行性,自动生成模型映射计划[93]。其关键见解基于以下观察:不同的并行性方案需要不同的通信带宽,而典型的计算集群则表现出相应的结构,即距离较近的内核以足够的带宽进行通信,而距离较远的内核则拥有有限的带宽。
Leveraging the asymmetric property of a computing cluster, the Alpa maps the intra-operator parallelism into the cores with high communication bandwidth, while allocating the inter-operator parallelism to those with limited bandwidth. Additionally, the Alpa utilizes a hierarchical architecture to express the plan in each parallelization category, which is depicted in Fig. 13. It is observed that given the computational graph and hardware configuration, the inter-operator compilation process separates the graph into a number of stages, while dividing the clusters into several device meshes. Specifically, the inter-operator process employs the dynamic programming (DP) algorithm to allocate stages to corresponding meshes and activate the intra-operator process on each stage-mesh pair, in 利用计算集群的非对称特性,Alpa 将运算器内并行映射到通信带宽高的内核上,同时将运算器间并行分配到带宽有限的内核上。此外,Alpa 利用分层架构来表达每个并行化类别中的计划,如图 13 所示。可以看到,在给定计算图和硬件配置的情况下,运算器间编译过程会将计算图分成若干阶段,同时将集群划分为多个设备网格。具体来说,操作员间进程采用动态编程(DP)算法将阶段分配给相应的网格,并在每个阶段-网格对上激活操作员内进程。
order to obtain the execution cost of this assignment. Next, the intra-operator process optimizes the execution efficiency about the stage running on its assigned device mesh, by minimizing the corresponding execution cost utilizing integer linear programming algorithm, then return the cost report to the inter-operator process. Through calling the intra-operator process for each stage-mesh pair repeatedly, the inter-operator process employs the DP to minimize the execution latency of inter-operator parallelism while acquiring the optimal partition scheme for stages and meshes. 以获得该分配的执行成本。接下来,操作员内进程通过利用整数线性规划算法最小化相应的执行成本,优化在其分配的设备网格上运行的阶段的执行效率,然后将成本报告返回给操作员间进程。通过重复调用每个阶段-网格对的操作员内进程,操作员间进程利用 DP 最大限度地减少操作员间并行的执行延迟,同时获得阶段和网格的最佳分区方案。
Inter-op Parallelism 运行间并行
Fig. 13. The complier flow and runtime architecture of the Alpa. Adapted from [93]. 图 13.Alpa 的编译器流程和运行时架构。改编自 [93]。
In 2021, Tan et al. presented an automatic tool named NN-Baton, which focuses on the chiplet granularity exploration and workload orchestration on different computation levels [94]. The high-level flow of the NN-Baton consists of three main components: hardware design space exploration, mapping analysis engine and cost evaluation module. The inputs include resource constraints, NN model descriptions and hardware descriptions, where the resource constraints are primarily composed of the number of MAC units and the area budgets. Furthermore, the model description is parsed from the Pytorch model exerting the torch. hit function. 2021 年,Tan 等人提出了一种名为 NN-Baton 的自动工具,该工具专注于不同计算级别的芯片粒度探索和工作负载协调[94]。NN-Baton 的高层流程由三个主要部分组成:硬件设计空间探索、映射分析引擎和成本评估模块。输入包括资源约束、NN 模型描述和硬件描述,其中资源约束主要由 MAC 单元数量和面积预算组成。此外,模型描述是通过火炬函数从 Pytorch 模型中解析出来的。
The ISPD 2020 competition delivered a unique challenge targeting at allocating neural network workloads to the Cerebras CS-1 WSE [59]. Since a vital feature of WSE lies in its sufficiently large capability to run every layer of an NN simultaneously, how to assign the workload to achieve a substantial increase in hardware efficiency is a critical question worths to be solved. ISPD 2020 竞赛提出了一项独特的挑战,旨在为 Cerebras CS-1 WSE 分配神经网络工作负载[59]。由于 WSE 的一个重要特点是有足够大的能力同时运行神经网络的每一层,因此如何分配工作负载以大幅提高硬件效率是一个值得解决的关键问题。
Fig. 14 illustrates the corresponding compilation stage of WSE. As can be seen, the input NN models that usually are represented by ML frameworks will be converted into a graph representation exerting a set of predetermined kernels provided in the given kernel library. Specifically, the layers in the NN are mapped to kernels that each performs an individual computational tasks. For example, a kernel could compute a " convolution", while the next kernel may execute a " " fully connected layer. 图 14 展示了 WSE 的相应编译阶段。可以看出,通常由 ML 框架表示的输入 NN 模型将通过给定内核库中提供的一组预定内核转换成图表示。具体来说,NN 中的各层被映射到内核中,每个内核执行单独的计算任务。例如,一个内核可以计算" 卷积",而下一个内核可以执行" "全连接层。
Next, the mapping algorithm is employed to place and route the kernels of an NN on the computation fabric with specific objectives and constraints. Specifically, the solutions should satisfy the following constraints [59]: 接下来,利用映射算法在计算结构上放置和路由 NN 的内核,并设定特定的目标和约束条件。具体来说,解决方案应满足以下约束 [59]:
All kernels must fit the fabric area tiles . 所有内核必须适合织物区域 瓷砖 。
Fig. 14. The overview of WSE compilation flow. Adapted from [95]. 图 14.WSE 编译流程概览。改编自 [95]。
Kernels may not overlap. 内核不得重叠。
No kernel's memory exceeds the tile's memory limit. 任何内核的内存都不会超过磁贴的内存限制。
While keeping the runtime of placement satisfies a specific time threshold, the quality of a feasible solution is evaluated by a weighted summation of the following objectives [59]: 在保证放置的运行时间满足特定时间阈值的同时,可行解决方案的质量通过以下目标的加权求和来评估 [59]:
The maximum execution time (MET) among all kernels. 所有内核中的最大执行时间 (MET)。
The overall distance of all connected kernels. 所有相连内核的总 距离。
The total adapter overhead of all connected kernels. 所有连接内核的总适配器开销。
Since the complied model on WSE is executed in a pipeline fashion, the kernel with slowest throughput will limit the overall performance of the system. Additionally, distance provides a simplified evaluation for the routing cost, and the adapter overhead reflects the cost required to unify the I/O protocols among kernels in practical systems. 由于 WSE 上的编译模型是以流水线方式执行的,因此吞吐量最慢的内核将限制系统的整体性能。此外, 距离提供了简化的路由成本评估,适配器开销反映了在实际系统中统一内核间 I/O 协议所需的成本。
Formally, a kernel represents a parametric program that performs specific tensor operations. Specifically, it is composed of a set of formal arguments to specify the shapes of the tensor operations to be executed, and a set of execution arguments to define how the operations are parallelized across the tiles [59]. For given kernels, their formal arguments are specified by the input neural network specification and retain unchanged during the compilation procedure, while the execution arguments are configurable, whose values are variables to be optimized by the mapping algorithm. Taking the convolution kernel as an example, it consists of eight formal arguments (H, W, R, and four execution arguments . Specifically, the specify the size of input image in dimensions. ( ) represents the size of convolution core in 2-dimensions. ( ) denotes the number of input and output features. (T, U) refer to the horizontal and vertical striding of the convolution operation. Moreover, express the unrolling of computations that can be performed in parallel. 从形式上看,内核代表了一个执行特定张量运算的参数程序。具体来说,内核由一组形式参数和一组执行参数组成,前者用于指定要执行的张量运算的形状,后者用于定义如何将这些运算并行到不同的张量[59]。对于给定的内核,其形式参数由输入神经网络规范指定,并在编译过程中保持不变,而执行参数是可配置的,其值是映射算法要优化的变量。以卷积核为例,它由 8 个形式参数(H、W、R、 和 4 个执行参数 )组成。具体来说, 指定了输入图像的尺寸( 维)。( ) 表示卷积核心的二维尺寸。( ) 表示输入和输出特征的数量。(T,U)指卷积操作的水平和垂直跨距。此外, 表示可以并行执行的计算的展开。
Jiang et al. [95] proposed a high-performance engine named CU.POKer, to fulfill the neural network workload placement tasks on WSE. First, the proposed engine adopts a binary search and a neighbor-range search to find a MET . Then, for each given targeted , kernel candidates with optimal shapes are generated. Next, a data-path-aware placement algorithm is employed to place the generated kernels aiming at minimizing the total wirelength while guaranteeing no kernel's execution time exceeds the . Considering the recursion employed in this step may result in computational exploration, some pruning techniques are utilized. For example, a threshold is set to adjust the maximum number of rows, which Jiang 等人[95]提出了一种名为 CU.POKer 的高性能引擎,用于完成 WSE 上的神经网络工作负载放置任务。首先,该引擎采用二进制搜索和邻域搜索来寻找 MET 。然后,针对每个给定的目标 ,生成具有最佳形状的内核候选。接下来,采用数据路径感知放置算法来放置生成的内核,目的是最大限度地减少总线长,同时保证没有内核的执行时间超过 。考虑到这一步骤中采用的递归可能会导致计算量的增加,因此采用了一些剪枝技术。例如,设置一个阈值 来调整行的最大数量,该阈值为 。
is initialized to one. If and only if no legal solution can be found under current threshold , the placement process will be invoked again with the threshold . Finally, a post refinement is executed to lower the overall adapter cost of the current solution. 初始化为 1。如果且仅当在当前阈值 下无法找到合法解决方案时,才会以阈值 再次调用放置流程。最后,将执行后细化,以降低当前解决方案的总体适配器成本。
However, the ISPD 2020 contest lacks consideration of internal wiring when improving performance. Specifically, the wirelength is the sum of distance between the centers of all connected kernels. Accordingly, there is no penalty for kernel implementations that span the entire height of the wafer. The drawback with cost function in ISPD metric is depicted in the Fig. 15. As can be seen, compared to Fig. 15(b), better internal wiring is acquired in Fig. 15(a), but the center-to-center interkernel distance metric penalizes Fig. 15(a) severely. Moreover, considering the CU.POker utilizes a threshold to control the number of rows, more practical modeling would push the CU .POKer into the computational explosion. 但是,ISPD 2020 竞赛在提高性能时缺乏对内部布线的考虑。具体来说,布线长度是所有相连内核中心之间 距离的总和。因此,对于跨越整个晶片高度的内核实现,没有任何惩罚。图 15 显示了 ISPD 指标中成本函数的缺点。可以看出,与图 15(b)相比,图 15(a)获得了更好的内部布线,但中心到中心的内核间距指标对图 15(a)造成了严重影响。此外,考虑到 CU.POker 利用阈值来控制行数,更实用的建模会将 CU.POKer 推向计算爆炸。
(a) Low wirelengths in each kernel (b) High wirelengths in each kernel with high inter-kernel wirelengths. with low inter-kernel wirelengths. (a) 每个内核的线长较短 (b) 每个内核的线长较长,内核间线长较短。
Fig. 15. The illustration for the shortcoming about cost functions in the ISPD 2020 contest. Adapted from [96]. 图 15.ISPD 2020 竞赛中成本函数缺陷的说明。改编自 [96]。
To this end, an improved work for CU.POker is presented in [96]. The improved approach utilizes an initial topological structure similar to , but employs dynamic programming to execute kernel selection and mapping. The corresponding experiments on the benchmarks in ISPD 2020 contest suite reveal that compared to the CU.POKer, the improved approach [96] exhibits smaller execution time for all but two cases. Although the CU.POKer performs better on wirelengths, the authors think it is not aligned with realistic system performance considerations. 为此,[96] 提出了 CU.POker 的改进方案。改进后的方法利用了与 类似的初始拓扑结构,但采用了动态编程来执行内核选择和映射。在 ISPD 2020 竞赛套件的基准上进行的相应实验表明,与 CU.POKer 相比,改进方法 [96] 除两种情况外,在所有情况下的执行时间都更短。虽然 CU.POKer 在线段长度上表现更好,但作者认为它并不符合实际系统性能的考虑。
Additionally, Li et al. present another mapping engine named GigaP lacer [97], which integrates binary search and dynamic programming algorithms to reduce the compilation time. Peng et al. combine graph theory and combinatorial optimization technology to devise a fast floorplanning approach [98]. 此外,Li 等人提出了另一种名为 GigaP lacer 的映射引擎[97],它集成了二进制搜索和动态编程算法,以减少编译时间。Peng 等人结合图论和组合优化技术,设计了一种快速平面规划方法 [98]。
Although the ISPD contest successfully provides an available cost model for the wafer-scale chip, there are still some limitations to be further considered. For example, since diverse data access modes lead to different computing behaviors, the relationship among computation latency, data access latency, data transfer latency should be considered more holistically. Further, the existing considerations only lie in the interoperator optimization. However, the intra-operator optimization needs to be further explored. Specifically, for a PE array with given shape, how to orchestrate the dataflow can minimize the cost of data transfer is a huge design space and need to be considered in more details. 尽管 ISPD 竞赛成功地为晶圆级芯片提供了一个可用的成本模型,但仍有一些局限性需要进一步考虑。例如,由于不同的数据访问模式会导致不同的计算行为,因此应更全面地考虑计算延迟、数据访问延迟和数据传输延迟之间的关系。此外,现有的考虑仅局限于操作员之间的优化。然而,操作员内部的优化还需要进一步探讨。具体来说,对于给定形状的 PE 阵列,如何协调数据流才能最大限度地降低数据传输成本是一个巨大的设计空间,需要更详细地考虑。
C. The Key Differences of Wafer-Scale Computing Compilers from Traditional Ones C.晶圆级计算编译器与传统编译器的主要区别
Wafer-scale Computing relies heavily on software-hardware co-design, so the platform characteristics decide what are different for Wafer-scale Computing compilers. 晶圆级计算在很大程度上依赖于软硬件协同设计,因此平台特性决定了晶圆级计算编译器的不同之处。
Physical Level: As integrated tightly by advanced packaging, wafer-scale chips have much larger die-to-die bandwidth than traditional accelerator clusters (e.g., the die-todie bandwidth of Tesla Dojo D1 die's every edge is while the NVLink bandwidth of NVIDIA H100 is only 900 ), which breaks the previous memory wall and opens up new design spaces for exploring more aggressive application solutions that can fully utilize the computing power available. But on the other hand, wafer-scale chips limit the NoC topologies within 2D space (usually 2D mesh or 2 D torus), which is the cost to have better die-to-die bandwidth and integration density. As a result, extra constraints should be set in the mapping. Enhancing the data locality can help avoid long-distance data transfer, which can assist in satisfying these constraints. 物理层面:由于先进封装的紧密集成,晶圆级芯片具有比传统加速器集群大得多的芯片到芯片带宽(例如,Tesla Dojo D1 芯片的每个边缘的芯片到芯片带宽为 ,而 NVIDIA H100 的 NVLink 带宽仅为 900 ),这打破了以往的内存墙,为探索能够充分利用现有计算能力的更激进应用解决方案开辟了新的设计空间。但另一方面,晶圆级芯片将 NoC 拓扑限制在 2D 空间(通常是 2D 网格或 2D 环形)内,这是为获得更好的芯片到芯片带宽和集成密度而付出的代价。因此,应在映射中设置额外的限制。增强数据局部性有助于避免长距离数据传输,从而有助于满足这些约束条件。
Architecture Level: Traditional large-scale deep learning acceleration platforms usually have clear control and communication boundaries, so the scaling out on them would suffer from the frequent host-to-host and host-to-device communication, and the significant gap between intra-device and interdevice bandwidth, leading to too coarse-grained task mapping and underutilization of hardware resources. On the contrary, wafer-scale chips adopt a seamless architecture and basically eliminate the bandwidth gap between intra-die and interdie, so they support uniform fine-grained mapping. Owing to this, wafer-scale chips have much more potential in mapping flexibility and scalability to achieve better hardware utilization and overall performance, however, at the cost of much higher problem complexity of compilation. Take the compilation in Figure 14 as an example, the task can be divided and mapped to kernels with any shapes and executed in any regions of the wafer (if only the three constraints are satisfied). If the target platform is a traditional accelerator cluster, the task division will be far more constrained by the boundary of devices. To achieve such fine-grained search, traditional deterministic search methods such as dynamic programming [99] may result in excessively long search times. In order to improve efficiency, heuristic search methods such as simulated annealing[100], ant colony optimization[101] and genetic algorithms[102], or even reinforcement learning[103]-based search may need to be developed. 架构层面:传统的大规模深度学习加速平台通常具有清晰的控制和通信边界,因此在其上进行扩展时会出现主机到主机、主机到设备通信频繁、设备内和设备间带宽差距较大等问题,导致任务映射过于粗粒度,硬件资源利用率不足。相反,晶圆级芯片采用无缝架构,基本消除了芯片内和芯片间的带宽差距,因此支持统一的细粒度映射。因此,晶圆级芯片在映射灵活性和可扩展性方面具有更大的潜力,可以实现更高的硬件利用率和整体性能,但代价是编译问题的复杂度大大提高。以图 14 中的编译为例,任务可以被划分和映射到任意形状的内核中,并在晶圆的任意区域执行(只要满足三个约束条件)。如果目标平台是传统的加速器集群,任务划分将受到器件边界的更多限制。要实现这种细粒度搜索,传统的确定性搜索方法(如动态编程 [99])可能会导致搜索时间过长。为了提高效率,可能需要开发启发式搜索方法,如模拟退火[100]、蚁群优化[101]和遗传算法[102],甚至基于强化学习[103]的搜索。
System Level: The disruptive Wafer-scale Computing paradigm brings unprecedented challenges in building a computing system. To address these challenges, researchers have to propose targeted hardware design strategies. The compilers should also be aware of the hardware details and cooperate with them to achieve a satisfactory system optimization result. For example, redundant process elements and data paths are designed to solve the yield problem, the compilers should know when and where to invoke the redundancies. Another example, wafer-scale chips suffer from more serious thermal problem than traditional accelerators, the compilers can in- 系统级:颠覆性的晶圆级计算模式为构建计算系统带来了前所未有的挑战。为应对这些挑战,研究人员必须提出有针对性的硬件设计策略。编译器也应了解硬件细节,并与之合作,以达到令人满意的系统优化效果。例如,冗余工艺元素和数据路径的设计是为了解决产量问题,编译器应知道何时何地调用冗余。再比如,晶圆级芯片比传统加速器存在更严重的散热问题,编译器可以在任何时间、任何地点调用冗余。
tegrate hotspot-aware strategies in the original load balance procedure to complement the hardware cooling design. 在原有的负载平衡程序中整合热点感知策略,以补充硬件冷却设计。
VI. InteGRATION VI.整合
As mentioned in the Section II, chiplet technology integrates multiple small dies into large computing systems by three main advanced packaging types: substrate-based, silicon-based, and redistribution layer (RDL)-based packaging technologies. In order to achieve the integration of wafer-scale chips, researchers have made optimizations on these original packaging technologies, or even take a more aggressive approach to directly produce an integrated wafer-scale chip. In the rest of this section, we will introduce different integration types for wafer-scale chips, and give some discussions. 如第二节所述,芯片技术通过三种主要的先进封装类型将多个小芯片集成到大型计算系统中:基于衬底、基于硅和基于再分布层(RDL)的封装技术。为了实现晶圆级芯片的集成,研究人员对这些原始封装技术进行了优化,甚至采取更激进的方法直接生产集成晶圆级芯片。本节接下来将介绍晶圆级芯片的不同集成类型,并进行一些讨论。
A. Silicon-Based Integration A.硅基集成
The improvements of integration density and performance of large scale integration devices are mainly based on increasing the total number of input/output and power/ground terminals, which leads to shrinking design rule of wiring and bump pitch. It is difficult to obtain highly reliable connections between chip and organic substrate with smaller bumps due to the mismatching of the coefficient of the thermal expansion[104]. To overcome these problems, silicon interposer-based packaging technology[104] was proposed, where the interconnection between dies is implemented by an extra silicon layer between the substrate and die, and the connection between die and substrate is implemented by throughsilicon vias (TSVs) and micro-bumps. Since the micro-bumps and TSVs have smaller bump pitch and trace distance, silicon interposer provides higher IO density, lower transmission delay and power consumption than organic substrates [51]. 大规模集成器件集成密度和性能的提高主要基于输入/输出和电源/接地端子总数的增加,从而导致布线和凸块间距设计规则的缩减。由于热膨胀系数不匹配[104],芯片与有机基底之间很难通过较小的凸点实现高可靠性连接。为了克服这些问题,人们提出了基于硅插层的封装技术[104]。在这种技术中,芯片与芯片之间的互连是通过在基板和芯片之间额外添加一层硅来实现的,而芯片与基板之间的连接则是通过硅通孔(TSV)和微凸块来实现的。由于微凸块和 TSV 的凸块间距和迹线距离较小,与有机基板相比,硅插层可提供更高的 IO 密度、更低的传输延迟和功耗 [51]。
However, silicon interposers are finally connected to organic substrates, adding an extra level in the packaging hierarchy, which leads to a limited size and high overall packaging cost [105]. To support wafer-scale integration, Bajwa et al. develop a package-less silicon-based integration platform named silicon interconnect fabric (Si-IF)[49, 105, 106]. Si-IF supports heterogeneous reliable integration of chiplets with high interconnect density ( , low interconnect resistivity ), close inter-chiplet spacing ( ), high adhesion strength ( 150 MPa ), and with more uniform heat spreading[105, 106]. Si-IF does not require organic substrates, so the packaging cost is reduced. Based on Si-IF platform, Pal et al. from UCLA and UIUC propose a wafer-level processor system architecture which can integrate at most 2048 pretested chiplets on a passive silicon-interconnect wafer with a total area of about . 然而,硅内插器最终要连接到有机基板,这就在封装层次中增加了一个额外的层次,从而导致尺寸有限和总体封装成本过高[105]。为了支持晶圆级集成,Bajwa 等人开发了一种无封装硅基集成平台,命名为硅互连结构(Si-IF)[49, 105, 106]。Si-IF 支持芯片的异构可靠集成,具有高互连密度( 、低互连电阻率 )、紧密的芯片间距( )、高粘合强度(150 兆帕)和更均匀的热扩散[105, 106]。Si-IF 不需要有机基板,因此降低了封装成本。基于 Si-IF 平台,加州大学洛杉矶分校和加州大学洛杉矶分校的 Pal 等人提出了一种晶圆级处理器系统架构,可在总面积约为 的无源硅互连晶圆上集成最多 2048 个经过预测试的芯片。
B. Redistribution Layer (RDL)-Based Integration B.基于再分配层(RDL)的集成
Besides silicon-based integration, another one key technology for physical implementation of die interconnection in chiplets is redistribution layer (RDL) -based fan-out packaging [107, 108]. This technology eliminates wirebonding or wafer bumping and leadframe or package substrate. Alternatively, it requires an RDL to carry the corresponding metal wiring 除硅基集成外,芯片中芯片互连物理实现的另一项关键技术是基于再分布层(RDL)的扇出封装[107, 108]。这种技术不需要线键合或晶片凸点以及引线框架或封装基板。或者,它需要一个 RDL 来承载相应的金属布线
Fig. 16. Schematic cross section of InFO_SoW structure. Adapted from [109]. 图 16.InFO_SoW 结构的横截面示意图。改编自 [109]。
pattern. Here the "fan-out" indicates that the IO ports of chips are rearranged on the loose area outside the die. Since the IO ports are not confined to the range of die, the average circuit length can be reduced to improve the signal quality, and the die area can be reduced to improve the integration density without worrying the space for placing IO ports [51]. 模式。这里的 "扇出 "表示芯片的 IO 端口被重新安排在芯片外的松散区域。由于 IO 端口不局限于芯片的范围,因此可以减少平均电路长度以提高信号质量,还可以减少芯片面积以提高集成密度,而不必担心 IO 端口的放置空间[51]。
TSMC proposes the industry's first wafer-scale system integration package with InFO technology, called InFO_SoW [109]. As shown in figure 16, an InFO_SoW-based waferscale system integrates InFO, power and thermal modules. Connectors and power modules are solder joined to the InFO wafer, which is then followed by the assembly of thermal module. InFO_SoW eliminates the use of PCB substrate by serving as the carrier itself. It has revealed good process uniformity across the super large package. Lower surface roughness of InFO RDL saves about power of the interconnects with length of 30 mm . The 2-by-5 array dummy heater can dissipate 7000 W while keeping the temperature below . Moreover, InFO_SoW has relatively low risk on structural robustness when compared to qualified Flip-Chip package. Tesla Dojo [15] is the first commercial product using InFO_SoW technology, of which the area of a single chip (training tile) is over . 台积电提出了业界首个采用 InFO 技术的晶圆级系统集成封装,称为 InFO_SoW[109]。如图 16 所示,基于 InFO_SoW 的晶圆级系统集成了 InFO、电源和散热模块。连接器和电源模块焊接到 InFO 晶圆上,然后再组装热模块。InFO_SoW 本身就是载体,因此无需使用印刷电路板衬底。它显示了整个超大封装的良好工艺一致性。InFO RDL 的表面粗糙度较低,可为长度为 30 mm 的互连节省约 功率。2-by-5 阵列假加热器可耗散 7000 W,同时将温度保持在 以下。此外,与合格的倒装芯片封装相比,InFO_SoW 在结构稳健性方面的风险相对较低。Tesla Dojo [15] 是第一款采用 InFO_SoW 技术的商业产品,其中单个芯片(训练瓦片)的面积超过 。
C. Field Stitching-Based Integration C.基于现场拼接的整合
Unlike the above types which utilize advanced packaging technology[12] to integrate regular-sized dies together, Cerebras takes a more aggressive approach. It uses the field stitching technology to connect the reticles, so that to directly produce an integrated wafer-scale chip, of which the area reaches as high as . This kind of wafer-scale integration provides high-capacity efficient SRAM distributed on the wafer, and ultra short inter-die links leading to uniform bandwidth across entire wafer. 与上述利用先进封装技术[12] 将常规尺寸芯片集成在一起的类型不同,Cerebras 采用了更为激进的方法。它利用场拼接技术 连接网罩,从而直接生产出集成的晶圆级芯片,其面积高达 。这种晶圆级集成提供了分布在晶圆上的大容量高效 SRAM,以及超短的晶圆间链路,从而实现了整个晶圆的统一带宽。
In 1980, Trilogy Systems attempted to produce a singlechip 2.5 inches on a side[8]. Since the reticle limit constrained chip size to a maximum of 0.25 inch on a side at that time, Trilogy Systems had to use transistor geometries larger than the minimum available, which sacrificed transistor density and impeded performance improvement[8]. To avoid this problem, Cerebras adopts the standard exposure process and field stitching (also called reticle stitching) technology [13]. First, a standard reticle of is repeated to 1980 年,Trilogy Systems 尝试生产边长为 2.5 英寸的单芯片[8]。由于当时的视网膜限制将芯片尺寸最大限制为单边 0.25 英寸,Trilogy 系统公司不得不使用大于最小可用尺寸的晶体管几何结构,这牺牲了晶体管密度,阻碍了性能的提高[8]。为避免这一问题,Cerebras 采用了标准曝光工艺和场拼接(也称网罩拼接)技术[13]。首先,重复 的标准视网膜以
Fig. 17. Chiplet heterogeneous integration on organic substrate with silicon bridge. Adapted from [12]. 图 17.有机基底上的芯片异质集成与硅桥。改编自 [12]。
traverse and expose the full wafer. Then, an offset reticle of wiring between the reticles is used to "stitch" the standard reticles together[8]. The process of field stitching was published decades before[13], but this is the first attempt at applying it to a commercial wafer-scale integration product. 横移并暴露整个晶片。然后,使用 网罩之间的偏移网罩布线将标准网罩 "缝合 "在一起[8]。现场缝合过程早在几十年前就已公布[13],但这是将其应用于商用晶圆级集成产品的首次尝试。
D. Discussion D.讨论情况
Since advantaged packaging-based integrations allow using individual pre-tested chiplets (i.e., known-good-dies, KGDs) for assembling into a wafer-scale chip, they can mitigate the die yield problem to a large extent. On the contrary, Cerebras directly produces an integrated wafer-scale chip, so the die yield challenge is more significant. To address it, Cerebras implements a homogeneous array of small cores, and reserves about redundant cores in order to "repair" defective ones . 由于基于优势封装的集成电路可以使用预先测试过的单个芯片(即已知好芯片,KGD)组装成晶圆级芯片,因此可以在很大程度上缓解芯片良率问题。相反,Cerebras 直接生产集成晶圆级芯片,因此裸片良率问题更为突出。为了解决这个问题,Cerebras 实现了小型内核的同质阵列,并预留了大约 个冗余内核,以便 "修复 "有缺陷的 。
Compared to the organic substrate, the implementation of silicon interposer brings higher cost in materials and process. The cost of RDL-based fan-out integration falls in between the organic substrate and silicon interposer, but the wiring resources of fan-out packaging are limited by the RDL wiring level. Another path to reduce the cost of silicon interposer is silicon bridge technology [48, 72]. As shown in figure 17, it integrates small thin layers on the substrate (instead of a complete silicon interposer) for inter-die interconnection. Since silicon bridge technology makes a good balance between performance and cost, multiple industries have carried out study on it and proposed the corresponding chiplet products, such as Intel [72], Apple [69] and TSMC [110]. However, the sizes of these products are far from reaching the scale of wafer. 与有机基板相比,硅插层的实施带来了更高的材料和工艺成本。基于 RDL 的扇出式集成成本介于有机基板和硅内插之间,但扇出式封装的布线资源受到 RDL 布线水平的限制。降低硅插层成本的另一条途径是硅桥技术 [48, 72]。如图 17 所示,它在基板上集成了小薄层(而不是完整的硅插层),用于芯片间的互连。由于硅桥技术在性能和成本之间取得了很好的平衡,多个行业对其进行了研究,并提出了相应的芯片产品,如英特尔[72]、苹果[69]和台积电[110]。然而,这些产品的尺寸远未达到晶圆的规模。
Field stitching (or reticle stitching) is a fundamental technology proposed in the last century [13]. One main disadvantage of reticle stitching is that the mask cannot be change during a complete exposure on the full wafer, otherwise the accuracy would seriously decrease. That means a great challenge for heterogeneous integration by reticle stitching. Contact printing [111-114] supports more free designs, and may become an inspiring research direction for wafer-scale integration if its defects problem can be solved. 现场拼接(或称视网膜拼接)是上世纪提出的一项基本技术 [13]。视网膜拼接技术的一个主要缺点是,在对整个晶片进行完整曝光的过程中,光罩不能更换,否则精度会严重下降。这意味着采用网罩拼接技术实现异质集成面临巨大挑战。接触式印刷 [111-114] 支持更自由的设计,如果能解决其缺陷问题,可能会成为晶圆级集成的一个有启发性的研究方向。
VII. SYSTEM VII.系统
In this section, we will introduce three key design issues of building a wafer-scale system: power delivery, clock delivery and heat dissipation. 在本节中,我们将介绍构建晶圆级系统的三个关键设计问题:功率传输、时钟传输和散热。
A. Power Delivery A.电力输送
For wafer-scale chips, stable power transmission is inseparable from the high computing power. How to stably transmit the power supply voltage to each die in the wafer chip is a question worth considering. Currently, there are two main styles of power delivery for wafer-scale chips: 1) Edge (Lateral) style: deliverying power through the edge of the wafer, and 2) Vertical style: using the third dimension to power the entire wafer. 对于晶圆级芯片而言,稳定的电源传输与高计算能力密不可分。如何将电源电压稳定地传输到晶圆芯片中的每个芯片是一个值得考虑的问题。目前,晶圆级芯片的电源传输主要有两种方式:1) 边缘(横向)式:通过晶圆边缘传输电源;2) 垂直式:利用第三维为整个晶圆供电。
Edge style: UCLA&UIUC's work [16] adopts edge power delivery, because its integration of through-wafer-vias technology a Si-IF wafer [115] is still under development and not mature yet. According to different actual needs, two strategies are provided [16]: 1) Delivering high voltage (say 12 V ) power at the edge and using down conversion near the die [58]. This method will cause a huge area overhead due to the need to place huge components such as inductors and capacitors on the wafer. In addition, these additional components will destroy the chain of the die array circuit and increase the distance between dies, making the overall design more complex. 2) Delivering higher voltage supplies (say 2.5 V ) at the edge and using low dropout (LDO) based regulation in the die, which also means some power efficiency loss in the power delivery plane due to resistive power loss and poor LDO efficiency. In addition, since the power is delivered from edge to center, one fault in the chain may cause large-scale failure. 边缘方式:加州大学洛杉矶分校和加州大学洛杉矶分校的工作[16]采用了边缘电源传输方式,因为其在 Si-IF 晶圆[115]上集成的直通晶圆-通孔技术仍处于开发阶段,尚未成熟。根据不同的实际需求,提供了两种策略[16]:1) 在边缘提供高压(如 12 V)电源,并在芯片附近使用向下转换[58]。由于需要在晶圆上放置电感器和电容器等巨大元件,这种方法将造成巨大的面积开销 。此外,这些额外的元件会破坏芯片阵列电路链,并增加芯片之间的距离,使整体设计变得更加复杂。2) 在边缘提供更高的电压电源(如 2.5 V),并在芯片中使用基于低压差(LDO)的调节,这也意味着由于电阻性功率损耗和 LDO 效率低下,在电源传输层中会出现一些功率效率损失。此外,由于功率是从边缘向中心输送的,因此链中的一个故障可能会导致大规模故障。
Vertical style: Different from UCLA&UIUC's work, a vertical power supply program is provided by Tesla in [116]. A two-layer voltage regulating module (VRM) is designed. The first layer is configured to output a regulated voltage based on a stepped down voltage, and the second layer includes a plurality of active components configured to provide the stepped down voltage to the first layer. Vertical structure brings a better thermal budget. Enhancing the hardware architecture stability, BGA is used to connect the layers and its clearance can be designed as about depending on the implementation. Delivering about 600 watts of power and a range of volts of voltage relying on the embodiment, this module offers sufficient power and voltage for highperformance computation. 垂直式:与加州大学洛杉矶分校和加州大学洛杉矶分校的工作不同,特斯拉在 [116] 中提供了一种垂直电源方案。该方案设计了一个双层稳压模块(VRM)。第一层配置为根据降压电压输出稳压电压,第二层包括多个有源元件,配置为向第一层提供降压电压。垂直结构带来更好的热预算。BGA 用于连接各层,其间隙可根据实施情况设计为 左右,从而增强了硬件架构的稳定性。该模块可提供约 600 瓦的功率和 伏的电压范围(取决于实施方案),可为高性能计算提供足够的功率和电压。
Cerebras provide another vertical architecture [117]. Between two surfaces, an integrated circuit has a first set of conductive pads arranged along the second surface of the circuit opposing the first surface, while a power converter is electrically connected to each of the first set of conductive pads, playing the role of adjusting the voltage in a suitable range. Different from BGA, the conductive pads perform better in the variant working situation, where it can choose suitable Young's Modulus ( ) alternatively after preload compression. Addressing the problem of hardware mismatch, this flexible property benefits stability in different work situations and different hardware parameters. The power components can accept any suitable AC and/or DC input voltage and it can output any suitable voltage, such as 0.9 VDC or 0.8 VDC. Cerebras 提供了另一种垂直结构 [117]。在两个表面之间,集成电路的第一组导电焊盘沿电路的第二表面布置,与第一表面相对,而电源转换器与第一组导电焊盘的每个焊盘电连接,起到在适当范围内调节电压的作用。与 BGA 不同的是,导电焊盘在变体工作情况下性能更好,它可以在预载压缩后交替选择合适的杨氏模量( )。针对硬件不匹配的问题,这种灵活的特性有利于在不同的工作环境和不同的硬件参数下保持稳定。电源组件可接受任何合适的交流和/或直流输入电压,并可输出任何合适的电压,如 0.9 VDC 或 0.8 VDC。
In conclusion, when compared to edge power supply, the vertical architecture showcases several advantages in terms of 总之,与边缘电源相比,垂直结构在以下几个方面具有优势
output power, stability, and flexibility. Nevertheless, it does entail higher costs and necessitates a higher level of technical maturity. 这种技术可以提高输出功率、稳定性和灵活性。尽管如此,它的成本较高,而且需要更高水平的技术成熟度。
B. Clock Delivery B.时钟传送
For a wafer-scale chip, how to reliably distribute clock across such a large area is a challenge. Similar to power delivery, there are also edge and vertical styles of clock distribution for wafer-scale chips. 对于晶圆级芯片来说,如何在如此大的面积上可靠地分配时钟是一个挑战。与功率传输类似,晶圆级芯片的时钟分配也有边缘和垂直两种方式。
Edge style: UCLA&UIUC's work [16, 63] has presented an edge-style solution, which builds a clock distribution network, generates a fast clock (up to 350 MHz ) in one edge die from a slower system clock provided by an off-die crystal oscillator source, and then forwarded throughout the die array by forwarding circuitry built inside every die. To avoid pull-up/pull-down imbalances in I/O drivers between various components and chips, the design also propagates an inverted version of the clock, ensuring that the distortion alternates between half the clock cycle. A faulty die can disrupt the clock propagation mechanism. Both the generation and propagation of the clock require a fault tolerance design. In UCLA&UIUC's work, each die receives the rebroadcast clock of the adjacent die, and each non-edge block receives the propagation clock from all four directions, which reduces the probability of die failure and improves the resiliency of clock propagation. 边缘式:加州大学洛杉矶分校和加州大学旧金山分校的研究[16, 63]提出了一种边缘式解决方案,该方案建立了一个时钟分配网络,在一个边缘芯片中从一个由芯片外晶体振荡器源提供的较慢系统时钟中产生一个快速时钟(高达 350 MHz),然后通过每个芯片内部的转发电路将其转发到整个芯片阵列中。为避免不同组件和芯片之间的 I/O 驱动器出现上拉/下拉不平衡现象,该设计还传播反向版本的时钟,确保在半个时钟周期内交替出现失真。故障芯片会破坏时钟传播机制。时钟的产生和传播都需要容错设计。在 UCLA&UIUC 的工作中,每个裸片接收相邻裸片的重播时钟,每个非边缘块接收来自四个方向的传播时钟,从而降低了裸片故障的概率,提高了时钟传播的弹性。
Vertical style: Edge style clock distribution is a reasonable solution for TSV-less structures like UCLA&UIUC's SiIF-based chip. However, long-distance clock transmission can lead to extra clock jitter, energy consumption, heat generation and parasitic inductance. To avoid these problems, Tesla proposed a vertical clock scheme [118]. They transmit the clock and power to the die from a vertical direction, so that longdistance transmission can be avoided. However, due to the limited area of the PCB board shared by the clock and power, the VRM is closer to the clock circuit, and the clock may be contaminated by power noise. In order to address this issue, Tesla suggests a series of techniques, including differential signaling clock, filter circuit, isolated ground wire, and so on. Overall, the vertical clock scheme presents higher technical hurdles and increased costs; however, it holds significant potential for future development. 垂直式:对于无 TSV 结构(如 UCLA&UIUC 基于 SiIF 的芯片)而言,边缘式时钟分配是一种合理的解决方案。但是,长距离时钟传输会导致额外的时钟抖动、能耗、发热和寄生电感。为了避免这些问题,特斯拉提出了一种垂直时钟方案 [118]。他们从垂直方向向芯片传输时钟和电源,从而避免了长距离传输。但是,由于时钟和电源共享的 PCB 板面积有限,VRM 与时钟电路的距离较近,时钟可能会受到电源噪声的污染。为了解决这个问题,特斯拉提出了一系列技术,包括差分信号时钟、滤波电路、隔离接地线等。总体而言,垂直时钟方案存在较高的技术障碍,成本也有所增加;但它在未来的发展中仍有很大的潜力。
C. Heat Dissipation C.散热
Heat dissipation is extremely important for wafer-scale chip, which directly affects the overall performance of the chip. Although the thermal design power (TDP) of a single system on wafer (SoW) could reach an astonishing order of , the huge area of the SoW keeps its heat power density at a level comparable to the most advanced GPUs and TPUs. According to our investigation and projections, the heat power density of SoW is between . However, the large area and the tremendous total power of SoW make it impossible to directly apply traditional chip cooling methods. Therefore, customized cooling solutions make essential, and emerging cooling technologies such as microfluidics [119], may also become one of the options for future SoW cooling solutions. 散热对于晶圆级芯片极为重要,它直接影响芯片的整体性能。虽然单个晶圆上系统(SoW)的热设计功率(TDP)可以达到 的惊人数量级,但 SoW 的巨大面积使其热功率密度与最先进的 GPU 和 TPU 不相上下。根据我们的调查和预测,SoW 的热功率密度介于 之间。然而,由于 SoW 面积大、总功率巨大,因此无法直接采用传统的芯片冷却方法。因此,定制的冷却解决方案必不可少,而新兴的冷却技术(如微流体技术 [119])也可能成为未来 SoW 冷却解决方案的选择之一。
Cerebras WSE: the WSE series of Cerebras adopts an air-cooling system with an internal coolant loop in [120]. The coolant is water/propylene glycol mixture, which is transported in the loop at low pressure and high flow rates while remaining single phase, and takes heat away from the surface of the WSE through connectors. Meanwhile, the heat transferred by the coolant is dissipated through a set of fans and a heat exchanger. Although this method implements cooling on a single WSE with an area of , the huge heat dissipation system occupies more than of the volume of the chassis, which reduces the possibility of WSE chips forming a highdensity computing system in units of wafers. Cerebras WSE:Cerebras 的 WSE 系列采用带有内部冷却剂回路的空气冷却系统 [120]。冷却剂是水/丙二醇混合物,在保持单相的情况下以低压和高流速在环路中输送,并通过接头从 WSE 表面带走热量。同时,冷却剂传递的热量通过一组风扇和一个热交换器进行散热。虽然这种方法在面积为 的单个 WSE 上实现了冷却 ,但庞大的散热系统占据了机箱 以上的体积,这降低了 WSE 芯片以晶圆为单位形成高密度计算系统的可能性。
UCLA&UIUC's work: UCLA&UIUC's wafer-scale system is cooled by forced air convection using one or two square radiators [58]. Two heat sinks cover the wafer chips, one directly on top of the die and the other on the backside of the wafer. This structure not only provides mechanical support for the wafer, but also helps improve heat dissipation efficiency. In the extreme condition of not considering factors such as power supply, the cooling device of dual heat sink can provide the system with a TDP of 9300W in 50,000 area, while keep the junction temperature below . However, in the face of higher heat dissipation requirements, this method based on traditional air cooling will encounter a bottleneck. 加州大学洛杉矶分校和加州大学旧金山分校的工作:加州大学洛杉矶分校和加州大学伯克利分校的晶圆级系统是通过一个或两个方形散热器进行强制空气对流冷却的[58]。两个散热器覆盖晶圆芯片,一个直接位于芯片顶部,另一个位于晶圆背面。这种结构不仅能为晶片提供机械支撑,还有助于提高散热效率。在不考虑电源等因素的极端情况下,双散热器的冷却装置可在 50000 面积内为系统提供 9300W 的 TDP,同时将结温保持在 以下。然而,面对更高的散热要求,这种基于传统风冷的方法将遇到瓶颈。
Tesla Dojo: In [116], the thermal system of Tesla Dojo includes a thermal dissipation and a cooling system, which are distributed on both sides of the SoW and connected together by conductive frames distributed around the SoW to provide heat dissipation, as well as structural support and EMI shielding. The thermal dissipation connected to the SoW directly or through a thermal interface material with low heat transfer resistance, dissipate heat from the SoW and may consists of a metal platform and/or heat sink. The cooling system is installed above the VRMs that supply power to the SoW, providing active cooling for the VRM node and the SoW. It can take the form of customized liquid cooling, e.g., metal with flow paths, or include brazed fin arrays to increase the cooling efficiency. For the specific implementation of the cooling system, Shu-Rong Chun, et al. proposed an expandable liquid-cooling platform in [109], which consists of multiple sub-unit cold plates, each responsible for cooling a row of dies of SoW. The coolant is deionized water and optimally distributed throughout each sub-unit cold plate. The liquidcooling platform enables a dummy heater array with a heat density of to be kept at a temperature of about . Tesla Dojo:在 [116] 中,Tesla Dojo 的散热系统包括散热和冷却系统,它们分布在 SoW 的两侧,并通过分布在 SoW 周围的导电框架连接在一起,以提供散热以及结构支持和 EMI 屏蔽。散热系统直接或通过热阻较低的热界面材料与 SoW 连接,从 SoW 散热,可由金属平台和/或散热器组成。冷却系统安装在为 SoW 供电的 VRM 上方,为 VRM 节点和 SoW 提供主动冷却。冷却系统可采用定制的液体冷却形式,如带有流道的金属,或包括钎焊鳍片阵列,以提高冷却效率。在冷却系统的具体实施方面,Shu-Rong Chun 等人在文献[109]中提出了一种可扩展的液体冷却平台,它由多个子单元冷板组成,每个冷板负责冷却一排 SoW 芯片。冷却剂为去离子水,并以最佳方式分布在每个子单元冷板上。液冷平台可使热密度为 的假加热器阵列保持在 左右的温度。
Microfluidics: Several current cooling technologies with practical applications, such as heat pipes or heat sinks as the mainstream cooling technology for server companies, singlephase liquid cooling technology applied to high-power servers, and two-phase immersion cooling used in supercomputers, can meet the cooling needs of the SoW level. However, the massive volume overhead of these technologies is not what SoW's designers expect. In this context, we turn our attention to the emerging microfluidic, also called microchannels, which has the advantages of small device size, high heat transfer efficiency, low thermal resistance and fast heat transfer, and is recognized as a promising heat dissipation method consistent with SoW. More recently, a mainstream microfluidic technique 微流控:目前几种实际应用的冷却技术,如服务器公司的主流冷却技术热管或散热器、应用于大功率服务器的单相液体冷却技术以及超级计算机中使用的两相浸入式冷却技术,都能满足 SoW 级的冷却需求。然而,这些技术的巨大体积开销并不是 SoW 设计人员所期望的。在这种情况下,我们把目光转向了新兴的微流体(也称微通道),它具有器件体积小、传热效率高、热阻低、传热快等优点,被公认为是一种有前途的符合 SoW 的散热方法。最近,一种主流的微流体技术
is achieved by etching microchannels or microcolumns in the silicon intermediate layer, and then combining different silicon intermediate layers through silicon-silicon bonding, silicon metal bonding or other processes, and using the forced flow of liquid coolant or two-phase coolant to carry away the heat of the chip[119, 121, 122]. Due to its small enough size and high efficiency from the active cooling, microfluidics has been considered as promising for 3D ICs with high power density [119]. The experimental result in [121] shows that the temperature of 3D chip stack with heating density of 97 can be controlled below under the condition of total flow of . In addition, microchannel can provide heat dissipation of close to at sufficiently high flow rates and pressure drops [123]. Meanwhile, it is also important to note that microfluidics are suitable for almost any IC with a silicon intermediate layer structure, including the SoW. Compared with other heat dissipation methods, the sufficiently small geometry, outstanding heat dissipation performance, and wide range of application scenarios of microfluidic technology make it highly likely to become the next generation of disruptive IC cooling technology. However, the complex manufacturing process and high price have always been the most difficult obstacles to overcome on the road to its commercialization. 微流控技术是通过在硅中间层刻蚀微通道或微柱,然后通过硅-硅键合、硅金属键合或其他工艺将不同的硅中间层结合在一起,并利用液态冷却剂或两相冷却剂的强制流动带走芯片的热量[119, 121, 122]。由于微流控芯片尺寸小、主动冷却效率高,因此被认为是具有高功率密度的三维集成电路的理想选择[119]。文献 [121] 中的实验结果表明,在总流量为 的条件下,加热密度为 97 的三维芯片堆栈的温度可控制在 以下。此外,在足够高的流速和压降条件下,微通道可以提供接近 的散热 [123]。同时,值得注意的是,微流体技术几乎适用于所有具有硅中间层结构的集成电路,包括 SoW。与其他散热方法相比,微流控技术具有足够小的几何尺寸、出色的散热性能和广泛的应用场景,极有可能成为下一代颠覆性的集成电路冷却技术。然而,复杂的制造工艺和高昂的价格一直是其商业化道路上最难克服的障碍。
VIII. FAULT TOLERANCE VIII.容错
In previous sections, we have interspersed fault tolerance designs from different perspectives. As fault tolerance is crucial for Wafer-scale Computing, we will now provide a summary of these designs in this section. 在前面的章节中,我们穿插了不同角度的容错设计。由于容错对于晶圆级计算至关重要,我们将在本节中对这些设计进行总结。
A. The Necessity of Fault Tolerance Designs A.容错设计的必要性
The yield of chip production refers to the proportion of chips that meet the specifications and quality requirements out of the total output during the production process [124-127]. It is primarily influenced by the following factors: die area, process, defect density, and other practical factors. The yield is an important indicator in chip production, as it directly affects the cost and performance of chips. Furthermore, it has been a key factor constraining the development of the chip industry. 芯片生产良率是指在生产过程中,符合规格和质量要求的芯片占总产量的比例[124-127]。它主要受以下因素影响:芯片面积、工艺、缺陷密度和其他实际因素。良品率是芯片生产的一个重要指标,因为它直接影响芯片的成本和性能。此外,它也是制约芯片产业发展的关键因素。
Wafer-scale chip production typically faces significant yield challenges. Taking Cerebras' WSE-2 [17] as an example, each die is characterized as area and TSMC N7 7 nm process with a defect density of 0.09 defects per square centimetre [128]. According to the yield models such as the Murphy's yield model [129-131] or the Chiplet Actuary model [132], the estimated die yield is only around . This implies that approximately of the dies have at least one defect and cannot function properly without specialized optimization. 晶圆级芯片生产通常面临着巨大的良品率挑战。以 Cerebras 的 WSE-2 [17] 为例,每个芯片的面积为 ,采用台积电 N7 7 纳米工艺,缺陷密度为每平方厘米 0.09 个缺陷 [128]。根据墨菲良品率模型 [129-131] 或 Chiplet Actuary 模型 [132]等良品率模型,估计裸片良品率仅为 左右。这意味着大约 的模具至少有一个缺陷,如果不进行专门优化,就无法正常工作。
Advantaged packaging allows using individual pre-tested chiplets (KGDs) for assembling into a wafer-scale chip, so that mitigates the die yield problem to a large extent. However, defects may also occur in the interconnections, and the bonding defects lead to waste of KGDs [132]. Research conducted by UCLA&UIUC's [16] reveals that though the bonding yield per I/O can exceed , the overall bonding yield for over 2000 I/Os per chiplet is only . Moreover, even with an improved bonding yield of per chiplet, the overall system of 2048 chiplets might still have a few faulty chiplets. Therefore, fault-tolerance design is essential for Wafer-scale Computing systems. 先进的封装技术允许使用单个预测试芯片(KGD)组装成晶圆级芯片,从而在很大程度上缓解了芯片良率问题。然而,相互连接中也可能出现缺陷,接合缺陷会导致 KGD 的浪费 [132]。加州大学洛杉矶分校和加州大学洛杉矶分校的研究[16]表明,虽然每个 I/O 的键合良品率可以超过 ,但每个芯片超过 2000 个 I/O 的整体键合良品率仅为 。此外,即使每个芯片的键合良率提高到 ,由 2048 个芯片组成的整个系统仍可能有几个故障芯片。因此,容错设计对于晶圆级计算系统至关重要。
B. Summary of Fault Tolerance Designs B.容错设计概要
Computation Core: To deal with the yield problem of computation core, Cerebras adopts a small core design and reserves redundant cores for replacing the faulty ones [9, 64]. On the other hand, Tesla integrates KGDs into a wafer-scale chip [15], so the yield problem of dies is under control, at the cost of a larger bandwidth gap across the die boundary than Cerebras' stitched wafer (but still much smaller than traditional accelerator clusters). These are discussed in Section III-F. 计算核心:为解决计算核心的良品率问题,Cerebras 采用小核心设计,并预留冗余核心以替换故障核心[9, 64]。另一方面,特斯拉将 KGD 集成到晶圆级芯片中[15],因此芯片的良率问题得到了控制,但代价是芯片边界的带宽差距比 Cerebras 的缝合晶圆大(但仍比传统加速器集群小得多)。这些将在第 III-F 节中讨论。
Interconnection: Besides the computation cores, faults can also happen in the interconnections. To improve the I/O bonding yield, UCLA&UIUC's work suggests landing two copper pillars on each I/O pad [16]. Cerebras reserves redundant fabric links for reconnecting fabric and restoring logical 2D mesh [9, 64]. Naturally, appropriate protocols should be designed to activate the backup links. To reduce the impact of faulty computation cores on interconnections, the NoC in a wafer-scale chip is usually designed to be highly decoupled from computation cores [9, 64]. Moreover, the topology of NoC also influences fault tolerance. 2D Mesh and Torus are common-used in existing wafer-scale chips, because they can provide good support for fault tolerance designs (e.g., X-Y/Y-X dimension-ordered routing [16]) with small physical implementation difficulty. These are discussed in Section III-D, III-F and IV-D. 互联:除了计算核心,故障也可能发生在互连中。为了提高 I/O 粘合良率,UCLA&UIUC 的工作建议在每个 I/O 焊盘上安装两个铜柱 [16]。Cerebras 预留了冗余 Fabric 链路,用于重新连接 Fabric 和恢复逻辑 2D 网格 [9,64]。当然,应设计适当的协议来激活备份链路。为减少计算内核故障对互连的影响,晶圆级芯片中的 NoC 通常设计为与计算内核高度解耦[9,64]。此外,NoC 的拓扑结构也会影响容错性。二维网格和环形拓扑是现有晶圆级芯片中常用的拓扑结构,因为它们能为容错设计(如 X-Y/Y-X 维有序布线 [16])提供良好的支持,且物理实现难度较小。这些内容将在第 III-D、III-F 和 IV-D 节中讨论。
Power and Clock: For power and clock supply, edgestyle schemes require a chain or network for delivering the power/clock from edge to center. Consequently, a single fault can lead to widespread failure. Therefore, redundant paths must be designed to reduce the affected area. In contrast, vertical-style schemes inherently offer fault tolerance. However, these schemes also come with higher technical hurdles and costs now. These are discussed in Section VII-A and VII-B. 电源和时钟:在电源和时钟供应方面,边缘式方案需要一个从边缘到中心的电源/时钟传输链或网络。因此,单个故障可能导致大面积故障。因此,必须设计冗余路径以减少受影响的面积。相比之下,垂直式方案本质上具有容错性。不过,这些方案目前也存在较高的技术障碍和成本。第 VII-A 和 VII-B 节将讨论这些问题。
Compilation: If the local controller and protocol are properly designed, the local fault tolerance operations (e.g., activating the backup logic/link in a same core when the main logic/link is broken) can be automatically done without the need to notify the upper-layer applications. However, to achieve better fault tolerance effect, the compiler should perform fault tolerance optimizations from a global perspective. One possible solution is to run a test program when the hardware system is started, characterize the hardware system to let the compiler know which parts are broken and where the backup redundancies can be used. Taking this information into consideration, the compiler maps the workload properly, and is prepared for adjusting the mapping if new hardware failures appear at runtime. These are discussed in Section V-C. 编译:如果本地控制器和协议设计得当,本地容错操作(如当主逻辑/链路断开时激活同一内核中的备份逻辑/链路)可自动完成,无需通知上层应用程序。不过,为了达到更好的容错效果,编译器应从全局角度进行容错优化。一种可行的解决方案是在硬件系统启动时运行测试程序,分析硬件系统的特性,让编译器知道哪些部分出现故障,哪些地方可以使用备份冗余。考虑到这些信息,编译器就能正确映射工作负载,并为在运行时出现新的硬件故障时调整映射做好准备。这些将在第 V-C 节中讨论。
IX. APPLICATION IX.申请
In the previous sections, we use AI computation task, specifically neural-network-based deep learning acceleration, 在前面的章节中,我们使用了人工智能计算任务,特别是基于神经网络的深度学习加速、
as the typical application scenario. However, Wafer-scale Computing is not limited to deep learning, and can be applied to various other tasks. In this section, we will introduce the important scientific computing applications [133] of Waferscale Computing system outside of AI computing domain. These applications leverage the architecture's high on-chip bandwidth and fast memory access, which are challenging to achieve with traditional accelerator clusters. 作为典型的应用场景。然而,晶圆级计算并不局限于深度学习,还可以应用于其他各种任务。在本节中,我们将介绍晶圆级计算系统在人工智能计算领域之外的重要科学计算应用 [133]。这些应用利用了该架构的高片上带宽和快速内存访问,而这是传统加速器集群难以实现的。
A. Linear Algebra A.线性代数
Linear algebra [134] is a fundamental application in the field of High-Performance Computing (HPC), as many engineering and scientific problems require transformation into linear algebra problems to be solved. Even neural network computation can be seen as basic matrix calculations. The application of linear algebra requires large-scale parallel computing, both dense and sparse. Wafer-scale Computing systems boast high computing power density, and the on-chip distributed SRAM can efficiently handle data access requests, resulting in significantly improved parallel performance. 线性代数 [134] 是高性能计算 (HPC) 领域的一项基本应用,因为许多工程和科学问题都需要转化为线性代数问题才能解决。甚至神经网络计算也可以看作是基本的矩阵计算。线性代数的应用需要大规模并行计算,包括密集计算和稀疏计算。晶圆级计算系统拥有高计算功率密度,片上分布式 SRAM 可以高效处理数据访问请求,从而显著提高并行性能。
B. Stencil Computation B.模板计算
Stencil computation, or Finite Element Methods [135], is widely used in Partial Differential Equation (PDE) applications. PDEs are mathematical equations describing various phenomena in nature, typically involving derivatives of spatial and temporal variables. Numerically solving PDEs requires discretization of continuous physical domains (e.g., spatial regions) into grids or arrays, approximating the differential operator of the PDE. In these computational applications, the arithmetic intensity of single elements is low, which makes traditional hierarchical memory systems unsuitable. On the contrary, in Wafer-scale Computing system, each processor equipped with dozens of kB SRAM can be assigned an appropriate stencil kernel. The cores can efficiently exchange data with neighbors through a high-bandwidth fabric. 模板计算或有限元方法 [135] 广泛应用于偏微分方程 (PDE) 应用中。偏微分方程是描述自然界各种现象的数学方程,通常涉及空间和时间变量的导数。数值求解 PDE 需要将连续物理域(如空间区域)离散为网格或阵列,以逼近 PDE 的微分算子。在这些计算应用中,单个元素的算术强度较低,因此不适合使用传统的分层存储系统。相反,在晶圆级计算系统中,每个配备几十 kB SRAM 的处理器都可以分配到一个合适的模版内核。这些内核可以通过高带宽结构与邻居高效交换数据。
C. N-body Problems C.N 体问题
The N-body problem [136, 137] involves calculating the interactions between every pair of particles, where each particle is influenced by gravity or interaction forces from all other particles. As the number of particles (N) increases, the calculation amount grows sharply, requiring interaction calculations. N -body computations involve complex communication patterns between particles. Thanks to their flexible data routing and single-cycle memory access capabilities of on-chip SRAM, Wafer-scale Computing systems perform excellently in these N-body applications. N 体问题 [136, 137] 涉及计算每一对粒子之间的相互作用,其中每个粒子都受到来自其他所有粒子的引力或相互作用力的影响。随着粒子数(N)的增加,计算量也急剧增加,需要进行 次相互作用计算。N 体计算涉及粒子间复杂的通信模式。晶圆级计算系统凭借灵活的数据路由和片上 SRAM 的单周期内存访问能力,在这些 N 体应用中表现出色。
D. Spectral Methods D.光谱方法
Spectral Methods [138] are numerical computing applications commonly used for solving PDEs and other mathematical problems. This method expresses the function as a linear combination of orthogonal basis functions. By computing on these basis functions, the original PDEs can be transformed into algebraic equations, and numerical solutions can be obtained by solving the resulting algebraic equations. Due to the global nature of this application, it exhibits typical all-toalls and reductions communication behavior, posing challenges for traditional hardware architectures. Wafer-scale Computing systems can address these specific communication behaviors by their high-bandwidth and low-latency fabric between PEs. 谱方法 [138] 是一种数值计算应用,常用于求解 PDE 和其他数学问题。这种方法将函数表示为正交基函数的线性组合。通过对这些基函数进行计算,可将原始 PDE 转换为代数方程,并通过求解代数方程获得数值解。由于该应用的全局性,它表现出典型的全对全和还原通信行为,给传统硬件架构带来了挑战。晶圆级计算系统可通过 PE 之间的高带宽和低延迟结构来解决这些特定的通信行为。
X. CONCLUSION X.结论
Wafer-scale Computing is an emerging trend to solve the gap between the computing powers required by large artificial intelligence models and provided by the chips. Existing publications mainly introduce individual products and techniques, so there is an urgent need to create a comprehensive survey on the current state of knowledge of Wafer-scale Computing technologies. We propose this paper to compare the different designs, summary their similarities and differences, extract the essential points of Wafer-scale Computing, discuss the achievements and shortcomings of existing research and give advice on the possible future research directions. 晶圆级计算是一种新兴趋势,旨在解决大型人工智能模型所需的计算能力与芯片所提供的计算能力之间的差距。现有出版物主要介绍个别产品和技术,因此迫切需要对晶圆级计算技术的知识现状进行全面调查。我们建议撰写本文,比较不同的设计,总结其异同,提炼晶圆级计算的要点,讨论现有研究的成就和不足,并就未来可能的研究方向提出建议。
ACKNOWLEDGEMENT 致谢
This work was supported in part by NSFC Grant 62125403; in part by the Science and Technology Innovation 2030 - New Generation of AI Project under Grant 2022ZD0115200; in part by Beijing Municipal Science and Technology Project Grant Z221100007722023; in part by the National Key Research and Development Program under Grant 2021ZD0114400; in part by Beijing National Research Center for Information Science and Technology; in part by the Beijing Advanced Innovation Center for Integrated Circuits; and in part by Tsinghua University-China Mobile Communications Group Co.,Ltd. Joint Institute. 本研究部分得到国家自然科学基金资助(62125403);科技创新2030-新一代人工智能项目资助(2022ZD0115200);北京市科技计划项目资助(Z221100007722023);国家重点研发计划项目资助(2021ZD0114400);北京国家信息科学研究中心资助;北京集成电路先进创新中心资助;清华大学-中国移动通信集团有限公司联合研究院资助。联合研究所。
REFERENCES 参考文献
[1] K. Tunyasuvunakool, J. Adler, Z. Wu, T. Green, M. Zielinski, A. Žídek, A. Bridgland, A. Cowie, C. Meyer, A. Laydon et al., "Highly accurate protein structure prediction for the human proteome," Nature, vol. 596, no. 7873, pp. 590-596, 2021. [1] K. Tunyasuvunakool、J. Adler、Z. Wu、T. Green、M. Zielinski、A. Žídek、A. Bridgland、A. Cowie、C. Meyer、A. Laydon 等人,《人类蛋白质组的高精度蛋白质结构预测》,《自然》,第 596 卷,第 7873 期,第 590-596 页,2021 年。
[2] A. Fawzi, M. Balog, A. Huang, T. Hubert, B. RomeraParedes, M. Barekatain, A. Novikov, F. J. R Ruiz, J. Schrittwieser, G. Swirszcz et al., "Discovering faster matrix multiplication algorithms with reinforcement learning," Nature, vol. 610, no. 7930, pp. 47-53, 2022. [2] A. Fawzi、M. Balog、A. Huang、T. Hubert、B. Romera-Paredes、M. Barekatain、A. Novikov、F. J. Ruiz、J. Schrittwieser、G. Swirszcz 等人,《用强化学习发现更快的矩阵乘法算法》,《自然》,第 610 卷,第 7930 期,第 47-53 页,2022 年。
[3] S. Ravuri, K. Lenc, M. Willson, D. Kangin, R. Lam, P. Mirowski, M. Fitzsimons, M. Athanassiadou, S. Kashem, S. Madge et al., "Skilful precipitation nowcasting using deep generative models of radar," Nature, vol. 597, no. 7878, pp. 672-677, 2021 [3] S. Ravuri、K. Lenc、M. Willson、D. Kangin、R. Lam、P. Mirowski、M. Fitzsimons、M. Athanassiadou、S. Kashem、S. Madge 等人,"使用雷达深度生成模型进行娴熟的降水预报",《自然》,第 597 卷,第 7878 期,第 672-677 页,2021 年。
[4] J. Kirkpatrick, B. McMorrow, D. H. Turban, A. L. Gaunt, J. S. Spencer, A. G. Matthews, A. Obika, L. Thiry, M. Fortunato, D. Pfau et al., "Pushing the frontiers of density functionals by solving the fractional electron problem," Science, vol. 374, no. 6573, pp. 1385-1389, 2021. [4] J. Kirkpatrick、B. McMorrow、D. H. Turban、A. L. Gaunt、J. S. Spencer、A. G. Matthews、A. Obika、L. Thiry、M. Fortunato、D. Pfau 等人,《通过解决分数电子问题推动密度函数的前沿》,《科学》,第 374 卷,第 6573 期,第 1385-1389 页,2021 年。
[6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, "Attention is all you need," in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/fil e/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf [6] A. Vaswani、N. Shazeer、N. Parmar、J. Uszkoreit、L. Jones、A. N. Gomez、L. u. Kaiser 和 I. Polosukhin,"注意力就是你所需要的一切",《神经信息处理系统进展》,I. Guyon、U. V. Luxburg、S. Bengio、H. Wallach、R. Fergus、S. Vishwanathan 和 R. Garnett 编辑,第 30 卷。Curran Associates, Inc., 2017.[Online].可查阅:https://proceedings.neurips.cc/paper/2017/fil e/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[7] T. N. Theis and H.-S. P. Wong, "The end of moore's law: A new beginning for information technology," Computing in Science & Engineering, vol. 19, no. 2, pp. 41-50, 2017. [7] T. N. Theis and H.-S.P. Wong,"摩尔定律的终结:信息技术的新起点》,《科学与工程计算》,第 19 卷第 2 期,第 41-50 页,2017 年。
[8] G. Lauterbach, "The path to successful wafer-scale integration: The cerebras story," IEEE Micro, vol. 41, no. 6, pp. 52-57, 2021. [8] G. Lauterbach,"晶圆级集成的成功之路:Cerebras 的故事",《IEEE Micro》,第 41 卷,第 6 期,第 52-57 页,2021 年。
[9] C. Systems, "Wafer-scale deep learning," in 2019 IEEE Hot Chips 31 Symposium (HCS). IEEE Computer Society, 2019. [9] C. Systems,"晶圆级深度学习",2019 IEEE Hot Chips 31 Symposium (HCS)。IEEE 计算机协会,2019 年。
[10] K. Rocki, D. Van Essendelft, I. Sharapov, R. Schreiber, M. Morrison, V. Kibardin, A. Portnoy, J. F. Dietiker, M. Syamlal, and M. James, "Fast stencil-code computation on a wafer-scale processor," in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1-14. [10] K. Rocki、D. Van Essendelft、I. Sharapov、R. Schreiber、M. Morrison、V. Kibardin、A. Portnoy、J. F. Dietiker、M. Syamlal 和 M. James,"晶圆级处理器上的快速模板代码计算",SC20:高性能计算、网络、存储和分析国际会议。IEEE, 2020, pp.
[12] J. H. Lau, Semiconductor advanced packaging. Springer Nature, 2021. [12] J. H. Lau,《半导体先进封装》。Springer Nature, 2021.
[13] W. Flack and G. Flores, "Lithographic manufacturing techniques for wafer scale integration," in [1992] Proceedings International Conference on Wafer Scale Integration. IEEE, 1992, pp. 4-13. [13] W. Flack 和 G. Flores,"晶圆级集成的光刻制造技术",[1992] 晶圆级集成国际会议论文集。IEEE, 1992, pp.
[15] E. Talpes, D. Williams, and D. D. Sarma, "Dojo: The microarchitecture of tesla's exa-scale computer," in 2022 IEEE Hot Chips 34 Symposium (HCS). IEEE Computer Society, 2022, pp. 1-28. [15] E. Talpes、D. Williams 和 D. D. Sarma,"Dojo:The microarchitecture of tesla's exa-scale computer," in 2022 IEEE Hot Chips 34 Symposium (HCS).IEEE 计算机协会,2022 年,第 1-28 页。
[16] S. Pal, J. Liu, I. Alam, N. Cebry, H. Suhail, S. Bu, S. S. Iyer, S. Pamarti, R. Kumar, and P. Gupta, "Designing a 2048-chiplet, 14336-core waferscale processor," in 2021 58th ACM/IEEE Design Automation Conference (DAC). IEEE, 2021, pp. 1183-1188. [16] S. Pal、J. Liu、I. Alam、N. Cebry、H. Suhail、S. Bu、S. S. Iyer、S. Pamarti、R. Kumar 和 P. Gupta,"设计 2048 片、14336 核晶圆级处理器",2021 年第 58 届 ACM/IEEE 设计自动化大会(DAC)。IEEE, 2021, pp.
[17] S. Lie, "Cerebras architecture deep dive: First look inside the hw/sw co-design for deep learning: Cerebras systems," in 2022 IEEE Hot Chips 34 Symposium (HCS). IEEE Computer Society, 2022, pp. 1-34. [17] S. Lie,"Cerebras 架构深度挖掘:深度学习的硬件/软件协同设计初探:Cerebras systems," in 2022 IEEE Hot Chips 34 Symposium (HCS)。IEEE 计算机协会,2022 年,第 1-34 页。
[19] O. Mencer, D. Allison, E. Blatt, M. Cummings, M. J. Flynn, J. Harris, C. Hewitt, Q. Jacobson, M. Lavasani, [19] O. Mencer、D. Allison、E. Blatt、M. Cummings、M. J. Flynn、J. Harris、C. Hewitt、Q. Jacobson、M. Lavasani、
M. Moazami et al., "The history, status, and future of fpgas: Hitting a nerve with field-programmable gate arrays," Queue, vol. 18, no. 3, pp. 71-82, 2020. M.Moazami 等人,"FPGAS 的历史、现状和未来:打通现场可编程门阵列的神经",《队列》,第 18 卷,第 3 期,第 71-82 页,2020 年。3, pp.
[20] K. Barr, ASIC design in the silicon sandbox: a complete guide to building mixed-signal integrated circuits. McGraw-Hill Education, 2007. [20] K. Barr,《硅沙箱中的 ASIC 设计:构建混合信号集成电路的完整指南》。McGraw-Hill Education, 2007.
[21] L. Liu, J. Zhu, Z. Li, Y. Lu, Y. Deng, J. Han, S. Yin, and S. Wei, "A survey of coarse-grained reconfigurable architecture and design: Taxonomy, challenges, and applications," ACM Computing Surveys (CSUR), vol. 52, no. 6, pp. 1-39, 2019. [21] L. Liu, J. Zhu, Z. Li, Y. Lu, Y. Deng, J. Han, S. Yin, and S. Wei, "A survey of coarse-grained reconfigurable architecture and design:分类、挑战和应用",《ACM 计算概览》(CSUR),第 52 卷,第 6 期,第 1-39 页,2019 年。
[23] M. McCool, J. Reinders, and A. Robison, Structured parallel programming: patterns for efficient computation. Elsevier, 2012. [23] M. McCool、J. Reinders 和 A. Robison,《结构化并行编程:高效计算模式》。Elsevier,2012。
[24] H. Kwon, A. Samajdar, and T. Krishna, "Maeri: Enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects," ACM SIGPLAN Notices, vol. 53, no. 2, pp. 461-475, 2018. [24] H. Kwon, A. Samajdar, and T. Krishna, "Maeri: Enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects," ACM SIGPLAN Notices, vol. 53, no. 2, pp.
[25] C. Kim, S. Kang, D. Shin, S. Choi, Y. Kim, and H.J. Yoo, "A 2.1 tflops/w mobile deep rl accelerator with transposable pe array and experience compression," in 2019 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 2019, pp. 136-138. [25] C. Kim、S. Kang、D. Shin、S. Choi、Y. Kim、H.J. Yoo:"A 2.1 tflops/w mobile deep rl accelerator with transposable pe array and experience compression",in 2019 IEEE International Solid-State Circuits Conference-(ISSCC)。IEEE, 2019, pp.
[26] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, "Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices," IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292-308, 2019. [26] Y.-H.Chen, T.-J. Yang, J. Emer, and V. Sze, "Eyeriss v2:A flexible accelerator for emerging deep neural networks on mobile devices," IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp.
[27] A. Fuchs and D. Wentzlaff, "The accelerator wall: Limits of chip specialization," in 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2019, pp. 1-14. [27] A. Fuchs 和 D. Wentzlaff,"加速器墙:Ultits of chip specialization," in 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).IEEE, 2019, pp.
[28] A. Krizhevsky, I. Sutskever, and G. Hinton, "Imagenet classification with deep convolutional neural networks," Advances in neural information processing systems, vol. 25, no. 2, 2012.
[29] J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng, 'Large scale distributed deep networks," in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, ser. NeurIPS'12. Red Hook, NY, USA: Curran Associates Inc., 2012, p. 1223-1231. [29] J. Dean、G. S. Corrado、R. Monga、K. Chen、M. Devin、Q. V. Le、M. Z. Mao、M. Ranzato、A. Senior、P. Tucker、K. Yang 和 A. Y. Ng,"大规模分布式深度网络",第 25 届神经信息处理系统国际会议论文集--第 1 卷,NeurIPS'12。NeurIPS'12.Red Hook, NY, USA: Curran Associates Inc., 2012, p. 1223-1231.
[31] P. Patarasuk and X. Yuan, "Bandwidth optimal allreduce algorithms for clusters of workstations," Journal of Parallel and Distributed Computing, vol. 69, no. 2, pp. 117-124, 2009. [31] P. Patarasuk and X. Yuan, "Bandwidth optimal allreduce algorithms for clusters of workstations," Journal of Parallel and Distributed Computing, vol. 69, no. 2, pp.
[32] C. Ying, S. Kumar, D. Chen, T. Wang, and Y. Cheng, "Image classification at supercomputer scale," arXiv preprint arXiv:1811.06992, 2018. [32] C. Ying、S. Kumar、D. Chen、T. Wang 和 Y. Cheng,"超级计算机规模的图像分类",arXiv 预印本 arXiv:1811.06992,2018。
[33] M. Cho, U. Finkler, and D. Kung, "Blueconnect: Novel hierarchical all-reduce on multi-tired network for deep [33] M. Cho、U. Finkler 和 D. Kung,"Blueconnect:用于深度学习的多层网络上的新颖分层全归并
learning," in Proceedings of the 2nd SysML Conference, 2019. 学习",《第二届 SysML 会议论文集》,2019 年。
[34] C. E. Leiserson, "Fat-trees: universal networks for hardware-efficient supercomputing," IEEE transactions on Computers, vol. 100, no. 10, pp. 892-901, 1985. [34] C. E. Leiserson,"Fat-trees: universal networks for hardware-efficient supercomputing," IEEE transactions on Computers,vol. 100,no. 10,pp.
[36] P.-J. Lu, M.-C. Lai, and J.-S. Chang, "A survey of high-performance interconnection networks in highperformance computer systems," Electronics, vol. 11, no. 9, p. 1369, 2022. [36] P.-J. Lu, M.-C. Lai, and J.-S.Lai, and J.-S.Chang, "A survey of high-performance interconnection networks in high performance computer systems," Electronics, vol. 11, no. 9, p. 1369, 2022.
[37] G. Cisneros-Stoianowski and C. Stunkel, 'Nvidia's quantum infiniband network congestion control technology and its impact on application performance," in High Performance Computing: 37th International Conference, ISC High Performance 2022, Hamburg, Germany, May 29-June 2, 2022, Proceedings, vol. 13289. Springer Nature, 2022, p. 26. [37] G. Cisneros-Stoianowski 和 C. Stunkel,"Nvidia 的量子 infiniband 网络拥塞控制技术及其对应用性能的影响",《高性能计算》:第 37 届国际会议,ISC High Performance 2022,德国汉堡,2022 年 5 月 29 日至 6 月 2 日,论文集,第 13289 卷。Springer Nature,2022 年,第 26 页。
[38] W. Won, T. Heo, S. Rashidi, S. Sridharan, S. Srinivasan, and T. Krishna, "Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale," arXiv preprint arXiv:2303.14006, 2023. [38] W. Won、T. Heo、S. Rashidi、S. Sridharan、S. Srinivasan 和 T. Krishna,"Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale," arXiv preprint arXiv:2303.14006, 2023.
[39] Z. Zhang, C. Chang, H. Lin, Y. Wang, R. Arora, and X. Jin, "Is network the bottleneck of distributed training?" in Proceedings of the Workshop on Network Meets AI & ML, 2020, pp. 8-13. [39] Z. Zhang, C. Chang, H. Lin, Y. Wang, R. Arora, and X. Jin, "Is network the bottleneck of distributed training?" in Proceedings of Workshop on Network Meets AI & ML, 2020, pp.
[40] S. Pati, S. Aga, M. Islam, N. Jayasena, and M. D. Sinclair, "Computation vs. communication scaling for future transformers on future hardware," arXiv preprint arXiv:2302.02825, 2023. [40] S. Pati、S. Aga、M. Islam、N. Jayasena 和 M. D. Sinclair,《未来硬件上未来变压器的计算与通信扩展》,arXiv 预印本 arXiv:2302.02825, 2023。
[43] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, "Megatron-lm: Training multi-billion parameter language models using model parallelism," arXiv preprint arXiv:1909.08053, 2019. [43] M. Shoeybi、M. Patwary、R. Puri、P. LeGresley、J. Casper 和 B. Catanzaro,"Megatron-lm:利用模型并行性训练多亿参数语言模型》,arXiv 预印本 arXiv:1909.08053, 2019。
[44] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro et al., "Efficient large-scale language model training on gpu clusters using megatron-lm," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1-15. [44] D. Narayanan、M. Shoeybi、J. Casper、P. LeGresley、M. Patwary、V. Korthikanti、D. Vainbrand、P. Kashinkunti、J. Bernauer、B. Catanzaro 等:《使用 megatron-lm 在 GPU 集群上进行高效的大规模语言模型训练》,《高性能计算、网络、存储和分析国际会议论文集》,2021 年,第 1-15 页。
[45] L. Floridi and M. Chiriatti, "Gpt-3: Its nature, scope, limits, and consequences," Minds and Machines, vol. 30, pp. 681-694, 2020 [45] L. Floridi 和 M. Chiriatti,"Gpt-3:其本质、范围、限制和后果",《心灵与机器》,第 30 卷,第 681-694 页,2020 年。
[46] J. H. Lau, "Recent advances and trends in advanced packaging," IEEE Transactions on Components, Packaging and Manufacturing Technology, vol. 12, no. 2, pp. 228-252, 2022. [46] J. H. Lau,"先进封装的最新进展和趋势",《IEEE 组件、封装和制造技术论文集》,第 12 卷,第 2 期,第 228-252 页,2022 年。
[47] H.-J. Lee, R. Mahajan, F. Sheikh, R. Nagisetty, and M. Deo, "Multi-die integration using advanced pack- aging technologies," in 2020 IEEE Custom Integrated Circuits Conference (CICC). IEEE, 2020, pp. 1-7. [47] H.-J. Lee、R. Mahajan、F. Sheikh、R. Nagisetty 和 M. Deo,"使用先进封装老化技术的多芯片集成",2020 年 IEEE 定制集成电路会议(CICC)。IEEE, 2020, pp.
[48] A. Podpod, J. Slabbekoorn, A. Phommahaxay, F. Duval, A. Salahouelhadj, M. Gonzalez, K. Rebibis, A. Miller, G. Beyer, and E. Beyne, "A novel fan-out concept for ultra-high chip-to-chip interconnect density with pitch," in 2018 IEEE 68th Electronic Components and Technology Conference (ECTC). IEEE, 2018, pp. 370378. [48] A. Podpod、J. Slabbekoorn、A. Phommahaxay、F. Duval、A. Salahouelhadj、M. Gonzalez、K. Rebibis、A. Miller、G. Beyer 和 E. Beyne,"采用 间距的超高芯片到芯片互连密度的新型扇出概念",2018 年 IEEE 第 68 届电子元件与技术大会(ECTC)。IEEE,2018 年,第 370378 页。
[49] A. A. Bajwa, S. Jangam, S. Pal, N. Marathe, T. Bai, T. Fukushima, M. Goorsky, and S. S. Iyer, "Heterogeneous integration at fine pitch using thermal compression bonding," in 2017 IEEE 67th electronic components and technology conference (ECTC). IEEE, 2017, pp. 1276-1284. [49] A. A. Bajwa、S. Jangam、S. Pal、N. Marathe、T. Bai、T. Fukushima、M. Goorsky 和 S. S. Iyer,"使用热压键合的细间距 异构集成",2017 年 IEEE 第 67 届电子元件与技术大会(ECTC)。IEEE,2017 年,第 1276-1284 页。
[50] J. H. Lau, M. Li, M. L. Qingqian, T. Chen, I. Xu, Q. X. Yong, Z. Cheng, N. Fan, E. Kuah, Z. Li et al., "Fan-out wafer-level packaging for heterogeneous integration," IEEE Transactions on Components, Packaging and Manufacturing Technology, vol. 8, no. 9, pp. 15441560, 2018. [50] J. H. Lau、M. Li、M. L. Qingqian、T. Chen、I. Xu、Q. X. Yong、Z. Cheng、N. Fan、E. Kuah、Z. Li 等,"用于异构集成的扇出晶圆级封装",《IEEE 组件、封装和制造技术论文集》,第 8 卷第 9 期,第 15441560 页,2018 年。
[51] T. Li, J. Hou, J. Yan, R. Liu, H. Yang, and Z. Sun, "Chiplet heterogeneous integration technology-status and challenges," Electronics, vol. 9, no. 4, p. 670, 2020. [51] T. Li、J. Hou、J. Yan、R. Liu、H. Yang 和 Z. Sun,"Chiplet 异构集成技术--现状与挑战",《电子学》,第 9 卷,第 4 期,第 670 页,2020 年。4, p. 670, 2020.
[52] M. Kawano, "Technology trends in packaging and heterogeneous integration," in 2021 5th IEEE Electron Devices Technology & Manufacturing Conference (EDTM). IEEE, 2021, pp. 1-3. [52] M. Kawano," 封装和异构集成的技术趋势",2021 年第五届 IEEE 电子器件技术与制造会议(EDTM)。IEEE, 2021, pp.
[53] S. S. Iyer, "Chips, dies, chiplets and dielets and heterogeneous integration," in 2022 6th IEEE Electron Devices Technology & Manufacturing Conference (EDTM). IEEE, 2022, pp. 8-11. [53] S. S. Iyer,"芯片、模具、芯片组和小芯片以及异构集成",2022 年第六届 IEEE 电子器件技术与制造会议(EDTM)。IEEE, 2022, pp.
[54] X. Ma, Y. Wang, Y. Wang, X. Cai, and Y. Han, "Survey on chiplets: interface, interconnect and integration methodology," CCF Transactions on High Performance Computing, pp. 1-10, 2022. [54] X. Ma, Y. Wang, Y. Wang, X. Cai, and Y. Han, "Survey on chiplets: interface, interconnect and integration methodology," CCF Transactions on High Performance Computing, pp.
[55] G. Shan, Y. Zheng, C. Xing, D. Chen, G. Li, and Y. Yang, "Architecture of computing system based on chiplet," Micromachines, vol. 13, no. 2, p. 205, 2022. [55] G. Shan、Y. Zheng、C. Xing、D. Chen、G. Li 和 Y. Yang,"基于芯片的计算系统结构",《微型机械》,第 13 卷第 2 期第 205 页,2022 年。
[56] Wccftech. (2022) Apple m1 ultra chip is nearly 3 times bigger than amd's ryzen cpus, benchmarks show desktop intel & amd cpus still ahead. [Online]. Available: https://wccftech.com/apple-m1-ultra-chip-i s-nearly-3-times-bigger-amd-ryzen-cpus-benchmarks-s how-intel-amd-cpus-still-ahead/ [56] Wccftech.(2022)苹果 m1 ultra 芯片比 amd 的 ryzen cpus 大近 3 倍,基准测试显示台式英特尔和 amd cpus 仍领先。[Online].可查阅:https://wccftech.com/apple-m1-ultra-chip-i s-nearly-3-times-bigger-amd-ryzen-cpus-benchmarks-s how-intel-amd-cpus-still-ahead/
[58] S. Pal, D. Petrisko, M. Tomei, P. Gupta, S. S. Iyer, and R. Kumar, "Architecting waferscale processors-a gpu case study," in 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2019, pp. 250-263. [58] S. Pal、D. Petrisko、M. Tomei、P. Gupta、S. S. Iyer 和 R. Kumar,"晶圆级处理器架构--Gpu 案例研究",2019 年电气与电子工程师学会高性能计算机架构(HPCA)国际研讨会。IEEE, 2019, pp.
[59] M. James, M. Tom, P. Groeneveld, and V. Kibardin, "ISPD 2020 physical mapping of neural networks on a wafer-scale deep learning accelerator," in Proceedings of the 2020 International Symposium on Physical De- [59] M. James、M. Tom、P. Groeneveld 和 V. Kibardin,"ISPD 2020 在晶圆级深度学习加速器上的神经网络物理映射",《2020 年国际物理去中心化研讨会论文集》(Proceedings of the 2020 International Symposium on Physical De-
sign, 2020, pp. 145-149 签署,2020 年,第 145-149 页
[61] N. Cebry, "Network-on-chip design for a chiplet-based waferscale processor," Ph.D. dissertation, 2020. [61] N. Cebry:《基于芯片的晶片级处理器的片上网络设计》,博士论文,2020 年。
[62] B. Chang, R. Kurian, D. Williams, and E. Quinnell, "Dojo: Super-compute system scaling for ml training," in 2022 IEEE Hot Chips 34 Symposium (HCS). IEEE Computer Society, 2022, pp. 1-45. [62] B. Chang、R. Kurian、D. Williams 和 E. Quinnell,"Dojo:用于 ml 训练的超级计算系统扩展",2022 年 IEEE Hot Chips 34 Symposium (HCS)。IEEE 计算机协会,2022 年,第 1-45 页。
[63] S. Pal, "Scale-out packageless processing," Ph.D. dissertation, UNIVERSITY OF CALIFORNIA Los Angeles, 2021. [63] S. Pal,《Scale-out packageless processing》,加州大学洛杉矶分校博士论文,2021 年。
[64] S. Lie, M. E. James, M. Morrison, S. Arekapudi, and G. R. Lauterbach, "Processor element redundancy for accelerated deep learning," Patent US011328 208B2, May, 2022. [64] S. Lie、M. E. James、M. Morrison、S. Arekapudi 和 G. R. Lauterbach,"用于加速深度学习的处理器元件冗余",专利 US011328 208B2,2022 年 5 月。
[65] G. Blake, R. G. Dreslinski, and T. Mudge, "A survey of multicore processors," IEEE Signal Processing Magazine, vol. 26, no. 6, pp. 26-37, 2009. [65] G. Blake、R. G. Dreslinski 和 T. Mudge,"多核处理器概览",《IEEE 信号处理杂志》,第 26 卷,第 6 期,第 26-37 页,2009 年。
[66] R. P. Patil and P. V. Sangamkar, "A review of system-onchip bus protocols," International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering, vol. 4, no. 1, pp. 271-281, 2015. [66] R. P. Patil and P. V. Sangamkar, "A review of system-onchip bus protocols," International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering, vol. 4, no. 1, pp.
[67] A. Agarwal, C. Iskander, and R. Shankar, "Survey of network on chip (noc) architectures & contributions," Journal of engineering, Computing and Architecture, vol. 3, no. 1, pp. 21-27, 2009. [67] A. Agarwal、C. Iskander 和 R. Shankar,"片上网络(NOC)架构调查与贡献",《工程、计算与架构期刊》,第 3 卷,第 1 期,第 21-27 页,2009 年。
[68] D. R. Stauffer, J. T. Mechler, M. A. Sorna, K. Dramstad, C. R. Ogilvie, A. Mohammad, and J. D. Rockrohr, High speed serdes devices and applications. Springer Science & Business Media, 2008. [68] D. R. Stauffer、J. T. Mechler、M. A. Sorna、K. Dramstad、C. R. Ogilvie、A. Mohammad 和 J. D. Rockrohr,《高速 serdes 设备与应用》。施普林格科学与商业媒体,2008 年。
[69] M. Mattioli, "Meet the fam1ly," IEEE Micro, vol. 42, no. 3, pp. 78-84, 2022. [69] M. Mattioli, "Meet the fam1ly," IEEE Micro, vol. 42, no.3,第 78-84 页,2022 年。
[70] N. Beck, S. White, M. Paraschou, and S. Naffziger, "'zeppelin': An soc for multichip architectures," in 2018 IEEE International Solid-State Circuits Conference(ISSCC). IEEE, 2018, pp. 40-42. [70] N. Beck、S. White、M. Paraschou 和 S. Naffziger,"'zeppelin':An soc for multichip architectures," in 2018 IEEE International Solid-State Circuits Conference(ISSCC).IEEE,2018,第 40-42 页。
[71] J. H. Lau, "Overview of heterogeneous integrations," in Heterogeneous Integrations. Springer, 2019, pp. 1-59. [71] J. H. Lau, "Overview of heterogeneous integrations," in Heterogeneous Integrations.Springer, 2019, pp.
[72] R. Mahajan, R. Sankman, N. Patel, D.-W. Kim, K. Aygun, Z. Qian, Y. Mekonnen, I. Salama, S. Sharan, D. Iyengar et al., "Embedded multi-die interconnect bridge (emib)-a high density, high bandwidth packaging interconnect," in 2016 IEEE 66th Electronic Components and Technology Conference (ECTC). IEEE, 2016, pp. 557-565. [72] R. Mahajan、R. Sankman、N. Patel、D.-W.Kim、K. Aygun、Z. Qian、Y. Mekonnen、I. Salama、S. Sharan、D. Iyengar 等人,"嵌入式多芯片互连桥(emib)--高密度、高带宽封装互连",2016 年 IEEE 第 66 届电子元件与技术大会(ECTC)。IEEE,2016 年,第 557-565 页。
[73] D. Ingerly, S. Amin, L. Aryasomayajula, A. Balankutty, D. Borst, A. Chandra, K. Cheemalapati, C. Cook, R. Criss, K. Enamul et al., "Foveros: 3d integration and the use of face-to-face chip stacking for logic devices," in 2019 IEEE International Electron Devices Meeting (IEDM). IEEE, 2019, pp. 19-6. [73] D. Ingerly、S. Amin、L. Aryasomayajula、A. Balankutty、D. Borst、A. Chandra、K. Cheemalapati、C. Cook、R. Criss、K. Enamul 等人,"Foveros:3D 集成和逻辑器件面对面芯片堆叠的使用",2019 年 IEEE 国际电子器件会议(IEDM)。IEEE, 2019, pp.
[74] M.-S. Lin, T.-C. Huang, C.-C. Tsai, K.-H. Tam, K. C.H. Hsieh, C.-F. Chen, W.-H. Huang, C.-W. Hu, Y.-C. Chen, S. K. Goel et al., "A 7-nm 4-ghz arm -corebased cowos chiplet design for high-performance computing," IEEE Journal of Solid-State Circuits, vol. 55, no. 4, pp. 956-966, 2020. [74] M.-S.Lin, T.-C.Huang, C.-C.Tsai, K.-H. Tam, K. C.H. Hsieh, C.-F.Tam, K. C.H. Hsieh, C.-F.Chen, W.-H. Huang, C.-W.Huang, C.-W.Hu, Y.-C.Chen, S. K. Goel et al., "A 7-nm 4-ghz arm -corebased cowos chiplet design for high-performance computing," IEEE Journal of Solid-State Circuits, vol. 55, no.4,第 956-966 页,2020 年。
[75] K. Drucker, D. Jani, I. Agarwal, G. Miller, M. Mittal, R. Wang, and B. Vinnakota, "The open domainspecific architecture," in 2020 IEEE Symposium on High-Performance Interconnects (HOTI). IEEE, 2020, pp. 25-32. [75] K. Drucker、D. Jani、I. Agarwal、G. Miller、M. Mittal、R. Wang 和 B. Vinnakota,"开放式特定域架构",2020 年电气与电子工程师学会高性能互连(HOTI)研讨会。IEEE, 2020, pp.
[77] T. Fountain, A. McCarthy, F. Peng et al., "Pci express: An overview of pci express, cabled pci express and pxi express," in 10th ICALEPCS Int. Conf. on Accelerator & Large Expt. Physics Control Systems, 2005. [77] T. Fountain、A. McCarthy、F. Peng 等人,"Pci express:Pci express: An overview of pci express, cabled pci express and pxi express," in 10th ICALEPCS Int.Conf. on Accelerator & Large Expt.物理控制系统国际会议,2005 年。
[78] D. D. Sharma, "Pci express 6.0 specification: A low-latency, high-bandwidth, high-reliability, and costeffective interconnect with pam-4 signaling," IEEE Micro, vol. 41, no. 1, pp. 23-29, 2020. [78] D. D. Sharma,"Pci express 6.0 规范:采用 pam-4 信号的低延迟、高带宽、高可靠性和低成本互联",《IEEE Micro》,第 41 卷,第 1 期,第 23-29 页,2020 年。
[79] D. D. Sharma, G. Pasdast, Z. Qian, and K. Aygun, "Universal chiplet interconnect express (ucie): An open industry standard for innovations with chiplets at package level," IEEE Transactions on Components, Packaging and Manufacturing Technology, vol. 12, no. 9, pp. 1423-1431, 2022. [79] D. D. Sharma、G. Pasdast、Z. Qian 和 K. Aygun,"Universal chiplet interconnect express (ucie):IEEE Transactions on Components, Packaging and Manufacturing Technology》,第 12 卷,第 9 期,第 1423-1431 页,2022 年。
[80] C. E. Spurgeon, Ethernet: the definitive guide. O'Reilly Media, Inc., 2000. [80] C. E. Spurgeon,《以太网:权威指南》。O'Reilly Media, Inc., 2000.
[81] S. Kim, H. Ahn, S. Jun, Y. Park, and W. Han, "Trends of the ccix interconnect and memory expansion technology," Electronics and Telecommunications Trends, vol. 37, no. 1, pp. 42-52, 2022. [81] S. Kim、H. Ahn、S. Jun、Y. Park 和 W. Han:《CCIX 互连和内存扩展技术的发展趋势》,《电子和电信趋势》,第 37 卷,第 1 期,第 42-52 页,2022 年。
[82] J. Stuecheli, W. J. Starke, J. D. Irish, L. B. Arimilli, D. Dreps, B. Blaner, C. Wollbrink, and B. Allison, "Ibm power9 opens up a new era of acceleration enablement: Opencapi," IBM Journal of Research and Development, vol. 62, no. , pp. . [82] J. Stuecheli、W. J. Starke、J. D. Irish、L. B. Arimilli、D. Dreps、B. Blaner、C. Wollbrink 和 B. Allison,"IBM Power9 开启了加速功能的新时代:Opencapi",《IBM 研究与开发杂志》,第 62 卷,第 No. ,第 页。
[83] S. Van Doren, "Hoti 2019: compute express link," in 2019 IEEE Symposium on High-Performance Interconnects (HOTI). IEEE, 2019, pp. 18-18. [83] S. Van Doren,"Hoti 2019:计算快速链接",2019 年电气与电子工程师学会高性能互连研讨会(HOTI)。IEEE, 2019, pp.
[84] Y. Xing, J. Weng, Y. Wang, L. Sui, Y. Shan, and Y. Wang, "An in-depth comparison of compilers for deep neural networks on hardware," in 2019 IEEE International Conference on Embedded Software and Systems (ICESS). IEEE, 2019, pp. 1-8. [84] Y. Xing、J. Weng、Y. Wang、L. Sui、Y. Shan 和 Y. Wang,"深度比较硬件上的深度神经网络编译器",2019 IEEE 嵌入式软件与系统国际会议(ICESS)。IEEE, 2019, pp.
[85] M. Li, Y. Liu, X. Liu, Q. Sun, X. You, H. Yang, Z. Luan, L. Gan, G. Yang, and D. Qian, "The deep learning compiler: A comprehensive survey," IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 3, pp. 708-727, 2020 [85] M. Li、Y. Liu、X. Liu、Q. Sun、X. You、H. Yang、Z. Luan、L. Gan、G. Yang 和 D. Qian,"深度学习编译器:A comprehensive survey," IEEE Transactions on Parallel and Distributed Systems, vol. 32, no.3, pp.
[86] P. Liang, Y. Tang, X. Zhang, Y. Bai, T. Su, Z. Lai, L. Qiao, and D. Li, "A survey on auto-parallelism of large-scale deep learning training," IEEE Transactions on Parallel and Distributed Systems, 2023. [86] P. Liang、Y. Tang、X. Zhang、Y. Bai、T. Su、Z. Lai、L. Qiao 和 D. Li,"大规模深度学习训练的自动并行性调查",《IEEE 并行与分布式系统交互》,2023 年。
[90] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze [90] T. Chen、T. Moreau、Z. Jiang、L. Zheng、E. Yan、H. Shen、M. Cowan、L. Wang、Y. Hu、L. Ceze
et al., "{TVM}: An automated {End-to-End} optimizing compiler for deep learning," in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018, pp. 578-594. 等人,"{TVM}:用于深度学习的自动{端到端}优化编译器,"第 13 届 USENIX 操作系统设计与实现研讨会(OSDI 18),2018 年,第 578-594 页。
[91] S. Cyphers, A. K. Bansal, A. Bhiwandiwalla, J. Bobba, M. Brookhart, A. Chakraborty, W. Constable, C. Convey, L. Cook, O. Kanawi et al., "Intel ngraph: An intermediate representation, compiler, and executor for deep learning," arXiv preprint arXiv:1801.08058, 2018. [91] S. Cyphers、A. K. Bansal、A. Bhiwandiwalla、J. Bobba、M. Brookhart、A. Chakraborty、W. Constable、C. Convey、L. Cook、O. Kanawi 等人,"Intel ngraph:深度学习的中间表示、编译器和执行器》,arXiv 预印本 arXiv:1801.08058, 2018。
[92] M. Kirisame, S. Lyubomirsky, A. Haan, J. Brennan, M. He, J. Roesch, T. Chen, and Z. Tatlock, "Dynamic tensor rematerialization," arXiv preprint arXiv:2006.09616, 2020.
[93] L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen, Y. Huang, Y. Wang, Y. Xu, D. Zhuo, J. E. Gonzalez et al., "Alpa: Automating inter-and intra-operator parallelism for distributed deep learning," arXiv preprint . [93] L. Zheng、Z. Li、H. Zhang、Y. Zhuang、Z. Chen、Y. Huang、Y. Wang、Y. Xu、D. Zhuo、J. E. Gonzalez 等人,"Alpa:自动化分布式深度学习的操作符间和操作符内并行性",arXiv preprint .
[94] Z. Tan, H. Cai, R. Dong, and K. Ma, "NN-Baton: DNN workload orchestration and chiplet granularity exploration for multichip accelerators," in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 1013-1026. [94] Z. Tan、H. Cai、R. Dong 和 K. Ma,"NN-Baton:多芯片加速器的 DNN 工作负载协调和芯片粒度探索",2021 年 ACM/IEEE 第 48 届计算机体系结构国际研讨会(ISCA)。IEEE, 2021, pp.
[95] B. Jiang, J. Chen, J. Liu, L. Liu, F. Wang, X. Zhang, and E. F. Young, "CU. POKer: Placing DNNs on waferscale AI accelerator with optimal kernel sizing," in Proceedings of the 39th International Conference on Computer-Aided Design, 2020, pp. 1-9. [95] B. Jiang、J. Chen、J. Liu、L. Liu、F. Wang、X. Zhang 和 E. F. Young,"CU.POKer:POKer: Placing DNNs on waferscale AI accelerator with optimal kernel sizing," in Proceedings of the 39th International Conference on Computer-Aided Design, 2020, pp.
[96] S. Özdemir, M. Khasawneh, S. Rao, and P. H. Madden, "Kernel mapping techniques for deep learning neural network accelerators," in Proceedings of the 2022 International Symposium on Physical Design, 2022, pp. . [96] S. Özdemir, M. Khasawneh, S. Rao, and P. H. Madden, "Kernel mapping techniques for deep learning neural network accelerators," in Proceedings of the 2022 International Symposium on Physical Design, 2022, pp.
[97] B. Li, Q. Du, D. Liu, J. Zhang, G. Chen, and H. You, "Placement for wafer-scale deep learning accelerator," in 2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2021, pp. 665-670. [97] B. Li、Q. Du、D. Liu、J. Zhang、G. Chen 和 H. You,"晶圆级深度学习加速器的布局",2021 年第 26 届亚洲及南太平洋设计自动化会议(ASP-DAC)。IEEE, 2021, pp.
[98] H. Peng, L. Guo, L. Sun, and X. Zhang, "Resource allocation for wafer-scale deep learning accelerator," in 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS). IEEE, 2021, pp. 1114-1115. [98] H. Peng, L. Guo, L. Sun, and X. Zhang, "Resource allocation for wafer-scale deep learning accelerator," in 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS).IEEE, 2021, pp.
[99] R. Bellman, "Dynamic programming," Science, vol. 153, no. 3731 , pp. . [99] R. Bellman, "Dynamic programming," Science, vol. 153, no.3731 , pp.
[100] S. Kirkpatrick, C. D. Gelatt Jr, and M. P. Vecchi, "Optimization by simulated annealing," Science, vol. 220, no. 4598, pp. 671-680, 1983. [100] S. Kirkpatrick, C. D. Gelatt Jr, and M. P. Vecchi, "Optimization by simulated annealing," Science, vol. 220, no.4598, pp.
[101] M. Dorigo, M. Birattari, and T. Stutzle, "Ant colony optimization," IEEE computational intelligence magazine, vol. 1, no. 4, pp. 28-39, 2006. [101] M. Dorigo, M. Birattari, and T. Stutzle, "Ant colony optimization," IEEE computational intelligence magazine, vol. 1, no.4, pp.
[102] S. Forrest, "Genetic algorithms," ACM computing surveys (CSUR), vol. 28, no. 1, pp. 77-80, 1996. [102] S. Forrest,"遗传算法",《ACM 计算调查(CSUR)》,第 28 卷,第 1 期,第 77-80 页,1996 年。
[103] S. Thrun and M. L. Littman, "Reinforcement learning: an introduction," AI Magazine, vol. 21, no. 1, pp. 103103, 2000. [103] S. Thrun and M. L. Littman, "Reinforcement learning: an introduction," AI Magazine, vol. 21, no. 1, pp.
[104] M. Matsuo, N. Hayasaka, K. Okumura, E. Hosomi, and C. Takubo, "Silicon interposer technology for highdensity package," in 2000 Proceedings. 50th Elec- tronic Components and Technology Conference (Cat. No. 00CH37070). IEEE, 2000, pp. 1455-1459. [104] M. Matsuo, N. Hayasaka, K. Okumura, E. Hosomi, and C. Takubo, "Silicon interposer technology for highdensity package," in 2000 Proceedings.50th Elec- tronic Components and Technology Conference (Cat. No. 00CH37070).IEEE, 2000, pp.
[105] S. Jangam, A. A. Bajwa, K. K. Thankkappan, P. Kittur, and S. S. Iyer, "Electrical characterization of high performance fine pitch interconnects in silicon-interconnect fabric," in 2018 IEEE 68th Electronic Components and Technology Conference (ECTC). IEEE, 2018, pp. . [105] S. Jangam、A. A. Bajwa、K. K. Thankkappan、P. Kittur 和 S. S. Iyer,"硅互连结构中高性能细间距互连的电气特性",2018 IEEE 第 68 届电子元件与技术大会(ECTC)。IEEE, 2018, pp.
[106] A. A. Bajwa, S. Jangam, S. Pal, B. Vaisband, R. Irwin, M. Goorsky, and S. S. Iyer, "Demonstration of a heterogeneously integrated system-on-wafer (sow) assembly," in 2018 IEEE 68th Electronic Components and Technology Conference (ECTC). IEEE, 2018, pp. 1926-1930. [106] A. A. Bajwa、S. Jangam、S. Pal、B. Vaisband、R. Irwin、M. Goorsky 和 S. S. Iyer,"异构集成片上系统(Sow)组件演示",2018 IEEE 第 68 届电子元件与技术大会(ECTC)。IEEE, 2018, pp.
[107] M. Brunnbauer, E. Furgut, G. Beer, T. Meyer, H. Hedler, J. Belonio, E. Nomura, K. Kiuchi, and K. Kobayashi, "An embedded device technology based on a molded reconfigured wafer," in 56th Electronic Components and Technology Conference 2006. IEEE, 2006, pp. 5-pp. [107] M. Brunnbauer、E. Furgut、G. Beer、T. Meyer、H. Hedler、J. Belonio、E. Nomura、K. Kiuchi 和 K. Kobayashi,"基于模制重构晶片的嵌入式设备技术",2006 年第 56 届电子元件和技术大会。IEEE, 2006, pp.
[108] M. Brunnbauer, T. Meyer, G. Ofner, K. Mueller, and R. Hagen, "Embedded wafer level ball grid array (ewlb)," in 2008 33rd IEEE/CPMT International Electronics Manufacturing Technology Conference (IEMT). IEEE, 2008, pp. 1-6. [108] M. Brunnbauer、T. Meyer、G. Ofner、K. Mueller 和 R. Hagen,"嵌入式晶圆级球栅阵列(ewlb)",2008 年第 33 届 IEEE/CPMT 国际电子制造技术会议(IEMT)。IEEE, 2008, pp.
[109] S.-R. Chun, T.-H. Kuo, H.-Y. Tsai, C.-S. Liu, C.T. Wang, J.-S. Hsieh, T.-S. Lin, T. Ku, and D. Yu, "Info_sow (system-on-wafer) for high performance computing," in 2020 IEEE 70th Electronic Components and Technology Conference (ECTC). IEEE, 2020, pp. . [109] S.-R.Chun,T.-H. Kuo,H.-Y.Kuo, H.-Y. Tsai, C.-S.Tsai, C.-S. Liu, C.T. Wang, J.-S.Liu, C.T. Wang, J.-S.H.-Y. Tsai, C.-S. Liu, C.T. Wang, J.-S. Hsieh, T.-S.Lin, T. Ku, and D. Yu, "Info_sow (system-on-wafer) for high performance computing," in 2020 IEEE 70th Electronic Components and Technology Conference (ECTC).IEEE, 2020, pp.
[110] D. Yu, "Tsmc packaging technologies for chiplets and 3d," in 2021 IEEE Hot Chips 33 Symposium (HCS). IEEE Computer Society, 2021. [110] D. Yu,"用于芯片和 3D 的 Tsmc 封装技术",2021 年 IEEE Hot Chips 33 Symposium (HCS)。IEEE 计算机协会,2021 年。
[111] J. H. Bruning, "Optical lithography-thirty years and three orders of magnitude: the evolution of optical lithography tools," in Advances in Resist Technology and Processing XIV, vol. 3049. SPIE, 1997, pp. 14-27. [111] J. H. Bruning,"光学光刻--30 年和三个数量级:光学光刻工具的演变",《抗蚀剂技术和加工进展 XIV》,第 3049 卷。SPIE, 1997, pp.
[112] D. P. Mancini, D. J. Resnick, K. A. Gehoski, L. L. Popovich, and D. Chang, "Low surface energy polymeric release coating for improved contact print lithography," in 21st Annual BACUS Symposium on Photomask Technology, vol. 4562. SPIE, 2001, pp. 593599 . [112] D. P. Mancini、D. J. Resnick、K. A. Gehoski、L. L. Popovich 和 D. Chang,"用于改进接触打印光刻技术的低表面能聚合物脱模涂层",第 21 届 BACUS 光掩膜技术年度研讨会,第 4562 卷。SPIE, 2001, pp.
[113] J. H. Bruning, "Optical lithography: 40 years and holding," in Optical Microlithography XX, vol. 6520. SPIE, 2007, pp. 62-74. [113] J. H. Bruning,"光学光刻:40 years and holding," in Optical Microlithography XX, vol. 6520.SPIE, 2007, pp.
[114] Y. Wu and Z. Xiao, "The recent progress of lithography machine and the state-of-art facilities," Highlights in Science, Engineering and Technology, vol. 5, pp. 155165, 2022. [114] Y. Wu 和 Z. Xiao,"光刻机的最新进展和先进设备",《科学、工程和技术要闻》,第 5 卷,第 155165 页,2022 年。
[115] M.-H. Liu, B. Vaisband, A. Hanna, Y. Luo, Z. Wan, and S. S. Iyer, "Process development of power delivery through wafer vias for silicon interconnect fabric," in 2019 IEEE 69th Electronic Components and Technology Conference (ECTC), 2019, pp. 579-586. [115] M.-H. Liu, B. Vaisband, A. Hanna, Y. Luo, Z. Wan, and S. S. Iyer, "Process development through wafer vias for silicon interconnect fabric," in 2019Liu, B. Vaisband, A. Hanna, Y. Luo, Z. Wan, and S. S. Iyer, "Process development of power delivery through wafer vias for silicon interconnect fabric," in 2019 IEEE 69th Electronic Components and Technology Conference (ECTC), 2019, pp.
[117] J.-P. Fricker, "Systems and methods for powering an integrated circuit having multiple interconnected die," Patent US011 201 137B2, December, 2020. [117] J.-P.Fricker:"为具有多个互连芯片的集成电路供电的系统和方法",专利 US011 201 137B2,2020 年 12 月。
[118] J. Zhao, R. Ramachandran, S. Sun, W. Chang, and T. Fischer, "Computing system with vertical clock delivery architecture," Patent US2022/040 333, February, 2023. [118] J. Zhao、R. Ramachandran、S. Sun、W. Chang 和 T. Fischer:"具有垂直时钟传输架构的计算系统",专利 US2022/040 333,2023 年 2 月。
[119] S. Wang, Y. Yin, C. Hu, and P. Rezai, "3d integrated circuit cooling with microfluidics," Micromachines, vol. 9, no. 6, p. 287, 2018.
[120] STH. (2019) Cerebras cs-1 wafer-scale ai system at sc19. [Online]. Available: https://www.servethehome.c om/cerebras-cs-1-wafer-scale-ai-system-at-sc19/ [120] STH.(2019) Cerebras cs-1 wafer-scale ai system at sc19.[Online].Available:https://www.servethehome.c om/cerebras-cs-1-wafer-scale-ai-system-at-sc19/
[121] L. Zheng, Y. Zhang, X. Zhang, and M. S. Bakir, 'Silicon interposer with embedded microfluidic cooling for high-performance computing systems," in 2015 IEEE 65th Electronic Components and Technology Conference (ECTC). IEEE, 2015, pp. 828-832. [121] L. Zheng、Y. Zhang、X. Zhang 和 M. S. Bakir,"用于高性能计算系统的嵌入式微流体冷却硅插层",2015 年 IEEE 第 65 届电子元件与技术大会(ECTC)。IEEE, 2015, pp.
[122] L. Zheng, Y. Zhang, and M. S. Bakir, "Design, fabrication and assembly of a novel electrical and microfluidic i/os for 3-d chip stack and silicon interposer," in 2013 IEEE 63rd Electronic Components and Technology Conference. IEEE, 2013, pp. 2243-2248. [122] L. Zheng, Y. Zhang, and M. S. Bakir, "Design, fabrication and assembly of a novel electrical and microfluidic i/os for 3-d chip stack and silicon interposer," in 2013 IEEE 63rd Electronic Components and Technology Conference.IEEE, 2013, pp.
[123] D. B. Tuckerman and R. F. W. Pease, "Highperformance heat sinking for vlsi," IEEE Electron device letters, vol. 2, no. 5, pp. 126-129, 1981. [123] D. B. Tuckerman and R. F. W. Pease, "Highperformance heat sinking for vlsi," IEEE Electron device letters, vol. 2, no.5, pp.
[124] W. R. Moore, "A review of fault-tolerant techniques for the enhancement of integrated circuit yield," Proceedings of the IEEE, vol. 74, no. 5, pp. 684-698, 1986. [124] W. R. Moore, "A review of fault-tolerant techniques for the enhancement of integrated circuit yield," Proceedings of the IEEE, vol. 74, no.5, pp.
[125] N. Kumar, K. Kennedy, K. Gildersleeve, R. Abelson, C. Mastrangelo, and D. Montgomery, "A review of yield modelling techniques for semiconductor manufacturing," International Journal of Production Research, vol. 44, no. 23, pp. 5019-5036, 2006. [125] N. Kumar, K. Kennedy, K. Gildersleeve, R. Abelson, C. Mastrangelo, and D. Montgomery, "A review of yield modelling techniques for semductor manufacturing," International Journal of Production Research, vol. 44, no. 23, pp.
[126] F.-K. Wang, "Process yield with measurement errors in semiconductor manufacturing," IEEE Transactions on Semiconductor Manufacturing, vol. 21, no. 2, pp. 279284, 2008. [126] F.-K.Wang, "Process yield with measurement errors in semductor manufacturing," IEEE Transactions on Semiconductor Manufacturing, vol. 21, no.
[127] A. Nutsch, R. Oechsner, U. Schoepka, and L. Pfitzner, "Yield model for estimation of yield impact of semiconductor manufacturing equipment," in 2010 International Symposium on Semiconductor Manufacturing (ISSM), 2010, pp. 1-4. [127] A. Nutsch、R. Oechsner、U. Schoepka 和 L. Pfitzner,"用于估算半导体制造设备产量影响的产量模型",2010 年国际半导体制造研讨会 (ISSM),2010 年,第 1-4 页。
[128] I. Cutress. (2020) 'better yield on 5nm than 7nm': Tsmc update on defect rates for n5. [Online]. Available: https://www.anandtech.com/show/16028/better-yield-o n-5nm-than-7nm-tsmc-update-on-defect-rates-for-n5 [128] I. Cutress.(2020) 'better yield on 5nm than 7nm':Tsmc 更新了 5 纳米的缺陷率。[Online].Available:https://www.anandtech.com/show/16028/better-yield-o n-5nm-than-7nm-tsmc-update-on-defect-rates-for-n5
[129] C. H. Stapper, "On murphy's yield integral (ic manufacture)," IEEE transactions on semiconductor manufacturing, vol. 4, no. 4, pp. 294-297, 1991. [129] C. H. Stapper, "On murphy's yield integral (ic manufacture)," IEEE transactions on semductor manufacturing, vol. 4, no.4, pp.
[132] Y. Feng and K. Ma, "Chiplet actuary: A quantitative cost model and multi-chiplet architecture exploration," in Proceedings of the 59th ACM/IEEE Design Automation Conference, 2022, pp. 121-126. [132] Y. Feng 和 K. Ma,"Chiplet 精算师:定量成本模型和多芯片架构探索",《第 59 届 ACM/IEEE 设计自动化大会论文集》,2022 年,第 121-126 页。
[134] J. J. Dongarra, L. S. Duff, D. C. Sorensen, and H. A. V. Vorst, Numerical Linear Algebra for High Performance Computers. USA: Society for Industrial and Applied Mathematics, 1998. [134] J. J. Dongarra, L. S. Duff, D. C. Sorensen, and H. A. V. Vorst, Numerical Linear Algebra for High Performance Computers.美国:工业与应用数学学会,1998 年。
[135] K. Rocki, D. V. Essendelft, I. Sharapov, R. Schreiber, M. Morrison, V. Kibardin, A. Portnoy, J. Dietiker, M. Syamlal, and M. James, "Fast stencil-code computation on a wafer-scale processor," , vol. abs/2010.03660, 2020. [Online]. Available: https://arxiv.org/abs/2010.03660 [135] K. Rocki, D. V. Essendelft, I. Sharapov, R. Schreiber, M. Morrison, V. Kibardin, A. Portnoy, J. Dietiker, M. Syamlal, and M. James, "Fast stencil-code computation on a wafer-scale processor," , vol. abs/2010.03660, 2020.[在线]。Available:https://arxiv.org/abs/2010.03660
[136] J. Makino, T. Ito, T. Ebisuzaki, and D. Sugimoto, "Grape: a special-purpose computer for n-body problems," in Proceedings of the International Conference on Application Specific Array Processors, 1990, pp. 180-189. [136] J. Makino、T. Ito、T. Ebisuzaki 和 D. Sugimoto,"Grape:n 体问题专用计算机",《专用阵列处理器国际会议论文集》,1990 年,第 180-189 页。
[138] J. Shen, T. Tang, and L.-L. Wang, Spectral Methods: Algorithms, Analysis and Applications, 1st ed. Springer Publishing Company, Incorporated, 2011. [138] J. Shen, T. Tang, and L.-L. Wang, Spectral Methods:算法、分析与应用》,第 1 版。Springer Publishing Company, Incorporated, 2011.
Dr. Yang Hu is currently an Associate Professor in the School of Integrated Circuits at Tsinghua University. He receives his BS degree from Tianjin University in 2007, and his MS degree from Tsinghua University in 2011, and his Ph.D. degree from the University of Florida in 2017. He was a tenuretrack assistant professor in the ECE department at University of Texas at Dallas from 2017 to 2021 Dr. Hu now works on high-performance AI chip architecture and compilation tools. 胡杨博士现任清华大学集成电路学院副教授。2007 年获天津大学学士学位,2011 年获清华大学硕士学位,2017 年获美国佛罗里达大学博士学位。2017年至2021年,他在德克萨斯大学达拉斯分校担任ECE系终身助理教授。 胡博士现在从事高性能人工智能芯片架构和编译工具的研究。
Dr. Hu is an NSF CAREER Awardee. He has published more than 50 papers in computer architecture conferences and journals such as ISCA, HPCA, ASPLOS, MICRO, SC, DAC, ICS, RTAS ICPP, ICCD, IEEE-TCAD, IEEE-TC, and IEEE-TPDS. His research work has won Best Paper Nomination of HPCA in 2017 and 2018. He received the Best of IEEE Computer Architecture Letters Award in 2015. He is an Associate Editor of Elsevier Chip Journal. He served as TPC track chair of DAC, and TPC member of HPCA, DAC, IWQoS, ISPASS, ICPP, ICDCS, IPDPS, and etc.. He served as NSF panelists and external reviewer of Hongkong Research Grant Council. He also served as session chair of HPCA 2022 and registration chair of ICS 2018. 胡博士是国家自然科学基金 CAREER 奖获得者。他在ISCA、HPCA、ASPLOS、MICRO、SC、DAC、ICS、RTAS ICPP、ICCD、IEEE-TCAD、IEEE-TC和IEEE-TPDS等计算机体系结构会议和期刊上发表了50多篇论文。他的研究工作曾获得 2017 年和 2018 年 HPCA 最佳论文提名。他于 2015 年获得 IEEE Computer Architecture Letters 最佳奖。他是 Elsevier Chip Journal 的副主编。他曾担任 DAC 的 TPC track chair,以及 HPCA、DAC、IWQoS、ISPASS、ICPP、ICDCS、IPDPS 等的 TPC 成员。他曾担任国家自然科学基金评审专家和香港研究资助局外部评审专家。他还曾担任HPCA 2022会议主席和ICS 2018注册主席。
Xinhan Lin received the B.S. degree in computer science and technology from Sichuan University, Chengdu, China, in 2010, the MS degree in inte grated circuits engineering from Tsinghua University, Beijing, China, in 2016, and Ph.D. degree in electronics science and technology from Tsinghua University, Beijing, China, in 2022. 林新汉 2010 年获中国成都四川大学计算机科学与技术学士学位,2016 年获中国北京清华大学集成电路工程硕士学位,2022 年获中国北京清华大学电子科学与技术博士学位。
Currently he is working for the Shanghai Artificial Intelligence Laboratory, Shanghai, China. His research interests include deep learning, reconfigurable computing and neural network acceleration. 目前,他在中国上海的上海人工智能实验室工作。他的研究兴趣包括深度学习、可重构计算和神经网络加速。
Huizheng Wang received the B.S. and M.S. degrees in communication engineering from Southeast University (SEU), Nanjing, China, in 2019 and 2022, respectively. He is currently working toward the Ph.D. degree at the School of Integrated Circuits, Tsinghua University, Beijing, China. His research interests include efficient algorithms and VLSI architectures for massive MIMO detection, polar decoder, stochastic computing, deep learning and neural network acceleration. 王惠正分别于2019年和2022年获得中国南京东南大学(SEU)通信工程学士和硕士学位。目前,他正在北京清华大学集成电路学院攻读博士学位。他的研究兴趣包括大规模 MIMO 检测的高效算法和 VLSI 架构、极性解码器、随机计算、深度学习和神经网络加速。
Zhen He received the B.S. degree from Chongqing University, Chongqing, China, in 2022. He is currently pursuing the Ph.D. degree with the School of Integrated Circuits, Tsinghua University, Beijing, China. His research interests include deep learning, in-memory computing, and computer architecture. 何震于 2022 年获得中国重庆大学学士学位。他目前在北京清华大学集成电路学院攻读博士学位。他的研究兴趣包括深度学习、内存计算和计算机体系结构。
JiaHao Zhang receiced the B.S. degree in microelectronics science and engineering from the School of Intergrated Circuits, Tsinghua University, Bejing, China, in 2023. 张家豪于 2023 年获得中国北京清华大学集成电路学院微电子科学与工程学士学位。
He is currently working with Professor Shouyi Yin in the School of Intergrated Circuits, Tsinghua University. His research interests include computer architecture, AI acceleration and processors, and large-scaling chip design. 他目前在清华大学集成电路学院与尹守义教授合作。他的研究兴趣包括计算机体系结构、人工智能加速和处理器以及大规模芯片设计。
Zheng Xu received the B.S. degree from the School of Information and Electronics, Beijing Institute of Technology, Beijing, China, in 2022 郑旭于 2022 年获得中国北京理工大学信息与电子学院学士学位。
He is currently a Ph.D. candidate with the School of Integrated Circuits, Tsinghua University, Beijing, China. His research interests include neural network acceleration, AI compiler, and wafer-scale chip. 他目前是中国北京清华大学集成电路学院的博士生。他的研究兴趣包括神经网络加速、人工智能编译器和晶圆级芯片。
Sihan Guan received the B.S. degree in Electronics and Information Engineering from Beihang University, Beijing, China, in 2019. She is currently working toward the M.S. degree in Integrated Circuit Engineering with the School of Integrated Circuits, Tsinghua University, Beijing, China. Her research interests include neural network acceleration, hardware simulation and design space exploration. 关思涵于 2019 年获得中国北京航空航天大学电子信息工程学士学位。目前,她正在北京清华大学集成电路学院攻读集成电路工程硕士学位。她的研究兴趣包括神经网络加速、硬件仿真和设计空间探索。
Fang Jiahao received the B.S. degree in Microelectronics Science and Engineering from Tsinghua University, Beijing, China, in 2021. 方家豪于 2021 年获得中国北京清华大学微电子科学与工程学士学位。
Currently, he is pursuing the M.S. degree at the School of Integrated Circuits, Tsinghua University, with a research focus on AI large models and hardware architecture. 目前,他在清华大学集成电路学院攻读硕士学位,研究方向为人工智能大型模型和硬件架构。
Haoran Shang received the B.S. degree in communication engineering from Beihang University, Beijing China, in 2023 尚浩然于 2023 年获得中国北京航空航天大学通信工程学士学位。
He is currently a graduate student with the School of Integrated Circuits, Tsinghua University. His research interests include network in chip, on-package interconnect. 他目前是清华大学集成电路学院的研究生。他的研究兴趣包括芯片网络、封装互连。
Xinru Tang received the B.S. degree in Integrated Chip Design and Integrated System from Huazhong University of Science and Technology, Hubei, China in 2023. He is currently working toward the Ph.D. degree in School of Integrated Circuits, Tsinghua University. His current research interests include microarchitecture of spatial accelerator, AI processor and large-scaling chip design. 唐心如于 2023 年获得中国湖北华中科技大学集成芯片设计与集成系统专业学士学位。他目前在清华大学集成电路学院攻读博士学位。他目前的研究兴趣包括空间加速器微架构、人工智能处理器和大规模芯片设计。
Xu Dai received the B.S. in microelectronics from 戴旭从美国麻省理工学院获得微电子学学士学位。
Xidian University, Xi'an, China, in 2013, and the M.S. degree in electronic science and technology from Tsinghua University, Beijing, China, in 2016. 2013 年获得中国西安电子科技大学电子科学与技术硕士学位,2016 年获得中国北京清华大学电子科学与技术硕士学位。
He is currently working with the Shanghai Artificial Intelligence Laboratory, Shanghai, China. His research interests include AI compiler, AI system and deep learning. 他目前在中国上海的上海人工智能实验室工作。他的研究兴趣包括人工智能编译器、人工智能系统和深度学习。
Shaojun Wei (Fellow, IEEE) was born in Beijing, China, in 1958. He received the Ph.D. degree from the Faculte Polytechnique de Mons, Mons, Belgium in 1991 魏少军(美国电气和电子工程师学会会员)1958 年出生于中国北京。1991 年获比利时蒙斯理工学院博士学位。
He became a Professor at the Institute of Microelectronics, Tsinghua University, Beijing, China, in 1995. His main research interests include VLSI SoC design, electronic design automation (EDA) methodology, and communication application-specific integrated circuit (ASIC) design 1995 年,他成为中国北京清华大学微电子学研究所教授。他的主要研究方向包括 VLSI SoC 设计、电子设计自动化 (EDA) 方法学和通信专用集成电路 (ASIC) 设计。
Dr. Wei is a Senior Member of the Chinese 魏博士是中国科学院资深院士。
Shouyi Yin (Member, IEEE) received the B.S., M.S., and Ph.D. degrees in electronic engineering from Tsinghua University, Beijing, China, in 2000, 2002, and 2005, respectively. 尹守义(IEEE 会员)分别于 2000 年、2002 年和 2005 年获得中国北京清华大学电子工程学士、硕士和博士学位。
He was a Research Associate with Imperial College London, London, U.K. He is currently a Full Professor and the Vice-Director of the School of Integrated Circuits, Tsinghua University. He has published more than 100 journal articles and more than 50 conference papers. His research interests include reconfigurable computing, AI processors, and high-level synthesis. 他曾任英国伦敦帝国理工学院副研究员,现任清华大学正教授、集成电路学院副院长。他已发表 100 多篇期刊论文和 50 多篇会议论文。他的研究兴趣包括可重构计算、人工智能处理器和高级合成。
Dr. Yin has served as a Technical Program Committee Member of the top very-large-scale integration (VLSI) and electronic design automation (EDA) conferences, such as the Asian Solid-State Circuits Conference (A-SSCC), the IEEE/ACM International Symposium on Microarchitecture (MICRO), the Design Automation Conference (DAC), the International Conference on Computer-Aided Design (ICCAD), and the Asia and South Pacific Design Automation Conference (ASP-DAC). He is also an Associate Editor of IEEE Transactions on Circuits and Systems-I: Regular Papers, ACM Transactions on Reconfigurable Technology and Systems (TRETS), and Integration, the VLSI Journal. 尹博士曾担任亚洲固态电路会议 (A-SSCC)、IEEE/ACM 微体系结构国际研讨会 (MICRO)、设计自动化会议 (DAC)、计算机辅助设计国际会议 (ICCAD) 和亚洲及南太平洋设计自动化会议 (ASP-DAC) 等顶级超大规模集成 (VLSI) 和电子设计自动化 (EDA) 会议的技术程序委员会成员。他还是 IEEE Transactions on Circuits and Systems-I: Regular Papers、ACM Transactions on Reconfigurable Technology and Systems (TRETS) 和 Integration, the VLSI Journal 的副主编。
Yang Hu, Huizheng Wang, Zhen He, Xingmao Yu, Jiahao Zhang, Qize Yang, Zheng Xu, Sihan Guan, Jiahao Fang, Haoran Shang, Xinru Tang, Shaojun Wei, and Shouyi Yin are with the School of Integrated Circuits, Tsinghua University, Beijing 100084, China 胡杨、王惠政、何震、于兴茂、张家豪、杨启泽、徐铮、关思涵、方家豪、尚浩然、唐新如、魏少军和尹守义,清华大学集成电路学院,北京 100084
Xinhan Lin and Xu Dai are with the Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China 林新汉和戴旭就职于上海人工智能实验室,上海,200232