ispc: A SPMD Compiler for High-Performance CPU Programming ispc:用于高性能 CPU 编程的 SPMD 编译器
Matt Pharr 马特-法尔Intel Corporation 英特尔公司matt.pharr@intel.com
William R. Mark 威廉-马克Intel Corporation 英特尔公司william.r.mark@intel.com
Abstract 摘要
SIMD parallelism has become an increasingly important mechanism for delivering performance in modern CPUs, due its power efficiency and relatively low cost in die area compared to other forms of parallelism. Unfortunately, languages and compilers for CPUs have not kept up with the hardware's capabilities. Existing CPU parallel programming models focus primarily on multi-core parallelism, neglecting the substantial computational capabilities that are available in CPU SIMD vector units. GPU-oriented languages like OpenCL support SIMD but lack capabilities needed to achieve maximum efficiency on CPUs and suffer from GPU-driven constraints that impair ease of use on CPUs. We have developed a compiler, the Intel SPMD Program Compiler (ispc), that delivers very high performance on CPUs thanks to effective use of both multiple processor cores and SIMD vector units. ispc draws from GPU programming languages, which have shown that for many applications the easiest way to program SIMD units is to use a single-program, multiple-data (SPMD) model, with each instance of the program mapped to one SIMD lane. We discuss language features that make ispc easy to adopt and use productively with existing software systems and show that ispc delivers up to 35 x speedups on a 4 -core system and up to 240 x speedups on a 40 -core system for complex workloads (compared to serial code) 与其他并行形式相比,SIMD 并行具有高能效和相对较低的芯片面积成本,因此已成为现代 CPU 中提供性能的一种日益重要的机制。遗憾的是,CPU 的语言和编译器并没有跟上硬件能力的发展。现有的 CPU 并行编程模型主要侧重于多核并行,而忽略了 CPU SIMD 向量单元所具有的强大计算能力。面向 GPU 的语言(如 OpenCL)支持 SIMD,但缺乏在 CPU 上实现最高效率所需的功能,并且受到 GPU 驱动的限制,影响了在 CPU 上的易用性。我们开发了一种编译器,即 Intel SPMD 程序编译器 (ispc),它能在 CPU 上提供极高的性能,这得益于对多个处理器内核和 SIMD 向量单元的有效利用。ispc 借鉴了 GPU 编程语言,这些语言表明,对于许多应用而言,对 SIMD 单元进行编程的最简单方法是使用单程序多数据 (SPMD) 模型,将程序的每个实例映射到一个 SIMD 通道上。我们讨论了使 ispc 易于采用并能与现有软件系统有效结合的语言特点,并表明对于复杂的工作负载,ispc 在 4 核系统上可实现高达 35 倍的速度提升,在 40 核系统上可实现高达 240 倍的速度提升(与串行 代码相比)。
Recent work has shown that CPUs are capable of delivering high performance on a variety of highly parallel workloads by using both SIMD and multi-core parallelism [22]. Coupled with their ability to also efficiently execute code with moderate to small amounts of parallelism, this makes CPUs an attractive target for a range of computationallyintensive applications, particularly those that exhibit varying amounts of parallelism over the course of their execution. 最近的研究表明,CPU 通过使用 SIMD 和多核并行技术,能够为各种高度并行的工作负载提供高性能[22]。再加上 CPU 还能高效执行具有中等至少量并行性的代码,这使得 CPU 成为一系列计算密集型应用的理想目标,尤其是那些在执行过程中表现出不同并行性的应用。
However, achieving this performance is difficult in practice; although techniques for parallelizing across CPU cores are well-known and reasonably easily adopted, parallelizing across SIMD vector lanes remains difficult, often requiring laboriously writing intrinsics code to generate desired instruction sequences by hand. The most common parallel programming languages and libraries designed for CPUsincluding OpenMP, MPI, Thread Building Blocks, UPC, and Cilk-focus on multi-core parallelism and do not provide any assistance for targeting SIMD parallelism within a core. There has been some hope that CPU implementations of GPU-oriented languages that support SIMD hardware (such as OpenCL) might address this gap [10], but OpenCL lacks capabilities needed to achieve maximum efficiency on CPUs and imposes productivity penalties caused by needing to accommodate GPU limitations such as a separate memory system. This situation led us to ask what would be possible if one were to design a language specifically for achieving high performance and productivity for using SIMD vector units on modern CPUs. 尽管跨 CPU 内核的并行化技术众所周知,也比较容易采用,但跨 SIMD 向量通道的并行化仍然很困难,往往需要费力地编写内在代码,手工生成所需的指令序列。为 CPU 设计的最常见并行编程语言和库,包括 OpenMP、MPI、Thread Building Blocks、UPC 和 Cilk,都侧重于多核并行,而不提供针对单核 SIMD 并行的任何帮助。支持 SIMD 硬件的面向 GPU 语言(如 OpenCL)在 CPU 上的实现可能会弥补这一不足[10],但 OpenCL 缺乏在 CPU 上实现最高效率所需的功能,而且由于需要适应 GPU 的限制(如独立的内存系统),因此在生产率方面会受到影响。在这种情况下,我们不禁要问,如果设计一种专门用于在现代 CPU 上使用 SIMD 向量单元实现高性能和高生产率的语言,会有什么可能?
We have implemented a language and compiler, the Intel SPMD Program Compiler (ispc), that extends a C-based language with "single program, multiple data" (SPMD) constructs for high-performance SIMD programming. ispc's "SPMD-on-SIMD" execution model provides the key feature of being able to execute programs that have divergent control flow across the SIMD lanes. ispc's main focus is effective use of CPU SIMD units, though it supports multi-core parallelism as well. 我们已经实现了一种语言和编译器,即 Intel SPMD 程序编译器 (ispc),它使用 "单程序、多数据"(SPMD) 结构扩展了基于 C 语言的高性能 SIMD 编程。 ispc的 "SPMD-on-SIMD "执行模型提供了一个关键功能,即能够执行在SIMD通道上具有不同控制流的程序。
The language and underlying programming model are designed to fully expose the capabilities of modern CPU hardware, while providing ease of use and high programmer productivity. Programs written in ispc generally see their performance scale with the product of both the number of processing cores and their SIMD width; this is a standard characteristic of GPU programming models but one that is much less common on the CPU. 该语言和底层编程模型旨在充分展示现代 CPU 硬件的功能,同时提供易用性和较高的编程效率。用 ispc 编写的程序通常会随着处理内核数量和 SIMD 宽度的乘积而提高性能;这是 GPU 编程模型的标准特征,但在 CPU 上却不常见。
The most important features of ispc for performance are: ispc 在性能方面最重要的特点是
Explicit language support for both scalar and SIMD operations. 为标量和 SIMD 运算提供明确的语言支持。
Support for structure-of-arrays data structures, including for converting previously-declared data types into structure of arrays layout. 支持数组结构数据结构,包括将先前声明的数据类型转换为数组结构布局。
Access to the full flexibility of the underlying CPU hardware, including the ability to launch asynchronous tasks and to perform fast cross-lane SIMD operations. 访问底层 CPU 硬件的全部灵活性,包括启动异步任务和执行快速跨行 SIMD 操作的能力。
The most important features of ispc for usability are: ispc 在可用性方面最重要的特点是
Support for tight coupling between and ispc, including the ability to directly call ispc routines from and to also call routines from ispc. 支持 和 ispc 之间的紧密耦合,包括从 直接调用 ispc 例程,以及从 ispc 调用 例程。
Coherent shared memory between C++ and ispc. C++ 和 ispc 之间的连贯共享内存
Familiar syntax and language features due to its basis in C. 以 C 语言为基础,语法和语言特点非常熟悉。
Most ispc language features are inspired by earlier SIMD languages such as C* [33], Pixar's FLAP C [24], the RenderMan Shading Language [11] and the MasPar programming language [29], and in some cases by more modern GPUoriented languages such as CUDA [31] and OpenCL [19]. The primary contribution of this paper is to design and implement a language and compiler targeted at modern CPU architectures, and to evaluate the performance impact of key language features on these architectures. ispc 语言的大多数功能都受到早期 SIMD 语言的启发,如 C* [33]、Pixar 的 FLAP C [24]、RenderMan 着色语言 [11] 和 MasPar 编程语言 [29],在某些情况下还受到更现代的面向 GPU 的语言的启发,如 CUDA [31] 和 OpenCL [19]。本文的主要贡献在于设计和实现了一种针对现代 CPU 架构的语言和编译器,并评估了关键语言特性对这些架构的性能影响。
2. DESIGN GOALS 2.设计目标
In order to motivate some of the design differences between ispc and other parallel languages, we will discuss the specific goals of the system and the key characteristics of the hardware that it targets. 为了说明 ispc 与其他并行语言在设计上的一些不同之处,我们将讨论该系统的具体目标及其目标硬件的主要特点。
2.1 Goals 2.1 目标
Performance on today's CPU hardware: The target users for ispc are performance-focused programmers. Therefore, a key design goal is that the system should provide performance transparency: just as with C , it should be straightforward for the user to understand how code written in the language will be compiled to the hardware and roughly how the code will perform. The target hardware is modern CPU hardware, with SIMD units from four to sixteen elements wide, and in particular x86 CPUs with SSE or AVX instructions. A skilled user should be able to achieve performance similar ( or better) to that achievable by programming in C with SSE and AVX intrinsic functions. Modern x86 workstations have up to forty cores and the Intel MIC architecture will have over fifty cores on a single chip, so ispc programs should be able to scale to these core counts and beyond. 当今 CPU 硬件的性能:ispc 的目标用户是注重性能的程序员。因此,一个关键的设计目标是系统应提供性能透明度:就像使用 C 语言一样,用户应能直接了解用该语言编写的代码将如何编译到硬件上,以及代码的大致性能。目标硬件是现代 CPU 硬件,具有四到十六元素宽的 SIMD 单元,特别是具有 SSE 或 AVX 指令的 x86 CPU。熟练的用户应能获得与使用 SSE 和 AVX 固有函数的 C 语言编程类似( 或更好)的性能。现代 x86 工作站拥有多达 40 个内核,而英特尔 MIC 架构的单个芯片将拥有超过 50 个内核,因此 ispc 程序应能扩展到这些内核数量甚至更多。
Programmer productivity: Programmer productivity should be substantially higher than that achievable by programming with intrinsic functions, and should be comparable to that of writing high-performance serial C code or OpenCL kernels. Productivity is measured not just in terms of writing code, but also by the ease of reading and modifying code. Of course, it is expected that it will take longer and require more skill to write highly-tuned code than it does to write less efficient code, just as when programming in other languages. It should be possible (though not mandatory) to write code that is portable across architectures with differing SIMD widths. 程序员的工作效率:程序员的生产率应大大高于使用固有函数编程的生产率,并应与编写高性能串行 C 代码或 OpenCL 内核的生产率相当。衡量生产率的标准不仅是代码的编写,还包括代码的易读性和易修改性。当然,与编写效率较低的代码相比,编写经过高度调整的代码需要更长的时间和更多的技巧,这一点与使用其他语言编程一样。在不同 SIMD 宽度的体系结构中编写可移植的代码应该是可能的(尽管不是必须的)。
Ease of adoption and interoperability: It should be easy for programmers to adopt the language, both for new code and for incremental enhancements to existing systems. The language should be as familiar as possible so that it 易于采用和互操作性:无论是编写新代码还是对现有系统进行增量改进,程序员都应易于采用这种语言。语言应尽可能熟悉,以便
is easy to learn. Ideally the language should be so similar to C that porting code or sharing data structure definitions between it and C/C++ is easy. To support incremental use of the system, it should be easy to call back and forth between and ispc code, and it should be easy to share complex pointer-based data structures. Code generated by the compiler should interoperate with existing memory allocators and task schedulers, rather than imposing its own. Finally, the system should easily work with existing build systems, debuggers and tools like memory verifiers (e.g. valgrind). (Lua's "embeddability" goals are similar [12].) 易学。理想情况下,该语言应与 C 语言非常相似,以便于在 C/C++ 之间移植代码或共享数据结构定义。为了支持系统的渐进式使用, 和 ispc 代码之间的来回调用应该非常容易,共享基于指针的复杂数据结构也应该非常容易。编译器生成的代码应与现有的内存分配器和任务调度器互操作,而不是强加自己的代码。最后,该系统应能轻松与现有的构建系统、调试器和内存验证器(如 valgrind)等工具配合使用。(Lua 的 "可嵌入性 "目标与此类似 [12])。
2.2 Non-goals 2.2 非目标
It is useful to specifically list several non-goals of ispc. 有必要具体列出 ispc 的几个非目标。
No support for GPUs: CPU and GPU architectures are sufficiently different that a single performance-focused programming model for both is unlikely to be an ideal fit for either. Thus, we focus exclusively on CPU architectures. For example, we assume a single cache-coherent address space, which would not be possible if the language had to support today's discrete GPUs. 不支持 GPU:CPU 和 GPU 体系结构有很大的不同,因此针对两者的单一性能编程模型不可能完全适合两者。因此,我们只关注 CPU 架构。例如,我们假定只有一个高速缓存相干的地址空间,而如果语言必须支持当今的独立 GPU,这是不可能的。
Don't try to provide "safe" parallelism: we do not attempt to protect the programmer by making races or deadlock difficult or impossible. Doing so would place too many levels of abstraction between the programmer and the underlying hardware, and we choose to focus on programmers who are willing give up some safety in return for achieving peak machine performance, just as they might do when they choose C or C++ over Java. 不要试图提供 "安全 "的并行性:我们并不试图通过使竞赛或死锁变得困难或不可能来保护程序员。这样做会在程序员和底层硬件之间设置过多的抽象层次,我们选择将重点放在那些愿意放弃一些安全性以换取最高机器性能的程序员身上,就像他们选择 C 或 C++ 而不是 Java 一样。
2.3 Target Hardware 2.3 目标硬件
Since one of the primary goals of the ispc language is to provide high efficiency on modern CPU hardware, it is helpful to review some of the characteristics of this hardware that impact the language and compiler design. 由于 ispc 语言的主要目标之一是在现代 CPU 硬件上提供高效率,因此回顾一下影响语言和编译器设计的一些硬件特性是很有帮助的。
Multi-core and SIMD parallelism: A modern CPU consists of several cores, each of which has a scalar unit and a SIMD unit. The instructions for accessing the SIMD unit have different names on different architectures: SSE for 128 -bit wide SIMD on x 86 processors, AVX for 256 -bit wide SIMD on Intel processors, AltiVec on PowerPC processors, and Neon on ARM processors. The ispc compiler currently supports SSE and AVX. 多核和 SIMD 并行性:现代 CPU 由多个内核组成,每个内核都有一个标量单元和一个 SIMD 单元。访问 SIMD 单元的指令在不同的体系结构上有不同的名称:在 x 86 处理器上,SSE 用于 128 位宽 SIMD;在 Intel 处理器上,AVX 用于 256 位宽 SIMD;在 PowerPC 处理器上,AltiVec;在 ARM 处理器上,Neon。ispc 编译器目前支持 SSE 和 AVX。
Simultaneous execution of scalar and SIMD instructions: Modern CPU architectures can issue multiple instructions per cycle when appropriate execution units are available for those instructions. There is often a performance advantage from replacing a SIMD instruction with a scalar instruction due to better occupancy of execution units. The architects of the Pixar FLAP observed long ago that even SIMD-heavy code has a large number of addressing and control computations that can be executed on a scalar unit [24]. 同时执行标量指令和 SIMD 指令:当适当的执行单元可用于指令时,现代 CPU 架构可在每个周期内发出多条指令。用标量指令取代 SIMD 指令,往往能更好地占用执行单元,从而带来性能优势。Pixar FLAP 的设计者很早就注意到,即使是 SIMD 重型代码,也有大量寻址和控制计算可在标量单元上执行 [24]。
One program counter per core: The scalar unit and all lanes of the associated SIMD unit share a single hardware program counter. 每个内核一个程序计数器:标量单元和相关 SIMD 单元的所有通道共享一个硬件程序计数器。
Single coherent memory: All cores share a single cache-coherent address space and memory system for both scalar and SIMD operations. This capability greatly simplifies the sharing of data structures between serial and parallel code. This capability is lacking on today's GPUs. 单一一致性内存:所有内核共享单个高速缓存相干地址空间和内存系统,用于标量和 SIMD 运算。这一功能大大简化了串行代码和并行代码之间的数据结构共享。目前的 GPU 都不具备这种功能。
Cross-lane SIMD operations: SSE and AVX efficiently support various cross-lane SIMD operations such as swizzles via a single instruction. GPUs generally provide 跨行 SIMD 运算SSE 和 AVX 可通过一条指令有效支持各种跨行 SIMD 操作,如旋转。GPU 通常提供
weaker support for these operations, although they can be mimicked at lower performance via memory. 虽然可以通过内存以较低的性能模拟这些操作,但对这些操作的支持较弱。
Tightly defined execution order and memory model: Modern CPUs have relatively strict rules on the order with which instructions are completed and the rules for when memory stores become visible to memory loads. GPUs have more relaxed rules, which provides greater freedom for hardware scheduling but makes it more difficult to provide ordering guarantees at the language level. 严格定义的执行顺序和内存模型:现代 CPU 对完成指令的顺序以及内存存储何时可见于内存加载都有相对严格的规定。GPU 的规则更为宽松,这为硬件调度提供了更大的自由度,但也使得在语言层面提供排序保证变得更加困难。
3. PARALLELISM MODEL: SPMD ON SIMD 3.并行模型:SPMD 上的 SIMD
Any language for parallel programming requires a conceptual model for expressing parallelism in the language and for mapping this language-level parallelism to the underlying hardware. For the following discussion of ispc's approach, we rely on Flynn's taxonomy of programming models into SIMD, MIMD, etc. [8], with Darema's enhancement to include SPMD (Single Program Multiple Data) [7]. 任何并行编程语言都需要一个概念模型来表达语言中的并行性,并将语言级并行性映射到底层硬件。在下面对 ispc 方法的讨论中,我们依据 Flynn 的编程模型分类法,将其分为 SIMD、MIMD 等[8]。[8],以及 Darema 对 SPMD(单程序多数据)[7]的改进。
3.1 Why SPMD? 3.1 为什么选择 SPMD?
Recall that our goal is to design a language and compiler for today's SIMD CPU hardware. One option would be to use a purely sequential language, such as unmodified C , and rely on the compiler to find parallelism and map it to the SIMD hardware. This approach is commonly referred to as auto-vectorization [37]. Although auto-vectorization can work well for regular code that lacks conditional operations, a number of issues limit the applicability of the technique in practice. All optimizations performed by an auto-vectorizer must honor the original sequential semantics of the program; the auto-vectorizer thus must have visibility into the entire loop body, which precludes vectorizing loops that call out to externally-defined functions, for example. Complex control flow and deeply nested function calls also often inhibit auto-vectorization in practice, in part due to heuristics that auto-vectorizers must apply to decide when to try to vectorize. As a result, auto-vectorization fails to provide good performance transparency-it is difficult to know whether a particular fragment of code will be successfully vectorized by a given compiler and how it will perform. 回想一下,我们的目标是为当今的 SIMD CPU 硬件设计一种语言和编译器。一种方法是使用纯粹的顺序语言,如未经修改的 C 语言,并依靠编译器找到并行性并将其映射到 SIMD 硬件。这种方法通常被称为自动矢量化 [37]。虽然自动矢量化对于缺乏条件操作的常规代码来说效果很好,但在实际应用中,有很多问题限制了该技术的适用性。自动矢量化器执行的所有优化都必须尊重程序的原始顺序语义;因此,自动矢量化器必须对整个循环体具有可见性,这就排除了对调用外部定义函数的循环等进行矢量化的可能性。在实践中,复杂的控制流和深度嵌套的函数调用也经常阻碍自动矢量化,部分原因是自动矢量化器必须采用启发式方法来决定何时尝试矢量化。因此,自动矢量化无法提供良好的性能透明度--很难知道特定编译器是否能成功矢量化特定代码片段,以及该代码片段的性能如何。
To achieve ispc's goals of efficiency and performance transparency it is clear that the language must have parallel semantics. This leads to the question: how should parallelism be expressed? The most obvious option is to explicitly express SIMD operations as explicit vector computations. This approach works acceptably in many cases when the SIMD width is four or less, since explicit operations on 3 -vectors and 4 -vectors are common in many algorithms. For SIMD widths greater than four, this option is still effective for algorithms without data-dependent control flow, and can be implemented in C++ using operator overloading layered over intrinsics. However, this option becomes less viable once complex control flow is required. 为了实现 ispc 的高效和性能透明的目标,很明显,该语言必须具有并行语义。这就引出了一个问题:如何表达并行性?最明显的选择是将 SIMD 运算显式地表达为显式向量计算。在 SIMD 宽度为 4 或更小的情况下,这种方法在许多情况下都可以接受,因为在许多算法中,对 3 向量和 4 向量的显式操作都很常见。当 SIMD 宽度大于 4 时,对于不依赖数据控制流的算法来说,这种方法仍然有效,并且可以在 C++ 中使用操作符重载和内在函数来实现。不过,一旦需要复杂的控制流,这种方案就不那么可行了。
Given complex control flow, what the programmer ideally wants is a programming model that is as close as possible to MIMD, but that can be efficiently compiled to the available SIMD hardware. SPMD provides just such a model: with SPMD, there are multiple instances of a single program executing concurrently and operating on different data. SPMD programs largely look like scalar programs (unlike explicit SIMD), which leads to a productivity advantage for pro- grammers working with SPMD programs. Furthermore, the SPMD approach aids with performance transparency: vectorization of a SPMD program is guaranteed by the underlying model, so a programmer can write SPMD code with a clear mental model of how it will be compiled. Over the past ten years the SPMD model has become widely used on GPUs, first for programmable shading [28] and then for more general-purpose computation via CUDA and OpenCL. 考虑到复杂的控制流,程序员理想中的编程模型是尽可能接近 MIMD,但又能高效地编译到可用的 SIMD 硬件上。SPMD 就提供了这样一种模式:在 SPMD 中,一个程序有多个实例同时执行,并对不同的数据进行操作。SPMD 程序在很大程度上看起来就像标量程序(与显式 SIMD 不同),这为使用 SPMD 程序的程序员带来了生产率优势。此外,SPMD 方法还有助于提高性能透明度:SPMD 程序的矢量化是由底层模型保证的,因此程序员在编写 SPMD 代码时,就能清楚地知道其编译方式。在过去十年中,SPMD 模型在 GPU 上得到了广泛应用,首先用于可编程着色[28],然后通过 CUDA 和 OpenCL 用于更多通用计算。
ispc implements SPMD execution on the SIMD vector units of CPUs; we refer to this model as "SPMD-on-SIMD". Each instance of the program corresponds to a different SIMD lane; conditionals and control flow that are different between the program instances are allowed. As long as each program instance operates only on its own data, it produces the same results that would be obtained if it was running on a dedicated MIMD processor. Figure 1 illustrates how SPMD execution is implemented on CPU SIMD hardware. ispc 在 CPU 的 SIMD 向量单元上实现 SPMD 执行;我们称这种模式为 "SPMD-on-SIMD"。程序的每个实例对应不同的 SIMD 通道;允许程序实例之间存在不同的条件和控制流。只要每个程序实例只对自己的数据进行操作,就能产生与在专用 MIMD 处理器上运行相同的结果。图 1 展示了 SPMD 如何在 CPU SIMD 硬件上执行。
3.2 Basic Execution Model 3.2 基本执行模式
Upon entry to a ispc function called from code, the execution model switches from the application's serial model to ispc's SPMD model. Conceptually, a number of program instances start running concurrently. The group of running program instances is a called a gang (harkening to "gang scheduling", since ispc provides certain guarantees about when program instances running in a gang run concurrently with other program instances in the gang, detailed below.) The gang of program instances starts executing in the same hardware thread and context as the application code that called the ispc function; no thread creation or implicit context switching is done by ispc. 在进入从 代码调用的 ispc 函数时,执行模型会从应用程序的串行模型切换到 ispc 的 SPMD 模型。从概念上讲,一些程序实例开始并发运行。这组正在运行的程序实例被称为 "gang"(这与 "gang scheduling "有异曲同工之妙,因为 ispc 为在 gang 中运行的程序实例与 gang 中的其他程序实例并发运行提供了某些保证,详见下文)。 程序实例群在调用 ispc 函数的应用程序代码的相同硬件线程和上下文中开始执行;ispc 不会创建线程或进行隐式上下文切换。
The number of program instances in a gang is relatively small; in practice, it's no more than twice the SIMD width of the hardware that it is executing on. Thus, there are four or eight program instances in a gang on a CPU using the 4 -wide SSE instruction set, and eight or sixteen on a CPU using 8 -wide AVX. The gang size is set at compile time. 帮组中程序实例的数量相对较少;实际上,它不会超过执行硬件 SIMD 宽度的两倍。 因此,在使用 4 宽 SSE 指令集的 CPU 上,一个帮中有 4 个或 8 个程序实例,而在使用 8 宽 AVX 指令集的 CPU 上,一个帮中有 8 个或 16 个程序实例。帮组大小在编译时设置。
SPMD parallelization across the SIMD lanes of a single core is complementary to multi-core parallelism. For example, if an application has already been parallelized across cores, then threads in the application can independently call functions written in ispc to use the SIMD unit on the core where they are running. Alternatively, ispc has capabilities for launching asynchronous tasks for multi-core parallelism; they will be introduced in Section 5.4. 跨单核 SIMD 通道的 SPMD 并行是多核并行的补充。例如,如果应用程序已经跨核并行化,那么应用程序中的线程就可以独立调用 ispc 编写的函数,以使用运行线程的内核上的 SIMD 单元。此外,ispc 还具有为多核并行启动异步任务的功能;第 5.4 节将介绍这些功能。
3.3 Mapping SPMD To Hardware: Control 3.3 将 SPMD 映射到硬件:控制
One of the challenges in SPMD execution is handling divergent control flow. Consider a while loop with a termination test ; when different program instances have different values for n , they will need to execute the loop body different numbers of times. SPMD 执行的挑战之一是处理不同的控制流。考虑一个带有终止测试 的 while 循环;当不同程序实例的 n 值不同时,它们需要执行循环体的次数也不同。
ispc's SPMD-on-SIMD model provides the illusion of separate control flow for each SIMD lane, but the burden of ispc 的 SPMD-on-SIMD 模型为每条 SIMD 通道提供了独立控制流的假象,但它所带来的负担也是巨大的。
Figure 1: Execution of a 4-wide SPMD program on 4-wide SIMD vector hardware. On the left we have a short program with simple control flow; the right illustrates how this program is compiled to run on SIMD vector hardware. Here, the if statement has been converted into partially predicated instructions, so the instructions for both the "true" and "false" cases are always executed. A mask is used to prevent side effects for program instances that should not themselves be executing instructions in a particular control flow path. 图 1:在 4 宽 SIMD 向量硬件上执行 4 宽 SPMD 程序。左边是一个具有简单控制流的简短程序;右边说明了如何编译该程序以便在 SIMD 向量硬件上运行。在这里,if 语句被转换为部分谓词指令,因此 "真 "和 "假 "两种情况的指令都会被执行。掩码用于防止程序实例产生副作用,因为程序实例本身不应该执行特定控制流路径中的指令。
supporting this illusion falls on the compiler. Control flow constructs are compiled following the approach described by Allen et al. [2] and recently generalized by Karrenberg et al. [17], where control flow is transformed to data flow. 支持这种幻想的责任落在了编译器身上。控制流构造是按照 Allen 等人[2]描述的方法编译的,最近 Karrenberg 等人[17]对这种方法进行了推广,将控制流转换为数据流。
A simple example of this transformation is shown in Figure 1 , where assignments to a variable are controlled by an if statement. The SIMD code generated by this example maintains a mask that indicates which program instances are currently active during program execution. Operations with side-effects are masked so that they don't have any effect for program instances with an "off" mask value. This approach is also applied to loops (including break and continue statements) and to multiple return statements within one function. 图 1 是这种转换的一个简单示例,其中变量的赋值由 if 语句控制。该示例生成的 SIMD 代码保留了一个掩码,用于指示在程序执行过程中哪些程序实例当前处于活动状态。具有副作用的操作会被屏蔽,因此它们不会对屏蔽值为 "关闭 "的程序实例产生任何影响。这种方法也适用于循环(包括 break 和 continue 语句)和一个函数中的多个返回语句。
Implementation of this transformation is complex on SSE hardware due to limited support for SIMD write-masks, in contrast to AVX, MIC and most GPUs. Instead, the compiler must use separate blend/select instructions designed for this purpose. Fortunately, masking isn't required for all operations; it is unnecessary for most temporaries computed when evaluating an expression, for example. Because the SPMD programming model is used pervasively on GPUs, most GPUs have some hardware/ISA support for SPMD control flow, thereby reducing the burden on the compiler . 与 AVX、MIC 和大多数 GPU 不同的是,由于 SSE 硬件对 SIMD 写掩码的支持有限,因此在 SSE 硬件上实现这种转换非常复杂。相反,编译器必须使用为此目的设计的单独混合/选择指令。幸运的是,并非所有操作都需要掩码;例如,在评估表达式时计算的大多数暂存器就不需要掩码。由于 SPMD 编程模型广泛用于 GPU,因此大多数 GPU 都为 SPMD 控制流提供了一定的硬件/ISA 支持,从而减轻了编译器 的负担。
3.4 SPMD and Synchronization 3.4 SPMD 和同步
ispc provides stricter guarantees of execution convergence than GPUs running SPMD programs do; these guarantees in turn provide ease-of-use benefits to the programmer. ispc specifically provides an important guarantee about the behavior of the program counter and execution mask: the execution of program instances within a gang is maximally converged. Maximal convergence means that if two program instances follow the same control path, they are guaranteed to execute each program statement concurrently. If two program instances follow diverging control paths, it is guaranteed that they will re-converge at the earliest point in the program where they could re-converge. 与运行 SPMD 程序的 GPU 相比,ispc 提供了更严格的执行收敛性保证;这些保证反过来又为程序员提供了易用性优势。ispc 特别为程序计数器和执行掩码的行为提供了重要保证:一个帮派内程序实例的执行是最大收敛的。最大收敛意味着,如果两个程序实例遵循相同的控制路径,就能保证同时执行每条程序语句。如果两个程序实例遵循不同的控制路径,则保证它们会在程序中可以重新会聚的最早点重新会聚。
In contrast, CUDA and OpenCL have much looser guar- 相比之下,CUDA 和 OpenCL 的保护措施要宽松得多。
antees on execution order, requiring explicit barrier synchronization among program instances with __syncthreads () or barrier(), respectively, when there is communication between program instances via memory. Implementing these barriers efficiently for OpenCL on CPUs is challenging [10]. 当程序实例之间通过内存进行通信时,需要分别使用 __syncthreads () 或 barrier() 在程序实例之间进行显式屏障同步。在 CPU 上为 OpenCL 高效实现这些障碍是一项挑战[10]。
Maximally converged execution provides several advantages compared to the looser model on GPUs; it is particularly helpful for efficient communication of values between program instances without needing to explicitly synchronize among them. However, this property also can introduce a dependency on SIMD width; by definition, ordering changes if the gang size changes. The programmer generally only needs to consider this issue when doing cross-programinstance communication. 与 GPU 上的松散模型相比,最大收敛执行具有多种优势;它尤其有助于程序实例之间高效地进行值通信,而无需在它们之间显式地进行同步。不过,这一特性也会引入对 SIMD 宽度的依赖性;根据定义,帮组大小发生变化时,排序也会发生变化。程序员通常只需要在进行跨程序实例通信时考虑这个问题。
The concept of lockstep execution must be precisely defined at the language level in order to write well-formed programs where program instances depend on values that are written to memory by other program instances within their gang. With ispc, any side effect from one program instance is visible to other program instances in the gang after the next sequence point in the program, where sequence points are defined as in C. Generally, sequence points include the end of a full expression, before a function is entered in a function call, at function return, and at the end of initializer expressions. The fact that there is no sequence point between the increment of and the assignment to in is why that expression yields undefined behavior in C , for example. Similarly, if multiple program instances write to the same location without an intervening sequence point, undefined behavior results. (The ispc User's Guide has further details about these convergence guarantees and resulting implications for language semantics [14].) 锁定步进执行的概念必须在语言层面精确定义,这样才能编写出格式良好的程序,在这种程序中,程序实例依赖于其帮体内其他程序实例写入内存的值。一般来说,序列点包括完整表达式的末尾、函数调用中输入函数之前、函数返回时以及初始化表达式的末尾。例如,在 的递增和 中对 的赋值之间没有序列点,这就是为什么该表达式在 C 语言中产生了未定义的行为。同样,如果多个程序实例写入同一位置,而中间没有序列点,也会产生未定义的行为。(ispc 用户指南》中有更多关于这些收敛保证以及由此产生的语言语义影响的详细信息[14])。
3.5 Mapping SPMD To Hardware: Memory 3.5 将 SPMD 映射到硬件:内存
The loads and stores generated by SPMD execution can present a performance challenge. Consider a simple array indexing operation like a[index]: when executed in SPMD, each of the program instances will in general have a different value for index and thus access a different memory location. Loads and stores of this type, corresponding to loads and stores with SIMD vectors of pointers, are typically called "gathers" and "scatters" respectively. It is frequently the case at runtime that these accesses are to the same location SPMD 执行过程中产生的加载和存储会带来性能挑战。考虑一下简单的数组索引操作,如 a[index]:在 SPMD 中执行时,每个程序实例一般都会有不同的 index 值,因此会访问不同的内存位置。这种类型的加载和存储,与使用 SIMD 向量指针的加载和存储相对应,通常分别称为 "集合 "和 "分散"。在运行时,这些访问经常会指向同一位置
or to sequential locations in memory; we refer to this as a coherent gather or scatter. For coherent gather/scatter, modern DRAM typically delivers better performance from a single memory transaction than a series of discrete memory transactions. 或内存中的连续位置;我们称其为一致性聚集或散射。对于相干集群/散射,现代 DRAM 的单个内存事务性能通常优于一系列离散内存事务。
Modern GPUs have memory controllers that coalesce coherent gathers and scatters into more efficient vector loads and stores [25]. The range of cases that this hardware handles has generally expanded over successive hardware generations. Current CPU hardware lacks "gather" and "scatter" instructions; SSE and AVX only provide vector load and store instructions for contiguous data. Therefore, when gathers and scatters are required, they must be implemented via a less-efficient series of scalar instructions. ispc's techniques for minimizing unnecessary gathers and scatters are described in Section 6.4. 现代 GPU 配备了内存控制器,可将连贯的聚集和分散整合为更高效的矢量加载和存储[25]。这种硬件可处理的情况范围一般随着硬件的更新换代而不断扩大。当前的 CPU 硬件缺乏 "聚 "和 "散 "指令;SSE 和 AVX 只提供连续数据的矢量加载和存储指令。因此,当需要收集和分散时,必须通过一系列效率较低的标量指令来实现。 第 6.4 节介绍了 ispc 最大限度减少不必要的聚集和分散的技术。
4. LANGUAGE OVERVIEW 4.语言概述
To give a flavor of the syntax and how the language is used, here is a simple example of using ispc. For more extensive examples and language documentation, see the ispc online documentation 下面是一个使用 ispc 的简单示例,让您了解该语言的语法和使用方法。有关更多示例和语言文档,请参阅 ispc 在线文档 。
First, we have some setup code in that dynamically allocates and initializes two arrays. It then calls an update() function. 首先,我们在 中编写了一些设置代码,用于动态分配和初始化两个数组。然后调用 update() 函数。
float *values = new float[1024];
int *iterations = new int[1024];
// ... initialize values[], iterations[] ...
update(values, iterations, 1024);
The call to update() is a regular function call; in this case it happens that update () is implemented in ispc. The function squares each element in the values array the number of times indicated by the corresponding entry in the iterations array. 对 update() 的调用是一个普通函数调用;在本例中,update () 恰好是用 ispc 实现的。该函数对 values 数组中的每个元素按迭代数组中相应条目所指示的次数进行平方。
export void update(uniform float values[],
uniform int iterations[],
uniform int num_values) {
for (int i = programIndex; i < num_values;
i += programCount) {
int iters = iterations[i];
while (iters-- > 0)
values[i] *= values[i];
}
}
The syntax and basic capabilities of ispc are based on C (C89 [3], specifically), though it adopts a number of constructs from C99 and C++. (Examples include the ability to declare variables anywhere in a function, a built-in bool type, references, and function overloading.) Matching C's syntax as closely as possible is an important aid to the adoptability of the language. ispc 的语法和基本功能基于 C 语言(特别是 C89 [3]),但它采用了 C99 和 C++ 的一些构造。(例如,可以在函数的任意位置声明变量、内置 bool 类型、引用和函数重载)。尽可能地与 C 的语法相匹配是提高语言可采用性的重要手段。
The update() function has an export qualifier, which indicates that it should be made callable from ; the uniform variable qualifier specifies scalar storage and computation and will be described in Section 5.1. update() 函数有一个导出限定符,表示应可从 中调用;统一变量限定符指定了标量存储和计算,将在第 5.1 节中介绍。
ispc supports arbitrary structured control flow within functions, including if statements, switch statements, for, ispc 支持函数内任意结构化控制流,包括 if 语句、switch 语句和 for 语句、
while, and do loops, as well as break, continue, and return statements in all of the places where they are allowed in C. Different program instances can follow different control paths; in the example above, the while loop may execute a different number of times for different elements of the array. 不同的程序实例可以遵循不同的控制路径;在上面的示例中,while 循环可能对数组中的不同元素执行不同的次数。
ispc provides a standard library of useful functions, including hardware atomic operations, transcendentals, functions for communication between program instances, and various data-parallel primitives such as reductions and scans across the program instances. ispc 提供了一个标准的实用函数库,包括硬件原子运算、超越、程序实例之间的通信函数,以及各种数据并行原语,如还原和跨程序实例扫描。
4.1 Mapping Computation to Data 4.1 将计算映射到数据
Given a number of instances of the program running in SPMD (i.e. one gang), it's necessary for the instances to iterate over the input data (which is typically larger than a gang). The example above does this using a for loop and the built-in variables programIndex and programCount. programCount gives the total number of instances running (i.e. the gang size) and programIndex gives each program instance an index from zero to programCount-1. Thus, in the above, for each for loop iteration a programCount-sized number of contiguous elements of the input arrays are processed concurrently by the program instances. 在 SPMD 中运行的程序实例数量众多(即一个帮派),这些实例必须遍历输入数据(通常大于一个帮派)。上面的示例使用 for 循环和内置变量 programIndex 和 programCount 实现了这一功能。programCount 给出了运行的实例总数(即帮派大小),programIndex 给出了每个程序实例从 0 到 programCount-1 的索引。因此,在上述代码中,每一次 for 循环迭代,程序实例都会同时处理输入数组中 programCount 大小的连续元素。
ispc's built-in programIndex variable is analogous to the threadIdx variable in CUDA and to the get_global_id() function in OpenCL, though a key difference is that in ispc, looping over more than a gang's worth of items to process is implemented by the programmer as an in-language for or foreach loop, while in those languages, the corresponding iteration is effectively done by the hardware and runtime thread scheduler outside of the user's kernel code. Performing this mapping in user code gives the programmer more control over the structure of the parallel computation. ispc 内置的 programIndex 变量类似于 CUDA 中的 threadIdx 变量和 OpenCL 中的 get_global_id()函数,但主要区别在于,在 ispc 中,程序员通过语言内的 for 或 foreach 循环来循环处理超过一帮的项目,而在这些语言中,相应的迭代实际上是由用户内核代码之外的硬件和运行时线程调度器完成的。在用户代码中进行这种映射,可以让程序员更好地控制并行计算的结构。
4.2 Implementation 4.2 实施
The ispc compiler uses flex and bison for tokenization and parsing. The compiler front-end performs type-checking and standard early optimizations such as constant folding before transforming the program to the vector intermediate representation of the LLVM toolkit [21]. LLVM then performs an additional set of traditional optimizations. Next our custom optimizations are applied, as discussed in Section 6. LLVM then generates final assembly code. ispc 编译器使用 flex 和 bison 进行标记化和解析。编译器前端执行类型检查和标准早期优化,如常量折叠,然后将程序转换为 LLVM 工具包[21]的向量中间表示。然后,LLVM 会执行一组额外的传统优化。接下来是我们的定制优化,第 6 节将对此进行讨论。然后,LLVM 生成最终的汇编代码。
It is reasonably easy to add support for new target instruction sets: most of the compiler is implemented in a fashion that is target agnostic (e.g. "gathers" are issued generically and only late in the compilation process are they transformed to a target-specific operation). 增加对新目标指令集的支持相当容易:编译器的大部分功能都是以与目标无关的方式实现的(例如,"收集 "是通用的,只有在编译过程的后期才会转换为特定目标的操作)。
5. DESIGN FEATURES 5.设计特点
We'll now describe some key features of ispc and how they support the goals introduced in Section 2. 下面我们将介绍 ispc 的一些主要功能,以及它们如何支持第 2 节中介绍的目标。
5.1 "Uniform" Datatypes 5.1 "统一 "数据类型
In a SPMD language like ispc, a declaration of a variable like float x represents a variable with a separate storage location (and thus, potentially different value) for each of the program instances. However, some variables and their 在 ispc 这样的 SPMD 语言中,float x 这样的变量声明代表着一个变量在每个程序实例中都有一个单独的存储位置(因此,可能会有不同的值)。不过,有些变量及其
associated computations do not need to be replicated across program instances. For example, address computations and loop iteration variables can often be shared. 相关计算不需要在程序实例间复制。例如,地址计算和循环迭代变量通常可以共享。
Since CPU hardware provides separate scalar computation units, it is important to be able to express nonreplicated storage and computation in the language. ispc provides a uniform storage class for this purpose, which corresponds to a single value in memory and thus, a value that is the same across all elements. In addition to the obvious direct benefits, the use of uniform variables facilitates additional optimizations as discussed in Section 6.1. It is a compile-time error to assign a non-uniform (i.e., "varying") value to a uniform variable. 由于 CPU 硬件提供了独立的标量计算单元,因此在语言中表达不可复制的存储和计算非常重要。ispc 为此提供了统一存储类,它对应内存中的单个值,因此所有元素的值都是相同的。除了显而易见的直接好处外,使用统一变量还能促进第 6.1 节中讨论的其他优化。给统一变量分配一个非统一(即 "变化")值是一个编译时错误。
In the absence of the uniform storage class, an optimizing compiler could convert varying variables into uniform variables when appropriate. (For example, in OpenCL or CUDA, all kernel parameters are effectively uniform and only variables that have values that are derived from the thread index are varying.) However, there are a number of reasons why having uniform as an explicit property of types in the language is important: 在没有统一存储类的情况下,优化编译器可以在适当的时候将变化变量转换为统一变量。(例如,在 OpenCL 或 CUDA 中,所有内核参数实际上都是统一的,只有那些值来源于线程索引的变量才是变化的)。然而,在语言中将统一作为类型的显式属性非常重要,原因有很多:
Interoperability with C/C++ data structures: uniform is necessary to explicitly declare in-memory variables of just a single element, as is common in . 与 C/C++ 数据结构的互操作性:有必要像 中常见的那样,明确声明仅包含单个元素的内存变量。
Performance transparency: Treating uniform as an optimization rather than an explicit type property would make it difficult for the programmer to reason about performance. A small change to a program could inadvertently inhibit the optimization elsewhere resulting in significant and difficult-to-understand performance regressions. 性能透明:将统一作为一种优化而非明确的类型属性,会使程序员难以推理性能。对程序稍作改动,就可能无意中抑制了其他地方的优化,导致难以理解的重大性能倒退。
Support for separate compilation: Optimizations cannot cross separate-compilation boundaries, so at a minimum it must be possible to define a formal function parameter as uniform. But to guarantee that a call to such a function with a variable as an actual parameter is legal, uniform must be an explicit part of the type system. Otherwise, the legality of the function call would depend on the optimizer's behavior for the variable. 支持单独编译:优化不能跨越单独编译的界限,因此至少要能将形式化函数参数定义为 uniform。但是,为了保证以变量作为实际参数的函数调用是合法的,"统一 "必须是类型系统的明确组成部分。否则,函数调用的合法性将取决于优化器对变量的处理。
There is a downside to distinguishing between uniform and varying types in the type system: with separately compiled libraries of functions, to provide optimum performance it may be necessary to have multiple variants of functions that take different combinations of uniform and varying parameters. 在类型系统中区分统一类型和变化类型有一个弊端:在单独编译的函数库中,为了提供最佳性能,可能需要有多个函数变体,分别使用统一参数和变化参数的不同组合。
The uniform and varying keywords were first used in the RenderMan shading language [11], but a similar distinction was made even earlier in general-purpose SIMD languages. To designate a SIMD variable, C* uses poly; Pixar's FLAP-C uses parallel; and MPL uses plural. CUDA and OpenCL do not provide this distinction; all variables are semantically varying in those languages. 统一关键字和变化关键字最早出现在 RenderMan 着色语言中[11],但类似的区分在通用 SIMD 语言中出现得更早。要指定 SIMD 变量,C* 使用 poly,Pixar 的 FLAP-C 使用 parallel,MPL 使用 plural。CUDA 和 OpenCL 没有提供这种区分;在这些语言中,所有变量在语义上都是不同的。
5.2 Support For SOA Layout 5.2 支持 SOA 布局
It is well known that the standard layout in memory for an "array of structures" (AOS) leads to sub-optimal performance for SIMD code. The top third of Figure 2 illustrates the issue using a simple structure, which corresponds to the ispc code below: 众所周知,"结构数组"(AOS)在内存中的标准 布局会导致 SIMD 代码的性能达不到最优。图 2 的顶部三分之一部分使用一个简单的结构来说明这个问题,它与下面的 ispc 代码相对应:
AOS
hybrid SOA 混合 SOA
float index / 4]. index & 3] float index / 4]. index & 3].
Figure 2: "Array of structures" layout (top), "hybrid structure of arrays" layout (middle), and "short structure of arrays" layout (bottom) of the example structure from Section 5.2. Reading data in an AOS layout generally leads to expensive gather instructions, while the SOA layouts lead to efficient vector load instructions. 图 2:第 5.2 节中示例结构的 "结构数组 "布局(上)、"数组混合结构 "布局(中)和 "数组短结构 "布局(下)。在 AOS 布局中读取数据一般会导致昂贵的收集指令,而 SOA 布局则会导致高效的矢量加载指令。
struct Foo { float x, y, z; };
uniform Foo a[...] = { ... };
int index = ...;
float x = a[index].x;
Even if program instances access the elements of contiguous structures (i.e. the values of index are sequential over the program instances), the locations accessed are strided in memory and performance suffers from gathers (Section 3.5). 即使程序实例访问连续结构中的元素(即索引值在程序实例中是连续的),所访问的位置在内存中也是串行的,因此性能会受到采集的影响(第 3.5 节)。
A better performing in-memory layout is "hybrid structure of arrays" (hybrid SOA layout), where the structure members are widened to be short arrays. On a system with a 4 -wide vector unit, one might instead use the following struct declaration and access code: 性能更好的内存布局是 "混合数组结构"(混合 SOA 布局),即结构成员加宽为短数组。在具有 4 宽向量单元的系统中,可以使用以下结构声明和访问代码:
float index / 4]. index & 3] float index / 4]. index & 3].
The corresponding memory layout is shown in the middle third of Figure 2. In many cases, accessing structure elements in hybrid SOA layout can be done with efficient vector load and store instructions. 相应的内存布局如图 2 中间三分之一所示。在许多情况下,访问混合 SOA 布局中的结构元素可以使用高效的矢量加载和存储指令。
The above syntax for declaring hybrid SOA layout and accessing hybrid SOA data is awkward and unnecessarily verbose; each element of Foo4 has the same array width repeated in its declaration. If we want both SOA and AOS versions of the struct, we would have to declare two structs with different types, which is undesirable. Furthermore, accessing elements of the structure is much more unwieldy to express than in the AOS case. 上述用于声明混合 SOA 布局和访问混合 SOA 数据的语法既笨拙又冗长;Foo4 的每个元素在其声明中都重复了相同的数组宽度。如果我们想要 SOA 和 AOS 版本的结构体,就必须声明两个类型不同的结构体,这是不可取的。此外,与 AOS 的情况相比,访问该结构的元素在表达上要笨重得多。
ispc addresses these problems and encourages more efficient hybrid SOA data layout by introducing a keyword soa, which modifies existing types to be laid out in SOA format. The soa qualifier converts primitive types (e.g. float or int) to fixed-sized arrays of that type, while for nested data structures or arrays, soa propagates downward through the data ispc 通过引入关键字 soa 来解决这些问题,并鼓励更高效的混合 SOA 数据布局,soa 可以修改现有类型,使其采用 SOA 格式布局。soa 限定符可将原始类型(如 float 或 int)转换为该类型的固定大小数组,而对于嵌套数据结构或数组,soa 会通过数据向下传播
structure until it reaches a primitive type. Traditional array indexing syntax is used for indexing into hybrid SOA data, while the code generated by the compiler actually implements the two-stage indexing calculation. Thus, use of the more efficient hybrid SOA layout can be expressed as follows in ispc: 结构,直到达到原始类型。传统的数组索引语法用于对混合 SOA 数据进行索引,而编译器生成的代码实际上实现了两阶段索引计算。因此,使用更高效的混合 SOA 布局可以用 ispc 表示如下:
struct Foo { float x, y, z; };
soa<4> struct Foo a[...] = { ... };
int index = ...;
float x = a[index].x;
Other than the soa<4> keyword, the code looks just like what one would write for an AOS layout, yet it delivers all of the performance benefits of hybrid SOA. As far as we know, these SOA capabilities have not been provided before in a general-purpose C-like language. 除了 soa<4> 关键字之外,该代码看起来就像为 AOS 布局编写的代码一样,但却提供了混合 SOA 的所有性能优势。据我们所知,这些 SOA 功能以前从未在通用的类 C 语言中出现过。
SOA layout also improves the performance of accesses to variables used by each program instance in a gang. We refer to this layout as a "short SOA layout" and illustrate it in the bottom of Figure 2. In the SPMD programming model, such variables should "look" scalar when they are used in expressions, so the indexing of such variables by programIndex should be implicit. Note that CUDA and OpenCL achieve similar behavior by storing such variables in a separate perlane memory space. The keyword varying produces the desired behavior in ispc: it causes a structure to be widened to the gang size and to be implicitly indexed by programIndex. In the code below, after the expensive AOS structure loads have been performed by the indexing operation, the elements of fv are laid out contiguously in memory and so can be accessed efficiently. SOA 布局还能提高访问帮派中每个程序实例所用变量的性能。我们将这种布局称为 "短 SOA 布局",并在图 2 的底部进行了说明。在 SPMD 编程模型中,此类变量在表达式中使用时应 "看起来 "是标量变量,因此通过 programIndex 对此类变量进行索引应是隐式的。请注意,CUDA 和 OpenCL 通过将此类变量存储在单独的 perlane 内存空间中,实现了类似的行为。关键字 varying 在 ispc 中产生了所需的行为:它会将结构扩大到帮组大小,并通过 programIndex 进行隐式索引。在下面的代码中,通过索引操作执行昂贵的 AOS 结构加载后,fv 的元素在内存中连续排列,因此可以高效地访问。
uniform struct Foo a[\ldots...] = {...};
int index = ...;
varying Foo fv = a[index];
// now e.g. fv.x is contiguous in memory
fv.x = fv.y + fv.z; // looks scalar
varying structures of this form are also available in the VecImp and IVL languages designed concurrently to ispc [23]. 与 ispc [23] 同时设计的 VecImp 和 IVL 语言中也有这种形式的不同结构。
The ability to conveniently but explicitly declare and access hybrid SOA and short SOA data structures is one of the major advantages of ispc over OpenCL when targeting CPU hardware. Note that languages that do not define memory layout as strictly as (and hence, typically forbid pointers or restrict pointer arithmetic) may choose to optimize layout to SOA form even when the declaration appears to be AOS. For languages with strict layout rules, the compiler may still optimize layout to SOA form if it can guarantee that pointers are never used to access the data. However, these approaches provide less performance transparency than ispc's approach and cannot be used for zero-copy data structures that are shared with the C/C++ application. 与 OpenCL 相比,ispc 在面向 CPU 硬件时的主要优势之一是能够方便而明确地声明和访问混合 SOA 和短 SOA 数据结构。请注意,那些不像 那样严格定义内存布局的语言(因此通常禁止指针或限制指针运算)可能会选择将布局优化为 SOA 形式,即使声明看起来是 AOS。对于具有严格布局规则的语言,如果编译器能保证指针不会被用于访问数据,那么编译器仍可将布局优化为 SOA 形式。不过,与 ispc 的方法相比,这些方法的性能透明度较低,而且不能用于与 C/C++ 应用程序共享的零拷贝数据结构。
5.3 Full C Pointer Model 5.3 完整的 C 指针模型
ispc generalizes the full set of C pointer operations to SPMD, including both uniform and varying pointers, pointers to pointers, and function pointers. This feature is important for the expressability of algorithms that use complex pointer-based data structures in ispc and is also critical for allowing ispc programs to interoperate with existing application data structures. Often the code that builds these data structures is not performance-critical and can be left in , while the performance-critical portions of the application that read or update the data structure can be rewritten in ispc. ispc 将全套 C 指针操作推广到 SPMD,包括统一指针和变化指针、指向指针的指针以及函数指针。这一特性对于在 ispc 中表达使用复杂指针数据结构的算法非常重要,对于允许 ispc 程序与现有应用程序数据结构互操作也至关重要。通常,构建这些数据结构的代码并不是性能关键代码,因此可以留在 中,而应用程序中读取或更新数据结构的性能关键部分可以在 ispc 中重写。
The distinction between uniform and varying data exists for both the pointer itself and for the data that is pointed to. (MasPar's C extensions make a similar distinction [29].) Thus, there are four kinds of pointers: 统一数据和变化数据的区别既存在于指针本身,也存在于被指向的数据。(MasPar 的 C 扩展也做了类似的区分 [29]:
The first two declarations above are uniform pointers; the first is to uniform data and the second is to varying data. Both are thus represented as single scalar pointers. The second two declarations are varying pointers, representing a separate pointer for each program instance. Because all variables and dynamically allocated storage reside in a single coherent address space, any pointer can point to any data of the appropriate underlying type in memory. 上述前两个声明都是统一指针;第一个声明指向统一数据,第二个声明指向变化数据。因此,这两个指针都表示为单个标量指针。后两个声明是变化指针,代表每个程序实例的单独指针。由于所有变量和动态分配的存储空间都位于单个一致的地址空间中,因此任何指针都可以指向内存中相应底层类型的任何数据。
In OpenCL and CUDA, all locally-declared pointers are in effect varying pointers to data, with additional limitations imposed by the fragmented memory architecture. CUDA supports function pointers and pointers to pointers, whereas OpenCL does not support function pointers and only supports certain cases of pointers to pointers. 在 OpenCL 和 CUDA 中,所有本地声明的指针实际上都是指向数据的不同指针,并受到碎片化内存架构的额外限制。CUDA 支持函数指针和指向指针的指针,而 OpenCL 不支持函数指针,只支持某些情况下指向指针的指针。
5.4 Task Launch 5.4 任务启动
In order to make it easy to fill multiple CPU cores with computation, ispc provides an asynchronous task launch mechanism, closely modeled on the "spawn" facility provided by Cilk [4]. ispc functions called in this manner are semantically asynchronous function calls that may run concurrently in different hardware threads than the function that launched them. This capability makes multi-core parallelization of ispc programs straightforward when independent computation is available; generally just a few lines of additional code are needed to use this construct. 为了让多个 CPU 内核都能轻松进行计算,ispc 提供了异步任务启动机制,该机制与 Cilk [4] 提供的 "产卵"(spawn)设施密切相关。以这种方式调用的 ispc 函数在语义上是异步函数调用,可以在不同的硬件线程中并发运行,而不是启动它们的函数。当独立计算可用时,这种功能使得 ispc 程序的多核并行化变得简单易行;通常只需几行额外代码就能使用这种结构。
Any complex multi-core C++ application typically has its own task system or thread pool, which may be custom designed or may be an existing one such as Microsoft's Concurrency Runtime, or Intel Thread Building Blocks. To interoperate with the application's task system, ispc allows the user to provide a callback to a task enqueue function, and then uses this callback to enqueue asynchronous tasks. 任何复杂的多核 C++ 应用程序通常都有自己的任务系统或线程池,它们可能是定制设计的,也可能是现有的,如微软的并发运行时(Concurrency Runtime)或英特尔线程构件(Intel Thread Building Blocks)。为了与应用程序的任务系统互操作,ispc 允许用户向任务 enqueue 函数提供回调,然后使用该回调来启动异步任务。
As in Cilk, all tasks launched from an ispc function must have returned before the function is allowed to return. This characteristic ensures parallel composability by freeing callers of functions from having to be aware of whether tasks are still executing (or yet to be executed) from functions they called. ispc also provides an explicit built-in sync construct that waits for tasks launched earlier in the function to complete. 与 Cilk 一样,从 ispc 函数启动的所有任务必须在函数允许返回之前全部返回。ispc 还提供了一个明确的内置同步构造,用于等待函数中较早启动的任务完成。
Current GPU programming languages have no support for task launch from the GPU, although it is possible to implement a task system in "user space" in CUDA [1]. 目前的 GPU 编程语言并不支持从 GPU 启动任务,尽管在 CUDA 的 "用户空间 "中可以实现任务系统[1]。
5.5 Cross-lane operations 5.5 跨线运行
One strength of SIMD capabilities on CPUs is the rich set of fast cross-lane operations. For example, there are instructions for broadcasting a value from one lane to all other lanes, and instructions for permuting values between lanes. ispc exposes these capabilities through built-in func- 中央处理器 SIMD 功能的优势之一是丰富的快速跨通道操作。例如,有指令可将一个值从一个行道广播到所有其他行道,还有指令可在行道之间对值进行排列。
tions that allow the program instances in a gang to exchange data. These operations are particularly lightweight thanks to the gang convergence guarantees described in Section 3.4. 这些操作允许帮派中的程序实例交换数据。得益于第 3.4 节所述的帮派收敛保证,这些操作尤其轻便。
5.6 Coherent Control Flow Hints 5.6 相干控制流提示
As described in Section 3.3, divergent control flow requires extra instructions on CPU hardware compared to regular control flow. In many uses of control flow, the common case is that all program instances follow the same control path. If the compiler had a way to know this, it could perform a number of optimizations, which are introduced in Section 6.5. ispc provides language constructs to express the programmer's expectation that control flow will typically be converged at a given point in the program. For each control flow construct, there is a corresponding "coherent" variant with the character "c" prepended to it. The following code shows cif in use: 如第 3.3 节所述,与常规控制流相比,发散控制流需要在 CPU 硬件上执行额外指令。在控制流的许多应用中,常见的情况是所有程序实例都遵循相同的控制路径。如果编译器有办法知道这一点,它就可以执行一系列优化,这些优化将在第 6.5 节中介绍。ispc 提供了一些语言结构来表达程序员的期望,即控制流通常会在程序中的某一点收敛。每种控制流结构都有一个相应的 "连贯 "变体,其前缀为 "c"。下面的代码显示了 cif 的使用情况:
float x = ...;
cif (x < 0) {
// handle negative x
}
These coherent control flow variants do not affect program correctness or the final results computed, but can potentially lead to higher performance. 这些连贯的控制流变体不会影响程序的正确性或计算出的最终结果,但有可能带来更高的性能。
For similar reasons, ispc provides convenience foreach constructs that loop over arrays of one or more dimensions and automatically set the execution mask at boundaries. These constructs allow the ispc compiler to easily produce optimized code for the subset of iterations that completely fill a gang of program instances (see Section 6.6 for a description of these optimizations). 出于类似的原因,ispc 提供了方便的 foreach 结构,它可以在一维或多维数组上循环,并在边界自动设置执行掩码。这些构造允许 ispc 编译器轻松生成优化代码,用于完全填充一帮程序实例的迭代子集(有关这些优化的描述,请参见第 6.6 节)。
5.7 Native Object Files and Function Calls 5.7 本地对象文件和函数调用
The ispc compiler generates native object files that can be linked into the application binary in the same way that other object files are. ispc code can be split into multiple object files if desired, with function calls between them resolved at link time. Standard debugging information is optionally included. These capabilities allow standard debuggers and disassemblers to be used with ispc programs and make it easy to add ispc code to existing build systems. ispc 编译器生成的本地对象文件可以像其他对象文件一样链接到应用程序二进制文件中。如果需要,可以将 ispc 代码分割成多个对象文件,在链接时解决它们之间的函数调用问题。可选择包含标准调试信息。有了这些功能,标准调试器和反汇编器就能与 ispc 程序一起使用,并可轻松将 ispc 代码添加到现有的构建系统中。
ispc's calling conventions are based on the platform's standard ABI, though functions not marked export are augmented with an additional parameter to provide the current execution mask. Functions that are marked export can be called with a regular function call from C or ; calling a ispc function is thus a lightweight operation-it's the same as the overhead of calling to an externally-defined C or function. In particular, no data copying or reformatting is performed, other than than possibly pushing parameters onto the stack if required by the platform ABI. While there are some circumstances where such reformatting could lead to improved performance, introducing such a layer is against our goals of performance transparency. ispc 的调用约定基于平台的标准 ABI,但未标记为导出的函数会增加一个额外参数,以提供当前的执行掩码。标记为导出的函数可以用 C 或 的常规函数调用调用;因此,调用 ispc 函数是一种轻量级操作--与调用外部定义的 C 或 函数的开销相同。尤其是,除了根据平台 ABI 的要求将参数推入堆栈之外,不会执行任何数据复制或重新格式化操作。虽然在某些情况下,这种重新格式化可能会提高性能,但引入这样一个层并不符合我们的性能透明化目标。
Lightweight function calls are a significant difference from OpenCL on the CPU, where an API call to a driver must be made in order to launch a kernel and where additional API calls are required to set each kernel parameter value. 轻量级函数调用与 CPU 上的 OpenCL 有很大不同,后者必须调用驱动程序的 API 才能启动内核,而且还需要调用额外的 API 来设置每个内核参数值。
6. EFFICIENT SPMD-ON-SIMD 6.高效的模拟数据管理
There are a number of specialized optimizations that ispc applies to generate efficient code for SPMD on CPUs. We will show how the features introduced in Section 5 make a number of these optimizations possible. We focus on the optimizations that are unique to SPMD-on-SIMD; ispc also applies a standard set of traditional optimizations (constant folding, inlining, etc). ispc 采用了许多专门的优化措施,以便在 CPU 上为 SPMD 生成高效代码。我们将展示第 5 节中介绍的功能是如何实现这些优化的。我们将重点介绍 SPMD-on-SIMD 独有的优化功能;ispc 还应用了一系列标准的传统优化功能(常量折叠、内联等)。
6.1 Benefits of "Uniform" 6.1 "统一 "的好处
Having scalar uniform data types, as introduced in Section 5.1, provides a number of benefits compared to always having a separate per-program-instance storage location for each variable in the program: 正如第 5.1 节介绍的那样,与总是为程序中的每个变量设置单独的程序实例存储位置相比,标量统一数据类型有很多好处:
It reduces the total amount of in-memory storage used for data, which in turn can lead to better cache performance. 它减少了用于数据的内存存储总量,从而提高了缓存性能。
Less bandwidth is consumed when reading and writing scalar values to memory. 将标量值读写到内存时消耗的带宽更少。
CPUs have separate register sets for scalar and vector values; storing values in scalar registers when possible reduces pressure on vector registers. CPU 有独立的标量和矢量寄存器集;尽可能将数值存储在标量寄存器中,以减少对矢量寄存器的压力。
CPUs can co-issue scalar and vector instructions, so that scalar and vector computations can happen concurrently. CPU 可以同时发出标量和矢量指令,因此标量和矢量计算可以同时进行。
In the usual case of using 64-bit pointers, pointer arithmetic (e.g. for addressing calculations) is more efficient for scalar pointers than for vector pointers. 在通常使用 64 位指针的情况下,标量指针的指针运算(例如寻址计算)比向量指针更有效率。
Dereferencing a uniform pointer (or using a uniform value to index into an array) corresponds to a single scalar or vector memory access, rather than a general gather or scatter. 取消引用统一指针(或使用统一值索引数组)对应于单个标量或矢量内存访问,而不是一般的聚集或分散。
Code for control flow based on uniform quantities can be more efficient than code for control flow based on non-uniform quantities (Section 6.2). 与基于非均匀量的控制流代码相比,基于均匀量的控制流代码效率更高(第 6.2 节)。
For the workloads we use for evaluation in Section 7, if all uses of the uniform qualifier were removed thus eliminating all of the above benefits, the workloads ran at geometric mean (geomean) 0.45 x the speed of when uniform was present. The ray tracer was hardest hit, running at 0.05 x of its previous performance, "aobench" ran at 0.36 x its original performance without uniform and "stencil" at 0.21 x . 对于我们在第 7 节中用于评估的工作负载,如果取消所有 uniform 限定符的使用,从而消除上述所有优点,工作负载的几何平均(geomean)运行速度将是使用 uniform 时的 0.45 倍。光线跟踪器受到的冲击最大,运行速度仅为原来的 0.05 倍,"aobench "的运行速度为不使用 uniform 时的 0.36 倍,"stencil "的运行速度为原来的 0.21 倍。
There were multiple causes of these substantial performance reductions without uniform; the most significant were the higher overhead of non-uniform control flow and the much greater expense of varying pointer operations compared to uniform pointer operations. Increased pressure on the vector registers which in turn led to more register spills to memory also impacted performance without uniform. 不使用 uniform 时性能大幅下降的原因是多方面的,其中最主要的是非统一控制流的开销较大,而且与统一指针操作相比,不同指针操作的开销要大得多。矢量寄存器承受的压力增大,反过来又导致更多寄存器溢出到内存,这也影响了无统一指针的性能。
6.2 Uniform Control Flow 6.2 统一控制流
When a control flow test is based on a uniform quantity, all program instances will follow the same path at that point in a function. Therefore, the compiler is able to generate regular jump instructions for control flow in this case, avoiding the costs of mask updates and overhead for handling control flow divergence. 当控制流测试以统一量为基础时,所有程序实例在函数中的该点将遵循相同的路径。因此,在这种情况下,编译器能够为控制流生成常规跳转指令,避免了掩码更新的成本和处理控制流分歧的开销。
Treating all uniform control flow statements as varying caused the example workloads to run with performance geomean 0.91 x as fast as when this optimization was enabled. This optimization had roughly similar effectiveness on all of 如果将所有统一控制流语句视为不同语句,则示例工作负载的运行速度是启用该优化后的 0.91 倍。这种优化对所有的
the workloads, though the ray tracer was particularly hardhit without it, running 0.65 x as fast as it did without this optimization. 尽管光线跟踪器在未进行优化的情况下运行速度是未进行优化时的 0.65 倍,但它对工作负载的影响尤其严重。
6.3 Benefits of SOA 6.3 SOA 的优势
We measured the benefits of SOA versus AOS layout with a workload based on a collision detection algorithm that computed collision points, when present, between two groups of spheres. We implemented this workload with AOS layout and then modified the implementation to also use SOA. By avoiding the gathers required with AOS layout, the SOA version was 1.25 x faster than the AOS one. 我们用一个基于碰撞检测算法的工作负载来衡量 SOA 与 AOS 布局的优势,该算法计算两组球体之间的碰撞点(如果存在)。我们使用 AOS 布局实现了这一工作负载,然后将其修改为也使用 SOA。通过避免使用 AOS 布局所需的收集,SOA 版本比 AOS 版本快 1.25 倍。
6.4 Coherent Memory Access 6.4 相干内存访问
After conventional compiler optimizations have been applied, it's often possible to detect additional cases where the program instances are actually accessing memory coherently [17]. The ispc compiler performs an additional optimization pass late in compilation that detects cases where all the instances, even if using "varying" indexing or pointers, are actually accessing the same location or consecutive locations. 在应用了传统的编译器优化后,通常还能检测到更多程序实例实际上是在连贯访问内存的情况[17]。ispc 编译器会在编译后期执行额外的优化,检测所有实例(即使使用 "不同 "的索引或指针)实际访问同一位置或连续位置的情况。
When this optimization was disabled for the example workloads, performance was geomean 0.79 x slower than when it is enabled. This optimization had a significant effect on the "stencil" workload, which ran 0.23 x as fast when it was disabled. 当示例工作负载禁用该优化时,性能比启用时慢 0.79 倍。该优化对 "stencil "工作负载的影响很大,禁用后运行速度提高了 0.23 倍。
6.5 Dynamic Control Flow Coherence 6.5 动态控制流一致性
Recall from Section 3.3 that control flow is generally transformed by the compiler to data flow with masking, so that for example both the "if" and "else" clauses of an if statement are executed. In many such cases, the executing program instances will actually follow a converged control-flow path at runtime; for example, only the "else" clause might be actually needed. The code generated by the compiler can check for this case at points in the program where control flow could diverge. When it actually does not diverge, a more efficient code path can be followed. Performing this check can be especially helpful to performance for code paths that are rarely executed (corner case handling, etc.) 回顾第 3.3 节,编译器通常会将控制流转换为数据流并进行屏蔽,例如,if 语句中的 "if "和 "else "子句都会被执行。在许多这种情况下,执行中的程序实例在运行时实际上会遵循一条收敛的控制流路径;例如,实际上可能只需要 "else "子句。编译器生成的代码可以在程序中控制流可能发生偏离的地方检查这种情况。当控制流实际上没有偏离时,就可以采用更有效的代码路径。对于很少执行的代码路径(角情况处理等),执行这种检查尤其有助于提高性能。
The ispc compiler uses the "coherent" control flow statements described in Section 5.6 to indicate when these additional tests should be performed. Performing this check for dynamic convergence at runtime gives two main advantages. ispc 编译器使用第 5.6 节所述的 "连贯 "控制流语句来指示何时执行这些附加测试。在运行时进行动态收敛检查有两个主要优点。
It makes it possible to avoid executing instructions when the mask is "all off" and to jump over them. 这样就可以避免在掩码 "全部关闭 "时执行指令,并跳过这些指令。
It gives an opportunity for dynamically reestablishing that the mask is "all on" and then taking a specialized code path for that case; the advantages of doing so are discussed in the following subsection. 它提供了一个动态重新确定掩码 "全开 "的机会,然后针对这种情况采取专门的代码路径;这样做的好处将在下一小节中讨论。
Disabling the the coherent control flow statements caused the example workloads to run at geomean their performance of when it is enabled. This optimization was particularly important for "aobench", which ran at 0.33 x of regular performance without it. For the workloads that only have "uniform" control flow (e.g. Black-Scholes), disabling this optimization had no effect. 禁用连贯控制流语句后,示例工作负载的运行性能达到了启用该语句时的平均 。这一优化对 "aobench "尤为重要,其运行性能是未启用时的 0.33 倍。对于只有 "统一 "控制流的工作负载(如 Black-Scholes),禁用该优化没有任何影响。
6.6 All On Mask 6.6 全靠面具
When it can be determined (statically or dynamically) that all of the program instances in a gang are executing 当可以(静态或动态)确定帮派中的所有程序实例都在执行时
Sec.
Optimization 优化
当 Perf.
disabled
Perf. when
disabled
Uniform data & control flow 统一数据和控制流
0.45 x
6.2
Uniform control flow 统一控制流
0.91 x
6.4
Gather/scatter improvements 收集/散射改进
0.79 x
6.5
Coherent control flow 连贯的控制流
0.85 x
6.6
"All on" mask improvements "全开 "面罩的改进
0.73 x
Table 1: Effect of individually disabling various optimizations (geometric mean over all of the example workloads) 表 1:单独禁用各种优化的效果(所有示例工作负载的几何平均数)
at a point in the program, there are additional optimization opportunities. For example, scatters need to be "scalarized" on current CPUs; they are turned into a scalar store for each currently-executing program instance. In the general case, this scalarization requires a conditional test for each program instance before the corresponding store instruction. If all program instances are known to be executing, however, the per-lane mask check can be omitted. 在程序的某一点上,还有额外的优化机会。例如,在当前的 CPU 上,散点需要 "标量化";对于每个当前执行的程序实例,散点都要变成标量存储。在一般情况下,这种标量化需要在相应的存储指令之前对每个程序实例进行条件测试。不过,如果已知所有程序实例都在执行,则可以省略每列掩码检查。
There is furthermore some benefit to turning masked loads and stores to regular loads and stores even on systems that support masked memory instructions natively when the mask is known to be all on. Doing so can in turn allow those memory operations to be emitted as direct memory operands to instructions without needing to be first loaded into registers. 此外,即使是在本机支持屏蔽内存指令的系统上,当已知屏蔽全开时,将屏蔽加载和存储转为常规加载和存储也有一定的好处。这样做反过来又可以将这些内存操作作为指令的直接内存操作数发出,而无需首先加载到寄存器中。
Disabling all of the optimizations that take advantage of statically determining that the execution mask is all on led to geomean 0.73 x the performance of when it was enabled. 禁用所有利用静态确定执行掩码全部开启的优化功能后,geomean 的性能是启用该功能时的 0.73 倍。
7. RESULTS 7.结果
We have measured performance of a variety of workloads written in ispc, comparing to serial, non-SIMD C++ implementations of the same algorithms. Both the and ispc implementations received equivalent amounts of performance tuning. (In general, the ispc and implementations are syntactically very similar.) 我们测量了用 ispc 编写的各种工作负载的性能,并与相同算法的串行非 SIMD C++ 实现进行了比较。 和 ispc 实现都接受了同等数量的性能调整。(一般来说,ispc 和 实现在语法上非常相似)。
These workloads are all included in the open-source ispc distribution. They include two options pricing workloads, a third-order stencil computation, a ray tracer, a volume renderer, Mandelbrot set computation, and "aobench", a Monte Carlo rendering benchmark [9]. Most of these are not suitable for conventional auto-vectorization, due to complex data-dependent control flow and program complexity. 这些工作负载都包含在开源的 ispc 发行版中。它们包括两个选项定价工作负载、三阶模版计算、光线跟踪器、体积渲染器、曼德布罗特集计算和蒙特卡罗渲染基准 "aobench"[9]。由于复杂的数据控制流和程序复杂性,其中大多数都不适合传统的自动矢量化。
For the results reported here, we did a number of runs of each workload, reporting the minimum time. The results were within a few percent over each run. Other than the results on a 40 -core system, results were measured on a 4 -core Apple iMac with a 4 -core 3.4 GHz Intel Core-i7 processor using the AVX instruction set. The basis for comparison is a reference C++ implementation compiled with a version of the clang compiler built using the same version of the LLVM libraries that are used by ispc. Thus, the results should generally indicate the performance due to more effective use of the vector unit rather than differences in implementation of traditional compiler optimizations or code generation. 对于本文报告的结果,我们对每个工作负载进行了多次运行,并报告了最短时间。每次运行的结果都在百分之几以内。除 40 核系统上的结果外,其他结果均在配备 4 核 3.4 GHz Intel Core-i7 处理器(使用 AVX 指令集)的 4 核 Apple iMac 上测量得出。比较的基础是使用与 ispc 所用 LLVM 库版本相同的 clang 编译器版本编译的 C++ 参考实现。 因此,结果一般应表明性能是由于更有效地使用了矢量单元,而不是传统编译器优化或代码生成实施的差异。
We have not performed direct comparisons between ispc and CPU OpenCL implementations in these evaluations; it 在这些评估中,我们没有对 ispc 和 CPU OpenCL 实现进行直接比较。
Workload 工作量
1 core
1 线程
1 core
1 thread
4 个内核
8 个线程
4 cores
8 threads
aobench 槟榔
5.58 x
26.26 x
Binomial Options 二项式选项
4.39 x
18.63 x
Black-Scholes 布莱克-斯科尔斯
7.43 x
26.69 x
Mandelbrot Set 曼德尔布罗特集
5.85 x
24.67 x
Ray Tracer 射线追踪器
6.85 x
34.82 x
Stencil 模板
3.37 x
12.03 x
Volume Rendering 体积渲染
3.24 x
15.92 x
Table 2: Speedup of various workloads on a single core and on four cores of a system with 8-wide SIMD units, compared to a serial implementation. The one core speedup shows the benefit from using the SIMD lanes of a single core efficiently, while the four core speedup shows the benefit from filling the entire processor with useful computation. 表 2:与串行 实现相比,采用 8 宽 SIMD 单元的系统的单核心和四核心上各种工作负载的速度提升。单核提速显示了高效使用单核 SIMD 通道带来的好处,而四核提速则显示了在整个处理器上进行有用计算带来的好处。
would be hard to interpret the results in that the effects of different compiler optimizations and code generators would be confounded with the effects of the impact of the language designs. Instead, we have focused on evaluating the performance benefit of various ispc features by disabling them individually, thus isolating the effect of the factor under evaluation. 由于不同编译器优化和代码生成器的影响会与语言设计的影响相混淆,因此很难解释结果。因此,我们将重点放在通过单独禁用各种 ispc 功能来评估其性能优势,从而将评估因素的影响隔离开来。
Table 1 recaps the effects of the various compiler optimizations that were reported in Section 6. 表 1 总结了第 6 节中报告的各种编译器优化的效果。
7.1 Speedup Compared to Serial Code 7.1 与串行代码相比的速度提升
Table 2 shows speedups due to ispc's effective use of SIMD hardware and due to ispc's use of task parallelism on a 4 -core system. The table compares three cases for each workload: a serial non-SIMD C++ implementation; an ispc implementation running in a single hardware thread on a single core of the system; and an ispc implementation running eight threads on the four two-way hyper-threaded cores of the system. The four core performance shows the result of filling the entire processor with computation via both task-parallelism and SPMD. For the four-core results, the workloads were parallelized over cores using the tasking functionality described in Section 5.4. 表 2 显示了 ispc 有效使用 SIMD 硬件和 ispc 在 4 核系统上使用任务并行所带来的速度提升。表中比较了每种工作负载的三种情况:串行非 SIMD C++ 实现;在系统单核心上以单硬件线程运行的 ispc 实现;在系统四个双向超线程核心上运行八个线程的 ispc 实现。四核性能显示了通过任务并行和 SPMD 使整个处理器充满计算的结果。对于四核结果,工作负载是利用第 5.4 节所述的任务分配功能在各内核上并行运行的。
7.2 Speedup Versus Intrinsics 7.2 加速与内嵌字
The complexity of the example workloads (which are as much as 700 lines of ispc code) makes it impractical to also implement intrinsics-based versions of them for performance comparisons. However, a number of users of the system have implemented computations in ispc after previously implementing the same computation with intrinsics and seen good results-the examples we've seen are an image downsampling kernel (ispc performance 0.99x of intrinsics), a collision detection computation (ispc 1.05x faster), a particle system rasterizer (ispc 1.01x faster). 由于示例工作负载的复杂性(多达 700 行 ispc 代码),要同时实现基于本征函数的版本来进行性能比较是不切实际的。不过,该系统的一些用户在使用本征代码实现相同计算后,又在 ispc 中实现了计算,并取得了良好的效果--我们看到的例子包括图像下采样内核(ispc 性能是本征代码的 0.99 倍)、碰撞检测计算(ispc 速度是其 1.05 倍)、粒子系统光栅化器(ispc 速度是其 1.01 倍)。
7.3 Speedup with wider vectors 7.3 更宽矢量的提速
We compared the performance of compiling the example workloads to use four-wide SSE vector instructions versus eight-wide AVX on a system that supported both instruction sets. No changes were made to the workloads' ispc source code. The geometric mean of the speedup for the workloads when going from SSE to AVX was 1.42x. Though this is not as good as the potential 2 x speedup from the doubling of 我们比较了在支持两种指令集的系统上编译示例工作负载使用四宽 SSE 向量指令与八宽 AVX 指令集的性能。工作负载的 ispc 源代码未作任何更改。从 SSE 到 AVX,工作负载的几何平均速度提高了 1.42 倍。虽然这不及将 SSE 指令集的速度提高一倍所带来的 2 倍的潜在速度提升,但也是一个不错的结果。
Workload 工作量
40 cores / 40 个内核
80 threads 80 线程
aobench 槟榔
182.36 x
Binomial Options 二项式选项
63.85 x
Black-Scholes 布莱克-斯科尔斯
83.97 x
Mandelbrot Set 曼德尔布罗特集
76.48 x
Ray Tracer 射线追踪器
195.67 x
Stencil 模板
9.40 x
Volume Rendering 体积渲染
243.18 x
Table 3: Speedup versus serial C++ implementations of various workloads on a 40-core system with 4 -wide SIMD units. 表 3:在配备 4 个宽 SIMD 单元的 40 核系统上,各种工作负载的速度提升与串行 C++ 实现的对比。
vector width, there are a number of microarchitectural details in the first generation of AVX systems that inhibit ideal speedups; they include the fact that the integer vector units are still only four-wide, as well as the fact that cache write bandwidth was not doubled to keep up with the widening of the vector registers. 在第一代 AVX 系统中,有一些微体系结构细节阻碍了理想的速度提升,其中包括整数矢量单元仍然只有四倍宽度,以及高速缓存写入带宽没有翻倍以跟上矢量寄存器宽度的增加。
7.4 Scalability on Larger Systems 7.4 大型系统的可扩展性
Table 3 shows the result of running the example workloads with 80 threads on a 2 -way hyper-threaded 40 -core Intel Xeon E7-8870 system at 2.40 GHz , using the SSE4 instruction set and running Microsoft Windows Server 2008 Enterprise. For these tests, the serial C/C++ baseline code was compiled with MSVC 2010. No changes were made to the implementation of workloads after their initial parallelization, though the "aobench" and options pricing workloads were run with larger data sets than the four-core runs ( image resolution versus , and 2 M options versus 128 k options, respectively). 表 3 显示了在主频为 2.40 GHz 的双向超线程 40 核 Intel 至强 E7-8870 系统上运行 80 个线程的示例工作负载的结果,该系统使用 SSE4 指令集并运行 Microsoft Windows Server 2008 Enterprise。在这些测试中,使用 MSVC 2010 编译了序列 C/C++ 基线代码。虽然运行 "aobench "和选项定价工作负载时使用了比四核运行更大的数据集(分别为 图像分辨率和 图像分辨率,以及 2 M 选项和 128 k 选项),但在初始并行化后未对工作负载的实施进行任何更改。
The results fall into three categories: some (aobench, ray tracer, and volume rendering), saw substantial speedups versus the serial baseline, thanks to effective use of all of the system's computational resources, achieving speedups of more than the theoretically-ideal 160x (the product of number of cores and SIMD width on each core); again, the super-linear component of the speedups is mostly due to hyper-threading. Other workloads (both of the options pricing workloads and the Mandelbrot set workload), saw speedups around 2 x the system's core count; for these, the MSVC compiler seems to have been somewhat effective at automatically vectorizing them, thus improving the serial baseline performance. Note, however, that these are the simplest of the workloads; for the more complex workloads the auto-vectorizer is much less effective. 结果可分为三类:部分工作(aobench、光线跟踪和体积渲染)由于有效利用了系统的所有计算资源,速度较串行基线有了大幅提升,超过了理论理想值的 160 倍(内核数与每个内核上 SIMD 宽度的乘积);同样,速度提升的超线性部分主要归功于超线程技术。其他工作负载(选项定价工作负载和 Mandelbrot 集工作负载)的速度提升约为系统内核数的 2 倍;对于这些负载,MSVC 编译器似乎在一定程度上有效地自动向量化了它们,从而提高了串行基线性能。但请注意,这些是最简单的工作负载;对于更复杂的工作负载,自动矢量化器的效果要差得多。
The stencil computation saw a poor speedup versus the serial baseline (and indeed, a worse speedup than on a fourcore system.) The main issue is that the computation is iterative, requiring that each set of asynchronous tasks complete before the set of tasks for the next iteration can be launched; the repeated ramp up and ramp down of parallel computation hurts scalability. Improvements to the implementation of the underlying task system could presumably reduce the impact of this issue. 主要问题在于计算是迭代式的,要求每组异步任务完成后才能启动下一次迭代的任务集;并行计算的反复加速和减速损害了可扩展性。改进底层任务系统的实施大概可以减少这个问题的影响。
7.5 Users 7.5 用户
There have been over 1,500 downloads of the ispc binaries since the system was first released; we don't know how many additional users are building the system from source. Users 自系统首次发布以来,ispc 二进制文件的下载次数已超过 1,500 次;我们不知道还有多少用户正在从源代码构建系统。用户
have reported roughly fifty bugs and made a number of suggestions for improvements to code generation and language syntax and capabilities. 报告了大约 50 个错误,并对代码生成、语言语法和功能提出了许多改进建议。
Overall feedback from users has been positive, both from users with a background in SPMD programming from GPUs but also from users with extensive background in intrinsics programming. Their experience has generally been that ispc's interoperability features and close relationship to C has made it easy to adopt the system; users can port existing code to ispc by starting with existing code, updating it to remove any constructs that ispc doesn't support (like classes), and then modifying it to use ispc's parallel constructs. It hasn't been unusual for a user with a bit of ispc experience to port an existing 500-1000 line program from C++ to ispc in a morning's work. From the other direction, many ispc programs can be compiled as C with the introduction of a few preprocessor definitions; being able to go back to serial C with the same source code has been useful for a number of users as well. 用户的总体反馈是积极的,其中既有具有 GPU SPMD 编程背景的用户,也有具有丰富内在子编程背景的用户。他们普遍认为,ispc 的互操作性特点以及与 C 语言的密切关系使得采用该系统非常容易;用户可以从现有的 代码开始,将其移植到 ispc,更新代码以删除 ispc 不支持的任何构造(如类),然后修改代码以使用 ispc 的并行构造。对于一个有一点 ispc 经验的用户来说,在一个上午的时间里就能把一个现有的 500-1000 行程序从 C++ 移植到 ispc 并不罕见。从另一个角度看,许多 ispc 程序只需引入几个预处理器定义,就能编译成 C 语言;使用相同的源代码返回串行 C 语言对许多用户来说也很有用。
Applications that users have reported using ispc for include implementing a 2D Jacobi Poisson solver (achieving a 3.60 x speedup compared to the previous implementation, both on a single core); implementing a variety of image processing operations for a production imaging system (achieving a 3.2 x speedup, again both on single core); and implementing physical simulation of airflow for aircraft design (speedups not reported to us). Most of these users had not previously bothered to try to vectorize their workloads with intrinsics, but have been able to see substantial speedups using ispc; they have generally been quite happy with both performance transparency and absolute performance. 用户报告的使用 ispc 的应用包括:实现二维雅可比泊松求解器(与之前的实现相比,速度提高了 3.60 倍,均为单核);为生产成像系统实现各种图像处理操作(速度提高了 3.2 倍,同样均为单核);以及为飞机设计实现气流物理仿真(未向我们报告速度提高情况)。这些用户中的大多数以前都没有尝试过使用本征技术对其工作负载进行矢量化,但使用 ispc 后都实现了大幅提速;他们通常对性能透明度和绝对性能都非常满意。
8. RELATED WORK 8.相关工作
The challenge of providing language and compiler support for parallel programming has received considerable attention over many years. To keep the discussion of related work tractable, we focus on languages whose goal is high performance (or more precisely, high efficiency) programming of SIMD hardware. We further focus on general purpose languages (in contrast to domain-specific languages) with a particular emphasis on languages that are C-like. We do not discuss languages and libraries that are focused just on multi-core or distributed parallelism, such as OpenMP, TBB, Cilk, MPI, etc. even though some of these languages use an SPMD programming model. 多年来,为并行编程提供语言和编译器支持的挑战一直备受关注。为了使相关工作的讨论具有可操作性,我们将重点放在以 SIMD 硬件的高性能(或更准确地说,高效率)编程为目标的语言上。我们进一步关注通用语言(与特定领域语言相反),特别强调类似 C 语言的语言。我们不讨论只关注多核或分布式并行的语言和库,如 OpenMP、TBB、Cilk、MPI 等,尽管其中一些语言使用 SPMD 编程模型。
8.1 Historical systems 8.1 历史系统
In the late 1980s and early 1990s, there was a wave of interest in SIMD architectures and accompanying languages. In all of the cases we discuss, SIMD computations were supported with a true superset of C; that is, serial C code could always be compiled, but the SIMD hardware was accessible via language extensions. The Pixar FLAP computer had a scalar integer ALU and 4-wide SIMD floating-point ALU, with an accompanying extended-C language [24]. FLAP is also notable for providing hardware support for SIMD mask operations, like the MIC ISA and some modern GPUs. The Thinking Machines CM-1 and CM-2 and the MasPar MP-1 and MP-2 supercomputers used very wide SIMD (1000s of ALUs), programmed in the extended-C languages [33] and MPL [29] respectively. 20 世纪 80 年代末和 90 年代初,人们对 SIMD 体系结构和相应的语言产生了浓厚的兴趣。在我们讨论的所有案例中,SIMD 计算都是通过真正的 C 语言超集来支持的;也就是说,串行 C 代码始终可以编译,但 SIMD 硬件可以通过语言扩展来访问。Pixar FLAP 计算机有一个标量整数 ALU 和 4 宽 SIMD 浮点 ALU,以及配套的扩展 C 语言[24]。与 MIC ISA 和某些现代 GPU 一样,FLAP 也为 SIMD 掩码操作提供了硬件支持。Thinking Machines CM-1和CM-2以及MasPar MP-1和MP-2超级计算机使用了非常宽的SIMD(1000个ALU),分别用扩展C语言 [33]和MPL[29]编程。
All of these systems used a single language for both serial and parallel computations; had a single hardware program counter; and provided keywords similar to ispc's uniform and varying to distinguish between scalar and SIMD variables. MPL provided vector control constructs with syntax similar to ispc, OpenCL, and CUDA; C* provided a more limited capability just for if statements. MPL provided generalized SIMD pointers similar to the ones in ispc, but each SIMD ALU and the scalar ALU had its own memory so these pointers could not be used to communicate data between units as they can in ispc. Both C* and MPL had sophisticated communication primitives for explicitly moving data between SIMD ALUs. 所有这些系统都使用单一语言进行串行和并行计算;拥有单一硬件程序计数器;提供与 ispc 的统一关键字类似的关键字,并通过不同的关键字来区分标量变量和 SIMD 变量。MPL 提供了矢量控制结构,其语法与 ispc、OpenCL 和 CUDA 相似;C* 仅为 if 语句提供了较为有限的功能。MPL 提供的通用 SIMD 指针与 ispc 中的指针类似,但每个 SIMD ALU 和标量 ALU 都有自己的内存,因此这些指针不能像 ispc 中的指针那样用于单元之间的数据通信。C* 和 MPL 都有复杂的通信原语,用于在 SIMD ALU 之间明确移动数据。
Clearspeed's is a more recent example of this family of languages; the paper describing it has a good discussion of design trade-offs [26]. Clearspeed 的 是这一系列语言中较新的一个例子;介绍它的论文对设计权衡进行了很好的讨论 [26]。
8.2 Contemporary systems 8.2 现代系统
CUDA is a SPMD language for NVIDIA GPUs [31] and OpenCL is a similar language developed as an open standard, with some enhancements such as API-level task parallelism designed to make it usable for CPUs as well as GPUs . At a high level, the most important differences between these languages and ispc are that ispc's design was not restricted by GPU constraints such as a separate memory system, and that ispc includes numerous features designed specifically to provide efficient performance on CPUs. All three languages are C-like but do not support all features of C. ispc and CUDA have some C++ features as well. CUDA 是针对英伟达™(NVIDIA®)图形处理器(GPU)的 SPMD 语言[31],OpenCL 也是一种作为开放标准开发的类似语言,它具有一些增强功能,例如 API 级任务并行性,旨在使其不仅适用于 CPU,也适用于 GPU 。从高层次来看,这些语言与 ispc 之间最重要的区别在于,ispc 的设计不受 GPU 限制(如独立的内存系统),而且 ispc 包含大量专为在 CPU 上提供高效性能而设计的功能。这三种语言都类似于 C 语言,但并不支持 C 语言的所有特性。
The difference in hardware focus between CUDA/OpenCL and ispc drives many specific differences. OpenCL has several different address spaces, including a per-SIMD-lane memory address space (called "private"), and a per-workgroup address space (called "local") whereas ispc has a single global coherent address space for all storage. OpenCL and CUDA also have complex APIs for moving data to and from a discrete graphics card that are unnecessary in isp. ispc has language-level support for task parallelism, unlike OpenCL and CUDA. CUDA and OpenCL lack ispc's support for "uniform" variables and convenient declaration of structure of arrays data types. Although these features are less important for performance on GPUs than on CPUs, we believe they would provide some benefit even on GPUs. CUDA/OpenCL 和 ispc 在硬件重点上的不同导致了许多具体差异。OpenCL 有多个不同的地址空间,包括每个 SIMD 通道的内存地址空间(称为 "私有")和每个工作组的地址空间(称为 "本地"),而 ispc 的所有存储空间只有一个全局一致的地址空间。与 OpenCL 和 CUDA 不同,ispc 在语言层面支持任务并行。CUDA 和 OpenCL 缺少 ispc 对 "统一 "变量和数组数据类型结构便捷声明的支持。虽然这些功能对 GPU 性能的重要性不如 CPU,但我们相信,即使在 GPU 上,它们也能带来一些好处。
There are several implementations of CUDA and OpenCL for CPUs. Some do not attempt to vectorize across SIMD lanes in the presence of control flow [10, 36]. Intel's OpenCL compiler does perform SIMD vectorization [34], using an approach related to Karrenberg et al.'s [17] (who also applied their technique to OpenCL kernels.) 有几种针对 CPU 的 CUDA 和 OpenCL 实现。其中一些没有尝试在存在控制流的情况下跨 SIMD 通道进行矢量化[10, 36]。英特尔的 OpenCL 编译器确实执行了 SIMD 矢量化[34],使用的方法与 Karrenberg 等人的[17](他们也将自己的技术应用于 OpenCL 内核)相关。
Parker et al.'s RTSL system provided SPMD-on-SIMD on current CPUs in a domain-specific language for implementing ray tracers [32]. Parker 等人的 RTSL 系统以特定领域语言在当前 CPU 上提供了 SPMD-on-SIMD,用于实现光线跟踪器[32]。
Microsoft's C++ AMP [30] provides a set of extensions to C++ to support GPU programming. As with CUDA and OpenCL, its design was constrained by the goal of running on today's GPUs. It is syntactically very different from CUDA, OpenCL, and ispc because of its choice of mechanisms for extending . 微软的 C++ AMP [30] 对 C++ 进行了一系列扩展,以支持 GPU 编程。与 CUDA 和 OpenCL 一样,其设计受限于在当今 GPU 上运行的目标。由于选择了 的扩展机制,它在语法上与 CUDA、OpenCL 和 ispc 有很大不同。
The UPC language extends C to provide an SPMD programming model for multiple cores [5]. UPC includes mechanisms for scaling to very large systems that lack hardware memory coherence, but the language was not designed to target SIMD parallelism within a core and as far as we know it has never been used for this purpose. UPC 语言扩展了 C 语言,为多内核提供了 SPMD 编程模型[5]。UPC 包含可扩展至缺乏硬件内存一致性的超大型系统的机制,但该语言并非针对内核内的 SIMD 并行性而设计,据我们所知,它从未用于此目的。
8.3 Concurrently-developed systems 8.3 同时开发的系统
The IVL and VecImp languages described in a recent paper are similar to ispc in a number of ways [23]; they were developed concurrently with ispc with some crosspollination of ideas. These three languages are the only C-like general-purpose languages that we are aware of that provide a mechanism for creating a structure-of-arrays variant of a previously-declared struct data type. 最近一篇论文[23]中描述的 IVL 和 VecImp 语言在很多方面与 ispc 相似;它们是与 ispc 同时开发的,在一些思想上有交叉。据我们所知,这三种语言是唯一提供机制为先前声明的 struct 数据类型创建数组结构变体的类 C 通用语言。
There are substantial differences in emphasis between the VecImp/IVL paper and this work. The VecImp/IVL paper focuses on describing the language and formally proving the soundness of the type system, whereas we focus on justifying and quantitatively evaluating language features such as uniform variables and structure-of-arrays support. IVL and its evaluation focus on the MIC architecture, whereas ispc focuses on the SSE and AVX architectures which have less dedicated ISA support for SPMD-style computation. This paper also introduces and analyzes the compiler optimizations required to reap the full benefit of language features such as uniform variables. VecImp/IVL 论文与本文的重点有很大不同。VecImp/IVL 论文的重点是描述语言和正式证明类型系统的合理性,而我们的重点是证明和定量评估统一变量和数组结构支持等语言特性。IVL 及其评估侧重于 MIC 架构,而 ispc 则侧重于 SSE 和 AVX 架构,这两种架构对 SPMD 风格计算的专用 ISA 支持较少。本文还介绍并分析了编译器优化的必要性,以便充分受益于统一变量等语言特性。
There are a variety of other detailed differences between ispc, IVL, and VecImp. For example, IVL supports function polymorphism, which is not currently supported in ispc, and ispc's pointer model is more powerful than IVL's. ispc uses LLVM for code generation, but the IVL compiler generates C++ code with intrinsics. ispc is the only one of the three languages with an implementation available for public use. ispc 、IVL 和 VecImp 之间还有许多其他细节上的差异。例如,IVL 支持函数多态性,而 ispc 目前不支持函数多态性;ispc 的指针模型比 IVL 的指针模型更强大。
The Intel C/C++ compiler provides an "elemental functions" extension of C++ that is intended to provide SPMD as an extension of a full compiler [16]. Its language functionality for SPMD is more limited than ispc's; for example its equivalent of uniform can only be applied to function parameters and there is no general facility for creating SOA types from AOS types. It has been demonstrated that its capabilities can be used to achieve good utilization of SIMD units [20]. 英特尔 C/C++ 编译器提供了 C++ 的 "元素函数 "扩展,旨在将 SPMD 作为完整 编译器的扩展 [16]。它的 SPMD 语言功能比 ispc 更为有限;例如,它的等效 uniform 只能应用于函数参数,而且没有从 AOS 类型创建 SOA 类型的通用工具。事实证明,它的功能可以很好地利用 SIMD 单元[20]。
9. CONCLUSION 9.结论
We have presented ispc, a SPMD language for programming CPU vector units that is easy to adopt and productive to use. We have shown that a few key language featuresuniform data types, native support for SOA structure layout, and in-language task launch-coupled with a series of custom optimization passes make it possible to efficiently execute SPMD programs on the SIMD hardware of modern CPUs. These programs can effectively target the full capabilities of CPUs, executing code with performance essentially the same as hand-written intrinsics. Support for uniform types is particularly important; our experiments showed that this capability provides over a 2 x performance improvement. 我们介绍了 ispc,这是一种用于 CPU 矢量单元编程的 SPMD 语言,易于采用,使用高效。我们已经证明,统一数据类型、本地支持 SOA 结构布局和语言内任务启动等几个关键语言特点,再加上一系列自定义优化通道,可以在现代 CPU 的 SIMD 硬件上高效执行 SPMD 程序。这些程序可以有效地发挥 CPU 的全部功能,执行代码的性能与手写的内在代码基本相同。对统一类型的支持尤为重要;我们的实验表明,这种功能可将性能提高 2 倍以上。
In the future, we plan to further refine the ispc language, eliminating remaining differences with C and adding convenience features like polymorphic functions. We are already adding support for the MIC architecture, which is an attractive target due to its 16 -wide SIMD and good ISA support for SPMD execution. 今后,我们计划进一步完善 ispc 语言,消除与 C 语言的其余差异,并增加多态函数等便利功能。我们已经在增加对 MIC 架构的支持,由于其 16 宽 SIMD 和对 SPMD 执行的良好 ISA 支持,MIC 架构是一个极具吸引力的目标。
Experience with ispc suggests a number of avenues for improving future hardware architectures. For conventional CPUs, improved support for masking and scatter would be desirable, and extending vector units to operate on 64 -bit integer values at the same performance as when operating on 32-bit integer values (for vector pointer addressing calculations) may be helpful. ispc 的经验为改进未来的硬件架构提供了许多途径。对于传统 CPU 来说,改进对屏蔽和分散的支持是可取的,而扩展矢量单元,使其在 64 位整数值上的运算性能与在 32 位整数值上的运算性能相同(用于矢量指针寻址计算)可能会有所帮助。
The decision to include both scalar and SIMD computation as first-class operations in the language may be applicable to other architectures. For example, AMD's forthcoming GPU has a scalar unit alongside its vector unit [27] as does a research architecture from NVIDIA [18]. Such architectures could have a variety of efficiency advantages versus a traditional "brute force" SIMD-only GPU implementation [6]. More broadly, many of the available approaches for achieving high SIMD efficiency can be implemented in different ways: by the programmer/language, by the compiler, or by the hardware. In the power-constrained environment that limits all hardware architectures today, we expect continued exploration of the complex trade offs between these different approaches. 将标量计算和 SIMD 计算作为一流操作纳入语言的决定可能适用于其他架构。例如,AMD 即将推出的 GPU 除了矢量单元外,还有一个标量单元[27],英伟达的研究架构也是如此[18]。与传统的 "蛮力 "SIMD-only GPU 实现[6]相比,此类架构在效率方面具有多种优势。更广泛地说,实现高 SIMD 效率的许多可用方法可以通过不同方式实现:程序员/语言、编译器或硬件。在当今限制所有硬件架构的功率受限环境中,我们期待继续探索这些不同方法之间的复杂权衡。
Acknowledgments 致谢
The parser from the (non-SPMD) C-based "nit" language written by Geoff Berry and Tim Foley at Neoptica provided the starting-point for the ispc implementation; ongoing feedback from Geoff and Tim about design and implementation issues in ispc has been extremely helpful. Tim suggested the "SPMD on SIMD" terminology and has extensively argued for the advantages of the SPMD model. Geoff Berry 和 Tim Foley 在 Neoptica 编写的基于 C 语言的 "nit"(非 SPMD)解析器为 ispc 的实现提供了起点;Geoff 和 Tim 就 ispc 的设计和实现问题提供的持续反馈非常有帮助。Tim 提出了 "SPMD on SIMD "的术语,并广泛论证了 SPMD 模型的优势。
We'd like to specifically thank the LLVM development team; without LLVM, this work wouldn't have been possible. Bruno Cardoso Lopes's work on support for AVX in LLVM was particularly helpful for the results reported here. 我们要特别感谢 LLVM 开发团队;没有 LLVM,这项工作就不可能完成。布鲁诺-卡多索-洛佩斯(Bruno Cardoso Lopes)在 LLVM 中支持 AVX 的工作对本文报告的结果尤其有帮助。
We have had many fruitful discussions with Ingo Wald that have influenced the system's design; ispc's approach to SPMD and Ingo's approach with IVL languages have had bidirectional and mutually-beneficial influence. More recently, discussions with Roland Leißa and Sebastian Hack about VecImp have been quite helpful. 我们与英戈-瓦尔德(Ingo Wald)进行了许多富有成效的讨论,这些讨论对系统的设计产生了影响;ispc 的 SPMD 方法和英戈的 IVL 语言方法产生了双向和互利的影响。最近,与罗兰-雷萨(Roland Leißa)和塞巴斯蒂安-哈克(Sebastian Hack)关于 VecImp 的讨论也颇有助益。
We appreciate the support of Geoff Lowney and Jim Hurley for this work as well as Elliot Garbus's early enthusiasm and support for it. Thanks to Kavyvon Fatahalian, Solomon Boulos, and Jonathan Ragan-Kelley for discussions about SPMD parallel languages and SIMD hardware architectures. Discussions with Nadav Rotem about SIMD code generation and LLVM as well as discussions with Matt Walsh have also directly improved the system. Ali Adl-Tabatabai's feedback and detailed questions about the precise semantics of ispc have been extremely helpful as well. 我们感谢 Geoff Lowney 和 Jim Hurley 对这项工作的支持,以及 Elliot Garbus 早期对这项工作的热情和支持。感谢 Kavyvon Fatahalian、Solomon Boulos 和 Jonathan Ragan-Kelley 对 SPMD 并行语言和 SIMD 硬件架构的讨论。与 Nadav Rotem 讨论 SIMD 代码生成和 LLVM 以及与 Matt Walsh 讨论也直接改进了系统。阿里-阿德尔-塔巴塔拜(Ali Adl-Tabatabai)关于 ispc 精确语义的反馈和详细问题也非常有帮助。
Thanks to Tim Foley, Mark Lacey, Jacob Munkberg, Doug McNabb, Andrew Lauritzen, Misha Smelyanskiy, Stefanus Du Toit, Geoff Berry, Roland Leißa, Aaron Lefohn, Dillon Sharlet, and Jean-Luc Duprat for comments on this paper, and thanks to the early users of ispc inside Intel-Doug McNabb, Mike MacPherson, Ingo Wald, Nico Galoppo, Bret Stastny, Andrew Lauritzen, Jefferson Montgomery, Jacob Munkberg, Masamichi Sugihara, and Wooyoung Kim-in particular for helpful suggestions and bug reports as well as for their patience with early versions of the system. 感谢 Tim Foley、Mark Lacey、Jacob Munkberg、Doug McNabb、Andrew Lauritzen、Misha Smelyanskiy、Stefanus Du Toit、Geoff Berry、Roland Leißa、Aaron Lefohn、Dillon Sharlet 和 Jean-Luc Duprat 对本文提出的意见,并感谢英特尔内部 ispc 的早期用户--Doug McNabb、Mike MacPherson、Ingo Wald、Nico Galoppo、Bret Stastny、Andrew Lauritzen、Jefferson Montgomery、Jacob Munkberg、Masamichi Sugihara 和 Wooyoung Kim,特别感谢他们提出的有益建议和错误报告,以及他们对早期版本系统的耐心使用。
10. REFERENCES 10.参考文献
[1] T. Aila and S. Laine. Understanding the efficiency of ray traversal on GPUs. In Proc. High-Performance Graphics 2009, pages 145-149, 2009. [1] T. Aila 和 S. Laine.了解 GPU 上光线遍历的效率。2009 年高性能图形学会议,第 145-149 页,2009 年。
[2] J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren. Conversion of control dependence to data dependence. In Proc. POPL '83. [2] J. R. Allen、K. Kennedy、C. Porterfield 和 J. Warren。从控制依赖到数据依赖的转换。In Proc.POPL '83.
[3] American National Standards Institute. American National Standard Programming Language C, ANSI X3.159-1989, 1989. [3] 美国国家标准学会。美国国家标准编程语言 C,ANSI X3.159-1989,1989 年。
[4] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPoPP), July 1995 [4] R. D. Blumofe、C. F. Joerg、B. C. Kuszmaul、C. E. Leiserson、K. H. Randall 和 Y. Zhou。Cilk:一个高效的多线程运行时系统。并行编程原理与实践(PPoPP)SIGPLAN 研讨会,1995 年 7 月
[5] W.-Y. Chen, D. Bonachea, J. Duell, P. Husbands, C. Iancu, and K. Yelick. A performance analysis of the Berkeley UPC compiler. In Proc. of 17 th Annual Intl. Conf. on Supercomputing, pages 63-73, 2003. [5] W.-Y.Chen、D. Bonachea、J. Duell、P. Husbands、C. Iancu 和 K. Yelick.伯克利 UPC 编译器的性能分析。In Proc. of 17 th Annual Intl. Conf. on Supercomputing, pages 63-73, 2003.
[6] S. Collange, D. Defour, and Y. Zhang. Dynamic detection of uniform and affine vectors in GPGPU computations. In Proc. of the 2009 Intl. Conf. on Parallel Processing, Euro-Par'09. [6] S. Collange、D. Defour 和 Y. Zhang。GPGPU 计算中均匀矢量和仿射矢量的动态检测。2009 并行处理国际会议论文集,Euro-Par'09。
[7] F. Darema, D. George, V. Norton, and G. Pfister. A single-program-multiple-data computational model for EPEX/FORTRAN. Parallel Computing, 7(1), 1988. [7] F. Darema、D. George、V. Norton 和 G. Pfister。EPEX/FORTRAN 的单程序多数据计算模型。并行计算》,7(1),1988 年。
[8] M. J. Flynn. Some computer organizations and their effectiveness. IEEE Transactions on Computers, C-21(9):948-960, Sept. 1972. [8] M. J. Flynn.Some computer organizations and their effectiveness.IEEE Transactions on Computers, C-21(9):948-960, Sept.
[10] J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B. R. Gaster, and B. Zheng. Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. PACT ' 10 . [10] J. Gummaraju、L. Morichetti、M. Houston、B. Sander、B. R. Gaster 和 B. Zheng。双峰:通用处理器和图形处理器上的异构计算软件平台。PACT ' 10 .
[11] P. Hanrahan and J. Lawson. A language for shading and lighting calculations. SIGGRAPH Comput. Graph., 24:289-298, September 1990. [11] P. Hanrahan 和 J. Lawson.阴影和照明计算语言。SIGGRAPH Comput.Graph., 24:289-298, September 1990.
[12] R. Ierusalimschy, L. H. de Figueiredo, and W. Celes. Passing a language through the eye of a needle. Queue, 9(5). [12] R. Ierusalimschy、L. H. de Figueiredo 和 W. Celes。用针眼传递语言。 Queue, 9(5).
[15] Intel. Intel advanced vector extensions programming reference. June 2011. [15] 英特尔。英特尔高级向量扩展编程参考。2011年6月。
[16] Intel. Intel Cilk Plus Language Extension Specification Version 1.1, 2011. Online document. [16] 英特尔。英特尔 Cilk Plus 语言扩展规范 1.1 版,2011 年。在线文档。
[17] R. Karrenberg and S. Hack. Whole Function Vectorization. In . [17] R. Karrenberg 和 S. Hack.全函数矢量化。在 .
[18] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs and the future of parallel computing. IEEE Micro, 31:7-17, Sept-Oct 2011. [18] S. W. Keckler、W. J. Dally、B. Khailany、M. Garland 和 D. Glasco。GPU 和并行计算的未来。IEEE Micro,31:7-17,2011 年 9-10 月。
[19] Khronos OpenCL Working Group. The OpenCL Specification, Sept. 2010. [19] Khronos OpenCL 工作组。OpenCL 规范》,2010 年 9 月。
[20] C. Kim, N. Satish, J. Chhugani, H. Saito, R. Krishnaiyer, M. Smelyanskiy, M. Girkar, and P. Dubey. Closing the ninja performance gap through traditional programming and compiler technology. Technical report, Intel Corporation, Dec 2011. [20] C. Kim、N. Satish、J. Chhugani、H. Saito、R. Krishnaiyer、M. Smelyanskiy、M. Girkar 和 P. Dubey。通过传统编程和编译器技术缩小忍者性能差距。技术报告,英特尔公司,2011 年 12 月。
[21] C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proc. of CGO ’04, Mar 2004. [21] C. Lattner 和 V. Adve.LLVM:用于终身程序分析与转换的编译框架。In Proc. of CGO '04, Mar 2004.
[22] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In Proc. ISCA 2010. [22] V. W. Lee、C. Kim、J. Chhugani、M. Deisher、D. Kim、A. D. Nguyen、N. Satish、M. Smelyanskiy、S. Chennupaty、P. Hammarlund、R. Singhal 和 P. Dubey。揭穿 GPU 与 CPU 100 倍的神话:CPU 和 GPU 吞吐量计算评估。In Proc.
[23] R. Leißa, S. Hack, and I. Wald. Extending a C-like language for portable SIMD programming. In , Feb 2012. [23] R. Leißa、S. Hack 和 I. Wald。为可移植 SIMD 编程扩展类 C 语言。 ,2012 年 2 月。
[24] A. Levinthal, P. Hanrahan, M. Paquette, and J. Lawson. Parallel computers for graphics applications. SIGPLAN Not., October 1987. [24] A. Levinthal、P. Hanrahan、M. Paquette 和 J. Lawson。图形应用的并行计算机。SIGPLAN Not.,1987 年 10 月。
[25] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, Mar-April 2008. [25] E. Lindholm、J. Nickolls、S. Oberman 和 J. Montrym。NVIDIA Tesla:统一的图形和计算架构。IEEE Micro,2008 年 3-4 月。
[26] A. Lokhmotov, B. Gaster, A. Mycroft, N. Hickey, and D. Stuttard. Revisiting SIMD programming. In Languages and Compilers for Parallel Computing, pages 32-46. 2008. [26] A. Lokhmotov、B. Gaster、A. Mycroft、N. Hickey 和 D. Stuttard。重新审视 SIMD 编程。并行计算语言与编译器》,第 32-46 页。2008.
[27] M. Mantor and M. Houston. AMD graphics core next: Low power high performance graphics and parallel compute. Hot3D, High Performance Graphics Conf., 2011. [27] M. Mantor 和 M. Houston。下一个 AMD 图形核心:低功耗高性能图形和并行计算。Hot3D,高性能图形会议,2011 年。
[28] W. R. Mark, R. S. Glanville, K. Akeley, and M. J. Kilgard. Cg: a system for programming graphics hardware in a C-like language. ACM Trans. Graph., July 2003 [28] W. R. Mark、R. S. Glanville、K. Akeley 和 M. J. Kilgard。Cg: a system for programming graphics hardware in a C-like language.ACM Trans.Graph.
[29] MasPar Computer Corporation. MasPar Programming Language (ANSI C compatible MPL) Reference Manual, Software Version 3.0, July 1992. [29] MasPar 计算机公司。MasPar 编程语言(ANSI C 兼容 MPL)参考手册,软件版本 3.0,1992 年 7 月。
[30] Microsoft Corporation. MSDN Library: Overview of C ++ Acceleration Massive Parallelism ( AMP), 2011. Online preview documention, visited Dec 11. [30] 微软公司。MSDN Library:C ++ Acceleration Massive Parallelism ( AMP) 概览,2011 年。在线预览文档,12 月 11 日访问。
[31] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue, 6:40-53, March 2008. [31] J. Nickolls、I. Buck、M. Garland 和 K. Skadron。使用 CUDA 的可扩展并行编程。ACM Queue,6:40-53,2008 年 3 月。
[32] S. G. Parker, S. Boulos, J. Bigler, and A. Robison. RTSL: a ray tracing shading language. In Proc. of the 2007 IEEE Symp. on Interactive Ray Tracing, 2007. [32] S. G. Parker、S. Boulos、J. Bigler 和 A. Robison.RTSL: a ray tracing shading language.In Proc. of the 2007 IEEE Symp. on Interactive Ray Tracing, 2007.
[33] J. Rose and J. G. Steele. C*: An extended C language for data parallel programming. In Proc. of the Second Intl. Conf. on Supercomputing, May 1987. [33] J. Rose 和 J. G. Steele.C*:用于数据并行编程的扩展 C 语言。第二届国际超级计算大会论文集》,1987 年 5 月。
[34] N. Rotem. Intel OpenCL SDK vectorizer. In LLVM Developer Conf. Presentation, Nov. 2011. [34] N. Rotem.英特尔 OpenCL SDK 向量器。In LLVM Developer Conf.演讲,2011 年 11 月。
[35] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., August 2008. [35] L. Seiler、D. Carmean、E. Sprangle、T. Forsyth、M. Abrash、P. Dubey、S. Junkins、A. Lake、J. Sugerman、R. Cavin、R. Espasa、E. Grochowski、T. Juan 和 P. Hanrahan。Larrabee:用于视觉计算的多核 x86 架构。ACM Trans.Graph.
[36] J. A. Stratton, S. S. Stone, and W.-M. W. Hwu. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Proc. 21st Int'l Workshop on Languages and Compilers for Parallel Computing, 2008. [36] J. A. Stratton、S. S. Stone 和 W. -M.W. Hwu.MCUDA:多核 CPU CUDA 内核的高效实现。In Proc. 21st Int'l Workshop on Languages and Compilers for Parallel Computing, 2008.
[37] M. Wolfe, C. Shanklin, and L. Ortega. High-Performance Compilers for Parallel Computing. Addison Wesley, 1995. [37] M. Wolfe、C. Shanklin 和 L. Ortega。并行计算的高性能编译器》。Addison Wesley,1995.
Here and throughout this paper, we use " code" or "application code" to indicate the rest of the software system that ispc is being used with. This could include, for example, Fortran or Python code that called ispc code. 在本文中,我们使用" 代码 "或 "应用程序代码 "来表示与 ispc 一起使用的软件系统的其他部分。例如,这可能包括调用 ispc 代码的 Fortran 或 Python 代码。
Program instances thus correspond to threads in CUDA and work items in OpenCL. A gang roughly corresponds to a CUDA warp. 因此,程序实例对应于 CUDA 中的线程和 OpenCL 中的工作项。一个帮派大致对应一个 CUDA warp。
Running gangs wider than the SIMD width can give performance benefits from amortizing shared computation (such as scalar control flow overhead) over more program instances, better cache reuse across the program instances, and from more instruction-level parallelism being available. The costs are greater register pressure and potentially more control flow divergence across the program instances. 运行比 SIMD 宽度更宽的帮组可以通过在更多程序实例上摊销共享计算(如标量控制流开销)、在程序实例间更好地重用高速缓存以及提供更多的指令级并行性来提高性能。代价则是寄存器压力增大,程序实例间的控制流可能出现更多分歧。
This guarantee is not provided across gangs in different threads; in that case, explicit synchronization must be used. 在不同线程中的帮派之间不提供这种保证;在这种情况下,必须使用显式同步。
This limitation will be removed in future hardware (The Haswell New Instructions provide gather [15] and MIC provides both gather and scatter [35]). 这一限制将在未来的硬件中消除(Haswell 新指令提供聚集功能 [15],MIC 提供聚集和散射功能 [35])。
Unstructured control flow (i.e. goto statements) is more difficult to support efficiently, though ispc does support goto in cases where it can be statically determined that all program instances will execute the goto. 非结构化控制流(即 goto 语句)更难得到有效支持,不过,在可以静态确定所有程序实例都将执行 goto 的情况下,ispc 的确支持 goto。
We have also tested with various versions of gcc with essentially the same results. 我们还使用不同版本的 gcc 进行了测试,结果基本相同。