This paper is included in the Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI '18). 本文收录于第 13 届 USENIX 操作系统设计与实现研讨会 (OSDI '18) 论文集。
October 8-10, Carlsbad, CA, USA 10 月 8-10 日, 美国加利福尼亚州卡尔斯巴德
ISBN 978-1-939133-08-3
The benefits and costs of writing a POSIX kernel in a high-level language 用高级语言编写 POSIX 内核的好处和代价
Cody Cutler, M. Frans Kaashoek, Robert T. Morris 科迪-卡特勒、M. 弗兰斯-卡舒克、罗伯特-T. 莫里斯MIT CSAIL 麻省理工学院 CSAIL
Abstract 摘要
This paper presents an evaluation of the use of a high-level language (HLL) with garbage collection to implement a monolithic POSIX-style kernel. The goal is to explore if it is reasonable to use an HLL instead of C for such kernels, by examining performance costs, implementation challenges, and programmability and safety benefits. The paper contributes Biscuit, a kernel written in Go that implements enough of POSIX (virtual memory, mmap, TCP/IP sockets, a logging file system, poll, etc.) to execute significant applications. Biscuit makes liberal use of Go's HLL features (closures, channels, maps, interfaces, garbage collected heap allocation), which subjectively made programming easier. The most challenging puzzle was handling the possibility of running out of kernel heap memory; Biscuit benefited from the analyzability of Go source to address this challenge. On a set of kernel-intensive benchmarks (including NGINX and Redis) the fraction of kernel CPU time Biscuit spends on HLL features (primarily garbage collection and thread stack expansion checks) ranges up to . The longest single GC-related pause suffered by NGINX was 115 microseconds; the longest observed sum of GC delays to a complete NGINX client request was 600 microseconds. In experiments comparing nearly identical system call, page fault, and context switch code paths written in Go and C, the Go version was to slower. 本文对使用带有垃圾回收功能的高级语言(HLL)来实现单片 POSIX 风格内核进行了评估。目的是通过考察性能成本、实施挑战以及可编程性和安全性优势,探讨使用 HLL 而不是 C 语言来实现此类内核是否合理。本文介绍了用 Go 编写的内核 Biscuit,它实现了足够多的 POSIX(虚拟内存、mmap、TCP/IP 套接字、日志文件系统、轮询等),可以执行重要的应用程序。Biscuit 充分利用了 Go 的 HLL 功能(闭包、通道、映射、接口、垃圾回收堆分配),这在主观上简化了编程。最具挑战性的难题是处理内核堆内存耗尽的可能性;Biscuit 利用 Go 源代码的可分析性解决了这一难题。在一组内核密集型基准测试(包括 NGINX 和 Redis)中,Biscuit 用于 HLL 功能(主要是垃圾回收和线程栈扩展检查)的内核 CPU 时间比例高达 。NGINX 遭受的与 GC 相关的最长单次暂停时间为 115 微秒;观察到的 NGINX 客户端完整请求的最长 GC 延迟总和为 600 微秒。在比较用 Go 和 C 语言编写的几乎完全相同的系统调用、页面故障和上下文切换代码路径的实验中,Go 版本的速度从 到 不等。
1 Introduction 1 引言
The default language for operating system kernels is C : Linux, macOS, and Windows all use C. C is popular for kernels because it can deliver high performance via flexible low-level access to memory and control over memory management (allocation and freeing). C, however, requires care and experience to use safely, and even then low-level bugs are common. For example, in 2017 at least 50 Linux kernel security vulnerabilities were reported that involved buffer overflow or use-after-free bugs in C code [34]. 操作系统内核的默认语言是 C 语言:Linux、macOS 和 Windows 都使用 C 语言。C 语言在内核中很受欢迎,因为它可以通过灵活的底层内存访问和内存管理(分配和释放)控制提供高性能。不过,C 语言需要谨慎和经验才能安全使用,即便如此,低级错误也屡见不鲜。例如,2017 年报告的至少 50 个 Linux 内核安全漏洞涉及 C 代码中的缓冲区溢出或使用后释放 bug [34]。
High-level languages (HLLs) provide type- and memory-safety and convenient abstractions such as threads. Many HLLs provide garbage collection to further reduce programmer burden and memory bugs. It is well-known that HLLs can be used in kernels: multiple kernels have been written in HLLs, often as platforms to explore innovative ideas ( ). On the other hand, leading OS designers have been skeptical that HLLs' memory management and abstractions are compatible with highperformance production kernels [51][47, p. 71]. 高级语言(HLL)提供了类型安全和内存安全,以及线程等方便的抽象。许多 HLL 还提供垃圾回收功能,以进一步减轻程序员的负担,减少内存错误。众所周知,内核中可以使用 HLL:多个内核都是用 HLL 编写的,通常作为探索创新思想的平台( )。另一方面,操作系统的主要设计者一直怀疑 HLL 的内存管理和抽象是否与高性能的生产内核兼容[51][47, p. 71]。
While it would probably not make sense to re-write an existing C kernel in an HLL, it is worth considering what languages new kernel projects should use. Since kernels impose different constraints and requirements than typical applications, it makes sense to explore this question in the context of a kernel. 虽然用 HLL 重写现有的 C 内核可能没有意义,但新的内核项目应该使用什么语言却值得考虑。由于内核与典型的应用程序有着不同的限制和要求,因此在内核的背景下探讨这个问题是有意义的。
We built a new kernel, Biscuit, written in Go [15] for x86-64 hardware. Go is a type-safe language with garbage collection. Biscuit runs significant existing applications such as NGINX and Redis without source modification by exposing a POSIX-subset system call interface. Supported features include multi-core, kernel-supported user threads, futexes, IPC, mmap, copy-on-write fork, vnode and name caches, a logging file system, and TCP/IP sockets. Biscuit implements two significant device drivers in Go: one for AHCI SATA disk controllers and one for Intel 82599-based Ethernet controllers. Biscuit has nearly 28 thousand lines of Go, 1546 lines of assembler, and no C. We report lessons learned about use of Go in Biscuit, including ways in which the language helped development, and situations in which it was less helpful. 我们为 x86-64 硬件构建了一个用 Go [15] 编写的新内核 Biscuit。Go 是一种带有垃圾回收功能的类型安全语言。Biscuit 通过公开 POSIX 子集系统调用接口,无需修改源代码即可运行 NGINX 和 Redis 等重要的现有应用程序。支持的功能包括多核、内核支持的用户线程、futexes、IPC、mmap、写时复制 fork、vnode 和名称缓存、日志文件系统和 TCP/IP 套接字。Biscuit 用 Go 实现了两个重要的设备驱动程序:一个用于 AHCI SATA 磁盘控制器,另一个用于基于 Intel 82599 的以太网控制器。Biscuit 有将近 2.8 万行 Go 语言,1546 行汇编语言,没有 C 语言。我们报告了在 Biscuit 中使用 Go 语言的经验教训,包括该语言对开发有帮助的方式,以及帮助不大的情况。
In most ways the design of Biscuit is that of a traditional monolithic POSIX/Unix kernel, and Go was a comfortable language for that approach. In one respect the design of Biscuit is novel: its mechanism for coping with kernel heap exhaustion. We use static analysis of the Biscuit source to determine how much heap memory each system call (and other kernel activity) might need, and each system call waits (if needed) when it starts until it can reserve that much heap. Once a system call is allowed to continue, its allocations are guaranteed to succeed without blocking. This obviates the need for complex allocation failure recovery or deadlock-prone waiting for free memory in the allocator. The use of an HLL that is conducive to static analysis made this approach possible. 在大多数方面,Biscuit 的设计都是传统的 POSIX/Unix 内核,而 Go 是一种适用于这种方法的语言。Biscuit 的设计有一点很新颖,那就是它应对内核堆耗尽的机制。我们使用 Biscuit 代码的静态分析来确定每个系统调用(和其他内核活动)可能需要多少堆内存,每个系统调用在启动时都会等待(如果需要),直到能预留出足够的堆内存。一旦系统调用被允许继续,就能保证其分配成功而不会阻塞。这样就不需要进行复杂的分配失败恢复,也不需要在分配器中等待空闲内存,以免造成死锁。使用有利于静态分析的 HLL 使这种方法成为可能。
We run several kernel-intensive applications on Biscuit and measure the effects of Go's type safety and garbage collection on kernel performance. For our benchmarks, GC costs up to 3% of CPU. For NGINX, the longest single 我们在 Biscuit 上运行了几个内核密集型应用程序,并测量了 Go 的类型安全和垃圾回收对内核性能的影响。在我们的基准测试中,垃圾回收最多耗费 3% 的 CPU。对于 NGINX,最长的单次
GC-related pause was 115 microseconds, and the longest a single NGINX client request was delayed (by many individual pauses) was a total of 600 microseconds. Other identifiable HLL performance costs amount to about of CPU. 与 GC 相关的暂停时间为 115 微秒,而单个 NGINX 客户端请求的最长延迟时间(多次单独暂停)总计为 600 微秒。其他可识别的 HLL 性能成本约为 CPU。
To shed light on the specific question of C versus Go performance in the kernel, we modify Biscuit and a kernel to have nearly identical source-level code paths for two benchmarks that stress system calls, page faults, and context switches. The C versions are about and faster than the Go versions. 为了阐明内核中 C 语言与 Go 语言性能对比的具体问题,我们修改了 Biscuit 和 内核,使其在强调系统调用、页面故障和上下文切换的两个基准测试中具有几乎相同的源代码级代码路径。C 版本比 Go 版本快约 和 。
Finally, we compare the performance of Biscuit and Linux on our kernel-intensive application benchmarks, finding that Linux is up to faster than Biscuit. This result is not very illuminating about choice of language, since performance is also affected by differences in the features, design and implementation of Biscuit and Linux. However, the results do provide an idea of whether the absolute performance of Biscuit is in the same league as that of a kernel. 最后,我们比较了 Biscuit 和 Linux 在内核密集型应用程序基准测试中的性能,发现 Linux 比 Biscuit 快 。由于 Biscuit 和 Linux 在功能、设计和实现方面的差异也会影响性能,因此这一结果并不能很好地说明语言的选择。不过,该结果确实提供了一个概念,即 Biscuit 的绝对性能是否与 内核处于同一水平。
In summary, the main contributions of this paper are: (1) Biscuit, a kernel written in Go with good performance; (2) a novel scheme for coping with kernel heap exhaustion; (3) a discussion of qualitative ways in which use of an HLL in a kernel was and was not helpful; (4) measurements of the performance tax imposed by use of an HLL; and (5) a direct Go-vs-C performance comparison of equivalent code typical of that found in a kernel. 总之,本文的主要贡献是(1)用 Go 编写的性能良好的内核 Biscuit;(2)应对内核堆耗尽的新方案;(3)讨论在内核中使用 HLL 是否有帮助的定性方式;(4)测量使用 HLL 带来的性能税;(5)直接比较内核中典型的等价代码的 Go-vs-C 性能。
This paper does not draw any top-level conclusion about versus an HLL as a kernel implementation language. Instead, it presents experience and measurements that may be helpful for others making this decision, who have specific goals and requirements with respect to programmability, safety and performance. Section 9 summarizes the key factors in this decision. 本文并未就 与 HLL 作为内核实现语言的对比得出任何最高级别的结论。相反,本文介绍的经验和测量结果可能会对做出这一决定的其他人有所帮助,因为他们在可编程性、安全性和性能方面有特定的目标和要求。第 9 节总结了做出这一决定的关键因素。
2 Related work 2 相关工作
Biscuit builds on multiple areas of previous work: highlevel languages in operating systems, high-level systems programming languages, and memory allocation in the kernel. As far as we know the question of the impact of language choice on kernel performance, all else being equal, has not been explored. Biscuit 建立在以前工作的多个领域之上:操作系统中的高级语言、高级系统编程语言和内核中的内存分配。据我们所知,在其他条件相同的情况下,语言选择对内核性能的影响问题还没有被探讨过。
Kernels in high-level languages. The Pilot [44] kernel and the Lisp machine [17] are early examples of use of a high-level language (Mesa [14] and Lisp, respectively) in an operating system. Mesa lacked garbage-collection, but it was a high-priority requirement for its successor language Cedar [48]. The Lisp machine had a real-time garbage collector [5]. 高级语言内核。Pilot [44] 内核和 Lisp 机器 [17] 是在操作系统中使用高级语言(分别是 Mesa [14] 和 Lisp)的早期例子。Mesa 缺乏垃圾回收功能,但这是其后续语言 Cedar [48] 的优先要求。Lisp 机器有一个实时垃圾回收器[5]。
A number of research kernels are written in high-level languages (e.g., Taos [49], Spin [7], Singularity [23], J- kernel [19], and KaffeOS [3, 4], House [18], the Mirage unikernel [29], and Tock [27]). The main thrust of these projects was to explore new ideas in operating system architecture, often enabled by the use of a type-safe highlevel language. While performance was often a concern, usually the performance in question related to the new ideas, rather than to the choice of language. Singularity quantified the cost of hardware and software isolation [22], which is related to the use of a HLL, but didn't quantify the cost of safety features of a HLL language, as we do in . 许多研究内核都是用高级语言编写的(如 Taos [49]、Spin [7]、Singularity [23]、J- kernel [19]、KaffeOS [3,4]、House [18]、Mirage unikernel [29] 和 Tock [27])。这些项目的主旨是探索操作系统架构的新思路,通常通过使用类型安全的高级语言来实现。虽然性能往往是一个值得关注的问题,但有关性能的问题通常与新思路有关,而不是与语言的选择有关。奇点量化了硬件和软件隔离的成本[22],这与高级语言的使用有关,但并没有像我们在 中所做的那样,量化高级语言安全特性的成本。
High-level systems programming languages. A number of systems-oriented high-level programming languages with type safety and garbage collection seem suitable for kernels, including Go, Java, C#, and Cyclone [25] (and, less recently, Cedar [48] and Modula-3 [37]). Other systems HLLs are less compatible with existing kernel designs. For example, Erlang [2] is a "shared-nothing" language with immutable objects, which would likely result in a kernel design that is quite different from traditional C shared-memory kernels. 高级系统编程语言。一些具有类型安全和垃圾回收功能的面向系统的高级编程语言似乎适合内核,包括 Go、Java、C# 和 Cyclone [25](最近还有 Cedar [48] 和 Modula-3 [37])。其他系统的 HLL 与现有内核设计的兼容性较差。例如,Erlang[2]是一种具有不可变对象的 "无共享 "语言,其内核设计很可能与传统的C语言共享内存内核大相径庭。
Frampton et al. introduce a framework for language extensions to support low-level programming features in Java, applying it to a GC toolkit [13]. Biscuit's goal is efficiency for kernels without modifying Go. Kernels have additional challenges such as dealing with user/kernel space, page tables, interrupts, and system calls. Frampton 等人引入了一个语言扩展框架,以支持 Java 的底层编程功能,并将其应用于 GC 工具包[13]。Biscuit 的目标是在不修改 Go 的情况下提高内核的效率。内核还面临其他挑战,如处理用户/内核空间、页表、中断和系统调用等。
A number of new languages have recently emerged for systems programming: D [11], Nim(rod) [42], Go [15], and Rust [36]. There are a number of kernels in Rust [12, 26, 27, 28, 39, 50], but none were written with the goal of comparing with C as an implementation language. Gopher OS is a Go kernel with a similar goal as Biscuit, but the project is at an early stage of development [1]. Other Go kernels exists but they don't target the questions that Biscuit answers. For example, Clive [6] is a unikernel and doesn't run on the bare metal. The Ethos OS uses C for the kernel and Go for user-space programs, with a design focused on security [41]. gVisor is a user-space kernel, written in Go, that implements a substantial portion of the Linux system API to sandbox containers [16]. 最近出现了许多新的系统编程语言:D [11]、Nim(rod) [42]、Go [15] 和 Rust [36]。Rust 中有许多内核[12, 26, 27, 28, 39, 50],但没有一种内核是以与作为实现语言的 C 语言相比较为目标而编写的。Gopher OS 是一个 Go 内核,目标与 Biscuit 类似,但该项目还处于早期开发阶段[1]。其他 Go 内核也存在,但它们并不以 Biscuit 所回答的问题为目标。例如,Clive [6] 是一个单内核,不能在裸机上运行。Ethos OS 的内核使用 C 语言,用户空间程序使用 Go 语言,其设计重点在于安全性[41]。gVisor 是一个用 Go 语言编写的用户空间内核,它实现了 Linux 系统 API 的大部分,并将其用于沙箱容器[16]。
Memory allocation. There is no consensus about whether a systems programming language should have automatic garbage-collection. For example, Rust is partially motivated by the idea that garbage collection cannot be made efficient; instead, the Rust compiler analyzes the program to partially automate freeing of memory. This approach can make sharing data among multiple threads or closures awkward [26]. 内存分配。关于系统编程语言是否应该自动收集垃圾,目前还没有达成共识。例如,Rust 的部分动机是垃圾回收无法做到高效;相反,Rust 编译器会分析程序,部分自动释放内存。这种方法会使多个线程或闭包之间的数据共享变得困难[26]。
Concurrent garbage collectors [5, 24, 30] reduce pause times by collecting while the application runs. Go 1.10 has such a collector [21], which Biscuit uses. 并发垃圾收集器 [5, 24, 30] 通过在应用程序运行时收集垃圾来减少暂停时间。Go 1.10 就有这样一个垃圾收集器 [21],Biscuit 就使用了它。
Several papers have studied manual memory allocation versus automatic garbage collection [20, 52], focusing on heap headroom memory's effect in reducing garbage collection costs in user-level programs. Headroom is also important for Biscuit's performance ( and ). 有几篇论文研究了手动内存分配与自动垃圾回收的对比[20, 52],重点关注堆净空内存在降低用户级程序垃圾回收成本方面的作用。净空对 Biscuit 的性能也很重要( 和 )。
Rafkind et al. added garbage collection to parts of Linux through automatic translation of source [43]. The authors observe that the kernel environment made this task difficult and adapted a fraction of a uniprocessor Linux kernel to be compatible with garbage collection. Biscuit required a fresh start in a new language, but as a result required less programmer effort for GC compatibility and benefited from a concurrent and parallel collector. Rafkind 等人通过自动翻译 源代码,为 Linux 的某些部分添加了垃圾回收功能 [43]。作者注意到内核环境给这项任务带来了困难,因此对单核处理器 Linux 内核的一部分进行了调整,使其与垃圾回收兼容。Biscuit 需要从头开始使用一种新的语言,但因此减少了程序员为兼容 GC 所付出的努力,并受益于并发和并行的收集器。
Linux's slab allocators [8] are specifically tuned for use in the kernel; they segregate free objects by type to avoid re-initialization costs and fragmentation. A hypothesis in the design of Biscuit is that Go's single general-purpose allocator and garbage collector are suitable for a wide range of different kernel objects. Linux 的板块分配器 [8] 专门针对内核进行了调整;它们按类型隔离空闲对象,以避免重新初始化成本和碎片化。Biscuit 设计中的一个假设是,Go 的单一通用分配器和垃圾回收器适用于各种不同的内核对象。
Kernel heap exhaustion. All kernels have to cope with the possibility of running out of memory for the kernel heap. Linux optimistically lets system calls proceed up until the point where an allocation fails. In some cases code waits and re-tries the allocation a few times, to give an "out-of-memory" killer thread time to find and destroy an abusive process to free memory. However, the allocating thread cannot generally wait indefinitely: it may hold locks, so there is a risk of deadlock if the victim of the killer thread is itself waiting for a lock [9]. As a result Linux system calls must contain code to recover from allocation failures, undoing any changes made so far, perhaps unwinding through many function calls. This undo code has a history of bugs [10]. Worse, the final result will be an error return from a system call. Once the heap is exhausted, any system call that allocates will likely fail; few programs continue to operate correctly in the face of unexpected errors from system calls, so the end effect may be application-level failure even if the kernel code handles heap exhaustion correctly. 内核堆耗尽。所有内核都必须应对内核堆内存耗尽的可能性。Linux 会乐观地让系统调用继续进行,直到分配失败为止。在某些情况下,代码会等待并重试几次分配,以便让 "内存不足 "的杀手线程有时间找到并摧毁一个滥用内存的进程,从而释放内存。但是,分配线程一般不能无限期地等待:它可能持有锁,因此如果杀手线程的受害者本身也在等待锁,就会有死锁的风险[9]。因此,Linux 系统调用必须包含从分配失败中恢复的代码,撤消迄今为止所做的任何更改,也许要通过多次函数调用才能解除。这种撤销代码历来就有错误[10]。更糟的是,最终结果将是系统调用错误返回。一旦堆耗尽,任何分配的系统调用都可能失败;面对系统调用的意外错误,很少有程序能继续正常运行,因此即使内核代码正确处理了堆耗尽,最终的结果也可能是应用程序级的失败。
Biscuit's reservation approach yields simpler code than Linux's. Biscuit kernel heap allocations do not fail (much as with Linux's contentious "too small to fail" rule [9, 10]), eliminating a whole class of complex error recovery code. Instead, each Biscuit system call reserves kernel heap memory when it starts (waiting if necessary), using a static analysis system to decide how much to reserve. Further, Biscuit applications don't see system call failures when the heap is exhausted; instead, they see delays. Biscuit 的保留方法比 Linux 的代码更简单。Biscuit 内核堆分配不会失败(就像 Linux 有争议的 "小到不能失败 "规则一样 [9,10]),从而消除了一整类复杂的错误恢复代码。相反,每个 Biscuit 系统调用都会在启动时(必要时等待)保留内核堆内存,并使用静态分析系统来决定保留多少内存。此外,当堆耗尽时,Biscuit 应用程序不会看到系统调用失败;相反,它们会看到延迟。
3 Motivation 3 动机
This section outlines our view of the main considerations in the choice between C and an HLL for the kernel. 本节概述了我们对内核选择 C 语言还是 HLL 的主要考虑因素的看法。
3.1 Why C? 3.1 为什么是 C?
A major reason for C's popularity in kernels is that it supports low-level techniques that can help performance, particularly pointer arithmetic, easy escape from type enforcement, explicit memory allocation, and custom allocators [51][47, p. 71]. There are other reasons too (e.g. C can manipulate hardware registers and doesn't depend on a complex runtime), but performance seems most important. C 语言在内核中流行的一个主要原因是它支持有助于提高性能的底层技术,特别是指针运算、轻松摆脱类型强制、显式内存分配和自定义分配器[51][47, p. 71]。还有其他一些原因(例如,C 语言可以操作硬件寄存器,并且不依赖于复杂的运行时),但性能似乎是最重要的。
3.2 Why an HLL? 3.2 为什么要使用 HLL?
The potential benefits of high-level languages are well understood. Automatic memory management reduces programmer effort and use-after-free bugs; type-safety detects bugs; runtime typing and method dispatch help with abstraction; and language support for threads and synchronization eases concurrent programming. 高级语言的潜在优势众所周知。自动内存管理可减少程序员的工作量和使用后的错误;类型安全可检测错误;运行时类型和方法调度有助于抽象;语言对线程和同步的支持可简化并发编程。
Certain kinds of bugs seem much less likely in an HLL than in C: buffer overruns, use-after-free bugs [40], and bugs caused by reliance on C's relaxed type enforcement. Even C code written with care by expert programmers has C-related bugs [40]. The CVE database for the Linux kernel [34] lists 40 execute-code vulnerabilities for 2017 which would be wholly or partially ameliorated by use of an HLL (see ). 与 C 语言相比,某些类型的错误在 HLL 中出现的可能性要小得多:缓冲区超限、use-after-free 错误[40],以及由于依赖 C 语言宽松的类型执行而导致的错误。即使是由专业程序员精心编写的 C 代码,也会出现与 C 相关的错误 [40]。Linux 内核的 CVE 数据库 [34] 列出了 2017 年的 40 个执行代码漏洞,使用 HLL 可以全部或部分改善这些漏洞(参见 )。
Use-after-free bugs are notoriously difficult to debug, yet occur often enough that the Linux kernel includes a memory checker that detects some use-after-free and buffer overrun bugs at runtime [46]. Nevertheless, Linux developers routinely discover and fix use-after-free bugs: Linux has at least 36 commits from January to April of 2018 for the specific purpose of fixing use-after-free bugs. 免用程序错误是众所周知的难以调试的问题,但却经常出现,以至于 Linux 内核包含了一个内存检查器,可以在运行时检测到一些免用程序和缓冲区超限错误[46]。尽管如此,Linux 开发人员还是经常发现并修复 "空闲后使用 "错误:从 2018 年 1 月到 4 月,Linux 至少有 36 次提交专门用于修复使用后即免错误。
Another area of kernel programming that would benefit from HLLs is concurrency. Transient worker threads can be cumbersome in C because the code must decide when the last thread has stopped using any shared objects that need to be freed; this is easier in a garbage collected language. 内核编程的另一个受益于 HLL 的领域是并发性。在 C 语言中,瞬时工作线程可能会很麻烦,因为代码必须决定最后一个线程何时停止使用任何需要释放的共享对象;而在垃圾回收语言中,这一点会更容易。
However, use of a garbage-collected HLL is not free. The garbage collector and safety checks consume CPU time and can cause delays; the expense of high-level features may deter their use; the language's runtime layer hides important mechanisms such as memory allocation; and enforced abstraction and safety may reduce developers' implementation options. 然而,使用垃圾回收的 HLL 并不是免费的。垃圾收集器和安全检查会消耗 CPU 时间,并可能导致延迟;高级功能的费用可能会阻碍它们的使用;语言的运行时层隐藏了内存分配等重要机制;强制抽象和安全可能会减少开发人员的实施选择。
4 Overview 4 概述
Biscuit's main purpose is to help evaluate the practicality of writing a kernel in a high-level language. Its design is similar to common practice in monolithic UNIX-like kernels, to facilitate comparison. Biscuit runs on 64-bit x86 hardware and is written in Go. It uses a modified version of the Go 1.10 runtime implementation; the runtime is Biscuit 的主要目的是帮助评估用高级语言编写内核的实用性。为了便于比较,它的设计与单片式 UNIX 内核的常见做法类似。Biscuit 在 64 位 x86 硬件上运行,由 Go 语言编写。它使用 Go 1.10 运行时实现的一个修改版本;运行时为
Figure 1: Biscuit's overall structure. 图 1:Biscuit 的整体结构。
written in Go with some assembly. Biscuit adds more assembly to handle boot and entry/exit for system calls and interrupts. There is no C. This section briefly describes Biscuit's components, focusing on areas in which use of Go affected the design and implementation. 是用 Go 语言和一些汇编语言编写的。Biscuit 增加了更多的汇编程序,以处理启动以及系统调用和中断的进入/退出。本节将简要介绍 Biscuit 的组件,重点是 Go 的使用对设计和实现产生影响的领域。
Boot and Go Runtime. The boot block loads Biscuit, the Go runtime, and a "shim" layer (as shown in Figure 1). The Go runtime, which we use mostly unmodified, expects to be able to call an underlying kernel for certain services, particularly memory allocation and control of execution contexts (cores, or in Go terminology, threads). The shim layer provides these functions, since there is no underlying kernel. Most of the shim layer's activity occurs during initialization, for example to pre-allocate memory for the Go kernel heap. 引导和 Go 运行时。引导块加载 Biscuit、Go 运行时和 "垫片 "层(如图 1 所示)。我们使用的 Go 运行时大多未经修改,它希望能够调用底层内核来提供某些服务,特别是内存分配和执行上下文(内核,或 Go 术语中的线程)控制。由于不存在底层内核,因此由 shim 层提供这些功能。shim 层的大部分活动都发生在初始化过程中,例如为 Go 内核堆预先分配内存。
Processes and Kernel Goroutines. Biscuit provides user processes with a POSIX interface: fork, exec, and so on, including kernel-supported threads and futexes. A user process has one address space and one or more threads. Biscuit uses hardware page protection to isolate user processes. A user program can be written in any language; we have implemented them only in C and (not Go). Biscuit maintains a kernel goroutine corresponding to each user thread; that goroutine executes system calls and handlers for page faults and exceptions for the user thread. "goroutine" is Go's name for a thread, and in this paper refers only to threads running inside the kernel. 进程和内核线程Biscuit 为用户进程提供 POSIX 接口:fork、exec 等,包括内核支持的线程和 futexes。一个用户进程有一个地址空间和一个或多个线程。Biscuit 使用硬件页面保护来隔离用户进程。用户程序可以用任何语言编写;我们仅用 C 和 实现了它们(不是 Go)。Biscuit 为每个用户线程维护一个对应的内核 goroutine;该 goroutine 为用户线程执行系统调用以及页面故障和异常处理程序。"goroutine "是 Go 对线程的称呼,本文仅指内核中运行的线程。
Biscuit's runtime schedules the kernel goroutines of user processes, each executing its own user thread in usermode when necessary. Biscuit uses timer interrupts to switch pre-emptively away from user threads. It relies on pre-emption checks generated by the Go compiler to switch among kernel goroutines. Biscuit 的运行时会调度用户进程的内核程序,必要时每个用户进程都会在用户模式下执行自己的用户线程。Biscuit 使用定时器中断来优先切换用户线程。它依靠 Go 编译器生成的抢占式检查来切换内核 goroutines。
Interrupts. A Biscuit device interrupt handler marks an associated device-driver goroutine as runnable and then returns, as previous kernels have done [35, 45]. Interrupt handlers cannot do much more without risk of deadlock, because the Go runtime does not turn off interrupts during sensitive operations such as goroutine context switch. 中断。Biscuit 设备中断处理程序会将相关的设备驱动程序 goroutine 标记为可运行,然后返回,就像以前的内核所做的那样 [35, 45]。中断处理程序不能做更多的事情,否则会有死锁的风险,因为 Go 运行时不会在goroutine 上下文切换等敏感操作期间关闭中断。
Handlers for system calls and faults from user space can execute any Go code. Biscuit executes this code in the context of the goroutine that is associated with the current user thread. 来自用户空间的系统调用和故障处理程序可以执行任何 Go 代码。Biscuit 会在与当前用户线程相关的 goroutine 上下文中执行这些代码。
Multi-Core and Synchronization. Biscuit runs in parallel on multi-core hardware. It guards its data structures using Go's mutexes, and synchronizes using Go's channels and condition variables. The locking is fine-grained enough that system calls from threads on different cores can execute in parallel in many common situations, for example when operating on different files, pipes, sockets, or when forking or execing in different processes. Biscuit uses read-lock-free lookups in some performance-critical code (see below). 多核与同步。Biscuit 可在多核硬件上并行运行。它使用 Go 的互斥来保护其数据结构,并使用 Go 的通道和条件变量进行同步。锁定的精细程度足以让不同内核线程的系统调用在许多常见情况下并行执行,例如在不同文件、管道、套接字上运行时,或在不同进程中分叉或执行时。Biscuit 在某些性能关键代码中使用免读锁定查找(见下文)。
Virtual Memory. Biscuit uses page-table hardware to implement zero-fill-on-demand memory allocation, copyon-write fork, and lazy mapping of files (e.g., for exec) in which the PTEs are populated only when the process page-faults, and mmap. 虚拟内存。Biscuit 使用页表硬件来实现按需零填充内存分配、写入即复制分叉和文件的懒映射(如用于执行),其中 PTE 仅在进程发生页故障和毫米映射时才被填充。
Biscuit records contiguous memory mappings compactly, so in the common case large numbers of mapping objects aren't needed. Physical pages can have multiple references; Biscuit tracks these using reference counts. Biscuit 会紧凑地记录连续的内存映射,因此在一般情况下不需要大量的映射对象。物理页面可以有多个引用;Biscuit 使用引用计数跟踪这些引用。
File System. Biscuit implements a file system supporting the core POSIX file system calls. The file system has a file name lookup cache, a vnode cache, and a block cache. Biscuit guards each vnode with a mutex and resolves pathnames by first attempting each lookup in a read-lockfree directory cache before falling back to locking each directory named in the path, one after the other. Biscuit runs each file system call as a transaction and has a journal to commit updates to disk atomically. The journal batches transactions through deferred group commit, and allows file content writes to bypass the journal. Biscuit has an AHCI disk driver that uses DMA, command coalescing, native command queuing, and MSI interrupts. 文件系统Biscuit 实现了一个支持核心 POSIX 文件系统调用的文件系统。文件系统有一个文件名查找缓存、一个 vnode 缓存和一个块缓存。Biscuit 通过一个互斥来保护每个 vnode,并通过以下方式解析路径名:首先在一个无读取锁定的目录缓存中尝试每次查找,然后再逐个锁定路径中命名的每个目录。Biscuit 将每个文件系统调用作为一个事务运行,并通过日志将更新以原子方式提交到磁盘。日志通过延迟组提交对事务进行批处理,并允许绕过日志写入文件内容。Biscuit 有一个 AHCI 磁盘驱动程序,使用 DMA、命令聚合、本地命令队列和 MSI 中断。
Network Stack. Biscuit implements a TCP/IP stack and a driver for Intel PCI-Express Ethernet NICs in Go. The driver uses DMA and MSI interrupts. The system-call API provides POSIX sockets. 网络协议栈Biscuit 在 Go 中为英特尔 PCI-Express 以太网网卡实现了 TCP/IP 协议栈和驱动程序。驱动程序使用 DMA 和 MSI 中断。系统调用 API 提供 POSIX 套接字。
Limitations. Although Biscuit can run many Linux C programs without source modification, it is a research prototype and lacks many features. Biscuit does not support scheduling priority because it relies on the Go runtime scheduler. Biscuit is optimized for a small number of cores, but not for large multicore machines or NUMA. Biscuit does not swap or page out to disk, and does not implement the reverse mapping that would be required to steal mapped pages. Biscuit lacks many security features like users, access control lists, or address space randomization. 局限性。虽然 Biscuit 可以在不修改源代码的情况下运行许多 Linux C 程序,但它只是一个研究原型,缺乏许多功能。Biscuit 不支持调度优先级,因为它依赖于 Go 运行时调度器。Biscuit 针对少量内核进行了优化,但不适合大型多核机器或 NUMA。Biscuit 不会交换或分页到磁盘,也没有实现窃取映射页所需的反向映射。Biscuit 缺乏许多安全功能,如用户、访问控制列表或地址空间随机化。
5 Garbage collection 5 垃圾收集
Biscuit's use of garbage collection is a clear threat to its performance. This section outlines the Go collector's design and describes how Biscuit configures the collector; evaluates performance costs. Biscuit 对垃圾回收的使用显然会对其性能造成威胁。本节概述了 Go 收集器的设计,并介绍了 Biscuit 如何配置收集器; 评估了性能成本。
5.1 Go's collector 5.1 围棋的收集器
Go 1.10 has a concurrent parallel mark-and-sweep garbage collector [21]. The concurrent aspect is critical for Biscuit, since it minimizes the collector's "stop-theworld" pauses. Go 1.10 有一个并发并扫垃圾收集器 [21]。并发方面对 Biscuit 至关重要,因为它能最大限度地减少收集器的 "停止-世界 "停顿。
When the Go collector is idle, the runtime allocates from the free lists built by the last collection. When the free space falls below a threshold, the runtime enables concurrent collection. When collection is enabled, the work of following ("tracing") pointers to find and mark reachable ("live") objects is interleaved with execution: each allocator call does a small amount of tracing and marking. Writes to already-traced objects are detected with compiler-generated "write barriers" so that any newly installed pointers will be traced. Once all pointers have been traced, the collector turns off write barriers and resumes ordinary execution. The collector suspends ordinary execution on all cores (a "stop-the-world" pause) twice during a collection: at the beginning to enable the write barrier on all cores and at the end to check that all objects have been marked. These stop-the-world pauses typically last dozens of microseconds. The collector rebuilds the free lists from the unmarked parts of memory ("sweeps"), again interleaved with Biscuit execution, and then becomes idle when all free heap memory has been swept. The collector does not move objects, so it does not reduce fragmentation. 当 Go 收集器空闲时,运行时会从上次收集建立的空闲列表中进行分配。当空闲空间低于阈值时,运行时会启用并发收集。启用收集后,跟踪("tracing")指针以查找和标记可到达("live")对象的工作将与执行交错进行:每次分配器调用都会进行少量的跟踪和标记。通过编译器生成的 "写入障碍 "来检测对已跟踪对象的写入,这样任何新安装的指针都会被跟踪。一旦跟踪完所有指针,收集器就会关闭写入障碍并恢复正常执行。在收集过程中,收集器会暂停所有内核上的普通执行("stop-the-world "暂停)两次:开始时在所有内核上启用写入障碍,结束时检查所有对象是否已被标记。这些 "世界停止 "暂停通常持续数十微秒。收集器从内存中未标记的部分重建空闲列表("清扫"),同样与 Biscuit 执行交错进行,然后在清扫完所有空闲堆内存后进入空闲状态。收集器不会移动对象,因此不会减少碎片。
The fraction of CPU time spent collecting is roughly proportional to the number of live objects, and inversely proportional to the interval between collections [20, 52]. This interval can be made large by devoting enough RAM to the heap that a substantial amount of space ("headroom") is freed by each collection. 用于收集的 CPU 时间比例与实时对象的数量大致成正比,与收集间隔成反比 [20, 52]。通过为堆分配足够的 RAM,每次收集都会释放出大量空间("净空"),从而拉大收集间隔。
The Go collector does most of its work during calls to the heap allocator, spreading out this work roughly evenly among calls. Thus goroutines see delays proportional to the amount that they allocate; presents measurements of these delays for Biscuit. Go 收集器的大部分工作都是在调用堆分配器时完成的,这些工作在调用中大致平均分配。因此,goroutines 的延迟与它们分配的数量成正比; 为 Biscuit 介绍了这些延迟的测量结果。
5.2 Biscuit's heap size 5.2 Biscuit 的堆大小
At boot time, Biscuit allocates a fixed amount of RAM for its Go heap, defaulting to nd of total RAM. Go's collector ordinarily expands the heap memory when live data exceeds half the existing heap memory; Biscuit disables this expansion. The next section (§6) explains how Biscuit copes with situations where the heap space is nearly filled with live data. 启动时,Biscuit 会为 Go 堆分配固定数量的 RAM,默认为总 RAM 的 nd。当实时数据超过现有堆内存的一半时,Go 的收集器通常会扩展堆内存;Biscuit 禁用了这种扩展。下一节(§6)将解释 Biscuit 如何处理堆空间几乎被实时数据填满的情况。
6 Avoiding heap exhaustion 6 避免堆耗尽
Biscuit must address the possibility of live kernel data completely filling the RAM allocated for the heap ("heap exhaustion"). This is a difficult area that existing kernels struggle with (§2). Biscuit 必须解决实时内核数据完全填满为堆分配的 RAM 的可能性("堆耗尽")。这是现有内核难以解决的问题(第 2 节)。
6.1 Approach: reservations 6.1 方法:保留
Biscuit is designed to tolerate heap exhaustion without kernel failure. In addition, it can take corrective action when there are identifiable "bad citizen" processes that allocate excessive kernel resources implemented with heap objects, such as the structures describing open files and pipes. Biscuit tries to identify bad citizens and kill them, in order to free kernel heap space and allow good citizens to make progress. Biscuit 在设计上可以容忍堆耗尽而不会导致内核故障。此外,如果有可识别的 "坏公民 "进程分配了过多的内核资源(如描述打开文件和管道的结构),Biscuit 还能采取纠正措施。Biscuit 会尝试识别 "坏公民 "并杀死它们,以释放内核堆空间,让 "好公民 "取得进展。
Biscuit's approach to kernel heap exhaustion has three elements. First, it purges caches and soft state as the heap nears exhaustion. Second, code at the start of each system call waits until it can reserve enough heap space to complete the call; the reservation ensures that the heap allocations made in the call will succeed once the wait (if any) is over. Third, a kernel "killer" thread watches for processes that are consuming lots of kernel heap when the heap is near exhaustion, and kills them. Biscuit 处理内核堆耗尽的方法有三个要素。首先,当堆接近耗尽时,它会清除缓存和软状态。其次,每次系统调用开始时,代码都会等待,直到它能预留足够的堆空间来完成调用;预留空间能确保调用中的堆分配在等待(如果有的话)结束后取得成功。第三,内核 "杀手 "线程会在堆空间接近耗尽时监视消耗大量内核堆空间的进程,并杀死它们。
This approach has some good properties. Applications do not have to cope with system call failures due to kernel heap exhaustion. Kernel code does not see heap allocation failure (with a few exceptions), and need not include logic to recover from such failures midway through a system call. System calls may have to wait for reservations, but they wait at their entry points without locks held, so deadlock is avoided. 这种方法具有一些良好的特性。应用程序不必应对内核堆耗尽导致的系统调用失败。内核代码不会看到堆分配失败(少数例外),也不需要在系统调用中途包含从这种失败中恢复的逻辑。系统调用可能需要等待保留,但它们在入口点等待时不会加锁,因此避免了死锁。
The killer thread must distinguish between good and bad citizens, since killing a critical process (e.g., init) can make the system unusable. If there is no obvious "bad citizen," this approach may block and/or kill valuable processes. Lack of a way within POSIX for the kernel to gracefully revoke resources causes there to be no good solution in some out-of-memory situations. 杀手线程必须区分好公民和坏公民,因为杀死一个关键进程(如 init)会导致系统无法使用。如果没有明显的 "坏公民",这种方法可能会阻塞和/或杀死有价值的进程。由于 POSIX 内核缺乏从容撤销资源的方法,因此在某些内存不足的情况下没有好的解决方案。
The mechanisms in this section do not apply to nonheap allocations. In particular, Biscuit allocates physical memory pages from a separate allocator, not from the Go heap; page allocations can fail, and kernel code must check for failure and recover (typically by returning an error to a system call). 本节中的机制不适用于非堆分配。特别是,Biscuit 从单独的分配器而非 Go 堆分配物理内存页;页分配可能失败,内核代码必须检查失败并恢复(通常是向系统调用返回错误)。
6.2 How Biscuit reserves 6.2 饼干如何储备
Biscuit dedicates a fixed amount of RAM for the kernel heap. A system call only starts if it can reserve enough heap memory for the maximum amount of simultaneously live data that it uses, called . A system call may allocate more than from the heap, but the amount over must be dead and can be freed by the collector. This means that, even in the extreme case in which all but of the Biscuit 为内核堆分配了固定数量的 RAM 。系统调用只有在为其使用的最大同时在线数据量(称为 )预留足够的堆内存时才会启动。系统调用可以从堆中分配超过 的内存,但超过 的内存必须是死内存,并且可以被收集器释放。这意味着,即使在极端情况下,除了 之外的所有
reserve(s):
g := last GC live bytes
c := used bytes
n := reserved bytes
L := g + c + n
M := heap RAM bytes
if L + S < M:
reserved bytes += s
else:
wake killer thread
wait for OK from killer thread
release(s):
a := bytes allocated by syscall
if a < s:
used bytes += a
else:
used bytes += s
reserved bytes -= s
Figure 2: Pseudo code for heap reservations in Biscuit. 图 2:Biscuit 中的堆保留伪代码。
heap RAM is used by live data or is already reserved, the system call can execute, with collections as needed to recover the call's own dead data in excess of . 如果堆 RAM 已被实时数据使用或已被预留,则系统调用可以执行,并根据需要进行收集,以恢复调用自身超过 的死数据。
Ideally, a reservation should check that minus the amount of live and reserved data in the heap is greater than or equal to . However, except immediately after a collection, the amount of live heap data is not known. Biscuit maintains a conservative over-estimate of live heap data using three counters: , and is the amount of live data marked by the previous garbage collection. is the total amount of reservations made by system calls that have completed. is the total outstanding reservations of system calls that are executing but not finished. Let be the sum of , and . 理想情况下,保留应检查 减去堆中的实时和保留数据量是否大于或等于 。但是,除了在收集后立即进行的情况外,堆中的实时数据量是未知的。Biscuit 使用三个计数器来保守地高估堆中的实时数据: 和 是上一次垃圾回收标记的实时数据量。 是已完成的系统调用所做保留的总量。 是正在执行但尚未完成的系统调用的未完成预订总量。让 成为 和 的总和。
Figure 2 presents pseudo code for reserving and releasing the reservation of heap RAM in Biscuit. Before starting a system call, a thread checks that . If , the thread reserves by adding to , otherwise the thread wakes up the killer thread and sleeps. When finished, a system call calculates , the total amount actually allocated, and uses to (partially) release any over-reservation: if , the system call adds to and subtracts from . Otherwise, and the system call adds to and subtracts from . 图 2 展示了在 Biscuit 中预留和释放堆 RAM 的伪代码。在启动系统调用之前,线程会检查 。如果 ,则线程通过将 加入 来保留,否则线程会唤醒杀手线程并休眠。完成后,系统调用会计算 ,即实际分配的总量,并使用 来(部分)释放超额预留:如果 ,系统调用会将 加到 中,并从 中减去 。否则, 和系统调用将 加入 ,并从 中减去 。
The reason for separate and is to carry over reservations of system calls that span a garbage collection; a collection sets to zero but leaves unchanged. 之所以要将 和 分开,是为了将跨越垃圾回收的系统调用保留下来;垃圾回收会将 设置为零,但会保留 不变。
If heap memory is plentiful (live data ), the amount of live+dead data in the heap usually grows faster than , so collections are triggered by heap free list exhaustion rather than by . Thus system calls do not wait for memory, and do not trigger the killer thread. As live heap data increases, and gets close to , may reach before a collection would ordinarily be triggered. For this reason the killer thread performs a collection before deciding whether to kill processes. 如果堆内存充裕(活数据 ),堆中的活数据+死数据量的增长速度通常快于 ,因此集合是由堆空闲列表耗尽触发的,而不是由 触发的。因此,系统调用不会等待内存,也不会触发杀手线程。随着实时堆数据的增加,以及 接近 时, 可能会在通常触发收集之前达到 。因此,杀手线程在决定是否杀死进程之前会执行一次收集。
6.3 Static analysis to find 6.3 通过静态分析查找
We have developed a tool, MaxLive, that analyzes the Biscuit source code and the Go packages Biscuit uses to find for each system call. The core challenge is detecting statically when allocated memory can no longer be live, since many system calls allocate memory for transient uses. Other challenges include analyzing loops with non-constant bounds, and determining reservations for background kernel activities that are not associated with a specific system call. 我们开发了一款名为 MaxLive 的工具,它可以分析 Biscuit 源代码和 Biscuit 使用的 Go 包,为每个系统调用查找 。核心挑战是静态检测已分配的内存何时不能继续使用,因为许多系统调用分配内存都是为了瞬时使用。其他挑战包括分析具有非恒定边界的循环,以及确定与特定系统调用无关的后台内核活动的保留。
We address these challenges by exploiting the characteristic event-handler-style structure of most kernel code, which does a modest amount of work and then returns (or goes idle); system call implementations, for example, work this way. Furthermore, we are willing to change the kernel code to make it amenable to the reservation approach, for example to avoid recursion (we changed a few functions). Two modifications were required to standard Go packages that Biscuit uses (packages time and ). 我们利用大多数内核代码特有的事件处理程序式结构来应对这些挑战,这种结构只做少量工作,然后返回(或闲置);例如,系统调用的实现就是以这种方式工作的。此外,我们愿意修改内核代码,使其适合保留方法,例如避免递归(我们修改了几个函数)。需要对 Biscuit 使用的标准 Go 软件包(软件包 time 和 )进行两处修改。
6.3.1 Basic MAXLIVE operation 6.3.1 MAXLIVE 的基本操作
MAXLIVE examines the call graph (using Go's and callgraph packages) to detect all allocations a system call may perform. It uses escape and pointer analysis (Go's pointer package) to detect when an allocation does not "escape" above a certain point in the call graph, meaning that the allocation must be dead on return from that point. MAXLIVE 会检查调用图(使用 Go 的 和 callgraph 包),以检测系统调用可能执行的所有分配。它使用转义和指针分析(Go 的指针包)来检测分配是否 "转义 "到调用图中的某个点之上,这意味着从该点返回时,分配必须是死的。
MaxLiVE handles a few kinds of allocation specially: go, defer, maps, and slices. go (which creates a goroutine) is treated as an escaping allocation of the maximum kernel stack size (the new goroutine itself must reserve memory when it starts, much as if it were itself a system call). defer is a non-escaping allocation, but is not represented by an object in the SSA so MAXLIVE specifically considers it an allocation. Every insertion into a map or slice could double its allocated size; MAXLIVE generally doesn't know the old size, so it cannot predict how much memory would be allocated. To avoid this problem, we annotate the Biscuit source to declare the maximum size of slices and maps, which required 70 annotations. MaxLiVE特别处理了几种分配:go、defer、map和slices。go(创建一个goroutine)被视为内核堆栈最大大小的escaping分配(新的goroutine本身必须在启动时预留内存,就像它本身是一个系统调用一样)。 defer是一种非escaping分配,但在SSA中没有对象表示,所以MAXLIVE特别将其视为一种分配。每次插入映射或片段都可能使分配的大小翻倍;MAXLIVE通常不知道旧的大小,因此无法预测会分配多少内存。为了避免这个问题,我们对Biscuit源进行注释,声明切片和映射的最大大小,这需要70个注释。
6.3.2 Handling loops 6.3.2 处理循环
For loops where MaxLive cannot determine a useful bound on the number of iterations, we supply a bound with an annotation; there were 78 such loops. Biscuit contains about 20 loops whose bounds cannot easily be expressed with an annotation, or for which the worst case is too large to be useful. Examples include retries to handle wakeup races in poll, iterating over a directory's data blocks during a path component lookup, and iterating over the pages of a user buffer in write. 对于 MaxLive 无法确定迭代次数有用界限的循环,我们提供了一个带有注释的界限;这样的循环有 78 个。Biscuit 包含约 20 个循环,这些循环的边界无法通过注释轻松表达,或者最坏情况下的边界太大,无法使用。例如在轮询中处理唤醒竞赛的重试、在路径组件查找过程中对目录数据块的遍历,以及在写入过程中对用户缓冲区页面的遍历。
We handle such loops with deep reservations. Each loop iteration tries to reserve enough heap for just the one 我们通过深度保留来处理这种循环。每次循环迭代都会尝试为一个
iteration. If there is insufficient free heap, the loop aborts and waits for free memory at the beginning of the system call, retrying when memory is available. Two loops (in exec and rename) needed code to undo changes after an allocation failure; the others did not. 迭代。如果没有足够的空闲堆,循环就会中止,并在系统调用开始时等待空闲内存,当内存可用时再重试。有两个循环(执行循环和重命名循环)需要代码来撤销分配失败后的更改,其他循环则不需要。
Three system calls have particularly challenging loops: exit, fork, and exec. These calls can close many file descriptors, either directly or on error paths, and each close may end up updating the file system (e.g. on last close of a deleted file). The file system writes allocate memory, and may create entries in file system caches. Thus, for example, an exiting process that has many file descriptors may need a large amount of heap memory for the one exit system call. However, in fact exit's memory requirements are much smaller than this: the cache entries will be deleted if heap memory is tight, so only enough memory is required to execute a single close. We bound the memory use of close by using MaxLive to find all allocations that may be live once close returns. We then manually ensure that all such allocations are either dead once close returns or are evictable cache entries. That way exit, fork, and exec only need to reserve enough kernel heap for one call to close. This results in heap bounds of less than 500 kB for all system calls but rename and fork ( 1 MB and 641 kB , respectively). The close system call is the only one we manually analyze with the assistance of MAXLIVE. 有三个系统调用的循环特别具有挑战性:exit、fork 和 exec。这些调用会直接或通过错误路径关闭许多文件描述符,而且每次关闭都可能最终更新文件系统(例如最后一次关闭已删除的文件)。文件系统写入会分配内存,并可能在文件系统缓存中创建条目。因此,举例来说,一个有许多文件描述符的退出进程可能需要大量堆内存来执行一个 exit 系统调用。但事实上,exit 对内存的需求远小于此:如果堆内存紧张,缓存条目就会被删除,因此只需要足够的内存来执行一次 close。我们使用 MaxLive 来查找 close 返回后可能仍在运行的所有分配,从而约束 close 的内存使用。然后,我们手动确保所有此类分配要么在 close 返回时已死亡,要么是可驱逐的缓存条目。这样,exit、fork 和 exec 只需为一次关闭调用预留足够的内核堆。这样,除了 rename 和 fork(分别为 1 MB 和 641 kB)之外,所有系统调用的堆边界都小于 500 kB。关闭系统调用是我们在 MAXLIVE 协助下手动分析的唯一一个调用。
6.3.3 Kernel threads 6.3.3 内核线程
A final area of special treatment applies to long-running kernel threads. An example is the filesystem logging thread, which acts on behalf of many processes. Each long-running kernel thread has its own kernel heap reservation. Since exit must always be able to proceed when the killer thread kills a process, kernel threads upon which exit depends must never release their heap reservation. For example, exit may need to free the blocks of unlinked files when closing file descriptors and thus depends on the filesystem logging thread. Other kernel threads, like the ICMP packet processing thread, block and wait for heap reservations when needed and release them when idle. 最后一个需要特殊处理的领域是长期运行的内核线程。文件系统日志线程就是一个例子,它代表许多进程运行。每个长期运行的内核线程都有自己的内核堆预留。由于当杀手线程杀死一个进程时,exit 必须始终能够继续运行,因此 exit 所依赖的内核线程绝不能释放其堆保留。例如,exit 在关闭文件描述符时可能需要释放未链接文件的块,因此需要依赖文件系统日志记录线程。其他内核线程(如 ICMP 数据包处理线程)会在需要时阻塞和等待堆保留,并在空闲时释放它们。
6.3.4 Killer thread 6.3.4 杀手线程
The killer thread is woken up when a system call's reservation fails. The thread first starts a garbage collection and waits for it to complete. If the collection doesn't free enough memory, the killer thread asks each cache to free as many entries as possible, and collects again. If that doesn't yield enough free memory, the killer thread finds the process with the largest total number of mapped memory regions, file descriptors, and threads, in the assumption that it is a genuine bad citizen, kills it, and again collects. As soon as the killer thread sees that enough memory has been freed to satisfy the waiting reservation, it wakes up the waiting thread and goes back to sleep. 当系统调用的保留失败时,杀手线程就会被唤醒。该线程首先启动垃圾回收并等待完成。如果垃圾回收没有释放足够的内存,杀手线程会要求每个缓存释放尽可能多的条目,然后再次进行垃圾回收。如果还不能释放足够的内存,杀手线程就会找到映射内存区域、文件描述符和线程总数最多的进程,假定它是一个真正的坏公民,将其杀死,并再次进行收集。一旦 "杀手 "线程发现已释放的内存足以满足等待保留的需求,它就会唤醒等待线程并继续休眠。
6.4 Limitations 6.4 局限性
Biscuit's approach for handling heap exhaustion requires that the garbage collector run successfully when there is little or no free memory available. However, Go's garbage collector may need to allocate memory during a collection in order to make progress, particularly for the work stack of outstanding pointers to scan. We haven't implemented it, but Biscuit could recover from this situation by detecting when the work stack is full and falling back to using the mark bitmap as the work stack, scanning for objects which are marked but contain unmarked pointers. This strategy will allow the garbage collection to complete, but will likely be slow. We expect this situation to be rare since the work stack buffers can be preallocated for little cost: in our experiments, the garbage collector allocates at most of the heap RAM for work stacks. Biscuit 处理堆耗尽的方法要求垃圾收集器在可用内存很少或没有可用内存时成功运行。但是,Go 的垃圾收集器可能需要在收集过程中分配内存,以便取得进展,尤其是对工作栈中的未完成指针进行扫描。我们还没有实现它,但 Biscuit 可以通过检测工作栈是否已满来恢复这种情况,并退回到使用标记位图作为工作栈,扫描已标记但包含未标记指针的对象。这种策略将允许完成垃圾回收,但可能会比较慢。我们预计这种情况很少发生,因为工作堆栈缓冲区只需很少的成本即可预先分配:在我们的实验中,垃圾收集器最多分配 的堆 RAM 用于工作堆栈。
Because the Go collector doesn't move objects, it doesn't reduce fragmentation. Hence, there might be enough free memory but in fragments too small to satisfy a large allocation. To eliminate this risk, MaxLive should compute for each size class of objects allocated during a system call. Our current implementation doesn't do this yet. 由于 Go 收集器不会移动对象,因此不会减少碎片。因此,可能会有足够的空闲内存,但碎片太小,无法满足大量分配的需要。为了消除这种风险,MaxLive 应该为系统调用期间分配的对象的每个大小类别计算 。我们目前的实现还没有做到这一点。
6.5 Heap exhaustion summary 6.5 堆耗尽总结
Biscuit borrows ideas for heap exhaustion from Linux: the killer thread, and the idea of waiting and retrying after the killer thread has produced free memory. Biscuit simplifies the situation by using reservation checks at the start of each system call, rather than Linux's failure checks at each allocation point; this means that Biscuit has less recovery code to back out of partial system calls, and can wait indefinitely for memory without fear of deadlock. Go's static analyzability helped automate Biscuit's simpler approach. Biscuit 借鉴了 Linux 中有关堆耗尽的思想:杀手线程,以及在杀手线程产生空闲内存后等待和重试的思想。Biscuit 通过在每次系统调用开始时使用保留检查来简化这种情况,而不是像 Linux 那样在每个分配点都进行失败检查;这意味着 Biscuit 只需较少的恢复代码来退出部分系统调用,并且可以无限期地等待内存而不必担心死锁。Go 的静态分析能力帮助 Biscuit 实现了更简单方法的自动化。
7 Implementation 7 实施
The Biscuit kernel is written almost entirely in Go: Figure 3 shows that it has 27,583 lines of Go, 1,546 lines of assembly, and no C. Biscuit 内核几乎完全由 Go 语言编写:图 3 显示,它有 27,583 行 Go 语言,1,546 行汇编语言,没有 C 语言。
Biscuit provides 58 system calls, listed in Figure 4. It has enough POSIX compatibility to run some existing server programs (for example, NGINX and Redis). Biscuit 提供 58 个系统调用,如图 4 所示。它的 POSIX 兼容性足以运行一些现有的服务器程序(如 NGINX 和 Redis)。
Biscuit includes device drivers for AHCI SATA disk controllers and for Intel 82599-based Ethernet controllers such as the X540 10-gigabit NIC. Both drivers use DMA. The drivers use Go's unsafe. Pointer to access device registers and in-memory structures (such as DMA descriptors) defined by device hardware, and Go's atomic package to control the order of these accesses. The code Biscuit 包含用于 AHCI SATA 磁盘控制器和基于 Intel 82599 的以太网控制器(如 X540 万兆网卡)的设备驱动程序。这两种驱动程序都使用 DMA。这些驱动程序使用 Go 的不安全指针来访问设备寄存器和由设备硬件定义的内存结构(如 DMA 描述符),并使用 Go 的原子包来控制这些访问的顺序。代码
Component 组件
Lang 朗
LOC
Biscuit kernel (mostly boot) 饼干内核(主要是引导)
asm
546
Biscuit kernel 饼干核
Go 转到
Core 核心
1700
Device drivers 设备驱动程序
4088
File system 文件系统
7338
Network 网络
4519
Other 其他
1105
Processes 流程
935
Reservations 预订
749
Syscalls 系统调用
5292
Virtual memory 虚拟内存
1857
Total 总计
27583
MaxLive
Go 转到
1299
Runtime modifications 运行时修改
asm
1,000
Runtime modifications 运行时修改
Go 转到
3,200
Figure 3: Lines of code in Biscuit. Not shown are about 50,000 lines of Go runtime and 32,000 lines of standard Go packages that Biscuit uses. 图 3:Biscuit 中的代码行数。图中未显示 Biscuit 使用的约 50,000 行 Go 运行时代码和 32,000 行标准 Go 软件包代码。
would be more concise if Go supported some kind of memory fence. 如果 Go 支持某种内存栅栏,就会更简洁。
Biscuit contains 90 uses of Go's "unsafe" routines (excluding uses in the Go runtime). These unsafe accesses parse and format packets, convert between physical page numbers and pointers, read and write user memory, and access hardware registers. Biscuit 包含 Go 的 90 种 "不安全 "例程用法(不包括 Go 运行时中的用法)。这些不安全访问会解析和格式化数据包、在物理页码和指针之间进行转换、读写用户内存以及访问硬件寄存器。
We modified the Go runtime to record the number of bytes allocated by each goroutine (for heap reservations), to check for runnable device handler goroutines, and to increase the default stack size from 2 kB to 8 kB to avoid stack expansion for a few common system calls. 我们修改了 Go 运行时,以记录每个 goroutine 分配的字节数(用于堆保留),检查可运行的设备处理程序 goroutine,并将默认堆栈大小从 2 kB 增加到 8 kB,以避免一些常见系统调用的堆栈扩展。
Biscuit lives with some properties of the Go runtime and compiler in order to avoid significantly modifying them. The runtime does not turn interrupts off when holding locks or when manipulating a goroutine's own private state. Therefore, in order to avoid deadlock, Biscuit interrupt handlers just set a flag indicating that a device handler goroutine should wake up. Biscuit's timer interrupt handler cannot directly force goroutine context switches because the runtime might itself be in the middle of a context switch. Instead, Biscuit relies on Go's pre-emption mechanism for kernel goroutines (the Go compiler inserts pre-emption checks in the generated code). Timer interrupts do force context switches when they arrive from user space. Biscuit 保留了 Go 运行时和编译器的某些特性,以避免对它们进行重大修改。当持有锁或操作 goroutine 自身的私有状态时,运行时不会关闭中断。因此,为了避免死锁,Biscuit 中断处理程序只需设置一个标志,指示设备处理程序 goroutine 应该唤醒。Biscuit 的定时器中断处理程序不能直接强制 goroutine 进行上下文切换,因为运行时本身可能正在进行上下文切换。相反,Biscuit 依赖 Go 的内核 goroutine 抢占机制(Go 编译器会在生成的代码中插入抢占检查)。来自用户空间的定时器中断会强制进行上下文切换。
Goroutine scheduling decisions and the context switch implementation live in the runtime, not in Biscuit. One consequence is that Biscuit does not control scheduling policy; it inherits the runtime's policy. Another consequence is that per-process page tables are not switched when switching goroutines, so Biscuit system call code cannot safely dereference user addresses directly. Instead, Biscuit explicitly translates user virtual addresses to physical addresses, and also explicitly checks page permissions. Goroutine调度决策和上下文切换实现都在运行时中,而不是在 Biscuit 中。其后果之一是,Biscuit 无法控制调度策略,而是继承运行时的策略。另一个后果是,在切换程序时不会切换每个进程的页表,因此 Biscuit 系统调用代码无法安全地直接引用用户地址。相反,Biscuit 会将用户虚拟地址明确转换为物理地址,并明确检查页面权限。
Biscuit switches page tables if necessary before switching to user space. 在切换到用户空间之前,Biscuit 会根据需要切换页面表。
We modified the runtime in three ways to reduce delays due to garbage collection. First, we disabled the dedicated garbage collector worker threads so that application threads don't compete with garbage collector threads for CPU cycles. Second, we made root marking provide allocation credit so that an unlucky allocating thread wouldn't mark many roots all at once. Third, we reduced the size of the pieces that large objects are broken into for marking from 128 kB to 10 kB . 我们从三个方面修改了运行时,以减少垃圾收集导致的延迟。首先,我们禁用了专门的垃圾回收器工作线程,这样应用程序线程就不会与垃圾回收器线程争夺 CPU 周期。其次,我们让根标记提供分配积分,这样一个不走运的分配线程就不会一次标记很多根。第三,我们将用于标记的大对象碎片大小从 128 kB 减少到 10 kB。
Biscuit implements many standard kernel performance optimizations. For example, Biscuit maps the kernel text using large pages to reduce iTLB misses, uses per-CPU NIC transmit queues, and uses read-lock-free data structures in some performance critical code such as the directory cache and TCP polling. In general, we found that Go did not hinder optimizations. Biscuit 实现了许多标准的内核性能优化。例如,Biscuit 使用大页面映射内核文本以减少 iTLB 错失,使用每 CPU NIC 发送队列,并在目录缓存和 TCP 轮询等一些性能关键代码中使用免读锁定数据结构。总的来说,我们发现 Go 并没有妨碍优化。
8 Evaluation 8 评估
This section analyzes the costs and benefits of writing a kernel in an HLL by exploring the following questions: 本节将通过探讨以下问题来分析用 HLL 编写内核的成本和收益:
To what degree does Biscuit benefit from Go's highlevel language features? To answer, we count and explain Biscuit's use of these features ( ). Biscuit 在多大程度上受益于 Go 的高级语言特性?为了回答这个问题,我们将统计并解释 Biscuit 对这些特性的使用 ( )。
Do C kernels have safety bugs that a high-level language might mitigate? We evaluate whether bugs reported in Linux kernel CVEs would likely apply to Biscuit ( ). C 内核是否存在高级语言可能缓解的安全漏洞?我们评估了 Linux 内核 CVE 中报告的错误是否可能适用于 Biscuit ( )。
How much performance does Biscuit pay for Go’s HLL features? We measure the time Biscuit spends in garbage collection, bounds checking, etc., and the delays that GC introduces . Biscuit 为 Go 的 HLL 功能付出了多少性能代价?我们测量了 Biscuit 在垃圾回收、边界检查等方面花费的时间,以及 GC 带来的 延迟。
What is the performance impact of using Go instead of C? We compare nearly-identical pipe and page-fault handler implementations in Go and ( ). 使用 Go 而不是 C 会对性能产生什么影响?我们比较了 Go 和 ( ) 中几乎相同的管道和页面故障处理程序实现。
Is Biscuit's performance in the same ballpark as Linux, a C kernel (§8.8)? Biscuit 的性能是否与 C 内核 Linux 相同(第 8.8 节)?
Is Biscuit's reservation scheme effective at handling kernel heap exhaustion (§8.9)? Biscuit 的保留方案是否能有效处理内核堆耗尽(§8.9)?
Can Biscuit benefit from RCU-like lock-free lookups Biscuit 能否从类似 RCU 的无锁查找 中受益?
8.1 Biscuit's use of HLL features 8.1 Biscuit 对 HLL 功能的使用
Our subjective feeling is that Go has helped us produce clear code and helped reduce programming difficulty, primarily by abstracting and automating low-level tasks. 我们的主观感受是,主要通过抽象化和自动化底层任务,Go 帮助我们生成了清晰的代码,并降低了编程难度。
Figure 5 shows how often Biscuit uses Go's HLL features, and compares with two other major Go systems: 图 5 显示了 Biscuit 使用围棋 HLL 功能的频率,并与其他两个主要围棋系统进行了比较:
Figure 5: Uses of Go HLL features in the Git repositories for Biscuit, Golang ( lines), and Moby ( lines) per 1,000 lines. For data types (such as slices), the numbers indicate the number of declarations of a variable, argument, or structure field of that type. 图 5:Biscuit、Golang( 行)和 Moby( 行)的 Git 仓库中每千行 Go HLL 功能的使用情况。对于数据类型(如切片),数字表示该类型的变量、参数或结构字段的声明次数。
the Golang repository (containing Go's compiler, runtime, and standard packages), and Moby , which contains Docker's container software and is the most starred Go repository on Github at the time of writing. Figure 5 shows the number of times each feature is used per 1,000 lines of code. Biscuit uses Go's HLL features about as much as other Go systems software. Golang 代码库(包含 Go 的编译器、运行时和标准软件包),以及 Moby ,其中包含 Docker 的容器软件,也是本文撰写时 Github 上星级最高的 Go 代码库。图 5 显示了每千行代码中每种特性的使用次数。Biscuit 使用 Go 的 HLL 功能的次数与其他 Go 系统软件差不多。
To give a sense how these HLL features can benefit a kernel, the rest of this section provides examples of successful uses, as well as situations where we didn't use them. Biscuit relies on the Go allocator and garbage collector for nearly all kernel objects. Biscuit has 302 statements that allocate an object from the GC-managed heap. Some of the objects are compound (composed of multiple Go objects). For example, Biscuit's Vmregion_t, which describes a mapped region of virtual memory, has a red/black tree of Vminfo_t, which itself is compound (e.g., when it is backed by a file). The garbage collector eliminates the need for explicit code to free the parts of such compound data types. 为了让大家了解这些 HLL 功能对内核的益处,本节的其余部分将提供成功使用的例子,以及我们没有使用这些功能的情况。几乎所有内核对象都依赖于 Go 分配器和垃圾回收器。Biscuit 有 302 条语句从 GC 管理的堆中分配对象。其中一些对象是复合对象(由多个 Go 对象组成)。例如,Biscuit 的 Vmregion_t(描述虚拟内存的映射区域)有一个 Vminfo_t 的红/黑树,而 Vminfo_t 本身就是复合对象(例如,当它由文件支持时)。垃圾回收器无需使用显式代码来释放此类复合数据类型的各个部分。
Biscuit's only special-purpose allocator is its physical page allocator. It is used for process memory pages, file cache pages, socket and pipe buffers, and page table pages. Biscuit 唯一的特殊用途分配器是物理页面分配器。它用于进程内存页、文件缓存页、套接字和管道缓冲区以及页表页。
Biscuit uses many goroutines. For example, device drivers create long-running goroutines to handle events such as packet arrival. Biscuit avoids goroutine creation, however, in frequently executed code. The reason is that the garbage collector produces pauses proportional to the number of goroutines; these are insignificant for thousands of goroutines but a problem with hundreds of thousands. Biscuit 使用许多 goroutines。例如,设备驱动程序会创建长时间运行的 goroutine 来处理数据包到达等事件。不过,Biscuit 避免在频繁执行的代码中创建 goroutine。原因是垃圾回收器会产生与 goroutines 数量成正比的停顿;对于数千个 goroutines 而言,这些停顿并不重要,但对于数十万个 goroutines 而言,停顿就成了问题。
The combination of threads and garbage collection is particularly pleasing, since it avoids forcing the programmer to worry about delaying frees for shared objects until the last sharing thread has finished. For example, Biscuit's poll system call installs a pointer to a helper object in each file descriptor being polled. When input arrives on a descriptor, the goroutine delivering the input uses the helper object to wake up the polling thread. Garbage collection eliminates races between arriving input and freeing the helper object. 线程与垃圾回收的结合尤其令人满意,因为它避免了迫使程序员担心将共享对象的释放延迟到最后一个共享线程结束之后。例如,Biscuit 的轮询系统调用会在每个被轮询的文件描述符中安装一个指向辅助对象的指针。当输入到达一个描述符时,提供输入的 goroutine 会使用辅助对象唤醒轮询线程。垃圾回收消除了输入到达和释放辅助对象之间的竞争。
Some Biscuit objects, when the last reference to them disappears, need to take clean-up actions before their memory can be collected; for example, TCP connections must run the TCP shutdown protocol. Go's finalizers were not convenient in these situations because of the prohibition against cycles among objects with finalizers. Biscuit maintains reference counts in objects that require clean-up actions. 有些 Biscuit 对象在最后一次引用消失后,需要在内存被回收前进行清理操作;例如,TCP 连接必须运行 TCP 关闭协议。在这种情况下,Go 的终结器并不方便,因为终结器禁止对象间的循环。Biscuit 会维护需要清理操作的对象中的引用计数。
Biscuit uses many standard Go packages. For example, Biscuit imports sync in 28 files and atomic packages in 18 files. These packages provide mutexes, condition variables, and low-level atomic memory primitives. Biscuit's MAXLIVE tool depends on Go's code analysis packages (ssa, callgraph, and pointer). Biscuit 使用了许多标准 Go 软件包。例如,Biscuit 在 28 个文件中导入了同步包,在 18 个文件中导入了原子包。这些包提供了互斥、条件变量和底层原子内存原语。Biscuit 的 MAXLIVE 工具依赖于 Go 的代码分析包(ssa、callgraph 和 pointer)。
Biscuit itself is split into 31 Go packages. Packages allowed some code to be developed and tested in user space. For example, we tested the file system package for races and crash-safety in user space. The package system also made it easy to use the file system code to create boot disks. Biscuit 本身分为 31 个 Go 软件包。软件包允许在用户空间开发和测试某些代码。例如,我们在用户空间测试了文件系统包的竞赛和崩溃安全性。软件包系统还可以方便地使用文件系统代码来创建启动盘。
8.2 Potential to reduce bugs 8.2 减少虫子的潜力
An HLL might help avoid problems such as memory corruption from buffer overflows. To see how this applies to kernels, we looked at Linux execute-code bugs in the CVE database published in 2017 [34]. There are 65 bugs HLL 可能有助于避免缓冲区溢出造成的内存损坏等问题。为了了解这一点如何适用于内核,我们查看了 2017 年发布的 CVE 数据库中的 Linux 执行代码漏洞[34]。共有 65 个漏洞