2024_09_10_fd810be3ac55454cc573g

The benefits and costs of writing a POSIX kernel in a high-level language
用高级语言编写 POSIX 内核的好处和代价

Cody Cutler, M. Frans Kaashoek, and Robert T. Morris, MIT CSAIL
Cody Cutler、M. Frans Kaashoek 和 Robert T. Morris，麻省理工学院 CSAIL
https://www.usenix.org/conference/osdi18/presentation/cutler

This paper is included in the Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI '18).
本文收录于第 13 届 USENIX 操作系统设计与实现研讨会 (OSDI '18) 论文集。
October 8-10,

Carlsbad, CA, USA
10 月 8-10 日，

美国加利福尼亚州卡尔斯巴德
ISBN 978-1-939133-08-3

The benefits and costs of writing a POSIX kernel in a high-level language
用高级语言编写 POSIX 内核的好处和代价

Cody Cutler, M. Frans Kaashoek, Robert T. Morris
科迪-卡特勒、M. 弗兰斯-卡舒克、罗伯特-T. 莫里斯MIT CSAIL 麻省理工学院 CSAIL

Abstract 摘要

This paper presents an evaluation of the use of a high-level language (HLL) with garbage collection to implement a monolithic POSIX-style kernel. The goal is to explore if it is reasonable to use an HLL instead of C for such kernels, by examining performance costs, implementation challenges, and programmability and safety benefits. The paper contributes Biscuit, a kernel written in Go that implements enough of POSIX (virtual memory, mmap, TCP/IP sockets, a logging file system, poll, etc.) to execute significant applications. Biscuit makes liberal use of Go's HLL features (closures, channels, maps, interfaces, garbage collected heap allocation), which subjectively made programming easier. The most challenging puzzle was handling the possibility of running out of kernel heap memory; Biscuit benefited from the analyzability of Go source to address this challenge. On a set of kernel-intensive benchmarks (including NGINX and Redis) the fraction of kernel CPU time Biscuit spends on HLL features (primarily garbage collection and thread stack expansion checks) ranges up to . The longest single GC-related pause suffered by NGINX was 115 microseconds; the longest observed sum of GC delays to a complete NGINX client request was 600 microseconds. In experiments comparing nearly identical system call, page fault, and context switch code paths written in Go and C, the Go version was to slower.
本文对使用带有垃圾回收功能的高级语言（HLL）来实现单片 POSIX 风格内核进行了评估。目的是通过考察性能成本、实施挑战以及可编程性和安全性优势，探讨使用 HLL 而不是 C 语言来实现此类内核是否合理。本文介绍了用 Go 编写的内核 Biscuit，它实现了足够多的 POSIX（虚拟内存、mmap、TCP/IP 套接字、日志文件系统、轮询等），可以执行重要的应用程序。Biscuit 充分利用了 Go 的 HLL 功能（闭包、通道、映射、接口、垃圾回收堆分配），这在主观上简化了编程。最具挑战性的难题是处理内核堆内存耗尽的可能性；Biscuit 利用 Go 源代码的可分析性解决了这一难题。在一组内核密集型基准测试（包括 NGINX 和 Redis）中，Biscuit 用于 HLL 功能（主要是垃圾回收和线程栈扩展检查）的内核 CPU 时间比例高达。NGINX 遭受的与 GC 相关的最长单次暂停时间为 115 微秒；观察到的 NGINX 客户端完整请求的最长 GC 延迟总和为 600 微秒。在比较用 Go 和 C 语言编写的几乎完全相同的系统调用、页面故障和上下文切换代码路径的实验中，Go 版本的速度从到不等。

1 Introduction 1 引言

The default language for operating system kernels is C : Linux, macOS, and Windows all use C. C is popular for kernels because it can deliver high performance via flexible low-level access to memory and control over memory management (allocation and freeing). C, however, requires care and experience to use safely, and even then low-level bugs are common. For example, in 2017 at least 50 Linux kernel security vulnerabilities were reported that involved buffer overflow or use-after-free bugs in C code [34].
操作系统内核的默认语言是 C 语言：Linux、macOS 和 Windows 都使用 C 语言。C 语言在内核中很受欢迎，因为它可以通过灵活的底层内存访问和内存管理（分配和释放）控制提供高性能。不过，C 语言需要谨慎和经验才能安全使用，即便如此，低级错误也屡见不鲜。例如，2017 年报告的至少 50 个 Linux 内核安全漏洞涉及 C 代码中的缓冲区溢出或使用后释放 bug [34]。

High-level languages (HLLs) provide type- and memory-safety and convenient abstractions such as threads. Many HLLs provide garbage collection to further reduce programmer burden and memory bugs. It is well-known that HLLs can be used in kernels: multiple kernels have been written in HLLs, often as platforms to explore innovative ideas (

§

). On the other hand, leading OS designers have been skeptical that HLLs' memory management and abstractions are compatible with highperformance production kernels [51][47, p. 71].
高级语言（HLL）提供了类型安全和内存安全，以及线程等方便的抽象。许多 HLL 还提供垃圾回收功能，以进一步减轻程序员的负担，减少内存错误。众所周知，内核中可以使用 HLL：多个内核都是用 HLL 编写的，通常作为探索创新思想的平台（

§

）。另一方面，操作系统的主要设计者一直怀疑 HLL 的内存管理和抽象是否与高性能的生产内核兼容[51][47, p. 71]。

While it would probably not make sense to re-write an existing C kernel in an HLL, it is worth considering what languages new kernel projects should use. Since kernels impose different constraints and requirements than typical applications, it makes sense to explore this question in the context of a kernel.
虽然用 HLL 重写现有的 C 内核可能没有意义，但新的内核项目应该使用什么语言却值得考虑。由于内核与典型的应用程序有着不同的限制和要求，因此在内核的背景下探讨这个问题是有意义的。

We built a new kernel, Biscuit, written in Go [15] for x86-64 hardware. Go is a type-safe language with garbage collection. Biscuit runs significant existing applications such as NGINX and Redis without source modification by exposing a POSIX-subset system call interface. Supported features include multi-core, kernel-supported user threads, futexes, IPC, mmap, copy-on-write fork, vnode and name caches, a logging file system, and TCP/IP sockets. Biscuit implements two significant device drivers in Go: one for AHCI SATA disk controllers and one for Intel 82599-based Ethernet controllers. Biscuit has nearly 28 thousand lines of Go, 1546 lines of assembler, and no C. We report lessons learned about use of Go in Biscuit, including ways in which the language helped development, and situations in which it was less helpful.
我们为 x86-64 硬件构建了一个用 Go [15] 编写的新内核 Biscuit。Go 是一种带有垃圾回收功能的类型安全语言。Biscuit 通过公开 POSIX 子集系统调用接口，无需修改源代码即可运行 NGINX 和 Redis 等重要的现有应用程序。支持的功能包括多核、内核支持的用户线程、futexes、IPC、mmap、写时复制 fork、vnode 和名称缓存、日志文件系统和 TCP/IP 套接字。Biscuit 用 Go 实现了两个重要的设备驱动程序：一个用于 AHCI SATA 磁盘控制器，另一个用于基于 Intel 82599 的以太网控制器。Biscuit 有将近 2.8 万行 Go 语言，1546 行汇编语言，没有 C 语言。我们报告了在 Biscuit 中使用 Go 语言的经验教训，包括该语言对开发有帮助的方式，以及帮助不大的情况。

In most ways the design of Biscuit is that of a traditional monolithic POSIX/Unix kernel, and Go was a comfortable language for that approach. In one respect the design of Biscuit is novel: its mechanism for coping with kernel heap exhaustion. We use static analysis of the Biscuit source to determine how much heap memory each system call (and other kernel activity) might need, and each system call waits (if needed) when it starts until it can reserve that much heap. Once a system call is allowed to continue, its allocations are guaranteed to succeed without blocking. This obviates the need for complex allocation failure recovery or deadlock-prone waiting for free memory in the allocator. The use of an HLL that is conducive to static analysis made this approach possible.
在大多数方面，Biscuit 的设计都是传统的 POSIX/Unix 内核，而 Go 是一种适用于这种方法的语言。Biscuit 的设计有一点很新颖，那就是它应对内核堆耗尽的机制。我们使用 Biscuit 代码的静态分析来确定每个系统调用（和其他内核活动）可能需要多少堆内存，每个系统调用在启动时都会等待（如果需要），直到能预留出足够的堆内存。一旦系统调用被允许继续，就能保证其分配成功而不会阻塞。这样就不需要进行复杂的分配失败恢复，也不需要在分配器中等待空闲内存，以免造成死锁。使用有利于静态分析的 HLL 使这种方法成为可能。

We run several kernel-intensive applications on Biscuit and measure the effects of Go's type safety and garbage collection on kernel performance. For our benchmarks, GC costs up to 3% of CPU. For NGINX, the longest single
我们在 Biscuit 上运行了几个内核密集型应用程序，并测量了 Go 的类型安全和垃圾回收对内核性能的影响。在我们的基准测试中，垃圾回收最多耗费 3% 的 CPU。对于 NGINX，最长的单次

GC-related pause was 115 microseconds, and the longest a single NGINX client request was delayed (by many individual pauses) was a total of 600 microseconds. Other identifiable HLL performance costs amount to about

of CPU.
与 GC 相关的暂停时间为 115 微秒，而单个 NGINX 客户端请求的最长延迟时间（多次单独暂停）总计为 600 微秒。其他可识别的 HLL 性能成本约为

CPU。

To shed light on the specific question of C versus Go performance in the kernel, we modify Biscuit and a

kernel to have nearly identical source-level code paths for two benchmarks that stress system calls, page faults, and context switches. The C versions are about

and

faster than the Go versions.
为了阐明内核中 C 语言与 Go 语言性能对比的具体问题，我们修改了 Biscuit 和

内核，使其在强调系统调用、页面故障和上下文切换的两个基准测试中具有几乎相同的源代码级代码路径。C 版本比 Go 版本快约

和

。

Finally, we compare the performance of Biscuit and Linux on our kernel-intensive application benchmarks, finding that Linux is up to

faster than Biscuit. This result is not very illuminating about choice of language, since performance is also affected by differences in the features, design and implementation of Biscuit and Linux. However, the results do provide an idea of whether the absolute performance of Biscuit is in the same league as that of a

kernel.
最后，我们比较了 Biscuit 和 Linux 在内核密集型应用程序基准测试中的性能，发现 Linux 比 Biscuit 快

。由于 Biscuit 和 Linux 在功能、设计和实现方面的差异也会影响性能，因此这一结果并不能很好地说明语言的选择。不过，该结果确实提供了一个概念，即 Biscuit 的绝对性能是否与

内核处于同一水平。

In summary, the main contributions of this paper are: (1) Biscuit, a kernel written in Go with good performance; (2) a novel scheme for coping with kernel heap exhaustion; (3) a discussion of qualitative ways in which use of an HLL in a kernel was and was not helpful; (4) measurements of the performance tax imposed by use of an HLL; and (5) a direct Go-vs-C performance comparison of equivalent code typical of that found in a kernel.
总之，本文的主要贡献是(1)用 Go 编写的性能良好的内核 Biscuit；(2)应对内核堆耗尽的新方案；(3)讨论在内核中使用 HLL 是否有帮助的定性方式；(4)测量使用 HLL 带来的性能税；(5)直接比较内核中典型的等价代码的 Go-vs-C 性能。

This paper does not draw any top-level conclusion about

versus an HLL as a kernel implementation language. Instead, it presents experience and measurements that may be helpful for others making this decision, who have specific goals and requirements with respect to programmability, safety and performance. Section 9 summarizes the key factors in this decision.
本文并未就

与 HLL 作为内核实现语言的对比得出任何最高级别的结论。相反，本文介绍的经验和测量结果可能会对做出这一决定的其他人有所帮助，因为他们在可编程性、安全性和性能方面有特定的目标和要求。第 9 节总结了做出这一决定的关键因素。

Biscuit builds on multiple areas of previous work: highlevel languages in operating systems, high-level systems programming languages, and memory allocation in the kernel. As far as we know the question of the impact of language choice on kernel performance, all else being equal, has not been explored.
Biscuit 建立在以前工作的多个领域之上：操作系统中的高级语言、高级系统编程语言和内核中的内存分配。据我们所知，在其他条件相同的情况下，语言选择对内核性能的影响问题还没有被探讨过。

Kernels in high-level languages. The Pilot [44] kernel and the Lisp machine [17] are early examples of use of a high-level language (Mesa [14] and Lisp, respectively) in an operating system. Mesa lacked garbage-collection, but it was a high-priority requirement for its successor language Cedar [48]. The Lisp machine had a real-time garbage collector [5].
高级语言内核。Pilot [44] 内核和 Lisp 机器 [17] 是在操作系统中使用高级语言（分别是 Mesa [14] 和 Lisp）的早期例子。Mesa 缺乏垃圾回收功能，但这是其后续语言 Cedar [48] 的优先要求。Lisp 机器有一个实时垃圾回收器[5]。

A number of research kernels are written in high-level languages (e.g., Taos [49], Spin [7], Singularity [23], J- kernel [19], and KaffeOS [3, 4], House [18], the Mirage unikernel [29], and Tock [27]). The main thrust of these projects was to explore new ideas in operating system architecture, often enabled by the use of a type-safe highlevel language. While performance was often a concern, usually the performance in question related to the new ideas, rather than to the choice of language. Singularity quantified the cost of hardware and software isolation [22], which is related to the use of a HLL, but didn't quantify the cost of safety features of a HLL language, as we do in

§

.
许多研究内核都是用高级语言编写的（如 Taos [49]、Spin [7]、Singularity [23]、J- kernel [19]、KaffeOS [3，4]、House [18]、Mirage unikernel [29] 和 Tock [27]）。这些项目的主旨是探索操作系统架构的新思路，通常通过使用类型安全的高级语言来实现。虽然性能往往是一个值得关注的问题，但有关性能的问题通常与新思路有关，而不是与语言的选择有关。奇点量化了硬件和软件隔离的成本[22]，这与高级语言的使用有关，但并没有像我们在

§

中所做的那样，量化高级语言安全特性的成本。

High-level systems programming languages. A number of systems-oriented high-level programming languages with type safety and garbage collection seem suitable for kernels, including Go, Java, C#, and Cyclone [25] (and, less recently, Cedar [48] and Modula-3 [37]). Other systems HLLs are less compatible with existing kernel designs. For example, Erlang [2] is a "shared-nothing" language with immutable objects, which would likely result in a kernel design that is quite different from traditional C shared-memory kernels.
高级系统编程语言。一些具有类型安全和垃圾回收功能的面向系统的高级编程语言似乎适合内核，包括 Go、Java、C# 和 Cyclone [25]（最近还有 Cedar [48] 和 Modula-3 [37]）。其他系统的 HLL 与现有内核设计的兼容性较差。例如，Erlang[2]是一种具有不可变对象的 "无共享 "语言，其内核设计很可能与传统的C语言共享内存内核大相径庭。

Frampton et al. introduce a framework for language extensions to support low-level programming features in Java, applying it to a GC toolkit [13]. Biscuit's goal is efficiency for kernels without modifying Go. Kernels have additional challenges such as dealing with user/kernel space, page tables, interrupts, and system calls.
Frampton 等人引入了一个语言扩展框架，以支持 Java 的底层编程功能，并将其应用于 GC 工具包[13]。Biscuit 的目标是在不修改 Go 的情况下提高内核的效率。内核还面临其他挑战，如处理用户/内核空间、页表、中断和系统调用等。

A number of new languages have recently emerged for systems programming: D [11], Nim(rod) [42], Go [15], and Rust [36]. There are a number of kernels in Rust [12, 26, 27, 28, 39, 50], but none were written with the goal of comparing with C as an implementation language. Gopher OS is a Go kernel with a similar goal as Biscuit, but the project is at an early stage of development [1]. Other Go kernels exists but they don't target the questions that Biscuit answers. For example, Clive [6] is a unikernel and doesn't run on the bare metal. The Ethos OS uses C for the kernel and Go for user-space programs, with a design focused on security [41]. gVisor is a user-space kernel, written in Go, that implements a substantial portion of the Linux system API to sandbox containers [16].
最近出现了许多新的系统编程语言：D [11]、Nim(rod) [42]、Go [15] 和 Rust [36]。Rust 中有许多内核[12, 26, 27, 28, 39, 50]，但没有一种内核是以与作为实现语言的 C 语言相比较为目标而编写的。Gopher OS 是一个 Go 内核，目标与 Biscuit 类似，但该项目还处于早期开发阶段[1]。其他 Go 内核也存在，但它们并不以 Biscuit 所回答的问题为目标。例如，Clive [6] 是一个单内核，不能在裸机上运行。Ethos OS 的内核使用 C 语言，用户空间程序使用 Go 语言，其设计重点在于安全性[41]。gVisor 是一个用 Go 语言编写的用户空间内核，它实现了 Linux 系统 API 的大部分，并将其用于沙箱容器[16]。
Memory allocation. There is no consensus about whether a systems programming language should have automatic garbage-collection. For example, Rust is partially motivated by the idea that garbage collection cannot be made efficient; instead, the Rust compiler analyzes the program to partially automate freeing of memory. This approach can make sharing data among multiple threads or closures awkward [26].
内存分配。关于系统编程语言是否应该自动收集垃圾，目前还没有达成共识。例如，Rust 的部分动机是垃圾回收无法做到高效；相反，Rust 编译器会分析程序，部分自动释放内存。这种方法会使多个线程或闭包之间的数据共享变得困难[26]。

Concurrent garbage collectors [5, 24, 30] reduce pause times by collecting while the application runs. Go 1.10 has such a collector [21], which Biscuit uses.
并发垃圾收集器 [5, 24, 30] 通过在应用程序运行时收集垃圾来减少暂停时间。Go 1.10 就有这样一个垃圾收集器 [21]，Biscuit 就使用了它。

Several papers have studied manual memory allocation versus automatic garbage collection [20, 52], focusing on heap headroom memory's effect in reducing garbage collection costs in user-level programs. Headroom is also important for Biscuit's performance (

§

and

§

).
有几篇论文研究了手动内存分配与自动垃圾回收的对比[20, 52]，重点关注堆净空内存在降低用户级程序垃圾回收成本方面的作用。净空对 Biscuit 的性能也很重要（

§

和

§

）。

Rafkind et al. added garbage collection to parts of Linux through automatic translation of

source [43]. The authors observe that the kernel environment made this task difficult and adapted a fraction of a uniprocessor Linux kernel to be compatible with garbage collection. Biscuit required a fresh start in a new language, but as a result required less programmer effort for GC compatibility and benefited from a concurrent and parallel collector.
Rafkind 等人通过自动翻译

源代码，为 Linux 的某些部分添加了垃圾回收功能 [43]。作者注意到内核环境给这项任务带来了困难，因此对单核处理器 Linux 内核的一部分进行了调整，使其与垃圾回收兼容。Biscuit 需要从头开始使用一种新的语言，但因此减少了程序员为兼容 GC 所付出的努力，并受益于并发和并行的收集器。

Linux's slab allocators [8] are specifically tuned for use in the kernel; they segregate free objects by type to avoid re-initialization costs and fragmentation. A hypothesis in the design of Biscuit is that Go's single general-purpose allocator and garbage collector are suitable for a wide range of different kernel objects.
Linux 的板块分配器 [8] 专门针对内核进行了调整；它们按类型隔离空闲对象，以避免重新初始化成本和碎片化。Biscuit 设计中的一个假设是，Go 的单一通用分配器和垃圾回收器适用于各种不同的内核对象。

Kernel heap exhaustion. All kernels have to cope with the possibility of running out of memory for the kernel heap. Linux optimistically lets system calls proceed up until the point where an allocation fails. In some cases code waits and re-tries the allocation a few times, to give an "out-of-memory" killer thread time to find and destroy an abusive process to free memory. However, the allocating thread cannot generally wait indefinitely: it may hold locks, so there is a risk of deadlock if the victim of the killer thread is itself waiting for a lock [9]. As a result Linux system calls must contain code to recover from allocation failures, undoing any changes made so far, perhaps unwinding through many function calls. This undo code has a history of bugs [10]. Worse, the final result will be an error return from a system call. Once the heap is exhausted, any system call that allocates will likely fail; few programs continue to operate correctly in the face of unexpected errors from system calls, so the end effect may be application-level failure even if the kernel code handles heap exhaustion correctly.
内核堆耗尽。所有内核都必须应对内核堆内存耗尽的可能性。Linux 会乐观地让系统调用继续进行，直到分配失败为止。在某些情况下，代码会等待并重试几次分配，以便让 "内存不足 "的杀手线程有时间找到并摧毁一个滥用内存的进程，从而释放内存。但是，分配线程一般不能无限期地等待：它可能持有锁，因此如果杀手线程的受害者本身也在等待锁，就会有死锁的风险[9]。因此，Linux 系统调用必须包含从分配失败中恢复的代码，撤消迄今为止所做的任何更改，也许要通过多次函数调用才能解除。这种撤销代码历来就有错误[10]。更糟的是，最终结果将是系统调用错误返回。一旦堆耗尽，任何分配的系统调用都可能失败；面对系统调用的意外错误，很少有程序能继续正常运行，因此即使内核代码正确处理了堆耗尽，最终的结果也可能是应用程序级的失败。

Biscuit's reservation approach yields simpler code than Linux's. Biscuit kernel heap allocations do not fail (much as with Linux's contentious "too small to fail" rule [9, 10]), eliminating a whole class of complex error recovery code. Instead, each Biscuit system call reserves kernel heap memory when it starts (waiting if necessary), using a static analysis system to decide how much to reserve. Further, Biscuit applications don't see system call failures when the heap is exhausted; instead, they see delays.
Biscuit 的保留方法比 Linux 的代码更简单。Biscuit 内核堆分配不会失败（就像 Linux 有争议的 "小到不能失败 "规则一样 [9,10]），从而消除了一整类复杂的错误恢复代码。相反，每个 Biscuit 系统调用都会在启动时（必要时等待）保留内核堆内存，并使用静态分析系统来决定保留多少内存。此外，当堆耗尽时，Biscuit 应用程序不会看到系统调用失败；相反，它们会看到延迟。

3 Motivation 3 动机

This section outlines our view of the main considerations in the choice between C and an HLL for the kernel.
本节概述了我们对内核选择 C 语言还是 HLL 的主要考虑因素的看法。

3.1 Why C? 3.1 为什么是 C？

A major reason for C's popularity in kernels is that it supports low-level techniques that can help performance, particularly pointer arithmetic, easy escape from type enforcement, explicit memory allocation, and custom allocators [51][47, p. 71]. There are other reasons too (e.g. C can manipulate hardware registers and doesn't depend on a complex runtime), but performance seems most important.
C 语言在内核中流行的一个主要原因是它支持有助于提高性能的底层技术，特别是指针运算、轻松摆脱类型强制、显式内存分配和自定义分配器[51][47, p. 71]。还有其他一些原因（例如，C 语言可以操作硬件寄存器，并且不依赖于复杂的运行时），但性能似乎是最重要的。

3.2 Why an HLL? 3.2 为什么要使用 HLL？

The potential benefits of high-level languages are well understood. Automatic memory management reduces programmer effort and use-after-free bugs; type-safety detects bugs; runtime typing and method dispatch help with abstraction; and language support for threads and synchronization eases concurrent programming.
高级语言的潜在优势众所周知。自动内存管理可减少程序员的工作量和使用后的错误；类型安全可检测错误；运行时类型和方法调度有助于抽象；语言对线程和同步的支持可简化并发编程。

Certain kinds of bugs seem much less likely in an HLL than in C: buffer overruns, use-after-free bugs [40], and bugs caused by reliance on C's relaxed type enforcement. Even C code written with care by expert programmers has C-related bugs [40]. The CVE database for the Linux kernel [34] lists 40 execute-code vulnerabilities for 2017 which would be wholly or partially ameliorated by use of an HLL (see

§

).
与 C 语言相比，某些类型的错误在 HLL 中出现的可能性要小得多：缓冲区超限、use-after-free 错误[40]，以及由于依赖 C 语言宽松的类型执行而导致的错误。即使是由专业程序员精心编写的 C 代码，也会出现与 C 相关的错误 [40]。Linux 内核的 CVE 数据库 [34] 列出了 2017 年的 40 个执行代码漏洞，使用 HLL 可以全部或部分改善这些漏洞（参见

§

）。

Use-after-free bugs are notoriously difficult to debug, yet occur often enough that the Linux kernel includes a memory checker that detects some use-after-free and buffer overrun bugs at runtime [46]. Nevertheless, Linux developers routinely discover and fix use-after-free bugs: Linux has at least 36 commits from January to April of 2018 for the specific purpose of fixing use-after-free bugs.
免用程序错误是众所周知的难以调试的问题，但却经常出现，以至于 Linux 内核包含了一个内存检查器，可以在运行时检测到一些免用程序和缓冲区超限错误[46]。尽管如此，Linux 开发人员还是经常发现并修复 "空闲后使用 "错误：从 2018 年 1 月到 4 月，Linux 至少有 36 次提交专门用于修复使用后即免错误。

Another area of kernel programming that would benefit from HLLs is concurrency. Transient worker threads can be cumbersome in C because the code must decide when the last thread has stopped using any shared objects that need to be freed; this is easier in a garbage collected language.
内核编程的另一个受益于 HLL 的领域是并发性。在 C 语言中，瞬时工作线程可能会很麻烦，因为代码必须决定最后一个线程何时停止使用任何需要释放的共享对象；而在垃圾回收语言中，这一点会更容易。

However, use of a garbage-collected HLL is not free. The garbage collector and safety checks consume CPU time and can cause delays; the expense of high-level features may deter their use; the language's runtime layer hides important mechanisms such as memory allocation; and enforced abstraction and safety may reduce developers' implementation options.
然而，使用垃圾回收的 HLL 并不是免费的。垃圾收集器和安全检查会消耗 CPU 时间，并可能导致延迟；高级功能的费用可能会阻碍它们的使用；语言的运行时层隐藏了内存分配等重要机制；强制抽象和安全可能会减少开发人员的实施选择。

4 Overview 4 概述

Biscuit's main purpose is to help evaluate the practicality of writing a kernel in a high-level language. Its design is similar to common practice in monolithic UNIX-like kernels, to facilitate comparison. Biscuit runs on 64-bit x86 hardware and is written in Go. It uses a modified version of the Go 1.10 runtime implementation; the runtime is
Biscuit 的主要目的是帮助评估用高级语言编写内核的实用性。为了便于比较，它的设计与单片式 UNIX 内核的常见做法类似。Biscuit 在 64 位 x86 硬件上运行，由 Go 语言编写。它使用 Go 1.10 运行时实现的一个修改版本；运行时为

Figure 1: Biscuit's overall structure.
图 1：Biscuit 的整体结构。
written in Go with some assembly. Biscuit adds more assembly to handle boot and entry/exit for system calls and interrupts. There is no C. This section briefly describes Biscuit's components, focusing on areas in which use of Go affected the design and implementation.
是用 Go 语言和一些汇编语言编写的。Biscuit 增加了更多的汇编程序，以处理启动以及系统调用和中断的进入/退出。本节将简要介绍 Biscuit 的组件，重点是 Go 的使用对设计和实现产生影响的领域。

Boot and Go Runtime. The boot block loads Biscuit, the Go runtime, and a "shim" layer (as shown in Figure 1). The Go runtime, which we use mostly unmodified, expects to be able to call an underlying kernel for certain services, particularly memory allocation and control of execution contexts (cores, or in Go terminology, threads). The shim layer provides these functions, since there is no underlying kernel. Most of the shim layer's activity occurs during initialization, for example to pre-allocate memory for the Go kernel heap.
引导和 Go 运行时。引导块加载 Biscuit、Go 运行时和 "垫片 "层（如图 1 所示）。我们使用的 Go 运行时大多未经修改，它希望能够调用底层内核来提供某些服务，特别是内存分配和执行上下文（内核，或 Go 术语中的线程）控制。由于不存在底层内核，因此由 shim 层提供这些功能。shim 层的大部分活动都发生在初始化过程中，例如为 Go 内核堆预先分配内存。

Processes and Kernel Goroutines. Biscuit provides user processes with a POSIX interface: fork, exec, and so on, including kernel-supported threads and futexes. A user process has one address space and one or more threads. Biscuit uses hardware page protection to isolate user processes. A user program can be written in any language; we have implemented them only in C and

(not Go). Biscuit maintains a kernel goroutine corresponding to each user thread; that goroutine executes system calls and handlers for page faults and exceptions for the user thread. "goroutine" is Go's name for a thread, and in this paper refers only to threads running inside the kernel.
进程和内核线程Biscuit 为用户进程提供 POSIX 接口：fork、exec 等，包括内核支持的线程和 futexes。一个用户进程有一个地址空间和一个或多个线程。Biscuit 使用硬件页面保护来隔离用户进程。用户程序可以用任何语言编写；我们仅用 C 和

实现了它们（不是 Go）。Biscuit 为每个用户线程维护一个对应的内核 goroutine；该 goroutine 为用户线程执行系统调用以及页面故障和异常处理程序。"goroutine "是 Go 对线程的称呼，本文仅指内核中运行的线程。

Biscuit's runtime schedules the kernel goroutines of user processes, each executing its own user thread in usermode when necessary. Biscuit uses timer interrupts to switch pre-emptively away from user threads. It relies on pre-emption checks generated by the Go compiler to switch among kernel goroutines.
Biscuit 的运行时会调度用户进程的内核程序，必要时每个用户进程都会在用户模式下执行自己的用户线程。Biscuit 使用定时器中断来优先切换用户线程。它依靠 Go 编译器生成的抢占式检查来切换内核 goroutines。

Interrupts. A Biscuit device interrupt handler marks an associated device-driver goroutine as runnable and then returns, as previous kernels have done [35, 45]. Interrupt handlers cannot do much more without risk of deadlock, because the Go runtime does not turn off interrupts during sensitive operations such as goroutine context switch.
中断。Biscuit 设备中断处理程序会将相关的设备驱动程序 goroutine 标记为可运行，然后返回，就像以前的内核所做的那样 [35, 45]。中断处理程序不能做更多的事情，否则会有死锁的风险，因为 Go 运行时不会在goroutine 上下文切换等敏感操作期间关闭中断。

Handlers for system calls and faults from user space can execute any Go code. Biscuit executes this code in the context of the goroutine that is associated with the current user thread.
来自用户空间的系统调用和故障处理程序可以执行任何 Go 代码。Biscuit 会在与当前用户线程相关的 goroutine 上下文中执行这些代码。

Multi-Core and Synchronization. Biscuit runs in parallel on multi-core hardware. It guards its data structures using Go's mutexes, and synchronizes using Go's channels and condition variables. The locking is fine-grained enough that system calls from threads on different cores can execute in parallel in many common situations, for example when operating on different files, pipes, sockets, or when forking or execing in different processes. Biscuit uses read-lock-free lookups in some performance-critical code (see below).
多核与同步。Biscuit 可在多核硬件上并行运行。它使用 Go 的互斥来保护其数据结构，并使用 Go 的通道和条件变量进行同步。锁定的精细程度足以让不同内核线程的系统调用在许多常见情况下并行执行，例如在不同文件、管道、套接字上运行时，或在不同进程中分叉或执行时。Biscuit 在某些性能关键代码中使用免读锁定查找（见下文）。

Virtual Memory. Biscuit uses page-table hardware to implement zero-fill-on-demand memory allocation, copyon-write fork, and lazy mapping of files (e.g., for exec) in which the PTEs are populated only when the process page-faults, and mmap.
虚拟内存。Biscuit 使用页表硬件来实现按需零填充内存分配、写入即复制分叉和文件的懒映射（如用于执行），其中 PTE 仅在进程发生页故障和毫米映射时才被填充。

Biscuit records contiguous memory mappings compactly, so in the common case large numbers of mapping objects aren't needed. Physical pages can have multiple references; Biscuit tracks these using reference counts.
Biscuit 会紧凑地记录连续的内存映射，因此在一般情况下不需要大量的映射对象。物理页面可以有多个引用；Biscuit 使用引用计数跟踪这些引用。

File System. Biscuit implements a file system supporting the core POSIX file system calls. The file system has a file name lookup cache, a vnode cache, and a block cache. Biscuit guards each vnode with a mutex and resolves pathnames by first attempting each lookup in a read-lockfree directory cache before falling back to locking each directory named in the path, one after the other. Biscuit runs each file system call as a transaction and has a journal to commit updates to disk atomically. The journal batches transactions through deferred group commit, and allows file content writes to bypass the journal. Biscuit has an AHCI disk driver that uses DMA, command coalescing, native command queuing, and MSI interrupts.
文件系统Biscuit 实现了一个支持核心 POSIX 文件系统调用的文件系统。文件系统有一个文件名查找缓存、一个 vnode 缓存和一个块缓存。Biscuit 通过一个互斥来保护每个 vnode，并通过以下方式解析路径名：首先在一个无读取锁定的目录缓存中尝试每次查找，然后再逐个锁定路径中命名的每个目录。Biscuit 将每个文件系统调用作为一个事务运行，并通过日志将更新以原子方式提交到磁盘。日志通过延迟组提交对事务进行批处理，并允许绕过日志写入文件内容。Biscuit 有一个 AHCI 磁盘驱动程序，使用 DMA、命令聚合、本地命令队列和 MSI 中断。

Network Stack. Biscuit implements a TCP/IP stack and a driver for Intel PCI-Express Ethernet NICs in Go. The driver uses DMA and MSI interrupts. The system-call API provides POSIX sockets.
网络协议栈Biscuit 在 Go 中为英特尔 PCI-Express 以太网网卡实现了 TCP/IP 协议栈和驱动程序。驱动程序使用 DMA 和 MSI 中断。系统调用 API 提供 POSIX 套接字。

Limitations. Although Biscuit can run many Linux C programs without source modification, it is a research prototype and lacks many features. Biscuit does not support scheduling priority because it relies on the Go runtime scheduler. Biscuit is optimized for a small number of cores, but not for large multicore machines or NUMA. Biscuit does not swap or page out to disk, and does not implement the reverse mapping that would be required to steal mapped pages. Biscuit lacks many security features like users, access control lists, or address space randomization.
局限性。虽然 Biscuit 可以在不修改源代码的情况下运行许多 Linux C 程序，但它只是一个研究原型，缺乏许多功能。Biscuit 不支持调度优先级，因为它依赖于 Go 运行时调度器。Biscuit 针对少量内核进行了优化，但不适合大型多核机器或 NUMA。Biscuit 不会交换或分页到磁盘，也没有实现窃取映射页所需的反向映射。Biscuit 缺乏许多安全功能，如用户、访问控制列表或地址空间随机化。

5 Garbage collection 5 垃圾收集

Biscuit's use of garbage collection is a clear threat to its performance. This section outlines the Go collector's design and describes how Biscuit configures the collector;

§

evaluates performance costs.
Biscuit 对垃圾回收的使用显然会对其性能造成威胁。本节概述了 Go 收集器的设计，并介绍了 Biscuit 如何配置收集器；

§

评估了性能成本。

5.1 Go's collector 5.1 围棋的收集器

Go 1.10 has a concurrent parallel mark-and-sweep garbage collector [21]. The concurrent aspect is critical for Biscuit, since it minimizes the collector's "stop-theworld" pauses.
Go 1.10 有一个并发并扫垃圾收集器 [21]。并发方面对 Biscuit 至关重要，因为它能最大限度地减少收集器的 "停止-世界 "停顿。

When the Go collector is idle, the runtime allocates from the free lists built by the last collection. When the free space falls below a threshold, the runtime enables concurrent collection. When collection is enabled, the work of following ("tracing") pointers to find and mark reachable ("live") objects is interleaved with execution: each allocator call does a small amount of tracing and marking. Writes to already-traced objects are detected with compiler-generated "write barriers" so that any newly installed pointers will be traced. Once all pointers have been traced, the collector turns off write barriers and resumes ordinary execution. The collector suspends ordinary execution on all cores (a "stop-the-world" pause) twice during a collection: at the beginning to enable the write barrier on all cores and at the end to check that all objects have been marked. These stop-the-world pauses typically last dozens of microseconds. The collector rebuilds the free lists from the unmarked parts of memory ("sweeps"), again interleaved with Biscuit execution, and then becomes idle when all free heap memory has been swept. The collector does not move objects, so it does not reduce fragmentation.
当 Go 收集器空闲时，运行时会从上次收集建立的空闲列表中进行分配。当空闲空间低于阈值时，运行时会启用并发收集。启用收集后，跟踪（"tracing"）指针以查找和标记可到达（"live"）对象的工作将与执行交错进行：每次分配器调用都会进行少量的跟踪和标记。通过编译器生成的 "写入障碍 "来检测对已跟踪对象的写入，这样任何新安装的指针都会被跟踪。一旦跟踪完所有指针，收集器就会关闭写入障碍并恢复正常执行。在收集过程中，收集器会暂停所有内核上的普通执行（"stop-the-world "暂停）两次：开始时在所有内核上启用写入障碍，结束时检查所有对象是否已被标记。这些 "世界停止 "暂停通常持续数十微秒。收集器从内存中未标记的部分重建空闲列表（"清扫"），同样与 Biscuit 执行交错进行，然后在清扫完所有空闲堆内存后进入空闲状态。收集器不会移动对象，因此不会减少碎片。

The fraction of CPU time spent collecting is roughly proportional to the number of live objects, and inversely proportional to the interval between collections [20, 52]. This interval can be made large by devoting enough RAM to the heap that a substantial amount of space ("headroom") is freed by each collection.
用于收集的 CPU 时间比例与实时对象的数量大致成正比，与收集间隔成反比 [20, 52]。通过为堆分配足够的 RAM，每次收集都会释放出大量空间（"净空"），从而拉大收集间隔。

The Go collector does most of its work during calls to the heap allocator, spreading out this work roughly evenly among calls. Thus goroutines see delays proportional to the amount that they allocate;

§

presents measurements of these delays for Biscuit.
Go 收集器的大部分工作都是在调用堆分配器时完成的，这些工作在调用中大致平均分配。因此，goroutines 的延迟与它们分配的数量成正比；

§

为 Biscuit 介绍了这些延迟的测量结果。

5.2 Biscuit's heap size 5.2 Biscuit 的堆大小

At boot time, Biscuit allocates a fixed amount of RAM for its Go heap, defaulting to

nd of total RAM. Go's collector ordinarily expands the heap memory when live data exceeds half the existing heap memory; Biscuit disables this expansion. The next section (§6) explains how Biscuit copes with situations where the heap space is nearly filled with live data.
启动时，Biscuit 会为 Go 堆分配固定数量的 RAM，默认为总 RAM 的

nd。当实时数据超过现有堆内存的一半时，Go 的收集器通常会扩展堆内存；Biscuit 禁用了这种扩展。下一节（§6）将解释 Biscuit 如何处理堆空间几乎被实时数据填满的情况。

6 Avoiding heap exhaustion
6 避免堆耗尽

Biscuit must address the possibility of live kernel data completely filling the RAM allocated for the heap ("heap exhaustion"). This is a difficult area that existing kernels struggle with (§2).
Biscuit 必须解决实时内核数据完全填满为堆分配的 RAM 的可能性（"堆耗尽"）。这是现有内核难以解决的问题（第 2 节）。

6.1 Approach: reservations
6.1 方法：保留

Biscuit is designed to tolerate heap exhaustion without kernel failure. In addition, it can take corrective action when there are identifiable "bad citizen" processes that allocate excessive kernel resources implemented with heap objects, such as the structures describing open files and pipes. Biscuit tries to identify bad citizens and kill them, in order to free kernel heap space and allow good citizens to make progress.
Biscuit 在设计上可以容忍堆耗尽而不会导致内核故障。此外，如果有可识别的 "坏公民 "进程分配了过多的内核资源（如描述打开文件和管道的结构），Biscuit 还能采取纠正措施。Biscuit 会尝试识别 "坏公民 "并杀死它们，以释放内核堆空间，让 "好公民 "取得进展。

Biscuit's approach to kernel heap exhaustion has three elements. First, it purges caches and soft state as the heap nears exhaustion. Second, code at the start of each system call waits until it can reserve enough heap space to complete the call; the reservation ensures that the heap allocations made in the call will succeed once the wait (if any) is over. Third, a kernel "killer" thread watches for processes that are consuming lots of kernel heap when the heap is near exhaustion, and kills them.
Biscuit 处理内核堆耗尽的方法有三个要素。首先，当堆接近耗尽时，它会清除缓存和软状态。其次，每次系统调用开始时，代码都会等待，直到它能预留足够的堆空间来完成调用；预留空间能确保调用中的堆分配在等待（如果有的话）结束后取得成功。第三，内核 "杀手 "线程会在堆空间接近耗尽时监视消耗大量内核堆空间的进程，并杀死它们。

This approach has some good properties. Applications do not have to cope with system call failures due to kernel heap exhaustion. Kernel code does not see heap allocation failure (with a few exceptions), and need not include logic to recover from such failures midway through a system call. System calls may have to wait for reservations, but they wait at their entry points without locks held, so deadlock is avoided.
这种方法具有一些良好的特性。应用程序不必应对内核堆耗尽导致的系统调用失败。内核代码不会看到堆分配失败（少数例外），也不需要在系统调用中途包含从这种失败中恢复的逻辑。系统调用可能需要等待保留，但它们在入口点等待时不会加锁，因此避免了死锁。

The killer thread must distinguish between good and bad citizens, since killing a critical process (e.g., init) can make the system unusable. If there is no obvious "bad citizen," this approach may block and/or kill valuable processes. Lack of a way within POSIX for the kernel to gracefully revoke resources causes there to be no good solution in some out-of-memory situations.
杀手线程必须区分好公民和坏公民，因为杀死一个关键进程（如 init）会导致系统无法使用。如果没有明显的 "坏公民"，这种方法可能会阻塞和/或杀死有价值的进程。由于 POSIX 内核缺乏从容撤销资源的方法，因此在某些内存不足的情况下没有好的解决方案。

The mechanisms in this section do not apply to nonheap allocations. In particular, Biscuit allocates physical memory pages from a separate allocator, not from the Go heap; page allocations can fail, and kernel code must check for failure and recover (typically by returning an error to a system call).
本节中的机制不适用于非堆分配。特别是，Biscuit 从单独的分配器而非 Go 堆分配物理内存页；页分配可能失败，内核代码必须检查失败并恢复（通常是向系统调用返回错误）。

6.2 How Biscuit reserves 6.2 饼干如何储备

Biscuit dedicates a fixed amount of RAM

for the kernel heap. A system call only starts if it can reserve enough heap memory for the maximum amount of simultaneously live data that it uses, called

. A system call may allocate more than

from the heap, but the amount over

must be dead and can be freed by the collector. This means that, even in the extreme case in which all but

of the
Biscuit 为内核堆分配了固定数量的 RAM

。系统调用只有在为其使用的最大同时在线数据量（称为

）预留足够的堆内存时才会启动。系统调用可以从堆中分配超过

的内存，但超过

的内存必须是死内存，并且可以被收集器释放。这意味着，即使在极端情况下，除了

之外的所有

reserve(s):
    g := last GC live bytes
    c := used bytes
    n := reserved bytes
    L := g + c + n
    M := heap RAM bytes
    if L + S < M:
        reserved bytes += s
    else:
        wake killer thread
        wait for OK from killer thread
release(s):
    a := bytes allocated by syscall
    if a < s:
        used bytes += a
    else:
        used bytes += s
    reserved bytes -= s

Figure 2: Pseudo code for heap reservations in Biscuit.
图 2：Biscuit 中的堆保留伪代码。
heap RAM is used by live data or is already reserved, the system call can execute, with collections as needed to recover the call's own dead data in excess of

.
如果堆 RAM 已被实时数据使用或已被预留，则系统调用可以执行，并根据需要进行收集，以恢复调用自身超过

的死数据。

Ideally, a reservation should check that

minus the amount of live and reserved data in the heap is greater than or equal to

. However, except immediately after a collection, the amount of live heap data is not known. Biscuit maintains a conservative over-estimate of live heap data using three counters:

, and

is the amount of live data marked by the previous garbage collection.

is the total amount of reservations made by system calls that have completed.

is the total outstanding reservations of system calls that are executing but not finished. Let

be the sum of

, and

.
理想情况下，保留应检查

减去堆中的实时和保留数据量是否大于或等于

。但是，除了在收集后立即进行的情况外，堆中的实时数据量是未知的。Biscuit 使用三个计数器来保守地高估堆中的实时数据：

和

是上一次垃圾回收标记的实时数据量。

是已完成的系统调用所做保留的总量。

是正在执行但尚未完成的系统调用的未完成预订总量。让

成为

和

的总和。

Figure 2 presents pseudo code for reserving and releasing the reservation of heap RAM in Biscuit. Before starting a system call, a thread checks that

. If

, the thread reserves by adding

, otherwise the thread wakes up the killer thread and sleeps. When finished, a system call calculates

, the total amount actually allocated, and uses

to (partially) release any over-reservation: if

, the system call adds

and subtracts

from

. Otherwise,

and the system call adds

and subtracts

from

.
图 2 展示了在 Biscuit 中预留和释放堆 RAM 的伪代码。在启动系统调用之前，线程会检查

。如果

，则线程通过将

加入

来保留，否则线程会唤醒杀手线程并休眠。完成后，系统调用会计算

，即实际分配的总量，并使用

来（部分）释放超额预留：如果

，系统调用会将

加到

中，并从

中减去

。否则，

和系统调用将

加入

，并从

中减去

。

The reason for separate

and

is to carry over reservations of system calls that span a garbage collection; a collection sets

to zero but leaves

unchanged.
之所以要将

和

分开，是为了将跨越垃圾回收的系统调用保留下来；垃圾回收会将

设置为零，但会保留

不变。

If heap memory is plentiful (live data

), the amount of live+dead data in the heap usually grows faster than

, so collections are triggered by heap free list exhaustion rather than by

. Thus system calls do not wait for memory, and do not trigger the killer thread. As live heap data increases, and

gets close to

may reach

before a collection would ordinarily be triggered. For this reason the killer thread performs a collection before deciding whether to kill processes.
如果堆内存充裕（活数据

），堆中的活数据+死数据量的增长速度通常快于

，因此集合是由堆空闲列表耗尽触发的，而不是由

触发的。因此，系统调用不会等待内存，也不会触发杀手线程。随着实时堆数据的增加，以及

接近

时，

可能会在通常触发收集之前达到

。因此，杀手线程在决定是否杀死进程之前会执行一次收集。

6.3 Static analysis to find
6.3 通过静态分析查找

We have developed a tool, MaxLive, that analyzes the Biscuit source code and the Go packages Biscuit uses to find

for each system call. The core challenge is detecting statically when allocated memory can no longer be live, since many system calls allocate memory for transient uses. Other challenges include analyzing loops with non-constant bounds, and determining reservations for background kernel activities that are not associated with a specific system call.
我们开发了一款名为 MaxLive 的工具，它可以分析 Biscuit 源代码和 Biscuit 使用的 Go 包，为每个系统调用查找

。核心挑战是静态检测已分配的内存何时不能继续使用，因为许多系统调用分配内存都是为了瞬时使用。其他挑战包括分析具有非恒定边界的循环，以及确定与特定系统调用无关的后台内核活动的保留。

We address these challenges by exploiting the characteristic event-handler-style structure of most kernel code, which does a modest amount of work and then returns (or goes idle); system call implementations, for example, work this way. Furthermore, we are willing to change the kernel code to make it amenable to the reservation approach, for example to avoid recursion (we changed a few functions). Two modifications were required to standard Go packages that Biscuit uses (packages time and

).
我们利用大多数内核代码特有的事件处理程序式结构来应对这些挑战，这种结构只做少量工作，然后返回（或闲置）；例如，系统调用的实现就是以这种方式工作的。此外，我们愿意修改内核代码，使其适合保留方法，例如避免递归（我们修改了几个函数）。需要对 Biscuit 使用的标准 Go 软件包（软件包 time 和

）进行两处修改。

6.3.1 Basic MAXLIVE operation
6.3.1 MAXLIVE 的基本操作

MAXLIVE examines the call graph (using Go's

and callgraph packages) to detect all allocations a system call may perform. It uses escape and pointer analysis (Go's pointer package) to detect when an allocation does not "escape" above a certain point in the call graph, meaning that the allocation must be dead on return from that point.
MAXLIVE 会检查调用图（使用 Go 的

和 callgraph 包），以检测系统调用可能执行的所有分配。它使用转义和指针分析（Go 的指针包）来检测分配是否 "转义 "到调用图中的某个点之上，这意味着从该点返回时，分配必须是死的。

MaxLiVE handles a few kinds of allocation specially: go, defer, maps, and slices. go (which creates a goroutine) is treated as an escaping allocation of the maximum kernel stack size (the new goroutine itself must reserve memory when it starts, much as if it were itself a system call). defer is a non-escaping allocation, but is not represented by an object in the SSA so MAXLIVE specifically considers it an allocation. Every insertion into a map or slice could double its allocated size; MAXLIVE generally doesn't know the old size, so it cannot predict how much memory would be allocated. To avoid this problem, we annotate the Biscuit source to declare the maximum size of slices and maps, which required 70 annotations.
MaxLiVE特别处理了几种分配：go、defer、map和slices。go（创建一个goroutine）被视为内核堆栈最大大小的escaping分配（新的goroutine本身必须在启动时预留内存，就像它本身是一个系统调用一样）。 defer是一种非escaping分配，但在SSA中没有对象表示，所以MAXLIVE特别将其视为一种分配。每次插入映射或片段都可能使分配的大小翻倍；MAXLIVE通常不知道旧的大小，因此无法预测会分配多少内存。为了避免这个问题，我们对Biscuit源进行注释，声明切片和映射的最大大小，这需要70个注释。

6.3.2 Handling loops 6.3.2 处理循环

For loops where MaxLive cannot determine a useful bound on the number of iterations, we supply a bound with an annotation; there were 78 such loops. Biscuit contains about 20 loops whose bounds cannot easily be expressed with an annotation, or for which the worst case is too large to be useful. Examples include retries to handle wakeup races in poll, iterating over a directory's data blocks during a path component lookup, and iterating over the pages of a user buffer in write.
对于 MaxLive 无法确定迭代次数有用界限的循环，我们提供了一个带有注释的界限；这样的循环有 78 个。Biscuit 包含约 20 个循环，这些循环的边界无法通过注释轻松表达，或者最坏情况下的边界太大，无法使用。例如在轮询中处理唤醒竞赛的重试、在路径组件查找过程中对目录数据块的遍历，以及在写入过程中对用户缓冲区页面的遍历。

We handle such loops with deep reservations. Each loop iteration tries to reserve enough heap for just the one
我们通过深度保留来处理这种循环。每次循环迭代都会尝试为一个
iteration. If there is insufficient free heap, the loop aborts and waits for free memory at the beginning of the system call, retrying when memory is available. Two loops (in exec and rename) needed code to undo changes after an allocation failure; the others did not.
迭代。如果没有足够的空闲堆，循环就会中止，并在系统调用开始时等待空闲内存，当内存可用时再重试。有两个循环（执行循环和重命名循环）需要代码来撤销分配失败后的更改，其他循环则不需要。

Three system calls have particularly challenging loops: exit, fork, and exec. These calls can close many file descriptors, either directly or on error paths, and each close may end up updating the file system (e.g. on last close of a deleted file). The file system writes allocate memory, and may create entries in file system caches. Thus, for example, an exiting process that has many file descriptors may need a large amount of heap memory for the one exit system call. However, in fact exit's memory requirements are much smaller than this: the cache entries will be deleted if heap memory is tight, so only enough memory is required to execute a single close. We bound the memory use of close by using MaxLive to find all allocations that may be live once close returns. We then manually ensure that all such allocations are either dead once close returns or are evictable cache entries. That way exit, fork, and exec only need to reserve enough kernel heap for one call to close. This results in heap bounds of less than 500 kB for all system calls but rename and fork ( 1 MB and 641 kB , respectively). The close system call is the only one we manually analyze with the assistance of MAXLIVE.
有三个系统调用的循环特别具有挑战性：exit、fork 和 exec。这些调用会直接或通过错误路径关闭许多文件描述符，而且每次关闭都可能最终更新文件系统（例如最后一次关闭已删除的文件）。文件系统写入会分配内存，并可能在文件系统缓存中创建条目。因此，举例来说，一个有许多文件描述符的退出进程可能需要大量堆内存来执行一个 exit 系统调用。但事实上，exit 对内存的需求远小于此：如果堆内存紧张，缓存条目就会被删除，因此只需要足够的内存来执行一次 close。我们使用 MaxLive 来查找 close 返回后可能仍在运行的所有分配，从而约束 close 的内存使用。然后，我们手动确保所有此类分配要么在 close 返回时已死亡，要么是可驱逐的缓存条目。这样，exit、fork 和 exec 只需为一次关闭调用预留足够的内核堆。这样，除了 rename 和 fork（分别为 1 MB 和 641 kB）之外，所有系统调用的堆边界都小于 500 kB。关闭系统调用是我们在 MAXLIVE 协助下手动分析的唯一一个调用。

6.3.3 Kernel threads 6.3.3 内核线程

A final area of special treatment applies to long-running kernel threads. An example is the filesystem logging thread, which acts on behalf of many processes. Each long-running kernel thread has its own kernel heap reservation. Since exit must always be able to proceed when the killer thread kills a process, kernel threads upon which exit depends must never release their heap reservation. For example, exit may need to free the blocks of unlinked files when closing file descriptors and thus depends on the filesystem logging thread. Other kernel threads, like the ICMP packet processing thread, block and wait for heap reservations when needed and release them when idle.
最后一个需要特殊处理的领域是长期运行的内核线程。文件系统日志线程就是一个例子，它代表许多进程运行。每个长期运行的内核线程都有自己的内核堆预留。由于当杀手线程杀死一个进程时，exit 必须始终能够继续运行，因此 exit 所依赖的内核线程绝不能释放其堆保留。例如，exit 在关闭文件描述符时可能需要释放未链接文件的块，因此需要依赖文件系统日志记录线程。其他内核线程（如 ICMP 数据包处理线程）会在需要时阻塞和等待堆保留，并在空闲时释放它们。

6.3.4 Killer thread 6.3.4 杀手线程

The killer thread is woken up when a system call's reservation fails. The thread first starts a garbage collection and waits for it to complete. If the collection doesn't free enough memory, the killer thread asks each cache to free as many entries as possible, and collects again. If that doesn't yield enough free memory, the killer thread finds the process with the largest total number of mapped memory regions, file descriptors, and threads, in the assumption that it is a genuine bad citizen, kills it, and again collects. As soon as the killer thread sees that enough memory has been freed to satisfy the waiting reservation, it wakes up the waiting thread and goes back to sleep.
当系统调用的保留失败时，杀手线程就会被唤醒。该线程首先启动垃圾回收并等待完成。如果垃圾回收没有释放足够的内存，杀手线程会要求每个缓存释放尽可能多的条目，然后再次进行垃圾回收。如果还不能释放足够的内存，杀手线程就会找到映射内存区域、文件描述符和线程总数最多的进程，假定它是一个真正的坏公民，将其杀死，并再次进行收集。一旦 "杀手 "线程发现已释放的内存足以满足等待保留的需求，它就会唤醒等待线程并继续休眠。

6.4 Limitations 6.4 局限性

Biscuit's approach for handling heap exhaustion requires that the garbage collector run successfully when there is little or no free memory available. However, Go's garbage collector may need to allocate memory during a collection in order to make progress, particularly for the work stack of outstanding pointers to scan. We haven't implemented it, but Biscuit could recover from this situation by detecting when the work stack is full and falling back to using the mark bitmap as the work stack, scanning for objects which are marked but contain unmarked pointers. This strategy will allow the garbage collection to complete, but will likely be slow. We expect this situation to be rare since the work stack buffers can be preallocated for little cost: in our experiments, the garbage collector allocates at most

of the heap RAM for work stacks.
Biscuit 处理堆耗尽的方法要求垃圾收集器在可用内存很少或没有可用内存时成功运行。但是，Go 的垃圾收集器可能需要在收集过程中分配内存，以便取得进展，尤其是对工作栈中的未完成指针进行扫描。我们还没有实现它，但 Biscuit 可以通过检测工作栈是否已满来恢复这种情况，并退回到使用标记位图作为工作栈，扫描已标记但包含未标记指针的对象。这种策略将允许完成垃圾回收，但可能会比较慢。我们预计这种情况很少发生，因为工作堆栈缓冲区只需很少的成本即可预先分配：在我们的实验中，垃圾收集器最多分配

的堆 RAM 用于工作堆栈。

Because the Go collector doesn't move objects, it doesn't reduce fragmentation. Hence, there might be enough free memory but in fragments too small to satisfy a large allocation. To eliminate this risk, MaxLive should compute

for each size class of objects allocated during a system call. Our current implementation doesn't do this yet.
由于 Go 收集器不会移动对象，因此不会减少碎片。因此，可能会有足够的空闲内存，但碎片太小，无法满足大量分配的需要。为了消除这种风险，MaxLive 应该为系统调用期间分配的对象的每个大小类别计算

。我们目前的实现还没有做到这一点。

6.5 Heap exhaustion summary
6.5 堆耗尽总结

Biscuit borrows ideas for heap exhaustion from Linux: the killer thread, and the idea of waiting and retrying after the killer thread has produced free memory. Biscuit simplifies the situation by using reservation checks at the start of each system call, rather than Linux's failure checks at each allocation point; this means that Biscuit has less recovery code to back out of partial system calls, and can wait indefinitely for memory without fear of deadlock. Go's static analyzability helped automate Biscuit's simpler approach.
Biscuit 借鉴了 Linux 中有关堆耗尽的思想：杀手线程，以及在杀手线程产生空闲内存后等待和重试的思想。Biscuit 通过在每次系统调用开始时使用保留检查来简化这种情况，而不是像 Linux 那样在每个分配点都进行失败检查；这意味着 Biscuit 只需较少的恢复代码来退出部分系统调用，并且可以无限期地等待内存而不必担心死锁。Go 的静态分析能力帮助 Biscuit 实现了更简单方法的自动化。

7 Implementation 7 实施

The Biscuit kernel is written almost entirely in Go: Figure 3 shows that it has 27,583 lines of Go, 1,546 lines of assembly, and no C.
Biscuit 内核几乎完全由 Go 语言编写：图 3 显示，它有 27,583 行 Go 语言，1,546 行汇编语言，没有 C 语言。

Biscuit provides 58 system calls, listed in Figure 4. It has enough POSIX compatibility to run some existing server programs (for example, NGINX and Redis).
Biscuit 提供 58 个系统调用，如图 4 所示。它的 POSIX 兼容性足以运行一些现有的服务器程序（如 NGINX 和 Redis）。

Biscuit includes device drivers for AHCI SATA disk controllers and for Intel 82599-based Ethernet controllers such as the X540 10-gigabit NIC. Both drivers use DMA. The drivers use Go's unsafe. Pointer to access device registers and in-memory structures (such as DMA descriptors) defined by device hardware, and Go's atomic package to control the order of these accesses. The code
Biscuit 包含用于 AHCI SATA 磁盘控制器和基于 Intel 82599 的以太网控制器（如 X540 万兆网卡）的设备驱动程序。这两种驱动程序都使用 DMA。这些驱动程序使用 Go 的不安全指针来访问设备寄存器和由设备硬件定义的内存结构（如 DMA 描述符），并使用 Go 的原子包来控制这些访问的顺序。代码

Component 组件	Lang 朗	LOC
Biscuit kernel (mostly boot) 饼干内核（主要是引导）	asm	546
Biscuit kernel 饼干核	Go 转到
Core 核心		1700
Device drivers 设备驱动程序		4088
File system 文件系统	7338
Network 网络	4519
Other 其他	1105
Processes 流程	935
Reservations 预订		749
Syscalls 系统调用		5292
Virtual memory 虚拟内存		1857
Total 总计		27583
MaxLive	Go 转到	1299
Runtime modifications 运行时修改	asm	1,000
Runtime modifications 运行时修改	Go 转到	3,200

Figure 3: Lines of code in Biscuit. Not shown are about 50,000 lines of Go runtime and 32,000 lines of standard Go packages that Biscuit uses.
图 3：Biscuit 中的代码行数。图中未显示 Biscuit 使用的约 50,000 行 Go 运行时代码和 32,000 行标准 Go 软件包代码。
would be more concise if Go supported some kind of memory fence.
如果 Go 支持某种内存栅栏，就会更简洁。

Biscuit contains 90 uses of Go's "unsafe" routines (excluding uses in the Go runtime). These unsafe accesses parse and format packets, convert between physical page numbers and pointers, read and write user memory, and access hardware registers.
Biscuit 包含 Go 的 90 种 "不安全 "例程用法（不包括 Go 运行时中的用法）。这些不安全访问会解析和格式化数据包、在物理页码和指针之间进行转换、读写用户内存以及访问硬件寄存器。

We modified the Go runtime to record the number of bytes allocated by each goroutine (for heap reservations), to check for runnable device handler goroutines, and to increase the default stack size from 2 kB to 8 kB to avoid stack expansion for a few common system calls.
我们修改了 Go 运行时，以记录每个 goroutine 分配的字节数（用于堆保留），检查可运行的设备处理程序 goroutine，并将默认堆栈大小从 2 kB 增加到 8 kB，以避免一些常见系统调用的堆栈扩展。

Biscuit lives with some properties of the Go runtime and compiler in order to avoid significantly modifying them. The runtime does not turn interrupts off when holding locks or when manipulating a goroutine's own private state. Therefore, in order to avoid deadlock, Biscuit interrupt handlers just set a flag indicating that a device handler goroutine should wake up. Biscuit's timer interrupt handler cannot directly force goroutine context switches because the runtime might itself be in the middle of a context switch. Instead, Biscuit relies on Go's pre-emption mechanism for kernel goroutines (the Go compiler inserts pre-emption checks in the generated code). Timer interrupts do force context switches when they arrive from user space.
Biscuit 保留了 Go 运行时和编译器的某些特性，以避免对它们进行重大修改。当持有锁或操作 goroutine 自身的私有状态时，运行时不会关闭中断。因此，为了避免死锁，Biscuit 中断处理程序只需设置一个标志，指示设备处理程序 goroutine 应该唤醒。Biscuit 的定时器中断处理程序不能直接强制 goroutine 进行上下文切换，因为运行时本身可能正在进行上下文切换。相反，Biscuit 依赖 Go 的内核 goroutine 抢占机制（Go 编译器会在生成的代码中插入抢占检查）。来自用户空间的定时器中断会强制进行上下文切换。

Goroutine scheduling decisions and the context switch implementation live in the runtime, not in Biscuit. One consequence is that Biscuit does not control scheduling policy; it inherits the runtime's policy. Another consequence is that per-process page tables are not switched when switching goroutines, so Biscuit system call code cannot safely dereference user addresses directly. Instead, Biscuit explicitly translates user virtual addresses to physical addresses, and also explicitly checks page permissions.
Goroutine调度决策和上下文切换实现都在运行时中，而不是在 Biscuit 中。其后果之一是，Biscuit 无法控制调度策略，而是继承运行时的策略。另一个后果是，在切换程序时不会切换每个进程的页表，因此 Biscuit 系统调用代码无法安全地直接引用用户地址。相反，Biscuit 会将用户虚拟地址明确转换为物理地址，并明确检查页面权限。
Biscuit switches page tables if necessary before switching to user space.
在切换到用户空间之前，Biscuit 会根据需要切换页面表。

We modified the runtime in three ways to reduce delays due to garbage collection. First, we disabled the dedicated garbage collector worker threads so that application threads don't compete with garbage collector threads for CPU cycles. Second, we made root marking provide allocation credit so that an unlucky allocating thread wouldn't mark many roots all at once. Third, we reduced the size of the pieces that large objects are broken into for marking from 128 kB to 10 kB .
我们从三个方面修改了运行时，以减少垃圾收集导致的延迟。首先，我们禁用了专门的垃圾回收器工作线程，这样应用程序线程就不会与垃圾回收器线程争夺 CPU 周期。其次，我们让根标记提供分配积分，这样一个不走运的分配线程就不会一次标记很多根。第三，我们将用于标记的大对象碎片大小从 128 kB 减少到 10 kB。

Biscuit implements many standard kernel performance optimizations. For example, Biscuit maps the kernel text using large pages to reduce iTLB misses, uses per-CPU NIC transmit queues, and uses read-lock-free data structures in some performance critical code such as the directory cache and TCP polling. In general, we found that Go did not hinder optimizations.
Biscuit 实现了许多标准的内核性能优化。例如，Biscuit 使用大页面映射内核文本以减少 iTLB 错失，使用每 CPU NIC 发送队列，并在目录缓存和 TCP 轮询等一些性能关键代码中使用免读锁定数据结构。总的来说，我们发现 Go 并没有妨碍优化。

8 Evaluation 8 评估

This section analyzes the costs and benefits of writing a kernel in an HLL by exploring the following questions:
本节将通过探讨以下问题来分析用 HLL 编写内核的成本和收益：

To what degree does Biscuit benefit from Go's highlevel language features? To answer, we count and explain Biscuit's use of these features ( $§$ ).
Biscuit 在多大程度上受益于 Go 的高级语言特性？为了回答这个问题，我们将统计并解释 Biscuit 对这些特性的使用 ( $§$ )。
Do C kernels have safety bugs that a high-level language might mitigate? We evaluate whether bugs reported in Linux kernel CVEs would likely apply to Biscuit ( $§$ ).
C 内核是否存在高级语言可能缓解的安全漏洞？我们评估了 Linux 内核 CVE 中报告的错误是否可能适用于 Biscuit ( $§$ )。
How much performance does Biscuit pay for Go’s HLL features? We measure the time Biscuit spends in garbage collection, bounds checking, etc., and the delays that GC introduces $§$ .
Biscuit 为 Go 的 HLL 功能付出了多少性能代价？我们测量了 Biscuit 在垃圾回收、边界检查等方面花费的时间，以及 GC 带来的 $§$ 延迟。
What is the performance impact of using Go instead of C? We compare nearly-identical pipe and page-fault handler implementations in Go and ( $§$ ).
使用 Go 而不是 C 会对性能产生什么影响？我们比较了 Go 和 ( $§$ ) 中几乎相同的管道和页面故障处理程序实现。
Is Biscuit's performance in the same ballpark as Linux, a C kernel (§8.8)?
Biscuit 的性能是否与 C 内核 Linux 相同（第 8.8 节）？
Is Biscuit's reservation scheme effective at handling kernel heap exhaustion (§8.9)?
Biscuit 的保留方案是否能有效处理内核堆耗尽（§8.9）？
Can Biscuit benefit from RCU-like lock-free lookups $§$
Biscuit 能否从类似 RCU 的无锁查找 $§$ 中受益？

8.1 Biscuit's use of HLL features
8.1 Biscuit 对 HLL 功能的使用

Our subjective feeling is that Go has helped us produce clear code and helped reduce programming difficulty, primarily by abstracting and automating low-level tasks.
我们的主观感受是，主要通过抽象化和自动化底层任务，Go 帮助我们生成了清晰的代码，并降低了编程难度。

Figure 5 shows how often Biscuit uses Go's HLL features, and compares with two other major Go systems:
图 5 显示了 Biscuit 使用围棋 HLL 功能的频率，并与其他两个主要围棋系统进行了比较：

accept 接受	bind 约束	chdir 目录	close 关闭	connect 接通	dup2	execv	exit 出口
fcntl	fork 分叉	fstat	ftruncate 截断	futex 黄麻	getcwd	getpid	getppid
getrlimit	getrusage	getsockopt 获取选项	gettid	gettimeofday 获取时间	info 信息	kill 杀死	link 链接
listen 倾听	lseek	mkdir	mknod	mmap 移动	munmap 地图	nanosleep 纳米睡眠	open 开
pipe2 管道2	poll 民意调查	pread 预载	prof 概况	pwrite	read 阅读	readv	reboot 重新启动
recvfrom 从	recvmsg	rename 重命名	sendmsg	sendto 发送到	setrlimit 设置限制	setsockopt 设置时钟选项	shutdown 关闭
socket 插座	socketpair 插座对	stat 统计	sync 同步	threxit 暴发户	truncate 截形	unlink 取消链接	wait4 等待4
write 写道	writev 写

Figure 4: Biscuit's 58 system calls.
图 4：Biscuit 的 58 个系统调用。

Figure 5: Uses of Go HLL features in the Git repositories for Biscuit, Golang (

lines), and Moby (

lines) per 1,000 lines. For data types (such as slices), the numbers indicate the number of declarations of a variable, argument, or structure field of that type.
图 5：Biscuit、Golang（

行）和 Moby（

行）的 Git 仓库中每千行 Go HLL 功能的使用情况。对于数据类型（如切片），数字表示该类型的变量、参数或结构字段的声明次数。
the Golang repository (containing Go's compiler, runtime, and standard packages), and Moby

, which contains Docker's container software and is the most starred Go repository on Github at the time of writing. Figure 5 shows the number of times each feature is used per 1,000 lines of code. Biscuit uses Go's HLL features about as much as other Go systems software.
Golang 代码库（包含 Go 的编译器、运行时和标准软件包），以及 Moby

，其中包含 Docker 的容器软件，也是本文撰写时 Github 上星级最高的 Go 代码库。图 5 显示了每千行代码中每种特性的使用次数。Biscuit 使用 Go 的 HLL 功能的次数与其他 Go 系统软件差不多。

To give a sense how these HLL features can benefit a kernel, the rest of this section provides examples of successful uses, as well as situations where we didn't use them. Biscuit relies on the Go allocator and garbage collector for nearly all kernel objects. Biscuit has 302 statements that allocate an object from the GC-managed heap. Some of the objects are compound (composed of multiple Go objects). For example, Biscuit's Vmregion_t, which describes a mapped region of virtual memory, has a red/black tree of Vminfo_t, which itself is compound (e.g., when it is backed by a file). The garbage collector eliminates the need for explicit code to free the parts of such compound data types.
为了让大家了解这些 HLL 功能对内核的益处，本节的其余部分将提供成功使用的例子，以及我们没有使用这些功能的情况。几乎所有内核对象都依赖于 Go 分配器和垃圾回收器。Biscuit 有 302 条语句从 GC 管理的堆中分配对象。其中一些对象是复合对象（由多个 Go 对象组成）。例如，Biscuit 的 Vmregion_t（描述虚拟内存的映射区域）有一个 Vminfo_t 的红/黑树，而 Vminfo_t 本身就是复合对象（例如，当它由文件支持时）。垃圾回收器无需使用显式代码来释放此类复合数据类型的各个部分。

Biscuit's only special-purpose allocator is its physical page allocator. It is used for process memory pages, file cache pages, socket and pipe buffers, and page table pages.
Biscuit 唯一的特殊用途分配器是物理页面分配器。它用于进程内存页、文件缓存页、套接字和管道缓冲区以及页表页。

Biscuit uses many goroutines. For example, device drivers create long-running goroutines to handle events such as packet arrival. Biscuit avoids goroutine creation, however, in frequently executed code. The reason is that the garbage collector produces pauses proportional to the number of goroutines; these are insignificant for thousands of goroutines but a problem with hundreds of thousands.
Biscuit 使用许多 goroutines。例如，设备驱动程序会创建长时间运行的 goroutine 来处理数据包到达等事件。不过，Biscuit 避免在频繁执行的代码中创建 goroutine。原因是垃圾回收器会产生与 goroutines 数量成正比的停顿；对于数千个 goroutines 而言，这些停顿并不重要，但对于数十万个 goroutines 而言，停顿就成了问题。

The combination of threads and garbage collection is particularly pleasing, since it avoids forcing the programmer to worry about delaying frees for shared objects until the last sharing thread has finished. For example, Biscuit's poll system call installs a pointer to a helper object in each file descriptor being polled. When input arrives on a descriptor, the goroutine delivering the input uses the helper object to wake up the polling thread. Garbage collection eliminates races between arriving input and freeing the helper object.
线程与垃圾回收的结合尤其令人满意，因为它避免了迫使程序员担心将共享对象的释放延迟到最后一个共享线程结束之后。例如，Biscuit 的轮询系统调用会在每个被轮询的文件描述符中安装一个指向辅助对象的指针。当输入到达一个描述符时，提供输入的 goroutine 会使用辅助对象唤醒轮询线程。垃圾回收消除了输入到达和释放辅助对象之间的竞争。

Some Biscuit objects, when the last reference to them disappears, need to take clean-up actions before their memory can be collected; for example, TCP connections must run the TCP shutdown protocol. Go's finalizers were not convenient in these situations because of the prohibition against cycles among objects with finalizers. Biscuit maintains reference counts in objects that require clean-up actions.
有些 Biscuit 对象在最后一次引用消失后，需要在内存被回收前进行清理操作；例如，TCP 连接必须运行 TCP 关闭协议。在这种情况下，Go 的终结器并不方便，因为终结器禁止对象间的循环。Biscuit 会维护需要清理操作的对象中的引用计数。

Biscuit uses many standard Go packages. For example, Biscuit imports sync in 28 files and atomic packages in 18 files. These packages provide mutexes, condition variables, and low-level atomic memory primitives. Biscuit's MAXLIVE tool depends on Go's code analysis packages (ssa, callgraph, and pointer).
Biscuit 使用了许多标准 Go 软件包。例如，Biscuit 在 28 个文件中导入了同步包，在 18 个文件中导入了原子包。这些包提供了互斥、条件变量和底层原子内存原语。Biscuit 的 MAXLIVE 工具依赖于 Go 的代码分析包（ssa、callgraph 和 pointer）。

Biscuit itself is split into 31 Go packages. Packages allowed some code to be developed and tested in user space. For example, we tested the file system package for races and crash-safety in user space. The package system also made it easy to use the file system code to create boot disks.
Biscuit 本身分为 31 个 Go 软件包。软件包允许在用户空间开发和测试某些代码。例如，我们在用户空间测试了文件系统包的竞赛和崩溃安全性。软件包系统还可以方便地使用文件系统代码来创建启动盘。

8.2 Potential to reduce bugs
8.2 减少虫子的潜力

An HLL might help avoid problems such as memory corruption from buffer overflows. To see how this applies to kernels, we looked at Linux execute-code bugs in the CVE database published in 2017 [34]. There are 65 bugs
HLL 可能有助于避免缓冲区溢出造成的内存损坏等问题。为了了解这一点如何适用于内核，我们查看了 2017 年发布的 CVE 数据库中的 Linux 执行代码漏洞[34]。共有 65 个漏洞

Type 类型

CVE-...

使用后免费或双倍免费

Use-after-free

or double-free

2016-10290, 2016-10288, 2016-8480,

2016-8449, 2016-8436, 2016-8392,

2016-8391, 2016-6791

界外访问

Out-of-bounds

access

、

,	,
,	,	,
,	,	,
,	,	,
,	,	,
,	,	,
,	,	,
,	,	,
,	,	,
,	,	,
,

Figure 6: Linux kernel CVEs from 2017 that would not cause memory corruption, code execution, or information disclosure in Biscuit.
图 6：2017 年以来不会在 Biscuit 中导致内存破坏、代码执行或信息泄露的 Linux 内核 CVE。
where the patch is publicly available. For 11 bugs of the 65 , we aren't sure whether Go would have improved the outcome. 14 of the 65 are logic bugs that could arise as easily in Go as they do in C. Use of Go would have improved the outcome of the remaining 40 bugs (listed in Figure 6), based on manual inspection of the patch that fixed the bug. The impact of some of these 40 bugs is severe: several allow remote code execution or information disclosure. Many of the bugs in the out-of-bounds category would have resulted in runtime errors in Go, and caused a panic. This is not ideal, but better than allowing a code execution or information disclosure exploit. The bugs in the use-after-free category would not have occurred in Go, because garbage collection would obviate them.
在这里，补丁是公开的。对于 65 个错误中的 11 个，我们不确定 Go 是否会改善结果。65 个错误中有 14 个是逻辑错误，在 Go 语言中和在 C 语言中一样容易出现。根据对修复错误的补丁的人工检查，使用 Go 语言会改善其余 40 个错误（如图 6 所示）的结果。在这 40 个漏洞中，有些漏洞的影响非常严重：其中几个漏洞允许远程代码执行或信息泄露。许多越界类错误会导致 Go 运行时出错，并引起恐慌。这并不理想，但总比允许代码执行或信息泄露漏洞要好。在 Go 中不会出现 "无用后使用 "类别中的错误，因为垃圾回收可以避免这些错误。

The Go runtime and packages that Biscuit relies on also have bugs. There are 14 CVEs in Go published from 2016 to 2018. Two of them allow code execution (all in go get) and two allow information gain (due to bugs in Go's smtp and math/big packages).
Biscuit 依赖的 Go 运行时和软件包也存在漏洞。从 2016 年到 2018 年，Go 共发布了 14 个 CVE。其中两个允许执行代码（全部在 go get 中），两个允许获取信息（由于 Go 的 smtp 和 math/big 软件包中的 bug）。

8.3 Experimental Setup 8.3 实验装置

The performance experiments reported below were run on a four-core 2.8 GHz Xeon X3460 with hyper-threading disabled and 16 GB of memory. Biscuit uses Go version 1.10. Except where noted, the benchmarks use an inmemory file system, rather than a disk, in order to stress the CPU efficiency of the kernel. The in-memory file system is the same as the disk file system, except that it doesn't append disk blocks to the in-memory log or call the disk driver. The disk file system uses a Samsung 850 SSD.
下面报告的性能实验是在四核 2.8 GHz Xeon X3460（禁用超线程）和 16 GB 内存上运行的。Biscuit 使用 Go 1.10 版本。除特别注明外，基准测试均使用内存文件系统，而非磁盘，以强调内核的 CPU 效率。内存文件系统与磁盘文件系统相同，只是不向内存日志附加磁盘块，也不调用磁盘驱动程序。磁盘文件系统使用三星 850 SSD。

The network server benchmarks have a dedicated tengigabit Ethernet switch between a client and a server machine, with no other traffic. The machines use Intel X540 ten-gigabit network interfaces. The network interfaces use an interrupt coalescing period of

. The client runs Linux.
网络服务器基准测试在客户端和服务器机器之间有一个专用的万兆以太网交换机，没有其他流量。这些机器使用 Intel X540 万兆网络接口。网络接口使用

的中断凝聚周期。客户端运行 Linux。
Except when noted, Biscuit allocates 512MB of RAM to the kernel heap. The reported fraction of CPU time spent in the garbage collector is calculated as

, where

is the time to execute a benchmark with garbage collection and

is the time without garbage collection. To measure

, we reserve enough RAM for the kernel heap that the kernel doesn't run out of free memory and thus never collects. This method does not remove the cost to check, for each write, whether write barriers are enabled.
除注明外，Biscuit 为内核堆分配 512MB 内存。所报告的 CPU 在垃圾回收器中所用时间的比例计算公式为

，其中

是执行有垃圾回收器的基准所需的时间，

是无垃圾回收器的时间。为了测量

，我们为内核堆预留了足够的 RAM，这样内核就不会耗尽空闲内存，从而永远不会进行垃圾回收。这种方法并不能消除每次写入时检查是否启用写屏障的成本。

We report the average of three runs for all figures except maximums. Except when noted, each run lasts for one minute, and variation in repeated runs for all measurements is less than

.
除最大值外，我们报告了所有数据三次运行的平均值。除特别注明外，每次运行均持续一分钟，所有测量的重复运行变化均小于

。

Many of the performance experiments use three applications, all of which are kernel-intensive:
许多性能实验使用了三个应用程序，它们都是内核密集型的：

CMailbench CMailbench is a mail-server-like benchmark which stresses the virtual memory system via fork and exec. The benchmark runs four server processes and four associated clients, all on the same machine. For each message delivery, the client forks and execs a helper; the helper sends a 1660-byte message to its server over a UNIX-domain socket; the server forks and execs a delivery agent; the delivery agent writes the message to a new file in a separate directory for each server. Each message involves two calls to each of fork, exec, and rename as well as one or two calls to read, write, open, close, fstat, unlink, and stat.
CMailbench CMailbench 是一个类似于邮件服务器的基准测试，它通过 fork 和 exec 对虚拟内存系统施加压力。该基准测试在同一台机器上运行四个服务器进程和四个相关客户端。每次发送信息时，客户端分叉并执行一个辅助进程；辅助进程通过 UNIX 域套接字向服务器发送 1660 字节的信息；服务器分叉并执行一个发送代理；发送代理将信息写入每个服务器单独目录下的一个新文件。每条信息都涉及 fork、exec 和 rename 的两次调用，以及 read、write、open、close、fstat、unlink 和 stat 的一到两次调用。

NGINX NGINX [38] (version 1.11.5) is a highperformance web server. The server is configured with four processes, all of which listen on the same socket for TCP connections from clients. The server processes use poll to wait for input on multiple connections. NGINX's request log is disabled. A separate client machine keeps 64 requests in flight; each request involves a fresh TCP connection to the server. For each incoming connection, a server process parses the request, opens and reads a 612-byte file, sends the 612 bytes plus headers to the client, and closes the connection. All requests fetch the same file.
NGINX NGINX [38]（版本 1.11.5）是一个高性能网络服务器。服务器配置了四个进程，所有进程都在同一个套接字上监听来自客户端的 TCP 连接。服务器进程使用轮询来等待多个连接上的输入。NGINX 的请求日志被禁用。一台单独的客户端机器会保持 64 个请求在飞行中；每个请求都涉及到一个到服务器的新 TCP 连接。对于每个传入连接，服务器进程都会对请求进行解析，打开并读取一个 612 字节的文件，将 612 字节加标题发送给客户端，然后关闭连接。所有请求都获取同一个文件。

Redis Redis (version 3.0.5) is an in-memory key/value database. We modified it to use poll instead of select (since Biscuit doesn't support select). The benchmark runs four single-threaded Redis server processes. A client machine generates load over the network using two instances of Redis's "redis-benchmark" per Redis server process, each of which opens 100 connections to the Redis process and keeps a single GET outstanding on each connection. Each GET requests one of 10,000 keys at random. The values are two bytes.
Redis Redis（3.0.5 版）是一个内存键/值数据库。我们对其进行了修改，使其使用 poll 代替 select（因为 Biscuit 不支持 select）。该基准测试运行四个单线程 Redis 服务器进程。每个 Redis 服务器进程使用两个 Redis 的 "redis-benchmark "实例，客户端机器通过网络产生负载，每个实例向 Redis 进程打开 100 个连接，并在每个连接上保持一个未完成的 GET。每次 GET 都会随机请求 10,000 个密钥中的一个。键值为两个字节。

8.4 HLL tax 8.4 HLL 税收

This section investigates the performance costs of Go's HLL features for the three applications. Figure 7 shows the results.
本节将研究 Go 的 HLL 功能在三个应用中的性能成本。图 7 显示了结果。

The "Tput" column shows throughput in application requests per second.
Tput "列显示的是以每秒应用程序请求为单位的吞吐量。

The "Kernel time" column (fraction of time spent in the kernel, rather than in user space) shows that the results are dominated by kernel activity. All of the benchmarks keep all four cores

busy.
内核时间 "列（用于内核而非用户空间的时间比例）显示，结果主要由内核活动决定。所有基准测试都使所有四个内核

处于忙碌状态。

The applications cause Biscuit to average between 18 and 48 MB of live data in the kernel heap. They allocate transient objects fast enough to trigger dozens of collections during each benchmark run ("GCs"). These collections use between

and

of the total CPU time.
这些应用导致 Biscuit 在内核堆中的平均实时数据量在 18 到 48 MB 之间。它们分配瞬态对象的速度足以在每次基准运行期间触发数十次集合（"GC"）。这些集合使用的 CPU 总时间在

和

之间。
"Prologue cycles" are the fraction of total time used by compiler-generated code at the start of each function that checks whether the stack must be expanded, and whether the garbage collector needs a stop-the-world pause. "WB cycles" reflect compiler-generated write-barriers that take special action when an object is modified during a concurrent garbage collection.
"序幕周期 "是编译器生成的代码在每个函数开始时使用的总时间的一部分，这些代码检查堆栈是否必须展开，以及垃圾收集器是否需要 "停止-世界 "暂停。"WB 周期 "反映的是编译器生成的写入障碍，当对象在并发垃圾收集过程中被修改时，编译器会采取特殊措施。
"Safety cycles" reports the cost of runtime checks for nil pointers, array and slice bounds, divide by zero, and incorrect dynamic casts. These checks occur throughout the compiler output; we wrote a tool that finds them in the Biscuit binary and cross-references them with CPU time profiles.
"安全周期 "报告了运行时对零指针、数组和片边界、除以零和不正确的动态转换进行检查的成本。这些检查出现在整个编译器输出中；我们编写了一个工具，可以在 Biscuit 二进制文件中找到这些检查，并将其与 CPU 时间曲线进行交叉引用。
"Alloc cycles" measures the time spent in the Go allocator, examining free lists to satisfy allocation requests (but not including concurrent collection work). Allocation is not an HLL-specific task, but it is one that some C kernels streamline with custom allocators [8].
"分配周期 "衡量的是 Go 分配器中花费的时间，即检查空闲列表以满足分配请求的时间（但不包括并发收集工作）。分配并不是 HLL 特有的任务，但有些 C 内核会使用自定义分配器来简化这项任务 [8]。

Figure 7 shows that the function prologues are the most expensive HLL feature. Garbage collection costs are noticeable but not the largest of the costs. On the other hand,

§

shows that collection cost grows with the amount of live data, and it seems likely that prologue costs could be reduced.
图 7 显示，函数序言是最昂贵的 HLL 功能。垃圾收集成本很明显，但不是最大的成本。另一方面，

§

显示，收集成本会随着实时数据量的增加而增加，因此似乎可以降低序幕成本。

8.5 GC delays 8.5 GC 延误

We measured the delays caused by garbage collection (including interleaved concurrent work) during the execution of NGINX, aggregated by allocator call, system call, and NGINX request.
我们按分配器调用、系统调用和 NGINX 请求汇总，测量了 NGINX 执行期间垃圾回收（包括交错并发工作）造成的延迟。

of heap allocator calls are delayed by collection work. Of the delayed allocator calls, the average delay is 0.9 microseconds, and the worst case is 115 microseconds, due to marking a large portion of the TCP connection hashtable.

的堆分配器调用因收集工作而延迟。在延迟的分配器调用中，平均延迟为 0.9 微秒，最差情况为 115 微秒，原因是标记了大部分 TCP 连接的 hashtable。

of system calls are delayed by collection work; of the delayed system calls, the average delay is 1.5 mi croseconds, and the worst case is 574 microseconds, incurred by a poll system call that involved 25 allocator calls that performed collection work.

的系统调用因收集工作而延迟；在延迟的系统调用中，平均延迟时间为 1.5 毫秒，最糟糕的情况为 574 微秒，这是一个轮询系统调用造成的，该调用涉及 25 个执行收集工作的分配器调用。

of NGINX web requests are delayed by collection work. Of the delayed requests, the average total collection delay is 1.8 microseconds (out of an average request processing time of 45 microseconds). Less than

of requests spend more than 100 microseconds garbage collecting. The worst case is 582 microseconds, which includes the worst-case system call described above.

的 NGINX Web 请求因收集工作而延迟。在延迟的请求中，平均总收集延迟为 1.8 微秒（平均请求处理时间为 45 微秒）。不到

个请求的垃圾收集时间超过 100 微秒。最糟糕的情况是 582 微秒，其中包括上述最糟糕的系统调用。

8.6 Sensitivity to heap size
8.6 对堆大小的敏感性

A potential problem with garbage collection is that it consumes a fraction of CPU time proportional to the "headroom ratio" between the amount of live data and the amount of RAM allocated to the heap. This section explores the effect of headroom on collection cost.
垃圾收集的一个潜在问题是，它消耗的 CPU 时间与实时数据量和分配给堆的 RAM 量之间的 "余量比 "成正比。本节将探讨余量对收集成本的影响。

This experiment uses the CMailbench benchmark. We artificially increased the live data by inserting two or four million vnodes ( 640 or 1280 MB of live data) into Biscuit's vnode cache. We varied the amount of RAM allocated to the kernel heap.
本实验使用 CMailbench 基准。我们在 Biscuit 的 vnode 缓存中插入了 200 万或 400 万个 vnode（640 或 1280 MB 的实时数据），从而人为地增加了实时数据。我们改变了分配给内核堆的 RAM 量。

Figure 8 shows the results. The two most significant columns are "Headroom ratio" and "GC%;" together they show roughly the expected relationship. For example, comparing the second and last table rows shows that increasing both live data and total heap RAM, so that the ratio remains the same, does not change the fraction of CPU time spent collecting; the reason is that the increased absolute amount of headroom decreases collection frequency, but that is offset by the fact that doubling the live data doubles the cost of each individual collection.
图 8 显示了结果。其中最重要的两列是 "净空比率 "和 "GC%"；它们大致显示了预期的关系。例如，比较表中的第二行和最后一行可以发现，增加实时数据和堆 RAM 总量，使比率保持不变，并不会改变 CPU 用于收集的时间比例；原因是净空绝对量的增加会降低收集频率，但实时数据增加一倍会使每次收集的成本增加一倍，这就抵消了净空绝对量的增加。

In summary, while the benchmarks in

§

/ Figure 7 incur modest collection costs, a kernel heap with millions of live objects but limited heap RAM might spend a significant fraction of its time collecting. We expect that decisions about how much RAM to buy for busy machines would include a small multiple (2 or 3 ) of the expected peak kernel heap live data size.
总之，虽然

§

/ 图 7 中的基准会产生适度的收集成本，但拥有数百万个实时对象但堆 RAM 有限的内核堆可能会花费大量时间进行收集。我们认为，在决定为繁忙的机器购买多少 RAM 时，应考虑内核堆实时数据的预期峰值大小的一个小倍数（2 或 3）。

8.7 Go versus 8.7 围棋对

This section compares the performance of code paths in C and Go that are nearly identical except for language. The goal is to focus on the impact of language choice on performance for kernel code. The benchmarks involve a small amount of code because of the need to ensure that the C and Go versions are very similar.
本节将比较除语言外几乎完全相同的 C 和 Go 代码路径的性能。我们的目标是关注语言选择对内核代码性能的影响。由于需要确保 C 和 Go 版本非常相似，因此基准测试只涉及少量代码。

The code paths are embedded in Biscuit (for Go) and Linux (for C). We modified both to ensure that the kernel code paths exercised by the benchmarks are nearly identical. We disabled Linux's kernel page-table isolation, retpoline, address space randomization, transparent hugepages, hardened usercopy, cgroup, fair group, and bandwidth scheduling, scheduling statistics, ftrace, kprobes, and paravirtualization to make its code paths similar to Biscuit. We also disabled Linux's FS notifications, atime and mtime updates to pipes, and replaced Linux's
代码路径被嵌入到 Biscuit（针对 Go）和 Linux（针对 C）中。我们对两者都进行了修改，以确保基准所使用的内核代码路径几乎相同。我们禁用了 Linux 的内核页表隔离、retpoline、地址空间随机化、透明 hugepages、加固 usercopy、cgroup、fair group 和带宽调度、调度统计、ftrace、kprobes 和准虚拟化，以使其代码路径与 Biscuit 相似。我们还禁用了 Linux 的 FS 通知、atime 和 mtime 更新管道，并替换了 Linux 的

Tput

内核时间

Kernel

time

实时数据

Live

data

GCs

气相色谱循环

cycles

序幕周期

Prologue

cycles

WB 周期

cycles

安全周期

Safety

cycles

分配周期

Alloc

cycles

CMailbench

15,862

34 MB

NGINX

88,592

48 MB

Redis

711,792

18 MB

Figure 7: Measured costs of HLL features in Biscuit for three kernel-intensive benchmarks. "Alloc cycles" are not an HLL-specific cost, since C code has significant allocation costs as well.
图 7：Biscuit 中 HLL 功能在三个内核密集型基准测试中的测量成本。"分配周期 "不是 HLL 特有的成本，因为 C 代码也有大量的分配成本。

现场 (MB)

Live

(MB)

总计（MB）

Total

(MB)

净空比

Headroom

ratio

吞吐量（毫秒/秒）

Tput

(msg/s)

GC%

GCs

640

960

0.66

10,448

640

1280

0.50

12,848

640

1920

0.33

14,430

1280

2560

0.50

13,041

	Biscuit 饼干	Linux 利纳克斯	Ratio 比率
CMailbench (mem)	15,862	17,034	1.07
CMailbench (SSD) CMailbench（固态硬盘）	254	252	0.99
NGINX	88,592	94,492	1.07
Redis	711,792	775,317	1.09

Figure 8: CMailbench throughput on Biscuit with different kernel heap sizes. The columns indicate live heap memory; RAM allocated to the heap; the ratio of live heap memory to heap RAM; the benchmark's throughput on Biscuit; the fraction of CPU cycles (over all four cores) spent garbage collecting; and the number of collections.
图 8：不同内核堆大小的 CMailbench 在 Biscuit 上的吞吐量。列中显示了实时堆内存、分配给堆的 RAM、实时堆内存与堆 RAM 的比率、基准在 Biscuit 上的吞吐量、用于垃圾收集的 CPU 周期分数（所有四个内核）以及收集次数。
scheduler and page allocator with simple versions, like Biscuit's. The benchmarks allocate no heap memory in steady-state, so Biscuit's garbage collector is not invoked.
调度器和页面分配器的简单版本，如 Biscuit。这些基准在稳定状态下不分配堆内存，因此不会调用 Biscuit 的垃圾回收器。

8.7.1 Ping-pong 8.7.1 乒乓球

The first benchmark is "ping-pong" over a pair of pipes between two user processes. Each process takes turns performing five-byte reads and writes to the other process. Both processes are pinned to the same CPU in order to require the kernel to context switch between them. The benchmark exercises core kernel tasks: system calls, sleep/wakeup, and context switch.
第一个基准测试是两个用户进程通过一对管道进行 "乒乓球 "比赛。每个进程轮流对另一个进程执行五个字节的读写操作。两个进程被固定在同一个 CPU 上，以便要求内核在它们之间进行上下文切换。该基准测试演练了内核的核心任务：系统调用、睡眠/唤醒和上下文切换。

We manually verified the similarity of the steady-state kernel code paths (1,200 lines for Go, 1,786 lines for C, including many comments and macros which compile to nothing). The CPU-time profiles for the two showed that time was spent in near-identical ways. The ten most expensive instructions match: saving and restoring SSE registers on context switch, entering and exiting the kernel, wrmsr to restore the thread-local-storage register, the copy to/from user memory, atomic instructions for locks, and swapgs.
我们手动验证了稳态内核代码路径的相似性（Go 1,200 行，C 1,786 行，包括许多编译为空的注释和宏）。两者的 CPU 时间曲线显示，时间的花费方式几乎完全相同。最耗时的十条指令是：上下文切换时保存和恢复 SSE 寄存器、进入和退出内核、恢复线程本地存储寄存器的 wrmsr、复制到/从用户内存、锁的原子指令和 swapgs。

The results are 465,811 round-trips/second for Go and

second for C ; thus C is

faster than Go on this benchmark. The benchmark spends

and

of its time in the kernel (as opposed to user space) for Go and C , respectively. A round trip takes 5,259 instructions for Go and 4,540 for C . Most of the difference is due to HLL features: 250, 200, 144, and 112 instructions per round-trip for stack expansion prologues, write barrier, bounds, and nil pointer/type checks, respectively.
结果是 Go 的往返次数为 465,811 次/秒，而 C 的往返次数为

秒；因此在该基准测试中，C 比 Go 快

。对于 Go 和 C，该基准测试在内核（而非用户空间）中花费的时间分别为

和

。Go 一次往返需要 5,259 条指令，而 C 一次往返需要 4,540 条指令。大部分差异是由于 HLL 功能造成的：堆栈扩展序言、写入障碍、边界和零指针/类型检查每次往返分别需要 250、200、144 和 112 条指令。

Figure 9: Application throughput of Biscuit and Linux. "Ratio" is the Linux to Biscuit throughput ratio.
图 9：Biscuit 和 Linux 的应用程序吞吐量。"比率 "是 Linux 与 Biscuit 的吞吐量比率。

8.7.2 Page-faults 8.7.2 页面故障

The second Go-versus-C benchmark is a user-space program that repeatedly calls mmap ( ) to map 4 MB of zerofill-on-demand 4096-byte pages, writes a byte on each page, and then unmaps the memory. Both kernels initially map the pages lazily, so that each write generates a page fault, in which the kernel allocates a physical page, zeroes it, adds it to the process page table, and returns. We ran the benchmark on a single CPU on Biscuit and Linux and recorded the average number of page-faults per second.
第二个 Go-versus-C 基准是一个用户空间程序，它重复调用 mmap ( ) 映射 4 MB 的按需零填充 4096 字节页面，在每个页面上写入一个字节，然后取消映射内存。两个内核最初都是懒散地映射页面，因此每次写入都会产生页面错误，内核会分配一个物理页面，将其清零，添加到进程页表中，然后返回。我们在 Biscuit 和 Linux 的单 CPU 上运行了该基准测试，并记录了每秒发生的平均页故障次数。

We manually verified the similarity of the steady-state kernel code: there are about 480 and 650 lines of code for Biscuit and Linux, respectively. The benchmark spends nearly the same amount of time in the kernel on both kernels (

on Biscuit and

on Linux). We verified with CPU-time profiles that the top five most expensive instructions match: entering the kernel on the page-fault, zeroing the newly allocated page, the userspace store after handling the fault, saving registers, and atomics for locks.
我们手动验证了稳态内核代码的相似性：Biscuit 和 Linux 的内核代码分别约为 480 行和 650 行。在两个内核上，基准测试在内核中花费的时间几乎相同（Biscuit 上为

，Linux 上为

）。我们通过 CPU 时间曲线验证了最昂贵的前五条指令：在页面故障时进入内核、将新分配的页面清零、处理故障后的用户空间存储、保存寄存器和锁的原子指令。

The results are 731 nanoseconds per page-fault for Go and 695 nanoseconds for C ; C is

faster on this benchmark. The two implementations spend much of their time in three ways: entering the kernel's page-fault handler, zeroing the newly allocated page, and returning to userspace. These operations use

, and

of CPU cycles for Biscuit and

, and

of CPU cycles for Linux, respectively.
Go 的结果是每个页面故障 731 纳秒，而 C 的结果是 695 纳秒；在这一基准测试中，C 更快

。这两种实现在三个方面花费了大量时间：进入内核的页面故障处理程序、将新分配的页面清零以及返回用户空间。这些操作分别使用了 Biscuit 的

和

个 CPU 周期，以及 Linux 的

和

个 CPU 周期。

These results give a feel for performance differences due just to choice of language. They don't involve garbage collection; for that, see

§

and

§

.
通过这些结果，我们可以感受到语言选择所带来的性能差异。它们不涉及垃圾回收；有关垃圾回收，请参阅

§

和

§

。

8.8 Biscuit versus Linux 8.8 饼干与 Linux

To get a sense of whether Biscuit's performance is in the same ballpark as a high-performance C kernel, we report the performance of Linux on the three applications of

§

. The applications make the same system calls on Linux and on Biscuit. These results cannot be used to conclude
为了了解 Biscuit 的性能是否与高性能 C 内核处于同一水平，我们报告了 Linux 在

§

的三个应用程序上的性能。这些应用程序在 Linux 和 Biscuit 上进行相同的系统调用。这些结果不能用来得出结论

Figure 10: The amount of live data (in red) in the kernel heap during the first 35 seconds of the heap exhaustion experiment. The blue line indicates the RAM allocated to the kernel heap ( 512 MB ). The four vertical black lines indicate the points at which the killer thread killed the abusive child process.
图 10：在堆耗尽实验的前 35 秒内，内核堆中的实时数据量（红色）。蓝线表示分配给内核堆的 RAM（512 MB）。四条垂直黑线表示杀手线程杀死滥用子进程的时间点。
much about performance differences due to Biscuit's use of Go, since Linux includes many features that Biscuit omits, and Linux may sacrifice some performance on these benchmarks in return for better performance in other situations (e.g., large core counts or NUMA).
因为 Linux 包含许多 Biscuit 所省略的功能，而且 Linux 在这些基准测试中可能会牺牲一些性能，以换取在其他情况下（如大内核数或 NUMA）更好的性能。

We use Debian 9.4 with Linux kernel 4.9.82. We increased Linux's performance by disabling some costly features: kernel page-table isolation, retpoline, address space randomization, transparent hugepages, TCP selective ACKs, and SYN cookies. We replaced glibc with musl (nearly doubling the performance of CMailbench on Linux) and pinned the application threads to CPUs when it improves the benchmark's performance. We ran CMailbench in two configurations: one using an in-memory file system and the other using an SSD file system (tmpfs and ext- 4 on Linux, respectively). The benchmarks use

of all cores on both Biscuit and Linux, except for CMailbench (SSD), which is bottlenecked by the SSD. The proportion of time each benchmark spends in the kernel on Linux is nearly the same as on Biscuit (differing by at most two percentage points).
我们使用的是 Debian 9.4 和 Linux 内核 4.9.82。我们通过禁用一些代价高昂的功能来提高 Linux 的性能：内核页表隔离、retpoline、地址空间随机化、透明 hugepages、TCP 选择性 ACK 和 SYN cookies。我们用 musl 替换了 glibc（几乎将 CMailbench 在 Linux 上的性能提高了一倍），并在能提高基准性能时将应用程序线程固定在 CPU 上。我们在两种配置下运行 CMailbench：一种使用内存文件系统，另一种使用 SSD 文件系统（Linux 上分别为 tmpfs 和 ext-4）。除了 CMailbench (SSD) 受到 SSD 的瓶颈限制外，其他基准都使用了 Biscuit 和 Linux 上所有内核的

。每个基准在 Linux 内核中花费的时间比例与在 Biscuit 上几乎相同（最多相差两个百分点）。

Figure 9 presents the results: Linux achieves up to 10% better performance than Biscuit. The "HLL taxes" identified in

§

contribute to the results, but the difference in performance is most likely due to the fact that the two kernels have different designs and amounts of functionality. It took effort to make Biscuit achieve this level of performance. Most of the work was in understanding why Linux was more efficient than Biscuit, and then implementing similar optimizations in Biscuit. These optimizations had little to do with the choice of language, but were for the most part standard kernel optimizations (e.g., avoiding lock contention, avoiding TLB flushes, using better data structures, adding caches).
图 9 显示了结果：Linux 的性能比 Biscuit 高出 10%。

§

中确定的 "HLL 税 "对结果有所影响，但性能差异很可能是由于两个内核的设计和功能数量不同。要使 Biscuit 达到这样的性能水平，我们付出了很多努力。大部分工作是了解 Linux 为什么比 Biscuit 更高效，然后在 Biscuit 中实施类似的优化。这些优化与语言的选择关系不大，大部分都是标准的内核优化（例如，避免锁争用、避免 TLB 刷新、使用更好的数据结构、添加缓存）。

8.9 Handling kernel heap exhaustion
8.9 处理内核堆耗尽

This experiment demonstrates two things. First, that the system calls of a good citizen process do not fail when executing concurrently with an application that tries to exhaust the kernel heap. Second, that Biscuit's heap RAM reservations aren't too conservative: that the reservations allow most of the heap RAM to be used before forcing system calls to wait.
这个实验证明了两件事。首先，良好公民进程的系统调用在与试图耗尽内核堆的应用程序同时执行时不会失败。其次，Biscuit 的堆 RAM 保留并不是太保守：保留允许在迫使系统调用等待之前使用大部分堆 RAM。

The experiment involves two programs. An abusive program repeatedly forks a child and waits for it. The child creates many non-contiguous memory mappings, which cause the kernel to allocate many heap objects describing the mappings. These objects eventually cause the kernel heap to approach fullness, at which point the out-of-memory killer kills the child. Meanwhile, a wellbehaved program behaves like a UNIX mail daemon: it repeatedly delivers dozens of messages and then sleeps for a few seconds. This process complains and exits if any of its system calls returns an unexpected error. The kernel has 512 MB of RAM allocated to its heap. The programs run for 25 minutes, and we record the amount of live data in the kernel heap at the end of every garbage collection.
实验涉及两个程序。一个滥用程序反复分叉一个子程序并等待它。子程序创建了许多非连续内存映射，导致内核分配了许多描述这些映射的堆对象。这些对象最终导致内核堆接近饱和，这时内存不足杀手就会杀死子程序。与此同时，一个行为良好的程序会像 UNIX 邮件守护进程一样：反复发送几十条信息，然后休眠几秒钟。如果任何系统调用返回意外错误，该进程就会抱怨并退出。内核为其堆分配了 512 MB 内存。程序运行 25 分钟，我们在每次垃圾回收结束时记录内核堆中的实时数据量。

Figure 10 shows the first 35 seconds of the experiment. Each red cross indicates the amount of live kernel heap data after a GC. The blue line at the top corresponds to 512 MB . The four vertical lines show the times at which the out-of-memory killer killed the abusive program's child process.
图 10 显示了实验的前 35 秒。每个红叉表示 GC 后内核堆的实时数据量。顶部的蓝线对应 512 MB。四条垂直线表示内存不足杀手杀死滥用程序子进程的时间。

Biscuit allows the live data in its heap to grow to about 500 MB , or

of the heap RAM. The main reason that live data does not reach 512 MB is that the reservation for the file system logger thread is 6 MB , more than the thread actually uses. When the child is killed, it takes a couple seconds to release the kernel heap objects describing its many virtual memory mappings. The system calls of the good citizen process wait for reservations hundreds of thousands of times, but none return an error.
Biscuit 允许堆中的实时数据增长到约 500 MB，即堆 RAM 的

。实时数据未达到 512 MB 的主要原因是，文件系统记录器线程的预留空间为 6 MB，超过了线程的实际使用量。当子进程被杀死时，需要几秒钟的时间来释放描述其众多虚拟内存映射的内核堆对象。好公民进程的系统调用等待保留的次数多达数十万次，但没有一次返回错误。

8.10 Lock-free lookups 8.10 无锁查找

This section explores whether read-lock-free data structures in Go increase parallel performance.
本节将探讨 Go 中的免读锁定数据结构是否能提高并行性能。

C kernels often use read-lock-free data structures to increase performance when multiple cores read the data. The goal is to allow reads without locking or dirtying cache lines, both of which are expensive when there is contention. However, safely deleting objects from a data structure with lock-free readers requires the deleter to defer freeing memory that a thread might still be reading. Linux uses read-copy update ( RCU ) to delay such frees, typically until all cores have performed a thread context switch; coupled with a rule that readers not hold references across context switch, this ensures safety [32, 33]. Linux's full set of RCU rules is complex; see "Review Checklist for RCU patches" [31].
当多个内核读取数据时，C 内核通常使用免读锁定数据结构来提高性能。这样做的目的是允许在不锁定或弄脏缓存行的情况下读取数据，而这两种情况在出现争用时都很昂贵。不过，要安全地从使用无锁读取器的数据结构中删除对象，需要删除者推迟释放线程可能仍在读取的内存。Linux 使用读拷贝更新（RCU）来延迟此类释放，通常要等到所有内核都执行了线程上下文切换之后；再加上读取器在上下文切换时不得持有引用的规则，这就确保了安全性[32, 33]。Linux 的整套 RCU 规则非常复杂，请参阅 "RCU 补丁审查清单"[31]。

Directory cache 目录缓存	Tput
Lock-free lookups 无锁查找
Read-locked lookups 读锁定查找

Figure 11: The performance of CMailbench with two versions of Biscuit's directory cache, one read-lock-free and one using read locks.
图 11：CMailbench 在使用两个版本的 Biscuit 目录缓存（一个无读锁版本和一个使用读锁版本）时的性能。

Garbage collection automates the freeing decision, simplifying use of read-lock-free data structures and increasing the set of situations in which they can safely be used (e.g. across context switches). However, HLLs and garbage collection add their own overheads, so it is worth exploring whether read-lock-free data structures nevertheless increase performance.
垃圾回收自动执行释放决定，简化了免读锁数据结构的使用，并增加了可安全使用免读锁数据结构的情况（如跨上下文切换）。不过，HLL 和垃圾回收会增加各自的开销，因此值得探讨的是，免读锁数据结构是否仍能提高性能。

In order to explore this question, we wrote two variants of a directory cache for Biscuit, one that is read-lock-free and one with read-locks. Both versions use an array of buckets as a hash table, each bucket containing a singlylinked list of elements. Insert and delete lock the relevant bucket, create new versions of list elements to be inserted or updated, and modify next pointers to refer to the new elements. The read-lock-free version of lookup simply traverses the linked list.

The read-locked version first read-locks the bucket (forbidding writers but allowing other readers) and then traverses the list. We use CMailbench for the benchmark since it stresses creation and deletion of entries in the directory cache. The file system is in-memory, so there is no disk I/O.
为了探讨这个问题，我们为 Biscuit 编写了两种不同的目录缓存，一种是无读锁缓存，另一种是有读锁缓存。这两个版本都将一个桶数组用作哈希表，每个桶包含一个单链路元素列表。插入和删除会锁定相关的桶，创建要插入或更新的列表元素的新版本，并修改下一个指针以引用新元素。无读取锁版本的查找只需遍历链接列表即可。

读锁定版本首先对桶进行读锁定（禁止写入，但允许其他读取），然后遍历列表。我们使用 CMailbench 作为基准，因为它强调在目录缓存中创建和删除条目。文件系统位于内存中，因此没有磁盘 I/O。

Figure 11 shows the throughput of CMailbench using the read-lock-free directory cache and the read-locked directory cache. The read-lock-free version provides an

throughput increase: use of Go does not eliminate the performance advantage of read-lock-free data in this example.
图 11 显示了 CMailbench 使用无读锁目录缓存和有读锁目录缓存时的吞吐量。免读锁定版本的吞吐量提高了

：在此示例中，使用 Go 并没有消除免读锁定数据的性能优势。

9 Discussion and future work
9 讨论与未来工作

Should one write a kernel in Go or in C? We have no simple answer, but we can make a few observations. For existing large kernels in C , the programming cost of conversion to Go would likely outweigh the benefits, particularly considering investment in expertise, ecosystem, and development process. The question makes more sense for new kernels and similar projects such as VMMs.
应该用 Go 语言还是 C 语言编写内核？我们没有简单的答案，但可以提出一些看法。对于现有的 C 语言大型内核，转换为 Go 语言的编程成本很可能会超过收益，尤其是考虑到在专业知识、生态系统和开发流程方面的投资。对于新内核和类似项目（如 VMM）来说，这个问题更有意义。

If a primary goal is avoiding common security pitfalls, then Go helps by avoiding some classes of security bugs (see

§

). If the goal is to experiment with OS ideas, then Go's HLL features may help rapid exploration of different designs (see

§

). If CPU performance is paramount, then C is the right answer, since it is faster

§ §

. If efficient memory use is vital, then C is also the right
如果主要目标是避免常见的安全隐患，那么 Go 可以避免某些类别的安全 bug（参见

§

）。如果目标是尝试操作系统的想法，那么 Go 的 HLL 功能可能有助于快速探索不同的设计（参见

§

）。如果 CPU 性能是最重要的，那么 C 就是正确答案，因为它更快

§ §

。如果高效使用内存至关重要，那么 C 也是正确答案。

answer: Go's garbage collector needs a factor of 2 to 3 of heap headroom to run efficiently (see

§

).
答案是Go 的垃圾回收器需要 2 到 3 倍的堆净空才能高效运行（参见

§

）。

We have found Go effective and pleasant for kernel development. Biscuit's performance on OS-intensive applications is good (about

as fast as Linux). Achieving this performance usually involved implementing the right optimizations; Go versus C was rarely an issue.
我们发现，Go 对内核开发来说既有效又令人愉快。Biscuit 在操作系统密集型应用程序上的性能很好（大约

和 Linux 一样快）。要达到这样的性能，通常需要实施正确的优化；Go 与 C 的比较很少成为问题。

An HLL other than Go might change these considerations. A language without a compiler as good as Go's, or whose design was more removed from the underlying machine, might perform less well. On the other hand, a language such as Rust that avoids garbage collection might provide higher performance as well as safety, though perhaps at some cost in programmability for threaded code.
Go 之外的 HLL 可能会改变这些考虑因素。如果一种语言没有像 Go 那样优秀的编译器，或者其设计更加脱离底层机器，那么其性能可能会大打折扣。另一方面，像 Rust 这样避免垃圾回收的语言，可能会提供更高的性能和安全性，尽管可能会在线程代码的可编程性方面付出一些代价。

There are some Biscuit-specific issues we would like to explore further. We would like Biscuit to expand and contract the RAM used for the heap dynamically. We would like to modify the Go runtime to allow Biscuit to control scheduling policies. We would like to scale Biscuit to larger numbers of cores. Finally, we would like to explore if Biscuit's heap reservation scheme could simplify the implementation of C kernels.
我们希望进一步探讨一些 Biscuit 特有的问题。我们希望 Biscuit 能够动态扩展和收缩堆所使用的 RAM。我们希望修改 Go 运行时，允许 Biscuit 控制调度策略。我们希望将 Biscuit 扩展到更多内核。最后，我们希望探索 Biscuit 的堆预留方案能否简化 C 内核的实现。

10 Conclusions 10 结论

Our subjective experience using Go to implement the Biscuit kernel has been positive. Go's high-level language features are helpful in the context of a kernel. Examination of historical Linux kernel bugs due to C suggests that a type- and memory-safe language such as Go might avoid real-world bugs, or handle them more cleanly than C . The ability to statically analyze Go helped us implement defenses against kernel heap exhaustion, a traditionally difficult task.
我们使用 Go 实现 Biscuit 内核的主观体验是积极的。Go 的高级语言特性对内核很有帮助。对历史上由 C 语言引起的 Linux 内核错误的研究表明，Go 这样的类型和内存安全语言可以避免现实世界中的错误，或者比 C 语言更干净地处理这些错误。静态分析 Go 的能力帮助我们实现了对内核堆耗尽的防御，这是一项传统的艰巨任务。

The paper presents measurements of some of the performance costs of Biscuit's use of Go's HLL features, on a set of kernel-intensive benchmarks. The fraction of CPU time consumed by garbage collection and safety checks is less than

. The paper compares the performance of equivalent kernel code paths written in C and Go , finding that the C version is about

faster.
本文介绍了 Biscuit 在一组内核密集型基准测试中使用 Go 的 HLL 功能所产生的一些性能代价。垃圾回收和安全检查所消耗的 CPU 时间小于

。本文比较了用 C 语言和 Go 语言编写的等效内核代码路径的性能，发现 C 语言版本的速度约为

。

We hope that this paper helps readers to make a decision about whether to write a new kernel in C or in an HLL.
我们希望本文能帮助读者做出决定，是用 C 语言还是用 HLL 语言编写新内核。

Acknowledgements 致谢

We thank Nickolai Zeldovich, PDOS, Austin Clements, the anonymous reviewers, and our shepherd, Liuba Shrira, for their feedback. This research was supported by NSF award CSR-1617487.
我们感谢尼古拉-泽尔多维奇（Nickolai Zeldovich）、PDOS、奥斯汀-克莱门茨（Austin Clements）、匿名审稿人以及我们的导师柳芭-施瑞拉（Liuba Shrira）提供的反馈意见。本研究得到了国家自然科学基金奖 CSR-1617487 的支持。

References 参考资料

[1] A. Anagnostopoulos. gopher-os, 2018. https: //github.com/achilleasa/gopher-os.
[2] J. Armstrong. Erlang. Commun. ACM, 53(9):68-75, Sept. 2010.
[2] J. Armstrong.Erlang.Commun.ACM，53（9）：68-75，2010 年 9 月。
[3] G. Back and W. C. Hsieh. The KaffeOS Java runtime system. ACM Trans. Program. Lang. Syst., 27(4):583-630, July 2005.
[3] G. Back 和 W. C. Hsieh.KaffeOS Java 运行时系统。ACM Trans.Program.Lang.Syst., 27(4):583-630, July 2005.
[4] G. Back, P. Tullmann, L. Stoller, W. C. Hsieh, and J. Lepreau. Techniques for the design of Java operating systems. In In Proceedings of the 2000 Usenix Annual Technical Conference, pages 197210. USENIX Association, 2000.
[4] G. Back、P. Tullmann、L. Stoller、W. C. Hsieh 和 J. Lepreau。Java 操作系统设计技术。In Proceedings of the 2000 Usenix Annual Technical Conference, pages 197210.USENIX 协会，2000 年。
[5] H. G. Baker, Jr. List processing in real time on a serial computer. Commun. ACM, 21(4):280-294, Apr. 1978.
[5] H. G. Baker, Jr.串行计算机上的实时列表处理。Commun.ACM，21（4）：280-294，1978 年 4 月。
[6] F. J. Ballesteros. The Clive operating system, 2014. http://lsub.org/ls/clive.html.
[6] F. J. Ballesteros.Clive 操作系统，2014.http://lsub.org/ls/clive.html。
[7] B. N. Bershad, S. Savage, P. Pardyak, E. G. Sirer, M. Fiuczynski, D. Becker, S. Eggers, and C. Chambers. Extensibility, safety and performance in the SPIN operating system. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP), pages 267-284, Copper Mountain, CO, Dec. 1995.
[7] B. N. Bershad、S. Savage、P. Pardyak、E. G. Sirer、M. Fiuczynski、D. Becker、S. Eggers 和 C. Chambers。SPIN 操作系统的可扩展性、安全性和性能。第 15 届 ACM 操作系统原理研讨会（SOSP）论文集》，第 267-284 页，科罗拉多州铜山，1995 年 12 月。
[8] J. Bonwick. The slab allocator: An object-caching kernel memory allocator. In Proceedings of the USENIX Summer Conference, 1994.
[8] J. Bonwick.板块分配器：对象缓存内核内存分配器。1994 年 USENIX 夏季会议论文集。
[9] J. Corbet. The too small to fail memory-allocation rule. from https://lwn.net/Articles/ 627419/, Dec 2014.
[9] J. Corbet.The too small to fail memory-allocation rule. from https://lwn.net/Articles/ 627419/, Dec 2014.
[10] J. Corbet. Revisiting too small to fail. from https: / /lwn. net/Articles/723317/, May 2017.
[10] J. Corbet.Revisiting too small to fail. from https: / /lwn. net/Articles/723317/, May 2017.
[11] D Language Foundation. D programming language, 2017. https://dlang.org/.
[11] D 语言基金会。D 编程语言，2017 年。https://dlang.org/。
[12] D. Evans. cs4414: Operating Systems, 2014. http: //www.rust-class.org/.
[13] D. Frampton, S. M. Blackburn, P. Cheng, R. J. Garner, D. Grove, J. E. B. Moss, and S. I. Salishev. Demystifying magic: High-level low-level programming. In Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE ’09, pages 81-90, New York, NY, USA, 2009. ACM.
[13] D. Frampton、S. M. Blackburn、P. Cheng、R. J. Garner、D. Grove、J. E. B. Moss 和 S. I. Salishev.解密魔法：高级低级编程。In Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE '09, pages 81-90, New York, NY, USA, 2009.ACM.
[14] C. M. Geschke, J. H. Morris, Jr., and E. H. Satterthwaite. Early experience with Mesa. SIGOPS Oper. Syst. Rev., Apr. 1977.
[14] C. M. Geschke, J. H. Morris, Jr., and E. H. Satterthwaite.使用 Mesa 的早期经验。SIGOPS Oper.Syst.Rev.，1977 年 4 月。
[15] Google. The Go Programming Language, 2017. https://golang.org/.
[15] Google.Go 编程语言，2017.https://golang.org/。
[16] Google. gvisor, 2018. https://github.com/ google/gvisor.
[16] Google. Gvisor, 2018.https://github.com/ google/gvisor。
[17] R. D. Greenblatt, T. F. Knight, J. T. Holloway, and D. A. Moon. A LISP machine. In Proceedings of the Fifth Workshop on Computer Architecture for Non-numeric Processing, CAW '80, pages 137-138, New York, NY, USA, 1980. ACM.
[17] R. D. Greenblatt、T. F. Knight、J. T. Holloway 和 D. A. Moon。LISP 机器。In Proceedings of the Fifth Workshop on Computer Architecture for Non-numeric Processing, CAW '80, pages 137-138, New York, NY, USA, 1980.ACM.
[18] T. Hallgren, M. P. Jones, R. Leslie, and A. Tolmach. A principled approach to operating system construction in Haskell. In Proceedings of the Tenth ACM SIGPLAN International Conference on Functional Programming, ICFP '05, pages 116-128, New York, NY, USA, 2005. ACM.
[18] T. Hallgren、M. P. Jones、R. Leslie 和 A. Tolmach。用 Haskell 构建操作系统的原则性方法。In Proceedings of the Tenth ACM SIGPLAN International Conference on Functional Programming, ICFP '05, pages 116-128, New York, NY, USA, 2005.ACM.
[19] C. Hawblitzel, C.-C. Chang, G. Czajkowski, D. Hu, and T. von Eicken. Implementing multiple protection domains in Java. In Proceedings of the 1998 USENIX Annual Technical Conference, pages 259270, 1998.
[19] C. Hawblitzel, C.-C. Chang, G. Czajkowski, D. Hu, and T. von Eicken.Chang、G. Czajkowski、D. Hu 和 T. von Eicken.在 Java 中实现多重保护域。1998 USENIX 年度技术会议论文集》，第 259270 页，1998 年。
[20] M. Hertz and E. Berger. Quantifying the performance of garbage collection vs. explicit memory management. In ACM OOPSLA, 2005.
[20] M. Hertz 和 E. Berger.量化垃圾回收与显式内存管理的性能。ACM OOPSLA, 2005.
[21] R. Hudson. Go GC: Prioritizing low latency and simplicity. from https://blog.golang.org/ go15gc, Aug 2015.
[21] R. Hudson.Go GC: Prioritizing low latency and simlicity. from https://blog.golang.org/ go15gc, Aug 2015.
[22] G. C. Hunt and J. R. Larus. Singularity: Rethinking the software stack. In Proceedings of the 21 st ACM Symposium on Operating Systems Principles (SOSP), pages 37-49, Stevenson, WA, Oct. 2007.
[22] G. C. Hunt 和 J. R. Larus.奇点：重新思考软件栈。第 21 届 ACM 操作系统原理研讨会（SOSP）论文集》，第 37-49 页，华盛顿州史蒂文森，2007 年 10 月。
[23] G. C. Hunt, J. R. Larus, M. Abadi, M. Aiken, P. Barham, M. Fahndrich, C. Hawblitzel, O. Hodson, S. Levi, N. Murphy, B. Steensgaard, D. Tarditi, T. Wobber, and B. Zill. An overview of the Singularity project. Technical Report MSR-TR-2005-135, Microsoft, Redmond, WA, Oct. 2005.
[23] G. C. Hunt, J. R. Larus, M. Abadi, M. Aiken, P. Barham, M. Fahndrich, C. Hawblitzel, O. Hodson, S. Levi, N. Murphy, B. Steensgaard, D. Tarditi, T. Wobber, and B. Zill.奇点项目概览.技术报告 MSR-TR-2005-135，微软，华盛顿州雷德蒙德，2005 年 10 月。
[24] B. Iyengar, G. Tene, M. Wolf, and E. Gehringer. The Collie: A Wait-free Compacting Collector. In Proceedings of the 2012 International Symposium on Memory Management, ISMM '12, pages 85-96, Beijing, China, 2012. ACM.
[24] B. Iyengar、G. Tene、M. Wolf 和 E. Gehringer。Collie: A Wait-free Compacting Collector.In Proceedings of the 2012 International Symposium on Memory Management, ISMM '12, pages 85-96, Beijing, China, 2012.ACM.
[25] T. Jim, J. G. Morrisett, D. Grossman, M. W. Hicks, J. Cheney, and Y. Wang. Cyclone: A safe dialect
[25] T. Jim、J. G. Morrisett、D. Grossman、M. W. Hicks、J. Cheney 和 Y. Wang。旋风：一种安全的方言
of C. In Proceedings of the General Track of the Annual Conference on USENIX Annual Technical Conference, ATEC '02, pages 275-288, Berkeley, CA, USA, 2002. USENIX Association.
在美国 USENIX 年度技术会议 ATEC '02 的一般主题会议记录中，第 275-288 页，美国加利福尼亚州伯克利，2002 年。USENIX 协会。
[26] A. Levy, M. P. Andersen, B. Campbell, D. Culler, P. Dutta, B. Ghena, P. Levis, and P. Pannuto. Ownership is theft: Experiences building an embedded OS in Rust. In Proceedings of the 8th Workshop on Programming Languages and Operating Systems, PLOS '15, pages 21-26, New York, NY, USA, 2015. ACM.
[26] A. Levy、M. P. Andersen、B. Campbell、D. Culler、P. Dutta、B. Ghena、P. Levis 和 P. Pannuto。所有权即盗窃：用 Rust 构建嵌入式操作系统的经验。第 8 届编程语言与操作系统研讨会论文集，PLOS '15，第 21-26 页，美国纽约，2015 年。ACM.
[27] A. Levy, B. Campbell, B. Ghena, D. B. Giffin, P. Pannuto, P. Dutta, and P. Levis. Multiprogramming a 64 kb computer safely and efficiently. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP '17, pages 234-251, New York, NY, USA, 2017. ACM.
[27] A. Levy、B. Campbell、B. Ghena、D. B. Giffin、P. Pannuto、P. Dutta 和 P. Levis。安全高效地对 64KB 计算机进行多重编程。第 26 届操作系统原理研讨会论文集，SOSP '17，第 234-251 页，美国纽约州纽约市，2017 年。ACM.
[28] A. Light. Reenix: implementing a Unix-like operating system in Rust, Apr. 2015.
[28] A. Light.Reenix：用 Rust 实现类 Unix 操作系统，2015 年 4 月。
[29] A. Madhavapeddy, R. Mortier, C. Rotsos, D. Scott, B. Singh, T. Gazagnaire, S. Smith, S. Hand, and J. Crowcroft. Unikernels: Library operating systems for the cloud. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 461-472, Houston, TX, Mar. 2013.
[29] A. Madhavapeddy、R. Mortier、C. Rotsos、D. Scott、B. Singh、T. Gazagnaire、S. Smith、S. Hand 和 J. Crowcroft。Unikernels：云的库操作系统。第 18 届编程语言和操作系统架构支持国际会议（ASPLOS）论文集》，第 461-472 页，德克萨斯州休斯顿，2013 年 3 月。
[30] B. McCloskey, D. F. Bacon, P. Cheng, and D. Grove. Staccato: A Parallel and Concurrent Real-time Compacting Garbage Collector for Multiprocessors. Technical report, IBM, 2008.
[30] B. McCloskey、D. F. Bacon、P. Cheng 和 D. Grove。Staccato：用于多处理器的并行并发实时压缩垃圾收集器。技术报告，IBM，2008 年。
[31] P. McKenney. Review list for RCU patches. https://www.kernel.org/doc/ Documentation/RCU/checklist.txt.
[31] P. McKenney.RCU 补丁审查列表。https://www.kernel.org/doc/ Documentation/RCU/checklist.txt.
[32] P. E. McKenney, S. Boyd-Wickizer, and J. Walpole. RCU usage in the Linux kernel: One decade later. 2012.
[32] P. E. McKenney, S. Boyd-Wickizer, and J. Walpole.Linux 内核中 RCU 的使用：十年之后。2012.
[33] P. E. McKenney and J. D. Slingwine. Read-copy update: Using execution history to solve concurrency problems. In Parallel and Distributed Computing and Systems, pages 509-518, 1998.
[33] P. E. McKenney 和 J. D. Slingwine.读拷贝更新：利用执行历史解决并发问题。并行和分布式计算与系统》，第 509-518 页，1998 年。
[34] MITRE Corporation. CVE Linux Kernel Vulnerability Statistics, 2018. http: //www.cvedetails.com/product/47/ Linux-Linux-Kernel.html?vendor_ id=33.
[34] MITRE Corporation.CVE Linux Kernel Vulnerability Statistics, 2018. http: //www.cvedetails.com/product/47/ Linux-Linux-Kernel.html?vendor_ id=33.
[35] J. Mogul. Eliminating receive livelock in an interrupt-driven kernel. In USENIX 1996 Annual Technical Conference, January 1996.
[35] J. Mogul.消除中断驱动内核中的接收活锁。在 USENIX 1996 年度技术会议上，1996 年 1 月。
[36] Mozilla research. The Rust Programming Language, 2017. https://doc.rust-lang. org/book/.
[36] Mozilla research.Rust 编程语言，2017.https://doc.rust-lang. org/book/.
[37] G. Nelson, editor. Systems Programming with Modula-3. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1991.
[37] G. Nelson 主编。Systems Programming with Modula-3.Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1991.
[38] NGINX. Nginx, 2018. https://www.nginx. com/.
[38] NGINX。Nginx，2018 年。https://www.nginx. com/.
[39] P. Oppermann. Writing an OS in Rust, 2017. http: //os.phil-opp.com/.
[39] P. Oppermann.用 Rust 编写操作系统，2017 年。http：//os.phil-opp.com/.
[40] N. Palix, G. Thomas, S. Saha, C. Calvès, J. Lawall, and G. Muller. Faults in Linux: Ten years later. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI, pages 305-318, New York, NY, USA, 2011. ACM.
[40] N. Palix、G. Thomas、S. Saha、C. Calvès、J. Lawall 和 G. Muller。Linux 中的故障：十年之后。第 16 届编程语言和操作系统架构支持国际会议论文集，ASPLOS XVI，第 305-318 页，美国纽约，2011 年。ACM.
[41] W. M. Petullo, W. Fei, J. A. Solworth, and P. Gavlin. Ethos' deeply integrated distributed types. In IEEE Security and Privacy Workshop on LangSec, May 2014.
[41] W. M. Petullo, W. Fei, J. A. Solworth, and P. Gavlin.Ethos 的深度集成分布式类型。In IEEE Security and Privacy Workshop on LangSec, May 2014.
[42] D. Picheta. Nim in action, 2017. http:// nim-lang.org/.
[42] D. Picheta.Nim in action, 2017. http:// nim-lang.org/.
[43] J. Rafkind, A. Wick, J. Regehr, and M. Flatt. Precise Garbage Collection for C. In Proceedings of the 9th International Symposium on Memory Management, ISMM ’09, Dublin, Ireland, June 2009. ACM.
[43] J. Rafkind、A. Wick、J. Regehr 和 M. Flatt.第 9 届国际内存管理研讨会论文集，ISMM '09，爱尔兰都柏林，2009 年 6 月。ACM.
[44] D. Redell, Y. Dalal, T. Horsley, H. Lauer, W. Lynch, P. McJones, H. Murray, and S. Purcell. Pilot: An operating system for a personal computer. In Proceedings of the 7th ACM Symposium on Operating Systems Principles (SOSP), Pacific Grove, CA, 1979. ACM.
[44] D. Redell、Y. Dalal、T. Horsley、H. Lauer、W. Lynch、P. McJones、H. Murray 和 S. Purcell。飞行员：个人计算机操作系统。第 7 届 ACM 操作系统原理研讨会（SOSP）论文集，加利福尼亚州太平洋格罗夫，1979 年。ACM.
[45] M. Schroeder and M. Burrows. Performance of Firefly RPC. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, SOSP '89, pages 83-90, New York, NY, USA, 1989. ACM.
[45] M. Schroeder 和 M. Burrows。Firefly RPC 的性能。第十二届 ACM 操作系统原理研讨会论文集，SOSP '89，第 83-90 页，美国纽约，1989 年。ACM.
[46] K. Serebryany, D. Bruening, A. Potapenko, and D. Vyukov. Addresssanitizer: A fast address sanity checker. In Proceedings of the 2012 USENIX Annual Technical Conference, Boston, MA, 2012. USENIX.
[46] K. Serebryany、D. Bruening、A. Potapenko 和 D. Vyukov。Addresssanitizer：快速地址正确性检查器。2012 年 USENIX 年度技术大会论文集，马萨诸塞州波士顿，2012 年。USENIX.
[47] A. S. Tanenbaum. Modern Operating Systems. Pearson Prentice Hall, 2008.
[47] A. S. Tanenbaum.Modern Operating Systems.Pearson Prentice Hall，2008.
[48] W. Teitelman. The Cedar programming environment: A midterm report and examination. Technical Report CSL-83-11, Xerox PARC, 1984.
[48] W. Teitelman.雪松编程环境：中期报告和检查。技术报告 CSL-83-11，施乐 PARC，1984 年。
[49] C. P. Thacker and L. C. Stewart. Firefly: a multiprocessor workstation. In Proceedings of the 2nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, Apr. 1987.
[49] C. P. Thacker 和 L. C. Stewart.萤火虫：多处理器工作站。第二届编程语言和操作系统架构支持国际会议（ASPLOS）论文集》。ACM，1987 年 4 月。
[50] Ticki. Redox - Your Next(Gen) OS, 2017. https: //doc.redox-os.org/book/.
[50] Ticki.https: //doc.redox-os.org/book/.
[51] L. Torvalds. http://harmful.cat-v.org/ software/c++/linus, Jan 2004.
[51] L. Torvalds.http://harmful.cat-v.org/ software/c++/linus, Jan 2004.
[52] T. Yang, M. Hertz, E. Berger, S. Kaplan, and J. E. B. Moss. Automatic heap sizing: Taking real memory into account. In ACM ISMM, 2004.
[52] T. Yang, M. Hertz, E. Berger, S. Kaplan, and J. E. B. Moss.自动堆大小：考虑实际内存。ACM ISMM, 2004.

https://github.com/moby/moby
We used Go's atomic package to prevent re-ordering of memory reads and writes; it is not clear that this approach is portable.
我们使用 Go 的原子包来防止内存读写的重新排序；目前还不清楚这种方法是否具有可移植性。

The benefits and costs of writing a POSIX kernel in a high-level language 用高级语言编写 POSIX 内核的好处和代价

The benefits and costs of writing a POSIX kernel in a high-level language 用高级语言编写 POSIX 内核的好处和代价

Abstract 摘要

1 Introduction 1 引言

2 Related work 2 相关工作

3 Motivation 3 动机

3.1 Why C? 3.1 为什么是 C？

3.2 Why an HLL? 3.2 为什么要使用 HLL？

4 Overview 4 概述

5 Garbage collection 5 垃圾收集

5.1 Go's collector 5.1 围棋的收集器

5.2 Biscuit's heap size 5.2 Biscuit 的堆大小

6 Avoiding heap exhaustion6 避免堆耗尽

6.1 Approach: reservations6.1 方法：保留

6.2 How Biscuit reserves 6.2 饼干如何储备

6.3 Static analysis to find 6.3 通过静态分析查找

6.3.1 Basic MAXLIVE operation6.3.1 MAXLIVE 的基本操作

6.3.2 Handling loops 6.3.2 处理循环

6.3.3 Kernel threads 6.3.3 内核线程

6.3.4 Killer thread 6.3.4 杀手线程

6.4 Limitations 6.4 局限性

6.5 Heap exhaustion summary6.5 堆耗尽总结

7 Implementation 7 实施

8 Evaluation 8 评估

8.1 Biscuit's use of HLL features8.1 Biscuit 对 HLL 功能的使用

8.2 Potential to reduce bugs8.2 减少虫子的潜力

8.3 Experimental Setup 8.3 实验装置

8.4 HLL tax 8.4 HLL 税收

8.5 GC delays 8.5 GC 延误

8.6 Sensitivity to heap size8.6 对堆大小的敏感性

8.7 Go versus 8.7 围棋对

8.7.1 Ping-pong 8.7.1 乒乓球

8.7.2 Page-faults 8.7.2 页面故障

8.8 Biscuit versus Linux 8.8 饼干与 Linux

8.9 Handling kernel heap exhaustion8.9 处理内核堆耗尽

8.10 Lock-free lookups 8.10 无锁查找

9 Discussion and future work9 讨论与未来工作

10 Conclusions 10 结论

Acknowledgements 致谢

References 参考资料

The benefits and costs of writing a POSIX kernel in a high-level language
用高级语言编写 POSIX 内核的好处和代价

The benefits and costs of writing a POSIX kernel in a high-level language
用高级语言编写 POSIX 内核的好处和代价

6 Avoiding heap exhaustion
6 避免堆耗尽

6.1 Approach: reservations
6.1 方法：保留

6.3 Static analysis to find
6.3 通过静态分析查找

6.3.1 Basic MAXLIVE operation
6.3.1 MAXLIVE 的基本操作

6.5 Heap exhaustion summary
6.5 堆耗尽总结

8.1 Biscuit's use of HLL features
8.1 Biscuit 对 HLL 功能的使用

8.2 Potential to reduce bugs
8.2 减少虫子的潜力

8.6 Sensitivity to heap size
8.6 对堆大小的敏感性

8.9 Handling kernel heap exhaustion
8.9 处理内核堆耗尽

9 Discussion and future work
9 讨论与未来工作