这是用户在 2024-5-18 6:44 为 https://app.immersivetranslate.com/pdf-pro/718682d1-cde6-428c-8d4e-c768df47ba6c 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_05_17_4cbbb5f360c7939c1c6bg

EROFS: A Compression-friendly Readonly File System for Resource-scarce Devices
EROFS:针对资源稀缺设备的压缩友好型只读文件系统

Xiang Gao, Huawei Technologies Co., Ltd.; Mingkai Dong, Shanghai Jiao Tong University; Xie Miao, Wei Du, and Chao Yu, Huawei Technologies Co., Ltd.;: Haibo Chen, Shanghai Jiao Tong University / Huawei Technologies Co., Ltd.
高翔,华为技术有限公司;董明凯,上海交通大学;谢淼、杜伟、余超,华为技术有限公司;陈海波,上海交通大学/华为技术有限公司:陈海波,上海交通大学/华为技术有限公司。
This paper is included in the Proceedings of the 2019 USENIX Annual Technical Conference.
本文收录在《2019 USENIX 年度技术大会论文集》中。
July 10-12, Renton, WA, USA
7 月 10-12 日, 美国华盛顿州伦顿
ISBN 978-1-939133-03-8
Open access to the Proceedings of the 2019 USENIX Annual Technical Conference is sponsored by USENIX.
2019 USENIX 年度技术大会论文集的开放获取由 USENIX 赞助。

EROFS: A Compression-friendly Readonly File System for Resource-scarce Devices
EROFS:针对资源稀缺设备的压缩友好型只读文件系统

Xiang Gao , Mingkai Dong , Xie Miao , Wei Du , Chao , and Haibo Chen
Xiang Gao , Mingkai Dong , Xie Miao , Wei Du , Chao , and Haibo Chen
Huawei Technologies Co., Ltd.
华为技术有限公司
Shanghai Jiao Tong University
上海交通大学

Abstract 摘要

Smartphones usually have limited storage and runtime memory. Compressed read-only file systems can dramatically decrease the storage used by read-only system resources. However, existing compressed read-only file systems use fixedsized input compression, which causes significant I/O amplification and unnecessary computation. They also consume excessive runtime memory during decompression and deteriorate the performance when the runtime memory is scarce. In this paper, we describe EROFS , a new compression-friendly read-only file system that leverages fixed-sized output compression and memory-efficient decompression to achieve high performance with little extra memory overhead. We also report our experience of deploying EROFS on tens of millions of smartphones. Evaluation results show that EROFS outperforms existing compressed read-only file systems with various micro-benchmarks and reduces the boot time of realworld applications by up to while nearly halving the storage usage.
智能手机的存储空间和运行内存通常有限。压缩只读文件系统可以显著减少只读系统资源所使用的存储空间。然而,现有的压缩只读文件系统使用固定大小的输入压缩,这会导致明显的 I/O 放大和不必要的计算。在解压缩过程中,它们还会消耗过多的运行内存,并在运行内存不足时降低性能。在本文中,我们介绍了EROFS ,这是一种新的压缩友好型只读文件系统,它利用固定大小的输出压缩和高效内存解压缩技术,以极小的额外内存开销实现高性能。我们还报告了在数千万部智能手机上部署EROFS的经验。评估结果表明,EROFS 在各种微基准测试中的表现优于现有的压缩只读文件系统,并将真实世界应用程序的启动时间缩短了多达 ,同时将存储使用量减少了近一半。

1 Introduction 1 引言

Low-end smartphones with relatively low price are still prevalent in the market , especially in developing countries. At a price, such devices are usually equipped with limited resources in both capacity and performance. For example, a low-end Android smartphone may have 1-2GB runtime memory and 8-16GB slow eMMC storage [19, 21, 22]. Even worse, the Android operating system itself can consume more than 3GB in storage, leaving scarce storage space available to users [46]. Even for high-end smartphones, the increasing storage and runtime memory consumption of popular or resident apps usually render a device resource-scarce for both user-initiated and system-initiated operations.
价格相对较低的低端智能手机在市场上仍很普遍 ,尤其是在发展中国家。由于价格低廉,这类设备通常在容量和性能方面都配备了有限的资源。例如,一款低端安卓智能手机可能只有 1-2GB 运行内存和 8-16GB 的慢速 eMMC 存储 [19、21、22]。更糟糕的是,Android 操作系统本身可能会消耗超过 3GB 的存储空间,留给用户的存储空间非常有限 [46]。即使是高端智能手机,流行或常驻应用程序不断增加的存储和运行时内存消耗通常也会导致设备在用户启动和系统启动操作时资源匮乏。
File systems with compression support, or compressed file systems, can be used to release more space to the
支持压缩的文件系统或压缩文件系统可用于释放更多空间到
users by transparently compressing/decompressing file data upon accesses. However, such file systems usually consume more resources and yield notably worse performance during compression/decompression. Thus they are not suitable for resource-limited devices, especially smartphones, on which user experience has the top priority.
通过在访问时以透明方式压缩/解压缩文件数据,用户可以使用这些文件系统。然而,这种文件系统通常会消耗更多资源,而且在压缩/解压缩过程中性能明显降低。因此,它们并不适合资源有限的设备,尤其是智能手机,因为在智能手机上,用户体验是重中之重。
Fortunately, for partitions with read-only data, such as the /system, /vendor and /odm partitions of Android, the file system can be made read-only to boost the performance by simplifying the structures and designs for file changes. However, existing compressed read-only file systems, such as Squashfs , usually cause notable degradation on access performance and incur extra memory usage during decompression. One key issue is that such file systems use fixed-sized input compression, in which file data is divided into fixed-sized chunks (e.g., ) and each chunk is compressed individually. The fixed-sized input compression incurs significant read amplification and excessively unnecessary computations ( ). Even worse, they usually require a huge amount of runtime memory, which is scarce on low-end smartphones or heavily-used high-end smartphones ( ).
幸运的是,对于具有只读数据的分区(如安卓系统中的/system、/vendor 和/odm 分区),可以将文件系统设为只读,通过简化结构和文件更改设计来提高性能。不过,现有的压缩只读文件系统(如 Squashfs )通常会明显降低访问性能,并在解压缩过程中产生额外的内存占用。其中一个关键问题是,这类文件系统使用固定大小的输入压缩,即把文件数据分成固定大小的块(如 ),然后对每个块进行单独压缩。固定大小的输入压缩会产生显著的读取放大和过多不必要的计算 ( )。更糟糕的是,它们通常需要大量运行时内存,而这在低端智能手机或大量使用的高端智能手机上是非常稀缺的 ( )。
To save the storage space and retain high performance with low memory overhead, we design and implement EROFS, an enhanced read-only file system with compression support. EROFS introduces the fixed-sized output compression, which compresses file data to multiple fixed-sized blocks, to significantly mitigate the read amplification problem and reduce unnecessary computations as much as possible. By exploiting the characteristics of compression algorithms (such as LZ4), EROFS designs different memoryefficient decompression schemes to reduce extra memory usage during the decompression. EROFS also adopts a set of optimizations that carefully ensure guaranteed user experience.
为了节省存储空间,以较低的内存开销保持高性能,我们设计并实现了支持压缩的增强型只读文件系统EROFS。EROFS引入了固定大小输出压缩,将文件数据压缩为多个固定大小的块,从而大大缓解了读取放大问题,并尽可能减少了不必要的计算。利用压缩算法(如 LZ4)的特点,EROFS 设计了不同的内存效率解压缩方案,以减少解压缩过程中的额外内存使用。此外,EROFS 还采用了一系列优化措施,精心确保用户体验。
The main contributions of this paper include:
本文的主要贡献包括
  • A study of existing compressed file systems which reveals the performance issues on resource-hungry devices .
    对现有压缩文件系统的研究揭示了资源消耗型设备的性能问题
  • A fixed-sized output compression scheme that signifi-
    一个固定大小的输出压缩方案,可以显著减少输出量。

    cantly mitigates the read amplification issue ( ).
    大大缓解了读取扩增问题 ( )。
  • A set of novel decompression schemes for both memory-efficiency and high performance ( § ).
    一套兼顾内存效率和高性能的新型解压缩方案 ( § )。
  • An evaluation of EROFS against other file systems to validate the effectiveness of EROFS ( ) and a study on the deployment experience of EROFS on tens of millions of smartphones ( § ).
    EROFS与其他文件系统的对比评估,以验证EROFS的有效性 ( ) ,以及EROFS在数千万部智能手机上的部署经验研究 ( § ) 。

2 Background and Motivation
2 背景与动机

2.1 Low user-perceived storage space
2.1 用户认为存储空间小

Smartphones are usually resource-scarce due to the cost constraint. Meanwhile, the space occupied by the Android operating system is constantly increasing. Fig. 1 shows the /system partition size in stock Android factory images [6] for different Android versions. The sparse image strips off all zero blocks and thus only contains all effective data; while the raw image is the actual space consumed once stored into the devices. From the figure, we can see the data size of the /system partition increases from 184MB in Android 2.3.6 to in Android 9.0.0. Besides the trend of increasing the effective data size, we can also see a large number of zero blocks in Android 7 and 8, which also consume large space. For Android 9, the zero blocks are significantly less, which is due to the support of data block deduplication [20] in the ext4 file system. Besides the /system partition shown in Fig. 1, there are other space-consuming partitions for Android such as /vendor, /oem and /odm [8]. As reported in previous work [46], the space used by the whole Android system itself is increasing and far larger than what we show here. For example, Android 6.0.0 consumes 3.17GB storage after a factory-reset .
由于成本限制,智能手机通常资源稀缺。与此同时,安卓操作系统占用的空间也在不断增加。图 1 显示了不同安卓版本的安卓出厂图片[6]中/系统分区的大小。稀疏图像去除了所有零块,因此只包含所有有效数据;而原始图像则是存储到设备中后实际消耗的空间。从图中可以看出,/system 分区的数据大小从 Android 2.3.6 的 184MB 增加到 Android 9.0.0 的 。除了有效数据大小有增加的趋势外,我们还可以看到 Android 7 和 8 中存在大量的零块,这也会消耗大量空间。而在安卓 9 中,零块明显减少,这是因为 ext4 文件系统支持重复数据块删除 [20]。除了图 1 所示的 /system 分区外,Android 还存在其他消耗空间的分区,如 /vendor、/oem 和 /odm [8]。正如之前的工作[46]所报告的,整个 Android 系统本身所使用的空间在不断增加,远远大于我们在这里所展示的。例如,Android 6.0.0 在出厂重置后会消耗 3.17GB 的存储空间
Meanwhile, the storage consumption of Android applications also keeps growing. As reported by Google Play, by early 2017, the average app size has quintupled compared with that at the time Google starts its Android application marketplace [45]. As a result, the storage capacity of low-end smartphones available for users is rather small. Further, many top apps for smartphones tend to consume a huge amount of memory, leaving only a small amount of memory for systeminitiated operations even on a high-end smartphone.
与此同时,安卓应用程序的存储消耗也在不断增长。根据 Google Play 的报告,到 2017 年初,应用程序的平均大小比谷歌启动安卓应用市场时增加了五倍[45]。因此,可供用户使用的低端智能手机的存储容量相当小。此外,智能手机上的许多顶级应用程序往往会消耗大量内存,即使在高端智能手机上,也只能留下少量内存用于系统启动的操作。
Fig. 1: Android /system partition sizes
图 1:安卓/系统分区大小
Compressed file systems. One intuitive approach to unleash- ing more spaces for users is adopting compressed file systems, which exposes standard file interfaces to the applications but transparently compress and decompress file data during file writes and reads.
压缩文件系统。为用户释放更多空间的一种直观方法是采用压缩文件系统,它向应用程序提供标准文件接口,但在文件写入和读取过程中透明地压缩和解压文件数据。
Btrfs [2] is a modern B-tree file system with compression support. When compression is enabled, the file data is divided into multiple chunks and compressed separately. Each of the compressed chunks will be stored in an extent, which is a run of contiguous blocks that store data sequentially. The locations of these extents are recorded as indexes in the B-tree structures. To read the file data, the corresponding extents are read from the storage and the whole chunks are decompressed. To update a file, the new data is compressed and written to new extents, and then the indexes are updated. To read the file data, Btrfs reads the corresponding extents from the storage and decompresses the whole chunks. To update a file, Btrfs compresses the new data, writes it to new extents, and updates the indexes.
Btrfs [2] 是一种支持压缩的现代 B 树文件系统。启用压缩功能后,文件数据会被分成多个 块,并分别进行压缩。每个压缩后的数据块都将存储在一个extent中,extent是按顺序存储数据的连续块。这些扩展的位置作为索引记录在 B 树结构中。要读取文件数据,需要从存储中读取相应的扩展,并对整个块进行解压缩。更新文件时,新数据会被压缩并写入新的扩展名,然后更新索引。要读取文件数据,Btrfs 会从存储中读取相应的扩展并解压缩整个块。更新文件时,Btrfs 会压缩新数据,将其写入新的扩展,然后更新索引。
Btrfs is a general-purpose file system, so its internal structures must consider efficient data modifications and cannot be aggressively optimized for compression. Furthermore, compression is not the only metric. The memory consumption during decompression should also be constrained.
Btrfs 是一种通用文件系统,因此其内部结构必须考虑高效的数据修改,而不能为压缩而积极优化。此外,压缩并不是唯一的衡量标准。解压缩过程中的内存消耗也应受到限制。
For devices like smartphones, performance and responsiveness are important key metrics that cannot be compromised. Hence, with the burden of efficient data modification, Btrfs can hardly satisfy the requirements of both performance and compression efficiency, as we will show later in the evaluation ( § ).
对于智能手机等设备来说,性能和响应速度是不可妥协的重要关键指标。因此,在高效数据修改的重负下,Btrfs 很难同时满足性能和压缩效率的要求,这一点我们将在后面的评估中说明 ( § )。
Compressed read-only file systems. Considering the access patterns of partitions in Android, we find that system resources are rarely modified once the Android operating system is installed. We can thus use compressed read-only file systems on read-only partitions to reduce the space consumption for system resources while retaining the performance. Unlike compressed read-write file systems which are complicated by data modifications, compressed read-only file systems exclude data updates by design, which exposes more opportunities for higher compression ratio and faster data reads.
压缩只读文件系统。考虑到安卓系统中分区的访问模式,我们发现一旦安装了安卓操作系统,系统资源就很少被修改。因此,我们可以在只读分区上使用压缩只读文件系统,在保留性能的同时减少系统资源的空间消耗。与压缩读写文件系统因数据修改而变得复杂不同,压缩只读文件系统在设计上排除了数据更新,这就为更高的压缩率和更快的数据读取提供了更多机会。
Squashfs [11] is a widely-used compressed read-only file system in Linux with many features and moderate performance. It supports several compression algorithms, and the chunk (i.e., compression input) size can be chosen from to 1MB. In Squashfs, metadata can be compressed, and inodes and directories are stored more compactly. File data is compressed chunk by chunk, and the compressed data blocks are stored sequentially. The compressed sizes of each original data chunk are stored in a list within the inode. These sizes are used to locate the position of compressed blocks during decompression.
Squashfs [11] 是 Linux 中广泛使用的压缩只读文件系统,具有许多功能和适中的性能。它支持多种压缩算法,块(即压缩输入)大小可从 至 1MB 之间选择。在 Squashfs 中,元数据可以压缩,节点和目录的存储也更加紧凑。文件数据逐块压缩,压缩后的数据块按顺序存储。每个原始数据块的压缩大小都存储在 inode 中的一个列表中。这些大小用于在解压缩时定位压缩块的位置。

2.2 Deficiency of existing readonly file systems
2.2 现有只读文件系统的缺陷

Compressed read-only file systems are designed to minimize storage usage. However, applying existing compressed read-
压缩只读文件系统旨在尽量减少存储空间的使用。然而,应用现有的压缩只读文件系统,并不能最大限度地减少存储空间的使用。

only file systems on resource-scarce smartphones can induce significant overhead on both performance and memory consumption. For example, we first tried to use Squashfs for the read-only partitions on Android. While the system boots successfully with Squashfs, booting the camera application requires tens of seconds even with light background workloads.
在资源稀缺的智能手机上使用只读文件系统,会给性能和内存消耗带来巨大开销。例如,我们首先尝试在安卓系统上使用 Squashfs 作为只读分区。虽然使用 Squashfs 可以成功启动系统,但即使后台工作负载很轻,启动相机应用程序也需要数十秒。
Why is there such a huge performance slowdown? We conducted a detailed study of Squashfs with default configuration using microbenchmarks and uncover that the performance degradation mainly originates from two parts. The first one is I/O amplification. We used FIO [23] to evaluate the basic performance of Squashfs. When Android sequentially reads from the enwik 9 [40] file stored in Squashfs, the actually issued I/O is . While the number looks decent regarding compression, Squashfs issues 165.27MB I/O reads when Android reads 16MB randomly. Moreover, when Android reads the first of every , reading file data issued as much as I/O read. The difference suggests that when Squashfs reads some data that is not decompressed and cached before, the size of data requested is significantly amplified.
为什么会出现如此严重的性能下降?我们使用微基准测试对默认配置下的 Squashfs 进行了详细研究,发现性能下降主要源于两个方面。首先是 I/O 放大。我们使用 FIO [23] 来评估 Squashfs 的基本性能。当 Android 从存储在 Squashfs 中的 enwik 9 [40] 文件中顺序读取 时,实际发出的 I/O 为 。虽然这个数字在压缩方面看起来还不错,但当安卓随机读取 16MB 时,Squashfs 发出的 I/O 读数为 165.27MB。此外,当 Android 读取每个 的第一个 时,读取 文件数据发出的 I/O 读取次数与 一样多。这种差异表明,当 Squashfs 读取一些之前未解压缩和缓存的数据时,请求的数据大小会明显增大。
The second reason is extra memory consumption. The total memory consumption after sequentially reading the enwik9 file on Squashfs is about , which suggests that decompression in Squashfs requires a significant amount of temporary memory compared to the size of the original data needed. This causes high pressure to Android since memory is a key factor for user experience given that Android and its apps already consume a large amount of memory. On one hand, allocating memory during decompression may trigger memory swapping, which involves victim selections and I/Os with high cost. On the other hand, consuming much extra memory during decompression affects other components or applications by dropping their cached data or swapping out useful memory pages.
第二个原因是额外的内存消耗。在 Squashfs 上顺序读取 enwik9 文件后的总内存消耗约为 ,这表明与所需的原始数据大小相比,Squashfs 的解压缩需要大量的临时内存。这给 Android 带来了很大压力,因为 Android 及其应用程序已经消耗了大量内存,而内存是影响用户体验的关键因素。一方面,在解压缩过程中分配内存可能会触发内存交换,这涉及到代价高昂的受害者选择和 I/O。另一方面,解压缩期间消耗大量额外内存会影响其他组件或应用程序,因为它们会丢弃缓存数据或交换出有用的内存页。
We further analyzed the design and implementation of Squashfs and found the following two defects.
我们进一步分析了 Squashfs 的设计和实施,发现了以下两个缺陷。
Fixed-sized input compression. Existing file systems compress original data in a fixed-sized chunk, generating variable-sized compressed data. As shown in Fig. 2(a), Squashfs takes a fixed-sized data (e.g., a 128KB chunk) as the input of a single invocation of the compression algorithm. The compression algorithm then generates the compressed data whose size depends on the content of the input data. The compressed data of one file is usually compacted in the original data order, to reduce wasted space in the first and the last blocks of each compressed chunk.
固定大小的输入压缩。现有的文件系统以固定大小的块为单位压缩原始数据,生成大小可变的压缩数据。如图 2(a)所示,Squashfs 将固定大小的数据(如 128KB 的数据块)作为压缩算法单次调用的输入。然后,压缩算法生成压缩数据,其大小取决于输入数据的内容。一个文件的压缩数据通常按原始数据顺序压缩,以减少每个压缩块的第一个和最后一个数据块的空间浪费。
Such a compression approach appears decent but has a notable deficiency due to amplified I/O and wasted computation. For example, in Fig. 2(b), the application wants to get the first byte of the chunk. To satisfy the application's request, the Squashfs has to read all compressed data from block 1 to block 7. Considering the minimal requested block size of the underlying storage devices is , the I/O is amplified 7 times! This is because the file system must read all related compressed blocks, even if the number of compressed blocks is very large. Even worse, even if not all data stored in the first block and the last block are useful for the decompression, they must be read from the storage altogether. In the example, the shadowed parts of block 1 and block 7 in Fig. 2(b) contribute nothing to the decompression but have to be read from the storage. Besides, the decompression process for useless data also causes huge CPU wastes that lead to high performance interference of other running apps (such as the Camera mentioned before).
这种压缩方法看似不错,但由于扩大了 I/O,浪费了计算量,因此存在明显不足。例如,在图 2(b)中,应用程序希望获取 块的第一个字节。为了满足应用程序的要求,Squashfs 必须读取从块 1 到块 7 的所有压缩数据。考虑到底层存储设备要求的最小块大小为 ,I/O 被放大了 7 倍!这是因为文件系统必须读取所有相关的压缩块,即使压缩块的数量非常大。更糟糕的是,即使第一个块和最后一个块中存储的所有数据对解压缩都没有用,也必须从存储设备中全部读取。例如,图 2(b)中块 1 和块 7 的阴影部分对解压缩毫无用处,但却必须从存储器中读取。此外,无用数据的解压缩过程也会造成巨大的 CPU 浪费,从而对其他正在运行的应用程序(如前面提到的相机)造成高性能干扰。
One possible mitigation would be reducing the input chunk size to in Squashfs. While this might alleviate the I/O amplification, this non-trivially reduces the compression ratio and incurs higher CPU utilization, as we will show in section 5 .
一种可能的缓解方法是将 Squashfs 中的输入块大小减小到 。虽然这可能会减轻 I/O 放大,但正如我们将在第 5 节中展示的那样,这并不意味着压缩率会降低,CPU 利用率也会提高。
Massive memory consumption and data movements. The other defect we found is that Squashfs requires massive temporary memory during the decompression. Upon file read requests, Squashfs will first look up the metadata to get the number of related compressed blocks. It then allocates memory (e.g., the buffer_head structure) for each of the compressed blocks, and issues I/O reads to fetch the compressed blocks from the storage to the allocated buffer_heads. Since the buffers in buffer_heads of adjacent compressed blocks might not have continuous virtual addresses, Squashfs has to copy data in the buffer_heads of all compressed blocks to a single continuous buffer. Then, the compression algorithm decompresses all original data and puts them in a temporary output buffer. Finally, Squashfs copies the original data from the temporary output buffer to the corresponding page cache pages.
大量内存消耗和数据移动。我们发现的另一个缺陷是,Squashfs 在解压缩过程中需要大量临时内存。在读取文件请求时,Squashfs 会首先查找元数据,以获得相关压缩块的数量。然后,它会为每个压缩块分配内存(如 buffer_head 结构),并发出 I/O 读取请求,将压缩块从存储区提取到分配的 buffer_head。由于相邻压缩块的缓冲区头中的缓冲区可能没有连续的虚拟地址,因此 Squashfs 必须将所有压缩块的缓冲区头中的数据复制到一个连续的缓冲区中。然后,压缩算法会解压所有原始数据,并将其放入临时输出缓冲区。最后,Squashfs 会将临时输出缓冲区中的原始数据复制到相应的页面缓存页中。
From the above routine, two pre-allocated temporary buffers are used and an array of buffer_heads are dynamically allocated for the decompression. The number of buffer_head needs to be large enough to store all compressed blocks. However, allocating such a large amount of memory can cause severe performance degradation under a lowmemory situation.
在上述例程中,使用了两个预先分配的临时缓冲区,并为解压缩动态分配了一个缓冲区头数组。缓冲区头的数量必须足够大,以存储所有压缩块。但是,在内存不足的情况下,分配如此大的内存可能会导致性能严重下降。
In addition to extra memory allocation, there are two data movements during decompression: from the buffer_heads to the temporary input buffer, and from the temporary output buffer to the page cache. These two data movements also cause performance overhead since, most of the time, the compression/decompression algorithm is bottlenecked by memory accesses.
除了额外的内存分配外,解压缩过程中还有两次数据移动:从缓冲区头到临时输入缓冲区,以及从临时输出缓冲区到页面缓存。这两个数据移动也会造成性能开销,因为在大多数情况下,压缩/解压缩算法都会受到内存访问的瓶颈制约。
The above two defects in Squashfs reveal two challenges when designing a compressed read-only file system for resource-scarce smartphones.
Squashfs 的上述两个缺陷揭示了为资源稀缺的智能手机设计压缩只读文件系统时所面临的两个挑战。
  • How to reduce I/O amplification during the decompression without sacrificing the compression ratio?
    如何在不影响压缩比的情况下减少解压缩过程中的 I/O 放大?
  • How to reduce memory consumption during the decompression to prevent performance degradation?
    如何减少解压缩过程中的内存消耗,防止性能下降?
(a) Fixed-sized input (a) 固定大小的输入
(b) Fixed-sized input issues
(b) 固定规模的输入问题
(c) Fixed-sized output (c) 固定规模的产出
Fig. 2: Compression approaches
图 2:压缩方法

3 EROFS:Enhanced Compressed File System
3 EROFS:增强型压缩文件系统

This section presents the design of EROFS, a compressionfriendly readonly file system which overcomes the deficiency of prior systems. The key design of EROFS includes fixedsized output compression, cached I/O and in-place I/O, and memory-efficient decompression.
本节介绍EROFS的设计,它是一种便于压缩的只读文件系统,克服了以往系统的不足。EROFS的关键设计包括固定大小的输出压缩、缓存I/O和就地I/O以及内存高效解压缩。

3.1 Fixed-sized output compression
3.1 固定尺寸输出压缩

To overcome the read amplification incurred by the fixedsized input compression, EROFS adopts a different compression approach: fixed-sized output compression.
为了克服固定大小的输入压缩带来的读取放大,EROFS 采用了另一种压缩方法:固定大小的输出压缩。
To generate fixed-sized output, EROFS compresses the file data using a sliding window, whose size is a fixed value and can be configured during image generation. The compression algorithm is invoked multiple times until all file data is compressed. For example, with a sliding window, EROFS feeds the compression algorithm with original data at a time. The algorithm then compresses the original data as much as possible until all data is consumed or the consumed data can generate exactly compressed data. The remaining original data is combined with more data, forming another 1MB original data for the next invocation of compression. Fig. 2(c) depicts the fixed-sized output compression, in which variable-sized original data is compressed to blocks.
为了生成固定大小的输出,EROFS 使用一个滑动窗口来压缩文件数据,该窗口的大小是一个固定值,可在生成图像时进行配置。压缩算法会被多次调用,直到所有文件数据都被压缩。例如,使用 滑动窗口,EROFS 每次向压缩算法提供 原始数据。然后,算法会尽可能多地压缩原始数据,直到所有 数据消耗完毕,或消耗的数据正好能生成 压缩数据。剩余的原始数据与更多数据合并,形成另一个 1MB 的原始数据,供下一次调用压缩。图 2(c) 描述了固定大小的输出压缩,其中大小可变的原始数据被压缩为 块。
There are several benefits of using fixed-sized output compression compared to the fixed-sized input one. First, as what we will show in the evaluation ( § ), it has better compression ratio under the same compression unit size. This is reasonable since the fixed-sized output compression can compress data as much as possible until the output limit is reached, while the fixed-sized input compression can only compress a fixed size of data at a time. Second, during the decompression, only the compressed blocks that contain the requested data will be read and processed. In the previous example where a single original block is requested, at most two compressed blocks will be read and decompressed. Third, as we will show later in §, the fixed-sized output compression makes it possible to do in-place decompression, which further reduces the memory consumption in EROFS.
与固定大小的输入相比,使用固定大小的输出压缩有几个好处。首先,正如我们将在评估( § )中展示的那样,在相同的压缩单元大小下,它具有更好的压缩比。这是合理的,因为固定大小的输出压缩可以尽可能多地压缩数据,直到达到输出极限,而固定大小的输入压缩每次只能压缩固定大小的数据。其次,在解压缩过程中,只有包含请求数据的压缩块才会被读取和处理。在前面的例子中,要求处理一个原始数据块,最多只能读取和解压缩两个压缩数据块。第三,正如我们将在后面的 § 中展示的,固定大小的输出压缩使得就地解压缩成为可能,从而进一步降低了EROFS 的内存消耗。

3.2 Cached I/O and in-place I/O
3.2 缓存 I/O 和就地 I/O

Before the actual decompression, EROFS needs space to store the compressed data retrieved from the storage. While this is costly for fixed-sized input compression due to excessive memory allocation and even page swapping, fixedsized output compression would incur much less cost since EROFS clearly knows that each compression only retrieves up to two compressed blocks. There are two strategies for EROFS: cached I/O and in-place I/O. EROFS uses cached I/O for compressed blocks that will be partially decompressed. EROFS manages a special inode whose page cache stores compressed blocks indexed by the physical block number. Thus, for cached I/O, EROFS will allocate a page in the special inode's page cache to initiate the I/O request, so that the compressed data will be directly fetched to the allocated page by the storage driver.
在实际解压缩之前,EROFS 需要空间来存储从存储器中获取的压缩数据。对于固定大小的输入压缩来说,由于过多的内存分配,甚至是页面交换,这都会造成高昂的成本,而固定大小的输出压缩所产生的成本则要低得多,因为EROFS 清楚地知道,每次压缩最多只能检索两个压缩块。EROFS有两种策略:缓存I/O和就地I/O。EROFS对将被部分解压缩的压缩块使用缓存I/O。EROFS管理着一个特殊的inode,其页面缓存存储着以物理块编号为索引的压缩块。因此,对于缓存 I/O,EROFS 将在特殊 inode 的页面缓存中分配一个页面来启动 I/O 请求,这样存储驱动程序就会直接将压缩数据提取到分配的页面。
For compressed blocks that will be completely decompressed, EROFS uses in-place I/O. On each file read, VFS will allocate pages in the page cache for the file system to put file data. For any one of these pages that contains no meaningful data before the decompression, we call it a reusable page. For in-place I/O, EROFS uses the last reusable page to initialize the I/O request.
对于将完全解压缩的压缩块,EROFS 使用就地 I/O。每次读取文件时,VFS 都会在页面缓存中分配页面,供文件系统放置文件数据。对于解压缩前不包含任何有意义数据的页面,我们称之为可重复使用页面。对于就地 I/O,EROFS 会使用最后一个可重用页面来初始化 I/O 请求。
Both I/O strategies are necessary. For cached I/O, partially decompressed blocks are cached in the special page cache, so that subsequent reads to the uncompressed part can use these blocks without invoking additional I/O requests. For blocks that are fully decompressed, they are unlikely to be used later since all decompressed data is stored in the page cache, which can serve subsequent reads without decompression. Thus, cached I/O vainly increases the memory spike due to page allocations for fully compressed blocks, while not contributing to subsequent file reads. In such cases, inplace I/O avoids unnecessary memory allocation, which relieves the memory pressure especially when there are many in-flight file read requests on different compressed blocks. Note that although it is possible to put the compressed block on the stack, it is not recommended to do so since the stack size is limited to be [14] and it is not easy to know how many bytes of the stack are still available.
这两种 I/O 策略都是必要的。对于缓存 I/O,部分解压缩的数据块被缓存在特殊的页面缓存中,这样,对未压缩部分的后续读取就可以使用这些数据块,而无需调用额外的 I/O 请求。对于完全解压缩的数据块,由于所有解压缩数据都存储在页面缓存中,无需解压缩就能满足后续读取,因此它们以后不太可能被使用。因此,缓存 I/O 会徒劳地增加完全压缩块的页面分配所导致的内存峰值,而对后续文件读取却没有任何帮助。在这种情况下,就地 I/O 可以避免不必要的内存分配,从而减轻内存压力,尤其是当不同压缩块上有许多飞行中的文件读取请求时。需要注意的是,虽然可以将压缩块放在堆栈中,但不建议这样做,因为堆栈大小限制为 [14],而且不容易知道堆栈中还有多少字节可用。

3.3 Decompression 3.3 减压

After loading compressed data into memory, we illustrate how EROFS decompresses data both fast and memoryefficiently. Examples in this section are based on Fig. 3(a) where the first five blocks (D0 to D4) and part of the block D5 are compressed to block , and the rest blocks are compressed to block C1. In this subsection, we only introduce
将压缩数据加载到内存后,我们将说明EROFS 如何既快速又高效地解压数据。本节中的示例基于图 3(a),其中前五个数据块(D0 至 D4)和数据块 D5 的一部分被压缩到数据块 ,其余数据块被压缩到数据块 C1。在本小节中,我们只介绍
(a) Compression (a) 压缩
(b) Vmap decompression (b) Vmap 解压缩
(c) Buffer decompression
(c) 缓冲区解压缩
(d) In-place decompression
(d) 就地减压
Fig. 3: Decompression 图 3:解压缩
how a single compressed block is decompressed since, for read requests containing data in multiple compressed blocks, the compressed blocks are decompressed one by one similarly. For example, to read blocks D4 to D6 in Fig. 3(a), C0 is firstly decompressed to get D4 and the first part of D5; then is decompressed to get the rest of D5 and D6.
因为对于包含多个压缩块中数据的读取请求,压缩块的解压缩方式是逐个类似的。例如,要读取图 3(a) 中的 D4 至 D6 块,首先对 C0 进行解压缩,得到 D4 和 D5 的第一部分;然后对 进行解压缩,得到 D5 和 D6 的其余部分。
Vmap decompression To get the data in block D3 and D4, EROFS first reads the compressed block C0 from the storage and stores it in the memory. Then EROFS will decompress it in following steps.
Vmap 解压缩 要获取块 D3 和 D4 中的数据,EROFS 首先从存储器中读取压缩块 C0 并将其存储到内存中。然后,EROFS 将按以下步骤进行解压缩。
  1. Find the largest needed block number that is stored in the compressed block (C0), which is the fifth block (D4) in the example. As an advantage, EROFS only needs to decompress the first five blocks (D0 to D4), rather than decompressing all original data blocks.
    找出存储在压缩数据块 (C0) 中的所需最大数据块编号,即示例中的第五个数据块 (D4)。这样做的好处是,EROFS 只需解压缩前五个数据块(D0 至 D4),而无需解压缩所有原始数据块。
  2. For each of the data blocks that need to be decompressed, find memory space to store them. In the example shown in Fig. 3(b), EROFS allocates three temporary physical pages to store D0, D1, and D2. For the requested two blocks, D3 and D4, EROFS reuses the two physical pages that have been allocated by VFS in the page cache.
    为每个需要解压缩的数据块寻找存储空间。在图 3(b) 所示的示例中,EROFS 分配了三个临时物理页来存储 D0、D1 和 D2。对于请求的两个数据块 D3 和 D4,EROFS 会重新使用页面缓存中已由 VFS 分配的两个物理页面。
  3. Since the decompression algorithm requires continuous memory as the destination of decompression, EROFS maps physical pages prepared in the previous step into a continuous virtual memory area via the vmap interface.
    由于解压缩算法要求将连续内存作为解压缩的目的地,EROFS 通过 vmap 接口将上一步准备好的物理页映射到一个连续的虚拟内存区域。
  4. If it's in-place I/O, in which case the compressed block (C0) is stored in the page cache page, EROFS also needs to copy the compressed data (C0) to a temporary perCPU page so that the decompressed data won't overwrite the compressed data during the decompression.
    如果是就地 I/O,在这种情况下,压缩块 (C0) 会存储在页面缓存页中,EROFS 还需要将压缩数据 (C0) 复制到临时的 perCPU 页面,这样解压缩数据就不会在解压缩过程中覆盖压缩数据。
  5. Finally, the decompression algorithm is invoked, and data in the compressed block is extracted to the continuous memory area. After the compression, the three temporal physical pages and the virtual memory area can be reclaimed, and the requested data has already been written to the corresponding page cache pages.
    最后,调用解压缩算法,将压缩块中的数据提取到连续内存区域。压缩完成后,三个临时物理页和虚拟内存区可被回收,请求的数据已被写入相应的页面缓存页。
Per-CPU buffer decompression The above decompression approach causes two problems. The first one is that it is still required to dynamically allocate physical memory pages, which increases the memory pressure on memoryconstrained devices. The second problem is that using vmap and vunmap on each decompression is inefficient.
每 CPU 缓冲区解压缩 上述解压缩方法会带来两个问题。第一个问题是,仍然需要动态分配物理内存页,这增加了内存受限设备的内存压力。第二个问题是,每次解压缩都使用 vmap 和 vunmap 的效率很低。
EROFS leverages per-CPU buffers to mitigate the problems when the decompressed data is less the four pages. As shown in Fig. 3(c), a four-page memory buffer is preallocated for each CPU as the per-CPU buffer. For decompression that extracts no more than four blocks of data, EROFS decompresses the data to the per-CPU buffer and then copy the requested data to the page cache pages. In the example demonstrated in Fig. 3(c), data in block D8 is requested. The compressed data in is directly decompressed to the per-CPU buffer, and the content of D8 is copied to the page cache page via memcpy.
当解压缩数据少于四页时,EROFS 利用每 CPU 缓冲区来缓解问题。如图 3(c)所示,每个 CPU 都预先分配了一个四页的内存缓冲区,作为每个 CPU 的缓冲区。对于提取不超过四个数据块的解压缩,EROFS 会将数据解压缩到每个 CPU 的缓冲区,然后将请求的数据复制到页面缓存页。在图 3(c) 演示的示例中,请求的数据位于块 D8 中。 中的压缩数据被直接解压到每个 CPU 的缓冲区,D8 的内容通过 memcpy 复制到页面缓存页。
The length of the per-CPU buffer is empirically decided, but it can effectively eliminate memory allocations since the per-CPU buffer can be reused across different decompressions. The per-CPU buffer decompression is a cost-effective trade-off which mitigates issues in the vmap decompression while introducing extra memory copies.
每个 CPU 缓冲区的长度根据经验决定,但由于每个 CPU 缓冲区可以在不同的解压缩过程中重复使用,因此可以有效消除内存分配。每 CPU 缓冲区解压缩是一种经济有效的权衡方法,它在减少 vmap 解压缩问题的同时,也引入了额外的内存拷贝。
Rolling decompression To avoid the overhead of vmap and vunmap and eliminate other dynamic page allocations, EROFS allocates a large virtual memory area and 16 physical pages for each CPU.
滚动解压缩 为了避免 vmap 和 vunmap 的开销,并消除其他动态页面分配,EROFS 为每个 CPU 分配了一个大的虚拟内存区 和 16 个物理页。
Before each compression, EROFS uses the 16 physical pages, along with the physical pages of the page cache to fill in the VM area, so that step 2 and step 3 of the vmap decompression can be skipped.
每次压缩前,EROFS 都会使用 16 个物理页以及页面缓存的物理页来填充虚拟机区域,这样就可以跳过 vmap 解压缩的第 2 步和第 3 步。
EROFS uses LZ4 as the compression algorithm, which needs to look backward at no more than of the decompressed data [7]. Thus, for a compression that extracts more than 16 pages, EROFS can reuse the physical page mapped 16 virtual pages (i.e., ) before. For example, in Fig. 4, the virtual addresses to store blocks D0 to D15 are backed by the 16 physical pages. The virtual page of D16 can be backed by the same physical page with D0 since each virtual address in D16 is away from the corresponding address in D0. D17 is backed in the same way by the physical page used by D1. D18, which is requested by the file read, uses the physical page of the page cache.
EROFS使用LZ4作为压缩算法,它需要回溯不超过 的解压缩数据[7]。因此,对于提取超过 16 个页面的压缩,EROFS 可以重复使用之前映射 16 个虚拟页面(即 )的物理页面。例如,在图 4 中,存储块 D0 至 D15 的虚拟地址由 16 个物理页支持。D16 的虚拟页面可以由与 D0 相同的物理页面支持,因为 D16 中的每个虚拟地址都与 D0 中的相应地址相距 。D17 以同样的方式由 D1 使用的物理页支持。文件读取请求的 D18 使用页面缓存的物理页面。
As a result, 16 physical pages are sufficient for any decompression cases by using such a rolling decompression.
因此,使用这种滚动解压缩方式,16 个物理页足以满足任何解压缩情况的需要。
Fig. 4: An example of rolling decompression
图 4:滚动减压示例
In-place decompression In step 4 of the vmap decompression approach, the compressed data is moved to a temporary per-CPU page to avoid data not yet compressed from being overwritten by the compressed data (Fig. 5).
就地解压缩 在 vmap 解压缩方法的第 4 步中,压缩数据被移至每个 CPU 的临时页面,以避免尚未压缩的数据被压缩数据覆盖(图 5)。
Fig. 5: An example compressed block (C0) that cannot be decompressed in-place. The decompressed data of sequence A corrupts the next sequence (the shadow part) which has not been decompressed.
图 5:无法就地解压缩的压缩块 (C0) 示例。序列 A 的解压缩数据会损坏未解压缩的下一个序列(阴影部分)。
However, if such a situation will never happen for a compressed block, we can decompress it in-place to avoid the extra memory allocation and memory copies. EROFS simulates the decompression during and marks whether a compressed block can be decompressed in-place in the block index. During the decompression of a block that can be decompressed in-place, step 4 is skipped. In our tested workload enwik 9 in § compressed blocks can be decompressed in-place; thus, most blocks can benefit from the inplace decompression as long as they are retrieved by in-place I/O.
不过,如果压缩块永远不会出现这种情况,我们可以就地解压缩,以避免额外的内存分配和内存拷贝。EROFS会在 ,并在块索引中标记压缩块是否可以就地解压缩。在对可就地解压缩的数据块进行解压缩时,将跳过步骤 4。在我们测试的工作负载 enwik 9 中, § ,压缩块可以就地解压缩;因此,只要通过就地 I/O 检索,大多数块都能从就地解压缩中受益。

4 Implementation 4 执行

We have implemented EROFS as a Linux file system and upstreamed the common part of EROFS to Linux kernel .
我们将EROFS作为Linux文件系统实现,并将EROFS的通用部分上传到Linux内核
In the current implementation, we use as the fixed output size since it is the minimal unit of page management and storage data transfers, and thus I/O amplification can be minimized. We support LZ4 (v1.8.3) as the compression algorithm since it has the fastest decompression speed and good compression ratio in our case. Other compression algorithms, such as LZO, can be supported once they are modified to provide fixed-sized output compression interfaces. Only the file data is compressed in EROFS; metadata such as inode and directory entries is stored without compression.
在当前的实现中,我们使用 作为固定输出大小,因为它是页面管理和存储数据传输的最小单位,因此可以最大限度地减少 I/O 放大。我们支持 LZ4(v1.8.3)作为压缩算法,因为在我们的案例中,它具有最快的解压缩速度和良好的压缩比。一旦对其他压缩算法(如 LZO)进行修改,使其提供固定大小的输出压缩接口,就可以支持这些算法。在 EROFS 中,只有文件数据被压缩;元数据(如 inode 和目录条目)无需压缩即可存储。
Currently, EROFS is still under active development and new features are constantly shipped into the smartphones
目前,EROFS 仍在积极开发中,新功能也在不断加入到智能手机中。
after a rigorous commercial testing process. Hence, we introduce two versions of EROFS: 1) the latest version with all features and optimizations presented in this paper; 2) the commercially-deployed version, which has all features and optimizations except the rolling decompression and the inplace decompression. The two versions are also different in decompression policies which we will illustrate in §4.2.
经过严格的商业测试过程。因此,我们介绍了两个版本的EROFS:1)最新版本,包含本文介绍的所有功能和优化;2)商业部署版本,包含除滚动解压缩和就地解压缩之外的所有功能和优化。这两个版本在解压缩策略上也有所不同,我们将在第 4.2 节中加以说明。

4.1 EROFS image layout
4.1 EROFS 图像布局

Fig. 6 shows the layout of an EROFS image. As in other file systems, a super block is located at the beginning of the image. Following the super block, metadata and data may be stored in a mixed style without constraints on the order.
图 6 显示了EROFS 映像的布局。与其他文件系统一样,超级块位于映像的开头。在超级块之后,元数据和数据可以混合方式存储,顺序不受限制。
In the current implementation, metadata and data of a file are stored together for better locality. For each file, as shown in Fig. 6, an inode is stored at the beginning, followed by blocks containing the extended attributes (i.e., xattrs) and the block index. Blocks for compressed or uncompressed file data (encoded blocks) are stored at the end of each file.
在当前的实现中,文件的元数据和数据存储在一起,以获得更好的定位性。如图 6 所示,每个文件的开头都会存储一个 inode,然后是包含扩展属性(即 xattrs)和块索引的块。压缩或未压缩文件数据块(编码块)存储在每个文件的末尾。
Since an inode can be placed anywhere in the image, the inode number is calculated from the position of an inode, so that the inode can be quickly located. Blocks for xattrs and the block index are omitted if a file contains no xattrs or is uncompressed. Further, xattrs, the block index and file data can also be inlined within an inode if possible, which reduces storage overhead and decreases the number of I/O requests since the inlined data/metadata is read along with the inode.
由于 inode 可以放置在图像中的任何位置,因此 inode 编号是根据 inode 的位置计算出来的,这样可以快速定位 inode。如果文件不包含 xattrs 或未压缩,则会省略 xattrs 块和块索引。此外,如果可能,xattrs、块索引和文件数据也可以内联到 inode 中,这样可以减少存储开销和 I/O 请求次数,因为内联的数据/元数据是与 inode 一起读取的。
The block index is used to quickly locate the corresponding encoded block for read requests. Fig. 6 shows an example block index for a regular file containing ten blocks before compression. The block index is an array of 8B-length entries, each of which corresponds to a data block before compression. Each entry indicates whether the corresponding data block is the head block (the boolean head field in Fig. 6), which starts a new encoded block. If so, the encoded block address (blkaddr), the offset of the first byte in the new encoded block ( ), whether the encoded block is compressed (cmpr), and whether the block can be decompressed in-place (dip) are also stored. If not, there must be a head block before the uncompressed block, and the block number difference to the head block is recorded in dist.
块索引用于快速定位读取请求的相应编码块。图 6 显示了一个常规文件的块索引示例,该文件在压缩前包含 10 个块。块索引是一个由 8B 长度条目组成的数组,每个条目对应一个压缩前的数据块。每个条目指示相应的数据块是否是头部块(图 6 中的布尔头部字段),头部块是一个新编码块的开始。如果是,则还会存储编码块地址 (blkaddr)、新编码块中第一个字节的偏移量 ( )、编码块是否被压缩 (cmpr) 以及该块是否可以就地解压缩 (dip)。如果不是,则在未压缩块之前必须有一个头部块,头部块的块号差值记录在 dist 中。
For a read request to an uncompressed data block, EROFS gets the block index entry according to the requested block number. For a head block, EROFS decompresses data from the block at blkaddr, and if the offset is non-zero, EROFS may also need to decompress from the nearest encode block stored before the blkaddr. For a non-head block, EROFS calculates the location of the corresponding head block according to the stored dist, and starts to decompress until the requested block data is decompressed.
对于未压缩数据块的读取请求,EROFS 会根据请求的块编号获取块索引条目。对于头部数据块,EROFS 从 blkaddr 处的数据块解压数据,如果偏移量不为零,EROFS 可能还需要从存储在 blkaddr 之前的最近编码块解压数据。对于非头部区块,EROFS 会根据存储的偏移量计算相应头部区块的位置,并开始解压缩,直到请求的区块数据解压缩完毕。
Some data blocks (e.g., block 5 in Fig. 6), which are larger after compression, are not compressed and directly stored as encoded blocks. For these cases, the corresponding cmpr fields are set to false (i.e., " " in the figure).
有些数据块(如图 6 中的数据块 5)在压缩后较大,则不进行压缩,直接作为编码块存储。在这种情况下,相应的 cmpr 字段被设置为 false(即图中的 " ")。
Directories are stored similarly as the regular files, except that there is no block index, and the encoded blocks are used to store uncompressed directory entries. For better locality of random accesses in directories, EROFS puts all dirent headers (e.g., inode number, file type, and name length) at the beginning of directory entries part, and places filenames after those headers.
目录的存储方式与普通文件类似,只是没有块索引,编码块用于存储未压缩的目录条目。为了更好地定位目录中的随机存取,EROFS 将所有目录头(如 inode 编号、文件类型和名称长度)放在目录条目部分的开头,并将文件名放在目录头之后。
Fig. 6: EROFS image layout and the block index
图 6:EROFS 图像布局和区块索引

4.2 Decompression policy
4.2 减压政策

Two versions of EROFS have different decompression policies. The commercially-deployed EROFS uses the per-CPU buffer decompression if less than four original data blocks are to be extracted; otherwise, the vmap decompression is used.
两个版本的EROFS有不同的解压缩策略。如果要提取的原始数据块少于四个,商业部署的EROFS会使用每CPU缓冲区解压缩,否则会使用vmap解压缩。
In the latest EROFS, all four decompression approaches are implemented. If there is no more than one data block to be extracted, the per-CPU buffer decompression is chosen. Otherwise, if a compressed block is retrieved using in-place I/O and can be decompressed in-place, EROFS employs the in-place decompression approach which avoids unnecessary memory allocations and memory copies. For other cases where the decompressed blocks can fit in the pre-allocated VM area, EROFS uses the rolling decompression since it beats the vmap decompression with less memory allocation overheads. For any other cases, the vmap decompression approach is adopted.
在最新的 EROFS 中,所有四种解压缩方法都已实现。如果要提取的数据块不超过一个,则选择每 CPU 缓冲区解压缩。否则,如果压缩数据块是通过就地 I/O 提取的,并且可以就地解压缩,EROFS 就会采用就地解压缩方法,从而避免不必要的内存分配和内存拷贝。在其他情况下,如果解压缩后的数据块可以放入预分配的虚拟机区域,EROFS 会使用滚动解压缩,因为它比 vmap 解压缩的内存分配开销更少。对于其他情况,则采用 vmap 解压缩方法。

4.3 Optimizations 4.3 优化

Index memory optimization It is possible that EROFS compresses hundreds of pages of original data into a single compressed block. In such a case, EROFS needs hundreds of pointers to keep track of where each page of the original data should be stored. These pointers can consume a large amount of memory. To address such a problem, EROFS tries to store the information with the help of reusable pages. If there are more than one VFS allocated pages are reusable, EROFS uses the last page to store the compressed data, and the other pages to store some of these pointers during the I/O. Before the actual decompression, these pointers are moved onto the stack, so that the reusable pages are free to store the decompressed data.
索引内存优化 EROFS 有可能将数百页原始数据压缩到一个压缩块中。在这种情况下,EROFS 需要数百个指针来跟踪原始数据每一页的存储位置。这些指针会占用大量内存。为了解决这个问题,EROFS 尝试借助可重复使用的页面来存储信息。如果有多个 VFS 分配的页面可重复使用,EROFS 就会使用最后一个页面来存储压缩数据,而在 I/O 过程中使用其他页面来存储其中的一些指针。在实际解压缩之前,这些指针会被移到堆栈中,这样可重复使用的页面就可以用来存储解压缩数据。
Scheduling optimization Decompression requires a relatively long time. Thus it is not suitable to be done within the interrupt context. In some file systems, such as Btrfs [2], when compressed data has been fetched to memory, a dedicated thread will be woken up to decompress the data. When the decompression is finished, the reader thread which issues the I/O will be woken up to get the decompressed data from the page cache. To reduce scheduling overhead, EROFS decompresses data in the reader thread, without dedicated threads for decompression. Thus once the compressed data has been fetched to memory, the reader thread will be directly woken up and start decompressing the data.
优化调度 解压缩需要较长的时间。因此不适合在中断环境下进行。在某些文件系统(如 Btrfs [2])中,当压缩数据被获取到内存中时,会唤醒一个专门的线程来解压缩数据。解压缩完成后,发出 I/O 的读取线程将被唤醒,以便从页面缓存中获取解压缩数据。为了减少调度开销,EROFS 在阅读器线程中解压数据,而不使用专用线程进行解压。因此,一旦压缩数据被获取到内存中,阅读器线程就会被直接唤醒并开始解压数据。
Cohort decompression Several requests can be in-flight simultaneously. If an original data block is requested on thread A and the corresponding compressed block is being decompressed by another thread named thread B, rather than decompressing the data by itself, thread A can wait for thread B to finish the decompression, and then directly read the decompressed data from the page cache. Such cooperation reuses the decompressed data and prevents a single data being decompressed multiple times.
协同解压缩 可以同时处理多个请求。如果线程 A 请求一个原始数据块,而相应的压缩块正在被另一个名为线程 B 的线程解压缩,那么线程 A 可以等待线程 B 完成解压缩,然后直接从页面缓存中读取解压缩后的数据,而不是自己解压缩数据。这种合作可重复使用解压缩数据,并防止单个数据被多次解压缩。
Image patching Although EROFS is a compressed readonly file system, there are cases such as system upgrade or security patching where the data stored in EROFS needs to be updated. EROFS provides a feature called image patching, which supports partial data updates. Usually, modifying a single bit in the original file data might cause a huge amount of scattered modifications in the compressed data. Instead of modification in-place, image patching places updated data at the end of the EROFS image, and when the corresponding file data blocks are requested, the origin data blocks are firstly decompressed and then the updated data is applied to overwrite the decompressed data in memory. In this way, image patching prevents scattering of changes and supports partial data updates without re-compressing the while file system.
镜像修补 尽管EROFS是一个压缩的只读文件系统,但在系统升级或安全修补等情况下,需要更新存储在EROFS中的数据。EROFS提供了一种叫做图像修补的功能,它支持部分数据更新。通常,修改原始文件数据中的一个位可能会导致压缩数据中出现大量分散的修改。与就地修改不同,图像修补将更新数据放在EROFS图像的末尾,当请求相应的文件数据块时,首先对原始数据块进行解压缩,然后应用更新数据覆盖内存中的解压缩数据。通过这种方式,映像修补可防止更改分散,并支持部分数据更新,而无需重新压缩文件系统。

5 Evaluation 5 评估

We have conducted a set of experiments to answer the following questions:
我们进行了一系列实验来回答以下问题:
  • How does compression affect the performance of file system read accesses?
    压缩如何影响文件系统读取访问的性能?
  • How much memory does EROFS consume during decompression?
    EROFS在解压缩过程中会消耗多少内存?
  • How does EROFS affect the boot time on real-world applications?
    EROFS对实际应用的启动时间有何影响?

5.1 Evaluation setup 5.1 评估设置

By default, we conduct experiments on an ARM development board, HiKey 960, running Android 9 Pie with Linux kernel 4.14. The board is equipped with Kirin 960 (four Cortex-A73 big cores and four Cortex-A53 little cores), 3GB Hynix LPDDR4 memory and 32GB Samsung UFS storage. We also evaluate on two kinds of smartphones in some experiments. The low-end smartphones are equipped with MT6765 (eight Cortex-A53 cores), 2GB memory and 32GB
默认情况下,我们在运行 Android 9 Pie 和 Linux 内核 4.14 的 ARM 开发板 HiKey 960 上进行实验。该开发板配备麒麟 960(四个 Cortex-A73 大核和四个 Cortex-A53 小核)、3GB 海力士 LPDDR4 内存和 32GB 三星 UFS 存储。我们还在两种智能手机上进行了评估。低端智能手机配备 MT6765(8 个 Cortex-A53 内核)、2GB 内存和 32GB UFS 存储器。
eMMC storage. High-end smartphones run with Kirin 980 (four Cortex-A76 cores and four Cortex-A55 cores), 6GB memory and 64GB UFS storage.
eMMC 存储。高端智能手机采用麒麟 980(四个 Cortex-A76 内核和四个 Cortex-A55 内核)、6GB 内存和 64GB UFS 存储。
For micro-benchmarks, we run FIO [23], a flexible I/O tester, on various file systems including EROFS, Squashfs, Btrfs, Ext4, and F2FS. We use the latest version of EROFS for micro-benchmark evaluation. Among these file systems, EROFS and Squashfs are designed to be compressed readonly file systems; Btrfs is a file system with compression support, but it is not a file system designed for read-only data; Ext4 is the default file system used by Android; F2FS is a file system designed for mobiles and is widely used in some smartphones.
在微基准测试中,我们在各种文件系统(包括 EROFS、Squashfs、Btrfs、Ext4 和 F2FS)上运行灵活的 I/O 测试器 FIO [23]。我们使用最新版本的EROFS进行微基准评估。在这些文件系统中,EROFS 和 Squashfs 被设计为压缩只读文件系统;Btrfs 是一个支持压缩的文件系统,但它不是一个为只读数据设计的文件系统;Ext4 是安卓系统使用的默认文件系统;F2FS 是一个为手机设计的文件系统,在一些智能手机中被广泛使用。
EROFS is configured to use 4KB-sized output compression with LZ4. Squashfs is configured to use LZ4 with , , and compression chunk sizes, indicated by Squashfs-4K, Squashfs-8K, Squashfs-16K, and Squashfs , respectively. Btrfs is configured to run in read-only mode without data integrity checks for a fair comparison. The compression algorithm used by Btrfs is LZO, since Btrfs does not support LZ4. Both Ext4 and F2FS are used without compression in the experiments since they do not support it.
EROFS配置为使用LZ4压缩4KB大小的输出。Squashfs 配置为使用 LZ4,压缩块大小为 ,分别用 Squashfs-4K、Squashfs-8K、Squashfs-16K 和 Squashfs 表示。为进行公平比较,Btrfs 被配置为以只读模式运行,不进行数据完整性检查。Btrfs 使用的压缩算法是 LZO,因为 Btrfs 不支持 LZ4。Ext4 和 F2FS 都不支持压缩,因此在实验中没有使用。
For real-world applications, we compare EROFS with Ext4, since Ext4 is now the default file system used by Android [17]. We use the commercial version for real-world evaluation since it takes time to ship the latest version to smartphones. We also tried to use Squashfs on Android. However, it costed too much CPU and memory resources, and when trying to run a camera application, the phone froze for tens of seconds before it finally failed.
在实际应用中,我们将EROFS与Ext4进行比较,因为Ext4是目前安卓系统使用的默认文件系统[17]。我们使用商业版本进行实际评估,因为向智能手机发布最新版本需要时间。我们还尝试在 Android 上使用 Squashfs。然而,它耗费了太多的 CPU 和内存资源,而且在尝试运行相机应用程序时,手机冻结了几十秒才最终失败。

5.2 Micro-benchmarks 5.2 微型基准

We use FIO to show the basic I/O efficiency of different file systems. In this experiment, we use enwik9 [40] as the workload, which is the first bytes of the English Wikipedia dump. We store the file in different file systems and read the file to test the file system read throughput. Each read is a buffered read. We test the throughput under three scenarios: sequential read, random read, and stride read. For the sequential read, we read the file by sequentially; thus the following reads are highly likely to hit in cache since the data is already loaded in the memory by previous decompression or prefetching (i.e., readahead). For the random read, we randomly read the whole file; thus the reads can hit in the cache if the data is already decompressed by previous reads. The last scenario is the stride read, in which we only read the first in every data. Since the largest compression chunk is , stride reads will not hit in cache . We test stride reads to illustrate the worst-case performance for compressed file systems.
我们使用 FIO 来展示不同文件系统的基本 I/O 效率。在本实验中,我们使用 enwik9 [40] 作为工作负载,它是英文维基百科转储的第一个 字节。我们将该文件存储在不同的文件系统中并读取该文件,以测试文件系统的读取吞吐量。每次读取都是 缓冲读取。我们在三种情况下测试吞吐量:顺序读取、随机读取和跨步读取。在顺序读取时,我们通过 按顺序读取文件 ;因此,由于数据已通过之前的解压缩或预抓取(即超前读取)加载到内存中,接下来的读取很可能会命中缓存。对于随机读取,我们随机读取整个文件;因此,如果数据已经被之前的读取解压缩,读取数据就有可能在缓存中命中。最后一种情况是跨步读取,我们只读取每个 数据中的第一个 。由于最大的压缩块是 ,因此跨步读取不会命中缓存 。我们测试了跨步读取,以说明压缩文件系统的最差性能。
Before each test, the page cache is dropped to reduce interference. All tests are done at least ten times, and the average
每次测试前,都会放弃页面缓存以减少干扰。所有测试至少进行十次,平均
throughputs are reported. The max relative standard deviation is for stride reads on A53 cores and for the rest results. Fig. 7 shows the following results we observed. Btrfs performs worst in all tests compared with EROFS and Squashfs-128K, since it is designed neither for compression nor for read-only data. On one hand, Btrfs does not take advantage of the read-only property and has to consider updates; thus it is outperformed by the compressed read-only file systems EROFS and Squashfs-128K. On the other hand, decompression in Btrfs incurs notable performance overhead compared to Ext4 and F2FS which do not need to decompress data during reads. This is reasonable since Btrfs is not designed to be a compressed read-only file system.
报告了吞吐量。A53 内核上的跨步读取的最大相对标准偏差为 ,其余结果的最大相对标准偏差为 。图 7 显示了我们观察到的以下结果。与 EROFS 和 Squashfs-128K 相比,Btrfs 在所有测试中表现最差,因为它既不是为压缩设计的,也不是为只读数据设计的。一方面,Btrfs 无法利用只读属性的优势,必须考虑更新;因此它的性能被压缩只读文件系统 EROFS 和 Squashfs-128K 所超越。另一方面,与在读取过程中不需要解压数据的 Ext4 和 F2FS 相比,Btrfs 的解压缩会产生显著的性能开销。这是合理的,因为 Btrfs 并不是设计为压缩只读文件系统。
Btrfs performs better than other configurations of Squashfs for sequential reads, which is caused by its larger compression chunk (128KB). The advantages shrink in random reads where prefetching does not work; the advantage disappears in stride reads where decompressing more data than requested becomes the burden.
在顺序读取方面,Btrfs 的性能优于 Squashfs 的其他配置,这是因为它的压缩块较大(128KB)。在预取不起作用的随机读取中,Btrfs 的优势缩小了;在跨步读取中,优势消失了,因为在跨步读取中,解压缩的数据比请求的数据多成为一种负担。
Overall, this result shows the inefficiency of using general file systems with compression support for read-only data and emphasizes the necessity of designing compressed read-only file systems.
总之,这一结果表明,使用支持压缩的普通文件系统处理只读数据的效率很低,并强调了设计压缩只读文件系统的必要性。
As the size of compression input increases, the performance of Squashfs increases for random reads and sequential reads, but decreases for stride reads. The main reason for this phenomenon is the locality and cache. Since file systems have enough memory to cache file data in this experiment, all decompressed data will be cached and possibly be read in the future. Thus for random reads and sequential reads, the larger-sized data is decompressed, more future reads will hit the cache. That is basically the reason why the Squashfs throughputs grow as the compression chunk size increases.
随着压缩输入大小的增加,Squashfs 的随机读取和顺序读取性能会提高,但字符串读取性能会降低。造成这种现象的主要原因是本地性和缓存。在本实验中,由于文件系统有足够的内存来缓存文件数据,所有解压缩数据都将被缓存,并有可能在未来被读取。因此,对于随机读取和顺序读取,解压缩的数据越大,缓存中的未来读取次数就越多。这就是 Squashfs 吞吐量随压缩块大小增加而增长的根本原因。
Since both sequential and random reads will read the whole file, there is only a little performance difference, which is caused by the good locality and prefetching.
由于顺序读取和随机读取都会读取整个文件,因此性能差异很小,这是由于良好的定位和预取造成的。
For stride reads, however, FIO only reads the first data for each data, which eradicates the benefits of memory cache since all the data decompressed but not requested will never be used in the future. Thus the more irrelevant data is read and decompressed, the more time and resource are wasted, yielding worse performance. That explains why the throughput drops with the increase of the compression chunk size for Squashfs.
然而,对于跨步读取,FIO 只读取每个 数据的第一个 数据,这消除了内存缓存的好处,因为所有解压缩但未请求的数据将来都不会使用。因此,读取和解压缩的无关数据越多,浪费的时间和资源就越多,性能也就越差。这就是为什么 Squashfs 的吞吐量会随着压缩块大小的增加而下降。
EROFS performs best in most of the tests among file systems with compression support and sometimes outperforms file systems that do not compress data. For sequential reads, EROFS exhibits the best performance among compressed file systems. Most wins come from the design of fixed-sized output compression and the elimination of unnecessary memory allocations and data movements compared with Squashfs. For random reads, EROFS is outperformed by Squashfs since the latter can decompress and cache the whole file during the test, while EROFS only benefits from cached
在支持压缩的文件系统中,EROFS 在大多数测试中表现最佳,有时甚至优于不压缩数据的文件系统。在顺序读取方面,EROFS 在压缩文件系统中表现最佳。与 Squashfs 相比,EROFS 的大部分优势来自于固定大小的输出压缩设计,以及消除了不必要的内存分配和数据移动。在随机读取方面,EROFS 的性能优于 Squashfs ,因为后者可以在测试期间解压缩并缓存整个文件,而EROFS 只能从缓存的文件中获益。
Fig. 7: FIO micro-benchmark results under three read patterns at four CPU frequencies
图 7:四种 CPU 频率下三种读取模式下的 FIO 微基准测试结果
I/Os. However, EROFS still performs better than other compressed file systems. For stride reads, since the prefetching is barely useful, EROFS still yields the best throughput among compressed file systems, but the win is limited.
I/O。不过,EROFS 的性能仍然优于其他压缩文件系统。对于跨步读取,由于预取几乎没有用处,EROFS 仍然是压缩文件系统中吞吐量最好的,但胜算有限。
Compared with Ext4 and F2FS without compression support, EROFS always performs comparably with and even outperforms them (e.g., the sequential reads on A73 cores). The reason is that even if EROFS needs to decompress data, it reads much less data from the storage thanks to compression.
与不支持压缩的 Ext4 和 F2FS 相比,EROFS 的性能始终不相上下,甚至更胜一筹(例如,在 A73 内核上的顺序读取)。原因在于,即使EROFS 需要解压数据,但由于压缩,它从存储中读取的数据要少得多。
Fig. 8: FIO micro-benchmark results for silesia.tar
图 8:silesia.tar 的 FIO 微基准测试结果
We also use the workload, silesia.tar [9], to conduct the same experiment. Silesia.tar is a tarball of the silesia compression corpus, which covers typical data types used nowadays. The results show the same trends as enwik9, so we only present the result for A73 cores at in Fig. 8 .
我们还使用工作负载 silesia.tar [9] 进行了同样的实验。Silesia.tar 是 silesia 压缩语料库的压缩包,涵盖了当前使用的典型数据类型。实验结果显示了与 enwik9 相同的趋势,因此我们在图 8 中只列出了 A73 内核( )的结果。

5.3 Compression ratio and memory usage
5.3 压缩率和内存使用量

We also evaluated the compression ratio and memory consumption during the decompression of each file system. We use both enwik9 and silesia.tar to present the compression ratio of different file systems. Fig. 9(a) and Fig. 9(b) show the number of bytes used on each file system for enwik9 and silesia.tar. The origin line in both figures represents the size of the uncompressed workload file, which is for enwik9 and 202.1MB for silesia.tar. Currently, EROFS only supports 4KB-sized output compression, but compared with Squashfs- , the compressed size is and smaller for the two workloads. The figure also matches the facts that the larger the compression unit, the better the compression ratio.
我们还评估了每个文件系统的压缩率和解压缩过程中的内存消耗。我们使用 enwik9 和 silesia.tar 来展示不同文件系统的压缩率。图 9(a)和图 9(b)显示了 enwik9 和 silesia.tar 在每个文件系统中使用的字节数。两图中的原点线代表未压缩工作负载文件的大小,enwik9 为 ,silesia.tar 为 202.1MB。目前,EROFS 仅支持 4KB 大小的输出压缩,但与 Squashfs- 相比,两个工作负载的压缩大小分别小于 。该图也符合压缩单元越大,压缩率越高的事实。
(a) enwik9
(b) silesia.tar
Fig. 9: Compressed size
图 9:压缩尺寸
Fig. 10: Decompression memory usage
图 10:解压缩内存使用情况
Fig. 10 shows the memory used after decompressing the enwik9 file. The test is conducted as follows: boot the machine, mount the file system, read the file stored in the file system, check the memory used, and then reboot.
图 10 显示了解压 enwik9 文件后使用的内存。测试过程如下:启动机器、挂载文件系统、读取存储在文件系统中的文件、检查内存使用情况,然后重新启动。
Since the file is roughly , the remainders are either used by other parts of the operating system or temporarily
由于文件的大小约为 ,剩余部分要么被操作系统的其他部分使用,要么被临时使用。
Table 1: I/O amount under different read patterns
表 1:不同读取模式下的 I/O 量
I/O (MB) seq-read rand-read stride-read
Requested 16.00 16.00 16.00
Squashfs-4K 10.65 26.19 26.23
Squashfs-8K 9.82 33.52 34.08
Squashfs-16K 9.05 46.42 48.32
Squashfs-128K 7.25 165.27 203.91
EROFS 10.14 26.12 25.93
Table 2: I/O patterns
表 2:输入/输出模式
I/O size
19.0 23.9 30.4 78.9 19.9 1.2
used by the file system. Besides EROFS and Squashfs, we also tested Ext4 as the baseline. From the figure, we can see that compared to Ext4, the memory overhead for four configurations of Squashfs ranges from to . However, memory used by EROFS is only slightly higher than the Ext 4 (about 4.9%). The result shows that EROFS has much lower memory spikes than Squashfs and proves the effectiveness of memory-friendly decompression of EROFS.
文件系统使用。除了EROFS和Squashfs,我们还测试了作为基准的Ext4。从图中我们可以看到,与 Ext4 相比,Squashfs 四种配置的内存开销从 不等。不过,EROFS 使用的内存仅略高于 Ext4(约 4.9%)。结果表明,EROFS 的内存峰值比 Squashfs 低得多,这也证明了EROFS 的内存友好解压缩的有效性。
In the test, there is only one file to be decompressed, and we allocate abundant memory to ensure that no memory reclamation or swapping will happen during the decompression. However, in a real-world scenario where more files will be decompressed simultaneously, more memory will be needed by the decompression of Squashfs. Once the available memory is scarce, memory reclamation or swapping may happen, which is very expansive and affects not only the file systems, but also other components or applications in the whole system. Thus in real-world scenarios, the advantages of EROFS, which uses as little as memory during the decompression, will be more remarkable.
在测试中,只有一个文件需要解压缩,我们分配了充足的内存,以确保在解压缩过程中不会发生内存回收或交换。然而,在实际场景中,会有更多的文件同时解压缩,Squashfs 的解压缩过程就会需要更多的内存。一旦可用内存不足,就会发生内存开垦或交换,这不仅会影响文件系统,还会影响整个系统中的其他组件或应用程序。因此,在实际应用场景中,EROFS 在解压缩过程中占用的内存越少,其优势就越明显。

5.4 I/O amplification and I/O patterns
5.4 输入/输出放大和输入/输出模式

We reran tests mentioned in § on EROFS and different Squashfs configurations. Table 1 lists the actual I/O issued when reading file data under three read patterns. EROFS issued the least I/O for random reads and stride reads. Yet, since Squashfs-8K, Squashfs-16K and Squashfs-128K have a better compression ratio, they read less data than EROFS for sequential reads. In summary, EROFS reduces the I/O amplification for most cases compared with Squashfs, especially for random reads and stride reads.
我们在 EROFS 和不同 Squashfs 配置上重新进行了 § 中提到的测试。表 1 列出了在三种读取模式下读取 文件数据时产生的实际 I/O。在随机读取和stride读取时,EROFS发出的I/O最少。然而,由于 Squashfs-8K、Squashfs-16K 和 Squashfs-128K 有更好的压缩比,它们在顺序读取时比 EROFS 读取的数据更少。总之,与 Squashfs 相比,EROFS 在大多数情况下都能减少 I/O 放大,尤其是在随机读取和跨步读取时。
We further identified the I/O pattern in a simulated realworld environment to illustrate how I/O amplification will affect real-world applications. We installed 100 apps and ran the Monkey tool [13] to randomly tap the screen once per second for 3 hours. We collected the I/O sizes passed to the readpage and readpages interfaces and show the proportion of different I/O sizes in Table 2. The result shows that there are quite a lot of I/Os ( ) with sizes no more than , which we consider as random I/Os. The amount of random I/Os is reasonable since as the system keeps running for a long time, some pages in the applications' page cache are reclaimed due to memory shortage. The insignificant amount of random I/Os emphasizes the importance of EROFS's effort of reducing the I/O amplification.
我们进一步确定了模拟真实世界环境中的 I/O 模式,以说明 I/O 放大将如何影响真实世界的应用。我们安装了 100 个应用程序,并运行 Monkey 工具 [13],每秒随机点击屏幕一次,持续 3 小时。我们收集了传递到 readpage 和 readpages 接口的 I/O 大小,并在表 2 中显示了不同 I/O 大小所占的比例。结果显示,有相当多的 I/O ( ) 大小不超过 ,我们将其视为随机 I/O。随机 I/O 的数量是合理的,因为随着系统的长期运行,应用程序页面缓存中的一些页面会因内存不足而被回收。随机 I/O 量的微不足道强调了 EROFS 在减少 I/O 放大方面所做努力的重要性。

5.5 Throughput and space savings
5.5 吞吐量和节省空间

(a) Throughput on A73 cores
(a) A73 内核的吞吐量
(b) Throughput on A73 cores
(b) A73 内核的吞吐量
Fig. 11: Throughput under different space savings
图 11:不同空间节省情况下的吞吐量
Fig. 11 depicts the throughput of EROFS and Ext4 under different space savings. The space savings is the value of space reduced by the compression divided by the size of original data; thus, a larger space savings indicates more space saved. For simplicity, we only show the results for big cores, since similar trends are presented on little cores.
图 11 描述了EROFS 和 Ext4 在不同空间节省情况下的吞吐量。节省的空间是压缩后减少的空间值除以原始数据的大小;因此,节省的空间越大,表示节省的空间越多。为简单起见,我们只显示了大内核的结果,因为小内核也有类似的趋势。
For the test, we collected blocks in our compressed partitions and check their space savings. When we found a block that matches our expected space savings, the original data was decompressed and duplicated multiple times to form a roughly 512MB file. Then we stored the file in EROFS and read it to get the read throughput under its space savings. We also stored the file in Ext4 and got the corresponding read throughput for comparison.
在测试中,我们收集了压缩分区中的数据块,并检查其节省的空间。当我们找到一个符合预期节省空间的区块时,原始数据就会被解压缩并复制多次,形成一个大约 512MB 的文件。然后,我们将该文件存储在 EROFS 中并读取它,以获得其节省空间情况下的读取吞吐量。我们还将文件存储在 Ext4 中,并获得相应的读取吞吐量,以进行比较。
Generally, the throughput of Ext4 remains stable during the test and the performance of EROFS increases together with space savings. EROFS achieves much better throughput than Ext4 when the space savings is high enough. In such cases, a single compressed block can be decompressed to dozens of blocks. Thus, the number of I/O requests is notably reduced, leading to higher performance. While when the space savings is low, the performance of EROFS is similar to Ext4 for random reads and worse than Ext4 for sequential reads. This is the result of the conjunction of I/O costs and decompression computation costs. For random reads, the I/O is more expensive than the decompression computation. While for sequential reads, due to prefetching, I/O is less costly and the computation cost dominates when the space
一般来说,Ext4 的吞吐量在测试过程中保持稳定,而 EROFS 的性能则随着空间节省而提高。当节省的空间足够大时,EROFS 的吞吐量要比 Ext4 高得多。在这种情况下,一个压缩块可以解压缩为几十个块。这样,I/O 请求的数量就会明显减少,从而获得更高的性能。而当节省的空间较小时,EROFS 的随机读取性能与 Ext4 类似,而顺序读取性能则比 Ext4 差。这是 I/O 成本和解压缩计算成本共同作用的结果。对于随机读取,I/O 比解压缩计算更昂贵。而对于顺序读取,由于预取的存在,I/O 的成本较低,当空间不足时,计算成本占主导地位。

savings is low. 储蓄率低。

5.6 Different decompression approaches and optimization
5.6 不同的解压缩方法和优化

To illustrate the effect of different decompression approaches, we ran FIO sequential reads on the Kirin 980 smartphone with A76 cores at . The vmap decompression approach serves file reads at while the per-CPU buffer decompression yields throughput of . File data is read at in the latest EROFS with the rolling decompression and the in-place decompression added.
为了说明不同解压缩方法的效果,我们在配备 A76 内核的麒麟 980 智能手机上运行了 FIO 连续读取,网址为 。vmap 解压缩方法的文件读取速度为 ,而每 CPU 缓冲区解压缩的吞吐量为 。在最新的EROFS中,文件数据的读取速度为 ,并增加了滚动解压缩和就地解压缩。
We also evaluated the effect of scheduling optimization in § with the same configuration. In the random read workload, the average throughput of EROFS without the scheduling optimization is , while with the optimization, the performance improves to be .
我们还在具有相同配置的 § 中评估了调度优化的效果。在随机读取工作负载中,未进行调度优化的EROFS平均吞吐量为 ,而进行优化后, ,性能提高到

5.7 Real-world applications
5.7 实际应用

For real-world applications, we ran modified Android 9 Pie on both low-end smartphones and high-end smartphones, whose hardware configurations are listed in §. Android system partitions such as /system,/vendor, and/odm are compressed with EROFS, and the space savings ranges from 30% to . We tested the boot time of thirteen popular applications required by the production team. We compared application boot time on EROFS to those on Ext4 and present relative boot time in Table 3. On average, EROFS reduces the boot time by for low-end smartphones and for high-end smartphones compared with Ext4.
在实际应用中,我们在低端智能手机和高端智能手机上运行了修改后的 Android 9 Pie,其硬件配置见 § 。Android系统分区(如/system、/vendor和/odm)使用EROFS压缩,节省的空间从30%到 。我们测试了生产团队所需的 13 个常用应用程序的启动时间。我们比较了EROFS 和 Ext4 上的应用程序启动时间,并在表 3 中列出了相对启动时间。与 Ext4 相比,EROFS 的启动时间在低端智能手机上平均缩短了 ,在高端智能手机上平均缩短了
We also conducted the same test while running FIO as the background workload to simulate real-world scenarios. In the FIO workloads, four threads randomly read and write individual files with rate limited at for both reads and writes. The last two rows in Table 3 show the boot time with FIO workloads, where the reduction of boot time is and for low-end and high-end smartphones, respectively.
我们还在运行 FIO 作为后台工作负载时进行了相同的测试,以模拟真实世界的场景。在 FIO 工作负载中,四个线程随机读写单个文件,读写速度限制为 。表 3 最后两行显示了 FIO 工作负载的启动时间,其中低端和高端智能手机的启动时间分别缩短了
Fig. 12: Camera boot time
图 12:相机启动时间
Other than the boot time of various applications, we also tested the boot time distribution of the camera application on the aforementioned high-end smartphones. To simulate the situation where memory is scarce, we ran a program in the background which continuously allocates memory and fills in garbage data. We waited until the program consumed all zram in the system before starting the experiment, and kept it running during the evaluation. In the experiment, we booted several applications in turn and recorded the boot time when it's the camera's turn to boot. We collected each time of 92 camera boots for both EROFS and Ext4, and present the cumulative distribution in Fig. 12. The camera application running on EROFS boots faster than on Ext4 for more than of the boots, while the longest boot time on EROFS is worse than that on Ext4. We think the result is acceptable since EROFS saves the storage space while reduces the boot time in most of the cases.
除了各种应用程序的启动时间外,我们还测试了上述高端智能手机上相机应用程序的启动时间分布。为了模拟内存不足的情况,我们在后台运行了一个程序,该程序会不断分配内存并填入垃圾数据。我们等到程序耗尽系统中所有的 zram 后才开始实验,并在评估期间保持运行。在实验中,我们依次启动多个应用程序,并记录轮到相机启动时的启动时间。我们收集了EROFS 和 Ext4 的 92 个相机启动时间,并在图 12 中给出了累积分布。运行在 EROFS 上的相机应用程序在超过 的启动时间内比在 Ext4 上的启动时间更快,而在 EROFS 上的最长启动时间比在 Ext4 上的更短。我们认为这个结果是可以接受的,因为在大多数情况下,EROFS 既节省了存储空间,又缩短了启动时间。

6 Experience over deployment
6 部署经验

EROFS has been deployed to tens of millions of smartphones. Here, we report some experiences during the deployment.
EROFS已部署到数千万部智能手机上。在此,我们将报告部署过程中的一些经验。
Optimizing for all cases, not only common cases. Deploying a new file system to replace an existing one is much harder than we first imagine. The reason is that a commercial product like a smartphone needs to retain the benefit and features of an existing file system. Hence, we need to carefully optimize EROFS for all cases instead of common cases to avoid performance degradation even in some rare cases.
优化所有情况,而不仅仅是常见情况。部署一个新的文件系统来取代现有的文件系统,比我们最初想象的要难得多。因为像智能手机这样的商业产品需要保留现有文件系统的优势和功能。因此,我们需要针对所有情况而不是常见情况仔细优化EROFS,以避免在某些罕见情况下出现性能下降。
For example, the performance of reading some files on EROFS is slightly worse than on Ext4. To optimize, we leave files with low compression ratio uncompressed for better performance. Further, we collect access frequencies of file blocks from anonymous beta users and store them in dedicated files. According to the frequency information, we pre-decompress the most frequently requested parts of compressed files and pin them in the memory to balance the storage consumption and the performance. As a result, the performance of EROFS can be as good as, and sometimes better than Ext4, while the storage consumption is significantly reduced.
例如,在 EROFS 上读取某些文件的性能比在 Ext4 上稍差。为了优化,我们不对压缩率低的文件进行压缩,以获得更好的性能。此外,我们还收集匿名 beta 用户对文件块的访问频率,并将其存储在专用文件中。根据频率信息,我们预先解压缩压缩文件中请求频率最高的部分,并将其固定在内存中,以平衡存储消耗和性能。因此,EROFS 的性能与 Ext4 不相上下,有时甚至更好,而存储消耗却大大减少。
Incomplete implementation leads to performance abnormalities and failures in real-world scenarios. We tested EROFS after the main functionalities have been implemented. However, several kinds of malfunctions happened during real-world tests.
不完整的实施会导致实际应用中的性能异常和故障。我们在实现了EROFS的主要功能后对其进行了测试。然而,在实际测试过程中出现了几种故障。
One example is that after the smartphone runs on EROFS for days, several applications become extremely slow at times. Eventually, we found that the root cause is the missing implementation of page migration in EROFS. Page migration is invoked by the memory management subsystem to ask file systems to move their data somewhere else. Most of the time, the page migration will not be triggered and thus leaving it unimplemented is benign. However, when the memory is fragmented, which is the case when the bug happens, the functionality of page migration is crucial to the success of allocating contiguous memory in the system. The issue was solved after we implemented the page migration in EROFS. Bottlenecks shift on different platforms. We developed
其中一个例子是,智能手机在EROFS上运行几天后,几个应用程序有时会变得非常慢。最终,我们发现根本原因在于EROFS中的页面迁移功能缺失。页面迁移由内存管理子系统调用,要求文件系统将数据转移到其他地方。大多数情况下,页面迁移不会被触发,因此不实施页面迁移是无害的。然而,当内存碎片化时,也就是出现错误时,页面迁移功能对于在系统中成功分配连续内存至关重要。我们在EROFS中实现了页面迁移功能后,这个问题就迎刃而解了。瓶颈在不同平台上发生转移。我们开发了
Table 3: Relative boot time of thirteen applications on low-end and high-end smartphones. Each number in the table is the average value of at least five boots. Negative numbers show the boot time reduction (in %) compared with Ext4, while positive numbers indicate the boot time is prolonged (in %). FIO workloads are running in the background for cases with 'w/ FIO' suffix.
表 3:13 个应用程序在低端和高端智能手机上的相对启动时间。表中的每个数字都是至少五次启动的平均值。与 Ext4 相比,负数表示启动时间缩短(单位:%),正数表示启动时间延长(单位:%)。后缀为 "w/FIO "的情况下,FIO 工作负载在后台运行。
App. # 1 2 3 4 5 6 7 8 9 10 11 12 13
Low-end -3.5 +4.2 -4.0 -7.5 -1.4 -6.8 +6.3 -2.2 -3.3 -7.6 -4.5
High-end -1.8 -0.7 -2.1 -1.8 -3.7 +1.2 -8.0 -2.8 +0.7 +2.0 -2.7 +1.9
Low-end w/ FIO 带 FIO 的低端 -2.8 -5.4 -7.6 -2.6
High-end w/ FIO 带 FIO 的高端设备 -4.6 -7.0 +0.8 -5.0 -0.7
early versions of EROFS on high-end smartphones where resources are abundant, and tuned it to use as fewer resources as possible. However, when we adopt the tuned EROFS to low-end smartphones, the performance is lower than we expected. This is surprising since we have already considered the limited resources and EROFS should work well. At last, the trouble-maker turned out to be the scheduler. Scheduling on low-end smartphones is much more costly and becomes the bottleneck of EROFS's decompression, which motivated us to introduce the scheduling optimization described in §4.3. Different platforms not only reflect resource limitation directly on the amount of memory available or how fast processors can run, but they can also shift the bottleneck of software.
我们在资源丰富的高端智能手机上使用了早期版本的EROFS,并对其进行了调整,以尽可能减少资源的使用。然而,当我们在低端智能手机上采用经过调整的EROFS时,其性能却低于我们的预期。这令人惊讶,因为我们已经考虑到资源有限,EROFS 应该运行良好。最后,麻烦制造者变成了调度器。低端智能手机的调度成本更高,成为EROFS解压缩的瓶颈,这促使我们引入第4.3节所述的调度优化。不同的平台不仅直接反映了可用内存量或处理器运行速度的资源限制,也会改变软件的瓶颈。
Other compressed file systems. Several other file systems support compression. AXFS [24] is a compressed read-only file system. It is designed to enable execute-in-place (XIP), which is not supported by common smartphone storage like eMMC or UFS. CramFS [3] is another compressed readonly file system, which is designed to be simple and spaceefficient. However, it also has several limitations such as limited file size. Cramfs was once obsoleted by Squashfs in the Linux kernel [4] and then revived for XIP [5], which is not supported by common smartphone storage like eMMC or UFS. LeCramFS [27] extends CramFS for better read performance and memory efficiency on flash memory. However, the compression ratio is reduced, and LeCramFS generates much larger images.
其他压缩文件系统其他一些文件系统也支持压缩。AXFS [24] 是一种压缩只读文件系统。它旨在实现就地执行(XIP),而 eMMC 或 UFS 等常见智能手机存储设备不支持这种功能。CramFS [3] 是另一种压缩只读文件系统,设计简单且节省空间。不过,它也有一些局限性,比如文件大小有限。Cramfs 曾一度被 Linux 内核中的 Squashfs 所取代[4],后来又在 XIP [5]中复兴,但 eMMC 或 UFS 等常见智能手机存储设备并不支持 Cramfs。LeCramFS [27] 对 CramFS 进行了扩展,以提高闪存的读取性能和内存效率。不过,压缩率有所降低,而且 LeCramFS 生成的图像要大得多。
JFFS2 [15], and UBIFS [12] are two file systems designed for flash memory. Although they support compression, they have to manage the wear-leveling, address translation, and garbage collection for NAND flash. Because all such features are already provided by the eMMC and UFS firmware, EROFS is much simpler and faster than JFFS2 and UBIFS.
JFFS2 [15] 和 UBIFS [12] 是为闪存设计的两个文件系统。虽然它们支持压缩,但必须管理 NAND 闪存的损耗平衡、地址转换和垃圾回收。由于 eMMC 和 UFS 固件已经提供了所有这些功能,EROFS 比 JFFS2 和 UBIFS 简单得多,速度也快得多。
Bcachefs [1] is a file system with an emphasis on reliability and robustness. ZFS [16] is a full-fledged file system designed by Sun Microsystems for Solaris. Although they both support compression, they have to consider updates on compressed files, which has a similar issue with Btrfs.
Bcachefs [1] 是一种注重可靠性和稳健性的文件系统。ZFS [16] 是 Sun Microsystems 为 Solaris 设计的成熟文件系统。虽然它们都支持压缩,但都必须考虑压缩文件的更新,这与 Btrfs 存在类似的问题。
File system and storage for smartphones. File system and storage for smartphones have long been a hot topic. For example, Kim et al. [33] illustrated that storage can affect application performance on smartphones and proposed several approaches to mitigating the performance impact. Jeong et al. [31] uncovered and mitigated the journaling of journal (JOJ) anomaly by overhauling file systems on smartphones, which has generated several follow-up efforts [36, 38, 44]. They further identified Quasi-Asynchronous I/O in smartphone file systems and boosted them for responsiveness [28]. SmartIO [41] reduces the application delay by prioritizing reads over writes. MobiFS [43] is a memory-centric design for smartphone data storage that improves response time and energy consumption.
智能手机的文件系统和存储。智能手机的文件系统和存储一直是个热门话题。例如,Kim 等人[33] 指出,存储会影响智能手机的应用性能,并提出了几种减轻性能影响的方法。Jeong 等人[31]通过对智能手机上的文件系统进行全面检查,发现并缓解了日记日志(JOJ)异常现象,并引发了多项后续工作[36, 38, 44]。他们进一步发现了智能手机文件系统中的准异步 I/O,并提高了它们的响应速度[28]。SmartIO [41] 通过优先读取而不是写入来减少应用延迟。MobiFS [43] 是一种以内存为中心的智能手机数据存储设计,可改善响应时间和能耗。
There has also been much work providing benchmarking frameworks to evaluate file systems and storage on smartphones. Due to the emergence of nonvolatile memory (NVM), recent researchers , 47] also investigated how NVM can be used on smartphones.
也有许多工作提供了基准框架,用于评估智能手机上的文件系统和存储。由于非易失性存储器(NVM)的出现,最近的研究人员 , 47] 也研究了如何在智能手机上使用 NVM。

8 Conclusion and Future Work
8 结论与未来工作

We introduce EROFS, a new compressed read-only file system designed for smartphones with limited resources. EROFS provides a comparable compression ratio while having much higher performance and less extra memory overhead compared to Squashfs. With fixed-sized output compression and fast and memory-efficient decompression, EROFS can store system code and resources with less storage usage and sometimes even better performance compared with file systems without compression support. Evaluation shows that apps on a system installed on EROFS can boot comparably or even faster compared with on Ext4. EROFS has been merged to the mainline Linux and has been deployed and used in tens of millions of smartphones. Currently, EROFS is still under active development, and we are continuously adding new features, such as deduplication, extended file statistics, fiemap, and EROFS-fuse in future versions of EROFS.
我们介绍的EROFS是一种新的压缩只读文件系统,专为资源有限的智能手机设计。与 Squashfs 相比,EROFS 在提供可比压缩率的同时,还具有更高的性能和更少的额外内存开销。通过固定大小的输出压缩和快速高效的解压缩,EROFS 可以以更少的存储空间占用来存储系统代码和资源,有时甚至比不支持压缩的文件系统性能更好。评估显示,安装在EROFS系统上的应用程序启动速度与Ext4相当,甚至更快。EROFS已并入Linux主线,并已在数千万部智能手机中部署和使用。目前,EROFS 仍在积极开发中,我们将在EROFS 的未来版本中不断添加新功能,如重复数据删除、扩展文件统计、fiemap 和 EROFS-fuse。

Acknowledgment 鸣谢

We thank our shepherd Ric Wheeler and the anonymous reviewers for the constructive comments, Guifu Li for helping prototype the mkfs utility, and Qiuyang Sun for his help with the early testing and evaluation. Haibo Chen is the corresponding author.
我们感谢 Ric Wheeler 和匿名审稿人提出的建设性意见,感谢李贵富帮助设计 mkfs 工具的原型,感谢孙秋阳在早期测试和评估中提供的帮助。陈海波为本文通讯作者。

References 参考资料

[1] Bcachefs. https://bcachefs.org.
[2] Btrfs: Main page. https://btrfs.wiki.kernel. org/index.php/Main_Page.
[2] Btrfs:主页。https://btrfs.wiki.kernel. org/index.php/Main_Page。
[3] Cramfs - cram a filesystem onto a small ROM. https://www.kernel.org/doc/Documentation/ filesystems/cramfs.txt.
[3] Cramfs - 将文件系统压缩到小 ROM 上。https://www.kernel.org/doc/Documentation/ filesystems/cramfs.txt。
[4] Cramfs: mark as obsolete. https://lkml.org/lkml/