Accelerating genomic workflows using NVIDIA Parabricks
使用 NVIDIA Parabricks 加速基因组工作流

Kyle A. O'Connell $^{1}$ , Zelaikha B. Yosufzai $^{1}$ , Ross A. Campbell $^{1}$ , Collin J. Lobb $^{1}$ , Haley T. Engelken $^{1}$ , Laura M. Gorrell $^{1}$ , Thad B. Carlson², Josh J. Catana', Dina Mikdadi $^{1}$ , Vivien R. Bonazzi $^{1 *}$ and Juergen A. Klenk $^{1 *}$
凯尔·A·奥康奈尔 $^{1}$ 、扎莱哈·B·约苏夫扎伊 $^{1}$ 、罗斯·A·坎贝尔 $^{1}$ 、科林·J·洛布 $^{1}$ 、海莉·T·英格尔肯 $^{1}$ 、劳拉·M·戈尔 $^{1}$ 、塔德·B·卡尔森²、约什·J·卡塔纳'、迪娜·米卡迪 $^{1}$ 、薇薇安·R·博纳兹 $^{1 *}$ 和约根·A·克伦克 $^{1 *}$

*Correspondence: 通信
vbonazzi@deloitte.com; 翻译结果: vbonazzi@deloitte.com
jklenk@deloitte.com

^{1}

Health Data and AI, Deloitte Consulting LLP, VA 22009 Arlington, USA

^{1}

健康数据和人工智能, 德勤咨询有限责任公司,22009 VA 阿灵顿,美国

^{2}

Cloud Managed Services, Deloitte Consulting LLP, Detroit, MI 48226, USA

^{2}

云托管服务,德勤咨询有限责任公司,美国密歇根州底特律 48226

Abstract 摘要

Background: As genome sequencing becomes better integrated into scientific research, government policy, and personalized medicine, the primary challenge for researchers is shifting from generating raw data to analyzing these vast datasets. Although much work has been done to reduce compute times using various configurations of traditional CPU computing infrastructures, Graphics Processing Units (GPUs) offer opportunities to accelerate genomic workflows by orders of magnitude. Here we benchmark one GPU-accelerated software suite called NVIDIA Parabricks on Amazon Web Services (AWS), Google Cloud Platform (GCP), and an NVIDIA DGX cluster. We benchmarked six variant calling pipelines, including two germline callers (HaplotypeCaller and DeepVariant) and four somatic callers (Mutect2, Muse, LoFreq, SomaticSniper). Results: We achieved up to $65 \times$ acceleration with germline variant callers, bringing HaplotypeCaller runtimes down from 36 h to 33 min on AWS, 35 min on GCP, and 24 min on the NVIDIA DGX. Somatic callers exhibited more variation between the number of GPUs and computing platforms. On cloud platforms, GPU-accelerated germline callers resulted in cost savings compared with CPU runs, whereas some somatic callers were more expensive than CPU runs because their GPU acceleration was not sufficient to overcome the increased GPU cost. Conclusions: Germline variant callers scaled well with the number of GPUs across platforms, whereas somatic variant callers exhibited more variation in the number of GPUs with the fastest runtimes, suggesting that, at least with the version of Parabricks used here, these workflows are less GPU optimized and require benchmarking on the platform of choice before being deployed at production scales. Our study demonstrates that GPUs can be used to greatly accelerate genomic workflows, thus bringing closer to grasp urgent societal advances in the areas of biosurveillance and personalized medicine.
背景：随着基因组测序更好地融入到科学研究、政府政策和个性化医疗中,研究人员面临的主要挑战正从生成原始数据转移到分析这些庞大的数据集。尽管已经做了大量工作来减少使用各种传统 CPU 计算基础设施的计算时间,但图形处理单元(GPU)提供了以数量级加速基因组工作流的机会。在这里,我们在亚马逊网络服务(AWS)、谷歌云平台(GCP)和 NVIDIA DGX 集群上对一个名为 NVIDIA Parabricks 的 GPU 加速软件套件进行了基准测试。我们对 6 种变体调用管道进行了基准测试,包括两种种系列呼叫器(HaplotypeCaller 和 DeepVariant)和四种体细胞呼叫器(Mutect2、Muse、LoFreq、SomaticSniper)。结果:我们实现了高达 $65 \times$ 的种系列变体呼叫器加速,使 HaplotypeCaller 的运行时间从 36 小时缩短到 AWS 上的 33 分钟,GCP 上的 35 分钟和 NVIDIA DGX 上的 24 分钟。体细胞呼叫器的加速程度在 GPU 数量和计算平台之间有更多变化。在云平台上,GPU 加速的种系列呼叫器与 CPU 运行相比产生了成本节省,而某些体细胞呼叫器的成本更高,因为它们的 GPU 加速不足以弥补增加的 GPU 成本。结论:种系列变体呼叫器随着平台上 GPU 数量的增加而很好地扩展,而体细胞变体呼叫器在 GPU 数量和最快运行时间之间表现出更多变化,这表明,至少在这里使用的 Parabricks 版本中,这些工作流的 GPU 优化程度较低,需要在部署到生产规模之前在所选平台上进行基准测试。我们的研究表明,GPU 可用于大大加速基因组工作流,从而使生物监测和个性化医疗等迫切的社会进步更加切实可行。

Keywords: GPU acceleration, NVIDIA Parabricks, Cloud computing, Amazon Web Services, Google Cloud Platform
关键词:GPU 加速、NVIDIA Parabricks、云计算、亚马逊网络服务、谷歌云平台

Background 背景

As the cost of genome sequencing continues to decrease, genomic datasets grow in both size and availability [1]. These processes will greatly enhance aims such as whole genome biosurveillance and personalized medicine [2, 3]. However, one challenge to attaining these goals is the computational burden of analyzing large amounts of genomic sequence data [4]. Two trends (among others) are helping to ameliorate this burden. The first is the migration to Cloud for data analysis and storage, and the second is the use of Graphics Processing Units (GPUs) to accelerate data processing and analysis [5, 6]. We discuss each of these trends in this article.
随着基因组测序成本的不断下降,基因组数据集在规模和可获得性方面都在不断增长[1]。这些过程将大大增强整个基因组生物监测和个性化医疗等目标的实现[2,3]。但是,分析大量基因组序列数据的计算负担是实现这些目标的一大挑战[4]。两种趋势(among others)正在帮助缓解这一负担。第一种是迁移到云端进行数据分析和存储,第二种是使用图形处理单元(GPU)加速数据处理和分析[5,6]。我们将在本文中讨论这些趋势。
Cloud computing addresses many of the challenges associated with large whole genome sequencing projects, which can suffer from siloed data, long download times, and slow workflow runtimes [7]. Several papers have reviewed the potential of cloud platforms for sequence data storage, sharing, and analysis [1, 5, 8-12], thus here we focus on one cloud computing challenge, how to select the right compute configuration to optimize cost and performance [13, 14].
云计算解决了大规模全基因组测序项目面临的许多挑战,这些项目常常遭遇数据孤岛、下载时间过长和工作流程运行缓慢等问题[7]。几篇论文已经审查了云平台在测序数据存储、共享和分析方面的潜力[1、5、8-12],因此在此我们将重点关注一个云计算挑战,即如何选择合适的计算配置以优化成本和性能[13、14]。

GPU acceleration in either a cloud or High Performance Computing (HPC) environment makes rapid genomic analysis possible at previously unattainable scales. While these are still early days for GPU-acceleration in the 'omics fields, several studies have begun benchmarking various algorithmic and hardware configurations to find the ‘Goldilocks zone’ between cost and performance. Two recent studies [6, 15] benchmarked GATK HaplotypeCaller using the original CPU algorithm and the GPU-accelerated version from NVIDIA Clara

^{TM}

Parabricks (hereafter Parabricks) on HPC platforms and found notable acceleration (

8 \times

and

21 \times

speedups respectively) when using GPUs. They also inferred high concordance of SNP calls ( 99.5%) between the CPU and GPU algorithms suggesting no to low loss of accuracy with the GPU-configured algorithms, for both germline and somatic variant callers [16], a finding also corroborated by [17]. Likewise [18], introduced a new GPU-accelerated pipeline called BaseNumber, which achieved runtimes slightly faster than previous benchmarks using Parabricks.
在云端或高性能计算(HPC)环境中使用 GPU 加速可以实现之前无法达到的快速基因组分析。尽管 GPU 加速在'omics 领域还处于早期阶段,但已有几项研究开始基准测试各种算法和硬件配置,以找到成本和性能之间的最佳平衡点。两项最新研究[6,15]使用原始 CPU 算法和 NVIDIA Clara

^{TM}

Parabricks 提供的 GPU 加速版本,在 HPC 平台上对 GATK HaplotypeCaller 进行基准测试,发现使用 GPU 可以获得明显的加速效果(

8 \times

倍和

21 \times

倍加速)。同时,它们推断 CPU 和 GPU 算法的 SNP 调用结果高度一致(99.5%),表明在保证准确性的前提下,GPU 算法有着出色的性能。这一结果也得到了其他研究[16,17]的支持。此外,[18]提出了一种名为 BaseNumber 的新型 GPU 加速管道,其运行时间略优于之前使用 Parabricks 的基准测试结果。

While the aforementioned studies conducted benchmarking using on-premises computing clusters, only a few studies have begun benchmarking GPU-accelerated algorithms in the cloud. The Parabricks team at NVIDIA benchmarked GATK HaplotypeCaller using Parabricks on Amazon Web Services (AWS) and achieved runtimes as low as 28 min for a

30 \times

genome with eight A100 NVIDIA GPUs [17]. NVIDIA compared an m5-family virtual machine (32 CPUs, 128 GB Memory; Intel Skylake 8175M or Cascade Lake 8259CL) with several GPU configurations, including g4dn.12xlarge (four T4 GPUs, 48 2.5 GHz Cascade Lake 24C processors), the g4dn.metal (eight T4 GPUs, 96 2.5 GHz Cascade Lake 24C processors, p3dn.24xlarge (8 V100 GPUs, 96 Intel Skylake 8175 CPU processors) and the p4d.24xlarge ( 8 A100 GPUs, 96 Intel Cascade Lake P-8275CL processors), with the largest acceleration observed with the p4 machine family (with NVIDIA A100s). NVIDIA also benchmarked four somatic callers, and achieved speedups ranging from

4 \times

42 \times

with a

50 \times

human genome. In this somatic variant calling study, they comparing an m5 machine to the g4dn.12xlarge (with four T4 GPUs), though they did not benchmark the newer compute-optimized A100 and V100 GPU machines [16]. Relatedly [13], benchmarked GWAS workflows using Spark Clusters (not NVIDIA Parabricks) on Google Cloud Platform (GCP; using standard n2 machines)
尽管上述研究使用内部计算集群进行基准测试,但只有少数研究开始在云端对 GPU 加速算法进行基准测试。NVIDIA 的 Parabricks 团队在 Amazon Web Services (AWS)上使用 Parabricks 对 GATK HaplotypeCaller 进行了基准测试,并获得了 28 分钟的运行时间,适用于具有八个 A100 NVIDIA GPU 的

30 \times

基因组[17]。NVIDIA 比较了 m5 系列虚拟机(32 个 CPU,128GB 内存;Intel Skylake 8175M 或 Cascade Lake 8259CL)与多种 GPU 配置,包括 g4dn.12xlarge(四个 T4 GPU,48 个 2.5 GHz Cascade Lake 24C 处理器)、g4dn.metal(八个 T4 GPU,96 个 2.5 GHz Cascade Lake 24C 处理器)、p3dn.24xlarge(8 个 V100 GPU,96 个 Intel Skylake 8175 CPU 处理器)以及 p4d.24xlarge(8 个 A100 GPU,96 个 Intel Cascade Lake P-8275CL 处理器),其中 p4 机器系列(配有 NVIDIA A100)提供了最大的加速。NVIDIA 还对四种体细胞变异检测器进行了基准测试,并在

50 \times

人类基因组上实现了

4 \times

到

42 \times

的加速。在这项体细胞变异检测研究中,他们比较了 m5 机器与 g4dn.12xlarge(配有四个 T4 GPU),但没有对新的计算优化型 A100 和 V100 GPU 机器进行基准测试[16]。另外,文献[13]在 Google Cloud Platform (GCP;使用标准 n2 机器)上使用 Spark 集群(而非 NVIDIA Parabricks)对 GWAS 工作流进行了基准测试。
and AWS (machines not specified) and found comparable performance between cloud platforms. While several of these studies have shed light on the performance of GATK HaplotypeCaller using Parabricks, fewer studies have compared CPU and GPU performance across a range of germline and somatic variant callers, or compared performance across AWS, GCP and an NVIDIA DGX cluster. Benchmarking a range of algorithms on several platforms and hardware configurations is important to inform future decisions around algorithmic, hardware and platform selection.
我们使用 AWS（未指定机器）和发现云平台之间的性能相当。尽管几项这类研究已经揭示了 Parabricks 中 GATK HaplotypeCaller 的性能,但较少的研究比较了广泛的遗传系统和体细胞变异检测器在 CPU 和 GPU 上的性能,或者比较了 AWS、GCP 和 NVIDIA DGX 集群的性能。对多种算法在多个平台和硬件配置上进行基准测试很重要,可以为未来的算法、硬件和平台选择提供依据。
Here, we benchmark two germline variant callers and four somatic variant callers comparing traditional

\times 86 CPU

algorithms with GPU-accelerated algorithms implemented with NVIDIA Parabricks on AWS and GCP, and benchmark GPU-accelerated algorithms on an NVIDIA DGX cluster. In the case of GPU-accelerated algorithms, we compare 2, 4, and 8 GPU configurations. For germline callers, we observed speedups of up to

65 x

(GATK HaplotypeCaller) and found that performance scaled linearly with the number of GPUs. We also found that because GPUs run so quickly, researchers can save money by using them for germline variant callers. Alternatively, somatic variant callers achieved speedups up to

56.8 \times

for the Mutect2 algorithm, but surprisingly, did not scale linearly with the number of GPUs in some contexts, emphasizing the need for algorithmic benchmarking before embarking on large-scale projects where sub-optimal optimization can substantially increase costs.
这里,我们基准测试了两种 germline 变体调用器和四种 somatic 变体调用器,比较了传统的

\times 86 CPU

算法和使用 NVIDIA Parabricks 在 AWS 和 GCP 上实现的 GPU 加速算法,并在 NVIDIA DGX 集群上基准测试了 GPU 加速算法。对于 GPU 加速算法,我们比较了 2、4 和 8 GPU 配置。对于 germline 调用器,我们观察到高达

65 x

的加速(GATK HaplotypeCaller),并发现性能与 GPU 数量呈线性比例。我们还发现,由于 GPU 运行速度非常快,研究人员可以通过使用它们进行 germline 变体调用器来节省资金。相反,somatic 变体调用器在 Mutect2 算法中实现了高达

56.8 \times

的加速,但令人惊讶的是,在某些情况下,它们的性能并未与 GPU 数量呈线性比例,这突出了在开始大规模项目之前进行算法基准测试的必要性,因为次优优化可能大大增加成本。

Results 结果

CPU baseline across cloud platforms
跨云平台的 CPU 基准

CPU machine performance varied considerably between the c6i machine on AWS (c6i) compared with the n2 machine on GCP for most analyses. For germline analyses, GCP performed faster for DeepVariant ( 18.8 h ) compared with AWS (22 h), whereas AWS performed faster for HaplotypeCaller (36.2 h) compared with GCP (38.8 h; Table 1, Fig. 1). Somatic runtimes favored AWS machines, except for Mutect2, where the n2 machine on GCP ran in 8.1 h compared with 16.9 h on AWS (Table 1, Fig. 1).
CPU 机器性能在 AWS 的 c6i 机器和 GCP 的 n2 机器之间差异很大,大多数分析表现如此。对于胚系分析,GCP 的 DeepVariant (18.8 小时)运行速度快于 AWS (22 小时),而 AWS 的 HaplotypeCaller (36.2 小时)运行速度快于 GCP (38.8 小时;表 1,图 1)。体细胞运行时间有利于 AWS 机器,除了 Mutect2,GCP 的 n2 机器在 8.1 小时内完成,而 AWS 在 16.9 小时内完成(表 1,图 1)。

GPU performance across cloud platforms
云平台之间的 GPU 性能

For germline callers, 8-GPU runtimes were below 43 min for HaplotypeCaller and DeepVariant across both cloud platforms. On AWS, we observed faster runtimes for the A100 compared with the V100 GPU machines (p4 vs p3 machine families), but the differences with 8 GPUs, where the number of CPUs were equal, were small for most workflows. Further, comparisons between the 2 and 4 A100 GPU machines on GCP/AWS were not precise because we were unable to limit the number of CPUs available for all AWS workflows due to constraints on available machine configurations ( 2 and 4 GPU machines were not available). As such, execution time differences between the two cloud platforms were biased towards AWS for some algorithms (DeepVariant and LoFreq with 2 GPUs) that were able to take advantage of the additional CPUs and memory of the larger GPU machine (see “Materials and methods” section). Although the two germline workflows scaled linearly with the number of GPUs (Fig. 2), somatic callers ran faster with 4 versus 8 GPUs for Muse on AWS (but not GCP), and for Mutect2 and SomaticSniper on both platforms (Fig. 2; Additional file 1: Figure S1). Compared with the CPU baselines, GPU runs on AWS (p4 machines with A100 GPUs) led to acceleration of HaplotypeCaller up
对于生殖系统召唤者,在两个云平台上,HaplotypeCaller 和 DeepVariant 的 8 GPU 运行时间均低于 43 分钟。在 AWS 上,我们观察到 A100 的运行时间比 V100 GPU 机器(p4 vs p3 机器系列)更快,但在有 8 个 GPU 且 CPU 数量相等的情况下,大多数工作流程的差异很小。此外,由于在可用的机器配置(2 个和 4 个 GPU 机器不可用)上存在限制,我们无法限制所有 AWS 工作流的 CPU 数量,因此 GCP/AWS 之间 2 台和 4 台 A100 GPU 机器的比较并不准确。因此,一些算法(使用 2 个 GPU 的 DeepVariant 和 LoFreq)能够利用更大 GPU 机器(更多 CPU 和内存)的优势,AWS 的执行时间差异会有偏向。尽管两种生殖系统工作流都呈线性扩展(图 2),但体细胞调用器在 AWS(但不是 GCP)上的 Muse 以及在两个平台上的 Mutect2 和 SomaticSniper 在 4 个 GPU 上运行更快(图 2;附加文件 1:图 S1)。与 CPU 基线相比,AWS 上的 GPU 运行(带有 A100 GPU 的 p4 机器)使 HaplotypeCaller 加速了

Table 1 Results of benchmarking for AWS, GCP and NVIDIA DGX workflow runs
表 1 AWS、GCP 和 NVIDIA DGX 工作流程基准测试结果

Platform 平台

Pipeline 管道

VM-type 虚拟机类型

Variantcaller 变异检测器

Time (min) 时间 (分钟)

Time (h) 时间 (h)

Cost ($) 成本 ($)

Fold acceleration 折叠加速度

成本节省%

% cost-

savings

AWS

Germline 生殖细胞

C6i.8xlarge

DeepVariant 深度差异

1317.3

21.96

29.9

GPU2

145.16

2.42

29.61

9.07

0.83

GPU4

97.07

1.62

19.80

13.57

33.68

GPU8

42.19

0.7

21.95

31.22

26.49

GCP

n2-32

1128

18.8

32.9

GPU2

156

2.6

19.4

7.2

41.03

GPU4

1.2

18.3

15.7

44.38

GPU8

42.6

0.71

20.9

26.5

36.47

DGX

GPU2

87.9

1.47

GPU4

49.1

0.82

GPU8

27.05

0.45

AWS

Germline 生殖细胞

C6i.8xlarge

HaplotypeCaller 类型分型器

2175.9

36.26

49.32

GPU2

131.99

2.2

26.93

16.49

45.41

GPU4

88.27

1.47

24.65

63.49

GPU8

41.51

0.69

21.60

52.42

56.21

GCP

n2-32

2328

38.8

67.9

GPU2

118.8

1.98

13.5

19.6

80.12

GPU4

57.6

0.96

14.1

79.23

GPU8

35.4

0.59

17.5

65.8

74.23

DGX

GPU2

64.6

1.08

GPU4

0.65

GPU8

24.4

0.41

AWS

Somatic 体体

C6i.8xlarge

LoFreq 低频

180.2

4.1

GPU2

145.14

2.42

29.61

1.24

- 625.07

GPU4

109.23

1.82

22.28

1.65

- 445.68

GPU8

57.18

0.95

29.75

3.15

- 628.55

GCP

N2-32

277.8

4.63

8.1

GPU2

155.2

2.59

1.8

- 134.5

GPU4

110.9

1.85

27.1

2.5

- 235

GPU8

61.4

1.02

30.1

4.5

- 271

DGX

GPU2

113.71

1.9

GPU4

70.41

1.18

GPU8

49.5

0.83

AWS

Somatic 体细胞

C6i.8xlarge

Muse 缪斯

425.1

7.09

9.6

GPU2

65.17

1.09

13.29

6.52

- 37.97

GPU4

61.35

1.02

12.52

6.93

- 29.88

GPU8

22.27

0.37

11.59

19.09

- 20.23

GCP

N2_32

621.8

10.36

18.1

GPU2

44.2

0.74

5.4

14.1

70.1

GPU4

32.4

0.54

7.9

19.2

56.2

GPU8

28.5

0.48

21.8

22.9

DGX

GPU2

0.6

GPU4

23.84

0.4

GPU8

22.7

0.38

AWS

Somatic 体细胞

C6i.8xlarge

Mutect2 变异检测 2

414.51

6.91

9.40

GPU2

28.4

0.47

5.79

14.60

38.34

GPU4

21.54

0.36

4.39

19.24

53.23

| Platform | Pipeline | VM-type | Variantcaller | Time (min) | Time (h) | Cost ($) | Fold acceleration | % cost- <br> savings | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | AWS | Germline | C6i.8xlarge | DeepVariant | 1317.3 | 21.96 | 29.9 | - | - | | | | GPU2 | | 145.16 | 2.42 | 29.61 | 9.07 | 0.83 | | | | GPU4 | | 97.07 | 1.62 | 19.80 | 13.57 | 33.68 | | | | GPU8 | | 42.19 | 0.7 | 21.95 | 31.22 | 26.49 | | GCP | | n2-32 | | 1128 | 18.8 | 32.9 | - | _ | | | | GPU2 | | 156 | 2.6 | 19.4 | 7.2 | 41.03 | | | | GPU4 | | 72 | 1.2 | 18.3 | 15.7 | 44.38 | | | | GPU8 | | 42.6 | 0.71 | 20.9 | 26.5 | 36.47 | | DGX | | GPU2 | | 87.9 | 1.47 | - | - | - | | | | GPU4 | | 49.1 | 0.82 | - | - | - | | | | GPU8 | | 27.05 | 0.45 | - | - | - | | AWS | Germline | C6i.8xlarge | HaplotypeCaller | 2175.9 | 36.26 | 49.32 | - | - | | | | GPU2 | | 131.99 | 2.2 | 26.93 | 16.49 | 45.41 | | | | GPU4 | | 88.27 | 1.47 | 18 | 24.65 | 63.49 | | | | GPU8 | | 41.51 | 0.69 | 21.60 | 52.42 | 56.21 | | GCP | | n2-32 | | 2328 | 38.8 | 67.9 | - | - | | | | GPU2 | | 118.8 | 1.98 | 13.5 | 19.6 | 80.12 | | | | GPU4 | | 57.6 | 0.96 | 14.1 | 40 | 79.23 | | | | GPU8 | | 35.4 | 0.59 | 17.5 | 65.8 | 74.23 | | DGX | | GPU2 | | 64.6 | 1.08 | - | - | - | | | | GPU4 | | 39 | 0.65 | - | - | - | | | | GPU8 | | 24.4 | 0.41 | - | - | - | | AWS | Somatic | C6i.8xlarge | LoFreq | 180.2 | 3 | 4.1 | - | - | | | | GPU2 | | 145.14 | 2.42 | 29.61 | 1.24 | - 625.07 | | | | GPU4 | | 109.23 | 1.82 | 22.28 | 1.65 | - 445.68 | | | | GPU8 | | 57.18 | 0.95 | 29.75 | 3.15 | - 628.55 | | GCP | | N2-32 | | 277.8 | 4.63 | 8.1 | - | - | | | | GPU2 | | 155.2 | 2.59 | 19 | 1.8 | - 134.5 | | | | GPU4 | | 110.9 | 1.85 | 27.1 | 2.5 | - 235 | | | | GPU8 | | 61.4 | 1.02 | 30.1 | 4.5 | - 271 | | DGX | | GPU2 | | 113.71 | 1.9 | - | - | - | | | | GPU4 | | 70.41 | 1.18 | - | - | - | | | | GPU8 | | 49.5 | 0.83 | - | - | - | | AWS | Somatic | C6i.8xlarge | Muse | 425.1 | 7.09 | 9.6 | - | - | | | | GPU2 | | 65.17 | 1.09 | 13.29 | 6.52 | - 37.97 | | | | GPU4 | | 61.35 | 1.02 | 12.52 | 6.93 | - 29.88 | | | | GPU8 | | 22.27 | 0.37 | 11.59 | 19.09 | - 20.23 | | GCP | | N2_32 | | 621.8 | 10.36 | 18.1 | - | - | | | | GPU2 | | 44.2 | 0.74 | 5.4 | 14.1 | 70.1 | | | | GPU4 | | 32.4 | 0.54 | 7.9 | 19.2 | 56.2 | | | | GPU8 | | 28.5 | 0.48 | 14 | 21.8 | 22.9 | | DGX | | GPU2 | | 36 | 0.6 | - | - | - | | | | GPU4 | | 23.84 | 0.4 | - | - | - | | | | GPU8 | | 22.7 | 0.38 | - | - | - | | AWS | Somatic | C6i.8xlarge | Mutect2 | 414.51 | 6.91 | 9.40 | - | - | | | | GPU2 | | 28.4 | 0.47 | 5.79 | 14.60 | 38.34 | | | | GPU4 | | 21.54 | 0.36 | 4.39 | 19.24 | 53.23 |

Table 1 (continued) 表 1（续）

Platform 平台

Pipeline 管道

VM-type 虚拟机类型

变异检测器

Variant-

caller

Time (min) 时间 (分钟)

Time (h) 时间 (h)

Cost ($) 成本 ($)

折叠加速度

Fold

acceleration

GCP cost- 根据成本对 GCP 进行优化

AWS results presented here are for the p3 family with the NVIDIA Tesla V100 GPU, results for the p4 family with the A100 GPU are shown in Additional file 1: Table S1
在此处显示的 AWS 结果适用于采用 NVIDIA Tesla V100 GPU 的 p3 系列,而采用 A100 GPU 的 p4 系列结果则显示在附加文件 1: 表 S1 中

Runtimes of All Cloud-Based Analyses
所有基于云的分析的运行时间

Variant Caller 变异呼叫器
Fig. 1 Comparison of execution times of variant calling algorithms on CPU and GPU environments between AWS and GCP. A 32 vCPU machine with the latest processors was used for CPU benchmarking on both cloud platforms. Here we show results for varying numbers of NVIDIA Tesla V100 GPUs running the Parabricks bioinformatics suite for AWS, and NVIDIA Tesla A100 GPUs for GCP
图 1 比较 AWS 和 GCP 环境下变异检测算法在 CPU 和 GPU 上的执行时间。在两个云平台上进行 CPU 基准测试时,使用了最新处理器的 32 个 vCPU 机器。这里我们展示了在 AWS 上运行 Parabricks 生物信息学套件的不同数量的 NVIDIA Tesla V100 GPU 的结果,以及在 GCP 上使用 NVIDIA Tesla A100 GPU 的结果。

GPU Runtimes Across Platforms
跨平台 GPU 运行时

Fig. 2 GPU benchmarking results for NVIDIA Tesla GPUs. On GCP and the DGX results are shown for A100 GPUs, whereas AWS results are shown for the V100 GPU runs
图 2 NVIDIA Tesla GPU 的 GPU 基准测试结果。在 GCP 和 DGX 上显示 A100 GPU 的结果,而在 AWS 上显示 V100 GPU 运行的结果。

GPU Cost Savings per Workflow
每个工作流的 GPU 成本节省

Fig. 3 Comparison of AWS (V100 GPU machine) versus GCP GPU cost savings per variant caller. Percentage of total cost savings shows higher cost savings using GPUs in algorithms optimized for GPU-acceleration, but losses when algorithms are not well optimized
图 3 AWS (V100 GPU 机器)与 GCP GPU 变体调用器成本节省的比较。总成本节省的百分比显示在为 GPU 加速优化的算法中使用 GPU 可获得更高的成本节省,但在算法未充分优化的情况下会产生损失。
to

65.1 x

, DeepVariant up to 30.7 x , Mutect2 up to 56.8 x , SomaticSniper up to 7.7 x , Muse up to 18.9 x , and Lofreq up to 3.7 x (Table 1). On GCP, GPUs resulted in acceleration of HaplotypeCaller up to

65.8 x

, DeepVariant up to

26.5 x

, Mutect2 up to

29.3 x

, SomaticSniper up to 7.0 x , Muse up to 21.8 x , and LoFreq up to 4.5 x .
在

65.1 x

上，DeepVariant 最多加速 30.7 倍，Mutect2 最多加速 56.8 倍，SomaticSniper 最多加速 7.7 倍，Muse 最多加速 18.9 倍，Lofreq 最多加速 3.7 倍(表 1)。在 GCP 上,GPU 使 HaplotypeCaller 最多加速

65.8 x

倍,DeepVariant 最多加速

26.5 x

倍,Mutect2 最多加速

29.3 x

倍,SomaticSniper 最多加速 7.0 倍,Muse 最多加速 21.8 倍,LoFreq 最多加速 4.5 倍。

Although GPU machines are much more expensive on a per hourly basis than CPU machines, the accelerated runtimes resulted in cost savings for most algorithms (Fig. 3). Leveraging GPUs on AWS with the p3 machine (with V100 GPUs) resulted in cost
尽管 GPU 机器每小时的成本比 CPU 机器高得多,但加速的运行时间使大多数算法的成本节省(图 3)。利用 AWS 上的 p3 机器(带 V100 GPU)可以实现成本
savings up to

63 %

for HaplotypeCaller with 8 GPUs and up to

21 %

for DeepVariant with 8 GPUs (Additional file 1: Table S1). Using the p4 machine with the A100 GPU resulted in savings of

63 %

for HaplotypeCaller with 4 GPUs,

34 %

for DeepVariant with 4 GPUs, and

53 %

for Mutect2 with 4 GPUs (Table 1).
使用 8 个 GPU 的 HaplotypeCaller 最高可节省

63 %

，使用 8 个 GPU 的 DeepVariant 最高可节省

21 %

（附件 1：表 S1）。使用配备 A100 GPU 的 p4 机器，HaplotypeCaller 使用 4 个 GPU 可节省

63 %

，DeepVariant 使用 4 个 GPU 可节省

34 %

，Mutect2 使用 4 个 GPU 可节省

53 %

（表 1）。

On GCP GPU runs resulted in cost savings of up to 80% for HaplotypeCaller with 2 GPUs, 44% for DeepVariant with 4 GPUs,

72 %

for Mutect2 with 4 GPUs,

26 %

for SomaticSniper with 2 GPUs, and up to

70.1 %

for Muse with 2 GPUs. However, on both platforms, algorithms that were not well optimized cost much more to run with GPUs rather than CPUs because the difference in runtimes was not sufficient to offset the extra GPU cost (Fig. 3; Additional file 1: Figure S4). For example, CPU runs of LoFreq cost less than $9/sample to run on both platforms, but as much as

$ 30

with GPUs (Additional file 1: Fig. S2). Likewise, CPU runs of SomaticSniper cost less than $14.5/sample on both platforms, but as much as

$ 75

on AWS with 8 GPUs.
在 GCP 上,使用 2 个 GPU 运行 HaplotypeCaller 可节省高达 80% 的成本,使用 4 个 GPU 运行 DeepVariant 可节省 44% 的成本,使用 4 个 GPU 运行 Mutect2 可节省成本, 使用 2 个 GPU 运行 SomaticSniper 可节省成本,使用 2 个 GPU 运行 Muse 可节省高达成本。但在两个平台上,未充分优化的算法在使用 GPU 运行时的成本要远高于使用 CPU 运行,因为运行时间的缩短无法抵消额外的 GPU 成本(图 3;附加文件 1:图 S4)。例如,在两个平台上,LoFreq 的 CPU 运行成本不到 9 美元/样本,但使用 GPU 运行则高达成本。同样地,在两个平台上,SomaticSniper 的 CPU 运行成本不到 14.5 美元/样本,但在 AWS 上使用 8 个 GPU 运行则高达成本。
For well optimized algorithms, results varied between variant callers on which numbers of GPUs were the fastest (ranging from 2 to 8 ); subsequently cost savings reflect a balance between speed and cost of a particular machine type that is not consistent between algorithms or cloud providers. For example, A100 GPU runs were expensive on AWS because the p4d.24xlarge machine type on-demand price is

$ 32.8 / h

, whereas the A100 machine type ranges from

$ 12.24 / h

for a 4 GPU machine, to

$ 24.5 / h

for an 8 GPU machine. On GCP, the a2-highgpu machine types range from $7.4/h (2 GPUs) to $29.4.00/h (8 GPUs). Alternatively, CPU runs were slightly cheaper on AWS with an on-demand price of

$ 1.36 / h

compared with

$ 1.75

on GCP. Interestingly, because the somatic callers did not scale with additional GPUs, the greatest increase in acceleration (and thus cost savings) was observed with 2 GPUs. Adding additional GPUs to the somatic runs resulted in minor improvements in runtimes (if any), but substantial increases in costs/hour. Prices here are given for the northern Virginia region calculated (at the time of writing) using the pricing calculators from the respective cloud service providers. As time goes on, these machine types will likely become less expensive.
对于优化良好的算法,不同的 GPU 数量的结果有所不同(从 2 到 8),随后的成本节省反映了在速度和特定机器类型成本之间的平衡,这种平衡在算法或云提供商之间并不一致。例如,在 AWS 上运行 A100 GPU 很昂贵,因为 p4d.24xlarge 机器类型的按需价格为

$ 32.8 / h

,而 A100 机器类型的价格从

$ 12.24 / h

的 4 GPU 机器到

$ 24.5 / h

的 8 GPU 机器不等。在 GCP 上,a2-highgpu 机器类型的价格从 7.4 美元/小时(2 GPU)到 29.4 美元/小时(8 GPU)不等。另一方面,在 AWS 上,CPU 运行的价格略低,按需价格为

$ 1.36 / h

,而在 GCP 上为

$ 1.75

。有趣的是,由于体细胞呼叫器无法随附加 GPU 而扩展,因此使用 2 个 GPU 观察到最大的加速增加(从而最大的成本节省)。为体细胞运行添加更多 GPU 会导致运行时间略有改善(如果有的话),但每小时成本大幅增加。这里给出的价格是针对弗吉尼亚北部地区,使用相应云服务提供商的定价计算器计算的(在撰写本文时)。随着时间的推移,这些机器类型可能会变得更便宜。

GPU performance on the DGX
显卡在 DGX 上的性能

Germline workflows ran considerably faster on the DGX than on the cloud platforms, with HaplotypeCaller finishing in 24.4 min and DeepVariant finishing in 27.1 min with 8 GPUs (Fig. 2; Additional file 1: Figure S1). Somatic variant callers were not faster in most cases than the cloud platforms, and in one case, ran slower than on the cloud (SomaticSniper; Fig. 2; Additional file 1: Figure S1). Interestingly, the pattern we observed in the cloud where the 4 GPU runtimes were the fastest for Muse and SomaticSniper did not manifest on the DGX, where the 8 GPU runs were the fastest for all algorithms except Mutect2 (Fig. 2; Additional file 1: Figure S1). For Mutect2, the 4 GPU run was still the fastest on the DGX, but the 8 GPU run was faster on the DGX than on both AWS/GCP (Additional file 1: Fig. S1).
在 DGX 上运行的种系工作流程明显快于云平台,HaplotypeCaller 仅用 24.4 分钟,DeepVariant 用 27.1 分钟完成(图 2;附加文件 1:图 S1)。对于大多数情况下,体细胞变异检测器都没有比云平台快,在一种情况下(SomaticSniper)还比云平台慢(图 2;附加文件 1:图 S1)。有趣的是,我们在云端观察到的模式(4 GPU 运行时对于 Muse 和 SomaticSniper 最快)在 DGX 上并未体现,在 DGX 上,除了 Mutect2 外,所有算法的 8 GPU 运行都是最快的(图 2;附加文件 1:图 S1)。对于 Mutect2,在 DGX 上 4 GPU 运行仍然是最快的,但 8 GPU 运行比 AWS/GCP 更快(附加文件 1:图 S1)。
We also tested the effect of CPU number on performance of GPU runs. On AWS and GCP the GPU machine types are preconfigured (and thus unalterable) with 12 CPUs/1 GPU, but on the DGX we were able to modify the number of CPUs for each run. We
我们还测试了 CPU 数量对 GPU 运行性能的影响。在 AWS 和 GCP 上,GPU 机器类型是预先配置的(因此无法更改),每台机器都有 12 个 CPU/1 个 GPU,但在 DGX 上,我们能够修改每次运行的 CPU 数量。
found that adding CPUs does decrease runtimes (increase performance), but that reduction of runtimes plateaued after 48 CPUs (Additional file 1: Fig. S5).
我们发现增加 CPU 的数量可以降低运行时间(提高性能),但在 48 个 CPU 后,运行时间的减少趋于平缓(附加文件 1:图 S5)。

Discussion 讨论

The acceleration provided by GPU-accelerated algorithms confers several advantages to researchers. First, GPU-acceleration enables researchers to rapidly run multiple algorithms for the cost of running a single CPU run [19]. Different variant callers exhibit biases leading to slightly different variant calls [3]. Combining calls across algorithms can improve accuracy, albeit with a slightly higher type I error. Future studies could help better understand this trade off by comparing false positive and negative rates for different strategies of combining calls across algorithms such as majority rule versus consensus site calls. Another advantage of GPU-accelerated genomic workflows is that they allow researchers to process more samples on a fixed budget. Academic research programs are often constrained by limited funding; the use of GPU-acceleration may allow researchers to reduce compute costs (and labor overhead) and thus process more samples for the same amount of money. Finally, GPU-accelerated algorithms enable near-real-time decision making. Pathogen biosurveillance benefits from rapid data processing to identify novel pathogens and could help policymakers to act more quickly during an outbreak [20]. Likewise, faster clinical test processing could lead to more timely patientcare decisions in a patient-care settings.
GPU 加速算法提供的加速度为研究人员带来了几大优势。首先,GPU 加速使研究人员能以运行单个 CPU 的成本快速运行多种算法[19]。不同变体调用器存在偏向,导致略有不同的变体调用[3]。跨算法汇总调用能提高准确性,但可能会略有较高的 1 型错误。未来的研究可以通过比较不同算法组合策略(如多数规则与共识位点调用)的假阳性和假阴性率,更好地理解这一权衡。GPU 加速基因组工作流的另一个优势是,它使研究人员能以固定预算处理更多样本。学术研究计划通常受限于有限的资金;使用 GPU 加速可以让研究人员降低计算成本(和人工开销),从而以相同的资金处理更多样本。最后,GPU 加速算法实现了近乎实时的决策能力。病原体生物监测受益于快速数据处理以识别新型病原体,这可以帮助政策制定者在疫情期间更快地采取行动[20]。同样,更快的临床测试处理也可以在医疗保健环境中更及时地做出患者护理决策。

Cloud platform considerations
云平台注意事项

CPU-only runs 仅 CPU 运行

As more research programs migrate to cloud platforms, researchers will need to make decisions about which platform provides the most advantages for both performance and cost considerations. CPU runs were faster on the AWS c6i.8xlarge machine than on the GCP n2-32 for four algorithms, while DeepVariant and Mutect2 ran faster on GCP (Fig. 1). While the AWS machines use the 3rd generation Intel Ice Lake processors, the GCP n2 machines default to the 2nd generation Cascade Lake processors, although Ice Lake is available in some regions/zones. This difference in processor generation most likely explains the differences in runtime we observed between cloud platforms, unless unaccounted-for factors are also influencing observed variation. Past work within our research group showed that reduced runtimes driven by using the latest CPU processors outweighs the increased per-second cost (TC unpublished) suggesting that researchers should also aim to use the latest processors for CPU platforms.
随着越来越多的研究项目迁移到云平台,研究人员需要做出决定,选择哪个平台能为性能和成本提供最大优势。对于四种算法而言,在 AWS c6i.8xlarge 机器上的 CPU 运行速度比 GCP n2-32 更快,而 DeepVariant 和 Mutect2 在 GCP 上运行速度更快(图 1)。尽管 GCP n2 机器默认使用第二代 Cascade Lake 处理器,但 AWS 机器使用第三代 Intel Ice Lake 处理器,Ice Lake 在某些地区/区域也可用。这种处理器代际的差异很可能解释了我们观察到的云平台之间的运行时间差异,除非还有未考虑的因素也影响了观察到的变化。我们研究组以前的工作显示,使用最新 CPU 处理器带来的运行时间缩短,超过了每秒增加的成本(TC 未发表),这表明研究人员也应该努力使用最新的处理器进行 CPU 平台。
Another consideration that researchers should be aware of in the near term is that AWS is migrating to newer ARM-based machine types, rather than

\times 86

architectures. We had trouble installing existing software on the ARM-based machines, and thus used the c6i.8xlarge machine which retains the

\times 86

architecture. This could present challenges for researchers in the future on AWS as the platform migrates more machine types to ARM-based architectures, necessitating the rewriting and/or compiling of common software. On GCP, we chose the N2 machine family as a balance between performance and cost. GCP does offer the compute-optimized C2 machine family, which may run faster than the N2 machines (it also uses Cascade Lake processors), but we did not benchmark those machines here. Further, future work could quantify CPU plateaus of
研究人员应该意识到的另一个考虑因素是 AWS 正在迁移到较新的基于 ARM 的机器类型,而不是

\times 86

架构。我们在 ARM 机器上安装现有软件时遇到了困难,因此使用了保留

\times 86

架构的 c6i.8xlarge 机器。随着 AWS 平台将更多的机器类型迁移到基于 ARM 的架构,这可能会给研究人员带来挑战,需要重写和/或编译常见软件。在 GCP 上,我们选择了 N2 机器系列,平衡了性能和成本。GCP 确实提供了计算优化的 C2 机器系列,这可能比 N2 机器运行得更快(它也使用了瀑布湖处理器),但我们没有在这里对这些机器进行基准测试。此外,未来的工作可能会量化 CPU 平台。
each variant caller to help optimize the ideal CPU machine type, particularly for designing cloud-based computing clusters [13].
每个变体呼叫器帮助优化理想的 CPU 机器类型,尤其是在设计基于云的计算集群[13]方面。

GPU considerations on the cloud
在云上的 GPU 考虑

For germline workflows, AWS and GCP performed very similarly for both speed and cost when using 8 A100 GPUs, although the 2 and 4 GPUs runs exhibited more variation (Figs. 2, 3). To quantify the balance between cost and performance on each cloud platform, we calculated a cost ratio metric by dividing the cost of the workflow by the xSpeedup for a GPU run when compared to the CPU run for that workflow. Thus, a lower cost ratio indicates a better value for a given GPU configuration (Table 1; Fig. 4). For the germline variant callers, the best cost ratio on both platforms used 8 GPUs, and the ratio for AWS and GCP was similar enough that we feel it should not impact the choice between cloud providers. For somatic variant workflows, the best cost ratio was usually

2 - 4

GPUs, as these workflows were less optimized (substantially more expensive relative to speed gains) to use 8 GPUs on the cloud. Further, because LoFreq and SomaticSniper were less accelerated, their high cost ratio suggests that it is not worth the extra cost to run these workflows using GPUs with the version of Parabricks we tested. One caveat to these findings is that we used synthetic somatic data (though based on sites from a real patient), and some of our findings could be artifacts of our somatic variant sampling design. Future work could repeat similar analyses using a variety of somatic variant samples and test if different variant numbers or allele frequency variation impact algorithmic performance on GPU platforms. The newest version of Parabricks may also address some of these biases. Further, we observed faster runtimes with 4 GPUs compared with 8 GPUs for Mutect2 (on all platforms) and SomaticSniper (on GCP), and in fact, for Mutect2, using 8 GPUs was
对于生殖系列工作流程,当使用 8 个 A100 GPU 时,AWS 和 GCP 在速度和成本方面的表现非常相似,尽管 2 个和 4 个 GPU 运行显示了更多的变化(图 2、3)。为了量化每个云平台上成本和性能之间的平衡,我们计算了成本比率指标,方法是将工作流程的成本除以与 CPU 运行相比的 GPU 运行的 xSpeedUp。因此,较低的成本比率表示给定 GPU 配置的性价比更高(表 1;图 4)。对于生殖型变体检测器,两个平台上最佳成本比率使用 8 个 GPU,而 AWS 和 GCP 的比率足够相似,我们认为这不应影响选择云服务提供商。对于体细胞变体工作流程,最佳成本比率通常为

2 - 4

GPU,因为这些工作流程在使用 8 个 GPU 在云上进行优化时成本较高(相对于速度提升)。此外,由于 LoFreq 和 SomaticSniper 加速效果较小,其高成本比率表明使用我们测试的 Parabricks 版本在 GPU 上运行这些工作流程并不值得额外的成本。这些发现的一个警告是,我们使用了合成的体细胞数据(尽管基于一位真实患者的位点),我们的一些发现可能是由于我们的体细胞变体采样设计而产生的人工效应。未来的工作可能会重复类似的分析,使用各种体细胞变体样本,并测试不同的变体数量或等位基因频率变化是否影响 GPU 平台上的算法性能。Parabricks 的最新版本也可能解决这些偏差。此外,我们观察到 Mutect2(在所有平台上)和 SomaticSniper(在 GCP 上)使用 4 个 GPU 的运行时间比 8 个 GPU 更快,实际上对于 Mutect2,使用 8 个 GPU 并不

Fig. 4 Comparison of AWS V100 versus GCP A100 GPU cost ratio per variant caller. Cost ratio is the ratio between cost per hour and fold speed-up. Cost per fold-speedup shows the benefit of harnessing GPU over CPU in select algorithms, while other algorithms are more cost-efficient with CPUs when using the version of Parabricks that we benchmarked
图 4 AWS V100 与 GCP A100 GPU 每个变种检测器的成本比例对比。成本比例是成本/小时与速度提升倍数的比率。成本/速度提升倍数显示了在某些算法中利用 GPU 相比 CPU 的优势,而在我们基准测试的 Parabricks 版本中,其他算法使用 CPU 更具成本效益。
barely faster than using only 2 GPUs. We attempt to explain these results but hypothesizing that either (1) the algorithm is hard coded to only use up to 4 GPUs, or (2) the MPI used is overloaded by adding additional GPUs. We struggled to compare our results with those of [16] because the NVIDIA study only presented results for the T4 GPU machine with 4 GPUs. Further, they benchmarked on

50 \times

whole genome samples compared with our

30 \times

data, making it difficult to directly compare run times. Nonetheless, future releases of Parabricks may resolve the issue with 4 versus 8 GPUs, but more work is needed to understand the underlying causes of these patterns.
仅比使用 2 个 GPU 快一点。我们试图解释这些结果,假设要么(1)算法被硬编码只能使用最多 4 个 GPU,要么(2)添加更多 GPU 会使用的 MPI 过载。我们很难将我们的结果与[16]的结果进行比较,因为 NVIDIA 的研究只展示了 T4 GPU 机器上 4 个 GPU 的结果。此外,他们以整个基因组样本进行基准测试,而我们使用的是

30 \times

数据,这使得直接比较运行时间变得困难。尽管如此,未来的 Parabricks 版本可能会解决 4 个 GPU 和 8 个 GPU 之间的问题,但需要更多工作来了解这些模式的根本原因。
GPU-accelerated bioinformatic workflows are still relatively new to the cloud, and as such, not all tools are readily available everywhere. For example, while we were conducting our analyses, Parabricks did not offer a Marketplace solution for GCP, although it has since been released. Likewise, the Marketplace solution on AWS offered a user-friendly way to access the Parabricks software suite without purchasing an annual license, but this machine image did not support the p4 machine family with the A100 GPUs. Nonetheless, although we were able to install Parabricks on the A100 machine on AWS, this machine type was not readily available (at the time of writing) in most regions, and it was difficult to procure this machine type to conduct our benchmarking. Perhaps using spot instances would have been a better solution for these difficult to procure machine types. After we conducted our study, NVIDIA has now made Parabricks free to download, and also made it available on several platforms, including Terra and Amazon Omics. Finally, we observed some decreases in runtime between the A100 and V100 GPU machines on AWS (Fig. 5). However, differences were relatively minor when using 8 GPUs-less than a minute for DeepVariant and 8 min for HaplotypeCaller. The 8 GPU p3 machine also uses newer Intel CPU processors, which may explain some of this difference. Future work could investigate the relative impact of the GPUs versus CPUs when running GPUaccelerated algorithms to better inform machine selection. Nonetheless, while the A100 machine type is difficult to obtain and was not available with the Marketplace machine
GPU 加速的生物信息学工作流程对云计算来说还相对较新,因此并非所有工具都能随处可得。例如,在我们进行分析时,Parabricks 还没有为 GCP 提供 Marketplace 解决方案,尽管现已推出。同样,AWS 上的 Marketplace 解决方案提供了一种用户友好的方式来访问 Parabricks 软件套件,而无需购买年度许可证,但该机器映像不支持具有 A100 GPU 的 p4 机器系列。尽管如此,我们能够在 AWS 上的 A100 机器上安装 Parabricks,但这种机器类型在大多数地区并不容易获得(在撰写本文时),而且很难采购这种机器类型来进行基准测试。也许使用现货实例会是这些难以采购的机器类型的更好解决方案。在我们进行研究后,NVIDIA 现已将 Parabricks 免费下载,并且还在多个平台上提供,包括 Terra 和 Amazon Omics。最后,我们观察到 AWS 上 A100 和 V100 GPU 机器之间的运行时间有所减少(图 5)。然而,当使用 8 个 GPU 时,差异相对较小,DeepVariant 不到 1 分钟,HaplotypeCaller 不到 8 分钟。采用 8 个 GPU 的 p3 机器还使用了较新的 Intel CPU 处理器,这可能解释了部分差异。未来的工作可以探讨在运行 GPU 加速算法时,GPU 与 CPU 的相对影响,以更好地指导机器选择。尽管如此,A100 机器类型很难获得,并且在 Marketplace 机器上不可用。

AWS GPU Hours per Workflow
AWS 工作流每小时 GPU 使用量

Fig. 5 Comparison of runtimes between V100 and A100 GPU machines on AWS
图 5. AWS 上 V100 和 A100 GPU 机器运行时间的比较
image, we recommend using the V100 GPU machine without significant cost to performance (Table 1, Additional file 1: Table S1; Fig. S3).
我们建议使用 V100 GPU 机器,性能与成本之间没有显著差异(表 1,附加文件 1:表 S1;图 S3)。

On-premises computing clusters
本地计算集群

For a myriad of reasons, some bioinformatic analysis will not migrate to the cloud, thus requiring on-premises infrastructure. Although not every institution will have a DGX cluster with A100 GPUs available, we show here that Parabricks runs well in an on-premises environment. For those looking to achieve the fastest possible runtimes in a production environment, the DGX ran considerably faster than AWS or GCP for germline callers, reducing runtimes for HaplotypeCaller by 8 min and DeepVariant by 15 min, differences that could be significant at large enough scales. We attribute these differences to the network communication between GPUs and CPUs on the machines, which is better optimized on the DGX compared with cloud-based instances, where GPUs may not be located in as close of proximity.
出于各种原因,一些生物信息学分析无法迁移到云端,因此需要本地基础设施。虽然并非每个机构都拥有配备 A100 GPU 的 DGX 集群,但我们在此展示 Parabricks 在本地环境中运行良好。对于希望在生产环境中实现最快运行时间的人而言,DGX 在种系检测器方面明显快于 AWS 或 GCP,HaplotypeCaller 的运行时间缩短了 8 分钟,DeepVariant 缩短了 15 分钟,这种差异在足够大的规模下可能会很重要。我们将这些差异归因于机器上 GPU 和 CPU 之间的网络通信,DGX 上的网络通信优化程度优于基于云的实例,因为 GPU 在云上的位置可能不太接近。

Conclusions 结论

We found that germline variant callers were well optimized with Parabricks and that GPU-accelerated workflows can result in substantial savings of both time and costs. Alternatively, somatic callers were accelerated, but exhibited substantial variation between algorithms, number of GPUs, and computing platform, suggesting that benchmarking algorithms with a reduced dataset is important before scaling up to an entire study or running at production scale. Though early days for GPU-accelerated bioinformatic pipelines, ever faster computing processors bring us closer to important societal aims such as tracking pathogens in near real-time to monitor emerging pandemics or enabling milestones in the field of personalized medicine.
我们发现,Parabricks 已经很好地优化了胚系变异检测器,GPU 加速的工作流程可以大大节省时间和成本。相比之下,体细胞变异检测器也得到了加速,但算法、GPU 数量和计算平台之间存在较大差异,这表明在扩大规模或进入生产环境之前,使用缩减的数据集进行基准测试很重要。尽管 GPU 加速生物信息学管线还处于早期阶段,但计算处理器的不断升级,使得我们越来越接近一些重要的社会目标,例如实时追踪新发疫情病原体,或在个性化医疗领域取得里程碑式的进展。

Materials and methods 材料和方法

Sampling and algorithms 采样和算法

We benchmarked six variant callers for CPU and GPU speed and cost. Herein, we defined algorithms that are well optimized for GPUs as those that resulted in both time and cost savings when run with GPUs compared with CPU-only runs. We conducted all benchmarking on the individual ‘HG002’ from the Genome in a Bottle Consortium [21, 22] hosted by the National Institute of Standards and Technology and made available as part of the Precision FDA Truth Challenge V2 (https://precision.fda.gov/challenges/10). We down-sampled the fastq files to

30 \times

coverage using Samtools v1.9 [23]. We used Grch38 as our reference genome downloaded from the GATK Reference Bundle. Our germline variant calling pipeline evaluated two germline variant callers: HaplotypeCaller v4.2.0.0 [24, 25] and DeepVariant v.1.1.0 [24]. GPU benchmarking used Parabricks v. 3.7.0-1. For germline callers we used ‘Germline Pipeline’ for GATK HaplotypeCaller, and for DeepVariant we used DeepVariant Germline Pipeline. Each of these pipelines take fastq files as inputs and output unfiltered variant call format (VCF) files. CPU benchmarking was conducted by writing custom workflows using Snakemake v.6.6.1 [26], following best practices for each tool and exactly matching the workflows used by Parabricks (Data and Materials). In short, our HaplotypeCaller pipeline mapped to reference using ‘bwa mem
我们对六种变体呼叫器的 CPU 和 GPU 速度及成本进行了基准测试。在此,我们定义了针对 GPU 进行了良好优化的算法,这些算法在使用 GPU 与仅使用 CPU 运行时都可带来时间和成本节省。我们在美国国家标准与技术研究所托管的"瓶中基因"协会[21,22]提供的个体"HG002"上进行了所有基准测试,这些基准测试是作为 Precision FDA 真相挑战 V2 的一部分公开的( https://precision.fda.gov/challenges/10)。我们使用 Samtools v1.9 [23]将 fastq 文件下采样到

30 \times

覆盖率。我们使用来自 GATK 参考包的 Grch38 作为我们的参考基因组。我们的生发变异呼叫管道评估了两种生发变异呼叫器:HaplotypeCaller v4.2.0.0 [24,25]和 DeepVariant v.1.1.0 [24]。GPU 基准测试使用的是 Parabricks v. 3.7.0-1。对于生发呼叫器,我们对 GATK HaplotypeCaller 使用了"生发管道",对 DeepVariant 使用了 DeepVariant Germline Pipeline 。这些管道都将 fastq 文件作为输入,输出未经过滤的变异呼叫格式(VCF)文件。CPU 基准测试是通过使用 Snakemake v.6.6.1 [26]编写定制工作流进行的,遵循每个工具的最佳实践,与 Parabricks 使用的工作流完全匹配(数据和材料)。简而言之,我们的 HaplotypeCaller 管道使用'bwa mem'将映射到参考基因组。
v.0.7.15’ [27] with the ‘threads’ flag = $CPUs, sorted using Samtools [23], marked duplicates and base quality score recalibration using GATK v.4.2.0.0, and then called variants with HaplotypeCaller (-native-pair-hmm-threads=$CPUs). Likewise, our DeepVariant pipeline mapped to reference with bwa mem, sorted with Samtools, marked duplicates with GATK, then ran DeepVariant as a shell script (-num_shards

= $

CPUs).
v.0.7.15' [27] 带有'threads'标志 = $CPUs,使用 Samtools [23]排序, GATK v.4.2.0.0 标记重复并校准碱基质量分数,然后使用 HaplotypeCaller (-native-pair-hmm-threads=$CPUs)调用变体。同样,我们的 DeepVariant 流水线使用 bwa mem 映射到参考基因组,使用 Samtools 排序,使用 GATK 标记重复,然后以 shell 脚本形式运行 DeepVariant (-num_shards

= $

CPUs)。

Our somatic variant calling pipeline evaluated four somatic variant callers: Mutect2 [25], SomaticSniper [28], Muse [29], and LoFreq [30]. While our full workflows are described in detail on our GitHub repository, we outline the general steps for each algorithm here. Mutect2 v.4.2.0.0 was run with a single command (with -native-pair-hmmthreads

= $

CPUs). SomaticSniper v.1.0.5.0 was run with ‘bam-somaticsniper’ command with single threading, followed by filtering with the Perl scripts distributed with the main program. LoFreq v.2.1 was run with a single command with threads

= $

CPUs, and finally Muse v. 2.0 was run with two steps, sump and call with single threading.
我们的体细胞变异检测管道评估了四种体细胞变异检测器:Mutect2 [25]、SomaticSniper [28]、Muse [29]和 LoFreq [30]。虽然我们的全流程在 GitHub 仓库上有详细描述,我们在这里概述了每种算法的一般步骤。Mutect2 v.4.2.0.0 使用单个命令运行(-native-pair-hmmthreads

= $

CPUs)。SomaticSniper v.1.0.5.0 使用'bam-somaticsniper'命令以单线程运行,然后使用随主程序分发的 Perl 脚本进行过滤。LoFreq v.2.1 使用单个命令以

= $

CPUs 运行,最后 Muse v. 2.0 分两步运行,sump 和 call,均以单线程运行。
We generated synthetic somatic tumor data using SomatoSim v1.0.0 [31]. We added 198 single nucleotide polymorphisms (SNPs) at random variant allele frequencies ranging from 0.001 to 0.4 (randomly generated using custom python scripts). Sites were selected from the ICGC Data Portal ovarian cancer patient DO32536 (https://dcc.icgc. org/donors/DO32536?mutations=%7B%22size%22:50,%22from%22:151%7D). We used the BAM file from the HaplotypeCaller pipeline (i.e., MarkDuplicates, BaseRecalibration, and ApplyBQSR were run prior to the mutation process) as the input for SomatoSim. For somatic variant callers, we used the Parabricks variant caller scripts (‘mutectcaller’, ‘somaticsniper_workflow’, ‘muse’, ‘lofreq’) which take BAM files as input and output VCF files. Each Parabricks tool was compared to a compatible CPU command as listed in the Parabricks 3.7 documentation. We used Snakemake scripts as described for germline callers. For benchmarking of MuSE, we used version v2.0 and set the number of threads to 1 to replicate MuSE v1.0 lack of parallel computing because of version conflicts with MuSE v1 in our compute environment. We created a conda environment before running each workflow because we found that using the ‘-with conda’ flag in Snakemake dramatically increased run times. After initial algorithmic exploration we recorded the time of our final workflow run. We observed very minor variation in run times for serially run GPU workflows. Complete workflows are described along with all scripts necessary to repeat our analyses at https://github.com/kyleoconnell/gpu-accle rated-genomics.
我们使用 SomatoSim v1.0.0 [31] 生成了合成的体细胞肿瘤数据。我们在随机变异等位基因频率从 0.001 到 0.4 的范围内添加了 198 个单核苷酸多态性(SNP)。这些位点是从 ICGC 数据门户网站上的卵巢癌患者 DO32536 ( https://dcc.icgc.org/donors/DO32536?mutations=%7B%22size%22:50,%22from%22:151%7D) 选择的。我们使用 HaplotypeCaller 管道(即在突变过程之前运行了 MarkDuplicates、BaseRecalibration 和 ApplyBQSR)的 BAM 文件作为 SomatoSim 的输入。对于体细胞变异检测器,我们使用了 Parabricks 变异检测器脚本('mutectcaller'、'somaticsniper_workflow'、'muse'、'lofreq')作为输入 BAM 文件,输出 VCF 文件。每个 Parabricks 工具都与 Parabricks 3.7 文档中列出的兼容 CPU 命令进行了比较。我们使用了 Snakemake 脚本,如本文所述,用于胚系检测器。在对 MuSE 进行基准测试时,我们使用了版本 v2.0,并将线程数设置为 1,以复制 MuSE v1.0 由于版本冲突而缺乏并行计算的情况。在运行每个工作流程之前,我们都创建了一个 conda 环境,因为我们发现在 Snakemake 中使用 '-with conda' 标志会大大增加运行时间。在初步算法探索之后,我们记录了最终工作流程运行的时间。我们观察到串行运行的 GPU 工作流程的运行时间变化非常小。完整的工作流程及重复我们分析所需的所有脚本都描述在 https://github.com/kyleoconnell/gpu-accelerated-genomics 上。

GCP configuration GCP 配置

Benchmarking on GCP leveraged virtual machines that were launched programmatically for CPU machines, or manually for GPU machines. On GCP, a vCPU is implemented as a single hardware hyper-thread. By default, GCP physical hardware cores use simultaneous multithreading such that two vCPUs are assigned to each core. Our CPU workflows used the ’ n 2 -standard-32’ machine type with Intel Xeon Cascade Lake processors with 32 vCPUs and 128 GB of memory. We assigned 1 TB of EBS storage to our instance. We launched these machines using a startup script that installed the conda environment, then ran the Snakemake workflows. All data was already loaded on a machine image, and runtimes were concatenated from each Snakemake rule using a custom script available
在 GCP 上,我们利用已编程启动的虚拟机对 CPU 机器进行基准测试,对 GPU 机器则手动进行。在 GCP 上,每个 vCPU 都是作为单个硬件超线程实现的。默认情况下,GCP 物理硬件核心使用同步多线程,每个核心分配两个 vCPU。我们的 CPU 工作流使用了 32 个 vCPU 和 128 GB 内存的 'n2-standard-32' 机器类型,配备了英特尔 Xeon Cascade Lake 处理器。我们为实例分配了 1 TB 的 EBS 存储。我们使用启动脚本启动这些机器,该脚本安装了 conda 环境,然后运行 Snakemake 工作流。所有数据都已预先加载到机器映像中,并使用自定义脚本连接了每个 Snakemake 规则的运行时间。
in our GitHub repository. We also benchmarked the older generation E2 family of processors but found the run times to be much slower and thus only present the results for N2 processors here.
在我们的 GitHub 仓库中。我们还对较早的 E2 处理器系列进行了基准测试,但发现其运行时间要慢得多,因此在此仅展示 N2 处理器的结果。
GPU benchmarking on GCP used the accelerator-optimized a2-highgpu machine types with two A100 GPUs, 24 vCPUs (Intel Xeon Cascade Lake processors) and 170 GB RAM, four A100 GPUs with 48 vCPUs and 340 GB RAM, and eight A100 GPUs with 96 vCPUs and 680 GB RAM. One virtual machine was utilized with 4 TB of elastic block storage, which we stopped and resized between runs.
在 GCP 上进行 GPU 基准测试使用了优化了加速器的 a2-highgpu 机器类型,具有两个 A100 GPU、24 个 vCPU(Intel Xeon Cascade Lake 处理器)和 170 GB RAM,四个 A100 GPU 与 48 个 vCPU 和 340 GB RAM,以及八个 A100 GPU 与 96 个 vCPU 和 680 GB RAM。我们使用了一台虚拟机,具有 4 TB 的弹性块存储,并在运行之间停止和调整大小。

AWS configuration AWS 配置

Benchmarking on AWS also used multiple virtual machines for CPU and GPU benchmarking. Similar to GCP, AWS assigns two vCPUs to each physical core to enable multithreading. CPU benchmarking used the C6i.8xlarge machine type, which has a 3rd generation Intel Xeon Scalable processor (Ice Lake 8375 C) with 32 vCPUs and 64 GiB RAM. We assigned 800 GB of EBS storage to our instance. We did some preliminary testing with the new ARM-based processors (C7g family) but had issues with installing several of the dependencies (particularly with mamba/conda), suggesting that a migration to ARM-based processors may prove problematic for bioinformatics in the cloud.
在 AWS 上进行基准测试也使用了多个虚拟机进行 CPU 和 GPU 基准测试。与 GCP 类似,AWS 将两个 vCPU 分配给每个物理内核以启用多线程。CPU 基准测试使用了 C6i.8xlarge 机器类型,该类型拥有第 3 代 Intel Xeon Scalable 处理器(Ice Lake 8375 C),具有 32 个 vCPU 和 64 GiB RAM。我们为实例分配了 800 GB 的 EBS 存储。我们对新的基于 ARM 的处理器(C7g 系列)进行了一些初步测试,但在安装多个依赖项(尤其是 mamba/conda)时遇到了问题,这表明向基于 ARM 的处理器迁移可能会给云生物信息学带来问题。
We benchmarked two GPU machine families. First, we benchmarked the p4 machine family which is similar to GCP a2-highgpu machines utilizing the latest NVIDIA A100 Tensor Core GPUs with 8 GPUs with 96 vCPUs (Intel Xeon Cascade Lake P-8275CL) and 1152 GiB RAM. AWS currently only has one machine type with A100 GPUs, the p4d.24xlarge, which only runs with 8 GPUs. To ensure consistency with GCP, we ran the 8 GPU machine, but specified the number of GPUs to use in our Parabricks commands for the smaller numbers of GPU runs. As this machine type was not compatible with the marketplace image (see below) we installed Parabricks manually using scripts provided by NVIDIA. When possible (-cpu flag available) we limited the number of CPUs available with the p4 machine, but most analysis did not allow us to control the number of CPUs. For example, we ran HaplotypeCaller with 2 GPUs, but 96 CPUs, compared with on GCP where the machine had 2 GPUs and 24 CPUs.
我们对两个 GPU 机器族进行了基准测试。首先,我们基准测试了 p4 机器族,它类似于 GCP a2-highgpu 机器,使用了最新的 NVIDIA A100 张量核心 GPU,配有 8 个 GPU、96 个 vCPU(Intel Xeon Cascade Lake P-8275CL)和 1152 GiB 内存。目前 AWS 只有一种机器类型带有 A100 GPU,即 p4d.24xlarge,只能运行 8 个 GPU。为了与 GCP 保持一致,我们运行了 8 个 GPU 的机器,但在我们的 Parabricks 命令中指定了使用较少数量 GPU 的运行次数。由于这种机器类型与市场映像不兼容(见下文),我们手动安装了 NVIDIA 提供的脚本中的 Parabricks。在可能的情况下(-cpu 标志可用),我们限制了 p4 机器可用的 CPU 数量,但大多数分析都不允许我们控制 CPU 数量。例如,我们使用 2 个 GPU 和 96 个 CPU 运行 HaplotypeCaller,而在 GCP 上机器有 2 个 GPU 和 24 个 CPU。
To compare GPU and CPU configurations directly with GCP, we further benchmarked the p3 machine family using the ‘NVIDIA Clara Parabricks Pipelines’ AWS Marketplace image. At the time of writing the image supported V100 GPUs (but not A100 GPUs), which are an older model of Tensor Core GPU, on machine types p3.8xlarge with 4 GPUs, and 32 Intel Broadwell E5-2686 v4 CPUs. We also benchmarked on the p3dn.24xlarge with 8 GPUs and 96 Intel Skylake 8175 CPUs. The Marketplace image also had Parabricks preinstalled at a cost of

$ 0.30 / h

(NVIDIA has since made Parabricks free). This configuration allowed us to directly compare 4 and 8 GPU machines with equal CPU numbers between AWS and GCP.
要直接将 GPU 和 CPU 配置与 GCP 进行比较,我们进一步使用 'NVIDIA Clara Parabricks Pipelines' AWS Marketplace 镜像对 p3 机器系列进行了基准测试。在撰写本文时,该镜像支持 V100 GPU(但不支持 A100 GPU),这是一种较旧的 Tensor Core GPU 型号,适用于 p3.8xlarge 机型,该机型拥有 4 个 GPU 和 32 个 Intel Broadwell E5-2686 v4 CPU。我们还在 p3dn.24xlarge 上进行了基准测试,其拥有 8 个 GPU 和 96 个 Intel Skylake 8175 CPU。该 Marketplace 镜像还预先安装了 Parabricks,费用为

$ 0.30 / h

(NVIDIA 此后已经将 Parabricks 免费化)。这一配置使我们能够在 AWS 和 GCP 之间直接比较 4 个和 8 个 GPU 机器,并保持 CPU 数量相等。

DGX configuration DGX 配置

We also conducted GPU benchmarking on an NVIDIA DGX Cluster (DGX SuperPOD), which is a computing cluster with six DGX A100s, each of which contains eight NVIDIA A100 GPUs and 64 core AMD Rome CPUs with 1 TB RAM. Although the DGX cluster is composed of four DGX A100 components, resulting in a total of 48 A100 GPUs
我们还在 NVIDIA DGX 集群(DGX SuperPOD)上进行了 GPU 基准测试,这是一个由六个 DGX A100 组成的计算集群,每个 DGX A100 包含八个 NVIDIA A100 GPU 和 64 核 AMD Rome CPU 以及 1TB RAM。尽管 DGX 集群由四个 DGX A100 组件组成,但总共有 48 个 A100 GPU。
available, Parabricks is only able to run on a single DGX A100 system, thus limiting any Parabricks analyses to 8 GPUs. Jobs were launched using a Kubernetes-based scheduler, allocating a max memory of 300 GB , and matching the GPU and CPU configurations of the GCP/AWS runs, except GATK HaplotypeCaller. For this workflow, we benchmarked times for 8 GPUs using 24, 48, 96, and 124 CPUs to test the effect of the number of CPUs on execution time. For all other algorithms, we ran at least three iterations of each run to ensure consistency of results, and present the time of the final run.
可用的,Parabricks 仅能在单个 DGX A100 系统上运行,因此将任何 Parabricks 分析限制为 8 个 GPU。使用基于 Kubernetes 的调度程序启动作业,分配最大内存 300 GB,并与 GCP/AWS 运行的 GPU 和 CPU 配置相匹配,GATK HaplotypeCaller 除外。对于此工作流,我们使用 24、48、96 和 124 个 CPU 对 8 个 GPU 的运行时间进行了基准测试,以测试 CPU 数量对执行时间的影响。对于所有其他算法,我们至少运行了三次迭代以确保结果的一致性,并显示了最终运行的时间。

Supplementary Information
补充信息

The online version contains supplementary material available at https://doi.org/10.1186/s12859-023-05292-2.
在线版本包含补充材料,可在 https://doi.org/10.1186/s12859-023-05292-2 获取。
Additional file 1. Additional results of benchmarking on AWS. Table S1 shows NVIDIA A100 GPU machine benchmarking results, and figures show benchmarking the NVIDIA V100 GPU machine.
附加文件 1。AWS 基准测试的附加结果。表 S1 显示了 NVIDIA A100 GPU 机器的基准测试结果,图表显示了 NVIDIA V100 GPU 机器的基准测试。

Acknowledgements 致谢

We thank G. Barnett and J Fenwick for help troubleshooting Parabricks install and analyses, and the Deloitte Center for AI computing for help getting onboarded to the DGX cluster.
我们感谢 G. Barnett 和 J Fenwick 在 Parabricks 安装和分析过程中提供的帮助,以及 Deloitte AI 计算中心在帮助我们接入 DGX 集群方面的协助。

Author contributions 作者贡献

KAO, CJL, TBC, DM, VRB, and JAK conceived the study. KAO, ZBY, RAC, and CJL designed the study. KAO, ZBY, RAC, and CJL ran cloud-based analyses. KAO and JJC ran DGX analyses. KAO, ZBY and HTE wrote the manuscript, and all authors read and approved of the text.
高, 陈钧伶, 汤宝臣, 戴梦, 文润彬和贾科共同构思了这项研究。高, 张碧尧, 任爱钏和陈钧伶设计了该研究。高, 张碧尧, 任爱钏和陈钧伶进行了基于云的分析。高和姜京昌进行了 DGX 分析。高, 张碧尧和何天鹅撰写了论文, 所有作者都阅读并批准了这篇文字。

Funding 资金

Deloitte Consulting LLP funded all aspects of this work.
德勤咨询有限责任公司全面资助了这项工作。

Availability of data and materials
数据和材料的可用性

The datasets supporting the conclusions of this article are available in the GitHub repository accessible at https://github. com/kyleoconnell/gpu-acclerated-genomics.
支持本文结论的数据集可在 https://github.com/kyleoconnell/gpu-acclerated-genomics 中访问的 GitHub 仓库中获取。

Declarations 声明

Ethics approval and consent to participate Not applicable.
伦理审批和参与同意不适用。

Consent for publication 同意发表
Not applicable. 不适用。

Competing interests 利益冲突

Deloitte Consulting LLP is an alliance partner with NVIDIA, Amazon Web Services and Google.
德勤咨询有限责任公司是 NVIDIA、亚马逊网络服务和谷歌的联盟合作伙伴。

Received: 20 July 2022 Accepted: 15 April 2023
收到: 2022 年 7 月 20 日接受: 2023 年 4 月 15 日
Published online: 31 May 2023
已在线发布：2023 年 5 月 31 日

References 参考文献

Langmead B, Nellore A. Cloud computing for genomic data analysis and collaboration. Nat Rev Genet. 2018;19(4):208-19.
朗米德 B, 内尔洛 A。云计算在基因组数据分析和协作中的应用。Nat Rev Genet. 2018;19(4):208-19.
Nwadiugwu MC, Monteiro N. Applied genomics for identification of virulent biothreats and for disease outbreak surveillance. Postgrad Med J; 2022.
瓦迪乌古 MC，蒙特罗 N. 应用基因组学识别致病性生物威胁和疾病暴发监测。医学研究生杂志；2022 年。
Zhao S, Agafonov O, Azab A, Stokowy T, Hovig E. Accuracy and efficiency of germline variant calling pipelines for human genome data. Sci Rep. 2020;10(1):1-12.
赵 S, 阿加福诺夫 O, 阿扎布 A, 斯托科维 T, 霍维格 E. 人类基因组数据中遗传变异检测管线的准确性和效率. Sci Rep. 2020;10(1):1-12.
Liu B, et al. Cloud-based bioinformatics workflow platform for large-scale next-generation sequencing analyses. J Biomed Inform. 2014;49:119-33.
刘 B 等. 大规模下一代测序分析的云计算生物信息学工作流平台. J Biomed Inform. 2014;49:119-33.
Cole BS, Moore JH. Eleven quick tips for architecting biomedical informatics workflows with cloud computing. PLoS Comput Biol. 2018;14(3): e1005994.
科尔 BS、摩尔 JH。使用云计算构建生物医学信息学工作流的 11 个快速技巧。PLoS Comput Biol。2018;14(3):e1005994。
Franke KR, Crowgey EL. Accelerating next generation sequencing data analysis: an evaluation of optimized best practices for Genome Analysis Toolkit algorithms. Genom Inform. 2020;18(1):e10.
弗兰克 KR,克罗格伊 EL.加速下一代测序数据分析:对基因组分析工具包算法优化最佳实践的评估.基因组信息.2020;18(1):e10.
Tanjo T, Kawai Y, Tokunaga K, Ogasawara O, Nagasaki M. Practical guide for managing large-scale human genome data in research. J Hum Genet. 2021;66(1):39-52.
丹祖 T、川井 Y、德永 K、小笠原 O、长崎 M. 管理大规模人类基因组数据的实用指南. J Hum Genet. 2021;66(1):39-52.
Augustyn DR, Wyciślik $Ł$ , Mrozek D. Perspectives of using Cloud computing in integrative analysis of multi-omics data. Brief Funct Genom. 2021;20(4):198-206.
奥古斯丁 DR、Wyciślik $Ł$ 、Mrozek D.使用云计算进行多组学数据综合分析的展望. Brief Funct Genom. 2021;20(4):198-206.
Grossman RL. Data lakes, clouds, and commons: a review of platforms for analyzing and sharing genomic data. Trends Genet. 2019;35(3):223-34.
格罗斯曼 RL.基因组数据分析和共享平台:数据湖、云和联邦. 遗传学趋势. 2019;35(3):223-34.
Grzesik P, Augustyn DR, Wyciślik $Ł$ , Mrozek D. Serverless computing in omics data analysis and integration. Brief Bioinform. 2022;23(1):bbab349.
格热西克 P, 奥古斯廷 DR, 维奇斯利克 $Ł$ , 莫泽克 D. 无服务器计算在组学数据分析和集成中的应用. Brief Bioinform. 2022;23(1):bbab349.
Koppad S, Gkoutos GV, Acharjee A. Cloud computing enabled big multi-omics data analytics. Bioinform Biol Insights. 2021;15:11779322211035920.
科波德 S，古托斯 GV，阿查尔杰 A。基于云计算的大型多组学数据分析。Bioinform Biol Insights。2021;15:11779322211035920。
Leonard C, et al. Running genomic analyses in the cloud. Stud Health Technol Inf. 2019;266:149-55.
崔伦纳德等人。在云端运行基因组分析。医疗卫生信息学研究。2019;266:149-55。
Krissaane I, et al. Scalability and cost-effectiveness analysis of whole genome-wide association studies on Google Cloud Platform and Amazon Web Services. J Am Med Inform Assoc. 2020;27(9):1425-30.
克里萨内等, 在谷歌云平台和亚马逊网络服务上进行全基因组关联研究的可扩展性和成本效益分析. J Am Med Inform Assoc. 2020;27(9):1425-30.
Ray $U$ et al. Hummingbird: efficient performance prediction for executing genomics applications in the cloud. In: Presented at the computational approaches for cancer workshop; 2018
雷 et al.Hummingbird:在云中执行基因组应用程序的高效性能预测。在:在计算癌症研究研讨会上提出;2018
Rosati S. Comparison of CPU and Parabricks GPU enabled bioinformatics software for high throughput clinical genomic applications; 2020
罗沙蒂 S.高通量临床基因组应用中 CPU 和 Parabricks GPU 生物信息学软件的比较; 2020
Benchmarking NVIDIA Clara Parabricks somatic variant calling pipeline on AWS | AWS HPC Blog. https://aws.amazon. com/blogs/hpc/benchmarking-nvidia-clara-parabricks-somatic-variant-calling-pipeline-on-aws/. Accessed 28 July 2022.
基准测试 NVIDIA Clara Parabricks 体细胞变种调用管线在 AWS 上的性能 | AWS HPC 博客。
Benchmarking the NVIDIA Clara Parabricks germline pipeline on AWS|AWS HPC Blog. https://aws.amazon.com/ blogs/hpc/benchmarking-the-nvidia-clara-parabricks-germline-pipeline-on-aws/. Accessed 28 July 2022.
在 AWS 上测试 NVIDIA Clara Parabricks 遗传线路管道|AWS 高性能计算博客。https://aws.amazon.com/blogs/hpc/benchmarking-the-nvidia-clara-parabricks-germline-pipeline-on-aws/。访问日期:2022 年 7 月 28 日。
Zhang Q, Liu H, Bu F. High performance of a GPU-accelerated variant calling tool in genome data analysis. bioRxiv; 2021.
张前, 刘航, 冯斌。GPU 加速变异检测工具在基因组数据分析中的高性能。生物预印本；2021 年。
Crowgey EL, et al. Enhanced processing of genomic sequencing data for pediatric cancers: GPUs and machine learning techniques for variant detection. Cancer Res. 2021;81(13_supplement):165-165.
克罗吉 EL 等。儿童癌症基因组测序数据的增强处理:用于变异检测的 GPU 和机器学习技术。癌症研究。2021;81(13_supplement):165-165.
Gardy JL, Loman NJ. Towards a genomics-informed, real-time, global pathogen surveillance system. Nat Rev Genet. 2018;19(1):9-20
加迪 JL, 洛曼 NJ. 走向一个以基因组学为基础的,实时的全球病原体监测系统. Nat Rev Genet. 2018;19(1):9-20
Krusche $P$ , et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat Biotechnol. 2019;37(5):555-60.
库鲁舍 $P$ 等。人类基因组中种系小变异召集的最佳实践。Nat Biotechnol. 2019;37(5):555-60.
Zook JM, et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data. 2016;3(1):1-26.
祖克 JM 等。广泛测序 7 个人类基因组以特征化基准参考材料。Sci Data。2016; 3(1):1-26。
Li H, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078-9.
李 H 等。序列比对/映射格式和 SAMtools。生物信息学。2009; 25(16):2078-9。
Poplin R, et al. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol. 2018;36(10):983-7.
波普林等,2018 年,使用深度神经网络的通用 SNP 和小缺失变体检测器,自然生物技术,36(10):983-7.
Van der Auwera GA, O’Connor BD. Genomics in the cloud: using Docker, GATK, and WDL in Terra. O’Reilly Media; 2020.
范德奥韦拉 GA,奥康纳 BD。云中的基因组学:在 Terra 中使用 Docker、GATK 和 WDL。O'Reilly Media;2020。
Mölder F. Sustainable data analysis with Snakemake. F1000Research 10; 2021.
莫尔德 F. 使用 Snakemake 的可持续数据分析。F1000Research 10; 2021。
Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv Prepr. ArXiv13033997; 2013.
李宏。利用 BWA-MEM 对序列读取、克隆序列和装配串联体进行排列。ArXiv Prepr. ArXiv13033997; 2013.
Larson $D E$ , et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics. 2012;28(3):311-7.
拉尔森等人。SomaticSniper:在全基因组测序数据中识别体细胞点突变。生物信息学。 2012 年;28(3):311-7。
Fan Y, et al. MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data. Genome Biol. 2016;17(1):1-11.
范等。用样本特定的错误模型来考虑肿瘤异质性可以提高从测序数据中进行突变检测的灵敏度和特异性。基因组生物学。 2016;17(1):1-11。
Wilm A, et al. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 2012;40(22):11189-201.
韦尔姆等人。LoFreq:一种序列质量感知的超敏感变异检测器,用于从高通量测序数据集中发现细胞群体异质性。核酸研究。2012; 40(22):11189-201。
Hawari MA, Hong CS, Biesecker LG. SomatoSim: precision simulation of somatic single nucleotide variants. BMC Bioinform. 2021;22(1):1-13.
哈瓦里 MA、洪 CS、比泽勒 LG。SomatoSim:体细胞单核苷酸变异的精确模拟。BMC 生物信息学。2021;22(1):1-13。

Publisher's Note 出版社说明

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
施普林格自然在已发表的地图和机构隶属关系方面保持中立。

Ready to submit your research? Choose BMC and benefit from:
准备提交您的研究成果了吗?选择 BMC,您可以获得以下优势:

fast, convenient online submission
快速、便捷的在线提交
thorough peer review by experienced researchers in your field
由经验丰富的研究人员在您所在领域进行全面的同行评审
rapid publication on acceptance
接收后即时发表
support for research data, including large and complex data types
对于研究数据的支持,包括大型和复杂的数据类型
gold Open Access which fosters wider collaboration and increased citations
黄金开放存取可促进更广泛的合作和增加引用次数
maximum visibility for your research: over 100M website views per year
您研究的最大可见度:每年超过 1 亿网站浏览量

At BMC, research is always in progress.
在百实工程公司,研究工作永不停歇。
Learn more biomedcentral.com/submissions
了解更多 biomedcentral.com/submissions

Accelerating genomic workflows using NVIDIA Parabricks 使用 NVIDIA Parabricks 加速基因组工作流

Abstract 摘要

Background 背景

Results 结果

CPU baseline across cloud platforms跨云平台的 CPU 基准

GPU performance across cloud platforms云平台之间的 GPU 性能

GPU performance on the DGX显卡在 DGX 上的性能

Discussion 讨论

Cloud platform considerations云平台注意事项

CPU-only runs 仅 CPU 运行

GPU considerations on the cloud在云上的 GPU 考虑

On-premises computing clusters本地计算集群

Conclusions 结论

Materials and methods 材料和方法

Sampling and algorithms 采样和算法

GCP configuration GCP 配置

AWS configuration AWS 配置

DGX configuration DGX 配置

Supplementary Information补充信息

Acknowledgements 致谢

Author contributions 作者贡献

Funding 资金

Availability of data and materials数据和材料的可用性

Declarations 声明

Competing interests 利益冲突

References 参考文献

Publisher's Note 出版社说明

Accelerating genomic workflows using NVIDIA Parabricks
使用 NVIDIA Parabricks 加速基因组工作流

CPU baseline across cloud platforms
跨云平台的 CPU 基准

GPU performance across cloud platforms
云平台之间的 GPU 性能

GPU performance on the DGX
显卡在 DGX 上的性能

Cloud platform considerations
云平台注意事项

GPU considerations on the cloud
在云上的 GPU 考虑

On-premises computing clusters
本地计算集群

Supplementary Information
补充信息

Availability of data and materials
数据和材料的可用性