Heterogeneous computing architecture for fast detection of SNP-SNP interactions
异构计算架构用于快速检测 SNP-SNP 互作

Davor Sluga $^{1}$ , Tomaz Curk $^{1}$ Blaz Zupan 1,2 and Uros Lotric $^{1 *}$
达沃尔·斯卢加 $^{1}$ ，托马斯·库克 $^{1}$ ，布拉兹·祖潘 1,2 和乌罗斯·洛特里奇 $^{1 *}$

Abstract 摘要

Background: The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested.
背景:典型基因组关联研究(GWAS)中的数据量给基因-基因互作发现的软件工具带来了重大的计算挑战。对数十万到数百万个单核苷酸多态性(SNP)的所有互作进行穷尽性评估可能需要数周甚至数月的计算时间。现代图形处理单元(GPU)和众核协处理器(MIC)提供的大规模并行硬件能够大幅缩短运行时间。尽管 GPU 实现在生物信息学中已得到广泛研究,但 MIC 架构的引入还比较新,可能提供一些尚待探索和验证的比较优势。

Results: We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort.
结果：我们开发了一个异构的、GPU 和 Intel MIC 加速的软件模块用于 SNP-SNP 交互发现,以取代 SNPsyn 交互式网络数据探索程序中先前的单线程计算核心。我们报告了这两种现代大规模并行架构及其软件环境之间的差异。与单线程 CPU 实现相比,它们的效用导致执行时间缩短了一个数量级。在单个 Nvidia Tesla K20 上的 GPU 实现运行速度是基于 MIC 架构的 Xeon Phi P5110 协处理器的两倍,但也需要大量更多的编程工作。

Conclusions: General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems.
结论:通用目的 GPU 是一个成熟的平台,拥有大量的计算能力,可以处理固有的并行问题,但对程序员来说可能很 demanding。另一方面,新的 MIC 架构虽然性能有所降低,但减少了编程工作,并提供了更通用的架构,适用于更广泛的问题。

Keywords: SNP-SNP interactions, Genome-wide association studies, Graphic processing unit, Many Integrated Core coprocessor, Intel Xeon Phi, CUDA
关键词：SNP-SNP 相互作用、基因组范围关联研究、图形处理单元、多集成核心协处理器、Intel Xeon Phi、CUDA

Background 背景

We are witnessing a dramatic shift in the design of personal computer systems, where speedups are achieved by porting the parallel traits of supercomputers into the world of personal computing. Modern computers are heterogeneous platforms with many different types of computational units, including central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), coprocessors and custom acceleration logic. Today’s CPUs contain from two to twelve cores, each capable of executing multiple instructions
我们正见证个人计算机系统设计发生戏剧性转变,其中通过将超级计算机的并行特性移植到个人计算领域实现了性能提升。现代计算机是异构平台,包含多种不同类型的计算单元,包括中央处理器(CPU)、图形处理器(GPU)、数字信号处理器(DSP)、协处理器和定制加速逻辑。当今的 CPU 从两个到十二个内核不等,每个内核都能执行多条指令。

per clock cycle. Assisting the CPU, graphics processing units usually render 3D graphics, but can also provide a general-purpose computing platform. Current GPUs are designed as massively parallel processors offering substantially more computing power than CPUs. GPUs are the most powerful computational hardware available at an affordable price [1,2]. The availability of general-purpose GPUs with computing abilities in commodity laptop and desktop computers has generated a wide interest, including applications in bioinformatics [3-9].
每个时钟周期。协助 CPU,图形处理单元通常会渲染 3D 图形,但也可以提供通用计算平台。当前的 GPU 被设计成大规模并行处理器,提供的计算能力大大超过 CPU。GPU 是目前可以以相对便宜的价格获得的最强大的计算硬件[1,2]。普通笔记本电脑和台式机上可获得的通用 GPU 计算能力的可用性,也引发了广泛的兴趣,包括生物信息学领域的应用[3-9]。
The newest addition to the commodity computer parallel processing hardware is the Intel Xeon Phi family of coprocessors [10] designed for computationally intensive applications. Xeon Phi implements Intel’s Many Integrated Core (MIC) architecture and offers a theoretical performance similar to that of modern
最新加入商品电脑并行处理硬件的是英特尔至强 Phi 系列的协处理器[10]，专为计算密集型应用而设计。至强 Phi 实现了英特尔的众多集成核心(MIC)架构,理论性能与当代处理器相当。

GPUs, but promises easier porting of existing software to the new architecture. Tianhe-2, currently the world’s fastest supercomputer has 48000 Xeon Phi coprocessors [11].
图形处理器(GPU),但承诺更轻松地将现有软件移植到新的体系结构。天河-2 号目前是世界上最快的超级计算机,拥有 48000 个 Xeon Phi 协处理器[11]。

Many computational problems in bioinformatics require substantial computational resources [12]. Problems that can be computed with a high degree of parallel and independent processing are most suited for heterogeneous massively parallel hardware. Our aim was to investigate how these modern architectures cope with problems that are typical for bioinformatics, such as the problem of SNP-SNP interaction detection. As a proof-of-concept, we focused on a parallel implementation of computational core for the web-application SNPsyn [13] by exploiting heterogeneous processing resources, multi-core CPUs, GPUs, and the new MIC coprocessors.
生物信息学中的许多计算问题需要大量的计算资源[12]。可以以高度并行和独立处理的方式计算的问题最适合于异构大规模并行硬件。我们的目标是研究这些现代架构如何应对生物信息学中典型的问题,例如 SNP-SNP 相互作用检测问题。作为概念验证,我们通过利用异构处理资源、多核 CPU、GPU 和新型 MIC 协处理器,重点关注对 web 应用程序 SNPsyn [13]计算核心的并行实现。
SNPsyn [13] (Figure 1) was developed as an interactive software tool for efficient exploration and discovery
SNPsyn [13] (图 1)被开发为一款交互式软件工具,用于高效的探索和发现
of interactions among single nucleotide polymorphisms (SNPs) in case-control genome-wide association study (GWAS) data. It uses an information-theoretic approach to evaluate SNP-SNP interactions [14]. Information gain is computed for every individual SNP, which allows the user to identify SNPs that are most associated with the disease under study. When searching for interesting pairs of SNPs, SNPsyn estimates the synergy between a pair of SNPs by computing the interaction gain. Information gain can identify SNP pairs with non-additive effects. Results are presented in an interactive graphical user interface that allows the user to select the most synergistic pairs, perform Gene Ontology enrichment analysis and visualize the synergy network among the selected SNP-SNP pairs.
基于病例-对照基因组范围关联研究(GWAS)数据中单核苷酸多态性(SNP)相互作用的信息理论方法。它使用信息理论方法来评估 SNP-SNP 相互作用[14]。计算每个个体 SNP 的信息增益,使用户能够识别与研究疾病最相关的 SNP。在搜索有趣的 SNP 对时,SNPsyn 通过计算相互作用增益来估计 SNP 对之间的协同作用。信息增益可以识别具有非加性效应的 SNP 对。结果以交互式图形用户界面的形式呈现,允许用户选择最具协同作用的 SNP 对,进行基因本体论富集分析,并可视化所选 SNP-SNP 对之间的协同作用网络。
SNPsyn computes the information gain exhaustively across all SNP pairs to avoid missing any pair where SNPs on their own provide no information about the phenotype under study. Because the number of pairs
SNPsyn 通过对所有 SNP 对进行信息收益的详尽计算来避免遗漏任何 SNP 自身无法提供有关所研究表型信息的配对。因为配对的数量

Figure 1 SNPsyn graphical user interface. a) A synergy versus information gain plot is used to select SNP-SNP pairs. b) Gene Ontology enrichment analysis for genes overlapping with selected SNP-SNP pairs. c) Synergy network of selected SNPs.
图 1 SNPsyn 图形用户界面。a) 协同效应与信息增益图用于选择 SNP-SNP 对。b) 与选定 SNP-SNP 对重叠的基因的基因本体论丰富分析。c) 选定 SNP 的协同网络。
is quadratic to the number of SNPs, the exhaustive search quickly becomes computationally intractable for commodity computer systems. The information-theoretic-based detection of SNP-SNP interactions has a high degree of data parallelism and requires much more processing power than memory storage. This makes it a perfect candidate for processing on modern massively parallel architectures.
随着 SNP 数量的平方级增加,穷举搜索很快在普通计算机系统上变得计算难以 tractable。基于信息论的 SNP-SNP 交互检测具有高度的数据并行性,需要的处理能力远大于内存存储需求。这使得它成为现代大规模并行架构上的完美候选。

Implementation 实施

Below we describe the SNP-SNP interaction scoring approach we use in SNPsyn and discuss its implementation on CPU, CUDA and MIC architectures. Our particular concern is to evaluate Intel’s new MIC architecture and compare its advantages against currently prevailing CUDA architecture.
以下我们描述了我们在 SNPsyn 中使用的 SNP-SNP 交互评分方法,并讨论了其在 CPU、CUDA 和 MIC 架构上的实现。我们特别关心评估英特尔新的 MIC 架构,并将其优势与目前主导的 CUDA 架构进行比较。

SNP-SNP interaction scoring
单核苷酸多态性-单核苷酸多态性相互作用评分

The SNP-SNP interaction scoring scheduler, written in Python, partitions and distributes the computational tasks to all available, user-specified resources: CPUs, GPUs, and Xeon Phi coprocessors (Figure 2). It then merges the results from individual units into a final result file. Each thread (CPU, GPU or Xeon Phi) takes one pair of SNPs and performs all the calculations needed to compute the synergy score of the pair. The synergy of a pair of SNPs

X

and

Y

with respect to phenotype

P

is obtained by
利用 Python 编写的 SNP-SNP 互作评分调度器将计算任务划分并分配到所有可用的用户指定资源:CPU、GPU 和 Xeon Phi 协处理器(图 2)。它随后将各单元的结果合并成最终结果文件。每个线程(CPU、GPU 或 Xeon Phi)都会处理一对 SNP,执行计算该对 SNP 关于表型

P

的协同效应得分所需的所有计算。

Figure 2 SNPsyn software architecture. Computation of SNP-SNP interaction is coded in C++ for the CPU, CUDA and MIC architectures The scheduler that invokes the three heterogeneous implementations is written in Python.
图 2 SNPsyn 软件架构。用于 CPU、CUDA 和 MIC 架构的 SNP-SNP 相互作用计算是用 C++编码的。调用这三种异构实现的调度程序是用 Python 编写的。
subtracting the information gains of individual SNPs from the information gain of the combined pair [13]:
从联合对的信息增益中减去单个 SNP 的信息增益[13]：

G (X, Y) = I (X, Y; P) - I (X; P) - I (Y; P) .

Given the two SNPs and the phenotype as random variables

X, Y

and

P

, respectively, the information gains required in Equation 1 are calculated as [14]:
给定两个 SNP 和表型作为随机变量

X, Y

和

P

，方程 1 中所需的信息增益计算如下[14]:

\begin{aligned} I (X; P) & = \sum_{x \in X, p \in P} q (x, p) \log_{2} \frac{q (x, p)}{q (x) q (p)}, \\ I (Y; P) & = \sum_{y \in Y, p \in P} q (y, p) \log_{2} \frac{q (y, p)}{q (y) q (p)}, \\ I (X, Y; P) & = \sum_{x \in X, y \in Y, p \in P} q (x, y, p) \log_{2} \frac{q (x, y, p)}{q (x, y) q (p)} . \end{aligned}

Computation of marginal probabilities

q (x), q (y), q (p)

and joint probability distributions

q (x, p), q (y, p), q (x, y)

q (x, y, p)

requires a single scan through case and control samples. The number of joint probability distributions

q (x, y)

and

q (x, y, p)

that need to be determined grows quadratically with the number of SNPs. This ensures enough computational load to compensate for the memory transfer costs and makes it efficient for an implementation on parallel hardware.
边际概率

q (x), q (y), q (p)

和联合概率分布

q (x, p), q (y, p), q (x, y)

、

q (x, y, p)

的计算需要对病例和对照样本进行单次扫描。需要确定的联合概率分布

q (x, y)

和

q (x, y, p)

的数量与 SNP 的数量呈二次增长。这确保了足够的计算负荷来补偿内存传输成本,并使其在并行硬件上实现高效。
Permutation analysis is used to evaluate the significance of results on true data. Data is randomly shuffled thirty times. Each time, information gain and synergy for all pairs are calculated to obtain the null distribution, which is used to determine the significance of results on true data. Details on permutation analysis are described in Curk et al. [13].
置换分析用于评估真实数据结果的显著性。数据被随机打乱 30 次。每次都计算所有配对的信息增益和协同作用,以获得用于确定真实数据结果显著性的零分布。置换分析的详细信息见 Curk 等人[13]。

Parallel implementations of interaction scoring
相互作用评分的并行实现

Calculations are performed in parallel for as many pairs of SNPs as allowed by the hardware. We took special care to efficiently use the GPU and Xeon Phi hardware. We minimized memory transfers between the main CPU and the coprocessors to avoid bottlenecks and vectorized the code wherever possible. We optimized the number of threads running on the GPU to maximize throughput. To cope with the memory limitation of the GPU, SNPsyn includes optional heuristics to quickly estimate the importance of SNPs and reduce the data set prior to analysis. In the following sections we present the implementation details regarding both architectures.
计算是针对所有可能的 SNP 对进行并行处理的,具体能处理的数量取决于硬件性能。我们特别注意高效利用 GPU 和 Xeon Phi 硬件。我们尽量减少主 CPU 与协处理器之间的内存传输,以避免瓶颈,并尽可能对代码进行了向量化处理。我们优化了在 GPU 上运行的线程数量,以最大化吞吐量。为了应对 GPU 内存限制,SNPsyn 包含可选的启发式方法,可以快速估算 SNP 的重要性,并在分析前缩小数据集。在以下部分中,我们将介绍针对两种架构的具体实现细节。

GPU and CUDA GPU 和 CUDA

GPUs gain their computational power from the numerous processing cores packed into one chip. For example, the modern Nvidia Tesla K20 GPU has 13 streaming multiprocessors, each containing 192 computational units called CUDA cores. These cores lack sophisticated control units and are thus likely to work best when executing the
GPU 从硅片上大量密集的处理核心获得其强大的计算能力。例如,现代英伟达 Tesla K20 GPU 有 13 个流式多处理器,每个处理器包含 192 个被称为 CUDA 核心的计算单元。这些核心缺乏复杂的控制单元,因此在执行高度并行的任务时效果最佳。
same instruction on many data elements in parallel with no divergent program paths in the algorithm. A programmer sees the GPU as a parallel coprocessor and can use it to speedup computationally intensive parts of the algorithm. Of course, there must be enough data parallelism in the code to make it worthwhile.
在算法中,对许多数据元素并行执行相同的指令,没有发散的程序路径。程序员将 GPU 视为并行协处理器,可以使用它来加速算法中计算密集的部分。当然,代码中必须有足够的数据并行性,才能使之值得一做。

Different tools are available for programming GPUs. Nvidia offers the CUDA toolkit [15] for programming its own products. It includes a proprietary compiler and a set of libraries that extend the C++ syntax with parallel programming constructs. Another popular option is the OpenCL framework [16]. It supports hardware from different vendors but usually lags slightly in terms of performance when compared to specialized development kits such as CUDA.
编程 GPU 有多种工具可用。Nvidia 提供 CUDA 工具包 [15] 来编程其自身产品。它包括专有编译器和一组扩展 C++ 语法的并行编程结构库。另一个流行的选择是 OpenCL 框架 [16]。它支持来自不同供应商的硬件,但与 CUDA 等专用开发工具相比,通常在性能方面略有滞后。

Regardless of the development tool used, the programmer must follow certain rules to obtain maximum performance [17]. The most important one is to partition the algorithm in blocks small enough to simultaneously start a sufficient number of threads to utilize all available resources. For example, consider the code snippet in Figure 3, a simplified version of a code that scores pairs of SNPs. Function computeIGain calculates the information gain of a SNP pair using Equation 1. The details of the calculation are omitted to emphasize the architecture specific parts of code. The snippet includes all the peculiarities of programming for GPUs. The program has to implement the GPU-specific part separately from the CPU code and explicitly transfer data from the host to the GPU. Special functions called kernels (marked with the keyword global) must be written to be executed on the GPU. Memory transfer and allocation functions must be called to supply the necessary data to the GPU and collect the results afterwards. Usually, the programmer performs measurements to determine which thread configuration is most suitable for a particular problem size and the appropriate number of threads to launch.
不管使用何种开发工具,程序员都必须遵循某些规则才能获得最大性能[17]。最重要的一条是把算法划分成足够小的块,以同时启动足够多的线程来利用所有可用的资源。例如,考虑图 3 中的代码片段,这是一个得分 SNP 对的简化版本。computeIGain 函数使用公式 1 计算 SNP 对的信息增益。省略了计算细节,以突出代码的架构特定部分。该片段包含了针对 GPU 编程的所有特点。该程序必须从 CPU 代码中分别实现 GPU 特定的部分,并明确地从主机传输数据到 GPU。必须编写特殊的函数(称为内核,用关键字 global 标记)在 GPU 上执行。必须调用内存传输和分配函数,以向 GPU 提供必要的数据并收集结果。通常,程序员会进行测量,以确定哪种线程配置最适合特定的问题大小,并启动适当数量的线程。

Xeon Phi and MIC
至强 Phi 和 MIC

Intel designed the Xeon Phi family of coprocessors around the new MIC architecture [18] to compete with GPUs specialized in general-purpose computing. The design follows a different approach in comparison to GPUs. Coprocessors consists of many simple, but fully functional processor cores derived from the Intel Pentium architecture. Intel improved the original design by adding a 512-bit wide vector unit and Hyper-Threading Technology. This enables Xeon Phi to achieve similar theoretical performance as modern GPUs. The model 5510P, which we used in this study, includes sixty cores interconnected with a bidirectional ring bus. Each core is capable of running four threads in parallel. The cores fetch data from the 8 GB of on-board RAM and communicate with the
英特尔围绕新的 MIC 架构[18]设计了 Xeon Phi 系列协处理器,以与专门从事通用计算的 GPU 竞争。该设计采用了与 GPU 不同的方法。协处理器由许多简单但功能齐全的处理器内核组成,这些内核源自英特尔奔腾架构。英特尔通过添加 512 位宽的向量单元和超线程技术来改进了原始设计。这使 Xeon Phi 能够达到与现代 GPU 相似的理论性能。我们在本研究中使用的型号 5510P 包括 60 个通过双向环形总线互连的内核。每个内核都能并行运行四个线程。内核从 8 GB 的板载 RAM 中获取数据,并与

_device_
Eloat computeMISingle(char *dataGPU,int N,int S,
                    int x)
{
    // compute average mutual information between
    // the phenotype and SNP x from data array
    // and return the result
    device
Eloat computeMIPair(char vdataGPU,int N,int S,
                                    int x,int y)
{
    // compute average mutual information between
    // the phenotype and SNP pair (x,y) from data
    // array and return the result
    global
ComputeIGain`char *dataGPU,float *resultGPU,
                int N,int S)
{
    int Np = N*(N-1)/2;
    int i = (blockIdx.y*blockDim.y+threadIdx.y)*
                (blockDim.x*gridDim.x) +
                blockIdx.x*blockDim.x+threadIdx.x;
    int t = ((isqrt((Np-8*i-8)+1)) - 1)/2;
    int x = (N-1)-l-t;
    int y = i-Np+(((t+1)*(t+2))/2)+x+1;
    if (i : Np)
        float Mixy = computeMIPair(dataGPU,N,S,x,y);
        float Mix = computeMI(dataGPU,N,S,x);
        float Miy = computeMI(dataGPU,N,S,y);
        resultGPU[i] = MIxy-Mix-Miy;
}
// char vdata - points to the data set with all
// phenotypes and genotypes
// int N - number of SNPs
// int S - number of samples
...
int Np = N* (N-1)/2;
int blocks = (Np+threads-1)/threads);
Eloat *result =(float*)malloc(Np*sizeof(Eloat));
cudaMalloc((void**)&dataGPU,N*S*sizeof(char));
cudaMalloc((void**)&resultGPU,Np*sizeof(Eloat));
cudaMemcpy(dataGPU, data,N*S*sizeof(char),
    cudaMemcpyHostToDevice);
computeIGain«s.slocks, threads:s;dataGPU,
                        resultGPU,N,S);
cudaMemcpy(result,resultGPU,Np*sizeof(float),
        cudaMemcpyDeviceToHost);
cudaFree(dataGPU);
cudaFree(resultGPU)

Figure 3 CUDA code snippet. Variables threads and blocks store the thread configuration. Function cudaMemcpy feeds the data into the GPU and retrieves the results afterwards. Each of the preconfigured GPU threads independently executes the computeIGain function and scores the associated SNP pair.
图 3 CUDA 代码片段。变量 threads 和 blocks 存储线程配置。函数 cudaMemcpy 将数据传送到 GPU 并之后取回结果。每个预先配置的 GPU 线程独立执行 computeIGain 函数并评分相关的 SNP 对。
host CPU through the PCIe bus. In comparison to GPUs, each core on a Xeon Phi can efficiently execute the code even if threads do not follow the same program path. This makes it suitable for a wider range of problems, including
通过 PCIe 总线访问主机 CPU。与 GPU 相比,Xeon Phi 上的每个核心都可以高效地执行代码,即使线程不遵循相同的程序路径。这使其适用于更广泛的问题,包括
multiplications of sparse matrices [19], and operations on trees and graphs [20].
稀疏矩阵的乘法[19]以及树和图的运算[20]。
Intel provides a C++ compiler suite and all the tools needed to exploit the hardware [21]. The code can be parallelized using OpenMP directives or the MPI library and compiled for the MIC architecture. Resulting applications can then run only on the Xeon Phi coprocessors. Another, more general way to specify parallel execution is to use offload constructs along with OpenMP to mark the data and the code to be transferred and executed on the Xeon Phi. All other parts of the program will run normally on the host computer CPU. A third possibility is to use OpenCL framework in the same manner as with GPUs.
英特尔提供 C++编译器套件及所有所需工具以利用硬件[21]。可以使用 OpenMP 指令或 MPI 库将代码并行化,并针对 MIC 架构进行编译。生成的应用程序仅可在 Xeon Phi 协处理器上运行。另一种更通用的并行执行方式是结合使用 OpenMP 的卸载结构,标记要在 Xeon Phi 上传输和执行的数据和代码。程序的所有其他部分将正常在主机计算机 CPU 上运行。第三种可能是以与 GPU 相同的方式使用 OpenCL 框架。
MIC development tools facilitate data management through compiler directives. The example in Figure 4 demonstrates this programming paradigm. It performs the same operation as the snippet from Figure 3. The programmer marks the data and the code that is needed on the coprocessor. All memory allocations and transfers are done implicitly. To obtain best performance, the programmer must tailor the algorithms to fully utilize the vector unit. The Intel compiler automatically vectorizes sections of code where possible.
仿生工具有助于通过编译指令进行数据管理。图 4 中的示例演示了这种编程范式。它执行与图 3 中的代码片段相同的操作。程序员标记了协处理器所需的数据和代码。所有内存分配和传输都隐式完成。为获得最佳性能,程序员必须调整算法以充分利用矢量单元。英特尔编译器在可能的情况下自动矢量化代码段。
If a computer lacks Xeon Phi, the MIC code can be executed by the main CPU, which is not the case with CUDA-specific implementation. The MIC code looks much cleaner and easier to handle than CUDA code. The current drawbacks of using Xeon Phi are the shortage of supporting Linux distributions (officially only RedHat and SuSE ) and the pricey development environment for the Windows operating system. The main aspects (relevant to the developer) of each of the architectures are shown in Table 1.
如果计算机缺乏 Xeon Phi,MIC 代码可以由主 CPU 执行,这与 CUDA 特定实现的情况并不相同。MIC 代码看起来要干净得多,也更易于处理。使用 Xeon Phi 的当前缺点是缺乏支持的 Linux 发行版(正式只有 RedHat 和 SuSE)以及 Windows 操作系统的昂贵开发环境。表 1 显示了每种体系结构的主要方面(与开发人员相关)。

Results 结果

We benchmarked SNPsyn on a workstation with two sixcore Intel Xeon E5-2620 2.00 GHz CPUs capable of running up to twenty-four threads in parallel, 64 GB of RAM, two Nvidia Tesla K20 general-purpose computing cards with 5 GB of RAM each and one Intel Xeon Phi 5110P coprocessor with 8 GB of RAM. The operating system was CentOS 6.4.
我们在一台配备有两个六核 Intel Xeon E5-2620 2.00 GHz CPU 的工作站上对 SNPsyn 进行了基准测试,这些 CPU 能够并行运行高达二十四个线程,该工作站还配备有 64 GB 的 RAM、两张 Nvidia Tesla K20 通用计算卡(每张 5 GB RAM)以及一个 Intel Xeon Phi 5110P 协处理器(8 GB RAM)。操作系统为 CentOS 6.4。
We evaluated the performance on a series of representative WGAS data sets constructed from the Infinium_20060727fs1_gt_MS_GCf data set found in the WTCCC study [22]. Our goal was to observe the effect of the number of SNPs and WGAS study subjects to the execution time on different configurations. We sampled with replacement the original data on 994 subjects and 15436 SNPs to obtain data sets with the desired number of subjects and SNPS. We performed the analysis on data with 1000,6000 , and 20000 subjects and 10000 , 100 000, and 660000 SNPs. The study considered only the data sets that could fit into the GPU memory. Xeon Phi
我们评估了从 WTCCC 研究[22]中发现的 Infinium_20060727fs1_gt_MS_GCf 数据集构建的一系列代表性 WGAS 数据集的性能。我们的目标是观察 SNP 数量和 WGAS 研究受试者数量对不同配置的执行时间的影响。我们对原始数据(994 个受试者和 15436 个 SNP)进行有放回抽样,以获得所需数量的受试者和 SNP 的数据集。我们对具有 1000、6000 和 20000 名受试者以及 10000、100000 和 660000 个 SNP 的数据进行了分析。该研究仅考虑可以装入 GPU 内存的数据集。Xeon Phi

_declspec(target(mic))
float computeMISingle(char *data,int N,int S,
                                    int x)
    // compute average mutual information between
    // the phenotype and SNP x from data array
    // and return the result
    _declspec(target(mic))
float computeMIPair(char *data,int N,int S,
                                    int x,int y)
    // compute average mutual information between
    // the phenotype and SNP pair (x,y) from data
    // array and return the result
// char *data - points to the data set with all
// phenotypes and genotypes
// int N - number of SNPs
// int S - number of samples
int Np = N* (N-1)/2;
float *result =(float*)malloc(Np*sizeof(float));
#pragma offload target(mic)
            in(data:length(N*S))
            out(result:length(N))
#pragma omp parallel for
for (int i = 0; i < Np; ++i)
{
    int t = ((isqrt((Np-8*i-8)+1))-1)/2;
    int x = (N-1)-1-t;
    int y = i-Np+(((t+1)*(t+2))/2)+x+1;
    float Mixy = computeMIPair(data,N,S,x,y)
    float Mix = computeMISingle(data,N,S,x)
    float Miy = computeMISingle(data,N,S,y);
    results[i] = MIxy-Mix-Miy;
}

Figure 4 MIC code snippet. The first pragma directive marks the start of a MIC code section. Keywords in and out indicate the data to be transferred to and from the Xeon Phi. The OpenMP clause omp parallel for launches all available threads in parallel, which execute the code in the body of the loop and score the SNP pairs.
图 4 MIC 代码片段。第一个 pragma 指令标记了 MIC 代码部分的开始。关键词 in 和 out 指示要传输到和从 Xeon Phi 传输的数据。OpenMP 子句 omp parallel for 启动所有可用的线程并行执行,执行循环体中的代码并评分 SNP 对。
is clearly in advantage when compared to K20 regarding the amount of RAM ( 8 GB versus 5 GB ). We tested six hardware configurations including one CPU core running a single thread, twelve CPU cores running twelve threads, twelve CPU cores running twenty-four threads, one GPU core, both GPU cores, and Xeon Phi.
与 K20 相比,在内存(8GB vs 5GB)方面拥有明显优势。我们测试了六种硬件配置,包括一个 CPU 内核运行一个线程、十二个 CPU 内核运行十二个线程、十二个 CPU 内核运行二十四个线程、一个 GPU 内核、两个 GPU 内核以及 Xeon Phi。
Figure 5 reports on execution times of the exhaustive SNP-SNP interaction analysis and the speedups achieved using various hardware configurations. For easier comparison, execution times are plotted on a logarithmic scale. As expected, execution times increase proportionally with the number of subjects and are quadratic with the number of SNPs included in the analysis.
图 5 展示了穷举式 SNP-SNP 交互作用分析的执行时间以及使用不同硬件配置所达到的加速倍数。为了更容易进行比较,执行时间采用对数坐标进行了绘制。如预期,执行时间随着分析对象数量的增加呈线性增长,并且与所包含 SNP 数量的平方成正比。
The single thread CPU configuration takes more than 30 days to analyze the data on 660000 SNPs and 1000
单线程 CPU 配置需要超过 30 天才能分析 660000 个 SNP 和 1000 个样本的数据

Table 1 Comparison of parallel computer architecture platforms with key aspects from the viewpoint of software development
表 1 从软件开发的角度比较并行计算机架构平台的关键方面

x86/x64 single CPU x86/x64 单核 CPU

Nvidia GPU 英伟达 GPU

Intel Xeon Phi 英特尔至强 Phi

Tools 工具

Arbitrary compiler 任意编译器

CUDA Toolkit or OpenCL framework
CUDA 工具包或 OpenCL 框架

Intel compiler suite 英特尔编译器套件

OS support 操作系统支持

Many 许多

Windows, Linux, Mac OSX

Linux (RedHat and SuSE), Windows
Linux（RedHat 和 SuSE）, Windows

Required programming skills
所需的编程技能

Low 低

High 高

Medium 中等

Lines of code* 代码行*

260

460

360

Programming remarks 编程注释

None 没有原文

Architecture specific optimizations
建筑特定优化

Recommended optimizations using
推荐使用的优化方法

Platform maturity 平台成熟度

Mature 成熟的

详细的文档,大量的编程示例

Exrtensive documentation, many

programming examples

Bugs in drivers, documentation needs
驱动程序中的漏洞,需要文档支持

Lines of code

^{(*)}

reports on the approximate length of the code that implements the computationally intensive tasks of SNPsyn.
代码行

^{(*)}

报告了实现 SNPsyn 计算密集型任务的代码的近似长度。
subjects. Running twelve threads in parallel, one on each of the CPU cores, speeds up the computation by a factor of 10 and reduces the execution time to approximately 3 days. Increasing the number of threads to twenty-four reduces the time to perform the analysis to around 2 days with the speedup peaking at 12.8 compared to a one thread configuration. Memory bottleneck is the main factor for the poor speedup, which is far below the theoretical value of 24. Interestingly, similar speedups are achieved on all (smaller) data sets, meaning that there is enough data parallelism to keep the CPU busy.
主题。并行运行十二个线程,每个 CPU 核心一个,计算速度提高了 10 倍,执行时间缩短到大约 3 天。将线程数增加到二十四个,分析时间缩短到大约 2 天,速度提升达到 12.8 倍,远低于理论值 24。内存瓶颈是导致速度提升较差的主要因素。有趣的是,在所有(更小的)数据集上都实现了类似的速度提升,这意味着有足够的数据并行性来保持 CPU 忙碌。

Nvidia K20 provides for considerable reduction in execution times, with the analysis of the largest data set taking only around 17 hours, demonstrating a speedup of 42 in comparison to a single CPU thread. Sharing the work between both GPU cards doubles the speedup and reduces the execution time to 8 hours. Increasing the number of subjects leads to a noticeable decrease in speedup, as more data is being transferred between the main memory and the GPU. On the other hand, increasing the number of SNPs introduces more data parallelism into the computations, reflecting in an improved speedup.
Nvidia K20 提供了大幅缩短执行时间,分析最大数据集仅需约 17 小时,与单个 CPU 线程相比提高了 42 倍。在两个 GPU 卡之间共享工作可以使加速翻倍,将执行时间缩短至 8 小时。增加主题数量会导致加速度明显下降,因为在主内存和 GPU 之间转移的数据更多。另一方面,增加 SNP 数量可以为计算引入更多数据并行性,从而提高加速度。

Figure 5 Execution times and speedups achieved on various computing resources. Shown are execution times on each hardware configuration for different problem sizes (a) and speedups in comparison to a single CPU thread execution (b).
图 5 在各种计算资源上实现的执行时间和加速。显示了不同硬件配置在不同问题规模下的执行时间(a)以及相比单 CPU 线程执行的加速情况(b)。

Table 2 Technical specification of hardware platforms
表 2 硬件平台技术规范

	Intel Xeon E5-2620 英特尔至强 E5-2620	Nvidia Tesla K20 英伟达 Tesla K20	Intel Xeon Phi 5110P 英特尔至强 Phi 5110P
Number of transistors 晶体管数量	2.3 billion 23 亿	7.1 billion 7.1 十亿	5 billion 50 亿
Peak power consumption 峰值功耗	95 W 95 瓦	225 W 225 瓦	225 W 225 瓦
Single precision floating point performance 单精度浮点性能	96 GFLOPS	3.5 TFLOPS	2.0 TFLOPS
Main memory 主存储器	$64 GB$ can be expanded $64 GB$ 可以被扩展	5 GB	8 GB

Xeon Phi is positioned somewhere in-between K20 and CPU-only implementation. It achieves a speedup of nearly 20 on the largest data set, making the analysis run a day and a half, which is double the time needed on a K20. The speedup behaves similarly for Xeon Phi as for K20 it increases with the number of SNPs and decreases with the number of subjects. This confirms that the drop is caused by transferring larger amounts of data without introducing additional parallelism.
至强 Phi 位于 K20 和纯 CPU 实现之间。对于最大数据集,它实现了近 20 倍的加速,使分析运行时间缩短了一天半,是在 K20 上所需时间的两倍。对于至强 Phi 和 K20 来说,加速度的表现类似,随着 SNP 数量的增加而增加,随着受试者数量的增加而下降。这证实了降速是由于传输更大量的数据而没有引入额外的并行度所导致的。
Using only CPUs to analyze the data is unfeasible except for small data sets since the computations can take days to complete even on multiple cores. Xeon Phi provides a considerable performance boost with a maximum speedup of nearly 20 and lots of on-board memory to store the data. Nvidia K20 clearly outperforms every other configuration in terms of speed and is the perfect choice when one wants to cut on the execution times as much as possible. This comes at a price of cumbersome programming and less on-board memory, which limits the size of data.
仅使用 CPU 分析数据对于小型数据集来说是不可行的,因为计算即使在多个内核上也可能需要几天才能完成。Xeon Phi 提供了显著的性能提升,最高可达 20 倍加速,并且拥有大量的板载内存来存储数据。Nvidia K20 在速度方面明显优于所有其他配置,是在希望尽可能缩短执行时间的情况下的最佳选择。但这需要复杂的编程以及较少的板载内存,这限制了数据的大小。
Technical specifications presented in Table 2 show similar trends: Nvidia K20 offers the highest theoretical performance in terms of TFLOPS and has the most complex design. Xeon Phi has considerably less computing power, but interestingly draws the same amount of power as K20 at maximum load. The Xeon E5-2620 CPU is the least efficient of all and lacks the performance to remain competitive at computationally intensive tasks.
表 2 中列出的技术规格显示了相似的趋势:Nvidia K20 在 TFLOPS 方面提供了最高的理论性能,并且有着最复杂的设计。Xeon Phi 的计算能力要小得多,但有趣的是,在最大负载下它和 K20 消耗相同的功率。Xeon E5-2620 CPU 是所有中最低效的,缺乏在计算密集型任务中保持竞争力的性能。

Conclusion 结论

We investigated how modern heterogeneous architectures cope with a selected computational problem typical for bioinformatics. The proof-of-concept implementation of SNPsyn on heterogeneous systems greatly reduces the (wall-clock) time needed for analysis of large GWAS data sets. GPUs proved to be a mature platform that offers a large amount of computing power to address inherently parallel problems, but is demanding for the programmer. A user who is only interested in using SNPsyn to analyze their data will profit the most by having multiple GPUs in their system. The new MIC architecture greatly alleviates programming but lacks in performance. Its ease of programming combined with good performance has a lot to offer to developers who don’t want to spend too much time optimizing their algorithms. Nevertheless, MIC is a
我们研究了现代异构架构如何应对生物信息学中典型的计算问题。在异构系统上实施的 SNPsyn 概念验证大大缩短了分析大型 GWAS 数据集所需的(墙钟)时间。GPU 被证明是一个成熟的平台,可提供大量的计算能力来解决固有并行的问题,但对程序员来说要求较高。只对使用 SNPsyn 分析数据感兴趣的用户,在系统中拥有多个 GPU 将获得最大收益。新的 MIC 架构大大简化了编程,但性能较弱。其简单的编程结合良好的性能,对不想花太多时间优化算法的开发人员来说非常有吸引力。尽管如此,MIC 仍是一
general platform capable of tackling a wider range of more complex problems. This makes it very promising to excel in more complex analysis of SNP-SNP interactions such as adjustment for covariates [23].
通用平台能够处理更广泛更复杂的问题。这使其在调整协变量等更复杂的 SNP-SNP 交互作用分析中更有前景[23]。

Availability and requirements
可用性和要求

Project name: SNPsyn 项目名称：SNPsyn
Project home page: http://snpsyn.biolab.si
项目主页：http://snpsyn.biolab.si
Operating systems: Linux, Windows, Mac OS
操作系统:Linux、Windows、Mac OS
Programming language:

C + +

编程语言：

C + +

Other requirements: CUDA 2.0 or higher, Intel Composer XE 2013 or newer, make
其他要求: CUDA 2.0 或更高版本, Intel Composer XE 2013 或更新版本, make
License: GNU GPLv3 许可证：GNU GPLv3
Restrictions to use by non-academics: none
非学术人员使用限制:无

Competing interests 利益冲突

The authors declare that they have no competing interests.
作者声明没有利益冲突。

Authors' contributions 作者贡献

UL, DS, TC, and BZ designed the study. DS implemented the CUDA and MIC software and measured the performance of the software. TC implemented the CPU and Python part of the software. DS wrote the first draft of the manuscript. All authors have written, read and approved the final manuscript.
UL、DS、TC 和 BZ 设计了该研究。DS 实现了 CUDA 和 MIC 软件并测量了软件的性能。TC 实现了软件的 CPU 和 Python 部分。DS 撰写了初稿。所有作者都已撰写、阅读并批准了最终稿件。

Acknowledgements 致谢

BZ and TC were supported by the Slovenian Research Agency (ARRS, P2-0209), UL and DS were supported by the Slovenian Research Agency (ARRS, P2-0241).
布兰萨和托马斯·科特里克获得了斯洛文尼亚研究机构(ARRS, P2-0209)的支持,乌兰·德索扎和戴维·斯特拉特获得了斯洛文尼亚研究机构(ARRS, P2-0241)的支持。

Author details 作者详情

^{1}

Faculty of Computer and Information Science, University of Ljubljana, Trzaska 25, SI 1000 Ljubljana, SI, Slovenia.

^{2}

Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, TX 77030 Houston, USA

^{1}

卢布尔雅那大学计算机与信息科学学院, Trzaska 25, SI 1000 卢布尔雅那, 斯洛文尼亚.

^{2}

贝勒医学院分子和人类遗传学系, One Baylor Plaza, TX 77030 休斯顿, 美国.

Received: 14 November 2013 Accepted: 19 June 2014
收到: 2013 年 11 月 14 日接受: 2014 年 6 月 19 日
Published: 25 June 2014
发布日期：2014 年 6 月 25 日

References 参考文献

Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC: GPU computing. In Proceedings of the IEEE. Volume 96. New York, USA: IEEE; 2008:879-899.
欧文斯 JD, 休斯顿 M, Luebke D, 格林 S, 石头 JE, 菲利普斯 JC: GPU 计算。载于《IEEE 学报》。第 96 卷。纽约, 美国: IEEE; 2008:879-899。
Nickolls J, Dally WJ: The GPU computing era. IEEE Micro 2010, 30(2):56-69.
尼科尔斯, J., 达利, WJ: GPU 计算时代。 IEEE Micro 2010, 30(2):56-69。
Greene CS, Sinnott-Armstrong NA, Himmelstein DS, Park PJ, Moore JH, Harris BT: Multifactor dimensionality reduction for graphics processing units enables genome-wide testing of epistasis in sporadic als. Bioinformatics 2010, 26(5):694-695
格林 CS、辛诺特-阿姆斯特朗 NA、希梅尔斯坦 DS、帕克 PJ、摩尔 JH、哈里斯 BT:图形处理器实现的多因子维数降低能够进行散发性肌萎缩侧索硬化症的基因组范围内的表观遗传学检验。生物信息学 2010 年,26(5):694-695。
Liu Y, Schmidt B, Maskell D: CUDASW++2.0: enhanced smith-waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions. BMC Res Notes 2010, 3(1):93-104
刘义、施密特 B、马斯克尔 D:CUDASW++2.0:基于 SIMT 和虚拟化 SIMD 抽象的 CUDA 驱动 GPU 上增强的 Smith-Waterman 蛋白质数据库搜索.BMC Res Notes 2010, 3(1):93-104
Zhou Y, Liepe J, Sheng X, Stumpf MPH: GPU accelerated biochemical network simulation. Bioinformatics 2011, 27(6):874-876
周 Y, Liepe J, 盛 X, Stumpf MPH：基于 GPU 的生化网络模拟。生物信息学 2011 年, 27(6):874-876
Ueki M, Tamiya G: Ultrahigh-dimensional variable selection method for whole-genome gene-gene interaction analysis. BMC Bioinformatics 2012, 13(1):72.
植 M,田宫 G:用于全基因组基因-基因相互作用分析的超高维变量选择方法。生物信息学 BMC 2012, 13(1):72。
Yung LS, Yang C, Wan X, Yu W: GBOOST: a GPU-based tool for detecting gene-gene interactions in genome-wide case control studies. Bioinformatics 2011, 27(9):1309-1310.
雍 LS,杨 C,万 X,余 W:GBOOST:一种基于 GPU 的工具,用于检测基因组范围内病例对照研究中的基因-基因相互作用.生物信息学 2011,27(9):1309-1310.
Kam-Thong T, Czamara D, Tsuda K, Borgwardt K, Lewis CM, Erhardt-Lehmann A, Hemmer B, Rieckmann P, Daake M, Weber F, Wolf C, Ziegler A, Pütz B, Holsboer F, Schölkopf B, Müller-Myhsok B: EPIBLASTER-fast exhaustive two-locus epistasis detection strategy using graphical processing units. Eur J Hum Genet 2011, 19(4):465-471.
康东 T, Czamara D, 土田 K, Borgwardt K, Lewis CM, Erhardt-Lehmann A, Hemmer B, Rieckmann P, Daake M, Weber F, Wolf C, Ziegler A, Pütz B, Holsboer F, Schölkopf B, Müller-Myhsok B:使用图形处理器的快速穷尽式两位点表观遗传相互作用检测策略。《欧洲人类遗传学杂志》2011,19(4):465-471。
Kam-Thong T, Azencott C-A, Cayton L, Pütz B, Altmann A, Karbalai N, Sämann PG, Schölkopf B, Müller-Myhsok B, Borgwardt KM: GLIDE: GPU-based linear regression for detection of epistasis. Hum Hered 2012, 73(4):220-236.
甘邕 T, 阿森克特 C-A, 凯顿 L, 普茨 B, 奥特曼 A, 卡尔巴莱 N, 萨曼 PG, 舒尔考夫 B, 穆勒-梅霍克 B, 博格瓦德 KM: GLIDE: 基于 GPU 的线性回归检测表型间相互作用. 人类遗传 2012, 73(4):220-236.
Chrysos G, Engineer SP: Intel $^{\oplus}$ Xeon Phi coprocessor (codename Knights Corner). In Proceedings of the 24th Hot Chips Symposium, HC. Stanford, USA: Stanford University; 2012.
克里索斯 G、工程师 SP:英特尔 $^{\oplus}$ Xeon Phi 协处理器(代号 Knights Corner)。载于 2012 年斯坦福大学主办的第 24 届热芯片研讨会论文集。
Courtland R: Intel strikes back [news]. Spectrum, IEEE 2013, 50(8):14.
英特尔反击[新闻]。电气与电子工程师学会期刊 2013 年,50(8):14。
Payne JL, Sinnott-Armstrong NA, Moore JH: Exploiting graphics processing units for computational biology and bioinformatics. Interdiscip Sci Comput Life Sci 2010, 2(3):213-220.
潘恩 JL, 西诺特 - 阿姆斯特朗 NA, 摩尔 JH: 利用图形处理单元进行计算生物学和生物信息学. 交叉学科计算生命科学 2010, 2(3):213-220.
Curk T, Rot G, Zupan B: SNPsyn: detection and exploration of SNP-SNP interactions. Nucleic Acids Res 2011, 39(2):444-449
库克 T, 罗特 G, 祖潘 B: SNPsyn:检测和探索 SNP-SNP 的相互作用。核酸研究 2011 年, 39(2):444-449。
Anastassiou D: Computational analysis of the synergy among multiple interacting genes. Mol Syst Biol 2007, 3(83):1-8.
安纳斯塔西欧 D:多个相互作用基因的协同性计算分析。分子系统生物学 2007 年,3(83):1-8。
Cohen J, Garland M: Solving computational problems with GPU computing. Comput Sci Eng 2009, 11(5):58-63.
科恩 J，加兰德 M：使用 GPU 计算解决计算问题。Comput 科学 Eng 2009, 11(5):58-63。
Stone JE, Gohara D, Shi G: OpenCL: a parallel programming standard for heterogeneous computing systems. Comput Sci Eng 2010, 12(3):66.
石头 JE、 Gohara D、石 G: OpenCL:用于异构计算系统的并行编程标准。计算机科学与工程 2010 年, 12(3):66。
Lindholm E, Nickolls J, Oberman S, Montrym J: NVIDIA tesla: a unified graphics and computing architecture. IEEE Micro 2008, 28(2):39-55.
林德霍尔姆 E, 尼科尔斯 J, 奥博曼 S, 蒙特尼姆 J:NVIDIA 特斯拉:统一的图形和计算架构。IEEE Micro 2008, 28(2):39-55。
Saule E, Kaya K, Çatalyürek Ümit V: Performance evaluation of sparse matrix multiplication kernels on intel xeon phi. In Parallel Processing and Applied Mathematics. Berlin, Germany; 2014:559-570.
萨乌莱 E, 卡亚 K, 恰塔吕克 Ümit V: 英特尔 Xeon Phi 稀疏矩阵乘法内核的性能评估。在《并行处理与应用数学》中。柏林,德国;2014:559-570。
Liu X, Smelyanskiy M, Chow E, Dubey P: Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ICS’13. New York: ACM; 2013:273-282.
刘 X, Smelyanskiy M, Chow E, Dubey P: 在 x86 多核处理器上实现高效的稀疏矩阵向量乘法. 在《第 27 届 ACM 国际并行计算会议》上的论文集. ICS'13. 纽约: ACM; 2013:273-282.
Gao T, Lu Y, Zhang B, Suo G: Using the intel many integrated core to accelerate graph traversal. Int J High Perform Comput Appl 2014. doi:10.1177/1094342014524240.
高 T，卢 Y，张 B，索 G：使用英特尔众核加速图遍历。国际高性能计算应用期刊 2014。doi:10.1177/1094342014524240。
Cramer T, Schmidl D, Klemm M, an Mey D: OpenMP programming on Intel $^{\oplus}$ Xeon ${Phi}^{TM}$ coprocessors: an early performance comparison. In Proceedings of the Many-core Applications Research Community (MARC) Symp. at RWTH Aachen University. Achen, Germany: RWTH Achen University; 2012:38-44.
Cramer T, Schmidl D, Klemm M, an Mey D：英特尔 Xeon 协处理器的 OpenMP 编程：早期性能比较。在 RWTH 阿五大学的 Many-core Applications Research Community (MARC)研讨会论文集中。德国阿五：RWTH 阿五大学；2012：38-44。
Wellcome Trust Case Control Consortium: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. 7145 2007, 447:661-678
韦尔康信托病例对照研究协会:14,000 例 7 种常见疾病和 3,000 例共享对照组的基因组范围关联研究。7145 2007, 447:661-678
Zhu Z, Tong X, Zhu Z, Liang M, Cui W, Su K, Li MD, Zhu J: Development of gmdr-gpu for gene-gene interaction analysis and its application to wtccc gwas data for type 2 diabetes. PloS one 2013, 8(4):61943.
朱政, 童晓, 朱钊, 梁敏, 崔文, 苏坤, 李明德, 朱杰: 为基因-基因互作分析开发 gmdr-gpu 及其在 2 型糖尿病的 WTCCC GWAS 数据上的应用. PloS one 2013, 8(4):61943.
doi:10.1186/1471-2105-15-216
Cite this article as: Sluga et al.: Heterogeneous computing architecture for fast detection of SNP-SNP interactions. BMC Bioinformatics 2014 15:216.
引用本文为:Sluga 等人:用于快速检测 SNP-SNP 相互作用的异构计算架构.BMC 生物信息学 2014 15:216.

Submit your next manuscript to BioMed Central and take full advantage of:
将您的下一篇论文提交至 BioMed Central,并充分利用以下优势:

Convenient online submission
方便在线提交
Thorough peer review 专业同行评审
No space constraints or color figure charges
没有空间限制或彩色图版费用
Immediate publication on acceptance
接受后立即发表
Inclusion in PubMed, CAS, Scopus and Google Scholar
收录于 PubMed、CAS、Scopus 和 Google Scholar
Research which is freely available for redistribution
可自由分发的研究

Submit your manuscript at www.biomedcentral.com/submit
在 www.biomedcentral.com/submit 上提交您的稿件

*Correspondence: uros.Iotric@fri.uni-lj.si
通讯地址: uros.Iotric@fri.uni-lj.si
$^{1}$ Faculty of Computer and Information Science, University of Ljubljana, Trzaska 25, SI 1000 Ljubljana, SI, Slovenia
卢布尔雅那大学计算机与信息科学学院, Trzaska 25, SI 1000 卢布尔雅那, SI, 斯洛文尼亚
Full list of author information is available at the end of the article
本文作者信息全文列于文末

Heterogeneous computing architecture for fast detection of SNP-SNP interactions 异构计算架构用于快速检测 SNP-SNP 互作