Heterogeneous computing architecture for fast detection of SNP-SNP interactions
异构计算架构用于快速检测 SNP-SNP 相互作用

Davor Sluga $^{1}$ , Tomaz Curk $^{1}$ Blaz Zupan 1,2 and Uros Lotric $^{1 *}$
达沃尔·斯卢加 $^{1}$ ，托马兹·丘尔克 $^{1}$ ，布拉兹·祖潘 1,2 和乌罗什·洛特里奇 $^{1 *}$

Abstract 摘要

Background: The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested.
背景：典型的全基因组关联研究（GWAS）中的数据量对基因-基因交互发现的软件工具提出了相当大的计算挑战。对数十万到数百万个单核苷酸多态性（SNP）之间的所有交互进行全面评估可能需要数周甚至数月的计算时间。现代图形处理单元（GPU）和多核集成（MIC）协处理器中的大规模并行硬件可以显著缩短运行时间。尽管基于 GPU 的生物信息学实现的实用性已经得到了很好的研究，但 MIC 架构仅在最近才被引入，可能提供一些尚未探索和测试的比较优势。

Results: We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort.
结果：我们开发了一个异构的、基于 GPU 和 Intel MIC 加速的软件模块，用于 SNP-SNP 交互发现，以取代之前在交互式基于 Web 的数据探索程序 SNPsyn 中的单线程计算核心。我们报告了这两种现代大规模并行架构及其软件环境之间的差异。与单线程 CPU 实现相比，它们的实用性使得执行时间缩短了一个数量级。在单个 Nvidia Tesla K20 上的 GPU 实现速度是基于 MIC 架构的 Xeon Phi P5110 协处理器的两倍，但也需要相当多的编程工作。

Conclusions: General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems.
结论：通用 GPU 是一种成熟的平台，具有大量的计算能力，能够处理固有的并行问题，但对程序员来说可能会很有挑战性。另一方面，新的 MIC 架构虽然在性能上有所欠缺，但减少了编程工作量，并通过更通用的架构弥补了这一点，适用于更广泛的问题。

Keywords: SNP-SNP interactions, Genome-wide association studies, Graphic processing unit, Many Integrated Core coprocessor, Intel Xeon Phi, CUDA
关键词：SNP-SNP 交互，基因组范围关联研究，图形处理单元，多核集成协处理器，英特尔至强 Phi，CUDA

Background 背景

We are witnessing a dramatic shift in the design of personal computer systems, where speedups are achieved by porting the parallel traits of supercomputers into the world of personal computing. Modern computers are heterogeneous platforms with many different types of computational units, including central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), coprocessors and custom acceleration logic. Today’s CPUs contain from two to twelve cores, each capable of executing multiple instructions
我们正在目睹个人计算机系统设计的剧烈变化，通过将超级计算机的并行特性移植到个人计算的世界中，实现了速度的提升。现代计算机是异构平台，具有多种不同类型的计算单元，包括中央处理器（CPU）、图形处理器（GPU）、数字信号处理器（DSP）、协处理器和定制加速逻辑。今天的 CPU 包含从两个到十二个核心，每个核心都能够执行多个指令。

per clock cycle. Assisting the CPU, graphics processing units usually render 3D graphics, but can also provide a general-purpose computing platform. Current GPUs are designed as massively parallel processors offering substantially more computing power than CPUs. GPUs are the most powerful computational hardware available at an affordable price [1,2]. The availability of general-purpose GPUs with computing abilities in commodity laptop and desktop computers has generated a wide interest, including applications in bioinformatics [3-9].
每个时钟周期。辅助 CPU，图形处理单元通常渲染 3D 图形，但也可以提供通用计算平台。当前的 GPU 被设计为大规模并行处理器，提供比 CPU 更强大的计算能力。GPU 是以可承受的价格提供的最强大的计算硬件[1,2]。具有计算能力的通用 GPU 在普通笔记本和台式计算机中的可用性引起了广泛的兴趣，包括在生物信息学中的应用[3-9]。
The newest addition to the commodity computer parallel processing hardware is the Intel Xeon Phi family of coprocessors [10] designed for computationally intensive applications. Xeon Phi implements Intel’s Many Integrated Core (MIC) architecture and offers a theoretical performance similar to that of modern
最新的商品计算机并行处理硬件是为计算密集型应用设计的英特尔至强 Phi 系列协处理器[10]。至强 Phi 实现了英特尔的多核集成(MIC)架构，并提供了与现代处理器相似的理论性能。

GPUs, but promises easier porting of existing software to the new architecture. Tianhe-2, currently the world’s fastest supercomputer has 48000 Xeon Phi coprocessors [11].
GPU，但承诺更容易将现有软件移植到新架构。天河二号，目前世界上最快的超级计算机，拥有 48000 个 Xeon Phi 协处理器[11]。

Many computational problems in bioinformatics require substantial computational resources [12]. Problems that can be computed with a high degree of parallel and independent processing are most suited for heterogeneous massively parallel hardware. Our aim was to investigate how these modern architectures cope with problems that are typical for bioinformatics, such as the problem of SNP-SNP interaction detection. As a proof-of-concept, we focused on a parallel implementation of computational core for the web-application SNPsyn [13] by exploiting heterogeneous processing resources, multi-core CPUs, GPUs, and the new MIC coprocessors.
许多生物信息学中的计算问题需要大量的计算资源 [12]。可以通过高度并行和独立处理来计算的问题最适合异构大规模并行硬件。我们的目标是研究这些现代架构如何应对生物信息学中典型的问题，例如 SNP-SNP 交互检测问题。作为概念验证，我们专注于通过利用异构处理资源、多核 CPU、GPU 和新的 MIC 协处理器，针对网络应用 SNPsyn [13] 的计算核心进行并行实现。
SNPsyn [13] (Figure 1) was developed as an interactive software tool for efficient exploration and discovery
SNPsyn [13]（图 1）被开发为一个交互式软件工具，用于高效探索和发现
of interactions among single nucleotide polymorphisms (SNPs) in case-control genome-wide association study (GWAS) data. It uses an information-theoretic approach to evaluate SNP-SNP interactions [14]. Information gain is computed for every individual SNP, which allows the user to identify SNPs that are most associated with the disease under study. When searching for interesting pairs of SNPs, SNPsyn estimates the synergy between a pair of SNPs by computing the interaction gain. Information gain can identify SNP pairs with non-additive effects. Results are presented in an interactive graphical user interface that allows the user to select the most synergistic pairs, perform Gene Ontology enrichment analysis and visualize the synergy network among the selected SNP-SNP pairs.
在病例对照全基因组关联研究（GWAS）数据中，单核苷酸多态性（SNP）之间的相互作用。它使用信息论方法来评估 SNP-SNP 相互作用[14]。为每个单独的 SNP 计算信息增益，这使用户能够识别与研究疾病最相关的 SNP。当搜索有趣的 SNP 对时，SNPsyn 通过计算交互增益来估计一对 SNP 之间的协同作用。信息增益可以识别具有非加性效应的 SNP 对。结果以交互式图形用户界面呈现，允许用户选择最具协同作用的对，执行基因本体富集分析，并可视化所选 SNP-SNP 对之间的协同网络。
SNPsyn computes the information gain exhaustively across all SNP pairs to avoid missing any pair where SNPs on their own provide no information about the phenotype under study. Because the number of pairs
SNPsyn 通过对所有 SNP 对进行全面计算信息增益，以避免遗漏任何对，其中 SNP 本身对所研究的表型没有提供任何信息。因为对的数量

Figure 1 SNPsyn graphical user interface. a) A synergy versus information gain plot is used to select SNP-SNP pairs. b) Gene Ontology enrichment analysis for genes overlapping with selected SNP-SNP pairs. c) Synergy network of selected SNPs.
图 1 SNPsyn 图形用户界面。a) 使用协同效应与信息增益图来选择 SNP-SNP 对。b) 对与选定 SNP-SNP 对重叠的基因进行基因本体富集分析。c) 选定 SNP 的协同网络。
is quadratic to the number of SNPs, the exhaustive search quickly becomes computationally intractable for commodity computer systems. The information-theoretic-based detection of SNP-SNP interactions has a high degree of data parallelism and requires much more processing power than memory storage. This makes it a perfect candidate for processing on modern massively parallel architectures.
对 SNP 数量的二次方，穷举搜索在普通计算机系统上迅速变得计算上不可处理。基于信息论的 SNP-SNP 交互检测具有高度的数据并行性，并且需要比内存存储更多的处理能力。这使得它成为在现代大规模并行架构上处理的完美候选者。

Implementation 实现

Below we describe the SNP-SNP interaction scoring approach we use in SNPsyn and discuss its implementation on CPU, CUDA and MIC architectures. Our particular concern is to evaluate Intel’s new MIC architecture and compare its advantages against currently prevailing CUDA architecture.
以下我们描述了在 SNPsyn 中使用的 SNP-SNP 交互评分方法，并讨论其在 CPU、CUDA 和 MIC 架构上的实现。我们特别关注评估英特尔的新 MIC 架构，并将其优势与当前主流的 CUDA 架构进行比较。

SNP-SNP interaction scoring
SNP-SNP 交互评分

The SNP-SNP interaction scoring scheduler, written in Python, partitions and distributes the computational tasks to all available, user-specified resources: CPUs, GPUs, and Xeon Phi coprocessors (Figure 2). It then merges the results from individual units into a final result file. Each thread (CPU, GPU or Xeon Phi) takes one pair of SNPs and performs all the calculations needed to compute the synergy score of the pair. The synergy of a pair of SNPs

X

and

Y

with respect to phenotype

P

is obtained by
SNP-SNP 交互作用评分调度器，使用 Python 编写，将计算任务划分并分配给所有可用的用户指定资源：CPU、GPU 和 Xeon Phi 协处理器（图 2）。然后，它将各个单元的结果合并为最终结果文件。每个线程（CPU、GPU 或 Xeon Phi）处理一对 SNP，并执行计算该对的协同评分所需的所有计算。关于表型

P

，一对 SNP

X

和

Y

的协同作用通过以下方式获得：

Figure 2 SNPsyn software architecture. Computation of SNP-SNP interaction is coded in C++ for the CPU, CUDA and MIC architectures The scheduler that invokes the three heterogeneous implementations is written in Python.
图 2 SNPsyn 软件架构。SNP-SNP 交互的计算是用 C++为 CPU、CUDA 和 MIC 架构编写的。调用这三种异构实现的调度器是用 Python 编写的。
subtracting the information gains of individual SNPs from the information gain of the combined pair [13]:
从组合对的增益信息中减去单个 SNP 的增益信息[13]：

G (X, Y) = I (X, Y; P) - I (X; P) - I (Y; P) .

Given the two SNPs and the phenotype as random variables

X, Y

and

P

, respectively, the information gains required in Equation 1 are calculated as [14]:
考虑到这两个 SNP 和表型作为随机变量

X, Y

和

P

，方程 1 中所需的信息增益计算如下[14]：

\begin{aligned} I (X; P) & = \sum_{x \in X, p \in P} q (x, p) \log_{2} \frac{q (x, p)}{q (x) q (p)}, \\ I (Y; P) & = \sum_{y \in Y, p \in P} q (y, p) \log_{2} \frac{q (y, p)}{q (y) q (p)}, \\ I (X, Y; P) & = \sum_{x \in X, y \in Y, p \in P} q (x, y, p) \log_{2} \frac{q (x, y, p)}{q (x, y) q (p)} . \end{aligned}

Computation of marginal probabilities

q (x), q (y), q (p)

and joint probability distributions

q (x, p), q (y, p), q (x, y)

q (x, y, p)

requires a single scan through case and control samples. The number of joint probability distributions

q (x, y)

and

q (x, y, p)

that need to be determined grows quadratically with the number of SNPs. This ensures enough computational load to compensate for the memory transfer costs and makes it efficient for an implementation on parallel hardware.
计算边际概率

q (x), q (y), q (p)

和联合概率分布

q (x, p), q (y, p), q (x, y)

，

q (x, y, p)

需要对病例和对照样本进行一次扫描。需要确定的联合概率分布

q (x, y)

和

q (x, y, p)

的数量随着 SNP 数量的增加而呈平方增长。这确保了足够的计算负载以补偿内存传输成本，并使其在并行硬件上的实现变得高效。
Permutation analysis is used to evaluate the significance of results on true data. Data is randomly shuffled thirty times. Each time, information gain and synergy for all pairs are calculated to obtain the null distribution, which is used to determine the significance of results on true data. Details on permutation analysis are described in Curk et al. [13].
置换分析用于评估真实数据结果的显著性。数据随机洗牌三十次。每次计算所有对的增益信息和协同作用，以获得零分布，这用于确定真实数据结果的显著性。关于置换分析的详细信息请参见 Curk 等人 [13]。

Parallel implementations of interaction scoring
交互评分的并行实现

Calculations are performed in parallel for as many pairs of SNPs as allowed by the hardware. We took special care to efficiently use the GPU and Xeon Phi hardware. We minimized memory transfers between the main CPU and the coprocessors to avoid bottlenecks and vectorized the code wherever possible. We optimized the number of threads running on the GPU to maximize throughput. To cope with the memory limitation of the GPU, SNPsyn includes optional heuristics to quickly estimate the importance of SNPs and reduce the data set prior to analysis. In the following sections we present the implementation details regarding both architectures.
计算是并行进行的，尽可能多的 SNP 对由硬件允许。我们特别注意高效利用 GPU 和 Xeon Phi 硬件。我们最小化了主 CPU 与协处理器之间的内存传输，以避免瓶颈，并在可能的情况下对代码进行了向量化。我们优化了在 GPU 上运行的线程数量，以最大化吞吐量。为了应对 GPU 的内存限制，SNPsyn 包括可选的启发式方法，以快速估计 SNP 的重要性并在分析之前减少数据集。在接下来的部分中，我们将介绍这两种架构的实现细节。

GPU and CUDA GPU 和 CUDA

GPUs gain their computational power from the numerous processing cores packed into one chip. For example, the modern Nvidia Tesla K20 GPU has 13 streaming multiprocessors, each containing 192 computational units called CUDA cores. These cores lack sophisticated control units and are thus likely to work best when executing the
GPU 的计算能力来自于一个芯片中集成的众多处理核心。例如，现代的 Nvidia Tesla K20 GPU 具有 13 个流处理器，每个处理器包含 192 个称为 CUDA 核心的计算单元。这些核心缺乏复杂的控制单元，因此在执行时可能表现最佳。
same instruction on many data elements in parallel with no divergent program paths in the algorithm. A programmer sees the GPU as a parallel coprocessor and can use it to speedup computationally intensive parts of the algorithm. Of course, there must be enough data parallelism in the code to make it worthwhile.
在算法中对许多数据元素执行相同的指令，并且没有分歧的程序路径。程序员将 GPU 视为并行协处理器，可以利用它加速算法中计算密集的部分。当然，代码中必须有足够的数据并行性才能使其值得。

Different tools are available for programming GPUs. Nvidia offers the CUDA toolkit [15] for programming its own products. It includes a proprietary compiler and a set of libraries that extend the C++ syntax with parallel programming constructs. Another popular option is the OpenCL framework [16]. It supports hardware from different vendors but usually lags slightly in terms of performance when compared to specialized development kits such as CUDA.
不同的工具可用于编程 GPU。Nvidia 提供了 CUDA 工具包[15]用于编程其自有产品。它包括一个专有编译器和一组库，这些库扩展了 C++语法，增加了并行编程构造。另一个流行的选择是 OpenCL 框架[16]。它支持来自不同供应商的硬件，但与 CUDA 等专用开发工具包相比，通常在性能上稍显滞后。

Regardless of the development tool used, the programmer must follow certain rules to obtain maximum performance [17]. The most important one is to partition the algorithm in blocks small enough to simultaneously start a sufficient number of threads to utilize all available resources. For example, consider the code snippet in Figure 3, a simplified version of a code that scores pairs of SNPs. Function computeIGain calculates the information gain of a SNP pair using Equation 1. The details of the calculation are omitted to emphasize the architecture specific parts of code. The snippet includes all the peculiarities of programming for GPUs. The program has to implement the GPU-specific part separately from the CPU code and explicitly transfer data from the host to the GPU. Special functions called kernels (marked with the keyword global) must be written to be executed on the GPU. Memory transfer and allocation functions must be called to supply the necessary data to the GPU and collect the results afterwards. Usually, the programmer performs measurements to determine which thread configuration is most suitable for a particular problem size and the appropriate number of threads to launch.
无论使用何种开发工具，程序员必须遵循某些规则以获得最佳性能[17]。最重要的一条是将算法划分为足够小的块，以便同时启动足够数量的线程来利用所有可用资源。例如，考虑图 3 中的代码片段，这是一个简化版本的代码，用于对 SNP 对进行评分。函数 computeIGain 使用公式 1 计算 SNP 对的信息增益。为了强调特定于架构的代码部分，计算的细节被省略。该代码片段包含了针对 GPU 编程的所有特性。程序必须将特定于 GPU 的部分与 CPU 代码分开实现，并显式地将数据从主机传输到 GPU。必须编写称为内核的特殊函数（用关键字 global 标记），以便在 GPU 上执行。必须调用内存传输和分配函数，以向 GPU 提供必要的数据并随后收集结果。通常，程序员会进行测量，以确定哪种线程配置最适合特定问题大小以及启动的适当线程数量。

Xeon Phi and MIC
Xeon Phi 和 MIC

Intel designed the Xeon Phi family of coprocessors around the new MIC architecture [18] to compete with GPUs specialized in general-purpose computing. The design follows a different approach in comparison to GPUs. Coprocessors consists of many simple, but fully functional processor cores derived from the Intel Pentium architecture. Intel improved the original design by adding a 512-bit wide vector unit and Hyper-Threading Technology. This enables Xeon Phi to achieve similar theoretical performance as modern GPUs. The model 5510P, which we used in this study, includes sixty cores interconnected with a bidirectional ring bus. Each core is capable of running four threads in parallel. The cores fetch data from the 8 GB of on-board RAM and communicate with the
英特尔围绕新的 MIC 架构设计了 Xeon Phi 系列协处理器，以与专门用于通用计算的 GPU 竞争。与 GPU 相比，这种设计采用了不同的方法。协处理器由许多简单但功能齐全的处理器核心组成，这些核心源自英特尔 Pentium 架构。英特尔通过添加 512 位宽的向量单元和超线程技术改进了原始设计。这使得 Xeon Phi 能够实现与现代 GPU 相似的理论性能。我们在本研究中使用的 5510P 型号包括六十个通过双向环形总线互连的核心。每个核心能够并行运行四个线程。核心从 8 GB 的板载 RAM 中获取数据并进行通信。

_device_
Eloat computeMISingle(char *dataGPU,int N,int S,
                    int x)
{
    // compute average mutual information between
    // the phenotype and SNP x from data array
    // and return the result
    device
Eloat computeMIPair(char vdataGPU,int N,int S,
                                    int x,int y)
{
    // compute average mutual information between
    // the phenotype and SNP pair (x,y) from data
    // array and return the result
    global
ComputeIGain`char *dataGPU,float *resultGPU,
                int N,int S)
{
    int Np = N*(N-1)/2;
    int i = (blockIdx.y*blockDim.y+threadIdx.y)*
                (blockDim.x*gridDim.x) +
                blockIdx.x*blockDim.x+threadIdx.x;
    int t = ((isqrt((Np-8*i-8)+1)) - 1)/2;
    int x = (N-1)-l-t;
    int y = i-Np+(((t+1)*(t+2))/2)+x+1;
    if (i : Np)
        float Mixy = computeMIPair(dataGPU,N,S,x,y);
        float Mix = computeMI(dataGPU,N,S,x);
        float Miy = computeMI(dataGPU,N,S,y);
        resultGPU[i] = MIxy-Mix-Miy;
}
// char vdata - points to the data set with all
// phenotypes and genotypes
// int N - number of SNPs
// int S - number of samples
...
int Np = N* (N-1)/2;
int blocks = (Np+threads-1)/threads);
Eloat *result =(float*)malloc(Np*sizeof(Eloat));
cudaMalloc((void**)&dataGPU,N*S*sizeof(char));
cudaMalloc((void**)&resultGPU,Np*sizeof(Eloat));
cudaMemcpy(dataGPU, data,N*S*sizeof(char),
    cudaMemcpyHostToDevice);
computeIGain«s.slocks, threads:s;dataGPU,
                        resultGPU,N,S);
cudaMemcpy(result,resultGPU,Np*sizeof(float),
        cudaMemcpyDeviceToHost);
cudaFree(dataGPU);
cudaFree(resultGPU)

Figure 3 CUDA code snippet. Variables threads and blocks store the thread configuration. Function cudaMemcpy feeds the data into the GPU and retrieves the results afterwards. Each of the preconfigured GPU threads independently executes the computeIGain function and scores the associated SNP pair.
图 3 CUDA 代码片段。变量 threads 和 blocks 存储线程配置。函数 cudaMemcpy 将数据传输到 GPU 并随后检索结果。每个预配置的 GPU 线程独立执行 computeIGain 函数并对相关的 SNP 对进行评分。
host CPU through the PCIe bus. In comparison to GPUs, each core on a Xeon Phi can efficiently execute the code even if threads do not follow the same program path. This makes it suitable for a wider range of problems, including
通过 PCIe 总线连接主机 CPU。与 GPU 相比，Xeon Phi 上的每个核心可以有效地执行代码，即使线程不遵循相同的程序路径。这使其适合更广泛的问题，包括
multiplications of sparse matrices [19], and operations on trees and graphs [20].
稀疏矩阵的乘法 [19]，以及对树和图的操作 [20]。
Intel provides a C++ compiler suite and all the tools needed to exploit the hardware [21]. The code can be parallelized using OpenMP directives or the MPI library and compiled for the MIC architecture. Resulting applications can then run only on the Xeon Phi coprocessors. Another, more general way to specify parallel execution is to use offload constructs along with OpenMP to mark the data and the code to be transferred and executed on the Xeon Phi. All other parts of the program will run normally on the host computer CPU. A third possibility is to use OpenCL framework in the same manner as with GPUs.
英特尔提供了一套 C++ 编译器和所有必要的工具，以利用硬件 [21]。代码可以使用 OpenMP 指令或 MPI 库进行并行化，并为 MIC 架构编译。生成的应用程序只能在 Xeon Phi 协处理器上运行。另一种更通用的指定并行执行的方法是使用卸载构造与 OpenMP 一起标记要传输和在 Xeon Phi 上执行的数据和代码。程序的所有其他部分将在主机计算机的 CPU 上正常运行。第三种可能性是以与 GPU 相同的方式使用 OpenCL 框架。
MIC development tools facilitate data management through compiler directives. The example in Figure 4 demonstrates this programming paradigm. It performs the same operation as the snippet from Figure 3. The programmer marks the data and the code that is needed on the coprocessor. All memory allocations and transfers are done implicitly. To obtain best performance, the programmer must tailor the algorithms to fully utilize the vector unit. The Intel compiler automatically vectorizes sections of code where possible.
MIC 开发工具通过编译器指令简化数据管理。图 4 中的示例演示了这种编程范式。它执行与图 3 中的代码片段相同的操作。程序员标记需要在协处理器上使用的数据和代码。所有内存分配和传输都是隐式完成的。为了获得最佳性能，程序员必须调整算法以充分利用向量单元。英特尔编译器会自动向量化可能的代码段。
If a computer lacks Xeon Phi, the MIC code can be executed by the main CPU, which is not the case with CUDA-specific implementation. The MIC code looks much cleaner and easier to handle than CUDA code. The current drawbacks of using Xeon Phi are the shortage of supporting Linux distributions (officially only RedHat and SuSE ) and the pricey development environment for the Windows operating system. The main aspects (relevant to the developer) of each of the architectures are shown in Table 1.
如果计算机缺少 Xeon Phi，MIC 代码可以由主 CPU 执行，这与 CUDA 特定实现的情况不同。MIC 代码看起来比 CUDA 代码更简洁、更易于处理。目前使用 Xeon Phi 的缺点是支持的 Linux 发行版较少（官方仅有 RedHat 和 SuSE）以及 Windows 操作系统的开发环境价格昂贵。每种架构的主要方面（与开发者相关）如表 1 所示。

Results 结果

We benchmarked SNPsyn on a workstation with two sixcore Intel Xeon E5-2620 2.00 GHz CPUs capable of running up to twenty-four threads in parallel, 64 GB of RAM, two Nvidia Tesla K20 general-purpose computing cards with 5 GB of RAM each and one Intel Xeon Phi 5110P coprocessor with 8 GB of RAM. The operating system was CentOS 6.4.
我们在一台工作站上对 SNPsyn 进行了基准测试，该工作站配备了两个六核 Intel Xeon E5-2620 2.00 GHz CPU，能够并行运行多达二十四个线程，64 GB 内存，两个每个有 5 GB 内存的 Nvidia Tesla K20 通用计算卡，以及一个 8 GB 内存的 Intel Xeon Phi 5110P 协处理器。操作系统为 CentOS 6.4。
We evaluated the performance on a series of representative WGAS data sets constructed from the Infinium_20060727fs1_gt_MS_GCf data set found in the WTCCC study [22]. Our goal was to observe the effect of the number of SNPs and WGAS study subjects to the execution time on different configurations. We sampled with replacement the original data on 994 subjects and 15436 SNPs to obtain data sets with the desired number of subjects and SNPS. We performed the analysis on data with 1000,6000 , and 20000 subjects and 10000 , 100 000, and 660000 SNPs. The study considered only the data sets that could fit into the GPU memory. Xeon Phi
我们评估了在一系列代表性的 WGAS 数据集上的性能，这些数据集是从 WTCCC 研究 [22] 中找到的 Infinium_20060727fs1_gt_MS_GCf 数据集中构建的。我们的目标是观察 SNP 数量和 WGAS 研究对象对不同配置下执行时间的影响。我们对 994 个对象和 15436 个 SNP 的原始数据进行了有放回抽样，以获得具有所需对象和 SNP 数量的数据集。我们对包含 1000、6000 和 20000 个对象以及 10000、100000 和 660000 个 SNP 的数据进行了分析。该研究仅考虑能够适应 GPU 内存的数据集。Xeon Phi

_declspec(target(mic))
float computeMISingle(char *data,int N,int S,
                                    int x)
    // compute average mutual information between
    // the phenotype and SNP x from data array
    // and return the result
    _declspec(target(mic))
float computeMIPair(char *data,int N,int S,
                                    int x,int y)
    // compute average mutual information between
    // the phenotype and SNP pair (x,y) from data
    // array and return the result
// char *data - points to the data set with all
// phenotypes and genotypes
// int N - number of SNPs
// int S - number of samples
int Np = N* (N-1)/2;
float *result =(float*)malloc(Np*sizeof(float));
#pragma offload target(mic)
            in(data:length(N*S))
            out(result:length(N))
#pragma omp parallel for
for (int i = 0; i < Np; ++i)
{
    int t = ((isqrt((Np-8*i-8)+1))-1)/2;
    int x = (N-1)-1-t;
    int y = i-Np+(((t+1)*(t+2))/2)+x+1;
    float Mixy = computeMIPair(data,N,S,x,y)
    float Mix = computeMISingle(data,N,S,x)
    float Miy = computeMISingle(data,N,S,y);
    results[i] = MIxy-Mix-Miy;
}

Figure 4 MIC code snippet. The first pragma directive marks the start of a MIC code section. Keywords in and out indicate the data to be transferred to and from the Xeon Phi. The OpenMP clause omp parallel for launches all available threads in parallel, which execute the code in the body of the loop and score the SNP pairs.
图 4 MIC 代码片段。第一个 pragma 指令标记 MIC 代码段的开始。关键词 in 和 out 指示要传输到 Xeon Phi 的数据和从 Xeon Phi 传输的数据。OpenMP 子句 omp parallel for 并行启动所有可用线程，这些线程执行循环体中的代码并评分 SNP 对。
is clearly in advantage when compared to K20 regarding the amount of RAM ( 8 GB versus 5 GB ). We tested six hardware configurations including one CPU core running a single thread, twelve CPU cores running twelve threads, twelve CPU cores running twenty-four threads, one GPU core, both GPU cores, and Xeon Phi.
与 K20 相比，在 RAM 的数量上（8 GB 对比 5 GB），显然具有优势。我们测试了六种硬件配置，包括一个 CPU 核心运行单线程、十二个 CPU 核心运行十二个线程、十二个 CPU 核心运行二十四个线程、一个 GPU 核心、两个 GPU 核心以及 Xeon Phi。
Figure 5 reports on execution times of the exhaustive SNP-SNP interaction analysis and the speedups achieved using various hardware configurations. For easier comparison, execution times are plotted on a logarithmic scale. As expected, execution times increase proportionally with the number of subjects and are quadratic with the number of SNPs included in the analysis.
图 5 报告了全面 SNP-SNP 交互分析的执行时间以及使用各种硬件配置所实现的加速。为了便于比较，执行时间以对数刻度绘制。正如预期的那样，执行时间与受试者数量成正比，并且与分析中包含的 SNP 数量呈平方关系。
The single thread CPU configuration takes more than 30 days to analyze the data on 660000 SNPs and 1000
单线程 CPU 配置需要超过 30 天的时间来分析 660000 个 SNP 和 1000 个数据

Table 1 Comparison of parallel computer architecture platforms with key aspects from the viewpoint of software development
表 1 从软件开发的角度比较平行计算机架构平台的关键方面

x86/x64 single CPU x86/x64 单 CPU

Nvidia GPU

Intel Xeon Phi 英特尔至强 Phi

Tools 工具

Arbitrary compiler 任意编译器

CUDA Toolkit or OpenCL framework
CUDA 工具包或 OpenCL 框架

Intel compiler suite 英特尔编译器套件

OS support 操作系统支持

Many 许多

Windows, Linux, Mac OSX

Linux (RedHat and SuSE), Windows
Linux（RedHat 和 SuSE），Windows

Required programming skills
所需编程技能

Low 低

High 高

Medium 中等

Lines of code* 代码行*

260

460

360

Programming remarks 编程备注

None 翻译文本：无

Architecture specific optimizations
架构特定优化

Recommended optimizations using
推荐的优化使用

Platform maturity 平台成熟度

Mature 成熟

广泛的文档，许多编程示例

Exrtensive documentation, many

programming examples

Bugs in drivers, documentation needs
驱动程序中的错误，文档需求

Lines of code

^{(*)}

reports on the approximate length of the code that implements the computationally intensive tasks of SNPsyn.
代码行数

^{(*)}

报告了实现 SNPsyn 计算密集型任务的代码的大致长度。
subjects. Running twelve threads in parallel, one on each of the CPU cores, speeds up the computation by a factor of 10 and reduces the execution time to approximately 3 days. Increasing the number of threads to twenty-four reduces the time to perform the analysis to around 2 days with the speedup peaking at 12.8 compared to a one thread configuration. Memory bottleneck is the main factor for the poor speedup, which is far below the theoretical value of 24. Interestingly, similar speedups are achieved on all (smaller) data sets, meaning that there is enough data parallelism to keep the CPU busy.
主题。并行运行十二个线程，每个 CPU 核心一个，计算速度提高了 10 倍，执行时间减少到大约 3 天。将线程数增加到二十四个将分析时间减少到大约 2 天，速度提升达到 12.8，相比于单线程配置。内存瓶颈是导致速度提升不佳的主要因素，远低于理论值 24。有趣的是，在所有（较小的）数据集上都实现了类似的速度提升，这意味着有足够的数据并行性来保持 CPU 忙碌。

Nvidia K20 provides for considerable reduction in execution times, with the analysis of the largest data set taking only around 17 hours, demonstrating a speedup of 42 in comparison to a single CPU thread. Sharing the work between both GPU cards doubles the speedup and reduces the execution time to 8 hours. Increasing the number of subjects leads to a noticeable decrease in speedup, as more data is being transferred between the main memory and the GPU. On the other hand, increasing the number of SNPs introduces more data parallelism into the computations, reflecting in an improved speedup.
Nvidia K20 在执行时间上提供了显著的减少，分析最大数据集仅需约 17 小时，与单个 CPU 线程相比，速度提升达 42 倍。将工作分配给两张 GPU 卡可以将速度提升翻倍，执行时间减少到 8 小时。增加受试者数量会导致速度提升明显下降，因为主内存与 GPU 之间传输的数据增多。另一方面，增加 SNP 的数量为计算引入了更多的数据并行性，从而反映出速度提升的改善。

Figure 5 Execution times and speedups achieved on various computing resources. Shown are execution times on each hardware configuration for different problem sizes (a) and speedups in comparison to a single CPU thread execution (b).
图 5 在各种计算资源上实现的执行时间和加速比。展示了不同问题规模下每种硬件配置的执行时间（a）以及与单个 CPU 线程执行相比的加速比（b）。

Table 2 Technical specification of hardware platforms
表 2 硬件平台的技术规格

	Intel Xeon E5-2620 英特尔至强 E5-2620	Nvidia Tesla K20	Intel Xeon Phi 5110P 英特尔至强 Phi 5110P
Number of transistors 晶体管数量	2.3 billion 23 亿	7.1 billion 71 亿	5 billion 50 亿
Peak power consumption 峰值功耗	95 W	225 W	225 W
Single precision floating point performance 单精度浮点性能	96 GFLOPS	3.5 TFLOPS	2.0 TFLOPS
Main memory 主存储器	$64 GB$ can be expanded $64 GB$ 可以展开	5 GB	8 GB

Xeon Phi is positioned somewhere in-between K20 and CPU-only implementation. It achieves a speedup of nearly 20 on the largest data set, making the analysis run a day and a half, which is double the time needed on a K20. The speedup behaves similarly for Xeon Phi as for K20 it increases with the number of SNPs and decreases with the number of subjects. This confirms that the drop is caused by transferring larger amounts of data without introducing additional parallelism.
Xeon Phi 的定位介于 K20 和仅 CPU 实现之间。在最大的数据显示集上，它实现了近 20 倍的加速，使分析运行一天半，这比 K20 所需的时间多了一倍。对于 Xeon Phi，速度提升的行为与 K20 类似，随着 SNP 数量的增加而增加，随着受试者数量的增加而减少。这证实了下降是由于传输更大数据量而没有引入额外的并行性所导致的。
Using only CPUs to analyze the data is unfeasible except for small data sets since the computations can take days to complete even on multiple cores. Xeon Phi provides a considerable performance boost with a maximum speedup of nearly 20 and lots of on-board memory to store the data. Nvidia K20 clearly outperforms every other configuration in terms of speed and is the perfect choice when one wants to cut on the execution times as much as possible. This comes at a price of cumbersome programming and less on-board memory, which limits the size of data.
仅使用 CPU 分析数据在小数据集之外是不可行的，因为即使在多个核心上，计算也可能需要几天才能完成。Xeon Phi 提供了显著的性能提升，最大加速接近 20，并且有大量的板载内存来存储数据。Nvidia K20 在速度方面明显优于其他所有配置，是希望尽可能缩短执行时间的完美选择。这带来了繁琐的编程和较少的板载内存，这限制了数据的大小。
Technical specifications presented in Table 2 show similar trends: Nvidia K20 offers the highest theoretical performance in terms of TFLOPS and has the most complex design. Xeon Phi has considerably less computing power, but interestingly draws the same amount of power as K20 at maximum load. The Xeon E5-2620 CPU is the least efficient of all and lacks the performance to remain competitive at computationally intensive tasks.
表 2 中呈现的技术规格显示出类似的趋势：Nvidia K20 在 TFLOPS 方面提供了最高的理论性能，并且设计最为复杂。Xeon Phi 的计算能力明显较低，但有趣的是，在最大负载下，它的功耗与 K20 相同。Xeon E5-2620 CPU 是所有产品中效率最低的，并且在计算密集型任务中缺乏竞争力。

Conclusion 结论

We investigated how modern heterogeneous architectures cope with a selected computational problem typical for bioinformatics. The proof-of-concept implementation of SNPsyn on heterogeneous systems greatly reduces the (wall-clock) time needed for analysis of large GWAS data sets. GPUs proved to be a mature platform that offers a large amount of computing power to address inherently parallel problems, but is demanding for the programmer. A user who is only interested in using SNPsyn to analyze their data will profit the most by having multiple GPUs in their system. The new MIC architecture greatly alleviates programming but lacks in performance. Its ease of programming combined with good performance has a lot to offer to developers who don’t want to spend too much time optimizing their algorithms. Nevertheless, MIC is a
我们研究了现代异构架构如何应对生物信息学中典型的计算问题。SNPsyn 在异构系统上的概念验证实现大大减少了分析大型 GWAS 数据集所需的（实际）时间。GPU 被证明是一个成熟的平台，提供了大量的计算能力来解决固有的并行问题，但对程序员的要求较高。对于仅仅希望使用 SNPsyn 分析数据的用户来说，拥有多个 GPU 的系统将带来最大的收益。新的 MIC 架构大大减轻了编程的难度，但在性能上有所欠缺。其编程的简易性结合良好的性能对那些不想花太多时间优化算法的开发者来说有很大的吸引力。尽管如此，MIC 仍然是一个
general platform capable of tackling a wider range of more complex problems. This makes it very promising to excel in more complex analysis of SNP-SNP interactions such as adjustment for covariates [23].
一个能够处理更广泛、更复杂问题的通用平台。这使得它在更复杂的 SNP-SNP 相互作用分析中表现出色，例如对协变量的调整[23]。

Availability and requirements
可用性和要求

Project name: SNPsyn 项目名称：SNPsyn
Project home page: http://snpsyn.biolab.si
项目主页：http://snpsyn.biolab.si
Operating systems: Linux, Windows, Mac OS
操作系统：Linux，Windows，Mac OS
Programming language:

C + +

编程语言：

C + +

Other requirements: CUDA 2.0 or higher, Intel Composer XE 2013 or newer, make
其他要求：CUDA 2.0 或更高版本，Intel Composer XE 2013 或更新版本，make
License: GNU GPLv3 许可证：GNU GPLv3
Restrictions to use by non-academics: none
非学术人员使用的限制：无

Competing interests 竞争利益

The authors declare that they have no competing interests.
作者声明他们没有竞争利益。

Authors' contributions 作者贡献

UL, DS, TC, and BZ designed the study. DS implemented the CUDA and MIC software and measured the performance of the software. TC implemented the CPU and Python part of the software. DS wrote the first draft of the manuscript. All authors have written, read and approved the final manuscript.
UL、DS、TC 和 BZ 设计了该研究。DS 实现了 CUDA 和 MIC 软件并测量了软件的性能。TC 实现了软件的 CPU 和 Python 部分。DS 撰写了手稿的初稿。所有作者均已撰写、阅读并批准了最终手稿。

Acknowledgements 致谢

BZ and TC were supported by the Slovenian Research Agency (ARRS, P2-0209), UL and DS were supported by the Slovenian Research Agency (ARRS, P2-0241).
BZ 和 TC 得到了斯洛文尼亚研究局（ARRS, P2-0209）的支持，UL 和 DS 得到了斯洛文尼亚研究局（ARRS, P2-0241）的支持。

Author details 作者详情

^{1}

Faculty of Computer and Information Science, University of Ljubljana, Trzaska 25, SI 1000 Ljubljana, SI, Slovenia.

^{2}

Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, TX 77030 Houston, USA

^{1}

卢布尔雅那大学计算机与信息科学学院，特尔扎斯卡街 25 号，SI 1000 卢布尔雅那，斯洛文尼亚。

^{2}

贝勒医学院分子与人类遗传学系，贝勒广场一号，德克萨斯州 77030 休斯顿，美国

Received: 14 November 2013 Accepted: 19 June 2014
收到：2013 年 11 月 14 日接受：2014 年 6 月 19 日
Published: 25 June 2014
发布于：2014 年 6 月 25 日

References 参考文献

Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC: GPU computing. In Proceedings of the IEEE. Volume 96. New York, USA: IEEE; 2008:879-899.
Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC: GPU 计算. 见于《IEEE 会议录》。第 96 卷。美国纽约：IEEE; 2008:879-899.
Nickolls J, Dally WJ: The GPU computing era. IEEE Micro 2010, 30(2):56-69.
Nickolls J, Dally WJ: GPU 计算时代。IEEE Micro 2010, 30(2):56-69。
Greene CS, Sinnott-Armstrong NA, Himmelstein DS, Park PJ, Moore JH, Harris BT: Multifactor dimensionality reduction for graphics processing units enables genome-wide testing of epistasis in sporadic als. Bioinformatics 2010, 26(5):694-695
Greene CS, Sinnott-Armstrong NA, Himmelstein DS, Park PJ, Moore JH, Harris BT: 多因素维度减少用于图形处理单元，使得在散发性 ALS 中进行全基因组的表型互作测试成为可能。生物信息学 2010, 26(5):694-695
Liu Y, Schmidt B, Maskell D: CUDASW++2.0: enhanced smith-waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions. BMC Res Notes 2010, 3(1):93-104
刘 Y, Schmidt B, Maskell D: CUDASW++2.0：基于 SIMT 和虚拟化 SIMD 抽象的 CUDA 支持 GPU 上增强的 Smith-Waterman 蛋白质数据库搜索。BMC 研究笔记 2010, 3(1):93-104
Zhou Y, Liepe J, Sheng X, Stumpf MPH: GPU accelerated biochemical network simulation. Bioinformatics 2011, 27(6):874-876
周毅, Liepe J, 盛晓, Stumpf MPH: GPU 加速的生化网络模拟. 生物信息学 2011, 27(6):874-876
Ueki M, Tamiya G: Ultrahigh-dimensional variable selection method for whole-genome gene-gene interaction analysis. BMC Bioinformatics 2012, 13(1):72.
植木 M, 谷山 G: 超高维变量选择方法用于全基因组基因-基因交互分析. BMC 生物信息学 2012, 13(1):72.
Yung LS, Yang C, Wan X, Yu W: GBOOST: a GPU-based tool for detecting gene-gene interactions in genome-wide case control studies. Bioinformatics 2011, 27(9):1309-1310.
Yung LS, Yang C, Wan X, Yu W: GBOOST：一种基于 GPU 的工具，用于在全基因组病例对照研究中检测基因-基因相互作用。生物信息学 2011, 27(9):1309-1310。
Kam-Thong T, Czamara D, Tsuda K, Borgwardt K, Lewis CM, Erhardt-Lehmann A, Hemmer B, Rieckmann P, Daake M, Weber F, Wolf C, Ziegler A, Pütz B, Holsboer F, Schölkopf B, Müller-Myhsok B: EPIBLASTER-fast exhaustive two-locus epistasis detection strategy using graphical processing units. Eur J Hum Genet 2011, 19(4):465-471.
Kam-Thong T, Czamara D, Tsuda K, Borgwardt K, Lewis CM, Erhardt-Lehmann A, Hemmer B, Rieckmann P, Daake M, Weber F, Wolf C, Ziegler A, Pütz B, Holsboer F, Schölkopf B, Müller-Myhsok B: EPIBLASTER-快速的两位点表观遗传检测策略，使用图形处理单元。欧洲人类遗传学杂志 2011, 19(4):465-471。
Kam-Thong T, Azencott C-A, Cayton L, Pütz B, Altmann A, Karbalai N, Sämann PG, Schölkopf B, Müller-Myhsok B, Borgwardt KM: GLIDE: GPU-based linear regression for detection of epistasis. Hum Hered 2012, 73(4):220-236.
Kam-Thong T, Azencott C-A, Cayton L, Pütz B, Altmann A, Karbalai N, Sämann PG, Schölkopf B, Müller-Myhsok B, Borgwardt KM: GLIDE：基于 GPU 的线性回归用于表观遗传学的检测。人类遗传学 2012, 73(4):220-236。
Chrysos G, Engineer SP: Intel $^{\oplus}$ Xeon Phi coprocessor (codename Knights Corner). In Proceedings of the 24th Hot Chips Symposium, HC. Stanford, USA: Stanford University; 2012.
克里索斯 G，工程师 SP：英特尔 $^{\oplus}$ 至强 Phi 协处理器（代号骑士角）。在第 24 届热芯片研讨会论文集中，HC。美国斯坦福：斯坦福大学；2012。
Courtland R: Intel strikes back [news]. Spectrum, IEEE 2013, 50(8):14.
科特兰·R：英特尔反击 [新闻]。IEEE Spectrum 2013, 50(8):14。
Payne JL, Sinnott-Armstrong NA, Moore JH: Exploiting graphics processing units for computational biology and bioinformatics. Interdiscip Sci Comput Life Sci 2010, 2(3):213-220.
Payne JL, Sinnott-Armstrong NA, Moore JH: 利用图形处理单元进行计算生物学和生物信息学的研究。跨学科科学计算生命科学 2010, 2(3):213-220.
Curk T, Rot G, Zupan B: SNPsyn: detection and exploration of SNP-SNP interactions. Nucleic Acids Res 2011, 39(2):444-449
Curk T, Rot G, Zupan B: SNPsyn：SNP-SNP 相互作用的检测与探索。核酸研究 2011, 39(2):444-449
Anastassiou D: Computational analysis of the synergy among multiple interacting genes. Mol Syst Biol 2007, 3(83):1-8.
Anastassiou D: 多个相互作用基因之间协同作用的计算分析。分子系统生物学 2007, 3(83):1-8.
Cohen J, Garland M: Solving computational problems with GPU computing. Comput Sci Eng 2009, 11(5):58-63.
Cohen J, Garland M: 使用 GPU 计算解决计算问题。计算机科学与工程 2009, 11(5):58-63。
Stone JE, Gohara D, Shi G: OpenCL: a parallel programming standard for heterogeneous computing systems. Comput Sci Eng 2010, 12(3):66.
Stone JE, Gohara D, Shi G: OpenCL：异构计算系统的并行编程标准。计算机科学与工程 2010, 12(3):66。
Lindholm E, Nickolls J, Oberman S, Montrym J: NVIDIA tesla: a unified graphics and computing architecture. IEEE Micro 2008, 28(2):39-55.
Lindholm E, Nickolls J, Oberman S, Montrym J: NVIDIA Tesla：统一的图形和计算架构。IEEE Micro 2008, 28(2):39-55.
Saule E, Kaya K, Çatalyürek Ümit V: Performance evaluation of sparse matrix multiplication kernels on intel xeon phi. In Parallel Processing and Applied Mathematics. Berlin, Germany; 2014:559-570.
Saule E, Kaya K, Çatalyürek Ümit V: 在英特尔至强 Phi 上的稀疏矩阵乘法内核性能评估。发表于《并行处理与应用数学》。德国柏林；2014:559-570。
Liu X, Smelyanskiy M, Chow E, Dubey P: Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ICS’13. New York: ACM; 2013:273-282.
刘 X, Smelyanskiy M, Chow E, Dubey P: 在基于 x86 的多核处理器上高效稀疏矩阵-向量乘法。发表于第 27 届国际 ACM 超级计算会议论文集。ICS’13。纽约：ACM；2013：273-282。
Gao T, Lu Y, Zhang B, Suo G: Using the intel many integrated core to accelerate graph traversal. Int J High Perform Comput Appl 2014. doi:10.1177/1094342014524240.
高天, 陆勇, 张博, 索国: 使用英特尔多核集成加速图遍历. 高性能计算国际期刊 2014. doi:10.1177/1094342014524240.
Cramer T, Schmidl D, Klemm M, an Mey D: OpenMP programming on Intel $^{\oplus}$ Xeon ${Phi}^{TM}$ coprocessors: an early performance comparison. In Proceedings of the Many-core Applications Research Community (MARC) Symp. at RWTH Aachen University. Achen, Germany: RWTH Achen University; 2012:38-44.
Cramer T, Schmidl D, Klemm M, 和 Mey D: 在英特尔 $^{\oplus}$ Xeon ${Phi}^{TM}$ 协处理器上的 OpenMP 编程：早期性能比较。发表于亚琛工业大学的多核应用研究社区 (MARC) 研讨会。德国亚琛：亚琛工业大学；2012:38-44。
Wellcome Trust Case Control Consortium: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. 7145 2007, 447:661-678
威康信托病例对照联盟：对 14,000 例七种常见疾病和 3,000 个共享对照的全基因组关联研究。7145 2007, 447:661-678
Zhu Z, Tong X, Zhu Z, Liang M, Cui W, Su K, Li MD, Zhu J: Development of gmdr-gpu for gene-gene interaction analysis and its application to wtccc gwas data for type 2 diabetes. PloS one 2013, 8(4):61943.
朱志，童晓，朱志，梁明，崔伟，苏凯，李明达，朱俊：gmdr-gpu 的开发用于基因-基因交互分析及其在 wtccc gwas 数据中的应用，针对 2 型糖尿病。PloS one 2013，8(4)：61943。
doi:10.1186/1471-2105-15-216
Cite this article as: Sluga et al.: Heterogeneous computing architecture for fast detection of SNP-SNP interactions. BMC Bioinformatics 2014 15:216.
引用本文为：Sluga 等：用于快速检测 SNP-SNP 交互的异构计算架构。BMC 生物信息学 2014 15:216。

Submit your next manuscript to BioMed Central and take full advantage of:
将您的下一篇手稿提交给 BioMed Central，并充分利用：

Convenient online submission
便捷的在线提交
Thorough peer review 彻底的同行评审
No space constraints or color figure charges
没有空间限制或彩色图形费用
Immediate publication on acceptance
接受后立即发布
Inclusion in PubMed, CAS, Scopus and Google Scholar
纳入 PubMed、CAS、Scopus 和 Google Scholar
Research which is freely available for redistribution
可自由分发的研究

Submit your manuscript at www.biomedcentral.com/submit
请在 www.biomedcentral.com/submit 提交您的手稿

*Correspondence: uros.Iotric@fri.uni-lj.si
*通讯: uros.Iotric@fri.uni-lj.si
$^{1}$ Faculty of Computer and Information Science, University of Ljubljana, Trzaska 25, SI 1000 Ljubljana, SI, Slovenia
卢布尔雅那大学计算机与信息科学学院，特尔扎斯卡街 25 号，斯洛文尼亚，卢布尔雅那，邮政编码 1000
Full list of author information is available at the end of the article
文章末尾提供了完整的作者信息列表

Heterogeneous computing architecture for fast detection of SNP-SNP interactions 异构计算架构用于快速检测 SNP-SNP 相互作用