Overview of NCCL¶ NCCL 概述 ¶

The NVIDIA Collective Communications Library (NCCL, pronounced “Nickel”) is a library providing inter-GPU communication primitives that are topology-aware and can be easily integrated into applications.
NVIDIA Collective Communications Library（NCCL，发音为“Nickel”）是一个提供拓扑感知且易于集成到应用程序中的 GPU 间通信原语的库。

NCCL implements both collective communication and point-to-point send/receive primitives. It is not a full-blown parallel programming framework; rather, it is a library focused on accelerating inter-GPU communication.
NCCL 实现了集体通信和点对点发送/接收原语。它不是一个全面的并行编程框架；相反，它是一个专注于加速 GPU 间通信的库。

NCCL provides the following collective communication primitives :
NCCL 提供以下集合通信原语：

AllReduce
Broadcast
Reduce
AllGather
ReduceScatter

Additionally, it allows for point-to-point send/receive communication which allows for scatter, gather, or all-to-all operations.
此外，它还支持点对点的发送/接收通信，从而实现分散、聚集或全对全操作。

Tight synchronization between communicating processors is a key aspect of collective communication. CUDA based collectives would traditionally be realized through a combination of CUDA memory copy operations and CUDA kernels for local reductions. NCCL, on the other hand, implements each collective in a single kernel handling both communication and computation operations. This allows for fast synchronization and minimizes the resources needed to reach peak bandwidth.
通信处理器之间的紧密同步是集体通信的一个关键方面。基于 CUDA 的集体操作传统上通过 CUDA 内存复制操作和用于本地归约的 CUDA 内核组合来实现。而 NCCL 则在单个内核中实现每个集体操作，同时处理通信和计算操作。这使得同步速度更快，并最大限度地减少了达到峰值带宽所需的资源。

NCCL conveniently removes the need for developers to optimize their applications for specific machines. NCCL provides fast collectives over multiple GPUs both within and across nodes. It supports a variety of interconnect technologies including PCIe, NVLINK, InfiniBand Verbs, and IP sockets.
NCCL 方便地消除了开发人员为特定机器优化应用程序的需求。NCCL 在节点内和跨节点的多个 GPU 上提供快速的集合操作。它支持多种互连技术，包括 PCIe、NVLINK、InfiniBand Verbs 和 IP sockets。

Next to performance, ease of programming was the primary consideration in the design of NCCL. NCCL uses a simple C API, which can be easily accessed from a variety of programming languages. NCCL closely follows the popular collectives API defined by MPI (Message Passing Interface). Anyone familiar with MPI will thus find NCCL’s API very natural to use. In a minor departure from MPI, NCCL collectives take a “stream” argument which provides direct integration with the CUDA programming model. Finally, NCCL is compatible with virtually any multi-GPU parallelization model, for example:
除了性能，编程的简便性是 NCCL 设计中的主要考虑因素。NCCL 使用简单的 C API，可以轻松地从多种编程语言访问。NCCL 紧密遵循 MPI（消息传递接口）定义的流行集合 API。因此，任何熟悉 MPI 的人都会发现 NCCL 的 API 非常自然易用。与 MPI 略有不同的是，NCCL 集合操作采用“流”参数，直接与 CUDA 编程模型集成。最后，NCCL 与几乎任何多 GPU 并行化模型兼容，例如：

single-threaded control of all GPUs
单线程控制所有 GPU
multi-threaded, for example, using one thread per GPU
多线程，例如，每个 GPU 使用一个线程
multi-process, for example, MPI
多进程，例如，MPI

NCCL has found great application in Deep Learning Frameworks, where the AllReduce collective is heavily used for neural network training. Efficient scaling of neural network training is possible with the multi-GPU and multi node communication provided by NCCL.
NCCL 在深度学习框架中得到了广泛应用，其中 AllReduce 集合操作被大量用于神经网络训练。通过 NCCL 提供的多 GPU 和多节点通信，可以实现神经网络训练的高效扩展。