Group Calls¶ 群组通话¶
Group functions (ncclGroupStart/ncclGroupEnd) can be used to merge multiple calls into one. This is needed for
three purposes: managing multiple GPUs from one thread (to avoid deadlocks), aggregating communication operations
to improve performance, or merging multiple send/receive point-to-point operations (see Point-to-point communication
section). All three usages can be combined together, with one exception : calls to ncclCommInitRank()
cannot be merged with others.
组函数(ncclGroupStart/ncclGroupEnd)可用于将多个调用合并为一个。这主要用于三个目的:从一个线程管理多个 GPU(以避免死锁)、聚合通信操作以提高性能,或合并多个发送/接收点对点操作(参见点对点通信)。
部分)。这三种用法可以结合在一起,但有一个例外:对ncclCommInitRank()
的调用。
无法与其他合并。
Management Of Multiple GPUs From One Thread¶
从一个线程管理多个 GPU¶
When a single thread is managing multiple devices, group semantics must be used.
This is because every NCCL call may have to block, waiting for other threads/ranks to arrive, before effectively posting the NCCL operation on the given stream. Hence, a simple loop on multiple devices like shown below could block on the first call waiting for the other ones:
当单个线程管理多个设备时,必须使用组语义。这是因为每个 NCCL 调用可能都需要阻塞,等待其他线程/等级到达,然后才能在给定的流上有效地发布 NCCL 操作。因此,如下所示的多个设备上的简单循环可能会在第一次调用时阻塞,等待其他调用:
for (int i=0; i<nLocalDevs; i++) {
ncclAllReduce(..., comm[i], stream[i]);
}
To define that these calls are part of the same collective operation, ncclGroupStart and ncclGroupEnd should be used:
要定义这些调用属于同一集体操作的一部分,应使用 ncclGroupStart 和 ncclGroupEnd:
ncclGroupStart();
for (int i=0; i<nLocalDevs; i++) {
ncclAllReduce(..., comm[i], stream[i]);
}
ncclGroupEnd();
This will tell NCCL to treat all calls between ncclGroupStart and ncclGroupEnd as a single call to many devices.
这将告诉 NCCL 将 ncclGroupStart 和 ncclGroupEnd 之间的所有调用视为对多个设备的单次调用。
Caution: When called inside a group, stream operations (like ncclAllReduce) can return without having enqueued the
operation on the stream. Stream operations like cudaStreamSynchronize can therefore be called only after ncclGroupEnd
returns.
注意:在组内调用时,流操作(如 ncclAllReduce)可能会在未将操作加入流的情况下返回。因此,只有在 ncclGroupEnd 返回后,才能调用诸如 cudaStreamSynchronize 之类的流操作。
Group calls must also be used to create a communicator when one thread manages more than one device:
组调用也必须用于在一个线程管理多个设备时创建通信器:
ncclGroupStart();
for (int i=0; i<nLocalDevs; i++) {
cudaSetDevice(device[i]);
ncclCommInitRank(comms+i, nranks, commId, rank[i]);
}
ncclGroupEnd();
Note: Contrary to NCCL 1.x, there is no need to set the CUDA device before every NCCL communication call within a group,
but it is still needed when calling ncclCommInitRank within a group.
注意:与 NCCL 1.x 不同,在组内的每次 NCCL 通信调用之前无需设置 CUDA 设备,但在组内调用 ncclCommInitRank 时仍需设置。
Related links: 相关链接:
Aggregated Operations (2.2 and later)¶
聚合操作(2.2 及更高版本)¶
The group semantics can also be used to have multiple collective operations performed within a single NCCL launch. This
is useful for reducing the launch overhead, in other words, latency, as it only occurs once for multiple operations.
Init functions cannot be aggregated with other init functions, nor with communication functions.
组语义也可用于在单个 NCCL 启动中执行多个集体操作。这有助于减少启动开销,即延迟,因为多个操作仅发生一次。初始化函数不能与其他初始化函数或通信函数聚合。
Aggregation of collective operations can be done simply by having multiple calls to NCCL within a ncclGroupStart /
ncclGroupEnd section.
集体操作的聚合可以通过在 ncclGroupStart / ncclGroupEnd 部分内多次调用 NCCL 来简单实现。
In the following example, we launch one broadcast and two allReduce operations together as a single NCCL launch.
在以下示例中,我们将一个广播和两个 allReduce 操作一起作为单个 NCCL 启动来执行。
ncclGroupStart();
ncclBroadcast(sendbuff1, recvbuff1, count1, datatype, root, comm, stream);
ncclAllReduce(sendbuff2, recvbuff2, count2, datatype, comm, stream);
ncclAllReduce(sendbuff3, recvbuff3, count3, datatype, comm, stream);
ncclGroupEnd();
It is permitted to combine aggregation with multi-GPU launch and use different communicators in a group launch
as shown in the Management Of Multiple GPUs From One Thread topic. When combining multi-GPU launch and aggregation,
ncclGroupStart and ncclGroupEnd can be either used once or at each level. The following example groups the allReduce
operations from different layers and on multiple CUDA devices :
允许将聚合与多 GPU 启动结合使用,并在组启动中使用不同的通信器,如“从一个线程管理多个 GPU”主题所示。当结合多 GPU 启动和聚合时,`ncclGroupStart`和`ncclGroupEnd`可以在每个层级使用一次或多次。以下示例将不同层和多个 CUDA 设备上的`allReduce`操作分组:
ncclGroupStart();
for (int i=0; i<nlayers; i++) {
ncclGroupStart();
for (int g=0; g<ngpus; g++) {
ncclAllReduce(sendbuffs[g]+offsets[i], recvbuffs[g]+offsets[i], counts[i], datatype[i], comms[g], streams[g]);
}
ncclGroupEnd();
}
ncclGroupEnd();
Note: The NCCL operation will only be started as a whole during the last call to ncclGroupEnd. The ncclGroupStart and
ncclGroupEnd calls within the for loop are not necessary and do nothing.
注意:NCCL 操作仅在最后一次调用 ncclGroupEnd 时整体启动。for 循环内的 ncclGroupStart 和 ncclGroupEnd 调用是不必要的,且不会执行任何操作。
Related links: 相关链接:
Nonblocking Group Operation¶
非阻塞组操作¶
If a communicator is marked as nonblocking through ncclCommInitRankConfig, the group functions become asynchronous
correspondingly. In this case, if users issue multiple NCCL operations in one group, returning from ncclGroupEnd() might
not mean the NCCL communication kernels have been issued to CUDA streams. If ncclGroupEnd() returns ncclSuccess, it means
NCCL kernels have been issued to streams; if it returns ncclInProgress, it means NCCL kernels are being issued to streams
in the background. It is users’ responsibility to make sure the state of the communicator changes into ncclSuccess
before calling related CUDA calls (e.g. cudaStreamSynchronize):
如果通过`ncclCommInitRankConfig`将通信器标记为非阻塞,则组函数相应地变为异步。在这种情况下,如果用户在一个组中发出多个 NCCL 操作,从`ncclGroupEnd()`返回可能并不意味着 NCCL 通信内核已发布到 CUDA 流中。如果`ncclGroupEnd()`返回`ncclSuccess`,则表示 NCCL 内核已发布到流中;如果返回`ncclInProgress`,则表示 NCCL 内核正在后台发布到流中。用户有责任确保在调用相关的 CUDA 调用(例如`cudaStreamSynchronize`)之前,通信器的状态变为`ncclSuccess`。
ncclGroupStart();
for (int g=0; g<ngpus; g++) {
ncclAllReduce(sendbuffs[g]+offsets[i], recvbuffs[g]+offsets[i], counts[i], datatype[i], comms[g], streams[g]);
}
ret = ncclGroupEnd();
if (ret == ncclInProgress) {
for (int g=0; g<ngpus; g++) {
do {
ncclCommGetAsyncError(comms[g], &state);
} while (state == ncclInProgress);
}
} else if (ret == ncclSuccess) {
/* Successfully issued */
printf("NCCL kernel issue succeeded\n");
} else {
/* Errors happen */
reportErrorAndRestart();
}
for (int g=0; g<ngpus; g++) {
cudaStreamSynchronize(streams[g]);
}
Related links: 相关链接: