Creating a Communicator¶
创建通信器 ¶
When creating a communicator, a unique rank between 0 and n-1 has to be assigned to each of the n CUDA devices which
are part of the communicator. Using the same CUDA device multiple times as different ranks of the same NCCL
communicator is not supported and may lead to hangs.
创建通信器时,必须为通信器中的每个 CUDA 设备分配一个介于 0 到 n-1 之间的唯一 rank。不支持将同一个 CUDA 设备多次用作同一 NCCL 通信器的不同 rank,这可能导致挂起。
Given a static mapping of ranks to CUDA devices, the ncclCommInitRank()
, ncclCommInitRankConfig()
and
ncclCommInitAll()
functions will create communicator objects, each communicator object being associated to a
fixed rank and CUDA device. Those objects will then be used to launch communication operations.
给定一个从排名到 CUDA 设备的静态映射,` ncclCommInitRank()
`、` ncclCommInitRankConfig()
`和` ncclCommInitAll()
`函数将创建通信器对象,每个通信器对象都与固定的排名和 CUDA 设备相关联。这些对象随后将用于启动通信操作。
Before calling ncclCommInitRank()
, you need to first create a unique object which will be used by all processes
and threads to synchronize and understand they are part of the same communicator. This is done by calling the
ncclGetUniqueId()
function.
在调用 ncclCommInitRank()
之前,您需要首先创建一个唯一的对象,该对象将被所有进程和线程用于同步并理解它们是同一通信器的一部分。这是通过调用 ncclGetUniqueId()
函数来完成的。
The ncclGetUniqueId()
function returns an ID which has to be broadcast to all participating threads and
processes using any CPU communication system, for example, passing the ID pointer to multiple threads, or broadcasting
it to other processes using MPI or another parallel environment using, for example, sockets.
ncclGetUniqueId()
函数返回一个 ID,该 ID 必须使用任何 CPU 通信系统广播给所有参与的线程和进程,例如将 ID 指针传递给多个线程,或使用 MPI 或其他并行环境(例如套接字)将其广播给其他进程。
You can also call the ncclCommInitAll operation to create n communicator objects at once within a single process. As it
is limited to a single process, this function does not permit inter-node communication. ncclCommInitAll is equivalent
to calling a combination of ncclGetUniqueId and ncclCommInitRank.
您还可以调用`ncclCommInitAll`操作,在单个进程中一次性创建 n 个通信器对象。由于它仅限于单个进程,此函数不允许节点间通信。`ncclCommInitAll`相当于调用`ncclGetUniqueId`和`ncclCommInitRank`的组合。
The following sample code is a simplified implementation of ncclCommInitAll.
以下示例代码是 ncclCommInitAll 的简化实现。
ncclResult_t ncclCommInitAll(ncclComm_t* comm, int ndev, const int* devlist) {
ncclUniqueId Id;
ncclGetUniqueId(&Id);
ncclGroupStart();
for (int i=0; i<ndev; i++) {
cudaSetDevice(devlist[i]);
ncclCommInitRank(comm+i, ndev, Id, i);
}
ncclGroupEnd();
}
Related links: 相关链接:
Creating a communicator with options¶
创建带有选项的通信器 ¶
The ncclCommInitRankConfig()
function allows to create a NCCL communicator with specific options.
ncclCommInitRankConfig()
函数允许创建具有特定选项的 NCCL 通信器。
The config parameters NCCL supports are listed here ncclConfig_t.
NCCL 支持的配置参数列在此处 ncclConfig_t。
For example, “blocking” can be set to 0 to ask NCCL to never block in any NCCL call, and at the same time
other config parameters can be set as well to more precisely define communicator behavior. A simple example
code is shown below:
例如,可以将“blocking”设置为 0,要求 NCCL 在任何 NCCL 调用中都不阻塞,同时还可以设置其他配置参数以更精确地定义通信器行为。下面展示了一个简单的示例代码:
ncclConfig_t config = NCCL_CONFIG_INITIALIZER;
config.blocking = 0;
config.minCTAs = 4;
config.maxCTAs = 16;
config.cgaClusterSize = 2;
config.netName = "Socket";
CHECK(ncclCommInitRankConfig(&comm, nranks, id, rank, &config));
do {
CHECK(ncclCommGetAsyncError(comm, &state));
// Handle outside events, timeouts, progress, ...
} while(state == ncclInProgress);
Related link: ncclCommGetAsyncError()
相关链接: ncclCommGetAsyncError()
Creating a communicator using multiple ncclUniqueIds¶
使用多个 ncclUniqueId 创建通信器 ¶
The ncclCommInitRankScalable()
function enables the creation of a NCCL communicator using many ncclUniqueIds.
All NCCL ranks have to provide the same array of ncclUniqueIds (same ncclUniqueIds, and in with the same order).
For the best performance, we recommend distributing the ncclUniqueIds as evenly as possible amongst the NCCL ranks.
ncclCommInitRankScalable()
函数支持使用多个 ncclUniqueId 创建 NCCL 通信器。所有 NCCL 等级必须提供相同的 ncclUniqueId 数组(相同的 ncclUniqueId,且顺序一致)。为了获得最佳性能,我们建议尽可能均匀地在 NCCL 等级之间分配 ncclUniqueId。
Internally, NCCL ranks will mostly communicate with a single ncclUniqueId.
Therefore, to obtain the best results, we recommend to evenly distribute ncclUniqueIds accross the ranks.
在内部,NCCL ranks 主要会与单个 ncclUniqueId 进行通信。因此,为了获得最佳结果,我们建议在 ranks 之间均匀分配 ncclUniqueIds。
The following function can be used to decide if a NCCL rank should create a ncclUniqueIds:
以下函数可用于决定 NCCL rank 是否应创建 ncclUniqueIds:
bool rankHasRoot(const int rank, const int nRanks, const int nIds) {
const int rmr = nRanks % nIds;
const int rpr = nRanks / nIds;
const int rlim = rmr * (rpr+1);
if (rank < rlim) {
return !(rank % (rpr + 1));
} else {
return !((rank - rlim) % rpr);
}
}
For example, if 3 ncclUniqueIds are to be distributed accross 7 NCCL ranks, the first ncclUniqueId will be associated to
ranks 0-2, while the others will be associated to ranks 3-4, and 5-6.
This function will therefore return true on rank 0, 3, and 5, and false otherwise.
例如,如果要在 7 个 NCCL rank 之间分发 3 个 ncclUniqueId,第一个 ncclUniqueId 将与 rank 0-2 关联,而其他两个将分别与 rank 3-4 和 rank 5-6 关联。因此,此函数将在 rank 0、3 和 5 上返回 true,在其他 rank 上返回 false。
Note: only the first ncclUniqueId will be used to create the communicator hash id, which is used to identify the
communicator in the log file and in the replay tool.
注意:只有第一个 ncclUniqueId 将用于创建通信器哈希 ID,该 ID 用于在日志文件和重放工具中识别通信器。
Creating more communicators¶
创建更多通信器 ¶
The ncclCommSplit function can be used to create communicators based on an existing one. This allows to split an existing
communicator into multiple sub-partitions, duplicate an existing communicator, or even create a single communicator with
fewer ranks.
ncclCommSplit 函数可用于基于现有通信器创建新的通信器。这允许将现有通信器拆分为多个子分区、复制现有通信器,甚至创建具有较少秩的单个通信器。
The ncclCommSplit function needs to be called by all ranks in the original communicator. If some ranks will not be part
of any sub-group, they still need to call ncclCommSplit with color being NCCL_SPLIT_NOCOLOR.
ncclCommSplit 函数需要由原始通信器中的所有 rank 调用。如果某些 rank 不属于任何子组,它们仍然需要调用 ncclCommSplit,并将 color 设置为 NCCL_SPLIT_NOCOLOR。
Newly created communicators will inherit the parent communicator configuration (e.g. non-blocking).
If the parent communicator operates in non-blocking mode, a ncclCommSplit operation may be stopped by calling ncclCommAbort
on the parent communicator, then on any new communicator returned. This is because a hang could happen during
operations on any of the two communicators.
新创建的通信器将继承父通信器的配置(例如非阻塞模式)。如果父通信器在非阻塞模式下运行,可以通过在父通信器上调用`ncclCommAbort`来停止`ncclCommSplit`操作,然后在返回的任何新通信器上调用。这是因为在这两个通信器中的任何一个上执行操作时可能会发生挂起。
The following code duplicates an existing communicator:
以下代码复制现有的通信器:
int rank;
ncclCommUserRank(comm, &rank);
ncclCommSplit(comm, 0, rank, &newcomm, NULL);
This splits a communicator in two halves:
这将一个通信器分成两半:
int rank, nranks;
ncclCommUserRank(comm, &rank);
ncclCommCount(comm, &nranks);
ncclCommSplit(comm, rank/(nranks/2), rank%(nranks/2), &newcomm, NULL);
This creates a communicator with only the first 2 ranks:
这将创建一个仅包含前 2 个等级的通信器:
int rank;
ncclCommUserRank(comm, &rank);
ncclCommSplit(comm, rank<2 ? 0 : NCCL_SPLIT_NOCOLOR, rank, &newcomm, NULL);
Related links: 相关链接:
Using multiple NCCL communicators concurrently¶
同时使用多个 NCCL 通信器 ¶
Using multiple NCCL communicators requires careful synchronization, or can lead to deadlocks.
使用多个 NCCL 通信器需要仔细同步,否则可能导致死锁。
NCCL kernels are blocking (waiting for data to arrive), and any CUDA operation can cause a device synchronization,
meaning it will wait for all NCCL kernels to complete. This can quickly lead to deadlocks since NCCL operations perform
CUDA calls themselves.
NCCL 内核是阻塞的(等待数据到达),任何 CUDA 操作都可能导致设备同步,这意味着它将等待所有 NCCL 内核完成。由于 NCCL 操作本身会执行 CUDA 调用,这可能会迅速导致死锁。
Operations on different communicators should therefore be used at different epochs with a locking mechanism, and
applications should ensure operations are submitted in the same order across ranks.
因此,应在不同时期使用锁定机制对不同通信器进行操作,应用程序应确保跨等级的操作以相同顺序提交。
Launching multiple communication operations (on different streams) might work provided they can fit within the GPU, but
could break at any time if NCCL were to use more CUDA blocks per operation, or if some calls used inside NCCL
collectives were to perform a device synchronization (e.g. allocate some CUDA memory dynamically).
启动多个通信操作(在不同的流上)可能有效,前提是它们能够适应 GPU,但如果 NCCL 在每个操作中使用更多的 CUDA 块,或者如果 NCCL 集合体内部使用的某些调用执行设备同步(例如动态分配一些 CUDA 内存),则可能会随时中断。
Finalizing a communicator¶
完成通信器的初始化 ¶
ncclCommFinalize will transition a communicator from the ncclSuccess state to the ncclInProgress state, start
completing all operations in the background and synchronize with other ranks which may be using resources for their
communications with other ranks.
All uncompleted operations and network-related resources associated to a communicator will be flushed and freed with
ncclCommFinalize.
Once all NCCL operations are complete, the communicator will transition to the ncclSuccess state. Users can
query that state with ncclCommGetAsyncError.
If a communicator is marked as nonblocking, this operation is nonblocking; otherwise, it is blocking.
ncclCommFinalize 将把通信器从 ncclSuccess 状态转换为 ncclInProgress 状态,开始在后台完成所有操作,并与其他可能正在使用资源进行通信的进程同步。所有未完成的操作和与通信器相关的网络资源将通过 ncclCommFinalize 刷新并释放。一旦所有 NCCL 操作完成,通信器将转换回 ncclSuccess 状态。用户可以使用 ncclCommGetAsyncError 查询该状态。如果通信器被标记为非阻塞,则此操作是非阻塞的;否则,它是阻塞的。
Related link: ncclCommFinalize()
相关链接: ncclCommFinalize()
Destroying a communicator¶
销毁通信器 ¶
Once a communicator has been finalized, the next step is to free all resources, including the communicator itself.
Local resources associated to a communicator can be destroyed with ncclCommDestroy. If the state of a communicator
is ncclSuccess when calling ncclCommDestroy, the call is guaranteed to be nonblocking; otherwise
ncclCommDestroy might block.
In all cases, ncclCommDestroy call will free the resources of the communicator and return, and
the communicator should no longer be accessed after ncclCommDestroy returns.
一旦通信器被最终化,下一步就是释放所有资源,包括通信器本身。与通信器关联的本地资源可以通过`ncclCommDestroy`销毁。如果在调用`ncclCommDestroy`时通信器的状态为`ncclSuccess`,则调用保证是非阻塞的;否则`ncclCommDestroy`可能会阻塞。在所有情况下,`ncclCommDestroy`调用都会释放通信器的资源并返回,并且在`ncclCommDestroy`返回后不应再访问通信器。
Related link: ncclCommDestroy()
相关链接: ncclCommDestroy()
Error handling and communicator abort¶
错误处理和通信器中止 ¶
All NCCL calls return a NCCL error code which is sumarized in the table below. If a NCCL call returns an error code
different from ncclSuccess and ncclInternalError, and if NCCL_DEBUG is set to WARN, NCCL will print a human-readable
message explaining what happened.
If NCCL_DEBUG is set to INFO, NCCL will also print the call stack which led to the error.
This message is intended to help the user fix the problem.
所有 NCCL 调用都会返回一个 NCCL 错误代码,该代码在下表中进行了总结。如果 NCCL 调用返回的错误代码不同于 ncclSuccess 和 ncclInternalError,并且如果 NCCL_DEBUG 设置为 WARN,NCCL 将打印一条人类可读的消息来解释发生了什么。如果 NCCL_DEBUG 设置为 INFO,NCCL 还会打印导致错误的调用堆栈。此消息旨在帮助用户解决问题。
The table below summarizes how different errors should be understood and handled. Each case is explained in details
in the following sections.
下表总结了应如何理解和处理不同的错误。每种情况在以下部分中有详细说明。
Error 错误 | Description 描述 | Resolution | Error handling | Group behavior |
---|---|---|---|---|
ncclSuccess | No error 无错误 | None 无 | None | None |
ncclUnhandledCudaError | Error during a CUDA call (1) CUDA 调用期间出错 (1) |
CUDA configuration / usage (1) CUDA 配置/使用(1) |
Communicator abort (5) | Global (6) |
ncclSystemError | Error during a system call (1) 系统调用期间出错 (1) |
System configuration / usage (1) 系统配置/使用情况 (1) |
Communicator abort (5) | Global (6) |
ncclInternalError | Error inside NCCL (2) NCCL 内部错误 (2) |
Fix in NCCL (2) 修复 NCCL (2) |
Communicator abort (5) | Global (6) |
ncclInvalidArgument | An argument to a NCCL call is invalid (3) NCCL 调用的参数无效 (3) |
Fix in the application (3) 修复应用程序中的问题 (3) |
None (3) | Individual (3) |
ncclInvalidUsage | The usage of NCCL calls is invalid (4) NCCL 调用的使用无效(4) |
Fix in the application (4) 修复应用程序中的问题 (4) |
Communicator abort (5) | Global (6) |
ncclInProgress | The NCCL call is still in progress NCCL 调用仍在进行中 |
Poll for completion using ncclCommGetAsyncError 使用 ncclCommGetAsyncError 轮询完成 |
None | None |
(1) ncclUnhandledCudaError and ncclSystemError indicate that a call NCCL made to an external component failed,
which caused the NCCL operation to fail. The error message should explain which component the user should look
at and try to fix, potentially with the help of the administrators of the system.
(1) ncclUnhandledCudaError 和 ncclSystemError 表示 NCCL 调用外部组件失败,导致 NCCL 操作失败。错误信息应解释用户应查看并尝试修复哪个组件,可能需要系统管理员的帮助。
(2) ncclInternalError denotes a NCCL bug. It might not report a message with NCCL_DEBUG=WARN since it requires a
fix in the NCCL source code. NCCL_DEBUG=INFO will print the back trace which led to the error.
(2) ncclInternalError 表示 NCCL 的一个错误。由于它需要在 NCCL 源代码中进行修复,因此在使用 NCCL_DEBUG=WARN 时可能不会报告消息。NCCL_DEBUG=INFO 将打印导致错误的回溯信息。
(3) ncclInvalidArgument indicates an argument value is incorrect, like a NULL pointer or an out-of-bounds value.
When this error is returned, the NCCL call had no effect. The group state remains unchanged, the communicator is
still functioning normally. The application can call ncclCommAbort or continue as if the call did not happen.
This error will be returned immediately for a call happening within a group and applies to that specific NCCL
call. It will not be returned by ncclGroupEnd since ncclGroupEnd takes no argument.
(3) ncclInvalidArgument 表示参数值不正确,例如 NULL 指针或越界值。当返回此错误时,NCCL 调用没有产生任何效果。组状态保持不变,通信器仍正常运行。应用程序可以调用 ncclCommAbort 或继续执行,就像调用没有发生一样。对于组内发生的调用,此错误将立即返回,并适用于该特定的 NCCL 调用。由于 ncclGroupEnd 不接受参数,因此不会由 ncclGroupEnd 返回。
(4) ncclInvalidUsage is returned when a dynamic condition causes a failure, which denotes an incorrect usage of
the NCCL API.
(4) 当动态条件导致失败时,会返回 ncclInvalidUsage,这表示 NCCL API 的使用不正确。
(5) These errors are fatal for the communicator. To recover, the application needs to call ncclCommAbort on the
communicator and re-create it.
(5) 这些错误对通信器是致命的。要恢复,应用程序需要在通信器上调用 `ncclCommAbort` 并重新创建它。
(6) Dynamic errors for operations within a group are always reported by ncclGroupEnd and apply to all operations
within the group, which may or may not have completed. The application must call ncclCommAbort on all communicators
within the group.
(6) 组内操作的动态错误总是由 ncclGroupEnd 报告,并适用于组内所有操作,这些操作可能已完成也可能未完成。应用程序必须对组内的所有通信器调用 ncclCommAbort。
Asynchronous errors and error handling¶
异步错误和错误处理 ¶
Some communication errors, and in particular network errors, are reported through the ncclCommGetAsyncError function.
Operations experiencing an asynchronous error will usually not progress and never complete. When an asynchronous error
happens, the operation should be aborted and the communicator destroyed using ncclCommAbort.
When waiting for NCCL operations to complete, applications should call ncclCommGetAsyncError and destroy the
communicator when an error happens.
一些通信错误,特别是网络错误,通过`ncclCommGetAsyncError`函数报告。遇到异步错误的操作通常不会继续执行且永远不会完成。当发生异步错误时,应中止操作并使用`ncclCommAbort`销毁通信器。在等待 NCCL 操作完成时,应用程序应调用`ncclCommGetAsyncError`,并在发生错误时销毁通信器。
The following code shows how to wait on NCCL operations and poll for asynchronous errors, instead of using
cudaStreamSynchronize.
以下代码展示了如何等待 NCCL 操作并轮询异步错误,而不是使用 cudaStreamSynchronize。
int ncclStreamSynchronize(cudaStream_t stream, ncclComm_t comm) {
cudaError_t cudaErr;
ncclResult_t ncclErr, ncclAsyncErr;
while (1) {
cudaErr = cudaStreamQuery(stream);
if (cudaErr == cudaSuccess)
return 0;
if (cudaErr != cudaErrorNotReady) {
printf("CUDA Error : cudaStreamQuery returned %d\n", cudaErr);
return 1;
}
ncclErr = ncclCommGetAsyncError(comm, &ncclAsyncErr);
if (ncclErr != ncclSuccess) {
printf("NCCL Error : ncclCommGetAsyncError returned %d\n", ncclErr);
return 1;
}
if (ncclAsyncErr != ncclSuccess) {
// An asynchronous error happened. Stop the operation and destroy
// the communicator
ncclErr = ncclCommAbort(comm);
if (ncclErr != ncclSuccess)
printf("NCCL Error : ncclCommDestroy returned %d\n", ncclErr);
// Caller may abort or try to create a new communicator.
return 2;
}
// We might want to let other threads (including NCCL threads) use the CPU.
sched_yield();
}
}
Related links: 相关链接:
Fault Tolerance¶ 容错 ¶
NCCL provides a set of features to allow applications to recover from fatal errors such as a network failure,
a node failure, or a process failure. When such an error happens, the application should be able to call ncclCommAbort
on the communicator to free all resources, then create a new communicator to continue.
All NCCL calls can be non-blocking to ensure ncclCommAbort can be called at any point, during initialization,
communication or when finalizing the communicator.
NCCL 提供了一组功能,允许应用程序从致命错误中恢复,例如网络故障、节点故障或进程故障。当发生此类错误时,应用程序应能够调用 `ncclCommAbort` 来释放所有资源,然后创建一个新的通信器以继续。所有 NCCL 调用都可以是非阻塞的,以确保在初始化、通信或最终化通信器期间的任何时刻都可以调用 `ncclCommAbort`。
To correctly abort, when any rank in a communicator fails (e.g., due to a segmentation fault), all other ranks need to
call ncclCommAbort to abort their own NCCL communicator.
Users can implement methods to decide when and whether to abort the communicators and restart the NCCL operation.
Here is an example showing how to initialize and split a communicator in a non-blocking manner, allowing for an abort at any point:
为了正确中止,当通信器中的任何秩失败时(例如,由于段错误),所有其他秩都需要调用 `ncclCommAbort` 来中止它们自己的 NCCL 通信器。用户可以实现方法来决定何时以及是否中止通信器并重新启动 NCCL 操作。以下是一个示例,展示了如何以非阻塞方式初始化和拆分通信器,允许在任何点中止:
bool globalFlag;
bool abortFlag = false;
ncclConfig_t config = NCCL_CONFIG_INITIALIZER;
config.blocking = 0;
CHECK(ncclCommInitRankConfig(&comm, nRanks, id, myRank, &config));
do {
CHECK(ncclCommGetAsyncError(comm, &state));
} while(state == ncclInProgress && checkTimeout() != true);
if (checkTimeout() == true || state != ncclSuccess) abortFlag = true;
/* sync abortFlag among all healthy ranks. */
reportErrorGlobally(abortFlag, &globalFlag);
if (globalFlag) {
/* time is out or initialization failed: every rank needs to abort and restart. */
ncclCommAbort(comm);
/* restart NCCL; this is a user implemented function, it might include
* resource cleanup and ncclCommInitRankConfig() to create new communicators. */
restartNCCL(&comm);
}
/* nonblocking communicator split. */
CHECK(ncclCommSplit(comm, color, key, &childComm, &config));
do {
CHECK(ncclCommGetAsyncError(comm, &state));
} while(state == ncclInProgress && checkTimeout() != true);
if (checkTimeout() == true || state != ncclSuccess) abortFlag = true;
/* sync abortFlag among all healthy ranks. */
reportErrorGlobally(abortFlag, &globalFlag);
if (globalFlag) {
ncclCommAbort(comm);
/* if chilComm is not NCCL_COMM_NULL, user should abort child communicator
* here as well for resource reclamation. */
if (childComm != NCCL_COMM_NULL) ncclCommAbort(childComm);
restartNCCL(&comm);
}
/* application workload */
The checkTimeout function needs to be provided by users to determine what is the longest time the application should wait for
NCCL initialization; likewise, users can apply other methods to detect errors besides a timeout function. Similar methods can be applied
to NCCL finalization as well.
checkTimeout 函数需要由用户提供,以确定应用程序应等待 NCCL 初始化的最长时间;同样,用户可以使用其他方法来检测错误,而不仅仅是超时函数。类似的方法也可以应用于 NCCL 的终止。