CUDA Stream Semantics¶ CUDA 流语义¶
NCCL calls are associated to a stream which is passed as the last argument of the collective communication function. The NCCL call returns when the operation has been effectively enqueued to the given stream, or returns an error. The collective operation is then executed asynchronously on the CUDA device. The operation status can be queried using standard CUDA semantics, for example, calling cudaStreamSynchronize or using CUDA events.
NCCL 调用与一个流相关联,该流作为集体通信函数的最后一个参数传递。当操作已有效地排入给定流时,NCCL 调用返回,或返回一个错误。然后,集体操作在 CUDA 设备上异步执行。可以使用标准的 CUDA 语义查询操作状态,例如调用 cudaStreamSynchronize 或使用 CUDA 事件。
Mixing Multiple Streams within the same ncclGroupStart/End() group¶
在同一 ncclGroupStart/End()组内混合多个流¶
NCCL allows for using multiple streams within a group call. This will enforce
a stream dependency of all streams before the NCCL kernel starts and block all
streams until the NCCL kernel completes.
NCCL 允许在组调用中使用多个流。这将在 NCCL 内核启动前强制执行所有流的流依赖关系,并阻塞所有流直到 NCCL 内核完成。
It will behave as if the NCCL group operation was posted on every stream, but
given it is a single operation, it will cause a global synchronization point
between the streams.
它的行为就像 NCCL 组操作被发布在每个流上一样,但由于它是一个单一操作,它将在流之间创建一个全局同步点。