Fabric Manager for NVIDIA NVSwitch Systems 用于英伟达™(NVIDIA®)NVSwitch 系统的 Fabric Manager
This document describes the NVIDIA ^(o+){ }^{\oplus} Fabric Manager for NVSwitch ^("TM "){ }^{\text {TM }} systems. 本文档介绍适用于 NVSwitch ^("TM "){ }^{\text {TM }} 系统的 NVIDIA ^(o+){ }^{\oplus} Fabric Manager。
Chapter 1. Introduction 第 1 章 导言导言
As deep learning neural networks become more sophisticated, their size and complexity continue to expand. The result is exponential demand in the computing capacity that is required to train these networks during a reasonable period. To meet this challenge, applications have turned into multi-GPU implementations. 随着深度学习神经网络变得越来越复杂,其规模和复杂性也在不断扩大。其结果是,在合理时间内训练这些网络所需的计算能力呈指数级增长。为了应对这一挑战,各种应用已转向多 GPU 实现。
NVIDIA NVLink ^(®){ }^{\circledR}, which was introduced to connect multiple GPUs, is a direct GPU-to-GPU interconnect that scales multi-GPU input/output (IO) in the server. To additionally scale the performance and connect multiple GPUs, NVIDIA introduced NVIDIA NVSwitch, which connects multiple NVLinks to provide all-to-all GPU communication at the total NVLink speed. 为连接多个 GPU 而推出的英伟达 NVLink ^(®){ }^{\circledR} 是一种 GPU 到 GPU 的直接互连,可扩展服务器中的多 GPU 输入/输出 (IO)。为了进一步扩展性能并连接多个 GPU,英伟达推出了 NVIDIA NVSwitch,它可以连接多个 NVLink,以 NVLink 的总速度提供全 GPU 对全 GPU 的通信。
Over the years, NVIDIA introduced three generation of NVSwitches and associated DGX and NVIDIA HGX ^("TM "){ }^{\text {TM }} server systems. 多年来,NVIDIA 推出了三代 NVSwitches 以及相关的 DGX 和 NVIDIA HGX ^("TM "){ }^{\text {TM }} 服务器系统。
NVIDIA DGX-2TM and NVIDIA HGX-2 systems consists of two identical GPU baseboards with eight NVIDIA V100 GPUs and six first generation NVSwitches on each baseboard. Each V100 GPU has one NVLink connection to each NVSwitch on the same GPU baseboard. Two GPU baseboards are connected to build a 16-GPU system. Between the two GPU baseboards, the only NVLink connections are between NVSwitches, where each NVSwitch from one GPU baseboard is connected to one NVSwitch on the second GPU baseboard for a total of eight NVLink connections. NVIDIA DGX-2TM 和 NVIDIA HGX-2 系统由两个相同的 GPU 基板组成,每个基板上有八个 NVIDIA V100 GPU 和六个第一代 NVSwitch。每个 V100 GPU 有一个 NVLink 连接,可连接到同一 GPU 基板上的每个 NVSwitch。连接两个 GPU 基板可构建 16 个 GPU 系统。在两个 GPU 基板之间,唯一的 NVLink 连接是在 NVSwitch 之间,其中一个 GPU 基板上的每个 NVSwitch 连接到第二个 GPU 基板上的一个 NVSwitch,总共有 8 个 NVLink 连接。
The NVIDIA DGXTM A 100 and NVIDIA HGX A 100 8-GPU systems consist of a GPU baseboard, with eight NVIDIA A100 GPUs, and six second generation NVSwitches. The GPU baseboard NVLink topology is like the first-generation version, where each A 100 GPU has two NVLink connections to each NVSwitch on the same GPU baseboard. This generation supports connecting two GPU baseboards for a total of sixteen NVLink connections between the baseboards. NVIDIA DGXTM A 100和NVIDIA HGX A 100 8-GPU系统由一块GPU底板(含8个NVIDIA A100 GPU)和6个第二代NVSwitch组成。GPU基板NVLink拓扑结构与第一代版本相同,每个A 100 GPU有两个NVLink连接,分别连接到同一GPU基板上的每个NVSwitch。这一代支持连接两个 GPU 基板,基板之间共有 16 个 NVLink 连接。
Third-generation NVSwitches are used in DGX H100 and NVIDIA HGX H1OO 8-GPU server systems. This server variant consists of one GPU baseboard with eight NVIDIA H 100 GPUs and four NVSwitches. The corresponding NVLink topology is different from the previous generation because every GPU has four NVLinks that connect to two of the NVSwitches, and five NVLinks that connect to the remaining two NVSwitches. This generation has depreciated the support to connect two GPU baseboard using NVLink. 第三代 NVSwitches 用于 DGX H100 和 NVIDIA HGX H1OO 8-GPU 服务器系统。这种服务器变体由一块GPU基板和8个英伟达H 100 GPU以及4个NVSwitch组成。相应的 NVLink 拓扑与上一代不同,因为每个 GPU 都有四个 NVLink 连接到其中两个 NVSwitches,五个 NVLink 连接到其余两个 NVSwitches。这一代产品已不再支持使用 NVLink 连接两个 GPU 底板。
Chapter 3. Terminology 第 3 章 术语术语
Term 学期
Definition 定义
M
Fabric Manager 织物管理员
MMIO
Memory Mapped IO 内存映射 IO
VM
Virtual Machine 虚拟机
GPU 注册表
GPU regis-
ter
GPU regis-
ter| GPU regis- |
| :--- |
| ter |
A location in the GPU MMIO space GPU MMIO 空间中的一个位置
SBR
Secondary Bus Reset 二级总线复位
DCGM
NVIDIA Data Center GPU manager 英伟达数据中心 GPU 管理器
NVML
NVIDIA Management Library 英伟达管理库
Service VM 服务虚拟机
A privileged VM where NVIDIA NVSwitch software stack runs 运行 NVIDIA NVSwitch 软件栈的特权虚拟机
访问 NVLink
Access
NVLink
Access
NVLink| Access |
| :--- |
| NVLink |
NVLink between a GPU and an NVSwitch GPU 和 NVSwitch 之间的 NVLink
主干 NVLink
Trunk
NVLink
Trunk
NVLink| Trunk |
| :--- |
| NVLink |
NVLink between two GPU baseboards 两个 GPU 底板之间的 NVLink
SMBPBI
NVIDIA SMBus Post-Box Interface 英伟达 SMBus 后置盒接口
vGPU
NVIDIA GRID Virtual GPU 英伟达™(NVIDIA®)GRID 虚拟图形处理器
MIG
Multi-Instance GPU 多实例 GPU
SR-IOV
Single-Root IO Virtualization 单根 IO 虚拟化
PF
Physical Function 身体机能
VF
Virtual Function 虚拟功能
GFID
GPU Function Identification GPU 功能识别
Partition 分区
允许在它们之间执行 NVLink 点对点通信的 GPU 集合
A collection of GPUs which are allowed to perform NVLink
Peer-to-Peer Communication among themselves
A collection of GPUs which are allowed to perform NVLink
Peer-to-Peer Communication among themselves| A collection of GPUs which are allowed to perform NVLink |
| :--- |
| Peer-to-Peer Communication among themselves |
ALI
Autonomous Link Initialization 自主链路初始化
Term Definition
M Fabric Manager
MMIO Memory Mapped IO
VM Virtual Machine
"GPU regis-
ter" A location in the GPU MMIO space
SBR Secondary Bus Reset
DCGM NVIDIA Data Center GPU manager
NVML NVIDIA Management Library
Service VM A privileged VM where NVIDIA NVSwitch software stack runs
"Access
NVLink" NVLink between a GPU and an NVSwitch
"Trunk
NVLink" NVLink between two GPU baseboards
SMBPBI NVIDIA SMBus Post-Box Interface
vGPU NVIDIA GRID Virtual GPU
MIG Multi-Instance GPU
SR-IOV Single-Root IO Virtualization
PF Physical Function
VF Virtual Function
GFID GPU Function Identification
Partition "A collection of GPUs which are allowed to perform NVLink
Peer-to-Peer Communication among themselves"
ALI Autonomous Link Initialization| Term | Definition |
| :--- | :--- |
| M | Fabric Manager |
| MMIO | Memory Mapped IO |
| VM | Virtual Machine |
| GPU regis- <br> ter | A location in the GPU MMIO space |
| SBR | Secondary Bus Reset |
| DCGM | NVIDIA Data Center GPU manager |
| NVML | NVIDIA Management Library |
| Service VM | A privileged VM where NVIDIA NVSwitch software stack runs |
| Access <br> NVLink | NVLink between a GPU and an NVSwitch |
| Trunk <br> NVLink | NVLink between two GPU baseboards |
| SMBPBI | NVIDIA SMBus Post-Box Interface |
| vGPU | NVIDIA GRID Virtual GPU |
| MIG | Multi-Instance GPU |
| SR-IOV | Single-Root IO Virtualization |
| PF | Physical Function |
| VF | Virtual Function |
| GFID | GPU Function Identification |
| Partition | A collection of GPUs which are allowed to perform NVLink <br> Peer-to-Peer Communication among themselves |
| ALI | Autonomous Link Initialization |
The core software stack for NVSwitch management consists of an NVSwitch kernel driver and a privileged process called NVIDIA Fabric Manager (FM). The kernel driver performs low-level hardware management in response to FM requests. The software stack also provides in-band and out-of-band monitoring solutions to report NVSwitch and GPU errors and status information. NVSwitch 管理的核心软件栈包括一个 NVSwitch 内核驱动程序和一个名为 NVIDIA Fabric Manager (FM) 的特权进程。内核驱动程序根据 FM 请求执行底层硬件管理。软件栈还提供带内和带外监控解决方案,以报告 NVSwitch 和 GPU 的错误和状态信息。
Chapter 5. What is Fabric Manager? 第 5 章 什么是 Fabric Manager?什么是 Fabric Manager?
FM configures the NVSwitch memory fabrics to form one memory fabric among all participating GPUs and monitors the NVLinks that support the fabric. At a high level, FM has the following responsibilities: FM 配置 NVSwitch 内存结构,以便在所有参与的 GPU 之间形成一个内存结构,并监控支持该结构的 NVLinks。在高层次上,FM 的职责如下:
Configures routing among NVSwitch ports. 在 NVSwitch 端口之间配置路由。
Coordinates with the GPU driver to initialize GPUs. 与 GPU 驱动程序协调,初始化 GPU。
Monitors the fabric for NVLink and NVSwitch errors. 监控 NVLink 和 NVSwitch 结构错误。
On systems that are not capable of Autonomous Link Initialization (ALI)-based NVLink training (the first and second generation NVSwitch-based systems), FM also has the following additional responsibilities: 在不能进行基于自主链路初始化(ALI)的 NVLink 培训的系统(基于 NVSwitch 的第一代和第二代系统)上,FM 还承担以下额外责任:
Coordinate with the NVSwitch driver to train NVSwitch to NVSwitch NVLink interconnects. 与 NVSwitch 驱动程序协调,训练 NVSwitch 到 NVSwitch NVLink 的互连。
Coordinate with the GPU driver to initialize and train NVSwitch to GPU NVLink interconnects. 与 GPU 驱动程序协调,初始化和训练 NVSwitch 与 GPU NVLink 的互连。
This document provides an overview of various FM features and is intended for system administrators and individual users of NVSwitch-based server systems. 本文档概述了各种调频功能,面向基于 NVSwitch 的服务器系统的系统管理员和个人用户。
Chapter 6. Getting Started with Fabric Manager 第 6 章 开始使用 Fabric Manager开始使用 Fabric Manager
6.1. Basic Components 6.1.基本组件
This section provides information about the basic components in FM. 本节介绍调频的基本组件。
6.1.1. Fabric Manager Service 6.1.1.Fabric Manager 服务
The core component of FM is implemented as a standalone executable that runs as a Unix Daemon process. The FM installation package will install the required core components and registers the daemon as a system service called nvidia-fabricmanager. FM 的核心组件是作为一个独立的可执行文件实现的,它以 Unix 守护进程的形式运行。FM 安装包将安装所需的核心组件,并将守护进程注册为名为 nvidia-fabricmanager 的系统服务。
6.1.2. Software Development Kit 6.1.2.软件开发包
FM also provides a shared library, a set of C/C++ APIs (SDK), and the corresponding header files. These APIs are used to interface with the Fabric Manager service to query/activate/deactivate GPU partitions when FM is running in Shared NVSwitch and vGPU multi-tenancy modes. All these SDK components are installed through a separate development package. For more information, refer to Shared NVSwitch Virtualization Model and vGPU Virtualization Model. FM 还提供了一个共享库、一套 C/C++ 应用程序接口(SDK)和相应的头文件。当 FM 以共享 NVSwitch 和 vGPU 多租户模式运行时,这些 API 用于与 Fabric Manager 服务接口,以查询/激活/停用 GPU 分区。所有这些 SDK 组件都通过单独的开发包安装。更多信息,请参阅共享 NVSwitch 虚拟化模式和 vGPU 虚拟化模式。
6.2. NVSwitch and NVLink Initialization 6.2.NVSwitch 和 NVLink 初始化
NVIDIA GPUs and NVSwitch memory fabrics are PCle endpoint devices that require an NVIDIA kernel driver to be used. 英伟达™(NVIDIA®)GPU 和 NVSwitch 内存结构是 PCle 端点设备,需要英伟达™(NVIDIA®)内核驱动程序才能使用。
On DGX-2, NVIDIA HGX-2, DGX A100, and NVIDIA HGX A 100 systems that do not have ALI support, after the system boots, the NVLink connections are enabled after the NVIDIA kernel driver is loaded, and the FM configures these connections. CUDA initialization will fail with the cudaErrorSystemNotReady error if the application is launched before FM completely initializes the system or when FM fails to initialize the system. 在不支持 ALI 的 DGX-2、NVIDIA HGX-2、DGX A100 和 NVIDIA HGX A 100 系统上,系统启动后,NVLink 连接会在加载 NVIDIA 内核驱动程序后启用,FM 会配置这些连接。如果应用程序在 FM 完全初始化系统之前启动,或 FM 初始化系统失败,CUDA 初始化将出现 cudaErrorSystemNotReady 错误。
On DGX H100 and NVIDIA HGX H100 systems that have ALI support, NVLinks are trained at the GPU and NVSwitch hardware levels without FM. To enable NVLink peer-to-peer support, the GPUs must 在支持 ALI 的 DGX H100 和 NVIDIA HGX H100 系统上,NVLink 在 GPU 和 NVSwitch 硬件级别上进行训练,无需调频。要启用 NVLink 点对点支持,GPU 必须
register with the NVLink fabric. If a GPU fails to register with the fabric, it will lose its NVLink peer-topeer capability and be available for non-peer-to- peer use cases. The CUDA initialization process will start after the GPUs complete their registration process with the NVLink fabric. 与 NVLink 架构注册。如果 GPU 无法向结构注册,它将失去 NVLink 点对点功能,只能用于非点对点用例。CUDA 初始化过程将在 GPU 完成与 NVLink 架构的注册过程后启动。
GPU fabric registration status is exposed through the NVML APIs, and as part of nvidia-smi -q command. Refer the following nvidia-smi command output for more information. GPU Fabric 注册状态通过 NVML API 和 nvidia-smi -q 命令公开。有关详细信息,请参阅以下 nvidia-smi 命令输出。
Here is the Fabric state output when the GPU is being registered: 下面是注册 GPU 时的 Fabric 状态输出:
nvidia-smi -q -i 0 | grep -i -A 2 Fabric
Fabric
State : In Progress
Status : N/A
Here is the Fabric state output after the GPU has been successfully registered: 下面是 GPU 成功注册后的 Fabric 状态输出:
nvidia-smi -q -i 0 | grep -i -A 2 Fabric
Fabric
State : Completed
Status : Success
Fabric Manager plays a critical role in the functionality of NVSwitch-based systems that are typically initiated during a system boot or a workload activation. Restarting the service intermittently is unnecessary; but if such a restart is necessary because of workflow requirements, or as part of a GPU reset operation, complete the following procedure for DGX H100 and NVIDIA HGX H100 systems to ensure the system returns to a coherent state. Fabric Manager 在基于 NVSwitch 的系统功能中发挥着至关重要的作用,这些功能通常在系统启动或工作负载激活时启动。间歇性地重启服务是不必要的;但如果由于工作流要求或作为 GPU 重置操作的一部分而必须重启服务,请针对 DGX H100 和 NVIDIA HGX H100 系统完成以下步骤,以确保系统恢复到一致状态。
Stop all CUDA Applications and GPU-Related Services. 停止所有 CUDA 应用程序和 GPU 相关服务。
Halt all running CUDA applications and services (for example, DCGM) that are actively using GPUs. 停止所有正在使用 GPU 的 CUDA 应用程序和服务(如 DCGM)。
You can leave the nvidia-persistenced service running. 您可以让 nvidia-persistenced 服务继续运行。
Stop Fabric Manager Service by terminating the Fabric Manager service. 终止 Fabric Manager 服务,停止 Fabric Manager 服务。
Perform GPU Reset by issuing the nvidia-smi -r command and executing a GPU reset by. 发布 nvidia-smi -r 命令并执行 GPU 重置。
Start Fabric Manager Service Againby restarting the Fabric Manager service and restoring its functionality. 重新启动 Fabric Manager 服务通过重新启动 Fabric Manager 服务并恢复其功能。
Resume Stopped Services by restarting any services that were halted in step 1, such as DCGM or other GPU-related services. 重新启动在步骤 1 中停止的任何服务,如 DCGM 或其他 GPU 相关服务,从而恢复已停止的服务。
Launch CUDA Applications. 启动 CUDA 应用程序
After completing these steps, launch your CUDA applications as needed. 完成这些步骤后,根据需要启动 CUDA 应用程序。
Note: System administrators can set their GPU application launcher services, such as SSHD, Docker, and so on, to start after the FM service is started. Refer to your Linux distribution’s manual for more information about setting up service dependencies and the service start order 注意:系统管理员可以将 SSHD、Docker 等 GPU 应用程序启动器服务设置为在 FM 服务启动后启动。有关设置服务依赖关系和服务启动顺序的更多信息,请参阅 Linux 发行版手册
6.3. Supported Platforms 6.3.支持的平台
This section provides informaiton about the products and environments that FM currently supports. 本节介绍 FM 目前支持的产品和环境。
6.3.1. Hardware Architectures 6.3.1.硬件架构
x86_64
AArch64
6.3.2. NVIDIA Server Architectures 6.3.2.英伟达™(NVIDIA®)服务器架构
DGX-2 and NVIDIA HGX-2 systems that use V100 GPUs and first-generation NVSwitches. 使用 V100 GPU 和第一代 NVSwitches 的 DGX-2 和 NVIDIA HGX-2 系统。
DGX AlOO and NVIDIA HGX AlOO systems that use AlOO GPUs and second-generation NVSwitches. 使用 AlOO GPU 和第二代 NVSwitches 的 DGX AlOO 和 NVIDIA HGX AlOO 系统。
NVIDIA HGX A800 systems that use A800 GPUs and second-generation NVSwitches. 使用 A800 GPU 和第二代 NVSwitches 的 NVIDIA HGX A800 系统。
DGX H100 and NVIDIA HGX H1OO systems that use H1OO GPUs and third-generation NVSwitches. 使用 H1OO GPU 和第三代 NVSwitches 的 DGX H100 和 NVIDIA HGX H1OO 系统。
NVIDIA HGX H800 systems that use H800 GPUs and third-generation NVSwitches. 使用 H800 GPU 和第三代 NVSwitches 的 NVIDIA HGX H800 系统。
Note: Unless specified, the steps for NVIDIA HGX A800 and NVIDIA HGX H800 are the same as the steps NVIDIA HGX A 100 and NVIDIA HGX H 100. The only difference is that the number of GPU NVLinks will defer depending on the actual platform. 注:除特别说明外,NVIDIA HGX A800 和 NVIDIA HGX H800 的步骤与 NVIDIA HGX A 100 和 NVIDIA HGX H 100 的步骤相同。唯一的区别是 GPU NVLinks 的数量将根据实际平台而定。
6.3.3. OS Environment 6.3.3.操作系统环境
FM is supported on the following major Linux OS distributions: 以下主要 Linux 操作系统发行版支持 FM:
RHEL/CentOS 7.x and RHEL/CentOS 8.x RHEL/CentOS 7.x 和 RHEL/CentOS 8.x
Ubuntu 18.04.x, and Ubuntu 20.04.x, and Ubuntu 22.04.x Ubuntu 18.04.x、Ubuntu 20.04.x和Ubuntu 22.04.x
6.4. Supported Deployment Models 6.4.支持的部署模式
NVSwitch-based systems can be deployed as bare metal servers or in a virtualized (full passthrough, Shared NVSwitch, or vGPU) multi-tenant environment. FM supports these deployment models. Refer to the following sections for more information: 基于 NVSwitch 的系统可作为裸机服务器或在虚拟化(完全直通、共享 NVSwitch 或 vGPU)多租户环境中部署。FM 支持这些部署模式。有关详细信息,请参阅以下章节:
Bare Metal Mode configuration 裸机模式配置
Full Passthrough Virtualized Configurations 完全直通的虚拟化配置
To run the FM service, the target system must include a compatible Driver, starting with version R450,for the NVIDIA Data Center GPUs. 要运行调频服务,目标系统必须包含一个兼容的驱动程序(从英伟达™(NVIDIA®)数据中心 GPU 的 R450 版本开始)。
Note: During initialization, the FM service checks the currently loaded kernel driver stack version for compatibility, and if the loaded driver stack version is not compatible, aborts the process. 注意:在初始化过程中,FM 服务会检查当前加载的内核驱动程序栈版本是否兼容,如果加载的驱动程序栈版本不兼容,则中止进程。
6.6. Installation 6.6.安装
6.6.1. On NVSwitch-Based DGX Server Systems 6.6.1.在基于 NVSwitch 的 DGX 服务器系统上
The FM service is preinstalled in all the NVSwitch-based DGX-2, DGX A100, and DGXH 100 systems as part of the supported DGX OS package. The service is enabled and started on OS boot. 所有基于 NVSwitch 的 DGX-2、DGX A100 和 DGXH 100 系统都预装了 FM 服务,作为受支持的 DGX 操作系统软件包的一部分。该服务在操作系统启动时启用和启动。
6.6.2. On NVSwitch-Based NVIDIA HGX Server Systems 6.6.2.在基于 NVSwitch 的 NVIDIA HGX 服务器系统上
On NVSwitch-based NVIDIA HGX systems, to configure the NVLinks and NVSwitch memory fabrics to support one memory fabric, the FM service needs to be manually installed. The FM package is available through the NVIDIA CUDA network repository. Refer to NVIDIA Driver Installation Guide for more information about settingup your system’s package manager and download packages from the desired CUDA network repositories. 在基于 NVSwitch 的 NVIDIA HGX 系统上,要配置 NVLinks 和 NVSwitch 内存结构以支持一个内存结构,需要手动安装 FM 服务。FM 软件包可通过英伟达™(NVIDIA®)CUDA 网络存储库获取。有关设置系统软件包管理器和从所需 CUDA 网络存储库下载软件包的更多信息,请参阅《英伟达驱动程序安装指南》。
Each release version of Fabric Manager comprises the following packages: Fabric Manager 的每个发布版本都包含以下软件包:
nvidia-fabric-manager
This package includes the essential components such as the core standalone FM service process, service unit file, and topology files. For bare metal, you can install just this package. 该软件包包括核心独立 FM 服务进程、服务单元文件和拓扑文件等基本组件。对于裸机,只需安装此软件包即可。
nvidia-fabric-manager-devel
The “devel” package encompasses the FM shared library and its associated header files. This package is important when you implement the Shared NVSwitch and vGPU multi-tenancy virtualization models. devel "包包含 FM 共享库及其相关头文件。在实施共享 NVSwitch 和 vGPU 多租户虚拟化模型时,该软件包非常重要。
By splitting the functionality into these packages, users can selectively install the components most relevant to their needs. 通过将功能拆分成这些软件包,用户可以选择性地安装最符合其需求的组件。
After the package manager network repositories are set up, use the following distro- specific FM installation commands: 设置好软件包管理器网络存储库后,使用以下针对发行版的 FM 安装命令:
Note: In the following commands, <driver-branch> should be substituted with the required NVIDIA driver branch number for qualified data center drivers (for example, 560). 注意:在以下命令中,<driver-branch> 应替换为合格数据中心驱动程序所需的英伟达驱动程序分支编号(例如 560)。
For Debian and Ubuntu based OS distributions: 适用于基于 Debian 和 Ubuntu 的操作系统发行版:
sudo apt-get install -V nvidia-open-<driver-branch>
sudo apt-get install -V nvidia-fabric-manager-<driver-branch> nvidiararr\rightarrow fabricmanager-dev-<driver-branch> sudo apt-get install -V nvidia-fabric-manager-<driver-branch> nvidia rarr\rightarrow fabricmanager-dev-
For Red Hat Enterprise Linux 8 based OS distributions: 适用于基于 Red Hat Enterprise Linux 8 的操作系统发行版:
sudo dnf module install nvidia-driver:<driver-branch>-open/fm
For SUSE Linux based OS distributions: 适用于基于 SUSE Linux 的操作系统发行版:
sudo zypper install nvidia-open-<driver-branch>-<kernel-flavor>
sudo zypper install nvidia-fabric-manager-<driver-branch> nvidia-fabricmanagerrarr\rightarrow devel-<driver-branch> sudozypper install nvidia-fabric-manager-<driver-branch> nvidia-fabricmanager rarr\rightarrow devel-
For SLES based OS distributions: 对于基于 SLES 的操作系统发行版:
sudo zypper update nvidia-fabric-manager=<driver-branch> nvidia-fabric-managerrarr\rightarrow devel=<driver-branch> sudozypper update nvidia-fabric-manager=<driver-branch> nvidia-fabric-manager rarr\rightarrow devel=
Note: On NVSwitch-based NVIDIA HGX systems, before you install FM, install the compatible Driver for NVIDIA Data Center GPUs. As part of the installation, the FM service unit file (nvidia-fabricmanager.service) will be copied to systemd. However, the system administrator must manually enable and start the FM service. 注意:在基于 NVSwitch 的 NVIDIA HGX 系统上,安装 FM 之前,请先安装兼容的 NVIDIA Data Center GPU 驱动程序。作为安装的一部分,FM 服务单元文件(nvidia-fabricmanager.service)将被复制到 systemd 中。不过,系统管理员必须手动启用并启动 FM 服务。
6.7. Managing the Fabric Manager Service 6.7.管理 Fabric Manager 服务
6.7.1. Start Fabric Manager 6.7.1.启动 Fabric 管理器
To start FM, run the following command: 要启动调频,请运行以下命令:
# For Linux based OS distributions # 适用于基于 Linux 的操作系统发行版
sudo systemctl start nvidia-fabricmanager
6.7.2. Stop Fabric Manager 6.7.2.停止 Fabric 管理器
To stop FM, run the following command: 要停止 FM,请运行以下命令:
# For Linux based OS distributions
sudo systemctl stop nvidia-fabricmanager
6.7.3. Check the Fabric Manager Status 6.7.3.检查 Fabric Manager 状态
To check the FM status, run the following command: 要检查调频状态,请运行以下命令:
# For Linux based OS distributions # 适用于基于 Linux 的操作系统发行版
sudo systemctl status nvidia-fabricmanager
6.7.4. Enable the Fabric Manager Service to Auto Start at Boot 6.7.4.启用 Fabric Manager 服务在启动时自动启动
To enable the FM service to start automatically at boot, run the following command: 要使调频服务在启动时自动启动,请运行以下命令:
# For Linux based OS distributions # 适用于基于 Linux 的操作系统发行版
sudo systemctl enable nvidia-fabricmanager
6.7.5. Disable the Fabric Manager Service Auto Start at Boot 6.7.5.禁用 Fabric Manager 服务在启动时自动启动
To disable the FM service to start automatically at boot, run the following command: 要禁止 FM 服务在启动时自动启动,请运行以下命令:
# For Linux based OS distributions # 适用于基于 Linux 的操作系统发行版
sudo systemctl disable nvidia-fabricmanager
To check the FM system log messages, run the following command: 要查看调频系统日志信息,请运行以下命令:
# For Linux based OS distributions # 适用于基于 Linux 的操作系统发行版
sudo journalctl -u nvidia-fabricmanager
FM supports the following command-line options: FM 支持以下命令行选项:
./nv-fabricmanager -h
NVIDIA Fabric Manager
Runs as a background process to configure the NVSwitches to form
a single memory fabric among all participating GPUs.
Usage: nv-fabricmanager [options]
(continues on next page) (下一页)
Options include:
[-h | --help]: Displays help information
[-v | --version]: Displays the Fabric Manager version and exit.
[-c | --config]: Provides Fabric Manager config file path/name which controls all
the config options.
[-r | --restart]: Restart Fabric Manager after exit. Applicable to Shared NVSwitch
~and vGPU multitenancy modes.
Most of the FM configurable parameters and options are specified through a text config file. FM installation will copy a default config file to a predefined location, and the file will be used by default. To use a different config file location, specify the same using the [-c | --config] command line argument. 大多数 FM 可配置参数和选项都是通过文本配置文件指定的。FM 安装时会将默认配置文件复制到预定义位置,并默认使用该文件。要使用不同的配置文件位置,请使用 [-c | --config] 命令行参数指定相同的位置。
Note: On Linux based installations, the default FM config file will be in the /usr/share/nvidia/ nvswitch/fabricmanager.cfg directory. If the default config file on the system is modified, to manage the existing config file, subsequent FM package update will provide options such as merge/keep/overwrite. 注意:在基于 Linux 的安装中,默认 FM 配置文件将位于 /usr/share/nvidia/ nvswitch/fabricmanager.cfg 目录中。如果修改了系统上的默认配置文件,要管理现有配置文件,后续的 FM 软件包更新将提供合并/保留/覆盖等选项。
6.9. Fabric Manager Service File 6.9.Fabric Manager 服务文件
6.9.1. On Linux-Based Systems 6.9.1.在基于 Linux 的系统上
On Linux-based systems, the installation package will register the FM service using the following systemd service unit file. To change the FM service start-up options, modify this service unit file in the /lib/systemd/system/nvidia-fabricmanager.service directory. 在基于 Linux 的系统上,安装包将使用以下 systemd 服务单元文件注册 FM 服务。要更改 FM 服务启动选项,请修改 /lib/systemd/system/nvidia-fabricmanager.service 目录中的服务单元文件。
6.10. Running Fabric Manager as Non-Root 6.10.以非根用户身份运行 Fabric Manager
On Linux-based systems, by default, the FM service requires administrative (root) privileges to configure all the GPU NVLinks and NVSwitches to support a memory fabric. However, system administrators and advanced users can complete the following steps to run FM from a non-root account: 在基于 Linux 的系统上,默认情况下,FM 服务需要管理员(root)权限才能配置所有 GPU NVLink 和 NVSwitches,以支持内存结构。不过,系统管理员和高级用户可以完成以下步骤,从非 root 账户运行 FM:
If the FM instance is running, stop it. 如果 FM 实例正在运行,请将其停止。
FM requires access to the following directory/file, so adjust the corresponding directory/file access to the desired user/user group. FM 需要访问以下目录/文件,因此请将相应目录/文件的访问权限调整为所需的用户/用户组。
/var/run/nvidia-fabricmanager
This option provides a fixed location to save runtime information. 该选项提供了保存运行时信息的固定位置。
/var/log/
This option provides a configurable location to save FM log file. 该选项提供了一个可配置的 FM 日志文件保存位置。
/usr/share/nvidia/nvswitch
This option provides a configurable location for fabric topology files. 该选项为结构拓扑文件提供了一个可配置的位置。
This configurable directory/file information is based on default FM config file options. If the default configuration values are changed, adjust the directory/file information accordingly. 这些可配置的目录/文件信息基于默认的 FM 配置文件选项。如果更改了默认配置值,请相应调整目录/文件信息。
The NVIDIA driver will create the following proc entry with default permission to root. 英伟达驱动程序将创建以下 proc 条目,默认权限为 root。
Change its read/write access to the desired user/user group. 将其读/写权限更改为所需的用户/用户组。
/proc/driver/nvidia-nvlink/capabilities/fabric-mgmt
FM also requires access to the following device node files: FM 还需要访问以下设备节点文件:
On all the NVSwitch-based NVIDIA HGX systems: 在所有基于 NVSwitch 的 NVIDIA HGX 系统上:
/dev/nvidia-nvlink
/dev/nvidia-nvswitchctl
/dev/nvidia-nvswitchX (one for each NVSwitch device) /dev/nvidia-nvswitchX(每个 NVSwitch 设备一个)
Here are the additional device node files on DGX-2 and NVIDIA HGX A100 systems: 下面是 DGX-2 和 NVIDIA HGX A100 系统上的附加设备节点文件:
/dev/nvidiactl
/dev/nvidiaX (one for each GPU device) /dev/nvidiaX(每个 GPU 设备一个)
By default, these device node files are created by the nvidia-modprobe utility, which is installed as part of NVIDIA Driver package for Data Center GPUs, with access permission for all users. If these device node files are created manually or outside of nvidia-modprobe, assign read/write access to the user/user group. 默认情况下,这些设备节点文件由 nvidia-modprobe 实用程序创建,该实用程序作为英伟达™(NVIDIA®)数据中心 GPU 驱动程序软件包的一部分安装,所有用户均有访问权限。如果这些设备节点文件是手动创建或在 nvidia-modprobe 之外创建的,则应将读/写权限分配给用户/用户组。
5. After the required permissions are assigned, manually start the FM process from the user/user group account. 5.分配所需权限后,从用户/用户组账户手动启动调频进程。
6. The NVIDIA driver will create/recreate the above / proc entry during driver load, so repeat steps 1-6 on every driver reload or system boot. 6.英伟达™(NVIDIA®)驱动程序会在加载驱动程序时创建/重新创建上述/proc 条目,因此每次重新加载驱动程序或系统启动时都要重复步骤 1-6。
When FM is configured as a systemd service, the system administrator must edit the FM service unit file to instruct systemd to run FM service from a specific user/group. This specific user/group can be specified through the User= and Group= directive in the [Service] section of FM service unit file. The system administrator must ensure that the proc entry and associated file node permission are changed to desired user/user group before FM service starts at system boot time. 当 FM 配置为 systemd 服务时,系统管理员必须编辑 FM 服务单元文件,指示 systemd 从特定用户/组运行 FM 服务。可以通过 FM 服务单元文件[服务]部分的 User= 和 Group= 指令指定特定用户/组。系统管理员必须确保在系统启动时启动 FM 服务之前,将 proc 条目和相关文件节点权限更改为所需的用户/用户组。
When FM is configured to run from a specific user/user group as specified above, the nvswitch-audit command line utility should be started from the same user/user group account. 当 FM 被配置为从上述指定的用户/用户组运行时,nvswitch-audit 命令行实用程序应从同一用户/用户组账户启动。
Note: System administrators can set up necessary udev rules to automate the process of changing these proc entry permissions. 注意:系统管理员可以设置必要的 udev 规则,以自动更改这些 proc 条目权限。
The configurable parameters and options used by FM are specified through a text config file. The following section lists all those currently supported configurable parameters and options. FM 使用的可配置参数和选项是通过文本配置文件指定的。下文列出了目前支持的所有可配置参数和选项。
Note: The FM config file is read as part of FM service startup. If you changed anyconfig file options, for the new settings to take effect, restart the FM service. 注意:FM 配置文件作为 FM 服务启动的一部分被读取。如果更改了 anyconfig 文件选项,要使新设置生效,必须重新启动 FM 服务。
6.11.1. Logging Related Config Items 6.11.1.日志相关配置项
6.11.1.1 Setting the Log File Location and Name 6.11.1.1 设置日志文件位置和名称
Config Item 配置项目
LOG_FILE_NAME=<value>
Supported/Possible Values 支持/可能的值
The complete path/filename string (max length of 256) for the log. 日志的完整路径/文件名字符串(最大长度 256)。
Logging is stopped once the log file reaches the size specified in above LOG_FILE_MAX_SIZE option. 一旦日志文件达到上述 LOG_FILE_MAX_SIZE 选项指定的大小,日志记录就会停止。
Non-zero: Rotate the current log file once it reaches the individual log file size. 非零:当当前日志文件达到单个日志文件大小时,将其旋转。
The combined Fabric Manager log size is LOG_FILE_MAX_SIZE multiplied by LOG_MAX_ROTATE_COUNT + 1. After this threshold is reached, the oldest log file will be purged. Fabric Manager 合并日志大小为 LOG_FILE_MAX_SIZE 乘以 LOG_MAX_ROTATE_COUNT + 1。达到此阈值后,最旧的日志文件将被清除。
Default Value 默认值
LOG_MAX_ROTATE_COUNT=3
Note: The FM log is in a clear-text format, and NVIDIA recommends that you run the FM service with logging enabled at the INFO level for troubleshooting field issues. 注意:FM 日志采用明文格式,NVIDIA 建议您在运行 FM 服务时将日志设置为 INFO 级别,以便排除现场问题。
6.11.2. Operating Mode Related Config Items 6.11.2.运行模式相关配置项
Note: This section of config items is applicable only to Shared NVSwitch and vGPU Multitenancy deployment. 注意:本部分配置项仅适用于共享 NVSwitch 和 vGPU 多租户部署。
0 - Start FM in bare metal or full passthrough virtualization mode. 0 - 以裸机或完全直通虚拟化模式启动 FM。
1 - Start FM in Shared NVSwitch multitenancy mode. 1 - 在共享 NVSwitch 多租户模式下启动 FM。
For more information, refer to Shared NVSwitch Virtualization Configurations. 有关详细信息,请参阅共享 NVSwitch 虚拟化配置。
2 - Start FM in vGPU multitenancy mode. 2 - 在 vGPU 多租户模式下启动调频。
For more information, refer to vGPU Virtualization Model. 有关详细信息,请参阅 vGPU 虚拟化模型。
Default Value 默认值
FABRIC_MODE=0
Note: The older SHARED_FABRIC_MODE configuration item is still supported, but we recommend that you use the FABRIC_MODE configuration item. 注意:旧的 SHARED_FABRIC_MODE 配置项仍受支持,但我们建议您使用 FABRIC_MODE 配置项。
0 - Start FM and complete the initialization sequence. 0 - 启动调频并完成初始化序列。
1 - Start FM and follow the Shared NVSwitch or vGPU multitenancy mode resiliency/restart sequence. 1 - 启动 FM 并遵循共享 NVSwitch 或 vGPU 多租户模式弹性/重启顺序。
This option is equal to the - restart command line argument to the FM process and is provided to enable the Shared NVSwitch or vGPU multitenancy mode resiliency without modifying command-line arguments to the FM process. Refer to Resiliency for more information on the FM resiliency flow. 该选项等同于 FM 进程的 - restart 命令行参数,用于在不修改 FM 进程命令行参数的情况下启用共享 NVSwitch 或 vGPU 多租户模式弹性。有关 FM 弹性流程的更多信息,请参阅弹性。
Default Value 默认值
FABRIC_MODE_RESTART=0
Note: The older SHARED_FABRIC_MODE_RESTART configuration item is still supported but we recommend that you use FABRIC_MODE_RESTART configuration item. 注意:旧的 SHARED_FABRIC_MODE_RESTART 配置项仍受支持,但我们建议您使用 FABRIC_MODE_RESTART 配置项。
6.11.2.3 Fabric Manager API Interface 6.11.2.3 Fabric Manager API 接口
Config Item 配置项目
FM_CMD_BIND_INTERFACE =<value>
Supported/Possible Values 支持/可能的值
The network interface for the FM SDK/API to listen and for the Hypervisor to communicate with the running FM instance for the shared NVSwitch and vGPU multitenancy operations. 网络接口,供 FM SDK/API 用于监听,并供管理程序与运行中的 FM 实例通信,以共享 NVSwitch 和 vGPU 多租户操作。
Default Value 默认值
FM_CMD_BIND_INTERFACE=127.0.0.1
6.11.2.4 Fabric Manager API TCP Port 6.11.2.4 Fabric Manager API TCP 端口
Config Item 配置项目
FM_CMD_PORT_NUMBER=<value>
Supported/Possible Values 支持/可能的值
The TCP port number for the FM SDK/API for Hypervisor to communicate with the running FM instance for Shared NVSwitch and vGPU multitenancy operations. 用于管理程序的 FM SDK/API 与运行中的 FM 实例通信的 TCP 端口号,用于共享 NVSwitch 和 vGPU 多租户操作。
The Unix domain socket path instead of the TCP/IP socket for the FM SDK/API to listen and to communicate with the running FM instance for the shared NVSwitch and vGPU multitenancy operations. Unix 域套接字路径,而不是 TCP/IP 套接字,用于 FM SDK/API 与运行中的 FM 实例进行监听和通信,以共享 NVSwitch 和 vGPU 多租户操作。
Default Value 默认值
FM_CMD_UNIX_SOCKET_PATH=<empty value>
6.11.2.6 Fabric Manager State 6.11.2.6 Fabric Manager 状态
Config Item 配置项目
STATE_FILE_NAME=<value>
Supported/Possible Values 支持/可能的值
Specify the filename to be used to save the FM states to restart FM after a crash or a successful exit. This is only valid when the Shared NVSwitch or vGPU multitenancy mode is enabled. 指定用于保存 FM 状态的文件名,以便在崩溃或成功退出后重新启动 FM。这仅在启用共享 NVSwitch 或 vGPU 多租户模式时有效。
0 - Do not daemonize and run FM as a normal process. 0 - 不进行守护进程化,将 FM 作为正常进程运行。
1 - Run the FM process as a Unix daemon. 1 - 将调频进程作为 Unix 守护进程运行。
Default Value 默认值
DAEMONIZE=1
6.11.3.2 Fabric Manager Communication Socket Interface 6.11.3.2 Fabric Manager 通信套接字接口
Config Item 配置项目
BIND_INTERFACE_IP=<value>
Supported/Possible Values 支持/可能的值
Network interface to listen for the FM internal communication/IPC, and this value should be a valid IPv4 address. 用于监听调频内部通信/IPC 的网络接口,该值应为有效的 IPv4 地址。
Default Value 默认值
BIND_INTERFACE_IP=127.0.0.1
6.11.3.3 Fabric Manager Communication TCP Port 6.11.3.3 Fabric Manager 通信 TCP 端口
Config Item 配置项目
STARTING_TCP_PORT=<value>
Supported/Possible Values 支持/可能的值
Starting TCP port number for the FM internal communication/IPC, and this value should be between 0 and 65535. 调频内部通信/IPC 的起始 TCP 端口号,该值应介于 0 和 65535 之间。
Default Value 默认值
STARTING_TCP_PORT=16000
6.11.3.4 Unix Domain Socket for Fabric Manager Communication 6.11.3.4 用于 Fabric Manager 通信的 Unix 域套接字
Config Item 配置项目
UNIX_SOCKET_PATH=<value>
Supported/Possible Values 支持/可能的值
Use Unix Domain socket instead of TCP/IP socket for FM internal communication/IPC. An empty value means that the Unix domain socket is not used. 在 FM 内部通信/IPC 中使用 Unix 域套接字而不是 TCP/IP 套接字。空值表示不使用 Unix 域套接字。
Configuration option to specify the FM topology files directory path information. 用于指定 FM 拓扑文件目录路径信息的配置选项。
Default Value 默认值
TOPOLOGY_FILE_PATH=/usr/share/nvidia/nvswitch
6.11.4. High Availability Mode-Related Config Items 6.11.4.高可用性模式相关配置项
6.11.4.1 Control Fabric Manager Behavior on Initialization Failure 6.11.4.1 初始化失败时控制 Fabric 管理器的行为
Config Item 配置项目
FM_STAY_RESIDENT_ON_FAILURES=<value>
Supported/Possible Values 支持/可能的值
0 - The FM service will terminate on errors such as, NVSwitch and GPU config failure, typical software errors, and so on. 0 - 如果出现 NVSwitch 和 GPU 配置失败、典型软件错误等错误,FM 服务将终止。
1 - The FM service will stay running on errors such as, NVSwitch and GPU config failure, typical software errors, and so on. 1 - 在出现 NVSwitch 和 GPU 配置失败、典型软件错误等错误时,FM 服务将继续运行。
However, the system will be uninitialized, and the CUDA application launch will fail. 但是,系统将无法初始化,CUDA 应用程序的启动也将失败。
The available high-availability options when there is an Access NVLink Failure (GPU to NVSwitch NVLink). Refer to Supported High Availability Modes for more information about supported values and behavior. 访问 NVLink 故障(GPU 到 NVSwitch NVLink)时可用的高可用性选项。有关支持值和行为的更多信息,请参阅支持的高可用性模式。
The available high-availability options when there is a Trunk Link failure (NVSwitch to NVSwitch connection between GPU baseboards). Refer to Supported High Availability Modes for more information about supported values and behavior. 当中继链路(GPU 底板之间的 NVSwitch 到 NVSwitch 连接)发生故障时可用的高可用性选项。有关支持值和行为的更多信息,请参阅 "支持的高可用性模式"。
The available high-availability options when there is an NVSwitch failure. Refer to Refer to Supported High Availability Modes for more information about supported values and behavior. 当 NVSwitch 出现故障时可用的高可用性选项。有关支持值和行为的更多信息,请参阅支持的高可用性模式。
Default Value 默认值
NVSWITCH_FAILURE_MODE=0
6.11.4.5 CUDA Jobs Behavior When the Fabric Manager Service is Stopped or Terminated 6.11.4.5 当 Fabric Manager 服务停止或终止时 CUDA 作业的行为
Config Item 配置项目
ABORT_CUDA_JOBS_ON_FM_EXIT=<value>
Supported/Possible Values 支持/可能的值
0 - Do not abort running CUDA jobs when the FM service is stopped or exits. 0 - 当 FM 服务停止或退出时,不中止正在运行的 CUDA 作业。
However, a new CUDA job launch will fail with cudaErrorSystemNotReady error. 但是,启动新的 CUDA 作业时会出现 cudaErrorSystemNotReady 错误。
1 - Abort all running CUDA jobs when the FM service is stopped or exits. 1 - 当 FM 服务停止或退出时,中止所有正在运行的 CUDA 作业。
Also, a new CUDA job launch will fail with cudaErrorSystemNotReady error. 此外,新 CUDA 作业启动失败时会出现 cudaErrorSystemNotReady 错误。
Note: This is not effective on DGX H100 and NVIDIA HGX H 100 NVSwitch based systems. Also, This config option is applicable to only bare metal and full passthrough virtualization models. 注意:这对基于 DGX H100 和 NVIDIA HGX H 100 NVSwitch 的系统无效。此外,该配置选项仅适用于裸机和全直通虚拟化模型。
Default Value 默认值
ABORT_CUDA_JOBS_ON_FM_EXIT=1
Chapter 7. Bare Metal Mode 第 7 章 裸机模式裸机模式
7.1. Introduction 7.1.导言
The NVSwitch-based DGX and NVIDIA HGX server systems’ default software configuration is to run the systems as bare-metal machines for workloads such as Al, machine learning, and so on. This chapter provides information about the FM installation requirements to support a bare metal configuration. 基于 NVSwitch 的 DGX 和 NVIDIA HGX 服务器系统的默认软件配置是将系统作为裸机运行,以处理 Al、机器学习等工作负载。本章介绍了支持裸机配置的 FM 安装要求。
7.2.1. On NVSwitch-Based DGX Server Systems 7.2.1.在基于 NVSwitch 的 DGX 服务器系统上
As part of the supported DGX OS package installation, the FM service is preinstalled in all the NVSwitch-based DGX systems. The service is enabled and started when the OS boots, and the default installation configuration is to support bare metal mode. 作为受支持的 DGX 操作系统软件包安装的一部分,所有基于 NVSwitch 的 DGX 系统都预装了 FM 服务。该服务在操作系统启动时启用并启动,默认安装配置支持裸机模式。
7.2.2. On NVSwitch-Based NVIDIA HGX Server Systems 7.2.2.在基于 NVSwitch 的 NVIDIA HGX 服务器系统上
To configure NVSwitch-based NVIDIA HGX systems for bare metal mode, system administrators must install the NVIDIA FM Package, which is the same version as the driver package. 要将基于 NVSwitch 的 NVIDIA HGX 系统配置为裸机模式,系统管理员必须安装与驱动程序软件包版本相同的 NVIDIA FM 软件包。
The driver package is for NVIDIA Data Center GPUs (version 450.xx and later) for NVIDIA HGX-2 and NVIDIA HGX A 100 systems. For NVIDIA HGX H 100 systems, version 525.xx and later is required. 该驱动程序软件包适用于 NVIDIA HGX-2 和 NVIDIA HGX A 100 系统的 NVIDIA Data Center GPU(450.xx 及更高版本)。对于英伟达 HGX H 100 系统,需要 525.xx 及更高版本。
The FM default installation mode and configuration file options support bare metal mode. FM 默认安装模式和配置文件选项支持裸机模式。
When an NVSwitch port or GPU generates a runtime error, the corresponding information will be logged into the operating system’s kernel log or event log. An error report from NVSwitch will be logged with the SXid prefix, and a GPU error report will be logged with the Xid prefix by the NVIDIA driver. 当 NVSwitch 端口或 GPU 出现运行时错误时,相应的信息将被记录到操作系统的内核日志或事件日志中。NVSwitch 的错误报告将以 SXid 前缀记录,而 GPU 的错误报告将以 Xid 前缀由英伟达驱动程序记录。
The NVSwitch SXids errors use the following reporting convention: NVSwitch SXids 错误使用以下报告约定:
<nvidia-nvswitchX: SXid (PCI:<switch_pci_bdf>): <SXid_Value>, <Fatal
or Non-Fatal>, <Link No> < Error Description>
<raw error information for additional troubleshooting>
The following is an example of a SXid error log
[...] nvidia-nvswitch3: SXid (PCI:0000:c1:00.0): 28006, Non-fatal, Link
46 MC TS crumbstore MCTO (First)
[...] nvidia-nvswitch3: SXid (PCI:0000:c1:00.0): 28006, Severity 0
Engine instance 46 Sub-engine instance 00
[...] nvidia-nvswitch3: SXid (PCI:0000:c1:00.0): 28006, Data
{0x00140004, 0x00100000, 0x00140004, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000}
The GPU Xids errors use the following reporting convention: GPU Xids 错误使用以下报告约定:
NVRM: GPU at PCI:<gpu_pci_bdf>: <gpu_uuid>
NVRM: GPU Board Serial Number: <gpu_serial_number>
NVRM: Xid (PCI:<gpu_pci_bdf>): <Xid_Value>, <raw error information>
The following is an example of a Xid error log
[...] NVRM: GPU at PCI:0000:34:00: GPU-c43f0536-e751-7211-d7a7-78c95249ee7d
[...] NVRM: GPU Board Serial Number: 0323618040756
[...] NVRM: Xid (PCI:0000:34:00): 45, Ch 00000010
Depending on the severity (fatal vs fon-fatal) and the impacted port, the SXid and Xid errors can abort existing CUDA jobs and prevent new CUDA job launches. The next section provides information about the potential impact of SXid and Xid errors and the corresponding recovery procedure. 根据严重程度(致命与非致命)和受影响的端口,SXid 和 Xid 错误会中止现有的 CUDA 作业并阻止新的 CUDA 作业启动。下一节将介绍 SXid 和 Xid 错误的潜在影响以及相应的恢复程序。
NVSwitch non-fatal SXids are for informational purposes only, and FM will not terminate CUDA jobs that are running or prevent new CUDA job launches. The existing CUDA jobs should resume; but depending on the exact error, CUDA jobs might experience issues such as a performance drop, no forward progress for brief time, and so on. NVSwitch 非致命 SXids 仅供参考,FM 不会终止正在运行的 CUDA 作业,也不会阻止新 CUDA 作业的启动。现有的 CUDA 作业应能恢复;但根据具体的错误,CUDA 作业可能会遇到性能下降、短时间内无法前进等问题。
When a fatal SXid error is reported on a NVSwitch port, which is connected between a GPU and a NVSwitch, the corresponding error will be propagated to the GPU. The CUDA jobs that are running on that GPU will be aborted and the GPU might report Xid 74 and Xid 45 errors. The FM service will log the corresponding GPU index and PCI bus information in its log file and syslog. The system administrator must use the following recovery procedure to clear the error state before using the GPU for an additional CUDA workload. 当连接 GPU 和 NVSwitch 的 NVSwitch 端口报告致命 SXid 错误时,相应的错误将传播到 GPU。在该 GPU 上运行的 CUDA 作业将被中止,GPU 可能会报告 Xid 74 和 Xid 45 错误。FM 服务将在其日志文件和系统日志中记录相应的 GPU 索引和 PCI 总线信息。系统管理员必须使用以下恢复程序清除错误状态,然后才能将 GPU 用于其他 CUDA 工作负载。
Reset the specified GPU (and all the participating GPUs in the affected workload) via the NVIDIA System Management Interface (nvidia-smi) command line utility. 通过 NVIDIA 系统管理界面 (nvidia-smi) 命令行实用程序重置指定的 GPU(以及受影响工作负载中的所有 GPU)。
Refer to the -r or the --gpu-reset options in nvidia-smi for more information and the individual GPU reset operation. If the problem persists, reboot or power cycle the system. 请参阅 nvidia-smi 中的 -r 或 --gpu-reset 选项,了解更多信息和单个 GPU 重置操作。如果问题仍然存在,请重新启动或关闭系统电源。
When a fatal SXid error is reported on a NVSwitch port, which connects two GPU baseboards, FM will abort all the running CUDA jobs and prevent any new CUDA job launches. The GPU will also report an Xid 45 error as part of aborting CUDA jobs. The FM service will log the corresponding error information in its log file and syslog. 当连接两个 GPU 底板的 NVSwitch 端口报告致命的 SXid 错误时,FM 将中止所有正在运行的 CUDA 作业,并阻止任何新的 CUDA 作业启动。作为中止 CUDA 作业的一部分,GPU 还将报告 Xid 45 错误。FM 服务将在日志文件和系统日志中记录相应的错误信息。
2. The system administrator must use the following recovery procedure to clear the error state and subsequent successful CUDA job launch: 2.系统管理员必须使用以下恢复程序来清除错误状态,并在随后成功启动 CUDA 作业:
Reset all the GPUs and NVSwitches. 重置所有 GPU 和 NVSwitches。
Stop the FM service. 停止调频服务。
Stop all the applications that are using the GPU. 停止所有正在使用 GPU 的应用程序。
Reset all the GPU and NVSwitches using the nvidia-smi command line utility with the -r or the --gpu-reset option. 使用带有 -r 或 --gpu-reset 选项的 nvidia-smi 命令行实用程序重置所有 GPU 和 NVSwitches。
Do not use the -i or the -id options. 请勿使用 -i 或 -id 选项。
After the reset operation is complete, start the FM service again. 重置操作完成后,再次启动调频服务。
If the problem persists, reboot or power cycle the system. 如果问题仍然存在,请重新启动或关闭系统电源。
7.3.2. GPU Xid Errors 7.3.2.GPU Xid 错误
GPU Xid messages indicate that a general GPU error occurred. The messages can indicate a hardware problem, an NVIDIA software problem, or a user application problem. When a GPU experiences an Xid error, the CUDA jobs that are running on that GPU will typically be aborted. Complete the GPU reset procedure in the previous section for additional troubleshooting. GPU Xid 消息表示发生了 GPU 一般性错误。这些信息可能表示硬件问题、英伟达软件问题或用户应用程序问题。当 GPU 出现 Xid 错误时,在该 GPU 上运行的 CUDA 作业通常会中止。完成上一节中的 GPU 重置步骤,以排除其他故障。
On DGX H 100 and NVIDIA HGX H 100 systems, FM no longer monitors and logs GPU errors. The NVIDIA driver will continue to monitor and log GPU errors in the syslog. 在 DGX H 100 和 NVIDIA HGX H 100 系统上,FM 不再监控和记录 GPU 错误。英伟达驱动程序将继续监控并在系统日志中记录 GPU 错误。
7.4. Interoperability With MIG 7.4.与 MIG 的互操作性
Multi-Instance GPUs (MIGs) partition an NVIDIA A 100 or H100 GPU into many independent GPU instances. These instances run simultaneously, each with its own memory, cache and streaming multiprocessors. However, when you enable the MIG mode, the GPU NVLinks will be disabled, and the GPU will lose its NVLink peer-to-peer (P2P) capability. After the MIG mode is successfully disabled, the GPU NVLinks will be enabled again, and the GPU NVLink P2P capability will be restored. On NVSwitch-based DGX and NVIDIA HGX systems, the FM service can cooperate with GPU MIG instances. Also, on these systems, to successfully restore GPU NVLink peer-to- peer capability after the MIG mode is disabled, the FM service must be running. On DGX A100 and NVIDIA HGX A100 systems, the corresponding GPU NVLinks and NVSwitch side NVLinks are trained off when MIG mode is enabled and retrained when MIG mode is disabled. However, on DGX H100 and NVIDIA HGX H100 systems, GPU NVLinks will stay active during MIG mode. 多实例 GPU(MIG)将英伟达 A 100 或 H100 GPU 划分为多个独立的 GPU 实例。这些实例同时运行,每个实例都有自己的内存、高速缓存和流式多处理器。但是,启用 MIG 模式后,GPU NVLink 将被禁用,GPU 将失去 NVLink 点对点(P2P)功能。成功禁用 MIG 模式后,将再次启用 GPU NVLinks,并恢复 GPU NVLink P2P 功能。在基于 NVSwitch 的 DGX 和 NVIDIA HGX 系统上,FM 服务可以与 GPU MIG 实例合作。此外,在这些系统上,要在禁用 MIG 模式后成功恢复 GPU NVLink 点对点功能,FM 服务必须运行。在 DGX A100 和 NVIDIA HGX A100 系统上,当启用 MIG 模式时,相应的 GPU NVLink 和 NVSwitch 侧 NVLink 会被训练关闭,而当禁用 MIG 模式时,相应的 GPU NVLink 和 NVSwitch 侧 NVLink 会被重新训练。但是,在 DGX H100 和 NVIDIA HGX H100 系统上,GPU NVLink 在 MIG 模式期间将保持激活状态。
Chapter 8. Virtualization Models 第 8 章 虚拟化模型虚拟化模式
8.1. Introduction 8.1.导言
NVSwitch-based systems support multiple models to isolate NVLink interconnects in a multi-tenant environment. In virtualized environments, VM workloads often cannot be trusted and must be isolated from each other and from the host or hypervisor. The switches used to maintain this isolation cannot be directly controlled by untrusted VMs and must instead be controlled by the trusted software. 基于 NVSwitch 的系统支持多种模式,可在多租户环境中隔离 NVLink 互连。在虚拟化环境中,虚拟机工作负载通常不可信任,必须相互隔离,并与主机或管理程序隔离。用于保持这种隔离的交换机不能由不信任的虚拟机直接控制,而必须由受信任的软件控制。
This chapter provides a high-level overview of supported virtualization models. 本章概述了支持的虚拟化模型。
8.2. Supported Virtualization Models 8.2.支持的虚拟化模式
The NVSwitch-based systems support the following virtualization models: 基于 NVSwitch 的系统支持以下虚拟化模式:
Full Passthrough 完全直通
GPUs and NVSwitch memory fabrics are passed to the guest OS. GPU 和 NVSwitch 内存结构会传递给客户操作系统。
Easy to deploy and requires minimal changes to the hypervisor/host OS. 易于部署,只需对虚拟机管理程序/主机操作系统进行少量修改。
Reduced NVLink bandwidth for two and four GPU VMs. 降低两个和四个 GPU 虚拟机的 NVLink 带宽。
FM provides a shared library, a set of C/C++ APIs (SDK), and the corresponding header files. The library and APIs are used to interface with FM when runs in the shared NVSwitch and vGPU multi-tenant modes to query/activate/deactivate GPU partitions. All FM interface API definitions, libraries, sample code, and associated data structure definitions are delivered as a separate development package (RPM/Debian). To compile the sample code provided in this user guide, this package must be installed. FM 提供了一个共享库、一套 C/C++ 应用程序接口(SDK)和相应的头文件。当以共享的 NVSwitch 和 vGPU 多租户模式运行时,该库和 API 用于与 FM 接口,以查询/激活/停用 GPU 分区。所有 FM 接口 API 定义、库、示例代码和相关数据结构定义都作为单独的开发包(RPM/Debian)提供。要编译本用户指南中提供的示例代码,必须安装该软件包。
9.1. Data Structures 9.1.数据结构
Here are the data structures: 下面是数据结构:
// max number of GPU/fabric partitions supported by FM
#define FM_MAX_FABRIC_PARTITIONS 64
// max number of GPUs supported by FM
#define FM_MAX_NUM_GPUS 16
// Max number of ports per NVLink device supported by FM
#define FM_MAX_NUM_NVLINK_PORTS 64
// connection options for fmConnect()
typedef struct
{
unsigned int version;
char addressInfo[FM_MAX_STR_LENGTH];
unsigned int timeoutMs;
unsigned int addressIsUnixSocket;
} fmConnectParams_v1;
typedef fmConnectParams_v1 fmConnectParams_t;
// VF PCI Device Information
typedef struct
{
unsigned int domain;
unsigned int bus;
unsigned int device;
unsigned int function;
} fmPciDevice_t;
// structure to store information about a GPU belonging to fabric partition
(continues on next page) (下一页)
(continued from previous page) (接上页)
typedef struct
{
unsigned int physicalId;
char uuid[FM_UUID_BUFFER_SIZE];
char pciBusId[FM_DEVICE_PCI_BUS_ID_BUFFER_SIZE];
unsigned int numNvLinksAvailable;
unsigned int maxNumNvLinks;
unsigned int nvlinkLineRateMBps;
} fmFabricPartitionGpuInfo_t;
// structure to store information about a fabric partition
typedef struct
{
fmFabricPartitionId_t partitionId;
unsigned int isActive;
unsigned int numGpus;
fmFabricPartitionGpuInfo_t gpuInfo[FM_MAX_NUM_GPUS];
} fmFabricPartitionInfo_t;
// structure to store information about all the supported fabric partitions
typedef struct
{
unsigned int version;
unsigned int numPartitions;
unsigned int maxNumPartitions;
fmFabricPartitionInfo_t partitionInfo[FM_MAX_FABRIC_PARTITIONS];
} fmFabricPartitionList_v2;
typedef fmFabricPartitionList_v2 fmFabricPartitionList_t;
// structure to store information about all the activated fabric partitionIds
typedef struct
{
unsigned int version;
unsigned int numPartitions;
fmFabricPartitionId_t partitionIds[FM_MAX_FABRIC_PARTITIONS];
} fmActivatedFabricPartitionList_v1;
typedef fmActivatedFabricPartitionList_v1 fmActivatedFabricPartitionList_t;
// Structure to store information about an NVSwitch or GPU with failed NVLinks
typedef struct
{
char uuid[FM_UUID_BUFFER_SIZE];
char pciBusId[FM_DEVICE_PCI_BUS_ID_BUFFER_SIZE];
unsigned int numPorts;
unsigned int portNum[FM_MAX_NUM_NVLINK_PORTS];
} fmNvlinkFailedDeviceInfo_t;
// Structure to store a list of NVSwitches and GPUs with failed NVLinks
typedef struct
{
unsigned int version;
unsigned int numGpus;
unsigned int numSwitches;
(continues on next page) (下一页)
(continued from previous page) (接上页)
fmNvlinkFailedDeviceInfo_t gpuInfo[FM_MAX_NUM_GPUS];
fmNvlinkFailedDeviceInfo_t switchInfo[FM_MAX_NUM_NVSWITCHES];
} fmNvlinkFailedDevices_v1;
typedef fmNvlinkFailedDevices_v1 fmNvlinkFailedDevices_t;
/**
* Structure to store information about a unsupported fabric partition
*/
typedef struct
{
fmFabricPartitionId_t partitionId; //!< a unique id assigned to
reference this partition
unsigned int numGpus; //!< number of GPUs in this partition
unsigned int gpuPhysicalIds[FM_MAX_NUM_GPUS]; //!< physicalId of
each GPU assigned to this partition.
} fmUnsupportedFabricPartitionInfo_t;
/**
* Structure to store information about all the unsupported fabric partitions
*/
typedef struct
{
unsigned int version; //!< version number. Use fmFabricPartitionList_version
unsigned int numPartitions; //!< total number of unsupported partitions
fmUnsupportedFabricPartitionInfo_t
partitionInfo[FM_MAX_FABRIC_PARTITIONS]; /*!< detailed information of
each unsupported partition*/
} fmUnsupportedFabricPartitionList_v1;
typedef fmUnsupportedFabricPartitionList_v1 fmUnsupportedFabricPartitionList_t;
#define fmUnsupportedFabricPartitionList_version1
MAKE_FM_PARAM_VERSION(fmUnsupportedFabricPartitionList_v1, 1)
#define fmUnsupportedFabricPartitionList_version
fmUnsupportedFabricPartitionList_version1
Note: On DGX H100 and NVIDIA HGX H100 systems, the GPU physical ID information has the same value as GPU Module ID information that is returned by the nvidia-smi-q output. On these systems, when reporting partition information, GPU information such as UUID, PCI Device (BDF) will be empty. The hypervisor stack should use GPU Physical ID information to correlate between GPUs in the partition, and the actual GPUs needs to be assigned to corresponding partition’s Guest VM. 注意:在 DGX H100 和 NVIDIA HGX H100 系统上,GPU 物理 ID 信息与 nvidia-smi-q 输出所返回的 GPU 模块 ID 信息具有相同的值。在这些系统上,报告分区信息时,UUID、PCI 设备(BDF)等 GPU 信息将为空。管理程序栈应使用 GPU 物理 ID 信息来关联分区中的 GPU,并将实际 GPU 分配给相应分区的客户虚拟机。
9.2. Initializing the Fabric Manager API Interface 9.2.初始化 Fabric Manager API 接口
To initialize the FM API interface library, run the following command: 要初始化 FM API 接口库,请运行以下命令:
FM_ST_SUCCESS - if FM API interface library has been properly initialized FM_ST_IN_USE - FM API interface library is already in initialized state. FM_ST_GENERIC_ERROR - A generic, unspecified error occurred FM_ST_SUCCESS - 如果 FM API 接口库已正确初始化 FM_ST_IN_USE - FM API 接口库已处于初始化状态。FM_ST_GENERIC_ERROR - 发生一般的、未指定的错误
9.3. Shutting Down the Fabric Manager API Interface 9.3.关闭 Fabric Manager API 接口
The following method is used to shut down the FM API interface library, and the remote connections that were established through fmConnect ( ) will also be shut down. 以下方法用于关闭 FM API 接口库,通过 fmConnect ( ) 建立的远程连接也将被关闭。
fmReturn_t fmLibShutdown(void)
Parameters
None
Return Values
FM_ST_SUCCESS - if FM API interface library has been properly shut down
FM_ST_UNINITIALIZED - interface library was not in initialized state.
9.4. Connect to Running the Fabric Manager Instance 9.4.连接运行 Fabric Manager 实例
To connect to a running instance of FM, the FM instance is started as part of system service or manually by the system administrator. This connection will be used by the APIs to exchange information to the running FM instance. 要连接到正在运行的 FM 实例,FM 实例需要作为系统服务的一部分启动,或由系统管理员手动启动。API 将使用此连接与运行中的 FM 实例交换信息。
fmReturn_t fmConnect(fmConnectParams_t *connectParams, fmHandle_t *pFmHandle)
Parameters
connectParams
Valid IP address for the remote host engine to connect to. If ipAddress
is specified as x.x.x.x it will attempt to connect to the default port
specified by FM_CMD_PORT_NUMBER.If ipAddress is specified as x.x.x.x:yyyy
it will attempt to connect to the port specified by yyyy. To connect to
an FM instance that was started with unix domain socket fill the socket
path in addressInfo member and set addressIsUnixSocket flag.
pfmHandle
Fabric Manager API interface abstracted handle for subsequent API calls
Return Values
FM_ST_SUCCESS - successfully connected to the FM instance
FM_ST_CONNECTION_NOT_VALID - if the FM instance could not be reached
FM_ST_UNINITIALIZED - FM interface library has not been initialized
FM_ST_BADPARAM - pFmHandle is NULL or IP Address/format is invalid
FM_ST_VERSION_MISMATCH - provided versions of params do not match
9.5. Disconnect from Running the Fabric Manager Instance 9.5.断开 Fabric Manager 实例的运行
To disconnect from an FM instance, run the following command. 要断开与 FM 实例的连接,请运行以下命令。
fmReturn_t fmDisconnect(fmHandle_t pFmHandle)
Parameters
pfmHandle
Handle that came from fmConnect
Return Values
FM_ST_SUCCESS - successfully disconnected from the FM instance
FM_ST_UNINITIALIZED - FM interface library has not been initialized
FM_ST_BADPARAM - if pFmHandle is not a valid handle
FM_ST_GENERIC_ERROR - an unspecified internal error occurred
9.6. Getting Supported Partitions 9.6.获取支持的分区
To query the list of supported (static) GPU fabric partitions in an NVSwitch-based system, run the following command. 要查询基于 NVSwitch 的系统中受支持的(静态)GPU Fabric 分区列表,请运行以下命令。
fmReturn_t fmGetSupportedFabricPartitions(fmHandle_t pFmHandle,
fmFabricPartitionList_t *pFmFabricPartition)
Parameters
pFmHandle
Handle returned by fmConnect()
pFmFabricPartition
Pointer to fmFabricPartitionList_t structure. On success, the list of
supported (static) partition information will be populated in this structure.
Return Values
M_ST_SUCCESS - successfully queried the list of supported partitions
FM_ST_UNINITIALIZED - FM interface library has not been initialized.
FM_ST_BADPARAM - Invalid input parameters
FM_ST_GENERIC_ERROR - an unspecified internal error occurred
FM_ST_NOT_SUPPORTED - requested feature is not supported or enabled
FM_ST_NOT_CONFIGURED - Fabric Manager is initializing and no data
FM_ST_VERSION_MISMATCH - provided versions of params do not match
9.7. Activate a GPU Partition 9.7.激活 GPU 分区
To activate a supported GPU fabric partition in an NVSwitch-based system, run the following command. 要在基于 NVSwitch 的系统中激活受支持的 GPU Fabric 分区,请运行以下命令。
Note: This API is supported only in Shared NVSwitch multi-tenancy mode. 注意:此 API 仅在共享 NVSwitch 多租户模式下受支持。
fmReturn_t fmActivateFabricPartition((fmHandle_t pFmHandle,
fmFabricPartitionId_t partitionId)
Parameters
pFmHandle
Handle returned by fmConnect()
partitionId
The partition id to be activated.
Return Values
FM_ST_SUCCESS - successfully queried the list of supported partitions
FM_ST_UNINITIALIZED - FM interface library has not been initialized.
FM_ST_BADPARAM - Invalid input parameters or unsupported partition id
FM_ST_GENERIC_ERROR - an unspecified internal error occurred
FM_ST_NOT_SUPPORTED - requested feature is not supported or enabled
FM_ST_NOT_CONFIGURED - Fabric Manager is initializing and no data
FM_ST_IN_USE - specified partition is already active or the GPUs are in
use by other partitions.
9.8. Activate a GPU Partition with Virtual Functions 9.8.使用虚拟函数激活 GPU 分区
In the vGPU Virtualization Mode, to activate an available GPU fabric partition with vGPU Virtual Functions (VFs), run this command. 在 vGPU 虚拟化模式下,要使用 vGPU 虚拟功能(VF)激活可用的 GPU Fabric 分区,请运行此命令。
fmReturn_t fmActivateFabricPartitionWithVFs((fmHandle_t pFmHandle,
fmFabricPartitionId_t partitionId, fmPciDevice_t *vfList, unsigned int numVfs)
Parameters:
pFmHandle
Handle returned by fmConnect()
partitionId
The partition id to be activated.
*vfList
List of VFs associated with physical GPUs in the partition. The
ordering of VFs passed to this call is significant, especially for
migration/suspend/resume compatibility, the same ordering should be used each
time the partition is activated.
numVfs
Number of VFs
Return Values:
FM_ST_SUCCESS - successfully queried the list of supported partitions
FM_ST_UNINITIALIZED - FM interface library has not been initialized.
FM_ST_BADPARAM - Invalid input parameters or unsupported partition id
FM_ST_GENERIC_ERROR - an unspecified internal error occurred
FM_ST_NOT_SUPPORTED - requested feature is not supported or enabled
FM_ST_NOT_CONFIGURED - Fabric Manager is initializing and no data
FM_ST_IN_USE - specified partition is already active or the GPUs are in
use by other partitions.
Note: Before you start a vGPU VM, this API must be called, even if there is only one vGPU partition. 注意:在启动 vGPU 虚拟机之前,必须调用此 API,即使只有一个 vGPU 分区。
9.9. Deactivate a GPU Partition 9.9.停用 GPU 分区
To deactivate a previously activated GPU fabric partition in an NVSwitch-based system when FM is running in Shared NVSwitch or vGPU multi-tenancy mode, run the following command. 当 FM 在共享 NVSwitch 或 vGPU 多租户模式下运行时,要在基于 NVSwitch 的系统中停用先前激活的 GPU Fabric 分区,请运行以下命令。
fmReturn_t fmDeactivateFabricPartition((fmHandle_t pFmHandle,
fmFabricPartitionId_t partitionId)
Parameters
pFmHandle
Handle returned by fmConnect()
partitionId
The partition id to be deactivated.
Return Values
FM_ST_SUCCESS - successfully queried the list of supported partitions
FM_ST_UNINITIALIZED - FM interface library has not been initialized.
FM_ST_BADPARAM - Invalid input parameters or unsupported partition id
FM_ST_GENERIC_ERROR - an unspecified internal error occurred
FM_ST_NOT_SUPPORTED - requested feature is not supported or enabled
FM_ST_NOT_CONFIGURED - Fabric Manager is initializing and no data
FM_ST_UNINITIALIZED - specified partition is not activated
9.10. Set Activated Partition List after a Fabric Manager Restart 9.10.重启 Fabric Manager 后设置已激活的分区列表
To send a list of currently activated fabric partitions to FM after it has been restarted, run the following command. 要在 FM 重新启动后向其发送当前激活的 Fabric 分区列表,请运行以下命令。
Note: If there are no active partitions when FM is restarted, this call must be made with the number of partitions as zero. 注意:如果在重新启动 FM 时没有活动分区,则必须在分区数为零的情况下进行此调用。
fmReturn_t fmSetActivatedFabricPartitions(fmHandle_t pFmHandle,
fmActivatedFabricPartitionList_t *pFmActivatedPartitionList)
Parameters
pFmHandle
Handle returned by fmConnect()
pFmActivatedPartitionList
List of currently activated fabric partitions.
Return Values
FM_ST_SUCCESS - FM state is updated with active partition information
FM_ST_UNINITIALIZED - FM interface library has not been initialized.
FM_ST_BADPARAM - Invalid input parameters
FM_ST_GENERIC_ERROR - an unspecified internal error occurred
FM_ST_NOT_SUPPORTED - requested feature is not supported or enabled
9.11. Get the NVLink Failed Devices 9.11.获取 NVLink 失败设备
To query all GPUs and NVSwitches with failed NVLinks as part of FM initialization, run the following command. 要查询 NVLink 失败的所有 GPU 和 NVSwitches,作为 FM 初始化的一部分,请运行以下命令。
Note: This API is not supported when FM is running in Shared NVSwitch or vGPU multi-tenancy resiliency restart (–restart) modes. 注意:当 FM 在共享 NVSwitch 或 vGPU 多租户弹性重启 (-restart) 模式下运行时,不支持此 API。
fmReturn_t fmGetNvlinkFailedDevices(fmHandle_t pFmHandle,
fmNvlinkFailedDevices_t *pFmNvlinkFailedDevices)
Parameters
pFmHandle
Handle returned by fmConnect()
pFmNvlinkFailedDevices
List of GPU or NVSwitch devices that have failed NVLinks.
Return Values
FM_ST_SUCCESS - successfully queried list of devices with failed NVLinks
FM_ST_UNINITIALIZED - FM interface library has not been initialized.
FM_ST_BADPARAM - Invalid input parameters
FM_ST_GENERIC_ERROR - an unspecified internal error occurred
FM_ST_NOT_SUPPORTED - requested feature is not supported or enabled
FM_ST_NOT_CONFIGURED - Fabric Manager is initializing and no data
FM_ST_VERSION_MISMATCH - provided versions of params do not match
Note: On DGX H100 and NVIDIA HGX H100 systems, NVLinks are trained at GPU and NVSwitch hardware level using ALI feature and without FM coordination. On these systems, FM will always return FM_ST_SUCCESS with an empty list for this API. 注意:在 DGX H100 和 NVIDIA HGX H100 系统上,NVLinks 是使用 ALI 功能在 GPU 和 NVSwitch 硬件级别进行训练的,无需 FM 协调。在这些系统上,FM 将始终为该 API 返回 FM_ST_SUCCESS(空列表)。
9.12. Get Unsupported Partitions 9.12.获取不支持的分区
To query all the unsupported fabric partitions when FM is running in Shared NVSwitch or vGPU multitenancy modes, run the following command. 要在 FM 以共享 NVSwitch 或 vGPU 多租户模式运行时查询所有不支持的 Fabric 分区,请运行以下命令。
fmReturn_tfmGetUnsupportedFabricPartitions(fmHandle_t pFmHandle,
fmUnsupportedFabricPartitionList_t *pFmUnupportedFabricPartition)
Parameters
pFmHandle
Handle returned by fmConnect()
pFmUnupportedFabricPartition
List of unsupported fabric partitions on the system.
Return Values
FM_ST_SUCCESS - successfully queried list of devices with failed NVLinks
FM_ST_UNINITIALIZED - FM interface library has not been initialized.
FM_ST_BADPARAM - Invalid input parameters
FM_ST_GENERIC_ERROR - an unspecified internal error occurred
FM_ST_NOT_SUPPORTED - requested feature is not supported or enabled
(continues on next page) (下一页)
(continued from previous page) (接上页)
FM_ST_NOT_CONFIGURED - Fabric Manager is initializing and no data FM_ST_NOT_CONFIGURED - Fabric Manager 正在初始化,没有数据
FM_ST_VERSION_MISMATCH - provided versions of params do not match FM_ST_VERSION_MISMATCH - 所提供的参数版本不匹配
Note: On DGX H100 and NVIDIA HGX H100 systems, this API will always return FM_ST_SUCCESS with an empty list of unsupported partition. 注意:在 DGX H100 和 NVIDIA HGX H100 系统上,此 API 将始终返回 FM_ST_SUCCESS,并带有一个不支持分区的空列表。
Chapter 10. Full Passthrough Virtualization Model 第 10 章完全直通虚拟化模式
The first supported virtualization model for NVSwitch-based systems is passthrough device assignment for GPUs and NVSwitch memory fabrics (switches). VMs with 16, eight, four, two, and one GPUs are supported with predefined subsets of GPUs and NVSwitches used for each VM size. 基于 NVSwitch 的系统支持的第一个虚拟化模型是 GPU 和 NVSwitch 内存结构(交换机)的直通设备分配。支持拥有 16 个、8 个、4 个、2 个和 1 个 GPU 的虚拟机,并为每种虚拟机规模使用预定义的 GPU 和 NVSwitch 子集。
A subset of GPUs and NVSwitches is referred to as a system partition. Non-overlapping partitions can be mixed and matched, which allows you to simultaneously support, for example, an 8-GPU VM, a 4-GPU VM, and two 2-GPU VMs on an NVSwitch-based system with two GPU baseboards. VMs with 16 and eight GPUs have no loss in bandwidth while in smaller VMs, there is some bandwidth tradeoff for isolation by using dedicated switches. GPU 和 NVSwitches 的子集称为系统分区。不重叠的分区可以混合和匹配,这样就可以在一个基于 NVSwitch 并有两个 GPU 基板的系统上同时支持 8 个 GPU 虚拟机、一个 4 个 GPU 虚拟机和两个 2 个 GPU 虚拟机。拥有16个和8个GPU的虚拟机在带宽方面不会有任何损失,而对于较小的虚拟机,则需要通过使用专用交换机进行隔离,从而在带宽方面做出一些折衷。
Table 1: DGX-2 and NVIDIA HGX-2 Systems Device Assignment 表 1:DGX-2 和 NVIDIA HGX-2 系统设备分配
分配给虚拟机的 GPU 数量
Number of
GPUs
assigned to
a VM
Number of
GPUs
assigned to
a VM| Number of |
| :--- |
| GPUs |
| assigned to |
| a VM |
分配给虚拟机的 NVSwitches 数量
Number of
NVSwitches
assigned to
a VM
Number of
NVSwitches
assigned to
a VM| Number of |
| :--- |
| NVSwitches |
| assigned to |
| a VM |
每个 GPU 已启用 NVLink 互连
Enabled NVLink
Interconnects
Per GPU
Enabled NVLink
Interconnects
Per GPU| Enabled NVLink |
| :--- |
| Interconnects |
| Per GPU |
每个 NVSwitch 已启用 NVLink 互连
Enabled NVLink
Interconnects
Per NVSwitch
Enabled NVLink
Interconnects
Per NVSwitch| Enabled NVLink |
| :--- |
| Interconnects |
| Per NVSwitch |
Constraints 制约因素
16
12
6 out of 6 6 分
16 out of 18 18 个中的 16 个
None 无
8
6
6 out of 6 6 分
8 out of 18 18人中有8人
每块 GPU 底板上的一组八个 GPU
One set of eight GPUs from
each GPU Baseboard
One set of eight GPUs from
each GPU Baseboard| One set of eight GPUs from |
| :--- |
| each GPU Baseboard |
4
3
3 out of 6 6 分中的 3 分
4 out of 18 18人中有4人
每块 GPU 底板上有两组四个 GPU
Two sets of four GPUs from
each GPU Baseboard
Two sets of four GPUs from
each GPU Baseboard| Two sets of four GPUs from |
| :--- |
| each GPU Baseboard |
2
1
1 out of 6 6 分中的 1 分
2 out of 18 18 个中的 2 个
每块 GPU 底板上有四组两个 GPU
Four sets of two GPUs from
each GPU Baseboard
Four sets of two GPUs from
each GPU Baseboard| Four sets of two GPUs from |
| :--- |
| each GPU Baseboard |
1
0
0 out of 6 0 分(满分 6 分
0 out of 18 0 分(满分 18 分
None 无
"Number of
GPUs
assigned to
a VM" "Number of
NVSwitches
assigned to
a VM" "Enabled NVLink
Interconnects
Per GPU" "Enabled NVLink
Interconnects
Per NVSwitch" Constraints
16 12 6 out of 6 16 out of 18 None
8 6 6 out of 6 8 out of 18 "One set of eight GPUs from
each GPU Baseboard"
4 3 3 out of 6 4 out of 18 "Two sets of four GPUs from
each GPU Baseboard"
2 1 1 out of 6 2 out of 18 "Four sets of two GPUs from
each GPU Baseboard"
1 0 0 out of 6 0 out of 18 None| Number of <br> GPUs <br> assigned to <br> a VM | Number of <br> NVSwitches <br> assigned to <br> a VM | Enabled NVLink <br> Interconnects <br> Per GPU | Enabled NVLink <br> Interconnects <br> Per NVSwitch | Constraints |
| :--- | :--- | :--- | :--- | :--- |
| 16 | 12 | 6 out of 6 | 16 out of 18 | None |
| 8 | 6 | 6 out of 6 | 8 out of 18 | One set of eight GPUs from <br> each GPU Baseboard |
| 4 | 3 | 3 out of 6 | 4 out of 18 | Two sets of four GPUs from <br> each GPU Baseboard |
| 2 | 1 | 1 out of 6 | 2 out of 18 | Four sets of two GPUs from <br> each GPU Baseboard |
| 1 | 0 | 0 out of 6 | 0 out of 18 | None |
Figure 2: Software Stack in a Two-GPU Virtual Machine (A Full Passthrough Model) 图 2:双 GPU 虚拟机中的软件栈(完全直通模式)
Table 2: Two DGX A 100 and NVIDIA HGX A 100 Systems Device Assignment 表 2:两套 DGX A 100 和 NVIDIA HGX A 100 系统的设备分配
分配给虚拟机的 GPU 数量
Number of
GPUs
assigned to to
a VM
Number of
GPUs
assigned to to
a VM| Number of |
| :--- |
| GPUs |
| assigned to to |
| a VM |
分配给虚拟机的 NVSwitches 数量
Number of
NVSwitches
assigned to
a VM
Number of
NVSwitches
assigned to
a VM| Number of |
| :--- |
| NVSwitches |
| assigned to |
| a VM |
每个 GPU 已启用 NVLink 互连
Enabled NVLink
Interconnects
Per GPU
Enabled NVLink
Interconnects
Per GPU| Enabled NVLink |
| :--- |
| Interconnects |
| Per GPU |
每个 NVSwitch 已启用 NVLink 互连
Enabled NVLink
Interconnects
Per NVSwitch
Enabled NVLink
Interconnects
Per NVSwitch| Enabled NVLink |
| :--- |
| Interconnects |
| Per NVSwitch |
Constraints 制约因素
16
12
12 out of 12 12 分满分
32 out of 36 36 分中的 32 分
None 无
8
6
12 out of 12 12 分满分
16 out of 36 36 个中的 16 个
每块 GPU 底板上的一组八个 GPU
One set of eight GPUs from
each GPU Baseboard
One set of eight GPUs from
each GPU Baseboard| One set of eight GPUs from |
| :--- |
| each GPU Baseboard |
4
3
6 out of 12 12 分中的 6 分
6 out of 36 36 分中的 6 分
每块 GPU 底板上的两组四个 GPU
Two sets of four GPUs from
each GPU Baseboard
Two sets of four GPUs from
each GPU Baseboard| Two sets of four GPUs from |
| :--- |
| each GPU Baseboard |
2
1
2 out of 12 12 分中的 2 分
4 out of 36 36 分中的 4 分
每块 GPU 底板上有四组两个 GPU
Four sets of two GPUs from
each GPU Baseboard
Four sets of two GPUs from
each GPU Baseboard| Four sets of two GPUs from |
| :--- |
| each GPU Baseboard |
1
0
0 out of 12 0 分(满分 12 分
0 out of 36 0 分,共 36 分
None 无
"Number of
GPUs
assigned to to
a VM" "Number of
NVSwitches
assigned to
a VM" "Enabled NVLink
Interconnects
Per GPU" "Enabled NVLink
Interconnects
Per NVSwitch" Constraints
16 12 12 out of 12 32 out of 36 None
8 6 12 out of 12 16 out of 36 "One set of eight GPUs from
each GPU Baseboard"
4 3 6 out of 12 6 out of 36 "Two sets of four GPUs from
each GPU Baseboard"
2 1 2 out of 12 4 out of 36 "Four sets of two GPUs from
each GPU Baseboard"
1 0 0 out of 12 0 out of 36 None| Number of <br> GPUs <br> assigned to to <br> a VM | Number of <br> NVSwitches <br> assigned to <br> a VM | Enabled NVLink <br> Interconnects <br> Per GPU | Enabled NVLink <br> Interconnects <br> Per NVSwitch | Constraints |
| :--- | :--- | :--- | :--- | :--- |
| 16 | 12 | 12 out of 12 | 32 out of 36 | None |
| 8 | 6 | 12 out of 12 | 16 out of 36 | One set of eight GPUs from <br> each GPU Baseboard |
| 4 | 3 | 6 out of 12 | 6 out of 36 | Two sets of four GPUs from <br> each GPU Baseboard |
| 2 | 1 | 2 out of 12 | 4 out of 36 | Four sets of two GPUs from <br> each GPU Baseboard |
| 1 | 0 | 0 out of 12 | 0 out of 36 | None |
Table 3: DGX H 100 and NVIDIA HGX H 100 Systems Device Assignment 表 3:DGX H 100 和 NVIDIA HGX H 100 系统设备分配
分配给虚拟机的 GPU 数量
Number of
GPUs
assigned to
a VM
Number of
GPUs
assigned to
a VM| Number of |
| :--- |
| GPUs |
| assigned to |
| a VM |
分配给虚拟机的 NVSwitches 数量
Number of
NVSwitches
assigned to
a VM
Number of
NVSwitches
assigned to
a VM| Number of |
| :--- |
| NVSwitches |
| assigned to |
| a VM |
每个 GPU 已启用 NVLink 互连
Enabled NVLink
Interconnects
Per GPU
Enabled NVLink
Interconnects
Per GPU| Enabled NVLink |
| :--- |
| Interconnects |
| Per GPU |
每个 NVSwitch 已启用 NVLink 互连
Enabled NVLink
Interconnects
Per NVSwitch
Enabled NVLink
Interconnects
Per NVSwitch| Enabled NVLink |
| :--- |
| Interconnects |
| Per NVSwitch |
32 out of 64 for
two NVSwitches.
40 out of 64
for other two
NVSwitches.| 32 out of 64 for |
| :--- |
| two NVSwitches. |
| 40 out of 64 |
| for other two |
| NVSwitches. |
None 无
1
0
0 out of 18 0 分(满分 18 分
0 out of 64 0(64 分
需要禁用 GPU NVLinks。
Need to disable GPU
NVLinks.
Need to disable GPU
NVLinks.| Need to disable GPU |
| :--- |
| NVLinks. |
"Number of
GPUs
assigned to
a VM" "Number of
NVSwitches
assigned to
a VM" "Enabled NVLink
Interconnects
Per GPU" "Enabled NVLink
Interconnects
Per NVSwitch" Constraints
8 4 18 out of 18 "32 out of 64 for
two NVSwitches.
40 out of 64
for other two
NVSwitches." None
1 0 0 out of 18 0 out of 64 "Need to disable GPU
NVLinks." | Number of <br> GPUs <br> assigned to <br> a VM | Number of <br> NVSwitches <br> assigned to <br> a VM | Enabled NVLink <br> Interconnects <br> Per GPU | Enabled NVLink <br> Interconnects <br> Per NVSwitch | Constraints | |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 8 | 4 | 18 out of 18 | 32 out of 64 for <br> two NVSwitches. <br> 40 out of 64 <br> for other two <br> NVSwitches. | None | |
| 1 | 0 | 0 out of 18 | 0 out of 64 | Need to disable GPU <br> NVLinks. | |
The available GPUs and NVSwitches are assigned to the guest VM. There are no disabled NVLink interconnects on the NVSwitches or on the GPUs. To support 16 GPU partitions, the system must be populated with two GPU baseboards. 可用的 GPU 和 NVSwitches 已分配给客户虚拟机。NVSwitches 和 GPU 上没有禁用的 NVLink 互连。要支持 16 个 GPU 分区,系统必须配备两个 GPU 底板。
10.3. Virtual Machines with Eight GPUS 10.3.带有八个 GPUS 的虚拟机
Each VM has eight GPUs and the NVSwitches on the same base board (six for DGX A100 and NVIDIA HGX A 100 and four for DGX H100 and NVIDIA HGX H100) must be assigned to the guest VM. Each GPU has all the NVLink interconnects enabled. If the system has two GPU baseboards, two system partitions will be available where eight GPU VMs can be created. Otherwise only one partition will be available. All NVLink connections between GPU baseboards are disabled. 每个虚拟机都有 8 个 GPU,同一基础板上的 NVSwitches(DGX A100 和 NVIDIA HGX A 100 为 6 个,DGX H100 和 NVIDIA HGX H100 为 4 个)必须分配给客户虚拟机。每个 GPU 都已启用所有 NVLink 互连。如果系统有两个 GPU 底板,则将有两个系统分区,可创建八个 GPU 虚拟机。否则只有一个分区可用。GPU 底板之间的所有 NVLink 连接均已禁用。
10.4. Virtual Machines with Four GPUS 10.4.带有四个 GPUS 的虚拟机
If this configuration is supported, each VM gets four GPUs and three switches. As specified in Table 3, only a subset of NVLink interconnects per GPU are enabled. If the system is populated with two GPU baseboards, four partitions are available on the system. For single baseboard systems, two partitions are available. All NVLink connections between GPU baseboards are disabled. 如果支持这种配置,每个虚拟机将获得四个 GPU 和三个交换机。如表 3 所示,每个 GPU 只启用一个 NVLink 互连子集。如果系统装有两个 GPU 底板,则系统上有四个分区。对于单基板系统,可使用两个分区。禁用 GPU 底板之间的所有 NVLink 连接。
10.5. Virtual Machines with Two GPUs 10.5.带有两个 GPU 的虚拟机
If this configuration is supported, each VM gets two GPUs and one NVSwitch. Also, a subset of GPU NVLink interconnects per GPU are enabled. If the system is populated with two GPU baseboards, eight partitions are available on the system. For single baseboard systems, four partitions are available. All NVLink connections between GPU baseboards are disabled. 如果支持这种配置,每个虚拟机将获得两个 GPU 和一个 NVSwitch。此外,每个 GPU 的 GPU NVLink 互连子集也已启用。如果系统装有两个 GPU 底板,则系统上有八个分区。对于单基板系统,可使用四个分区。禁用 GPU 底板之间的所有 NVLink 连接。
10.6. Virtual Machine with One GPU 10.6.使用一个图形处理器的虚拟机
Each VM has one GPU and no switches. If the system is populated with two GPU baseboards, 16 partitions are available on the system. For single baseboard systems, eight partitions are available. All NVLink connections between GPU baseboards are disabled. 每个虚拟机有一个 GPU,没有交换机。如果系统装有两个 GPU 底板,系统上就有 16 个分区。对于单基板系统,可使用 8 个分区。禁用 GPU 底板之间的所有 NVLink 连接。
10.7. Other Requirements 10.7.其他要求
Here are some other requirements: 下面是一些其他要求:
The hypervisor needs to maintain the partition configuration, including which NVLink connections to block on each GPU and switch for each partition. 管理程序需要维护分区配置,包括在每个 GPU 和每个分区的交换机上屏蔽哪些 NVLink 连接。
The hypervisor needs to implement MMIO filtering for NVSwitch. 管理程序需要为 NVSwitch 实施 MMIO 过滤。
The hypervisor needs to finely control IOMMU mappings that were configured for GPUs and switches. 管理程序需要精细控制为 GPU 和交换机配置的 IOMMU 映射。
Guest VMs with more than one GPU need to run the core NVSwitch software stack, for example, NVIDIA Driver and FM to configure switches and NVLink connections. 拥有一个以上 GPU 的客户虚拟机需要运行核心 NVSwitch 软件栈,例如,英伟达驱动程序和 FM,以配置交换机和 NVLink 连接。
10.8. Hypervisor Sequences 10.8.管理程序序列
The hypervisor completes the following steps to launch, shutdown and reboot Guest VMs. 管理程序会完成以下步骤来启动、关闭和重启客户虚拟机。
Start the guest VM. 启动客户虚拟机。
Select an unused partition of GPUs and switches. 选择一个未使用的 GPU 和交换机分区。
Reset the GPUs and switches in the partition. 重置分区中的 GPU 和开关。
Block the disabled NVLink connections on each GPU by performing the specified MMIO configuration. 通过执行指定的 MMIO 配置,在每个 GPU 上阻断已禁用的 NVLink 连接。
Block the disabled NVLink connections on each switch by configuring the MMIO intercept. 通过配置 MMIO 拦截,在每个交换机上阻止已禁用的 NVLink 连接。
Avoid configuring any IOMMU mappings between GPUs and switches. 避免在 GPU 和交换机之间配置任何 IOMMU 映射。
Switches cannot be accessible by any other PCle device that the guest VM controls. This way, the switches cannot bypass the MMIO restrictions implemented for the CPU. 交换机不能被客户虚拟机控制的任何其他 PCle 设备访问。这样,交换机就无法绕过针对 CPU 实施的 MMIO 限制。
GPUs do not need to be accessible by any other GPUs or switches. 其他 GPU 或交换机无需访问 GPU。
GPUs need to be accessible by third-party devices to support NVIDIAGPUDirect ^("TM "){ }^{\text {TM }} RDMA. 第三方设备需要访问 GPU,以支持 NVIDIAGPUDirect ^("TM "){ }^{\text {TM }} RDMA。
To avoid additional GPU resets, start the guest VM. 为避免额外的 GPU 重置,请启动客户虚拟机。
Shut down the guest VM. 关闭客户虚拟机。
Shut down the Guest VM as usual. 像往常一样关闭客户虚拟机。
Reset the GPUs and switches that belong to the partition. 重置属于分区的 GPU 和开关。
Reboot the Guest VM. 重启客户虚拟机。
Repeat steps la to 1 f , but this time, the partition has already been selected. 重复步骤 la 至 1 f,但此时分区已选定。
10.9. Monitoring Errors 10.9.监控错误
The NVSwitch, GPU and NVLink errors are visible to guest VMs. If you want the hypervisor to monitor the same items, use one of the following methods: 客户虚拟机可以看到 NVSwitch、GPU 和 NVLink 错误。如果希望管理程序监控相同的项目,请使用以下方法之一:
In-band monitoring 带内监测
Run NVIDIA Data Center GPU Manager (DCGM) on the guest VM or use the NVIDIA Management Library (NVML) APIs for GPU-specific monitoring. 在客户虚拟机上运行英伟达™(NVIDIA®)数据中心GPU管理器(DCGM),或使用英伟达™(NVIDIA®)管理库(NVML)API进行特定于GPU的监控。
Out-of-Band Monitoring 带外监测
Use the GPU and NVSwitch SMBus Post Box Interface (SMBPBI)-based OOB commands. 使用基于 GPU 和 NVSwitch SMBus Post Box Interface (SMBPBI) 的 OOB 命令。
10.10. Limitations 10.10.限制
NVSwitch errors are visible to the guest VMs. 客户虚拟机可看到 NVSwitch 错误。
Windows is only supported for single GPU VMs. Windows 仅支持单 GPU 虚拟机。
The shared NVSwitch virtualization model additionally extends the GPU Passthrough model by managing the switches from one Service VM that runs permanently. The GPUs are made accessible to the Service VM for link training and reassigned to the guest VMs. Sharing switches among the guest VMs allows FM to enable more NVLink connections for 2 and 4 GPU VMs that observe reduced bandwidth in GPU Passthrough model. 共享 NVSwitch 虚拟化模型通过从一个永久运行的服务虚拟机管理交换机,进一步扩展了 GPU 直通模型。服务虚拟机可访问 GPU 进行链路训练,然后将其重新分配给客户虚拟机。在客户虚拟机之间共享交换机允许 FM 为 2 和 4 GPU 虚拟机启用更多的 NVLink 连接,而这些虚拟机在 GPU 直通模式下的带宽会减少。
11.1. Software Stack 11.1.软件堆栈
The software stack required for NVSwitch management runs in a Service VM. NVSwitch 管理所需的软件栈在服务虚拟机中运行。
NVSwitch units are always assigned as a PCle passthrough device to the Service VM. GPUs are hotplugged and hot-unplugged on-demand (as PCI passthrough) to the Service VM. NVSwitch 设备始终作为 PCle 直通设备分配给服务虚拟机。GPU 按需热插拔(作为 PCI 直通)到服务虚拟机。
At a high level, the Service VM has the following features: 服务虚拟机具有以下高级功能:
Provides an interface to query the available GPU VM partitions (groupings) and corresponding GPU information. 提供一个接口,用于查询可用的 GPU 虚拟机分区(分组)和相应的 GPU 信息。
Provides an interface to activate GPU VM partitions, which involves the following: 提供了激活 GPU 虚拟机分区的接口,其中涉及以下内容:
Training NVSwitch to NVSwitch NVLink interconnects (if required). 培训 NVSwitch 到 NVSwitch NVLink 互连(如需要)。
Training the corresponding GPU NVLink interfaces (if applicable). 训练相应的 GPU NVLink 接口(如适用)。
Programming NVSwitch to deny access to GPUs not assigned to the partition. 对 NVSwitch 进行编程,拒绝访问未指定给分区的 GPU。
Provides an interface to deactivate GPU VM partitions, which involves the following: 提供了一个停用 GPU 虚拟机分区的接口,其中涉及以下内容:
Untrain (power down) the NVSwitch to NVSwitch NVLink interconnects. 解除 NVSwitch 到 NVSwitch NVLink 互连的训练(关闭电源)。
Disable the corresponding NVSwitch routing and GPU access. 禁用相应的 NVSwitch 路由和 GPU 访问。
Report NVSwitch errors through in-band and out-of-band mechanisms. 通过带内和带外机制报告 NVSwitch 错误。
11.2. Guest VM to Service VM Interaction 11.2.客户虚拟机与服务虚拟机之间的交互
For NVIDIA HGX-2, NVIDIA HGX A 100, and NVIDIA HGX A800 server systems, the GPU configurations that are required to enable NVLink communication are established as part of the initial partition activation process, which occurs before transferring GPU control to the Guest VM. Consequently, there is no need for the Guest VM to initiate communication with the Service VM while workloads are running. 对于 NVIDIA HGX-2、NVIDIA HGX A 100 和 NVIDIA HGX A800 服务器系统,启用 NVLink 通信所需的 GPU 配置是作为初始分区激活过程的一部分建立的,该过程发生在将 GPU 控制权转移到客户虚拟机之前。因此,在工作负载运行时,客户虚拟机无需启动与服务虚拟机的通信。
However, on NVIDIA HGX H100 and NVIDIA HGX H800 systems, a different approach is required. In these systems, the GPUs are not assigned to the Service VM during partition activation. As a result, the configurations for GPU NVLink communication must be passed to the Guest VM. Additionally, the newly introduced NVLink Sharp feature in the H100 and H800 generations necessitates dynamic adjustments to the NVSwitch configuration based on the workload requirements of the Guest VM. 不过,在英伟达 HGX H100 和英伟达 HGX H800 系统上,需要采用不同的方法。在这些系统中,GPU 不会在分区激活时分配给服务虚拟机。因此,GPU NVLink 通信的配置必须传递给客户虚拟机。此外,H100 和 H800 新推出的 NVLink Sharp 功能要求根据客户虚拟机的工作负载要求对 NVSwitch 配置进行动态调整。
To facilitate these functionalities on NVIDIA HGX H100 and NVIDIA HGX H800 systems, GPUs in the Guest VM communicate over NVLink by transmitting specialized packets to the FM that runs on the Service VM. To simplify integration efforts, communicating these requests over NVLink is the optimal solutionbecause it can be completely managed in NVIDIA’s software and firmware, without requiring custom integrations for the customer. This communication protocol also is version agnostic, which allows compatibility between different versions of NVIDIA Drivers on the Guest and Service VMs. 为了在 NVIDIA HGX H100 和 NVIDIA HGX H800 系统上实现这些功能,Guest VM 中的 GPU 通过 NVLink 进行通信,向 Service VM 上运行的 FM 传输专用数据包。为了简化集成工作,通过 NVLink 通信这些请求是最佳的解决方案,因为它可以完全由英伟达的软件和固件进行管理,而不需要客户定制集成。这种通信协议还与版本无关,因此可以兼容客户虚拟机和服务虚拟机上不同版本的英伟达™(NVIDIA®)驱动程序。
11.3. Preparing the Service Virtual Machine 11.3.准备服务虚拟机
11.3.1. The OS Image 11.3.1.操作系统映像
Internally, NVIDIA uses an Ubuntu distro as the Service VM OS image. However, there are no known limitations with other major Linux OS distributions. Refer to OS Environment for more information. 英伟达内部使用 Ubuntu 发行版作为服务虚拟机的操作系统映像。不过,其他主要 Linux 操作系统发行版并没有已知的限制。更多信息,请参阅操作系统环境。
11.3.2. Resource Requirements 11.3.2.资源需求
Refer to the corresponding OS distributions minimum resource guidelines for more information about the exact Service VM resource requirements. In addition to the specified minimum guidelines, NVIDIA internally uses the following hardware resources for Service VM. 有关 Service VM 确切资源要求的更多信息,请参阅相应的操作系统发行版最低资源指南。除指定的最低指导原则外,NVIDIA 还在内部为 Service VM 使用以下硬件资源。
Note: The resource requirements for the Service VM might vary if it is used for additional functionalities, such as conducting a GPU health check. The specific memory and vCPU demands might also fluctuate depending on the Linux distribution you selected and the OS features you enabled. We recommend that you make necessary adjustments to the allocated memory and vCPU resources accordingly. 注意:如果服务虚拟机用于执行 GPU 健康检查等附加功能,其资源需求可能会有所不同。具体的内存和 vCPU 需求也可能会因选择的 Linux 发行版和启用的操作系统功能而波动。我们建议您相应地对分配的内存和 vCPU 资源进行必要的调整。
The Service VM image must have the following NVIDIA software packages installed. 服务虚拟机映像必须安装以下英伟达™(NVIDIA®)软件包。
NVIDIA Data Center GPU Driver (version 450.xx and later for NVIDIA HGX-2 and NVIDIA HGX A 100 systems). 英伟达数据中心 GPU 驱动程序(450.xx 及更高版本,适用于英伟达 HGX-2 和英伟达 HGX A 100 系统)。
For NVIDIA HGX H100 systems, version 525.xx and later is required. NVIDIA HGX H100 系统需要 525.xx 及更高版本。
NVIDIA Fabric Manager Package (same version as the Driver package). NVIDIA Fabric Manager 软件包(与驱动程序软件包版本相同)。
To support the Shared NVSwitch mode, start the FM service in Shared NVSwitch mode by setting the FM config item FABRIC_MODE=1. 要支持共享 NVSwitch 模式,请通过设置 FM 配置项 FABRIC_MODE=1,在共享 NVSwitch 模式下启动 FM 服务。
Note: NVSwitches and GPUs on NVIDIA HGX-2 and NVIDIA HGX A 100 systems must bind to nvidia.ko before FM service starts. If the GPUs and NVSwitches are not plugged into the Service VM as part of OS boot, start the FM service manually or the process directly by running the appropriate command line options after the NVSwitches and GPUs are bound to nvidia.ko. 注意:NVIDIA HGX-2 和 NVIDIA HGX A 100 系统上的 NVSwitches 和 GPU 必须在 FM 服务启动前绑定到 nvidia.ko。如果 GPU 和 NVSwitches 没有作为操作系统启动的一部分插入服务虚拟机,则应在 NVSwitches 和 GPU 绑定到 nvidia.ko 之后手动启动 FM 服务,或通过运行相应的命令行选项直接启动进程。
In Shared NVSwitch mode, FM supports a resiliency feature, which allows the non-stop forwarding of NVLink traffic between GPUs on active guest VMs after FM gracefully or non-gracefully exits in the Service VM. To support this feature, FM uses /tmp/fabricmanager. state to save certain metadata information. To use a different location/file to store this metadata information, modify the STATE_FILE_NAME FM config file item with the path and file name. FM uses TCP I/P loopback (127.0.0.1)-based socket interface for communication. To use Unix domain sockets instead, modify the FM FM_CMD_UNIX_SOCKET_PATH and UNIX_SOCKET_PATH config file options with the Unix domain socket names. 在共享 NVSwitch 模式下,FM 支持弹性功能,允许在服务虚拟机中 FM 优雅或非优雅退出后,在活动客户虚拟机的 GPU 之间不间断地转发 NVLink 流量。为支持该功能,FM 使用 /tmp/fabricmanager.state 保存某些元数据信息。要使用不同的位置/文件来存储此元数据信息,请使用路径和文件名修改 STATE_FILE_NAME FM 配置文件项。FM 使用基于 TCP I/P loopback (127.0.0.1) 的套接字接口进行通信。要使用 Unix 域套接字,请使用 Unix 域套接字名称修改 FM FM_CMD_UNIX_SOCKET_PATH 和 UNIX_SOCKET_PATH 配置文件选项。
11.3.5. Other NVIDIA Software Packages 11.3.5.其他英伟达软件包
In Shared NVSwitch mode, no process or entity, other than FM, should open and interact with GPUs while activating or deactivating the partition. Also, all the GPU health check applications must be started after activating the partition and must be closed before unbinding the GPUs from nvidia. ko. 在共享 NVSwitch 模式下,在激活或停用分区时,除 FM 外,任何进程或实体都不得打开 GPU 并与之交互。此外,所有 GPU 健康检查应用程序都必须在激活分区后启动,并且必须在从 nvidia.ko.FM 解除 GPU 绑定前关闭。
11.4. FM Shared Library and APIs 11.4.调频共享库和应用程序接口
Refer to Fabric Manager SDK for the list of the APIs that manage a shared NVSwitch partition life cycle. 有关管理共享 NVSwitch 分区生命周期的 API 列表,请参阅 Fabric Manager SDK。
11.4.1. Sample Code 11.4.1.代码示例
The following code snippet shows how to query supported partitions, activate, or deactivate partitions, and so on by using the FM APIs mentioned in Fabric Manager SDK. 下面的代码片段展示了如何使用 Fabric Manager SDK 中提到的 FM API 来查询支持的分区、激活或停用分区等。
Refer to GPU Partitions for the default, and all supported partitions, for the shared NVSwitch virtualization mode. 有关共享 NVSwitch 虚拟化模式的默认分区和所有支持的分区,请参阅 GPU 分区。
11.6.2. Building GPUs to Partition Mapping 11.6.2.构建 GPU 到分区映射
The FM instance that runs on the Service VM and Hypervisor must use a common numbering scheme (GPU Physical ID) to uniquely identify each GPU. In this release, the Physical ID numbering is the same as in the Baseboard Pinout design collateral. 在服务虚拟机和管理程序上运行的 FM 实例必须使用通用的编号方案(GPU 物理 ID)来唯一标识每个 GPU。在此版本中,物理 ID 编号与底板引脚布局设计抵押品中的编号相同。
The hypervisor should maintain a list of GPU Physical IDs and corresponding PCI BDF mapping information to identify each GPUs in the hypervisor. This information is required to identify GPUs that belong to a partition and hot attach the GPUs to a Service VM as part of guest VM activation. 管理程序应维护 GPU 物理 ID 列表和相应的 PCI BDF 映射信息,以识别管理程序中的每个 GPU。需要使用这些信息来识别属于某个分区的 GPU,并将 GPU 热附加到服务虚拟机,作为客户虚拟机激活的一部分。
11.6.3. Booting the Service Virtual Machine 11.6.3.启动服务虚拟机
As part of Service VM boot, the hypervisor must do the following: 作为服务虚拟机启动的一部分,管理程序必须执行以下操作:
Assign/plug the available NVSwitches as PCI passthrough devices to the Service VM without MMIO filtering. 将可用的 NVSwitches 作为 PCI 直通设备分配/插入服务虚拟机,而不进行 MMIO 过滤。
On NVIDIA HGX-2 and NVIDIA HGX A100 systems, assign/plug the available GPUs as PCI passthrough devices to the Service VM without MMIO filtering. 在英伟达™(NVIDIA®)HGX-2 和英伟达™(NVIDIA®)HGX A100 系统上,将可用的 GPU 作为 PCI 直通设备分配/插入服务虚拟机,而无需 MMIO 过滤。
Start and wait for the FM to fully initialize the GPUs and switches. The FM APIs will return FM_ST_NOT_CONFIGURED until the fabric is initialized and ready. 启动并等待 FM 完全初始化 GPU 和交换机。FM API 将返回 FM_ST_NOT_CONFIGURED,直到 Fabric 初始化就绪。
Query the list of currently supported VM partitions and build the available guest VM combinations accordingly. 查询当前支持的虚拟机分区列表,并据此构建可用的客户虚拟机组合。
Deassign/unplug the GPUs from the Service VM for NVIDIA HGX-2 and NVIDIA HGX A 100 systems. 从 NVIDIA HGX-2 和 NVIDIA HGX A 100 系统的服务虚拟机中取消分配/拔出 GPU。
11.6.4. Restarting the Service Virtual Machine 11.6.4.重启服务虚拟机
The NVSwitch kernel software stack gets loaded and initializes the NVSwitches and GPUs as part of the Service VM booting, so restarting Service VM will affect currently activated GPU partitions. The hypervisor must follow the same procedure and steps as described in Booting the Service Virtual Machine. NVSwitch 内核软件栈加载并初始化 NVSwitch 和 GPU 是服务虚拟机启动的一部分,因此重启服务虚拟机将影响当前激活的 GPU 分区。管理程序必须遵循 "启动服务虚拟机 "中所述的相同程序和步骤。
11.6.5. Shut Down the Service 11.6.5.关闭服务
Currently activated VM partitions will not be affected as part of Service VM shutdown because the NVSwitch configuration is preserved. However, if the hypervisor or PCle pass through driver issues, a Secondary Bus Reset (SBR) to the NVSwitch devices as part of Service VM shutdown, the activated partitions will be affected. Since FM is not running, and the driver is unloaded, there will be no active error monitoring and corresponding remediation. 当前激活的虚拟机分区不会受到服务虚拟机关闭的影响,因为 NVSwitch 配置得到了保留。但是,如果管理程序或 PCle 在服务虚拟机关闭过程中通过驱动程序对 NVSwitch 设备进行二次总线重置 (SBR),则激活的分区将受到影响。由于 FM 没有运行,而且驱动程序已卸载,因此不会有活动错误监控和相应的补救措施。
Note: Do not leave the guest VMs in this state for a longer period. 注意:不要让客户虚拟机长时间处于这种状态。
11.7. Guest Virtual Machine Life Cycle Management 11.7.客户虚拟机生命周期管理
To use GPU NVLink interconnects, ensure that one of the following the driver packages for NVIDIA Data Center GPUs is installed on the guest VM: 要使用 GPU NVLink 互连,请确保在客户虚拟机上安装了以下其中一个英伟达™(NVIDIA®)数据中心 GPU 驱动程序软件包:
Version 450.xx and later for NVIDIA HGX-2 and NVIDIA HGX A 100 systems. 用于 NVIDIA HGX-2 和 NVIDIA HGX A 100 系统的 450.xx 及更高版本。
Version 525.xx and later for NVIDIA HGX H 100 systems. 用于英伟达 HGX H 100 系统的 525.xx 及更高版本。
11.7.2. Starting a Guest Virtual Machine 11.7.2.启动客户虚拟机
To start a guest VM, the hypervisor must complete one of the following procedures: 要启动客户虚拟机,管理程序必须完成以下程序之一:
Note: The sequences will be different depending on the NVSwitch generation used in the system. The key difference is whether the GPU needs to be attached to Service VM and bound to nvidia.ko. 注意:根据系统中使用的 NVSwitch 代,序列会有所不同。主要区别在于 GPU 是否需要连接到服务虚拟机并绑定到 nvidia.ko。
On NVIDIA HGX-2 and NVIDIA HGX A 100 Systems: 在 NVIDIA HGX-2 和 NVIDIA HGX A 100 系统上:
Select one of the supported GPU partitions based on guest VM GPU demand. 根据客户虚拟机 GPU 需求,从支持的 GPU 分区中选择一个。
Identify the corresponding GPUs using the GPU Physical ID to PCI BDF mapping. 使用 GPU 物理 ID 与 PCI BDF 的映射关系识别相应的 GPU。
Reset (SBR) the selected GPUs. 重置(SBR)所选 GPU。
Hot plug the selected GPUs to the Service VM. 将选定的 GPU 热插拔到服务虚拟机。
Ensure that the GPUs are bound to nvidia.ko. 确保 GPU 已绑定到 nvidia.ko。
Request FM to activate the requested GPU partition using the fmActivateFabricPartition() API. 请求 FM 使用 fmActivateFabricPartition() API 激活所请求的 GPU 分区。
Unbind the GPUs from nvidia.ko. 从 nvidia.ko 中解除 GPU 的绑定。
Hot unplug the GPUs from Service VM (if needed). 从服务虚拟机热拔 GPU(如有必要)。
Start the guest VM without resetting the GPUs. 在不重置 GPU 的情况下启动客户虚拟机。
Note: If the GPUs get PCle reset as part of guest VM launch, the GPU NVLinks will be in an InActive state on the guest VM. Also, starting the guest VM without a GPU reset might require a modification in your hypervisor VM launch sequence path. 注意:如果 GPU 在启动客户虚拟机时被 PCle 重置,则客户虚拟机上的 GPU NVLinks 将处于 InActive 状态。此外,在不重置 GPU 的情况下启动客户虚拟机可能需要修改管理程序虚拟机启动顺序路径。
On NVIDIA HGX H1OO systems: 在英伟达 HGX H1OOO 系统上:
Select one of the supported GPU partitions based on guest VM GPU demand. 根据客户虚拟机 GPU 需求,从支持的 GPU 分区中选择一个。
Identify the corresponding GPUs by using the GPU Physical ID to PCI BDF mapping. 通过 GPU 物理 ID 与 PCI BDF 的映射,识别相应的 GPU。
Request FM to activate the requested GPU partition using the fmActivateFabricPartition() API. 请求 FM 使用 fmActivateFabricPartition() API 激活所请求的 GPU 分区。
Start the guest VM. 启动客户虚拟机。
11.7.3. Shutting Down a Guest Virtual Machine 11.7.3.关闭客户虚拟机
To shut down a guest VM, the hypervisor must do the following. 要关闭客户虚拟机,管理程序必须执行以下操作。
Note: The sequences will be different depending on the NVSwitch generation used in the system. 注意:根据系统中使用的 NVSwitch 代,顺序会有所不同。
On NVIDIA HGX-2 and NVIDIA HGX A 100 Systems: 在 NVIDIA HGX-2 和 NVIDIA HGX A 100 系统上:
Shut down the guest VM but, to avoid any NVSwitch side NVLink errors, avoid GPU resets. 关闭客户虚拟机,但为避免出现任何 NVSwitch 方面的 NVLink 错误,应避免重置 GPU。
Use the fmDeactivateFabricPartition() API and request FM to`deactivate the specific GPU partition. 使用 fmDeactivateFabricPartition() API 并请求 FM "停用 "特定 GPU 分区。
Reset the GPUs after the deactivation partition request has completed. 在停用分区请求完成后重置 GPU。
On NVIDIA HGX H100 Systems: 在英伟达 HGX H100 系统上:
Shut down the guest VM. 关闭客户虚拟机。
Use the fmDeactivateFabricPartition() API and request FM to deactivate the specific GPU partition. 使用 fmDeactivateFabricPartition() API 并请求 FM 停用特定 GPU 分区。
If the guest VM shutdown process is not completing an explicit GPU reset, reset the GPUs after the deactivate partition request has completed. 如果客户虚拟机关闭进程未完成显式 GPU 重置,请在停用分区请求完成后重置 GPU。
11.7.4. Rebooting a Guest Virtual Machine 11.7.4.重启客户虚拟机
When rebooting a guest VM, if the GPUs get an SBR as part of the VM reboot, the hypervisor must complete the steps in:ref:starting-a-guest-virtual-machine and Shutting Down a Guest Virtual Machine. 在重启客户虚拟机时,如果 GPU 在虚拟机重启过程中获得 SBR,则管理程序必须完成:ref:starting-a-guest-virtual-machine(启动客户虚拟机)和 Shutting Down a Guest Virtual Machine(关闭客户虚拟机)中的步骤。
11.7.5. Verifying GPU Routing 11.7.5.验证 GPU 路由
The nvswitch-audit command line utility, which was installed as part of the FM package, can output the number of NVLinks that the NVSwitches are programmed to handle for each GPU. The tool reconstructs this information by reading and decoding the internal NVSwitch hardware routing table information. We recommend that you periodically verify the GPU reachability matrix on each VM partition activation and deactivation cycle by running this tool in the Service VM. 作为 FM 软件包的一部分安装的 nvswitch-audit 命令行实用程序可以输出 NVSwitches 程序为每个 GPU 处理的 NVLink 数量。该工具通过读取和解码内部 NVSwitch 硬件路由表信息来重建该信息。我们建议您在服务虚拟机中运行该工具,在每个虚拟机分区激活和停用周期定期验证 GPU 可及性矩阵。
The following options are supported by nvswitch-audit command line utility. nvswitch-audit 命令行实用程序支持以下选项。
root@host1-servicevm:~# ./nvswitch-audit -h
NVIDIA NVSwitch audit tool
Reads NVSwitch hardware tables and outputs the current number of
NVlink connections between each pair of GPUs
Usage: nvswitch-audit [options]
Options include:
[-h | --help]: Displays help information
[-v | --verbose]: Verbose output including all Request and Response table entries
[-f | --full-matrix]: Display All possible GPUs including those with no connecting
~paths
(continues on next page) (下一页)
[-c | --csv]: Output the GPU Reachability Matrix as Comma Separated Values [-c | --csv]:以逗号分隔值的形式输出 GPU 可及性矩阵
[-s | --src]: Source GPU for displaying number of unidirectional connections [-s | --src]:显示单向连接数的源 GPU
[-d | --dst]: Destination GPU for displaying number of unidirectional connections [-d | --dst]:显示单向连接数的目标 GPU
The following example output shows the maximum GPU NVLink connectivity when an 8-GPU VM partition on an NVIDIA HGX A100 is activated. 下面的示例输出显示了激活英伟达 HGX A100 上的 8 GPU 虚拟机分区时 GPU NVLink 的最大连接性。
Refer to Error Handling for information about FM initialization, partition, and hardware specific errors and their handling. 有关调频初始化、分区和硬件特定错误及其处理的信息,请参阅错误处理。
When the guest VM is active, all GPU runtime errors will be logged in the guest VM syslog as Xid errors. On NVIDIA HGX-2 and NVIDIA HGX A 100 systems, the GPU NVLink errors that require retraining are not supported in this environment, and to recover, must complete the steps in Starting a Guest Virtual Machine and Shutting Down a Guest Virtual Machine. 当客户虚拟机处于活动状态时,所有 GPU 运行时错误都将以 Xid 错误的形式记录在客户虚拟机系统日志中。在 NVIDIA HGX-2 和 NVIDIA HGX A 100 系统上,该环境不支持需要重新培训的 GPU NVLink 错误,要恢复这些错误,必须完成 "启动客户虚拟机 "和 "关闭客户虚拟机 "中的步骤。
11.8.2. Handling a Service Virtual Machine Crash 11.8.2.处理服务虚拟机崩溃
When a Service VM experiences a kernel crash, the remaining activated guest VMs will continue as expected. However, the VM partition activation and deactivation life cycle will be affected. To recover from this state, a Service VM restart, or a reboot is required. 当服务虚拟机发生内核崩溃时,其余已激活的客户虚拟机将按预期继续运行。不过,虚拟机分区的激活和停用生命周期将受到影响。要从这种状态恢复,需要重新启动或重启服务虚拟机。
1 1.9. Interoperability With a Multi-Instance GPU 1 1.9.与多实例 GPU 的互操作性
The Shared NVSwitch virtualization model can interoperate with the MIG feature that is supported on NVIDIA A100 and H100 GPUs. However, to expose a shared NVSwitch partition with MIG-enabled GPUs to guest VMs, maintain one of the options in this section. Since NVLinks are not trained on H100 GPUs when MIG is enabled, these options are not applicable for NVIDIA HGX H100 systems. 共享 NVSwitch 虚拟化模型可与英伟达 A100 和 H100 GPU 支持的 MIG 功能互操作。不过,要将带有支持 MIG 的 GPU 的共享 NVSwitch 分区暴露给客户虚拟机,请保留本节中的一个选项。由于启用 MIG 时不会在 H100 GPU 上训练 NVLinks,因此这些选项不适用于英伟达 HGX H100 系统。
11.9.1. Initializing Service Virtual Machine 11.9.1.初始化服务虚拟机
When FM initializes on the Service VM, without the --restart option for resiliency flow, the MIG mode must be disabled for the available GPUs. If any GPUs have MIG mode enabled, the FM service initialization will be aborted. 在服务虚拟机上初始化 FM 时,如果不使用弹性流的--restart 选项,则必须禁用可用 GPU 的 MIG 模式。如果任何 GPU 启用了 MIG 模式,FM 服务初始化将被中止。
11.9.2. Activating the Guest Virtual Machine 11.9.2.激活客户虚拟机
The FM-shared NVSwitch partition activation and deactivation sequence can handle MIG- enabled GPUs. However, GPUs in which MIG was enabled before the partition was activated, for example by the VM before the VM reboot, will not have NVLinks trained as part of the partition activation. The activation/deactivation flow works as expected. FM-shared NVSwitch 分区激活和停用序列可以处理启用了 MIG 的 GPU。但是,在分区激活之前(例如由虚拟机重启之前)已启用 MIG 的 GPU 将不会在分区激活过程中接受 NVLinks 训练。激活/停用流程可按预期运行。
Chapter 12. vGPU Virtualization Model 第 12 章 vGPU 虚拟化模型
The vGPU virtualization model supports VF passthrough by enabling SR-IOV functionality in all the supported GPUs and assigning a specific VF, or set of VFs, to the VM. vGPU 虚拟化模型通过在所有支持的 GPU 中启用 SR-IOV 功能并为虚拟机分配特定的 VF 或一组 VF 来支持 VF 直通。
GPU NVLinks are assigned to only one VF at a time. GPU NVLink 一次只能分配给一个 VF。
NVLink P2P between GPUs that belong to different VMs or partitions is not supported. 不支持属于不同虚拟机或分区的 GPU 之间的 NVLink P2P。
Refer to the vGPU Software User Guide for more information about the supported vGPU functionality, features, and configurations. 有关支持的 vGPU 功能、特性和配置的详细信息,请参阅《vGPU 软件用户指南》。
12.1. Software Stack 12.1.软件堆栈
In the vGPU virtualization model, the NVSwitch Software Stack (FM and Switch Driver) runs in the vGPU host. Like the bare-metal mode, the physical GPUs and NVSwitches are owned and managed by the vGPU host. The GPU and NVSwitch NVLinks are trained and configured as part of FM initialization. The switch routing table is initialized to prevent any GPU-GPU communication. 在 vGPU 虚拟化模式中,NVSwitch 软件栈(FM 和交换机驱动程序)在 vGPU 主机中运行。与裸机模式一样,物理 GPU 和 NVSwitch 由 vGPU 主机拥有和管理。GPU 和 NVSwitch NVLink 的训练和配置是 FM 初始化的一部分。交换机路由表已初始化,以防止任何 GPU-GPU 通信。
Note: The vGPU-based deployment model is not supported on first generation-based NVSwitch systems such as DGX-2 and NVIDIA HGX-2. 注意:基于 vGPU 的部署模式不支持基于第一代 NVSwitch 的系统,如 DGX-2 和 NVIDIA HGX-2。
Note: The vGPU-based deployment model is not supported on the current release of DGX H100 and NVIDIA HGX H1OO systems. NVIDIA plans to add this support in a future software release. 注意:当前发布的 DGX H100 和 NVIDIA HGX H1OO 系统不支持基于 vGPU 的部署模式。英伟达计划在未来的软件版本中添加这一支持。
12.2. Preparing the vGPU Host 12.2.准备 vGPU 主机
12.2.1. OS Image 12.2.1.操作系统图像
Refer to the vGPU Software User Guide for the list of supported OSs, hypervisors, and for information about installing and configuring the vGPU host driver software. 有关支持的操作系统和管理程序列表,以及安装和配置 vGPU 主机驱动程序软件的信息,请参阅《vGPU 软件用户指南》。
In addition to the NVIDIA vGPU host driver software, the vGPU host image must have the following NVIDIA software packages installed: 除英伟达 vGPU 主机驱动程序软件外,vGPU 主机映像还必须安装以下英伟达软件包:
To support vGPU virtualization, start the FM service in vGPU Virtualization mode by setting the FABRIC_MODE=2 FM config item. 要支持 vGPU 虚拟化,请通过设置 FABRIC_MODE=2 FM 配置项在 vGPU 虚拟化模式下启动 FM 服务。
Note: NVSwitches must bind to nvidia.ko before the FM service starts. On DGX A100 and NVIDIA HGX A 100 systems, all the GPUs must also be bound to nvidia.ko before the FM service starts. 注意:NVSwitches 必须在 FM 服务启动前绑定到 nvidia.ko。在 DGX A100 和 NVIDIA HGX A 100 系统上,所有 GPU 也必须在 FM 服务启动前绑定到 nvidia.ko。
In the vGPU virtualization mode, FM supports a resiliency feature that allows the continuous forwarding of NVLink traffic between GPUs on active guest VMs after FM exits (gracefully or non-gracefully) on the vGPU host. To support this feature, FM uses /tmp/fabricmanager. state to save certain metadata information. To use a different location/file to store this metadata information, modify the STATE_FILE_NAME FM config file item with the new path and file name. 在 vGPU 虚拟化模式下,FM 支持一种弹性功能,允许在 vGPU 主机上的 FM 退出(从容或非从容)后,在活动客户虚拟机的 GPU 之间持续转发 NVLink 流量。为支持这一功能,FM 使用 /tmp/fabricmanager.state 保存某些元数据信息。要使用不同的位置/文件来存储此元数据信息,请使用新路径和文件名修改 STATE_FILE_NAME FM 配置文件项。
By default, FM uses TCP I/P loopback (127.0.0.1)-based socket interface for communication. To use Unix domain sockets instead, modify the FM_CMD_UNIX_SOCKET_PATH and UNIX_SOCKET_PATH FM config file options with the new Unix domain socket names. 默认情况下,FM 使用基于 TCP I/P loopback (127.0.0.1) 的套接字接口进行通信。要改用 Unix 域套接字,请使用新的 Unix 域套接字名称修改 FM_CMD_UNIX_SOCKET_PATH 和 UNIX_SOCKET_PATH FM 配置文件选项。
12.3. Fabric Manager-Shared Library and APIs 12.3.Fabric 管理器共享库和 API
Refer to Fabric Manager SDK for a list of the APIs to manage the vGPU partition life cycle. 有关管理 vGPU 分区生命周期的 API 列表,请参阅 Fabric Manager SDK。
Refer to Resiliency for more information about FM resiliency in vGPU Virtualization mode. 有关 vGPU 虚拟化模式下调频弹性的更多信息,请参阅弹性。
12.5. vGPU Partitions 12.5. vGPU 分区
Refer to GPU Partitions for the default supported partitions for the vGPU virtualization model. 有关 vGPU 虚拟化模型默认支持的分区,请参阅 GPU 分区。
12.6. Guest Virtual Machine Life Cycle Management 12.6.客户虚拟机生命周期管理
Here is an overview of the guest VM life cycle: 下面是客户虚拟机生命周期概览:
System powers on and initializes. 系统开机并初始化。
The vGPU host driver loads. 加载 vGPU 主机驱动程序。
SR-IOV is enabled. SR-IOV 已启用。
FM initializes in the vGPU Virtualization Mode. FM 在 vGPU 虚拟化模式下初始化。
NVlinks are trained. NVlinks 接受过培训。
The partition is activated with the selected SR-IOV VFs. 分区通过选定的 SR-IOV VF 激活。
The vGPU-enabled VM completes its life cycle with VFs selected in step 2. 启用了 vGPU 的虚拟机通过步骤 2 中选择的 VF 完成其生命周期。
This life cycle can involve boot, reboot, shutdown, suspend, resume, and migrate activities. 这一生命周期可能涉及启动、重启、关机、挂起、恢复和迁移活动。
4. The partition deactivates. 4.分区停用。
These steps are explained in greater detail in the following sections. 下文将详细介绍这些步骤。
12.6.1. Activating the Partition and Starting the Virtual Machine 12.6.1.激活分区并启动虚拟机
SR-IOV VFs must be enabled on the physical GPUs before you activate partitions and power on the vGPU VMs. 在激活分区和 vGPU 虚拟机电源之前,必须在物理 GPU 上启用 SR-IOV VF。
When starting a guest VM, the hypervisor must do the following: 启动客户虚拟机时,管理程序必须执行以下操作:
Select an available GPU partition that contains the required number of GPUs for the guest VM and select the VFs that will be used on those GPUs. 选择包含客户虚拟机所需 GPU 数量的可用 GPU 分区,并选择将在这些 GPU 上使用的 VF。
Use the fmActivateFabricPartitionWithVFs( ) API and request FM to activate the GPU partition, with the set of selected VFs. 使用 fmActivateFabricPartitionWithVFs( ) API 并请求 FM 使用所选 VF 激活 GPU 分区。
Start the guest VM with the selected VFs. 使用选定的虚拟机启动客户虚拟机。
Note: Partition activation is always required before starting a vGPU VM, even for VMs that use only one vGPU. 注意:启动 vGPU 虚拟机之前始终需要激活分区,即使只使用一个 vGPU 的虚拟机也是如此。
The ordering of VFs used during partition activation and VM assignment must remain consistent to ensure the correct suspend, resume, and migration operations. 分区激活和虚拟机分配过程中使用的虚拟机配置的顺序必须保持一致,以确保正确的挂起、恢复和迁移操作。
Deactivate partitions only when no VM is executing on the GPUs in the partition. To deactivate a partition: 只有当没有虚拟机在分区的 GPU 上执行时,才会停用分区。要停用分区,请
Shut down the guest VM that is currently operating in the partition. 关闭当前在分区中运行的客户虚拟机。
Use the fmDeactivateFabricPartition() API and request that FM deactivate the partition. 使用 fmDeactivateFabricPartition() API 并请求 FM 停用分区。
12.6.3. Migrating Virtual Machines 12.6.3.迁移虚拟机
VM migration is supported only between partitions with an identical number, type of GPU, and NvLink topology. 只支持在具有相同数量、类型的 GPU 和 NvLink 拓扑的分区之间进行虚拟机迁移。
Refer to Migrating a VM Configured with vGPU for more information. 有关详细信息,请参阅迁移使用 vGPU 配置的虚拟机。
The nvswitch-audit command line utility referenced in Verifying GPU Routing can also be used to verify NVSwitch routing information in the vGPU mode. Werecommend that you run this tool to periodically verify the GPU reachability matrix on each VM partition activation and deactivation cycle. 《验证 GPU 路由》一文中提到的 nvswitch-audit 命令行实用程序也可用于验证 vGPU 模式下的 NVSwitch 路由信息。我们建议您运行该工具,在每个虚拟机分区激活和停用周期定期验证 GPU 可及性矩阵。
12.7. Error Handling 12.7.错误处理
Refer to:ref:error-handling for information about FM initialization, partition, hardware specific errors, and their handling. 有关 FM 初始化、分区、硬件特定错误及其处理的信息,请参阅:ref:error-handling。
When the guest VM is active, GPU runtime errors will be logged in the vGPU host syslog like the Xid errors. On DGX A100 and NVIDIA HGX A 100 systems, GPU NVLink errors that require retraining are not supported in this environment and must complete the guest VM shutdown and start sequence to recover. 当客户虚拟机处于活动状态时,GPU 运行时错误会像 Xid 错误一样记录在 vGPU 主机系统日志中。在 DGX A100 和 NVIDIA HGX A 100 系统上,此环境不支持需要重新培训的 GPU NVLink 错误,必须完成客户虚拟机关闭和启动序列才能恢复。
12.8. GPU Reset 12.8.重置图形处理器
If the GPU generates a runtime error or gets an Xid NVLink error, the system administrator can clear the corresponding error state and recover the GPU using the GPU reset operation. The operation must be initiated from the vGPU host after a VM that is using the GPU is shut down and the corresponding partition is deactivated. Refer to the nvidia-smi command-line utility documentation for more information. 如果 GPU 生成运行时错误或出现 Xid NVLink 错误,系统管理员可以清除相应的错误状态,并使用 GPU 重置操作恢复 GPU。必须在关闭使用 GPU 的虚拟机并停用相应分区后,从 vGPU 主机启动该操作。有关详细信息,请参阅 nvidia-smi 命令行实用程序文档。
12.9. Interoperability with MIG 12.9.与 MIG 的互操作性
MIG-backed vGPUs on NVIDIA A 100 and NVIDIA HGX A 100 cannot use NVlink. The FM’s vGPU Virtualization mode still can interoperate with the MIG feature to support use cases where a subset of GPUs are being used in MIG mode. NVIDIA A 100 和 NVIDIA HGX A 100 上的 MIG 支持 vGPU 无法使用 NVlink。FM 的 vGPU 虚拟化模式仍可与 MIG 功能互操作,以支持在 MIG 模式下使用部分 GPU 的用例。
12.9.1. Enabling MIG before Starting the Fabric Manager Service 12.9.1.启动 Fabric Manager 服务前启用 MIG
When MIG was enabled on a GPU before FM was started, FM will remove the GPU partitions from its list of available partitions that contain GPUs in MIG mode. 如果在启动 FM 之前已在 GPU 上启用 MIG,FM 将从其包含处于 MIG 模式的 GPU 的可用分区列表中移除 GPU 分区。
These GPU partitions will not be available for deploying VMs. 这些 GPU 分区将无法用于部署虚拟机。
To enable partitions after disabling MIG mode on a GPU, reboot the system. 要在 GPU 上禁用 MIG 模式后启用分区,请重新启动系统。
12.9.2. Enabling MIG After Starting the Fabric Manager Service 12.9.2.启动 Fabric Manager 服务后启用 MIG
MIG functionality might be enabled on any GPU after starting the FM Service, but before a partition that contained the GPU, is activated. 在启动调频服务后,但在激活包含 GPU 的分区之前,可以在任何 GPU 上启用 MIG 功能。
Activating a GPU partition will return success even if the GPU is in MIG mode. 即使 GPU 处于 MIG 模式,激活 GPU 分区也会返回成功。
Activating a multi-GPU partition will fail if any GPU in the partition is in MIG mode on DGX A100 and NVIDIA HGX A 100 systems. 在 DGX A100 和 NVIDIA HGX A 100 系统上,如果分区中的任何 GPU 处于 MIG 模式,则激活多 GPU 分区将失败。
The process will succeed on the DGX H100 and NVIDIA HGX H1OO systems. 该流程将在 DGX H100 和 NVIDIA HGX H1OO 系统上取得成功。
Chapter 13. Supported High Availability Modes 第 13 章 支持的高可用性模式支持的高可用性模式
FM provides several High Availability Mode (Degraded Mode) configurations that allow system administrators to set appropriate policies when there are hardware failures, such as GPU failures, NVSwitch failures, NVLink connection failures, and so on, on NVSwitch- based systems. With this feature, system administrators can keep a subset of available GPUs that can be used while waiting to replace failed GPUs, baseboards, and so on. FM 提供多种高可用性模式(降级模式)配置,允许系统管理员在基于 NVSwitch 的系统出现硬件故障(如 GPU 故障、NVSwitch 故障、NVLink 连接故障等)时设置适当的策略。利用该功能,系统管理员可以保留一部分可用的 GPU,以便在等待更换故障 GPU、底板等时使用。
DGX A100, NVIDIA HGX A100 and DGX H100, NVIDIA HGX H100 systems have different behaviors Refer to:ref:error-handling for more information. DGX A100、NVIDIA HGX A100 和 DGX H100、NVIDIA HGX H100 系统具有不同的行为,更多信息请参阅:ref:error-handling。
13.1. Common Terms 13.1.常用术语
GPU Access NVLink is an NVLink connection between a GPU and a NVSwitch. GPU Access NVLink 是 GPU 与 NVSwitch 之间的 NVLink 连接。
GPU Access NVLink failure is a failure that occurs in the connection between a GPU and an NVSwitch. GPU 访问 NVLink 故障是指 GPU 和 NVSwitch 之间的连接出现故障。
Failures can be the result of a GPU/NVSwitch pin failure, a mechanical failure in the GPU baseboard, or a similar failure. 故障原因可能是 GPU/NVSwitch 引脚故障、GPU 底板机械故障或类似故障。
Trunk NVLink are the links that connect two GPU baseboards. Trunk NVLinks only occur between NVSwitches and travel over the NVLink bridge PCBs and connectors. 主干 NVLink 是连接两个 GPU 底板的链接。主干 NVLink 仅发生在 NVSwitches 之间,通过 NVLink 桥接 PCB 和连接器传输。
Trunk NVLink failure is a trunk NVLink failure that traverses between the two GPU baseboard trays. 主干 NVLink 故障是指穿越两个 GPU 底板托盘之间的主干 NVLink 故障。
This failure can be the result of a bad backplane connector pin or a similar issue. 这种故障可能是背板连接器针脚或类似问题造成的。
NVSwitch failure is a NVSwitch failure that is categorized as an internal failure of the NVSwitch. This failure can be the result of the NVSwitch not being displayed on the PCle bus, DBE error, or a similar issue. NVSwitch 故障是指 NVSwitch 故障,被归类为 NVSwitch 内部故障。这种故障可能是 PCle 总线上未显示 NVSwitch、DBE 错误或类似问题造成的。
GPU failure is a GPU failure where the GPU has failed. GPU 故障是指 GPU 出现故障。
This failure can be the result of NVLink connectivity, a PCle failure, or a similar issue. 这种故障可能是 NVLink 连接、PCle 故障或类似问题造成的。
Note: These high availability modes and their corresponding dynamic reconfiguration of the NVSwitch based system are applied in response to errors that are detected during FM initialization. Runtime errors that occur after the system is initialized, or when a GPU job is running, will not trigger these high availability mode policies. 注意:基于 NVSwitch 的系统的这些高可用性模式及其相应的动态重新配置是针对 FM 初始化过程中检测到的错误而应用的。系统初始化后或 GPU 作业运行时发生的运行时错误不会触发这些高可用性模式策略。
The GPU access NVLink failure mode is controlled through this FM config file item: GPU 访问 NVLink 故障模式通过此 FM 配置文件项进行控制:
ACCESS_LINK_FAILURE_MODE=<value>
13.2.2. Bare Metal Behavior 13.2.2.裸机行为
ACCESS_LINK_FAILURE_MODE=0
In this mode, FM removes the GPUs with access NVLink failure from NVSwitch routing and configures the rest of the GPUs to form one memory fabric. This means the GPUs with the access NVLink failure will lose their NVLink P2P capability with other GPUs. The failed GPUs are still visible to the NVIDIA software stack, such as CUDA, NVML, NVIDIA-SMI, and so on, and can be used for non-NVLink workloads. 在这种模式下,FM 会将出现访问 NVLink 故障的 GPU 从 NVSwitch 路由中移除,并将其余 GPU 配置为形成一个内存结构。这意味着出现访问 NVLink 故障的 GPU 将失去与其他 GPU 的 NVLink P2P 功能。英伟达软件栈(如 CUDA、NVML、NVIDIA-SMI 等)仍可看到发生故障的 GPU,并可将其用于非 NVLink 工作负载。
ACCESS_LINK_FAILURE_MODE=1
In this mode, FM will disable NVSwitch and its pair of Trunk NVLinks if there are two GPU baseboards where the GPU Access NVLink is connected. This reduces the NVLink P2P bandwidth to 5/6 throughout the fabric. If a GPU can access NVLink failures to more than one NVSwitch, this option will remove the GPU from the NVSwitch routing configuration and disable its NVLink P2P capability. 在此模式下,如果有两个 GPU 底板连接了 GPU 访问 NVLink,FM 将禁用 NVSwitch 及其一对主干 NVLink。这将把整个结构中的 NVLink P2P 带宽减少到 5/6。如果 GPU 可以访问一个以上 NVSwitch 的 NVLink 故障,该选项将从 NVSwitch 路由配置中移除 GPU 并禁用其 NVLink P2P 功能。
This process will leave the other GPUs with complete NVLink P2P bandwidth. If multiple GPU access NVLink failures point to the same NVSwitch, that NVSwitch will be disabled. 这一过程将为其他 GPU 留出完整的 NVLink P2P 带宽。如果多个 GPU 访问 NVLink 故障指向同一个 NVSwitch,该 NVSwitch 将被禁用。
This mode is effective only on DGX A 100 and NVIDIA HGX A 100 NVSwitch-based systems. 该模式仅对基于 DGX A 100 和 NVIDIA HGX A 100 NVSwitch 的系统有效。
In this mode, FM removes the GPUs with access NVLink failures from the currently supported GPU partition list. Figure 4 shows the effect of one GPU having an access NVLink failure in a two-GPU baseboard system. The failed GPUs will be available for single GPU partitions. 在该模式下,FM 会从当前支持的 GPU 分区列表中删除出现访问 NVLink 故障的 GPU。图 4 显示了双 GPU 底板系统中一个 GPU 出现访问 NVLink 故障时的效果。发生故障的 GPU 可用于单 GPU 分区。
This mode is effective only on DGX A 100 and NVIDIA HGX A 100 NVSwitch-based systems. 该模式仅对基于 DGX A 100 和 NVIDIA HGX A 100 NVSwitch 的系统有效。
ACCESS_LINK_FAILURE_MODE=1
In the Shared NVSwitch mode, all GPU partitions will be available, but the partitions will reduce the available bandwidth to 5/6 throughout the fabric. If multiple access NVLinks fail on one GPU, the GPU will be removed, and the available GPU partitions will be adjusted as mentioned earlier. The failed GPUs will be available for single GPU partitions. 在共享 NVSwitch 模式下,所有 GPU 分区都将可用,但这些分区会将整个结构的可用带宽减少到 5/6。如果一个 GPU 上的多个访问 NVLink 出现故障,该 GPU 将被移除,可用的 GPU 分区将如前所述进行调整。出现故障的 GPU 将可用于单 GPU 分区。
This mode is effective only on DGX A 100 and NVIDIA HGX A 100 NVSwitch based systems. 该模式仅对基于 DGX A 100 和 NVIDIA HGX A 100 NVSwitch 的系统有效。
Figure 5: Shared NVSwitch and vGPU Partitions When a GPU Access NVLink Fails 图 5:当 GPU 访问 NVLink 出现故障时共享 NVSwitch 和 vGPU 分区
Note: Currently, the ACCESS_LINK_FAILURE_MODE=1 configuration is not supported in the vGPU Multitenancy Mode. 注意:目前,vGPU 多租户模式不支持 ACCESS_LINK_FAILURE_MODE=1 配置。
The Trunk NVLink failure mode is controlled through this FM config file item: 主干 NVLink 故障模式通过此 FM 配置文件项进行控制:
TRUNK_LINK_FAILURE_MODE=<value>
Note: This option applies only to systems with two GPU baseboards. 注意:该选项仅适用于带有两个 GPU 底板的系统。
13.3.2. Bare Metal Behavior 13.3.2.裸机行为
TRUNK_LINK_FAILURE_MODE=0
In this mode, FM aborts and leaves the system uninitialized when there is a trunk NVLink failure, and all CUDA application launches will fail with cudaErrorSystemNotReady status. However, when FM_STAY_RESIDENT_ON_FAILURES =1, the continue with error config option is enabled, and the FM service continues to run, and the CUDA application launches will fail with cudaErrorSystemNotReady status. 在此模式下,当出现主干 NVLink 故障时,FM 将中止并使系统处于未初始化状态,所有 CUDA 应用程序的启动都将以 cudaErrorSystemNotReady 状态失败。但是,当 FM_STAY_RESIDENT_ON_FAILURES =1 时,将启用 "出错时继续 "配置选项,FM 服务将继续运行,CUDA 应用程序的启动将以 cudaErrorSystemNotReady 状态失败。
This mode is effective only on the DGX A 100 and NVIDIA HGX A 100 NVSwitch-based systems. 该模式仅对基于 DGX A 100 和 NVIDIA HGX A 100 NVSwitch 的系统有效。
TRUNK_LINK_FAILURE_MODE=1
In this mode, if an NVSwitch has one or more trunk NVLink failures, the NVSwitch will be disabled with its peer NVSwitch. This reduces the available bandwidth to 5/6 throughout the fabric. If multiple NVSwitches have trunk NVLink failures, FM will fall back to the TRUNK_LINK_FAILURE_MODE=0 behavior as mentioned above. 在此模式下,如果一个 NVSwitch 出现一个或多个主干 NVLink 故障,则该 NVSwitch 将与其对等 NVSwitch 一起被禁用。这将使整个光纤架构的可用带宽降至 5/6。如果多个 NVSwitch 出现中继 NVLink 故障,FM 将返回上述 TRUNK_LINK_FAILURE_MODE=0 行为。
This mode is effective only on DGX A100 and NVIDIA HGX A100 NVSwitch-based systems. 该模式仅对基于 DGX A100 和 NVIDIA HGX A100 NVSwitch 的系统有效。
In this mode, FM removes GPU partitions by using trunk NVLinks from the currently supported GPU partition list. This means 16 GPU partitions and eight GPU partitions across baseboards will be removed. The remaining partitions will run with complete NVLink bandwidth. This option will support an unlimited number of trunk NVLink failures on a connected pair of NVSwitches. 在此模式下,FM 通过使用当前支持的 GPU 分区列表中的主干 NVLinks 来移除 GPU 分区。这意味着 16 个 GPU 分区和 8 个跨底板的 GPU 分区将被移除。其余分区将以完整的 NVLink 带宽运行。此选项将支持连接的一对 NVSwitches 上无限数量的主干 NVLink 故障。
This mode is effective only on DGX A100 and NVIDIA HGX A100 NVSwitch-based systems. 该模式仅对基于 DGX A100 和 NVIDIA HGX A100 NVSwitch 的系统有效。
TRUNK_LINK_FAILURE_MODE=1
In the Shared NVSwitch mode, the GPU partitions will be available, but the partitions will reduce the available bandwidth to 5/6 throughout the fabric. This option will be supported when multiple trunk NVLink failures are present on the same NVSwitch pair. If multiple trunk NVLink failures affect different NVSwitch pairs, FM will fall back to the TRUNK_LINK_FAILURE_MODE=0 behavior as mentioned above. 在共享 NVSwitch 模式下,GPU 分区可用,但分区会将整个 Fabric 的可用带宽降至 5/6。当同一 NVSwitch 对出现多个主干 NVLink 故障时,将支持该选项。如果多个主干 NVLink 故障影响到不同的 NVSwitch 对,FM 将返回上述 TRUNK_LINK_FAILURE_MODE=0 行为。
This mode is effective only on DGX A 100 and NVIDIA HGX A 100 NVSwitch-based systems. 该模式仅对基于 DGX A 100 和 NVIDIA HGX A 100 NVSwitch 的系统有效。
Note: Currently, the TRUNK_LINK_FAILURE_MODE=1 configuration is not supported in the vGPU Multitenancy Mode. 注意:目前,vGPU 多租户模式不支持 TRUNK_LINK_FAILURE_MODE=1 配置。
The NVSwitch failure mode is controlled through this FM config file item: NVSwitch 故障模式通过此 FM 配置文件项进行控制:
NVSWITCH_FAILURE_MODE=<value>
13.4.2. Bare Metal Behavior 13.4.2.裸机行为
NVSWITCH_FAILURE_MODE=0
In this mode, FM aborts and leaves the system uninitialized when there is an NVSwitch failure, and all CUDA application launches will fail with a cudaErrorSystemNotReady status. However, when FM_STAY_RESIDENT_ON_FAILURES =1, the continue with error config option is enabled, the FM service continues to run, and CUDA application launches will fail with a cudaErrorSystemNotReady status. 在此模式下,当 NVSwitch 出现故障时,FM 会中止并使系统处于未初始化状态,所有 CUDA 应用程序的启动都会以 cudaErrorSystemNotReady 状态失败。但是,当 FM_STAY_RESIDENT_ON_FAILURES =1 时,错误配置选项被启用,FM 服务继续运行,CUDA 应用程序启动失败时将显示 cudaErrorSystemNotReady 状态。
NVSWITCH_FAILURE_MODE =1
In this mode, when there is an NVSwitch failure, the NVSwitch will be disabled with its peer NVSwitch. This will reduce the available bandwidth to 5//65 / 6 throughout the fabric. If multiple NVSwitch failures happen, FM will fall back to the NVSWITCH_FAILURE_MODE =0=0 behavior as mentioned above. 在这种模式下,当 NVSwitch 出现故障时,NVSwitch 将与其对等 NVSwitch 一起被禁用。这将使整个 Fabric 的可用带宽降至 5//65 / 6 。如果多个 NVSwitch 发生故障,FM 将退回到上述 NVSWITCH_FAILURE_MODE =0=0 行为。
This mode is effective only on DGX A 100 and NVIDIA HGX A 100 NVSwitch-based systems. 该模式仅对基于 DGX A 100 和 NVIDIA HGX A 100 NVSwitch 的系统有效。
In this mode, FM will remove multi-GPU partitions from the baseboard with the failing NVSwitch and eight GPU partitions across baseboards. In one baseboard system, only single GPU partitions will be supported. Figure 5 shows the supported partitions when an NVSwitch has failed. 在此模式下,FM 将从出现故障的 NVSwitch 基板上删除多 GPU 分区,并跨基板删除八个 GPU 分区。在一个底板系统中,只支持单 GPU 分区。图 5 显示了 NVSwitch 出现故障时支持的分区。
This mode is effective only on DGX A 100 and NVIDIA HGX A 100 NVSwitch-based systems. 该模式仅对基于 DGX A 100 和 NVIDIA HGX A 100 NVSwitch 的系统有效。
Figure 6: Shared NVSwitch and vGPU Partitions When an NVSwitch has Failed 图 6:NVSwitch 出现故障时的共享 NVSwitch 和 vGPU 分区
NVSWITCH_FAILURE_MODE=1
In the Shared NVSwitch mode, all the GPU partitions will be available, but the partitions will reduce the available bandwidth to 5//65 / 6 throughout the fabric. If multiple NVSwitch failures happen, FM will fall back to NVSWITCH_FAILURE_MODE =0 behavior as mentioned above. 在共享 NVSwitch 模式下,所有 GPU 分区都将可用,但这些分区会将整个结构的可用带宽降至 5//65 / 6 。如果多个 NVSwitch 出现故障,FM 将退回到 NVSWITCH_FAILURE_MODE =0 行为,如上所述。
This mode is effective only on DGX A 100 and NVIDIA HGX A 100 NVSwitch-based systems. 该模式仅对基于 DGX A 100 和 NVIDIA HGX A 100 NVSwitch 的系统有效。
Note: Currently, the NVSWITCH_FAILURE_MODE=1 configuration is not supported in the vGPU Multitenancy Mode. 注意:目前,vGPU 多租户模式不支持 NVSWITCH_FAILURE_MODE=1 配置。
13.5. GPU Failure 13.5.图形处理器故障
13.5.1. Bare Metal Behavior 13.5.1.裸机行为
FM will ignore GPUs that have failed to initialize, are not displayed on the PCI bus, and so on. FM will set up routing and enable NVLink P2P among the available GPUs. FM 将忽略初始化失败或 PCI 总线上未显示的 GPU 等。FM 将在可用 GPU 之间设置路由并启用 NVLink P2P。
FM will continue initialization and adjust the currently supported partition list by excluding the failed GPU partitions. Figure 6 shows the supported partitions when a GPU is missing or has failed to initialize. FM 将继续初始化并调整当前支持的分区列表,将失败的 GPU 分区排除在外。图 6 显示了 GPU 丢失或初始化失败时支持的分区。
Figure 7: Shared NVSwitch and vGPU Partitions When a GPU is Missing/Has Failed 图 7:GPU 丢失/故障时共享 NVSwitch 和 vGPU 分区
13.6. Manual Degradation 13.6.手动降级
Manual degradation prevents a consistently failing GPU, NVSwitch, or baseboard from being enumerated by the NVSwitch system software stack. Depending on the failing component, the system administrator must configure appropriate action. 手动降级可防止持续故障的 GPU、NVSwitch 或底板被 NVSwitch 系统软件堆栈枚举。系统管理员必须根据故障组件配置相应的操作。
13.6.1. GPU Exclusion 13.6.1.排除 GPU
Depending on the errors, certain GPUs might be candidates for exclusion from the system so that FM can successfully initialize and configure the rest of the GPU subsets. Based on the failure analysis data from previous generation GPUs, to exclude a GPU, here are the recommended error conditions: 根据错误情况,某些 GPU 可能会被排除在系统之外,以便 FM 能够成功初始化和配置其余的 GPU 子集。根据上一代 GPU 的故障分析数据,要排除 GPU,以下是建议的错误条件:
GPU double bit ECC errors. GPU 双位 ECC 错误。
GPU falling off the PCle bus. GPU 从 PCle 总线上脱落
GPU failure to enumerate on the PCle bus. GPU 无法在 PCle 总线上枚举。
GPU side NVLink training error. GPU 端 NVLink 训练错误。
GPU side unexpected XID. This category can also be application induced. GPU 侧意外 XID。这类问题也可能是应用程序引起的。
For full passthrough virtualization, the administrator must identify the GPUs that should be excluded. The hypervisor must ensure that VMs are not created on the GPUs that have been identified as candidates for exclusion. 对于完全直通虚拟化,管理员必须确定应排除的 GPU。管理程序必须确保不在已确定为候选排除对象的 GPU 上创建虚拟机。
13.6.1.1 GPU Exclusion Flow 13.6.1.1 GPU 排除流程
The GPU exclusion flow can be broken down into the following phases: GPU 排除流程可分为以下几个阶段:
Running application error handling. 运行应用程序错误处理
Diagnosing GPU failures. 诊断 GPU 故障。
Remediating the error. 纠正错误。
The steps for each of these phases can vary based on whether the system is running in bare metal or in virtualized mode. The following sections describe the flow for bare metal and virtualized platforms. 根据系统是以裸机模式还是虚拟化模式运行,每个阶段的步骤都会有所不同。下文将介绍裸机平台和虚拟化平台的流程。
Errors faced by the GPU during active execution, such as GPU ECC errors, GPU falling off the bus, and so on, are reported through the following means: GPU 在执行过程中出现的错误,如 GPU ECC 错误、GPU 脱离总线等,会通过以下方式报告:
/var/log/syslog as an XID message /var/log/syslog 作为 XID 信息
Table 5: Error Conditions and Signatures 表 5:错误条件和签名
Error Condition 错误条件
Error signature on Running Application 运行应用程序时的错误签名
GPU Double Bit Error 图形处理器双位错误
XID 48 output by GPU driver GPU 驱动程序输出的 XID 48
GPU falling off PCle bus GPU 脱离 PCle 总线
XID 79 output by GPU driver GPU 驱动程序输出的 XID 79
GPU failing to enumerate on bus GPU 在总线上枚举失败
应用程序(CUDA 应用程序或 nvidia-smi 查询)未显示 GPU
GPU does not appear to applications (CUDA applications or
nvidia-smi query)
GPU does not appear to applications (CUDA applications or
nvidia-smi query)| GPU does not appear to applications (CUDA applications or |
| :--- |
| nvidia-smi query) |
GPU side NVLink training error GPU 端 NVLink 训练错误
Error output to /var/log/syslog by FM FM 向 /var/log/syslog 输出的错误信息
GPU side errors GPU 方面的错误
GPU 驱动程序输出的其他 XID。这也可能是应用程序引起的。
Other XIDs output by GPU driver. This can also be applica-
tion induced.
Other XIDs output by GPU driver. This can also be applica-
tion induced.| Other XIDs output by GPU driver. This can also be applica- |
| :--- |
| tion induced. |
GPU Double Bit Error 图形处理器双位错误
XID 48 output by GPU driver GPU 驱动程序输出的 XID 48
Error Condition Error signature on Running Application
GPU Double Bit Error XID 48 output by GPU driver
GPU falling off PCle bus XID 79 output by GPU driver
GPU failing to enumerate on bus "GPU does not appear to applications (CUDA applications or
nvidia-smi query)"
GPU side NVLink training error Error output to /var/log/syslog by FM
GPU side errors "Other XIDs output by GPU driver. This can also be applica-
tion induced."
GPU Double Bit Error XID 48 output by GPU driver| Error Condition | Error signature on Running Application |
| :--- | :--- |
| GPU Double Bit Error | XID 48 output by GPU driver |
| GPU falling off PCle bus | XID 79 output by GPU driver |
| GPU failing to enumerate on bus | GPU does not appear to applications (CUDA applications or <br> nvidia-smi query) |
| GPU side NVLink training error | Error output to /var/log/syslog by FM |
| GPU side errors | Other XIDs output by GPU driver. This can also be applica- <br> tion induced. |
| GPU Double Bit Error | XID 48 output by GPU driver |
System administrators can create their own GPU monitoring/health check scripts to look for the error traces. This process requires looking for at least one of the above- mentioned sources (syslog, NVML APIs, and so on) to collect the necessary data. 系统管理员可以创建自己的 GPU 监控/健康检查脚本来查找错误痕迹。这一过程至少需要寻找一个上述来源(系统日志、NVML API 等)来收集必要的数据。
DCGM includes an exclusion recommendation script that can be invoked by a system administrator to collect the GPU error information. This script queries information from the passive monitoring performed by DCGM to determine whether any conditions that might require a GPU to be excluded have occurred since the previous time the DCGM daemon was started. As part of the execution, the script invokes a validation test that DCGM 包含一个排除建议脚本,系统管理员可调用该脚本来收集 GPU 错误信息。该脚本从 DCGM 执行的被动监控中查询信息,以确定自上次启动 DCGM 守护进程以来,是否发生了任何可能需要排除 GPU 的情况。在执行过程中,脚本会调用一个验证测试,该测试会
determines whether unexpected XIDs are being generated by the execution of a known good application. Users can prevent the validation test from being run and choose to only monitor the passive information. 确定已知良好应用程序的执行是否会产生意外的 XID。用户可以阻止验证测试运行,并选择只监控被动信息。
The DCGM exclusion recommendation script code is provided as a reference for system administrators to extend as appropriate or build their own monitoring/health check scripts. 提供 DCGM 排除建议脚本代码作为参考,供系统管理员酌情扩展或创建自己的监控/健康检查脚本。
Note: Refer to the NVIDIA DCGM Documentation for more information about the exclusion recommendation script such as its location and supported options. 注意:有关排除建议脚本的更多信息,如位置和支持的选项,请参阅英伟达 DCGM 文档。
The GPU kernel driver on NVSwitch-based systems can be configured to ignore a set of GPUs, even if the GPUs were enumerated on the PCle bus. The GPUs to be excluded are identified by the GPU’s unique identifier (GPU UUID) via a kernel module parameter. After identifying whether the GPU exclude candidates are in the system, the GPU kernel module driver will exclude the GPU from being used by applications. If a GPU UUID is in the exclude candidate list, but the UUID was not detected at runtime because the UUID belonged to a GPU that is not on the system or because the PCle enumeration of the GPU board failed, the GPU is not considered to have been excluded. 基于 NVSwitch 的系统上的 GPU 内核驱动程序可配置为忽略一组 GPU,即使这些 GPU 已在 PCle 总线上进行了枚举。要排除的 GPU 通过内核模块参数以 GPU 的唯一标识符(GPU UUID)进行识别。在确定系统中是否有 GPU 排除候选后,GPU 内核模块驱动程序将禁止应用程序使用该 GPU。如果 GPU UUID 在排除候选列表中,但由于 UUID 属于不在系统中的 GPU 或 GPU 板的 PCle 枚举失败而在运行时未检测到 UUID,则不认为该 GPU 已被排除。
The list of exclude candidate GPUs can be persisted across reboots by specifying the module parameters by using a .conf file in the filesystem. The exclude mechanism is specific to a GPU, rather than a physical location on the baseboard. As a result, if a GPU is on the exclude candidate list, and is later replaced by a new GPU, the new GPU will become visible to the system without updating the exclude 通过使用文件系统中的 .conf 文件来指定模块参数,排除候选 GPU 的列表可在重启后持续存在。排除机制针对的是 GPU,而不是底板上的物理位置。因此,如果某个 GPU 在候选排除列表中,后来被新的 GPU 取代,系统将看到新的 GPU,而无需更新候选排除列表。
candidates. Conversely, if a GPU has been excluded on a system, placing it in different PCle slots will still prevent the GPU from being visible to applications, unless the exclude candidate list is updated. Updating the GPU excludes candidates requires manual intervention by the system administrator. 候选。相反,如果系统中的 GPU 已被排除,将其放置在不同的 PCle 插槽中仍会阻止应用程序看到 GPU,除非更新排除候选列表。更新 GPU 排除候选列表需要系统管理员的手动干预。
13.6.1.5 Kernel Module Parameters 13.6.1.5 内核模块参数
The set of candidate GPU UUIDs that will be excluded are specified by using a kernel module parameter that consists of a set of comma-separated GPU UUIDs. 将被排除的候选 GPU UUID 的集合是通过使用内核模块参数来指定的,该参数由一组逗号分隔的 GPU UUID 组成。
The kernel parameter can be specified when the kernel module loads nvidia.ko. insmod nvidia.ko NVreg_ExcludedGpus=uuid1,uuid2… 可在内核模块加载 nvidia.ko 时指定内核参数。 insmod nvidia.ko NVreg_ExcludedGpus=uuid1,uuid2...
To make the GPU UUID persistent, the set of exclude candidate GPU UUIDs can also be specified by using an nvidia. conf file in /etc/modprobe.d. 为使 GPU UUID 保持不变,还可通过使用 /etc/modprobe.d 中的 nvidia. conf 文件指定一组排除的候选 GPU UUID。
options nvidia NVreg_ExcludedGpus=uuid1, uuid2… 选项 nvidia NVreg_ExcludedGpus=uuid1, uuid2...
Adding GPUs into the exclude candidate list is a manual step that must be completed by a system administrator. 将 GPU 添加到排除候选列表是一个手动步骤,必须由系统管理员完成。
Note: The previously supported NVreg_GpuBlacklist module parameter option has been deprecated and will be removed in a future release. 注意:先前支持的 NVreg_GpuBlacklist 模块参数选项已被弃用,并将在今后的版本中删除。
13.6.1.6 Adding/Removing a GPU from the Exclude Candidate List 13.6.1.6 从排除候选列表中添加/删除 GPU
To add a GPU from the exclude candidate list or to remove it from the list, the system administrator must complete the following steps: 要从排除候选列表中添加 GPU 或从列表中删除 GPU,系统管理员必须完成以下步骤:
If a conf file does not exist, create a conf file for the nvidia kernel module parameters. 如果不存在 conf 文件,请为 nvidia 内核模块参数创建 conf 文件。
Complete one of the following tasks: 完成以下任务之一
Add the UUID of the excluded GPU into the .conf file. 将排除在外的 GPU 的 UUID 添加到 .conf 文件中。
Remove the UUID of the GPU from the list. 从列表中删除 GPU 的 UUID。
Restart the system to load the kernel driver with updated module parameters. 重启系统,以更新模块参数加载内核驱动程序。
An excluded GPU is not visible in CUDA applications or in basic queries by using nvidia-smi -q or through NVML. This section provides information about the options to identify when a GPU has been excluded, for example, the GPU’s UUID was in the exclude candidate list, and the GPU was detected in the system. 被排除的 GPU 在 CUDA 应用程序中或使用 nvidia-smi -q 或通过 NVML 进行的基本查询中均不可见。本节将提供有关选项的信息,以确定 GPU 是否已被排除,例如,GPU 的 UUID 是否在排除候选列表中,以及系统中是否检测到该 GPU。
13.6.1.8 nvidia-smi
The new command, nvidia-smi -B or nvidia-smi --list-excluded-gpus, can be used to get a list of excluded GPUs. 新命令 nvidia-smi -B 或 nvidia-smi --list-excluded-gpus 可用于获取排除在外的 GPU 列表。
13.6.1.9 Procfs
The procfs entry, /proc/driver/nvidia/gpus/<PCI_ID>/information, can specify whether the GPU has been excluded. procfs 条目 /proc/driver/nvidia/gpus/<PCI_ID>/information 可以指定 GPU 是否被排除在外。
13.6.1.10 Out-of-Band Query 13.6.1.10 带外查询
Refer to the NVIDIA GPU SMBus Post-Box Interface (SMBPBI) documentation for more information. 更多信息请参阅英伟达™(NVIDIA®)GPU SMBus 后端接口(SMBPBI)文档。
The following section provides information about the recommended flow that a system administrator should follow to run GPU monitoring health checks or the DCGM exclusion recommendation script on various system configurations. 下文将介绍系统管理员在各种系统配置上运行 GPU 监控健康检查或 DCGM 排除建议脚本时应遵循的建议流程。
13.6.1.12 Bare Metal and vGPU Configurations 13.6.1.12 裸机和 vGPU 配置
In, the system administrator will run the bare metal and vGPU virtualization configurations in the same OS instance as the application programs. Here is the general flow that a system administrator will follow: 其中,系统管理员将在与应用程序相同的操作系统实例中运行裸机和 vGPU 虚拟化配置。以下是系统管理员要遵循的一般流程:
Periodically run the health check script or the DCGM exclusion recommendation script for all the GPUs and NVSwitches on the system. 定期为系统中的所有 GPU 和 NVSwitches 运行健康检查脚本或 DCGM 排除建议脚本。
(Optional) Monitor the system logs to trigger a run of the health check script or DCGM exclusion recommendation script. (可选)监控系统日志以触发运行健康检查脚本或 DCGM 排除建议脚本。
Based on the output of the health check or exclusion recommendation script, add the GPU UUID to the exclude candidate list. 根据健康检查或排除建议脚本的输出,将 GPU UUID 添加到排除候选列表中。
Also, if you are using the DCGM exclusion recommendation script, update the periodic run of the exclude recommendation script with the newly expected GPU count. 此外,如果您使用的是 DCGM 排除推荐脚本,请使用新的预期 GPU 数量更新定期运行的排除推荐脚本。
Reboot the system to load the kernel driver with updated module parameters. 重启系统,加载更新了模块参数的内核驱动程序。
13.6.1.13 Full Passthrough Virtualized Configurations 13.6.1.13 完全直通虚拟化配置
The primary difference in virtualized configurations is that the GPU kernel driver is left to the guest VMs. As a result, the execution of the GPU diagnosis and remediation phases must be performed by the hypervisor with the VM provisioning mechanism. 虚拟化配置的主要区别在于 GPU 内核驱动程序由客户虚拟机负责。因此,GPU 诊断和修复阶段的执行必须由虚拟机管理程序通过虚拟机配置机制来完成。
Here is the general flow that a hypervisor will follow: 下面是管理程序的一般流程:
The guest VM finishes and returns controls of a set of GPUs and switches to the hypervisor. 客户虚拟机完成后,会将一组 GPU 和开关的控制权交还给管理程序。
The hypervisor invokes a special test VM, which is trusted by the hypervisor. In test VM, there should be a complete instance of the NVIDIA NVSwitch core software stack, including GPU drivers and FM. 管理程序会调用一个特殊的测试虚拟机,该虚拟机受到管理程序的信任。在测试虚拟机中,应该有一个完整的英伟达™(NVIDIA®)NVSwitch 核心软件栈实例,包括 GPU 驱动程序和 FM。
On this test VM, run the health check script or DCGM exclusion recommendation script. 在该测试虚拟机上,运行健康检查脚本或 DCGM 排除建议脚本。
Based on the output of the health check or exclusion recommendation script, add the GPU UUID to a hypervisor readable database. 根据健康检查或排除建议脚本的输出,将 GPU UUID 添加到管理程序可读数据库中。
The hypervisor shuts down the test VM. 管理程序会关闭测试虚拟机。
The hypervisor reads the database to identify the candidates for excluding and updates its resource allocation mechanisms to prevent that GPU from being assigned to future VM requests. 管理程序会读取数据库以确定排除在外的候选 GPU,并更新其资源分配机制,以防止该 GPU 被分配给未来的虚拟机请求。
After the GPU board has been replaced, to make the GPU available again, the hypervisor updates the database. 更换 GPU 板后,为了使 GPU 重新可用,管理程序会更新数据库。
In a shared NVSwitch virtualization configuration, system administrators can run their GPU health check script or DCGM exclusion recommendation script in a dedicated test VM, or on DGX A 100 and NVIDIA HGX A 100 systems, in the Service VM immediately after the GPU partition is activated. 在共享 NVSwitch 虚拟化配置中,系统管理员可以在专用测试虚拟机中运行 GPU 健康检查脚本或 DCGM 排除建议脚本,或者在 DGX A 100 和 NVIDIA HGX A 100 系统中,在 GPU 分区激活后立即在服务虚拟机中运行 GPU 健康检查脚本或 DCGM 排除建议脚本。
To run GPU health on a special test VM: 在特殊测试虚拟机上运行 GPU 健康状况:
The guest VM completes and returns control of the GPUs in the partition to the hypervisor. 客户虚拟机完成后,将分区中 GPU 的控制权交还给管理程序。
After the shared NVSwitch guest VM shutdown procedure is complete, activate the same GPU partition again. 共享 NVSwitch guest VM 关闭程序完成后,再次激活同一 GPU 分区。
The hypervisor schedules a special test VM, which is trusted on those GPUs. 管理程序会调度一个特殊的测试虚拟机,该虚拟机在这些 GPU 上是可信的。
On this test VM, run the health check script or DCGM exclusion recommendation script. 在该测试虚拟机上,运行健康检查脚本或 DCGM 排除建议脚本。
Based on the output of the health check or exclusion recommendation script, add the GPU UUID into a hypervisor readable database. 根据健康检查或排除建议脚本的输出,将 GPU UUID 添加到管理程序可读数据库中。
If the partition activation/deactivation cycle is consistently failing, the hypervisor can consider adding all the GPU UUID s of a partition to the database. 如果分区激活/停用周期持续失败,管理程序可考虑将分区的所有 GPU UUID 添加到数据库中。
After the health check is complete, shut down the test VM. 完成健康检查后,关闭测试虚拟机。
The hypervisor reads the database to identify the candidates for exclusion and removes the corresponding GPU partitions from its currently supported partitions. 管理程序会读取数据库以确定需要排除的候选分区,并从当前支持的分区中删除相应的 GPU 分区。
The hypervisor resource allocation mechanisms ensures that the affected GPU partitions will not be activated. 管理程序资源分配机制可确保受影响的 GPU 分区不会被激活。
When the Service VM is rebooted, the hypervisor can choose not to bind the excluded GPUs to the Service VM. 当服务虚拟机重新启动时,管理程序可以选择不将排除在外的 GPU 与服务虚拟机绑定。
This way, FM will adjust its currently supported GPU partitions. 这样,FM 就会调整当前支持的 GPU 分区。
11. When the GPU board has been replaced, the hypervisor updates the database to make the GPU available and restarts the Service VM with all the GPUs to enable previously disabled GPU partitions again. 11.更换 GPU 板后,管理程序会更新数据库,使 GPU 可用,并用所有 GPU 重新启动服务虚拟机,再次启用之前禁用的 GPU 分区。
To run GPU health on a Service VM on DGX 100 and NVIDIA HGX 100 systems: 在 DGX 100 和 NVIDIA HGX 100 系统的服务虚拟机上运行 GPU 健康状况:
The fmActivateFabricPartition( ) call returned successfully in a Shared NVSwitch partition activation flow. 在共享 NVSwitch 分区激活流程中,fmActivateFabricPartition( ) 调用成功返回。
Before the hypervisor detaches/unbinds the GPUs in the partition, run the required health check script or DCGM exclusion recommendation script on those GPUs in the Service VM. 在管理程序分离/解绑分区中的 GPU 之前,在服务虚拟机中的这些 GPU 上运行所需的健康检查脚本或 DCGM 排除建议脚本。
Based on the output of the health check or exclusion recommendation script, add the GPU UUID into a hypervisor readable database. 根据健康检查或排除建议脚本的输出,将 GPU UUID 添加到管理程序可读数据库中。
The hypervisor executes the partition deactivation flow using fmDeactivateFabricPartition( ) when health check fails and corresponding guest VM launch is deferred. 当健康检查失败时,管理程序会使用 fmDeactivateFabricPartition( ) 执行分区停用流程,并推迟启动相应的客户虚拟机。
If the partition activation/deactivation cycle is consistently failing, the hypervisor can consider adding the GPU UUID s of a partition to the database. 如果分区激活/停用周期总是失败,管理程序可以考虑将分区的 GPU UUID 添加到数据库中。
The hypervisor reads the database to identify the candidates for exclusion and removes the corresponding GPU partitions from its currently supported partitions. 管理程序会读取数据库以确定需要排除的候选分区,并从当前支持的分区中删除相应的 GPU 分区。
The hypervisor resource allocation mechanisms ensure that the affected GPU partitions will not be activated. 管理程序资源分配机制可确保不会激活受影响的 GPU 分区。
After the Service VM is rebooted, the hypervisor can choose not to bind the excluded GPUs to the Service VM. 服务虚拟机重启后,管理程序可以选择不将排除在外的 GPU 与服务虚拟机绑定。
This way, FM will adjust its currently supported GPU partitions. 这样,FM 就会调整当前支持的 GPU 分区。
After the GPU board has been replaced, the hypervisor updates the database to make the GPU available and restarts the Service VM with the GPUs to enable previously disabled GPU partitions again. 更换 GPU 板后,管理程序会更新数据库,使 GPU 可用,并重新启动带有 GPU 的服务虚拟机,再次启用之前禁用的 GPU 分区。
13.6.2. NVSwitch Exclusion 13.6.2.排除 NVSwitch
In DGX A100 and NVIDIA HGX A 100 systems, if an NVSwitch is consistently failing, the system administrator can explicitly exclude the NVSwitch. 在 DGX A100 和 NVIDIA HGX A 100 系统中,如果某个 NVSwitch 始终出现故障,系统管理员可以明确排除该 NVSwitch。
The NVSwitch kernel driver on NVSwitch-based systems can be configured to ignore an NVSwitch even when the systems were enumerated on the PCle bus like the GPU exclusion feature. If the NVSwitch exclusion candidates are in the system, the NVSwitch kernel module driver will exclude the NVSwitch from being used by applications. If an NVSwitch UUID is in the exclusion candidate list, but the UUID is not detected at runtime because the UUID belongs to a NVSwitch that is not on the system, or because the PCle enumeration of the NVSwitch fails, the NVSwitch is not considered to have been excluded. 基于 NVSwitch 的系统上的 NVSwitch 内核驱动程序可配置为忽略 NVSwitch,即使系统已在 PCle 总线上枚举,如 GPU 排除功能。如果系统中存在 NVSwitch 排除候选,NVSwitch 内核模块驱动程序就会排除 NVSwitch,不让应用程序使用。如果 NVSwitch UUID 在排除候选列表中,但由于 UUID 属于系统中没有的 NVSwitch,或由于 NVSwitch 的 PCle 枚举失败,在运行时未检测到该 UUID,则不认为 NVSwitch 已被排除。
Also, in NVIDIA HGX A 100 systems with two GPU baseboards, if an NVSwitch is explicitly excluded, FM will manually exclude its peer NVSwitch across the Trunk NVLinks. This behavior can be configured using the NVSWITCH_FAILURE_MODE high availability configuration file item. 此外,在带有两个 GPU 底板的英伟达 HGX A 100 系统中,如果一个 NVSwitch 被明确排除,FM 将通过主干 NVLinks 手动排除其对等 NVSwitch。可使用 NVSWITCH_FAILURE_MODE 高可用性配置文件项配置此行为。
13.6.2.2 Kernel Module Parameters 13.6.2.2 内核模块参数
To specify a candidate NVSwitch UUID as a kernel module parameter, run the following command: 要将候选 NVSwitch UUID 指定为内核模块参数,请运行以下命令:
insmod nvidia.ko NvSwitchExcludelist=<NVSwitch_uuid>
To make the NVSwitch UUID persistent, specify the UUID using an nvidia.conf file in /etc/ modprobe.d: 要使 NVSwitch UUID 保持不变,可使用 /etc/ modprobe.d 中的 nvidia.conf 文件指定 UUID:
options nvidia NvSwitchExcludelist=<NVSwitch_uuid> 选项 nvidia NvSwitchExcludelist=<NVSwitch_uuid>
The system administrator can get the NVSwitch UUID from the FM log file and add the UUID into the excluded candidate list. 系统管理员可从 FM 日志文件中获取 NVSwitch UUID,并将 UUID 添加到排除候选列表中。
Note: The previously supported NvSwitchBlacklist module parameter option has been deprecated and will be removed in a future release. 注意:之前支持的 NvSwitchBlacklist 模块参数选项已被弃用,并将在今后的版本中删除。
The following section lists the link IDs used by each GPU to connect to each NVSwitch on different versions of NVIDIA HGX baseboards. 下一节列出了在不同版本的英伟达 HGX 底板上,每个 GPU 连接到每个 NVSwitch 所使用的链接 ID。
Every NVSwitch uses the 0/1, 2/3, 8/9 and 10/11 links for the inter-GPU baseboard connection, and the links are not listed. Other NVLink connections, two per NVSwitch, are unused. 每个 NVSwitch 都将 0/1、2/3、8/9 和 10/11 链接用于 GPU 底板间连接,这些链接未列出。其他 NVLink 连接(每个 NVSwitch 有两个)未使用。
Every NVSwitch uses links 0 to 7 and 16 to 23 for the inter-GPU baseboard connection, and the links are not listed. Other NVLink connections (four per NVSwitch) are unused. The GPU numbering in Table 7 is the same numbering used in the HGX A 100 Baseboard Pinout design document. 每个 NVSwitch 都将 0 至 7 和 16 至 23 链接用于 GPU 底板间连接,这些链接未列出。其他 NVLink 连接(每个 NVSwitch 四个)未使用。表 7 中的 GPU 编号与 HGX A 100 底板引脚设计文件中使用的编号相同。
Table 7: NVLink Topology of the NVIDIA HGX A100 GPU Baseboard 表 7:英伟达 HGX A100 GPU 底板的 NVLink 拓扑
14.3. NVIDIA HGX H 100 GPU Baseboard 14.3.英伟达 HGX H 100 GPU 底板
The GPU numbering in Table 8 is the same information that is returned through nvidia-smi as the module ID, which is derived based on the GPIO connections on the baseboard. 表 8 中的 GPU 编号与通过 nvidia-smi 返回的模块 ID 信息相同,后者是根据底板上的 GPIO 连接得出的。
Table 8: NVLink Topology of the NVIDIA HGX H1OO GPU 表 8:英伟达™(NVIDIA®)HGX H1OO GPU 的 NVLink 拓扑
Baseboard 底板
GPU
GPU link GPU 链接
NVSwitch
NVSwitch link NVSwitch 链接
1
2,3,12,132,3,12,13
1
40,41,44,4540,41,44,45
GPU GPU link NVSwitch NVSwitch link
1 2,3,12,13 1 40,41,44,45| GPU | GPU link | NVSwitch | NVSwitch link |
| :--- | :--- | :--- | :--- |
| 1 | $2,3,12,13$ | 1 | $40,41,44,45$ |