Dynamo Distributed Runtime#
Dynamo 分布式运行时 #
Overview# 概述 #
Dynamo DistributedRuntime
is the core infrastructure in dynamo that enables distributed communication and coordination between different dynamo components. It is implemented in rust (/lib/runtime
) and exposed to other programming languages via binding (i.e., python bindings can be found in /lib/bindings/python
). DistributedRuntime
follows a hierarchical structure:
Dynamo DistributedRuntime
是 dynamo 中实现分布式通信与组件协调的核心基础设施,采用 Rust 语言开发( /lib/runtime
),并通过绑定机制向其他编程语言暴露接口(例如 Python 绑定见 /lib/bindings/python
)。 DistributedRuntime
采用分层架构:
DistributedRuntime
: This is the highest level object that exposes the distributed runtime interface. It maintains connection to external services (e.g., etcd for service discovery and NATS for messaging) and manages lifecycle with cancellation tokens.
DistributedRuntime
:这是暴露分布式运行时接口的最高层级对象。它维护与外部服务(如用于服务发现的 etcd 和用于消息传递的 NATS)的连接,并通过取消令牌管理生命周期。Namespace
: ANamespace
is a logical grouping of components that isolate between different model deployments.
Namespace
:Namespace
是组件的逻辑分组,用于隔离不同模型部署之间的环境。Component
: AComponent
is a discoverable object within aNamespace
that represents a logical unit of workers.
Component
:Component
是Namespace
中可被发现的对象,代表工作者的逻辑单元。Endpoint
: AnEndpoint
is a network-accessible service that provides a specific service or function.
Endpoint
:Endpoint
是一种可通过网络访问的服务,提供特定服务或功能。
While theoretically each DistributedRuntime
can have multiple Namespace
s as long as their names are unique (similar logic also applies to Component/Namespace
and Endpoint/Component
), in practice, each dynamo components typically are deployed with its own process and thus has its own DistributedRuntime
object. However, they share the same namespace to discover each other.
理论上,每个 DistributedRuntime
可以拥有多个 Namespace
,只要它们的名称唯一(类似逻辑也适用于 Component/Namespace
和 Endpoint/Component
)。但实际上,每个 dynamo 组件通常独立部署进程,因此拥有自己的 DistributedRuntime
对象。不过它们共享相同的命名空间以实现相互发现。
For example, the deployment configuration examples/llm/configs/disagg.yaml
have four workers:
例如,部署配置 examples/llm/configs/disagg.yaml
包含四个工作节点:
Frontend
: Start an HTTP server and register achat/completions
endpoint. The HTTP server route the request to theProcessor
.
Frontend
:启动 HTTP 服务器并注册一个chat/completions
端点。该 HTTP 服务器将请求路由至Processor
。Processor
: When a new request arrives,Processor
applies the chat template and perform the tokenization. Then, it route the request to theVllmWorker
.
Processor
:当新请求到达时,Processor
会应用聊天模板并执行分词处理,随后将请求路由至VllmWorker
。VllmWorker
andPrefillWorker
: Perform the actual decode and prefill computation.
VllmWorker
和PrefillWorker
:执行实际的解码和预填充计算。
Since the four workers are deployed in different processes, each of them have their own DistributedRuntime
. Within their own DistributedRuntime
, they all have their own Namespace
s named dynamo
. Then, under their own dynamo
namespace, they have their own Component
s named Frontend/Processor/VllmWorker/PrefillWorker
. Lastly, for the Endpoint
, Frontend
has no Endpoints
, Processor
and VllmWorker
each has a generate
endpoint, and PrefillWorker
has a placeholder mock
endpoint. Their DistributedRuntime
s and Namespace
s are set in the @service
decorators in examples/llm/components/<frontend/processor/worker/prefill_worker>.py
. Their Component
s are set by their name in /deploy/dynamo/sdk/src/dynamo/sdk/cli/serve_dynamo.py
. Their Endpoint
s are set by the @endpoint
decorators in examples/llm/components/<frontend/processor/worker/prefill_worker>.py
.
由于四个工作进程部署在不同的进程中,每个进程都拥有自己的 DistributedRuntime
。在各自的 DistributedRuntime
内,它们都有名为 dynamo
的 Namespace
。接着,在各自的 dynamo
命名空间下,又拥有名为 Frontend/Processor/VllmWorker/PrefillWorker
的 Component
。最后关于 Endpoint
: Frontend
没有 Endpoints
, Processor
和 VllmWorker
各自有一个 generate
端点,而 PrefillWorker
则有一个占位符 mock
端点。它们的 DistributedRuntime
和 Namespace
通过 examples/llm/components/<frontend/processor/worker/prefill_worker>.py
中的 @service
装饰器设置, Component
由它们在 /deploy/dynamo/sdk/src/dynamo/sdk/cli/serve_dynamo.py
中的名称决定, Endpoint
则通过 examples/llm/components/<frontend/processor/worker/prefill_worker>.py
里的 @endpoint
装饰器配置。
Initialization# 初始化阶段
In this section, we explain what happens under the hood when DistributedRuntime/Namespace/Component/Endpoint
objects are created. There are two modes for DistributedRuntime
initialization: dynamic and static. In static mode, components and endpoints are defined using known addresses and do not change during runtime. In dynamic modes, components and endpoints are discovered through the network and can change during runtime. We focus on the dynamic mode in the rest of this document. Static mode is basically dynamic mode without registration and discovery and hence does not rely on etcd.
本节将揭示创建 DistributedRuntime/Namespace/Component/Endpoint
对象时的底层运作机制。 DistributedRuntime
初始化存在两种模式:动态模式与静态模式。静态模式下,组件和端点使用预知地址进行定义,运行时保持不变;而动态模式下,它们通过网络发现机制动态获取,运行时可能发生变化。本文后续内容将聚焦动态模式,静态模式本质上是不需要注册发现机制的动态模式,因此不依赖 etcd。
Caution 注意
The hierarchy and naming in etcd and NATS may change over time, and this document might not reflect the latest changes. Regardless of such changes, the main concepts would remain the same.
etcd 和 NATS 中的层级结构与命名规范可能随时间变化,本文档未必反映最新变更。但无论这些变更如何,核心概念将保持不变。
DistributedRuntime
: When aDistributedRuntime
object is created, it establishes connections to the following two services:
DistributedRuntime
:当创建DistributedRuntime
对象时,它会建立与以下两个服务的连接:etcd (dynamic mode only): for service discovery. In static mode,
DistributedRuntime
can operate without etcd.
etcd(仅动态模式):用于服务发现。在静态模式下,DistributedRuntime
无需 etcd 即可运行。NATS (both static and dynamic mode): for messaging.
NATS(静态和动态模式均可):用于消息传递。
where etcd and NATS are two global services (there could be multiple etcd and NATS services for high availability).
其中 etcd 和 NATS 是两项全局服务(为实现高可用性,可以部署多个 etcd 和 NATS 服务)。For etcd, it also creates a primary lease and spin up a background task to keep the lease alive. All objects registered under this
DistributedRuntime
use this lease_id to maintain their life cycle. There is also a cancellation token that is tied to the primary lease. When the cancellation token is triggered or the background task failed, the primary lease is revoked or expired and the kv pairs stored with this lease_id is removed.
对于 etcd,它还会创建一个主租约并启动后台任务以保持租约有效。所有在此DistributedRuntime
下注册的对象都使用此 lease_id 来维护其生命周期。同时存在一个与主租约绑定的取消令牌。当取消令牌被触发或后台任务失败时,主租约将被撤销或过期,与该 lease_id 关联存储的键值对会被移除。Namespace
:Namespace
s are primarily a logical grouping mechanism and is not registered in etcd. It provides the root path for all components under thisNamespace
.
Namespace
:Namespace
主要是一种逻辑分组机制,并未在 etcd 中注册。它为属于该Namespace
的所有组件提供根路径。Component
: When aComponent
object is created, similar toNamespace
, it isn’t be registered in etcd. Whencreate_service
is called, it creates a NATS service group using{namespace_name}.{service_name}
and registers a service in the registry of theComponent
, where the registry is an internal data structure that tracks all services and endpoints within theDistributedRuntime
.
Component
:当创建Component
对象时,与Namespace
类似,它不会在 etcd 中注册。当调用create_service
时,会使用{namespace_name}.{service_name}
创建一个 NATS 服务组,并在Component
的注册表中注册服务,该注册表是用于跟踪DistributedRuntime
内所有服务和端点的内部数据结构。Endpoint
: When an Endpoint object is created and started, it performs two key registrations:
Endpoint
:当创建并启动 Endpoint 对象时,会执行两个关键注册:NATS Registration: The endpoint is registered with the NATS service group created during service creation. The endpoint is assigned a unique subject following the naming:
{namespace_name}.{service_name}.{endpoint_name}-{lease_id_hex}
.
NATS 注册:端点会被注册到服务创建时建立的 NATS 服务组中。端点会被分配一个遵循以下命名规则的唯一主题:{namespace_name}.{service_name}.{endpoint_name}-{lease_id_hex}
。etcd Registration: The endpoint information is stored in etcd at a path following the naming:
/services/{namespace}/{component}/{endpoint}-{lease_id}
. Note that the endpoints of different workers of the same type (i.e., twoPrefillWorker
s in one deployment) share the sameNamespace
,Componenet
, andEndpoint
name. They are distinguished by their different primarylease_id
of theirDistributedRuntime
.
etcd 注册:端点信息存储在 etcd 中,路径命名遵循格式:/services/{namespace}/{component}/{endpoint}-{lease_id}
。注意,同类型的不同 worker(即同一部署中的两个PrefillWorker
)共享相同的Namespace
、Componenet
和Endpoint
名称。它们通过各自DistributedRuntime
的不同主lease_id
进行区分。
Calling Endpoints# 调用端点 #
Dynamo uses Client
object to call an endpoint. When a Client
objected is created, it is given the name of the Namespace
, Component
, and Endpoint
. It then sets up an etcd watcher to monitor the prefix /services/{namespace}/{component}/{endpoint}
. The etcd watcher continuously updates the Client
with the information, including lease_id
and NATS subject of the available Endpoint
s.
Dynamo 使用 Client
对象来调用端点。当创建 Client
对象时,会指定 Namespace
、 Component
和 Endpoint
的名称。随后它会设置一个 etcd 监视器来监听前缀 /services/{namespace}/{component}/{endpoint}
。该 etcd 监视器会持续更新 Client
的信息,包括可用 Endpoint
的 lease_id
和 NATS 主题。
The user can decide which load balancing strategy to use when calling the Endpoint
from the Client
, which is done in push_router.rs. Dynamo supports three load balancing strategies:
用户可以通过 push_router.rs 决定从 Client
调用 Endpoint
时使用的负载均衡策略。Dynamo 支持三种负载均衡策略:
random
: randomly select an endpoint to hit
random
: 随机选择一个端点进行访问round_robin
: select endpoints in round-robin order
round_robin
: 以轮询顺序选择端点direct
: direct the request to a specific endpoint by specifying thelease_id
of the endpoint
direct
: 通过指定端点的lease_id
,将请求定向到特定端点
After selecting which endpoint to hit, the Client
sends the serialized request to the NATS subject of the selected Endpoint
. The Endpoint
receives the request and create a TCP response stream using the connection information from the request, which establishes a direct TCP connection to the Client
. Then, as the worker generates the response, it serializes each response chunk and sends the serialized data over the TCP connection.
选定要访问的端点后, Client
将序列化请求发送至所选 Endpoint
的 NATS 主题。 Endpoint
接收请求后,利用请求中的连接信息创建 TCP 响应流,从而与 Client
建立直接 TCP 连接。随后,当工作节点生成响应时,会序列化每个响应数据块并通过 TCP 连接发送序列化数据。
Examples# 示例 #
We provide native rust and python (through binding) examples for basic usage of DistributedRuntime
:
我们提供了原生 Rust 和 Python(通过绑定)的 DistributedRuntime
基础用法示例:
Rust:
/lib/runtime/examples/
Rust 版:/lib/runtime/examples/
Python:
/lib/bindings/python/examples/
. We also provide a complete example of usingDistributedRuntime
for communication and Dynamo’s LLM library for prompt templates and (de)tokenization to deploy a vllm-based service. Please refer tolib/bindings/python/examples/hello_world/server_vllm.py
for details.
Python 版:/lib/bindings/python/examples/
。我们还提供了一个完整示例,展示如何使用DistributedRuntime
进行通信,并利用 Dynamo 的 LLM 库处理提示模板和(反)标记化,以部署基于 vllm 的服务。详情请参阅lib/bindings/python/examples/hello_world/server_vllm.py
。
Note 注意
Building a vLLM docker image for ARM machines currently involves building vLLM from source, which is known to be slow and requires extensive system RAM; see vLLM Issue 8878.
为 ARM 架构机器构建 vLLM Docker 镜像目前需要从源码编译 vLLM,这一过程以速度慢且需要大量系统内存而著称;详见 vLLM Issue 8878。
You can tune the number of parallel build jobs for building VLLM from source
on ARM based on your available cores and system RAM with VLLM_MAX_JOBS
.
您可以根据 ARM 设备可用核心数和系统内存,通过 VLLM_MAX_JOBS
调整 vLLM 源码编译的并行任务数。
For example, on an ARM machine with low system resources:
./container/build.sh --framework vllm --platform linux/arm64 --build-arg VLLM_MAX_JOBS=2
例如,在系统资源较少的 ARM 机器上: ./container/build.sh --framework vllm --platform linux/arm64 --build-arg VLLM_MAX_JOBS=2
For example, on a GB200 which has very high CPU cores and memory resource:
./container/build.sh --framework vllm --platform linux/arm64 --build-arg VLLM_MAX_JOBS=64
例如,在拥有极高 CPU 核心数和内存资源的 GB200 上: ./container/build.sh --framework vllm --platform linux/arm64 --build-arg VLLM_MAX_JOBS=64
When vLLM has pre-built ARM wheels published, this process can be improved.
当 vLLM 发布预构建的 ARM 架构 wheel 包时,此流程可以得到改进。