High Level Architecture# 高级架构#
Dynamo is NVIDIA’s high-throughput, low-latency inference framework that’s designed to serve generative AI and reasoning models in multi-node distributed environments. It’s inference engine agnostic, supporting TRT-LLM, vLLM, SGLang and others, while capturing essential LLM capabilities:
Dynamo 是 NVIDIA 的高吞吐量、低延迟推理框架,旨在为多节点分布式环境中的生成式 AI 和推理模型提供服务。它与推理引擎无关,支持 TRT-LLM、vLLM、SGLang 等,同时捕获基本的 LLM 功能:
Disaggregated prefill & decode inference: Maximizes GPU throughput and helps you balance throughput and latency
分解预填充和解码推理 :最大化 GPU 吞吐量并帮助您平衡吞吐量和延迟Dynamic GPU scheduling: Optimizes performance based on real-time demand
GPU 动态调度 :根据实时需求优化性能LLM-aware request routing: Eliminates unnecessary KV cache recomputation
LLM 感知请求路由 :消除不必要的 KV 缓存重新计算Accelerated data transfer: Reduces inference response time using NIXL
加速数据传输 :使用 NIXL 缩短推理响应时间KV cache offloading: Uses multiple memory hierarchies for higher system throughput and lower latency
KV 缓存卸载 :使用多个内存层次结构,以提高系统吞吐量和降低延迟
Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, Open Source Software (OSS)-first development approach
Dynamo 使用 Rust 构建以实现性能,并使用 Python 构建以实现可扩展性,它是完全开源的,并由透明的开源软件 (OSS) 优先开发方法驱动
Motivation behind Dynamo#
Dynamo 背后的动机#
Scaling inference for generative AI and reasoning models presents complex challenges in three key areas: performance, correctness, and efficiency. Here’s what we’re solving:
为生成式 AI 和推理模型扩展推理在三个关键领域提出了复杂的挑战:性能、正确性和效率。以下是我们正在解决的问题:
There are multi-faceted challenges:
存在多方面的挑战:
Difficult UX: User experience is critical for distributed inference runtimes because managing large-scale inference systems is already complex, and poor usability further complicates matters. Developers need a clear, intuitive way to define, optimize, and update inference execution without wrestling with low-level infrastructure details. Without simple UX, inference runtimes remain inaccessible, prone to errors, and inefficient, hindering model deployment and innovation. A modern distributed inference stack must consider usability at its core—empowering developers to scale AI effortlessly for agentic workflows while ensuring correctness and performance.
困难的用户体验 :用户体验对于分布式推理运行时至关重要,因为管理大规模推理系统已经很复杂,而可用性差会使问题进一步复杂化。开发人员需要一种清晰、直观的方法来定义、优化和更新推理执行,而无需纠结于低级基础设施细节。如果没有简单的 UX,推理运行时仍然无法访问、容易出错且效率低下,从而阻碍模型部署和创新。现代分布式推理堆栈必须以可用性为核心考虑,使开发人员能够轻松地扩展 AI 以用于代理工作流,同时确保正确性和性能。GPU underutilization: Traditional monolithic inference pipelines often leave GPUs idle due to the imbalance between prefill and decode stages. Prefill (which generates large prompt embeddings) is highly compute-intensive, while decode (which generates tokens) is latency-sensitive. A disaggregated approach that separate prefill and decode ensures optimal GPU utilization and increases overall throughput (DistServe).
GPU 利用率不足 :由于预填充和解码阶段之间的不平衡,传统的整体式推理管道通常会使 GPU 处于空闲状态。Prefill(生成大型提示嵌入)是高度计算密集型的,而 decode(生成令牌)对延迟敏感。将预填充和解码分开的分解方法可确保最佳 GPU 利用率并提高整体吞吐量 (DistServe)。Expensive KV cache re-computation: When requests aren’t efficiently routed, KV caches (intermediate states of transformer model) often get flushed and recomputed, leading to wasted computation cycles and increased latency. KV-aware request routing eliminates redundant KV cache regeneration, significantly boosting efficiency.(DeepSeek)
昂贵的 KV 缓存重新计算 :当请求没有得到有效路由时,KV 缓存(transformer 模型的中间状态)通常会被刷新和重新计算,从而导致计算周期浪费和延迟增加。KV 感知请求路由消除了冗余的 KV 缓存重新生成,从而显著提高了效率。( 深度搜索 )Memory bottlenecks: Large-scale inference workloads demand extensive KV cache storage, which can quickly overwhelm GPU memory capacity. KV cache offloading across memory hierarchies (HBM, DDR, NVMe or remote storage) enables models to scale beyond GPU memory limits and speeds up latency. (Mooncake, AIBrix, LMCache)
内存瓶颈 :大规模推理工作负载需要大量的 KV 缓存存储,这会很快使 GPU 内存容量不堪重负。跨内存层次结构(HBM、DDR、NVMe 或远程存储)的 KV 缓存卸载使模型能够扩展到超出 GPU 内存限制并加快延迟。( 月饼 、AIBrix、LMCache)Fluctuating demand and inefficient GPU allocation: Inference workloads are use-case specific and dynamic—demand surges inherently cause unpredictably, yet traditional serving stacks allocate GPUs statically. Dynamic GPU scheduling ensures that resources are allocated based on real-time demand, preventing over-provisioning and improving utilization (AzureTrace)
需求波动和 GPU 分配效率低下 :推理工作负载是特定于用例的动态工作负载 — 需求激增本身会导致不可预测的,但传统的服务堆栈会静态分配 GPU。动态 GPU 计划可确保根据实时需求分配资源,从而防止过度配置并提高利用率 (AzureTrace)Inefficient data transfer: Distributed inference workloads introduce unique and highly dynamic communication patterns that differ fundamentally from training. Unlike training, where worker roles remain largely static, inference requires real-time worker scaling, dynamic load balancing, and adaptive memory management—necessitating a communication layer that can efficiently handle these evolving requirements. Contemporary libraries are built for static, synchronous operations and lack the dynamicity needed for inference serving. While UCX provides high-performance networking, it requires deep networking expertise to configure correctly, making it impractical for broad inference use cases. Developers need a library optimized for inference workloads that can abstract heterogeneous memory (remote memory or storage) and dynamically select the best transport mechanism via a unified API.
数据传输效率低下 :分布式推理工作负载引入了独特且高度动态的通信模式,这些模式与训练有着根本的不同。与工作角色在很大程度上保持静态的训练不同,推理需要实时工作线程扩展、动态负载均衡和自适应内存管理,因此需要一个能够有效处理这些不断变化的需求的通信层。现代库是为静态、同步作而构建的,缺乏推理服务所需的动态性。虽然 UCX 提供高性能网络,但它需要深厚的网络专业知识才能正确配置,因此对于广泛的推理使用案例来说是不切实际的。开发人员需要一个针对推理工作负载优化的库,该库可以抽象异构内存(远程内存或存储)并通过统一的 API 动态选择最佳传输机制。
To address the growing demands of distributed inference serving, NVIDIA introduces Dynamo. This innovative product tackles key challenges in scheduling, memory management, and data transfer. Dynamo employs KV-aware routing for optimized decoding, leveraging existing KV caches. For efficient global memory management at scale, it strategically stores and evicts KV caches across multiple memory tiers—GPU, CPU, SSD, and object storage—enhancing both time-to-first-token and overall throughput. Dynamo features NIXL (NVIDIA Inference tranXfer Library), a new data transfer engine designed for dynamic scaling and low-latency storage access.
为了满足分布式推理服务日益增长的需求,NVIDIA 推出了 Dynamo。这款创新产品解决了调度、内存管理和数据传输方面的关键挑战。Dynamo 采用 KV 感知路由来优化解码,从而利用现有的 KV 缓存。为了实现大规模高效的全局内存管理,它战略性地跨多个内存层(GPU、CPU、SSD 和对象存储)存储和驱逐 KV 缓存,从而提高了首次令牌的时间和整体吞吐量。Dynamo 具有 NIXL (NVIDIA Inference tranXfer Library),这是一种新的数据传输引擎,专为动态扩展和低延迟存储访问而设计。
High level architecture and key benefits#
高级架构和主要优势#
The following diagram outlines Dynamo’s high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:
下图概述了 Dynamo 的高级体系结构。为了实现大规模分布式和分解式推理服务,Dynamo 包括五个关键功能:
Every component in the Dynamo architecture is independently scalable and portable. The API server can adapt to task-specific deployment. A smart router processes user requests to route them to the optimal worker for performance. Specifically, for Large Language Models (LLMs), Dynamo employs KV cache-aware routing, which directs requests to the worker with the highest cache hit rate while maintaining load balance, expediting decoding. This routing strategy leverages a KV cache manager that maintains a global radix tree registry for hit rate calculation. The KV cache manager also oversees a multi-tiered memory system, enabling rapid KV cache storage and eviction. This design results in substantial TTFT reductions, increased throughput, and the ability to process extensive context lengths.
Dynamo Architecture 中的每个组件都是独立可扩展和可移植的。API 服务器可以适应特定于任务的部署。智能路由器处理用户请求,将其路由到最佳工作线程以提高性能。具体而言,对于大型语言模型 (LLM),Dynamo 采用 KV 缓存感知路由,该路由将请求定向到缓存命中率最高的工作程序,同时保持负载平衡,从而加快解码速度。此路由策略利用 KV 缓存管理器,该管理器维护全局基数树注册表以计算命中率。KV 缓存管理器还监督多层内存系统,支持快速 KV 缓存存储和驱逐。这种设计可大幅减少 TTFT,提高吞吐量,并能够处理广泛的上下文长度。
Dynamo enables dynamic worker scaling, responding to real-time deployment signals. These signals, captured and communicated through an event plane, empower the Planner to make intelligent, zero-downtime adjustments. For instance, if Dynamo detects an increase in requests with long input sequences, the Planner automatically scales up prefill workers to meet the heightened demand.
Dynamo 支持动态工作程序扩展,以响应实时部署信号。这些信号通过事件平面捕获和通信,使 Planner 能够进行智能、零停机时间的调整。例如,如果 Dynamo 检测到输入序列较长的请求增加,则 Planner 会自动纵向扩展预填充工作程序以满足增加的需求。
Beyond efficient event communication, data transfer across multi-node deployments is crucial at scale. To address this, Dynamo utilizes NIXL, a technology designed to expedite transfers through reduced synchronization and intelligent batching. This acceleration is particularly vital for disaggregated serving, ensuring minimal latency when prefill workers pass KV cache data to decode workers.
除了高效的事件通信之外,跨多节点部署的数据传输在大规模方面也至关重要。为了解决这个问题,Dynamo 利用了 NIXL,该技术旨在通过减少同步和智能批处理来加快传输速度。这种加速对于解耦服务尤其重要,可确保在预填充 worker 将 KV 缓存数据传递给 decode worker 时将延迟降至最低。
Dynamo prioritizes seamless integration. Its modular design enables it to work harmoniously with your existing infrastructure and preferred open-source components. To achieve optimal performance and extensibility, Dynamo leverages the strengths of both Rust and Python. We built critical performance-sensitive modules with Rust for speed, memory safety, and robust concurrency. Meanwhile, we used Python for its flexibility, enabling rapid prototyping and effortless customization.
Dynamo 优先考虑无缝集成。其模块化设计使其能够与您现有的基础设施和首选的开源组件协调工作。为了实现最佳性能和可扩展性,Dynamo 利用了 Rust 和 Python 的优势。我们使用 Rust 构建了关键的性能敏感型模块,以实现速度、内存安全性和强大的并发性。同时,我们使用 Python 的灵活性,实现快速原型设计和轻松定制。
Performance benefits of key features#
主要功能的性能优势#
Disaggregated serving# 分批服务#
Disaggregating prefill and decode boosts performance, gaining efficiency when more GPUs are involved in inference. For example, for Llama 70B, single-node tests show a 30% throughput/GPU improvement, while two-node setups achieve over 2X gains due to better parallelization.
分解预填充和解码可以提高性能,当推理涉及更多 GPU 时提高效率。例如,对于 Llama 70B ,单节点测试显示吞吐量/GPU 提高了 30%,而由于更好的并行化,双节点设置实现了 2 倍以上的增益。
Tested on H100s with R1 Distilled Llama 70B model FP8 using vLLM. 3K ISL/ 150 OSL
使用 vLLM 在 H100 上使用 R1 蒸馏 Llama 70B 型号 FP8 进行测试。3K ISL/ 150 OSL
The disaggregation of prefill and decode phases offers valuable flexibility. Since these phases directly correlate with time-to-first-token (TTFT) and inter-token latency (ITL) respectively, adjusting worker allocation can provide tailored performance. This enables optimization for specific service level agreements (SLAs), whether prioritizing faster TTFT, lower ITL, or higher throughput.
预填充和解码阶段的分解提供了宝贵的灵活性。由于这些阶段分别与首次标记时间 (TTFT) 和标记间延迟 (ITL) 直接相关,因此调整工作线程分配可以提供量身定制的性能。这样可以针对特定的服务等级协议 (SLA) 进行优化,无论是优先考虑更快的 TTFT、更低的 ITL 还是更高的吞吐量。
KV aware routing# KV 感知路由#
Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL
在 H100 的 2 个节点上使用 R1 蒸馏 Llama 70B FP8 对 R1 的 100K 请求进行测试。平均 4K ISL / 800 OSL
Existing routing methods, including load-based routing, overlook the specific properties of LLMs that could improve performance. Addressing this, routing user queries to workers with the highest KV cache hit rate (rather than simply the least busy node) allows for immediate processing, even under heavy load. The preceeding figures illustrate the effectiveness of KV aware routing on 100,000 real R1 user queries, achieving a 3x improvement in TTFT and a 2x reduction in average request latency. Depending on traffic, this approach can also enhance throughput.
现有的路由方法(包括基于负载的路由)忽略了可以提高性能的 LLM 的特定属性。解决这个问题,将用户查询路由到具有最高 KV 缓存命中率的工作程序(而不仅仅是最不繁忙的节点)可以立即进行处理,即使在重负载下也是如此。前面的数字说明了 KV 感知路由对 100000 个真实 R1 用户查询的有效性,TTFT 提高了 3 倍,平均请求延迟降低了 2 倍。根据流量,此方法还可以提高吞吐量。
KV cache manager# KV 缓存管理器#
Dynamo’s design enables KV cache offloading to system CPU memory. In accelerated servers, the CPU (system) memory is often larger than the GPU memory and fast enough to store and serve KV cache data. The following plot highlights the performance gains achieved through system memory offloading, even with prefix caching enabled via inference engine. In a scenario involving 10 multi-turn conversations with 80 users, system memory offloading resulted in a 40% improvement in TTFT, demonstrating benefits beyond basic prefix caching.
Dynamo 的设计支持将 KV 缓存卸载到系统 CPU 内存。在加速服务器中,CPU(系统)内存通常大于 GPU 内存,并且速度足够快,可以存储和提供 KV 缓存数据。下图突出显示了通过系统内存卸载实现的性能提升,即使通过推理引擎启用了前缀缓存也是如此。在涉及 80 个用户的 10 个多轮次对话的场景中,系统内存卸载导致 TTFT 提高了 40%,展示了超越基本前缀缓存的优势。
Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL
在 H100 的 2 个节点上使用 R1 蒸馏 Llama 70B FP8 对 R1 的 100K 请求进行测试。平均 4K ISL / 800 OSL
NVIDIA Inference Transfer Library (NIXL)#
NVIDIA 推理传输库 (NIXL)#
NIXL streamlines data transfer through simplified synchronization and batching and simplified source and destination abstractions. NIXL can abstract data movement across different types of memory and fast storage, whereas other data transfer libraries typically support a single tier of memory. These enhancements yield significant performance gains, accelerating both time-to-first-token (TTFT) and throughput.
NIXL 通过简化的同步和批处理以及简化的源和目标抽象来简化数据传输。NIXL 可以抽象出跨不同类型内存和快速存储的数据移动,而其他数据传输库通常支持单层内存。这些增强功能可显著提高性能,从而加快首次令牌时间 (TTFT) 和吞吐量。
Acknowledgements# 确认#
We’d like to acknowledge several open source software stacks that motivated our creation Dynamo.
我们要感谢几个开源软件堆栈,正是这些堆栈推动了我们创建 Dynamo。
vLLM and vLLM-project vLLM 和 vLLM 项目
SGLang
DistServe DistServe 服务
Mooncake 月饼
AIBrix AIBrix 公司
BentoML 本托 ML