这是用户在 2025-1-17 14:23 为 https://app.immersivetranslate.com/pdf-pro/67e64403-ca67-4d74-b2bb-1af520283868 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

DeepSeek-V3 Technical Report
DeepSeek-V3 技术报告

DeepSeek-AIresearch@deepseek.com

Abstract  摘要

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788 M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available athttps://github.com/deepseek-ai/DeepSeek-V3
我们推出的DeepSeek-V3是一个强大的专家混合(MoE)语言模型,它拥有671B个总参数,每个标记有37B个激活参数。为了实现高效推理和低成本训练,DeepSeek-V3采用了多头潜意识(MLA)和DeepSeekMoE架构,这在DeepSeek-V2中得到了充分验证。此外,DeepSeek-V3 还率先采用了无辅助损失的负载均衡策略,并设定了多标记预测训练目标,以提高性能。我们在14.8万亿个不同的高质量代币上对DeepSeek-V3进行预训练,然后在监督微调和强化学习阶段充分发挥其能力。综合评估显示,DeepSeek-V3的性能优于其他开源模型,并可与领先的闭源模型相媲美。尽管性能卓越,DeepSeek-V3 的全部训练仅需 2.788 M H800 GPU 小时。此外,其训练过程也非常稳定。在整个训练过程中,我们没有遇到任何不可恢复的损失峰值,也没有进行任何回滚。模型检查点可从以下网址获取:https://github.com/deepseek-ai/DeepSeek-V3

Figure 1 | Benchmark performance of DeepSeek-V3 and its counterparts.
图 1:DeepSeek-V3 及其同类产品的基准性能。

Contents  目录

1 Introduction … 4  1 导言 ... 4
2 Architecture … 6
2 建筑 ... 6

2.1 Basic Architecture … 6
2.1 基本架构 ... 6

2.1.1 Multi-Head Latent Attention … 7
2.1.2 DeepSeekMoE with Auxiliary-Loss-Free Load Balancing … 8
2.2 Multi-Token Prediction … 10
2.2 多标记预测...... 10

3 Infrastructures … 11
3 基础设施 ... 11

3.1 Compute Clusters … 11
3.1 计算集群 ... 11

3.2 Training Framework … 12
3.2 培训框架 ... 12

3.2.1 DualPipe and Computation-Communication Overlap … 12
3.2.2 Efficient Implementation of Cross-Node All-to-All Communication … 13
3.2.3 Extremely Memory Saving with Minimal Overhead … 14
3.3 FP8 Training … 14
3.3 FP8 培训 ... 14

3.3.1 Mixed Precision Framework … 15
3.3.1 混合精度框架 ... 15

3.3.2 Improved Precision from Quantization and Multiplication … 16
3.3.3 Low-Precision Storage and Communication … 18
3.4 Inference and Deployment … 18
3.4.1 Prefilling … 19
3.4.1 预填充 ... 19

3.4.2 Decoding … 19
3.4.2 解码 ... 19

3.5 Suggestions on Hardware Design … 20
3.5.1 Communication Hardware … 20
3.5.1 通信硬件 ... 20

3.5.2 Compute Hardware … 20
3.5.2 计算硬件 ... 20

4 Pre-Training … 22
4 培训前 ... 22

4.1 Data Construction … 22
4.2 Hyper-Parameters … 22
4.2 超参数 ... 22

4.3 Long Context Extension … 23
4.4 Evaluations … 24
4.4.1 Evaluation Benchmarks … 24
4.4.1 评估基准 ... 24

4.4.2 Evaluation Results … 25
4.4.2 评估结果 ... 25

4.5 Discussion … 26
4.5 讨论...... 26

4.5.1 Ablation Studies for Multi-Token Prediction … 26
4.5.2 Ablation Studies for the Auxiliary-Loss-Free Balancing Strategy … 27
4.5.3 Batch-Wise Load Balance VS. Sequence-Wise Load Balance … 27
4.5.3 批量负载平衡 VS.顺序负载平衡 ... 27

5 Post-Training … 28
5 培训后...... 28

5.1 Supervised Fine-Tuning … 28
5.2 Reinforcement Learning … 29
5.2 强化学习 ... 29

5.2.1 Reward Model … 29
5.2.1 奖励模式 ... 29

5.2.2 Group Relative Policy Optimization … 30
5.3 Evaluations … 30
5.3.1 Evaluation Settings … 30
5.3.1 评估设置 ... 30

5.3.2 Standard Evaluation … 32
5.3.2 标准评估...... 32

5.3.3 Open-Ended Evaluation … 33
5.3.4 DeepSeek-V3 as a Generative Reward Model … 33
5.4 Discussion … 34
5.4 讨论 ... 34

5.4.1 Distillation from DeepSeek-R1 … 34
5.4.1 来自 DeepSeek-R1 的蒸馏... 34

5.4.2 Self-Rewarding … 34
5.4.3 Multi-Token Prediction Evaluation … 35
6 Conclusion, Limitations, and Future Directions … 35
A Contributions and Acknowledgments … 45
贡献和鸣谢...... 45

B Ablation Studies for Low-Precision Training … 47
B 用于低精度训练的消融研究...... 47

B. 1 FP8 v.s. BF16 Training … 47
B. 2 Discussion About Block-Wise Quantization … 47
C Expert Specialization Patterns of the 16B Aux-Loss-Based and Aux-Loss-Free Models … 48
C 基于辅助损失模型和无辅助损失模型的 16B 专家专业化模式......................

1. Introduction  1.导言

In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and evolution (Anthropic, 2024; Google, 2024; OpenAI, 2024a), progressively diminishing the gap towards Artificial General Intelligence (AGI). Beyond closed-source models, open-source models, including DeepSeek series (DeepSeek-AI, 2024a|b|c; Guo et al., 2024), LLaMA series (AI@Meta, 2024a b; Touvron et al., 2023a b), Qwen series (Qwen, 2023, 2024a b), and Mistral series (Jiang et al. 2023; Mistral, 2024), are also making significant strides, endeavoring to close the gap with their closed-source counterparts. To further push the boundaries of open-source model capabilities, we scale up our models and introduce DeepSeek-V3, a large Mixture-of-Experts (MoE) model with 671B parameters, of which 37B are activated for each token.
近年来,大型语言模型(LLMs)经历了快速的迭代和演进(Anthropic,2024;Google,2024;OpenAI,2024a),逐步缩小了与人工通用智能(AGI)的差距。除闭源模型外,开源模型包括 DeepSeek 系列(DeepSeek-AI, 2024a|b|c;Guo et al.,2024)、LLaMA 系列(AI@Meta, 2024a b;Touvron et al、2023a b)、Qwen 系列(Qwen, 2023, 2024a b)和 Mistral 系列(Jiang 等,2023;Mistral, 2024)也都取得了长足进步,努力缩小与闭源同行的差距。为了进一步推动开源模型能力的发展,我们扩大了模型的规模,并引入了 DeepSeek-V3,这是一个拥有 671B 个参数的大型专家混合物(MoE)模型,其中 37B 个参数针对每个标记被激活。
With a forward-looking perspective, we consistently strive for strong model performance and economical costs. Therefore, in terms of architecture, DeepSeek-V3 still adopts Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Dai et al., 2024) for cost-effective training. These two architectures have been validated in DeepSeekV2 (DeepSeek-AI, 2024c), demonstrating their capability to maintain robust model performance while achieving efficient training and inference. Beyond the basic architecture, we implement two additional strategies to further enhance the model capabilities. Firstly, DeepSeek-V3 pioneers an auxiliary-loss-free strategy (Wang et al., 2024a) for load balancing, with the aim of minimizing the adverse impact on model performance that arises from the effort to encourage load balancing. Secondly, DeepSeek-V3 employs a multi-token prediction training objective, which we have observed to enhance the overall performance on evaluation benchmarks.
我们以前瞻性的视角,始终追求强大的模型性能和经济的成本。因此,在架构方面,DeepSeek-V3仍然采用多头潜意识(MLA)(DeepSeek-AI, 2024c)进行高效推理,并采用DeepSeekMoE(Dai等人,2024)进行低成本训练。这两种架构已在DeepSeekV2(DeepSeek-AI,2024c)中得到验证,证明它们有能力在实现高效训练和推理的同时保持模型的稳健性能。除基本架构外,我们还实施了两项额外策略,以进一步增强模型能力。首先,DeepSeek-V3 率先采用了一种无辅助损失策略(Wang 等人,2024a)来实现负载平衡,目的是尽量减少因鼓励负载平衡而对模型性能造成的不利影响。其次,DeepSeek-V3 采用了多标记预测训练目标,据我们观察,该目标提高了评估基准的整体性能。
In order to achieve efficient training, we support the FP8 mixed precision training and implement comprehensive optimizations for the training framework. Low-precision training has emerged as a promising solution for efficient training (Dettmers et al., 2022; Kalamkar et al., 2019, Narang et al., 2017, Peng et al., 2023b), its evolution being closely tied to advancements in hardware capabilities (Luo et al., 2024; Micikevicius et al., 2022; Rouhani et al., 2023a). In this work, we introduce an FP8 mixed precision training framework and, for the first time, validate its effectiveness on an extremely large-scale model. Through the support for FP8 computation and storage, we achieve both accelerated training and reduced GPU memory usage. As for the training framework, we design the DualPipe algorithm for efficient pipeline parallelism, which has fewer pipeline bubbles and hides most of the communication during training through computation-communication overlap. This overlap ensures that, as the model further scales up, as long as we maintain a constant computation-to-communication ratio, we can still employ fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead. In addition, we also develop efficient cross-node all-to-all communication kernels to fully utilize InfiniBand (IB) and NVLink bandwidths. Furthermore, we meticulously optimize the memory footprint, making it possible to train DeepSeek-V3 without using costly tensor parallelism. Combining these efforts, we achieve high training efficiency.
为了实现高效训练,我们支持 FP8 混合精度训练,并对训练框架进行了全面优化。低精度训练已成为高效训练的一种有前途的解决方案(Dettmers等人,2022年;Kalamkar等人,2019年;Narang等人,2017年;Peng等人,2023年b),其发展与硬件能力的进步密切相关(Luo等人,2024年;Micikevicius等人,2022年;Rouhani等人,2023年a)。在这项工作中,我们引入了 FP8 混合精度训练框架,并首次在超大规模模型上验证了其有效性。通过支持 FP8 计算和存储,我们实现了加速训练和减少 GPU 内存使用。在训练框架方面,我们设计了高效流水线并行的 DualPipe 算法,该算法具有较少的流水线气泡,并通过计算-通信重叠隐藏了训练过程中的大部分通信。这种重叠确保了在模型进一步扩展时,只要我们保持恒定的计算与通信比率,就能在节点间使用细粒度专家,同时实现接近零的全对全通信开销。此外,我们还开发了高效的跨节点全对全通信内核,以充分利用 InfiniBand(IB)和 NVLink 带宽。此外,我们还精心优化了内存占用,使 DeepSeek-V3 的训练无需使用昂贵的张量并行。综合这些努力,我们实现了很高的训练效率。
During pre-training, we train DeepSeek-V3 on 14.8T high-quality and diverse tokens. The pre-training process is remarkably stable. Throughout the entire training process, we did not encounter any irrecoverable loss spikes or have to roll back. Next, we conduct a two-stage context length extension for DeepSeek-V3. In the first stage, the maximum context length is extended to 32 K , and in the second stage, it is further extended to 128 K . Following this, we conduct post-training, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on the base model of DeepSeek-V3, to align it with human preferences and further unlock its potential. During the post-training stage, we distill the reasoning capability from the DeepSeekR1 series of models, and meanwhile carefully maintain the balance between model accuracy
在预训练过程中,我们在 14.8T 高质量和多样化的标记上训练 DeepSeek-V3。预训练过程非常稳定。在整个训练过程中,我们没有遇到任何不可恢复的损失峰值,也无需回滚。接下来,我们对 DeepSeek-V3 进行了两阶段的上下文长度扩展。第一阶段,最大上下文长度扩展到 32 K,第二阶段进一步扩展到 128 K。之后,我们对 DeepSeek-V3 的基础模型进行了包括监督微调(SFT)和强化学习(RL)在内的后训练,使其符合人类偏好,进一步释放其潜力。在后训练阶段,我们从DeepSeekR1系列模型中提炼出推理能力,同时谨慎地保持模型准确性和推理能力之间的平衡。
Training Costs  培训费用 Pre-Training  培训前 Context Extension  背景扩展 Post-Training  培训后 Total  总计
in H800 GPU Hours
在 H800 GPU 小时数中
2664 K 119 K 5 K 2788 K
in USD  以美元计 $ 5.328 M $ 5.328 M $5.328M\$ 5.328 \mathrm{M} $ 0.238 M $ 0.238 M $0.238M\$ 0.238 \mathrm{M} $ 0.01 M $ 0.01 M $0.01M\$ 0.01 \mathrm{M} $ 5.576 M $ 5.576 M $5.576M\$ 5.576 \mathrm{M}
Training Costs Pre-Training Context Extension Post-Training Total in H800 GPU Hours 2664 K 119 K 5 K 2788 K in USD $5.328M $0.238M $0.01M $5.576M| Training Costs | Pre-Training | Context Extension | Post-Training | Total | | :--- | :---: | :---: | :---: | :---: | | in H800 GPU Hours | 2664 K | 119 K | 5 K | 2788 K | | in USD | $\$ 5.328 \mathrm{M}$ | $\$ 0.238 \mathrm{M}$ | $\$ 0.01 \mathrm{M}$ | $\$ 5.576 \mathrm{M}$ |
Table 1 | Training costs of DeepSeek-V3, assuming the rental price of H800 is $ 2 $ 2 $2\$ 2 per GPU hour.
表 1:假设 H800 的租赁价格为 $ 2 $ 2 $2\$ 2 每 GPU 小时,DeepSeek-V3 的培训成本。

and generation length.  和世代长度。
We evaluate DeepSeek-V3 on a comprehensive array of benchmarks. Despite its economical training costs, comprehensive evaluations reveal that DeepSeek-V3-Base has emerged as the strongest open-source base model currently available, especially in code and math. Its chat version also outperforms other open-source models and achieves performance comparable to leading closed-source models, including GPT-4o and Claude-3.5-Sonnet, on a series of standard and open-ended benchmarks.
我们在一系列基准测试中对 DeepSeek-V3 进行了评估。尽管训练成本低廉,但综合评估结果显示,DeepSeek-V3-Base 已成为目前最强大的开源基础模型,尤其是在代码和数学方面。在一系列标准和开放式基准测试中,其聊天版本的性能也优于其他开源模型,并可与 GPT-4o 和 Claude-3.5-Sonnet 等领先的闭源模型相媲美。
Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized in Table 1, achieved through our optimized co-design of algorithms, frameworks, and hardware. During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180 K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pretraining stage is completed in less than two months and costs 2664 K GPU hours. Combined with 119 K GPU hours for the context length extension and 5 K GPU hours for post-training, DeepSeek-V3 costs only 2.788 M GPU hours for its full training. Assuming the rental price of the H800 GPU is $ 2 $ 2 $2\$ 2 per GPU hour, our total training costs amount to only $ 5.576 M $ 5.576 M $5.576M\$ 5.576 \mathrm{M}. Note that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data.
最后,我们再次强调DeepSeek-V3的经济训练成本,总结如表1所示,这是通过我们对算法、框架和硬件的优化协同设计实现的。在预训练阶段,在每万亿个代币上训练DeepSeek-V3只需要180 K H800 GPU小时,也就是说,在我们拥有2048个H800 GPU的集群上只需要3.7天。因此,我们的预训练阶段在不到两个月的时间内完成,耗费了 2664 K GPU 小时。加上用于上下文长度扩展的 119 K GPU 小时和用于后期训练的 5 K GPU 小时,DeepSeek-V3 的全部训练成本仅为 2.788 M GPU 小时。假设 H800 GPU 的租赁价格为每 GPU 小时 $ 2 $ 2 $2\$ 2 ,那么我们的总训练成本仅为 $ 5.576 M $ 5.576 M $5.576M\$ 5.576 \mathrm{M} 。请注意,上述成本仅包括 DeepSeek-V3 的官方训练费用,不包括之前对架构、算法或数据进行研究和消融实验的相关费用。
Our main contribution includes:
我们的主要贡献包括

Architecture: Innovative Load Balancing Strategy and Training Objective
架构:创新负载平衡策略和培训目标

  • On top of the efficient architecture of DeepSeek-V2, we pioneer an auxiliary-loss-free strategy for load balancing, which minimizes the performance degradation that arises from encouraging load balancing.
    在DeepSeek-V2的高效架构基础上,我们首创了一种无辅助损失的负载均衡策略,最大限度地降低了鼓励负载均衡所带来的性能下降。
  • We investigate a Multi-Token Prediction (MTP) objective and prove it beneficial to model performance. It can also be used for speculative decoding for inference acceleration.
    我们研究了多令牌预测(MTP)目标,并证明它有利于提高模型性能。它还可用于推测性解码,以加快推理速度。

Pre-Training: Towards Ultimate Training Efficiency
训练前:提高训练效率

  • We design an FP8 mixed precision training framework and, for the first time, validate the feasibility and effectiveness of FP8 training on an extremely large-scale model.
    我们设计了一个 FP8 混合精度训练框架,并首次在一个超大规模模型上验证了 FP8 训练的可行性和有效性。
  • Through the co-design of algorithms, frameworks, and hardware, we overcome the communication bottleneck in cross-node MoE training, achieving near-full computationcommunication overlap. This significantly enhances our training efficiency and reduces the training costs, enabling us to further scale up the model size without additional overhead.
    通过共同设计算法、框架和硬件,我们克服了跨节点 MoE 训练中的通信瓶颈,实现了近乎完全的计算-通信重叠。这大大提高了我们的训练效率,降低了训练成本,使我们能够在不增加额外开支的情况下进一步扩大模型规模。
  • At an economical cost of only 2.664 M H800 GPU hours, we complete the pre-training of DeepSeek-V3 on 14.8T tokens, producing the currently strongest open-source base model. The subsequent training stages after pre-training require only 0.1M GPU hours.
    我们仅用 2.664 M H800 GPU 小时的经济成本,就在 14.8T 代币上完成了 DeepSeek-V3 的预训练,生成了目前最强的开源基础模型。预训练后的后续训练阶段仅需 0.1 百万 GPU 小时。

Post-Training: Knowledge Distillation from DeepSeek-R1
培训后:从 DeepSeek-R1 中提炼知识

  • We introduce an innovative methodology to distill reasoning capabilities from the long-Chain-of-Thought (CoT) model, specifically from one of the DeepSeek R1 series models, into standard LLMs, particularly DeepSeek-V3. Our pipeline elegantly incorporates the
    我们引入了一种创新方法,将长思链(CoT)模型(特别是 DeepSeek R1 系列模型之一)中的推理能力提炼到标准 LLMs 中,尤其是 DeepSeek-V3。我们的管道优雅地将

    verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its reasoning performance. Meanwhile, we also maintain control over the output style and length of DeepSeek-V3.
    我们将R1的验证和反射模式引入DeepSeek-V3,显著提高了其推理性能。同时,我们还保持了对 DeepSeek-V3 输出样式和长度的控制。

Summary of Core Evaluation Results
核心评估结果摘要

  • Knowledge: (1) On educational benchmarks such as MMLU, MMLU-Pro, and GPQA, DeepSeek-V3 outperforms all other open-source models, achieving 88.5 on MMLU, 75.9 on MMLU-Pro, and 59.1 on GPQA. Its performance is comparable to leading closed-source models like GPT-4o and Claude-Sonnet-3.5, narrowing the gap between open-source and closed-source models in this domain. (2) For factuality benchmarks, DeepSeek-V3 demonstrates superior performance among open-source models on both SimpleQA and Chinese SimpleQA. While it trails behind GPT-4o and Claude-Sonnet-3.5 in English factual knowledge (SimpleQA), it surpasses these models in Chinese factual knowledge (Chinese SimpleQA), highlighting its strength in Chinese factual knowledge.
    知识:(1)在MMLU、MMLU-Pro和GPQA等教育基准测试中,DeepSeek-V3的表现优于所有其他开源模型,在MMLU上达到88.5,在MMLU-Pro上达到75.9,在GPQA上达到59.1。其性能可与 GPT-4o 和 Claude-Sonnet-3.5 等领先的闭源模型相媲美,缩小了该领域开源模型与闭源模型之间的差距。(2) 在事实性基准方面,DeepSeek-V3 在 SimpleQA 和中文 SimpleQA 上的表现在开源模型中都属上乘。虽然它在英语事实知识(SimpleQA)方面落后于 GPT-4o 和 Claude-Sonnet-3.5,但在中文事实知识(Chinese SimpleQA)方面却超过了这两个模型,突出了它在中文事实知识方面的优势。
  • Code, Math, and Reasoning: (1) DeepSeek-V3 achieves state-of-the-art performance on math-related benchmarks among all non-long-CoT open-source and closed-source models. Notably, it even outperforms o1-preview on specific benchmarks, such as MATH-500, demonstrating its robust mathematical reasoning capabilities. (2) On coding-related tasks, DeepSeek-V3 emerges as the top-performing model for coding competition benchmarks, such as LiveCodeBench, solidifying its position as the leading model in this domain. For engineering-related tasks, while DeepSeek-V3 performs slightly below Claude-Sonnet-3.5, it still outpaces all other models by a significant margin, demonstrating its competitiveness across diverse technical benchmarks.
    代码、数学和推理:(1)DeepSeek-V3在所有非长CoT开源和闭源模型中的数学相关基准测试中都取得了最先进的性能。值得注意的是,它在特定基准(如 MATH-500)上的表现甚至超过了 o1-preview,这证明了它强大的数学推理能力。(2)在编码相关任务方面,DeepSeek-V3 在编码竞赛基准(如 LiveCodeBench)上表现最佳,巩固了其在该领域领先模型的地位。在与工程相关的任务中,虽然 DeepSeek-V3 的性能略低于 Claude-Sonnet-3.5,但仍以显著优势超越所有其他模型,显示了其在各种技术基准中的竞争力。
In the remainder of this paper, we first present a detailed exposition of our DeepSeek-V3 model architecture (Section 2). Subsequently, we introduce our infrastructures, encompassing our compute clusters, the training framework, the support for FP8 training, the inference deployment strategy, and our suggestions on future hardware design. Next, we describe our pre-training process, including the construction of training data, hyper-parameter settings, longcontext extension techniques, the associated evaluations, as well as some discussions (Section 4). Thereafter, we discuss our efforts on post-training, which include Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), the corresponding evaluations, and discussions (Section 5). Lastly, we conclude this work, discuss existing limitations of DeepSeek-V3, and propose potential directions for future research (Section 6).
在本文的其余部分,我们将首先详细介绍 DeepSeek-V3 模型架构(第 2 节)。随后,我们介绍了我们的基础设施,包括我们的计算集群、训练框架、对 FP8 训练的支持、推理部署策略以及我们对未来硬件设计的建议。接下来,我们将介绍我们的预训练过程,包括训练数据的构建、超参数设置、长语境扩展技术、相关评估以及一些讨论(第 4 节)。之后,我们将讨论我们在后训练方面所做的努力,包括监督微调(SFT)、强化学习(RL)、相应的评估和讨论(第 5 节)。最后,我们总结了这项工作,讨论了 DeepSeek-V3 现有的局限性,并提出了未来研究的潜在方向(第 6 节)。

2. Architecture  2.建筑学

We first introduce the basic architecture of DeepSeek-V3, featured by Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Dai et al., 2024) for economical training. Then, we present a Multi-Token Prediction (MTP) training objective, which we have observed to enhance the overall performance on evaluation benchmarks. For other minor details not explicitly mentioned, DeepSeek-V3 adheres to the settings of DeepSeekV2 (DeepSeek-AI, 2024c).
我们首先介绍DeepSeek-V3的基本架构,其特点是采用多头潜在注意力(MLA)(DeepSeek-AI, 2024c)进行高效推理,采用DeepSeekMoE(Dai等人,2024)进行经济训练。然后,我们提出了多标记预测(MTP)训练目标,据我们观察,该目标提高了评估基准的整体性能。至于其他未明确提及的小细节,DeepSeek-V3沿用了DeepSeekV2(DeepSeek-AI,2024c)的设置。

2.1. Basic Architecture  2.1.基本架构

The basic architecture of DeepSeek-V3 is still within the Transformer (Vaswani et al., 2017) framework. For efficient inference and economical training, DeepSeek-V3 also adopts MLA and DeepSeekMoE, which have been thoroughly validated by DeepSeek-V2. Compared with DeepSeek-V2, an exception is that we additionally introduce an auxiliary-loss-free load balancing
DeepSeek-V3 的基本架构仍采用 Transformer(Vaswani 等人,2017 年)框架。为了实现高效的推理和经济的训练,DeepSeek-V3 还采用了 MLA 和 DeepSeekMoE,它们已经在 DeepSeek-V2 中得到了充分验证。与 DeepSeek-V2 相比,我们额外引入了无辅助损失的负载平衡技术。

Figure 2 | Illustration of the basic architecture of DeepSeek-V3. Following DeepSeek-V2, we adopt MLA and DeepSeekMoE for efficient inference and economical training.
图 2:DeepSeek-V3 的基本架构示意图。继 DeepSeek-V2 之后,我们采用了 MLA 和 DeepSeekMoE 来实现高效推理和经济训练。

strategy (Wang et al., 2024a) for DeepSeekMoE to mitigate the performance degradation induced by the effort to ensure load balance. Figure 2 illustrates the basic architecture of DeepSeek-V3, and we will briefly review the details of MLA and DeepSeekMoE in this section.
MLA和DeepSeekMoE策略(Wang等人,2024a)可减轻因确保负载平衡而导致的性能下降。图2展示了DeepSeek-V3的基本架构,我们将在本节简要回顾MLA和DeepSeekMoE的细节。

2.1.1. Multi-Head Latent Attention
2.1.1.多头潜意识

For attention, DeepSeek-V3 adopts the MLA architecture. Let d d dd denote the embedding dimension, n h n h n_(h)n_{h} denote the number of attention heads, d h d h d_(h)d_{h} denote the dimension per head, and h t R d h t R d h_(t)inR^(d)\mathbf{h}_{t} \in \mathbb{R}^{d} denote the attention input for the t t tt-th token at a given attention layer. The core of MLA is the low-rank joint compression for attention keys and values to reduce Key-Value (KV) cache during inference:
对于注意力,DeepSeek-V3 采用 MLA 架构。让 d d dd 表示嵌入维度, n h n h n_(h)n_{h} 表示注意头的数量, d h d h d_(h)d_{h} 表示每个注意头的维度, h t R d h t R d h_(t)inR^(d)\mathbf{h}_{t} \in \mathbb{R}^{d} 表示给定注意层中 t t tt -th 标记的注意输入。MLA 的核心是对注意力键和值进行低级联合压缩,以减少推理过程中的键值(KV)缓存:
c t K V = W D K V h t , [ k t , 1 C ; k t , 2 C ; ; k t , n h C ] = k t C = W U K c t K V , k t R = RoPE ( W K R h t ) , k t , i = [ k t , i C k ˙ t R ] , [ v t , 1 C ; v t , 2 C ; ; v t , n h C ] = v t C = W U V c t K V , c t K V = W D K V h t , k t , 1 C ; k t , 2 C ; ; k t , n h C = k t C = W U K c t K V , k t R = RoPE W K R h t , k t , i = k t , i C k ˙ t R , v t , 1 C ; v t , 2 C ; ; v t , n h C = v t C = W U V c t K V , {:[c_(t)^(KV)=W^(DKV)h_(t)","],[[k_(t,1)^(C);k_(t,2)^(C);dots;k_(t,n_(h))^(C)]=k_(t)^(C)=W^(UK)c_(t)^(KV)","],[k_(t)^(R)=RoPE(W^(KR)h_(t))","],[k_(t,i)=[k_(t,i)^(C)k^(˙)_(t)^(R)]","],[[v_(t,1)^(C);v_(t,2)^(C);dots;v_(t,n_(h))^(C)]=v_(t)^(C)=W^(UV)c_(t)^(KV)","]:}\begin{aligned} \boxed{\mathbf{c}_{t}^{K V}} & =W^{D K V} \mathbf{h}_{t}, \\ {\left[\mathbf{k}_{t, 1}^{C} ; \mathbf{k}_{t, 2}^{C} ; \ldots ; \mathbf{k}_{t, n_{h}}^{C}\right]=\mathbf{k}_{t}^{C} } & =W^{U K} \mathbf{c}_{t}^{K V}, \\ \boxed{\mathbf{k}_{t}^{R}} & =\operatorname{RoPE}\left(W^{K R} \mathbf{h}_{t}\right), \\ \mathbf{k}_{t, i} & =\left[\mathbf{k}_{t, i}^{C} \dot{\mathbf{k}}_{t}^{R}\right], \\ {\left[\mathbf{v}_{t, 1}^{C} ; \mathbf{v}_{t, 2}^{C} ; \ldots ; \mathbf{v}_{t, n_{h}}^{C}\right]=\mathbf{v}_{t}^{C} } & =W^{U V} \mathbf{c}_{t}^{K V}, \end{aligned}
where c t K V R d c c t K V R d c c_(t)^(KV)inR^(d_(c))\mathbf{c}_{t}^{K V} \in \mathbb{R}^{d_{c}} is the compressed latent vector for keys and values; d c ( d h n h ) d c d h n h d_(c)(≪d_(h)n_(h))d_{c}\left(\ll d_{h} n_{h}\right) indicates the K V K V KVK V compression dimension; W D K V R d c × d W D K V R d c × d W^(DKV)inR^(d_(c)xx d)W^{D K V} \in \mathbb{R}^{d_{c} \times d} denotes the down-projection matrix; W U K , W U V R d h n h × d c W U K , W U V R d h n h × d c W^(UK),W^(UV)inR^(d_(h)n_(h)xxd_(c))W^{U K}, W^{U V} \in \mathbb{R}^{d_{h} n_{h} \times d_{c}} are the up-projection matrices for keys and values, respectively; W K R R h d h R × d W K R R h d h R × d W^(KR)inR_(h)^(d_(h)^(R)xx d)W^{K R} \in \mathbb{R}_{h}^{d_{h}^{R} \times d} is the matrix used to produce the decoupled key that carries Rotary Positional Embedding (RoPE) (Su et al., 2024); RoPE ( ) RoPE ( ) RoPE(*)\operatorname{RoPE}(\cdot) denotes the operation that applies RoPE matrices; and [ ] [ ] [:'*][\because \cdot] denotes concatenation. Note that for MLA, only the blue-boxed vectors (i.e., c t K V c t K V c_(t)^(KV)\mathbf{c}_{t}^{K V} and k t R k t R k_(t)^(R)\mathbf{k}_{t}^{R} ) need to be cached during generation, which results in significantly reduced KV cache while maintaining performance comparable to standard Multi-Head Attention (MHA) (Vaswani et al., 2017).
其中, c t K V R d c c t K V R d c c_(t)^(KV)inR^(d_(c))\mathbf{c}_{t}^{K V} \in \mathbb{R}^{d_{c}} 是密钥和值的压缩潜向量; d c ( d h n h ) d c d h n h d_(c)(≪d_(h)n_(h))d_{c}\left(\ll d_{h} n_{h}\right) 表示 K V K V KVK V 压缩维数; W D K V R d c × d W D K V R d c × d W^(DKV)inR^(d_(c)xx d)W^{D K V} \in \mathbb{R}^{d_{c} \times d} 表示向下投影矩阵; W U K , W U V R d h n h × d c W U K , W U V R d h n h × d c W^(UK),W^(UV)inR^(d_(h)n_(h)xxd_(c))W^{U K}, W^{U V} \in \mathbb{R}^{d_{h} n_{h} \times d_{c}} 分别是密钥和值的向上投影矩阵; W K R R h d h R × d W K R R h d h R × d W^(KR)inR_(h)^(d_(h)^(R)xx d)W^{K R} \in \mathbb{R}_{h}^{d_{h}^{R} \times d} 是用于生成解耦密钥的矩阵,该矩阵带有旋转位置嵌入 (RoPE) 功能(Su 等人,2024 年); RoPE ( ) RoPE ( ) RoPE(*)\operatorname{RoPE}(\cdot) 表示应用 RoPE 矩阵的操作; [ ] [ ] [:'*][\because \cdot] 表示连接、2024); RoPE ( ) RoPE ( ) RoPE(*)\operatorname{RoPE}(\cdot)表示应用 RoPE 矩阵的操作; [ ] [ ] [:'*][\because \cdot]表示连接。需要注意的是,对于 MLA,在生成过程中只需要缓存蓝框向量(即 c t K V c t K V c_(t)^(KV)\mathbf{c}_{t}^{K V} k t R k t R k_(t)^(R)\mathbf{k}_{t}^{R} ),这就大大减少了 KV 缓存,同时保持了与标准多头注意力 (MHA) 相当的性能(Vaswani 等人,2017 年)。
For the attention queries, we also perform a low-rank compression, which can reduce the activation memory during training:
对于注意力查询,我们还进行了低级压缩,这可以减少训练过程中的激活记忆:
c t Q = W D Q h t , [ q t , 1 C ; q t , 2 C ; ; q t , n h C ] = q t C = W U Q c t Q [ q t , 1 R ; q t , 2 R ; ; q t , n h R ] = q t R = RoPE ( W Q R c t Q ) , q t , i = [ q t , i C ; q t , i R ] , c t Q = W D Q h t , q t , 1 C ; q t , 2 C ; ; q t , n h C = q t C = W U Q c t Q q t , 1 R ; q t , 2 R ; ; q t , n h R = q t R = RoPE W Q R c t Q , q t , i = q t , i C ; q t , i R , {:[c_(t)^(Q)=W^(DQ)h_(t)","],[[q_(t,1)^(C);q_(t,2)^(C);dots;q_(t,n_(h))^(C)]=q_(t)^(C)=W^(UQ)c_(t)^(Q)],[[q_(t,1)^(R);q_(t,2)^(R);dots;q_(t,n_(h))^(R)]=q_(t)^(R)=RoPE(W^(QR)c_(t)^(Q))","],[q_(t,i)=[q_(t,i)^(C);q_(t,i)^(R)]","]:}\begin{aligned} \mathbf{c}_{t}^{Q} & =W^{D Q} \mathbf{h}_{t}, \\ {\left[\mathbf{q}_{t, 1}^{C} ; \mathbf{q}_{t, 2}^{C} ; \ldots ; \mathbf{q}_{t, n_{h}}^{C}\right]=\mathbf{q}_{t}^{C} } & =W^{U Q} \mathbf{c}_{t}^{Q} \\ {\left[\mathbf{q}_{t, 1}^{R} ; \mathbf{q}_{t, 2}^{R} ; \ldots ; \mathbf{q}_{t, n_{h}}^{R}\right]=\mathbf{q}_{t}^{R} } & =\operatorname{RoPE}\left(W^{Q R} \mathbf{c}_{t}^{Q}\right), \\ \mathbf{q}_{t, i} & =\left[\mathbf{q}_{t, i}^{C} ; \mathbf{q}_{t, i}^{R}\right], \end{aligned}
where c t Q R d c c t Q R d c c_(t)^(Q)inR^(d_(c)^('))\mathbf{c}_{t}^{Q} \in \mathbb{R}^{d_{c}^{\prime}} is the compressed latent vector for queries; d c ( d h n h ) d c d h n h d_(c)^(')(≪d_(h)n_(h))d_{c}^{\prime}\left(\ll d_{h} n_{h}\right) denotes the query compression dimension; W D Q R d c × d , W U Q R d h n h × d c d W D Q R d c × d , W U Q R d h n h × d c d W^(DQ)inR^(d_(c)^(')xx d),W^(UQ)inR^(d_(h)n_(h)xxd_(c)^(d))W^{D Q} \in \mathbb{R}^{d_{c}^{\prime} \times d}, W^{U Q} \in \mathbb{R}^{d_{h} n_{h} \times d_{c}^{d}} are the down-projection and up-projection matrices for queries, respectively; and W Q R R d h R n h × d c W Q R R d h R n h × d c W^(QR)inR^(d_(h)^(R)n_(h)xxd_(c)^('))W^{Q R} \in \mathbb{R}^{d_{h}^{R} n_{h} \times d_{c}^{\prime}} is the matrix to produce the decoupled queries that carry RoPE.
其中, c t Q R d c c t Q R d c c_(t)^(Q)inR^(d_(c)^('))\mathbf{c}_{t}^{Q} \in \mathbb{R}^{d_{c}^{\prime}} 是查询的压缩潜向量; d c ( d h n h ) d c d h n h d_(c)^(')(≪d_(h)n_(h))d_{c}^{\prime}\left(\ll d_{h} n_{h}\right) 表示查询压缩维度; W D Q R d c × d , W U Q R d h n h × d c d W D Q R d c × d , W U Q R d h n h × d c d W^(DQ)inR^(d_(c)^(')xx d),W^(UQ)inR^(d_(h)n_(h)xxd_(c)^(d))W^{D Q} \in \mathbb{R}^{d_{c}^{\prime} \times d}, W^{U Q} \in \mathbb{R}^{d_{h} n_{h} \times d_{c}^{d}} 分别是查询的向下投影矩阵和向上投影矩阵; W Q R R d h R n h × d c W Q R R d h R n h × d c W^(QR)inR^(d_(h)^(R)n_(h)xxd_(c)^('))W^{Q R} \in \mathbb{R}^{d_{h}^{R} n_{h} \times d_{c}^{\prime}} 是产生携带 RoPE 的解耦查询的矩阵。
Ultimately, the attention queries ( q t , i ) q t , i (q_(t,i))\left(\mathbf{q}_{t, i}\right), keys ( k j , i ) k j , i (k_(j,i))\left(\mathbf{k}_{j, i}\right), and values ( v j , i C ) v j , i C (v_(j,i)^(C))\left(\mathbf{v}_{j, i}^{C}\right) are combined to yield the final attention output u t u t u_(t)\mathbf{u}_{t} :
最终,将注意力查询 ( q t , i ) q t , i (q_(t,i))\left(\mathbf{q}_{t, i}\right) 、键 ( k j , i ) k j , i (k_(j,i))\left(\mathbf{k}_{j, i}\right) 和值 ( v j , i C ) v j , i C (v_(j,i)^(C))\left(\mathbf{v}_{j, i}^{C}\right) 合并,得出最终注意力输出 u t u t u_(t)\mathbf{u}_{t}
o t , i = j = 1 t Softmax j ( q t , i T k j , i d h + d h R ) v j , i C u t = W O [ o t , i ; o t , 2 ; ; o t , n h ] , o t , i = j = 1 t Softmax j q t , i T k j , i d h + d h R v j , i C u t = W O o t , i ; o t , 2 ; ; o t , n h , {:[o_(t,i)=sum_(j=1)^(t)Softmax_(j)((q_(t,i)^(T)k_(j,i))/(sqrt(d_(h)+d_(h)^(R))))v_(j,i)^(C)],[u_(t)=W^(O)[o_(t,i);o_(t,2);dots;o_(t,n_(h))]","]:}\begin{aligned} \mathbf{o}_{t, i} & =\sum_{j=1}^{t} \operatorname{Softmax}_{j}\left(\frac{\mathbf{q}_{t, i}^{T} \mathbf{k}_{j, i}}{\sqrt{d_{h}+d_{h}^{R}}}\right) \mathbf{v}_{j, i}^{C} \\ \mathbf{u}_{t} & =W^{O}\left[\mathbf{o}_{t, i} ; \mathbf{o}_{t, 2} ; \ldots ; \mathbf{o}_{t, n_{h}}\right], \end{aligned}
where W O R d × d h n h W O R d × d h n h W^(O)inR^(d xxd_(h)n_(h))W^{O} \in \mathbb{R}^{d \times d_{h} n_{h}} denotes the output projection matrix.
其中 W O R d × d h n h W O R d × d h n h W^(O)inR^(d xxd_(h)n_(h))W^{O} \in \mathbb{R}^{d \times d_{h} n_{h}} 表示输出投影矩阵。

2.1.2. DeepSeekMoE with Auxiliary-Loss-Free Load Balancing
2.1.2.带有无辅助负载平衡功能的 DeepSeekMoE

Basic Architecture of DeepSeekMoE. For Feed-Forward Networks (FFNs), DeepSeek-V3 employs the DeepSeekMoE architecture (Dai et al., 2024). Compared with traditional MoE architectures like GShard (Lepikhin et al., 2021), DeepSeekMoE uses finer-grained experts and isolates some experts as shared ones. Let u t u t u_(t)\mathbf{u}_{t} denote the FFN input of the t t tt-th token, we compute the FFN output h t h t h_(t)^(')\mathbf{h}_{t}^{\prime} as follows:
DeepSeekMoE 的基本架构。对于前馈网络(FFN),DeepSeek-V3采用了DeepSeekMoE架构(Dai等人,2024年)。与GShard(Lepikhin等人,2021年)等传统MoE架构相比,DeepSeekMoE使用了更细粒度的专家,并将一些专家隔离为共享专家。让 u t u t u_(t)\mathbf{u}_{t} 表示第 t t tt 个标记的 FFN 输入,我们按以下方式计算 FFN 输出 h t h t h_(t)^(')\mathbf{h}_{t}^{\prime}
h t = u t + i = 1 N s FFN i ( s ) ( u t ) + i = 1 N r g i , t FFN i ( r ) ( u t ) , g i , t = g i , t j = 1 N r g j , t , g i , t = { s i , t , s i , t Topk ( { s j , t 1 j N r } , K r ) , 0 , otherwise , s i , t = Sigmoid ( u t T e i ) , h t = u t + i = 1 N s FFN i ( s ) u t + i = 1 N r g i , t FFN i ( r ) u t , g i , t = g i , t j = 1 N r g j , t , g i , t =