这是用户在 2024-12-30 16:15 为 https://app.immersivetranslate.com/pdf-pro/78603506-8767-4eb8-9369-847087092113 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

DeepSeek-V3 Technical Report
DeepSeek-V3技术报告

DeepSeek-AI

Abstract  抽象的

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
我们提出了 DeepSeek-V3,这是一个强大的专家混合 (MoE) 语言模型,总参数为 671B,每个token激活 37B。为了实现高效的推理和经济高效的训练,DeepSeek-V3 采用了多头潜在注意力(MLA)和 DeepSeekMoE 架构,这些架构在 DeepSeek-V2 中得到了彻底的验证。此外,DeepSeek-V3首创了负载均衡的辅助无损策略,并设置了多token预测训练目标以实现更强的性能。我们在 14.8 万亿个多样化的高质量代币上对 DeepSeek-V3 进行预训练,然后进行监督微调和强化学习阶段,以充分利用其能力。综合评估表明,DeepSeek-V3 的性能优于其他开源模型,并且达到了与领先的闭源模型相当的性能。尽管性能出色,DeepSeek-V3 仅需要 2.788M H800 GPU 小时即可完成完整训练。此外,它的训练过程非常稳定。在整个训练过程中,我们没有遇到任何不可恢复的损失峰值或执行任何回滚。模型检查点位于https://github.com/deepseek-ai/DeepSeek-V3

Figure 1 | Benchmark performance of DeepSeek-V3 and its counterparts.
图1| DeepSeek-V3 及其同类产品的基准性能。

Contents  内容

1 Introduction … 4  1 简介 … 4
2 Architecture … 6  2 建筑学 … 6
2.1 Basic Architecture … 6
2.1 基本架构……6

2.1.1 Multi-Head Latent Attention … 7
2.1.1 多头潜在注意力……7

2.1.2 DeepSeekMoE with Auxiliary-Loss-Free Load Balancing … 8
2.1.2 具有辅助无损负载平衡的 DeepSeekMoE … 8

2.2 Multi-Token Prediction … 10
2.2 多Token预测……10

3 Infrastructures … 11  3 基础设施 … 11
3.1 Compute Clusters … 11
3.1 计算集群……11

3.2 Training Framework … 12
3.2 培训框架……12

3.2.1 DualPipe and Computation-Communication Overlap … 12
3.2.1 DualPipe 和计算通信重叠…… 12

3.2.2 Efficient Implementation of Cross-Node All-to-All Communication … 13
3.2.2 高效实现跨节点全对全通信……13

3.2.3 Extremely Memory Saving with Minimal Overhead … 14
3.2.3 以最小的开销极大地节省内存…… 14

3.3 FP8 Training … 14
3.3 FP8 训练 … 14

3.3.1 Mixed Precision Framework … 15
3.3.1 混合精度框架……15

3.3.2 Improved Precision from Quantization and Multiplication … 16
3.3.2 通过量化和乘法提高精度…… 16

3.3.3 Low-Precision Storage and Communication … 18
3.3.3 低精度存储和通信……18

3.4 Inference and Deployment … 18
3.4 推理和部署…… 18

3.4.1 Prefilling … 19  3.4.1 预填充…… 19
3.4.2 Decoding … 19  3.4.2 解码……19
3.5 Suggestions on Hardware Design … 20
3.5 硬件设计建议……20

3.5.1 Communication Hardware … 20
3.5.1 通信硬件……20

3.5.2 Compute Hardware … 20
3.5.2 计算硬件……20

4 Pre-Training … 22  4 预训练 … 22
4.1 Data Construction … 22
4.1 数据构建……22

4.2 Hyper-Parameters … 22
4.2 超参数……22

4.3 Long Context Extension … 23
4.3 长上下文扩展……23

4.4 Evaluations … 24  4.4 评估……24
4.4.1 Evaluation Benchmarks … 24
4.4.1 评估基准…… 24

4.4.2 Evaluation Results … 25
4.4.2 评估结果……25

4.5 Discussion … 26  4.5 讨论……26
4.5.1 Ablation Studies for Multi-Token Prediction … 26
4.5.1 多Token预测的消融研究……26

4.5.2 Ablation Studies for the Auxiliary-Loss-Free Balancing Strategy … 27
4.5.2 辅助无损耗平衡策略的消融研究……27

4.5.3 Batch-Wise Load Balance VS. Sequence-Wise Load Balance … 27
4.5.3 批量负载均衡 VS.按顺序负载平衡... 27

5 Post-Training … 28  5 培训后 … 28
5.1 Supervised Fine-Tuning … 28
5.1 有监督的微调……28

5.2 Reinforcement Learning … 29
5.2 强化学习……29

5.2.1 Reward Model … 29
5.2.1 奖励模型……29

5.2.2 Group Relative Policy Optimization … 30
5.2.2 组相关策略优化……30

5.3 Evaluations … 30  5.3 评估……30
5.3.1 Evaluation Settings … 30
5.3.1 评估设置…… 30

5.3.2 Standard Evaluation … 32
5.3.2标准评估……32

5.3.3 Open-Ended Evaluation … 33
5.3.3 开放式评估……33

5.3.4 DeepSeek-V3 as a Generative Reward Model … 33
5.3.4 DeepSeek-V3 作为生成奖励模型…… 33

5.4 Discussion … 34  5.4 讨论……34
5.4.1 Distillation from DeepSeek-R1 … 34
5.4.1 从 DeepSeek-R1 中蒸馏出来……34

5.4 .2 Self-Rewarding … 34
5.4 .2 自我奖励…… 34

5.4.3 Multi-Token Prediction Evaluation … 35
5.4.3 多Token预测评估……35

6 Conclusion, Limitations, and Future Directions … 35
6 结论、局限性和未来方向…… 35

A Contributions and Acknowledgments … 45
A 贡献和致谢…… 45

B Ablation Studies for Low-Precision Training … 47
B 低精度训练的消融研究…… 47

B. 1 FP8 v.s. BF16 Training … 47
B.1 FP8 与 BF16 训练 … 47

B. 2 Discussion About Block-Wise Quantization … 47
B.2 关于分块量化的讨论……47

C Expert Specialization Patterns of the 16B Aux-Loss-Based and Aux-Loss-Free Models … 48
16B 基于 Aux 损耗和无 Aux 损耗模型的 C 专家专业化模式…… 48