We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3. 我们提出了 DeepSeek-V3,这是一个强大的专家混合 (MoE) 语言模型,总参数为 671B,每个token激活 37B。为了实现高效的推理和经济高效的训练,DeepSeek-V3 采用了多头潜在注意力(MLA)和 DeepSeekMoE 架构,这些架构在 DeepSeek-V2 中得到了彻底的验证。此外,DeepSeek-V3首创了负载均衡的辅助无损策略,并设置了多token预测训练目标以实现更强的性能。我们在 14.8 万亿个多样化的高质量代币上对 DeepSeek-V3 进行预训练,然后进行监督微调和强化学习阶段,以充分利用其能力。综合评估表明,DeepSeek-V3 的性能优于其他开源模型,并且达到了与领先的闭源模型相当的性能。尽管性能出色,DeepSeek-V3 仅需要 2.788M H800 GPU 小时即可完成完整训练。此外,它的训练过程非常稳定。在整个训练过程中,我们没有遇到任何不可恢复的损失峰值或执行任何回滚。模型检查点位于https://github.com/deepseek-ai/DeepSeek-V3 。
Figure 1 | Benchmark performance of DeepSeek-V3 and its counterparts. 图1| DeepSeek-V3 及其同类产品的基准性能。