行业研究公司研究宏观策略财报招股书会议纪要 Token 低空经济十五五 AIGC 大模型

DeepSeek_V3

信息技术 2025-03-06 DeepSeek 淘金曹艳平

DeepSeek-V3 是 DeepSeek-AI 开发的 671B 参数的 Mixture-of-Experts (MoE) 语言模型，其中 37B 参数为每个 token 激活参数。模型采用 Multi-head Latent Attention (MLA) 架构实现高效推理，并使用 DeepSeekMoE 架构进行经济高效的训练。此外，DeepSeek-V3 创新性地采用了无辅助损失负载均衡策略和多 token 预测训练目标，进一步提升模型性能。

在训练方面，DeepSeek-V3 支持 FP8 混合精度训练，并通过 DualPipe 算法、高效的跨节点 all-to-all 通信内核以及内存占用优化等技术，实现了高效的训练过程。模型在 14.8 万亿个高质量、多样化的 token 上进行预训练，训练过程稳定，未出现任何不可恢复的损失峰值或需要回滚的情况。

在评估方面，DeepSeek-V3 在多个基准测试中表现出色，包括知识、代码、数学和推理等任务。模型在 MMLU、MMLU-Pro、GPQA 等教育基准测试中超越了所有其他开源模型，并在 SimpleQA 和 Chinese SimpleQA 等事实性基准测试中表现出色。在代码和数学相关任务中，DeepSeek-V3 在 LiveCodeBench、HumanEval 和 MATH-500 等基准测试中取得了最先进的性能。

在后续训练方面，DeepSeek-V3 通过监督微调和强化学习进行优化，并从 DeepSeek-R1 系列模型中提取推理能力，进一步提升模型性能。评估结果表明，DeepSeek-V3 在标准基准测试和开放式生成任务中均表现出与领先闭源模型相当的性能。

DeepSeek-V3 的训练成本经济高效，整个训练过程仅需要 2.788M H800 GPU 小时，总训练成本约为 550 万美元。模型的主要贡献包括创新的负载均衡策略和训练目标、高效的训练方法以及从 DeepSeek-R1 中提取推理能力的技术。

尽管 DeepSeek-V3 表现出色，但仍存在一些局限性，例如部署单元较大，可能对小型团队造成负担。未来，DeepSeek-AI 将继续研究和改进模型架构，提升训练和推理效率，并探索更全面的模型评估方法。

DeepSeek-AI research@deepseek.com Abstract We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B totalparameters with 37B activated for each token. To achieve efficient inference and cost-effectivetraining, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architec-tures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneersan auxiliary-loss-free strategy for load balancing and sets a multi-token prediction trainingobjective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse andhigh-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms Contents 2Architecture 3Infrastructures 4Pre-Training CExpert Specialization Patterns of the 16B Aux-Loss-Based and Aux-Loss-Free Models 48 1. Introduction In recent years, Large Language Models (LLMs) have been undergoing rapid iteration andevolution (Anthropic, 2024; Google, 2024; OpenAI, 2024a), progressively diminishing the gap to-wards Artificial General Intelligence (AGI). Beyond closed-source models, open-source models,including DeepSeek series (DeepSeek-AI, 2024a,b,c; Guo et al., 2024), LLaMA series (AI@Meta,2024a,b; Touvron et al., 2023a,b), Qwen series (Qwen, 2023, 2024a,b), and Mistral series (Jianget al., 2023; Mistral, 2024), are also making significant strides, endeavoring to close the gap withtheir closed-source counterparts. To further push the boundaries of open-source model capa- With a forward-looking perspective, we consistently strive for strong model performanceand economical costs. Therefore, in terms of architecture, DeepSeek-V3 still adopts Multi-headLatent Attention (MLA) (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Daiet al., 2024) for cost-effective training. These two architectures have been validated in DeepSeek-V2 (DeepSeek-AI, 2024c), demonstrating their capability to maintain robust model performancewhile achieving efficient training and inference. Beyond the basic architecture, we implementtwo additional strategies to further enhance the model capabilities. Firstly, DeepSeek-V3 pi- In order to achieve efficient training, we support the FP8 mixed precision training andimplement comprehensive optimizations for the training framework. Low-precision traininghas emerged as a promising solution for efficient training (Dettmers et al., 2022; Kalamkar et al.,2019; Narang et al., 2017; Peng et al., 2023b), its evolution being closely tied to advancements inhardware capabilities (Luo et al., 2024; Micikevicius et al., 2022; Rouhani et al., 2023a). In thiswork, we introduce an FP8 mixed precision training framework and, for the first time, validateits effectiveness on an extremely large-scale model. Through the support for FP8 computationand storage, we achieve both accelerated training and reduced GPU memory usage. As forthe training framework, we design the DualPipe algorithm for efficient pipeline parallelism,which has fewer pipeline bubbles and hides most of the communication during training through During pre-training, we train DeepSeek-V3 on 14.8T high-quality and diverse tokens. Thepre-training process is remarkably stable. Throughout the entire training process, we did notencounter any irrecoverable loss spikes or have to roll back. Next, we conduct a two-stagecontext length extension for DeepSeek-V3. In the first stage, the maximum context length isextended to 32K, and in the second stage, it is further extended to 128K. Following this, weconduct post-training, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) Table 1|Training costs of DeepSeek-V3, assuming the rental price of H800 is $2 per GPU hour. and generation length. We evaluate DeepSeek-V3 on a comprehensive array of benchmarks. Despite its economicaltraining costs, comprehensive evaluations reveal that DeepSeek-V3-Base has emerged as thestrongest open-source base model currently available, especially in code and math. Its chatversion also outperforms other open-source models and achieves performance comparable to Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized inTable 1, achieved through our optimized co-design of algorithms, frameworks, and hardware.During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180KH800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pre-training stage is completed in less than two months and costs 2664K GPU hours. Combinedwith 119K GPU hours for the context length extension and 5K GPU hours for post-training,DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of Our main contribution includes: Architecture: Innovative Load Balancing Strategy and Training Objective •On top of the efficient architecture of DeepSeek-V2, we pioneer an auxilia

点击免费查看完整报告

你可能感兴趣

计算机行业：DEEPSEEK_V3发布，技术创新和商业化落地的共振

信息技术财通证券2024-12-28