您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。 [DeepSeek]:DeepSeek_V3 - 发现报告

DeepSeek_V3

信息技术 2025-03-06 DeepSeek 淘金 曹艳平
报告封面

DeepSeek-AI research@deepseek.com Abstract We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B totalparameters with 37B activated for each token. To achieve efficient inference and cost-effectivetraining, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architec-tures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneersan auxiliary-loss-free strategy for load balancing and sets a multi-token prediction trainingobjective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse andhigh-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms Contents 2Architecture 3Infrastructures 4Pre-Training CExpert Specialization Patterns of the 16B Aux-Loss-Based and Aux-Loss-Free Models 48 1. Introduction In recent years, Large Language Models (LLMs) have been undergoing rapid iteration andevolution (Anthropic, 2024; Google, 2024; OpenAI, 2024a), progressively diminishing the gap to-wards Artificial General Intelligence (AGI). Beyond closed-source models, open-source models,including DeepSeek series (DeepSeek-AI, 2024a,b,c; Guo et al., 2024), LLaMA series (AI@Meta,2024a,b; Touvron et al., 2023a,b), Qwen series (Qwen, 2023, 2024a,b), and Mistral series (Jianget al., 2023; Mistral, 2024), are also making significant strides, endeavoring to close the gap withtheir closed-source counterparts. To further push the boundaries of open-source model capa- With a forward-looking perspective, we consistently strive for strong model performanceand economical costs. Therefore, in terms of architecture, DeepSeek-V3 still adopts Multi-headLatent Attention (MLA) (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Daiet al., 2024) for cost-effective training. These two architectures have been validated in DeepSeek-V2 (DeepSeek-AI, 2024c), demonstrating their capability to maintain robust model performancewhile achieving efficient training and inference. Beyond the basic architecture, we implementtwo additional strategies to further enhance the model capabilities. Firstly, DeepSeek-V3 pi- In order to achieve efficient training, we support the FP8 mixed precision training andimplement comprehensive optimizations for the training framework. Low-precision traininghas emerged as a promising solution for efficient training (Dettmers et al., 2022; Kalamkar et al.,2019; Narang et al., 2017; Peng et al., 2023b), its evolution being closely tied to advancements inhardware capabilities (Luo et al., 2024; Micikevicius et al., 2022; Rouhani et al., 2023a). In thiswork, we introduce an FP8 mixed precision training framework and, for the first time, validateits effectiveness on an extremely large-scale model. Through the support for FP8 computationand storage, we achieve both accelerated training and reduced GPU memory usage. As forthe training framework, we design the DualPipe algorithm for efficient pipeline parallelism,which has fewer pipeline bubbles and hides most of the communication during training through During pre-training, we train DeepSeek-V3 on 14.8T high-quality and diverse tokens. Thepre-training process is remarkably stable. Throughout the entire training process, we did notencounter any irrecoverable loss spikes or have to roll back. Next, we conduct a two-stagecontext length extension for DeepSeek-V3. In the first stage, the maximum context length isextended to 32K, and in the second stage, it is further extended to 128K. Following this, weconduct post-training, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) Table 1|Training costs of DeepSeek-V3, assuming the rental price of H800 is $2 per GPU hour. and generation length. We evaluate DeepSeek-V3 on a comprehensive array of benchmarks. Despite its economicaltraining costs, comprehensive evaluations reveal that DeepSeek-V3-Base has emerged as thestrongest open-source base model currently available, especially in code and math. Its chatversion also outperforms other open-source models and achieves performance comparable to Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized inTable 1, achieved through our optimized co-design of algorithms, frameworks, and hardware.During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180KH800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pre-training stage is completed in less than two months and costs 2664K GPU hours. Combinedwith 119K GPU hours for the context length extension and 5K GPU hours for post-training,DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of Our main contribution includes: Architecture: Innovative Load Balancing Strategy and Training Objective •On top of the efficient architecture of DeepSeek-V2, we pioneer an auxilia