DeepSeek-AIresearch@deepseek.com Abstract We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models — DeepSeek-V4-Pro with 1.6T parameters (49B activated) andDeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length ofone million tokens. DeepSeek-V4 series incorporate several key upgrades in architecture and op-timization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA)and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold-Constrained Hyper-Connections (mHC) that enhance conventional residual connections; (3)and the Muon optimizer for faster convergence and greater training stability. We pre-trainboth models on more than 32T diverse and high-quality tokens, followed by a comprehensivepost-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art foropen models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series arehighly efficient in long-context scenarios. In the one-million-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache comparedwith DeepSeek-V3.2. This enables us to routinely support one-million-token contexts, therebymaking long-horizon tasks and further test-time scaling more feasible. The model checkpointsare available athttps://huggingface.co/collections/deepseek-ai/deepseek-v4. Contents 1Introduction 2Architecture 2.1Designs Inherited from DeepSeek-V3 . . . . . . . . . . . . . . . . . . . . . . . . . .72.2Manifold-Constrained Hyper-Connections. . . . . . . . . . . . . . . . . . . . . .72.3Hybrid Attention with CSA and HCA. . . . . . . . . . . . . . . . . . . . . . . . .92.3.1Compressed Sparse Attention . . . . . . . . . . . . . . . . . . . . . . . . . .92.3.2Heavily Compressed Attention . . . . . . . . . . . . . . . . . . . . . . . . .112.3.3Other Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .122.3.4Efficiency Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .132.4Muon Optimizer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 3General Infrastructures15 3.1Fine-Grained Communication-Computation Overlap in Expert Parallelism . . . .153.2Flexible and Efficient Kernel Development with TileLang . . . . . . . . . . . . . .163.3High-Performance Batch-Invariant and Deterministic Kernel Libraries. . . . . .183.4FP4 Quantization-Aware Training . . . . . . . . . . . . . . . . . . . . . . . . . . . .193.5Training Framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .203.5.1Efficient Implementation of Muon. . . . . . . . . . . . . . . . . . . . . . .203.5.2Cost-Effective and Memory-Efficient Implementation ofmHC . . . . . . .213.5.3Contextual Parallelism for Long-Context Attention. . . . . . . . . . . . .213.5.4Extended Automatic Differentiation for Flexible Activation Checkpointing213.6Inference Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .223.6.1KV Cache Structure and Management . . . . . . . . . . . . . . . . . . . . .223.6.2On-Disk KV Cache Storage. . . . . . . . . . . . . . . . . . . . . . . . . . .23 4Pre-Training24 4.1Data Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .244.2Pre-Training Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .254.2.1Model Setups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .254.2.2Training Setups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .254.2.3Mitigating Training Instability. . . . . . . . . . . . . . . . . . . . . . . . .264.3Evaluations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .274.3.1Evaluation Benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . .274.3.2Evaluation Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28 5Post-Training29 5.1Post-Training Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .295.1.1Specialist Training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .295.1.2On-Policy Distillation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .325.2RL and OPD Infrastructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .345.2.1FP4 Quantization Integration . . . . . . . . . . . . . . . . . . . . . . . . . .345.2.2Efficient Teacher Scheduling for Full-Vocabulary OPD. . . . . . . . . . .345.2.3Preemptible and Fault-Tolerant Rollout Service. . . . . . . . . . . . . . .345.2.4Scaling RL Framework for Million-Token Context. . . . . . . . . . . . . .355.2.5Sandbox Infrastructure for Agentic AI . . . . . . . . . . . . . . . . . . . . .355.3Standard Benchmark Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . .365.3.1Eval