行业研究公司研究宏观策略财报招股书会议纪要 Token 低空经济十五五 AIGC 大模型

DeepSeek-V4：迈向高效百万token上下文智能

信息技术 2026-04-24 DeepSeek-AI yuAner

核心观点

DeepSeek-V4 系列包含 DeepSeek-V4-Pro 和 DeepSeek-V4-Flash 两个强大的 MoE 语言模型，参数量分别为 1.6T 和 284B，均支持百万级 token 的上下文长度。
该系列模型在架构和优化方面进行了多项升级，包括混合注意力架构、Manifold-Constrained Hyper-Connections (mHC) 和 Muon 优化器，以提升长上下文效率、建模能力和训练稳定性。
DeepSeek-V4 系列在长上下文场景中表现出色，百万级 token 上下文设置下，DeepSeek-V4-Pro 的单 token 推理 FLOPs 和 KV 缓存分别仅为 DeepSeek-V3.2 的 27% 和 10%。
DeepSeek-V4-Pro-Max 在核心任务中超越了其前身，成为开源模型中的领先者。
DeepSeek-V4 系列采用了混合注意力架构、精度优化等手段，显著降低了推理 FLOPs 和 KV 缓存大小，尤其在长上下文设置中优势明显。
DeepSeek-V4-Flash-Base 在参数量更小的前提下，在多数基准测试中超越了 DeepSeek-V3.2-Base，展现出更高效的架构设计。
DeepSeek-V4-Pro-Base 在推理、编码、长上下文和世界知识任务中均达到了新的性能标准，成为 DeepSeek 基础模型中最强的模型。

关键数据

DeepSeek-V4-Pro 参数量为 1.6T，激活参数为 49B。
DeepSeek-V4-Flash 参数量为 284B，激活参数为 13B。
DeepSeek-V4 系列预训练数据量超过 32T，包含数学、代码、网页、长文档等高质量内容。
DeepSeek-V4-Pro-Max 在 SimpleQA 和 Chinese-SimpleQA 基准测试中显著优于其他开源模型。
DeepSeek-V4-Pro-Max 在推理基准测试中表现优异，接近 GPT-5.4 和 Gemini-3.1-Pro 的水平。
DeepSeek-V4-Pro-Max 在代码代理任务中表现优异，达到 Claude Opus 4.5 的水平。

研究结论

DeepSeek-V4 系列通过架构创新和基础设施优化，实现了长上下文处理的高效率，为未来测试时扩展、长时任务和在线学习等研究奠定了基础。
DeepSeek-V4-Pro-Max 成为开源模型中的领先者，在知识、推理、编码和长上下文能力方面均取得了显著成果。
DeepSeek-V4-Flash-Max 在推理性能方面与领先闭源模型相当，同时保持了高度的成本效益。
DeepSeek-V4 系列开启了百万级上下文的新时代，为大型语言模型的效率、规模和智能发展提供了新的方向。

DeepSeek-AIresearch@deepseek.com Abstract We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models — DeepSeek-V4-Pro with 1.6T parameters (49B activated) andDeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length ofone million tokens. DeepSeek-V4 series incorporate several key upgrades in architecture and op-timization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA)and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold-Constrained Hyper-Connections (mHC) that enhance conventional residual connections; (3)and the Muon optimizer for faster convergence and greater training stability. We pre-trainboth models on more than 32T diverse and high-quality tokens, followed by a comprehensivepost-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art foropen models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series arehighly efficient in long-context scenarios. In the one-million-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache comparedwith DeepSeek-V3.2. This enables us to routinely support one-million-token contexts, therebymaking long-horizon tasks and further test-time scaling more feasible. The model checkpointsare available athttps://huggingface.co/collections/deepseek-ai/deepseek-v4. Contents 1Introduction 2Architecture 2.1Designs Inherited from DeepSeek-V3 . . . . . . . . . . . . . . . . . . . . . . . . . .72.2Manifold-Constrained Hyper-Connections. . . . . . . . . . . . . . . . . . . . . .72.3Hybrid Attention with CSA and HCA. . . . . . . . . . . . . . . . . . . . . . . . .92.3.1Compressed Sparse Attention . . . . . . . . . . . . . . . . . . . . . . . . . .92.3.2Heavily Compressed Attention . . . . . . . . . . . . . . . . . . . . . . . . .112.3.3Other Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .122.3.4Efficiency Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .132.4Muon Optimizer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 3General Infrastructures15 3.1Fine-Grained Communication-Computation Overlap in Expert Parallelism . . . .153.2Flexible and Efficient Kernel Development with TileLang . . . . . . . . . . . . . .163.3High-Performance Batch-Invariant and Deterministic Kernel Libraries. . . . . .183.4FP4 Quantization-Aware Training . . . . . . . . . . . . . . . . . . . . . . . . . . . .193.5Training Framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .203.5.1Efficient Implementation of Muon. . . . . . . . . . . . . . . . . . . . . . .203.5.2Cost-Effective and Memory-Efficient Implementation ofmHC . . . . . . .213.5.3Contextual Parallelism for Long-Context Attention. . . . . . . . . . . . .213.5.4Extended Automatic Differentiation for Flexible Activation Checkpointing213.6Inference Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .223.6.1KV Cache Structure and Management . . . . . . . . . . . . . . . . . . . . .223.6.2On-Disk KV Cache Storage. . . . . . . . . . . . . . . . . . . . . . . . . . .23 4Pre-Training24 4.1Data Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .244.2Pre-Training Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .254.2.1Model Setups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .254.2.2Training Setups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .254.2.3Mitigating Training Instability. . . . . . . . . . . . . . . . . . . . . . . . .264.3Evaluations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .274.3.1Evaluation Benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . .274.3.2Evaluation Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28 5Post-Training29 5.1Post-Training Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .295.1.1Specialist Training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .295.1.2On-Policy Distillation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .325.2RL and OPD Infrastructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .345.2.1FP4 Quantization Integration . . . . . . . . . . . . . . . . . . . . . . . . . .345.2.2Efficient Teacher Scheduling for Full-Vocabulary OPD. . . . . . . . . . .345.2.3Preemptible and Fault-Tolerant Rollout Service. . . . . . . . . . . . . . .345.2.4Scaling RL Framework for Million-Token Context. . . . . . . . . . . . . .355.2.5Sandbox Infrastructure for Agentic AI . . . . . . . . . . . . . . . . . . . . .355.3Standard Benchmark Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . .365.3.1Eval

点击免费查看完整报告

DeepSeek-V4：迈向高效百万token上下文智能

核心观点

关键数据

研究结论

你可能感兴趣

计算机行业周报-周观点：DeepSeek-V4开启百万上下文普惠时代，关注国产算力机遇

DeepSeek-V4 发布，超长上下文的突破与架构效率的胜利

DeepSeek-V4点评：多层面技术提升训练规模，超长上下文进入普惠时代

从技术演进到算力消耗估算，深度拆解AIAgent：AI进入Token时代，MCP赋能Agent迈向泛智能

DeepSeek V4发布点评：百万上下文进入普惠时代，国产算力成功适配需求爆发将至

电子行业点评报告：百万Token时代来临，Rubin CPX重塑推理架构与产业链

AI+专题系列点评（十）：月之暗面上下文窗口技术取得新突破，Kimi赋能高效信息交互

打造“中国版TPU”，这家公司业务覆盖芯片、算力、IP，提供多层次企业级应用，有望实现“百万Token的极致性价比”：另有一家公司精密制造公司受益折叠屏+AI眼镜+AI服务器全布局

亚洲工业科技与亚洲新兴机器人：人形机器人：迈向年出货量百万台之路

饲料迈向“百万”吨，养殖布局积极推进