AI智能总结
DeepSeek-AI research@deepseek.com Abstract We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super-vised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguingreasoning behaviors. However, it encounters challenges such as poor readability, and languagemixing. To address these issues and further enhance reasoning performance, we introduceDeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support theresearch community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models(1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. Contents 1Introduction31.1Contributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41.2Summary of Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 2Approach 5 2.1Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52.2DeepSeek-R1-Zero: Reinforcement Learning on the Base Model. . . . . . . . . .52.2.1Reinforcement Learning Algorithm. . . . . . . . . . . . . . . . . . . . . .52.2.2Reward Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62.2.3Training Template. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62.2.4Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero62.3DeepSeek-R1: Reinforcement Learning with Cold Start. . . . . . . . . . . . . . .92.3.1Cold Start. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92.3.2Reasoning-oriented Reinforcement Learning. . . . . . . . . . . . . . . . .102.3.3Rejection Sampling and Supervised Fine-Tuning . . . . . . . . . . . . . . .102.3.4Reinforcement Learning for all Scenarios. . . . . . . . . . . . . . . . . . .112.4Distillation: Empower Small Models with Reasoning Capability . . . . . . . . . .11 3Experiment 3.1DeepSeek-R1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .133.2Distilled Model Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 4Discussion 4.1Distillation v.s. Reinforcement Learning. . . . . . . . . . . . . . . . . . . . . . . .144.2Unsuccessful Attempts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15 5Conclusion, Limitations, and Future Work16 AContributions and Acknowledgments20 1. Introduction In recent years, Large Language Models (LLMs) have been undergoing rapid iteration andevolution (Anthropic, 2024; Google, 2024; OpenAI, 2024a), progressively diminishing the gaptowards Artificial General Intelligence (AGI). Recently, post-training has emerged as an important component of the full training pipeline.It has been shown to enhance accuracy on reasoning tasks, align with social values, and adaptto user preferences, all while requiring relatively minimal computational resources againstpre-training. In the context of reasoning capabilities, OpenAI’s o1 (OpenAI, 2024b) series modelswere the first to introduce inference-time scaling by increasing the length of the Chain-of-Thought reasoning process. This approach has achieved significant improvements in variousreasoning tasks, such as mathematics, coding, and scientific reasoning. However, the challengeof effective test-time scaling remains an open question for the research community. Several priorworks have explored various approaches, including process-based reward models (Lightmanet al., 2023; Uesato et al., 2022; Wang et al., 2023), reinforcement learning (Kumar et al., 2024),and search algorithms such as Monte Carlo Tree Search and Beam Search (Feng et al., 2024; Trinhet al., 2024; Xin et al., 2024). However, none of these methods has achieved general reasoningperformance comparable to OpenAI’s o1 series models. In this paper, we take the first step toward improving language model reasoning capabilitiesusing pure reinforcement learning (RL). Our goal is to explore the potential of LLMs to developreasoning capabilities without any supervised data, focusing on their self-evolution througha pure RL process.Specifically, we use DeepSeek-V3-Base as the base model and employGRPO (Shao et al., 2024) as the RL framework to improve model performance in reasoning.During training, DeepSeek-R1-Zero naturally emerged with numerous powerful and interestingreasoning behaviors. After thousands of RL steps, DeepSeek-R1-Zero exhibits super performanceon reasoning benchmarks. For instance, the pass@1 score on AIME 2024 increases from 15.6% to71.0%, and with majority voting, the score further improves to 86.7%, matching the performanceof OpenAI-o1-0912. However, DeepSeek-R1-Zero encounters challenges such as poor readability, and languagemixing. T