AI智能总结
DeepSeek-AI research@deepseek.com Abstract General reasoning represents a long-standing and formidable challenge in artificial intelli-gence. Recent breakthroughs, exemplified by large language models (LLMs) (Brown et al.,2020; OpenAI, 2023) and chain-of-thought prompting (Wei et al., 2022b), have achieved con-siderable success on foundational reasoning tasks. However, this success is heavily contingentupon extensive human-annotated demonstrations, and models’ capabilities are still insuffi-cient for more complex problems. Here we show that the reasoning abilities of LLMs can beincentivized through pure reinforcement learning (RL), obviating the need for human-labeledreasoning trajectories. The proposed RL framework facilitates the emergent development ofadvanced reasoning patterns, such as self-reflection, verification, and dynamic strategy adapta-tion. Consequently, the trained model achieves superior performance on verifiable tasks suchas mathematics, coding competitions, and STEM fields, surpassing its counterparts trained viaconventional supervised learning on human demonstrations. Moreover, the emergent reasoningpatterns exhibited by these large-scale models can be systematically harnessed to guide andenhance the reasoning capabilities of smaller models. 1. Introduction Reasoning capability, the cornerstone of human intelligence, enables complex cognitive tasksranging from mathematical problem-solving to logical deduction and programming. Recentadvances in artificial intelligence have demonstrated that large language models (LLMs) canexhibit emergent behaviors, including reasoning abilities, when scaled to a sufficient size (Kaplanet al., 2020; Wei et al., 2022a). However, achieving such capabilities in pre-training typicallydemands substantial computational resources. In parallel, a complementary line of researchhas demonstrated that large language models can be effectively augmented through chain-of-thought (CoT) prompting. This technique, which involves either providing carefully designedfew-shot examples or using minimalistic prompts such as “Let’s think step by step”(Kojimaet al., 2022; Wei et al., 2022b), enables models to produce intermediate reasoning steps, therebysubstantially enhancing their performance on complex tasks. Similarly, further performancegains have been observed when models learn high-quality, multi-step reasoning trajectoriesduring the post-training phase (Chung et al., 2024; OpenAI, 2023). Despite their effectiveness,these approaches exhibit notable limitations. Their dependence on human-annotated reasoningtraces hinders scalability and introduces cognitive biases. Furthermore, by constraining modelsto replicate human thought processes, their performance is inherently capped by the human-arXiv:2501.12948v2 [cs.CL] 4 Jan 2026 provided exemplars, which prevents the exploration of superior, non-human-like reasoningpathways. To tackle these issues, we aim to explore the potential of LLMs for developing reasoningabilities through self-evolution in an RL framework, with minimal reliance on human labelingefforts. Specifically, we build upon DeepSeek-V3-Base (DeepSeek-AI, 2024b) and employ GroupRelative Policy Optimization (GRPO) (Shao et al., 2024) as our RL framework. The reward signalis solely based on the correctness of final predictions against ground-truth answers, withoutimposing constraints on the reasoning process itself. Notably, we bypass the conventionalsupervised fine-tuning (SFT) phase before RL training. This design choice stems from ourhypothesis that human-defined reasoning patterns may limit model exploration, whereasunrestricted RL training can better incentivize the emergence of novel reasoning capabilitiesin LLMs. Through this process, detailed in Section 2, our model (referred to as DeepSeek-R1-Zero) naturally developed diverse and sophisticated reasoning behaviors. In solving reasoningproblems, the model exhibits a tendency to generate longer responses, incorporating verification,reflection, and the exploration of alternative approaches within each response. Although wedo not explicitly teach the model how to reason, it successfully learns improved reasoningstrategies through reinforcement learning. Although DeepSeek-R1-Zero demonstrates excellent reasoning capabilities, it faces chal-lenges such as poor readability and language mixing, occasionally combining English andChinese within a single chain-of-thought response. Furthermore, the rule-based RL trainingstage of DeepSeek-R1-Zero is narrowly focused on reasoning tasks, resulting in limited per-formance in broader areas such as writing and open-domain question answering. To addressthese challenges, we introduce DeepSeek-R1, a model trained through a multi-stage learningframework that integrates rejection sampling, reinforcement learning, and supervised fine-tuning, detailed in Section 3. This training pipeline enables DeepSeek-R1 to inherit the reasoningcapabilities of its predecessor, DeepSeek-R1-Ze