您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。 [字节跳动]:Seed1.5-思考:通过强化学习推进卓越推理模型 - 发现报告

Seed1.5-思考:通过强化学习推进卓越推理模型

2025-04-10 字节跳动 土豆不吃泥
报告封面

ByteDance Seed Full author list in Contributions Abstract We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resultingin improved performance on a wide range of benchmarks.Seed1.5-Thinking achieves86.7onAIME2024,55.0on Codeforces and77.3on GPQA, demonstrating excellent reasoning abilities inSTEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization acrossdiverse domains. For instance, it surpasses DeepSeek R1 by 8% in win rate on non-reasoning tasks,indicating its broader applicability. Compared to other state-of-the-art reasoning models, Seed1.5-Thinking is a Mixture-of-Experts (MoE) model with a relatively small size, featuring 20B activatedand 200B total parameters. As part of our effort to assess generalized reasoning, we develop twointernal benchmarks, BeyondAIME and Codeforces, both of which will be publicly released tosupport future research. Model trial link:https://www.volcengine.com/experience/ark. Date:April 10, 2025 1Introduction Driven by large-scale reinforcement learning on large language models, reasoning models have seen significantadvancements. Notably, OpenAI’s o1 series [1], DeepSeek’s R1 [2], Google’s Gemini 2.5 [3], and Anthropic’sClaude 3.7 [4] have emerged as state-of-the-art models, each making substantial progress in logical reasoning,mathematical problem-solving, and code generation. These advancements underscore a shift toward morestructured, efficient and scalable reasoning models, with ongoing research focusing on training efficiency, longchain-of-thought, and large-scale reinforcement learning. In this work, we present a new reasoning model, called Seed1.5-Thinking. This model has achieved strongperformance in both reasoning and non-reasoning tasks. Mathematical Reasoning: For math competition, Seed1.5-Thinking achieves86.7on AIME 2024, matchingthe performance of o3-mini-high and significantly outperforming o1 and DeepSeek R1, demonstratingcompetitive strength. Since AIME 2024 no longer provides sufficient discrimination, we construct a morechallenging evaluation set named BeyondAIME. All problems in BeyondAIME are newly curated byhuman experts and designed to minimize the chance of being solved through memorization or guessing.While Seed1.5-Thinking surpasses both o1 and R1, there remains a performance gap compared to o3and Gemini pro 2.5. This also further demonstrates the discriminative power of the new evaluation set. Competitive Programming: For the evaluation of competitive programming, we adopt Codeforces as ourbenchmark. Unlike some prior works that rely on Elo Scores, which contains estimation and are notdirectly comparable, we adopt a concrete evaluation protocol based on the most recent 12 Codeforcescontests. Specifically, we report pass@1 and pass@8 metrics, where pass@k indicates whether the modelsolves the problem within k attempts, i.e., selecting the best result from k generated submissions. Wechoose to report pass@8 since it provides more stable results and aligns more closely with actual usersubmission patterns. Seed1.5-Thinking outperforms DeepSeek R1 on both metrics, though a performancegap remains compared to o3. The evaluation set will be made publicly available in a future release. Science: Seed1.5-Thinking reaches a score of 77.3 on GPQA, close to o3-level performance. Importantly,this gain is largely attributed to improved generalization from mathematical training, rather than anincrease in domain-specific science data. Non-reasoning Tasks: For non-reasoning tasks, Seed1.5-Thinking is evaluated using a test set designed toreplicate real-world user needs. Through human evaluations conducted against DeepSeek R1 acrossdiverse scenarios, Seed1.5-Thinking demonstrates significant advancements: it attains an 8.0% overallrise in users’ positive feedback, thereby highlighting its augmented ability to manage intricate userscenarios. There are three key points in the development of high-quality reasoning models: training data, RL algorithm,and RL infrastructure. We have devoted considerable effort to these three areas, and we will discuss them indetail. DataFor SFT training, unlike conventional post-training data, reasoning models rely on chain-of-thoughtdata, which explicitly outlines the step-by-step reasoning process. Our preliminary experiments showedthat too much non-CoT SFT data can significantly reduce the model’s ability to explore.For RLtraining, we incorporate four categories of data: STEM problems, code-related tasks, logic reasoning andnon-reasoning data like creative writing and dialogue. Among these, the logic reasoning data contributesto performance improvements on the ARC-AGI benchmark significantly. The math data exhibits stronggeneralization capabilities and can lead to broad performance improvements across tasks. RL AlgorithmRL training of reasoning models is highly unstable and often crashes, especially for modelswithout SFT. Sometimes, the score difference betwee