行业研究公司研究宏观策略财报招股书会议纪要 Token 低空经济十五五 AIGC 大模型

Seed1.5-思考：通过强化学习推进卓越推理模型

2025-04-10 字节跳动土豆不吃泥

Seed1.5-Thinking 研报总结

模型介绍

Seed1.5-Thinking 是一款能够通过思考进行推理的模型，在多个基准测试中表现出色。它在 AIME2024、Codeforces 和 GPQA 上分别取得了 86.7、55.0 和 77.3 的成绩，展现了在 STEM 和编码方面的优秀推理能力。此外，该模型在非推理任务中也表现出显著的泛化能力，例如在非推理任务上的胜率比 DeepSeek R1 高 8%。

模型特点

Seed1.5-Thinking 是一个 Mixture-of-Experts (MoE) 模型，参数量相对较小，激活参数为 20B，总参数量为 200B。为了评估模型的泛化推理能力，研究人员开发了两个内部基准测试 BeyondAIME 和 Codeforces，并将公开发布以支持未来的研究。

模型训练

模型训练主要涉及三个关键点：训练数据、RL 算法和 RL 基础设施。

数据

RL 训练数据：包括可验证问题（STEM 问题、代码问题和逻辑推理）和非可验证问题（非推理任务）。
可验证问题：STEM 数据（数学、物理和化学问题）、代码数据（带单元测试的编码问题）和逻辑谜题数据（如 24 点、迷宫、数独等）。
非可验证问题：主要涵盖需要根据人类偏好进行质量评估的非推理任务，如创意写作、翻译、知识问答、角色扮演等。
高级数学基准：为了更好地评估数学推理能力，研究人员构建了一个新的基准数据集 BeyondAIME，包含 100 个由人类专家精心设计的、难度不低于 AIME 最难问题的数学问题。

RL 算法

奖励建模：针对可验证问题和非可验证问题采用了不同的奖励建模方法。可验证问题使用 Seed-Verifier 和 Seed-Thinking-Verifier 进行奖励建模，非可验证问题使用 pairwise rewarding 方法进行奖励建模。
RL 训练：采用统一的强化学习框架，融合了来自不同领域的多种数据，并使用了多种技术来应对长 CoT RLHF 中的挑战，如价值模型偏差和奖励信号的稀疏性。

RL 基础设施

框架：使用 HybridFlow 编程抽象构建训练框架，并在 Ray 集群上运行。
Streaming Rollout System：采用流式展开架构，将模型进化与运行时执行解耦，并通过参数 α 动态调整 on/off-policy 样本比例。
Training System：设计了一个混合分布式训练框架，集成了先进的并行策略、动态工作负载平衡和内存优化技术。

实验结果

自动评估结果：Seed1.5-Thinking 在数学推理、代码生成和科学任务上均取得了优异的成绩，但在 SimpleQA 上表现相对较弱。
人工评估结果：在非推理场景中，Seed1.5-Thinking 在 8.0% 的评估会话中取得了胜利，表明其在与人类偏好保持一致方面具有优势。
预训练模型的影响：拒绝采样初始化 RL 可以加速训练，但最终性能低于未使用拒绝采样的模型。不同规模的模型在 RL 算法排名上表现出一致性。

研究结论

Seed1.5-Thinking 是一款优秀的推理模型，在推理和非推理任务上均取得了出色成绩。未来研究计划探索更高效的 RL 配方、更具挑战性的思考模式任务，以及更通用的奖励建模方法。

ByteDance Seed Full author list in Contributions Abstract We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resultingin improved performance on a wide range of benchmarks.Seed1.5-Thinking achieves86.7onAIME2024,55.0on Codeforces and77.3on GPQA, demonstrating excellent reasoning abilities inSTEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization acrossdiverse domains. For instance, it surpasses DeepSeek R1 by 8% in win rate on non-reasoning tasks,indicating its broader applicability. Compared to other state-of-the-art reasoning models, Seed1.5-Thinking is a Mixture-of-Experts (MoE) model with a relatively small size, featuring 20B activatedand 200B total parameters. As part of our effort to assess generalized reasoning, we develop twointernal benchmarks, BeyondAIME and Codeforces, both of which will be publicly released tosupport future research. Model trial link:https://www.volcengine.com/experience/ark. Date:April 10, 2025 1Introduction Driven by large-scale reinforcement learning on large language models, reasoning models have seen significantadvancements. Notably, OpenAI’s o1 series [1], DeepSeek’s R1 [2], Google’s Gemini 2.5 [3], and Anthropic’sClaude 3.7 [4] have emerged as state-of-the-art models, each making substantial progress in logical reasoning,mathematical problem-solving, and code generation. These advancements underscore a shift toward morestructured, efficient and scalable reasoning models, with ongoing research focusing on training efficiency, longchain-of-thought, and large-scale reinforcement learning. In this work, we present a new reasoning model, called Seed1.5-Thinking. This model has achieved strongperformance in both reasoning and non-reasoning tasks. Mathematical Reasoning: For math competition, Seed1.5-Thinking achieves86.7on AIME 2024, matchingthe performance of o3-mini-high and significantly outperforming o1 and DeepSeek R1, demonstratingcompetitive strength. Since AIME 2024 no longer provides sufficient discrimination, we construct a morechallenging evaluation set named BeyondAIME. All problems in BeyondAIME are newly curated byhuman experts and designed to minimize the chance of being solved through memorization or guessing.While Seed1.5-Thinking surpasses both o1 and R1, there remains a performance gap compared to o3and Gemini pro 2.5. This also further demonstrates the discriminative power of the new evaluation set. Competitive Programming: For the evaluation of competitive programming, we adopt Codeforces as ourbenchmark. Unlike some prior works that rely on Elo Scores, which contains estimation and are notdirectly comparable, we adopt a concrete evaluation protocol based on the most recent 12 Codeforcescontests. Specifically, we report pass@1 and pass@8 metrics, where pass@k indicates whether the modelsolves the problem within k attempts, i.e., selecting the best result from k generated submissions. Wechoose to report pass@8 since it provides more stable results and aligns more closely with actual usersubmission patterns. Seed1.5-Thinking outperforms DeepSeek R1 on both metrics, though a performancegap remains compared to o3. The evaluation set will be made publicly available in a future release. Science: Seed1.5-Thinking reaches a score of 77.3 on GPQA, close to o3-level performance. Importantly,this gain is largely attributed to improved generalization from mathematical training, rather than anincrease in domain-specific science data. Non-reasoning Tasks: For non-reasoning tasks, Seed1.5-Thinking is evaluated using a test set designed toreplicate real-world user needs. Through human evaluations conducted against DeepSeek R1 acrossdiverse scenarios, Seed1.5-Thinking demonstrates significant advancements: it attains an 8.0% overallrise in users’ positive feedback, thereby highlighting its augmented ability to manage intricate userscenarios. There are three key points in the development of high-quality reasoning models: training data, RL algorithm,and RL infrastructure. We have devoted considerable effort to these three areas, and we will discuss them indetail. DataFor SFT training, unlike conventional post-training data, reasoning models rely on chain-of-thoughtdata, which explicitly outlines the step-by-step reasoning process. Our preliminary experiments showedthat too much non-CoT SFT data can significantly reduce the model’s ability to explore.For RLtraining, we incorporate four categories of data: STEM problems, code-related tasks, logic reasoning andnon-reasoning data like creative writing and dialogue. Among these, the logic reasoning data contributesto performance improvements on the ARC-AGI benchmark significantly. The math data exhibits stronggeneralization capabilities and can lead to broad performance improvements across tasks. RL AlgorithmRL training of reasoning models is highly unstable and often crashes, especially for modelswithout SFT. Sometimes, the score difference betwee

点击免费查看完整报告

Seed1.5-思考：通过强化学习推进卓越推理模型

Seed1.5-Thinking 研报总结

模型介绍

模型特点

模型训练

数据

RL 算法

RL 基础设施

实验结果

研究结论

你可能感兴趣

DeepSeek-R1：通过强化学习激励大语言模型中的推理能力

DeepSeek-R1：通过强化学习激励大型语言模型的推理能力

10月8日国新办新闻发布会要点学习与思考：加快政策落地着力推进稳增长与高质量发展

【智能数据与出行】通过卓越的顾客体验驶上汽车零售行业快车道

通过改进的数据和分析方法确保企业卓越2024

可口可乐：通过卓越的内容实现品牌价值

通过您的云之旅推动营收增长快速高效、创新和卓越的客户体验

通过 NIST 800-171 合规性推动机构研究卓越

在高等教育领域进行创新：通过统一财务、人力资源和学生系统来加速学术和运营卓越

通过游戏学习：教师专业发展基础研究全文

Seed1.5-思考：通过强化学习推进卓越推理模型

你可能感兴趣

DeepSeek-R1：通过强化学习激励大语言模型中的推理能力

DeepSeek-R1：通过强化学习激励大型语言模型的推理能力

10月8日国新办新闻发布会要点学习与思考：加快政策落地 着力推进稳增长与高质量发展

【智能数据与出行】通过卓越的顾客体验驶上汽车零售行业快车道

通过改进的数据和分析方法确保企业卓越2024

可口可乐：通过卓越的内容实现品牌价值

通过您的云之旅推动营收增长快速高效、创新和卓越的客户体验

通过 NIST 800-171 合规性推动机构研究卓越

在高等教育领域进行创新：通过统一财务、人力资源和学生系统来加速学术和运营卓越

通过游戏学习 ： 教师专业发展基础研究全文

10月8日国新办新闻发布会要点学习与思考：加快政策落地着力推进稳增长与高质量发展

通过游戏学习：教师专业发展基础研究全文