您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。[-]:强化学习环境与科学中的强化学习:数据铸造厂与多智能体架构 - 发现报告

强化学习环境与科学中的强化学习:数据铸造厂与多智能体架构

信息技术2026-01-07--�***
AI智能总结
查看更多
强化学习环境与科学中的强化学习:数据铸造厂与多智能体架构

强化学习环境与科学中的强化学习:数据铸造⼚与多智能体架构 WorkerAutomation,RLasaService,Anthropic'snextbigbet,GDPvalandUtilityEvals,ComputerUseAgents,LLMsinBiology,Mid-Training,LabProcurementPatterns,PlatformPoliticsandAccess 劳动⼒⾃动化、RL即服务、Anthropic的下⼀项重⼤押注、GDPval与效⽤评估、计算机使⽤代理、⽣物学中的LLMs、中期训练、实验室采购模式、平台政治与访问权限 Last June, we argued that scaling RL is the critical path to unlocking further AI capabilities.As we will show, the past several months have affirmed our thesis: major capability gains arecoming from ramping RL compute. Pre-training continues to see further optimizations, butthe lab’s are laser focused on scaling compute for RL. 去年六⽉,我们提出观点认为,扩展强化学习(RL)是解锁进⼀步 AI 能⼒的关键路径。正如我们将要展⽰的那样,过去⼏个⽉已经验证了我们的论断:重⼤的能⼒提升正来⾃不断增加⽤于 RL 的算⼒投⼊。预训练仍在持续获得进⼀步优化,但各⼤实验室正⾼度聚焦于扩展强化学习的计算规模。 ScalingReinforcementLearning:Environments,RewardHacking,Agents,ScalingData 扩展强化学习:环境、奖励⿊客、智能体、数据规模化 The best example of this is demonstrated by OpenAI. The company has used the same basemodel, GPT-4o, for all their recentflagship models: o1, o3, and GPT-5 series. Gains in theperformance of OpenAI’s models for 18 months were being driven by post-training andscaling up RL compute alone. OpenAI has nowfixed their pretraining problems, so with thatvector of scaling unlocked, progress will be even more rapid. 这⼀点最好的例⼦来⾃OpenAI。该公司在近期所有旗舰模型中都使⽤了同⼀个基础模型 GPT-4o:o1、o3 以及 GPT-5 系列。在长达 18 个⽉的时间⾥,OpenAI 模型性能的提升⼏乎完全由后训练以及强化学习算⼒的扩展所驱动。如今 OpenAI 已经解决了其预训练⽅⾯的问题,随着这⼀扩展⽅向被重新打开,其进展将会更加迅速。 This is not to say that pre-training is dead: Anthropic, xAI, and especially Google all derivedsignificant gains from scaling up pre-training. But OpenAI’s progress last year and ability tokeep up using an older base model was existence proof of the efficacy of post-training. 这并不是说预训练已经⾛到尽头:Anthropic、xAI,尤其是 Google,仍然通过扩⼤预训练规模取得了显著收益。但 OpenAI 去年的进展,以及其在使⽤较⽼的基础模型情况下仍能保持竞争⼒,证明了后训练(post-training)的有效性。 Scaling up RL is difficult as it requires a steady stream of tasks the model needs to solve andlearn from. Pre-training had the entire internet to train on, but equivalent corpus for RL is yetto be fully created. Most RL data and tasks must be constructed from scratch, which can bequite labor intensive. 扩⼤强化学习(RL)的规模⼗分困难,因为它需要源源不断的任务供模型去解决和学习。预训练阶段可以利⽤整个互联⽹作为训练语料,但与之等价的 RL 语料库尚未被完全构建出来。⼤多数 RL 数据和任务都必须从零开始构建,这往往相当耗费⼈⼒。 Making the models “do the homework” started with math problems, which are easy to grade.Methods have since advanced, including branching out to newer domains like healthcare andfinancial modelling. To do this, models are placed in increasingly specialized “environments”that require the model to do these tasks. 让模型“⾃⼰完成作业”最初始于数学问题,因为这些问题容易评估。此后相关⽅法不断发展,并拓展到医疗健康、⾦融建模等新的领域。为此,模型被置于越来越专业化的“环境”中,在这些环境⾥需要完成相应任务。 Aggregating tasks and data can be done manually, or through the curation of high signal userdata. The latter is what gives companies like Windsurf and Cursor the ability to post-traintheir own competitive models despite not having the resources of a lab. 任务和数据的聚合可以通过⼈⼯完成,或通过整理⾼信号的⽤户数据来实现。后者使Windsurf 和 Cursor 等公司即使没有实验室那样的资源,也能对⾃⾝具有竞争⼒的模型进⾏后训练。 These post-training efforts have improved model capability in domains like coding, but alsomodelutility: models are more usable in everyday tools like Excel and PowerPoint. 这些后训练⼯作提升了模型在编程等领域的能⼒,也提⾼了模型的实⽤性:模型在Excel 和 PowerPoint 等⽇常⼯具中的可⽤性更强。 To measure how much models are improving in utility and capability, OpenAI created aneval called GDPval. This eval covers 1000+ tasks across 44 occupations, picked from sectorsthat representing >5% of the economy. Many of these tasks are digital but require severalhours to complete for a human. These tasks were created in conjunction with experts,averaging 14 years of experience. 为了衡量模型在实⽤性和能⼒⽅⾯的提升程度,OpenAI 创建了⼀项名为 GDPval 的评测。该评测涵盖 44 个职业、1000 多项任务,这些职业来⾃占经济总量 5% 以上的⾏业。许多任务都是数字化的,但对⼈类⽽⾔需要数⼩时才能完成。这些任务是在与专家合作的情况下创建的,参与专家的平均从业经验为 14 年。 Models are asked to solve these problems, given a prompt and a set of supporting documents.Tasks includefiling a tax return for afictitious human, creating slides as a client advisor for aresort, and creating commercials from a given set of stock footage. Grading occurs throughexperts picking between a model’s answer and a human expert’s answer. This win rate, ifequal, would then mean that a model’s performance is then in parity with a human expert.The best current model, GPT-5.2, scores around 71%, meaning its work is tied to or preferredfrom human outputs 71% of the time. 模型被要求在给定提⽰和⼀组⽀持性⽂档的情况下解决这些问题。任务包括为⼀名虚构的⼈类申报纳税申报表、作为度假村的客户顾问制作幻灯⽚,以及利⽤给定的⼀组素材视频创作⼴告。评分⽅式是由专家在模型的答案与⼈类专家的答案之间做出选择。如果胜率相当,则意味着模型的表现与⼈类专家持平。当前最强的模型 GPT-5.2得分约为 71%,这意味着在 71% 的情况下,其⼯作成果与⼈类输出持平或更受偏好。 Sample task set from GDPval. Source: OpenAIGDPval 的⽰例任务集。来源:OpenAI While GDPval has some issues (e.g., skewed toward unusually specific digital work) it is thebest example of how evaluations are shifting from measuring abstract intelligence to realworld utility. This stands in contrast to most of the previous model evaluations, which focusedon things like mathematical knowledge or PhD level scientific questions