行业研究公司研究宏观策略财报招股书会议纪要 Token 低空经济十五五 AIGC 大模型

预测科学进步与人工智能

2026-05-21 Sean Wei, Pan-Yang Chen, Jonathan Bogo, Yutaro Yamada, Peter Clark, David Cifuentes, Philip Torr, James Y., Junshi Yu, Sakana Al 牛津大学&斯坦福大学&艾伦人工智能研究所& Sakana AI 顾小桶🙊

核心观点

本研报提出了 CUSP（Cutoff-conditioned Unseen Scientific Progress）基准，用于评估人工智能系统预测科学进步的能力。研究发现，尽管当前 AI 模型能够识别合理的研究方向，但它们无法可靠地预测科学进步是否会发生以及何时发生。模型在生物学、化学和物理学等领域的表现存在显著差异，AI 进步的时间比这些领域的进步更容易预测。

关键数据

CUSP 基准包含 4,760 个科学事件，涵盖九个一级科学领域和 4,245 个不同的子类别。
研究评估了多个前沿模型，包括 GPT-5.4、GPT-4o、Claude Sonnet 4.5、LLaMA-3.3、GPT-OSS 和 DeepSeek R1。
模型在多项选择题中表现良好，但在二元预测和日期预测中接近机会水平。
模型在自由响应问题中表现较差，表明它们难以生成与实际科学进步方法一致的解决方案。
模型表现出系统性的过度自信和强烈的响应偏差，表明它们在预测科学进步时的不确定性估计不可靠。

研究结论

访问先验知识并不能转化为可靠的科学预测。
模型性能主要受益于事后信息，而非事前预测。
预测科学进步需要超越知识检索的能力，包括在不确定性下推理科学发现如何随时间发展。

Sean Wu1,∗, Pan Lu2,∗, Yupeng Chen1, Jonathan Bragg3Yutaro Yamada4, Peter Clark3, David Clifton1, Philip Torr1,†, James Zou2,†, Junchi Yu1,†1University of Oxford2Stanford University3Allen Institute for AI4Sakana AI Abstract Repository Artificial intelligence (AI) is increasingly embedded in scientific discovery, yet whether itcan anticipate scientific progress remains unclear. To study this question, we introduce a tem-porally grounded evaluation framework for forecasting scientific progress under controlledknowledge constraints. We presentCUSP(Cutoff-conditionedUnseenScientificProgress), amulti-disciplinary and event-level benchmark that evaluates scientific forecasting performancein AI systems through feasibility assessment, mechanistic reasoning, generative solution design,and temporal prediction. Across 4,760 scientific events, we observe systematic and domain-dependent limitations in current frontier models. While models can identify plausible researchdirections from competing candidates, they fail to reliably predict whether scientific advanceswill be realized and systematically misestimate when they will occur. Model performance ishighly heterogeneous across domains, with the timing of AI progress being more predictablethan advances in biology, chemistry, and physics. Performance is largely insensitive to whether 1Introduction Scientific progress is often assumed to follow structured patterns [1,2], with empirical regularitiessuch as Moore’s Law [3] in semiconductors and scaling relationships [4] in deep learning providingquantitative expectations about future developments. These patterns emerge from accumulatedscientific progress [5] and have long informed research roadmaps, funding priorities, and techno-arXiv:2605.22681v1 [cs.AI] 21 May 2026 [8, 9, 10, 11, 12], a question arises: can AI systems forecast the trajectory of scientific progress? Recent advances in large language models suggest that AI systems can act as general-purposescientific assistants and support tasks ranging from hypothesis generation to experiment design[13,14]. A growing body of work has evaluated their capabilities in scientific reasoning [15,16],problem-solving [17,18], and impact prediction [19] across scientific domains.While thesestudies demonstrate broad proficiency, they do not evaluate whether AI systems can reliablyforecast scientific progress under temporal knowledge constraints. Evaluating such capabilities is To address this gap, we introduceCUSP(Cutoff-conditionedUnseenScientificProgress), an event-level, multi-disciplinary, and temporally grounded framework for evaluating scientific forecastingin AI systems.CUSPis constructed from 4,760 verifiable scientific milestones extracted fromtop-tier publications and community-driven repositories across multiple disciplines. Each eventis associated with a precise temporal reference to enable controlled access to prior knowledge.Crucially,CUSPoperationalizes scientific forecasting as a measurable capability across four com- We useCUSPto evaluate frontier models under controlled temporal constraints and find a con-sistent pattern of limitations. While models can identify plausible technical approaches fromcompeting candidates, they struggle to generate solutions that align with the methods underlyingrealized scientific advances. In feasibility assessment and temporal prediction, models performnear chance in predicting whether scientific advances will be realized and exhibit a strong biastoward delayed outcomes when estimating when such advances will occur. Moreover, models aresystematically overconfident and display strong response biases in feasibility assessment, indicat- To further understand these limitations, we analyze model performance across pre- and post-cutoffevents under controlled information access. Providing additional pre-cutoffknowledge improvesperformance on both pre-cutoffand post-cutoffevents, indicating a knowledge gap in how modelsaccess and utilize available information. However, a substantial forecasting gap remains, as modelsperform significantly worse on post-cutoffevents than in full-information settings with post-event Taken together, these results indicate that while current AI systems can identify plausible scientificapproaches and benefit from additional knowledge, they lack grounded and well-calibrated scien-tific forecasting. They fail to accurately predict whether scientific advances will be realized andwhen they will occur, with these errors becoming more pronounced for high-impact discoveries. 2TheCUSPBenchmark We developCUSPusing a temporally stratified corpus of scientific milestones, spanning January2024 to March 2026, to evaluate scientific forecasting in current AI systems under controlledtemporal knowledge constraints.CUSPis designed to rigorously evaluate predictive performanceand calibrated expectation on scientific development across a broad spectrum of scientific dis- We source natural science milestones fromNature

点击免费查看完整报告

预测科学进步与人工智能

核心观点

关键数据

研究结论

你可能感兴趣

当人工智能遇到机器人：与麻省理工学院计算机科学与人工智能实验室（csail）主任达妮拉·鲁斯的对话

为什么美国移民对全球科学进步很重要

人工智能与数据科学竞赛白皮书2022

人类与人工智能协作的科学与艺术

人工智能重塑科学与工程研究

AAAS 董事会关于人权对科学进步的益处的声明

界定享受科学进步及其应用利益的权利

人工智能推动医疗设备的进步

预防原则破坏人工智能进步的十种方式

2025年数字进步和趋势报告：加强人工智能基础（英）2025