bytechnicalities作者:technicalities8th Dec 2025 This is the editorial for this year’s "Shallow Review of AI Safety". (It got long enoughto stand alone.) 这是今年“AI安全浅层综述”的社论。(它写得⾜够长,可以独⽴成⽂。) Epistemic status: subjective impressions plus one new graph plus 300 links. 认知状态:主观印象,加上⼀张新图表和300个链接。 Huge thanks to Jaeho Lee, Jaime Sevilla, and Lexin Zhou for running lots of tests pro bonoand so greatly improving the main analysis. ⾮常感谢Jaeho Lee、Jaime Sevilla和Lexin Zhou⽆偿进⾏了⼤量测试,从⽽⼤⼤改进了主要分析。 tl;dr要点概述 Informed peopledisagreeabout the prospects for LLM AGI – or even just whatexactly was achieved this year. But the famous ones with abookto talk at leastagree that we’re2-20years off (allowing for other paradigms arising). In thispiece I stick to arguments rather than reporting who thinks what. 有见识的⼈对LLM AGI的前景持有不同意见——甚⾄对今年究竟实现了什么也有分歧。但那些出书、出⾯发声的名⼈⾄少⼀致认为我们还差2到20年(并且允许出现其他范式)。在这篇⽂章中,我坚持论证⽽不是报道谁持何种观点。 My view: compared to last year, AI is much more impressive but notproportionally more useful. They improved on some things they were explicitlyoptimised for (coding, vision, OCR, benchmarks), and did not hugely improveon everything else. Progress is thus (still!) consistent with current frontier training bringing more things in-distribution rather than generalising very far. 我的看法:与去年相⽐,AI更令⼈印象深刻,但并没有按⽐例变得更有⽤。它们在⼀些被明确优化的⽅⾯有所改善(编码、视觉、OCR、基准测试),⽽在其他许多⽅⾯并未⼤幅改进。因此,进展(仍然)与当前前沿训练将更多事物纳⼊分布内⽽不是进⾏远距离泛化相⼀致。 Pretraining (GPT-4.5, Grok 3/4, but also the counterfactual large runs whichweren’t done) disappointed people this year. It’s probably not because it didn'tor wouldn’t work; it was just too hard to serve the big models and ~30 timesmore efficient to do post-training instead,on the margin. This should change, yetagain, soon, if RL scales even worse. 预训练(GPT-4.5、Grok 3/4,但也包括那些未做的⼤规模反事实训练)今年让⼈失望。原因很可能不是它不起作⽤或不会起作⽤;只是服务⼤型模型太困难,⽽进⾏后训练在边际上⼤约⾼效30倍。这种情况应该很快再次改变,如果强化学习的扩展表现更差的话。 Edit: Seethisamazing comment for the hardware reasons behind this, andreasons to think that pretraining will struggle for years. 编辑:参见这条精彩评论,了解背后的硬件原因以及为什么认为预训练在未来⼏年会⾯临困难。 True frontier capabilities are likely obscured by systematic cost-cutting(distillation for serving to consumers, quantization, low reasoning-tokenmodes, routing to cheap models, etc) and a few unreleased models/modes. 真正的前沿能⼒很可能被系统性的成本削减掩盖了(⽤于对消费者提供服务的蒸馏、量化、低推理token模式、路由到廉价模型等)以及⼀些未发布的模型/模式。 Most benchmarks are weak predictors of even the rank order of models’capabilities. I distrustECI,ADeLe, andHCASTthe least (see graph beloworthis notebook). ECI and ADeLe show a linear improvement while HCASTfinds an exponential improvement on greenfield software engineering. ⼤多数基准都难以预⾔模型能⼒的排序。我对ECI、ADeLe和HCAST的不信任度最低(见下图或此笔记本)。ECI和ADeLe显⽰线性改进,⽽HCAST在绿地软件⼯程上发现了指数级改进。 The world’sde factostrategy remains “iterative alignment”, optimising outputswith a stack of alignment and control techniques everyone admits areindividually weak. 世界上的事实策略仍然是“迭代对齐”,⽤⼀堆⼈⼈都承认各⾃都很脆弱的对齐和控制技术来优化输出。 Early claims that reasoning models are safer turned out to be a mixed bag (seebelow). We alreadyknewfrom jailbreaks that current alignment methods were brittle.The great safety discovery of the year is that bad things are correlated incurrent models. (And on net this is good news.) "Emergent misalignment"fromfinetuning on one malign task; and in the wild fromreward hacking; and ithappens bystrengtheningspecific bad personas; and there is at leastonepositive generalisationtoo (from honesty about silly errors to honesty about hidden objectives). 我们已经从越狱(jailbreaks)中知道当前的对齐⽅法很脆弱。今年关于安全性的重⼤发现是:坏事在当前模型中是相关联的。(总的来说这是个好消息。)“微调后出现的突发性不对齐”在对单⼀恶意任务的微调中出现;在实际环境中也会通过奖励投机⽽发⽣;它会通过强化特定的坏⾓⾊⽽发⽣;⽽且⾄少还有⼀种积极的泛化也存在(从对愚蠢错误的诚实到对隐藏⽬标的诚实)。 Previously I thought that "character training" was a separate and lesser matterthan "alignment training". Now I am not sure. 之前我认为“品格训练”是与“对齐训练”不同且次要的事情。现在我不确定了。 Welcome to the many new people in AI Safety and Security and Assurance andso on. In theShallow Review, out soon, I added a new, sprawling top-levelcategory for one large trend among them, which is to treat the multi-agent lensas primary in various ways. 欢迎加⼊众多新进的⼈⼯智能安全、安保与保障等领域的朋友。在即将发布的《浅层回顾》中,我新增了⼀个⼴泛的顶层类别,针对其中的⼀⼤趋势,即在多代理视⾓上以各种⽅式作为主要框架来对待问题。 Overall I wish I could tell you some number, the net expected safety change(this year’s improvements in dangerous capabilities and agent performance,minus the alignment-boosting portion of capabilities, minus the cumulativeeffect of the best actually implemented composition of alignment and controltechniques). But I can’t. 总体上我真希望能告诉你⼀个数字——预期的净安全变化(今年危险能⼒与代理性能的提升,减去能促进对齐的那部分能⼒提升,减去迄今为⽌实际实施的最佳对齐与控制技术组合的累积效应)。但我做不到。 Capabilities in 20252025年的能⼒ Better, but how much?更好了,但是多少? Arguments against 2025 capabilities growth being above-trend反对2025年能⼒增长超出趋势的论点 Apparent progress is an unknown mixture of real general capabilityincrease,hidden contaminationincrease, benchmaxxing (nailing a small set of static examples instead of generalisation) andusemaxxing(nailing a small setof narrow tasks with RL instead of de