您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。[谷歌]:通用人工智能(AGI)技术安全与保障方法 - 发现报告

通用人工智能(AGI)技术安全与保障方法

信息技术2025-04-01谷歌我***
通用人工智能(AGI)技术安全与保障方法

Rohin Shah1, Alex Irpan*,1, Alexander Matt Turner*,1, Anna Wang*,1, Arthur Conmy*,1, David Lindner*,1,Jonah Brown-Cohen*,1, Lewis Ho*,1, Neel Nanda*,1, Raluca Ada Popa*,1, Rishub Jain*,1, Rory Greig*,1, SamuelAlbanie*,1, Scott Emmons*,1, Sebastian Farquhar*,1, Sébastien Krier*,1, Senthooran Rajamanoharan*,1,Sophie Bridgers*,1, Tobi Ijitoye*,1, Tom Everitt*,1, Victoria Krakovna*,1, Vikrant Varma*,1, Vladimir Mikulik*,2,Zachary Kenton*,1, Dave Orr1, Shane Legg1, Noah Goodman1, Allan Dafoe1, Four Flynn1and Anca Dragan11Google DeepMind,2Work done while at Google DeepMind,*Core contributor, alphabetical order Artificial General Intelligence (AGI) promises transformative benefits but also presents significant risks.We develop an approach to address the risk of harms consequential enough to significantly harmhumanity. We identify four areas of risk: misuse, misalignment, mistakes, and structural risks. Of these,we focus on technical approaches to misuse and misalignment. For misuse, our strategy aims to preventthreat actors from accessing dangerous capabilities, by proactively identifying dangerous capabilities,and implementing robust security, access restrictions, monitoring, and model safety mitigations. Toaddress misalignment, we outline two lines of defense. First, model-level mitigations such as amplifiedoversight and robust training can help to build an aligned model. Second, system-level security measuressuch as monitoring and access control can mitigate harm even if the model is misaligned. Techniquesfrom interpretability, uncertainty estimation, and safer design patterns can enhance the effectiveness ofthese mitigations. Finally, we briefly outline how these ingredients could be combined to produce safetycases for AGI systems. Extended Abstract AI, and particularly AGI, will be a transformative technology. As with any transformative technology,AGI will provide significant benefits while posing significant risks. This includes risks ofsevere harm:incidents consequential enough to significantly harm humanity. This paper outlines our approach tobuilding AGI that avoids severe harm.1 Since AGI safety research is advancing quickly, our approach should be taken as exploratory. Weexpect it to evolve in tandem with the AI ecosystem to incorporate new ideas and evidence. Severe harms necessarily require a precautionary approach, subjecting them to anevidencedilemma: research and preparation of risk mitigations occurs before we have clear evidence of thecapabilities underlying those risks. We believe in being proactive, and taking a cautious approach byanticipating potential risks, even before they start to appear likely. This allows us to develop a moreexhaustive and informed strategy in the long run. Nonetheless, we still prioritize those risks for which we can foresee how the requisite capabilitiesmay arise, while deferring even more speculative risks to future research. Specifically, we focuson capabilities in foundation models that are enabled through learning via gradient descent, andconsider Exceptional AGI (Level 4) from Morris et al. (2023), defined as an AI system that matchesor exceeds that of the 99th percentile of skilled adults on a wide range of non-physical tasks. Thismeans that our approach covers conversational systems, agentic systems, reasoning, learned novelconcepts, and some aspects of recursive improvement, while setting aside goal drift and novel risksfrom superintelligence as future work. We focus on technical research areas that can provide solutions that would mitigate severe harm.However, this is only half of the picture: technical solutions should be complemented by effectivegovernance. It is especially important to have broader consensus on appropriate standards and bestpractices, to prevent a potential race to the bottom on safety due to competitive pressures. We hopethat this paper takes a meaningful step towards building this consensus. Background assumptions In developing our approach, we weighed the advantages and disadvantages of different options.For example, some safety approaches could provide more robust, general theoretical guarantees, butit is unclear that they will be ready in time. Other approaches are more ad hoc, empirical, and readysooner, but with rough edges. To make these tradeoffs, we rely significantly on a few background assumptions and beliefs abouthow AGI will be developed: 1.No human ceiling:Under the current paradigm (broadly interpreted), we do not see anyfundamental blockers that limit AI systems to human-level capabilities. We thus treat even morepowerful capabilities as a serious possibility to prepare for.2•Implication:Supervising a system with capabilities beyond that of the overseer is difficult, with the difficulty increasing as the capability gap widens, especially at machine scale and speed. So, for sufficiently powerful AI systems, our approach does not rely purely onhuman overseers, and instead leverages AI capabilities themselves for