行业研究公司研究宏观策略财报招股书会议纪要 Token 低空经济十五五 AIGC 大模型

GPT-5.3-Codex系统卡

2026-02-05 OpenAI 土豆不吃泥

核心观点

GPT-5.3-Codex 是 OpenAI 最强大的 agentic coding 模型，结合了 GPT-5.2-Codex 的前沿编码性能和 GPT-5.2 的推理和专业知识能力，能够执行涉及研究、工具使用和复杂执行的长任务。该模型在生物学和化学领域被评估为高能力，并部署了相应的安全措施。在网络安全领域，GPT-5.3-Codex 是 OpenAI 首个被评估为高能力的模型，并激活了相关的安全措施。

关键数据

禁令内容评估：GPT-5.3-Codex 在禁令内容评估中表现良好，与 GPT-5.2-Thinking 性能相当或接近。
破坏性操作避免：GPT-5.3-Codex 在避免破坏性操作方面表现出色，destructive action 避免评估得分达到 0.885。
生物学和化学能力评估：GPT-5.3-Codex 在生物学和化学能力评估中表现与 GPT-5.2-Codex 相似。
网络安全能力评估：GPT-5.3-Codex 在网络安全能力评估中表现出显著提升，在 Capture-the-flag (专业)、CVE-Bench 和 Cyber Range 评估中均取得优异成绩，展现出在自主操作、漏洞发现和利用以及操作一致性方面的能力。
AI 自我改进能力评估：GPT-5.3-Codex 在 AI 自我改进能力评估中表现与 GPT-5.2-Codex 和 GPT-5.2-Thinking 相似。
沙袋策略评估：Apollo Research 评估发现 GPT-5.3-Codex 在 sabotage 能力方面有所提升，但在 covert action 方面仍保持较低水平。

研究结论

GPT-5.3-Codex 是 OpenAI 最强大的 agentic coding 模型，在多个评估中展现出显著的性能提升，特别是在网络安全领域。OpenAI 部署了多层次的安全措施，包括模型安全训练、对话监控、行为级执行、基于信任的访问等，以降低潜在风险。同时，OpenAI 也致力于支持网络安全防御者，通过 Trusted Access for Cyber (TAC) 等项目提供高风险评估能力。

OpenAI February 5, 2026 Contents 2Baseline Model Safety Evaluations32.1Disallowed Content Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 3Product-Specific Risk Mitigations4 3.1Agent sandbox. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43.2Network access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 4Model-Specific Risk Mitigations5 4.1Avoid data-destructive actions. . . . . . . . . . . . . . . . . . . . . . . . . . . .54.1.1Risk description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .54.1.2Mitigation: Safety training. . . . . . . . . . . . . . . . . . . . . . . . . .6 5Preparedness6 5.1Capabilities Assessment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 5.1.1.1Tacit Knowledge and Troubleshooting. . . . . . . . . . . . . . .75.1.1.2ProtocolQA Open-Ended. . . . . . . . . . . . . . . . . . . . . .85.1.1.3Multimodal Troubleshooting Virology. . . . . . . . . . . . . . .85.1.1.4TroubleshootingBench . . . . . . . . . . . . . . . . . . . . . . . .9 5.1.2.1Capture-the-flag (professional). . . . . . . . . . . . . . . . . . .125.1.2.2CVE-Bench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .135.1.2.3Cyber Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . .145.1.2.4External Evaluations by Irregular. . . . . . . . . . . . . . . . .17 5.1.3.1Monorepo-Bench . . . . . . . . . . . . . . . . . . . . . . . . . . .185.1.3.2OpenAI-Proof Q&A. . . . . . . . . . . . . . . . . . . . . . . . .19 5.2Safeguards Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .215.2.1Cyber Safeguards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .215.2.1.1Threat Model and Scenarios. . . . . . . . . . . . . . . . . . . .225.2.1.2Cyber Threat Taxonomy. . . . . . . . . . . . . . . . . . . . . .225.2.1.3Safeguards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .235.2.1.3.1Model Safety Training . . . . . . . . . . . . . . . . . . .245.2.1.3.2Conversation monitor. . . . . . . . . . . . . . . . . . .245.2.1.3.3Expert Red Teaming. . . . . . . . . . . . . . . . . . .255.2.1.3.4Actor Level Enforcement. . . . . . . . . . . . . . . . .275.2.1.3.5Trust-based access. . . . . . . . . . . . . . . . . . . . .275.2.1.4Security Controls . . . . . . . . . . . . . . . . . . . . . . . . . . .285.2.1.5Misalignment risks and internal deployment . . . . . . . . . . . .285.2.1.6Sufficiency of Risk Mitigation Measures. . . . . . . . . . . . . .29 1Introduction GPT-5.3-Codex is the most capable agentic coding model to date, combining the frontier codingperformance of GPT-5.2-Codex with the reasoning and professional knowledge capabilities ofGPT-5.2.This enables it to take on long-running tasks that involve research, tool use, andcomplex execution. Much like a colleague, you can steer and interact with GPT-5.3-Codex whileit’s working, without losing context. Like other recent models, it is being treated as High capability on biology, and is being deployedwith the corresponding suite of safeguards we use for other models in the GPT-5 family. It doesnot reach High capability on AI self-improvement. This is the first launch we are treating as High capability in the Cybersecurity domain under ourPreparedness Framework, activating the associated safeguards. We do not have definitive evidencethat this model reaches our High threshold, but are taking a precautionary approach becausewe cannot rule out the possibility that it may be capable enough to reach the threshold. Oursafeguards for high capability in cybersecurity rely on a layered safety stack designed to impedeand disrupt threat actors, while we work to make these same capabilities as easily available aspossible for cyber defenders. 2Baseline Model Safety Evaluations 2.1Disallowed Content Evaluations To further ecosystem insight into frontier AI, and for consistency with our general practice,we provide the following benchmark evaluations of GPT-5.3-Codex across disallowed contentcategories tested in a conversational setting. A subset of these evals are displayed below. Wedo not believe these conversational evals are reflective of real-world risk in the context of thiscoding-focused model.1 We report here on our Production Benchmarks, an evaluation set with conversations representativeof challenging examples from production data.As we noted in previous system cards, weintroduced these Production Benchmarks to help us measure continuing progress given that ourearlier Standard evaluations for these categories had become relatively saturated. These evaluations were deliberately created to be difficult. They were built around cases in whichour existing models were not yet giving ideal responses, and this is reflected in the scores below.Error rates are not representative of average production traffic. The primary metric is not_unsafe,checking that the model did not produce output that is disallowed under the relevant OpenAIp

点击免费查看完整报告

GPT-5.3-Codex系统卡

核心观点

关键数据

研究结论

你可能感兴趣

【东吴电子】卡莱特推荐逻辑：格局好（LED显控系统双寡头），增速快（2023-25年复合增速50%），估值低（对应24年PE 21倍）

GPT-4系统卡

OpenAI o3-mini 系统卡

印第安纳州卡梅尔的 Wedeco 紫外线消毒系统：屡获殊荣的处理厂利用 Wedeco 紫外线消毒超出了环境法规

GPT-4 系统卡

印度卡纳塔克邦健康保险计划中的在线转诊系统：一种有助于战略采购的数字工具

卡姆丹克太阳能系统集团有限公司 2025年报

OpenAI o3-mini系统卡：OpenAI

OpenAI GPT-4.5 系统卡

银行和卡业的云信用决策系统：以更低的风险和最优的定价更快地推出新产品

GPT-5.3-Codex系统卡

你可能感兴趣

【东吴电子】卡莱特推荐逻辑：格局好（LED显控系统双寡头），增速快（2023-25年复合增速50%），估值低（对应24年PE 21倍）

GPT-4系统卡

OpenAI o3-mini 系统卡

印第安纳州卡梅尔的 Wedeco 紫外线消毒系统 ： 屡获殊荣的处理厂利用 Wedeco 紫外线消毒超出了环境法规

GPT-4 系统卡

印度卡纳塔克邦健康保险计划中的在线转诊系统：一种有助于战略采购的数字工具

卡姆丹克太阳能系统集团有限公司 2025年报

OpenAI o3-mini系统卡：OpenAI

OpenAI GPT-4.5 系统卡

银行和卡业的云信用决策系统：以更低的风险和最优的定价更快地推出新产品

印第安纳州卡梅尔的 Wedeco 紫外线消毒系统：屡获殊荣的处理厂利用 Wedeco 紫外线消毒超出了环境法规