行业研究公司研究宏观策略财报招股书会议纪要 Token 低空经济十五五 AIGC 大模型

超越Chatgpt的AI agent综述

2025-02-17 - 哥伦比亚大学光影

核心观点与关键数据

1. 模型自改进

背景：LLMs 可通过提示进行自改进，但小型 LLM 在此过程中存在困难。
问题：小型 LLM 无法通过提示学习“自改进”，且错误会传播。
方法：TriPosT 通过 LLM/Python 脚本作为编辑模型，收集小型 LLM 与 LLM 的交互记录，进行交互式轨迹编辑。
数据后处理：将交互数据格式化为 (尝试, 反馈, 更新) 三元组，并进行过滤和平衡。
模型训练：使用加权 SFT，更侧重于反馈和更新标记，训练 LLaMA-1/LLaMA-2。
评估：Big Bench Hard 任务上，TriPosT 提升了小型 LLM 的整体性能和自改进能力。
结论：无需人类监督即可提升模型性能。

2. 基于树搜索增强模型能力

背景：许多对话任务本质上是决策问题，可使用棋类游戏的 look-ahead 搜索增强。
方法：使用 Prompt-Based Monte-Carlo Tree Search (MCTS) 进行对话策略规划。
评估：在 PersuasionForGood 数据集上，GDP-Zero 生成的对话策略比基础 LLM 更具说服力。
结论：树搜索是直接提升模型行为的有效方法。

3. AI 代理自改进

背景：VLM 在计算机任务上表现困难，因为交互并非其预训练内容。
方法：引入 R-MCTS，在推理时探索决策空间并自改进，并通过训练将搜索知识转移回 VLM。
R-MCTS：包括 MCTS 与对比自反思，以及多代理辩论值函数。
评估：在 VisualWebArena 和 OSWorld 上，R-MCTS 优于其他搜索算法，并达到 SOTA。
Exploratory Learning：通过训练树遍历进行探索、评估和回溯。
结论：GPT-4o 在 R-MCTS 树上进行探索性学习后，表现出计算扩展特性。

研究结论

TriPosT：通过交互式轨迹编辑和训练，提升小型 LLM 的自改进能力。
树搜索：有效提升对话任务的性能和策略质量。
R-MCTS 与 Exploratory Learning：提升 AI 代理在复杂任务上的性能和泛化能力。

未来工作

使用强化学习方法减少对树搜索的依赖。
使用模型预测控制 (MPC) 方法减少昂贵的环境交互。

Who supports AI Agents? What are AI Agents? Perception:Multimodal inputs including,text, image, audio, video, touch, etc. Planning (Inner Monologue):Chain-of-Thought reasoning over tokensthat powered by LLMs Reflection: meta-reasoning in every stop Actions:function/tool calling, embodiedactions. AI Agent Deployment Consideration Overview 1.Model self-improvement with LLMs(Yu et al, NAACL 2024, Outstanding paper) 2.Eliciting stronger model ability via tree search(Yu et al, EMNLP 2023) 3.AI agent self-improvement via tree search(Yu et al, ICLR 2025) Background: In-Context Self-Improvement Input: Q: Calculate (4 * 1) - (2 * 3) = ? Background: In-Context Self-Improvement Background: In-Context Self-Improvement Input: Q: Calculate (4 * 1) - (2 * 3) = ? Self-Improvement Prompting(Madaan, et al, 2023) Step 1 : (4 * 1) - (2 * 3) = 4 - 6Step 2: 4 - 6 = -3Ans: -3 Background: In-Context Self-Improvement Q: Calculate (4 * 1) - (2 * 3) = ? Self-Improvement Prompting(Madaan, et al, 2023) Step 1 : (4 * 1) - (2 * 3) = 4 - 6Step 2: 4 - 6 = -3Ans: -3 In step 2 the part “4-6=-3” isincorrect. This is because … Step 1: (4 * 1) - (2 * 3) = 4 - 6Step 2: 4 - 6 = -2Ans: -2 Background: In-Context Self-Improvement Q: Calculate (4 * 1) - (2 * 3) = ? Step 1 : (4 * 1) - (2 * 3) = 4 - 6Step 2: 4 - 6 = -3Ans: -3 Step 1: (4 * 1) - (2 * 3) = 4 - 6Step 2: 4 - 6 = -2Ans: -2 Background: In-Context Self-Improvement Problem 1: small LMcannotself-improve viaprompting! Background: In-Context Self-Improvement Background: In-Context Self-Improvement Problem 2:small LMcannotlearn“self-improvement” fromLLM demonstrations! Background: In-Context Self-Improvement Motivation Prior work shows thatself-improvement (S.I.) is usefulfor task performance/generalization(Madaan, et al, 2023)We find prompt-based S.I./simple distillation methodsfails with small LM Motivation Prior work shows thatself-improvement (S.I.) is usefulfor task performance/generalization(Madaan, et al, 2023) We find prompt-based S.I./simple distillation methodsfails with small LM 1.Treat “self-improvement” as a task to learn-(attempt) -> (feedback, update) Motivation Prior work shows thatself-improvement (S.I.) is usefulfor task performance/generalization(Madaan, et al, 2023) We find prompt-based S.I./simple distillation methodsfails with small LM 1.Treat “self-improvement” as a task to learn 2.But learn “self-improvement”online -consider LLMs/python scripts as teacher modeledit modelsto modify small LM’s attempts-replay this interactionexperienceto train the small LM Madaan, A.et al.(2023) ‘Self-Refine: Iterative Refinement with Self-Feedback’ Motivation Prior work shows thatself-improvement (S.I.) is usefulfor task performance/generalization(Madaan, et al, 2023) We find prompt-based S.I./simple distillation methodsfails with small LM 1.Treat “self-improvement” as a task to learn 2.But learn “self-improvement”online -consider LLMs/python scripts as teacher modeledit modelsto modify small LM’s attempts-replay this interaction experienceto train the small LM Madaan, A.et al.(2023) ‘Self-Refine: Iterative Refinement with Self-Feedback’ TriPosT Interactive trajectory editing1 -uses LLM/python scripts asedit models-gather interaction recordsbetween small LM and LLM Q: Calculate (4 * 1) - (2 * 3) = ? Step 1 : (4 * 1) - (2 * 3) = 4 - 6Step 2: 4 - 6 = -3Ans: -3 In step 1 the part “2*3=6” isincorrect. This is because … TriPosT Interactive trajectory editing1 -uses LLM/python scripts asedit models-gather interaction recordsbetween small LM and LLM Q: Calculate (4 * 1) - (2 * 3) = ? Step 1 : (4 * 1) - (2 * 3) = 4 - 6Step 2: 4 - 6 = -3Ans: -3 TriPosT Interactive trajectory editing1 -uses LLM/python scripts asedit models-gather interaction recordsbetween small LM and LLM TriPosT Interactive trajectory editing1 -uses LLM/python scripts asedit models-gather interaction recordsbetween small LM and LLM Q: Calculate (4 * 1) - (2 * 3) = ? Step 1 : (4 * 1) - (2 * 3) = 4 - 6Step 2: 4 - 6 = -3Ans: -3 TriPosT Interactive trajectory editing1 2Data post-processing TriPosT Interactive trajectory editing1 2Data post-processing TriPosT Interactive trajectory editing1 2Data post-processing -weighted SFTwith more emphasis on feedback and update tokens “on-policy” data Model self-improvement with LLMs Main Idea: Prior work shows that LLMs can be prompted to self-improveExplicit craft “self-improvement” data with LLMs to train/enhance this ability Model self-improvement with LLMs Main Idea: Prior work shows that LLMs can be prompted to self-improveExplicit craft “self-improvement” data with LLMs to train/enhance this ability 3Train the LMwith improved data1Let a weak LLMattempt self-improvement2 Use a stronger LLMto perform “processsupervision” Model self-improvement with LLMs Evaluation: Big Bench Hard-tasks where small LM struggles -split tasks into easy (seen)and harder (unseen) subtasks to measure generalization Can TriP

点击免费查看完整报告

超越Chatgpt的AI agent综述

核心观点与关键数据

1. 模型自改进

2. 基于树搜索增强模型能力

3. AI 代理自改进

研究结论

未来工作

你可能感兴趣

超越ChatGPT：生成式AI的机遇、风险与挑战

超越炒作：ChatGPT 和生成式 AI 的企业影响

海外科技2025年投资策略：AI投资手册·ChatGPT发布两周年纪念版-AI Infra奏响主旋律，AI Agent拉开新画布

Agent应用的ChatGPT时刻20250306

精通人工智能2025——超越ChatGPT的终极分步指南，提升生产力并使您的技能面向未来。

1999年互联网行情复盘：ChatGPT的科技革命性有望超越1999年的互联网，复盘当年国内外行情可指引早

计算机行业周报：月之暗面开源KimiK2大模型，OpenAI发布ChatGPT Agent

ChatGPT Agent发布智能体商业化再提速-20250721华创计算机20250721

【盘中宝】OpenAI发布ChatGPT智能体，机构称Agent商业模式正从“提供工具”向“交付价值”转变，这家企业和多个合作伙伴研发细分Agent

ChatGPT发布新agent产品叠加KimiK2开源新范式Agent趋