Who supports AI Agents? What are AI Agents? Perception:Multimodal inputs including,text, image, audio, video, touch, etc. Planning (Inner Monologue):Chain-of-Thought reasoning over tokensthat powered by LLMs Reflection: meta-reasoning in every stop Actions:function/tool calling, embodiedactions. AI Agent Deployment Consideration Overview 1.Model self-improvement with LLMs(Yu et al, NAACL 2024, Outstanding paper) 2.Eliciting stronger model ability via tree search(Yu et al, EMNLP 2023) 3.AI agent self-improvement via tree search(Yu et al, ICLR 2025) Background: In-Context Self-Improvement Input: Q: Calculate (4 * 1) - (2 * 3) = ? Background: In-Context Self-Improvement Background: In-Context Self-Improvement Input: Q: Calculate (4 * 1) - (2 * 3) = ? Self-Improvement Prompting(Madaan, et al, 2023) Step 1 : (4 * 1) - (2 * 3) = 4 - 6Step 2: 4 - 6 = -3Ans: -3 Background: In-Context Self-Improvement Q: Calculate (4 * 1) - (2 * 3) = ? Self-Improvement Prompting(Madaan, et al, 2023) Step 1 : (4 * 1) - (2 * 3) = 4 - 6Step 2: 4 - 6 = -3Ans: -3 In step 2 the part “4-6=-3” isincorrect. This is because … Step 1: (4 * 1) - (2 * 3) = 4 - 6Step 2: 4 - 6 = -2Ans: -2 Background: In-Context Self-Improvement Q: Calculate (4 * 1) - (2 * 3) = ? Step 1 : (4 * 1) - (2 * 3) = 4 - 6Step 2: 4 - 6 = -3Ans: -3 Step 1: (4 * 1) - (2 * 3) = 4 - 6Step 2: 4 - 6 = -2Ans: -2 Background: In-Context Self-Improvement Problem 1: small LMcannotself-improve viaprompting! Background: In-Context Self-Improvement Background: In-Context Self-Improvement Problem 2:small LMcannotlearn“self-improvement” fromLLM demonstrations! Background: In-Context Self-Improvement Motivation Prior work shows thatself-improvement (S.I.) is usefulfor task performance/generalization(Madaan, et al, 2023)We find prompt-based S.I./simple distillation methodsfails with small LM Motivation Prior work shows thatself-improvement (S.I.) is usefulfor task performance/generalization(Madaan, et al, 2023) We find prompt-based S.I./simple distillation methodsfails with small LM 1.Treat “self-improvement” as a task to learn-(attempt) -> (feedback, update) Motivation Prior work shows thatself-improvement (S.I.) is usefulfor task performance/generalization(Madaan, et al, 2023) We find prompt-based S.I./simple distillation methodsfails with small LM 1.Treat “self-improvement” as a task to learn 2.But learn “self-improvement”online -consider LLMs/python scripts as teacher modeledit modelsto modify small LM’s attempts-replay this interactionexperienceto train the small LM Madaan, A.et al.(2023) ‘Self-Refine: Iterative Refinement with Self-Feedback’ Motivation Prior work shows thatself-improvement (S.I.) is usefulfor task performance/generalization(Madaan, et al, 2023) We find prompt-based S.I./simple distillation methodsfails with small LM 1.Treat “self-improvement” as a task to learn 2.But learn “self-improvement”online -consider LLMs/python scripts as teacher modeledit modelsto modify small LM’s attempts-replay this interaction experienceto train the small LM Madaan, A.et al.(2023) ‘Self-Refine: Iterative Refinement with Self-Feedback’ TriPosT Interactive trajectory editing1 -uses LLM/python scripts asedit models-gather interaction recordsbetween small LM and LLM Q: Calculate (4 * 1) - (2 * 3) = ? Step 1 : (4 * 1) - (2 * 3) = 4 - 6Step 2: 4 - 6 = -3Ans: -3 In step 1 the part “2*3=6” isincorrect. This is because … TriPosT Interactive trajectory editing1 -uses LLM/python scripts asedit models-gather interaction recordsbetween small LM and LLM Q: Calculate (4 * 1) - (2 * 3) = ? Step 1 : (4 * 1) - (2 * 3) = 4 - 6Step 2: 4 - 6 = -3Ans: -3 TriPosT Interactive trajectory editing1 -uses LLM/python scripts asedit models-gather interaction recordsbetween small LM and LLM TriPosT Interactive trajectory editing1 -uses LLM/python scripts asedit models-gather interaction recordsbetween small LM and LLM Q: Calculate (4 * 1) - (2 * 3) = ? Step 1 : (4 * 1) - (2 * 3) = 4 - 6Step 2: 4 - 6 = -3Ans: -3 TriPosT Interactive trajectory editing1 2Data post-processing TriPosT Interactive trajectory editing1 2Data post-processing TriPosT Interactive trajectory editing1 2Data post-processing -weighted SFTwith more emphasis on feedback and update tokens “on-policy” data Model self-improvement with LLMs Main Idea: Prior work shows that LLMs can be prompted to self-improveExplicit craft “self-improvement” data with LLMs to train/enhance this ability Model self-improvement with LLMs Main Idea: Prior work shows that LLMs can be prompted to self-improveExplicit craft “self-improvement” data with LLMs to train/enhance this ability 3Train the LMwith improved data1Let a weak LLMattempt self-improvement2 Use a stronger LLMto perform “processsupervision” Model self-improvement with LLMs Evaluation: Big Bench Hard-tasks where small LM struggles -split tasks into easy (seen)and harder (unseen) subtasks to measure generalization Can TriP