行业研究公司研究宏观策略财报招股书会议纪要 Token 低空经济十五五 AIGC 大模型

从玩家到大师：通过强化学习强化记忆，提升LLM代理的测试学习效率

文化传媒 2026-06-07 - - 哪开不壶提哪开

核心观点

该研报提出了 MEMOPILOT，一个通过端到端强化学习训练的动态经验记忆模型，旨在提升大型语言模型（LLM）在重复交互环境中的测试时学习能力（TTL）。现有方法通常依赖手工设计的提示规则来更新显式记忆，难以在多步视野内与下游目标保持一致。MEMOPILOT 通过将记忆更新视为一个可训练的多回合决策问题，并使用多回合 GRPO 进行优化，直接优化记忆更新过程，从而提升冻结 LLM 在连续交互中的性能。

方法

MEMOPILOT 将记忆更新建模为一个多回合决策问题，并使用多回合 GRPO 进行端到端优化。为了实现更稳定的信用分配，训练过程引入了回合奖励信号和回合级别的优势估计。记忆模型被结构化为三个组件：推理层、知识库和最终策略提示，以实现迭代更新过程。

实验

MEMOPILOT 在多轮 Rock-Paper-Scissors (RPS) 和 Limit Texas Hold’em (LHE) 两个测试环境中进行了评估。结果表明，MEMOPILOT 在两个游戏中均显著优于强基线，在 Elo 评分中排名第一（LHE 为 1762，RPS 为 1590），并超越了所有基线记忆方法和专有模型，包括 DeepSeek-V3.2。

结论

MEMOPILOT 通过将记忆更新视为可训练的决策过程，并通过多回合 GRPO 进行优化，显著提升了冻结 LLM 在重复交互环境中的测试时学习能力。该方法在 RPS 和 LHE 两个游戏中均取得了优异的性能，并展现出强大的泛化能力，能够适应未见过的对手和更大的玩家模型。

Yishuo Cai1 Xingyu Guo2 Xuancheng Huang3 Jinhua Du4 Can Huang3 Wenxuan Huang5 Wenhan Ma Abstract Large language model (LLM) agents are increas-ingly deployed in long-running settings whereimproving through experience at test time be-comes important. A common approach is to up-date an explicit memory after each interactionto guide future decisions.However, most ex-isting methods rely on hand-designed prompt-ing rules, making it difficult to align memoryupdates with downstream objectives over multi-step horizons consistently. We propose MEMO-PILOT, a plug-in memory copilot thatexplicitlytrainsthe memory update process to improve afrozen LLM’s performance across sequential in-teractions. We formulate memory updating as amulti-turn decision problem and optimize it end-to-end with multi-turn GRPO. Our training recipeintroduces (i) a turn-wise reward signal and (ii)a context-independent, turn-level advantage es-timation across rollouts, enabling finer-grained Figure 1.Test-time learning dynamics in Limit Texas Hold’em(LHE) of our memory model (MEMOPILOT) compared to base-line memory models across sequential games, evaluated withtwo frozen players.Cumulative performance denotes the run-ning average of per-game scores up to the current round. Left:the player used during memory training (Qwen2.5-14B-Instruct).Right: zero-shot generalization to a stronger frozen player (Qwen3- 1. Introduction Large language model (LLM) agents are increasingly usedin settings that involve repeated interactions with relatedtasks, users, or environments. In such settings, a key capa-bility istest-time learning(TTL), where an agent improvesover a sequence of interactions by leveraging experienceaccumulated during deployment. Recent benchmarks andarXiv:2606.08656v1 [cs.CL] 7 Jun 2026 analyses have begun to systematically evaluate such learn-ing capability and efficiency in LLMs and agents (Dou et al.,2025; Zheng et al., 2025b; Wang et al., 2025a), highlighting A growing line of work attempts to realize TTL via ex-plicit memory and experience-driven adaptation. Early ap-proaches such as Reflexion (Shinn et al., 2024) and Ex-peL (Zhao et al., 2024) demonstrate that agents can itera- storage or naive history reuse and start to incorporatedy-namicupdates: Dynamic Cheatsheet (Suzgun et al., 2025)maintains an evolving memory for test-time adaptation; Rea-soningBank (Ouyang et al., 2026) distills reusable reasoningstrategies from an agent’s successes and failures and closesthe loop via retrieval and consolidation. Together, these date through an iterative “hypothesize-and-verify” cycle:observes evidence from the current experience, proposesor refines hypotheses, verifies them against accumulated We evaluate MEMOPILOTon two strategic games includingmulti-round Rock–Paper–Scissors (RPS) (Guertler et al., 2025) and Limit Texas Hold’em (LHE) (Zha et al., 2019)because they closely match the TTL setting and satisfy threedesiderata: (i)learnability under cross-game interaction:there exists exploitable, opponent-specific behavioral struc-ture that can be discovered from multi-game experience;(ii)controllability: opponents can be specified by explicitstrategies, enabling reproducible interactions and systematiccoverage/generalization tests; and (iii)challenge with mea-surable reward: both environments provide clear outcomerewards suitable for end-to-end optimization, yet requirenon-trivial adaptation. LHE introduces imperfect informa-tion and rich hand-level variation that acts as natural probesof opponent behavior. While RPS has a small action space, However, despite these advances, most existing approachesrely on hand-designed or prompt-based memory updaterules, rather than end-to-end optimization of the memoryupdate policy (Suzgun et al., 2025; Ouyang et al., 2026).In our pilot observations, even strong instruction-followingLLMs fail to consistently improve across repeated interac-tions when memory updates are driven only by such heuris-tic mechanisms, motivating a training signal that directly To address this gap, we propose MEMOPILOT, a plug-inMemory Copilotthat explicitly trains the memory update process to improve the performance of a frozen LLM inmulti-turn interactions. Inspired by Suzgun et al. (2025),we view memory as an evolving artifact that refines acrossmultiple interactions. We treat memory updating as a train-able multi-turn decision problem and optimize it end-to-endwith multi-turn GRPO (Shao et al., 2024). Concretely, weintroduce aturn-wisereward signal and aturn-leveladvan-tage estimation across rollouts, which provides finer-grained Our main contributions are: (1) We propose MEMOPILOT,a plug-in memory pilot that improves a frozen LLM player’stest-time learning behavior across repeated interactions bytraining the memory update process end-to-end. (2) Weintroduce a multi-turn GRPO training recipe for memoryupdating with turn-wise rewards and turn-level advantage test-time learning rollouts. (3) We validate MEMOPILOTonco

点击免费查看完整报告

从玩家到大师：通过强化学习强化记忆，提升LLM代理的测试学习效率

核心观点

方法

实验

结论

你可能感兴趣

从架构分析到实测：LLM自动渗透测试实证研究

【风口研报·洞察】1.6T光模块mSAP工艺拉动“载体铜箔”需求，且加工费远超传统铜箔，随着存储客户端与光模块端的验证通过，国产份额有望大幅提升；5月机会从大盘成长扩散到小盘-20260430

信息抽取：从PLM到LLM的变迁

从LLM到Agent—电子产业链的再定价

DeepSeek-R1：通过强化学习激励大语言模型中的推理能力

DeepSeek-R1：通过强化学习激励大型语言模型的推理能力

小米汽车：从0到1的重要机遇，新势力强势玩家登场

AI脉动调查——第三卷：从自动化到自主化：AI代理的能力与复杂性

CX中的AI状态：从辅助到代理 CX zhōng de AI zhuàng tài: cóng fǔ zhù dào dàì lì

商贸零售行业点评报告：商社板块2023Q3业绩总结，增长预期放缓，从“价格带提升”到“运营效率提升”，关注出口链，黄金珠宝等领域