您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。 [-]:从玩家到大师:通过强化学习强化记忆,提升LLM代理的测试学习效率 - 发现报告

从玩家到大师:通过强化学习强化记忆,提升LLM代理的测试学习效率

文化传媒 2026-06-07 - - 哪开不壶提哪开
报告封面

Yishuo Cai1 Xingyu Guo2 Xuancheng Huang3 Jinhua Du4 Can Huang3 Wenxuan Huang5 Wenhan Ma Abstract Large language model (LLM) agents are increas-ingly deployed in long-running settings whereimproving through experience at test time be-comes important. A common approach is to up-date an explicit memory after each interactionto guide future decisions.However, most ex-isting methods rely on hand-designed prompt-ing rules, making it difficult to align memoryupdates with downstream objectives over multi-step horizons consistently. We propose MEMO-PILOT, a plug-in memory copilot thatexplicitlytrainsthe memory update process to improve afrozen LLM’s performance across sequential in-teractions. We formulate memory updating as amulti-turn decision problem and optimize it end-to-end with multi-turn GRPO. Our training recipeintroduces (i) a turn-wise reward signal and (ii)a context-independent, turn-level advantage es-timation across rollouts, enabling finer-grained Figure 1.Test-time learning dynamics in Limit Texas Hold’em(LHE) of our memory model (MEMOPILOT) compared to base-line memory models across sequential games, evaluated withtwo frozen players.Cumulative performance denotes the run-ning average of per-game scores up to the current round. Left:the player used during memory training (Qwen2.5-14B-Instruct).Right: zero-shot generalization to a stronger frozen player (Qwen3- 1. Introduction Large language model (LLM) agents are increasingly usedin settings that involve repeated interactions with relatedtasks, users, or environments. In such settings, a key capa-bility istest-time learning(TTL), where an agent improvesover a sequence of interactions by leveraging experienceaccumulated during deployment. Recent benchmarks andarXiv:2606.08656v1 [cs.CL] 7 Jun 2026 analyses have begun to systematically evaluate such learn-ing capability and efficiency in LLMs and agents (Dou et al.,2025; Zheng et al., 2025b; Wang et al., 2025a), highlighting A growing line of work attempts to realize TTL via ex-plicit memory and experience-driven adaptation. Early ap-proaches such as Reflexion (Shinn et al., 2024) and Ex-peL (Zhao et al., 2024) demonstrate that agents can itera- storage or naive history reuse and start to incorporatedy-namicupdates: Dynamic Cheatsheet (Suzgun et al., 2025)maintains an evolving memory for test-time adaptation; Rea-soningBank (Ouyang et al., 2026) distills reusable reasoningstrategies from an agent’s successes and failures and closesthe loop via retrieval and consolidation. Together, these date through an iterative “hypothesize-and-verify” cycle:observes evidence from the current experience, proposesor refines hypotheses, verifies them against accumulated We evaluate MEMOPILOTon two strategic games includingmulti-round Rock–Paper–Scissors (RPS) (Guertler et al., 2025) and Limit Texas Hold’em (LHE) (Zha et al., 2019)because they closely match the TTL setting and satisfy threedesiderata: (i)learnability under cross-game interaction:there exists exploitable, opponent-specific behavioral struc-ture that can be discovered from multi-game experience;(ii)controllability: opponents can be specified by explicitstrategies, enabling reproducible interactions and systematiccoverage/generalization tests; and (iii)challenge with mea-surable reward: both environments provide clear outcomerewards suitable for end-to-end optimization, yet requirenon-trivial adaptation. LHE introduces imperfect informa-tion and rich hand-level variation that acts as natural probesof opponent behavior. While RPS has a small action space, However, despite these advances, most existing approachesrely on hand-designed or prompt-based memory updaterules, rather than end-to-end optimization of the memoryupdate policy (Suzgun et al., 2025; Ouyang et al., 2026).In our pilot observations, even strong instruction-followingLLMs fail to consistently improve across repeated interac-tions when memory updates are driven only by such heuris-tic mechanisms, motivating a training signal that directly To address this gap, we propose MEMOPILOT, a plug-inMemory Copilotthat explicitly trains the memory update process to improve the performance of a frozen LLM inmulti-turn interactions. Inspired by Suzgun et al. (2025),we view memory as an evolving artifact that refines acrossmultiple interactions. We treat memory updating as a train-able multi-turn decision problem and optimize it end-to-endwith multi-turn GRPO (Shao et al., 2024). Concretely, weintroduce aturn-wisereward signal and aturn-leveladvan-tage estimation across rollouts, which provides finer-grained Our main contributions are: (1) We propose MEMOPILOT,a plug-in memory pilot that improves a frozen LLM player’stest-time learning behavior across repeated interactions bytraining the memory update process end-to-end. (2) Weintroduce a multi-turn GRPO training recipe for memoryupdating with turn-wise rewards and turn-level advantage test-time learning rollouts. (3) We validate MEMOPILOTonco