MEMO: Memory-Augmented Model Context Optimization for Robust Multi-Turn Multi-Agent LLM Games

1Rice University, 2The University of Texas at Austin, 3Princeton University, 4A*STAR, 5Good Start Labs, 6xAI
*Equal Contribution    Project Leader    Equal Advising
Performance and Stability Comparison
(a) Performance and stability comparison: Using GPT-4o-mini, MEMO achieves the highest mean win rate (49.5%) with the lowest RSE (6.4%).
Training Efficiency on KuhnPoker
(b) Learning efficiency on KuhnPoker: Using Qwen2.5-7B-Instruct, MEMO reaches 60% win rate with only 2,000 games.

MEMO couples memory retention and exploration to optimize LLM context for robust multi-agent gameplay.

Abstract

Multi-turn, multi-agent LLM game evaluations often exhibit substantial run-to-run variance. In long-horizon interactions, small early deviations compound across turns and are amplified by multi-agent coupling, biasing win rate estimates and destabilizing comparative rankings across repeated tournaments. Prompt choice exacerbates this by inducing different effective policies and interaction dynamics.

We address both instability and underperformance in interactive games with MEMO (Memory-augmented MOdel context optimization), a self-play framework that treats inference-time context as an optimizable, agentic object by coupling retention and exploration. Retention maintains a persistent memory bank that distills self-play trajectories into structured insights, consolidates them via CRUD-style updates, and injects them as priors during subsequent play. Exploration performs tournament-style prompt evolution with uncertainty-aware selection via TrueSkill, and uses prioritized replay to revisit vital states for sample-efficient coverage.

Across five text-based games, MEMO raises mean win rate from 24.9% → 49.5% for GPT-4o-mini and 21.7% → 44.3% for Qwen-2.5-7B-Instruct using a mere budget of 2000 self-play games per task; reducing run-to-run dispersion of end-to-end outcomes and yielding more reliable rankings under prompt stratification.

The MEMO Framework

MEMO is an iterative procedure that optimizes prompts and game context to maximize performance and stability in two-player Markov games. A key goal is to make prompt optimization less run-dependent by accumulating reusable priors in a shared memory bank and reusing them across generations.

MEMO Framework Pipeline

At each optimization generation, new candidate contexts are proposed through two strategies: random proposals and memory-augmented updates. These candidates are then evaluated via self-play, and the best-performing candidates are used to update the pool for the next generation. To encourage exploration and mitigate redundant early moves, a prioritized replay module is introduced, enabling efficient search for robust prompts and priors within a single game.

Key Components

1. Context Optimization Loop

MEMO maintains a context population evaluated through self-play. Selection uses a conservative TrueSkill objective that favors contexts winning reliably rather than due to noisy trajectories. The population evolves through two proposal operators:

  • Random proposals: Introduce novel variations to encourage exploration by sampling a playstyle and applying small edits to the base context
  • Memory-augmented updates: Incorporate insights extracted from trajectory reflections into targeted prompt edits

2. Memory Bank (Retention)

MEMO maintains a shared memory bank that persists across optimization generations. After each generation, completed self-play trajectories are analyzed to extract structured insights such as rule clarifications, legality constraints, and strategy priors. These insights are merged using three database-style operations:

  • Add: If a new insight is not similar to any existing insight, it is added directly
  • Remove: If a new insight conflicts with an existing one, both are removed to avoid misleading the agent
  • Edit: If a new insight is similar to an existing one, they are merged to be more actionable

3. Prioritized Replay (Exploration)

To improve trajectory coverage, MEMO maintains a replay buffer that stores trajectory prefixes together with environment seeds for reproduction. The buffer biases sampling toward infrequently encountered trajectories using inverse-frequency scoring, encouraging a more diverse and balanced pool of prompt-level insights. A gating parameter determines how often games are initialized from the replay buffer rather than played afresh.

Results

MEMO achieves substantial improvements across five text-based games with both GPT-4o-mini and Qwen-2.5-7B-Instruct models. The framework demonstrates strong gains in negotiation games and competitive results in imperfect-information settings.

Comparison with Baselines

Type Optimizer SimpleNegotiation TwoDollar KuhnPoker Briscola SimpleTak Mean Win Rate Mean RSE
GPT-4o-mini
Static baseline 31.3% 32.2% 39.1% 0.3% 21.4% 25.1% 44.9%
CoT 27.8% 25.7% 46.5% 30.4% 24.8% 31.1% 28.7%
ToT 26.3% 27.0% 51.7% 45.1% 23.8% 34.8% 36.5%
Prompt TextGrad 42.0% 44.6% 55.6% 7.1% 23.6% 34.6% 18.4%
MIPRO 38.4% 50.9% 55.1% 19.7% 19.1% 36.7% 12.4%
GEPA 36.8% 40.4% 52.2% 3.3% 26.9% 32.0% 11.3%
Ours MEMO 54.9% 52.4% 55.6% 42.7% 41.8% 49.5% 6.4%
Qwen2.5-7B-Instruct
Static baseline 24.0% 17.1% 45.3% 2.8% 15.1% 20.9% 30.1%
CoT 23.8% 18.7% 42.0% 25.8% 13.6% 24.8% 43.4%
ToT 27.1% 20.7% 42.2% 22.7% 15.1% 25.6% 40.2%
Prompt TextGrad 37.1% 29.3% 52.8% 7.1% 22.4% 29.9% 21.7%
MIPRO 42.4% 47.5% 53.8% 2.2% 20.9% 33.4% 7.3%
GEPA 34.4% 31.7% 55.8% 3.3% 19.3% 28.8% 14.8%
RL UnstableBaseline 41.1% 30.4% 52.7% 53.3% 47.3% 45.0% 43.3%
SPIRAL 45.7% -- 56.7% -- 32.7% -- --
Ours MEMO 48.0% 48.4% 60.0% 31.1% 34.0% 44.3% 6.1%

Benchmark results across multiple tasks. Each win rate is the mean across three evaluation models. Green: Static prompting; Orange: Prompt optimization; Red: RL pipelines; Blue: Our method.

Key Findings

  • Large improvements: MEMO raises mean win rate from 24.9% → 49.5% (GPT-4o-mini) and 21.7% → 44.3% (Qwen-2.5-7B-Instruct)
  • Sample efficiency: MEMO requires only 2,000 games per task - 19x fewer than UnstableBaseline's 38,000 games
  • Reduced variance: MEMO achieves 6.4% RSE compared to MIPRO's 12.4%, significantly reducing run-to-run dispersion
  • Outperforms prompt optimization: MEMO outperforms TextGrad, MIPRO, and GEPA by 14.9%, 12.8%, and 17.5% respectively on GPT-4o-mini
  • Competitive with RL: Achieves comparable performance to RL methods while being far more efficient

Cross-Model Context Transfer

We test whether a context learned via self-play with GPT-4o-mini can generalize across models. Specifically, we apply the learned prompts and retained experience produced by MEMO for GPT-4o-mini to Gemini-2.5-Flash-Lite and Grok-4-fast-non-reasoning.

Cross-Model Context Transfer

Context learned from GPT-4o-mini transfers effectively to Gemini-2.5-Flash-Lite, yielding consistent improvements across the three environments. The gains are most pronounced in TwoDollar, where Gemini exhibits a substantial increase in win rate when augmented with the 4o-mini context.

In contrast, the same context does not yield consistent benefits for Grok-4-fast-non-reasoning, as seen in the performance drops in Briscola and KuhnPoker. This suggests that learned contexts capture reusable strategic structures that can generalize across certain model families, but effectiveness depends on how the target model processes and acts on the contextual information for gameplay.

Token Efficiency

Optimizer SimpleNegotiation KuhnPoker SimpleTak Avg. Tokens
TextGrad 842 986 938 922
MIPRO 145,864 162,084 754,534 354,161
GEPA 110,325 119,365 111,907 113,865
MEMO (Ours) 87,364 94,160 89,152 90,575

Output token cost for each prompt optimization method. MEMO uses only 91K tokens on average - about one-quarter of MIPRO (354K) and 20% fewer than GEPA (113K).