MEMO: Memory-Augmented Model Context Optimization for Robust Multi-Turn Multi-Agent LLM Games

Yunfei Xie^*,1, Kevin Wang^*,‡,2, Bobby Cheng^*,4, Jianzhu Yao^*,3, Zhizhou Sha², Alexander Duffy⁵, Yihan Xi², Hongyuan Mei⁶, Cheston Tan⁴, Chen Wei^†,1, Pramod Viswanath^†,3, Zhangyang Wang^†,2

¹Rice University, ²The University of Texas at Austin, ³Princeton University, ⁴A*STAR, ⁵Good Start Labs, ⁶xAI

^*Equal Contribution ^‡Project Leader ^†Equal Advising

Paper Code (Coming Soon)

Performance and Stability Comparison — **(a) Performance and stability comparison:** Using GPT-4o-mini, MEMO achieves the highest mean win rate (49.5%) with the lowest RSE (6.4%).

Training Efficiency on KuhnPoker — **(b) Learning efficiency on KuhnPoker:** Using Qwen2.5-7B-Instruct, MEMO reaches 60% win rate with only 2,000 games.

MEMO couples memory retention and exploration to optimize LLM context for robust multi-agent gameplay.

Abstract

Multi-turn, multi-agent LLM game evaluations often exhibit substantial run-to-run variance. In long-horizon interactions, small early deviations compound across turns and are amplified by multi-agent coupling, biasing win rate estimates and destabilizing comparative rankings across repeated tournaments. Prompt choice exacerbates this by inducing different effective policies and interaction dynamics.

We address both instability and underperformance in interactive games with MEMO (Memory-augmented MOdel context optimization), a self-play framework that treats inference-time context as an optimizable, agentic object by coupling retention and exploration. Retention maintains a persistent memory bank that distills self-play trajectories into structured insights, consolidates them via CRUD-style updates, and injects them as priors during subsequent play. Exploration performs tournament-style prompt evolution with uncertainty-aware selection via TrueSkill, and uses prioritized replay to revisit vital states for sample-efficient coverage.

Across five text-based games, MEMO raises mean win rate from 24.9% → 49.5% for GPT-4o-mini and 21.7% → 44.3% for Qwen-2.5-7B-Instruct using a mere budget of 2000 self-play games per task; reducing run-to-run dispersion of end-to-end outcomes and yielding more reliable rankings under prompt stratification.

The MEMO Framework

MEMO is an iterative procedure that optimizes prompts and game context to maximize performance and stability in two-player Markov games. A key goal is to make prompt optimization less run-dependent by accumulating reusable priors in a shared memory bank and reusing them across generations.

At each optimization generation, new candidate contexts are proposed through two strategies: random proposals and memory-augmented updates. These candidates are then evaluated via self-play, and the best-performing candidates are used to update the pool for the next generation. To encourage exploration and mitigate redundant early moves, a prioritized replay module is introduced, enabling efficient search for robust prompts and priors within a single game.

Key Components

1. Context Optimization Loop

MEMO maintains a context population evaluated through self-play. Selection uses a conservative TrueSkill objective that favors contexts winning reliably rather than due to noisy trajectories. The population evolves through two proposal operators:

Random proposals: Introduce novel variations to encourage exploration by sampling a playstyle and applying small edits to the base context
Memory-augmented updates: Incorporate insights extracted from trajectory reflections into targeted prompt edits

2. Memory Bank (Retention)

MEMO maintains a shared memory bank that persists across optimization generations. After each generation, completed self-play trajectories are analyzed to extract structured insights such as rule clarifications, legality constraints, and strategy priors. These insights are merged using three database-style operations:

Add: If a new insight is not similar to any existing insight, it is added directly
Remove: If a new insight conflicts with an existing one, both are removed to avoid misleading the agent
Edit: If a new insight is similar to an existing one, they are merged to be more actionable

3. Prioritized Replay (Exploration)

To improve trajectory coverage, MEMO maintains a replay buffer that stores trajectory prefixes together with environment seeds for reproduction. The buffer biases sampling toward infrequently encountered trajectories using inverse-frequency scoring, encouraging a more diverse and balanced pool of prompt-level insights. A gating parameter determines how often games are initialized from the replay buffer rather than played afresh.

Results

MEMO achieves substantial improvements across five text-based games with both GPT-4o-mini and Qwen-2.5-7B-Instruct models. The framework demonstrates strong gains in negotiation games and competitive results in imperfect-information settings.

Comparison with Baselines

Type	Optimizer	SimpleNegotiation	TwoDollar	KuhnPoker	Briscola	SimpleTak	Mean Win Rate	Mean RSE
GPT-4o-mini
Static	baseline	31.3%	32.2%	39.1%	0.3%	21.4%	25.1%	44.9%
	CoT	27.8%	25.7%	46.5%	30.4%	24.8%	31.1%	28.7%
	ToT	26.3%	27.0%	51.7%	45.1%	23.8%	34.8%	36.5%
Prompt	TextGrad	42.0%	44.6%	55.6%	7.1%	23.6%	34.6%	18.4%
	MIPRO	38.4%	50.9%	55.1%	19.7%	19.1%	36.7%	12.4%
	GEPA	36.8%	40.4%	52.2%	3.3%	26.9%	32.0%	11.3%
Ours	MEMO	54.9%	52.4%	55.6%	42.7%	41.8%	49.5%	6.4%
Qwen2.5-7B-Instruct
Static	baseline	24.0%	17.1%	45.3%	2.8%	15.1%	20.9%	30.1%
	CoT	23.8%	18.7%	42.0%	25.8%	13.6%	24.8%	43.4%
	ToT	27.1%	20.7%	42.2%	22.7%	15.1%	25.6%	40.2%
Prompt	TextGrad	37.1%	29.3%	52.8%	7.1%	22.4%	29.9%	21.7%
	MIPRO	42.4%	47.5%	53.8%	2.2%	20.9%	33.4%	7.3%
	GEPA	34.4%	31.7%	55.8%	3.3%	19.3%	28.8%	14.8%
RL	UnstableBaseline	41.1%	30.4%	52.7%	53.3%	47.3%	45.0%	43.3%
RL	SPIRAL	45.7%	--	56.7%	--	32.7%	--	--
Ours	MEMO	48.0%	48.4%	60.0%	31.1%	34.0%	44.3%	6.1%

Benchmark results across multiple tasks. Each win rate is the mean across three evaluation models. Green: Static prompting; Orange: Prompt optimization; Red: RL pipelines; Blue: Our method.

Key Findings

Large improvements: MEMO raises mean win rate from 24.9% → 49.5% (GPT-4o-mini) and 21.7% → 44.3% (Qwen-2.5-7B-Instruct)
Sample efficiency: MEMO requires only 2,000 games per task - 19x fewer than UnstableBaseline's 38,000 games
Reduced variance: MEMO achieves 6.4% RSE compared to MIPRO's 12.4%, significantly reducing run-to-run dispersion
Outperforms prompt optimization: MEMO outperforms TextGrad, MIPRO, and GEPA by 14.9%, 12.8%, and 17.5% respectively on GPT-4o-mini
Competitive with RL: Achieves comparable performance to RL methods while being far more efficient

Cross-Model Context Transfer

We test whether a context learned via self-play with GPT-4o-mini can generalize across models. Specifically, we apply the learned prompts and retained experience produced by MEMO for GPT-4o-mini to Gemini-2.5-Flash-Lite and Grok-4-fast-non-reasoning.

Context learned from GPT-4o-mini transfers effectively to Gemini-2.5-Flash-Lite, yielding consistent improvements across the three environments. The gains are most pronounced in TwoDollar, where Gemini exhibits a substantial increase in win rate when augmented with the 4o-mini context.

In contrast, the same context does not yield consistent benefits for Grok-4-fast-non-reasoning, as seen in the performance drops in Briscola and KuhnPoker. This suggests that learned contexts capture reusable strategic structures that can generalize across certain model families, but effectiveness depends on how the target model processes and acts on the contextual information for gameplay.

Token Efficiency

Optimizer	SimpleNegotiation	KuhnPoker	SimpleTak	Avg. Tokens
TextGrad	842	986	938	922
MIPRO	145,864	162,084	754,534	354,161
GEPA	110,325	119,365	111,907	113,865
MEMO (Ours)	87,364	94,160	89,152	90,575

Output token cost for each prompt optimization method. MEMO uses only 91K tokens on average - about one-quarter of MIPRO (354K) and 20% fewer than GEPA (113K).