| Type | Optimizer | SimpleNegotiation | TwoDollar | KuhnPoker | Briscola | SimpleTak | Mean Win Rate | Mean RSE |
|---|---|---|---|---|---|---|---|---|
| GPT-4o-mini | ||||||||
| Static | baseline | 31.3% | 32.2% | 39.1% | 0.3% | 21.4% | 25.1% | 44.9% |
| CoT | 27.8% | 25.7% | 46.5% | 30.4% | 24.8% | 31.1% | 28.7% | |
| ToT | 26.3% | 27.0% | 51.7% | 45.1% | 23.8% | 34.8% | 36.5% | |
| Prompt | TextGrad | 42.0% | 44.6% | 55.6% | 7.1% | 23.6% | 34.6% | 18.4% |
| MIPRO | 38.4% | 50.9% | 55.1% | 19.7% | 19.1% | 36.7% | 12.4% | |
| GEPA | 36.8% | 40.4% | 52.2% | 3.3% | 26.9% | 32.0% | 11.3% | |
| Ours | MEMO | 54.9% | 52.4% | 55.6% | 42.7% | 41.8% | 49.5% | 6.4% |
| Qwen2.5-7B-Instruct | ||||||||
| Static | baseline | 24.0% | 17.1% | 45.3% | 2.8% | 15.1% | 20.9% | 30.1% |
| CoT | 23.8% | 18.7% | 42.0% | 25.8% | 13.6% | 24.8% | 43.4% | |
| ToT | 27.1% | 20.7% | 42.2% | 22.7% | 15.1% | 25.6% | 40.2% | |
| Prompt | TextGrad | 37.1% | 29.3% | 52.8% | 7.1% | 22.4% | 29.9% | 21.7% |
| MIPRO | 42.4% | 47.5% | 53.8% | 2.2% | 20.9% | 33.4% | 7.3% | |
| GEPA | 34.4% | 31.7% | 55.8% | 3.3% | 19.3% | 28.8% | 14.8% | |
| RL | UnstableBaseline | 41.1% | 30.4% | 52.7% | 53.3% | 47.3% | 45.0% | 43.3% |
| SPIRAL | 45.7% | -- | 56.7% | -- | 32.7% | -- | -- | |
| Ours | MEMO | 48.0% | 48.4% | 60.0% | 31.1% | 34.0% | 44.3% | 6.1% |
Benchmark results across multiple tasks. Each win rate is the mean across three evaluation models. Green: Static prompting; Orange: Prompt optimization; Red: RL pipelines; Blue: Our method.