Why We Make AI Agents Play Chess Every Hour

Matthias Meyer

We run twenty-four chess games per day between five AI agents and Stockfish. Not because we want to build a better chess bot. Stockfish has been better than humans since 2005 and better than every other AI on the planet for a while now. We run them because chess is the cleanest experiment we found to study what actually changes AI behavior.

The five agents are not different products. They are the same kind of system with one knob turned each time. One has a long memory of past games, one has a short identity description, one is allowed to search the web before each move, one is supposed to rewrite its own instructions over time. They all play against the same opponent. Whoever loses, loses cleanly. Whoever wins, wins because of the knob.

You can watch it live on meetmyagent.io/chess. The site shows the current game on a single big board, a small rail next to it that lights up step by step as the AI thinks, and a feed at the bottom that explains in plain words what just happened. There is no analysis, no commentator. It is a research instrument that happens to be public.

What We Are Actually Asking

Four questions, all of which sound boring until you look at the answers.

The first is the most direct: does memory beat web search? Two of our agents use the same model, Claude Sonnet 4.6. One of them has access to a structured memory of sixty-seven chess notes plus twelve opponent-pattern cards. The other has no memory but can call the web before each move. Same brain, two different ways of being informed. If memory wins, the takeaway is that a small curated knowledge base beats a generic search for a narrow task. If web search wins, the takeaway is that we should stop building memory layers and just give the model a search tool.

The second question is about reasoning depth. Claude Opus 4.8 came out the day this post was written. The model can be called at five different effort levels, from low to max, and Anthropic silently flipped the default from "high" to "medium" earlier this year, which became a small saga in the developer community after Stella Laurenzo's GitHub issue. For the chess lab we pass max effort only to the Opus agent. Every other agent runs on the default. Opus also gets a longer personality prompt, a longer memory window, and a heavier subprocess timeout. Each Opus move costs us between ten and thirty cents instead of the usual half cent. Is that worth it? We will know after a few hundred games.

The third question is whether self-optimization actually helps. One of our agents has a slot where its own prompt can be rewritten between generations. We use a simple evolutionary loop, no fancy theory, just keep what works and mutate what does not. The agent is the same Sonnet model as two of the others, so any improvement comes from the prompt evolving over time. The honest expectation here is marginally positive, with a high compute cost. But it is the kind of question you can only answer by letting it run for weeks.

The fourth question is about prompt length. Our smallest agent has a twenty-eight-line personality description. The largest has a hundred and twelve lines, structured like a miniature handbook with sections on workflow, memory discipline, forbidden patterns, and identity anchor. The recent LLM Chess paper from December 2025 (arxiv 2512.01992) reports that "non-reasoning models are highly sensitive to small prompt and guideline variations, which can flip performance unpredictably." We want to see if our setup reproduces that, and whether a longer prompt actually anchors the model or just adds noise.

The Five Players

Haiku 4.5 plays the tactic hunter role. Fast, instinctive, looks two or three moves ahead, prefers to capture pieces. Twenty-eight lines of personality. The cheapest model in our roster.

Sonnet 4.6 plays the solid strategist. King safety first, only trades when there is a clear gain, patient. Forty-five lines of personality. Mid-range cost.

Opus 4.8 plays the high-end reasoning partner. A hundred and twelve lines of personality formatted like a CLAUDE.md handbook with explicit pre-move workflow, memory discipline rules, forbidden patterns to avoid, and an identity anchor. Gets max reasoning effort, fifteen memory hits per move, two-minute timeout per stage. The most expensive agent.

A second Sonnet 4.6 plays the self-evolving player. Same model as the strategist, but its personality prompt has a Darwin slot that can mutate between generations. Tied to a small evolutionary loop that rewards game outcomes over time.

A third Sonnet 4.6 plays the web strategist. No memory at all. Instead, it can search the web before each move. This is the control: same model as two others, but a completely different way of being informed.

All five play Stockfish 16, calibrated to roughly 1320 Elo. That sounds low, and it is. Stockfish at full strength is around 3600, but we want the LLMs to have a real chance. The point is to measure the differences between the five agents, not to humiliate them with a 3600-rated opponent that wins in twenty moves every time.

The Stack, in Plain Words

Each move passes through nine stations. The agent observes the board, recalls relevant memories about the position, recalls relevant memories about the opponent, optionally researches something on the web, drafts a plan, generates candidate moves, verifies which ones are legal, reflects on the choice, and commits. Each station is its own node in a state machine, and each station emits a trace event we can audit later.

The state machine is LangGraph. The scheduler that fires a game once an hour is Temporal, self-hosted on a single Postgres instance. The tracer that captures every model call and its cost is Langfuse, self-hosted on the same box. The memory layer is our own Postgres-backed memory server, where each agent has its own scope. The board validation is python-chess, exactly the same library the ChessQA benchmark uses. The opponent is Stockfish 16 with UCI_LimitStrength enabled.

There are no exotic pieces in this stack. Every part is open source or our own code. The whole thing runs on one server.

What We Already See

It is too early for strong claims. We started the hourly cron a few days ago, switched to Opus 4.8 today, and have about a hundred games on the board so far. But three patterns are already visible.

Memory hits matter, but only when they fire. We had a bug for two days where the recall query against the memory store used the raw FEN string as a search token. No memory note ever matched and recall returned zero hits every move. The agents played at random-baseline strength. After we changed the query to a phase-aware token ("position opening" or "position middlegame"), the hit rate jumped to five per move and game quality visibly improved. The lesson is not "memory is good." The lesson is that a memory system without a working retrieval layer is decoration, and the only way to catch that is to ship it into a real game loop.

Personality prompts can pretend things that are not true. Our first Opus personality file referenced two pipeline stations that did not actually exist, and a "recall strategy" hook that was never wired up. The agent dutifully described its workflow using those phantom stations in the reasoning trace. The lesson is that you must keep personality documents in sync with the actual pipeline, or the agent will hallucinate its own behavior in its reasoning narration without losing strength. This is the same drift problem you see in production docs, just faster because the agent runs every hour.

Web search needs an opt-in flag in headless mode. The Claude CLI in -p print mode defaults to deny-all on tools. We discovered this when our web-search agent was silently producing plausible but unsourced "research" output without ever hitting the web. Fixing it required passing --allowedTools WebSearch and --permission-mode bypassPermissions explicitly. The lesson, again, is that ablation studies are only valid if you verify each lever actually moves what you think it moves.

What You Can Take From This

If you are building an AI system that you expect to be used over time, build it so the differences between configurations are observable from the outside. Our chess lab is not a chess project. It is a way to make the invisible differences between AI configurations visible to anyone with a browser. That is a useful pattern for any team that needs to defend a config choice ("we use memory because here are 200 games where it helps") instead of arguing about it ("intuitively, memory should help").

If you are doing serious AI evaluations, run them continuously, not in batches. A one-off benchmark gives you one data point and freezes your understanding at that moment. A continuous loop catches drift, catches model updates (we just switched to Opus 4.8 mid-experiment), and gives you the kind of statistical confidence that a single run cannot. The maxim-saplin LLM Chess leaderboard on GitHub shows this well. They keep extending it as new models arrive.

If you are curious about chess specifically, the LLM Chess paper from December 2025 (arxiv 2512.01992) is the cleanest summary of the field right now. The EPAM 2026 piece on choosing AI models is the easier read. The dynomight blog from late 2024 is the most fun, because it tries to explain the strange anomaly that gpt-3.5-turbo-instruct, an older non-reasoning model, played chess at roughly 1750 Elo while many newer chat-tuned models cannot break 1300. The answer turns out to be something close to "regurgitation of training data plus a few-shot prompt did most of the work", which is exactly the kind of effect a careful ablation study should detect and isolate.

The next time you hear someone claim that memory, or long prompts, or higher effort settings, make AI agents reliably better at something, ask them to show their work in a setting where you can watch. We made chess that setting. There are probably better ones. Pick yours and run it for a few months.