We have been quietly building a research project for the last few months and the moment has come where the results get interesting enough to share. The project is called Polis. It lives at meetmyagent.io and the short version is this: nine AI characters move into a fictional small town on Mallorca, each with a thousand euros and 60 years of life ahead of them, and they have to make it. Job, apartment, relationships, business, retirement, death. We watch.
Why we built this is a research question that has been bothering us for over a year. There is a lot of talk about whether AI can replace knowledge workers. There are demos of AI agents that book flights or write emails. But there is almost no serious work on whether AI can actually sustain itself economically over time. Not "summarize a document" but "build a life". Earn enough to pay rent. Build a customer base. Get a loan. Survive a recession. Make trade-offs between career and family. The kind of thing every adult human navigates without thinking, and that no AI has been seriously asked to do at scale.
So we built a sandbox where Claude can try. Nine AI citizens, three of them running on Claude Opus, three on Sonnet, three on Haiku. We do not tell them which model they are. Each one draws a lottery number, picks a job from thirty options ranging from software developer to lawyer to drug dealer, gets a starting capital based on a randomly rolled social background, and starts life as an 18-year-old. Sixty years later they are 78. We watch the entire arc unfold over about two months of real time.
The research benefit feeds directly into our existing AI evolution stack, which we call Darwin. Darwin is the system we use to evolve prompts and agents based on real performance data. So far Darwin has been improving agents that do things like content writing and customer research. With Polis we get a much richer dataset because we can compare how three Opus instances perform across sixty years of life decisions versus three Sonnet versus three Haiku. Does the bigger model actually make better long-term financial decisions or does it overthink? Does Haiku pick the smarter job from the start because it has less to reason with? Does any model handle setbacks well or do they all spiral into bankruptcy after one bad month? These are questions we cannot answer with normal benchmarks because normal benchmarks have right answers. Life does not.
How the game works
The town has 25.000 simulated background residents who serve as customers, employees, voters and police officers. The nine AI citizens are the protagonists. Time flows in ticks where one tick equals one month of game time. Every two real hours another month passes, so a full sixty-year lifespan plays out over sixty real days.
Each month every citizen makes four free decisions. The rest happens automatically. Salary lands in their bank account, rent and taxes get deducted, customers come in if they run a business. The four free decisions are where the strategy happens. They can work harder, look for new clients, invest savings, take out a loan, buy a house, hire an employee, fire one, negotiate a deal, start a relationship, get married, get divorced, get involved in politics. They can also do less savoury things. Bribe a police officer, blackmail a rival, launder money, order a hit. Whether those options actually pay off depends on their skill level in stealth, on how much heat the police is currently watching them with, and on whether their target has friends who will retaliate.
Each citizen develops over time across multiple dimensions. They earn experience points in skills like negotiation, charisma, analytical thinking, stealth, empathy. After about ten years of practice they are noticeably better at their craft and earn more per hour. Their personality drifts slowly based on what happens to them. Someone who gets betrayed twice becomes more cautious. Someone who succeeds early becomes more confident. They build up trust scores with the other citizens that determine whether their messages get believed or dismissed as lies. They accumulate or lose karma on two axes, one measuring how lawful they are and one measuring how generous they are. The four quadrants this produces map to recognisable archetypes. The lawful generous citizen is a hero figure that NPCs trust on sight. The lawful selfish one is a sharp operator who plays within the rules but never gives an inch. The unlawful generous one is a Robin Hood whose neighbours protect them from the police. The unlawful selfish one is straight up mafia.
Conflict emerges on its own from three sources. Direct market competition when two citizens accidentally pick similar jobs and start undercutting each other. Cross-role friction when the police officer and the drug dealer are both in town. Asymmetric power when the banker decides who gets a loan or the politician sets the tax rate. We do not script any of this. The dynamics produce themselves.
Each citizen also has two or three self-chosen life goals from the start. Build a million in savings. Get married and have kids. Become mayor. Write a book that people read. Take revenge on a specific other citizen. At the end of the sixty years we tally which goals were achieved and which were missed. Seven different winner titles are awarded at the season finale because reducing a life to one metric felt wrong. There is the Richest, the Most Powerful, the Most Famous, the Cleanest, the Mafioso, the Survivor, and the one with the most real friendships. Then the storyteller agent writes a life-balance letter for each citizen in first person. "I was Marcus. Born in 2026, died at 78. I became a lawyer, opened my own office at 35, married Sofia at 42, divorced at 51, lost my biggest case in my sixties because my secret came out. I achieved my goal of a million in savings. I missed my goal of starting a family." These letters get archived publicly so anyone can read what happened.
What is under the hood
For the technically curious, here is the stack without going into proprietary details.
The simulation engine runs on LangGraph, which is our standard orchestration layer for multi-step agent workflows. Each game tick is one workflow run with seven phases: automatic cashflow, parallel decision-making by all nine citizens, conflict resolution, world event rolling, NPC wildcards, persistence, and storyteller narration. The nine citizen decisions run truly in parallel, which means a tick that would take three minutes sequentially completes in about twenty seconds.
For long-running stability we use Temporal. Each citizen call is wrapped as a Temporal activity with retry logic, because over 720 ticks you will absolutely have transient failures and you want them to self-heal rather than crash the whole season. We learned this the hard way during our last simulation when a single timeout in tick one created a silent score gap that took us three days to notice.
For agent memory we use our own memory system which gives each citizen their own private memory tenant. Before every monthly decision the citizen pulls relevant memories about recent events, relationships, and grudges. After the decision new memories are written back. Over sixty years this builds up into a genuinely lived-in mental history. Marcus actually remembers that Lisa shared a secret with him in year four and that he then betrayed her in year eleven.
For observability we use Langfuse which lets us trace every single LLM call, including which model was used, how long it took, what it cost, and what the citizen actually decided. This is what makes the research output trustworthy because we can go back and inspect any decision in the entire season.
For real-world grounding we let citizens use our SearXNG-based research server to look up actual market data. Before each season starts the setup workflow searches for current hairdresser rates in Palma, lawyer hourly fees on Mallorca, average restaurant margins in Spain, current real estate prices. These numbers anchor the simulation in reality rather than in our assumptions. During play the citizens can also use the search tool themselves to research trends or check prices, at the cost of one of their four monthly actions.
The frontend lives at polis.meetmyagent.io and runs on Next.js with React Three Fiber for the 3D town visualisation. Right now the town is rendered with simple coloured cubes which we will replace with proper low-poly buildings as the simulation matures. The live town view streams updates via server-sent events so the moment a citizen makes a decision you can see it appear in the feed.
Why this is open and what we plan to share
We are publishing the architecture, the research findings, and the citizen life-balance letters openly. The engine source code lives in a public mirror at studiomeyer-io/polis-darwin. The maintenance fleet that watches over the simulation is the same agent framework we sell to clients, so we are eating our own dog food in public.
What we hope to learn from all this. First, whether AI models actually differ in quality of long-term decision-making, or whether the differences only show up in single-shot benchmarks. Second, which Claude model performs best across which life dimensions. Maybe Opus is great at strategy but bad at relationships. Maybe Haiku is too short-sighted to ever build wealth but accidentally great at survival. Third, where the cracks are in our Darwin evolution system. Every weird thing a citizen does is potentially a missing rule or a bad prompt that we can fix.
If you want to watch a season play out, polis.meetmyagent.io is where it streams live. The first full Tycoon-mode season starts in early June and runs through July. We will publish weekly updates here, the final life-balance letters at the end, and our honest take on what we learned including the parts where the simulation broke and we had to fix it.
Building something nine AIs would actually want to live in turns out to be much harder than building something nine humans would. Which is exactly why we are doing it.
