AutoML for Agent Fleets, Without the Vendor Bill

Matthias Meyer

Last night I shipped AutoML to a 10-agent fleet in a single session. The added monthly cost was zero euros. Not because we found a discount, but because the math at the heart of agent routing does not need an LLM call.

The fleet runs every other Sunday and writes 10 to 15 page reports for a real customer who pays for the service. Until yesterday, all nine worker agents ran every single time, even when only four or five of them really had something to say about that particular customer. The math layer I added watches how well each worker actually performs, learns which workers are pulling their weight for which customer profile, and in a few weeks will be ready to route only the four to six that earn the spot. The bill stays the same. The throughput goes up.

I am writing this down because the pattern is dead simple, transferable to almost any multi-agent setup, and almost nobody outside academic circles talks about how cheap it really is.

The Setup Nobody Else Has

We run a service called StudioMeyer Agents. Ten specialized agents work on one customer at a time and a master agent stitches their findings into a single coherent report. Four agents check website-side signals (visibility, traffic, competitors, technical SEO). Three check AI visibility (LLM citations, brand mentions, cited sources). Two check industry trends. The last one, the master synthesizer, reads all nine reports and writes the customer-facing version.

For our pilot customer, an anti-luxury real-estate agency on Mallorca, the master fires roughly every other Sunday. For StudioMeyer's own site, every other Sunday too, on a different slot. Each run consumes a fair chunk of Anthropic Max-Plan tokens. Each run also produces about 40 to 80 KB of structured worker reports plus the customer-facing markdown.

Here is the part nobody had asked yet: which of those nine workers are actually contributing? Some weeks the SEO-technical agent has nothing to say because nothing changed on the technical layer. Some weeks the AI-visibility agent finds twelve new citations and the master ends up half its report around those. Different customer types pull on different agents. A tourism client probably benefits more from the visibility and the local-search agents. A B2B SaaS client probably pulls harder on the citation-source and competitor agents.

The fleet has been live since Phase D, mid-May 2026. It works. But we were leaving signal on the floor by treating all nine agents as equally relevant for every customer.

Why AutoML Usually Means a Vendor Bill

If you tell most engineers "we should add AutoML to our agent fleet," they hear "let's pay DataRobot, SageMaker Autopilot, or Vertex AI for the privilege." That is a real solution for a different problem. None of those platforms is cheap, and none of them was built for the question "which subset of my LLM agents should I run on customer X this Tuesday."

The other instinct is "let the LLM decide." Build a meta-agent whose job is to read each customer's profile, decide which sub-agents to fire, and dispatch them. That works. It also means every single routing decision is now an LLM call, with its latency, its token budget, and its hallucination surface area.

There is a third option, and it has been the production-standard for routing problems since the early 2010s in adtech and recommender systems. It just took until AAAI 2026 for somebody to put a tutorial together explicitly applying it to LLM agent routing. IBM Research presented two of them this January: "Bandits, LLMs, and Agentic AI" and "Multi-Armed Bandits Meet Large Language Models". The vLLM Semantic Router team made the same point in their April 2026 vision paper, recommending "multi-armed bandits to route queries by context-aware features."

The pattern is older than the LLM era. The multi-armed bandit problem assumes you have a fixed number of options (slot machines, ad creatives, content blocks, or in our case worker agents) and a finite budget of trials. You want to learn which options pay off and exploit them, while still occasionally trying the others to make sure your beliefs are not outdated. Production code does it in dozens of lines.

The AdaptOrch benchmark from the Augment Code orchestration guide measured routing overhead at less than 50 milliseconds. Compare that to the 2 to 15 seconds of LLM inference latency per agent call. The math layer is essentially free.

Twelve Lines of Math

Here is the formula I shipped. It is Bayesian additive smoothing, also known as Laplacian smoothing or Beta-Binomial conjugate prior, depending on which Wikipedia article you land on first. The additive smoothing page has the cleanest version:

export function bayesianMean(
  observed: Array<number | null>,
  priorMean: number,
  priorWeight: number,
): number {
  const valid = observed.filter(
    (x): x is number => x !== null && Number.isFinite(x),
  );
  if (valid.length === 0) return priorMean;
  const sum = valid.reduce((acc, x) => acc + x, 0);
  return (priorWeight * priorMean + sum) / (priorWeight + valid.length);
}

That is the entire ranking core. The intuition: you do not start from "I have no data, so I cannot rank." You start from a prior belief, expressed as a mean and a pseudo-sample-count. With priorMean = 0.6 and priorWeight = 5, the prior says: "I think each worker is decently good (0.6 on a 0 to 1 scale), and I am as confident in that as if I had observed five samples already."

When the first real sample arrives, it gets averaged in with the five pseudo-samples. The estimate moves, but not violently. After five real observations the prior has exactly as much weight as the data. After twenty real observations, the prior is essentially noise floor and the actual measurements dominate.

What does each worker get scored on? In our case three signals, all extracted from the worker's own report:

Verify-confidence: a 0 to 1 score the worker assigns to itself in a "Verify-Confidence" block at the end of every report. We made it mandatory in Session 1068 as part of the anti-hallucination layer. Now it is the primary input to the ranking layer.
Source citation count: how many tool calls and external sources the worker cited in its "Datenquellen" block. A high number means evidence-backed work. A low number means the worker leaned on its training data.
Domain-lock pass rate: a yes/no per run. Did the worker stay on the customer's actual domain or did it drift to staging subdomains or competitor sites?

The composite score is a weighted sum:

rankScore =
  smoothedConfidence * 0.5 +
  normalize(smoothedSourceDensity) * 0.3 +
  domainLockPassRate * 0.2;

50 percent on the worker's own confidence claim, 30 percent on evidence density, 20 percent on hygiene. Three knobs you can tune later when you have enough data to argue about the right ratio. None of those three signals required a new piece of infrastructure. They were already in every worker report, written by the agents themselves, for the anti-hallucination guard. The ranking layer just reads them.

Cold Start Is the Actual Problem

Most multi-armed bandit tutorials lead with exploration versus exploitation. The classic dilemma: should you keep playing the slot machine that has paid the best so far, or try the one you have not pulled in a while?

In production, that is not the hard problem. The hard problem is what to do on day one when you have zero data, or day three when you have data on three of nine workers and nothing on the rest.

Facebook's Reels team solved this in 2023 by using Thompson Sampling with posterior samples for content cold-start, drawing from the posterior distribution rather than a point estimate so brand-new content still had a fair shot. The 2026 papers on LLM-augmented bandits go further: they let an LLM predict the missing observations and feed them into the bandit as pseudo-data, weighted by how well the LLM's predictions have matched reality so far.

I considered both. For now I shipped something simpler: a hard cold-start guard. If the total number of observed worker runs is below three, the recommendation function just returns "all nine workers, exploration phase." No routing decision is made on a dataset that small. After three runs we have nine workers times three samples plus the prior, which is enough signal to make a soft recommendation. After ten to twenty runs, the prior has melted into the noise floor.

if (totalRunsObserved < MIN_SAMPLES_FOR_RECOMMENDATION) {
  return {
    coldStart: true,
    recommendedWorkers: rankings.map((r) => r.agentKuerzel), // all 9
    ...
  };
}

This is a deliberate trade-off. A more sophisticated bandit, like LinUCB or Thompson Sampling, would make a soft recommendation even on day one. But a soft recommendation on day one is exactly the kind of thing that bites you in week three when you realize the system has been disproportionately favoring the agent who got lucky in its first run. I would rather pay for nine full runs through the cold-start window and ship a confident routing decision in week six than ship a wobbly one immediately.

Closure-Locked Tools, or Why Tenant Isolation Costs You Nothing Here Either

The master synthesizer needs to actually call this. We did that with two inline tools: track_worker_performance and get_worker_ranking. Both registered on the master agent at startup.

The Customer-Slug Closure pattern is worth a paragraph because it is the kind of thing that bites you the day you onboard customer number two. Here is the relevant signature:

export function buildTrackPerformanceInlineTool(
  customerSlug: string,
  agentResolver: (kuerzel: string) => SmaAgentDef | undefined,
  options: { dryRun?: boolean } = {},
): SdkMcpToolDefinition {
  return {
    name: "track_worker_performance",
    description: `... Customer-Slug is locked to "${customerSlug}". ...`,
    handler: async (args) => {
      // customerSlug is captured by closure, NOT a tool argument
      const metrics = buildMetricsFromReport({ customerSlug, ... });
      return await recordWorkerPerformance(metrics);
    },
  };
}

The LLM never sees the customer slug as a parameter it can write. The slug is baked into the tool at build time. Even if the master synthesizer hallucinates "actually let me also track this report for the other customer" mid-run, there is no parameter for it to pass and no path that could route the write to anyone else's bucket. This is the same isolation pattern we use for the analytics-sources inline tool we shipped in Session 1069, and it has not let us down once across about 30 master runs.

For defense in depth, the database layer also validates the slug format itself, in case somebody later builds a script that calls the library directly and accidentally hands it a path-traversal-like value. Our Code Critic agent caught that one and made me add it during the same session.

What Phase 1 Ships, and What Phase 2 Will Ship

Phase 1, the one that went live last night, is informative. The master collects performance data from every worker report it reads, persists it to a new sma_worker_performance table with a hard 5,000-row cap per customer to keep memory bounded, and offers a ranking view to whoever asks. The actual routing logic, the part that decides "only run smasicht and smakonk for tourism customers," is not yet wired up. The fleet still fires all nine agents every cycle.

That is deliberate. The fleet has run a handful of times in production. We do not have enough data to draw conclusions yet. If I had shipped routing right away, we would now be optimizing against a noise pattern.

Phase 2 is the routing layer. It will live in sma-run-all.ts, the script that fires the cron schedule. It reads the recommendation from the ranking layer, picks the top two website-module agents plus all three GEO agents plus the top two business agents (a default of seven instead of nine), and respects an anti-stale guard: any worker that has not run in the last 60 days runs anyway, no matter what its current rank is. That keeps exploration alive even after exploitation kicks in.

The cost in token budget for skipping two agents per run, every other week, across two customers: about 20 to 25 percent fewer Anthropic-Max-Plan tokens spent on each cycle. Times the cycle count over a year, that translates into roughly an extra customer's worth of headroom in the same Max-Plan flat rate.

What I Will Be Watching

A few things that could go wrong, and that I want to catch before they do:

The verify-confidence score is self-reported by each worker. Workers might learn to inflate it, the same way employees learn what their performance metric is and game it. In our case the workers do not actually know the score is being used for ranking. The system prompt does not mention it. But the moment we put this into the prompt, that incentive shows up. I will keep the ranking signal sources unmentioned in worker prompts.

The "all nine for cold-start" rule could trap us. If a customer is fundamentally never going to need the AI-visibility agent (because they are a B2B SaaS company with no public-facing brand), the system will keep firing it forever, scoring it low forever, and never quite cross the threshold to drop it. A future refinement is a low-confidence floor: if a worker scores below 0.4 across more than five runs, ask the master to argue for or against dropping it, with the customer profile in context.

The 50/30/20 weight split is a guess. After ten Phase 2 cycles we should have enough variance to ask whether that split actually correlates with customer-facing report quality. If not, the weights should move.

The Replicable Part

I keep coming back to this: the math layer is twelve lines, the SQL is one table, the integration is two inline tools. The Phase 1 shipping cost was one focused session. The Phase 2 shipping cost will be similar, mostly because all the data plumbing already exists.

If you run any kind of multi-agent fleet, whether it is a customer-onboarding pipeline, a research squad, a content-generation system, or a code-review orchestrator, the same pattern applies. You probably already have a confidence signal somewhere in your pipeline (eval scores, judge models, retry rates, output lengths, or just a self-reported number). You probably already have a signal for hygiene (did the agent stay on task? did it cite sources? did it write more than 500 characters of actual content?). What you do not have, until you add it, is a record of those signals over time, normalized across customers or queries, and a sub-100ms function that turns the record into a routing decision.

This is what AutoML actually looks like when you are not buying it from a vendor. It looks like a table, a function, and a guard. The "ML" is a 1.96 KB SQL file and a Bayesian estimator that an undergraduate could write. The "Auto" comes from the fact that nobody has to look at the data, the system updates itself every run.

The vendor bill is zero because the LLM is the thing being routed, not the thing doing the routing. The math does not need a model with a billion parameters. It needs a prior, a counter, and a sort.

If you want to see the full implementation, the migration SQL, the inline tool wiring on the master synthesizer, and the test suite covering Bayesian smoothing, extraction logic, and cold-start, the StudioMeyer Agents source is documented at studiomeyer.io/services/agents. Or if you want a similar pattern designed for your own fleet, the same service handles the implementation.