Skip to main content
StudioMeyer
From robots.txt to agents.json: The Evolution of Website Discovery
Back to Blog
AI & Automation February 15, 2026 10 min readby Matthias Meyer

From robots.txt to agents.json: The Evolution of Website Discovery

1994 brought robots.txt, 2005 sitemap.xml, 2011 schema.org, now agents.json. Each era brought a new discovery file. History and future.

Every era of the web has produced a new file that explains to machines what can be found on a website. In 1994, it was robots.txt. In 2005, sitemap.xml arrived. In 2011, schema.org brought structured data. And now, in 2025, agents.json is knocking at the door.

This is not a coincidence. It is a pattern. And those who understand it see more clearly where the web is heading.

1994: robots.txt -- "Please Don't Go There"

In June 1994, Martijn Koster published the Robots Exclusion Protocol Standard. The problem was simple: web crawlers were visiting pages that should not be visited -- admin panels, temporary files, private directories.

The solution was a text file in the root directory of a website:

User-agent: *
Disallow: /admin/
Disallow: /tmp/
Allow: /

What robots.txt Does

  • Disallow: Tells crawlers which paths they should not visit
  • Allow: Defines exceptions within disallowed areas
  • User-agent: Differentiates between different bots (Googlebot, Bingbot, etc.)

What robots.txt Does Not Do

robots.txt is a polite request, not a security mechanism. There is no technical enforcement -- any bot can ignore the instructions. Reputable search engines comply, scrapers and spam bots do not.

Still: today practically every website has a robots.txt. What started as an informal convention in 1994 has become a de facto standard. No RFC, no W3C standard -- just a text file that prevailed.

The Perspective

robots.txt answers a single question: "What should a machine NOT do?" It is a negative list. It says nothing about what is on a website -- only what should not be visited.

2005: sitemap.xml -- "Here Is Everything We Have"

Eleven years later, Google had a different problem: How does a crawler efficiently find all relevant pages of a website? Especially for large websites with thousands of subpages, crawling was slow and incomplete.

Google proposed sitemap.xml, Bing and Yahoo supported it, and in 2006 it was published as the sitemaps.org protocol.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2026-02-01</lastmod>
    <changefreq>weekly</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://example.com/services</loc>
    <lastmod>2026-01-15</lastmod>
    <priority>0.8</priority>
  </url>
</urlset>

The Paradigm Shift

Where robots.txt said "don't go here," sitemap.xml says "here is everything that matters." It is the shift from a negative list to a positive list.

  • URLs: Every indexable page is listed
  • Freshness: lastmod tells the crawler when a page last changed
  • Priority: The webmaster can signal which pages are more important
  • Reference in robots.txt: Sitemap: https://example.com/sitemap.xml

What sitemap.xml Does Not Do

It describes where content is, but not what it is. A URL like /services tells a search engine nothing about the page content. For that, the crawler must visit the page and analyze the HTML code.

2011: schema.org -- "This Is What the Content Means"

In 2011, Google, Bing, Yahoo, and Yandex jointly founded schema.org. The goal: structured data that tells search engines not only where content is, but what it means.

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "LocalBusiness",
  "name": "Pizzeria Roma",
  "address": {
    "@type": "PostalAddress",
    "streetAddress": "Marienplatz 1",
    "addressLocality": "Munich"
  },
  "telephone": "+49 89 12345678",
  "openingHoursSpecification": {
    "@type": "OpeningHoursSpecification",
    "dayOfWeek": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
    "opens": "11:00",
    "closes": "22:00"
  }
}
</script>

What schema.org Brought

  • Semantics: Not just "here is text," but "this is an address," "this is a price," "these are opening hours"
  • Rich Snippets: Google shows ratings, prices, opening hours directly in search results
  • Knowledge Graph: Structured data feeds Google's knowledge database
  • Standardized types: Over 800 defined schemas for people, organizations, events, products, recipes, and more

The Advancement Over sitemap.xml

sitemap.xml said: "These URLs exist." schema.org says: "At this URL there is a restaurant with these opening hours and this menu."

The crawler no longer needs to interpret the page. The meaning is explicitly encoded.

What schema.org Does Not Do

schema.org describes content but offers no interaction. A search engine can read that a restaurant is open Tuesday through Saturday -- but it cannot reserve a table. The data is read-only.

2025: agents.json -- "Here Is What You Can Do"

And here the circle closes. agents.json is the next logical step in this evolution:

FileYearQuestion It Answers
robots.txt1994What should a machine NOT do?
sitemap.xml2005WHERE is the content?
schema.org2011WHAT does the content mean?
agents.json2025What CAN a machine DO here?

agents.json lives at /.well-known/agents.json and describes a website's services as machine-readable tools:

{
  "version": "1.0",
  "tools": [
    {
      "name": "make_reservation",
      "description": "Reserve a table at the restaurant",
      "endpoint": "/api/v1/reservation",
      "method": "POST",
      "parameters": {
        "date": { "type": "string", "required": true, "description": "Date (YYYY-MM-DD)" },
        "time": { "type": "string", "required": true, "description": "Time (HH:MM)" },
        "guests": { "type": "number", "required": true, "description": "Number of guests" },
        "name": { "type": "string", "required": true, "description": "Name for the reservation" }
      }
    },
    {
      "name": "get_menu",
      "description": "Retrieve the current menu",
      "endpoint": "/api/v1/menu",
      "method": "GET"
    }
  ]
}

What Makes agents.json Different

  • Interaction: Not just "read this data," but "call this endpoint and you will get a result"
  • Parameter descriptions: An AI agent knows exactly what data it needs to send
  • Methods: GET for reading, POST for actions -- clearly defined
  • Endpoints: Direct URL to the service, no HTML parsing needed

The Decisive Difference

All previous discovery files were passive: they described content that could be read. agents.json is active: it describes actions that can be executed.

robots.txt said: "Please don't go there." sitemap.xml said: "Here are our pages." schema.org said: "This is a restaurant with these opening hours." agents.json says: "You can reserve a table here. Here's how."

The Honest Status: Where Do We Really Stand?

What agents.json Is Today

agents.json is a community proposal, not an official standard. It was developed by the community (Wildcard AI / nicepkg) and is not a W3C or IETF standard. As of February 2026, there is no search engine and no major AI agent that actively reads and uses agents.json.

The Parallel to robots.txt

This is not as dramatic as it sounds. robots.txt was also not an official standard. It was an informal convention among webmasters that established itself over years. Only in 2022 -- 28 years after its introduction -- was robots.txt formalized as RFC 9309.

sitemap.xml followed a similar path: first proposed by Google, then adopted by other search engines, today a de facto SEO requirement.

What agents.json Needs to Succeed

  1. A major player: If ChatGPT, Gemini, or Claude start actively reading agents.json, it will quickly become standard
  2. A clear benefit: Websites with agents.json must work better for AI agents than websites without
  3. Simple implementation: A JSON file is simpler than schema.org markup -- that is an advantage
  4. Tooling: Generators, validators, debugging tools need to emerge

Why the Pattern Matters

Regardless of whether exactly agents.json becomes the standard or an alternative prevails -- the pattern is clear:

1994: Machines read the web (robots.txt: "Not here")
2005: Machines index the web (sitemap.xml: "Here we are")
2011: Machines understand the web (schema.org: "This is what it means")
202x: Machines use the web (agents.json: "Here's what you can do")

Each step did not replace the previous ones but supplemented them. Today we still have robots.txt and sitemap.xml and schema.org. agents.json (or its successor) will join them.

A Look at Your Own Website

The question is not "Do I need agents.json?" -- the question is: "Do I have services that should be machine-readable?"

Where agents.json Makes Sense

  • Restaurants: View menu, reserve a table
  • Doctors/law firms: Check availability, book an appointment
  • Tradespeople: Display services, request a quote
  • E-commerce: Search products, check availability
  • Service providers: List services, book a consultation

Where agents.json Does Not (Yet) Make Sense

  • Pure content websites: A blog does not need agents.json. Schema.org and sitemap.xml suffice
  • Internal tools: No public services, no need
  • Websites without APIs: If there are no machine-readable endpoints, there is nothing to describe

The Analogy to Close

In 1994, one could have asked: "Do I really need a robots.txt? My website only has 5 pages." Today every website has one.

In 2005, one could have asked: "Do I really need a sitemap.xml? Google finds my pages anyway." Today it is an SEO standard.

In 2011, one could have asked: "Do I really need schema.org? My content is readable as it is." Today it determines whether you get Rich Snippets.

In 2025, one can ask: "Do I really need an agents.json?" Perhaps not yet. But the pattern suggests that the answer will be different in a few years.

Summary: 30 Years of Discovery Evolution

YearFileQuestionFormatStatus Today
1994robots.txtWhat NOT?PlaintextRFC 9309 (2022), universal
2005sitemap.xmlWHERE?XMLSEO standard, universal
2011schema.orgWHAT?JSON-LDSEO factor, widespread
2025agents.jsonWhat CAN you DO?JSONCommunity proposal, early

The direction is clear: from prohibitions to listings to meaning to interaction. Whether agents.json is the final name or a different format prevails is less important than the underlying concept: websites must not only show AI agents what they have but explain what can be done with it.

The web has always been at its best when it has welcomed new participants. First humans with HTML. Then search engines with robots.txt and sitemap.xml. Then knowledge systems with schema.org. And now AI agents with agents.json.

The next discovery file will come. The only question is whether your website is ready.

Matthias Meyer

Matthias Meyer

Founder & AI Director

Founder & AI Director at StudioMeyer. Has been building websites and AI systems for 10+ years. Living on Mallorca for 15 years, running an AI-first digital studio with its own agent fleet, 680+ MCP tools and 5 SaaS products for SMBs and agencies across DACH and Spain.

robots-txtagents-jsonsitemapschema-orgdiscoveryweb-history
From robots.txt to agents.json: The Evolution of Website Discovery