From robots.txt to agents.json: The Evolution of Website Discovery

Matthias Meyer

Every era of the web has produced a new file that explains to machines what can be found on a website. In 1994, it was robots.txt. In 2005, sitemap.xml arrived. In 2011, schema.org brought structured data. And now, in 2025, agents.json is knocking at the door.

This is not a coincidence. It is a pattern. And those who understand it see more clearly where the web is heading.

1994: robots.txt -- "Please Don't Go There"

In June 1994, Martijn Koster published the Robots Exclusion Protocol Standard. The problem was simple: web crawlers were visiting pages that should not be visited -- admin panels, temporary files, private directories.

The solution was a text file in the root directory of a website:

User-agent: *
Disallow: /admin/
Disallow: /tmp/
Allow: /

What robots.txt Does

Disallow: Tells crawlers which paths they should not visit
Allow: Defines exceptions within disallowed areas
User-agent: Differentiates between different bots (Googlebot, Bingbot, etc.)

What robots.txt Does Not Do

robots.txt is a polite request, not a security mechanism. There is no technical enforcement -- any bot can ignore the instructions. Reputable search engines comply, scrapers and spam bots do not.

Still: today practically every website has a robots.txt. What started as an informal convention in 1994 has become a de facto standard. No RFC, no W3C standard -- just a text file that prevailed.

The Perspective

robots.txt answers a single question: "What should a machine NOT do?" It is a negative list. It says nothing about what is on a website -- only what should not be visited.

2005: sitemap.xml -- "Here Is Everything We Have"

Eleven years later, Google had a different problem: How does a crawler efficiently find all relevant pages of a website? Especially for large websites with thousands of subpages, crawling was slow and incomplete.

Google proposed sitemap.xml, Bing and Yahoo supported it, and in 2006 it was published as the sitemaps.org protocol.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2026-02-01</lastmod>
    <changefreq>weekly</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://example.com/services</loc>
    <lastmod>2026-01-15</lastmod>
    <priority>0.8</priority>
  </url>
</urlset>

The Paradigm Shift

Where robots.txt said "don't go here," sitemap.xml says "here is everything that matters." It is the shift from a negative list to a positive list.

URLs: Every indexable page is listed
Freshness: lastmod tells the crawler when a page last changed
Priority: The webmaster can signal which pages are more important
Reference in robots.txt: Sitemap: https://example.com/sitemap.xml

What sitemap.xml Does Not Do

It describes where content is, but not what it is. A URL like /services tells a search engine nothing about the page content. For that, the crawler must visit the page and analyze the HTML code.

2011: schema.org -- "This Is What the Content Means"

In 2011, Google, Bing, Yahoo, and Yandex jointly founded schema.org. The goal: structured data that tells search engines not only where content is, but what it means.

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "LocalBusiness",
  "name": "Pizzeria Roma",
  "address": {
    "@type": "PostalAddress",
    "streetAddress": "Marienplatz 1",
    "addressLocality": "Munich"
  },
  "telephone": "+49 89 12345678",
  "openingHoursSpecification": {
    "@type": "OpeningHoursSpecification",
    "dayOfWeek": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
    "opens": "11:00",
    "closes": "22:00"
  }
}
</script>

What schema.org Brought

Semantics: Not just "here is text," but "this is an address," "this is a price," "these are opening hours"
Rich Snippets: Google shows ratings, prices, opening hours directly in search results
Knowledge Graph: Structured data feeds Google's knowledge database
Standardized types: Over 800 defined schemas for people, organizations, events, products, recipes, and more

The Advancement Over sitemap.xml

sitemap.xml said: "These URLs exist." schema.org says: "At this URL there is a restaurant with these opening hours and this menu."

The crawler no longer needs to interpret the page. The meaning is explicitly encoded.

What schema.org Does Not Do

schema.org describes content but offers no interaction. A search engine can read that a restaurant is open Tuesday through Saturday -- but it cannot reserve a table. The data is read-only.

2025: agents.json -- "Here Is What You Can Do"

And here the circle closes. agents.json is the next logical step in this evolution:

File	Year	Question It Answers
robots.txt	1994	What should a machine NOT do?
sitemap.xml	2005	WHERE is the content?
schema.org	2011	WHAT does the content mean?
agents.json	2025	What CAN a machine DO here?

agents.json lives at /.well-known/agents.json and describes a website's services as machine-readable tools:

{
  "version": "1.0",
  "tools": [
    {
      "name": "make_reservation",
      "description": "Reserve a table at the restaurant",
      "endpoint": "/api/v1/reservation",
      "method": "POST",
      "parameters": {
        "date": { "type": "string", "required": true, "description": "Date (YYYY-MM-DD)" },
        "time": { "type": "string", "required": true, "description": "Time (HH:MM)" },
        "guests": { "type": "number", "required": true, "description": "Number of guests" },
        "name": { "type": "string", "required": true, "description": "Name for the reservation" }
      }
    },
    {
      "name": "get_menu",
      "description": "Retrieve the current menu",
      "endpoint": "/api/v1/menu",
      "method": "GET"
    }
  ]
}

What Makes agents.json Different

Interaction: Not just "read this data," but "call this endpoint and you will get a result"
Parameter descriptions: An AI agent knows exactly what data it needs to send
Methods: GET for reading, POST for actions -- clearly defined
Endpoints: Direct URL to the service, no HTML parsing needed

The Decisive Difference

All previous discovery files were passive: they described content that could be read. agents.json is active: it describes actions that can be executed.

robots.txt said: "Please don't go there." sitemap.xml said: "Here are our pages." schema.org said: "This is a restaurant with these opening hours." agents.json says: "You can reserve a table here. Here's how."

The Honest Status: Where Do We Really Stand?

What agents.json Is Today

agents.json is a community proposal, not an official standard. It was developed by the community (Wildcard AI / nicepkg) and is not a W3C or IETF standard. As of February 2026, there is no search engine and no major AI agent that actively reads and uses agents.json.

The Parallel to robots.txt

This is not as dramatic as it sounds. robots.txt was also not an official standard. It was an informal convention among webmasters that established itself over years. Only in 2022 -- 28 years after its introduction -- was robots.txt formalized as RFC 9309.

sitemap.xml followed a similar path: first proposed by Google, then adopted by other search engines, today a de facto SEO requirement.

What agents.json Needs to Succeed

A major player: If ChatGPT, Gemini, or Claude start actively reading agents.json, it will quickly become standard
A clear benefit: Websites with agents.json must work better for AI agents than websites without
Simple implementation: A JSON file is simpler than schema.org markup -- that is an advantage
Tooling: Generators, validators, debugging tools need to emerge

Why the Pattern Matters

Regardless of whether exactly agents.json becomes the standard or an alternative prevails -- the pattern is clear:

1994: Machines read the web (robots.txt: "Not here")
2005: Machines index the web (sitemap.xml: "Here we are")
2011: Machines understand the web (schema.org: "This is what it means")
202x: Machines use the web (agents.json: "Here's what you can do")

Each step did not replace the previous ones but supplemented them. Today we still have robots.txt and sitemap.xml and schema.org. agents.json (or its successor) will join them.

A Look at Your Own Website

The question is not "Do I need agents.json?" -- the question is: "Do I have services that should be machine-readable?"

Where agents.json Makes Sense

Restaurants: View menu, reserve a table
Doctors/law firms: Check availability, book an appointment
Tradespeople: Display services, request a quote
E-commerce: Search products, check availability
Service providers: List services, book a consultation

Where agents.json Does Not (Yet) Make Sense

Pure content websites: A blog does not need agents.json. Schema.org and sitemap.xml suffice
Internal tools: No public services, no need
Websites without APIs: If there are no machine-readable endpoints, there is nothing to describe

The Analogy to Close

In 1994, one could have asked: "Do I really need a robots.txt? My website only has 5 pages." Today every website has one.

In 2005, one could have asked: "Do I really need a sitemap.xml? Google finds my pages anyway." Today it is an SEO standard.

In 2011, one could have asked: "Do I really need schema.org? My content is readable as it is." Today it determines whether you get Rich Snippets.

In 2025, one can ask: "Do I really need an agents.json?" Perhaps not yet. But the pattern suggests that the answer will be different in a few years.

Summary: 30 Years of Discovery Evolution

Year	File	Question	Format	Status Today
1994	robots.txt	What NOT?	Plaintext	RFC 9309 (2022), universal
2005	sitemap.xml	WHERE?	XML	SEO standard, universal
2011	schema.org	WHAT?	JSON-LD	SEO factor, widespread
2025	agents.json	What CAN you DO?	JSON	Community proposal, early

The direction is clear: from prohibitions to listings to meaning to interaction. Whether agents.json is the final name or a different format prevails is less important than the underlying concept: websites must not only show AI agents what they have but explain what can be done with it.

The web has always been at its best when it has welcomed new participants. First humans with HTML. Then search engines with robots.txt and sitemap.xml. Then knowledge systems with schema.org. And now AI agents with agents.json.

The next discovery file will come. The only question is whether your website is ready.