I have spent two decades building software that sits between organizations and the web. Marketing automation platforms, CRM systems, open source communities. Every one of those products was fundamentally a piece of infrastructure that helped one type of consumer interact with web content more effectively.
What I have come to understand, and what I want to articulate here, is that the web evolves in distinct states, each defined by the arrival of a new class of consumer that the existing infrastructure cannot adequately serve. We are entering the fourth such state, and we are profoundly unprepared for it.
The three states we have already lived through
State 1: The Human Web (1991 to present)
The web began as a system for humans to read documents through browsers. HTML was the format. It encoded structure, layout, and presentation because the consumer was a person looking at a screen. CSS added visual styling. JavaScript added interactivity. Every evolution of the web's first state was oriented around the same question: how do we make this better for human eyes?
This state produced an extraordinary ecosystem. But every piece of it, from the markup language to the rendering engine to the design patterns, assumes that the end consumer is a person who will see the result rendered as pixels.
State 2: The Index Web (1998 to present)
Search engines arrived and immediately struggled with HTML. Crawlers do not render pages. They cannot "see" what a page looks like. They need structured signals about what a page contains, how important it is, and how it relates to other pages.
The web responded by inventing a parallel infrastructure layer for this new consumer:
Sitemaps told crawlers which pages existed and how frequently they changed. This was information that humans never needed (they navigate by clicking) but that machines required.
robots.txt defined access rules. Again, this was purely a machine-to-machine protocol. No human reads robots.txt.
Structured data (Schema.org, JSON-LD, OpenGraph, meta tags) embedded machine-readable facts directly into HTML. A product page might display a price visually in styled text, but it also declared the price in a <meta> tag that crawlers could parse without rendering.
The key insight of State 2 was that a new consumer required a new infrastructure layer. Publishers who adopted sitemaps, robots.txt, and structured data were rewarded with better search visibility. Those who did not were penalized. The incentive structure drove adoption.
State 3: The Application Web (2005 to present)
Applications needed to consume web data programmatically. Unlike search engines, they did not want to crawl and parse HTML at all. They wanted structured data delivered through purpose-built interfaces.
REST APIs emerged as the primary solution. A product catalog that had previously been accessible only through rendered HTML pages became available as a JSON endpoint. GraphQL followed, offering more flexible querying. Webhooks provided real-time event notification.
The critical principle remained the same: each new consumer class got infrastructure designed for its consumption model. No application developer would consider scraping rendered HTML to get data that could be served as JSON from an API. That would be absurd. The data exists; why force the consumer to extract it from a presentation format?
And yet this is exactly what we are doing with AI agents.
State 4: The Agent Web
AI agents are the fourth consumer of the web. They browse autonomously, read page content, reason about what they find, fill forms, click buttons, navigate multi-step workflows, and synthesize information across multiple sources.
They are fundamentally different from every prior consumer:
They are not rendering pixels. Agents do not see web pages. They process text. Every CSS class, every style attribute, every decorative element in HTML is noise that consumes their context window without contributing to their understanding.
They are not building an index. Search engines process billions of pages to build a ranking system. Agents process individual pages to accomplish specific tasks. Their consumption is targeted and task-oriented.
They are not calling structured APIs. Most web content does not have an API. The millions of news articles, documentation pages, product listings, government forms, and community forums that agents need to access exist only as HTML.
They reason over their input. This is the crucial difference. When an agent receives a web page, it does not simply extract and store data. It uses the page content as context for language model reasoning. The quality, structure, and token efficiency of that input directly affects the quality, speed, and cost of the agent's reasoning.
And the fourth consumer has no infrastructure designed for it.
Agents today receive raw HTML. A typical page contains 30,000 to 60,000 tokens of markup, the vast majority of which encodes visual presentation: CSS class names, inline styles, tracking scripts, advertising containers, layout dividers. The agent pays to process all of it. The model spends time reasoning about what is content versus what is chrome. And the results are degraded because the signal-to-noise ratio is terrible.
This is the equivalent of a search engine trying to build an index by rendering every page in a browser and OCR-ing the screenshots. It would work, technically. It would also be absurdly wasteful. That is the state of agent-web interaction today.
Why this matters more than you think
The gap between what agents need and what the web provides is not just an efficiency problem. It is an architectural problem that constrains what agents can do.
The context window bottleneck
Language models have finite context windows. Every token of HTML noise that enters the context displaces a token of actual content. An agent that could analyze five pages of content in one pass can only analyze one or two if those pages are delivered as raw HTML. The representation format directly limits the agent's cognitive capacity.
This is not a temporary limitation that will disappear as context windows grow. Larger windows cost proportionally more. A 200K-token context window is not cheaper per token than a 32K-token window. The economic pressure to minimize input tokens will persist indefinitely.
The interaction gap
Markdown extraction (the most common workaround) solves the token problem by stripping everything. But it also strips the information agents need for interaction. A Markdown representation of a page cannot tell an agent which text is a button, which is a link, which is a form field, or what actions are available. For tasks that require the agent to do something on a page (not just read it), Markdown is blind.
This creates a bifurcation in agent architectures: one system for reading (Markdown) and another for acting (raw HTML with DOM selectors). This split is a sign that neither format is right.
The reliability problem
Every agent that parses HTML does so with its own extraction logic. Different agents interpret the same page differently. When a site redesigns, every agent's extraction breaks simultaneously. There is no canonical representation, no contract between publisher and consumer about what the page contains.
Search engines solved this with structured data. A publisher who adds Schema.org markup is declaring: "This is the product name. This is the price. This is the rating." The declaration is independent of how the page looks. We need the equivalent for agents.
What the fourth state requires
Based on building infrastructure for every prior state of the web, I believe the fourth state requires three primitives:
A semantic representation format
A structured format that preserves what agents need (content, element types, interactive affordances, page regions) while discarding what they do not (visual presentation, scripts, tracking). This is what the Semantic Object Model (SOM) provides.
SOM is not a new rendering format. It is a compilation target. HTML remains the source of truth for browsers. SOM is the derived representation optimized for machine reasoning. The relationship is analogous to how a database stores normalized data (the source of truth) while a search index stores denormalized projections (the derived representation optimized for a specific consumer).
A semantic interaction protocol
Agents that need to act on web pages currently do so through browser automation protocols designed for debugging (CDP, WebDriver). These protocols operate at the DOM level: click at these coordinates, type into this CSS selector, wait for this element to appear.
Agents need a protocol that operates at the semantic level: click the login button, fill the search field with this query, select "Next Day Delivery" from the shipping options. Semantic interactions are more reliable (they survive page redesigns), more efficient (no coordinate calculation), and more aligned with how agents reason about tasks.
A cooperative discovery mechanism
Publishers need a way to declare: "I serve structured representations for agents, and here is where to find them." This is the robots.txt of the agent era. Instead of the binary "allow crawling" or "block crawling," publishers need to express "here is the format I prefer you consume."
The proposed SOM-Endpoint directive in robots.txt serves this purpose. Combined with the /.well-known/som.json convention and the <link rel="alternate"> tag, publishers have three complementary discovery mechanisms.
The economic argument
Every transition between web states was driven by economics, not altruism. Publishers adopted sitemaps because search visibility drove traffic. They built APIs because application integrations drove revenue. They will adopt agent-oriented infrastructure for the same reason: agents are becoming a primary channel for content discovery and interaction.
The publishers who serve structured representations to agents will be the preferred sources when users ask their AI assistants for information. The publishers who block agents or serve them raw HTML will be invisible to a growing segment of content consumption.
The transition is already happening. Agent traffic as a percentage of total web traffic is growing rapidly. Cloudflare's data shows bot traffic (a rough proxy that includes agents) exceeding 30% on many sites. As AI assistants become the default interface for information retrieval, the fraction of web consumption mediated by agents will only increase.
Publishers who prepare for this now will have a structural advantage. Those who wait will be caught in the same scramble that characterized the early days of SEO, when publishers who ignored search engines suddenly found themselves invisible.
What I am building
Plasmate is my attempt to build the foundational infrastructure for the web's fourth state. It is an open source compiler that converts HTML to SOM, a headless browser optimized for agent consumption rather than human rendering.
It is not the only approach that will exist. The web's second and third states produced multiple competing tools and standards before settling on conventions. The same will happen here. What matters is that the problem is identified, the primitives are defined, and the building starts.
I have published five research papers exploring these ideas: the SOM format specification, the agentic web infrastructure vision, the Agent Web Protocol, cooperative content negotiation via robots.txt, and a task-completion benchmark comparing web representations. All are available at timespent.xyz/papers.
I am participating in the W3C Web Content for Browser and AI Community Group because standards matter and because the web's prior state transitions all involved standards bodies eventually.
But I am also building, because the gap exists now and agents are consuming the web now. The infrastructure they need should not wait for a standards process to complete. Open source implementations can evolve into standards. The reverse rarely works.
The pattern
Looking back at my career, from building Mautic (open source marketing automation) to building Plasmate (open source agent infrastructure), I see the same pattern repeating. In both cases, a new class of consumer emerged that existing tools could not serve well. In both cases, the solution was a new infrastructure layer that sat between the consumer and the content, translating one into the other.
Marketing automation was infrastructure that helped marketers interact with customers through the web. Agent infrastructure helps AI systems interact with content through the web. The abstraction level is different but the structural problem is identical: a consumer needs a format and a protocol designed for its consumption model.
The web is entering its fourth state. The question is not whether agent-oriented infrastructure will be built. It is who will build it, whether it will be open, and whether it will arrive in time to shape the transition rather than react to it.
I intend for the answer to be: we will, it will be open, and it is already here.
David Hurley is the founder of Plasmate Labs and the creator of the Semantic Object Model (SOM). Previously, he founded Mautic, the world's first open source marketing automation platform. He writes about web infrastructure, AI agents, and the agentic web at dbhurley.com and publishes research at timespent.xyz/papers.