The Publisher's Dilemma: Block, Tolerate, or Cooperate ─ David Hurley

If you run a website with meaningful traffic, you have already encountered this problem, whether you know it or not. AI agents are crawling your pages. They are consuming your content. And you are getting almost nothing back.

The numbers are stark. Cloudflare reported in August 2025 that Anthropic's crawlers fetch approximately 38,000 pages for every single page visit they refer back to publishers. OpenAI's ratio is better but still lopsided. Perplexity's ratio actually got worse over the course of 2025, with crawling volume increasing while referral traffic decreased.

Meanwhile, Google referrals to news sites have been declining since February 2025, coinciding with the expansion of AI Overviews. The Pew Research Center found that users are less likely to click through to a source when Google displays an AI summary at the top of search results.

The traffic pipeline that has sustained the web publishing model for two decades is weakening. And the new consumers that are replacing human visitors are not playing by the same economic rules.

The two options publishers have today

Right now, publishers face a binary choice.

Option A: Block everything. Add Disallow: / for GPTBot, ClaudeBot, and every other AI user agent in your robots.txt. The New York Times, The Atlantic, and hundreds of other publishers have taken this approach. It stops the crawling, eliminates the infrastructure cost, and sends a clear signal about content rights.

But it also makes you invisible to AI. When someone asks ChatGPT or Claude about a topic you cover deeply, your content will not be in the answer. As AI assistants become a primary interface for information discovery (and the trajectory is unmistakable), publishers who block agents are opting out of an increasingly important distribution channel. It is the equivalent of blocking Googlebot in 2005.

Option B: Allow everything. Do nothing. Let the crawlers in. Serve them the same HTML you serve to browsers, with all the CSS, JavaScript, tracking scripts, ad markup, and layout containers. Hope that being included in AI training data and retrieval results translates to some indirect benefit.

The cost of this option is concrete and measurable. Every agent request hits your CDN, your origin server, and your rendering pipeline. For sites with server-side rendering, each request triggers a full page build. The agent receives your 2.5MB page, extracts maybe 25% of the content, discards the rest, and sends no referral traffic in return.

Block everything and become invisible. Allow everything and subsidize companies that are extracting your value without compensation. Neither option is good.

What "cooperate" means

There is a third option that most publishers have not considered, because until recently it did not exist as a practical choice. Instead of blocking agents or serving them raw HTML, you can serve them a format designed for their consumption model.

The idea is content negotiation for the agent era. When a browser requests your page, you serve HTML (as you always have). When an AI agent requests your page, you serve a structured semantic representation: smaller, faster to process, and containing only the information the agent actually needs.

This is not a new concept in web architecture. HTTP content negotiation has existed since the 1990s. Servers already serve different content types based on the Accept header (JSON for APIs, HTML for browsers, images in different formats based on browser support). Extending this pattern to serve structured representations for AI agents is a natural evolution.

The structured format we have been developing is called SOM (Semantic Object Model). It is a JSON document that organizes page content into typed regions (navigation, main content, sidebar, footer) containing typed elements (headings, paragraphs, links, buttons, form fields) with explicit declarations of available actions. It preserves what agents need and discards what they do not.

But the specific format matters less than the principle: publishers can choose what agents receive, rather than letting agents extract whatever they can from HTML.

The economics of cooperation

I published a detailed cost-benefit analysis as a research paper, but here is the summary.

We modeled four publisher strategies across three tiers (small blog at 10K agent requests/month, mid-size news site at 1M/month, and large publisher at 50M/month) and compared the annual infrastructure costs of each.

$35K

annual cost

mid-size publisher, raw HTML

$11K

annual cost

mid-size publisher, SOM-first

68%

cost reduction

per-request savings

~50K

break-even

agent requests/month

The savings come from three sources:

Bandwidth reduction. A structured representation is 5-15KB compared to 30-100KB+ for a full HTML page with its associated resources. When you are serving millions of agent requests per month, the bandwidth difference is substantial.

Compute reduction. This is the big one for dynamic sites. Server-side rendered pages require a full render cycle per request: template compilation, database queries, component rendering, HTML serialization. A cached SOM representation is a static JSON file served from CDN with zero origin compute. For sites where SSR costs dominate (which is most modern web applications), this is where the majority of savings come from.

Bot management simplification. Publishers currently spend significant effort on WAF rules, rate limiting, and bot detection to manage agent traffic. Cooperative serving through a dedicated endpoint reduces the adversarial dynamic. You are explicitly offering content for agents through a channel you control, rather than playing defense against crawlers hitting your production infrastructure.

Small blog (10K agent requests/month): Raw HTML costs roughly $720/year. SOM-first costs roughly $380/year. Savings: $340/year. Modest, but the break-even point is around 50K requests/month, so most blogs are below threshold unless they are popular targets for AI research agents.

Mid-size news site (1M agent requests/month): Raw HTML costs roughly $35,000/year. SOM-first costs roughly $11,000/year. Savings: $24,000/year. This is where the economics become compelling. The upfront investment in SOM generation ($0.02/page for initial compilation, cached afterward) pays for itself within a few months.

Large publisher (50M agent requests/month): Raw HTML costs roughly $547,000/year. SOM-first costs roughly $217,000/year. Savings: $330,000/year. At this scale, the savings fund a dedicated team to manage the agent content pipeline.

Beyond cost: what cooperation gets you

Cost savings are the easiest benefit to quantify, but they are not the most important one.

Content control

When you serve raw HTML to agents, you have no control over what they extract. They might grab a sidebar advertisement and present it as your editorial content. They might pull a number from a related-articles widget and attribute it to your reporting. The Air Canada chatbot got a bereavement policy wrong because it could not distinguish policy text from surrounding content.

When you serve a structured representation, you define exactly what the agent receives. The main article is explicitly labeled as main content. The sidebar is labeled as complementary. Advertisements and tracking are excluded entirely. You are curating the agent's view of your page, the same way you curate the search engine's view through structured data and the API consumer's view through endpoint design.

Attribution

This is the piece that excites me most about the cooperative model. A structured representation can include provenance metadata: unique identifiers for each content element that enable agents to cite specific sources. Not just "according to nytimes.com," but "according to paragraph 3 of the main content region on nytimes.com/article/xyz, published March 15, 2026."

This creates the foundation for an attribution system that does not exist today. When agents cite structured sources with element-level provenance, publishers can:

Track which content elements are most frequently cited by agents
Measure their influence in the agent-mediated information ecosystem
Build a case for licensing based on demonstrated usage data

Future-proofing

Agent traffic is growing at 18-30% year over year. Some individual crawlers are growing 300%+ annually. The publishers who build cooperative infrastructure now will have it in place when agent-mediated content discovery becomes the primary channel. The publishers who wait will scramble to catch up, the same way many scrambled to adopt SEO, social sharing, and mobile-responsive design after those transitions were already well underway.

How it actually works

There are three levels of implementation, from simple to comprehensive:

Level 1: Static SOM files

Generate SOM representations at build time and serve them as static files alongside your HTML. This works for any site with a build pipeline (Hugo, Next.js, Astro, Jekyll, etc.).

# Install the SOM compiler
npm install plasmate-wasm

# In your build script, after generating HTML:
for file in public/**/*.html; do
  plasmate compile "$file" > "${file%.html}.som.json"
done

Agents discover SOM files through a .well-known/som.json manifest or <link rel="alternate"> tags in your HTML. No server-side logic required.

Level 2: Content negotiation

Add a middleware that checks the Accept header or user agent and serves SOM to agents, HTML to browsers. This is a few lines in any web framework:

// Express/Next.js middleware
app.use((req, res, next) => {
  if (req.accepts('application/som+json') || isAgentUA(req)) {
    return res.sendFile(somPathFor(req.path));
  }
  next(); // serve HTML as normal
});

Level 3: Robots.txt directives

Declare your SOM endpoint in robots.txt so agents know to request it:

User-agent: *
SOM-Endpoint: /.well-known/som/{path}.json
SOM-Version: 1.0

This is the cooperative content negotiation model described in our robots.txt extension proposal.

The licensing question

I would be dishonest if I did not address the elephant in the room. Some publishers are pursuing licensing deals with AI companies (OpenAI has signed agreements with several major publishers, and Perplexity has launched a revenue-sharing program). If you can get paid directly for your content, that changes the calculus.

But licensing deals are available to a handful of the largest publishers. The vast majority of websites, the millions of blogs, documentation sites, small news outlets, government portals, community forums, and niche publications that collectively make up the long tail of the web, will never get a licensing call from OpenAI.

For those publishers, the choice is not "license or cooperate." It is "serve raw HTML to agents that extract your content without attribution or compensation" versus "serve structured content that costs you less, gives you more control, and positions you for attribution when the infrastructure matures."

Cooperation is not a substitute for licensing. It is the practical path for publishers who will never get a licensing deal but still want to reduce costs, maintain control, and participate in the agent-mediated web.

The precedent

Every major transition in how the web is consumed has followed the same pattern:

A new consumer class emerges (search engines, applications, agents)
Publishers initially resist or ignore the new consumer
An economic incentive emerges (search visibility, API integrations, agent inclusion)
Publishers adopt purpose-built infrastructure (sitemaps, APIs, ???)
Early adopters gain structural advantages that late adopters struggle to overcome

1998

Search engines emerge

Publishers resist, then adopt sitemaps and robots.txt

2005

Applications emerge

Publishers resist, then build APIs and webhooks

2024

AI agents surge

Publishers block or tolerate. Economic incentive forming.

202X

Cooperative infrastructure

Structured representations, content negotiation, attribution

We are between steps 2 and 3 right now. Most publishers are either resisting (blocking agents) or ignoring (serving raw HTML). The economic incentive is forming: agent traffic is growing, agent-mediated discovery is replacing some search traffic, and the cost of serving raw HTML to agents is quantifiable.

The infrastructure for step 4 is being built. SOM, AWP, the robots.txt extensions, the content negotiation patterns. Whether these specific technologies become the standards or something else does, the directional trend is clear: publishers will eventually serve structured content to agents, because the economics and the incentives demand it.

The question is not whether publishers will cooperate with AI agents. It is whether they will do so on terms they define, or on terms imposed by the agents that crawl them regardless.

The publishers who engage now, while the infrastructure is still forming, get to shape those terms. The publishers who wait will accept whatever standards emerge without their input.

I know which position I would rather be in. And I suspect most publishers, once they see the numbers, will agree.

Interactive: Calculate your token savings

Pages per month

10,000

Model

$2.5/M input tokens

HTML cost

$830

SOM cost

$208

You save

$622/mo

(75%)

The full cost-benefit analysis is published as a research paper with worked examples for three publisher tiers, sensitivity analysis across traffic growth scenarios, and case studies for news, e-commerce, and documentation sites.

David Hurley is the founder of Plasmate Labs. Previously, he founded Mautic, the world's first open source marketing automation platform. He writes at dbhurley.com/blog and publishes research at dbhurley.com/papers.