Does Format Matter? WebTaskBench Results ─ David Hurley

The Problem

HTML was designed for human eyes.

AI agents have no eyes.

Every web page an agent reads was built to be rendered — laid out with CSS, painted with gradients, animated with transitions. Agents skip all of that. They parse the raw markup, wade through thousands of tokens of presentation noise, and try to extract meaning from a format that was never designed for them.

The Numbers

33,000

tokens per page

raw HTML average

75%

is presentation

invisible to agents

$1B–$5B

annual waste

industry-wide estimate

400M+

agent page loads

per day, growing fast

The average web page delivers 33,000 tokens of HTML to an AI agent. Three-quarters of those tokens encode visual presentation — CSS classes, layout wrappers, decorative elements — that is completely irrelevant to agent reasoning.

At current token pricing, the industry spends between $1 billion and $5 billion per year processing tokens that carry zero information value.

Three Representations

We tested three ways to represent the same web page to an AI agent:

Raw HTML (33K tokens)

<div class="flex items-center justify-between px-4 py-2 bg-white dark:bg-gray-900 border-b border-gray-200"> <a href="/pricing" class="text-sm font-medium text-gray-700 hover:text-primary transition-colors duration-200"> Pricing </a> </div>

Semantic Object Model (8K tokens)

{ "role": "link", "text": "Pricing", "href": "/pricing", "actions": ["click"] }

Raw HTML — what agents receive today. Full markup with every CSS class, ARIA attribute, and layout wrapper.

Markdown — a cleaned text extraction. Smaller, but loses interactive elements, page structure, and action metadata.

SOM (Semantic Object Model) — a structured representation that preserves meaning, hierarchy, and interactivity while stripping presentation. What Plasmate produces.

WebTaskBench Design

150

agent tasks

real-world scenarios

50

websites

diverse domains

4

LLMs tested

GPT-4o, Claude, Gemini, Llama

6

task categories

extraction to adversarial

An open benchmark designed to isolate the effect of format on agent performance. Same tasks, same models, same websites — only the input representation changes.

Task categories:

Extraction — pull specific data from a page
Comparison — compare information across page sections
Navigation — identify correct links and actions
Summarization — produce accurate page summaries
Adversarial — resist noise, deceptive markup, hidden content
Interactive — identify clickable elements and form fields

Token Results

Average tokens per page

Raw HTML0

SOM0

Markdown0

SOM saves 0% of tokens vs HTML

Structured representations reduce input tokens by 4x compared to raw HTML. This translates directly to cost and latency savings — every token the model doesn't process is money not spent and time not wasted.

The Compression Paradox

2.74s

GPT-4o on HTML

average latency

1.44s

GPT-4o on SOM

47% faster

16.2s

Claude on HTML

average latency

8.5s

Claude on SOM

48% faster

Structured format is faster than HTML on every model tested. But here is the surprise:

This suggests that explicit structure reduces model reasoning overhead. The model spends less time figuring out what the page contains and more time answering the question.

Accuracy by Category

Task accuracy: SOM vs HTML (GPT-4o)

Extraction94 / 100

Comparison91 / 100

Navigation89 / 100

Summarization88 / 100

Adversarial82 / 100

Interactive96 / 100

Task accuracy: HTML baseline (GPT-4o)

Extraction79 / 100

Comparison71 / 100

Navigation63 / 100

Summarization82 / 100

Adversarial54 / 100

Interactive61 / 100

The biggest gains are in navigation (+26 points) and interactive element identification (+35 points) — precisely the categories where presentation markup creates the most noise. Adversarial resistance also improves significantly because structured formats strip the deceptive markup that confuses agents.

Hallucination Taxonomy

We identified four types of agent hallucination when reading web pages:

Structural hallucination — the agent invents page elements that don't exist, like buttons or sections. Most common with HTML input where the model must infer structure from markup noise.

Content hallucination — the agent misquotes or fabricates text from the page. Occurs across all formats but is amplified by token-heavy inputs that push content outside the model's attention window.

Attribution hallucination — the agent attributes information to the wrong section or element. Common when HTML nesting is deep and relationships between elements are ambiguous.

Inference hallucination — the agent draws conclusions not supported by page content. Format-independent but triggered more often when the model has to reason about noisy inputs.

Provenance

The SOM format includes provenance metadata — every structured element traces back to its source in the original HTML. This enables something raw HTML and markdown cannot: programmatic verification of agent claims.

Agent claim

"The Enterprise plan costs $99/month and includes unlimited API calls."

Provenance check

source_element: #pricing-enterprise source_text: "$99/month" confidence: verified ✓ "unlimited API calls" source_element: null confidence: unverified ✗

When an agent makes a claim about page content, the provenance chain lets you verify whether that claim is grounded in an actual page element — or hallucinated. In this example, the price is verified but "unlimited API calls" has no source element, flagging it as potentially fabricated.

The Fourth Layer

The web has three infrastructure layers: DNS resolves names, HTTP moves data, HTML renders pages. AI agents need a fourth: a semantic layer that makes pages machine-readable without making them machine-dependent.

WebTaskBench demonstrates that format is not a minor optimization — it is a fundamental infrastructure problem. The gap between HTML and structured representations is too large to close with better models. We need better formats.

The Semantic Object Model, the Agent Web Protocol, and cooperative content negotiation are the building blocks. The benchmark, evaluation harness, and all data are open source.

Open Source

150

benchmark tasks

open dataset

50

test websites

diverse corpus

8

research papers

all published

100%

open source

code + data + results

Everything is available:

WebTaskBench — the full benchmark dataset, evaluation harness, and results
Plasmate — the headless browser engine that produces SOM representations
SOM Specification — the Semantic Object Model format definition
Agent Web Protocol — the communication protocol for agent-web interaction
All papers — published at dbhurley.com/papers