Does Format Matter? WebTaskBench Results

April 8th, 2026

AI agents browse billions of web pages daily — but every page is served in a format designed for human eyes. WebTaskBench measures what happens when you fix that.


The Problem

HTML was designed for human eyes.

AI agents have no eyes.

Every web page an agent reads was built to be rendered — laid out with CSS, painted with gradients, animated with transitions. Agents skip all of that. They parse the raw markup, wade through thousands of tokens of presentation noise, and try to extract meaning from a format that was never designed for them.

The Numbers

33,000
tokens per page
raw HTML average
75%
is presentation
invisible to agents
$1B–$5B
annual waste
industry-wide estimate
400M+
agent page loads
per day, growing fast

The average web page delivers 33,000 tokens of HTML to an AI agent. Three-quarters of those tokens encode visual presentation — CSS classes, layout wrappers, decorative elements — that is completely irrelevant to agent reasoning.

At current token pricing, the industry spends between $1 billion and $5 billion per year processing tokens that carry zero information value.

Three Representations

We tested three ways to represent the same web page to an AI agent:

Raw HTML (33K tokens)
<div class="flex items-center justify-between px-4 py-2 bg-white dark:bg-gray-900 border-b border-gray-200"> <a href="/pricing" class="text-sm font-medium text-gray-700 hover:text-primary transition-colors duration-200"> Pricing </a> </div>
Semantic Object Model (8K tokens)
{ "role": "link", "text": "Pricing", "href": "/pricing", "actions": ["click"] }

Raw HTML — what agents receive today. Full markup with every CSS class, ARIA attribute, and layout wrapper.

Markdown — a cleaned text extraction. Smaller, but loses interactive elements, page structure, and action metadata.

SOM (Semantic Object Model) — a structured representation that preserves meaning, hierarchy, and interactivity while stripping presentation. What Plasmate produces.

WebTaskBench Design

150
agent tasks
real-world scenarios
50
websites
diverse domains
4
LLMs tested
GPT-4o, Claude, Gemini, Llama
6
task categories
extraction to adversarial

An open benchmark designed to isolate the effect of format on agent performance. Same tasks, same models, same websites — only the input representation changes.

Task categories:

  1. Extraction — pull specific data from a page
  2. Comparison — compare information across page sections
  3. Navigation — identify correct links and actions
  4. Summarization — produce accurate page summaries
  5. Adversarial — resist noise, deceptive markup, hidden content
  6. Interactive — identify clickable elements and form fields

Token Results

Average tokens per page
Raw HTML0
SOM0
Markdown0
SOM saves 0% of tokens vs HTML

Structured representations reduce input tokens by 4x compared to raw HTML. This translates directly to cost and latency savings — every token the model doesn't process is money not spent and time not wasted.

The Compression Paradox

2.74s
GPT-4o on HTML
average latency
1.44s
GPT-4o on SOM
47% faster
16.2s
Claude on HTML
average latency
8.5s
Claude on SOM
48% faster

Structured format is faster than HTML on every model tested. But here is the surprise:

This suggests that explicit structure reduces model reasoning overhead. The model spends less time figuring out what the page contains and more time answering the question.

Accuracy by Category

Task accuracy: SOM vs HTML (GPT-4o)
Extraction94 / 100
Comparison91 / 100
Navigation89 / 100
Summarization88 / 100
Adversarial82 / 100
Interactive96 / 100
Task accuracy: HTML baseline (GPT-4o)
Extraction79 / 100
Comparison71 / 100
Navigation63 / 100
Summarization82 / 100
Adversarial54 / 100
Interactive61 / 100

The biggest gains are in navigation (+26 points) and interactive element identification (+35 points) — precisely the categories where presentation markup creates the most noise. Adversarial resistance also improves significantly because structured formats strip the deceptive markup that confuses agents.

Hallucination Taxonomy

We identified four types of agent hallucination when reading web pages:

Structural hallucination — the agent invents page elements that don't exist, like buttons or sections. Most common with HTML input where the model must infer structure from markup noise.

Content hallucination — the agent misquotes or fabricates text from the page. Occurs across all formats but is amplified by token-heavy inputs that push content outside the model's attention window.

Attribution hallucination — the agent attributes information to the wrong section or element. Common when HTML nesting is deep and relationships between elements are ambiguous.

Inference hallucination — the agent draws conclusions not supported by page content. Format-independent but triggered more often when the model has to reason about noisy inputs.

Provenance

The SOM format includes provenance metadata — every structured element traces back to its source in the original HTML. This enables something raw HTML and markdown cannot: programmatic verification of agent claims.

Agent claim
"The Enterprise plan costs $99/month and includes unlimited API calls."
Provenance check
source_element: #pricing-enterprise source_text: "$99/month" confidence: verified ✓ "unlimited API calls" source_element: null confidence: unverified ✗

When an agent makes a claim about page content, the provenance chain lets you verify whether that claim is grounded in an actual page element — or hallucinated. In this example, the price is verified but "unlimited API calls" has no source element, flagging it as potentially fabricated.

The Fourth Layer

The web has three infrastructure layers: DNS resolves names, HTTP moves data, HTML renders pages. AI agents need a fourth: a semantic layer that makes pages machine-readable without making them machine-dependent.

WebTaskBench demonstrates that format is not a minor optimization — it is a fundamental infrastructure problem. The gap between HTML and structured representations is too large to close with better models. We need better formats.

The Semantic Object Model, the Agent Web Protocol, and cooperative content negotiation are the building blocks. The benchmark, evaluation harness, and all data are open source.

Open Source

150
benchmark tasks
open dataset
50
test websites
diverse corpus
8
research papers
all published
100%
open source
code + data + results

Everything is available:

  • WebTaskBench — the full benchmark dataset, evaluation harness, and results
  • Plasmate — the headless browser engine that produces SOM representations
  • SOM Specification — the Semantic Object Model format definition
  • Agent Web Protocol — the communication protocol for agent-web interaction
  • All papers — published at dbhurley.com/papers