Your Token Bill Is Two Problems ─ David Hurley

Three months ago the conversation was whether AI would replace knowledge work. Now it is whether anyone can afford to find out. The CFO panic piece has run in every business publication I read, the runaway agent loop story has shown up in three trade journals this month, and every board deck that has crossed my desk in the last four weeks has a line item that did not exist in January labeled "model inference."

The numbers are real. A mid-size company running production agents through one of the major frontier models is now spending six figures a month on tokens, the curve is steep, and nobody on the executive team has a clean story for why. The reaction is predictable. Capability cuts. Throttling. Long memos from finance about "responsible AI use." The CFO is right to ask the question. Most of the answers being proposed are wrong.

The token bill is not one problem. It is two.

Problem one: each call costs too much

A typical agent task involves the model reading something, thinking about it, and producing an output. The reading is where most of the tokens go. If your agent is touching the web, which most useful agents do, the reading is HTML. HTML is a document format designed for a browser to paint pixels onto a screen for a human with eyes. It was not designed for a language model to extract meaning from. The agent reads ten thousand tokens of <div class="container"> and <span data-testid="..."> and inline scripts and CSS noise to find the three pieces of information it actually needed.

The waste is not subtle. We have measured it on the top one hundred websites. The median page wastes between eighty and ninety-five percent of its tokens on formatting the model was never going to act on. A single Google Cloud documentation page, the kind your developer agent might read while resolving a bug, comes in at around 464,000 input tokens of raw HTML. After compression to a Semantic Object Model, the same page is approximately 4,000 tokens. Same content. Same meaning. One hundred and sixteen times less context window consumed.

This is the hidden tax and it is the single largest cost line in agent operations today. The fix is not "use a cheaper model." A cheaper model still reads the same bloated HTML, the bill just becomes a slightly smaller version of the same shape. The fix is upstream of the model. You make the input smaller before it ever reaches the context window.

Plasmate is the engine I built to do exactly that. It compiles HTML into a structured representation that preserves what the page means and discards what the page looks like. We tested ninety-eight of the top one hundred sites. Ninety-four parsed successfully. Median compression was nine times. Single-page-application heavy sites compress in some cases more than fifteen hundred times because they ship megabytes of client-side state the model never needed to see.

Nine to fifteen hundred times less input is not a feature. It is a different cost structure.

The math is hard to argue with. If most of your inference spend is input tokens, and for agent workloads it almost always is, an order of magnitude compression on inputs is an order of magnitude cut to the line that dominates your bill. The model is the same. The agent is the same. The work is the same. What changed is that you stopped paying a frontier model to read CSS class names.

Problem two: a lot of those calls should not have happened

The other half of the bill is not about price per call. It is about whether the call should have been made at all.

Multi-agent systems have a runaway property that is easy to underestimate until it appears in your billing dashboard. An agent that decides what to do next can decide to make another call, that call can spawn three more, and those three can spawn nine. The system is working, in the sense that nothing crashed and the logs look normal, but it is also burning thousands of dollars an hour pursuing a task that, if a human had been watching, they would have stopped at minute two.

There is a familiar shape to the conversation when this happens. Engineering says the agent is doing what it was told. Finance says the agent is doing too much of what it was told. Both are correct. The agent had no concept of "should." It only had a concept of "can."

MeshGuard is the runtime I built to put a "should" between the agent's plan and the agent's action. Every action passes through a policy layer that evaluates it against rules you have defined. If the action violates a rule, MeshGuard blocks it before the tokens are spent. Not after. Not at the quarterly review. At the moment of decision.

The first three policies most teams write are not exotic. They are about cost containment. Do not call this expensive model for tasks the cheaper model has handled before. Do not loop more than five times on a planning task without escalating to a human. Do not spawn a sub-agent when the parent agent has already burned its budget for this request. These are not ethical constraints. They are operational ones, and they are exactly the kind of constraint that disappears the moment you stop watching the agent for a week.

The difference between MeshGuard and a rate limiter is that the rule is about intent, not volume. A rate limiter says "no more than ten thousand calls per hour." MeshGuard says "no calls of this kind from this agent in this context for this purpose, regardless of volume." The agent that was about to walk into a thousand-dollar loop never makes the first call. The tokens are not spent. The decision trace shows what would have happened and why it was blocked, which is also what your auditors are going to want to see when the regulators arrive.

The two are not the same axis

The reason this matters, and the reason I am writing it down, is that the standard reaction to a high token bill is to pick one of these two problems and ignore the other.

Engineering teams pick efficiency. They find a way to compress the prompts or switch to a cheaper model. The bill drops twenty percent. The CFO notices. Work continues. Three months later the runaway loop problem shows up at a different scale and the bill is high again, just in a new shape.

Or the team picks effectiveness. They build a governance layer, throttle the loops, restrict the agents. The bill drops twenty percent. The CFO notices. The agents are now slower, the engineering team is grumbling about the governance system being in the way, and the underlying per-call cost is still as high as it ever was. Three months later the bill rises again because volume returned to normal.

You need both. Plasmate cuts the cost per call by a factor that is difficult to believe until you measure it on your own traffic. MeshGuard cuts the number of calls that should not have happened by putting a policy gate between the agent's planning and the agent's action. The combined effect is not additive. It is multiplicative, because the agent that does not run the loop also does not pay the per-call cost for the loop it did not run.

I am skeptical of any agent operations story that does not address both. A CFO who reads the token bill and asks one question is asking the wrong question. The right question is two questions. Why does each call cost what it costs? Why are we making the calls we are making? Those are different problems with different fixes. They do not collapse into each other no matter how badly anyone wants them to.

The token panic is, in part, a story about how fast the bill rose. It is also a story about how few of the answers being proposed actually attack the cause. We can argue about pricing. We can argue about model choice. We can argue about agentic versus non-agentic architectures. None of that changes the fact that most of the tokens being consumed today are being spent on inputs the agent did not need to see, by agents that did not need to take the action they were about to take.

Fix those two things and the bill is not a crisis anymore. It is a budget line.

The CFOs are not wrong to be alarmed. They are just being given the wrong tools to act on the alarm. Plasmate and MeshGuard are not the only tools. They are the two I built, because the two problems I just described are the two problems I kept seeing, in every agent deployment, every quarter, for two years running. The bill is two problems. Solve both.