What an AI Agency Actually Needs to Work ─ David Hurley

Founders usually start this conversation in the wrong place. They read The Retention Layer or The Service Layer, get excited about the economics of delivering outcomes at software margins, and immediately want to talk about models, frameworks, and infrastructure.

We get there. But the first thing I tell them is that the most important piece of the entire architecture is the thing they are least likely to build first.

It is the evaluation function. And if they get it wrong, nothing else matters.

The evaluation function is the product

Here is a claim I will defend: in a service-as-software business, the evaluation function is more important than the agent itself.

The agent can be rebuilt. The prompts can be rewritten. The tools can be swapped out. The foundation model will be upgraded every few months anyway. All of these components are replaceable because the ACP self-evolution loop will adapt whatever you give it toward the local optimum for each customer. That is the entire point of the retention layer.

But the evolution loop can only converge toward the right optimum if it knows what "right" looks like. That knowledge lives in the evaluation function. It answers, after every evolution cycle, a single binary question: is the candidate version better than the current version?

If the evaluation function answers that question reliably, the rest of the system self-corrects over time. If it does not, the system either stalls (because it cannot distinguish improvement from noise) or drifts (because it optimizes toward the wrong target). Classic Goodhart's Law applied to autonomous systems.

TSIA's 2026 State of Reports found consistent patterns across every major service line: organizations are investing in AI and automating core functions, but ROI measurement has not caught up. The competitive advantage now belongs to companies that can measure value, not just deliver it. For an AI agency, measuring value means building an evaluation function that measures the same outcome the customer pays for. If you charge per claim resolved, the evaluation function measures claim resolution accuracy. If you charge per lease renewed, it measures renewal completion. If you charge per load delivered, it measures on-time delivery rate.

This alignment between evaluation metric and pricing metric is not just good engineering. It is the mechanism that prevents your self-evolution loop from optimizing the wrong thing. When the evaluation function measures what the customer values, every evolution cycle makes the agent better at delivering what the customer is paying for. When it measures something else, every evolution cycle makes the agent better at something nobody cares about.

The evaluation function is not a feature of the product. It is the product. Everything else is infrastructure that serves it.

The execution trace corpus

The second thing an AI agency needs is something that does not exist on day one and cannot be purchased: a corpus of execution traces generated by real-world agent operation.

Every task the agent performs should produce a structured record. What was the input. What tools were called, what they returned. What errors occurred. What the final output was. How long it took. How many tokens it consumed. Whether the output met the evaluation criteria.

This sounds like logging. It is not logging. Logging tells you what happened. Execution traces tell you why something worked or failed, which is what the reflection operator in the evolution loop needs to generate improvement hypotheses.

After six months of operation across twenty customers, this corpus contains something no amount of pre-launch research can replicate: a detailed record of how real workflows actually behave in practice. The edge cases nobody documented. The API responses that deviate from the spec. The data quality issues that only surface at scale. The seasonal patterns that change which approaches work and which do not.

McKinsey's research on agentic organizations found that almost 80 percent of organizations use generative AI in some way, but the same percentage see no impact on the bottom line. The reason is that most organizations deploy tools without rewiring workflows. The organizations that see bottom-line impact are the ones that embed AI into specific domains end-to-end and learn from operational data. The execution trace corpus is that operational data. It is the raw material from which the evolution loop extracts improvement, and it is a compounding asset. Every customer engagement adds to it. Every edge case enriches it. Every failure mode, once captured in a trace and addressed through evolution, becomes institutional knowledge that makes future engagements better.

Accountability before the first incident

The moment you accept payment for an outcome, you accept liability for that outcome. Most founders understand this intellectually. Very few build for it before their first major incident forces them to.

Mayer Brown's February 2026 analysis of contracting for agentic AI laid out exactly what is required. Six areas where standard SaaS contracts are inadequate: service definitions, warranties, outcome-based SLAs, indemnification, governance, and data ownership. Each area requires infrastructure the vendor must have in place before signing the contract. Not after. I wrote about the legal dimension of this shift in The Service Layer, but the operational implication deserves its own emphasis here.

The ACP architecture provides most of this infrastructure if built correctly from the start. Version lineage answers "why did the agent do that?" for any decision. Rollback capability answers "how do we fix it?" within minutes. Tenant isolation answers "is Customer A's data safe from Customer B?" definitively.

But there is one piece ACP does not provide natively and that every agency must build: the delegation of authority framework. The explicit, written definition of what the agent is and is not allowed to do autonomously. Decision boundaries. Escalation triggers. Human approval gates.

The human escalation layer

Fully autonomous outcome delivery is the aspiration. It is not the launch state, and it should not be the steady state either.

Every AI agency needs a human layer that handles the cases the agent cannot. The edge cases that require judgment the agent has not yet evolved to handle. The novel situations where getting it wrong has consequences that exceed the agent's delegated authority. The emotionally sensitive interactions where a human touch is not a luxury but a necessity.

PwC's 2026 AI Agent Survey found that of organizations adopting AI agents, 66 percent report increased productivity and 55 percent report faster decision-making. But PwC also noted that broad adoption does not always mean deep impact. Many organizations use agents for routine tasks while complex decisions still require human involvement. The agencies that win will be the ones that design the boundary between agent and human deliberately rather than discovering it through failure.

The key insight is that the human escalation layer should shrink over time but never reach zero. The ACP evolution loop learns from escalated cases. Every case the agent could not handle becomes a training signal that, over multiple evolution cycles, teaches the agent to handle similar cases in the future. The human layer is simultaneously a safety net and a signal generator.

There is a feedback dynamic here that I find genuinely elegant. The cases that escalate to humans are precisely the cases where the agent's current capabilities are insufficient. They represent the frontier of what the agent needs to learn. By routing them to humans, you ensure the hardest problems are handled correctly right now. By feeding the outcomes back into the evolution loop, you ensure similar problems are handled by the agent in the future. The human layer makes the agent better, which makes the human layer smaller, which concentrates the remaining human effort on increasingly difficult edge cases, which makes the agent better at increasingly difficult tasks.

Over time this produces an agency where the human team is small, extremely skilled, and focused exclusively on the problems that genuinely require human judgment. The routine work has been absorbed by agents. The edge cases have been progressively automated through evolution. What remains is the genuinely hard stuff. The work that is worth paying senior people to do.

The contract framework

The fundamental shift is from selling access to selling results. This changes every element of the commercial relationship. The service definition changes from "you can use our platform" to "we will deliver these specified outcomes." The warranty changes from uptime guarantees to performance thresholds. The SLA changes from availability to accuracy, completion rate, and turnaround time. The remedy for failure changes from service credits to remediation for incorrect outcomes.

Mayer Brown draws heavily from BPO contract structures, and this is not a coincidence. When you hire a BPO provider you are hiring them to perform business functions on your behalf. When you hire an AI agency you are hiring agents to perform business functions on the customer's behalf. The legal structures that evolved over decades to manage outsourced human labor are the natural starting point for managing outsourced agent labor.

Three contract elements matter most. Bounded autonomy: the contract specifies exactly what decisions the agent can make without human approval. Broader autonomy means faster execution but higher liability. The right boundary depends on the vertical, the customer's risk tolerance, and the agent's evolution stage. Performance-triggered evolution disclosure: the customer has visibility into how the agent's behavior has changed over the contract period. The ACP version lineage already captures this. Making it contractually available builds trust and differentiates you from competitors who treat their agent as a black box. Graduated outcome commitment: do not promise 95 percent accuracy on day one. Promise a measurable starting baseline with contractual commitments to improvement milestones. "We will achieve 80 percent accuracy in month one, 88 percent in month three, and 93 percent in month six." This sets honest expectations while demonstrating confidence in the evolution loop, and it gives you a contractual framework for the cold start period where the customer is most likely to churn.

The stack

An evaluation function that measures the outcome you sell. An execution trace pipeline that captures the operational data your evolution loop feeds on. Accountability infrastructure including version lineage, rollback, tenant isolation, and delegation of authority. A human escalation layer that handles edge cases and generates evolution signals. Contract frameworks that allocate risk through bounded autonomy, evolution disclosure, and graduated commitments.

None of these are optional. An agency missing any one of them will either fail to improve, fail to learn from operation, fail to survive its first incident, fail on edge cases, or fail to close enterprise deals.

The good news is that if you have already built the ACP retention layer, you have most of this infrastructure. The resource registry is the version lineage. The SEPL loop is the evolution engine. The execution tracer captures the corpus. What remains is the evaluation function, the human escalation design, and the contract framework. Those are not engineering problems. They are domain expertise and business design problems.

Which means the hardest parts of building an AI agency are not technical. They are knowing your vertical deeply enough to define what "good" looks like, and being honest enough with customers to define what you are and are not accountable for. The code is the easy part, the judgment is everything.