The Five Things That Separate Good AI Agencies From Great Ones

April 20th, 2026

Everyone will be able to build an agent that delivers outcomes. When the dust settles and there are hundreds of AI agencies in every vertical, the differentiation will not come from where you expect.


Two agencies deploy competing agents to the same customer on the same day. Both start from generic baselines. Both deliver acceptable results on day one.

Three weeks later, one agent has fully adapted to the customer's operational patterns. The other is still making basic mistakes with their vendor naming conventions. By the time Agency B catches up, Agency A has expanded into two additional workflows and the customer has stopped taking competitive calls.

What happened? The agents used comparable foundation models. The APIs were the same. The system prompts were equally well-crafted. The difference was structural, and it was invisible until the moment it mattered.

I have been thinking about this question more than any other in the agent-native conversation: when the dust settles and there are hundreds of AI agencies in every vertical, what separates the ones that win from the ones that get commoditized? The conventional answer is better models or better prompts or more data. I think the real differentiators are something else entirely. They compound over time. And most of them are invisible to the customer until the exact moment they matter most.

Convergence speed

The scenario I opened with is about convergence speed: how quickly the ACP evolution loop adapts the agent from a generic baseline to a customer-specific optimum. It is the single most important competitive metric in the service-as-software model, and almost nobody is measuring it.

Convergence speed comes down to three engineering decisions.

The quality of the reflection operator. When the agent produces a suboptimal result, the reflection step must diagnose why. A shallow reflection operator says "the output was wrong, try again." A deep one says "the tool returned a malformed date because Customer A's EU subsidiary uses DD/MM/YYYY instead of the expected MM/DD/YYYY, and this cascaded into an incorrect calculation in step four." The specificity of the diagnosis determines the precision of the fix. Precise fixes converge faster than broad adjustments.

Resource decomposition granularity. A monolithic system prompt stored as a single resource gives the evolution loop a coarse target. Any improvement risks breaking something else. Decomposed resources, one for domain terminology, one for communication style, one for tool usage, one for exception handling, provide fine-grained targets that can be evolved independently. Thirty modular resources converge faster than one monolithic prompt because each improvement is smaller, more targeted, and less likely to cause regressions.

Evaluation function responsiveness. An evaluation function that takes three days to produce a signal limits the loop to one or two cycles per week. One that produces a signal within hours enables daily evolution. Weekly evolution over three months produces roughly 12 cycles. Daily evolution produces roughly 90. In a market where the first 90 days determine whether the customer stays or switches, convergence speed is survival.

Cross-tenant learning

This is the capability that separates good agencies from great ones, and it is the hardest to build.

After running for a year across fifty customers, the agency has fifty independently evolved agent states. Each is customer-specific, shaped by that customer's unique workflows and edge cases. But hidden in those fifty states are patterns that generalize.

If forty-seven out of fifty customers' agents independently evolved the same solution for handling a particular exception type, that solution should be part of the baseline for customer fifty-one. If thirty agents all evolved similar retry logic for a specific API provider's rate limiting, that logic should be pre-built into every new deployment.

The hard part is not extraction. It is data isolation. The evolved state of Customer A's agent contains operational information about Customer A's business. Vendor relationships, processing volumes, approval thresholds. This information is proprietary. You cannot merge customer states or train on raw evolution histories without violating the accountability framework your contracts guarantee.

The agencies that solve this will do it through abstraction. Extracting structural patterns (this type of API error benefits from this class of retry logic) while discarding customer-specific parameters (the specific endpoint, the specific error message, the specific threshold). A generalization pipeline that converts customer-specific artifacts into generic baseline improvements.

This creates something that looks like a data network effect but operates differently. Traditional network effects aggregate customer data. Cross-tenant learning aggregates structural patterns from evolution without sharing any individual customer's data. The more customers served, the better the baseline, the faster convergence for the next customer, the better early-stage outcomes, the higher close rates.

Outcome consistency under stress

There is a moment in every customer relationship where the agent faces conditions it was not designed for. A seasonal spike triples the volume. A backend system goes down and returns error codes instead of data. A counterparty submits documentation in a format nobody has seen before.

Most agents fail gracefully in these moments. Graceful failure is table stakes. What separates great agencies is maintained performance under stress: the agent continues to deliver acceptable outcomes even when conditions deviate significantly from normal.

This resilience is a direct function of tool evolution depth. Every edge case the agent's tools have evolved to handle adds to the system's capacity to cope with the unexpected. After six months, an agent that has encountered and evolved through dozens of unusual situations has a richer set of fallback behaviors and adaptive strategies than one that has only seen clean data.

Forrester's 2026 Enterprise Software predictions describe a related dynamic, noting that enterprise applications are moving toward "role-based AI agents that orchestrate and complete tasks across multiple systems." Agents are increasingly responsible for workflows spanning multiple systems, each of which can independently fail. Resilience across system boundaries is not a nice-to-have.

The customer rarely notices resilience during normal operation. They notice its absence the first time conditions get difficult. And that moment, more than any demo or sales pitch, is what determines whether the relationship deepens or ends. The sales pitch gets you in the door. The convergence speed keeps you through the first quarter. But it is the day everything goes sideways and your agent handles it calmly while the competitor's agent falls apart, that turns a customer into an advocate.

Evolution transparency

This one surprises people. They assume the evolution process should be invisible. The agent gets better quietly. The customer enjoys improved outcomes without knowing or caring about the mechanism.

I think the opposite is true.

Imagine a quarterly review where instead of showing the customer a dashboard of activity metrics, you show them the evolution history of their agent. Here is how your agent changed over the past quarter. In week three, the reflection operator identified that your accounts payable team uses a non-standard approval chain for invoices over $10,000. The agent evolved its workflow to route these invoices through the additional approval step automatically. Here is the before-and-after accuracy rate. Here is the version diff that shows exactly what changed.

This is qualitatively different from a traditional vendor update. You are not showing the customer what your engineering team built. You are showing them what the agent learned about their business. The customer sees their own operational reality reflected in the agent's evolution. They see their edge cases being addressed. They see measurable improvement attributed to specific adaptations.

The ACP version lineage makes this possible without additional engineering. Every evolution cycle already records what changed, why, and what the measured impact was. Packaging this into a customer-facing evolution report is a presentation problem, not an engineering problem.

The quarterly evolution review is the most powerful retention conversation in the relationship. You are not showing the customer what you built. You are showing them what the agent learned about their business.

McKinsey's research on agentic governance reinforces this from the accountability side. They argue that in agentic systems, governance must be real-time, data-driven, and embedded. Evolution transparency gives the customer exactly the governance visibility McKinsey says is essential. It transforms the ACP audit trail from an internal compliance mechanism into a customer engagement tool.

The retention effect compounds. The customer who has seen twelve months of evolution reports has twelve months of documented evidence that the agent is learning their business. Switching means abandoning all of that accumulated, demonstrated learning. Not because of a contract. Because of proof.

Domain depth in the evaluation function

I wrote in What an AI Agency Actually Needs to Work that the evaluation function is the product. Here I want to be more specific about why domain depth in the evaluation function is a differentiator rather than just a requirement.

A generic evaluation function asks "did the agent complete the task?" A domain-deep evaluation function asks "did the agent complete the task in a way that an experienced professional in this vertical would consider correct, efficient, and appropriate?"

Different questions. Different evolution trajectories.

Consider insurance claims. A generic function measures whether the claim was resolved and how quickly. A domain-deep function measures whether the coverage determination was consistent with the policy language, whether the settlement amount was within the actuarially appropriate range, whether the documentation met regulatory requirements, whether the communication to the claimant was compliant with state-specific notification rules, and whether the file would survive an audit.

An agent evolving against the generic function gets faster at closing claims. An agent evolving against the domain-deep function gets better at closing claims correctly. The first produces speed with inconsistent quality. The second produces reliable, compliant, defensible outcomes that the customer trusts with progressively less oversight.

This is where the ex-operators who understand a vertical workflow cold become irreplaceable. The domain depth of the evaluation function cannot be generated by an LLM. It cannot be extracted from documentation. It comes from years of operational experience, the kind that teaches you what "good" looks like in practice not just in theory. The agency whose evaluation function was designed by someone who spent ten years adjusting claims will out-evolve the agency whose function was designed by a software engineer who read the documentation.

The compounding picture

These five do not operate independently. They compound.

Faster convergence means more evolution cycles, which generates richer execution traces, which enables better cross-tenant learning, which improves the baseline, which makes convergence faster for the next customer. Domain depth ensures that faster convergence produces genuine quality improvement, not just faster iteration toward a local optimum. Evolution transparency converts internal improvements into customer-visible value, strengthening retention. Outcome consistency protects the relationship during the moments that matter most.

The agencies that build all five will dominate their verticals. The agencies that build three or four will survive. The agencies that build one or two will be commoditized.

The encouraging thing is that none of these require proprietary technology. They require operational excellence, domain expertise, and engineering discipline. They require knowing your vertical deeply enough to build the right evaluation function, and caring enough about quality to build the infrastructure that makes evolution visible and accountable.

The technology is new. The competitive dynamics are eternal.

The technology is new. The competitive dynamics are eternal.