The Data Advantage Is Not What You Think

April 20th, 2026

The conventional wisdom says incumbents with decades of data will build the best AI agents. The conventional wisdom is wrong in a way that matters.


Every investor pitch I hear about agent-native businesses runs into the same objection. Usually around slide nine. The founder is explaining how their agent handles insurance claims or manages property leases or processes freight bookings, and someone on the other side of the table says:

"What about Guidewire? They have twenty years of claims data across hundreds of carriers. What about Yardi? They have every lease in every major market. What about the incumbents? They have the data. You do not."

It is a reasonable objection. It is also, I believe, fundamentally wrong. And the reasons it is wrong illuminate something important about how competitive advantage works in the service-as-software era.

The data everyone is talking about

When investors say "data advantage," they mean historical record data. Past transactions, past communications, past decisions. The structured and unstructured information that accumulates in databases over years of operation.

Incumbents have vast stores of this data. Salesforce has decades of CRM interactions. UnitedHealth has decades of claims records. Workday has decades of HR transactions. This data is real, it is proprietary, and it took decades and billions of dollars to accumulate.

The assumption is that this historical data is the critical input for building AI agents. More data equals better training equals better agents equals insurmountable competitive advantage. The incumbent's data moat protects them.

This assumption made sense for the previous generation of AI, where models were trained on large datasets to perform static inference. It does not hold for self-evolving agents operating in the ACP paradigm.

The data that actually matters

The data that drives ACP evolution is not historical records. It is execution traces: real-time, structured records of what the agent did, what happened as a result, and whether the outcome met the evaluation criteria.

An execution trace captures a live interaction between the agent and the customer's actual operational environment. Which tools were called. What they returned. What errors occurred. How the agent reasoned through the problem. What the final output was. Whether that output was correct. This is observational data generated by operation, not accumulated by storage.

Here is the critical distinction: execution traces do not exist until the agent starts operating. No incumbent in the world has a corpus of execution traces for how an AI agent performs their customers' workflows, because no incumbent has been deploying self-evolving agents against those workflows. The incumbent's twenty years of historical data gives them exactly zero execution traces. That data is about how humans used their software. It says nothing about how agents perform their customers' work.

A startup that deploys an agent today begins generating execution traces today. In six months, they have six months of traces. The incumbent that deploys an agent in six months has zero traces, regardless of how much historical data sits in their warehouse.

The incumbent has twenty years of data about how humans used their software. The startup has six months of data about how agents perform their customers' work. For ACP evolution, the startup's data is more valuable.

When historical data is actually a liability

This is the part that surprises people.

Incumbents' historical data reflects how humans performed workflows using the incumbent's software. This data encodes human limitations. It encodes the workarounds people developed because the software could not handle certain edge cases. It encodes the manual steps that exist because the integration between two systems was never built properly. It encodes the judgment heuristics that experienced operators developed to compensate for inadequate tooling.

Training an agent on "how humans used our CRM" produces an agent that replicates human CRM usage patterns, including all the inefficiencies, workarounds, and process debris that accumulated over years because the software was insufficient.

An agent that starts from a clean baseline and evolves through the ACP loop against actual outcomes has no such constraints. It discovers the optimal approach to each workflow element through experimentation and evaluation, unconstrained by historical patterns that reflect legacy process assumptions.

I have seen this play out in a related context. When companies migrate from legacy systems to modern platforms, the most common failure mode is replicating the old process on the new platform. They take a workflow designed around the limitations of a system built in 2003 and implement it identically in a system built in 2024, preserving inefficiencies that the new system could easily eliminate. The historical process becomes a ceiling rather than a foundation.

The same dynamic applies to agents trained on historical operational data. The data does not just teach the agent what to do. It teaches the agent what the limitations were. And an agent constrained by historical limitations will be outperformed by an agent that discovers optimal approaches through unconstrained evolution.

Historical data is not worthless. It provides domain knowledge, vocabulary, and a rough map of the workflow landscape. It is useful for building the initial baseline, the generic agent that is good enough on day one to win the first customer. But the initial baseline is the least important component of an ACP system. It is what the agent starts from, not what it converges toward. The self-evolution loop determines the destination, and the loop operates on execution traces, not historical records.

The advantage that actually matters is customer-specific

Here is where the argument gets sharpest.

The incumbent's aggregate dataset tells you what the average customer's workflow looks like. A statistical composite of thousands of customers' behaviors, smoothed across industries, geographies, and company sizes. Useful for understanding general patterns. Nearly useless for serving a specific customer.

The customer does not care whether your agent understands the average property management workflow. They care whether it understands their portfolio, their tenants, their vendors, their lease structures, their communication preferences. These customer-specific details are not in any aggregate dataset. They are discovered through operation at that specific customer's site.

This is the core insight from The Retention Layer: the ACP evolution loop produces customer-specific adaptation that diverges from the baseline over time. Customer A's evolved agent looks different from Customer B's evolved agent even though they started from the same baseline. This divergence is not a bug. It is the mechanism through which the agent develops deep understanding of each customer's specific operational reality.

Aggregate data helps build a better baseline. Customer-specific evolution determines outcome quality. And since outcome quality is what the customer pays for in a service-as-software model, the customer-specific data generated through operation is more strategically valuable than the aggregate data accumulated over decades.

An analogy. A national real estate brokerage has transaction data from every market in the country. A local agent who has worked a specific neighborhood for fifteen years has something the national brokerage cannot match: deep, contextual knowledge of that neighborhood's dynamics. Who has the better data? It depends on what question you are answering. For broad market analysis, the brokerage. For selling a specific house in a specific neighborhood, the local agent. The ACP-powered service provider is the local agent, multiplied across every customer, with the accumulated knowledge growing deeper every month.

The startup's structural advantage

The conventional framing positions startups as underdogs fighting incumbents' data moats. I think the structural advantage actually favors startups, for reasons that go beyond data.

No cannibalization dilemma. The startup prices on outcomes from day one without cannibalizing existing revenue. The incumbent that shifts to outcome-based pricing undermines the seat-based revenue that funds current operations. No professional services team to protect. The startup's automated customization through self-evolution directly threatens the incumbent's services revenue, which can represent 20 to 40 percent of total revenue. No architectural constraints. The incumbent's product was designed for static behavior. Retrofitting ACP requires deep refactoring incompatible with how enterprise software ships. No organizational inertia. The startup builds the organization for the new model from scratch.

These structural advantages do not guarantee the startup wins. The incumbent has distribution, brand trust, existing relationships, and deep integration into enterprise stacks. These are real advantages, especially for initial customer acquisition.

But they are acquisition advantages, not retention advantages. The incumbent's distribution gets them the meeting. The startup's evolution loop keeps the customer. And in a model where outcome quality determines renewal and expansion, the structural advantages compound over time in the startup's favor.

The playbook

If you are a startup competing against incumbents with large datasets, here is how I would think about the sequence.

Start with a baseline that is good enough. Not great. Good enough to deliver acceptable outcomes on day one. Use publicly available domain knowledge, the founder's operational expertise, and the best available foundation model. The baseline gets you the first customer. It is not what keeps them.

Prioritize customer diversity in your first ten accounts. An insurance agency serving only auto claims learns less than one serving auto, property, liability, and workers' comp. Diverse customers produce richer evolution signals.

Run the evolution loop aggressively in the first 90 days. Daily cycles if your evaluation function supports it. This is the convergence sprint where the customer sees the most visible improvement, and where you are most vulnerable to churn.

Build the cross-tenant learning pipeline as soon as you have five customers. Extract generalizable improvements from customer-specific evolution, feed them back into the baseline. Every new customer starts from a better foundation than the previous one.

Treat your execution trace corpus as the strategic asset it is. After twelve months across twenty customers, you have something no incumbent possesses: thousands of structured records showing how agents actually perform workflows in practice. This corpus was generated by your operation, it reflects how agents perform work rather than how humans used software.

The timeline

The incumbent wakes up with a massive historical dataset and a business model they cannot afford to disrupt. The startup wakes up with no historical data, the freedom to price on outcomes, and an architecture that generates the only data that matters from the first day of operation.

After six months, the startup has six months of execution traces and a dozen customer-specific evolved agents. The incumbent is still debating how to restructure their pricing model without alarming Wall Street.

After twelve months, the startup's cross-tenant learning pipeline is producing baseline improvements that make every new deployment faster and better. The incumbent has launched a pilot program with three customers and is dealing with internal resistance from their professional services division.

After eighteen months, the startup has more operationally relevant data than the incumbent. Not more historical records. The incumbent will always have more of those. But more data about how agents perform the work that customers are now paying for. And that is the data that drives the ACP evolution loop, which determines outcome quality, which is what the customer is paying for.

The window is open. Not because the technology favors startups. Not because the models favor startups. The window is open because the business model favors startups. And the business model is the one thing incumbents cannot easily change.

The startup's advantage is not better technology. It is not better data. It is the absence of a business model worth protecting.