I have spent the last several posts describing a machine that nobody has seen run. The Retention Layer explained the theory. The Service Layer described the economic implications. The rest of the series explored what it means for agencies, differentiation, data, sovereignty, and market structure. All of that was argument. This post is evidence.
We built the full ACP architecture, pointed it at a controlled simulation of real business operations, and let it run for sixty evolution cycles across two synthetic customers. No cherry-picked demos. No curated screenshots. Every number in this post comes from held-out test data that the evolution loop never observed during training. The code is published. The methodology is reproducible.
The headline: the architecture works. Self-evolution produces measurable improvement, real customer-specific divergence, and zero regressions. The nuances matter too, and I will be honest about them. But the mechanism does what the theory predicted.
The setup
We chose waste hauling dispatch because every step has an objectively correct answer. Did the agent extract the right job details? Did it select the correct vendor? Is the margin above the threshold? Does the invoice match the quote? No subjectivity in grading.
Two synthetic hauling companies with fully specified operational profiles:
Bulldog Hauling in Atlanta. Residential and commercial. Four vendors. 35% minimum margin. Text-message quotes. English only. Business rules including vendor reliability filters and dual-bid requirements for large commercial jobs.
Costa del Sol Hauling in Miami. Residential and municipal. Three vendors. 40% minimum margin. Bilingual English and Spanish. Hurricane season surcharges. Municipal contract handling with FEMA debris codes. Two-hour follow-up windows.
Each company got approximately 90 synthetic tasks spanning routine residential pickups, commercial jobs, edge cases that trigger specific business rules, incomplete requests requiring follow-up, and adversarial inputs. Tasks were split into three sets: 60% for the evolution loop to learn from, 20% for commit/rollback decisions, and 20% held out as a test set the loop never saw.
The baseline agent received generic prompts. It knew the workflow steps and had access to each customer's raw data: vendor lists, margin requirements, payment terms. What it lacked was judgment. It did not know which vendor to prefer when multiple options exist, how to communicate in each customer's voice, when to apply exceptions to general rules, or how to handle the dozen edge cases buried in each customer's business logic.
Baseline performance: 56% composite score. Good enough to produce structured output. Not good enough to run a real business.
What the agent learned on its own
The first committed evolution cycle for Bulldog Hauling tells the story. The reflection model examined twelve failed tasks and identified three specific problems.
The system prompt told the agent to "extract job details from the customer's message" but never explained how to parse an informal text like "Hey Marcus this is Dave Patterson at 1847 Peachtree Rd need a 30 yard for a bathroom demo getting started Thursday." The agent was leaving fields empty. The fix: explicit parsing instructions for each field type, with fallback defaults for missing information.
The invoice prompt said "correct totals" but never specified decimal precision. The agent was rounding $431.03 to $431.00. One cent of error, caught by the evaluator, flagged by reflection, fixed in the prompt. The agent added a rule: "All monetary amounts must be calculated and displayed to exactly 2 decimal places. Use precise decimal arithmetic."
The vendor selection prompt said "consider vendor reliability" but never required the agent to explain its reasoning. The fix added structured reasoning requirements: which counties the vendor covers, their reliability score and response time, how specialties match the job type, which business rules exclude alternatives, and comparison with at least one other vendor.
None of these instructions existed in the baseline. No human wrote them. The evolution loop discovered each one by running tasks, examining failures, forming hypotheses about what went wrong, and proposing targeted modifications. Then it tested those modifications on a validation set, checked that they did not break anything that was already working, and committed only if both conditions held.
The agent taught itself to parse informal text messages, enforce penny-level invoice precision, and explain its vendor selection reasoning. Three capabilities. Zero human prompts. One reflection cycle.
Costa del Sol's agent evolved differently. Its first committed cycle focused on hurricane season surcharge calculations, bilingual response detection, and vendor reliability filtering for municipal contracts. Same architecture, same evolution loop, different knowledge. The system was adapting to what each customer's operations actually required.
Claim one: improvement
Bulldog Hauling improved from a 56.3% baseline to 61.2% on the held-out test set across three independent trials. The best trial reached 69.0%, a 22.8% improvement over where it started. The mean peak across all three trials was 67.0%. Two of three trials showed clear sustained improvement; one trial regressed after an early peak, likely due to a committed change that helped on validation tasks but hurt on the held-out distribution.
Costa del Sol improved from 55.1% to 56.3%. Modest. The peak reached 59.5%, an 8.1% gain. The improvement is real but small, and I think I know why. Costa del Sol's operational profile is harder. Bilingual detection, municipal contract handling, FEMA debris codes, and hurricane season logic create a more complex decision space. The reflection model identified the right problems but generated fixes that were too narrow to generalize across the held-out set. The evolution loop needs more cycles, or a more capable reflection model, to crack this kind of complexity.
A third customer, Front Range Disposal in Denver, completed baseline scoring but did not finish its evolution cycles before the experiment ended. Its baseline score was 0.6413, notably higher than the other two. This is consistent with expectations: Front Range is construction-focused with simpler routing logic, while Bulldog and Costa del Sol have more complex rule sets involving vendor reliability filters, bilingual handling, and municipal contracts. Higher operational complexity means a lower starting score and more room for evolution to discover.
Averaged across both completed customers: +5.5% improvement on held-out tasks the evolution loop never trained on. Not a controlled-environment miracle. A real, reproducible gain with honest variance across trials.
Claim two: divergence
This is the retention layer argument made concrete. If you deploy the same baseline agent to two different customers and let the evolution loop run independently, do the agents become measurably different?
Yes.
After ten evolution cycles, the structural divergence between Bulldog and Costa del Sol's evolved prompts averaged 43.3% across all six resource types. That means 43% of the prompt content differs between the two agents, measured by normalized edit distance. In a production deployment running one cycle per day, this divergence would emerge in roughly two weeks.
The divergence was not uniform. Business rules prompts diverged by 91%. The system prompts diverged by 55%. Vendor selection prompts by 53%. Invoice prompts by 61%. Communication style and quote generation prompts stayed at their baselines, suggesting those resources need more targeted evolution pressure or more cycles to begin adapting.
The divergence was also substantive, not cosmetic. Bulldog's agent learned vendor reliability thresholds and dual-bid logic for commercial jobs over 30 yards. Costa del Sol's agent learned hurricane season surcharge calculations and municipal contract discount rules with Net 45 payment term overrides. These are not rephrasings of the same idea. They are different knowledge, discovered independently, from the same starting point.
This is the switching cost mechanism from the retention layer thesis, observed in a controlled setting.
Two weeks. Same starting point. Ninety-one percent divergence in business rules. The agents are no longer the same product. They are two different specialists, shaped by two different businesses, and neither one could do the other's job well.
After ten cycles, migrating from one provider to another means losing every adaptation the agent has built. The new provider's agent starts from baseline. The compounding knowledge, gone.
Claim three: regression protection
Zero regressions across all sixteen committed evolution cycles. The commit gate works exactly as designed.
The mechanism is simple but strict. Every time the evolution loop commits a change, it records the tasks the agent currently handles well. Those tasks become the regression surface. The next proposed change must pass every task on that surface before it can be committed. If any previously-passing task now fails, the change is rolled back.
The cost of this strictness is visible in the data. Only 27% of evolution cycles were committed. The other 73% were rolled back because proposed improvements, while helping on new tasks, degraded performance on previously solved ones. For Bulldog Hauling, the commit rate was just 13%. The gate is aggressive.
This is a feature, not a bug. An aggressive gate means the agent's performance floor never drops. It also means progress is slow, which is exactly the tradeoff the architecture is designed to make. In a production deployment where the agent is handling real customer operations, a slow learner that never regresses is infinitely preferable to a fast learner that occasionally breaks things that were working yesterday.
What needs work
Costa del Sol's improvement was marginal. A 2.2% gain on held-out tasks is better than nothing but would not be statistically significant in a larger study. The bilingual and municipal complexity appears to exceed what ten evolution cycles with generic reflection can address. More cycles, domain-specific reflection prompts, or a more capable reflection model would likely help. We will test this.
The held-out scores are noisy. Bulldog trial 0 peaked at cycle 1 and then declined to below baseline by cycle 10. The learning curves are not smooth upward lines; they oscillate. This is partly because the held-out set is small (18 tasks) and partly because a single committed change can shift the score in either direction depending on which held-out tasks it affects.
The LLM judge calibration was weak. Quote delivery scoring used a lightweight calibration with low correlation. We fell back to deterministic heuristics for most style evaluation. A production system needs a well-calibrated judge with human-scored reference examples.
None of these failures invalidate the architecture. They tell us where the engineering effort needs to go next.
What sixty cycles proved
Seven million tokens. Sixty evolution cycles. Two fully evolved agents with measurably different capabilities adapted to their respective customers' operations. All of it autonomous. No human tuned a prompt. No engineer wrote a customer-specific rule. The loop observed failures, formed hypotheses, proposed fixes, tested them, and committed only when the evidence justified it.
A human consultant achieving the same result would need to study each customer's operations, interview the owner, document business rules, write custom procedures, and revise them through trial and error. Weeks of work. And the result would be static, a snapshot of the business that begins degrading the moment operations change.
The ACP loop does this continuously, autonomously, at marginal cost. A production deployment running one cycle per day would compound customer-specific knowledge every single day without anyone asking it to.
Where this goes
This experiment tested the mechanism, not the market. We now know the self-evolution loop produces measurable improvement, customer-specific divergence, and regression-protected stability. The three pillars of the retention layer thesis hold under controlled conditions.
The next steps are engineering, not research. A more capable reflection model that proposes tighter modifications. Larger regression surfaces that grow faster. Domain-specific evaluation functions that catch the nuances the current scorer misses. And then the step that matters most: deploying this on live customer operations where the tasks are not synthetic and the ground truth is not precomputed.
The gap between this simulation and production is real, but it is an engineering gap. The architecture works. The mechanism produces the effects the theory predicted. What remains is building it well enough to trust with someone's actual business.
The code is at experiments/acp-proof. Run it yourself.