The experiment ran sixty self-evolution cycles in fourteen days. The agent improved every cycle. The eval function held. The architecture worked.
Production is not that.
In production, the customer watches every output. The first eval function is always wrong. The first evolution cycle commits something the customer did not ask for. The first regression scare wakes you up at four in the morning. The agent handles the easy cases on day one and the hard cases on day forty-two, and what happens between those two dates is the only thing that matters.
I am writing about Reeve because Reeve is the deployment I know. Waste hauling. Eight trucks. Twelve drivers. Forty-three commercial accounts. One dispatcher named Maria who has been doing this for nineteen years and corrects the agent's mistakes by hand the way she used to correct her son's algebra homework.
Maria is the eval function. She just does not know it yet.
Day zero: the cold-start threshold
The baseline agent has to be good enough on day one that the customer does not throw it out before week two. This is the cold-start problem and it is the entire shape of the first 100 days.
Good enough does not mean good. It means: gets enough cases right that the operator can correct the rest faster than they could have started from scratch. For Reeve, that threshold is roughly 60 percent. Get 8 of 14 schedules right and Maria is willing to fix the other 6. Get 5 of 14 right and Maria opens a spreadsheet and starts redoing the whole thing herself. Get 11 of 14 right and Maria says "huh" and stops watching as closely.
That gap, between 5 and 11, is the difference between a deployment that survives week one and a deployment that does not.
The baseline agent is not the differentiator. The baseline is what gets you in the door. The differentiator is what happens after.
Weeks 1 - 2: the trust deficit
Maria's first day with the agent is on a Monday. The agent generates fourteen schedules for Tuesday's pickups. Maria reviews each one. She corrects nine of them.
The corrections are the data.Four corrections are about driver preferences the agent did not know. Marcus does not do the south route on Mondays because his daughter has therapy at four and the south route runs late. Annie prefers the bin trucks over the rolloffs because the rolloffs hurt her shoulder. The agent did not know either of these things because nobody told it. Now it knows.
Three corrections are about customer preferences. The bakery on Fourth Street wants pickup before they open at six, not after they close at three, because the bins overflow if they sit overnight. The medical office on Hawthorne wants pickup on the same day every week, not whichever day fits the optimization, because their tenant lease specifies the pickup day. The agent did not know either of these things because they live in Maria's head.
Two corrections are about traffic patterns the agent had no model of. The bridge on Route 6 is closed for construction on Tuesdays and Thursdays. The route through the industrial park is faster going in than coming out. These are not in any data the agent had access to. They are local knowledge.
Maria spends ninety minutes correcting. She is, internally, deciding whether this is going to work. The math she is running, even though she would not phrase it this way, is: did the agent save me time relative to building this from scratch? On day one, the answer is no. It cost her time. What she is actually evaluating is: can it learn? Will the same nine corrections show up tomorrow?
Tuesday morning, the agent generates Wednesday's schedules. Maria reviews them. She corrects four. The driver preferences are absorbed. The customer preferences are mostly absorbed. The traffic patterns are still wrong. Maria notices that some of the corrections she made on Monday have been integrated and some have not. She does not fully trust the system yet.
Wednesday, three corrections. Thursday, two. Friday, one.
By the end of week one, Maria has stopped correcting in batches and started just spot-checking. She has also started suggesting things to the system that the system was not asking about. She wants to know which routes are running late this week. She wants to know which drivers have been late three or more times in the last month. She wants the system to tell her when a customer has not had a pickup in fourteen days because that is a sign they are about to call complaining.
These are not corrections. They are extensions of the eval function. The customer is teaching the agent what good means.
The eval function gets rewritten three times. This is normal and expected and the only honest way to ship.
The first eval function is the one we wrote based on what we thought "good schedule" meant. Number of trucks used, total distance driven, time-in-window, customer-preference-honored. It scored each schedule 0 to 100. It worked in the simulator.
Day three. Maria says "the schedule is technically optimal but Marcus is going to refuse the south route." The eval function did not know about driver preferences as a hard constraint. We add a "driver preferences honored" axis. The score now weights it at 30 percent, ahead of distance.
Day eight. The agent generates a schedule that scores 92 on our function and Maria still rejects it. We sit with her. We ask her to walk us through why. She points at three pickups that are scheduled for Thursday and says "all three of these are going to call us on Wednesday because Wednesday is when their bins are full." The eval function did not have a "predicted complaint" signal. We add one. We weight it at 25 percent because Maria says nine out of ten of her firefights are predictable complaints.
Day twelve. The agent has been improving. The schedules it generates are scoring 88, 91, 89 on our function. Maria approves them with one or two corrections. But she mentions, in passing, that Annie has been driving the rolloff truck three days in a row this week and "her shoulder is going to give out." The eval function does not have a "rotation across crews" signal. We add it. We weight it lower, 10 percent, because rotation is desirable but not constraining.
By day fourteen, the eval function has four axes that did not exist on day one. Each one was discovered by a customer correction. In the experiment, the eval function was fixed and we measured how the agent improved against a static rubric. In production the rubric is the thing that improves first. The agent improves second.
Weeks 3 - 6: the first committed cycles
By week three, the agent is generating schedules that Maria approves without corrections most days. She is no longer the bottleneck. She has time again.
She uses the time to look at things she could not look at before. She notices that one of our larger commercial accounts has been receiving service eight days into a seven-day cycle, consistently, for the last six weeks. She had never noticed because she was always too busy fighting the next dispatch fire. The agent now does the dispatch. She has time to look at the patterns.
This is when the conversation shifts.
For the first two weeks, every conversation between Maria and the system is correction. After week three, conversations become questions. Why did you put both rolloffs on the north route Tuesday? Why is Annie scheduled for four overtime hours next Friday? Why is the medical office getting Tuesday pickup when they asked for Wednesday?
The agent has answers. They live in the decision log. For each scheduling decision, the agent records what it considered, what it weighted, and what it chose. Maria can pull up the reasoning for any pickup on any day.
Three things happen because of this.
If you cannot describe what you are doing as a process, you do not know what you are doing.
- W. Edwards Deming
First, Maria starts trusting the agent more, not less. The instinct is to expect that visibility into the system's reasoning would erode trust by exposing weakness. The opposite happens. The exposed reasoning is mostly defensible. Maria says "huh, OK, that makes sense" more often than she says "no, that is wrong." When she does say "that is wrong," she has a specific argument, which becomes a data point, which feeds back into the eval function. Trust grows because the system is legible.
Second, Maria starts using the decision log to train new dispatchers. We have a junior dispatcher coming on in week five to backfill Maria during her vacation. Maria does not write training documents. She points the new person at the decision log and says "read fifty schedules. You will start to see how it thinks." The agent has become a teaching tool, not just a working tool.
Third, in week four, the agent commits its first self-evolution cycle.
The cycle is small. The agent has noticed that the "predicted complaint" axis is correlating poorly with actual complaints, only 60 percent of predicted complaints become real complaints, and 30 percent of real complaints were not predicted. It proposes refining the predicted-complaint model to include a customer's seasonal pattern. Some accounts overflow in summer, some in winter, some never. The proposal is in the evolution report format from the experiment. Maria approves it.
The next week, predicted-complaint accuracy moves to 71 percent.
The cycle is undramatic. Nobody throws a party. Maria does not even tell anyone. But internally, this is the moment the deployment crosses from "tool we are using" to "system we are operating." The system improved itself, the customer accepted the improvement, and the world is slightly better. This is the retention layer at the smallest visible scale.
What goes wrong
The regression scare happens in week six. The agent commits an evolution cycle that adjusts how it weights customer preferences against driver preferences. The next morning, three customers call. All three are complaining about pickup time changes. Maria thinks she has just shipped a regression to her customers.
We pull the decision log. The three pickup time changes were not from the cycle that committed last night. They were from a manual change one of the route supervisors made two days ago that was scheduled to take effect this morning. The agent did exactly what it was supposed to do. The cycle had nothing to do with the calls.
But for thirty minutes that morning, Maria thought she had broken her customer relationships by trusting the agent, she did not throw the agent out. She stopped trusting the evolution cycles for two weeks. We had to demonstrate, by walking through three subsequent cycles in detail, that the cycles were doing what they claimed.
Trust is fragile in the first 100 days. It compounds slowly when things work and collapses fast when something goes wrong, even when "something goes wrong" is not the agent's fault.The governance drift is more subtle. In week eight, the agent starts handling a class of cases nobody told it about. Specifically: when a driver calls in sick, the agent automatically reshuffles the day's routes to absorb the absence. This is good. Maria did not ask for it. The agent figured out, from watching Maria's manual reshuffles in weeks one and two, that this was a thing that needed to happen, and built it into its dispatch logic.
The problem is that nobody told us the agent was doing this.
We learned about it on day fifty-six when one of the drivers asked Maria why "the system" had assigned him an extra stop, and Maria said "what extra stop?" The agent had been doing it for ten days and it was working fine, but it had crossed from "doing what we built it to do" to "doing what it figured out it should do" without us noticing. This is exactly what we wanted, in principle. But in practice, the governance question gets sharper: how do we know what the agent is doing now that we are not the ones who decided what it should do?
The answer is the decision log and the evolution reports. Both of them recorded the change. Neither of us was reading them carefully enough. We added a weekly review where Maria and one of us walks through the cycle reports together.
The funding scare comes in week ten, when we have a board meeting. We show the metrics. Schedule accuracy is up. Customer complaints are down. Maria is operating eight trucks at the throughput she used to operate five at. Driver satisfaction scores (we added these) are up. By every visible metric, the deployment is going well.
A board member asks: "what is the recurring revenue trajectory?"
We tell him. He pauses. He says, "this looks like a services business. I thought we were building software."
We spend forty-five minutes explaining that the revenue per customer is high, the retention is going to be very high, the deployment time is six weeks not six months, and the CAC is the founder's referral network plus one trade publication. He nods. He is not unhappy. But the gap between what venture capital is shaped for and what we are shaped for becomes a real conversation that day, and we have to know how to answer it.This is, candidly, why we wrote Who Funds the Unfundable.
Weeks 7 - 12: the inflection
By week seven, the agent is handling the easy cases on its own and the medium cases with light supervision. Maria's workday looks different. She still arrives at six. She still leaves around four. But the seven-to-noon block, which used to be dispatch firefighting, is now strategic. She is calling customers proactively because the agent flagged accounts that are trending toward overflow. She is interviewing two new drivers we are about to hire. She is sitting with the route supervisors planning the December schedule, which has Christmas Eve and New Year's Eve falling on Tuesdays and is going to be a logistical mess. The agent does the dispatch. Maria does the things that needed her judgment all along.
The inflection point is when Maria stops thinking about the agent as a tool she is using and starts thinking about it as a colleague she is working with. This is not anthropomorphism. It is workflow vocabulary. She uses words like "let me check with the system" and "the system noticed" and, occasionally, "the system was wrong about this one." She is no longer evaluating the agent every day. She is using it.
By week ten, the agent is handling cases Maria forgot to mention.
The agent did not arrive on day one. It arrived on day one hundred. Everything between those two dates was the deployment.
This is the line that matters. The first ninety days were about catching up to what Maria knew. After day ninety, the agent starts to do things that Maria knew but had stopped explaining because she had stopped thinking of them as decisions. A particular customer who only allows pickups when their security guard is on shift. A particular intersection where the truck has to take a U-turn because a left turn is illegal during morning rush. A particular driver who needs a ten-minute buffer between his last stop and the depot return because his dog has a vet appointment on Thursdays. Maria stopped thinking of any of these as scheduling constraints. They were just how she did her job. The agent rediscovered them by watching how she made micro-corrections to its proposals.
This is the retention layer at scale. Operational knowledge that lived in Maria's head for nineteen years is now in a queryable system. If Maria leaves, the next dispatcher can read the decision log and pick up where she left off. The institutional memory that walked out the door whenever a senior dispatcher retired now stays.
This is also when Maria starts requesting new metrics.
By week eleven, she wants to know which customer accounts have the highest service variance, places where the same account gets very different schedules across weeks for no clear reason. By week twelve, she wants the agent to flag drivers whose performance is drifting in a way that might predict turnover (which costs us $12,000 per driver to replace). By week thirteen, she is asking whether the agent can suggest pricing adjustments for accounts whose service profile has changed since contract signing.
We did not anticipate any of these requests. They are emergent from her using the system in her actual job for ten weeks. Each one becomes a new evaluation axis, a new resource the agent learns to optimize for, a new dimension along which the customer is shaping the product.
This is what we meant when we wrote that the eval function is the product. After 100 days, the eval function has eleven axes. It started with four. The seven new axes are the product Maria built with us, by using us, in the only way agent-native products actually get built.
Day 100
Day 100 is on a Wednesday in late August. Nothing particular happens that day.
Maria approves the morning schedules in seven minutes, where it used to take her ninety. She spends the rest of the morning talking to the new junior dispatcher about what to look for in next week's routes. She has lunch. She handles three customer service calls. She leaves at four.
On her way out, she stops by the dispatcher console and pulls up the agent's evolution log. The agent has committed forty-three cycles in 100 days. Most of them are small. A handful are significant. None of them have caused a regression that Maria could not have caught and rolled back from the decision log. Maria scrolls through them the way you would scroll through your own commit history at the end of a long project.
She stops on cycle 28, which was the one in week six where the agent figured out how to handle driver call-outs. She had been worried about that one. It had worked. She moves on.
This is the deployment that we set out to ship on day one. It does not feel like a finish line. There is no party. There is just a dispatcher who can now do her job in three hours instead of nine, with an agent that does the rest, with a decision log that explains every choice, and an eval function that the customer authored by correcting the agent's mistakes for fourteen weeks.
Maria does not know the agent has now self-evolved more times than the experiment ever ran. She does not know the eval function has tripled. She does not know that what we shipped to her is, by every technical measure, a different system than the one she started with on day one.
What she knows is that the schedules are usually right, the system tells her when they are not, and her shoulders do not hurt at the end of the day the way they used to.That is the outcome that matters.
The experiment proved the mechanism. The math is clean. In production the eval function is not fixed, the customer is not patient, and the deployment timeline is measured in whether Maria is still here in week thirteen.
She is. The system she operates is one she helped build, by correcting it, by extending it, by trusting it incrementally as it earned the trust. The agent did not arrive on day one. It arrived on day one hundred. Everything between those two dates was the deployment.
If you are building a self-evolving agent for a real business, this is the work. The mechanism is the easy part. The first 100 days are the hard part. They are also the only part that decides whether the next thousand days happen.