The experiment we published two weeks ago showed a self-evolving agent improving its performance on tasks it had never seen. The headline was the improvement. The mechanism was more interesting.
The agent did not improve by getting access to more data. It did not improve because someone tuned the prompt. It improved because there was a scoring rubric that defined, for each task, what a correct answer looked like. Vendor selection: did it pick the right vendor for the right reasons? Margin calculation: is the number accurate to the penny? Invoice: do the line items match? Communication: does the tone match this customer's style?
The rubric was six functions. About four hundred lines of code. It was the least glamorous part of the entire system. It was also the only part that mattered.
Everything the agent learned, it learned because the evaluation function told it what "better" meant. Every improvement the reflection model proposed, it proposed because the evaluation function identified what was failing. Every commit gate decision, every rollback, every growth of the regression surface happened because the evaluation function provided a signal that the rest of the architecture could act on.
Change the evaluation function and you change what the agent becomes. Not gradually. Immediately. The agent optimizes for whatever you measure. If you measure the wrong thing, the agent gets very good at the wrong thing, very fast, with no human in the loop to notice until the customer does.What an evaluation function actually is
Strip away the terminology and an evaluation function is a precise answer to one question: what does a good job look like for this customer?
Not a general answer. Not "the agent should be helpful and accurate." A specific answer, for a specific workflow, in a specific vertical, for a specific customer's operational reality. The waste hauling evaluation function we built had six components, one per workflow step. Intake scored whether the agent correctly extracted eight fields from an informal text message. Vendor selection scored whether the agent picked the optimal vendor given coverage areas, reliability thresholds, and business rules. Margin calculation scored whether the arithmetic was correct to the cent. And so on.
Each component encoded judgment. Not abstract judgment about what AI should do. Operational judgment about what a good dispatcher does when a customer texts "need a 30 yard for a bathroom demo Thursday." A good dispatcher knows that "bathroom demo" means residential, that a 30-yard container in Fulton County has three possible vendors, that the cheapest vendor has a reliability score too low for a job that will exceed $500 after margin, and that the customer prefers text messages signed "Thanks, Bulldog Hauling."
All of that judgment lived in the evaluation function. The agent started with none of it. After ten cycles, it had discovered most of it. But it could only discover what the evaluation function was able to recognize. This is the problem Reeve is solving in the waste hauling vertical, and why the evaluation function had to come before the agent.
The domain expert writes the best one
This is the counterintuitive claim that the agency post made and that I believe more strongly after running the experiment: the most important person in an agent-native company is not the engineer who builds the agent. It is the person who has done the job the agent is replacing.
A generic evaluation function asks "did the agent complete the task?" A domain-deep evaluation function asks "did the agent complete the task the way a ten-year veteran would complete it, accounting for the fifteen things that can go wrong that only someone who has done this a thousand times would know to check?"
The difference is not a matter of degree. It is a matter of kind. The generic function produces a generic agent. The domain-deep function produces an agent that learns the operational nuances that make the difference between a competent system and one that a customer trusts with their business.
In waste hauling, the nuance is: Quick Drop Containers has the lowest price but a 0.78 reliability score, and Marcus Johnson's rule is never to dispatch them for jobs over $500. In home care, the nuance might be: Mrs. Chen's daughter calls every Tuesday at 3pm and the aide should be available to give an update, and the chart note should be written before the call, not after. In construction lending, the nuance might be: draw requests from this general contractor always include a 10% materials contingency that the previous lender rejected, but it is actually standard practice in this market.
No model knows this. No prompt contains this. The evaluation function encodes this, and the person who writes it is the person who lived it.
The most important hire in an agent-native company is not an ML engineer. It is someone who has done the job the agent is learning to do, and who can articulate what good looks like with enough precision that a machine can be graded against it.
The Monday Morning post described the skilled-trades shortage: six in ten Gen Z entering trades, $145 million in DOL apprenticeship grants, electricians under 30 making $240,000 on data-center builds. The same dynamic applies here. The people who can write evaluation functions are the people who have deep operational experience in specific verticals. They are not looking for jobs in AI. They are dispatchers, nurses, adjusters, loan officers, project managers, and tradespeople. The companies that figure out how to hire them, or partner with them, will build the best agents. Everyone else will build generic ones.
The Goodhart problem is not theoretical
Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. In agent-native systems, this is not a philosophy-of-science concern. It is a product failure mode.
Our experiment included a quote delivery step scored partly by an LLM judge evaluating communication style. The judge calibration was weak. We acknowledged this in the results. But the implication is worth stating plainly: if the style-matching score had been the dominant factor in the composite, the agent would have evolved to produce messages that scored well on style while potentially sacrificing accuracy on the vendor selection and margin calculation that actually matter to the business.
The evaluation function is not a test the agent takes. It is the objective the agent pursues. If the objective is misaligned with what the customer values, the agent will get very good at something the customer does not care about. And the self-evolution loop will compound that misalignment every cycle.
The mitigation is the one I described in The Retention Layer: align the evaluation metric with the pricing metric. If the customer pays for correct dispatches, measure correct dispatches. If the customer pays for resolved insurance claims, measure resolved claims. If the customer pays for booked appointments, measure booked appointments. The evaluation function should measure the outcome the customer is buying, not a proxy for it.
This sounds obvious. In practice, it is the hardest product decision you will make, because the outcome the customer is buying is often ambiguous, delayed, or difficult to attribute. Did the dispatch go well? You will not know until the container is delivered, the vendor is paid, and the customer does not complain. That feedback loop is days, not seconds. The evaluation function has to score a prediction of that outcome at the moment the agent acts, not after the outcome is known.
Getting this right is the product. Everything else is infrastructure.
How to write your first one
If you are starting an agent-native company and reading this, here is what I would do on day one.
Start with the workflow, not the model. Write down every step of the process your agent will handle. For each step, write down what a correct output looks like. Be as specific as the waste hauling example: not "select a vendor" but "select the lowest-cost vendor that covers the service area, meets the reliability threshold for this job size, and is available on the requested date."
Find the person who has done this job for a decade. Sit with them. Watch them work. Ask them what they check, what they worry about, what the new person always gets wrong. The answers are your evaluation criteria. The things the veteran checks that the rookie does not are the highest-value components of your evaluation function.
Score on a continuum, not a binary. "Did the agent pick the right vendor?" is a useful question. "Did the agent pick the right vendor, explain why, reference the relevant business rules, and compare against at least one alternative?" is a better one. Our vendor selection scoring gave 0.0 for the wrong vendor, 0.7 for the right vendor without reasoning, and 1.0 for the right vendor with correct reasoning. The gradient matters because it gives the evolution loop a signal about how to improve, not just whether to improve.
Weight by business impact. Not all steps are equal. In waste hauling, a margin calculation error costs real money. A communication style mismatch costs customer satisfaction. We weighted margin and vendor selection at 20% each, intake and quote delivery at 15%, invoicing at 20%, and dispatch at 10%. Your weights should reflect what your customer would fire you for getting wrong.
Accept that version one will be wrong. Our evaluation function had a weak judge for communication style. We knew it. We shipped it. The experiment still produced meaningful results on the five deterministic components. Your first evaluation function will have gaps. Ship it, measure what you can measure well, and iterate. The eval function evolves too. The difference between a good team and a great one is how fast they close the gaps between what they are measuring and what actually matters.
The product is the judgment
Every agent-native company will have access to the same models. The same APIs. The same orchestration frameworks. The same MCP tools. The marginal cost of building an agent that does a thing approaches zero.
The evaluation function is the part that does not commoditize. It is the encoded judgment of someone who knows what good looks like in a specific domain, translated into a form that a machine can optimize against. For web-content tasks, infrastructure like Plasmate makes structured evaluation possible by giving the agent a semantic representation instead of raw HTML to score against. It is the reason two companies building waste hauling agents will produce agents that evolve differently, serve differently, and retain customers differently.
It is, as the agency post argued, not a feature of the product. It is the product. The model is the engine. The prompt is the steering wheel. The evaluation function is the destination. Build accordingly.