I bet that a lightweight, text-based eval system would outperform purpose-built simulation platforms for pre-deployment quality gates. That bet unlocked regulated verticals, converted 75% of enterprise pilots to annual contracts, and became the backbone of how the team ships.
* To comply with my non-disclosure agreement, I have omitted or modified confidential information. All views are my own.
GoodCall deploys AI phone agents that handle millions of inbound calls for enterprises. Two of our largest prospects, a Fortune 500 logistics company and a prescription benefits manager operating in a HIPAA-compliant environment, were ready to sign contracts that would represent roughly half the company's revenue at 9-figure scale.
But their procurement teams had the same question: how do you prove the agent won't say the wrong thing?
We didn't have a good answer. Manual QA took 3 to 5 days per deployment cycle and couldn't keep pace with the growing number of agent configurations. Every new feature launch was bottlenecked by it. Agents that passed spot checks would fail on real calls: giving a driver incorrect instructions, skipping required legal language, dropping data from a service ticket. Each incident eroded trust. Some cost us clients entirely.
The sales team was fielding objections we couldn't overcome with demos. The engineering team was spending more time on manual testing than on building. And I was watching the company's entire enterprise strategy stall on a problem that was fundamentally about tooling, not about model quality.
The problem wasn't that our agents were bad. It was that we had no way to prove they were good before they touched a live caller.
I saw a two-week window. Our lead AI engineer had bandwidth. The enterprise contracts were stalling. And I had a hypothesis: we didn't need a high-fidelity simulation platform. We needed a fast, lightweight system that could tell us whether an agent passed or failed across hundreds of scenarios in minutes, not days.
I co-led this sprint with our lead AI engineer. I owned the framework design, the JTBD-based test taxonomy, the build-vs-buy recommendation, the remediation loop protocol, and the eval UI. Engineering owned the cloud orchestration, voice pipeline integration, judge prompt implementation, and telemetry.
The system works like this: you define a scenario (what the simulated caller will say), an expected outcome (what should happen), and verdict criteria (the pass/fail rules). A simulated caller LLM follows the script, the agent responds through the real stack, and a separate judge LLM scores the transcript. Two LLM calls total. 30 to 60 seconds per eval, running in parallel across hundreds of scenarios. No voice simulation, no personality sliders. Just: did the agent do the job or not?
We also baked in latency telemetry from the start: average time to first token, P50 and P95 TTFT, tokens per second, broken down by model and provider. When a voice agent takes too long to respond, the caller thinks it's broken. Latency isn't a quality-of-life metric. It's a behavioral one. Surfacing it alongside pass/fail means you catch regressions before they sound like silence on the other end of the phone.
The market was building toward high-fidelity simulation: realistic voices, personality modeling, nuanced conversation flows. I went the opposite direction. Every architectural decision was about stripping to essentials and optimizing for the question that actually gated our deployments.
Every agent is decomposed into jobs it was hired to do. A trucking breakdown agent has jobs like "collect vehicle information," "create a service ticket," "escalate when the issue is outside scope." Each job gets eval coverage across test types: clear intent, vague start, escalation, boundary, and system tests. The taxonomy ensures we're testing behaviors, not just happy paths.
Initial test cases come from call data analytics. We know what callers actually say, so the scenarios start grounded in reality. From there, every new feature gets corresponding evals before it ships. The eval is the spec.
When evals fail, the system classifies the failure (config issue, eval too strict, fatal error, scenario needs rewrite), proposes a targeted fix, and re-runs. This is the part that makes it a development tool, not just a testing tool. Max 5 iterations before a human steps in. The loop closes itself, but with guardrails.
The market for conversational AI eval tooling was moving toward fidelity: realistic voices, personality modeling, nuanced caller simulation. Coval is the best example. We trialed it alongside our internal build during the same sprint. Full voice simulation, personality sliders, rich testing features. It's a genuinely good product.
But I recommended we build. The reason wasn't technical capability. It was strategic fit. Coval was optimized for the question "how realistic can the test be?" We needed the answer to "can we ship today?" Hundreds of scenarios in parallel. Results in minutes. Latency telemetry baked in. A remediation loop that iterates autonomously. None of that existed in the market because the market was solving a different problem.
| Factor | Coval | What We Built |
|---|---|---|
| Execution model | Full conversational + voice simulation | Text-based endpoint, stripped to essentials |
| Throughput | Rich but slower per-eval | High parallelism, 30 to 60s per eval |
| Fidelity | Personality sliders, voice sim, caller modeling | Scenario-driven, controlled inputs |
| What we needed | Fast pass/fail gates across 150+ test cases | |
| Cost at scale | Higher per-eval (more LLM calls, voice infra) | 2 LLM calls per eval (agent + judge) |
This was a bet against the market direction. The market was building toward simulation fidelity. I bet that volume and speed would matter more for deployment gates. That bet has held. As the system matures, revisiting richer simulation tooling remains an option, but the core insight (strip to essentials, optimize for throughput) has not been wrong yet.
One of the largest trucking companies in the US routes their breakdown calls through our agents. A driver calls in: tire blowout, coolant leak, truck won't start. The agent collects information, creates a service ticket, and dispatches help. The data has to flow end-to-end or someone is stranded on the side of a highway.
When we ran our eval suite against the initial configuration, ticket creation was inconsistent. Not dramatically broken. It worked often enough that a few manual test calls would have looked fine. But across hundreds of parallel evals, the pattern was unmistakable: the agent wasn't reliably calling its logging tools. It was being asked to call them one by one, and the model was getting overloaded.
We changed the architecture to batch everything into a single tool call. Ticket creation accuracy went to 95%+.
Without the eval system, this would have been caught in production. By a dispatcher wondering where the ticket went. By a driver still waiting.
The eval system screens for API data connections, behavioral guardrails, and tool orchestration. All of these pass a manual spot check. None of them survive at scale without automation.
The system started as something I ran from a terminal during a 2-week sprint. Within months, it was the backbone of how the entire team ships. That trajectory wasn't accidental. I designed the system to be lightweight enough that non-engineers could use it, because I knew the bottleneck wasn't building evals. It was writing them fast enough to keep pace with how quickly we were shipping agent configs.
For self-serve users, we now generate default evals out of the box: new lead capture, caller wants a human, caller stops talking. You get a baseline quality gate just by creating an agent. For power users, we built an MCP-based authoring system where you can write, edit, and run evals at scale, then pipe results directly into your development pipeline.
That authoring layer is now part of the agent development studio, a visual UI where users seed an agent from a website URL, extend it with API connections, deploy to a phone number, and write evals against it. The eval system went from a tool I built to unblock a deployment to a product feature that ships with the platform.
The quality gate that unlocked regulated verticals (healthcare and logistics) and the contracts that came with them.
Same-day iteration on agent configs with automated pass/fail feedback.
Eval-first development shifted agent tuning from engineers to PMs and product leaders.
The eval system is load-bearing infrastructure for contracts that represent roughly half the company's revenue. It's what made it possible to enter regulated verticals like healthcare and logistics, where procurement teams need systematic proof of agent reliability before signing. 75% of enterprise pilots converted to annual contracts. The top 4 clients roughly doubled their contract value. The quality gate didn't just protect existing revenue. It created new revenue that wasn't accessible before.
A deployment cycle that used to take 3 to 5 days now takes one. Same-day iteration on agent configs with automated pass/fail feedback. 150+ evals covering key enterprise workflows. The constraint on shipping is no longer QA bandwidth. It's how fast you can write the eval.
This is the impact I care about most. Evals aren't a QA step anymore. They're the development driver. When you build a new feature, you start by writing the eval. Then you iterate on the config until it passes. The eval is the spec, and the spec runs itself.
This shifted agent behavior tuning from engineers to PMs and product leaders. That's a multiplier. Instead of hiring more engineers to handle more agent configs, we gave non-engineers the tools to ship with confidence. The system scales with the team, not against it.
It's not foolproof. LLM non-determinism means occasional false signals. An eval might pass today and fail tomorrow on the same config. We haven't fully solved that. But it's changed how the team ships in a way that's hard to walk back from, and nobody wants to.
The architecture already supports the next step. A post-call QA layer (a multi-agent system I also designed) processes every call, tags outcomes, rates quality, and flags issues like loops, verbosity, or incorrect escalations. That's the monitoring half. The eval system is the action half.
Connecting them means: take flagged issues from live calls, automatically generate evals from that feedback, and iterate on agent configs until the new evals pass. Every client interaction makes the agent better without a human writing the eval.
We haven't shipped this yet. The engineering is straightforward. The product judgment is not. How much feedback do you need before generating a reliable eval? How much autonomy should the loop have before requiring human review? Get those thresholds wrong and the system optimizes for the wrong thing. These are the decisions I'm working through now, and they're the reason this is a product leadership problem, not an engineering one.