← Back to Home
Featured Case Study

Eval-Driven Agent Development

I bet that a lightweight, text-based eval system would outperform purpose-built simulation platforms for pre-deployment quality gates. That bet unlocked regulated verticals, converted 75% of enterprise pilots to annual contracts, and became the backbone of how the team ships.

0
Pilot → Annual Contract Conversion
~0
Contract Expansion, Top 4 Clients
0
Feature Release Cycle

* To comply with my non-disclosure agreement, I have omitted or modified confidential information. All views are my own.

The Problem

The trust gap

GoodCall deploys AI phone agents that handle millions of inbound calls for enterprises. Two of our largest prospects, a Fortune 500 logistics company and a prescription benefits manager operating in a HIPAA-compliant environment, were ready to sign contracts that would represent roughly half the company's revenue at 9-figure scale.

But their procurement teams had the same question: how do you prove the agent won't say the wrong thing?

We didn't have a good answer. Manual QA took 3 to 5 days per deployment cycle and couldn't keep pace with the growing number of agent configurations. Every new feature launch was bottlenecked by it. Agents that passed spot checks would fail on real calls: giving a driver incorrect instructions, skipping required legal language, dropping data from a service ticket. Each incident eroded trust. Some cost us clients entirely.

The sales team was fielding objections we couldn't overcome with demos. The engineering team was spending more time on manual testing than on building. And I was watching the company's entire enterprise strategy stall on a problem that was fundamentally about tooling, not about model quality.

The problem wasn't that our agents were bad. It was that we had no way to prove they were good before they touched a live caller.

Before / After: The QA Bottleneck
Side-by-side showing manual QA cycle (3 to 5 days, spot checks, issues found in prod) vs. automated eval pipeline (same-day, 150+ scenarios, issues caught pre-deploy). Metric blocks with JetBrains Mono numbers.
The Bet

A flight simulator for agents

I saw a two-week window. Our lead AI engineer had bandwidth. The enterprise contracts were stalling. And I had a hypothesis: we didn't need a high-fidelity simulation platform. We needed a fast, lightweight system that could tell us whether an agent passed or failed across hundreds of scenarios in minutes, not days.

I co-led this sprint with our lead AI engineer. I owned the framework design, the JTBD-based test taxonomy, the build-vs-buy recommendation, the remediation loop protocol, and the eval UI. Engineering owned the cloud orchestration, voice pipeline integration, judge prompt implementation, and telemetry.

I Owned
  • Framework design and system architecture
  • JTBD-based test taxonomy
  • Build-vs-buy recommendation to leadership
  • Remediation loop protocol
  • Eval UI on the platform
Engineering Owned
  • Cloud function orchestration
  • Voice pipeline integration
  • Judge prompt implementation
  • Telemetry pipeline

The system works like this: you define a scenario (what the simulated caller will say), an expected outcome (what should happen), and verdict criteria (the pass/fail rules). A simulated caller LLM follows the script, the agent responds through the real stack, and a separate judge LLM scores the transcript. Two LLM calls total. 30 to 60 seconds per eval, running in parallel across hundreds of scenarios. No voice simulation, no personality sliders. Just: did the agent do the job or not?

Eval Authoring and Results UI
Screenshot or stylized mockup showing the eval creation form (4 fields) alongside a results view with pass/fail verdicts, transcript drill-down, and latency telemetry.

We also baked in latency telemetry from the start: average time to first token, P50 and P95 TTFT, tokens per second, broken down by model and provider. When a voice agent takes too long to respond, the caller thinks it's broken. Latency isn't a quality-of-life metric. It's a behavioral one. Surfacing it alongside pass/fail means you catch regressions before they sound like silence on the other end of the phone.

The decisions that mattered

The market was building toward high-fidelity simulation: realistic voices, personality modeling, nuanced conversation flows. I went the opposite direction. Every architectural decision was about stripping to essentials and optimizing for the question that actually gated our deployments.

Text endpoints, not voice simulation.
Does the agent get the right outcome given this input? That's the pre-deployment question. Voice quality is a separate concern tested in production.
Binary verdicts, not rubrics.
Pass or fail against explicit criteria. If a deployment gate requires human interpretation, it's not a gate. It's a suggestion.
Scenario-driven, not free-form.
Each eval tests one behavior in isolation. When something fails, you know exactly what broke. Emergent conversation testing sounds elegant but it's expensive, flaky, and impossible to debug at scale.
Anti-gaming by default.
The system defaults to fixing the agent, not loosening the test. Every eval modification requires transcript evidence. The same eval can't be weakened twice without human review. A system that can be gamed will be gamed, especially under deadline pressure.
Agent Config Test Cases (JTBD) Eval Scenarios Parallel Execution Judge Verdicts Remediation Loop
System Architecture
Architecture visual showing the three layers: JTBD test taxonomy feeding eval scenarios, parallel execution engine with simulated caller + judge LLMs, and the autonomous remediation loop with anti-gaming guardrails.

The JTBD taxonomy

Every agent is decomposed into jobs it was hired to do. A trucking breakdown agent has jobs like "collect vehicle information," "create a service ticket," "escalate when the issue is outside scope." Each job gets eval coverage across test types: clear intent, vague start, escalation, boundary, and system tests. The taxonomy ensures we're testing behaviors, not just happy paths.

Initial test cases come from call data analytics. We know what callers actually say, so the scenarios start grounded in reality. From there, every new feature gets corresponding evals before it ships. The eval is the spec.

The remediation loop

When evals fail, the system classifies the failure (config issue, eval too strict, fatal error, scenario needs rewrite), proposes a targeted fix, and re-runs. This is the part that makes it a development tool, not just a testing tool. Max 5 iterations before a human steps in. The loop closes itself, but with guardrails.

Build vs. Buy

Why not just buy Coval?

The market for conversational AI eval tooling was moving toward fidelity: realistic voices, personality modeling, nuanced caller simulation. Coval is the best example. We trialed it alongside our internal build during the same sprint. Full voice simulation, personality sliders, rich testing features. It's a genuinely good product.

But I recommended we build. The reason wasn't technical capability. It was strategic fit. Coval was optimized for the question "how realistic can the test be?" We needed the answer to "can we ship today?" Hundreds of scenarios in parallel. Results in minutes. Latency telemetry baked in. A remediation loop that iterates autonomously. None of that existed in the market because the market was solving a different problem.

Factor Coval What We Built
Execution model Full conversational + voice simulation Text-based endpoint, stripped to essentials
Throughput Rich but slower per-eval High parallelism, 30 to 60s per eval
Fidelity Personality sliders, voice sim, caller modeling Scenario-driven, controlled inputs
What we needed Fast pass/fail gates across 150+ test cases
Cost at scale Higher per-eval (more LLM calls, voice infra) 2 LLM calls per eval (agent + judge)

This was a bet against the market direction. The market was building toward simulation fidelity. I bet that volume and speed would matter more for deployment gates. That bet has held. As the system matures, revisiting richer simulation tooling remains an option, but the core insight (strip to essentials, optimize for throughput) has not been wrong yet.

In Practice

The dispatcher who never got the ticket

One of the largest trucking companies in the US routes their breakdown calls through our agents. A driver calls in: tire blowout, coolant leak, truck won't start. The agent collects information, creates a service ticket, and dispatches help. The data has to flow end-to-end or someone is stranded on the side of a highway.

When we ran our eval suite against the initial configuration, ticket creation was inconsistent. Not dramatically broken. It worked often enough that a few manual test calls would have looked fine. But across hundreds of parallel evals, the pattern was unmistakable: the agent wasn't reliably calling its logging tools. It was being asked to call them one by one, and the model was getting overloaded.

We changed the architecture to batch everything into a single tool call. Ticket creation accuracy went to 95%+.

Without the eval system, this would have been caught in production. By a dispatcher wondering where the ticket went. By a driver still waiting.

The eval system screens for API data connections, behavioral guardrails, and tool orchestration. All of these pass a manual spot check. None of them survive at scale without automation.

From a deployment tool to a platform feature

The system started as something I ran from a terminal during a 2-week sprint. Within months, it was the backbone of how the entire team ships. That trajectory wasn't accidental. I designed the system to be lightweight enough that non-engineers could use it, because I knew the bottleneck wasn't building evals. It was writing them fast enough to keep pace with how quickly we were shipping agent configs.

For self-serve users, we now generate default evals out of the box: new lead capture, caller wants a human, caller stops talking. You get a baseline quality gate just by creating an agent. For power users, we built an MCP-based authoring system where you can write, edit, and run evals at scale, then pipe results directly into your development pipeline.

That authoring layer is now part of the agent development studio, a visual UI where users seed an agent from a website URL, extend it with API connections, deploy to a phone number, and write evals against it. The eval system went from a tool I built to unblock a deployment to a product feature that ships with the platform.

Impact

What changed

0
Revenue Stream Enabled

The quality gate that unlocked regulated verticals (healthcare and logistics) and the contracts that came with them.

0
Release Cycle

Same-day iteration on agent configs with automated pass/fail feedback.

PMs ship
Not just engineers

Eval-first development shifted agent tuning from engineers to PMs and product leaders.

Revenue protection

The eval system is load-bearing infrastructure for contracts that represent roughly half the company's revenue. It's what made it possible to enter regulated verticals like healthcare and logistics, where procurement teams need systematic proof of agent reliability before signing. 75% of enterprise pilots converted to annual contracts. The top 4 clients roughly doubled their contract value. The quality gate didn't just protect existing revenue. It created new revenue that wasn't accessible before.

Velocity

A deployment cycle that used to take 3 to 5 days now takes one. Same-day iteration on agent configs with automated pass/fail feedback. 150+ evals covering key enterprise workflows. The constraint on shipping is no longer QA bandwidth. It's how fast you can write the eval.

Organizational leverage

This is the impact I care about most. Evals aren't a QA step anymore. They're the development driver. When you build a new feature, you start by writing the eval. Then you iterate on the config until it passes. The eval is the spec, and the spec runs itself.

This shifted agent behavior tuning from engineers to PMs and product leaders. That's a multiplier. Instead of hiring more engineers to handle more agent configs, we gave non-engineers the tools to ship with confidence. The system scales with the team, not against it.

It's not foolproof. LLM non-determinism means occasional false signals. An eval might pass today and fail tomorrow on the same config. We haven't fully solved that. But it's changed how the team ships in a way that's hard to walk back from, and nobody wants to.

What's Next

The loop that closes itself

The architecture already supports the next step. A post-call QA layer (a multi-agent system I also designed) processes every call, tags outcomes, rates quality, and flags issues like loops, verbosity, or incorrect escalations. That's the monitoring half. The eval system is the action half.

Connecting them means: take flagged issues from live calls, automatically generate evals from that feedback, and iterate on agent configs until the new evals pass. Every client interaction makes the agent better without a human writing the eval.

We haven't shipped this yet. The engineering is straightforward. The product judgment is not. How much feedback do you need before generating a reliable eval? How much autonomy should the loop have before requiring human review? Get those thresholds wrong and the system optimizes for the wrong thing. These are the decisions I'm working through now, and they're the reason this is a product leadership problem, not an engineering one.