The TDD loop works because assertEqual(2 + 2, 4) returns the same answer every time. Agents don't work that way. Ask Claude to "summarize this document" ten times and you'll get ten different summaries. Traditional unit tests can't handle that. Eval-driven development can.

Anthropic's engineering team published their eval framework in early 2026, and the core recommendation is blunt: practice eval-driven development. Build evals to define planned capabilities before agents can fulfill them, then iterate until the agent performs well. Start with 20–50 simple tasks drawn from real failures. That's your minimum viable eval suite.

Deterministic scoring, where the same input produces the same result every time with no LLM in the loop, sits at one end of the eval spectrum. Most teams need the full range.

Why unit tests break down

A unit test asserts an exact output for a given input. Agent outputs are non-deterministic by nature, and the same prompt can produce structurally different but equally valid responses. An academic framework for eval-driven development (EDDOps) makes this precise: unlike TDD and BDD, EDDOps must address agents that pursue under-specified goals, produce non-deterministic outputs, and continue to adapt after deployment.

The practical consequence: you can't write assert agent.respond("summarize X") == "expected summary". You need grading functions that evaluate properties of the output rather than its exact text.

The eval spectrum

Every eval falls somewhere on a spectrum from fully deterministic to fully human. Anthropic's framework breaks grading into three categories, and their recommendation is clear: use deterministic graders wherever possible, supplement with LLM-based graders when necessary, use human graders primarily for calibration.

Grader Type Speed Cost Best For
Code-based (deterministic) Milliseconds Free String matching, JSON validity, tool call verification, regex checks
LLM-as-judge Seconds $0.01–0.10/eval Rubric-based scoring, pairwise comparison, open-ended quality
Human eval Minutes to hours $1–10+/eval Calibrating automated graders, subjective quality, edge cases

The mistake most teams make is jumping straight to LLM-as-judge for everything. That's expensive, slow, and introduces its own non-determinism. Code-based graders cover more ground than people think.

Eval grading spectrum: Code-based evals (deterministic, fast, cheap) to LLM-as-Judge (flexible, moderate cost) to Human review (gold standard, expensive, slow)

Deterministic evals: more powerful than you'd expect

With deterministic scoring, when a score drops from 82 to 71, you know exactly which dimension changed and why. There's no "the judge model was having an off day." Deterministic evals give you:

  • Reproducibility. Same input, same score. Always. Run it in CI, run it locally, run it air-gapped. Identical results.
  • Debuggability. When something fails, you trace the code path, not a prompt chain.
  • Speed. Milliseconds per eval. You can run thousands in a CI pipeline without hitting rate limits.
  • Zero cost. No API calls, no token budgets, no billing surprises.

Deterministic evals are right for anything with a verifiable ground truth: did the agent call the correct tool? Did the output contain required fields? Did the code compile? Did the response stay under the token budget? OpenAI's open-source evals framework supports this pattern. You provide data in JSON, specify parameters in YAML, and many basic evals need zero evaluation code.

LLM-as-judge: when you need it

Deterministic checks hit a wall when the success criteria is subjective. "Was the summary accurate?" "Was the tone professional?" "Did the agent handle the edge case gracefully?" These require judgment, and that's where you bring in a model as grader.

The key is calibration. An LLM judge is only as good as its rubric, and you validate rubrics by comparing LLM grades against human grades on the same tasks. Anthropic recommends reserving human evaluation specifically for this calibration step, not as a primary grading mechanism, but as the ground truth that keeps your automated graders honest.

Platforms like Braintrust and LangSmith have built this into their workflows. Braintrust offers trajectory-level scoring for multi-step agent evaluations. LangSmith provides step-level scoring optimized for LangGraph's node/edge structure. Both let you mix deterministic and model-based graders in the same eval suite.

The practical advice from The Pragmatic Engineer's guide to evals: use code-based evals for deterministic failures, LLM-as-judge for subjective cases, and average scores across 3+ runs to absorb non-deterministic variance.

Two eval categories you need

Anthropic distinguishes between two categories that serve fundamentally different purposes:

Capability evals start at low pass rates. They target things the agent can't do yet, "a hill to climb." You write them before the capability exists and iterate until the agent passes. This is the eval-driven development loop: define the behavior, measure the gap, close it.

Regression evals maintain near-100% pass rates. They protect against performance degradation when you change prompts, swap models, or update tools. If a regression eval drops below threshold, something broke. These run in CI on every change.

The distinction matters because the response to failure is different. A failing capability eval means "keep iterating." A failing regression eval means "revert and investigate."


Metrics for non-deterministic systems

Traditional tests are binary: pass or fail. Agent evals need probabilistic metrics because the same agent can succeed on one run and fail on the next.

Pass@k: probability the agent achieves at least one correct solution in k attempts. Useful for capability evals where you care whether the agent can do something.

Pass^k: probability all k trials succeed. This is the reliability metric, and it gets increasingly stringent as k increases. If your agent passes 80% of the time, Pass^3 is 51.2%. For production systems, Pass^k is what matters.

The gap between Pass@1 and Pass^k tells you how reliable your agent actually is. A 90% Pass@1 sounds great. A 73% Pass^3 sounds like production incidents.

A decision framework

After building deterministic eval systems and watching teams adopt model-based ones, here's the framework I use for deciding which eval type to apply:

Question If Yes If No
Is there a verifiable ground truth? Code-based grader Next question
Can you write an unambiguous rubric? LLM-as-judge with rubric Next question
Can you compare two outputs and pick the better one? LLM pairwise comparison Human eval

Start from the top. Use the simplest grader that can distinguish good from bad. Code-based graders are fast, free, and deterministic. Use them for everything they can cover. Push LLM-as-judge only for the parts where code can't express the success criteria. Reserve human eval for calibrating the judges.

The bottom line

You wouldn't merge a PR without tests passing. The same standard should apply to agent changes. Define the behavior with evals. Run them in CI. Block deploys on regressions. Use Pass^k for reliability targets, not Pass@1 for vibes.

It's the most reliable way I've found to ship agents that don't silently degrade. Write the eval first. Then make it pass.