Part 7 — Evaluation and Benchmarking

How to Evaluate an Agent

Sections in this chapter

1Why evaluation is engineering, not QA
2What to evaluate
3Golden datasets
4LLM-as-judge
5Regression testing in CI
6Statistical rigor
7Eval-as-a-service
8Benchmarks to know by name and by mechanism
9The eval-debugging discipline

Key Takeaways

Insight

A useful way to frame the shift: in traditional software, tests tell you whether you built the thing right. In agent software, evals tell you whether the thing you built is still right. Evals are cont

Insight

The useful thing to say about any benchmark is not its leaderboard number but what it measures and what it doesn't. A 70% on SWE-bench Verified does not mean 70% of software-engineering problems are s

Common Trap

Average score improved by 3%" is not a ship signal if any important slice dropped by more than that. The model provider's headline benchmark improvements are average-over-slice numbers; the regress

Interview Questions

Design an eval for a customer-support agent where `correct' is subjective.

▲

Frame: golden dataset from production traffic, with expert-annotated canonical replies. Scorers: LLM judge in pairwise mode against the canonical reply (helpfulness, factuality, tone-compliance on three separate calls, ensemble of 3). Human rubric sampling on 5% of cases for ground truth calibration. Slice

Your eval says 85% success; production says 60%. Debug the gap.

▲

Frame: the five causes. Distribution mismatch, evaluator mismatch, silent downstream regression, user behaviour difference, infrastructure noise. Diagnose each systematically rather than guessing.

A model upgrade scores +3% overall but % on one important slice. What do you do?

▲

Frame: don't ship. Average improvement cannot excuse a critical-slice regression. Either fix the regression (prompt tuning, routing the slice to the old model, targeted retraining) or reject the upgrade. Raise the eval slice thresholds to make this automatic next time.

Trajectory eval vs. final-answer eval — when does each matter?

▲

Frame: final-answer when the task has a canonical answer and the path doesn't matter. Trajectory when the path matters for cost, safety, auditability, or when the answer is subjective (then the path is evidence for the answer's trustworthiness). Most production agents need both.

Walk me through validating an LLM-as-judge.

▲

Frame: golden-pair set, agreement metric ( or direct agreement depending on rubric), threshold ( or % pairwise), iteration on rubric or model if below threshold. Validated judge goes in CI; unvalidated judge is decoration.

How do you keep eval costs bounded at scale?

▲

Frame: tiered evals (smoke on PR, full nightly); prompt caching on shared prefixes; batch APIs for non-real-time runs; eval-as-a-service so infrastructure is shared across teams. Track eval spend as a line item.