Part 7 — Evaluation and Benchmarking
How to Evaluate an Agent
Sections in this chapter
- 1Why evaluation is engineering, not QA
- 2What to evaluate
- 3Golden datasets
- 4LLM-as-judge
- 5Regression testing in CI
- 6Statistical rigor
- 7Eval-as-a-service
- 8Benchmarks to know by name and by mechanism
- 9The eval-debugging discipline
Key Takeaways
Insight
A useful way to frame the shift: in traditional software, tests tell you whether you built the thing right. In agent software, evals tell you whether the thing you built is still right. Evals are cont
Insight
The useful thing to say about any benchmark is not its leaderboard number but what it measures and what it doesn't. A 70% on SWE-bench Verified does not mean 70% of software-engineering problems are s
Common Trap
Average score improved by 3%" is not a ship signal if any important slice dropped by more than that. The model provider's headline benchmark improvements are average-over-slice numbers; the regress
Interview Questions
1Design an eval for a customer-support agent where `correct' is subjective.
▲
Frame: golden dataset from production traffic, with expert-annotated canonical replies. Scorers: LLM judge in pairwise mode against the canonical reply (helpfulness, factuality, tone-compliance on three separate calls, ensemble of 3). Human rubric sampling on 5% of cases for ground truth calibration. Slice
2Your eval says 85% success; production says 60%. Debug the gap.
▲
Frame: the five causes. Distribution mismatch, evaluator mismatch, silent downstream regression, user behaviour difference, infrastructure noise. Diagnose each systematically rather than guessing.
3A model upgrade scores +3% overall but % on one important slice. What do you do?
▲
Frame: don't ship. Average improvement cannot excuse a critical-slice regression. Either fix the regression (prompt tuning, routing the slice to the old model, targeted retraining) or reject the upgrade. Raise the eval slice thresholds to make this automatic next time.
4Trajectory eval vs. final-answer eval — when does each matter?
▲
Frame: final-answer when the task has a canonical answer and the path doesn't matter. Trajectory when the path matters for cost, safety, auditability, or when the answer is subjective (then the path is evidence for the answer's trustworthiness). Most production agents need both.
5Walk me through validating an LLM-as-judge.
▲
Frame: golden-pair set, agreement metric ( or direct agreement depending on rubric), threshold ( or % pairwise), iteration on rubric or model if below threshold. Validated judge goes in CI; unvalidated judge is decoration.
6How do you keep eval costs bounded at scale?
▲
Frame: tiered evals (smoke on PR, full nightly); prompt caching on shared prefixes; batch APIs for non-real-time runs; eval-as-a-service so infrastructure is shared across teams. Track eval spend as a line item.