Part 7 — Evaluation and Benchmarking
Failure Modes and Reliability Engineering
Sections in this chapter
- 1A taxonomy of agent failures
- 2Diagnosing from traces
- 3Recovery strategies
- 4Eval-driven debugging
- 5Chaos engineering for agents
- 6Blast radius minimisation
- 7Partially-applied state
- 8The on-call runbook
- 9Reliability metrics that matter
Key Takeaways
Insight
The debugging workflow for agents is closer to debugging distributed systems than to debugging traditional monoliths. You are reconstructing what happened across an asynchronous sequence of components
Common Trap
A common early-stage failure: the agent has no explicit "do nothing and escalate" action in its design, so it treats every task as producing-an-output-or-failing. Given a task it should refuse, it p
Interview Questions
1Describe the worst agent bug you've debugged. What observability let you find it?
▲
Frame: if you have one, walk through it with the nine-category taxonomy and the trace-walk-backward discipline. If you don't, construct a plausible one (context poisoning from a past run manifesting weeks later, diagnosed by walking memory reads back to the poisoning episode). Emphasise the role of trace c
2Design a chaos-testing harness for an agent. What failure modes do you inject?
▲
Frame: the list in 14.6. Bad/truncated/delayed/noisy/poisoned tool responses, injected instructions, partial success. Measure recovery rate per injection class. Run scheduled against the full eval dataset.
3Your agent timed out mid-task and left the system in a partially-modified state. How do you design for this?
▲
Frame: the four mitigations. All-or-nothing coordination where possible (durable workflow engines), explicit rollback plans, checkpointing, escalation-on-partial-state rather than blind continuation. Reversibility ordering in planning.
4What's the difference between an escalation and a failure?
▲
Frame: an escalation is a success of the harness — the agent recognised its limits and handed off with full context. A failure is the opposite — the agent produced wrong output confidently, or silently terminated. Escalation-on-uncertainty is a positive metric.
5Name five plausible causes of an agent's success rate dropping 10% week-over-week.
▲
Frame: (1) silent model provider update; (2) retrieval corpus drift (a re-index that introduced errors); (3) a tool dependency changed its output shape; (4) an upstream input distribution change (different user segment or new feature); (5) accumulated memory poisoning surfacing at retrieval time. Diagnose
6What's the difference between blast radius and reversibility?
▲
Frame: blast radius is how much damage a failure can cause (how many resources, users, records affected). Reversibility is how recoverable the damage is after the fact. Both are design variables. Prefer small blast radius and high reversibility; irreversible actions with large blast radius are the ones tha