Part 7 — Evaluation and Benchmarking

Failure Modes and Reliability Engineering

Sections in this chapter

1A taxonomy of agent failures
2Diagnosing from traces
3Recovery strategies
4Eval-driven debugging
5Chaos engineering for agents
6Blast radius minimisation
7Partially-applied state
8The on-call runbook
9Reliability metrics that matter

Key Takeaways

Insight

The debugging workflow for agents is closer to debugging distributed systems than to debugging traditional monoliths. You are reconstructing what happened across an asynchronous sequence of components

Common Trap

A common early-stage failure: the agent has no explicit "do nothing and escalate" action in its design, so it treats every task as producing-an-output-or-failing. Given a task it should refuse, it p

Interview Questions

Describe the worst agent bug you've debugged. What observability let you find it?

▲

Frame: if you have one, walk through it with the nine-category taxonomy and the trace-walk-backward discipline. If you don't, construct a plausible one (context poisoning from a past run manifesting weeks later, diagnosed by walking memory reads back to the poisoning episode). Emphasise the role of trace c

Design a chaos-testing harness for an agent. What failure modes do you inject?

▲

Frame: the list in 14.6. Bad/truncated/delayed/noisy/poisoned tool responses, injected instructions, partial success. Measure recovery rate per injection class. Run scheduled against the full eval dataset.

Your agent timed out mid-task and left the system in a partially-modified state. How do you design for this?

▲

Frame: the four mitigations. All-or-nothing coordination where possible (durable workflow engines), explicit rollback plans, checkpointing, escalation-on-partial-state rather than blind continuation. Reversibility ordering in planning.

What's the difference between an escalation and a failure?

▲

Frame: an escalation is a success of the harness — the agent recognised its limits and handed off with full context. A failure is the opposite — the agent produced wrong output confidently, or silently terminated. Escalation-on-uncertainty is a positive metric.

Name five plausible causes of an agent's success rate dropping 10% week-over-week.

▲

Frame: (1) silent model provider update; (2) retrieval corpus drift (a re-index that introduced errors); (3) a tool dependency changed its output shape; (4) an upstream input distribution change (different user segment or new feature); (5) accumulated memory poisoning surfacing at retrieval time. Diagnose

What's the difference between blast radius and reversibility?

▲

Frame: blast radius is how much damage a failure can cause (how many resources, users, records affected). Reversibility is how recoverable the damage is after the fact. Both are design variables. Prefer small blast radius and high reversibility; irreversible actions with large blast radius are the ones tha