AI Harness EngineeringChapter 15 of 19

Part 8Safety, Guardrails, and Governance

15

Agent Safety and Guardrails

Sections in this chapter

  1. 1What guardrails actually are
  2. 2Input, inline, and output guardrails
  3. 3Prompt injection: the central threat
  4. 4PII and data-handling
  5. 5Secret handling
  6. 6Policy engines
  7. 7Refusal, and how to do it well
  8. 8Content filtering
  9. 9Regional and regulatory compliance
  10. 10A worked example: guardrail stack for an enterprise support agent

Key Takeaways

Insight

The three positions exist because the three failure modes differ. Input guardrails catch the adversary at the door. Inline catch mid-session drift. Output catch the final mistake. A system with only o

Insight

A calibration exercise worth running: produce a set of 30 edge-case requests — some legitimately in scope, some out, some ambiguous. Evaluate an agent on all 30. Count over-refusals (should have answe

Common Trap

A specific anti-pattern worth naming: an agent with a broad exfiltration-capable tool ("send an email on the user's behalf") that also retrieves content from untrusted sources (web pages, user docum

Interview Questions

1

Design the guardrail stack for an enterprise support agent handling regulated data.

Frame: the three positions (input, inline, output) with 2–4 specific guardrails at each. A policy engine for declarative rules. Audit for every decision. Red-team as ongoing discipline. The worked example in 15.9 is close enough to many real scenarios.

2

Prompt injection is still a real threat. How do you defend against it in a system that retrieves from untrusted sources?

Frame: the six-layer defence — spotlighting, instruction hierarchy, injection classifier, output guardrails sensitive to exfiltration, scope restriction on sensitive tools, architectural isolation. State honestly that no single defence is sufficient and describe the combined defence.

3

Your output guardrail flagged a false positive and the customer complained. What's your response?

Frame: false positives are expected and acceptable up to a threshold; calibrate the guardrail on a representative set, measure precision/recall, tune the threshold where the false-negative cost dominates the false-positive cost. A false positive is a tuning question, not a "remove the guardrail" question

4

Design a refusal policy for an agent. What's the difference between over-refusing and under-refusing?

Frame: a clean refusal names the category, offers an alternative, doesn't lecture, doesn't over-apologise. Over-refuse (refusing things that were in scope) damages product value; under-refuse (answering things that should have been refused) causes incidents. Both measured; both tuned; both with budget in e

5

Policy engine vs. hardcoded rules — when do you need each?

Frame: hardcode the universal invariants (no secrets in output, no CSAM, no cross-tenant leakage). Use a policy engine for rules that vary by tenant, region, deployment, or regulatory regime. The engine lets compliance teams iterate without code changes; it costs complexity and latency, so it isn't the ans

6

PII redaction vs. PII rejection — when is each appropriate?

Frame: redaction for legitimate workflows where the PII was incidental (a user mentioned their email while asking a question); rejection when no legitimate flow should include the PII at all (a user sending a whole customer list for unrelated processing). Both log; neither drops the audit.