AI Harness EngineeringChapter 3 of 19

Part 1Foundations

03

Anatomy of a Harness — The Seven Layers

Sections in this chapter

  1. 1Why seven
  2. 2Layer 1: Instruction
  3. 3Layer 2: Tools
  4. 4Layer 3: Memory and retrieval
  5. 5Layer 4: Execution
  6. 6Layer 5: Policy and approval
  7. 7Layer 6: Observability
  8. 8Layer 7: Evaluation
  9. 9The dependency graph
  10. 10Which layer to investigate first
  11. 11Which layer to build first
  12. 12A worked example: applying the seven layers

Key Takeaways

Insight

Memorise the order. It is not alphabetical and not accidental. It mirrors the lifecycle of a single request: the agent reads instructions, discovers tools, retrieves memory, executes actions, is check

Common Trap

Investigating agent failures by staring at individual failed conversations is a classic junior-engineer mistake. One conversation is an anecdote; ten thousand slices are data. Agents fail in distribut

Interview Questions

1

Draw the seven layers of a production harness on a whiteboard and explain each one.

Frame: draw the agent loop in the centre, surround with the layers, use the dependency graph in 3.8, state one sentence per layer about what it owns and what breaks without it.

2

An agent is failing 40% of the time in production. Which layer do you investigate first and why?

Frame: observability first, because every other answer is uninformed. Slice the failure rate along the five axes (mode, task type, tool, model version, input characteristic) and let the slice point at the layer.

3

For a brand-new coding agent, which of the seven layers would you build first?

Frame: observability, then execution (sandbox), then tools+instruction+memory together, then policy (dry-runs and approval gates on writes), then evaluation by month two. Justify each step with what it unblocks and why skipping it is unsafe.

4

Which layer is most often neglected in early deployments?

Frame: evaluation. Teams focus on shipping behaviour, treat evals as QA, and regret it on the first silent regression. Observability is a close second.

5

If a single layer had to be outsourced to a vendor, which one and why?

Frame: execution (sandboxing) is the most commonly and most safely bought. Building Firecracker-microVM-fast-start infrastructure in-house is expensive; the managed services (E2B, Modal, Daytona) are well differentiated. Observability is second-most-often bought (LangSmith, Langfuse, Braintrust, Arize).

6

Describe one failure mode per layer.

Frame: instruction drift across model upgrades; tools malformed destructive calls; memory memory poisoning; execution sandbox escape; policy indirect prompt injection bypass; observability PII in traces; evaluation judge-model miscalibration.