AI Harness EngineeringChapter 19 of 19

Part 10Multi-Agent Systems and Orchestration

19

Multi-Agent Design and Durable Orchestration

Sections in this chapter

  1. 1The single-agent default
  2. 2Agent composition patterns
  3. 3Handoff protocols
  4. 4Shared state versus message passing
  5. 5Durable orchestration
  6. 6Coordination failure modes
  7. 7Observability across agents
  8. 8A worked example: multi-agent incident triage

Key Takeaways

Insight

A useful test: can you name three specific engineering costs a multi-agent system adds? State management across agents, failure semantics across agents, coordination protocol design. If any of these c

Interview Questions

1

Supervisor, pipeline, peer-handoff — when does each fit?

Frame: supervisor when subtasks are independent; pipeline when stages differ and flow is linear; peer-handoff when control must pass based on runtime evaluation. Each has a failure mode: supervisor bottleneck, pipeline brittleness, peer-handoff ping-pong.

2

When do you move from single-agent to multi-agent?

Frame: the five cases (long-horizon, distinct subdomains, parallel exploration, specialised tool surfaces, security isolation). Default is single; multi-agent is a justified deviation. Name three specific coordination costs any multi-agent system pays.

3

Design a durable workflow for a coding agent that might take days to complete.

Frame: Temporal or equivalent workflow engine at the root. Agent runs are activities; tool calls are sub-activities; human approvals are signals. State persisted at every step; retries configured; timeouts mapped to escalation paths. Observability tied to workflow ID.

4

Your multi-agent system is in a ping-pong loop. Diagnose and fix.

Frame: check handoff depth (should have a cap); examine the handoff payloads for missing authority or ambiguous scope; add loop detection at the orchestration layer; reconsider whether the decomposition is correct — two agents ping-ponging is sometimes one agent trying to emerge.

5

Shared state vs. message passing between agents — when do you use each?

Frame: shared state for long-horizon task context with clear ownership; message passing for handoffs and day-to-day coordination. Most systems blend: small persistent state for goals/context, messages for control flow.

6

How do you observability-instrument a multi-agent system?

Frame: correlation ID across all agents; trace tree rooted at the workflow/coordinator; nested spans per agent, per tool call, per guardrail evaluation. Without correlation, debugging is infeasible; with it, the same discipline as single-agent works.