Part 10 — Multi-Agent Systems and Orchestration
Multi-Agent Design and Durable Orchestration
Sections in this chapter
- 1The single-agent default
- 2Agent composition patterns
- 3Handoff protocols
- 4Shared state versus message passing
- 5Durable orchestration
- 6Coordination failure modes
- 7Observability across agents
- 8A worked example: multi-agent incident triage
Key Takeaways
Insight
A useful test: can you name three specific engineering costs a multi-agent system adds? State management across agents, failure semantics across agents, coordination protocol design. If any of these c
Interview Questions
1Supervisor, pipeline, peer-handoff — when does each fit?
▲
Frame: supervisor when subtasks are independent; pipeline when stages differ and flow is linear; peer-handoff when control must pass based on runtime evaluation. Each has a failure mode: supervisor bottleneck, pipeline brittleness, peer-handoff ping-pong.
2When do you move from single-agent to multi-agent?
▲
Frame: the five cases (long-horizon, distinct subdomains, parallel exploration, specialised tool surfaces, security isolation). Default is single; multi-agent is a justified deviation. Name three specific coordination costs any multi-agent system pays.
3Design a durable workflow for a coding agent that might take days to complete.
▲
Frame: Temporal or equivalent workflow engine at the root. Agent runs are activities; tool calls are sub-activities; human approvals are signals. State persisted at every step; retries configured; timeouts mapped to escalation paths. Observability tied to workflow ID.
4Your multi-agent system is in a ping-pong loop. Diagnose and fix.
▲
Frame: check handoff depth (should have a cap); examine the handoff payloads for missing authority or ambiguous scope; add loop detection at the orchestration layer; reconsider whether the decomposition is correct — two agents ping-ponging is sometimes one agent trying to emerge.
5Shared state vs. message passing between agents — when do you use each?
▲
Frame: shared state for long-horizon task context with clear ownership; message passing for handoffs and day-to-day coordination. Most systems blend: small persistent state for goals/context, messages for control flow.
6How do you observability-instrument a multi-agent system?
▲
Frame: correlation ID across all agents; trace tree rooted at the workflow/coordinator; nested spans per agent, per tool call, per guardrail evaluation. Without correlation, debugging is infeasible; with it, the same discipline as single-agent works.