Part 10 — Multi-Agent Systems and Orchestration

Multi-Agent Design and Durable Orchestration

Sections in this chapter

1The single-agent default
2Agent composition patterns
3Handoff protocols
4Shared state versus message passing
5Durable orchestration
6Coordination failure modes
7Observability across agents
8A worked example: multi-agent incident triage

Key Takeaways

Insight

A useful test: can you name three specific engineering costs a multi-agent system adds? State management across agents, failure semantics across agents, coordination protocol design. If any of these c

Interview Questions

Supervisor, pipeline, peer-handoff — when does each fit?

▲

Frame: supervisor when subtasks are independent; pipeline when stages differ and flow is linear; peer-handoff when control must pass based on runtime evaluation. Each has a failure mode: supervisor bottleneck, pipeline brittleness, peer-handoff ping-pong.

When do you move from single-agent to multi-agent?

▲

Frame: the five cases (long-horizon, distinct subdomains, parallel exploration, specialised tool surfaces, security isolation). Default is single; multi-agent is a justified deviation. Name three specific coordination costs any multi-agent system pays.

Design a durable workflow for a coding agent that might take days to complete.

▲

Frame: Temporal or equivalent workflow engine at the root. Agent runs are activities; tool calls are sub-activities; human approvals are signals. State persisted at every step; retries configured; timeouts mapped to escalation paths. Observability tied to workflow ID.

Your multi-agent system is in a ping-pong loop. Diagnose and fix.

▲

Frame: check handoff depth (should have a cap); examine the handoff payloads for missing authority or ambiguous scope; add loop detection at the orchestration layer; reconsider whether the decomposition is correct — two agents ping-ponging is sometimes one agent trying to emerge.

Shared state vs. message passing between agents — when do you use each?

▲

Frame: shared state for long-horizon task context with clear ownership; message passing for handoffs and day-to-day coordination. Most systems blend: small persistent state for goals/context, messages for control flow.

How do you observability-instrument a multi-agent system?

▲

Frame: correlation ID across all agents; trace tree rooted at the workflow/coordinator; nested spans per agent, per tool call, per guardrail evaluation. Without correlation, debugging is infeasible; with it, the same discipline as single-agent works.