Part 9 — Observability, Runtime, and Cost
Runtime Engineering and Cost Control
Sections in this chapter
- 1Runtime engineering as a discipline
- 2Model routing
- 3Prompt caching
- 4Batch APIs
- 5Token budgeting
- 6Concurrency control
- 7Latency optimisation
- 8Caching beyond prompts
- 9Fallback and degraded modes
- 10Operational runbook
- 11The cost model
Key Takeaways
Insight
The 2024–2026 era feature that changed agent economics more than any single model upgrade was prompt caching. A well-cached agent is 3–5x cheaper than an uncached equivalent on the same prompt. Teams
Interview Questions
1Your agent costs $0.10. Where do you look first?
▲
Frame: the cost decomposition. How much is model tokens? Cached vs. uncached? How many tool calls? How large are the tool results? How much of the context is retrieved (and how large)? Apply the six levers in the order of expected impact: prompt caching, model routing, tool-result summarisation, retrieval
2Design the runtime stack for a pipeline-embedded agent that must handle 500 concurrent invocations.
▲
Frame: per-tenant concurrency caps, per-provider concurrency caps, sandbox pool sizing, rate-limit-aware model routing, graceful degradation on provider outage, batch APIs for any non-interactive work (reports, summaries, evals).
3The primary model provider is down. What happens?
▲
Frame: fallback routing to a secondary provider, but only for Skills whose eval on the fallback model exceeds a threshold. For others, surface a degraded state to users. Communicate clearly; don't silently substitute. Post-incident, decide whether deeper multi-provider coverage is justified or whether the
4Explain prompt caching and why its hit ratio is a first-class metric.
▲
Frame: providers charge a fraction for tokens served from cache. Structure prompts with stable prefix at top. Hit ratio is both a cost metric and a prompt-structure quality metric. A drop in hit ratio indicates prefix rotation or dynamic content leaking into the prefix — diagnosable from observability.
5What's your latency target and how do you hit it?
▲
Frame: pick a defensible target for the product (time to first token 1s; time to useful response 10s for most Skills). Streaming, parallelism for independent tool calls, smaller model in the critical path for acknowledgement, prefetching of likely context. Measure p50 and p95; optimise p95 (that's the user
6When is a batch API appropriate and when not?
▲
Frame: batch for non-real-time, non-sequential workloads: evals, bulk enrichment, scheduled reports, training data generation. Not for user-facing interactive requests or agent-loop sequential calls. Savings roughly 50%; latency cost up to 24 hours.