Part 9 — Observability, Runtime, and Cost

Runtime Engineering and Cost Control

Sections in this chapter

1Runtime engineering as a discipline
2Model routing
3Prompt caching
4Batch APIs
5Token budgeting
6Concurrency control
7Latency optimisation
8Caching beyond prompts
9Fallback and degraded modes
10Operational runbook
11The cost model

Key Takeaways

Insight

The 2024–2026 era feature that changed agent economics more than any single model upgrade was prompt caching. A well-cached agent is 3–5x cheaper than an uncached equivalent on the same prompt. Teams

Interview Questions

Your agent costs $0.10. Where do you look first?

▲

Frame: the cost decomposition. How much is model tokens? Cached vs. uncached? How many tool calls? How large are the tool results? How much of the context is retrieved (and how large)? Apply the six levers in the order of expected impact: prompt caching, model routing, tool-result summarisation, retrieval

Design the runtime stack for a pipeline-embedded agent that must handle 500 concurrent invocations.

▲

Frame: per-tenant concurrency caps, per-provider concurrency caps, sandbox pool sizing, rate-limit-aware model routing, graceful degradation on provider outage, batch APIs for any non-interactive work (reports, summaries, evals).

The primary model provider is down. What happens?

▲

Frame: fallback routing to a secondary provider, but only for Skills whose eval on the fallback model exceeds a threshold. For others, surface a degraded state to users. Communicate clearly; don't silently substitute. Post-incident, decide whether deeper multi-provider coverage is justified or whether the

Explain prompt caching and why its hit ratio is a first-class metric.

▲

Frame: providers charge a fraction for tokens served from cache. Structure prompts with stable prefix at top. Hit ratio is both a cost metric and a prompt-structure quality metric. A drop in hit ratio indicates prefix rotation or dynamic content leaking into the prefix — diagnosable from observability.

What's your latency target and how do you hit it?

▲

Frame: pick a defensible target for the product (time to first token 1s; time to useful response 10s for most Skills). Streaming, parallelism for independent tool calls, smaller model in the critical path for acknowledgement, prefetching of likely context. Measure p50 and p95; optimise p95 (that's the user

When is a batch API appropriate and when not?

▲

Frame: batch for non-real-time, non-sequential workloads: evals, bulk enrichment, scheduled reports, training data generation. Not for user-facing interactive requests or agent-loop sequential calls. Savings roughly 50%; latency cost up to 24 hours.