Part 9 — Observability, Runtime, and Cost
Tracing and Observability
Sections in this chapter
- 1Why traditional observability isn't enough
- 2GenAI OpenTelemetry semantics
- 3Agent run as a trace
- 4Cost and budget visibility
- 5Prompt and completion capture
- 6Anomaly detection on traces
- 7Feedback signal integration
- 8Sampling and retention
- 9A worked example: the observability stack for a 40-Skill platform
Key Takeaways
Common Trap
The failure mode where captured prompts contain secrets and those secrets propagate into observability backend vendors is a recurring incident class. A single unsanitised debug run can expose an API k
Interview Questions
1What does an agent trace look like, and why does it differ from a traditional microservice trace?
▲
Frame: three differences — non-determinism requires comparing paths; semantic content (prompts, tool args/results) is the primary evidence, not metadata; cost is a first-class metric with a wide decomposition. Spans structured as a tree from agent run to steps to LLM/tool calls. GenAI OpenTelemetry convent
2Debug this: your agent's success rate dropped 10% overnight. What's your first five minutes?
▲
Frame: pull the dashboards. Is it concentrated on a specific Skill, model, tenant, or tool? Check deploy log for changes in the last 24 hours. Check model provider status. If nothing obvious: sample 20 failed traces, classify via the nine-category taxonomy, and the concentration tells you what changed.
3Cost per run vs. cost per successful outcome — when does each mislead?
▲
Frame: cost per run misleads when you compare cheap unreliable Skills to expensive reliable ones; cost per success misleads when most of the cost is a fixed cost regardless of outcome. Track both; understand what each optimises.
4Design a capture policy for prompts and tool results in an enterprise deployment.
▲
Frame: tiered — 100% capture with PII redaction at source for 7 days, then truncated, then sampled. Secrets dropped entirely via pattern detection. Failure-biased retention for escalations and guardrail trips. Compliance-required retention on separate immutable storage.
5How do you detect a prompt-injection attack from traces?
▲
Frame: patterns across multiple runs. Cluster of guardrail trips in a short window. Unusual tool-call sequences inconsistent with the Skill's typical trajectory. Content in retrieved documents matching known injection patterns. Output guardrail trips on exfiltration patterns. Alert on the cluster, not the
6What's the role of OpenTelemetry in agent observability?
▲
Frame: standard semantic conventions (genai.*) so data is portable across backends. Instrumentation in the harness, not in each Skill. Collector aggregates. The backends can differ per team; the instrumentation does not.