Part 9 — Observability, Runtime, and Cost

Tracing and Observability

Sections in this chapter

1Why traditional observability isn't enough
2GenAI OpenTelemetry semantics
3Agent run as a trace
4Cost and budget visibility
5Prompt and completion capture
6Anomaly detection on traces
7Feedback signal integration
8Sampling and retention
9A worked example: the observability stack for a 40-Skill platform

Key Takeaways

Common Trap

The failure mode where captured prompts contain secrets and those secrets propagate into observability backend vendors is a recurring incident class. A single unsanitised debug run can expose an API k

Interview Questions

What does an agent trace look like, and why does it differ from a traditional microservice trace?

▲

Frame: three differences — non-determinism requires comparing paths; semantic content (prompts, tool args/results) is the primary evidence, not metadata; cost is a first-class metric with a wide decomposition. Spans structured as a tree from agent run to steps to LLM/tool calls. GenAI OpenTelemetry convent

Debug this: your agent's success rate dropped 10% overnight. What's your first five minutes?

▲

Frame: pull the dashboards. Is it concentrated on a specific Skill, model, tenant, or tool? Check deploy log for changes in the last 24 hours. Check model provider status. If nothing obvious: sample 20 failed traces, classify via the nine-category taxonomy, and the concentration tells you what changed.

Cost per run vs. cost per successful outcome — when does each mislead?

▲

Frame: cost per run misleads when you compare cheap unreliable Skills to expensive reliable ones; cost per success misleads when most of the cost is a fixed cost regardless of outcome. Track both; understand what each optimises.

Design a capture policy for prompts and tool results in an enterprise deployment.

▲

Frame: tiered — 100% capture with PII redaction at source for 7 days, then truncated, then sampled. Secrets dropped entirely via pattern detection. Failure-biased retention for escalations and guardrail trips. Compliance-required retention on separate immutable storage.

How do you detect a prompt-injection attack from traces?

▲

Frame: patterns across multiple runs. Cluster of guardrail trips in a short window. Unusual tool-call sequences inconsistent with the Skill's typical trajectory. Content in retrieved documents matching known injection patterns. Output guardrail trips on exfiltration patterns. Alert on the cluster, not the

What's the role of OpenTelemetry in agent observability?

▲

Frame: standard semantic conventions (genai.*) so data is portable across backends. Instrumentation in the harness, not in each Skill. Collector aggregates. The backends can differ per team; the instrumentation does not.