Part 2 — Context and Instruction Engineering

Skills and Reusable Behaviors

Sections in this chapter

1From prompt strings to Skills
2The anatomy of a Skill
3A taxonomy of Skills
4Editor Skills versus pipeline Skills
5Skill discovery and selection
6Testing Skills: regression suites for prompt templates
7Skill drift and maintenance
8A/B testing Skill versions
9Skills versus tools: a distinction worth learning
10Harness Skills: the canonical catalogue

Key Takeaways

Insight

A mature team's Skill catalogue reads like a product roadmap, because it is one. When a new capability is wanted, the question becomes "what Skill do we need to author, test, and ship?" — the same d

Common Trap

The failure mode of Skill selection is picking the wrong Skill confidently. A classifier that routes a billing question to the technical-support Skill produces a plausible-but-wrong answer that is har

Interview Questions

How do you prevent a Skill from drifting out of date as the codebase evolves?

▲

Frame: scheduled drift detection (daily golden eval, production sampling, outcome signals), alerting on regression, a maintenance response that adds failing cases to the eval. Culturally: ownership — every Skill has a named owner and a review cadence.

Design a Skill-selection system for an agent with 50 domain Skills.

▲

Frame: two-stage selection. Stage 1: cheap classifier or semantic search narrows to 3–5 candidates. Stage 2: structured selection by the main agent over full descriptions. Add a confidence threshold and a "none applies" option. Evaluate the classifier against a golden routing dataset as a first-class met

What's the difference between a Skill and a tool? When does something become one versus the other?

▲

Frame: a tool is a callable; a Skill is a curated behaviour that usually orchestrates tools. A tool's interface is its schema; a Skill's includes its prompt template, context policy, and eval suite. The test case: if the implementation has an LLM call in it, it is a Skill.

How do you version and deploy a Skill?

▲

Frame: semantic versioning, changelog, source control, eval gate in CI, A/B test for material changes, rollback is a config change not a deploy, deprecation state tracked in the Skill metadata.

What's the value of editor-integrated Skills versus pipeline Skills?

▲

Frame: editor Skills optimise for interactive iteration with a human in the loop; pipeline Skills optimise for programmatic, non-interactive use. A mature catalogue exposes both variants for the same underlying Skill, with different failure behaviours.

What does a Skill's eval suite contain, at minimum?

▲

Frame: golden dataset (20–200 pairs), scorers (exact match / LLM judge / structured equality / human rubric, as appropriate), regression thresholds, slice analysis, adversarial cases. Every commit runs the suite; a failing suite blocks merge.