Part 2 — Context and Instruction Engineering
Skills and Reusable Behaviors
Sections in this chapter
- 1From prompt strings to Skills
- 2The anatomy of a Skill
- 3A taxonomy of Skills
- 4Editor Skills versus pipeline Skills
- 5Skill discovery and selection
- 6Testing Skills: regression suites for prompt templates
- 7Skill drift and maintenance
- 8A/B testing Skill versions
- 9Skills versus tools: a distinction worth learning
- 10Harness Skills: the canonical catalogue
Key Takeaways
Insight
A mature team's Skill catalogue reads like a product roadmap, because it is one. When a new capability is wanted, the question becomes "what Skill do we need to author, test, and ship?" — the same d
Common Trap
The failure mode of Skill selection is picking the wrong Skill confidently. A classifier that routes a billing question to the technical-support Skill produces a plausible-but-wrong answer that is har
Interview Questions
1How do you prevent a Skill from drifting out of date as the codebase evolves?
▲
Frame: scheduled drift detection (daily golden eval, production sampling, outcome signals), alerting on regression, a maintenance response that adds failing cases to the eval. Culturally: ownership — every Skill has a named owner and a review cadence.
2Design a Skill-selection system for an agent with 50 domain Skills.
▲
Frame: two-stage selection. Stage 1: cheap classifier or semantic search narrows to 3–5 candidates. Stage 2: structured selection by the main agent over full descriptions. Add a confidence threshold and a "none applies" option. Evaluate the classifier against a golden routing dataset as a first-class met
3What's the difference between a Skill and a tool? When does something become one versus the other?
▲
Frame: a tool is a callable; a Skill is a curated behaviour that usually orchestrates tools. A tool's interface is its schema; a Skill's includes its prompt template, context policy, and eval suite. The test case: if the implementation has an LLM call in it, it is a Skill.
4How do you version and deploy a Skill?
▲
Frame: semantic versioning, changelog, source control, eval gate in CI, A/B test for material changes, rollback is a config change not a deploy, deprecation state tracked in the Skill metadata.
5What's the value of editor-integrated Skills versus pipeline Skills?
▲
Frame: editor Skills optimise for interactive iteration with a human in the loop; pipeline Skills optimise for programmatic, non-interactive use. A mature catalogue exposes both variants for the same underlying Skill, with different failure behaviours.
6What does a Skill's eval suite contain, at minimum?
▲
Frame: golden dataset (20–200 pairs), scorers (exact match / LLM judge / structured equality / human rubric, as appropriate), regression thresholds, slice analysis, adversarial cases. Every commit runs the suite; a failing suite blocks merge.