AI Harness EngineeringChapter 5 of 19

Part 2Context and Instruction Engineering

05

Skills and Reusable Behaviors

Sections in this chapter

  1. 1From prompt strings to Skills
  2. 2The anatomy of a Skill
  3. 3A taxonomy of Skills
  4. 4Editor Skills versus pipeline Skills
  5. 5Skill discovery and selection
  6. 6Testing Skills: regression suites for prompt templates
  7. 7Skill drift and maintenance
  8. 8A/B testing Skill versions
  9. 9Skills versus tools: a distinction worth learning
  10. 10Harness Skills: the canonical catalogue

Key Takeaways

Insight

A mature team's Skill catalogue reads like a product roadmap, because it is one. When a new capability is wanted, the question becomes "what Skill do we need to author, test, and ship?" — the same d

Common Trap

The failure mode of Skill selection is picking the wrong Skill confidently. A classifier that routes a billing question to the technical-support Skill produces a plausible-but-wrong answer that is har

Interview Questions

1

How do you prevent a Skill from drifting out of date as the codebase evolves?

Frame: scheduled drift detection (daily golden eval, production sampling, outcome signals), alerting on regression, a maintenance response that adds failing cases to the eval. Culturally: ownership — every Skill has a named owner and a review cadence.

2

Design a Skill-selection system for an agent with 50 domain Skills.

Frame: two-stage selection. Stage 1: cheap classifier or semantic search narrows to 3–5 candidates. Stage 2: structured selection by the main agent over full descriptions. Add a confidence threshold and a "none applies" option. Evaluate the classifier against a golden routing dataset as a first-class met

3

What's the difference between a Skill and a tool? When does something become one versus the other?

Frame: a tool is a callable; a Skill is a curated behaviour that usually orchestrates tools. A tool's interface is its schema; a Skill's includes its prompt template, context policy, and eval suite. The test case: if the implementation has an LLM call in it, it is a Skill.

4

How do you version and deploy a Skill?

Frame: semantic versioning, changelog, source control, eval gate in CI, A/B test for material changes, rollback is a config change not a deploy, deprecation state tracked in the Skill metadata.

5

What's the value of editor-integrated Skills versus pipeline Skills?

Frame: editor Skills optimise for interactive iteration with a human in the loop; pipeline Skills optimise for programmatic, non-interactive use. A mature catalogue exposes both variants for the same underlying Skill, with different failure behaviours.

6

What does a Skill's eval suite contain, at minimum?

Frame: golden dataset (20–200 pairs), scorers (exact match / LLM judge / structured equality / human rubric, as appropriate), regression thresholds, slice analysis, adversarial cases. Every commit runs the suite; a failing suite blocks merge.