Agentic AI Glossary

Evaluation, Guardrails & Safety

Evaluation, Guardrails & Safety terms and explanations from the Agentic AI Glossary.

61 terms in this chapter

Agent Evaluation

Definition

The process of testing an agent's reasoning, tool calls, final answers, safety, cost, and latency before or during production use.

Benchmarking

Definition

Comparing an AI system against baselines, alternative models, datasets, or performance targets.

Confidence Scoring

Definition

Estimating how reliable an output, classification, tool call, or decision is likely to be.

Cost per Task

Definition

The average spend required for one completed task, including model tokens, tool calls, infrastructure, retries, and human review when needed.

Escalation Rate

Definition

The percentage of tasks or conversations that must be handed to a human, specialist agent, or higher-trust workflow.

Eval Dataset

Definition

A curated set of prompts, scenarios, expected behaviors, and edge cases used to test an AI system repeatedly.

Evaluation (Eval)

Definition

A structured process for measuring quality, safety, correctness, and business value of AI behavior.

Failure Rate

Definition

The share of tasks where the agent gives a wrong answer, calls the wrong tool, violates policy, or fails to finish.

False Negative

Definition

A missed detection, such as failing to flag a risky output, bad retrieval result, policy violation, or defect.

False Positive

Definition

An incorrect alert or block, such as flagging safe content, valid tool use, or correct output as unsafe.

Final Answer Evaluation

Definition

Reviewing the final response for correctness, groundedness, completeness, tone, policy compliance, and usefulness to the user.

Goal Completion Rate

Definition

The percentage of tasks where the agent reaches the requested outcome without unnecessary failure, escalation, or user rework.

Golden Dataset

Definition

A trusted set of examples, expected outputs, or human-labeled judgments used for evaluation.

Groundedness

Definition

How strongly an answer is supported by retrieved documents, verified data, tool results, or other trusted evidence.

Helpfulness

Definition

How well the response solves the user's real problem with clear, relevant, and actionable information.

Human Evaluation

Definition

Quality review performed by people who judge usefulness, correctness, safety, tone, and real-world task success.

Latency per Task

Definition

The total time required to finish one task, including model calls, retrieval, tool execution, retries, and final response generation.

LLM-as-Judge

Definition

Using a language model to assess outputs, often with rubrics, references, or pairwise comparisons.

Multi-Turn Evaluation

Definition

Testing whether an agent stays accurate, safe, and context-aware across a conversation or long-running workflow.

Pairwise Evaluation

Definition

Comparing two outputs side by side so reviewers or judge models can select the better answer or behavior.

Plan Accuracy

Definition

How well the generated plan matches the task requirements, dependencies, constraints, and expected order of execution.

Plan Quality

Definition

The usefulness, feasibility, and ordering of an agent's proposed steps.

RAG Evaluation

Definition

Testing retrieval and answer quality together, including context relevance, citation accuracy, faithfulness, and answer completeness.

Regression Test

Definition

A repeatable test that catches quality drops after model, prompt, retrieval, or tool changes.

Relevance

Definition

The degree to which retrieved context, tool output, or generated text directly answers the user's request.

Safety Score

Definition

A metric summarizing whether outputs and actions comply with safety and policy expectations.

Scenario Test

Definition

An evaluation case built around a realistic user situation, including inputs, constraints, expected behavior, and pass criteria.

Simulation

Definition

Controlled testing of agent behavior in representative scenarios before production exposure.

Simulation Test

Definition

Running agents inside controlled mock environments to test behavior before exposing them to live users or real systems.

Step-Level Evaluation

Definition

Checking each intermediate plan step, tool call, observation, and decision instead of only judging the final answer.

Task Success Rate

Definition

The percentage of tasks an agent completes according to predefined success criteria.

Token Usage

Definition

The number of input and output tokens consumed by a request, conversation, or task, used for cost and latency control.

Tool-Call Accuracy

Definition

How often an agent chooses the correct tool and passes the correct arguments.

Tool Selection Accuracy

Definition

How often the agent chooses the correct tool, with the correct arguments, for the user's intent and system constraints.

Trajectory Evaluation

Definition

Assessing the full sequence of agent thoughts, tool calls, observations, and revisions.

Approval Gate

Definition

A required human or policy checkpoint before the agent performs a risky, expensive, or irreversible action.

Auditability

Definition

The ability to reconstruct what the agent saw, decided, called, and produced for review or compliance.

Compliance Check

Definition

A validation step that verifies output or action meets legal, regulatory, contractual, or internal policy requirements.

Content Filter

Definition

A rule or model that blocks, labels, or redirects content that violates safety, quality, or policy standards.

Data Loss Prevention

Definition

Controls that detect and prevent sensitive data from being exposed, copied, logged, or sent to unsafe destinations.

Escalation

Definition

Routing a case to a human, specialist agent, or safer workflow when confidence, risk, or complexity requires it.

Escalation Path

Definition

A route from the agent to a human, specialist, or safer workflow when automation should not continue alone.

Fallback Response

Definition

A safe alternative answer used when the model is uncertain, retrieval fails, tools are unavailable, or policy blocks completion.

Fallback Strategy

Definition

A predefined alternative path when an agent has low confidence, fails, times out, or reaches a safety boundary.

Guardrail

Definition

A rule, check, model, or workflow constraint that keeps AI behavior safe, compliant, and aligned with expectations.

Guardrails

Definition

Rules, checks, filters, permissions, and approval gates that keep agent behavior safe and compliant.

Human Review

Definition

Manual inspection of an AI output, decision, or planned action before it is approved, revised, or rejected.

Input Guardrail

Definition

A check applied to user input before model processing, often detecting harmful requests, prompt injection, or sensitive data.

Jailbreak Detection

Definition

The ability to identify jailbreak signals in inputs, outputs, logs, retrieved content, or system behavior.

Moderation

Definition

Classifying content for safety categories so the system can allow, block, transform, or escalate it appropriately.

Output Guardrail

Definition

A check applied after generation to catch unsafe, incorrect, private, or non-compliant output before delivery.

PII Detection

Definition

The ability to identify pii signals in inputs, outputs, logs, retrieved content, or system behavior.

PII Redaction

Definition

Removing or masking personally identifiable information so it is not exposed to users, logs, models, or downstream tools.

Policy-as-Code

Definition

Representing rules and compliance logic in executable configuration so checks are consistent and auditable.

Policy Check

Definition

A validation step that compares a request, plan, tool call, or answer against approved rules.

Prompt Injection Detection

Definition

The ability to identify prompt injection signals in inputs, outputs, logs, retrieved content, or system behavior.

Refusal

Definition

A safe response that declines to help with disallowed or harmful requests while keeping the tone professional.

Risk Score

Definition

A numeric or labeled measure that estimates risk for an output, action, user experience, or workflow result.

Safe Completion

Definition

A response that answers within allowed boundaries while avoiding unsafe instructions, private data, or unsupported claims.

Safety Check

Definition

A pre- or post-processing validation that looks for harm, misuse, policy violation, or high-risk behavior.

Tool Permission Check

Definition

A verification that the agent is allowed to use a specific tool, data source, action, or permission scope.

Explore more chapters or test your knowledge with quizzes.

Back to Agentic AI Glossary All glossary chapters Practice quizzes