Agent Evaluation
Definition
The process of testing an agent's reasoning, tool calls, final answers, safety, cost, and latency before or during production use.
Evaluation, Guardrails & Safety terms and explanations from the Agentic AI Glossary.
Definition
The process of testing an agent's reasoning, tool calls, final answers, safety, cost, and latency before or during production use.
Definition
Comparing an AI system against baselines, alternative models, datasets, or performance targets.
Definition
Estimating how reliable an output, classification, tool call, or decision is likely to be.
Definition
The average spend required for one completed task, including model tokens, tool calls, infrastructure, retries, and human review when needed.
Definition
The percentage of tasks or conversations that must be handed to a human, specialist agent, or higher-trust workflow.
Definition
A curated set of prompts, scenarios, expected behaviors, and edge cases used to test an AI system repeatedly.
Definition
A structured process for measuring quality, safety, correctness, and business value of AI behavior.
Definition
The share of tasks where the agent gives a wrong answer, calls the wrong tool, violates policy, or fails to finish.
Definition
A missed detection, such as failing to flag a risky output, bad retrieval result, policy violation, or defect.
Definition
An incorrect alert or block, such as flagging safe content, valid tool use, or correct output as unsafe.
Definition
Reviewing the final response for correctness, groundedness, completeness, tone, policy compliance, and usefulness to the user.
Definition
The percentage of tasks where the agent reaches the requested outcome without unnecessary failure, escalation, or user rework.
Definition
A trusted set of examples, expected outputs, or human-labeled judgments used for evaluation.
Definition
How strongly an answer is supported by retrieved documents, verified data, tool results, or other trusted evidence.
Definition
How well the response solves the user's real problem with clear, relevant, and actionable information.
Definition
Quality review performed by people who judge usefulness, correctness, safety, tone, and real-world task success.
Definition
The total time required to finish one task, including model calls, retrieval, tool execution, retries, and final response generation.
Definition
Using a language model to assess outputs, often with rubrics, references, or pairwise comparisons.
Definition
Testing whether an agent stays accurate, safe, and context-aware across a conversation or long-running workflow.
Definition
Comparing two outputs side by side so reviewers or judge models can select the better answer or behavior.
Definition
How well the generated plan matches the task requirements, dependencies, constraints, and expected order of execution.
Definition
The usefulness, feasibility, and ordering of an agent's proposed steps.
Definition
Testing retrieval and answer quality together, including context relevance, citation accuracy, faithfulness, and answer completeness.
Definition
A repeatable test that catches quality drops after model, prompt, retrieval, or tool changes.
Definition
The degree to which retrieved context, tool output, or generated text directly answers the user's request.
Definition
A metric summarizing whether outputs and actions comply with safety and policy expectations.
Definition
An evaluation case built around a realistic user situation, including inputs, constraints, expected behavior, and pass criteria.
Definition
Controlled testing of agent behavior in representative scenarios before production exposure.
Definition
Running agents inside controlled mock environments to test behavior before exposing them to live users or real systems.
Definition
Checking each intermediate plan step, tool call, observation, and decision instead of only judging the final answer.
Definition
The percentage of tasks an agent completes according to predefined success criteria.
Definition
The number of input and output tokens consumed by a request, conversation, or task, used for cost and latency control.
Definition
How often an agent chooses the correct tool and passes the correct arguments.
Definition
How often the agent chooses the correct tool, with the correct arguments, for the user's intent and system constraints.
Definition
Assessing the full sequence of agent thoughts, tool calls, observations, and revisions.
Definition
A required human or policy checkpoint before the agent performs a risky, expensive, or irreversible action.
Definition
The ability to reconstruct what the agent saw, decided, called, and produced for review or compliance.
Definition
A validation step that verifies output or action meets legal, regulatory, contractual, or internal policy requirements.
Definition
A rule or model that blocks, labels, or redirects content that violates safety, quality, or policy standards.
Definition
Controls that detect and prevent sensitive data from being exposed, copied, logged, or sent to unsafe destinations.
Definition
Routing a case to a human, specialist agent, or safer workflow when confidence, risk, or complexity requires it.
Definition
A route from the agent to a human, specialist, or safer workflow when automation should not continue alone.
Definition
A safe alternative answer used when the model is uncertain, retrieval fails, tools are unavailable, or policy blocks completion.
Definition
A predefined alternative path when an agent has low confidence, fails, times out, or reaches a safety boundary.
Definition
A rule, check, model, or workflow constraint that keeps AI behavior safe, compliant, and aligned with expectations.
Definition
Rules, checks, filters, permissions, and approval gates that keep agent behavior safe and compliant.
Definition
Manual inspection of an AI output, decision, or planned action before it is approved, revised, or rejected.
Definition
A check applied to user input before model processing, often detecting harmful requests, prompt injection, or sensitive data.
Definition
The ability to identify jailbreak signals in inputs, outputs, logs, retrieved content, or system behavior.
Definition
Classifying content for safety categories so the system can allow, block, transform, or escalate it appropriately.
Definition
A check applied after generation to catch unsafe, incorrect, private, or non-compliant output before delivery.
Definition
The ability to identify pii signals in inputs, outputs, logs, retrieved content, or system behavior.
Definition
Removing or masking personally identifiable information so it is not exposed to users, logs, models, or downstream tools.
Definition
Representing rules and compliance logic in executable configuration so checks are consistent and auditable.
Definition
A validation step that compares a request, plan, tool call, or answer against approved rules.
Definition
The ability to identify prompt injection signals in inputs, outputs, logs, retrieved content, or system behavior.
Definition
A safe response that declines to help with disallowed or harmful requests while keeping the tone professional.
Definition
A numeric or labeled measure that estimates risk for an output, action, user experience, or workflow result.
Definition
A response that answers within allowed boundaries while avoiding unsafe instructions, private data, or unsupported claims.
Definition
A pre- or post-processing validation that looks for harm, misuse, policy violation, or high-risk behavior.
Definition
A verification that the agent is allowed to use a specific tool, data source, action, or permission scope.
Explore more chapters or test your knowledge with quizzes.