Weak Test Coverage
Definition
Evaluation does not cover enough real-world cases.
Solution
Expand the test set with edge cases, production examples, and subgroup coverage.
Evaluation & MLOps Deployment Failures terms and explanations from the AI Failure Dictionary.
Definition
Evaluation does not cover enough real-world cases.
Solution
Expand the test set with edge cases, production examples, and subgroup coverage.
Definition
The trusted evaluation set is incomplete or low quality.
Solution
Use expert review, regular refreshes, and quality audits.
Definition
Human evaluators label similar examples differently.
Solution
Improve guidelines, reviewer training, and agreement checks.
Definition
Reviewers become inconsistent over time.
Solution
Run calibration sessions and include benchmark examples.
Definition
The test set or metric favors certain outputs, users, or cases.
Solution
Audit evaluation data across scenarios, languages, and user groups.
Definition
The model improves the metric without improving real quality.
Solution
Use multiple metrics, human review, and outcome-based evaluation.
Definition
The metric does not match actual user or business value.
Solution
Choose metrics that reflect usefulness, safety, reliability, and business impact.
Definition
The model incorrectly predicts something is present.
Solution
Tune thresholds, add hard negative examples, and review precision errors.
Definition
The model fails to detect something that is present.
Solution
Improve recall with more positive examples, threshold tuning, and feature improvements.
Definition
Too many predicted positives are wrong.
Solution
Raise thresholds, improve features, or add stronger negative examples.
Definition
Too many real positives are missed.
Solution
Lower thresholds, add positive data, and improve retrieval or feature coverage.
Definition
A single F1 score hides important precision-recall tradeoffs.
Solution
Review precision and recall separately by class and use case.
Definition
Confidence scores do not reflect actual correctness.
Solution
Use calibration methods and evaluate confidence reliability.
Definition
Offline evaluation results do not match production behavior.
Solution
Use production-like evaluation data and online testing.
Definition
A new model version performs worse than the previous version.
Solution
Use regression tests, release gates, and rollback plans.
Definition
Experiment groups affect each other or are not separated correctly.
Solution
Use clean randomization, isolation, and experiment monitoring.
Definition
The test does not include enough examples to support a conclusion.
Solution
Increase sample size and check statistical power.
Definition
Benchmark or test data appears in the model's training data.
Solution
Use private, fresh, or carefully controlled evaluation sets.
Definition
An LLM evaluator favors certain writing styles, lengths, or model outputs.
Solution
Calibrate judges, use multiple judges, and validate with humans.
Definition
Evaluation checks final answers but ignores retrieval quality or citation accuracy.
Solution
Measure retrieval, grounding, answer correctness, and citation support separately.
Definition
The system is not tested against harmful, adversarial, or policy-sensitive cases.
Solution
Use red-team prompts, adversarial tests, and safety rubrics.
Definition
Rare but important cases are missing from evaluation.
Solution
Add edge-case suites and scenario-based tests.
Definition
The system is not tested across demographic or user groups.
Solution
Measure subgroup performance and fairness metrics.
Definition
The system is not tested against noisy, adversarial, or unusual inputs.
Solution
Add perturbation tests, adversarial inputs, and stress testing.
Definition
The team cannot clearly explain why the model made a decision.
Solution
Use interpretable features, explanations, decision records, and audit trails.
Definition
A model release breaks or behaves incorrectly in production.
Solution
Use staging tests, canary releases, release gates, and rollback plans.
Definition
The team cannot track which model version is running.
Solution
Use a model registry, version tags, and deployment metadata.
Definition
The team cannot reproduce which data was used for training.
Solution
Version datasets, snapshots, transformations, and training data references.
Definition
Feature definitions are not tracked across training and serving.
Solution
Version feature definitions and connect them to model artifacts.
Definition
Training logic differs from production inference logic.
Solution
Share preprocessing and feature logic between training and serving.
Definition
Development, staging, and production environments behave differently.
Solution
Use containers, pinned dependencies, and infrastructure-as-code.
Definition
Library or package versions break the ML system.
Solution
Pin dependencies and test environments before release.
Definition
The model does not run correctly inside its deployment container.
Solution
Test containers with production-like inputs before deployment.
Definition
Automated testing, build, or deployment pipelines break.
Solution
Add pipeline tests, clear release gates, and rollback procedures.
Definition
The team cannot safely return to a previous working version.
Solution
Keep versioned artifacts and automate rollback workflows.
Definition
A small rollout does not detect issues before full deployment.
Solution
Use better canary metrics, traffic segmentation, and quality monitoring.
Definition
A shadow model is tested incorrectly or not monitored properly.
Solution
Compare shadow predictions against real outcomes and baseline models.
Definition
Models are not approved, tagged, or stored correctly.
Solution
Use registry governance, approval workflows, and artifact validation.
Definition
Model files, tokenizer files, or configuration files are damaged or mismatched.
Solution
Use checksums, artifact validation, and compatibility tests.
Definition
The production prediction service becomes unavailable.
Solution
Use health checks, autoscaling, failover, and incident runbooks.
Definition
The model endpoint input or output format changes unexpectedly.
Solution
Use API contracts, backward compatibility tests, and schema validation.
Definition
Infrastructure cannot scale fast enough for traffic.
Solution
Use load testing, scaling policies, queueing, and capacity planning.
Definition
The first request is slow because infrastructure or the model is not warmed up.
Solution
Use warm pools, caching, optimized model loading, and pre-warming.
Definition
GPU memory, scheduling, or availability problems break inference or training.
Solution
Use resource limits, monitoring, optimized batch sizes, and fallback capacity.
Definition
Model usage becomes too expensive because of traffic, tokens, compute, or inefficient design.
Solution
Use caching, model routing, prompt optimization, budgets, and usage limits.
Definition
Prompts or responses consume too many tokens.
Solution
Shorten prompts, filter retrieval context, set output limits, and summarize.
Definition
The model takes too long to generate a response.
Solution
Use smaller models, caching, batching, streaming, optimized serving, or model routing.
Definition
The system cannot handle the required request volume.
Solution
Use batching, autoscaling, queue management, and performance testing.
Definition
The system fails reliability, speed, or uptime requirements.
Solution
Use SLOs, error budgets, monitoring, and reliability engineering.
Definition
Models are shipped without proper review, approval, or documentation.
Solution
Use release checklists, approvals, model cards, and audit trails.
Definition
Production settings become different from approved settings.
Solution
Use config versioning, drift detection, and infrastructure-as-code.
Definition
Model performance degrades over time.
Solution
Monitor performance, collect labels, and retrain or refresh the model.
Definition
The relationship between inputs and labels changes.
Solution
Detect drift and retrain with newer labeled data.
Definition
Input feature distribution changes.
Solution
Track feature distributions and adapt data, features, or model strategy.
Definition
Output class distribution changes.
Solution
Monitor class distribution and recalibrate or retrain.
Definition
The distribution of model predictions changes.
Solution
Compare prediction trends against baselines and investigate anomalies.
Definition
Input features change in meaning or distribution.
Solution
Monitor feature statistics and trigger alerts for major shifts.
Definition
Production data differs from training data.
Solution
Monitor data distributions and retrain when shifts affect quality.
Definition
The system produces bad results without visible errors.
Solution
Use quality checks, anomaly alerts, sampled human review, and outcome monitoring.
Definition
Important model, data, or system metrics are not tracked.
Solution
Monitor data, model quality, latency, cost, safety, and business outcomes together.
Definition
Too many alerts cause teams to ignore important signals.
Solution
Tune alert thresholds, deduplicate alerts, and prioritize severity.
Definition
A serious issue occurs without triggering an alert.
Solution
Add alerts for critical failure modes and test alert coverage.
Definition
Logs, metrics, traces, and examples are insufficient for debugging.
Solution
Use structured logging, distributed tracing, metrics, and request-level audit records.
Definition
The system does not store enough information to investigate failures.
Solution
Log inputs, outputs, model versions, data versions, errors, and decisions safely.
Definition
User feedback is not collected or connected to improvement.
Solution
Connect feedback to evaluation, labeling, retraining, and product decisions.
Definition
True labels arrive too late to monitor performance quickly.
Solution
Use proxy metrics and delayed performance tracking.
Definition
A new version makes inference slower.
Solution
Run performance tests before deployment and monitor latency after release.
Definition
The system consumes its allowed failure budget too quickly.
Solution
Pause risky releases and prioritize reliability fixes.
Definition
Dashboards show system health but miss model quality issues.
Solution
Add model quality, retrieval quality, safety, and user outcome dashboards.
Definition
The system does not detect stale input data.
Solution
Add freshness metrics and stale-data alerts.
Definition
The system monitors uptime but not answer or prediction quality.
Solution
Add quality sampling, human review, and automated evaluation.
Definition
Users change how they interact with the system, reducing performance.
Solution
Monitor usage patterns and update prompts, UX, or models.
Definition
Feedback comes mostly from certain user groups, creating misleading signals.
Solution
Analyze feedback coverage and balance feedback sources.
Explore more chapters or test your knowledge with quizzes.