Back to AI Failure Dictionary
AI Failure Dictionary

ML & Deep Learning Training Failures

ML & Deep Learning Training Failures terms and explanations from the AI Failure Dictionary.

76 terms in this chapter
01

Overfitting

Definition

The model memorizes training data and performs poorly on new data.

Solution

Use more data, regularization, simpler models, dropout, cross-validation, or early stopping.

02

Underfitting

Definition

The model is too simple to learn useful patterns.

Solution

Use better features, a stronger model, more training, or lower regularization.

03

Poor Generalization

Definition

The model works in development but fails on real-world examples.

Solution

Use realistic validation sets, production-like test data, and edge-case coverage.

04

Low Generalization

Definition

The model does not transfer well to unseen examples.

Solution

Improve data diversity and evaluate on data that better represents deployment conditions.

05

Data Leakage During Training

Definition

The model learns from information it should not access.

Solution

Audit features, joins, target timing, and train-test boundaries.

06

Train-Test Contamination

Definition

The same or very similar examples appear in both training and test sets.

Solution

Use duplicate detection, grouped splitting, and leakage checks.

07

Test Set Leakage

Definition

Evaluation examples accidentally influence training.

Solution

Keep test sets locked, private, and separate from training decisions.

08

Bad Split Strategy

Definition

Training, validation, and test sets are divided incorrectly.

Solution

Use stratified, grouped, time-based, or user-based splits depending on the problem.

09

Temporal Split Error

Definition

Time-based data is split randomly, causing future information to leak.

Solution

Train on the past and validate on future time windows.

10

Class Imbalance Failure

Definition

The model ignores minority classes because majority classes dominate training.

Solution

Use class weights, resampling, threshold tuning, and minority-class data collection.

11

Hyperparameter Instability

Definition

Small parameter changes cause large performance differences.

Solution

Use systematic search, robust validation, and multiple random seeds.

12

Poor Regularization

Definition

The model is not constrained enough, increasing overfitting risk.

Solution

Use L1/L2 regularization, dropout, pruning, simpler models, or early stopping.

13

Optimization Failure

Definition

The learning algorithm fails to find a good solution.

Solution

Tune learning rate, optimizer, loss function, initialization, and preprocessing.

14

Local Minimum

Definition

Training gets stuck in a poor solution.

Solution

Use better initialization, adaptive optimizers, learning-rate schedules, or restarts.

15

High Variance

Definition

Model performance changes significantly across datasets or runs.

Solution

Use more data, cross-validation, ensembling, and regularization.

16

High Bias

Definition

The model is too limited and consistently misses important patterns.

Solution

Use richer features, a more expressive model, or reduce excessive regularization.

17

Poor Calibration

Definition

Predicted probabilities do not match real-world likelihoods.

Solution

Use calibration methods such as Platt scaling, isotonic regression, or temperature scaling.

18

Random Seed Instability

Definition

Results change significantly when the random seed changes.

Solution

Run multiple seeds and report mean, variance, and confidence intervals.

19

Reproducibility Failure

Definition

The team cannot reproduce training results.

Solution

Version code, data, features, configs, environments, and random seeds.

20

Experiment Tracking Failure

Definition

Training parameters, datasets, metrics, and artifacts are not recorded properly.

Solution

Use experiment tracking tools such as MLflow, Weights \& Biases, or a model registry.

21

Baseline Failure

Definition

The team does not compare the model against a simple baseline.

Solution

Build a simple baseline first and require new models to beat it meaningfully.

22

Accuracy Paradox

Definition

A model has high accuracy but performs poorly on the important class.

Solution

Use precision, recall, F1, ROC-AUC, PR-AUC, and class-specific metrics.

23

Spurious Correlation

Definition

The model learns a false pattern that does not truly cause the outcome.

Solution

Run stress tests, causal review, subgroup analysis, and out-of-distribution evaluation.

24

Distribution Shift

Definition

Production data differs from training data.

Solution

Monitor data distributions and retrain or adapt when shifts are detected.

25

Covariate Shift

Definition

Input feature distribution changes between training and production.

Solution

Track feature distributions and update data, features, or models as needed.

26

Label Shift

Definition

The distribution of output classes changes over time.

Solution

Monitor class distribution and recalibrate or retrain.

27

Concept Drift

Definition

The relationship between inputs and outputs changes over time.

Solution

Detect drift and retrain with newer labeled data.

28

Model Decay

Definition

Model quality gets worse as the environment changes.

Solution

Monitor performance and schedule retraining or model refreshes.

29

Out-of-Distribution Input

Definition

The model receives data very different from what it saw during training.

Solution

Detect OOD inputs, reject uncertain cases, or route them to human review.

30

Weak Signal

Definition

The input data does not contain enough predictive information.

Solution

Improve features, collect stronger signals, or reconsider whether the task is learnable.

31

Vanishing Gradient

Definition

Gradients become too small, so early layers stop learning.

Solution

Use better activations, normalization, residual connections, and architecture changes.

32

Exploding Gradient

Definition

Gradients become too large, causing unstable training.

Solution

Use gradient clipping, normalization, lower learning rates, and stable initialization.

33

Dead Neuron

Definition

A neural unit stops activating and contributes little or nothing.

Solution

Adjust initialization, learning rate, architecture, or activation function.

34

Dead ReLU

Definition

A ReLU neuron outputs zero for most or all inputs.

Solution

Use Leaky ReLU, GELU, better initialization, or a lower learning rate.

35

Saturation

Definition

Activation functions enter flat regions where gradients are very small.

Solution

Use modern activations, normalization, and better initialization.

36

Poor Weight Initialization

Definition

Initial weights make training slow, unstable, or ineffective.

Solution

Use initialization methods such as Xavier, He initialization, or pretrained weights.

37

Learning Rate Too High

Definition

Training jumps around and fails to converge.

Solution

Lower the learning rate or use a learning-rate scheduler.

38

Learning Rate Too Low

Definition

Training is extremely slow or gets stuck.

Solution

Increase the learning rate or use adaptive optimizers.

39

Batch Size Instability

Definition

Batch size causes noisy gradients or poor generalization.

Solution

Tune batch size, use gradient accumulation, and scale learning rates carefully.

40

Mode Collapse

Definition

A generative model produces limited or repetitive outputs.

Solution

Use better objectives, diversity penalties, training stabilization, and evaluation for diversity.

41

Catastrophic Forgetting

Definition

A model forgets old knowledge when trained on new data.

Solution

Use replay data, regularization, frozen layers, or parameter-efficient fine-tuning.

42

Representation Collapse

Definition

Learned representations become too similar and lose useful distinctions.

Solution

Use contrastive objectives, normalization, better negatives, and representation diagnostics.

43

Embedding Collapse

Definition

Embeddings lose semantic diversity and become less useful.

Solution

Improve training objectives, data diversity, and embedding evaluation.

44

Gradient Noise

Definition

Training updates are unstable because gradients are too noisy.

Solution

Tune batch size, optimizer, gradient accumulation, and learning rate.

45

Loss Plateau

Definition

The loss stops improving before reaching good performance.

Solution

Adjust learning rate, architecture, data quality, optimizer, or schedule.

46

Convergence Failure

Definition

The model never reaches a stable or useful solution.

Solution

Debug data, labels, loss function, optimizer, architecture, and preprocessing.

47

Attention Collapse

Definition

Attention focuses on irrelevant or too narrow parts of the input.

Solution

Use better data, architecture tuning, attention diagnostics, and regularization.

48

Poor Transfer Learning

Definition

A pretrained model does not adapt well to the target task.

Solution

Use domain data, careful fine-tuning, and task-specific validation.

49

Fine-Tuning Instability

Definition

Fine-tuning damages useful pretrained behavior.

Solution

Use lower learning rates, LoRA or PEFT, frozen layers, and validation gates.

50

Catastrophic Interference

Definition

New learning disrupts previously learned patterns.

Solution

Use continual-learning strategies and mixed old/new training data.

51

Capacity Bottleneck

Definition

The model is too small or constrained for the task.

Solution

Increase model capacity, improve architecture, or simplify the task.

52

Overparameterization Risk

Definition

The model has more parameters than needed, increasing cost and overfitting risk.

Solution

Use smaller models, regularization, pruning, distillation, or model selection.

53

Training Instability

Definition

Loss, gradients, or metrics fluctuate unpredictably.

Solution

Inspect data, reduce learning rate, stabilize optimization, and monitor gradients.

54

Detection Failure

Definition

The model fails to identify an object that is present.

Solution

Add more labeled examples, tune thresholds, and improve detection architecture.

55

False Detection

Definition

The model detects something that is not actually there.

Solution

Add hard negative examples and tune confidence thresholds.

56

Misclassification

Definition

The image is assigned to the wrong class.

Solution

Improve labels, augmentation, class balance, and confusion-matrix analysis.

57

Localization Error

Definition

The object is detected, but the bounding box is inaccurate.

Solution

Improve annotations and use localization-focused loss functions.

58

Segmentation Failure

Definition

The model fails to correctly separate object regions.

Solution

Improve masks, add training examples, and evaluate segmentation metrics.

59

Occlusion Failure

Definition

The model fails when objects are partially hidden.

Solution

Use occlusion augmentation and realistic training data.

60

Scale Sensitivity

Definition

The model fails when objects are too small, too large, or at unusual distances.

Solution

Use multi-scale training and feature pyramid networks.

61

Rotation Sensitivity

Definition

The model fails when objects appear at different angles.

Solution

Use rotation augmentation and rotation-invariant architectures when needed.

62

Lighting Sensitivity

Definition

Shadows, glare, darkness, or brightness reduce performance.

Solution

Use lighting augmentation and collect diverse image data.

63

Blur Sensitivity

Definition

Motion blur or low image quality causes mistakes.

Solution

Use blur augmentation and image quality checks.

64

Background Bias

Definition

The model relies on background patterns instead of the main object.

Solution

Train with diverse backgrounds and object-focused augmentation.

65

Texture Bias

Definition

The CNN relies too much on texture rather than shape.

Solution

Use shape-focused augmentation, diverse data, and robustness testing.

66

Domain Shift in Vision

Definition

Production images differ from training images in camera, lighting, angle, or environment.

Solution

Collect production samples and use domain adaptation.

67

Adversarial Patch Attack

Definition

A visual pattern tricks the model into a wrong prediction.

Solution

Use adversarial testing, robust training, and input monitoring.

68

Small Object Failure

Definition

The model misses tiny objects in the image.

Solution

Use higher-resolution inputs, tiling, and small-object-focused training.

69

Class Confusion

Definition

Visually similar classes are repeatedly confused.

Solution

Collect more examples, improve labels, and analyze the confusion matrix.

70

Data Augmentation Failure

Definition

Augmentation creates unrealistic or harmful training examples.

Solution

Review augmentations against real-world conditions and remove harmful transforms.

71

Annotation Box Error

Definition

Bounding boxes or masks in training data are incorrect.

Solution

Audit labels and improve reviewer quality control.

72

Image Preprocessing Skew

Definition

Image resizing, cropping, or normalization differs between training and inference.

Solution

Share the same preprocessing pipeline across training and production.

73

OCR Failure

Definition

Text in images is read incorrectly.

Solution

Improve image quality, tune OCR models, and validate extracted text.

74

Pose Variation Failure

Definition

The model fails when objects or people appear in unusual positions.

Solution

Add pose-diverse data and augmentation.

75

Camera Quality Shift

Definition

A model trained on high-quality images fails on low-quality production images.

Solution

Train and evaluate with production-quality images.

76

Frame Sampling Failure

Definition

A video model misses important moments because frames are sampled poorly.

Solution

Tune sampling strategy and evaluate temporal coverage.

Explore more chapters or test your knowledge with quizzes.