Class Imbalance
Situation where one class in the training dataset occurs significantly more frequently than others.
Class imbalance occurs when one class heavily dominates the dataset – standard models then ignore rare classes. SMOTE, weighting, and F1 over accuracy help.
Explanation
Models tend to predict the majority class and ignore minority classes. Countermeasures: resampling, weighting, SMOTE.
Marketing Relevance
Class imbalance is the norm in real datasets – fraud detection, disease diagnosis, churn prediction often have <1% positive cases.
Common Pitfalls
Accuracy as metric with imbalance is misleading. Oversampling before train/test split causes data leakage.
Origin & History
The problem was formalized in the 2000s by Japkowicz & Stephen. SMOTE (2002) was a milestone. Modern approaches include Focal Loss (2017) and cost-sensitive methods.
Comparisons & Differences
Class Imbalance vs. Data Augmentation
Data augmentation expands all classes evenly through transformations. Class imbalance techniques specifically target the minority class.
Class Imbalance vs. Cost-Sensitive Learning
Resampling changes data distribution. Cost-sensitive learning modifies the loss function to penalize errors on the minority class more.