Data Leakage
Situation where information from the test set or the future leaks into training, producing unrealistically good results.
Data leakage means test data or future information enters training – the model seems perfect but fails in production. Avoidable through correct pipeline ordering.
Explanation
Data leakage leads to models perfect in training but worthless in production. Common causes: features from the future, preprocessing before split.
Marketing Relevance
Data leakage is one of the most common and expensive mistakes in ML projects – often only discovered in production.
Common Pitfalls
Normalization/scaling before the split. Target variable as feature. Temporal leakage with time series data.
Origin & History
The problem was popularized through Kaggle competitions where leakage often led to unrealistic scores. Kaufman et al. (2012) formalized the concept in "Leakage in Data Mining".
Comparisons & Differences
Data Leakage vs. Overfitting
Overfitting learns noise in training data; data leakage uses forbidden information. Overfitting shows in validation, leakage often only in production.
Data Leakage vs. Feature Engineering
Good feature engineering uses available information; data leakage uses information that wouldn't be available at prediction time.