Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    Data Leakage

    Also known as:
    Data Leakage
    Information Leakage
    Target Leakage
    Train-Test Contamination
    Updated: 2/10/2026

    Situation where information from the test set or the future leaks into training, producing unrealistically good results.

    Quick Summary

    Data leakage means test data or future information enters training – the model seems perfect but fails in production. Avoidable through correct pipeline ordering.

    Explanation

    Data leakage leads to models perfect in training but worthless in production. Common causes: features from the future, preprocessing before split.

    Marketing Relevance

    Data leakage is one of the most common and expensive mistakes in ML projects – often only discovered in production.

    Common Pitfalls

    Normalization/scaling before the split. Target variable as feature. Temporal leakage with time series data.

    Origin & History

    The problem was popularized through Kaggle competitions where leakage often led to unrealistic scores. Kaufman et al. (2012) formalized the concept in "Leakage in Data Mining".

    Comparisons & Differences

    Data Leakage vs. Overfitting

    Overfitting learns noise in training data; data leakage uses forbidden information. Overfitting shows in validation, leakage often only in production.

    Data Leakage vs. Feature Engineering

    Good feature engineering uses available information; data leakage uses information that wouldn't be available at prediction time.

    Related Services

    Related Terms

    👋Questions? Chat with us!