Data Validation (ML)
Automated checking of data quality, schema conformity, and statistical properties in ML pipelines.
Data validation automatically checks data quality and schema in ML pipelines – Great Expectations and TFDV are the standard tools.
Explanation
Data validation in ML includes schema validation (column types, nullable), statistical tests (distribution changes, outliers), completeness checks, and referential integrity. Tools like Great Expectations and TensorFlow Data Validation (TFDV) automate these checks.
Marketing Relevance
Data validation prevents the most common ML failure: bad data in production.
Common Pitfalls
Only checking schema, not statistical distributions. No alerting integration. Validation only in training, not serving.
Origin & History
Google released TensorFlow Data Validation (TFDV) in 2018 as part of TFX. Great Expectations started in 2018 as an open-source project for expectation-based data validation. Both tools formalized data validation as an MLOps discipline.
Comparisons & Differences
Data Validation (ML) vs. Data Quality
Data quality is the concept; data validation is the automated checking with concrete tests and assertions.
Data Validation (ML) vs. Data Drift
Data drift detects distribution changes over time; data validation checks data against defined expectations at each pipeline run.