DVC (Data Version Control)
Open-source tool for data and model versioning that extends Git workflows to ML artifacts.
DVC extends Git with data and model versioning for ML projects – with pipeline tracking, experiment comparisons, and cloud storage integration.
Explanation
DVC versions large files (datasets, models) separately from Git, manages ML pipelines as DAGs, and supports experiment comparisons. Storage backends include S3, GCS, and Azure.
Marketing Relevance
DVC is the leading tool for Git-based ML data and experiment versioning.
Common Pitfalls
Storage costs for large datasets. Learning curve for Git-inexperienced data scientists. Remote storage must be configured.
Origin & History
Iterative.ai released DVC in 2017 as "Git for Data." CML (Continuous Machine Learning) was released in 2020 as a CI/CD companion. DVC Studio followed as a web UI. Today DVC has over 13,000 GitHub stars.
Comparisons & Differences
DVC (Data Version Control) vs. Git LFS
Git LFS stores large files in Git; DVC additionally offers ML pipelines, experiment tracking, and flexible storage backends.
DVC (Data Version Control) vs. MLflow
DVC focuses on data versioning with Git workflow; MLflow on experiment tracking and model registry.