Apache Airflow
Open-source platform for orchestrating complex data and ML workflows as DAGs (Directed Acyclic Graphs).
Apache Airflow orchestrates data and ML workflows as Python-defined DAGs with scheduling, monitoring, and cloud integration.
Explanation
Airflow defines workflows as Python code (DAGs), provides scheduling, monitoring, retry logic, and a web UI. Operators connect to cloud services, databases, and ML frameworks.
Marketing Relevance
Apache Airflow is the most widely used workflow orchestrator for data engineering and ML pipelines.
Common Pitfalls
Not suitable for real-time streaming. Scheduler bottleneck with thousands of DAGs. TaskFlow API vs. classic operators confusing.
Origin & History
Airbnb developed Airflow internally in 2014. It became an Apache Incubator project in 2016, top-level Apache project in 2019. Airflow 2.0 (2020) brought the TaskFlow API and new scheduler. Managed services: Astronomer, Google Cloud Composer, Amazon MWAA.
Comparisons & Differences
Apache Airflow vs. Kubeflow Pipelines
Kubeflow is ML-specialized on Kubernetes; Airflow is a general workflow orchestrator for data + ML.
Apache Airflow vs. Prefect
Prefect offers more modern Python-native orchestration; Airflow has the larger ecosystem and more community support.