Apache Airflow: A One-Stop Guide for Junior Developers
Apache Airflow is a powerful open-source tool for orchestrating jobs and managing data workflows. This guide covers everything you need—history, features, and a practical example—all explained simply.
History and Evolution
Workflow orchestration evolved over time. Here’s the journey:
- Non/Quasi-Programmable Tools (e.g., Informatica, Talend):
Early tools like Informatica and Talend offered graphical interfaces for ETL workflows. While powerful for simple tasks, they weren’t fully programmable, limiting flexibility, dependency management, and version control. - cronTab and Event Scheduler:
Basic scheduling tools like cronTab (Linux) and Event Scheduler (Windows) ran jobs at fixed times but couldn’t handle dependencies or track job status. - Celery:
A step up, Celery provided a task queue with workers but required custom logic for workflows. - Apache Airflow (2014):
Created at Airbnb in 2014 and open-sourced in 2015, Airflow introduced code-defined workflows with dependency management, becoming an Apache project in 2016.
What is Airflow?
Airflow lets you programmatically define, schedule, and monitor workflows using Python. Workflows are represented as Directed Acyclic Graphs (DAGs)—tasks with a defined order and no loops.
Architecture with Mermaid Diagram
Here’s how Airflow’s components connect:
- Web Server: Hosts the UI for monitoring.
- Scheduler: Schedules tasks based on DAGs.
- Executor: Manages task execution.
- Workers: Run the tasks.
- Metadata Database: Stores task states and logs.
Executor Modes with Mermaid Diagrams
LocalExecutor
Runs tasks on the same machine as the scheduler—simple but not scalable.
CeleryExecutor
Distributes tasks across workers using Celery—scalable but needs a broker like Redis.
Sample Python DAG Code
Here’s a styled Python DAG example that prints "Hello" and "World!":
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
def print_hello():
print("Hello")
def print_world():
print("World!")
with DAG(
dag_id='hello_world_dag',
start_date=datetime(2023, 1, 1),
schedule_interval='@daily',
catchup=False
) as dag:
task1 = PythonOperator(
task_id='print_hello',
python_callable=print_hello
)
task2 = PythonOperator(
task_id='print_world',
python_callable=print_world
)
task1 >> task2 # Task1 runs before Task2
Copy this HTML into a blog editor, and the Mermaid diagrams will render as interactive graphs. The Python code is styled with a light-gray background for readability. You’re all set!