Saturday, March 22, 2025

Apache Airflow notes

Apache Airflow: A One-Stop Guide for Junior Developers

Apache Airflow: A One-Stop Guide for Junior Developers

Apache Airflow is a powerful open-source tool for orchestrating jobs and managing data workflows. This guide covers everything you need—history, features, and a practical example—all explained simply.

History and Evolution

Workflow orchestration evolved over time. Here’s the journey:

  • Non/Quasi-Programmable Tools (e.g., Informatica, Talend):
    Early tools like Informatica and Talend offered graphical interfaces for ETL workflows. While powerful for simple tasks, they weren’t fully programmable, limiting flexibility, dependency management, and version control.
  • cronTab and Event Scheduler:
    Basic scheduling tools like cronTab (Linux) and Event Scheduler (Windows) ran jobs at fixed times but couldn’t handle dependencies or track job status.
  • Celery:
    A step up, Celery provided a task queue with workers but required custom logic for workflows.
  • Apache Airflow (2014):
    Created at Airbnb in 2014 and open-sourced in 2015, Airflow introduced code-defined workflows with dependency management, becoming an Apache project in 2016.

What is Airflow?

Airflow lets you programmatically define, schedule, and monitor workflows using Python. Workflows are represented as Directed Acyclic Graphs (DAGs)—tasks with a defined order and no loops.

Architecture with Mermaid Diagram

Here’s how Airflow’s components connect:

Web Server

Scheduler

Executor

Workers

Metadata Database

  • Web Server: Hosts the UI for monitoring.
  • Scheduler: Schedules tasks based on DAGs.
  • Executor: Manages task execution.
  • Workers: Run the tasks.
  • Metadata Database: Stores task states and logs.

Executor Modes with Mermaid Diagrams

LocalExecutor

Scheduler

LocalExecutor

Task 1

Task 2

Runs tasks on the same machine as the scheduler—simple but not scalable.

CeleryExecutor

Scheduler

CeleryExecutor

Celery Worker 1

Celery Worker 2

Task 1

Task 2

Distributes tasks across workers using Celery—scalable but needs a broker like Redis.

Sample Python DAG Code

Here’s a styled Python DAG example that prints "Hello" and "World!":

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def print_hello():
    print("Hello")

def print_world():
    print("World!")

with DAG(
    dag_id='hello_world_dag',
    start_date=datetime(2023, 1, 1),
    schedule_interval='@daily',
    catchup=False
) as dag:
    task1 = PythonOperator(
        task_id='print_hello',
        python_callable=print_hello
    )
    task2 = PythonOperator(
        task_id='print_world',
        python_callable=print_world
    )
    task1 >> task2  # Task1 runs before Task2
    

Copy this HTML into a blog editor, and the Mermaid diagrams will render as interactive graphs. The Python code is styled with a light-gray background for readability. You’re all set!

No comments:

Post a Comment

Apache Airflow notes