Saturday, March 22, 2025

Apache Airflow notes

Apache Airflow: A One-Stop Guide for Junior Developers

Apache Airflow: A One-Stop Guide for Junior Developers

Apache Airflow is a powerful open-source tool for orchestrating jobs and managing data workflows. This guide covers everything you need—history, features, and a practical example—all explained simply.

History and Evolution

Workflow orchestration evolved over time. Here’s the journey:

  • Non/Quasi-Programmable Tools (e.g., Informatica, Talend):
    Early tools like Informatica and Talend offered graphical interfaces for ETL workflows. While powerful for simple tasks, they weren’t fully programmable, limiting flexibility, dependency management, and version control.
  • cronTab and Event Scheduler:
    Basic scheduling tools like cronTab (Linux) and Event Scheduler (Windows) ran jobs at fixed times but couldn’t handle dependencies or track job status.
  • Celery:
    A step up, Celery provided a task queue with workers but required custom logic for workflows.
  • Apache Airflow (2014):
    Created at Airbnb in 2014 and open-sourced in 2015, Airflow introduced code-defined workflows with dependency management, becoming an Apache project in 2016.

What is Airflow?

Airflow lets you programmatically define, schedule, and monitor workflows using Python. Workflows are represented as Directed Acyclic Graphs (DAGs)—tasks with a defined order and no loops.

Architecture with Mermaid Diagram

Here’s how Airflow’s components connect:

Web Server

Scheduler

Executor

Workers

Metadata Database

  • Web Server: Hosts the UI for monitoring.
  • Scheduler: Schedules tasks based on DAGs.
  • Executor: Manages task execution.
  • Workers: Run the tasks.
  • Metadata Database: Stores task states and logs.

Executor Modes with Mermaid Diagrams

LocalExecutor

Scheduler

LocalExecutor

Task 1

Task 2

Runs tasks on the same machine as the scheduler—simple but not scalable.

CeleryExecutor

Scheduler

CeleryExecutor

Celery Worker 1

Celery Worker 2

Task 1

Task 2

Distributes tasks across workers using Celery—scalable but needs a broker like Redis.

Sample Python DAG Code

Here’s a styled Python DAG example that prints "Hello" and "World!":

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def print_hello():
    print("Hello")

def print_world():
    print("World!")

with DAG(
    dag_id='hello_world_dag',
    start_date=datetime(2023, 1, 1),
    schedule_interval='@daily',
    catchup=False
) as dag:
    task1 = PythonOperator(
        task_id='print_hello',
        python_callable=print_hello
    )
    task2 = PythonOperator(
        task_id='print_world',
        python_callable=print_world
    )
    task1 >> task2  # Task1 runs before Task2
    

Copy this HTML into a blog editor, and the Mermaid diagrams will render as interactive graphs. The Python code is styled with a light-gray background for readability. You’re all set!

Saturday, March 15, 2025

Introduction - FullStack Developer/ Data Engineer

Why I'm the Best Fit for This Role

Thank you for the briefing. Now, I can understand more about the position. After listening to the conversation, I am more confident to say that, “I am best fit for this role.”

Introduction

I am a Full-Stack Python Developer, with all years of experience, with around 80% into Backend work and 20% into the frontend.

Backend Expertise

In Backend, I worked on:

  • Creating standalone scripts for automation, scheduled jobs, ETL jobs, and data pipelines.
  • Building RESTful APIs and web applications.

Frontend Expertise

In Frontend, I worked on:

  • Creating dashboards, graphs, and charts using d3.js or Highcharts libraries.
  • Building tables using ag-grid.
  • Working with JavaScript, jQuery, and React.

Databases

In databases:

  • Relational Databases: MySQL, MS SQL, Oracle DB, PostgreSQL.
  • NoSQL Databases: MongoDB, Cassandra.
  • Data Warehousing: Snowflake schema.

Python Expertise

In Python:

  • Creating web applications and RESTful APIs using Django, Flask, and FastAPI frameworks.
  • Process automation and integrating with infrastructure (Linux/Windows).
  • Data gathering from:
    1. Structured datasources like REST APIs (internal or external), databases, etc.
    2. Unstructured datasources like web scraping using Beautiful Soup.
    3. Structured, semi-structured, or unstructured file types like CSV, Excel, JSON, YAML, Parquet, etc.
  • Following TDD (Test-Driven Development) by creating unit tests and integration tests using unittest or pytest modules.

Caching and Scheduling

In caching:

  • Worked with Redis and Memcache.

In scheduling:

  • Worked with Celery.
  • Integrated Celery with Django applications for periodic jobs.

For data job orchestration, I worked with Airflow.

Public Cloud Experience

In Public Cloud, I am mostly associated with AWS Cloud. In AWS Cloud, I worked with:

Server-Based Architectures

  • EC2 Instances
  • Elastic Beanstalk
  • Elastic Load Balancer
  • Auto-Scaling
  • Route53

Storage Solutions

  • S3 Bucket for file storage
  • S3 Glacier for archival storage

Databases

  • AWS RDS & Redshift for relational databases
  • AWS DynamoDB for NoSQL

Serverless Architectures

  • AWS Lambda (time-triggered or event-triggered)
  • AWS API Gateway (HTTP-triggered)
  • AWS Event Scheduler for scheduling Lambda

Container-based Environments

  • Created docker.yaml files for creating containers
  • AWS EKS (Elastic Kuberntes Service) for Orchestrating the pods.
  • Also, wrote the helm chart

For OAuth, I worked with AWS Cognito.

Also worked with SQS, SNS, and SES.

For big data processing, I worked with EMR clusters for PySpark.

For ETLs, I worked with AWS Glue Jobs with DataCatalog and PySpark.

CI/CD and Infrastructure as Code

Experience in creating CI/CD setups using Jenkins and Groovy scripts.

In terms of Infrastructure as Code, I worked mainly with:

  • CloudFormation templates
  • Terraform
  • Pulumi Python module

Agile Methodologies

Experience in agile methodologies like Scrum and Kanban in facilitating agile ceremonies like:

  • Daily standups
  • Sprint reviews
  • Planning sessions
  • Retrospectives

Familiar with Agile tools like Jira and skilled in working with cross-functional teams including Dev teams, QA testers, and PM/POs.

Apache Airflow notes