Saturday, March 22, 2025

Apache Airflow notes

Apache Airflow: A One-Stop Guide for Junior Developers

Apache Airflow: A One-Stop Guide for Junior Developers

Apache Airflow is a powerful open-source tool for orchestrating jobs and managing data workflows. This guide covers everything you need—history, features, and a practical example—all explained simply.

History and Evolution

Workflow orchestration evolved over time. Here’s the journey:

  • Non/Quasi-Programmable Tools (e.g., Informatica, Talend):
    Early tools like Informatica and Talend offered graphical interfaces for ETL workflows. While powerful for simple tasks, they weren’t fully programmable, limiting flexibility, dependency management, and version control.
  • cronTab and Event Scheduler:
    Basic scheduling tools like cronTab (Linux) and Event Scheduler (Windows) ran jobs at fixed times but couldn’t handle dependencies or track job status.
  • Celery:
    A step up, Celery provided a task queue with workers but required custom logic for workflows.
  • Apache Airflow (2014):
    Created at Airbnb in 2014 and open-sourced in 2015, Airflow introduced code-defined workflows with dependency management, becoming an Apache project in 2016.

What is Airflow?

Airflow lets you programmatically define, schedule, and monitor workflows using Python. Workflows are represented as Directed Acyclic Graphs (DAGs)—tasks with a defined order and no loops.

Architecture with Mermaid Diagram

Here’s how Airflow’s components connect:

Web Server

Scheduler

Executor

Workers

Metadata Database

  • Web Server: Hosts the UI for monitoring.
  • Scheduler: Schedules tasks based on DAGs.
  • Executor: Manages task execution.
  • Workers: Run the tasks.
  • Metadata Database: Stores task states and logs.

Executor Modes with Mermaid Diagrams

LocalExecutor

Scheduler

LocalExecutor

Task 1

Task 2

Runs tasks on the same machine as the scheduler—simple but not scalable.

CeleryExecutor

Scheduler

CeleryExecutor

Celery Worker 1

Celery Worker 2

Task 1

Task 2

Distributes tasks across workers using Celery—scalable but needs a broker like Redis.

Sample Python DAG Code

Here’s a styled Python DAG example that prints "Hello" and "World!":

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def print_hello():
    print("Hello")

def print_world():
    print("World!")

with DAG(
    dag_id='hello_world_dag',
    start_date=datetime(2023, 1, 1),
    schedule_interval='@daily',
    catchup=False
) as dag:
    task1 = PythonOperator(
        task_id='print_hello',
        python_callable=print_hello
    )
    task2 = PythonOperator(
        task_id='print_world',
        python_callable=print_world
    )
    task1 >> task2  # Task1 runs before Task2
    

Copy this HTML into a blog editor, and the Mermaid diagrams will render as interactive graphs. The Python code is styled with a light-gray background for readability. You’re all set!

Saturday, March 15, 2025

Introduction - FullStack Developer/ Data Engineer

Why I'm the Best Fit for This Role

Thank you for the briefing. Now, I can understand more about the position. After listening to the conversation, I am more confident to say that, “I am best fit for this role.”

Introduction

I am a Full-Stack Python Developer, with all years of experience, with around 80% into Backend work and 20% into the frontend.

Backend Expertise

In Backend, I worked on:

  • Creating standalone scripts for automation, scheduled jobs, ETL jobs, and data pipelines.
  • Building RESTful APIs and web applications.

Frontend Expertise

In Frontend, I worked on:

  • Creating dashboards, graphs, and charts using d3.js or Highcharts libraries.
  • Building tables using ag-grid.
  • Working with JavaScript, jQuery, and React.

Databases

In databases:

  • Relational Databases: MySQL, MS SQL, Oracle DB, PostgreSQL.
  • NoSQL Databases: MongoDB, Cassandra.
  • Data Warehousing: Snowflake schema.

Python Expertise

In Python:

  • Creating web applications and RESTful APIs using Django, Flask, and FastAPI frameworks.
  • Process automation and integrating with infrastructure (Linux/Windows).
  • Data gathering from:
    1. Structured datasources like REST APIs (internal or external), databases, etc.
    2. Unstructured datasources like web scraping using Beautiful Soup.
    3. Structured, semi-structured, or unstructured file types like CSV, Excel, JSON, YAML, Parquet, etc.
  • Following TDD (Test-Driven Development) by creating unit tests and integration tests using unittest or pytest modules.

Caching and Scheduling

In caching:

  • Worked with Redis and Memcache.

In scheduling:

  • Worked with Celery.
  • Integrated Celery with Django applications for periodic jobs.

For data job orchestration, I worked with Airflow.

Public Cloud Experience

In Public Cloud, I am mostly associated with AWS Cloud. In AWS Cloud, I worked with:

Server-Based Architectures

  • EC2 Instances
  • Elastic Beanstalk
  • Elastic Load Balancer
  • Auto-Scaling
  • Route53

Storage Solutions

  • S3 Bucket for file storage
  • S3 Glacier for archival storage

Databases

  • AWS RDS & Redshift for relational databases
  • AWS DynamoDB for NoSQL

Serverless Architectures

  • AWS Lambda (time-triggered or event-triggered)
  • AWS API Gateway (HTTP-triggered)
  • AWS Event Scheduler for scheduling Lambda

Container-based Environments

  • Created docker.yaml files for creating containers
  • AWS EKS (Elastic Kuberntes Service) for Orchestrating the pods.
  • Also, wrote the helm chart

For OAuth, I worked with AWS Cognito.

Also worked with SQS, SNS, and SES.

For big data processing, I worked with EMR clusters for PySpark.

For ETLs, I worked with AWS Glue Jobs with DataCatalog and PySpark.

CI/CD and Infrastructure as Code

Experience in creating CI/CD setups using Jenkins and Groovy scripts.

In terms of Infrastructure as Code, I worked mainly with:

  • CloudFormation templates
  • Terraform
  • Pulumi Python module

Agile Methodologies

Experience in agile methodologies like Scrum and Kanban in facilitating agile ceremonies like:

  • Daily standups
  • Sprint reviews
  • Planning sessions
  • Retrospectives

Familiar with Agile tools like Jira and skilled in working with cross-functional teams including Dev teams, QA testers, and PM/POs.

Wednesday, September 6, 2023

Python Interview Questions

Python Interview Quiz

Python Interview Quiz




































































Friday, June 5, 2020

getattrib, Setattrib, hasattrib and delattrib in python

# `getattr(object, name[, default])` Function in Python

The `getattr(object, name[, default])` function returns the value of a named attribute of an object, where `name` must be a string. If the object has an attribute with the specified `name`, then the value of that attribute is returned. On the other hand, if the object does not have an attribute with `name`, then the value of `default` is returned, or `AttributeError` is raised if `default` is not provided.

```python
>>> t = ('This', 'is', 'a', 'tuple')
>>> t.index('is')
1
>>> getattr(t, 'index')
<built-in method index of tuple object at 0x10c15e680>
>>> getattr(t, 'index')('is')
1
```

when the attribute is not defined,
```python
>>> getattr(t, 'len')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'tuple' object has no attribute 'len'
>>> getattr(t, 'len', t.count)
<built-in method count of tuple object at 0x10c15e680>
```

Besides "normal" objects like tuples, lists, and class instances, getattr also accepts modules as arguments. Since modules are also objects in Python, the attributes of modules can be retrieved just like any attribute in an object.

>>> import uuid >>> getattr(uuid, 'UUID') <class 'uuid.UUID'> >>> type(getattr(uuid, 'UUID')) <type 'type'> >>> isinstance(getattr(uuid, 'UUID'), type) True >>> callable(getattr(uuid, 'UUID')) True >>> getattr(uuid, 'UUID')('12345678123456781234567812345678') UUID('12345678-1234-5678-1234-567812345678')

*****
# check for existence of an attribute
# hasattr  #uses pythonic "Look Before You Leap" (LBYL) code style
>>> hasattr('abc', 'upper')
True
>>> hasattr('abc', 'lower')
True
>>> hasattr('abc', 'convert')
False

Using try-except way   # "Easier to Ask for Forgiveness than Permission" (EAFP) code style
>>> try:
...     'abc'.upper()
... except AttributeError:
...     print("abc does not have attribute 'upper'")
...
'ABC'
>>> try:
...     'abc'.convert()
... except AttributeError:
...     print("abc does not have attribute 'convert'")
...
abc does not have attribute 'convert'

**********

hasattr vs __dict__

>>> class A(object):
...   foo = 1
...
>>> class B(A):
...   pass
...
>>> b = B()
>>> hasattr(b, 'foo')
True
>>> 'foo' in b.__dict__
False


#setattrib --Assigns a value to the object’s attribute given its name.

EX: setattr(x, ‘foobar’, 123) is equivalent to x.foobar = 123

Example 1
>>> class Foo:
...     def __init__(self, x):
...         self.x = x
...
>>> f = Foo(10)
>>> f.x
10
>>> setattr(f, 'x', 20)
>>> f.x
20
>>> setattr(f, 'y', 10)
>>> f.y
10
>>> f.y = 100
>>> f.y
100
you can dynamically add a function as a method to a class
>>> def b(self):
            print 'bar'
class Foo:
    pass
f = Foo() print dir(f) #[‘__doc__’, ‘__module__’] setattr(Foo, ‘bar’, b) print dir(f) #[‘__doc__’, ‘__module__’, ‘bar’] f.bar() #bar
you can setattr() on an instance of a class that
inherits from "object", but you can't on an instance of "object"
itself
>>> o = object()
>>> setattr(o, "x", 1000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'object' object has no attribute 'x'
>>> class Object(object):pass ... >>> o = Object() >>> setattr(o, "x", 1000) >>> o.x 1000
If __setattr__() wants to assign to an instance attribute,
 it should not simply execute self.name = value — this would
 cause a recursive call to itself. Instead, it should insert
 the value in the dictionary of instance attributes,
 e.g., self.__dict__[name] = value.
 For new-style classes, rather than accessing the instance dictionary,
 it should call the base class method with the same name,
 for example, object.__setattr__(self, name, value).

 ****for blog****
The __slots__ declaration takes a sequence of instance variables and
 reserves just enough space in each instance to hold a value for each
 variable. Space is saved because __dict__ is not created for each instance.

 When there is '__slots__', there won't be '__dict__' and '__weakref__'


delattr will delete the attribute

Example:
class Box:
    pass
box = Box()
# Create a width attribute. setattr(box, "width", 15)
# The attribute exists. if hasattr(box, "width"):     print(True)
# Delete the width attribute. delattr(box, "width")
# Width no longer exists. if not hasattr(box, "width"):     print(False)

Python tips

python tips:

In [1]: os.listdir(os.getcwd())==os.listdir(os.curdir)

Out[1]: True

In [2]: os.listdir(os.getcwd())==os.listdir('.')

Out[2]: True     


*******
WAP for print a character its position times in a word.

word=raw_input("Enter the word:")
for i in word:
    print i*word.index(i)

Output:

******
WAP for print a character its ASCII values times in a word.

word=raw_input("Enter the word:")
for i in word:
print i*ord(i)
#alternatively # print [i*ord(i) for i in word]

Sunday, December 17, 2017

Initiating static local HTTP Server using different scripting languages

Initiating HTTP Server using different scripting languages


Python

    python 2.x        - python -m SimpleHTTPServer 8000

    python 3.x        - python -m http.server 8000

http-server (Node.js)

    npm install -g http-server # install dependency
    http-server -p 8000

node-static (Node.js)

    npm install -g node-static # install dependency
    static -p 8000

Ruby

    ruby -rwebrick -e’WEBrick::HTTPServer.new(:Port => 3000, :DocumentRoot => Dir.pwd).start’

    Ruby 1.9.2+
        - ruby -run -ehttpd . -p3000

    adsf (Ruby)
        - gem install adsf # install dependency
        - adsf -p 8000
Perl
----
    cpan HTTP::Server::Brick # install dependency
        - $ perl -MHTTP::Server::Brick -e ‘$s=HTTP::Server::Brick->new(port=>8000); $s->mount(“/”=>{path=>”.”}); $s->start’

    Plack (Perl)
        - cpan Plack # install dependency
        - plackup -MPlack::App::Directory -e ‘Plack::App::Directory->new(root=>”.”);’ -p 8000

PHP (>= 5.4)
------------
    php -S 127.0.0.1:8000

busybox httpd
-------------
    busybox httpd -f -p 8000

webfs
-----
    webfsd -F -p 8000

IIS Express
-----------
    C:\> “C:\Program Files (x86)\IIS Express\iisexpress.exe” /path:C:\MyWeb /port:8000

AWS CLI installation procedure in centos/ RHEL

1)  Make sure you have the epel repository installed
sudo yum install -y epel-release 

2)  Install needed packages
sudo yum install -y python2-pip

3)  Install awscli
sudo yum install -y awscli

4) Configure awscli
aws configure --profile <profile_name>

4) Configure awscli.
If you do not add --profile <profile_name>, it will use the default profile.

aws configure --profile <profile_name>

Using the --profile allows you to create profiles for other accounts using different keys.
You can also create profiles for different regions.

Apache Airflow notes