Saturday, November 1, 2025

Turn a blocking API into an event-driven firehose by letting Kafka do the “streaming” while the HTTP tier stays stateless.

From Ad-Hoc Abuse to Enterprise Pseudo-Streaming – A STAR Story

From Ad-Hoc Abuse to Enterprise Pseudo-Streaming – A STAR Story

STAR Narrative (Gap-Free) Critic’s Question & Reasoning (Why the Requirement Exists)
SITUATION Business context – Internal analytics team needed real-time visibility into high-volume operational data via self-service dashboards.
Initial contract – “≤ 2-day window, ≤ 1 000 rows” → sub-second REST GET.
Reality – Users began POST-ing custom filters (dimensions, regex patterns, time-range up to 90 days).
Pain point – Queries took 30–180 s → dashboard gateway timeout (60 s) → blank screens.
Q1: Why allow custom filters at all?
A: Flexibility was a hard stakeholder requirement from Day-1: business wanted “self-service” without ticket-based schema changes.

Q2: SQL injection risk?
A: We never concatenated user input. All filters parsed into a safe AST → translated to search engine DSL. 100 % unit test coverage + external pentest.

Q3: Why did abuse explode?
A: Early wins → viral adoption → 50 % of analysts using API in 2 weeks. Feature creep became a formal requirement – leadership mandated “no restrictions” to preserve velocity.
TASK
  1. Zero timeouts for any valid query.
  2. No long-lived HTTP connections (global concurrent users > 2 000).
  3. Incremental results – first page in < 2 s.
  4. Security & governance – audit every query, enforce retention, prevent storage bloat.
Q4: Why not just increase timeout to 300 s?
A: Dashboard platform hard-coded 60 s gateway + UX degrades > 5 s wait.

Q5: Why not cache everything?
A: 90 % of queries were unique (different regex/time-range). Cache hit < 5 %.
ACTION Step 1 – Input validation & job creation
• POST → 202 Accepted + jobId + offset (round-robin from fixed pool of 1 000 offsets).
• Filter AST → query fingerprint (SHA-256) → deduplication cache (Redis, 5 min TTL).
• Identical fingerprints → reuse same offset (instant share).

Step 2 – Backend micro-batches
• Worker executes in 10-row chunks → JSON → Kafka ResultTopic at assigned offset.
• Each message: jobId, chunkSeq, isLast.

Step 3 – Consumer experience
• Dashboard connector polls from offset (or uses reactive client).
• First chunk arrives in median 1.1 s.

Step 4 – Governance
• Offset TTL = 5 min → auto-cleanup consumer → log compaction → offset returned to pool.
• Audit log: jobId, fingerprint, user, offset, row count.

Step 5 – POC & rollout
• 3-day POC: 100 concurrent 90 s queries → 100 % success.
• Load test: 2 500 concurrent users → CPU < 70 %, no socket exhaustion.
• Security sign-off: verified no injection path.
Q6: Why manual offset pool instead of consumer groups?
A: Consumer groups require unique group.id per dashboard instance → dashboard engine cannot coordinate. Fixed offset pool gives deterministic URL (offset=42) embeddable in config.

Q7: Why 10-row chunks?
A: Balances Kafka throughput (≤ 1 ms per message) and UI rendering (10 rows = one screen page).

Q8: Why 5 min TTL?
A: 95th percentile query finishes in 110 s; 5 min covers stragglers + network retries.

Q9: Why not Server-Sent Events (SSE)?
A: SSE = one TCP socket per dashboard → 2 000 dashboards = 2 000 sockets → kernel EPOLL exhaustion. Kafka moves load to brokers.
RESULT
  • Timeouts eliminated – 0 % post-rollout.
  • First-chunk latency: 95th percentile 1.2 s.
  • Scalability: 200 → 2 500 concurrent users (12×) on same Kafka cluster.
  • Cost: +1 microservice (offset-manager, 2 vCPU), no new hardware.
  • Adoption: 3 additional teams adopted pattern in < 90 days.
  • Governance: 100 % queries audited; zero security incidents.
  • Feature status: “Pseudo-streaming API” is now official product requirement in Analytics Roadmap 2026.
Q10: Any regression?
A: None. Added SLA dashboard showing chunk latency distribution – used in quarterly reviews.

Q11: User abuse still possible?
A: Rate-limited to 1 job/5 s per user + max 500 k rows per job → enforced in API gateway.

One-Liner for Resume / Interview

“Converted an abused ad-hoc REST API into a Kafka-mediated pseudo-streaming platform that scaled 12×, eliminated timeouts, and turned analyst flexibility into a governed enterprise feature – all while keeping HTTP stateless and injection-free.”

Turn a blocking API into an event-driven firehose by letting Kafka do the “streaming” while the HTTP tier stays stateless.


Generic, company-agnostic, and ready for your engineering blog. Copy, paste, publish.

Friday, October 31, 2025

Decoding the Hunt: How to Land Your Dream Quant Data Engineer / MLOps Role

Decoding the Hunt: How to Land Your Dream Quant Data Engineer / MLOps Role

Decoding the Hunt: How to Land Your Dream Quant Data Engineer / MLOps Role

The world of quantitative finance is a thrilling intersection of complex mathematics, bleeding-edge technology, and immense financial impact. At the heart of this world sits a role that has become absolutely critical: the Quant Data Engineer / MLOps Engineer. We are the architects who build the high-performance data pipelines, the guardians who manage the lifecycle of multi-million dollar models, and the engineers who ensure that every trading decision is backed by clean, fast, and reliable data.

But having the skills is only half the battle. The top hedge funds and quant trading firms aren't just posting jobs on a board and waiting. They are hunting for talent in a sophisticated, multi-channel process. This guide will pull back the curtain on that process and give you a actionable playbook to not just be found, but to be sought after.

The Recruiter's Playbook: How You're Actually Found

First, let's dispel a myth: the best candidates are rarely the ones applying through a portal. They are the ones who are discovered. A senior recruiter at a London-based financial firm operates like a skilled intelligence agent. Their toolkit goes far beyond a simple LinkedIn search. They use a combination of platforms to build a 360-degree view of a candidate's capabilities and passion.

Your goal is to make your professional signal so strong and clear across these platforms that you become impossible to ignore.

The "Be Found" Strategy: A Practical Guide

Here’s how you can optimize your presence across the key hunting grounds.

1. LinkedIn: Your Digital Handshake

LinkedIn is the command center. To be found here, your profile must be a keyword-rich, achievement-oriented document. More importantly, you need to understand how recruiters search. They don't just type "Quant Data Engineer"; they use complex Boolean strings.

Here are some examples of the strings they use, and how you can align your profile:

  • Broad Search for Core Skills:
    ("Quant Data Engineer" OR "MLOps Engineer" OR "Quantitative Developer") AND ("Python" OR "Golang") AND ("Kafka" OR "Kubernetes")
    Your Focus: Ensure these exact terms are liberally and naturally sprinkled throughout your profile's headline, summary, and experience sections.

  • Targeting Finance & Cloud Expertise:
    ("Quant Data Engineer" OR "MLOps") AND ("Risk Modeling" OR "IPV" OR "Derivatives") AND ("AWS" OR "Azure") AND ("Databricks" OR "BigQuery")
    Your Focus: If you have any experience with financial products (even academic) or specific cloud platforms, list them explicitly.

  • Excluding Irrelevant Profiles:
    ("Quant Data Engineer" OR "MLOps") -recruiter -hr -"human resources" -"looking for work"
    Your Focus: This shows how recruiters filter noise. You don't need to do anything here, but it's good to know they are actively trying to find you, not other recruiters.

2. GitHub: Your Living, Breathing Resume

For a data engineer, your GitHub profile is more important than your CV. It's tangible proof of your skills. Recruiters aren't just looking for a green square; they are assessing:

  • Code Quality: Is your code clean, well-documented, and following best practices (e.g., PEP 8 for Python)?
  • Activity: Are you actively contributing? Do you have meaningful commit messages? Do you contribute to major open-source projects like pandas, scikit-learn, or Apache Airflow? Even small, well-thought-out contributions are huge signals.
  • Followers/Following: Who do you follow? Following key figures in the data engineering and MLOps space shows you are engaged with the community.

3. eFinancialCareers & Niche Boards

For finance roles, generalist job boards are too noisy. Recruiters live on eFinancialCareers, QuantStack, and Otta.

  • How to Search: Set up highly specific job alerts. Use keywords like "MLOps," "Kafka," "Python," "C++," and "Quantitative." Keep your CV on these platforms updated and tailored to the finance domain, highlighting any experience with risk, trading systems, or financial data.

4. Kaggle: Your Proving Ground

A strong Kaggle profile is a powerful, objective signal of talent. Recruiters analyze:

  • Competition Rankings: A high rank (Grandmaster, Master) immediately puts you on the map.
  • Notebook Quality: This is crucial. A great notebook isn't just code that wins; it's a story. It has clear explanations, insightful visualizations, and well-structured code. It demonstrates your ability to communicate complex results.
  • Profile: A complete profile with a professional picture and a link to your LinkedIn/GitHub creates a cohesive, professional brand.

5. Stack Overflow: Building Your Reputation

Are you a true expert? Prove it. Recruiters search for high-reputation users in tags that are critical for our roles: python, pandas, apache-kafka, kubernetes, and golang. Providing thoughtful, detailed answers to complex questions is one of the best ways to demonstrate deep, practical knowledge.

The 8 AM Habit: Treat Your Job Search Like a High-Frequency Strategy

Landing a top-tier role isn't about a frantic month-long sprint; it's about consistent, disciplined effort. The best candidates treat their career development like a high-frequency trading strategy: small, consistent gains that compound over time.

This is where the idea of a daily routine comes in. It’s not about applying for jobs every day. It’s about a daily 8 AM habit:

  • 15 minutes: Scan your LinkedIn for new connections or messages from recruiters. Engage with a post from a leader in the field.
  • 15 minutes: Find one question on Stack Overflow or a discussion on GitHub that you can contribute to.
  • 15 minutes: Review a recent project on your GitHub. Can you improve the documentation? Refactor one function?

This daily discipline keeps you sharp, grows your network, and steadily builds your professional reputation. It ensures that when a top recruiter finally comes calling, your public profile is a powerful testament to your expertise and passion.

The hunt for top talent is relentless, but by building a strong, multi-channel presence, you can shift from being the hunter to the hunted. Good luck.

Saturday, March 22, 2025

Apache Airflow notes

Apache Airflow: A One-Stop Guide for Junior Developers

Apache Airflow: A One-Stop Guide for Junior Developers

Apache Airflow is a powerful open-source tool for orchestrating jobs and managing data workflows. This guide covers everything you need—history, features, and a practical example—all explained simply.

History and Evolution

Workflow orchestration evolved over time. Here’s the journey:

  • Non/Quasi-Programmable Tools (e.g., Informatica, Talend):
    Early tools like Informatica and Talend offered graphical interfaces for ETL workflows. While powerful for simple tasks, they weren’t fully programmable, limiting flexibility, dependency management, and version control.
  • cronTab and Event Scheduler:
    Basic scheduling tools like cronTab (Linux) and Event Scheduler (Windows) ran jobs at fixed times but couldn’t handle dependencies or track job status.
  • Celery:
    A step up, Celery provided a task queue with workers but required custom logic for workflows.
  • Apache Airflow (2014):
    Created at Airbnb in 2014 and open-sourced in 2015, Airflow introduced code-defined workflows with dependency management, becoming an Apache project in 2016.

What is Airflow?

Airflow lets you programmatically define, schedule, and monitor workflows using Python. Workflows are represented as Directed Acyclic Graphs (DAGs)—tasks with a defined order and no loops.

Architecture with Mermaid Diagram

Here’s how Airflow’s components connect:

graph LR A[Web Server] --> B[Scheduler] B --> C[Executor] C --> D[Workers] B --> E[Metadata Database] D --> E
  • Web Server: Hosts the UI for monitoring.
  • Scheduler: Schedules tasks based on DAGs.
  • Executor: Manages task execution.
  • Workers: Run the tasks.
  • Metadata Database: Stores task states and logs.

Executor Modes with Mermaid Diagrams

LocalExecutor

graph LR A[Scheduler] --> B[LocalExecutor] B --> C[Task 1] B --> D[Task 2]

Runs tasks on the same machine as the scheduler—simple but not scalable.

CeleryExecutor

graph LR A[Scheduler] --> B[CeleryExecutor] B --> C[Celery Worker 1] B --> D[Celery Worker 2] C --> E[Task 1] D --> F[Task 2]

Distributes tasks across workers using Celery—scalable but needs a broker like Redis.

Sample Python DAG Code

Here’s a styled Python DAG example that prints "Hello" and "World!":

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def print_hello():
    print("Hello")

def print_world():
    print("World!")

with DAG(
    dag_id='hello_world_dag',
    start_date=datetime(2023, 1, 1),
    schedule_interval='@daily',
    catchup=False
) as dag:
    task1 = PythonOperator(
        task_id='print_hello',
        python_callable=print_hello
    )
    task2 = PythonOperator(
        task_id='print_world',
        python_callable=print_world
    )
    task1 >> task2  # Task1 runs before Task2
    

Copy this HTML into a blog editor, and the Mermaid diagrams will render as interactive graphs. The Python code is styled with a light-gray background for readability. You’re all set!

Saturday, March 15, 2025

Introduction - FullStack Developer/ Data Engineer

Why I'm the Best Fit for This Role

Thank you for the briefing. Now, I can understand more about the position. After listening to the conversation, I am more confident to say that, “I am best fit for this role.”

Introduction

I am a Full-Stack Python Developer, with all years of experience, with around 80% into Backend work and 20% into the frontend.

Backend Expertise

In Backend, I worked on:

  • Creating standalone scripts for automation, scheduled jobs, ETL jobs, and data pipelines.
  • Building RESTful APIs and web applications.

Frontend Expertise

In Frontend, I worked on:

  • Creating dashboards, graphs, and charts using d3.js or Highcharts libraries.
  • Building tables using ag-grid.
  • Working with JavaScript, jQuery, and React.

Databases

In databases:

  • Relational Databases: MySQL, MS SQL, Oracle DB, PostgreSQL.
  • NoSQL Databases: MongoDB, Cassandra.
  • Data Warehousing: Snowflake schema.

Python Expertise

In Python:

  • Creating web applications and RESTful APIs using Django, Flask, and FastAPI frameworks.
  • Process automation and integrating with infrastructure (Linux/Windows).
  • Data gathering from:
    1. Structured datasources like REST APIs (internal or external), databases, etc.
    2. Unstructured datasources like web scraping using Beautiful Soup.
    3. Structured, semi-structured, or unstructured file types like CSV, Excel, JSON, YAML, Parquet, etc.
  • Following TDD (Test-Driven Development) by creating unit tests and integration tests using unittest or pytest modules.

Caching and Scheduling

In caching:

  • Worked with Redis and Memcache.

In scheduling:

  • Worked with Celery.
  • Integrated Celery with Django applications for periodic jobs.

For data job orchestration, I worked with Airflow.

Public Cloud Experience

In Public Cloud, I am mostly associated with AWS Cloud. In AWS Cloud, I worked with:

Server-Based Architectures

  • EC2 Instances
  • Elastic Beanstalk
  • Elastic Load Balancer
  • Auto-Scaling
  • Route53

Storage Solutions

  • S3 Bucket for file storage
  • S3 Glacier for archival storage

Databases

  • AWS RDS & Redshift for relational databases
  • AWS DynamoDB for NoSQL

Serverless Architectures

  • AWS Lambda (time-triggered or event-triggered)
  • AWS API Gateway (HTTP-triggered)
  • AWS Event Scheduler for scheduling Lambda

Container-based Environments

  • Created docker.yaml files for creating containers
  • AWS EKS (Elastic Kuberntes Service) for Orchestrating the pods.
  • Also, wrote the helm chart

For OAuth, I worked with AWS Cognito.

Also worked with SQS, SNS, and SES.

For big data processing, I worked with EMR clusters for PySpark.

For ETLs, I worked with AWS Glue Jobs with DataCatalog and PySpark.

CI/CD and Infrastructure as Code

Experience in creating CI/CD setups using Jenkins and Groovy scripts.

In terms of Infrastructure as Code, I worked mainly with:

  • CloudFormation templates
  • Terraform
  • Pulumi Python module

Agile Methodologies

Experience in agile methodologies like Scrum and Kanban in facilitating agile ceremonies like:

  • Daily standups
  • Sprint reviews
  • Planning sessions
  • Retrospectives

Familiar with Agile tools like Jira and skilled in working with cross-functional teams including Dev teams, QA testers, and PM/POs.