Saturday, November 1, 2025

Turn a blocking API into an event-driven firehose by letting Kafka do the “streaming” while the HTTP tier stays stateless.

From Ad-Hoc Abuse to Enterprise Pseudo-Streaming – A STAR Story

From Ad-Hoc Abuse to Enterprise Pseudo-Streaming – A STAR Story

STAR Narrative (Gap-Free) Critic’s Question & Reasoning (Why the Requirement Exists)
SITUATION Business context – Internal analytics team needed real-time visibility into high-volume operational data via self-service dashboards.
Initial contract – “≤ 2-day window, ≤ 1 000 rows” → sub-second REST GET.
Reality – Users began POST-ing custom filters (dimensions, regex patterns, time-range up to 90 days).
Pain point – Queries took 30–180 s → dashboard gateway timeout (60 s) → blank screens.
Q1: Why allow custom filters at all?
A: Flexibility was a hard stakeholder requirement from Day-1: business wanted “self-service” without ticket-based schema changes.

Q2: SQL injection risk?
A: We never concatenated user input. All filters parsed into a safe AST → translated to search engine DSL. 100 % unit test coverage + external pentest.

Q3: Why did abuse explode?
A: Early wins → viral adoption → 50 % of analysts using API in 2 weeks. Feature creep became a formal requirement – leadership mandated “no restrictions” to preserve velocity.
TASK
  1. Zero timeouts for any valid query.
  2. No long-lived HTTP connections (global concurrent users > 2 000).
  3. Incremental results – first page in < 2 s.
  4. Security & governance – audit every query, enforce retention, prevent storage bloat.
Q4: Why not just increase timeout to 300 s?
A: Dashboard platform hard-coded 60 s gateway + UX degrades > 5 s wait.

Q5: Why not cache everything?
A: 90 % of queries were unique (different regex/time-range). Cache hit < 5 %.
ACTION Step 1 – Input validation & job creation
• POST → 202 Accepted + jobId + offset (round-robin from fixed pool of 1 000 offsets).
• Filter AST → query fingerprint (SHA-256) → deduplication cache (Redis, 5 min TTL).
• Identical fingerprints → reuse same offset (instant share).

Step 2 – Backend micro-batches
• Worker executes in 10-row chunks → JSON → Kafka ResultTopic at assigned offset.
• Each message: jobId, chunkSeq, isLast.

Step 3 – Consumer experience
• Dashboard connector polls from offset (or uses reactive client).
• First chunk arrives in median 1.1 s.

Step 4 – Governance
• Offset TTL = 5 min → auto-cleanup consumer → log compaction → offset returned to pool.
• Audit log: jobId, fingerprint, user, offset, row count.

Step 5 – POC & rollout
• 3-day POC: 100 concurrent 90 s queries → 100 % success.
• Load test: 2 500 concurrent users → CPU < 70 %, no socket exhaustion.
• Security sign-off: verified no injection path.
Q6: Why manual offset pool instead of consumer groups?
A: Consumer groups require unique group.id per dashboard instance → dashboard engine cannot coordinate. Fixed offset pool gives deterministic URL (offset=42) embeddable in config.

Q7: Why 10-row chunks?
A: Balances Kafka throughput (≤ 1 ms per message) and UI rendering (10 rows = one screen page).

Q8: Why 5 min TTL?
A: 95th percentile query finishes in 110 s; 5 min covers stragglers + network retries.

Q9: Why not Server-Sent Events (SSE)?
A: SSE = one TCP socket per dashboard → 2 000 dashboards = 2 000 sockets → kernel EPOLL exhaustion. Kafka moves load to brokers.
RESULT
  • Timeouts eliminated – 0 % post-rollout.
  • First-chunk latency: 95th percentile 1.2 s.
  • Scalability: 200 → 2 500 concurrent users (12×) on same Kafka cluster.
  • Cost: +1 microservice (offset-manager, 2 vCPU), no new hardware.
  • Adoption: 3 additional teams adopted pattern in < 90 days.
  • Governance: 100 % queries audited; zero security incidents.
  • Feature status: “Pseudo-streaming API” is now official product requirement in Analytics Roadmap 2026.
Q10: Any regression?
A: None. Added SLA dashboard showing chunk latency distribution – used in quarterly reviews.

Q11: User abuse still possible?
A: Rate-limited to 1 job/5 s per user + max 500 k rows per job → enforced in API gateway.

One-Liner for Resume / Interview

“Converted an abused ad-hoc REST API into a Kafka-mediated pseudo-streaming platform that scaled 12×, eliminated timeouts, and turned analyst flexibility into a governed enterprise feature – all while keeping HTTP stateless and injection-free.”

Turn a blocking API into an event-driven firehose by letting Kafka do the “streaming” while the HTTP tier stays stateless.


Generic, company-agnostic, and ready for your engineering blog. Copy, paste, publish.