From Ad-Hoc Abuse to Enterprise Pseudo-Streaming – A STAR Story
| STAR | Narrative (Gap-Free) | Critic’s Question & Reasoning (Why the Requirement Exists) |
|---|---|---|
| SITUATION |
Business context – Internal analytics team needed real-time visibility into high-volume operational data via self-service dashboards. Initial contract – “≤ 2-day window, ≤ 1 000 rows” → sub-second REST GET. Reality – Users began POST-ing custom filters (dimensions, regex patterns, time-range up to 90 days). Pain point – Queries took 30–180 s → dashboard gateway timeout (60 s) → blank screens. |
Q1: Why allow custom filters at all? A: Flexibility was a hard stakeholder requirement from Day-1: business wanted “self-service” without ticket-based schema changes. Q2: SQL injection risk? A: We never concatenated user input. All filters parsed into a safe AST → translated to search engine DSL. 100 % unit test coverage + external pentest. Q3: Why did abuse explode? A: Early wins → viral adoption → 50 % of analysts using API in 2 weeks. Feature creep became a formal requirement – leadership mandated “no restrictions” to preserve velocity. |
| TASK |
|
Q4: Why not just increase timeout to 300 s? A: Dashboard platform hard-coded 60 s gateway + UX degrades > 5 s wait. Q5: Why not cache everything? A: 90 % of queries were unique (different regex/time-range). Cache hit < 5 %. |
| ACTION |
Step 1 – Input validation & job creation • POST → 202 Accepted + jobId + offset (round-robin from fixed pool of 1 000 offsets).• Filter AST → query fingerprint (SHA-256) → deduplication cache (Redis, 5 min TTL). • Identical fingerprints → reuse same offset (instant share). Step 2 – Backend micro-batches • Worker executes in 10-row chunks → JSON → Kafka ResultTopic at assigned offset.• Each message: jobId, chunkSeq, isLast.Step 3 – Consumer experience • Dashboard connector polls from offset (or uses reactive client). • First chunk arrives in median 1.1 s. Step 4 – Governance • Offset TTL = 5 min → auto-cleanup consumer → log compaction → offset returned to pool. • Audit log: jobId, fingerprint, user, offset, row count.Step 5 – POC & rollout • 3-day POC: 100 concurrent 90 s queries → 100 % success. • Load test: 2 500 concurrent users → CPU < 70 %, no socket exhaustion. • Security sign-off: verified no injection path. |
Q6: Why manual offset pool instead of consumer groups? A: Consumer groups require unique group.id per dashboard instance → dashboard engine cannot coordinate. Fixed offset pool gives deterministic URL ( offset=42) embeddable in config.Q7: Why 10-row chunks? A: Balances Kafka throughput (≤ 1 ms per message) and UI rendering (10 rows = one screen page). Q8: Why 5 min TTL? A: 95th percentile query finishes in 110 s; 5 min covers stragglers + network retries. Q9: Why not Server-Sent Events (SSE)? A: SSE = one TCP socket per dashboard → 2 000 dashboards = 2 000 sockets → kernel EPOLL exhaustion. Kafka moves load to brokers. |
| RESULT |
|
Q10: Any regression? A: None. Added SLA dashboard showing chunk latency distribution – used in quarterly reviews. Q11: User abuse still possible? A: Rate-limited to 1 job/5 s per user + max 500 k rows per job → enforced in API gateway. |
One-Liner for Resume / Interview
“Converted an abused ad-hoc REST API into a Kafka-mediated pseudo-streaming platform that scaled 12×, eliminated timeouts, and turned analyst flexibility into a governed enterprise feature – all while keeping HTTP stateless and injection-free.”
Turn a blocking API into an event-driven firehose by letting Kafka do the “streaming” while the HTTP tier stays stateless.
Generic, company-agnostic, and ready for your engineering blog. Copy, paste, publish.