Week 3 Capstone: Event-Driven Food Delivery Order System
π― A Real-World Problem Covering Everything You've Learned
This capstone integrates all Week 3 concepts into a single, realistic system design interview. You'll apply:
| Day | Concept | How It's Used |
|---|---|---|
| Day 1 | Queue vs Stream | Choosing the right messaging pattern for different event types |
| Day 2 | Transactional Outbox | Guaranteeing order events are never lost |
| Day 3 | Backpressure | Handling dinner rush without system collapse |
| Day 4 | Dead Letters | Managing failed orders and payment issues |
| Day 5 | Audit Log | Tracking every order state change for disputes |
The Interview Begins
You walk into the interview room at a fast-growing food delivery startup. The interviewer smiles and gestures to the whiteboard.
Interviewer: "Thanks for coming in. Today we're going to work through a system design problem together. I'm interested in your thought process, so please think out loud. Feel free to ask questions β this is meant to be collaborative."
They write on the whiteboard:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Design an Event-Driven Order Fulfillment System β
β β
β Context: β
β We're a food delivery platform operating in 50 cities. When a β
β customer places an order, we need to: β
β β
β - Process the payment β
β - Notify the restaurant β
β - Match and dispatch a driver β
β - Track the order through delivery β
β - Handle failures gracefully β
β - Maintain audit trail for disputes β
β β
β Key challenges: β
β - Dinner rush: 10x normal traffic β
β - Payment/restaurant failures happen β
β - Customers dispute orders (need history) β
β - Multiple services need order events β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Interviewer: "Take a few minutes to think about this, then walk me through your approach. We have about 45 minutes."
Phase 1: Requirements Clarification (5 minutes)
Before diving in, you take a breath and start asking questions. This is crucial β never assume.
Your Questions
You: "Before I start designing, I'd like to clarify a few requirements. First, what's our current order volume and expected growth?"
Interviewer: "We process about 500,000 orders per day currently. During dinner rush (6-9 PM), we see about 60% of daily volume in those 3 hours. We're growing 100% year over year."
You: "So roughly 300,000 orders in 3 hours during peak β that's about 28 orders per second average, probably 50+ per second at the absolute peak. Got it. What about the order lifecycle? What states does an order go through?"
Interviewer: "An order goes through: placed β payment_processing β payment_confirmed β restaurant_notified β restaurant_accepted β driver_assigned β picked_up β delivered. It can also go to payment_failed, restaurant_rejected, or cancelled at various points."
You: "How many services need to react to order events? And do they all need the same events?"
Interviewer: "We have: Payment Service, Restaurant Service, Driver Matching, Notifications (push/SMS), Analytics, Customer Support dashboard, and a new Fraud Detection service. They have different needs β some need real-time, some can be slightly delayed."
You: "What about failures? If payment fails or the restaurant rejects, what's the expected behavior?"
Interviewer: "Payment failures should retry 3 times, then notify customer. Restaurant rejections should try to find an alternative restaurant or refund. We need to track all of this for customer disputes β we get about 2% of orders disputed."
You: "Last question β what's the latency requirement for the customer? How fast should they see their order confirmed?"
Interviewer: "Customer should see payment confirmed within 3 seconds. Restaurant notification should happen within 10 seconds. Full end-to-end from order to driver assignment should be under 2 minutes in normal conditions."
You: "Perfect. Let me summarize the requirements as I understand them."
Functional Requirements
1. ORDER LIFECYCLE MANAGEMENT
- Accept order placement from customers
- Process payments with retry logic
- Notify restaurants and handle acceptance/rejection
- Match and assign drivers
- Track order through pickup and delivery
- Handle cancellations at any stage
2. EVENT DISTRIBUTION
- Multiple services consume order events
- Different services need different event types
- Some services need real-time, others can be delayed
- Events must be reliably delivered
3. FAILURE HANDLING
- Payment retries with intelligent backoff
- Restaurant rejection β alternative matching or refund
- Driver no-show β reassignment
- All failures tracked and recoverable
4. AUDIT AND DISPUTE RESOLUTION
- Complete history of every order state change
- Who changed what, when, and why
- Queryable for customer support
- Retained for 2 years (legal requirement)
Non-Functional Requirements
1. SCALE
- 500,000 orders/day (current)
- 50 orders/second peak
- 1,000,000 orders/day (1 year target)
- 7 consumer services per order event
2. RELIABILITY
- Zero order loss (payment taken = order tracked)
- At-least-once delivery for all events
- Exactly-once processing for payments
3. LATENCY
- Payment confirmation: < 3 seconds (p99)
- Restaurant notification: < 10 seconds
- Driver assignment: < 2 minutes (p95)
- Audit query: < 500ms for recent orders
4. AVAILABILITY
- 99.9% uptime (8.7 hours downtime/year max)
- Graceful degradation during partial failures
- No data loss during outages
Phase 2: Back of the Envelope Estimation (5 minutes)
You: "Let me work through the numbers to understand the scale we're designing for."
Traffic Estimation
ORDER VOLUME
Current daily orders: 500,000
Peak hours (6-9 PM): 60% of daily = 300,000 orders
Peak duration: 3 hours = 10,800 seconds
Average during peak: 300,000 / 10,800 = ~28 orders/sec
Peak of peak (2x average): ~56 orders/sec
Design target (headroom): 100 orders/sec
EVENTS PER ORDER
Order lifecycle events:
- order.placed 1
- payment.processing 1
- payment.confirmed 1 (or payment.failed)
- restaurant.notified 1
- restaurant.accepted 1 (or restaurant.rejected)
- driver.assigned 1
- order.picked_up 1
- order.delivered 1
βββββββββββββββββββββββββββββββββ
Total: ~8 events per order
Events per second (peak): 100 orders Γ 8 events = 800 events/sec
With consumer fanout (7x): 800 Γ 7 = 5,600 event deliveries/sec
Storage Estimation
EVENT STORAGE
Event size (average):
- Event metadata: 200 bytes
- Order snapshot: 500 bytes
- Actor/context: 150 bytes
- Total per event: ~850 bytes β 1 KB
Daily events: 500,000 orders Γ 8 = 4,000,000 events
Daily storage: 4M Γ 1 KB = 4 GB/day
Monthly storage: 4 GB Γ 30 = 120 GB/month
2-year retention: 120 GB Γ 24 = 2.88 TB
With replication (3x): ~9 TB total
AUDIT LOG STORAGE
Audit events (more detailed): ~2 KB per event
Daily audit storage: 4M Γ 2 KB = 8 GB/day
2-year audit retention: 8 GB Γ 730 = 5.8 TB
Key Metrics Summary
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ESTIMATION SUMMARY β
β β
β TRAFFIC β
β βββ Peak orders: 100 /second (design target) β
β βββ Peak events: 800 /second β
β βββ Peak deliveries: 5,600 /second (with fanout) β
β β
β STORAGE β
β βββ Events per day: 4 GB β
β βββ Audit per day: 8 GB β
β βββ 2-year total: ~15 TB (with replication) β
β β
β INFRASTRUCTURE (rough) β
β βββ Kafka partitions: 32 (for parallelism) β
β βββ Consumer instances: 7 services Γ 4 instances = 28 β
β βββ Database: PostgreSQL + TimescaleDB β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Phase 3: High-Level Design (10 minutes)
You: "Now let me sketch out the high-level architecture."
System Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HIGH-LEVEL ARCHITECTURE β
β β
β ββββββββββββββββ β
β β Customer β β
β β Mobile/Web β β
β ββββββββ¬ββββββββ β
β β β
β βΌ β
β ββββββββββββββββ ββββββββββββββββ β
β β API Gateway βββββββΆβ Order β β
β β β β Service β β
β ββββββββββββββββ ββββββββ¬ββββββββ β
β β β
β ββββββββββββββ΄βββββββββββββ β
β β Same Transaction β β
β β β β
β βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββ β
β β Orders β β Outbox ββββ Day 2: Transactional β
β β Table β β Table β Outbox β
β ββββββββββββββββ ββββββββ¬ββββββββ β
β β β
β β CDC (Debezium) β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Kafka Cluster β β
β β β β
β β ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ β β
β β β orders β β orders β β orders β β audit β β β
β β β .events β β .retry β β .dlq β β .events β β β
β β βββββββ¬βββββββ ββββββββββββββ ββββββββββββββ βββββββ¬βββββββ β β
β β β β² β² β β β
β β β β β β β β
β ββββββββββΌβββββββββββββββΌββββββββββββββββΌβββββββββββββββββΌββββββββββββββββ β
β β β β β β
β β Day 4: Dead Letters β β β
β β β β β
β βΌ β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββ ββββββββββββββββ β
β β Consumer Group β β Audit β β
β β β β Consumer β β
β β βββββββββββ βββββββββββ βββββββββββ β ββββββββ¬ββββββββ β
β β β Payment β βRestaurantβ β Driver β β β β
β β β Service β β Service β β Matcher β β β β
β β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ β β β
β β β β β β β β
β β ββββββ΄ββββ ββββββ΄ββββ ββββββ΄ββββ β β β
β β βNotific-β βAnalyticsβ β Fraud β β β β
β β βations β β β βDetectionβ β β β
β β ββββββββββ ββββββββββ ββββββββββ β β β
β β β β β
β β Day 3: Backpressure Handling βββββββ | β β
β β β β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββ β β
β β β β
β βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββ β
β β Service β β Audit Store β β
β β Databases β β (TimescaleDB)ββββ Day 5 β
β ββββββββββββββββ ββββββββββββββββ β
β β
β Day 1: Queue vs Stream ββββββββββββββ€ |
β (Kafka for multi-consumer replay) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Component Breakdown
You: "Let me walk through each component..."
1. Order Service
Purpose: Accepts orders, orchestrates the lifecycle, maintains order state.
Key responsibilities:
- Validate incoming orders
- Create order record + outbox event in single transaction
- Expose order status API
- Handle cancellations
Why transactional outbox: If we write order to database and separately publish to Kafka, we risk inconsistency. The outbox pattern (Day 2) guarantees the event is published if and only if the order is saved.
2. Kafka Cluster
Purpose: Reliable event distribution to all consumers.
Topics:
orders.eventsβ Main order lifecycle events (32 partitions)orders.retryβ Delayed retry for transient failuresorders.dlqβ Dead letter queue for investigationaudit.eventsβ Separate topic for audit trail
Why Kafka (not RabbitMQ): Day 1 taught us Kafka is better when:
- Multiple consumers need the same events (7 services)
- We need replay capability (debugging, reprocessing)
- We need ordering per order (partition by order_id)
- High throughput with durability
3. Consumer Services
Purpose: React to order events and perform their specific functions.
Services and their patterns:
- Payment Service: Needs exactly-once, uses idempotency keys
- Restaurant Service: Needs ordering per restaurant, partitioned accordingly
- Driver Matcher: Can handle eventual consistency, more latency-tolerant
- Notifications: Fire-and-forget for push, retry for SMS
- Analytics: Batch-friendly, can lag behind
- Fraud Detection: Needs real-time, dedicated consumer group
4. Audit Store
Purpose: Immutable record of all order state changes for disputes.
Technology: TimescaleDB (PostgreSQL extension for time-series)
Why: Day 5 taught us audit logs need fast time-range queries, immutability, and long retention. TimescaleDB gives us automatic partitioning and compression.
Data Flow
You: "Let me trace through a typical order placement..."
ORDER PLACEMENT FLOW
Step 1: Customer places order
Mobile App βββΆ API Gateway βββΆ Order Service
Step 2: Order Service creates order + outbox event
βββββββββββββββββββββββββββββββββββββββββββ
β BEGIN TRANSACTION β
β INSERT INTO orders (...) β
β INSERT INTO outbox (order.placed) β
β COMMIT β
βββββββββββββββββββββββββββββββββββββββββββ
Response returned to customer immediately
Step 3: CDC publishes outbox to Kafka
Outbox βββΆ Debezium βββΆ Kafka (orders.events)
Latency: ~100ms
Step 4: Payment Service consumes event
Kafka βββΆ Payment Consumer βββΆ Process Payment
βββ Success: Publish payment.confirmed
βββ Failure: Retry or publish payment.failed
Step 5: Restaurant Service consumes payment.confirmed
Kafka βββΆ Restaurant Consumer βββΆ Notify Restaurant
βββ Accepted: Publish restaurant.accepted
βββ Rejected: Publish restaurant.rejected
Step 6: Driver Matcher consumes restaurant.accepted
Kafka βββΆ Driver Matcher βββΆ Find & Assign Driver
Publish driver.assigned
Step 7: All events also consumed by Audit Service
Every event βββΆ Audit Consumer βββΆ TimescaleDB
Phase 4: Deep Dives (20 minutes)
Interviewer: "Great high-level design. Let's dive deeper into a few areas. Tell me more about how you'd handle the event publishing reliably."
Deep Dive 1: Transactional Outbox Pattern (Day 2)
You: "This is critical for our system. The challenge is: how do we ensure an order is saved AND its event is published atomically? Let me explain the problem first."
The Problem
THE DUAL-WRITE PROBLEM
Without outbox pattern:
Order Service:
1. Save order to database β Success
2. Publish to Kafka β Failure (network issue)
Result:
- Order exists in database
- No event published
- Payment never charged
- Restaurant never notified
- Customer waiting forever
Or worse:
1. Publish to Kafka β Success
2. Save order to database β Failure
Result:
- Event published
- Order not in database
- Payment charged, but no order record!
- Refund nightmare
The Solution
You: "I'd use the transactional outbox pattern from Day 2. Here's how it works:"
TRANSACTIONAL OUTBOX SOLUTION
Order Service:
BEGIN TRANSACTION
INSERT INTO orders (id, customer_id, items, status, ...)
INSERT INTO outbox (event_type, payload, created_at)
COMMIT
Both writes succeed or both fail β guaranteed by database ACID.
Separately, Outbox Publisher:
1. Poll outbox table (or use CDC)
2. Publish events to Kafka
3. Mark as published (or delete)
If publisher crashes:
- Unpublished events remain in outbox
- Publisher resumes on restart
- No events lost
Implementation
# Transactional Outbox Implementation for Order Service
from dataclasses import dataclass
from datetime import datetime
from typing import Optional, Dict, Any
import json
import uuid
@dataclass
class OrderEvent:
"""Event to be published via outbox."""
event_id: str
event_type: str
order_id: str
payload: Dict[str, Any]
created_at: datetime
class OrderService:
"""
Order service with transactional outbox.
Applies Day 2: Transactional Outbox Pattern.
"""
def __init__(self, db_pool, audit_client):
self.db = db_pool
self.audit = audit_client
async def place_order(
self,
customer_id: str,
restaurant_id: str,
items: list,
delivery_address: dict,
payment_method_id: str
) -> dict:
"""
Place a new order.
CRITICAL: Order creation and event publishing happen in
the same database transaction. This guarantees:
- If order is saved, event WILL be published
- If transaction fails, neither is persisted
"""
order_id = f"ord_{uuid.uuid4().hex[:12]}"
# Calculate totals
subtotal = sum(item['price'] * item['quantity'] for item in items)
delivery_fee = 4.99
total = subtotal + delivery_fee
async with self.db.transaction() as tx:
# 1. Create the order
await tx.execute("""
INSERT INTO orders (
id, customer_id, restaurant_id, items,
delivery_address, payment_method_id,
subtotal, delivery_fee, total,
status, created_at
) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11)
""",
order_id, customer_id, restaurant_id, json.dumps(items),
json.dumps(delivery_address), payment_method_id,
subtotal, delivery_fee, total,
'placed', datetime.utcnow()
)
# 2. Write event to outbox (SAME TRANSACTION)
event = OrderEvent(
event_id=f"evt_{uuid.uuid4().hex[:12]}",
event_type="order.placed",
order_id=order_id,
payload={
"order_id": order_id,
"customer_id": customer_id,
"restaurant_id": restaurant_id,
"items": items,
"total": total,
"payment_method_id": payment_method_id,
"delivery_address": delivery_address
},
created_at=datetime.utcnow()
)
await tx.execute("""
INSERT INTO outbox (
event_id, event_type, aggregate_id,
aggregate_type, payload, created_at
) VALUES ($1, $2, $3, $4, $5, $6)
""",
event.event_id, event.event_type, order_id,
'order', json.dumps(event.payload), event.created_at
)
# 3. Write audit event (SAME TRANSACTION)
await self.audit.log(
action="order.placed",
resource_type="order",
resource_id=order_id,
actor_id=customer_id,
details={"total": total, "item_count": len(items)}
)
# Transaction committed β order and event are guaranteed persisted
return {"order_id": order_id, "status": "placed", "total": total}
class OutboxPublisher:
"""
Background process that publishes outbox events to Kafka.
Uses CDC (Debezium) for low-latency, or polling as fallback.
"""
def __init__(self, db_pool, kafka_producer):
self.db = db_pool
self.kafka = kafka_producer
async def publish_pending(self, batch_size: int = 100) -> int:
"""Publish pending outbox events."""
# Fetch unpublished events
rows = await self.db.fetch("""
SELECT id, event_id, event_type, aggregate_id, payload
FROM outbox
WHERE published_at IS NULL
ORDER BY created_at
LIMIT $1
FOR UPDATE SKIP LOCKED
""", batch_size)
if not rows:
return 0
published_ids = []
for row in rows:
# Partition by order_id for ordering guarantee
partition_key = row['aggregate_id'].encode('utf-8')
await self.kafka.send_and_wait(
'orders.events',
key=partition_key,
value=row['payload'].encode('utf-8'),
headers=[
('event_type', row['event_type'].encode()),
('event_id', row['event_id'].encode())
]
)
published_ids.append(row['id'])
# Mark as published
await self.db.execute("""
UPDATE outbox
SET published_at = NOW()
WHERE id = ANY($1)
""", published_ids)
return len(published_ids)
Edge Cases Handled
Interviewer: "What if the outbox publisher crashes after publishing to Kafka but before marking as published?"
You: "Great question. The event gets published again when the publisher restarts. This is why consumers must be idempotent β they use the event_id to deduplicate. At-least-once delivery is the guarantee; exactly-once is the consumer's responsibility.
For payments specifically, we use idempotency keys tied to the order_id. Stripe won't charge twice for the same idempotency key."
Deep Dive 2: Backpressure During Dinner Rush (Day 3)
Interviewer: "You mentioned dinner rush is 10x normal traffic. How do you prevent the system from collapsing?"
You: "This is where Day 3's backpressure patterns come in. Let me show you the strategy."
The Problem
DINNER RUSH SCENARIO
Normal traffic: 5 orders/second
Dinner rush: 50 orders/second (10x)
Super peak: 100 orders/second
What can fail:
- Payment service overwhelmed β timeouts
- Restaurant service backed up β notifications delayed
- Driver matching overloaded β long wait times
- Database connections exhausted
- Kafka consumers fall behind
Symptoms:
- Consumer lag growing
- Processing latency increasing
- Memory pressure on consumers
- Eventual OOM crashes
- Cascading failures
The Solution
You: "I'd implement a multi-layer backpressure strategy:"
BACKPRESSURE LAYERS
Layer 1: DETECTION
Monitor consumer lag, queue depth, processing latency
Thresholds: Warning at 10K lag, Critical at 50K
Layer 2: RATE LIMITING AT INGESTION
API Gateway limits orders/second per city
Protects downstream from unbounded load
Layer 3: PRIORITY QUEUES
Critical: Payment confirmations, driver assignments
Normal: Analytics, notifications
Low: Marketing events, non-urgent updates
Layer 4: ADAPTIVE BATCHING
Low load: Process immediately
High load: Batch for efficiency
Layer 5: LOAD SHEDDING (last resort)
Never shed payment or critical events
Can delay analytics, marketing
Implementation
# Backpressure-Aware Consumer
# Applies Day 3: Backpressure and Flow Control
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import asyncio
import time
class Priority(Enum):
CRITICAL = 1 # Payments, order state changes
HIGH = 2 # Restaurant notifications, driver assignment
NORMAL = 3 # Customer notifications
LOW = 4 # Analytics, marketing
class BackpressureLevel(Enum):
NONE = "none"
WARNING = "warning"
CRITICAL = "critical"
EMERGENCY = "emergency"
@dataclass
class BackpressureConfig:
"""Configuration for backpressure handling."""
lag_warning_threshold: int = 10_000
lag_critical_threshold: int = 50_000
lag_emergency_threshold: int = 100_000
# Batch sizes by pressure level
batch_size_normal: int = 10
batch_size_warning: int = 50
batch_size_critical: int = 200
class BackpressureAwareConsumer:
"""
Kafka consumer with backpressure detection and response.
Key behaviors:
- Monitors consumer lag continuously
- Adjusts batch size based on pressure
- Sheds low-priority work under extreme pressure
- Never sheds critical events
"""
def __init__(
self,
kafka_consumer,
processor,
config: BackpressureConfig,
metrics_client
):
self.consumer = kafka_consumer
self.processor = processor
self.config = config
self.metrics = metrics_client
self._current_level = BackpressureLevel.NONE
self._current_batch_size = config.batch_size_normal
async def start(self):
"""Main consumer loop with backpressure handling."""
# Start background monitoring
asyncio.create_task(self._monitor_loop())
batch = []
async for message in self.consumer:
batch.append(message)
if len(batch) >= self._current_batch_size:
await self._process_batch(batch)
batch = []
async def _monitor_loop(self):
"""Continuously monitor lag and adjust strategy."""
while True:
lag = await self._get_consumer_lag()
# Determine pressure level
if lag >= self.config.lag_emergency_threshold:
new_level = BackpressureLevel.EMERGENCY
elif lag >= self.config.lag_critical_threshold:
new_level = BackpressureLevel.CRITICAL
elif lag >= self.config.lag_warning_threshold:
new_level = BackpressureLevel.WARNING
else:
new_level = BackpressureLevel.NONE
# Adjust batch size
if new_level != self._current_level:
self._current_level = new_level
self._adjust_batch_size()
self.metrics.gauge('consumer.backpressure_level', new_level.value)
await asyncio.sleep(1)
def _adjust_batch_size(self):
"""Adjust batch size based on pressure level."""
if self._current_level == BackpressureLevel.NONE:
self._current_batch_size = self.config.batch_size_normal
elif self._current_level == BackpressureLevel.WARNING:
self._current_batch_size = self.config.batch_size_warning
else: # CRITICAL or EMERGENCY
self._current_batch_size = self.config.batch_size_critical
async def _process_batch(self, messages: list):
"""Process a batch of messages with priority awareness."""
# Under emergency pressure, filter by priority
if self._current_level == BackpressureLevel.EMERGENCY:
messages = self._filter_by_priority(messages, min_priority=Priority.HIGH)
self.metrics.increment('consumer.messages_shed',
len(messages) - len(messages))
# Process remaining messages
for msg in messages:
try:
await self.processor.process(msg)
except Exception as e:
# Send to DLQ (Day 4)
await self._send_to_dlq(msg, e)
await self.consumer.commit()
def _filter_by_priority(self, messages: list, min_priority: Priority) -> list:
"""Filter messages by priority. Never filter CRITICAL."""
return [
msg for msg in messages
if self._get_priority(msg).value <= min_priority.value
]
def _get_priority(self, message) -> Priority:
"""Determine message priority from event type."""
event_type = message.headers.get('event_type', b'').decode()
# CRITICAL: Never shed
if event_type in ('payment.confirmed', 'payment.failed',
'order.placed', 'order.cancelled'):
return Priority.CRITICAL
# HIGH: Only shed in emergency
if event_type in ('restaurant.notified', 'driver.assigned'):
return Priority.HIGH
# NORMAL: Shed when critical
if event_type in ('notification.sent', 'driver.location_update'):
return Priority.NORMAL
# LOW: Shed when warning or above
return Priority.LOW
async def _get_consumer_lag(self) -> int:
"""Get current consumer lag."""
# Implementation depends on Kafka client
return await self.consumer.get_lag()
async def _send_to_dlq(self, message, error: Exception):
"""Send failed message to dead letter queue (Day 4)."""
# Implementation in Deep Dive 3
pass
Scaling Response
You: "Beyond the consumer-level backpressure, we also have infrastructure-level responses:"
SCALING RESPONSES BY PRESSURE LEVEL
WARNING (lag > 10K):
βββ Alert ops team (Slack)
βββ Increase batch sizes
βββ Pre-warm additional consumer instances
CRITICAL (lag > 50K):
βββ Auto-scale consumers (Kubernetes HPA)
β kubectl scale deployment order-consumer --replicas=8
βββ Shed LOW priority events
βββ Page on-call engineer
EMERGENCY (lag > 100K):
βββ Max consumer replicas
βββ Shed NORMAL and LOW priority
βββ Enable API rate limiting (upstream)
βββ Consider temporary feature degradation
Deep Dive 3: Dead Letter Queue for Failed Orders (Day 4)
Interviewer: "What happens when an order can't be processed? Payment fails, restaurant rejects, etc.?"
You: "Day 4's dead letter queue pattern is essential here. Let me show you how we handle failures without losing orders."
The Problem
ORDER FAILURE SCENARIOS
1. PAYMENT FAILURES
- Card declined
- Insufficient funds
- Payment gateway timeout
- Fraud detected
2. RESTAURANT FAILURES
- Restaurant closed unexpectedly
- Item out of stock
- Restaurant rejects order
- No response (timeout)
3. PROCESSING FAILURES
- Invalid message format (poison pill)
- Database temporarily unavailable
- Downstream service down
- Bug in consumer code
What we CANNOT do:
- Lose the order (customer already waiting)
- Retry forever (wastes resources)
- Block other orders (one failure shouldn't affect all)
The Solution
DLQ STRATEGY FOR ORDERS
orders.events
β
βΌ
Order Consumer
β
ββββββ΄βββββ
β β
Success Failure
β β
βΌ βΌ
Commit Classify
β
βββββββββββββββββΌββββββββββββββββ
β β β
Transient Permanent Poison
(retry) (route) (quarantine)
β β β
βΌ βΌ βΌ
orders.retry Compensation orders.dlq.poison
β Logic β
β β β
Wait & retry ββββββ΄βββββ Immediate
β β β alert
βΌ Notify Alternative
Main topic customer restaurant
RETRY TOPICS WITH DELAYS:
orders.retry.1m (1 minute delay)
orders.retry.5m (5 minute delay)
orders.retry.15m (15 minute delay)
After 3 retries: β orders.dlq
Implementation
# Dead Letter Queue Handler for Orders
# Applies Day 4: Dead Letters and Poison Pills
from dataclasses import dataclass
from datetime import datetime
from typing import Optional, Dict, Any
from enum import Enum
import json
import traceback
class ErrorCategory(Enum):
TRANSIENT = "transient" # Retry will likely help
PERMANENT = "permanent" # Retry won't help, need action
POISON = "poison" # Crashes consumer, quarantine
class FailureType(Enum):
# Payment failures
PAYMENT_DECLINED = "payment_declined"
PAYMENT_TIMEOUT = "payment_timeout"
PAYMENT_FRAUD = "payment_fraud"
# Restaurant failures
RESTAURANT_CLOSED = "restaurant_closed"
RESTAURANT_REJECTED = "restaurant_rejected"
ITEM_UNAVAILABLE = "item_unavailable"
# Processing failures
INVALID_MESSAGE = "invalid_message"
SERVICE_UNAVAILABLE = "service_unavailable"
UNKNOWN = "unknown"
@dataclass
class DeadLetter:
"""A failed order event with context for investigation."""
event_id: str
order_id: str
event_type: str
original_payload: Dict[str, Any]
failure_type: FailureType
error_category: ErrorCategory
error_message: str
stack_trace: Optional[str]
retry_count: int
first_failed_at: datetime
last_failed_at: datetime
# For resolution
resolved: bool = False
resolution: Optional[str] = None
resolved_at: Optional[datetime] = None
class OrderDLQHandler:
"""
Handles failed order events with intelligent routing.
Key behaviors:
- Classifies errors into transient/permanent/poison
- Transient: Retry with exponential backoff
- Permanent: Route to compensation logic
- Poison: Quarantine immediately
"""
# Error classification rules
TRANSIENT_ERRORS = [
FailureType.PAYMENT_TIMEOUT,
FailureType.SERVICE_UNAVAILABLE,
]
PERMANENT_ERRORS = [
FailureType.PAYMENT_DECLINED,
FailureType.PAYMENT_FRAUD,
FailureType.RESTAURANT_CLOSED,
FailureType.RESTAURANT_REJECTED,
FailureType.ITEM_UNAVAILABLE,
]
POISON_ERRORS = [
FailureType.INVALID_MESSAGE,
]
def __init__(
self,
kafka_producer,
db_pool,
notification_service,
config
):
self.kafka = kafka_producer
self.db = db_pool
self.notifications = notification_service
self.config = config
async def handle_failure(
self,
event: Dict[str, Any],
error: Exception,
retry_count: int = 0
) -> None:
"""
Handle a failed order event.
Routes to appropriate handling based on error type.
"""
# Classify the error
failure_type = self._classify_error(error)
category = self._get_category(failure_type)
order_id = event.get('order_id', 'unknown')
if category == ErrorCategory.POISON:
# POISON: Quarantine immediately, never retry
await self._quarantine_poison(event, error, failure_type)
await self._alert_oncall(event, error, "Poison pill detected")
elif category == ErrorCategory.PERMANENT:
# PERMANENT: Don't retry, execute compensation
await self._handle_permanent_failure(event, error, failure_type)
elif retry_count < self.config.max_retries:
# TRANSIENT: Retry with backoff
await self._schedule_retry(event, retry_count + 1)
else:
# Exhausted retries: Move to DLQ
await self._send_to_dlq(event, error, failure_type, retry_count)
await self._notify_customer_failure(order_id)
def _classify_error(self, error: Exception) -> FailureType:
"""Classify exception into failure type."""
error_str = str(error).lower()
# Payment errors
if 'card_declined' in error_str:
return FailureType.PAYMENT_DECLINED
if 'timeout' in error_str and 'payment' in error_str:
return FailureType.PAYMENT_TIMEOUT
if 'fraud' in error_str:
return FailureType.PAYMENT_FRAUD
# Restaurant errors
if 'restaurant_closed' in error_str:
return FailureType.RESTAURANT_CLOSED
if 'rejected' in error_str:
return FailureType.RESTAURANT_REJECTED
if 'out_of_stock' in error_str or 'unavailable' in error_str:
return FailureType.ITEM_UNAVAILABLE
# Processing errors
if isinstance(error, (json.JSONDecodeError, KeyError, TypeError)):
return FailureType.INVALID_MESSAGE
if 'service unavailable' in error_str or '503' in error_str:
return FailureType.SERVICE_UNAVAILABLE
return FailureType.UNKNOWN
def _get_category(self, failure_type: FailureType) -> ErrorCategory:
"""Get error category for failure type."""
if failure_type in self.TRANSIENT_ERRORS:
return ErrorCategory.TRANSIENT
if failure_type in self.PERMANENT_ERRORS:
return ErrorCategory.PERMANENT
if failure_type in self.POISON_ERRORS:
return ErrorCategory.POISON
return ErrorCategory.TRANSIENT # Default to retry for unknown
async def _schedule_retry(self, event: Dict, retry_count: int):
"""Schedule retry with exponential backoff."""
# Determine delay based on retry count
delays = ['1m', '5m', '15m']
delay = delays[min(retry_count - 1, len(delays) - 1)]
retry_topic = f"orders.retry.{delay}"
# Add retry metadata
event['_retry_count'] = retry_count
event['_retry_at'] = datetime.utcnow().isoformat()
await self.kafka.send(
retry_topic,
key=event['order_id'].encode(),
value=json.dumps(event).encode()
)
async def _handle_permanent_failure(
self,
event: Dict,
error: Exception,
failure_type: FailureType
):
"""Handle permanent failures with compensation logic."""
order_id = event.get('order_id')
if failure_type == FailureType.PAYMENT_DECLINED:
# Notify customer, mark order as payment_failed
await self._update_order_status(order_id, 'payment_failed')
await self.notifications.send(
user_id=event['customer_id'],
template='payment_declined',
data={'order_id': order_id}
)
elif failure_type == FailureType.RESTAURANT_REJECTED:
# Try alternative restaurant or refund
alternative = await self._find_alternative_restaurant(event)
if alternative:
await self._reroute_order(order_id, alternative)
else:
await self._refund_order(order_id)
elif failure_type == FailureType.ITEM_UNAVAILABLE:
# Notify customer, offer alternatives
await self.notifications.send(
user_id=event['customer_id'],
template='item_unavailable',
data={'order_id': order_id, 'items': event.get('unavailable_items')}
)
# Always record in DLQ for tracking
await self._send_to_dlq(event, error, failure_type, 0, resolved=True)
async def _quarantine_poison(
self,
event: Dict,
error: Exception,
failure_type: FailureType
):
"""Quarantine poison pill - never retry."""
await self.kafka.send(
'orders.dlq.poison',
key=event.get('order_id', 'unknown').encode(),
value=json.dumps({
'event': event,
'error': str(error),
'stack_trace': traceback.format_exc(),
'quarantined_at': datetime.utcnow().isoformat()
}).encode()
)
async def _send_to_dlq(
self,
event: Dict,
error: Exception,
failure_type: FailureType,
retry_count: int,
resolved: bool = False
):
"""Send to dead letter queue for investigation."""
dead_letter = DeadLetter(
event_id=event.get('event_id', 'unknown'),
order_id=event.get('order_id', 'unknown'),
event_type=event.get('event_type', 'unknown'),
original_payload=event,
failure_type=failure_type,
error_category=self._get_category(failure_type),
error_message=str(error),
stack_trace=traceback.format_exc(),
retry_count=retry_count,
first_failed_at=datetime.utcnow(),
last_failed_at=datetime.utcnow(),
resolved=resolved
)
await self.kafka.send(
'orders.dlq',
key=event.get('order_id', 'unknown').encode(),
value=json.dumps(dead_letter.__dict__, default=str).encode()
)
Deep Dive 4: Audit Trail for Disputes (Day 5)
Interviewer: "You mentioned 2% of orders are disputed. How do you support customer support investigations?"
You: "This is where Day 5's audit log pattern comes in. Every order state change is immutably recorded for investigation."
The Problem
DISPUTE SCENARIOS
Customer: "I was charged but never got my food!"
Support needs to answer:
- Did the order actually go through?
- Was payment confirmed?
- Did the restaurant receive the notification?
- Was a driver assigned?
- What happened at each step?
- Who/what caused the failure?
Without audit log:
- Check order table: only current state
- Check logs: distributed across services, hard to correlate
- Check payments: separate system, might be inconsistent
Result: "We can't determine what happened"
Customer: Unhappy, posts on social media
Company: Reputation damage, potential lawsuit
With audit log:
- Query: SELECT * FROM audit_events WHERE order_id = 'xyz' ORDER BY timestamp
- See complete timeline of every state change
- See who/what triggered each change
- Resolve in minutes, not hours
The Solution
AUDIT LOG ARCHITECTURE
Every service publishes audit events:
Order Service βββ¬βββΆ orders.events (for processing)
ββββΆ audit.events (for audit trail)
Audit events are:
- Immutable (no UPDATE, no DELETE)
- Complete (all context included)
- Queryable (indexed by order_id, actor, time)
- Retained (2 years for legal)
Audit Consumer:
audit.events βββΆ Audit Consumer βββΆ TimescaleDB
β
ββββββββ΄βββββββ
β β
90 days Archive
(hot query) (S3 Parquet)
Implementation
# Audit Log System for Order Disputes
# Applies Day 5: Audit Log System
from dataclasses import dataclass, asdict
from datetime import datetime
from typing import Optional, Dict, Any, List
from enum import Enum
import json
import hashlib
class AuditAction(Enum):
# Order lifecycle
ORDER_PLACED = "order.placed"
ORDER_CANCELLED = "order.cancelled"
ORDER_COMPLETED = "order.completed"
# Payment
PAYMENT_INITIATED = "payment.initiated"
PAYMENT_CONFIRMED = "payment.confirmed"
PAYMENT_FAILED = "payment.failed"
PAYMENT_REFUNDED = "payment.refunded"
# Restaurant
RESTAURANT_NOTIFIED = "restaurant.notified"
RESTAURANT_ACCEPTED = "restaurant.accepted"
RESTAURANT_REJECTED = "restaurant.rejected"
# Driver
DRIVER_ASSIGNED = "driver.assigned"
DRIVER_PICKED_UP = "driver.picked_up"
DRIVER_DELIVERED = "driver.delivered"
# Support
SUPPORT_INTERVENTION = "support.intervention"
DISPUTE_OPENED = "dispute.opened"
DISPUTE_RESOLVED = "dispute.resolved"
@dataclass
class AuditEvent:
"""
Immutable audit event for order tracking.
Contains all context needed to understand what happened,
who did it, and when.
"""
# Identity
event_id: str
timestamp: datetime
# What happened
action: AuditAction
order_id: str
# Who did it
actor_type: str # 'customer', 'system', 'restaurant', 'driver', 'support'
actor_id: str
# State change
previous_state: Optional[str]
new_state: str
# Details
details: Dict[str, Any]
# Context for debugging
service_name: str
correlation_id: str
# Integrity
previous_hash: Optional[str] = None
event_hash: Optional[str] = None
def compute_hash(self) -> str:
"""Compute hash for tamper detection."""
content = json.dumps({
'event_id': self.event_id,
'timestamp': self.timestamp.isoformat(),
'action': self.action.value,
'order_id': self.order_id,
'actor_id': self.actor_id,
'previous_hash': self.previous_hash
}, sort_keys=True)
return hashlib.sha256(content.encode()).hexdigest()
class AuditService:
"""
Service for recording and querying audit events.
Key guarantees:
- Events are immutable once written
- Hash chain provides tamper evidence
- Fast queries by order_id for support
"""
def __init__(self, kafka_producer, db_pool):
self.kafka = kafka_producer
self.db = db_pool
self._last_hash: Optional[str] = None
async def record(
self,
action: AuditAction,
order_id: str,
actor_type: str,
actor_id: str,
previous_state: Optional[str],
new_state: str,
details: Dict[str, Any],
service_name: str,
correlation_id: str
) -> str:
"""
Record an audit event.
Returns the event_id.
"""
import uuid
event = AuditEvent(
event_id=f"aud_{uuid.uuid4().hex[:12]}",
timestamp=datetime.utcnow(),
action=action,
order_id=order_id,
actor_type=actor_type,
actor_id=actor_id,
previous_state=previous_state,
new_state=new_state,
details=details,
service_name=service_name,
correlation_id=correlation_id,
previous_hash=self._last_hash
)
# Compute hash for chain
event.event_hash = event.compute_hash()
self._last_hash = event.event_hash
# Publish to Kafka
await self.kafka.send(
'audit.events',
key=order_id.encode(),
value=json.dumps(asdict(event), default=str).encode()
)
return event.event_id
async def get_order_timeline(self, order_id: str) -> List[Dict]:
"""
Get complete timeline for an order.
Used by customer support for dispute resolution.
"""
rows = await self.db.fetch("""
SELECT
event_id,
timestamp,
action,
actor_type,
actor_id,
previous_state,
new_state,
details
FROM audit_events
WHERE order_id = $1
ORDER BY timestamp ASC
""", order_id)
return [dict(row) for row in rows]
async def verify_integrity(self, order_id: str) -> Dict[str, Any]:
"""
Verify hash chain integrity for an order.
Used for dispute verification - proves logs weren't tampered.
"""
rows = await self.db.fetch("""
SELECT event_id, event_hash, previous_hash
FROM audit_events
WHERE order_id = $1
ORDER BY timestamp ASC
""", order_id)
issues = []
prev_hash = None
for row in rows:
if row['previous_hash'] != prev_hash:
issues.append({
'event_id': row['event_id'],
'issue': 'Chain break detected',
'expected': prev_hash,
'actual': row['previous_hash']
})
prev_hash = row['event_hash']
return {
'order_id': order_id,
'events_checked': len(rows),
'integrity': 'valid' if not issues else 'compromised',
'issues': issues
}
class SupportDashboard:
"""
Support dashboard for dispute resolution.
Provides easy access to order history and investigation tools.
"""
def __init__(self, audit_service: AuditService, order_service):
self.audit = audit_service
self.orders = order_service
async def investigate_order(self, order_id: str) -> Dict[str, Any]:
"""
Get complete investigation package for an order.
Used by support agents during disputes.
"""
# Get current order state
order = await self.orders.get_order(order_id)
# Get complete timeline
timeline = await self.audit.get_order_timeline(order_id)
# Verify integrity
integrity = await self.audit.verify_integrity(order_id)
# Get related DLQ entries if any
dlq_entries = await self._get_dlq_entries(order_id)
return {
'order': order,
'timeline': timeline,
'integrity': integrity,
'failures': dlq_entries,
'summary': self._generate_summary(timeline)
}
def _generate_summary(self, timeline: List[Dict]) -> str:
"""Generate human-readable summary of order events."""
if not timeline:
return "No events found for this order."
summary_lines = []
for event in timeline:
ts = event['timestamp'].strftime('%H:%M:%S')
action = event['action'].replace('.', ' ').title()
actor = f"{event['actor_type']}:{event['actor_id'][:8]}"
summary_lines.append(f"{ts} - {action} by {actor}")
return "\n".join(summary_lines)
Query Examples for Support
You: "Here's how support would use this for common disputes:"
-- "I never got my food but was charged"
SELECT
timestamp,
action,
actor_type,
new_state,
details
FROM audit_events
WHERE order_id = 'ord_abc123'
ORDER BY timestamp;
-- Result shows:
-- 18:30:01 - order.placed (customer)
-- 18:30:02 - payment.confirmed (system) β Payment went through
-- 18:30:05 - restaurant.notified (system)
-- 18:31:00 - restaurant.accepted (restaurant)
-- 18:35:00 - driver.assigned (system)
-- 18:50:00 - driver.picked_up (driver)
-- 19:15:00 - ??? (no delivery event!)
--
-- Gap after pickup suggests driver issue
-- "The restaurant says they never got my order"
SELECT * FROM audit_events
WHERE order_id = 'ord_xyz789'
AND action = 'restaurant.notified';
-- Shows notification was sent at 18:30:05
-- Check details for delivery confirmation
Phase 5: Scaling and Edge Cases (5 minutes)
Interviewer: "How would this system scale to 10x the current load?"
Scaling Strategy
You: "There are several scaling vectors to consider..."
CURRENT β 10X SCALE
Current 10X Target
Orders/second (peak): 50 500
Events/second: 800 8,000
Consumer instances: 28 100+
Kafka partitions: 32 128
Database writes/sec: 1,000 10,000
Horizontal Scaling
COMPONENT SCALING
1. KAFKA
Current: 32 partitions, 3 brokers
10X: 128 partitions, 10+ brokers
Why: More partitions = more parallel consumers
2. CONSUMERS
Current: 7 services Γ 4 instances = 28
10X: 7 services Γ 16 instances = 112
Limit: Instances β€ partitions per consumer group
3. ORDER DATABASE
Current: Single primary + replicas
10X: Sharded by customer_id or city
Why: Write throughput is the bottleneck
4. AUDIT DATABASE (TimescaleDB)
Current: Single instance, 4TB
10X: Distributed TimescaleDB or ClickHouse
Why: Write throughput + query performance
5. OUTBOX PUBLISHER
Current: CDC with Debezium
10X: Still CDC, but watch for lag
May need: Multiple Debezium instances per shard
Bottleneck Analysis
| Component | Current Limit | First to Break | Solution |
|---|---|---|---|
| API Gateway | 10,000 req/s | No | Already scaled |
| Order DB writes | 2,000/s | Yes, at 5x | Shard by city |
| Kafka ingestion | 100,000 msg/s | No | Plenty of headroom |
| Consumer processing | 500 orders/s | Yes, at 10x | Add instances |
| Audit DB writes | 5,000/s | Yes, at 6x | Batch inserts, shard |
Edge Cases
Interviewer: "What are some edge cases we should handle?"
Edge Case 1: Customer Places Order During Restaurant Close
Scenario:
Customer orders at 9:59 PM
Restaurant closes at 10:00 PM
Order reaches restaurant at 10:01 PM
Restaurant system auto-rejects
Handling:
1. Check restaurant hours during order placement
2. If within 15 min of close, warn customer
3. If rejected due to close, auto-refund immediately
4. Audit: Record "auto_rejected_closing_time"
Edge Case 2: Payment Confirmed, Then Fraud Detected
Scenario:
Payment confirmed at 18:30
Fraud detection flags at 18:35
Order already sent to restaurant
Handling:
1. Fraud service publishes fraud.detected event
2. Order service cancels order
3. Restaurant notified of cancellation
4. Payment reversed (refund)
5. Customer notified of security hold
6. Audit: Full trace for investigation
Edge Case 3: Kafka Cluster Unavailable
Scenario:
Kafka down for 30 minutes
Orders still being placed
Handling:
1. Outbox accumulates events (DB still working)
2. Alert fired at 1 minute of lag
3. API continues accepting orders (degraded mode)
4. When Kafka recovers, outbox publisher catches up
5. No orders lost due to transactional outbox
Failure Scenarios
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| Kafka broker down | Health check | Reduced throughput | Replication handles it |
| Payment service down | Circuit breaker | Orders queue up | Retry topic with backoff |
| Database primary down | Connection errors | Orders fail | Failover to replica |
| Consumer crash | Kubernetes | Temporary lag | Auto-restart, resume from offset |
| Outbox publisher down | Lag monitoring | Events delayed | Restart, outbox catches up |
Phase 6: Monitoring and Operations
Interviewer: "How would you monitor this system in production?"
Key Metrics
You: "I would track metrics at multiple levels..."
Business Metrics
ORDER HEALTH
βββ Orders placed per minute: Target 50/min normal, 200/min peak
βββ Order completion rate: Target > 95%
βββ Average order-to-delivery time: Target < 45 min
βββ Dispute rate: Target < 2%
PAYMENT HEALTH
βββ Payment success rate: Target > 98%
βββ Payment latency p99: Target < 3s
βββ Refund rate: Target < 5%
RESTAURANT HEALTH
βββ Restaurant acceptance rate: Target > 90%
βββ Notification latency p99: Target < 10s
βββ Restaurant response time: Target < 5 min
System Metrics
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MONITORING DASHBOARD β
β β
β KAFKA HEALTH β
β βββ Consumer lag: < 1,000 [ββββββββββ] 800 β
β βββ Messages/sec: > 500 [ββββββββββ] 850 β
β βββ Under-replicated: 0 [ββββββββββ] 0 β
β β
β ORDER SERVICE β
β βββ API latency p99: < 200ms [ββββββββββ] 180ms β
β βββ Error rate: < 1% [ββββββββββ] 0.3% β
β βββ Outbox depth: < 100 [ββββββββββ] 25 β
β β
β CONSUMERS β
β βββ Payment processing: < 3s [ββββββββββ] 2.1s β
β βββ Restaurant notification: < 10s [ββββββββββ] 4s β
β βββ DLQ depth (all topics): < 50 [ββββββββββ] 12 β
β β
β AUDIT β
β βββ Write latency: < 100ms [ββββββββββ] 45ms β
β βββ Query latency: < 500ms [ββββββββββ] 200ms β
β βββ Storage used: < 80% [ββββββββββ] 62% β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Alerting Strategy
CRITICAL (PagerDuty - Wake someone up):
- Consumer lag > 50,000 (falling behind badly)
- DLQ depth > 100 (many failures)
- Payment success rate < 90% (payments broken)
- Outbox depth > 10,000 (Kafka might be down)
- Kafka under-replicated partitions > 0
WARNING (Slack - Investigate soon):
- Consumer lag > 10,000
- DLQ depth > 20
- Payment latency p99 > 5s
- Restaurant acceptance rate < 80%
- Order completion rate < 90%
INFO (Dashboard only):
- New deployment detected
- Consumer rebalance occurred
- Traffic spike (but within limits)
Runbook Snippet
RUNBOOK: Consumer Lag Spike
SYMPTOMS:
- Alert: "Consumer lag > 10,000"
- Dashboard shows growing lag
- Order processing delays reported
DIAGNOSIS:
1. Check consumer health:
kubectl get pods -l app=order-consumer
2. Check Kafka broker health:
kafka-broker-api-versions.sh --bootstrap-server kafka:9092
3. Check for poison pills:
kafka-console-consumer --topic orders.dlq --from-beginning | head -10
4. Check consumer logs:
kubectl logs -l app=order-consumer --tail=100 | grep ERROR
RESOLUTION:
1. If consumers crashed:
kubectl rollout restart deployment/order-consumer
2. If poison pills detected:
# Skip the message (requires offset management)
# Or fix and replay from DLQ
3. If sustained high load:
kubectl scale deployment/order-consumer --replicas=8
4. If Kafka broker issues:
Escalate to infrastructure team
POST-RESOLUTION:
- Monitor lag returning to normal
- Check DLQ for any messages that need replay
- Document incident in postmortem
Interview Conclusion
Interviewer: "Excellent work. You've demonstrated strong understanding of event-driven systems, handled the messaging patterns well, and showed good operational thinking. Any questions for me?"
You: "Thank you! I'd love to hear how your current system handles the dinner rush - do you use the backpressure patterns we discussed, or have you found different strategies that work better at your scale?"
Interviewer: "Great question. We actually faced exactly the problems you described with Kafka lag during peaks. We ended up implementing priority queues similar to what you proposed. The trickiest part was getting the load shedding priorities right - we had some debates about whether driver assignment notifications should be shedable. In the end, we decided they're critical because a delayed driver assignment is worse than a slightly delayed analytics event."
You: "That makes sense. I'd also be curious about your DLQ investigation workflow - is it automated or do you have engineers reviewing manually?"
Interviewer: "Mostly automated for known failure types, manual review for unknowns. We've built up a library of auto-fix rules over time. Thanks for a great discussion!"
Summary: Week 3 Concepts Applied
Week 3 Concepts Integration
| Day | Concept | Application in This Design |
|---|---|---|
| Day 1 | Queue vs Stream | Chose Kafka for multi-consumer fanout, replay capability, and ordering per order_id |
| Day 2 | Transactional Outbox | Order creation + event publishing in same DB transaction, CDC for low-latency publishing |
| Day 3 | Backpressure | Consumer lag monitoring, priority-based shedding, adaptive batch sizing during dinner rush |
| Day 4 | Dead Letters | Categorized DLQs for transient/permanent/poison, retry topics with delays, compensation logic |
| Day 5 | Audit Log | Hash-chained immutable events, TimescaleDB for queries, support dashboard for disputes |
How Concepts Connect
ORDER FLOW WITH ALL WEEK 3 CONCEPTS
Customer places order
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ORDER SERVICE β
β β
β BEGIN TRANSACTION βββββββββββββββββββββββ Day 2: Outbox β
β INSERT INTO orders β
β INSERT INTO outbox (order.placed) β
β INSERT INTO audit (order.placed) ββββββ Day 5: Audit β
β COMMIT β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β CDC (Debezium)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β KAFKA βββββββββββββββββββββββββββββββββββββ Day 1: Stream β
β β
β orders.events βββ¬βββΆ Payment Consumer β
β ββββΆ Restaurant Consumer β
β ββββΆ Driver Consumer β
β ββββΆ Notification Consumer β
β ββββΆ Analytics Consumer β
β ββββΆ Audit Consumer βββββββ Day 5 β
β β
β Backpressure: Lag monitoring, batch sizing βββ Day 3 β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β On failure
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FAILURE HANDLING ββββββββββββββββββββββββββββ Day 4: DLQ β
β β
β orders.retry.1m βββΆ Delayed retry β
β orders.retry.5m βββΆ Longer delay β
β orders.dlq βββΆ Investigation + replay β
β orders.dlq.poison βββΆ Quarantine + alert β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Code Patterns Demonstrated
1. TRANSACTIONAL OUTBOX (Day 2)
- Single transaction for order + event
- CDC-based publishing
- Key insight: Never dual-write to DB and queue
2. BACKPRESSURE HANDLING (Day 3)
- Consumer lag monitoring
- Priority-based load shedding
- Key insight: Graceful degradation over crash
3. ERROR CLASSIFICATION (Day 4)
- Transient vs Permanent vs Poison
- Delayed retry topics
- Key insight: Different errors need different handling
4. AUDIT TRAIL (Day 5)
- Hash-chained events
- Immutable storage
- Key insight: Complete context for investigation
Self-Assessment Checklist
After studying this capstone, you should be able to:
- Choose messaging patterns β Know when Kafka vs RabbitMQ, partition strategies
- Implement transactional outbox β Guarantee no lost events with dual-write
- Design for backpressure β Detect lag, adjust batching, shed load gracefully
- Handle failures with DLQ β Classify errors, retry appropriately, quarantine poison
- Build audit trails β Immutable events, hash chains, queryable history
- Estimate scale β Calculate events/sec, storage, consumer instances
- Draw architecture diagrams β Show event flow, failure paths, storage tiers
- Discuss trade-offs β Latency vs throughput, consistency vs availability
- Design for operations β Metrics, alerts, runbooks, investigation tools
- Conduct mock interviews β Walk through design systematically, handle follow-ups
Key Interview Takeaways
WHEN DESIGNING EVENT-DRIVEN SYSTEMS:
1. START WITH DELIVERY GUARANTEES
"What happens if this event is lost?"
If critical β Transactional outbox
2. CHOOSE MESSAGING PATTERN BY CONSUMPTION
Multiple consumers? β Kafka (stream)
Single consumer, task distribution? β RabbitMQ (queue)
3. PLAN FOR FAILURE FROM DAY ONE
Transient β Retry with backoff
Permanent β Compensation logic
Poison β Quarantine immediately
4. PROTECT AGAINST OVERLOAD
Monitor lag continuously
Have shedding priorities defined
Never shed critical events
5. TRACK EVERYTHING FOR INVESTIGATION
Audit every state change
Include full context
Make it queryable
This capstone problem integrates all concepts from Week 3 of the System Design Mastery Series. Use this as a template for approaching similar interview problems involving event-driven architectures.