Capstone

Week 3 Capstone: Event-Driven Food Delivery Order System

🎯 A Real-World Problem Covering Everything You've Learned

This capstone integrates all Week 3 concepts into a single, realistic system design interview. You'll apply:

Day	Concept	How It's Used
Day 1	Queue vs Stream	Choosing the right messaging pattern for different event types
Day 2	Transactional Outbox	Guaranteeing order events are never lost
Day 3	Backpressure	Handling dinner rush without system collapse
Day 4	Dead Letters	Managing failed orders and payment issues
Day 5	Audit Log	Tracking every order state change for disputes

The Interview Begins

You walk into the interview room at a fast-growing food delivery startup. The interviewer smiles and gestures to the whiteboard.

Interviewer: "Thanks for coming in. Today we're going to work through a system design problem together. I'm interested in your thought process, so please think out loud. Feel free to ask questions — this is meant to be collaborative."

They write on the whiteboard:

╔══════════════════════════════════════════════════════════════════════════╗
║                                                                          ║
║              Design an Event-Driven Order Fulfillment System             ║
║                                                                          ║
║   Context:                                                               ║
║   We're a food delivery platform operating in 50 cities. When a          ║
║   customer places an order, we need to:                                  ║
║                                                                          ║
║   - Process the payment                                                  ║
║   - Notify the restaurant                                                ║
║   - Match and dispatch a driver                                          ║
║   - Track the order through delivery                                     ║
║   - Handle failures gracefully                                           ║
║   - Maintain audit trail for disputes                                    ║
║                                                                          ║
║   Key challenges:                                                        ║
║   - Dinner rush: 10x normal traffic                                      ║
║   - Payment/restaurant failures happen                                   ║
║   - Customers dispute orders (need history)                              ║
║   - Multiple services need order events                                  ║
║                                                                          ║
╚══════════════════════════════════════════════════════════════════════════╝

Interviewer: "Take a few minutes to think about this, then walk me through your approach. We have about 45 minutes."

Phase 1: Requirements Clarification (5 minutes)

Before diving in, you take a breath and start asking questions. This is crucial — never assume.

Your Questions

You: "Before I start designing, I'd like to clarify a few requirements. First, what's our current order volume and expected growth?"

Interviewer: "We process about 500,000 orders per day currently. During dinner rush (6-9 PM), we see about 60% of daily volume in those 3 hours. We're growing 100% year over year."

You: "So roughly 300,000 orders in 3 hours during peak — that's about 28 orders per second average, probably 50+ per second at the absolute peak. Got it. What about the order lifecycle? What states does an order go through?"

Interviewer: "An order goes through: placed → payment_processing → payment_confirmed → restaurant_notified → restaurant_accepted → driver_assigned → picked_up → delivered. It can also go to payment_failed, restaurant_rejected, or cancelled at various points."

You: "How many services need to react to order events? And do they all need the same events?"

Interviewer: "We have: Payment Service, Restaurant Service, Driver Matching, Notifications (push/SMS), Analytics, Customer Support dashboard, and a new Fraud Detection service. They have different needs — some need real-time, some can be slightly delayed."

You: "What about failures? If payment fails or the restaurant rejects, what's the expected behavior?"

Interviewer: "Payment failures should retry 3 times, then notify customer. Restaurant rejections should try to find an alternative restaurant or refund. We need to track all of this for customer disputes — we get about 2% of orders disputed."

You: "Last question — what's the latency requirement for the customer? How fast should they see their order confirmed?"

Interviewer: "Customer should see payment confirmed within 3 seconds. Restaurant notification should happen within 10 seconds. Full end-to-end from order to driver assignment should be under 2 minutes in normal conditions."

You: "Perfect. Let me summarize the requirements as I understand them."

Functional Requirements

1. ORDER LIFECYCLE MANAGEMENT
   - Accept order placement from customers
   - Process payments with retry logic
   - Notify restaurants and handle acceptance/rejection
   - Match and assign drivers
   - Track order through pickup and delivery
   - Handle cancellations at any stage

2. EVENT DISTRIBUTION
   - Multiple services consume order events
   - Different services need different event types
   - Some services need real-time, others can be delayed
   - Events must be reliably delivered

3. FAILURE HANDLING
   - Payment retries with intelligent backoff
   - Restaurant rejection → alternative matching or refund
   - Driver no-show → reassignment
   - All failures tracked and recoverable

4. AUDIT AND DISPUTE RESOLUTION
   - Complete history of every order state change
   - Who changed what, when, and why
   - Queryable for customer support
   - Retained for 2 years (legal requirement)

Non-Functional Requirements

1. SCALE
   - 500,000 orders/day (current)
   - 50 orders/second peak
   - 1,000,000 orders/day (1 year target)
   - 7 consumer services per order event

2. RELIABILITY
   - Zero order loss (payment taken = order tracked)
   - At-least-once delivery for all events
   - Exactly-once processing for payments

3. LATENCY
   - Payment confirmation: < 3 seconds (p99)
   - Restaurant notification: < 10 seconds
   - Driver assignment: < 2 minutes (p95)
   - Audit query: < 500ms for recent orders

4. AVAILABILITY
   - 99.9% uptime (8.7 hours downtime/year max)
   - Graceful degradation during partial failures
   - No data loss during outages

Phase 2: Back of the Envelope Estimation (5 minutes)

You: "Let me work through the numbers to understand the scale we're designing for."

Traffic Estimation

ORDER VOLUME

Current daily orders:          500,000
Peak hours (6-9 PM):          60% of daily = 300,000 orders
Peak duration:                3 hours = 10,800 seconds

Average during peak:          300,000 / 10,800 = ~28 orders/sec
Peak of peak (2x average):    ~56 orders/sec
Design target (headroom):     100 orders/sec


EVENTS PER ORDER

Order lifecycle events:
  - order.placed                 1
  - payment.processing           1
  - payment.confirmed            1 (or payment.failed)
  - restaurant.notified          1
  - restaurant.accepted          1 (or restaurant.rejected)
  - driver.assigned              1
  - order.picked_up              1
  - order.delivered              1
  ─────────────────────────────────
  Total:                        ~8 events per order

Events per second (peak):     100 orders × 8 events = 800 events/sec
With consumer fanout (7x):    800 × 7 = 5,600 event deliveries/sec

Storage Estimation

EVENT STORAGE

Event size (average):
  - Event metadata:             200 bytes
  - Order snapshot:             500 bytes
  - Actor/context:              150 bytes
  - Total per event:            ~850 bytes ≈ 1 KB

Daily events:                   500,000 orders × 8 = 4,000,000 events
Daily storage:                  4M × 1 KB = 4 GB/day
Monthly storage:                4 GB × 30 = 120 GB/month
2-year retention:               120 GB × 24 = 2.88 TB

With replication (3x):          ~9 TB total


AUDIT LOG STORAGE

Audit events (more detailed):   ~2 KB per event
Daily audit storage:            4M × 2 KB = 8 GB/day
2-year audit retention:         8 GB × 730 = 5.8 TB

Key Metrics Summary

┌─────────────────────────────────────────────────────────────────────────┐
│                         ESTIMATION SUMMARY                              │
│                                                                         │
│  TRAFFIC                                                                │
│  ├── Peak orders:            100 /second (design target)                │
│  ├── Peak events:            800 /second                                │
│  └── Peak deliveries:        5,600 /second (with fanout)                │
│                                                                         │
│  STORAGE                                                                │
│  ├── Events per day:         4 GB                                       │
│  ├── Audit per day:          8 GB                                       │
│  └── 2-year total:           ~15 TB (with replication)                  │
│                                                                         │
│  INFRASTRUCTURE (rough)                                                 │
│  ├── Kafka partitions:       32 (for parallelism)                       │
│  ├── Consumer instances:     7 services × 4 instances = 28              │
│  └── Database:               PostgreSQL + TimescaleDB                   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Phase 3: High-Level Design (10 minutes)

You: "Now let me sketch out the high-level architecture."

System Architecture

┌─────────────────────────────────────────────────────────────────────────────────┐
│                           HIGH-LEVEL ARCHITECTURE                               │
│                                                                                 │
│    ┌──────────────┐                                                             │
│    │   Customer   │                                                             │
│    │  Mobile/Web  │                                                             │
│    └──────┬───────┘                                                             │
│           │                                                                     │
│           ▼                                                                     │
│    ┌──────────────┐      ┌──────────────┐                                       │
│    │ API Gateway  │─────▶│    Order     │                                       │
│    │              │      │   Service    │                                       │
│    └──────────────┘      └──────┬───────┘                                       │
│                                 │                                               │
│                    ┌────────────┴────────────┐                                  │
│                    │   Same Transaction      │                                  │
│                    │                         │                                  │
│                    ▼                         ▼                                  │
│             ┌──────────────┐          ┌──────────────┐                          │
│             │    Orders    │          │    Outbox    │◄── Day 2: Transactional  │
│             │    Table     │          │    Table     │         Outbox           │
│             └──────────────┘          └──────┬───────┘                          │
│                                              │                                  │
│                                              │ CDC (Debezium)                   │
│                                              ▼                                  │
│    ┌────────────────────────────────────────────────────────────────────────┐   │
│    │                                                                        │   │
│    │                         Kafka Cluster                                  │   │
│    │                                                                        │   │
│    │  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐        │   │
│    │  │   orders   │  │  orders    │  │  orders    │  │   audit    │        │   │
│    │  │  .events   │  │  .retry    │  │  .dlq      │  │  .events   │        │   │
│    │  └─────┬──────┘  └────────────┘  └────────────┘  └─────┬──────┘        │   │
│    │        │              ▲               ▲                │               │   │
│    │        │              │               │                │               │   │
│    └────────┼──────────────┼───────────────┼────────────────┼───────────────┘   │
│             │              │               │                │                   │
│             │         Day 4: Dead Letters  │                │                   │
│             │                              │                │                   │
│             ▼                              │                ▼                   │
│    ┌─────────────────────────────────────────────┐  ┌──────────────┐            │
│    │              Consumer Group                 │  │    Audit     │            │
│    │                                             │  │   Consumer   │            │
│    │  ┌─────────┐ ┌─────────┐ ┌─────────┐        │  └──────┬───────┘            │
│    │  │ Payment │ │Restaurant│ │ Driver  │       │         │                    │
│    │  │ Service │ │ Service │ │ Matcher │        │         │                    │
│    │  └────┬────┘ └────┬────┘ └────┬────┘        │         │                    │
│    │       │           │           │             │         │                    │
│    │  ┌────┴───┐  ┌────┴───┐  ┌────┴───┐         │         │                    │
│    │  │Notific-│  │Analytics│ │ Fraud  │         │         │                    │
│    │  │ations  │  │        │  │Detection│        │         │                    │
│    │  └────────┘  └────────┘  └────────┘         │         │                    │
│    │                                             │         │                    │
│    │       Day 3: Backpressure Handling ───────  |         │                    │
│    │                                             │         │                    │
│    └─────────────────────────────┬───────────────┘         │                    │
│                                  │                         │                    │
│                                  ▼                         ▼                    │
│                         ┌──────────────┐          ┌──────────────┐              │
│                         │   Service    │          │  Audit Store │              │
│                         │  Databases   │          │ (TimescaleDB)│◄── Day 5     │
│                         └──────────────┘          └──────────────┘              │
│                                                                                 │
│                                         Day 1: Queue vs Stream ─────────────┤   |
│                                         (Kafka for multi-consumer replay)       │
│                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────┘

Component Breakdown

You: "Let me walk through each component..."

1. Order Service

Purpose: Accepts orders, orchestrates the lifecycle, maintains order state.

Key responsibilities:

Validate incoming orders
Create order record + outbox event in single transaction
Expose order status API
Handle cancellations

Why transactional outbox: If we write order to database and separately publish to Kafka, we risk inconsistency. The outbox pattern (Day 2) guarantees the event is published if and only if the order is saved.

2. Kafka Cluster

Purpose: Reliable event distribution to all consumers.

Topics:

orders.events — Main order lifecycle events (32 partitions)
orders.retry — Delayed retry for transient failures
orders.dlq — Dead letter queue for investigation
audit.events — Separate topic for audit trail

Why Kafka (not RabbitMQ): Day 1 taught us Kafka is better when:

Multiple consumers need the same events (7 services)
We need replay capability (debugging, reprocessing)
We need ordering per order (partition by order_id)
High throughput with durability

3. Consumer Services

Purpose: React to order events and perform their specific functions.

Services and their patterns:

Payment Service: Needs exactly-once, uses idempotency keys
Restaurant Service: Needs ordering per restaurant, partitioned accordingly
Driver Matcher: Can handle eventual consistency, more latency-tolerant
Notifications: Fire-and-forget for push, retry for SMS
Analytics: Batch-friendly, can lag behind
Fraud Detection: Needs real-time, dedicated consumer group

4. Audit Store

Purpose: Immutable record of all order state changes for disputes.

Technology: TimescaleDB (PostgreSQL extension for time-series)

Why: Day 5 taught us audit logs need fast time-range queries, immutability, and long retention. TimescaleDB gives us automatic partitioning and compression.

Data Flow

You: "Let me trace through a typical order placement..."

ORDER PLACEMENT FLOW

Step 1: Customer places order
        Mobile App ──▶ API Gateway ──▶ Order Service

Step 2: Order Service creates order + outbox event
        ┌─────────────────────────────────────────┐
        │  BEGIN TRANSACTION                      │
        │    INSERT INTO orders (...)             │
        │    INSERT INTO outbox (order.placed)    │
        │  COMMIT                                 │
        └─────────────────────────────────────────┘
        Response returned to customer immediately

Step 3: CDC publishes outbox to Kafka
        Outbox ──▶ Debezium ──▶ Kafka (orders.events)
        Latency: ~100ms

Step 4: Payment Service consumes event
        Kafka ──▶ Payment Consumer ──▶ Process Payment
        ├── Success: Publish payment.confirmed
        └── Failure: Retry or publish payment.failed

Step 5: Restaurant Service consumes payment.confirmed
        Kafka ──▶ Restaurant Consumer ──▶ Notify Restaurant
        ├── Accepted: Publish restaurant.accepted  
        └── Rejected: Publish restaurant.rejected

Step 6: Driver Matcher consumes restaurant.accepted
        Kafka ──▶ Driver Matcher ──▶ Find & Assign Driver
        Publish driver.assigned

Step 7: All events also consumed by Audit Service
        Every event ──▶ Audit Consumer ──▶ TimescaleDB

Phase 4: Deep Dives (20 minutes)

Interviewer: "Great high-level design. Let's dive deeper into a few areas. Tell me more about how you'd handle the event publishing reliably."

Deep Dive 1: Transactional Outbox Pattern (Day 2)

You: "This is critical for our system. The challenge is: how do we ensure an order is saved AND its event is published atomically? Let me explain the problem first."

The Problem

THE DUAL-WRITE PROBLEM

Without outbox pattern:

  Order Service:
    1. Save order to database     ✓ Success
    2. Publish to Kafka           ✗ Failure (network issue)
    
  Result:
    - Order exists in database
    - No event published
    - Payment never charged
    - Restaurant never notified
    - Customer waiting forever
    
  Or worse:
    1. Publish to Kafka           ✓ Success
    2. Save order to database     ✗ Failure
    
  Result:
    - Event published
    - Order not in database
    - Payment charged, but no order record!
    - Refund nightmare

The Solution

You: "I'd use the transactional outbox pattern from Day 2. Here's how it works:"

TRANSACTIONAL OUTBOX SOLUTION

Order Service:
  BEGIN TRANSACTION
    INSERT INTO orders (id, customer_id, items, status, ...)
    INSERT INTO outbox (event_type, payload, created_at)
  COMMIT

Both writes succeed or both fail — guaranteed by database ACID.

Separately, Outbox Publisher:
  1. Poll outbox table (or use CDC)
  2. Publish events to Kafka
  3. Mark as published (or delete)

If publisher crashes:
  - Unpublished events remain in outbox
  - Publisher resumes on restart
  - No events lost

Implementation

# Transactional Outbox Implementation for Order Service

from dataclasses import dataclass
from datetime import datetime
from typing import Optional, Dict, Any
import json
import uuid


@dataclass
class OrderEvent:
    """Event to be published via outbox."""
    event_id: str
    event_type: str
    order_id: str
    payload: Dict[str, Any]
    created_at: datetime


class OrderService:
    """
    Order service with transactional outbox.
    Applies Day 2: Transactional Outbox Pattern.
    """
    
    def __init__(self, db_pool, audit_client):
        self.db = db_pool
        self.audit = audit_client
    
    async def place_order(
        self,
        customer_id: str,
        restaurant_id: str,
        items: list,
        delivery_address: dict,
        payment_method_id: str
    ) -> dict:
        """
        Place a new order.
        
        CRITICAL: Order creation and event publishing happen in
        the same database transaction. This guarantees:
        - If order is saved, event WILL be published
        - If transaction fails, neither is persisted
        """
        order_id = f"ord_{uuid.uuid4().hex[:12]}"
        
        # Calculate totals
        subtotal = sum(item['price'] * item['quantity'] for item in items)
        delivery_fee = 4.99
        total = subtotal + delivery_fee
        
        async with self.db.transaction() as tx:
            # 1. Create the order
            await tx.execute("""
                INSERT INTO orders (
                    id, customer_id, restaurant_id, items,
                    delivery_address, payment_method_id,
                    subtotal, delivery_fee, total,
                    status, created_at
                ) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11)
            """, 
                order_id, customer_id, restaurant_id, json.dumps(items),
                json.dumps(delivery_address), payment_method_id,
                subtotal, delivery_fee, total,
                'placed', datetime.utcnow()
            )
            
            # 2. Write event to outbox (SAME TRANSACTION)
            event = OrderEvent(
                event_id=f"evt_{uuid.uuid4().hex[:12]}",
                event_type="order.placed",
                order_id=order_id,
                payload={
                    "order_id": order_id,
                    "customer_id": customer_id,
                    "restaurant_id": restaurant_id,
                    "items": items,
                    "total": total,
                    "payment_method_id": payment_method_id,
                    "delivery_address": delivery_address
                },
                created_at=datetime.utcnow()
            )
            
            await tx.execute("""
                INSERT INTO outbox (
                    event_id, event_type, aggregate_id, 
                    aggregate_type, payload, created_at
                ) VALUES ($1, $2, $3, $4, $5, $6)
            """,
                event.event_id, event.event_type, order_id,
                'order', json.dumps(event.payload), event.created_at
            )
            
            # 3. Write audit event (SAME TRANSACTION)
            await self.audit.log(
                action="order.placed",
                resource_type="order",
                resource_id=order_id,
                actor_id=customer_id,
                details={"total": total, "item_count": len(items)}
            )
        
        # Transaction committed — order and event are guaranteed persisted
        return {"order_id": order_id, "status": "placed", "total": total}


class OutboxPublisher:
    """
    Background process that publishes outbox events to Kafka.
    
    Uses CDC (Debezium) for low-latency, or polling as fallback.
    """
    
    def __init__(self, db_pool, kafka_producer):
        self.db = db_pool
        self.kafka = kafka_producer
    
    async def publish_pending(self, batch_size: int = 100) -> int:
        """Publish pending outbox events."""
        
        # Fetch unpublished events
        rows = await self.db.fetch("""
            SELECT id, event_id, event_type, aggregate_id, payload
            FROM outbox
            WHERE published_at IS NULL
            ORDER BY created_at
            LIMIT $1
            FOR UPDATE SKIP LOCKED
        """, batch_size)
        
        if not rows:
            return 0
        
        published_ids = []
        
        for row in rows:
            # Partition by order_id for ordering guarantee
            partition_key = row['aggregate_id'].encode('utf-8')
            
            await self.kafka.send_and_wait(
                'orders.events',
                key=partition_key,
                value=row['payload'].encode('utf-8'),
                headers=[
                    ('event_type', row['event_type'].encode()),
                    ('event_id', row['event_id'].encode())
                ]
            )
            
            published_ids.append(row['id'])
        
        # Mark as published
        await self.db.execute("""
            UPDATE outbox 
            SET published_at = NOW()
            WHERE id = ANY($1)
        """, published_ids)
        
        return len(published_ids)

Edge Cases Handled

Interviewer: "What if the outbox publisher crashes after publishing to Kafka but before marking as published?"

You: "Great question. The event gets published again when the publisher restarts. This is why consumers must be idempotent — they use the event_id to deduplicate. At-least-once delivery is the guarantee; exactly-once is the consumer's responsibility.

For payments specifically, we use idempotency keys tied to the order_id. Stripe won't charge twice for the same idempotency key."

Deep Dive 2: Backpressure During Dinner Rush (Day 3)

Interviewer: "You mentioned dinner rush is 10x normal traffic. How do you prevent the system from collapsing?"

You: "This is where Day 3's backpressure patterns come in. Let me show you the strategy."

The Problem

DINNER RUSH SCENARIO

Normal traffic:       5 orders/second
Dinner rush:          50 orders/second (10x)
Super peak:           100 orders/second

What can fail:
  - Payment service overwhelmed → timeouts
  - Restaurant service backed up → notifications delayed
  - Driver matching overloaded → long wait times
  - Database connections exhausted
  - Kafka consumers fall behind

Symptoms:
  - Consumer lag growing
  - Processing latency increasing
  - Memory pressure on consumers
  - Eventual OOM crashes
  - Cascading failures

The Solution

You: "I'd implement a multi-layer backpressure strategy:"

BACKPRESSURE LAYERS

Layer 1: DETECTION
  Monitor consumer lag, queue depth, processing latency
  Thresholds: Warning at 10K lag, Critical at 50K

Layer 2: RATE LIMITING AT INGESTION
  API Gateway limits orders/second per city
  Protects downstream from unbounded load

Layer 3: PRIORITY QUEUES
  Critical: Payment confirmations, driver assignments
  Normal: Analytics, notifications
  Low: Marketing events, non-urgent updates

Layer 4: ADAPTIVE BATCHING
  Low load: Process immediately
  High load: Batch for efficiency

Layer 5: LOAD SHEDDING (last resort)
  Never shed payment or critical events
  Can delay analytics, marketing

Implementation

# Backpressure-Aware Consumer
# Applies Day 3: Backpressure and Flow Control

from dataclasses import dataclass
from enum import Enum
from typing import Optional
import asyncio
import time


class Priority(Enum):
    CRITICAL = 1  # Payments, order state changes
    HIGH = 2      # Restaurant notifications, driver assignment
    NORMAL = 3    # Customer notifications
    LOW = 4       # Analytics, marketing


class BackpressureLevel(Enum):
    NONE = "none"
    WARNING = "warning"
    CRITICAL = "critical"
    EMERGENCY = "emergency"


@dataclass
class BackpressureConfig:
    """Configuration for backpressure handling."""
    lag_warning_threshold: int = 10_000
    lag_critical_threshold: int = 50_000
    lag_emergency_threshold: int = 100_000
    
    # Batch sizes by pressure level
    batch_size_normal: int = 10
    batch_size_warning: int = 50
    batch_size_critical: int = 200


class BackpressureAwareConsumer:
    """
    Kafka consumer with backpressure detection and response.
    
    Key behaviors:
    - Monitors consumer lag continuously
    - Adjusts batch size based on pressure
    - Sheds low-priority work under extreme pressure
    - Never sheds critical events
    """
    
    def __init__(
        self,
        kafka_consumer,
        processor,
        config: BackpressureConfig,
        metrics_client
    ):
        self.consumer = kafka_consumer
        self.processor = processor
        self.config = config
        self.metrics = metrics_client
        
        self._current_level = BackpressureLevel.NONE
        self._current_batch_size = config.batch_size_normal
    
    async def start(self):
        """Main consumer loop with backpressure handling."""
        
        # Start background monitoring
        asyncio.create_task(self._monitor_loop())
        
        batch = []
        
        async for message in self.consumer:
            batch.append(message)
            
            if len(batch) >= self._current_batch_size:
                await self._process_batch(batch)
                batch = []
    
    async def _monitor_loop(self):
        """Continuously monitor lag and adjust strategy."""
        while True:
            lag = await self._get_consumer_lag()
            
            # Determine pressure level
            if lag >= self.config.lag_emergency_threshold:
                new_level = BackpressureLevel.EMERGENCY
            elif lag >= self.config.lag_critical_threshold:
                new_level = BackpressureLevel.CRITICAL
            elif lag >= self.config.lag_warning_threshold:
                new_level = BackpressureLevel.WARNING
            else:
                new_level = BackpressureLevel.NONE
            
            # Adjust batch size
            if new_level != self._current_level:
                self._current_level = new_level
                self._adjust_batch_size()
                self.metrics.gauge('consumer.backpressure_level', new_level.value)
            
            await asyncio.sleep(1)
    
    def _adjust_batch_size(self):
        """Adjust batch size based on pressure level."""
        if self._current_level == BackpressureLevel.NONE:
            self._current_batch_size = self.config.batch_size_normal
        elif self._current_level == BackpressureLevel.WARNING:
            self._current_batch_size = self.config.batch_size_warning
        else:  # CRITICAL or EMERGENCY
            self._current_batch_size = self.config.batch_size_critical
    
    async def _process_batch(self, messages: list):
        """Process a batch of messages with priority awareness."""
        
        # Under emergency pressure, filter by priority
        if self._current_level == BackpressureLevel.EMERGENCY:
            messages = self._filter_by_priority(messages, min_priority=Priority.HIGH)
            self.metrics.increment('consumer.messages_shed', 
                                   len(messages) - len(messages))
        
        # Process remaining messages
        for msg in messages:
            try:
                await self.processor.process(msg)
            except Exception as e:
                # Send to DLQ (Day 4)
                await self._send_to_dlq(msg, e)
        
        await self.consumer.commit()
    
    def _filter_by_priority(self, messages: list, min_priority: Priority) -> list:
        """Filter messages by priority. Never filter CRITICAL."""
        return [
            msg for msg in messages
            if self._get_priority(msg).value <= min_priority.value
        ]
    
    def _get_priority(self, message) -> Priority:
        """Determine message priority from event type."""
        event_type = message.headers.get('event_type', b'').decode()
        
        # CRITICAL: Never shed
        if event_type in ('payment.confirmed', 'payment.failed', 
                          'order.placed', 'order.cancelled'):
            return Priority.CRITICAL
        
        # HIGH: Only shed in emergency
        if event_type in ('restaurant.notified', 'driver.assigned'):
            return Priority.HIGH
        
        # NORMAL: Shed when critical
        if event_type in ('notification.sent', 'driver.location_update'):
            return Priority.NORMAL
        
        # LOW: Shed when warning or above
        return Priority.LOW
    
    async def _get_consumer_lag(self) -> int:
        """Get current consumer lag."""
        # Implementation depends on Kafka client
        return await self.consumer.get_lag()
    
    async def _send_to_dlq(self, message, error: Exception):
        """Send failed message to dead letter queue (Day 4)."""
        # Implementation in Deep Dive 3
        pass

Scaling Response

You: "Beyond the consumer-level backpressure, we also have infrastructure-level responses:"

SCALING RESPONSES BY PRESSURE LEVEL

WARNING (lag > 10K):
  ├── Alert ops team (Slack)
  ├── Increase batch sizes
  └── Pre-warm additional consumer instances

CRITICAL (lag > 50K):
  ├── Auto-scale consumers (Kubernetes HPA)
  │     kubectl scale deployment order-consumer --replicas=8
  ├── Shed LOW priority events
  └── Page on-call engineer

EMERGENCY (lag > 100K):
  ├── Max consumer replicas
  ├── Shed NORMAL and LOW priority
  ├── Enable API rate limiting (upstream)
  └── Consider temporary feature degradation

Deep Dive 3: Dead Letter Queue for Failed Orders (Day 4)

Interviewer: "What happens when an order can't be processed? Payment fails, restaurant rejects, etc.?"

You: "Day 4's dead letter queue pattern is essential here. Let me show you how we handle failures without losing orders."

The Problem

ORDER FAILURE SCENARIOS

1. PAYMENT FAILURES
   - Card declined
   - Insufficient funds
   - Payment gateway timeout
   - Fraud detected

2. RESTAURANT FAILURES
   - Restaurant closed unexpectedly
   - Item out of stock
   - Restaurant rejects order
   - No response (timeout)

3. PROCESSING FAILURES
   - Invalid message format (poison pill)
   - Database temporarily unavailable
   - Downstream service down
   - Bug in consumer code

What we CANNOT do:
  - Lose the order (customer already waiting)
  - Retry forever (wastes resources)
  - Block other orders (one failure shouldn't affect all)

The Solution

DLQ STRATEGY FOR ORDERS

                    orders.events
                          │
                          ▼
                    Order Consumer
                          │
                     ┌────┴────┐
                     │         │
                 Success    Failure
                     │         │
                     ▼         ▼
                  Commit    Classify
                              │
              ┌───────────────┼───────────────┐
              │               │               │
         Transient       Permanent        Poison
         (retry)         (route)          (quarantine)
              │               │               │
              ▼               ▼               ▼
        orders.retry    Compensation    orders.dlq.poison
              │          Logic               │
              │               │               │
         Wait & retry    ┌────┴────┐     Immediate
              │          │         │       alert
              ▼     Notify     Alternative
        Main topic   customer    restaurant


RETRY TOPICS WITH DELAYS:

orders.retry.1m   (1 minute delay)
orders.retry.5m   (5 minute delay)  
orders.retry.15m  (15 minute delay)

After 3 retries: → orders.dlq

Implementation

# Dead Letter Queue Handler for Orders
# Applies Day 4: Dead Letters and Poison Pills

from dataclasses import dataclass
from datetime import datetime
from typing import Optional, Dict, Any
from enum import Enum
import json
import traceback


class ErrorCategory(Enum):
    TRANSIENT = "transient"      # Retry will likely help
    PERMANENT = "permanent"      # Retry won't help, need action
    POISON = "poison"            # Crashes consumer, quarantine


class FailureType(Enum):
    # Payment failures
    PAYMENT_DECLINED = "payment_declined"
    PAYMENT_TIMEOUT = "payment_timeout"
    PAYMENT_FRAUD = "payment_fraud"
    
    # Restaurant failures  
    RESTAURANT_CLOSED = "restaurant_closed"
    RESTAURANT_REJECTED = "restaurant_rejected"
    ITEM_UNAVAILABLE = "item_unavailable"
    
    # Processing failures
    INVALID_MESSAGE = "invalid_message"
    SERVICE_UNAVAILABLE = "service_unavailable"
    UNKNOWN = "unknown"


@dataclass
class DeadLetter:
    """A failed order event with context for investigation."""
    event_id: str
    order_id: str
    event_type: str
    original_payload: Dict[str, Any]
    
    failure_type: FailureType
    error_category: ErrorCategory
    error_message: str
    stack_trace: Optional[str]
    
    retry_count: int
    first_failed_at: datetime
    last_failed_at: datetime
    
    # For resolution
    resolved: bool = False
    resolution: Optional[str] = None
    resolved_at: Optional[datetime] = None


class OrderDLQHandler:
    """
    Handles failed order events with intelligent routing.
    
    Key behaviors:
    - Classifies errors into transient/permanent/poison
    - Transient: Retry with exponential backoff
    - Permanent: Route to compensation logic
    - Poison: Quarantine immediately
    """
    
    # Error classification rules
    TRANSIENT_ERRORS = [
        FailureType.PAYMENT_TIMEOUT,
        FailureType.SERVICE_UNAVAILABLE,
    ]
    
    PERMANENT_ERRORS = [
        FailureType.PAYMENT_DECLINED,
        FailureType.PAYMENT_FRAUD,
        FailureType.RESTAURANT_CLOSED,
        FailureType.RESTAURANT_REJECTED,
        FailureType.ITEM_UNAVAILABLE,
    ]
    
    POISON_ERRORS = [
        FailureType.INVALID_MESSAGE,
    ]
    
    def __init__(
        self,
        kafka_producer,
        db_pool,
        notification_service,
        config
    ):
        self.kafka = kafka_producer
        self.db = db_pool
        self.notifications = notification_service
        self.config = config
    
    async def handle_failure(
        self,
        event: Dict[str, Any],
        error: Exception,
        retry_count: int = 0
    ) -> None:
        """
        Handle a failed order event.
        
        Routes to appropriate handling based on error type.
        """
        # Classify the error
        failure_type = self._classify_error(error)
        category = self._get_category(failure_type)
        
        order_id = event.get('order_id', 'unknown')
        
        if category == ErrorCategory.POISON:
            # POISON: Quarantine immediately, never retry
            await self._quarantine_poison(event, error, failure_type)
            await self._alert_oncall(event, error, "Poison pill detected")
            
        elif category == ErrorCategory.PERMANENT:
            # PERMANENT: Don't retry, execute compensation
            await self._handle_permanent_failure(event, error, failure_type)
            
        elif retry_count < self.config.max_retries:
            # TRANSIENT: Retry with backoff
            await self._schedule_retry(event, retry_count + 1)
            
        else:
            # Exhausted retries: Move to DLQ
            await self._send_to_dlq(event, error, failure_type, retry_count)
            await self._notify_customer_failure(order_id)
    
    def _classify_error(self, error: Exception) -> FailureType:
        """Classify exception into failure type."""
        error_str = str(error).lower()
        
        # Payment errors
        if 'card_declined' in error_str:
            return FailureType.PAYMENT_DECLINED
        if 'timeout' in error_str and 'payment' in error_str:
            return FailureType.PAYMENT_TIMEOUT
        if 'fraud' in error_str:
            return FailureType.PAYMENT_FRAUD
        
        # Restaurant errors
        if 'restaurant_closed' in error_str:
            return FailureType.RESTAURANT_CLOSED
        if 'rejected' in error_str:
            return FailureType.RESTAURANT_REJECTED
        if 'out_of_stock' in error_str or 'unavailable' in error_str:
            return FailureType.ITEM_UNAVAILABLE
        
        # Processing errors
        if isinstance(error, (json.JSONDecodeError, KeyError, TypeError)):
            return FailureType.INVALID_MESSAGE
        if 'service unavailable' in error_str or '503' in error_str:
            return FailureType.SERVICE_UNAVAILABLE
        
        return FailureType.UNKNOWN
    
    def _get_category(self, failure_type: FailureType) -> ErrorCategory:
        """Get error category for failure type."""
        if failure_type in self.TRANSIENT_ERRORS:
            return ErrorCategory.TRANSIENT
        if failure_type in self.PERMANENT_ERRORS:
            return ErrorCategory.PERMANENT
        if failure_type in self.POISON_ERRORS:
            return ErrorCategory.POISON
        return ErrorCategory.TRANSIENT  # Default to retry for unknown
    
    async def _schedule_retry(self, event: Dict, retry_count: int):
        """Schedule retry with exponential backoff."""
        # Determine delay based on retry count
        delays = ['1m', '5m', '15m']
        delay = delays[min(retry_count - 1, len(delays) - 1)]
        
        retry_topic = f"orders.retry.{delay}"
        
        # Add retry metadata
        event['_retry_count'] = retry_count
        event['_retry_at'] = datetime.utcnow().isoformat()
        
        await self.kafka.send(
            retry_topic,
            key=event['order_id'].encode(),
            value=json.dumps(event).encode()
        )
    
    async def _handle_permanent_failure(
        self,
        event: Dict,
        error: Exception,
        failure_type: FailureType
    ):
        """Handle permanent failures with compensation logic."""
        order_id = event.get('order_id')
        
        if failure_type == FailureType.PAYMENT_DECLINED:
            # Notify customer, mark order as payment_failed
            await self._update_order_status(order_id, 'payment_failed')
            await self.notifications.send(
                user_id=event['customer_id'],
                template='payment_declined',
                data={'order_id': order_id}
            )
            
        elif failure_type == FailureType.RESTAURANT_REJECTED:
            # Try alternative restaurant or refund
            alternative = await self._find_alternative_restaurant(event)
            if alternative:
                await self._reroute_order(order_id, alternative)
            else:
                await self._refund_order(order_id)
                
        elif failure_type == FailureType.ITEM_UNAVAILABLE:
            # Notify customer, offer alternatives
            await self.notifications.send(
                user_id=event['customer_id'],
                template='item_unavailable',
                data={'order_id': order_id, 'items': event.get('unavailable_items')}
            )
        
        # Always record in DLQ for tracking
        await self._send_to_dlq(event, error, failure_type, 0, resolved=True)
    
    async def _quarantine_poison(
        self,
        event: Dict,
        error: Exception,
        failure_type: FailureType
    ):
        """Quarantine poison pill - never retry."""
        await self.kafka.send(
            'orders.dlq.poison',
            key=event.get('order_id', 'unknown').encode(),
            value=json.dumps({
                'event': event,
                'error': str(error),
                'stack_trace': traceback.format_exc(),
                'quarantined_at': datetime.utcnow().isoformat()
            }).encode()
        )
    
    async def _send_to_dlq(
        self,
        event: Dict,
        error: Exception,
        failure_type: FailureType,
        retry_count: int,
        resolved: bool = False
    ):
        """Send to dead letter queue for investigation."""
        dead_letter = DeadLetter(
            event_id=event.get('event_id', 'unknown'),
            order_id=event.get('order_id', 'unknown'),
            event_type=event.get('event_type', 'unknown'),
            original_payload=event,
            failure_type=failure_type,
            error_category=self._get_category(failure_type),
            error_message=str(error),
            stack_trace=traceback.format_exc(),
            retry_count=retry_count,
            first_failed_at=datetime.utcnow(),
            last_failed_at=datetime.utcnow(),
            resolved=resolved
        )
        
        await self.kafka.send(
            'orders.dlq',
            key=event.get('order_id', 'unknown').encode(),
            value=json.dumps(dead_letter.__dict__, default=str).encode()
        )

Deep Dive 4: Audit Trail for Disputes (Day 5)

Interviewer: "You mentioned 2% of orders are disputed. How do you support customer support investigations?"

You: "This is where Day 5's audit log pattern comes in. Every order state change is immutably recorded for investigation."

The Problem

DISPUTE SCENARIOS

Customer: "I was charged but never got my food!"

Support needs to answer:
  - Did the order actually go through?
  - Was payment confirmed?
  - Did the restaurant receive the notification?
  - Was a driver assigned?
  - What happened at each step?
  - Who/what caused the failure?

Without audit log:
  - Check order table: only current state
  - Check logs: distributed across services, hard to correlate
  - Check payments: separate system, might be inconsistent
  
  Result: "We can't determine what happened"
  Customer: Unhappy, posts on social media
  Company: Reputation damage, potential lawsuit


With audit log:
  - Query: SELECT * FROM audit_events WHERE order_id = 'xyz' ORDER BY timestamp
  - See complete timeline of every state change
  - See who/what triggered each change
  - Resolve in minutes, not hours

The Solution

AUDIT LOG ARCHITECTURE

Every service publishes audit events:
  Order Service ──┬──▶ orders.events (for processing)
                  └──▶ audit.events (for audit trail)

Audit events are:
  - Immutable (no UPDATE, no DELETE)
  - Complete (all context included)
  - Queryable (indexed by order_id, actor, time)
  - Retained (2 years for legal)

Audit Consumer:
  audit.events ──▶ Audit Consumer ──▶ TimescaleDB
                                           │
                                    ┌──────┴──────┐
                                    │             │
                                90 days         Archive
                               (hot query)     (S3 Parquet)

Implementation

# Audit Log System for Order Disputes
# Applies Day 5: Audit Log System

from dataclasses import dataclass, asdict
from datetime import datetime
from typing import Optional, Dict, Any, List
from enum import Enum
import json
import hashlib


class AuditAction(Enum):
    # Order lifecycle
    ORDER_PLACED = "order.placed"
    ORDER_CANCELLED = "order.cancelled"
    ORDER_COMPLETED = "order.completed"
    
    # Payment
    PAYMENT_INITIATED = "payment.initiated"
    PAYMENT_CONFIRMED = "payment.confirmed"
    PAYMENT_FAILED = "payment.failed"
    PAYMENT_REFUNDED = "payment.refunded"
    
    # Restaurant
    RESTAURANT_NOTIFIED = "restaurant.notified"
    RESTAURANT_ACCEPTED = "restaurant.accepted"
    RESTAURANT_REJECTED = "restaurant.rejected"
    
    # Driver
    DRIVER_ASSIGNED = "driver.assigned"
    DRIVER_PICKED_UP = "driver.picked_up"
    DRIVER_DELIVERED = "driver.delivered"
    
    # Support
    SUPPORT_INTERVENTION = "support.intervention"
    DISPUTE_OPENED = "dispute.opened"
    DISPUTE_RESOLVED = "dispute.resolved"


@dataclass
class AuditEvent:
    """
    Immutable audit event for order tracking.
    
    Contains all context needed to understand what happened,
    who did it, and when.
    """
    # Identity
    event_id: str
    timestamp: datetime
    
    # What happened
    action: AuditAction
    order_id: str
    
    # Who did it
    actor_type: str  # 'customer', 'system', 'restaurant', 'driver', 'support'
    actor_id: str
    
    # State change
    previous_state: Optional[str]
    new_state: str
    
    # Details
    details: Dict[str, Any]
    
    # Context for debugging
    service_name: str
    correlation_id: str
    
    # Integrity
    previous_hash: Optional[str] = None
    event_hash: Optional[str] = None
    
    def compute_hash(self) -> str:
        """Compute hash for tamper detection."""
        content = json.dumps({
            'event_id': self.event_id,
            'timestamp': self.timestamp.isoformat(),
            'action': self.action.value,
            'order_id': self.order_id,
            'actor_id': self.actor_id,
            'previous_hash': self.previous_hash
        }, sort_keys=True)
        return hashlib.sha256(content.encode()).hexdigest()


class AuditService:
    """
    Service for recording and querying audit events.
    
    Key guarantees:
    - Events are immutable once written
    - Hash chain provides tamper evidence
    - Fast queries by order_id for support
    """
    
    def __init__(self, kafka_producer, db_pool):
        self.kafka = kafka_producer
        self.db = db_pool
        self._last_hash: Optional[str] = None
    
    async def record(
        self,
        action: AuditAction,
        order_id: str,
        actor_type: str,
        actor_id: str,
        previous_state: Optional[str],
        new_state: str,
        details: Dict[str, Any],
        service_name: str,
        correlation_id: str
    ) -> str:
        """
        Record an audit event.
        
        Returns the event_id.
        """
        import uuid
        
        event = AuditEvent(
            event_id=f"aud_{uuid.uuid4().hex[:12]}",
            timestamp=datetime.utcnow(),
            action=action,
            order_id=order_id,
            actor_type=actor_type,
            actor_id=actor_id,
            previous_state=previous_state,
            new_state=new_state,
            details=details,
            service_name=service_name,
            correlation_id=correlation_id,
            previous_hash=self._last_hash
        )
        
        # Compute hash for chain
        event.event_hash = event.compute_hash()
        self._last_hash = event.event_hash
        
        # Publish to Kafka
        await self.kafka.send(
            'audit.events',
            key=order_id.encode(),
            value=json.dumps(asdict(event), default=str).encode()
        )
        
        return event.event_id
    
    async def get_order_timeline(self, order_id: str) -> List[Dict]:
        """
        Get complete timeline for an order.
        
        Used by customer support for dispute resolution.
        """
        rows = await self.db.fetch("""
            SELECT 
                event_id,
                timestamp,
                action,
                actor_type,
                actor_id,
                previous_state,
                new_state,
                details
            FROM audit_events
            WHERE order_id = $1
            ORDER BY timestamp ASC
        """, order_id)
        
        return [dict(row) for row in rows]
    
    async def verify_integrity(self, order_id: str) -> Dict[str, Any]:
        """
        Verify hash chain integrity for an order.
        
        Used for dispute verification - proves logs weren't tampered.
        """
        rows = await self.db.fetch("""
            SELECT event_id, event_hash, previous_hash
            FROM audit_events
            WHERE order_id = $1
            ORDER BY timestamp ASC
        """, order_id)
        
        issues = []
        prev_hash = None
        
        for row in rows:
            if row['previous_hash'] != prev_hash:
                issues.append({
                    'event_id': row['event_id'],
                    'issue': 'Chain break detected',
                    'expected': prev_hash,
                    'actual': row['previous_hash']
                })
            prev_hash = row['event_hash']
        
        return {
            'order_id': order_id,
            'events_checked': len(rows),
            'integrity': 'valid' if not issues else 'compromised',
            'issues': issues
        }


class SupportDashboard:
    """
    Support dashboard for dispute resolution.
    
    Provides easy access to order history and investigation tools.
    """
    
    def __init__(self, audit_service: AuditService, order_service):
        self.audit = audit_service
        self.orders = order_service
    
    async def investigate_order(self, order_id: str) -> Dict[str, Any]:
        """
        Get complete investigation package for an order.
        
        Used by support agents during disputes.
        """
        # Get current order state
        order = await self.orders.get_order(order_id)
        
        # Get complete timeline
        timeline = await self.audit.get_order_timeline(order_id)
        
        # Verify integrity
        integrity = await self.audit.verify_integrity(order_id)
        
        # Get related DLQ entries if any
        dlq_entries = await self._get_dlq_entries(order_id)
        
        return {
            'order': order,
            'timeline': timeline,
            'integrity': integrity,
            'failures': dlq_entries,
            'summary': self._generate_summary(timeline)
        }
    
    def _generate_summary(self, timeline: List[Dict]) -> str:
        """Generate human-readable summary of order events."""
        if not timeline:
            return "No events found for this order."
        
        summary_lines = []
        for event in timeline:
            ts = event['timestamp'].strftime('%H:%M:%S')
            action = event['action'].replace('.', ' ').title()
            actor = f"{event['actor_type']}:{event['actor_id'][:8]}"
            summary_lines.append(f"{ts} - {action} by {actor}")
        
        return "\n".join(summary_lines)

Query Examples for Support

You: "Here's how support would use this for common disputes:"

-- "I never got my food but was charged"
SELECT 
    timestamp,
    action,
    actor_type,
    new_state,
    details
FROM audit_events
WHERE order_id = 'ord_abc123'
ORDER BY timestamp;

-- Result shows:
-- 18:30:01 - order.placed (customer)
-- 18:30:02 - payment.confirmed (system) ← Payment went through
-- 18:30:05 - restaurant.notified (system)
-- 18:31:00 - restaurant.accepted (restaurant)
-- 18:35:00 - driver.assigned (system)
-- 18:50:00 - driver.picked_up (driver)
-- 19:15:00 - ??? (no delivery event!)
-- 
-- Gap after pickup suggests driver issue


-- "The restaurant says they never got my order"
SELECT * FROM audit_events
WHERE order_id = 'ord_xyz789'
AND action = 'restaurant.notified';

-- Shows notification was sent at 18:30:05
-- Check details for delivery confirmation

Phase 5: Scaling and Edge Cases (5 minutes)

Interviewer: "How would this system scale to 10x the current load?"

Scaling Strategy

You: "There are several scaling vectors to consider..."

CURRENT → 10X SCALE

                        Current         10X Target
Orders/second (peak):   50              500
Events/second:          800             8,000
Consumer instances:     28              100+
Kafka partitions:       32              128
Database writes/sec:    1,000           10,000

Horizontal Scaling

COMPONENT SCALING

1. KAFKA
   Current: 32 partitions, 3 brokers
   10X: 128 partitions, 10+ brokers
   Why: More partitions = more parallel consumers

2. CONSUMERS
   Current: 7 services × 4 instances = 28
   10X: 7 services × 16 instances = 112
   Limit: Instances ≤ partitions per consumer group

3. ORDER DATABASE
   Current: Single primary + replicas
   10X: Sharded by customer_id or city
   Why: Write throughput is the bottleneck

4. AUDIT DATABASE (TimescaleDB)
   Current: Single instance, 4TB
   10X: Distributed TimescaleDB or ClickHouse
   Why: Write throughput + query performance

5. OUTBOX PUBLISHER
   Current: CDC with Debezium
   10X: Still CDC, but watch for lag
   May need: Multiple Debezium instances per shard

Bottleneck Analysis

Component	Current Limit	First to Break	Solution
API Gateway	10,000 req/s	No	Already scaled
Order DB writes	2,000/s	Yes, at 5x	Shard by city
Kafka ingestion	100,000 msg/s	No	Plenty of headroom
Consumer processing	500 orders/s	Yes, at 10x	Add instances
Audit DB writes	5,000/s	Yes, at 6x	Batch inserts, shard

Edge Cases

Interviewer: "What are some edge cases we should handle?"

Edge Case 1: Customer Places Order During Restaurant Close

Scenario:
  Customer orders at 9:59 PM
  Restaurant closes at 10:00 PM
  Order reaches restaurant at 10:01 PM
  Restaurant system auto-rejects

Handling:
  1. Check restaurant hours during order placement
  2. If within 15 min of close, warn customer
  3. If rejected due to close, auto-refund immediately
  4. Audit: Record "auto_rejected_closing_time"

Edge Case 2: Payment Confirmed, Then Fraud Detected

Scenario:
  Payment confirmed at 18:30
  Fraud detection flags at 18:35
  Order already sent to restaurant

Handling:
  1. Fraud service publishes fraud.detected event
  2. Order service cancels order
  3. Restaurant notified of cancellation
  4. Payment reversed (refund)
  5. Customer notified of security hold
  6. Audit: Full trace for investigation

Edge Case 3: Kafka Cluster Unavailable

Scenario:
  Kafka down for 30 minutes
  Orders still being placed

Handling:
  1. Outbox accumulates events (DB still working)
  2. Alert fired at 1 minute of lag
  3. API continues accepting orders (degraded mode)
  4. When Kafka recovers, outbox publisher catches up
  5. No orders lost due to transactional outbox

Failure Scenarios

Failure	Detection	Impact	Recovery
Kafka broker down	Health check	Reduced throughput	Replication handles it
Payment service down	Circuit breaker	Orders queue up	Retry topic with backoff
Database primary down	Connection errors	Orders fail	Failover to replica
Consumer crash	Kubernetes	Temporary lag	Auto-restart, resume from offset
Outbox publisher down	Lag monitoring	Events delayed	Restart, outbox catches up

Phase 6: Monitoring and Operations

Interviewer: "How would you monitor this system in production?"

Key Metrics

You: "I would track metrics at multiple levels..."

Business Metrics

ORDER HEALTH
├── Orders placed per minute: Target 50/min normal, 200/min peak
├── Order completion rate: Target > 95%
├── Average order-to-delivery time: Target < 45 min
└── Dispute rate: Target < 2%

PAYMENT HEALTH  
├── Payment success rate: Target > 98%
├── Payment latency p99: Target < 3s
└── Refund rate: Target < 5%

RESTAURANT HEALTH
├── Restaurant acceptance rate: Target > 90%
├── Notification latency p99: Target < 10s
└── Restaurant response time: Target < 5 min

System Metrics

┌─────────────────────────────────────────────────────────────────────────┐
│                         MONITORING DASHBOARD                            │
│                                                                         │
│  KAFKA HEALTH                                                           │
│  ├── Consumer lag:              < 1,000          [████████░░] 800       │
│  ├── Messages/sec:              > 500            [██████████] 850       │
│  └── Under-replicated:          0                [██████████] 0         │
│                                                                         │
│  ORDER SERVICE                                                          │
│  ├── API latency p99:           < 200ms          [████████░░] 180ms     │
│  ├── Error rate:                < 1%             [█░░░░░░░░░] 0.3%      │
│  └── Outbox depth:              < 100            [██░░░░░░░░] 25        │
│                                                                         │
│  CONSUMERS                                                              │
│  ├── Payment processing:        < 3s             [██████░░░░] 2.1s      │
│  ├── Restaurant notification:   < 10s            [████░░░░░░] 4s        │
│  └── DLQ depth (all topics):    < 50             [█░░░░░░░░░] 12        │
│                                                                         │
│  AUDIT                                                                  │
│  ├── Write latency:             < 100ms          [███░░░░░░░] 45ms      │
│  ├── Query latency:             < 500ms          [████░░░░░░] 200ms     │
│  └── Storage used:              < 80%            [██████░░░░] 62%       │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Alerting Strategy

CRITICAL (PagerDuty - Wake someone up):
  - Consumer lag > 50,000 (falling behind badly)
  - DLQ depth > 100 (many failures)
  - Payment success rate < 90% (payments broken)
  - Outbox depth > 10,000 (Kafka might be down)
  - Kafka under-replicated partitions > 0

WARNING (Slack - Investigate soon):
  - Consumer lag > 10,000
  - DLQ depth > 20
  - Payment latency p99 > 5s
  - Restaurant acceptance rate < 80%
  - Order completion rate < 90%

INFO (Dashboard only):
  - New deployment detected
  - Consumer rebalance occurred
  - Traffic spike (but within limits)

Runbook Snippet

RUNBOOK: Consumer Lag Spike

SYMPTOMS:
  - Alert: "Consumer lag > 10,000"
  - Dashboard shows growing lag
  - Order processing delays reported

DIAGNOSIS:
  1. Check consumer health:
     kubectl get pods -l app=order-consumer
     
  2. Check Kafka broker health:
     kafka-broker-api-versions.sh --bootstrap-server kafka:9092
     
  3. Check for poison pills:
     kafka-console-consumer --topic orders.dlq --from-beginning | head -10

  4. Check consumer logs:
     kubectl logs -l app=order-consumer --tail=100 | grep ERROR

RESOLUTION:
  1. If consumers crashed:
     kubectl rollout restart deployment/order-consumer
     
  2. If poison pills detected:
     # Skip the message (requires offset management)
     # Or fix and replay from DLQ
     
  3. If sustained high load:
     kubectl scale deployment/order-consumer --replicas=8
     
  4. If Kafka broker issues:
     Escalate to infrastructure team

POST-RESOLUTION:
  - Monitor lag returning to normal
  - Check DLQ for any messages that need replay
  - Document incident in postmortem

Interview Conclusion

Interviewer: "Excellent work. You've demonstrated strong understanding of event-driven systems, handled the messaging patterns well, and showed good operational thinking. Any questions for me?"

You: "Thank you! I'd love to hear how your current system handles the dinner rush - do you use the backpressure patterns we discussed, or have you found different strategies that work better at your scale?"

Interviewer: "Great question. We actually faced exactly the problems you described with Kafka lag during peaks. We ended up implementing priority queues similar to what you proposed. The trickiest part was getting the load shedding priorities right - we had some debates about whether driver assignment notifications should be shedable. In the end, we decided they're critical because a delayed driver assignment is worse than a slightly delayed analytics event."

You: "That makes sense. I'd also be curious about your DLQ investigation workflow - is it automated or do you have engineers reviewing manually?"

Interviewer: "Mostly automated for known failure types, manual review for unknowns. We've built up a library of auto-fix rules over time. Thanks for a great discussion!"

Summary: Week 3 Concepts Applied

Week 3 Concepts Integration

Day	Concept	Application in This Design
Day 1	Queue vs Stream	Chose Kafka for multi-consumer fanout, replay capability, and ordering per order_id
Day 2	Transactional Outbox	Order creation + event publishing in same DB transaction, CDC for low-latency publishing
Day 3	Backpressure	Consumer lag monitoring, priority-based shedding, adaptive batch sizing during dinner rush
Day 4	Dead Letters	Categorized DLQs for transient/permanent/poison, retry topics with delays, compensation logic
Day 5	Audit Log	Hash-chained immutable events, TimescaleDB for queries, support dashboard for disputes

How Concepts Connect

ORDER FLOW WITH ALL WEEK 3 CONCEPTS

Customer places order
        │
        ▼
┌─────────────────────────────────────────────────────────────┐
│  ORDER SERVICE                                              │
│                                                             │
│  BEGIN TRANSACTION ◄────────────────────── Day 2: Outbox    │
│    INSERT INTO orders                                       │
│    INSERT INTO outbox (order.placed)                        │
│    INSERT INTO audit (order.placed)  ◄───── Day 5: Audit    │
│  COMMIT                                                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘
        │
        │ CDC (Debezium)
        ▼
┌─────────────────────────────────────────────────────────────┐
│  KAFKA ◄──────────────────────────────────── Day 1: Stream  │
│                                                             │
│  orders.events ──┬──▶ Payment Consumer                      │
│                  ├──▶ Restaurant Consumer                   │
│                  ├──▶ Driver Consumer                       │
│                  ├──▶ Notification Consumer                 │
│                  ├──▶ Analytics Consumer                    │
│                  └──▶ Audit Consumer ◄────── Day 5          │
│                                                             │
│  Backpressure: Lag monitoring, batch sizing ◄── Day 3       │
│                                                             │
└─────────────────────────────────────────────────────────────┘
        │
        │ On failure
        ▼
┌────────────────────────────────────────────────────────────┐
│  FAILURE HANDLING ◄─────────────────────────── Day 4: DLQ  │
│                                                            │
│  orders.retry.1m ──▶ Delayed retry                         │
│  orders.retry.5m ──▶ Longer delay                          │
│  orders.dlq ──▶ Investigation + replay                     │
│  orders.dlq.poison ──▶ Quarantine + alert                  │
│                                                            │
└────────────────────────────────────────────────────────────┘

Code Patterns Demonstrated

1. TRANSACTIONAL OUTBOX (Day 2)
   - Single transaction for order + event
   - CDC-based publishing
   - Key insight: Never dual-write to DB and queue

2. BACKPRESSURE HANDLING (Day 3)
   - Consumer lag monitoring
   - Priority-based load shedding
   - Key insight: Graceful degradation over crash

3. ERROR CLASSIFICATION (Day 4)
   - Transient vs Permanent vs Poison
   - Delayed retry topics
   - Key insight: Different errors need different handling

4. AUDIT TRAIL (Day 5)
   - Hash-chained events
   - Immutable storage
   - Key insight: Complete context for investigation

Self-Assessment Checklist

After studying this capstone, you should be able to:

Key Interview Takeaways

WHEN DESIGNING EVENT-DRIVEN SYSTEMS:

1. START WITH DELIVERY GUARANTEES
   "What happens if this event is lost?"
   If critical → Transactional outbox
   
2. CHOOSE MESSAGING PATTERN BY CONSUMPTION
   Multiple consumers? → Kafka (stream)
   Single consumer, task distribution? → RabbitMQ (queue)
   
3. PLAN FOR FAILURE FROM DAY ONE
   Transient → Retry with backoff
   Permanent → Compensation logic
   Poison → Quarantine immediately
   
4. PROTECT AGAINST OVERLOAD
   Monitor lag continuously
   Have shedding priorities defined
   Never shed critical events
   
5. TRACK EVERYTHING FOR INVESTIGATION
   Audit every state change
   Include full context
   Make it queryable

This capstone problem integrates all concepts from Week 3 of the System Design Mastery Series. Use this as a template for approaching similar interview problems involving event-driven architectures.

Back to Course Overview