Himanshu Kukreja
0%
LearnSystem DesignWeek 6Interview Week 6 Food Delivery System
Capstone

Weeks 1-6 Capstone: Food Delivery Order System

A Complete System Design Interview Integrating All Concepts


The Interview Begins

You walk into the interview room at a food delivery company. The interviewer smiles and gestures to the whiteboard.

Interviewer: "Thanks for coming in today. We're going to work through a system design problem together. I'm interested in your thought process, so please think out loud. Feel free to ask questions — this is collaborative."

They write on the whiteboard:

╔══════════════════════════════════════════════════════════════════════╗
║                                                                      ║
║              Design a Food Delivery Order System                     ║
║                                                                      ║
║   We're building the core ordering platform for a food delivery      ║
║   service like DoorDash or Uber Eats.                                ║
║                                                                      ║
║   Key capabilities needed:                                           ║
║   - Customers browse restaurants and place orders                    ║
║   - Orders are routed to restaurants for preparation                 ║
║   - Drivers are assigned and tracked in real-time                    ║
║   - All parties receive status updates throughout                    ║
║   - Payments are processed reliably                                  ║
║                                                                      ║
╚══════════════════════════════════════════════════════════════════════╝

Interviewer: "Take a moment to think about this. We have about 45 minutes. Where would you like to start?"


Phase 1: Requirements Clarification

Before diving in, you take a breath and start asking questions.

You: "Before I start designing, I'd like to clarify some requirements. First, what's our scale? How many orders per day are we handling?"

Interviewer: "We're one of the larger players. About 5 million orders per day, concentrated heavily during lunch and dinner peaks."

You: "So during peak hours — say 6-8 PM — we might see 3-4x the average rate. What's the geographic scope? Single country or global?"

Interviewer: "Focus on a single country for now, but the design should be extensible to multiple regions."

You: "For the order flow, once a customer places an order, what's the expected time until a driver is assigned?"

Interviewer: "We want driver assignment within 30 seconds of restaurant confirmation. The restaurant should see the order within 2 seconds of placement."

You: "What about payment? Do we handle payments ourselves or use a payment processor?"

Interviewer: "We use Stripe for payment processing, but we need to handle the complexity of: authorization at order time, capture when driver picks up, and potential refunds."

You: "Last question — how important is real-time tracking? Do customers need live driver location?"

Interviewer: "Yes, live tracking is essential. Customers expect to see driver location updating every few seconds."

You: "Perfect. Let me summarize the requirements."

Functional Requirements

1. ORDER PLACEMENT
   - Customer browses restaurants (filtered by location, cuisine, rating)
   - Customer adds items to cart and places order
   - Order is validated (restaurant open, items available, address deliverable)
   - Payment is authorized

2. ORDER PROCESSING
   - Order sent to restaurant in real-time
   - Restaurant confirms and provides prep time estimate
   - System assigns optimal driver
   - Driver accepts/rejects assignment

3. ORDER FULFILLMENT
   - Track order status: placed → confirmed → preparing → ready → picked up → delivered
   - Real-time driver location tracking
   - ETA updates throughout

4. NOTIFICATIONS
   - Customer: order confirmed, driver assigned, driver arriving, delivered
   - Restaurant: new order, driver arriving for pickup
   - Driver: new assignment, navigation updates

5. PAYMENTS
   - Authorize at order placement
   - Capture when driver picks up
   - Handle refunds for cancellations or issues

Non-Functional Requirements

1. SCALE
   - 5M orders/day → ~60 orders/sec average
   - Peak: 200-300 orders/sec during dinner rush
   - 100K concurrent users browsing
   - 50K active drivers at peak

2. LATENCY
   - Restaurant sees order: < 2 seconds
   - Driver assignment: < 30 seconds from restaurant confirmation
   - Location updates: every 3-5 seconds
   - Order status API: < 100ms p99

3. RELIABILITY
   - Orders must never be lost (durability)
   - Payments must be exactly-once (no double charges)
   - Driver assignment must be atomic (no double assignments)

4. AVAILABILITY
   - 99.9% uptime (8.7 hours downtime/year max)
   - Graceful degradation during partial failures

Phase 2: Back-of-Envelope Estimation

You: "Let me work through the numbers to understand what we're building."

Traffic Estimation

ORDER TRAFFIC

Daily orders:           5,000,000
Average rate:           5M / 86,400 = ~58 orders/sec
Peak multiplier:        4x during dinner (6-8 PM)
Peak rate:              ~230 orders/sec

Per order, we have:
- 1 order creation
- 3-5 status updates
- 10-20 location updates (30 min delivery, every 3 sec = 600 updates, 
  but batched to 10-20 meaningful ones)
- 5-10 notifications

Total writes at peak:   230 × 30 = ~7,000 writes/sec

READ TRAFFIC

Browsing:               100K concurrent users
                        Each browsing = 10 requests/min
                        100K × 10 / 60 = ~17,000 reads/sec

Order tracking:         Active orders at any time ≈ 100K
                        Each polling every 10 sec
                        100K / 10 = 10,000 reads/sec

Total reads at peak:    ~30,000 reads/sec

Storage Estimation

ORDER DATA

Per order:
- Order record:         2 KB (items, addresses, metadata)
- Status history:       500 bytes (10 status changes × 50 bytes)
- Location history:     2 KB (20 points × 100 bytes)
- Total per order:      ~5 KB

Daily storage:          5M × 5 KB = 25 GB/day
Yearly storage:         25 GB × 365 = ~9 TB/year
With 3x replication:    ~27 TB/year

ACTIVE DATA (hot)
- Last 24 hours orders: 5M × 5 KB = 25 GB
- Active orders (in-flight): 100K × 5 KB = 500 MB
- Driver locations: 50K × 100 bytes = 5 MB

Infrastructure Estimation

┌─────────────────────────────────────────────────────────────────────────┐
│                    ESTIMATION SUMMARY                                   │
│                                                                         │
│  TRAFFIC                                                                │
│  ├── Peak orders:           230/sec                                     │
│  ├── Peak writes:           7,000/sec                                   │
│  └── Peak reads:            30,000/sec                                  │
│                                                                         │
│  STORAGE                                                                │
│  ├── Daily new data:        25 GB                                       │
│  ├── Hot data (24h):        25 GB                                       │
│  └── Yearly growth:         9 TB                                        │
│                                                                         │
│  INFRASTRUCTURE (rough)                                                 │
│  ├── API servers:           20-30 instances                             │
│  ├── Database:              Primary + 3 read replicas                   │
│  ├── Cache (Redis):         32 GB cluster                               │
│  └── Message queue:         Kafka with 32 partitions                    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Phase 3: High-Level Design

You: "Now let me sketch out the high-level architecture."

System Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         HIGH-LEVEL ARCHITECTURE                         │
│                                                                         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                      │
│  │  Customer   │  │ Restaurant  │  │   Driver    │                      │
│  │    App      │  │    App      │  │    App      │                      │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘                      │
│         │                │                │                             │
│         └────────────────┼────────────────┘                             │
│                          ▼                                              │
│                   ┌─────────────┐                                       │
│                   │     CDN     │                                       │
│                   └──────┬──────┘                                       │
│                          ▼                                              │
│                   ┌─────────────┐     ┌─────────────┐                   │
│                   │     ALB     │────▶│ Rate Limiter│                   │
│                   └──────┬──────┘     └─────────────┘                   │
│                          │                                              │
│         ┌────────────────┼────────────────┐                             │
│         ▼                ▼                ▼                             │
│  ┌─────────────┐  ┌─────────────┐  ┌──────────────┐                     │
│  │  Order API  │  │  Driver API │  │Restaurant API│                     │
│  └──────┬──────┘  └──────┬──────┘  └──────┬───────┘                     │
│         │                │                │                             │
│         └────────────────┼────────────────┘                             │
│                          ▼                                              │
│                   ┌─────────────┐                                       │
│                   │    Kafka    │                                       │
│                   │   (Events)  │                                       │
│                   └──────┬──────┘                                       │
│                          │                                              │
│    ┌──────────┬──────────┼──────────┬──────────┐                        │
│    ▼          ▼          ▼          ▼          ▼                        │
│ ┌──────┐  ┌──────┐  ┌──────┐  ┌──────┐  ┌──────┐                        │
│ │Order │  │Driver│  │Payment│ │Notif.│  │Track │                        │
│ │Worker│  │Match │  │Worker │ │Worker│  │Worker│                        │
│ └──┬───┘  └──┬───┘  └──┬───┘  └──┬───┘  └──┬───┘                        │
│    │         │         │         │         │                            │
│    └─────────┴─────────┴────┬────┴─────────┘                            │
│                             ▼                                           │
│    ┌─────────────┐   ┌─────────────┐   ┌─────────────┐                  │
│    │ PostgreSQL  │   │    Redis    │   │   S3/Blob   │                  │
│    │  (Orders)   │   │   (Cache)   │   │  (History)  │                  │
│    └─────────────┘   └─────────────┘   └─────────────┘                  │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Component Breakdown

You: "Let me walk through each component."

1. API Layer

Order API: Handles order placement, status queries, cancellations

  • Validates orders (restaurant hours, delivery address, payment)
  • Writes to transactional outbox (Week 3 pattern)
  • Returns order ID immediately

Driver API: Handles driver location updates, assignment acceptance

  • High-frequency location updates (every 3-5 seconds)
  • Assignment accept/reject
  • Status updates (picked up, delivered)

Restaurant API: Order receipt, confirmation, prep time updates

  • Real-time order notification via WebSocket
  • Prep time estimates
  • Order ready signal

2. Event-Driven Core (Kafka)

Topics:

  • orders.created - New orders for processing
  • orders.status - Status change events
  • driver.location - Location updates
  • driver.assignment - Assignment events
  • payments.process - Payment commands
  • notifications.send - Notification requests

3. Workers

Order Worker: Orchestrates order lifecycle via saga pattern (Week 5) Driver Matcher: Assigns optimal driver based on location, rating, load Payment Worker: Handles Stripe integration with idempotency (Week 2) Notification Worker: Multi-channel delivery (Week 6) Tracking Worker: Processes location updates, calculates ETAs


Phase 4: Deep Dives

Interviewer: "Great overview. Let's dive deeper. How do you ensure an order is never lost?"


Deep Dive 1: Reliable Order Processing (Week 3 - Transactional Outbox)

You: "This is critical. We use the transactional outbox pattern to ensure orders are never lost, even if Kafka is temporarily unavailable."

The Problem

WITHOUT OUTBOX PATTERN

Customer places order:
  1. Write to database         ✓ Success
  2. Publish to Kafka          ✗ Kafka down!
  
Result: Order in DB but never processed
        Customer charged, no food delivered
        
This is unacceptable for a food delivery system.

The Solution

# services/order_service.py
# Applies: Week 3, Day 2 - Transactional Outbox

from dataclasses import dataclass
from datetime import datetime
from typing import Optional
import uuid
import json


@dataclass
class Order:
    order_id: str
    customer_id: str
    restaurant_id: str
    items: list
    delivery_address: dict
    total_amount: float
    status: str
    created_at: datetime


@dataclass
class OutboxMessage:
    message_id: str
    aggregate_type: str
    aggregate_id: str
    event_type: str
    payload: dict
    created_at: datetime


class OrderService:
    """
    Order service with transactional outbox pattern.
    
    Guarantees:
    - Order and outbox message written atomically
    - Events eventually published to Kafka
    - No order is ever lost
    """
    
    def __init__(self, db_pool, payment_service):
        self.db = db_pool
        self.payments = payment_service
    
    async def place_order(self, request: dict) -> Order:
        """
        Place an order with guaranteed event publishing.
        
        Uses single transaction for order + outbox.
        """
        
        order_id = str(uuid.uuid4())
        
        # Authorize payment first (can fail, no order created yet)
        auth_result = await self.payments.authorize(
            customer_id=request["customer_id"],
            amount=request["total_amount"],
            idempotency_key=f"order:{order_id}:auth"
        )
        
        if not auth_result.success:
            raise PaymentAuthorizationError(auth_result.error)
        
        # Single transaction: order + outbox
        async with self.db.transaction() as tx:
            # Insert order
            order = Order(
                order_id=order_id,
                customer_id=request["customer_id"],
                restaurant_id=request["restaurant_id"],
                items=request["items"],
                delivery_address=request["delivery_address"],
                total_amount=request["total_amount"],
                status="placed",
                created_at=datetime.utcnow()
            )
            
            await tx.execute("""
                INSERT INTO orders (
                    order_id, customer_id, restaurant_id, items,
                    delivery_address, total_amount, status, created_at,
                    payment_auth_id
                ) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
            """, order.order_id, order.customer_id, order.restaurant_id,
                json.dumps(order.items), json.dumps(order.delivery_address),
                order.total_amount, order.status, order.created_at,
                auth_result.authorization_id)
            
            # Insert outbox message (same transaction!)
            outbox_message = OutboxMessage(
                message_id=str(uuid.uuid4()),
                aggregate_type="Order",
                aggregate_id=order_id,
                event_type="OrderPlaced",
                payload={
                    "order_id": order_id,
                    "customer_id": order.customer_id,
                    "restaurant_id": order.restaurant_id,
                    "items": order.items,
                    "total_amount": order.total_amount,
                    "payment_auth_id": auth_result.authorization_id,
                },
                created_at=datetime.utcnow()
            )
            
            await tx.execute("""
                INSERT INTO outbox (
                    message_id, aggregate_type, aggregate_id,
                    event_type, payload, created_at
                ) VALUES ($1, $2, $3, $4, $5, $6)
            """, outbox_message.message_id, outbox_message.aggregate_type,
                outbox_message.aggregate_id, outbox_message.event_type,
                json.dumps(outbox_message.payload), outbox_message.created_at)
        
        # Transaction committed - order is durable
        # Outbox publisher will send to Kafka eventually
        
        return order


class OutboxPublisher:
    """
    Publishes outbox messages to Kafka.
    
    Runs as background process, polling for unpublished messages.
    """
    
    def __init__(self, db_pool, kafka_producer):
        self.db = db_pool
        self.kafka = kafka_producer
    
    async def poll_and_publish(self, batch_size: int = 100):
        """Poll outbox and publish to Kafka."""
        
        async with self.db.transaction() as tx:
            # Lock and fetch unpublished messages
            messages = await tx.fetch("""
                SELECT * FROM outbox
                WHERE published_at IS NULL
                ORDER BY created_at
                LIMIT $1
                FOR UPDATE SKIP LOCKED
            """, batch_size)
            
            for msg in messages:
                topic = self._get_topic(msg["event_type"])
                
                # Publish to Kafka
                await self.kafka.send(
                    topic=topic,
                    key=msg["aggregate_id"].encode(),
                    value=json.dumps(msg["payload"]).encode(),
                    headers=[("message_id", msg["message_id"].encode())]
                )
                
                # Mark as published
                await tx.execute("""
                    UPDATE outbox 
                    SET published_at = NOW()
                    WHERE message_id = $1
                """, msg["message_id"])
    
    def _get_topic(self, event_type: str) -> str:
        return {
            "OrderPlaced": "orders.created",
            "OrderConfirmed": "orders.status",
            "OrderReady": "orders.status",
            "OrderDelivered": "orders.status",
        }.get(event_type, "orders.events")

Interviewer: "What happens if the outbox publisher crashes?"

You: "Since we mark messages as published only after Kafka confirms receipt, a crash means the message stays in the outbox. Next poll picks it up. We might publish twice, which is why downstream consumers must be idempotent."


Deep Dive 2: Payment Processing (Week 2 - Idempotency)

Interviewer: "Speaking of payments — how do you ensure we never double-charge a customer?"

You: "This is where idempotency is critical. We use idempotency keys for every payment operation."

The Problem

DOUBLE CHARGE SCENARIO

1. Customer places order
2. Payment authorized
3. Driver picks up food  
4. We call Stripe to capture payment
5. Network timeout — did it succeed?
6. We retry capture
7. Customer charged twice!

This WILL happen at scale. We need to prevent it.

The Solution

# services/payment_service.py
# Applies: Week 2, Day 2 - Idempotency

from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional
from enum import Enum
import hashlib


class PaymentStatus(Enum):
    PENDING = "pending"
    AUTHORIZED = "authorized"
    CAPTURED = "captured"
    REFUNDED = "refunded"
    FAILED = "failed"


@dataclass 
class PaymentRecord:
    payment_id: str
    order_id: str
    amount: float
    status: PaymentStatus
    idempotency_key: str
    stripe_payment_intent_id: Optional[str]
    created_at: datetime
    updated_at: datetime


class PaymentService:
    """
    Payment service with idempotency guarantees.
    
    Every operation uses idempotency keys to ensure:
    - Authorization happens exactly once
    - Capture happens exactly once
    - Refund happens exactly once
    """
    
    def __init__(self, db_pool, stripe_client, redis_client):
        self.db = db_pool
        self.stripe = stripe_client
        self.redis = redis_client
        
        # Idempotency window (Week 2 concept)
        self.idempotency_ttl = timedelta(hours=24)
    
    async def authorize(
        self,
        customer_id: str,
        amount: float,
        idempotency_key: str
    ) -> AuthorizationResult:
        """
        Authorize payment with idempotency.
        
        Same idempotency_key always returns same result.
        """
        
        # Check if we've seen this key before
        existing = await self._get_idempotent_result(idempotency_key)
        if existing:
            return existing
        
        # Try to acquire lock for this key
        lock_acquired = await self._acquire_idempotency_lock(idempotency_key)
        if not lock_acquired:
            # Another request is processing - wait and return their result
            return await self._wait_for_result(idempotency_key)
        
        try:
            # Create payment intent with Stripe
            # Stripe also accepts idempotency key!
            intent = await self.stripe.create_payment_intent(
                amount=int(amount * 100),  # cents
                currency="usd",
                customer=customer_id,
                capture_method="manual",  # Authorize only
                idempotency_key=idempotency_key
            )
            
            result = AuthorizationResult(
                success=True,
                authorization_id=intent.id,
                amount=amount
            )
            
            # Store result for future idempotent requests
            await self._store_idempotent_result(idempotency_key, result)
            
            return result
            
        except StripeError as e:
            result = AuthorizationResult(
                success=False,
                error=str(e)
            )
            await self._store_idempotent_result(idempotency_key, result)
            return result
            
        finally:
            await self._release_idempotency_lock(idempotency_key)
    
    async def capture(
        self,
        order_id: str,
        authorization_id: str,
        amount: float
    ) -> CaptureResult:
        """
        Capture authorized payment.
        
        Idempotency key derived from order_id ensures exactly-once capture.
        """
        
        idempotency_key = f"capture:{order_id}"
        
        existing = await self._get_idempotent_result(idempotency_key)
        if existing:
            return existing
        
        lock_acquired = await self._acquire_idempotency_lock(idempotency_key)
        if not lock_acquired:
            return await self._wait_for_result(idempotency_key)
        
        try:
            # Check current state in our database
            payment = await self._get_payment_by_order(order_id)
            
            if payment and payment.status == PaymentStatus.CAPTURED:
                # Already captured - return success
                return CaptureResult(success=True, captured=True)
            
            # Capture with Stripe (Stripe idempotency handles their side)
            intent = await self.stripe.capture_payment_intent(
                authorization_id,
                amount_to_capture=int(amount * 100),
                idempotency_key=idempotency_key
            )
            
            # Update our database
            await self._update_payment_status(
                order_id,
                PaymentStatus.CAPTURED
            )
            
            result = CaptureResult(success=True, captured=True)
            await self._store_idempotent_result(idempotency_key, result)
            
            return result
            
        except StripeError as e:
            result = CaptureResult(success=False, error=str(e))
            await self._store_idempotent_result(idempotency_key, result)
            return result
            
        finally:
            await self._release_idempotency_lock(idempotency_key)
    
    async def _acquire_idempotency_lock(self, key: str) -> bool:
        """Acquire distributed lock for idempotency key."""
        lock_key = f"idem_lock:{key}"
        return await self.redis.set(
            lock_key,
            "1",
            nx=True,  # Only if not exists
            ex=30     # 30 second timeout
        )
    
    async def _get_idempotent_result(self, key: str):
        """Get stored result for idempotency key."""
        result_key = f"idem_result:{key}"
        data = await self.redis.get(result_key)
        if data:
            return self._deserialize_result(data)
        return None
    
    async def _store_idempotent_result(self, key: str, result):
        """Store result for future idempotent requests."""
        result_key = f"idem_result:{key}"
        await self.redis.setex(
            result_key,
            int(self.idempotency_ttl.total_seconds()),
            self._serialize_result(result)
        )

Deep Dive 3: Driver Assignment (Week 5 - Distributed Coordination)

Interviewer: "How do you assign drivers without double-assigning the same driver to multiple orders?"

You: "This requires distributed coordination. We can't have two orders grab the same driver simultaneously."

The Problem

DOUBLE ASSIGNMENT RACE CONDITION

Order A needs driver:          Order B needs driver:
1. Query available drivers     1. Query available drivers
2. See Driver X is free        2. See Driver X is free
3. Assign Driver X to A        3. Assign Driver X to B
4. Driver X has two orders!

At 200 orders/sec during peak, this WILL happen.

The Solution

# services/driver_matcher.py
# Applies: Week 5, Day 5 - Leader Election & Coordination

from dataclasses import dataclass
from datetime import datetime
from typing import Optional, List
import asyncio


@dataclass
class Driver:
    driver_id: str
    location: tuple  # (lat, lng)
    status: str      # available, assigned, busy
    rating: float
    current_order_id: Optional[str]


@dataclass
class AssignmentResult:
    success: bool
    driver_id: Optional[str] = None
    error: Optional[str] = None


class DriverMatcher:
    """
    Assigns drivers to orders with coordination guarantees.
    
    Uses optimistic locking to prevent double assignment.
    """
    
    def __init__(self, db_pool, redis_client, location_service):
        self.db = db_pool
        self.redis = redis_client
        self.locations = location_service
    
    async def assign_driver(
        self,
        order_id: str,
        restaurant_location: tuple,
        max_distance_km: float = 5.0
    ) -> AssignmentResult:
        """
        Find and assign optimal driver for an order.
        
        Uses optimistic locking to prevent race conditions.
        """
        
        # Find candidate drivers near restaurant
        candidates = await self._find_nearby_drivers(
            restaurant_location,
            max_distance_km
        )
        
        if not candidates:
            return AssignmentResult(
                success=False,
                error="No available drivers nearby"
            )
        
        # Sort by score (distance, rating, etc.)
        ranked = self._rank_drivers(candidates, restaurant_location)
        
        # Try to assign, starting with best candidate
        for driver in ranked:
            result = await self._try_assign(order_id, driver)
            if result.success:
                return result
        
        return AssignmentResult(
            success=False,
            error="All nearby drivers unavailable"
        )
    
    async def _try_assign(
        self,
        order_id: str,
        driver: Driver
    ) -> AssignmentResult:
        """
        Attempt to assign driver using optimistic locking.
        
        Only succeeds if driver is still available.
        """
        
        async with self.db.transaction() as tx:
            # Lock the driver row and check status
            current = await tx.fetchrow("""
                SELECT driver_id, status, version
                FROM drivers
                WHERE driver_id = $1
                FOR UPDATE
            """, driver.driver_id)
            
            if not current or current["status"] != "available":
                # Driver no longer available
                return AssignmentResult(success=False)
            
            # Update driver status atomically
            result = await tx.execute("""
                UPDATE drivers
                SET status = 'assigned',
                    current_order_id = $1,
                    version = version + 1,
                    assigned_at = NOW()
                WHERE driver_id = $2
                  AND version = $3
                  AND status = 'available'
            """, order_id, driver.driver_id, current["version"])
            
            if result == "UPDATE 0":
                # Concurrent modification - another order got them
                return AssignmentResult(success=False)
            
            # Create assignment record
            await tx.execute("""
                INSERT INTO assignments (
                    assignment_id, order_id, driver_id,
                    status, created_at
                ) VALUES (gen_random_uuid(), $1, $2, 'pending', NOW())
            """, order_id, driver.driver_id)
        
        # Successfully assigned!
        return AssignmentResult(
            success=True,
            driver_id=driver.driver_id
        )
    
    async def _find_nearby_drivers(
        self,
        location: tuple,
        max_distance_km: float
    ) -> List[Driver]:
        """Find available drivers within radius."""
        
        # Use Redis GEO for fast spatial query
        lat, lng = location
        
        # Get driver IDs within radius
        nearby_ids = await self.redis.georadius(
            "driver_locations",
            lng, lat,
            max_distance_km,
            unit="km",
            count=50
        )
        
        if not nearby_ids:
            return []
        
        # Fetch driver details (only available ones)
        drivers = await self.db.fetch("""
            SELECT driver_id, status, rating
            FROM drivers
            WHERE driver_id = ANY($1)
              AND status = 'available'
        """, [d.decode() for d in nearby_ids])
        
        return [
            Driver(
                driver_id=d["driver_id"],
                location=await self.locations.get(d["driver_id"]),
                status=d["status"],
                rating=d["rating"],
                current_order_id=None
            )
            for d in drivers
        ]
    
    def _rank_drivers(
        self,
        drivers: List[Driver],
        restaurant_location: tuple
    ) -> List[Driver]:
        """Rank drivers by assignment score."""
        
        def score(driver: Driver) -> float:
            distance = self._haversine(driver.location, restaurant_location)
            # Lower distance is better, higher rating is better
            return distance - (driver.rating * 0.5)
        
        return sorted(drivers, key=score)

Deep Dive 4: Order Saga (Week 5 - Saga Pattern)

Interviewer: "Walk me through what happens if something fails mid-order — say the restaurant rejects the order after payment is authorized?"

You: "This is a classic saga pattern problem. We need compensation logic for each step."

The Order Saga

# services/order_saga.py
# Applies: Week 5, Day 2-3 - Saga Pattern

from dataclasses import dataclass
from enum import Enum
from typing import Optional
import logging

logger = logging.getLogger(__name__)


class SagaStep(Enum):
    VALIDATE_ORDER = "validate"
    AUTHORIZE_PAYMENT = "authorize_payment"
    NOTIFY_RESTAURANT = "notify_restaurant"
    AWAIT_CONFIRMATION = "await_confirmation"
    ASSIGN_DRIVER = "assign_driver"
    CAPTURE_PAYMENT = "capture_payment"
    COMPLETE = "complete"


class SagaStatus(Enum):
    RUNNING = "running"
    COMPLETED = "completed"
    COMPENSATING = "compensating"
    FAILED = "failed"


@dataclass
class OrderSagaState:
    saga_id: str
    order_id: str
    current_step: SagaStep
    status: SagaStatus
    payment_auth_id: Optional[str] = None
    driver_id: Optional[str] = None
    compensation_reason: Optional[str] = None


class OrderSagaOrchestrator:
    """
    Orchestrates order lifecycle as a saga.
    
    Each step has a corresponding compensation action.
    If any step fails, we run compensations in reverse order.
    """
    
    def __init__(
        self,
        order_service,
        payment_service,
        restaurant_service,
        driver_service,
        notification_service,
        saga_repo
    ):
        self.orders = order_service
        self.payments = payment_service
        self.restaurants = restaurant_service
        self.drivers = driver_service
        self.notifications = notification_service
        self.sagas = saga_repo
    
    async def execute(self, order_id: str) -> OrderSagaState:
        """Execute order saga from start to completion."""
        
        state = OrderSagaState(
            saga_id=f"saga:{order_id}",
            order_id=order_id,
            current_step=SagaStep.VALIDATE_ORDER,
            status=SagaStatus.RUNNING
        )
        
        await self.sagas.save(state)
        
        try:
            # Step 1: Validate order
            state = await self._validate_order(state)
            
            # Step 2: Authorize payment
            state = await self._authorize_payment(state)
            
            # Step 3: Notify restaurant
            state = await self._notify_restaurant(state)
            
            # Step 4: Wait for restaurant confirmation
            # (This happens async via webhook/event)
            state.current_step = SagaStep.AWAIT_CONFIRMATION
            await self.sagas.save(state)
            
            return state
            
        except SagaStepError as e:
            logger.error(f"Saga {state.saga_id} failed at {state.current_step}: {e}")
            return await self._compensate(state, str(e))
    
    async def on_restaurant_confirmed(self, order_id: str, prep_time_minutes: int):
        """Called when restaurant confirms order."""
        
        state = await self.sagas.get_by_order(order_id)
        if not state or state.status != SagaStatus.RUNNING:
            return
        
        try:
            # Step 5: Assign driver
            state = await self._assign_driver(state)
            
            # Update order with driver info
            await self.orders.update_status(
                order_id,
                "driver_assigned",
                driver_id=state.driver_id
            )
            
            # Notify customer
            await self.notifications.send(
                user_id=state.customer_id,
                type="driver_assigned",
                template="driver_assigned",
                variables={"driver_id": state.driver_id}
            )
            
            state.current_step = SagaStep.COMPLETE
            state.status = SagaStatus.COMPLETED
            await self.sagas.save(state)
            
        except SagaStepError as e:
            await self._compensate(state, str(e))
    
    async def on_restaurant_rejected(self, order_id: str, reason: str):
        """Called when restaurant rejects order."""
        
        state = await self.sagas.get_by_order(order_id)
        if not state:
            return
        
        await self._compensate(state, f"Restaurant rejected: {reason}")
    
    async def on_driver_picked_up(self, order_id: str):
        """Called when driver picks up order — capture payment."""
        
        state = await self.sagas.get_by_order(order_id)
        if not state or not state.payment_auth_id:
            return
        
        # Capture payment (idempotent)
        result = await self.payments.capture(
            order_id=order_id,
            authorization_id=state.payment_auth_id,
            amount=state.order_amount
        )
        
        if not result.success:
            logger.error(f"Payment capture failed for {order_id}: {result.error}")
            # Don't compensate here — food is already with driver
            # Flag for manual review instead
    
    async def _compensate(
        self,
        state: OrderSagaState,
        reason: str
    ) -> OrderSagaState:
        """
        Run compensation actions in reverse order.
        
        Compensation order:
        1. Release driver (if assigned)
        2. Void payment authorization (if authorized)
        3. Update order status to cancelled
        4. Notify customer
        """
        
        state.status = SagaStatus.COMPENSATING
        state.compensation_reason = reason
        await self.sagas.save(state)
        
        logger.info(f"Starting compensation for saga {state.saga_id}")
        
        # Compensate driver assignment
        if state.driver_id:
            try:
                await self.drivers.release(state.driver_id)
                logger.info(f"Released driver {state.driver_id}")
            except Exception as e:
                logger.error(f"Failed to release driver: {e}")
        
        # Compensate payment authorization
        if state.payment_auth_id:
            try:
                await self.payments.void_authorization(
                    authorization_id=state.payment_auth_id,
                    idempotency_key=f"void:{state.order_id}"
                )
                logger.info(f"Voided payment auth {state.payment_auth_id}")
            except Exception as e:
                logger.error(f"Failed to void payment: {e}")
        
        # Update order status
        await self.orders.update_status(
            state.order_id,
            "cancelled",
            reason=reason
        )
        
        # Notify customer
        await self.notifications.send(
            user_id=state.customer_id,
            type="order_cancelled",
            template="order_cancelled",
            variables={"reason": reason}
        )
        
        state.status = SagaStatus.FAILED
        await self.sagas.save(state)
        
        return state
    
    async def _validate_order(self, state: OrderSagaState) -> OrderSagaState:
        """Validate order details."""
        order = await self.orders.get(state.order_id)
        
        # Check restaurant is open
        restaurant = await self.restaurants.get(order.restaurant_id)
        if not restaurant.is_open():
            raise SagaStepError("Restaurant is closed")
        
        # Check delivery address is in range
        if not restaurant.delivers_to(order.delivery_address):
            raise SagaStepError("Delivery address out of range")
        
        state.current_step = SagaStep.AUTHORIZE_PAYMENT
        state.customer_id = order.customer_id
        state.order_amount = order.total_amount
        await self.sagas.save(state)
        
        return state
    
    async def _authorize_payment(self, state: OrderSagaState) -> OrderSagaState:
        """Authorize payment."""
        result = await self.payments.authorize(
            customer_id=state.customer_id,
            amount=state.order_amount,
            idempotency_key=f"order:{state.order_id}:auth"
        )
        
        if not result.success:
            raise SagaStepError(f"Payment failed: {result.error}")
        
        state.payment_auth_id = result.authorization_id
        state.current_step = SagaStep.NOTIFY_RESTAURANT
        await self.sagas.save(state)
        
        return state
    
    async def _notify_restaurant(self, state: OrderSagaState) -> OrderSagaState:
        """Send order to restaurant."""
        order = await self.orders.get(state.order_id)
        
        await self.restaurants.send_order(
            restaurant_id=order.restaurant_id,
            order=order
        )
        
        state.current_step = SagaStep.AWAIT_CONFIRMATION
        await self.sagas.save(state)
        
        return state
    
    async def _assign_driver(self, state: OrderSagaState) -> OrderSagaState:
        """Assign driver to order."""
        order = await self.orders.get(state.order_id)
        restaurant = await self.restaurants.get(order.restaurant_id)
        
        result = await self.drivers.assign(
            order_id=state.order_id,
            restaurant_location=restaurant.location
        )
        
        if not result.success:
            raise SagaStepError(f"Driver assignment failed: {result.error}")
        
        state.driver_id = result.driver_id
        state.current_step = SagaStep.COMPLETE
        await self.sagas.save(state)
        
        return state

Deep Dive 5: Real-Time Location & Caching (Week 4)

Interviewer: "With 50K active drivers updating location every 3 seconds, how do you handle that load?"

You: "This is a perfect caching problem. We use Redis GEO for spatial queries and smart cache strategies."

# services/location_service.py
# Applies: Week 4 - Caching, Week 1 - Hot Keys

from datetime import datetime, timedelta
from typing import Optional, List, Tuple
import asyncio


class LocationService:
    """
    High-frequency location tracking with caching.
    
    Scale: 50K drivers × 1 update/3sec = 17K updates/sec
    
    Strategy:
    - Redis GEO for current locations (hot data)
    - Batch writes to PostgreSQL for history (cold data)
    - Cache customer's driver location (read-heavy)
    """
    
    def __init__(self, redis_client, db_pool):
        self.redis = redis_client
        self.db = db_pool
        self._history_buffer = []
        self._buffer_lock = asyncio.Lock()
    
    async def update_location(
        self,
        driver_id: str,
        lat: float,
        lng: float,
        timestamp: datetime
    ):
        """
        Update driver location.
        
        Hot path: Redis GEO (immediate)
        Cold path: Batched PostgreSQL writes (async)
        """
        
        # Update Redis GEO (immediate, for spatial queries)
        await self.redis.geoadd(
            "driver_locations",
            lng, lat, driver_id
        )
        
        # Store current location with timestamp
        await self.redis.hset(
            f"driver:{driver_id}:location",
            mapping={
                "lat": str(lat),
                "lng": str(lng),
                "timestamp": timestamp.isoformat()
            }
        )
        
        # Buffer for batch history write
        async with self._buffer_lock:
            self._history_buffer.append({
                "driver_id": driver_id,
                "lat": lat,
                "lng": lng,
                "timestamp": timestamp
            })
            
            # Flush if buffer is large
            if len(self._history_buffer) >= 1000:
                await self._flush_history()
    
    async def get_driver_location(
        self,
        driver_id: str
    ) -> Optional[Tuple[float, float, datetime]]:
        """Get current location for a driver."""
        
        data = await self.redis.hgetall(f"driver:{driver_id}:location")
        
        if not data:
            return None
        
        return (
            float(data[b"lat"]),
            float(data[b"lng"]),
            datetime.fromisoformat(data[b"timestamp"].decode())
        )
    
    async def get_location_for_customer(
        self,
        order_id: str,
        driver_id: str
    ) -> Optional[dict]:
        """
        Get driver location for customer tracking.
        
        Uses cache to reduce Redis hits for popular orders.
        """
        
        cache_key = f"tracking:{order_id}:location"
        
        # Check cache first (5 second TTL)
        cached = await self.redis.get(cache_key)
        if cached:
            return json.loads(cached)
        
        # Get fresh location
        location = await self.get_driver_location(driver_id)
        if not location:
            return None
        
        lat, lng, timestamp = location
        
        # Calculate ETA
        order = await self._get_order(order_id)
        eta = await self._calculate_eta(
            (lat, lng),
            order.delivery_address
        )
        
        result = {
            "lat": lat,
            "lng": lng,
            "timestamp": timestamp.isoformat(),
            "eta_minutes": eta
        }
        
        # Cache for 5 seconds
        await self.redis.setex(
            cache_key,
            5,
            json.dumps(result)
        )
        
        return result
    
    async def find_nearby_drivers(
        self,
        lat: float,
        lng: float,
        radius_km: float,
        limit: int = 50
    ) -> List[str]:
        """Find drivers within radius using Redis GEO."""
        
        results = await self.redis.georadius(
            "driver_locations",
            lng, lat,
            radius_km,
            unit="km",
            count=limit,
            sort="ASC"  # Closest first
        )
        
        return [r.decode() for r in results]
    
    async def _flush_history(self):
        """Batch write location history to PostgreSQL."""
        
        if not self._history_buffer:
            return
        
        buffer = self._history_buffer
        self._history_buffer = []
        
        # Batch insert
        await self.db.executemany("""
            INSERT INTO driver_location_history 
            (driver_id, lat, lng, recorded_at)
            VALUES ($1, $2, $3, $4)
        """, [
            (h["driver_id"], h["lat"], h["lng"], h["timestamp"])
            for h in buffer
        ])

Deep Dive 6: Rate Limiting & Backpressure (Week 1 & 3)

Interviewer: "What if someone tries to abuse the system — placing hundreds of orders?"

You: "We implement rate limiting at multiple layers and handle backpressure gracefully."

# services/rate_limiter.py
# Applies: Week 1, Day 3 - Rate Limiting; Week 3, Day 3 - Backpressure

from datetime import datetime
from typing import Tuple


class MultiLayerRateLimiter:
    """
    Rate limiting at multiple layers.
    
    Layers:
    1. Per-user: 10 orders/hour
    2. Per-IP: 100 requests/minute  
    3. Per-restaurant: 500 orders/hour (prevent overwhelming kitchen)
    4. Global: System capacity protection
    """
    
    def __init__(self, redis_client):
        self.redis = redis_client
        
        self.limits = {
            "user_orders": {"count": 10, "window": 3600},      # 10/hour
            "ip_requests": {"count": 100, "window": 60},       # 100/min
            "restaurant_orders": {"count": 500, "window": 3600}, # 500/hour
            "global_orders": {"count": 1000, "window": 1},     # 1000/sec
        }
    
    async def check_order_allowed(
        self,
        user_id: str,
        ip_address: str,
        restaurant_id: str
    ) -> Tuple[bool, str]:
        """
        Check if order is allowed by all rate limits.
        
        Returns (allowed, reason).
        """
        
        # Check user limit
        if not await self._check_limit(f"ratelimit:user:{user_id}", "user_orders"):
            return (False, "Too many orders. Please try again later.")
        
        # Check IP limit
        if not await self._check_limit(f"ratelimit:ip:{ip_address}", "ip_requests"):
            return (False, "Too many requests. Please slow down.")
        
        # Check restaurant limit
        if not await self._check_limit(f"ratelimit:restaurant:{restaurant_id}", "restaurant_orders"):
            return (False, "Restaurant is very busy. Please try again soon.")
        
        # Check global limit
        if not await self._check_limit("ratelimit:global:orders", "global_orders"):
            return (False, "System is busy. Please try again in a moment.")
        
        return (True, "")
    
    async def _check_limit(self, key: str, limit_type: str) -> bool:
        """Check if under rate limit using sliding window."""
        
        config = self.limits[limit_type]
        now = datetime.utcnow().timestamp()
        window_start = now - config["window"]
        
        # Sliding window counter with Redis
        pipe = self.redis.pipeline()
        pipe.zremrangebyscore(key, 0, window_start)
        pipe.zcard(key)
        pipe.zadd(key, {str(now): now})
        pipe.expire(key, config["window"])
        
        results = await pipe.execute()
        current_count = results[1]
        
        return current_count < config["count"]


class BackpressureManager:
    """
    Handles system backpressure during overload.
    
    Applies: Week 3, Day 3 - Backpressure
    """
    
    def __init__(self, redis_client, kafka_admin):
        self.redis = redis_client
        self.kafka = kafka_admin
    
    async def get_system_pressure(self) -> float:
        """
        Calculate current system pressure (0-1).
        
        Based on:
        - Kafka consumer lag
        - Worker queue depth
        - Database connection usage
        """
        
        # Get Kafka lag
        lag = await self._get_kafka_lag()
        max_acceptable_lag = 100000
        lag_pressure = min(1.0, lag / max_acceptable_lag)
        
        # Get worker queue depth
        queue_depth = await self.redis.llen("order_processing_queue")
        max_queue = 10000
        queue_pressure = min(1.0, queue_depth / max_queue)
        
        # Weighted average
        return (lag_pressure * 0.6) + (queue_pressure * 0.4)
    
    async def should_accept_order(self) -> Tuple[bool, str]:
        """
        Determine if we should accept new orders.
        
        Implements graceful degradation:
        - pressure < 0.7: Accept all
        - pressure 0.7-0.9: Reject new customers
        - pressure > 0.9: Emergency mode
        """
        
        pressure = await self.get_system_pressure()
        
        if pressure < 0.7:
            return (True, "normal")
        
        if pressure < 0.9:
            return (True, "degraded")  # Maybe skip some features
        
        return (False, "System at capacity. Please try again shortly.")

Phase 5: Scaling and Edge Cases

Interviewer: "How would this system scale to 10x the current load?"

Scaling Strategy

You: "Let me analyze what breaks at 10x and how we'd address it."

SCALING TO 10X (50M ORDERS/DAY)

Current:        5M orders/day, 230/sec peak
Target:         50M orders/day, 2,300/sec peak

BOTTLENECK ANALYSIS

Component        │ Current Limit    │ 10x Solution
─────────────────┼──────────────────┼────────────────────────────
API Servers      │ 30 instances     │ 300 instances (horizontal)
PostgreSQL Write │ 5K writes/sec    │ Shard by region/restaurant
PostgreSQL Read  │ 30K reads/sec    │ More read replicas + cache
Redis            │ 100K ops/sec     │ Redis Cluster (10 shards)
Kafka            │ 100K msgs/sec    │ More partitions (320)
Location Updates │ 17K/sec          │ Batch + Redis Cluster

SHARDING STRATEGY

Orders: Shard by city/region
- NYC orders → Shard 1
- LA orders → Shard 2
- etc.

Drivers: Shard by operating region
Restaurants: Shard by city

Cross-shard queries minimized:
- Orders are local to city
- Drivers operate in one city
- Restaurants are in one city

Edge Cases

Interviewer: "What edge cases do you need to handle?"

EDGE CASES

1. DRIVER GOES OFFLINE MID-DELIVERY
   - Detect via missing location updates (>5 min)
   - Auto-reassign to new driver
   - Notify customer with new ETA

2. RESTAURANT CLOSES AFTER ORDER PLACED
   - Saga compensation: void payment, notify customer
   - Offer alternatives or credit

3. CUSTOMER CANCELS AFTER FOOD PREPARED
   - If before pickup: full refund
   - If after pickup: partial refund, driver keeps food
   - Track cancellation rate per customer

4. PAYMENT CAPTURE FAILS AFTER PICKUP
   - Food is already with driver — can't undo
   - Queue for retry with exponential backoff
   - After N failures: flag for manual review
   - Don't block delivery

5. DOUBLE ORDER (USER CLICKS TWICE)
   - Idempotency key prevents duplicate charges
   - Dedup window: 5 minutes

6. DRIVER AT WRONG LOCATION
   - Compare GPS with restaurant/customer address
   - Alert if >500m off
   - Allow manual override

Failure Scenarios

Failure Detection Impact Recovery
PostgreSQL primary Health check Write failures Promote replica
Redis down Connection errors Cache miss, no rate limit Fall back to DB
Kafka broker Producer errors Event delay Buffer in memory, retry
Payment service Timeouts Can't place orders Show "payment unavailable"
Location service Missing updates Stale ETAs Show "last known"

Phase 6: Monitoring and Operations

Interviewer: "How would you monitor this system in production?"

Key Metrics

BUSINESS METRICS
├── Orders placed/minute: Target 4,000, Alert if < 3,000
├── Order success rate: Target 98%, Alert if < 95%
├── Avg delivery time: Target 35 min, Alert if > 45 min
└── Driver utilization: Target 70%, Alert if < 50%

SYSTEM METRICS  
├── API latency p99: Target 100ms, Alert if > 500ms
├── Kafka consumer lag: Target < 1000, Alert if > 10,000
├── Database connections: Target < 80%, Alert if > 90%
└── Payment success rate: Target 99%, Alert if < 97%

Monitoring Dashboard

┌─────────────────────────────────────────────────────────────────────────┐
│                    FOOD DELIVERY OPERATIONS                             │
│                                                                         │
│  ORDER HEALTH                                                           │
│  ├── Orders/min:          [████████░░] 3,847                            │
│  ├── Success rate:        [█████████░] 98.2%                            │
│  └── Avg delivery time:   [███████░░░] 32 min                           │
│                                                                         │
│  SYSTEM HEALTH                                                          │
│  ├── API p99 latency:     [███░░░░░░░] 87ms                             │
│  ├── Kafka lag:           [█░░░░░░░░░] 342                              │
│  └── DB connections:      [██████░░░░] 62%                              │
│                                                                         │
│  PAYMENTS                                                               │
│  ├── Auth success:        [█████████░] 99.1%                            │
│  ├── Capture success:     [█████████░] 99.8%                            │
│  └── Pending captures:    127                                           │
│                                                                         │
│  DRIVER FLEET (50,234 active)                                           │
│  ├── Available:           [████████░░] 18,421                           │
│  ├── On delivery:         [███████░░░] 31,813                           │
│  └── Avg orders/driver:   2.3                                           │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Alerting Strategy

CRITICAL (Page immediately):
- Order success rate < 90%
- Payment failures > 5%
- All drivers in city unavailable
- Database primary down

WARNING (Slack, investigate):
- Order success rate < 95%
- Kafka lag > 10,000
- Driver assignment > 60 sec avg
- Payment auth latency > 2s

INFO (Dashboard only):
- Peak traffic approaching
- New restaurant onboarding
- Scheduled maintenance window

Runbook: High Order Failure Rate

RUNBOOK: Order Success Rate Dropped

SYMPTOMS:
- Order success rate alert triggered
- Customer complaints in support queue

DIAGNOSIS:
1. Check by failure reason:
   SELECT reason, COUNT(*) FROM order_failures 
   WHERE created_at > NOW() - INTERVAL '1 hour'
   GROUP BY reason;

2. Check payment health:
   - Stripe status page
   - Payment auth latency metrics

3. Check restaurant connectivity:
   - WebSocket connection count
   - Restaurant app errors

RESOLUTION:
- If payment issue: Enable backup processor
- If restaurant app: Check for app store update issues
- If driver shortage: Expand driver radius
- If system overload: Enable degraded mode

Interview Conclusion

Interviewer: "Excellent work. You've covered a lot of ground. Let me ask one final question — if you had to build this from scratch, what would you build first?"

You: "I'd prioritize in this order:

  1. Core order flow with transactional outbox — this is the money path
  2. Payment integration with idempotency — can't lose money
  3. Restaurant notification — orders need to get to kitchens
  4. Basic driver assignment — start with simple nearest-driver
  5. Customer notifications — keep users informed

The saga orchestration, real-time tracking, and advanced driver matching can come in phase 2. The key is getting orders reliably from customers to restaurants first."

Interviewer: "Great prioritization. Any questions for me?"

You: "Yes — what's been the biggest operational challenge you've faced at scale? I'm curious what breaks that you don't expect."


Summary: Weeks 1-6 Concepts Applied

Week 1: Data at Scale

Concept Application
Partitioning Orders sharded by region, drivers by city
Replication PostgreSQL read replicas for order queries
Rate Limiting Multi-layer: user, IP, restaurant, global
Hot Keys Driver locations in Redis GEO

Week 2: Failure-First Design

Concept Application
Timeouts Restaurant confirmation timeout (5 min)
Idempotency Payment auth/capture with idempotency keys
Circuit Breakers Payment service circuit breaker
Retry Strategies Exponential backoff for payment capture

Week 3: Messaging and Async

Concept Application
Transactional Outbox Order creation → Kafka publishing
Consumer Groups Order workers, notification workers
Backpressure System pressure monitoring
Dead Letter Queue Failed payment captures

Week 4: Caching

Concept Application
Cache-Aside Driver location caching
Write-Behind Location history batching
Cache Invalidation TTL-based for tracking cache
Multi-tier Redis for hot, PostgreSQL for cold

Week 5: Consistency and Coordination

Concept Application
Saga Pattern Order lifecycle orchestration
Compensation Payment void, driver release on failure
Optimistic Locking Driver assignment race prevention
Distributed Locks Idempotency key locking

Week 6: Notification Platform

Concept Application
Multi-channel Push + SMS for order updates
Priority Critical (payment) vs normal (status)
User Preferences Notification opt-in/out
Real-time WebSocket for restaurant orders

Self-Assessment Checklist

After studying this capstone, verify you can:

  • Design a transactional outbox for reliable event publishing
  • Implement idempotency for payment operations
  • Use optimistic locking to prevent race conditions
  • Design a saga with compensation for multi-step workflows
  • Implement rate limiting at multiple layers
  • Handle backpressure during system overload
  • Use Redis GEO for spatial queries
  • Design caching strategies for high-frequency updates
  • Estimate storage and traffic for a large-scale system
  • Create monitoring dashboards and alerting strategies
  • Analyze failure scenarios and recovery procedures
  • Prioritize features for MVP vs later phases

This capstone integrates all concepts from Weeks 1-6 of the System Design Mastery Series. Use this as a template for approaching similar interview problems involving transactional workflows, real-time updates, and distributed coordination.