Capstone

Week 1-2 Capstone: The Ultimate System Design Interview

🎯 A Real-World Problem Covering Everything You've Learned

The Interview Begins

You walk into the interview room. The interviewer smiles and gestures to the whiteboard.

Interviewer: "Thanks for coming in. Today we're going to work through a system design problem together. I'm interested in your thought process, so please think out loud. Feel free to ask questions — this is meant to be collaborative."

They write on the whiteboard:

╔══════════════════════════════════════════════════════════════════════════╗
║                                                                          ║
║           Design a Global Flash Sale System for "MegaMart"               ║
║                                                                          ║
║   MegaMart is launching "Lightning Deals" — flash sales where limited    ║
║   inventory items are sold at 90% discount for exactly 10 minutes.       ║
║                                                                          ║
║   - 50 million users globally                                            ║
║   - Flash sales happen every hour (random products)                      ║
║   - Each sale: 10,000 units available                                    ║
║   - Users get 3 minutes to complete checkout after claiming              ║
║   - Must handle payment processing and inventory                         ║
║   - Must notify users via email/push when deals go live                  ║
║                                                                          ║
╚══════════════════════════════════════════════════════════════════════════╝

Interviewer: "Take a few minutes to think about this, then walk me through your approach. We have about 45 minutes."

Phase 1: Requirements Clarification (5 minutes)

Before diving in, you take a breath and start asking questions. This is crucial — never assume.

Your Questions

You: "Before I start designing, I'd like to clarify a few requirements. First, when you say 50 million users globally — is that monthly active users, or do we expect all of them potentially accessing a flash sale?"

Interviewer: "Good question. 50M is MAU. For popular flash sales, we've seen up to 5 million users trying to access a single sale. The product team calls these 'hype drops' — think iPhone launches or limited edition sneakers."

You: "That's helpful. For the 10,000 units — do users select quantities, or is it one unit per user?"

Interviewer: "One unit per user per sale. We want to maximize the number of happy customers."

You: "What happens if someone claims an item but doesn't complete checkout within 3 minutes?"

Interviewer: "The item should go back to available inventory for others."

You: "For payment processing — do you have an existing payment provider, or should I design that?"

Interviewer: "Assume we use Stripe. Treat it as an external API that might be slow or fail."

You: "Last question — the notification when deals go live. Are users subscribed to specific products, or do we notify everyone?"

Interviewer: "Users can 'watch' products. When a flash sale includes a watched product, they should be notified. But we also have a general 'deals' notification channel for users who opted in. Could be millions of notifications per sale."

You: "Perfect. Let me summarize the requirements as I understand them."

Functional Requirements

1. FLASH SALE MANAGEMENT
   - Create flash sales with specific products and inventory
   - Sales start at scheduled times (hourly)
   - Sales last exactly 10 minutes
   - 10,000 units per sale

2. INVENTORY CLAIMING
   - Users can claim one item per sale
   - Claimed items reserved for 3 minutes
   - Unclaimed items return to pool after timeout
   - No overselling (exactly 10,000 successful orders max)

3. CHECKOUT FLOW
   - Complete payment within 3 minutes of claim
   - Process payment via Stripe
   - Handle payment failures gracefully
   - Create order on successful payment

4. NOTIFICATIONS
   - Notify watchers when watched product goes on sale
   - Notify deal subscribers when any sale starts
   - Support email and push notifications
   - Millions of notifications per sale

Non-Functional Requirements

1. SCALE
   - Handle 5M concurrent users hitting a single sale
   - Process 10K checkouts in 3-minute window
   - Send millions of notifications within seconds

2. RELIABILITY
   - No overselling (inventory consistency is critical)
   - No double-charging (payment idempotency)
   - No lost orders (durability)

3. AVAILABILITY
   - System must work during sale windows
   - Graceful degradation if components fail

4. LATENCY
   - Claim response: <500ms at p99
   - Checkout: <3s at p99
   - Notification delivery: <30s for email, <5s for push

Phase 2: Back of the Envelope Estimation (5 minutes)

You: "Let me work through the numbers to understand the scale."

Traffic Estimation

FLASH SALE TRAFFIC (per sale)

Users trying to access:       5,000,000
Sale duration:                10 minutes = 600 seconds
Average request rate:         5M / 600 = 8,333 requests/second

But traffic isn't uniform. The first 30 seconds see 80% of traffic:
Peak traffic:                 (5M × 0.8) / 30 = 133,333 requests/second

Let's round up for safety:    150,000 requests/second peak

Breakdown by operation:
├── Page views:               100,000 /sec (viewing the sale)
├── Claim attempts:           40,000 /sec (trying to claim)
├── Stock checks:             10,000 /sec (AJAX refreshes)
└── Checkouts:                50 /sec (successful claimers)

Inventory Operations

INVENTORY MATH

Total inventory:              10,000 units
Claim timeout:                3 minutes
Successful checkout rate:     ~70% (estimate)

If all 10K claimed in first 30 seconds:
├── 7,000 complete checkout
├── 3,000 timeout → return to pool
└── Pool refills trigger second wave of claims

Maximum claim operations:     ~15,000 (accounting for timeouts)
Checkout operations:          ~10,000 (until inventory exhausted)

Claims per second (peak):     10,000 / 30 = 333 claims/second
Checkouts per second (peak):  10,000 / 180 = 56 checkouts/second

Notification Estimation

NOTIFICATION VOLUME

Product watchers:             500,000 (popular product)
Deal subscribers:             2,000,000
Overlap (watching + subscribed): ~100,000

Total notifications:          2,400,000

Delivery target:              30 seconds for all
Notification rate:            2.4M / 30 = 80,000 notifications/second

Push notification payload:    ~500 bytes
Email payload:                ~5 KB

Bandwidth:
├── Push: 80,000 × 500 = 40 MB/sec
└── Email: 30,000 × 5KB = 150 MB/sec

Storage Estimation

STORAGE REQUIREMENTS

Per sale:
├── Sale metadata:            ~1 KB
├── Inventory records:        10,000 × 100 bytes = 1 MB
├── Claim records:            15,000 × 200 bytes = 3 MB
├── Order records:            10,000 × 500 bytes = 5 MB
└── Total per sale:           ~10 MB

Sales per day:                24
Storage per day:              ~240 MB
Storage per year:             ~90 GB

Notification logs:
├── Per notification:         200 bytes
├── Per sale:                 2.4M × 200 = 480 MB
├── Per year:                 ~4 TB

Hot data (active sales):      ~100 MB
Warm data (past week):        ~2 GB
Cold data (archive):          Compress and store in S3

Infrastructure Estimation

SERVER REQUIREMENTS

API Servers (150K req/sec):
├── Per server capacity:      5,000 req/sec
├── Servers needed:           30 servers
└── With 2x headroom:         60 servers

Redis (inventory + claims):
├── Operations/sec:           50,000 (reads + writes)
├── Memory for hot data:      ~1 GB per sale
└── Cluster:                  3 primary + 3 replica

PostgreSQL (orders, users):
├── Write TPS:                500 (orders only)
├── Read TPS:                 5,000
└── Single primary + 2 read replicas

Message Queue (notifications):
├── Messages/sec:             100,000
├── Kafka cluster:            6 brokers, 3 partitions
└── Consumer groups:          Push, Email, Analytics

Interviewer: "Good analysis. Those numbers seem reasonable. What concerns you most about the scale?"

You: "Three things concern me:

The 150K requests/second peak — this is a thundering herd problem. We need aggressive caching and potentially a queue-based claim system.
The inventory consistency — with 40K claim attempts per second on 10K items, we need atomic operations. Race conditions could cause overselling.
The notification delivery — 80K notifications/second is achievable, but we need to pre-compute recipient lists before the sale starts, not at sale time."

Phase 3: High-Level Design (10 minutes)

You: "Let me sketch out the high-level architecture."

╔════════════════════════════════════════════════════════════════════════════╗
║                      MEGAMART FLASH SALE ARCHITECTURE                      ║
╠════════════════════════════════════════════════════════════════════════════╣
║                                                                            ║
║   ┌──────────────────────────────────────────────────────────────────┐     ║
║   │                          USERS (5M)                               │    ║
║   └──────────────────────────────────────────────────────────────────┘     ║
║                                    │                                       ║
║                                    ▼                                       ║
║   ┌──────────────────────────────────────────────────────────────────┐     ║
║   │                      CDN + WAF + Load Balancer                    │    ║
║   │                    (CloudFront + Shield + ALB)                    │    ║
║   └──────────────────────────────────────────────────────────────────┘     ║
║                                    │                                       ║
║            ┌───────────────────────┼───────────────────────┐               ║
║            │                       │                       │               ║
║            ▼                       ▼                       ▼               ║
║   ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐        ║
║   │   Sale Page     │    │   Claim API     │    │  Checkout API   │        ║
║   │   (Read Heavy)  │    │   (Write Heavy) │    │   (Critical)    │        ║
║   │   60 servers    │    │   20 servers    │    │   20 servers    │        ║
║   └────────┬────────┘    └────────┬────────┘    └────────┬────────┘        ║
║            │                       │                       │               ║
║            │                       │                       │               ║
║   ┌────────┴───────────────────────┴───────────────────────┴─────────┐     ║
║   │                          REDIS CLUSTER                           │     ║
║   │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐            │     ║
║   │  │  Inventory   │  │    Claims    │  │  Rate Limit  │            │     ║
║   │  │   Counter    │  │    Store     │  │    Cache     │            │     ║
║   │  └──────────────┘  └──────────────┘  └──────────────┘            │     ║
║   └──────────────────────────────────────────────────────────────────┘     ║
║                                    │                                       ║
║            ┌───────────────────────┼───────────────────────┐               ║
║            │                       │                       │               ║
║            ▼                       ▼                       ▼               ║
║   ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐        ║
║   │   PostgreSQL    │    │     Kafka       │    │     Stripe      │        ║
║   │   (Orders DB)   │    │  (Events/Notif) │    │   (Payments)    │        ║
║   │   Primary + 2RR │    │   6 brokers     │    │   External API  │        ║
║   └─────────────────┘    └─────────────────┘    └─────────────────┘        ║ 
║                                    │                                       ║
║                    ┌───────────────┼───────────────┐                       ║
║                    ▼               ▼               ▼                       ║
║           ┌──────────────┐ ┌──────────────┐ ┌──────────────┐               ║
║           │   Push       │ │    Email     │ │   Webhook    │               ║
║           │   Workers    │ │   Workers    │ │   Delivery   │               ║
║           └──────────────┘ └──────────────┘ └──────────────┘               ║
║                                                                            ║
║   ┌──────────────────────────────────────────────────────────────────┐     ║
║   │                      SUPPORTING SERVICES                         │     ║
║   │  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐     │     ║
║   │  │ Sale Sched │ │ Inventory  │ │ Claim      │ │ Monitoring │     │     ║
║   │  │ (Cron)     │ │ Manager    │ │ Expiry     │ │ (Grafana)  │     │     ║
║   │  └────────────┘ └────────────┘ └────────────┘ └────────────┘     │     ║
║   └──────────────────────────────────────────────────────────────────┘     ║
║                                                                            ║
╚════════════════════════════════════════════════════════════════════════════╝

Component Overview

You: "Let me walk through each component and its role:"

Traffic Layer

CDN + WAF + Load Balancer
├── CloudFront: Cache static assets (sale page HTML, JS, CSS)
├── AWS Shield: DDoS protection (critical during sale spikes)
├── WAF: Rate limiting, bot detection
└── ALB: Route to appropriate service based on path

Benefits:
- CDN handles 80% of traffic (static content)
- WAF blocks abusive traffic before hitting servers
- Geographic distribution reduces latency

API Services

Sale Page Service (Read-Heavy)
├── Renders sale page with current inventory count
├── Heavy caching (1-second TTL on inventory count)
├── Stateless, horizontally scalable
└── Circuit breaker to Redis

Claim API Service (Write-Heavy)
├── Handles claim requests
├── Atomic inventory operations
├── Idempotent (claim key per user per sale)
└── Rate limited per user

Checkout API Service (Critical)
├── Processes payments through Stripe
├── Creates orders
├── Idempotent (prevents double-charge)
├── Timeout-aware (3-minute claim expiry)
└── Circuit breaker to Stripe

Data Layer

Redis Cluster
├── Inventory Counter: Atomic decrement with Lua scripts
├── Claims Store: user_id → claim_info with TTL
├── Rate Limit: Sliding window counters
└── Session Cache: User session data

PostgreSQL
├── Primary: Writes (orders, users)
├── Read Replicas: Read queries
├── Partitioned by date for orders
└── Indexed on user_id, sale_id, order_status

Event Processing

Kafka
├── Topic: sale_events (sale start, sale end, inventory updates)
├── Topic: claim_events (claimed, expired, checkout_started)
├── Topic: notification_events (to_send, sent, failed)
└── Topic: order_events (created, paid, failed)

Consumer Groups:
├── NotificationWorkers: Send push/email
├── AnalyticsWorkers: Real-time dashboards
├── WebhookWorkers: Notify external systems
└── ClaimExpiryWorkers: Return expired claims to pool

Interviewer: "This looks comprehensive. I'm curious about a few things. First, how do you prevent overselling? Walk me through the claim flow."

Phase 4: Deep Dive - Inventory and Claims (10 minutes)

You: "Great question. This is the most critical part of the system. Let me detail the claim flow."

The Claim Flow

USER CLICKS "CLAIM DEAL"
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 1: RATE LIMITING                                                   │
│                                                                         │
│ Check: Has user exceeded 10 claim attempts in last minute?              │
│ Implementation: Redis sliding window                                    │
│                                                                         │
│ INCRBY rate_limit:{user_id}:{minute} 1                                  │
│ EXPIRE rate_limit:{user_id}:{minute} 60                                 │
│                                                                         │
│ If count > 10: Return 429 Too Many Requests                             │
└─────────────────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 2: IDEMPOTENCY CHECK                                               │
│                                                                         │
│ Check: Has user already claimed in this sale?                           │
│ Key: claim:{sale_id}:{user_id}                                          │
│                                                                         │
│ If exists: Return existing claim (idempotent response)                  │
│                                                                         │
│ This prevents double-claiming and handles retries safely.               │
└─────────────────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 3: ATOMIC CLAIM (Lua Script in Redis)                              │
│                                                                         │
│ -- Lua script for atomic claim                                          │
│ local inventory_key = KEYS[1]  -- inventory:{sale_id}                   │
│ local claim_key = KEYS[2]      -- claim:{sale_id}:{user_id}             │
│ local user_id = ARGV[1]                                                 │
│ local claim_id = ARGV[2]                                                │
│ local ttl = ARGV[3]            -- 180 seconds                           │
│                                                                         │
│ -- Check if already claimed                                             │
│ if redis.call('EXISTS', claim_key) == 1 then                            │
│     return redis.call('GET', claim_key)  -- Return existing claim       │
│ end                                                                     │
│                                                                         │
│ -- Try to decrement inventory                                           │
│ local remaining = redis.call('DECR', inventory_key)                     │
│                                                                         │
│ if remaining < 0 then                                                   │
│     -- No inventory, restore counter                                    │
│     redis.call('INCR', inventory_key)                                   │
│     return nil  -- Sold out                                             │
│ end                                                                     │
│                                                                         │
│ -- Success! Store claim with TTL                                        │
│ local claim_data = cjson.encode({                                       │
│     claim_id = claim_id,                                                │
│     user_id = user_id,                                                  │
│     claimed_at = redis.call('TIME')[1],                                 │
│     expires_at = redis.call('TIME')[1] + ttl                            │
│ })                                                                      │
│                                                                         │
│ redis.call('SET', claim_key, claim_data, 'EX', ttl)                     │
│                                                                         │
│ -- Add to expiry tracking set                                           │
│ redis.call('ZADD', 'claims:expiry:' .. KEYS[1],                         │
│            redis.call('TIME')[1] + ttl, claim_key)                      │
│                                                                         │
│ return claim_data                                                       │
└─────────────────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 4: PUBLISH EVENT                                                   │
│                                                                         │
│ Kafka: claim_events                                                     │
│ {                                                                       │
│     "event_type": "claimed",                                            │
│     "claim_id": "clm_abc123",                                           │
│     "sale_id": "sale_xyz",                                              │
│     "user_id": "usr_456",                                               │
│     "expires_at": "2024-01-15T10:03:00Z",                               │
│     "remaining_inventory": 9523                                         │
│ }                                                                       │
│                                                                         │
│ Consumers:                                                              │
│ - AnalyticsWorker: Update real-time dashboard                           │
│ - WebsocketBroadcaster: Push inventory count to browsers                │
└─────────────────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 5: RETURN RESPONSE                                                 │
│                                                                         │
│ {                                                                       │
│     "status": "claimed",                                                │
│     "claim_id": "clm_abc123",                                           │
│     "expires_at": "2024-01-15T10:03:00Z",                               │
│     "seconds_remaining": 180,                                           │
│     "checkout_url": "/checkout/clm_abc123"                              │
│ }                                                                       │
└─────────────────────────────────────────────────────────────────────────┘

Why This Works

You: "This design prevents overselling through several mechanisms:"

OVERSELLING PREVENTION

1. ATOMIC OPERATIONS
   - The Lua script runs atomically in Redis
   - No race condition between check and decrement
   - DECR is atomic, returns new value immediately

2. IDEMPOTENCY
   - Checking existing claim BEFORE decrementing
   - Same user retrying gets same response
   - No double-claiming even under network issues

3. BOUNDED INVENTORY
   - Counter starts at exactly 10,000
   - Can only decrement, never go negative (we restore on <0)
   - TTL ensures claims expire automatically

4. NO DATABASE IN HOT PATH
   - Redis handles all claim logic
   - PostgreSQL only involved at checkout (much lower rate)
   - Removes database as bottleneck

Claim Expiry Handling

You: "When a claim expires without checkout, we need to return inventory:"

class ClaimExpiryWorker:
    """
    Background worker that returns expired claims to inventory.
    Runs continuously, processing expired claims.
    """
    
    def __init__(self, redis_client):
        self.redis = redis_client
    
    async def run(self):
        """Main worker loop."""
        while True:
            await self.process_expired_claims()
            await asyncio.sleep(1)  # Check every second
    
    async def process_expired_claims(self):
        """Find and process expired claims."""
        
        now = int(time.time())
        
        # Get all active sales
        active_sales = await self.redis.smembers('active_sales')
        
        for sale_id in active_sales:
            expiry_key = f'claims:expiry:{sale_id}'
            inventory_key = f'inventory:{sale_id}'
            
            # Get claims that have expired
            expired = await self.redis.zrangebyscore(
                expiry_key,
                min=0,
                max=now
            )
            
            for claim_key in expired:
                # Atomically return inventory and remove claim
                await self.return_claim_to_inventory(
                    sale_id, claim_key, inventory_key, expiry_key
                )
    
    async def return_claim_to_inventory(
        self, sale_id, claim_key, inventory_key, expiry_key
    ):
        """Return a single expired claim to inventory."""
        
        # Lua script for atomic return
        script = """
        local claim_key = KEYS[1]
        local inventory_key = KEYS[2]
        local expiry_key = KEYS[3]
        
        -- Check if claim still exists (might have been checked out)
        if redis.call('EXISTS', claim_key) == 0 then
            -- Claim was checked out, just remove from expiry set
            redis.call('ZREM', expiry_key, claim_key)
            return 0
        end
        
        -- Claim exists and expired, return to inventory
        redis.call('DEL', claim_key)
        redis.call('INCR', inventory_key)
        redis.call('ZREM', expiry_key, claim_key)
        
        return 1
        """
        
        returned = await self.redis.eval(
            script, 
            keys=[claim_key, inventory_key, expiry_key]
        )
        
        if returned:
            # Publish event
            await self.publish_claim_expired(sale_id, claim_key)

Interviewer: "Nice. What about the checkout flow? How do you handle payment failures and ensure orders aren't double-charged?"

Phase 5: Deep Dive - Checkout Flow (10 minutes)

You: "The checkout flow is where everything comes together — timeouts, idempotency, circuit breakers. Let me walk through it."

Checkout Architecture

USER CLICKS "COMPLETE PURCHASE"
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 1: VALIDATE CLAIM                                                  │
│                                                                         │
│ Check claim exists and hasn't expired:                                  │
│                                                                         │
│ claim_data = redis.get(f"claim:{sale_id}:{user_id}")                    │
│                                                                         │
│ if not claim_data:                                                      │
│     return 410 Gone "Your claim has expired"                            │
│                                                                         │
│ if claim_data.checkout_started:                                         │
│     return 409 Conflict "Checkout already in progress"                  │
│                                                                         │
│ # Mark checkout started (prevent concurrent checkouts)                  │
│ redis.hset(claim_key, "checkout_started", timestamp)                    │
└─────────────────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 2: IDEMPOTENCY CHECK                                               │
│                                                                         │
│ Check if this checkout was already processed:                           │
│                                                                         │
│ Key: idempotency:{claim_id}                                             │
│                                                                         │
│ existing = await idempotency_store.get(claim_id)                        │
│ if existing:                                                            │
│     if existing.status == "completed":                                  │
│         return existing.response  # Return same successful response     │
│     if existing.status == "processing":                                 │
│         return 409 "Payment in progress"                                │
│     if existing.status == "failed":                                     │
│         # Allow retry                                                   │
│         pass                                                            │
│                                                                         │
│ # Record processing started                                             │
│ await idempotency_store.set(claim_id, {                                 │
│     status: "processing",                                               │
│     started_at: now                                                     │
│ }, ttl=3600)                                                            │
└─────────────────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 3: PROCESS PAYMENT (with timeout + circuit breaker)                │
│                                                                         │
│ # Check circuit breaker first (Day 3 pattern)                           │
│ if stripe_circuit.is_open:                                              │
│     await idempotency_store.set(claim_id, {status: "failed"})           │
│     return 503 "Payment service temporarily unavailable"                │
│                                                                         │
│ # Call Stripe with timeout (Day 1 pattern)                              │
│ try:                                                                    │
│     payment = await asyncio.wait_for(                                   │
│         stripe.charges.create(                                          │
│             amount=sale.discounted_price,                               │
│             currency="usd",                                             │
│             customer=user.stripe_customer_id,                           │
│             idempotency_key=f"claim_{claim_id}",  # Day 2 pattern       │
│             metadata={"sale_id": sale_id, "claim_id": claim_id}         │
│         ),                                                              │
│         timeout=10.0  # 10 second timeout                               │
│     )                                                                   │
│     stripe_circuit.record_success()                                     │
│                                                                         │
│ except asyncio.TimeoutError:                                            │
│     stripe_circuit.record_failure()                                     │
│     # Don't fail yet - payment might have succeeded!                    │
│     payment = await verify_payment_status(claim_id)                     │
│     if not payment:                                                     │
│         await idempotency_store.set(claim_id, {status: "failed"})       │
│         return 504 "Payment timeout - please try again"                 │
│                                                                         │
│ except StripeError as e:                                                │
│     stripe_circuit.record_failure()                                     │
│     await idempotency_store.set(claim_id, {status: "failed", error: e}) │
│     return 400 f"Payment failed: {e.message}"                           │
└─────────────────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 4: CREATE ORDER (transactionally)                                  │
│                                                                         │
│ async with database.transaction():                                      │
│     # Create order record                                               │
│     order = await Order.create(                                         │
│         id=generate_order_id(),                                         │
│         user_id=user_id,                                                │
│         sale_id=sale_id,                                                │
│         claim_id=claim_id,                                              │
│         amount=payment.amount,                                          │
│         payment_id=payment.id,                                          │
│         status="confirmed"                                              │
│     )                                                                   │
│                                                                         │
│     # Record in idempotency store                                       │
│     await idempotency_store.set(claim_id, {                             │
│         status: "completed",                                            │
│         order_id: order.id,                                             │
│         response: {order_id: order.id, status: "confirmed"}             │
│     }, ttl=86400 * 7)  # Keep for 7 days                                │
│                                                                         │
│     # Delete claim (inventory already decremented)                      │
│     await redis.delete(f"claim:{sale_id}:{user_id}")                    │
│     await redis.zrem(f"claims:expiry:{sale_id}",                        │
│                      f"claim:{sale_id}:{user_id}")                      │
└─────────────────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 5: PUBLISH EVENTS (async, non-blocking)                            │
│                                                                         │
│ # Publish to Kafka - don't wait for confirmation                        │
│ await kafka.send_async("order_events", {                                │
│     "event_type": "order.created",                                      │
│     "order_id": order.id,                                               │
│     "user_id": user_id,                                                 │
│     "sale_id": sale_id,                                                 │
│     "amount": payment.amount                                            │
│ })                                                                      │
│                                                                         │
│ # Queue confirmation email (Day 4 - webhook pattern)                    │
│ await notification_queue.send({                                         │
│     "type": "email",                                                    │
│     "template": "order_confirmation",                                   │
│     "user_id": user_id,                                                 │
│     "order_id": order.id                                                │
│ })                                                                      │
└─────────────────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 6: RETURN SUCCESS                                                  │
│                                                                         │
│ {                                                                       │
│     "status": "confirmed",                                              │
│     "order_id": "ord_xyz789",                                           │
│     "message": "Congratulations! Your order is confirmed.",             │
│     "receipt_url": "/orders/ord_xyz789/receipt"                         │
│ }                                                                       │
└─────────────────────────────────────────────────────────────────────────┘

Complete Checkout Code

class CheckoutService:
    """
    Checkout service implementing all Week 1-2 patterns:
    - Timeouts (Day 1)
    - Idempotency (Day 2)
    - Circuit Breakers (Day 3)
    - Async Events (Day 4)
    """
    
    def __init__(
        self,
        redis: Redis,
        db: Database,
        stripe_client: Stripe,
        kafka: KafkaProducer,
        idempotency_store: IdempotencyStore,
        circuit_breaker: CircuitBreaker
    ):
        self.redis = redis
        self.db = db
        self.stripe = stripe_client
        self.kafka = kafka
        self.idempotency = idempotency_store
        self.circuit = circuit_breaker
        
        # Timeout configuration (Day 1)
        self.payment_timeout = 10.0
        self.db_timeout = 5.0
        self.total_timeout = 25.0  # Budget for entire checkout
    
    async def checkout(
        self, 
        user_id: str, 
        sale_id: str, 
        claim_id: str
    ) -> CheckoutResult:
        """
        Process checkout with full reliability guarantees.
        """
        
        # Start timeout budget (Day 1)
        deadline = time.time() + self.total_timeout
        
        # Step 1: Validate claim
        claim = await self._validate_claim(user_id, sale_id, claim_id)
        if not claim:
            return CheckoutResult(
                success=False,
                error="Claim expired or invalid"
            )
        
        # Step 2: Idempotency check (Day 2)
        existing = await self.idempotency.get(claim_id)
        if existing:
            if existing['status'] == 'completed':
                return CheckoutResult(
                    success=True,
                    order_id=existing['order_id'],
                    idempotent=True
                )
            elif existing['status'] == 'processing':
                return CheckoutResult(
                    success=False,
                    error="Checkout already in progress"
                )
        
        # Mark as processing
        await self.idempotency.set(claim_id, {
            'status': 'processing',
            'started_at': time.time()
        })
        
        try:
            # Step 3: Process payment (Day 1 timeout + Day 3 circuit breaker)
            payment = await self._process_payment(
                user_id, sale_id, claim_id, deadline
            )
            
            # Step 4: Create order (with remaining timeout budget)
            remaining_time = deadline - time.time()
            if remaining_time <= 0:
                raise TimeoutError("Checkout timeout exceeded")
            
            order = await asyncio.wait_for(
                self._create_order(user_id, sale_id, claim_id, payment),
                timeout=remaining_time
            )
            
            # Step 5: Finalize idempotency record
            await self.idempotency.set(claim_id, {
                'status': 'completed',
                'order_id': order.id,
                'completed_at': time.time()
            }, ttl=86400 * 7)
            
            # Step 6: Publish events (async, non-blocking) (Day 4)
            asyncio.create_task(self._publish_order_events(order))
            
            return CheckoutResult(
                success=True,
                order_id=order.id
            )
        
        except PaymentError as e:
            await self.idempotency.set(claim_id, {
                'status': 'failed',
                'error': str(e)
            })
            return CheckoutResult(success=False, error=str(e))
        
        except TimeoutError:
            # Don't mark as failed - payment might have succeeded
            # Let user retry, idempotency will handle it
            return CheckoutResult(
                success=False,
                error="Request timeout - please try again"
            )
    
    async def _process_payment(
        self, 
        user_id: str, 
        sale_id: str, 
        claim_id: str,
        deadline: float
    ) -> PaymentResult:
        """Process payment with circuit breaker and timeout."""
        
        # Check circuit breaker (Day 3)
        if self.circuit.is_open():
            raise PaymentError("Payment service temporarily unavailable")
        
        # Calculate remaining timeout
        remaining = deadline - time.time()
        timeout = min(self.payment_timeout, remaining)
        
        if timeout <= 0:
            raise TimeoutError("No time remaining for payment")
        
        try:
            # Call Stripe with idempotency key (Day 2)
            payment = await asyncio.wait_for(
                self.stripe.charges.create(
                    amount=await self._get_sale_price(sale_id),
                    currency='usd',
                    customer=await self._get_stripe_customer(user_id),
                    idempotency_key=f"checkout_{claim_id}",
                    metadata={
                        'sale_id': sale_id,
                        'claim_id': claim_id,
                        'user_id': user_id
                    }
                ),
                timeout=timeout
            )
            
            self.circuit.record_success()
            return payment
        
        except asyncio.TimeoutError:
            self.circuit.record_failure()
            
            # Payment might have succeeded - verify
            payment = await self._verify_payment(claim_id)
            if payment:
                return payment
            
            raise TimeoutError("Payment timeout")
        
        except StripeError as e:
            self.circuit.record_failure()
            raise PaymentError(str(e))
    
    async def _verify_payment(self, claim_id: str) -> Optional[PaymentResult]:
        """Verify if payment exists in Stripe (for timeout recovery)."""
        
        try:
            payments = await self.stripe.charges.list(
                limit=1,
                metadata={'claim_id': claim_id}
            )
            
            if payments.data:
                return payments.data[0]
            
            return None
        except:
            return None
    
    async def _create_order(
        self, 
        user_id: str, 
        sale_id: str, 
        claim_id: str,
        payment: PaymentResult
    ) -> Order:
        """Create order in database."""
        
        async with self.db.transaction():
            order = await self.db.orders.create(
                id=generate_order_id(),
                user_id=user_id,
                sale_id=sale_id,
                claim_id=claim_id,
                payment_id=payment.id,
                amount=payment.amount,
                status='confirmed',
                created_at=datetime.utcnow()
            )
            
            # Clean up claim
            await self.redis.delete(f"claim:{sale_id}:{user_id}")
            
            return order
    
    async def _publish_order_events(self, order: Order):
        """Publish events for order (non-blocking)."""
        
        # Order created event
        await self.kafka.send('order_events', {
            'event_type': 'order.created',
            'order_id': order.id,
            'user_id': order.user_id,
            'sale_id': order.sale_id,
            'amount': order.amount,
            'timestamp': datetime.utcnow().isoformat()
        })
        
        # Queue confirmation email
        await self.kafka.send('notification_events', {
            'type': 'email',
            'template': 'order_confirmation',
            'recipient_id': order.user_id,
            'data': {
                'order_id': order.id,
                'amount': order.amount
            }
        })

Interviewer: "I like how you've integrated all the patterns. Now, let's talk about the notification system. You mentioned millions of notifications when a sale starts. How do you handle that?"

Phase 6: Deep Dive - Notification System (5 minutes)

You: "The notification system is where the webhook patterns from Day 4 come in. Let me explain."

Pre-computation Strategy

THE PROBLEM:
  Sale starts at 10:00:00
  Need to notify 2.4M users
  Users expect to know within seconds
  Computing recipients at 10:00:00 = disaster

THE SOLUTION:
  Pre-compute recipient list before sale starts
  Store in ready-to-send format
  At 10:00:00, just fan out to workers

Notification Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                     NOTIFICATION PIPELINE                               │
│                                                                         │
│  T-10 MINUTES (Pre-computation)                                         │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                                                                 │    │
│  │  SELECT user_id, email, push_token, notification_preferences    │    │
│  │  FROM users                                                     │    │
│  │  WHERE (user_id IN (SELECT user_id FROM product_watchers        │    │
│  │                     WHERE product_id = 'xyz')                   │    │
│  │         OR subscribed_to_deals = true)                          │    │
│  │  AND notification_enabled = true                                │    │
│  │                                                                 │    │
│  │  Result: 2.4M rows                                              │    │
│  │  Store in: Redis sorted set (partitioned by user_id hash)       │    │
│  │                                                                 │    │
│  │  notification_batch:{sale_id}:0 = [users 0-100k]                │    │
│  │  notification_batch:{sale_id}:1 = [users 100k-200k]             │    │
│  │  ...                                                            │    │
│  │  notification_batch:{sale_id}:23 = [users 2.3M-2.4M]            │    │
│  │                                                                 │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                         │
│  T=0 (Sale Starts)                                                      │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                                                                 │    │
│  │  Scheduler publishes to Kafka:                                  │    │
│  │                                                                 │    │
│  │  Topic: notification_triggers                                   │    │
│  │  {                                                              │    │
│  │      "sale_id": "xyz",                                          │    │
│  │      "batch_count": 24,                                         │    │
│  │      "trigger_time": "2024-01-15T10:00:00Z"                     │    │
│  │  }                                                              │    │
│  │                                                                 │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                         │
│  Workers Process Batches (Parallel)                                     │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                                                                 │    │
│  │  24 workers, each handles 100k users                            │    │
│  │                                                                 │    │
│  │  Worker 0:                                                      │    │
│  │    - Read notification_batch:{sale_id}:0 from Redis             │    │
│  │    - For each user:                                             │    │
│  │        - Check preferences (email? push? both?)                 │    │
│  │        - Queue to appropriate sender                            │    │
│  │                                                                 │    │
│  │  All workers run simultaneously = 2.4M queued in ~10 seconds    │    │
│  │                                                                 │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                         │
│  Sender Pools (Final Delivery)                                          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                                                                 │    │
│  │  Push Notification Pool (20 workers)                            │    │
│  │  ├── Firebase Cloud Messaging for Android                       │    │
│  │  ├── APNs for iOS                                               │    │
│  │  └── Rate: ~50,000/second total                                 │    │
│  │                                                                 │    │
│  │  Email Pool (10 workers)                                        │    │
│  │  ├── SendGrid / SES                                             │    │
│  │  ├── Rate: ~10,000/second total                                 │    │
│  │  └── Lower priority (30 second SLA, not 5 second)               │    │
│  │                                                                 │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Notification Worker Implementation

class NotificationWorker:
    """
    Notification worker implementing Day 4 patterns:
    - At-least-once delivery
    - Retry with backoff
    - Dead letter queue
    - Circuit breaker for external services
    """
    
    def __init__(
        self,
        redis: Redis,
        kafka_consumer: KafkaConsumer,
        push_service: PushNotificationService,
        email_service: EmailService,
        circuit_breakers: Dict[str, CircuitBreaker]
    ):
        self.redis = redis
        self.consumer = kafka_consumer
        self.push = push_service
        self.email = email_service
        self.circuits = circuit_breakers
    
    async def process_batch(self, trigger: NotificationTrigger):
        """Process a notification batch for a sale."""
        
        batch_key = f"notification_batch:{trigger.sale_id}:{trigger.batch_id}"
        
        # Get users from pre-computed batch
        users = await self.redis.lrange(batch_key, 0, -1)
        
        for user_data in users:
            user = json.loads(user_data)
            
            try:
                await self._send_notification(user, trigger)
            except Exception as e:
                # Log and continue - don't let one failure stop batch
                logger.error(
                    "Notification failed",
                    user_id=user['id'],
                    error=str(e)
                )
    
    async def _send_notification(self, user: dict, trigger: NotificationTrigger):
        """Send notification to a single user."""
        
        # Idempotency check (Day 2)
        idem_key = f"notif_sent:{trigger.sale_id}:{user['id']}"
        if await self.redis.exists(idem_key):
            return  # Already sent
        
        # Send based on preferences
        if user.get('push_enabled') and user.get('push_token'):
            await self._send_push(user, trigger)
        
        if user.get('email_enabled') and user.get('email'):
            await self._queue_email(user, trigger)
        
        # Mark as sent (idempotency)
        await self.redis.set(idem_key, '1', ex=3600)  # 1 hour TTL
    
    async def _send_push(self, user: dict, trigger: NotificationTrigger):
        """Send push notification with circuit breaker."""
        
        # Check circuit breaker (Day 3)
        provider = 'fcm' if user['platform'] == 'android' else 'apns'
        
        if self.circuits[provider].is_open():
            # Queue for retry later
            await self._queue_for_retry(user, trigger, 'push')
            return
        
        try:
            await asyncio.wait_for(
                self.push.send(
                    token=user['push_token'],
                    title="⚡ Lightning Deal Live!",
                    body=f"The {trigger.product_name} deal is live! Tap to claim.",
                    data={'sale_id': trigger.sale_id}
                ),
                timeout=2.0  # Day 1: Timeout
            )
            
            self.circuits[provider].record_success()
            
        except asyncio.TimeoutError:
            self.circuits[provider].record_failure()
            await self._queue_for_retry(user, trigger, 'push')
        
        except Exception as e:
            self.circuits[provider].record_failure()
            logger.error("Push failed", error=str(e))
    
    async def _queue_email(self, user: dict, trigger: NotificationTrigger):
        """Queue email for async sending (Day 4 pattern)."""
        
        await self.kafka.send('email_queue', {
            'recipient': user['email'],
            'template': 'flash_sale_live',
            'data': {
                'user_name': user['name'],
                'product_name': trigger.product_name,
                'sale_url': f"https://megamart.com/sale/{trigger.sale_id}"
            },
            'idempotency_key': f"sale_email:{trigger.sale_id}:{user['id']}"
        })

Interviewer: "Good. One more thing — how does the flash sale scheduling work? You mentioned these happen every hour. Is that a cron job?"

Phase 7: Deep Dive - Sale Scheduling (5 minutes)

You: "Yes, this is where the distributed cron patterns from Day 5 come in. We need to ensure each sale starts exactly once, even across multiple servers."

Sale Scheduling Architecture

┌────────────────────────────────────────────────────────────────────────┐
│                     DISTRIBUTED SALE SCHEDULER                         │
│                                                                        │
│  ┌────────────────────────────────────────────────────────────────┐    │
│  │                    ETCD COORDINATION LAYER                     │    │
│  │                                                                │    │
│  │    /elections/sale-scheduler/leader                            │    │
│  │    /config/sales/upcoming                                      │    │
│  │    /fencing/current_token                                      │    │
│  │                                                                │    │
│  └────────────────────────────────────────────────────────────────┘    │
│                              │                                         │
│                              │ Only leader runs                        │
│                              ▼                                         │
│  ┌────────────────────────────────────────────────────────────────┐    │
│  │                     SALE SCHEDULER SERVICE                     │    │
│  │                                                                │    │
│  │   Every minute:                                                │    │
│  │   1. Check for sales starting in next 10 minutes               │    │
│  │   2. Pre-compute notification recipients                       │    │
│  │   3. Warm up Redis with sale data                              │    │
│  │                                                                │    │
│  │   At scheduled time:                                           │    │
│  │   1. Verify fencing token                                      │    │
│  │   2. Initialize inventory counter                              │    │
│  │   3. Mark sale as active                                       │    │
│  │   4. Trigger notifications                                     │    │
│  │                                                                │    │
│  │   At sale end:                                                 │    │
│  │   1. Mark sale as ended                                        │    │
│  │   2. Return unclaimed inventory                                │    │
│  │   3. Generate sale report                                      │    │
│  │                                                                │    │
│  └────────────────────────────────────────────────────────────────┘    │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Sale Scheduler Implementation

class FlashSaleScheduler:
    """
    Distributed flash sale scheduler using Day 5 patterns:
    - Leader election
    - Fencing tokens
    - Exactly-once scheduling
    - Heartbeat monitoring
    """
    
    def __init__(
        self,
        etcd_client,
        redis: Redis,
        db: Database,
        kafka: KafkaProducer,
        notification_service: NotificationService
    ):
        self.etcd = etcd_client
        self.redis = redis
        self.db = db
        self.kafka = kafka
        self.notifications = notification_service
        
        self.is_leader = False
        self.fencing_token = 0
        self.leader_election = LeaderElection(
            etcd_client,
            election_name="sale-scheduler"
        )
    
    async def start(self):
        """Start the scheduler."""
        
        # Start leader election
        asyncio.create_task(
            self.leader_election.campaign(
                on_elected=self._on_elected,
                on_demoted=self._on_demoted
            )
        )
        
        # Main scheduler loop
        while True:
            if self.is_leader:
                await self._scheduler_tick()
            await asyncio.sleep(1)
    
    def _on_elected(self):
        """Called when this instance becomes leader."""
        self.is_leader = True
        self.fencing_token = int(time.time() * 1000)
        logger.info("Became sale scheduler leader", token=self.fencing_token)
    
    def _on_demoted(self):
        """Called when this instance loses leadership."""
        self.is_leader = False
        logger.info("Lost sale scheduler leadership")
    
    async def _scheduler_tick(self):
        """Main scheduling loop tick."""
        
        now = datetime.utcnow()
        
        # Find sales that need action
        await self._prepare_upcoming_sales(now)
        await self._start_due_sales(now)
        await self._end_expired_sales(now)
    
    async def _prepare_upcoming_sales(self, now: datetime):
        """Prepare sales starting in next 10 minutes."""
        
        upcoming = await self.db.query("""
            SELECT * FROM flash_sales
            WHERE status = 'scheduled'
            AND start_time BETWEEN $1 AND $2
            AND preparation_started = false
        """, now, now + timedelta(minutes=10))
        
        for sale in upcoming:
            # Mark as preparing (idempotent)
            await self.db.execute("""
                UPDATE flash_sales
                SET preparation_started = true,
                    preparation_token = $1
                WHERE id = $2 AND preparation_started = false
            """, self.fencing_token, sale.id)
            
            # Pre-compute notifications (async)
            asyncio.create_task(
                self.notifications.prepare_recipients(sale)
            )
            
            # Warm up Redis
            await self._warm_up_sale_data(sale)
    
    async def _start_due_sales(self, now: datetime):
        """Start sales that are due."""
        
        due_sales = await self.db.query("""
            SELECT * FROM flash_sales
            WHERE status = 'scheduled'
            AND start_time <= $1
            AND preparation_started = true
        """, now)
        
        for sale in due_sales:
            await self._start_sale(sale)
    
    async def _start_sale(self, sale):
        """
        Start a single sale.
        Uses fencing token to prevent double-start.
        """
        
        # Atomically claim sale start with fencing token
        claimed = await self.db.execute("""
            UPDATE flash_sales
            SET status = 'active',
                started_at = NOW(),
                started_by_token = $1
            WHERE id = $2
            AND status = 'scheduled'
            AND (started_by_token IS NULL OR started_by_token < $1)
            RETURNING id
        """, self.fencing_token, sale.id)
        
        if not claimed:
            logger.warning("Could not claim sale start", sale_id=sale.id)
            return
        
        logger.info("Starting sale", sale_id=sale.id, token=self.fencing_token)
        
        # Initialize inventory in Redis
        await self.redis.set(
            f"inventory:{sale.id}",
            sale.total_inventory
        )
        
        # Add to active sales set
        await self.redis.sadd('active_sales', sale.id)
        
        # Trigger notifications
        await self.kafka.send('notification_triggers', {
            'type': 'sale_started',
            'sale_id': sale.id,
            'product_name': sale.product_name,
            'fencing_token': self.fencing_token
        })
        
        # Schedule sale end
        asyncio.create_task(
            self._schedule_sale_end(sale.id, sale.duration_minutes)
        )
    
    async def _schedule_sale_end(self, sale_id: str, duration_minutes: int):
        """Schedule the end of a sale."""
        
        await asyncio.sleep(duration_minutes * 60)
        
        if self.is_leader:  # Only end if still leader
            await self._end_sale(sale_id)
    
    async def _end_sale(self, sale_id: str):
        """End a sale and clean up."""
        
        # Update database with fencing token
        ended = await self.db.execute("""
            UPDATE flash_sales
            SET status = 'ended',
                ended_at = NOW(),
                ended_by_token = $1
            WHERE id = $2
            AND status = 'active'
            RETURNING id
        """, self.fencing_token, sale_id)
        
        if not ended:
            return
        
        logger.info("Ending sale", sale_id=sale_id)
        
        # Remove from active sales
        await self.redis.srem('active_sales', sale_id)
        
        # Return unclaimed inventory to report
        remaining = await self.redis.get(f"inventory:{sale_id}")
        
        # Publish sale ended event
        await self.kafka.send('sale_events', {
            'type': 'sale_ended',
            'sale_id': sale_id,
            'remaining_inventory': int(remaining) if remaining else 0,
            'fencing_token': self.fencing_token
        })
        
        # Clean up Redis (keep for a while for debugging)
        await self.redis.expire(f"inventory:{sale_id}", 3600)
    
    async def _warm_up_sale_data(self, sale):
        """Pre-load sale data into Redis."""
        
        await self.redis.hset(f"sale:{sale.id}", mapping={
            'product_id': sale.product_id,
            'product_name': sale.product_name,
            'original_price': sale.original_price,
            'sale_price': sale.sale_price,
            'total_inventory': sale.total_inventory,
            'start_time': sale.start_time.isoformat(),
            'end_time': sale.end_time.isoformat()
        })
        
        # Set TTL beyond sale end
        await self.redis.expire(
            f"sale:{sale.id}",
            sale.duration_minutes * 60 + 3600
        )

Phase 8: Monitoring and Observability (3 minutes)

Interviewer: "How would you monitor this system? What alerts would you set up?"

You: "Monitoring is critical for a flash sale system. Here's my approach:"

Key Metrics Dashboard

╔═════════════════════════════════════════════════════════════════════════╗
║                     FLASH SALE REAL-TIME DASHBOARD                      ║
╠═════════════════════════════════════════════════════════════════════════╣
║                                                                         ║
║  CURRENT SALE: iPhone 15 Pro Lightning Deal                             ║
║  ══════════════════════════════════════════════════════════════════════ ║
║                                                                         ║
║  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐          ║
║  │   INVENTORY     │  │    CLAIMS       │  │   CHECKOUTS     │          ║
║  │                 │  │                 │  │                 │          ║
║  │    2,847        │  │    7,153        │  │    6,421        │          ║
║  │   remaining     │  │    active       │  │   completed     │          ║
║  │                 │  │                 │  │                 │          ║
║  │  ▼▼▼ 450/min   │  │  ▲▲▲ 320/min   │  │  ▲▲▲ 280/min   │             ║
║  └─────────────────┘  └─────────────────┘  └─────────────────┘          ║
║                                                                         ║
║  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐          ║
║  │   REQUESTS/SEC  │  │  ERROR RATE     │  │   P99 LATENCY   │          ║
║  │                 │  │                 │  │                 │          ║
║  │    45,231       │  │     0.12%       │  │     247ms       │          ║
║  │                 │  │                 │  │                 │          ║
║  │  [████████░░]   │  │  [█░░░░░░░░░]   │  │  [██████░░░░]   │          ║
║  │  peak: 150k     │  │  budget: 1%     │  │  budget: 500ms  │          ║
║  └─────────────────┘  └─────────────────┘  └─────────────────┘          ║
║                                                                         ║
║  CIRCUIT BREAKERS                                                       ║
║  ══════════════════════════════════════════════════════════════════════ ║
║  │ Stripe API      │ ████████████████████ │ CLOSED │ 99.8% success      ║
║  │ Redis Cluster   │ ████████████████████ │ CLOSED │ 100% success       ║
║  │ Push (FCM)      │ ██████████████░░░░░░ │ CLOSED │ 97.2% success      ║
║  │ Push (APNs)     │ ████████████████████ │ CLOSED │ 99.9% success      ║
║  │ Email (SES)     │ ████████████████████ │ CLOSED │ 99.7% success      ║
║                                                                         ║
║  NOTIFICATIONS                                                          ║
║  ══════════════════════════════════════════════════════════════════════ ║
║  │ Total to send   │ 2,400,000                                          ║
║  │ Sent            │ 2,387,421  (99.5%)                                 ║
║  │ Failed          │ 12,579     (0.5%)                                  ║
║  │ Pending         │ 0                                                  ║
║  │ Time elapsed    │ 18 seconds                                         ║
║                                                                         ║
╚═════════════════════════════════════════════════════════════════════════╝

Critical Alerts

# Prometheus Alert Rules

groups:
  - name: flash_sale_critical
    rules:
      # INVENTORY ALERTS
      - alert: InventoryOversold
        expr: flash_sale_inventory < 0
        for: 0s
        labels:
          severity: critical
        annotations:
          summary: "CRITICAL: Inventory oversold for sale {{ $labels.sale_id }}"
          
      - alert: InventoryNotDecreasing
        expr: rate(flash_sale_inventory[1m]) >= 0 AND flash_sale_active == 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Inventory not moving during active sale"

      # LATENCY ALERTS
      - alert: ClaimLatencyHigh
        expr: histogram_quantile(0.99, rate(claim_request_duration_seconds_bucket[1m])) > 0.5
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "P99 claim latency {{ $value }}s exceeds 500ms"
          
      - alert: CheckoutLatencyHigh
        expr: histogram_quantile(0.99, rate(checkout_request_duration_seconds_bucket[1m])) > 3
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "P99 checkout latency {{ $value }}s exceeds 3s"

      # ERROR RATE ALERTS
      - alert: HighErrorRate
        expr: sum(rate(http_requests_total{status=~"5.."}[1m])) / sum(rate(http_requests_total[1m])) > 0.01
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Error rate {{ $value | humanizePercentage }} exceeds 1%"

      # CIRCUIT BREAKER ALERTS
      - alert: CircuitBreakerOpen
        expr: circuit_breaker_state == 2  # 2 = OPEN
        for: 0s
        labels:
          severity: critical
        annotations:
          summary: "Circuit breaker {{ $labels.name }} is OPEN"

      # INFRASTRUCTURE ALERTS
      - alert: RedisHighLatency
        expr: redis_command_duration_seconds_p99 > 0.01
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Redis P99 latency {{ $value }}s exceeds 10ms"
          
      - alert: KafkaLag
        expr: kafka_consumer_lag > 10000
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Kafka consumer lag {{ $value }} exceeds 10k"

Phase 9: Wrap-Up and Extensions (2 minutes)

Interviewer: "We're almost out of time. Let's quickly discuss: what would you do differently at 10x the scale?"

You: "Great question. At 10x scale (50M concurrent users, 100K inventory), I'd make these changes:"

Scaling to 10x

CURRENT → 10x SCALE

TRAFFIC HANDLING:
├── Current: 150K req/sec peak
├── 10x: 1.5M req/sec peak
└── Solution:
    - Geographic distribution (multiple regions)
    - Regional inventory pools
    - Edge caching for sale pages
    - WebSocket for inventory updates (reduce polling)

INVENTORY MANAGEMENT:
├── Current: Single Redis cluster
├── 10x: Redis cluster can't handle atomic ops at this rate
└── Solution:
    - Shard inventory by region
    - Eventual consistency for display count
    - Strong consistency only for actual claims
    - Consider CockroachDB for distributed transactions

NOTIFICATION DELIVERY:
├── Current: 2.4M in 30 seconds
├── 10x: 24M in 30 seconds
└── Solution:
    - Pre-send notifications (hint: "sale starting in 5 seconds")
    - Progressive rollout by user segment
    - More aggressive batching to FCM/APNs

DATABASE:
├── Current: PostgreSQL primary + 2 replicas
├── 10x: Single primary becomes bottleneck
└── Solution:
    - Vitess or CockroachDB for horizontal scaling
    - Event sourcing for order creation
    - Read from cache, async write to DB

Alternative Approaches Considered

You: "I should also mention alternatives I considered but didn't choose:"

ALTERNATIVE: Queue-Based Claims
  Instead of: Direct Redis DECR for claims
  Alternative: Put all claim requests in a queue, process in order
  
  Pros: Perfect ordering, simpler consistency
  Cons: Higher latency, doesn't match "instant" UX requirement
  
  Decision: Direct Redis better for our latency requirements

ALTERNATIVE: Reservation-Style Inventory
  Instead of: Decrement on claim
  Alternative: Decrement only on successful checkout
  
  Pros: No need for claim expiry handling
  Cons: Inventory shows as available but isn't, worse UX
  
  Decision: Claim-then-checkout better for transparency

ALTERNATIVE: Distributed Lock per Item
  Instead of: Atomic counter
  Alternative: Lock each inventory unit individually
  
  Pros: Fine-grained control
  Cons: 10,000 locks = complexity nightmare
  
  Decision: Atomic counter much simpler

Interview Conclusion

Interviewer: "Excellent work. You've demonstrated strong understanding of distributed systems, handled the scale estimation well, and made good trade-off decisions. Any questions for me?"

You: "Thank you! I'd love to hear how MegaMart currently handles this, and what challenges you've faced in production."

Summary: Week 1-2 Concepts Applied

Week 1 Concepts (Foundations of Scale)

Concept	Application in This Design
Horizontal vs Vertical Scaling	API services scale horizontally, Redis and PostgreSQL scale with clustering
Database Partitioning	Orders partitioned by date, notifications batched by user_id hash
Caching Strategies	CDN for static content, Redis for hot data, 1-second TTL for inventory count
Load Balancing	ALB across API servers, partition-aware Kafka consumers
Message Queues	Kafka for event-driven architecture, async notification processing

Week 2 Concepts (Failure-First Design)

Day	Concept	Application
Day 1	Timeouts	Payment timeout (10s), checkout budget (25s), claim validation
Day 2	Idempotency	Claim idempotency key, checkout idempotency, Stripe idempotency_key
Day 3	Circuit Breakers	Stripe circuit, FCM/APNs circuits, Redis circuit
Day 4	Webhooks	Notification delivery, at-least-once semantics, retry with backoff
Day 5	Distributed Cron	Sale scheduling, leader election, fencing tokens

Code Patterns Demonstrated

1. ATOMIC OPERATIONS
   - Redis Lua scripts for claim
   - Database transactions for orders
   
2. IDEMPOTENCY IMPLEMENTATION
   - Check-before-execute pattern
   - Idempotency store with TTL
   
3. CIRCUIT BREAKER INTEGRATION
   - Check before external calls
   - Record success/failure
   - Fallback behavior
   
4. TIMEOUT BUDGETS
   - Total budget for operation
   - Remaining budget propagation
   - Timeout on all external calls
   
5. LEADER ELECTION
   - etcd-based election
   - Fencing token validation
   - Graceful failover
   
6. EVENT-DRIVEN ARCHITECTURE
   - Kafka for event publishing
   - Consumer groups for parallel processing
   - At-least-once delivery

Self-Assessment Checklist

After studying this capstone, you should be able to:

This capstone problem integrates all concepts from Weeks 1-2 of the System Design Mastery Series. Use this as a template for approaching similar interview problems.

Back to Course Overview