Capstone

Week 4 Capstone: Design a Global E-Commerce Product Catalog System

🎯 A Real-World Problem Covering Everything You've Learned in Week 4

The Interview Begins

You walk into the interview room. The interviewer smiles and gestures to the whiteboard.

Interviewer: "Thanks for coming in. Today we're going to work through a system design problem together. I'm interested in your thought process, so please think out loud. Feel free to ask questions — this is meant to be collaborative."

They write on the whiteboard:

╔═════════════════════════════════════════════════════════════════════════╗
║                                                                         ║
║           Design a Global E-Commerce Product Catalog System             ║
║                                                                         ║
║   You're building the product catalog for a large e-commerce platform   ║
║   like Amazon or Shopify. The system serves product pages, search       ║
║   results, category listings, and personalized recommendations.         ║
║                                                                         ║
║   Key challenges:                                                       ║
║   - 50 million products across 10,000 categories                        ║
║   - 500 million daily page views globally                               ║
║   - Flash sales with 100x traffic spikes                                ║
║   - Price/inventory updates must reflect within seconds                 ║
║   - Personalized "Recently Viewed" and "Recommended For You"            ║
║                                                                         ║
║   Focus on the CACHING ARCHITECTURE                                     ║
║                                                                         ║
╚═════════════════════════════════════════════════════════════════════════╝

Interviewer: "Take a few minutes to think about this, then walk me through your approach. We have about 45 minutes. I'm particularly interested in how you'd design the caching layers."

Phase 1: Requirements Clarification (5 minutes)

Before diving in, you take a breath and start asking questions. This is crucial — never assume.

Your Questions

You: "Before I start designing, I'd like to clarify a few requirements. First, what's the read-to-write ratio? How often do products change?"

Interviewer: "Reads vastly outnumber writes. Product metadata like descriptions change maybe once a week. But prices can change multiple times per day, and inventory changes constantly — every purchase affects it."

You: "Interesting. So we have different freshness requirements for different data. For the flash sales — what's the traffic pattern? How quickly does it spike?"

Interviewer: "Flash sales are announced in advance but start at a specific time. Traffic goes from normal to 100x in about 30 seconds. We had an incident last Black Friday where the site went down at midnight."

You: "That sounds like a thundering herd problem. For personalization — the 'Recently Viewed' and recommendations — are these per-user or can they be shared?"

Interviewer: "Per-user. Each user has their own browsing history and recommendations based on their behavior."

You: "Got it. How global is the user base? Do we need multi-region deployment?"

Interviewer: "Yes, we have users in North America, Europe, and Asia. Latency matters — we want sub-200ms page loads globally."

You: "For the price changes, what's the acceptable staleness? Can a user see an old price for a few seconds?"

Interviewer: "For display purposes, a few seconds is acceptable. But at checkout, they MUST see the current price. We've had complaints about price mismatches."

You: "Last question — are there any hot products that get disproportionate traffic? Like featured items or viral products?"

Interviewer: "Absolutely. Our 'Deal of the Day' gets about 10% of all traffic. And when influencers mention products, those can spike to millions of views in minutes."

You: "Perfect. Let me summarize the requirements as I understand them."

Functional Requirements

1. PRODUCT DISPLAY
   - Product detail pages (title, description, images, specs)
   - Product pricing (current price, discounts, sale prices)
   - Product availability (inventory count, in-stock status)
   - Product reviews and ratings (aggregated)

2. PRODUCT DISCOVERY
   - Category browsing (hierarchical categories)
   - Search results (filtered, sorted, paginated)
   - Homepage featured products

3. PERSONALIZATION
   - Recently viewed products (per user)
   - Personalized recommendations (per user)
   - "Customers also bought" (per product)

4. REAL-TIME UPDATES
   - Flash sale price activation
   - Inventory updates on purchase
   - New product launches

Non-Functional Requirements

1. SCALE
   - 50 million products
   - 500 million page views/day (~6,000 req/sec average)
   - 100x spike during flash sales (~600,000 req/sec peak)
   - 100 million registered users

2. LATENCY
   - Product page load: <200ms p99 (globally)
   - Search results: <300ms p99
   - Personalization: <100ms p99

3. FRESHNESS (Bounded Staleness)
   - Product metadata: 1 hour acceptable
   - Price: <5 seconds during normal, immediate during checkout
   - Inventory: <30 seconds for display, real-time for purchase

4. AVAILABILITY
   - 99.9% uptime
   - Graceful degradation during failures
   - No downtime during flash sales

Phase 2: Back-of-the-Envelope Estimation (5 minutes)

You: "Let me work through the numbers to understand the scale."

Traffic Estimation

PAGE VIEW TRAFFIC

Daily page views:              500 million
Seconds per day:               86,400
Average requests/sec:          ~6,000 req/sec

Peak traffic (flash sale):     100x normal
Peak requests/sec:             ~600,000 req/sec

Breakdown by page type:
├── Product detail pages:      60% → 3,600 req/sec (360K peak)
├── Category/search:           25% → 1,500 req/sec (150K peak)
├── Homepage:                  10% → 600 req/sec (60K peak)
└── Personalization:           5%  → 300 req/sec (30K peak)

Storage Estimation

PRODUCT DATA

Products:                      50 million
Average product size:
├── Metadata (title, desc):    5 KB
├── Pricing data:              100 bytes
├── Inventory data:            100 bytes
├── Images (URLs only):        500 bytes
├── Reviews aggregate:         200 bytes
└── Total per product:         ~6 KB

Total product data:            50M × 6KB = 300 GB

PERSONALIZATION DATA

Users:                         100 million
Recently viewed (20 items):    20 × 50 bytes = 1 KB per user
Recommendations (50 items):    50 × 50 bytes = 2.5 KB per user
Total per user:                ~3.5 KB

Total personalization:         100M × 3.5KB = 350 GB

Cache Sizing

CACHE REQUIREMENTS

Product cache (hot products):
├── Assume 20% products are "hot" (viewed daily)
├── Hot products: 10 million
├── Size: 10M × 6KB = 60 GB
└── Add overhead: ~80 GB Redis

Personalization cache:
├── Active users (daily): ~50 million
├── Size: 50M × 3.5KB = 175 GB
└── Add overhead: ~200 GB Redis

Category/Search results cache:
├── 10,000 categories × 50 variations = 500K entries
├── Size: 500K × 10KB = 5 GB
└── Add overhead: ~10 GB Redis

TOTAL REDIS CLUSTER: ~300 GB
(Distributed across multiple nodes)

Key Metrics Summary

┌────────────────────────────────────────────────────────────────────────┐
│                    ESTIMATION SUMMARY                                  │
│                                                                        │
│  TRAFFIC                                                               │
│  ├── Average:                6,000 req/sec                             │
│  ├── Peak (flash sale):      600,000 req/sec                           │
│  └── Target cache hit:       >99% (to survive peak)                    │
│                                                                        │
│  STORAGE                                                               │
│  ├── Product data:           300 GB (database)                         │
│  ├── Personalization:        350 GB (database)                         │
│  └── Total cache:            ~300 GB (Redis cluster)                   │
│                                                                        │
│  INFRASTRUCTURE                                                        │
│  ├── Redis nodes:            10 nodes × 32 GB                          │
│  ├── CDN edge locations:     50+ global PoPs                           │
│  └── API servers:            100 instances (auto-scaling)              │
│                                                                        │
│  CRITICAL INSIGHT:                                                     │
│  At 600K req/sec peak, even 1% cache miss = 6,000 DB queries/sec       │
│  Database cannot handle this. Cache hit rate must be >99%              │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Phase 3: High-Level Design (10 minutes)

You: "Now let me sketch out the high-level architecture, focusing on the caching layers."

System Architecture

┌────────────────────────────────────────────────────────────────────────────┐
│                         MULTI-TIER CACHING ARCHITECTURE                    │
│                                                                            │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                         CLIENTS                                     │   │
│  │              Browser / Mobile App / Third-party                     │   │
│  │                    [Browser Cache Layer]                            │   │
│  └─────────────────────────────┬───────────────────────────────────────┘   │
│                                │                                           │
│  ┌─────────────────────────────▼───────────────────────────────────────┐   │
│  │                         CDN LAYER                                   │   │
│  │         CloudFront / Fastly (50+ global edge locations)             │   │
│  │  ┌─────────────────────────────────────────────────────────────┐    │   │
│  │  │ • Static assets (images, JS, CSS) - 1 year TTL              │    │   │
│  │  │ • Product pages (anonymous) - 60s TTL                       │    │   │
│  │  │ • Category pages - 60s TTL                                  │    │   │
│  │  │ • NOT cached: personalized content, prices                  │    │   │
│  │  └─────────────────────────────────────────────────────────────┘    │   │
│  └─────────────────────────────┬───────────────────────────────────────┘   │
│                                │                                           │
│  ┌─────────────────────────────▼───────────────────────────────────────┐   │
│  │                      API GATEWAY                                    │   │
│  │                 (Rate Limiting, Auth)                               │   │
│  └─────────────────────────────┬───────────────────────────────────────┘   │
│                                │                                           │
│       ┌────────────────────────┼────────────────────────┐                  │
│       │                        │                        │                  │
│       ▼                        ▼                        ▼                  │
│  ┌─────────────┐        ┌─────────────┐        ┌─────────────┐             │
│  │  Product    │        │   Search    │        │ Personal-   │             │
│  │  Service    │        │   Service   │        │ ization     │             │
│  └──────┬──────┘        └──────┬──────┘        └──────┬──────┘             │
│         │                      │                      │                    │
│  ┌──────▼──────────────────────▼──────────────────────▼──────┐             │
│  │                    REDIS CLUSTER                          │             │
│  │         (Application Cache - 300GB across 10 nodes)       │             │
│  │  ┌─────────────────────────────────────────────────────┐  │             │
│  │  │ Product Cache:    product:{id} → full product data  │  │             │
│  │  │ Price Cache:      price:{id} → current price        │  │             │
│  │  │ Inventory:        inventory:{id} → stock count      │  │             │
│  │  │ Category Cache:   category:{id}:page:{n} → products │  │             │
│  │  │ User Feed:        user:{id}:recent → product IDs    │  │             │
│  │  │ Recommendations:  user:{id}:recs → product IDs      │  │             │
│  │  └─────────────────────────────────────────────────────┘  │             │
│  └────────────────────────────┬──────────────────────────────┘             │
│                               │                                            │
│  ┌────────────────────────────▼──────────────────────────────┐             │
│  │                      DATABASES                            │             │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │             │
│  │  │  PostgreSQL  │  │ Elasticsearch │  │   DynamoDB   │    │             │
│  │  │  (Products)  │  │   (Search)    │  │ (User Data)       │             │
│  │  └──────────────┘  └──────────────┘  └──────────────┘     │             │
│  └───────────────────────────────────────────────────────────┘             │
│                                                                            │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                    EVENT STREAM (Kafka)                             │   │
│  │         price.updated | inventory.changed | product.modified        │   │
│  │                    ↓ Triggers cache invalidation                    │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Component Breakdown

You: "Let me walk through each component and its caching role..."

1. Browser Cache Layer

Purpose: Cache static assets and reduce redundant requests

Strategy:

Static assets: Cache-Control: public, max-age=31536000, immutable
Product pages: Cache-Control: private, max-age=60, stale-while-revalidate=30
Personalized: Cache-Control: private, no-cache with ETag

2. CDN Layer

Purpose: Serve content from edge locations globally for low latency

Strategy:

Product images: Long TTL (7 days), purge on update
Anonymous product pages: Short TTL (60s)
Search results: Short TTL (30s) with Vary on query params
Personalized: Bypass CDN entirely

3. Application Cache (Redis)

Purpose: Fast access to frequently requested data

Strategy:

Cache-aside pattern for most data
Write-through for critical data (prices)
Different TTLs based on data type

4. Event-Driven Invalidation

Purpose: Keep caches fresh when data changes

Strategy:

Kafka events trigger invalidation
Invalidate app cache → gateway → CDN (in order)
Different strategies for different data types

Data Flow: Product Page Request

You: "Let me trace through a typical product page request..."

PRODUCT PAGE REQUEST FLOW

User requests: /products/ABC123

1. BROWSER CHECK
   └── If cached and fresh → Return immediately
   └── If stale → Send conditional request (If-None-Match)

2. CDN CHECK
   └── If HIT → Return from edge (20ms latency)
   └── If MISS → Forward to origin

3. API GATEWAY
   └── Auth check (cached token validation)
   └── Rate limit check
   └── Route to Product Service

4. PRODUCT SERVICE
   └── Check Redis: product:ABC123
   └── If HIT → Return cached product (2ms)
   └── If MISS → Query PostgreSQL, cache result

5. PRICE SERVICE (separate call)
   └── Check Redis: price:ABC123
   └── Short TTL (30s) + event-driven invalidation
   └── Always fresh for checkout

6. ASSEMBLE RESPONSE
   └── Combine product + price + inventory
   └── Set appropriate cache headers
   └── Return to user

Total latency (cache hit): <50ms
Total latency (cache miss): <200ms

Phase 4: Deep Dives (20 minutes)

Interviewer: "Great high-level design. Let's dive deeper into some specific challenges. How would you handle the flash sale traffic spike?"

Deep Dive 1: Thundering Herd Protection (Week 4, Day 3)

You: "Flash sales are a classic thundering herd scenario. At midnight, thousands of users refresh simultaneously. If the cache expires or is empty, all those requests hit the database at once."

The Problem

FLASH SALE THUNDERING HERD

11:59:59 PM - Normal traffic
├── 6,000 req/sec
├── 99% cache hit
└── 60 DB queries/sec

12:00:00 AM - Flash sale starts
├── Cache expires on "Deal of the Day" product
├── 100,000 users refresh simultaneously
├── ALL requests miss cache
├── 100,000 DB queries in 1 second!
├── Database overwhelmed
└── Site goes down

This is what happened last Black Friday.

The Solution

You: "I'd implement multiple layers of thundering herd protection..."

# Thundering Herd Protection Implementation
# Applies: Week 4, Day 3

import asyncio
from dataclasses import dataclass
from typing import Optional, Dict, Any, Callable
from datetime import datetime, timedelta
import random
import logging

logger = logging.getLogger(__name__)


@dataclass
class CacheEntry:
    """Cache entry with fresh and stale TTLs."""
    value: Any
    fresh_until: datetime
    stale_until: datetime
    version: int = 0


class ThunderingHerdProtectedCache:
    """
    Cache with multiple thundering herd protections.
    
    Protections applied:
    1. Request coalescing - Duplicate requests share one fetch
    2. Stale-while-revalidate - Serve stale, refresh async
    3. Probabilistic early expiration - Stagger expiration times
    4. Background refresh - Pre-refresh hot keys before expiry
    """
    
    def __init__(self, redis_client, config: dict = None):
        self.redis = redis_client
        self.config = config or {}
        
        # In-flight requests for coalescing
        self._in_flight: Dict[str, asyncio.Future] = {}
        
        # Hot keys for background refresh
        self._hot_keys: set = set()
        
        # Metrics
        self._coalesced_count = 0
        self._stale_served_count = 0
    
    async def get(
        self,
        key: str,
        fetch_func: Callable,
        fresh_ttl: int = 60,
        stale_ttl: int = 300
    ) -> Any:
        """
        Get value with thundering herd protection.
        
        1. Check cache - if fresh, return immediately
        2. If stale but valid, return stale and refresh async
        3. If expired, coalesce requests and fetch once
        """
        # Try to get from cache
        entry = await self._get_entry(key)
        
        now = datetime.utcnow()
        
        # Case 1: Fresh cache hit
        if entry and entry.fresh_until > now:
            return entry.value
        
        # Case 2: Stale but within stale window - serve stale, refresh async
        if entry and entry.stale_until > now:
            self._stale_served_count += 1
            # Trigger async refresh (don't wait)
            asyncio.create_task(
                self._refresh_async(key, fetch_func, fresh_ttl, stale_ttl)
            )
            return entry.value
        
        # Case 3: Expired or missing - fetch with coalescing
        return await self._fetch_with_coalescing(
            key, fetch_func, fresh_ttl, stale_ttl
        )
    
    async def _fetch_with_coalescing(
        self,
        key: str,
        fetch_func: Callable,
        fresh_ttl: int,
        stale_ttl: int
    ) -> Any:
        """
        Fetch with request coalescing.
        
        If multiple requests come in for the same key,
        only one actually fetches - others wait for it.
        """
        # Check if there's already a fetch in flight
        if key in self._in_flight:
            self._coalesced_count += 1
            logger.debug(f"Coalescing request for {key}")
            return await self._in_flight[key]
        
        # Create future for this fetch
        future = asyncio.get_event_loop().create_future()
        self._in_flight[key] = future
        
        try:
            # Actually fetch the data
            value = await fetch_func()
            
            # Apply probabilistic early expiration
            # Add jitter to prevent synchronized expiration
            jitter = random.uniform(0.8, 1.0)
            actual_fresh_ttl = int(fresh_ttl * jitter)
            
            # Cache it
            await self._set_entry(key, value, actual_fresh_ttl, stale_ttl)
            
            # Complete the future
            future.set_result(value)
            return value
            
        except Exception as e:
            future.set_exception(e)
            raise
            
        finally:
            del self._in_flight[key]
    
    async def _refresh_async(
        self,
        key: str,
        fetch_func: Callable,
        fresh_ttl: int,
        stale_ttl: int
    ):
        """Background refresh without blocking."""
        try:
            # Don't refresh if already in flight
            if key in self._in_flight:
                return
            
            value = await fetch_func()
            
            jitter = random.uniform(0.8, 1.0)
            actual_fresh_ttl = int(fresh_ttl * jitter)
            
            await self._set_entry(key, value, actual_fresh_ttl, stale_ttl)
            logger.debug(f"Background refresh completed for {key}")
            
        except Exception as e:
            logger.warning(f"Background refresh failed for {key}: {e}")
    
    async def _get_entry(self, key: str) -> Optional[CacheEntry]:
        """Get cache entry with metadata."""
        data = await self.redis.hgetall(f"cache:{key}")
        if not data:
            return None
        
        return CacheEntry(
            value=json.loads(data[b'value']),
            fresh_until=datetime.fromisoformat(data[b'fresh_until'].decode()),
            stale_until=datetime.fromisoformat(data[b'stale_until'].decode()),
            version=int(data.get(b'version', 0))
        )
    
    async def _set_entry(
        self,
        key: str,
        value: Any,
        fresh_ttl: int,
        stale_ttl: int
    ):
        """Set cache entry with metadata."""
        now = datetime.utcnow()
        
        entry_data = {
            'value': json.dumps(value, default=str),
            'fresh_until': (now + timedelta(seconds=fresh_ttl)).isoformat(),
            'stale_until': (now + timedelta(seconds=fresh_ttl + stale_ttl)).isoformat(),
            'version': str(int(now.timestamp()))
        }
        
        pipe = self.redis.pipeline()
        pipe.hset(f"cache:{key}", mapping=entry_data)
        pipe.expire(f"cache:{key}", fresh_ttl + stale_ttl)
        await pipe.execute()
    
    # =========================================================================
    # Background Refresh for Hot Keys (Flash Sale Products)
    # =========================================================================
    
    def mark_hot(self, key: str):
        """Mark a key as hot for background refresh."""
        self._hot_keys.add(key)
    
    def unmark_hot(self, key: str):
        """Remove key from hot list."""
        self._hot_keys.discard(key)
    
    async def run_background_refresh(self, fetch_funcs: Dict[str, Callable]):
        """
        Background job to keep hot keys fresh.
        
        Run this continuously to ensure flash sale products
        are ALWAYS in cache before they're requested.
        """
        while True:
            for key in list(self._hot_keys):
                try:
                    entry = await self._get_entry(key)
                    
                    # Refresh if within 20% of fresh TTL expiring
                    if entry:
                        now = datetime.utcnow()
                        time_to_stale = (entry.fresh_until - now).total_seconds()
                        
                        if time_to_stale < 12:  # Less than 12 seconds to stale
                            if key in fetch_funcs:
                                await self._refresh_async(
                                    key, fetch_funcs[key], 60, 300
                                )
                
                except Exception as e:
                    logger.error(f"Background refresh error for {key}: {e}")
            
            await asyncio.sleep(5)  # Check every 5 seconds


# =============================================================================
# Flash Sale Cache Warming
# =============================================================================

class FlashSaleCacheWarmer:
    """
    Pre-warm cache before flash sales start.
    
    Flash sales are scheduled in advance, so we know
    which products will be hit. Warm them before midnight!
    """
    
    def __init__(self, cache: ThunderingHerdProtectedCache, db_client):
        self.cache = cache
        self.db = db_client
    
    async def warm_flash_sale(self, sale_id: str, start_time: datetime):
        """
        Warm cache for upcoming flash sale.
        
        Call this 5 minutes before sale starts.
        """
        # Get flash sale products
        products = await self.db.fetch(
            """
            SELECT p.* FROM products p
            JOIN flash_sale_items fsi ON p.id = fsi.product_id
            WHERE fsi.sale_id = $1
            """,
            sale_id
        )
        
        logger.info(f"Warming cache for {len(products)} flash sale products")
        
        for product in products:
            product_id = product['id']
            
            # Cache the product
            await self.cache._set_entry(
                f"product:{product_id}",
                dict(product),
                fresh_ttl=60,
                stale_ttl=300
            )
            
            # Mark as hot for background refresh
            self.cache.mark_hot(f"product:{product_id}")
        
        logger.info(f"Flash sale cache warming complete")
    
    async def cool_down_flash_sale(self, sale_id: str):
        """Remove products from hot list after sale ends."""
        products = await self.db.fetch(
            "SELECT product_id FROM flash_sale_items WHERE sale_id = $1",
            sale_id
        )
        
        for product in products:
            self.cache.unmark_hot(f"product:{product['product_id']}")

Edge Cases

Interviewer: "What if the background refresh fails?"

You: "If background refresh fails, the stale value is still served. Users see slightly old data (within stale window), but the system doesn't collapse. We'd alert on refresh failures and extend the stale TTL as a fallback."

Deep Dive 2: Multi-Strategy Cache Invalidation (Week 4, Day 2)

Interviewer: "You mentioned price changes need to reflect within seconds. How do you handle that across all cache layers?"

You: "This is a cache invalidation problem. Different data needs different strategies. Let me show how I'd handle it..."

The Problem

INVALIDATION CHALLENGE

Product has multiple cached data types:
├── Description: Changes weekly, 1-hour staleness OK
├── Price: Changes daily, <5 second staleness required  
├── Inventory: Changes constantly, <30 second staleness
└── Images: Changes rarely, 1-week staleness OK

One invalidation strategy doesn't fit all!

Additionally, data is cached at multiple tiers:
├── Browser cache
├── CDN (50+ edge locations)
├── API Gateway
└── Redis (application cache)

Must invalidate in correct order!

The Solution

# Multi-Strategy Cache Invalidation
# Applies: Week 4, Day 2 (Invalidation) + Day 5 (Multi-Tier)

from dataclasses import dataclass
from typing import List, Dict, Optional
from enum import Enum
import asyncio


class DataFreshness(Enum):
    """Data freshness requirements."""
    REAL_TIME = "real_time"      # <5 seconds, event-driven
    NEAR_TIME = "near_time"      # <30 seconds, event + short TTL
    EVENTUAL = "eventual"        # <1 hour, TTL only
    STATIC = "static"            # Days/weeks, TTL + purge on change


@dataclass
class InvalidationStrategy:
    """Strategy for a data type."""
    freshness: DataFreshness
    ttl: int
    event_driven: bool
    cdn_cache: bool
    cdn_ttl: int = 0


# Define strategies for different data types
INVALIDATION_STRATEGIES = {
    "product_metadata": InvalidationStrategy(
        freshness=DataFreshness.EVENTUAL,
        ttl=3600,           # 1 hour Redis TTL
        event_driven=False, # Just TTL-based
        cdn_cache=True,
        cdn_ttl=60          # 1 minute CDN
    ),
    "product_price": InvalidationStrategy(
        freshness=DataFreshness.REAL_TIME,
        ttl=30,             # 30 second Redis TTL (safety net)
        event_driven=True,  # Event-driven invalidation
        cdn_cache=False     # Don't cache prices at CDN
    ),
    "product_inventory": InvalidationStrategy(
        freshness=DataFreshness.NEAR_TIME,
        ttl=30,             # 30 second Redis TTL
        event_driven=True,  # Event-driven invalidation
        cdn_cache=False     # Don't cache inventory at CDN
    ),
    "product_images": InvalidationStrategy(
        freshness=DataFreshness.STATIC,
        ttl=86400,          # 24 hour Redis TTL
        event_driven=True,  # Purge on image change
        cdn_cache=True,
        cdn_ttl=604800      # 7 day CDN
    ),
    "category_listing": InvalidationStrategy(
        freshness=DataFreshness.EVENTUAL,
        ttl=300,            # 5 minute Redis TTL
        event_driven=False, # Just TTL-based
        cdn_cache=True,
        cdn_ttl=60          # 1 minute CDN
    ),
}


class MultiStrategyInvalidationService:
    """
    Invalidation service with data-type-aware strategies.
    
    Different data types have different freshness requirements.
    This service applies the right strategy for each type.
    """
    
    def __init__(
        self,
        redis_client,
        cdn_client,
        event_consumer
    ):
        self.redis = redis_client
        self.cdn = cdn_client
        self.event_consumer = event_consumer
        self.strategies = INVALIDATION_STRATEGIES
    
    async def invalidate(
        self,
        data_type: str,
        entity_id: str,
        cascade: bool = True
    ):
        """
        Invalidate cache for an entity.
        
        Args:
            data_type: Type of data (determines strategy)
            entity_id: ID of the entity
            cascade: Whether to invalidate related caches
        """
        strategy = self.strategies.get(data_type)
        if not strategy:
            logger.warning(f"Unknown data type: {data_type}")
            return
        
        # Only process if event-driven
        if not strategy.event_driven:
            logger.debug(f"Skipping invalidation for TTL-only type: {data_type}")
            return
        
        # Invalidation order: App Cache → CDN
        # (Closest to database first)
        
        # 1. Application cache (Redis)
        await self._invalidate_redis(data_type, entity_id)
        
        # 2. CDN (if cached there)
        if strategy.cdn_cache:
            await self._invalidate_cdn(data_type, entity_id)
        
        # 3. Cascade to related caches if needed
        if cascade:
            await self._cascade_invalidation(data_type, entity_id)
    
    async def _invalidate_redis(self, data_type: str, entity_id: str):
        """Invalidate application cache."""
        keys = self._get_cache_keys(data_type, entity_id)
        
        if keys:
            await self.redis.delete(*keys)
            logger.info(f"Invalidated Redis keys: {keys}")
    
    async def _invalidate_cdn(self, data_type: str, entity_id: str):
        """Invalidate CDN cache."""
        urls = self._get_cdn_urls(data_type, entity_id)
        
        for url in urls:
            await self.cdn.purge_url(url)
        
        logger.info(f"Purged CDN URLs: {urls}")
    
    async def _cascade_invalidation(self, data_type: str, entity_id: str):
        """Invalidate related caches."""
        # Price change → invalidate product page cache
        if data_type == "product_price":
            await self._invalidate_redis("product_page", entity_id)
        
        # Inventory change → invalidate category listings
        if data_type == "product_inventory":
            categories = await self._get_product_categories(entity_id)
            for cat_id in categories:
                await self._invalidate_redis("category_listing", cat_id)
    
    def _get_cache_keys(self, data_type: str, entity_id: str) -> List[str]:
        """Get Redis keys for a data type and entity."""
        key_patterns = {
            "product_metadata": [f"product:{entity_id}:metadata"],
            "product_price": [f"product:{entity_id}:price", f"price:{entity_id}"],
            "product_inventory": [f"product:{entity_id}:inventory", f"inventory:{entity_id}"],
            "product_images": [f"product:{entity_id}:images"],
            "product_page": [f"product:{entity_id}", f"product_page:{entity_id}"],
            "category_listing": [f"category:{entity_id}:*"],
        }
        return key_patterns.get(data_type, [])
    
    def _get_cdn_urls(self, data_type: str, entity_id: str) -> List[str]:
        """Get CDN URLs to purge."""
        url_patterns = {
            "product_metadata": [f"/api/products/{entity_id}"],
            "product_images": [f"/images/products/{entity_id}/*"],
            "category_listing": [f"/api/categories/{entity_id}/*"],
        }
        return url_patterns.get(data_type, [])
    
    async def _get_product_categories(self, product_id: str) -> List[str]:
        """Get categories a product belongs to."""
        # Would query database
        return []


# =============================================================================
# Event Handler for Real-Time Invalidation
# =============================================================================

class PriceChangeEventHandler:
    """
    Handle price change events for real-time cache invalidation.
    
    When a price changes, we must invalidate immediately.
    """
    
    def __init__(self, invalidation_service: MultiStrategyInvalidationService):
        self.invalidation = invalidation_service
    
    async def handle(self, event: dict):
        """Handle price change event from Kafka."""
        product_id = event['product_id']
        old_price = event['old_price']
        new_price = event['new_price']
        
        logger.info(
            f"Price change detected: product={product_id}, "
            f"{old_price} → {new_price}"
        )
        
        # Invalidate price cache across all tiers
        await self.invalidation.invalidate(
            data_type="product_price",
            entity_id=product_id,
            cascade=True
        )
        
        # If significant price drop (flash sale), warm the cache
        if new_price < old_price * 0.5:  # 50%+ discount
            logger.info(f"Flash sale price detected for {product_id}")
            # Mark as hot for background refresh


class InventoryChangeEventHandler:
    """Handle inventory changes."""
    
    def __init__(self, invalidation_service: MultiStrategyInvalidationService):
        self.invalidation = invalidation_service
    
    async def handle(self, event: dict):
        """Handle inventory change event."""
        product_id = event['product_id']
        new_quantity = event['new_quantity']
        
        # Only invalidate if stock status changed
        # (in stock → out of stock, or vice versa)
        if event.get('stock_status_changed', False):
            await self.invalidation.invalidate(
                data_type="product_inventory",
                entity_id=product_id,
                cascade=True
            )

The Safety Net Pattern

You: "I always combine event-driven invalidation with TTL as a safety net. Events can be lost or delayed. TTL ensures eventual consistency even if events fail."

# Safety Net Pattern: Event-Driven + TTL

async def cache_product_price(product_id: str, price: dict):
    """
    Cache price with safety net TTL.
    
    - Primary invalidation: Event-driven (real-time)
    - Safety net: 30-second TTL (eventual consistency)
    
    Even if the invalidation event is lost, price will
    be refreshed within 30 seconds.
    """
    await redis.setex(
        f"price:{product_id}",
        30,  # Safety net TTL
        json.dumps(price)
    )

Deep Dive 3: Personalized Feed Caching (Week 4, Day 4)

Interviewer: "What about the personalized features? Recently Viewed and Recommendations are per-user. How do you cache those?"

You: "This is similar to social media feed caching. The challenge is that we can't pre-compute feeds for 100 million users. We need a hybrid approach."

The Problem

PERSONALIZATION CACHE CHALLENGE

100 million users
Each user has:
├── Recently Viewed: 20 products
└── Recommendations: 50 products

If we cache feeds for all users:
├── 100M users × 3.5KB = 350 GB
└── Most users are inactive (wasted storage)

If we compute on demand:
├── Each view: Query user history + ML model
└── 100ms+ latency
└── Won't scale at peak traffic

The Solution

You: "I'd use a hybrid caching strategy with activity-based tiering..."

# Personalized Feed Caching
# Applies: Week 4, Day 4

from dataclasses import dataclass
from typing import List, Optional, Dict
from datetime import datetime, timedelta
from enum import Enum


class UserActivityTier(Enum):
    """User activity tiers for caching strategy."""
    ACTIVE = "active"       # Logged in today - full cache
    RECENT = "recent"       # Logged in this week - partial cache
    DORMANT = "dormant"     # Not logged in for 7+ days - compute on demand


@dataclass
class UserFeedConfig:
    """Configuration for user feed caching."""
    recently_viewed_limit: int = 20
    recommendations_limit: int = 50
    active_user_ttl: int = 3600       # 1 hour for active users
    recent_user_ttl: int = 86400      # 24 hours for recent users
    dormant_threshold_days: int = 7


class PersonalizedFeedService:
    """
    Personalized feed service with activity-based caching.
    
    Strategy:
    - Active users (today): Full cache, pre-computed
    - Recent users (this week): Partial cache, refresh on access
    - Dormant users (7+ days): Compute on demand, cache briefly
    """
    
    def __init__(
        self,
        redis_client,
        db_client,
        recommendation_service,
        config: UserFeedConfig = None
    ):
        self.redis = redis_client
        self.db = db_client
        self.recs = recommendation_service
        self.config = config or UserFeedConfig()
    
    async def get_recently_viewed(
        self,
        user_id: str,
        limit: int = 20
    ) -> List[dict]:
        """
        Get user's recently viewed products.
        
        Recently viewed is user-specific but simple:
        - Just a list of product IDs with timestamps
        - Easy to maintain incrementally
        """
        cache_key = f"user:{user_id}:recently_viewed"
        
        # Get from sorted set (most recent first)
        product_ids = await self.redis.zrevrange(
            cache_key, 0, limit - 1
        )
        
        if product_ids:
            return await self._get_products_by_ids(product_ids)
        
        # Cache miss - compute from database
        return await self._compute_recently_viewed(user_id, limit)
    
    async def record_view(self, user_id: str, product_id: str):
        """
        Record a product view.
        
        This is write-through: Write to cache AND database.
        Cache is always up-to-date.
        """
        now = datetime.utcnow().timestamp()
        cache_key = f"user:{user_id}:recently_viewed"
        
        pipe = self.redis.pipeline()
        
        # Add to sorted set (score = timestamp)
        pipe.zadd(cache_key, {product_id: now})
        
        # Trim to limit (keep most recent 20)
        pipe.zremrangebyrank(cache_key, 0, -self.config.recently_viewed_limit - 1)
        
        # Set TTL based on user activity
        tier = await self._get_user_tier(user_id)
        ttl = self._get_ttl_for_tier(tier)
        pipe.expire(cache_key, ttl)
        
        await pipe.execute()
        
        # Also persist to database (async)
        await self._persist_view(user_id, product_id)
    
    async def get_recommendations(
        self,
        user_id: str,
        limit: int = 20
    ) -> List[dict]:
        """
        Get personalized recommendations.
        
        Recommendations are expensive to compute (ML model).
        Strategy varies by user activity tier.
        """
        tier = await self._get_user_tier(user_id)
        
        if tier == UserActivityTier.ACTIVE:
            # Active users: Check cache, compute if missing
            return await self._get_active_user_recs(user_id, limit)
        
        elif tier == UserActivityTier.RECENT:
            # Recent users: Check cache, compute and cache if missing
            return await self._get_recent_user_recs(user_id, limit)
        
        else:
            # Dormant users: Compute on demand, cache briefly
            return await self._get_dormant_user_recs(user_id, limit)
    
    async def _get_active_user_recs(
        self,
        user_id: str,
        limit: int
    ) -> List[dict]:
        """Get recommendations for active users (cached)."""
        cache_key = f"user:{user_id}:recommendations"
        
        # Check cache
        cached = await self.redis.get(cache_key)
        if cached:
            product_ids = json.loads(cached)[:limit]
            return await self._get_products_by_ids(product_ids)
        
        # Cache miss - should be rare for active users
        # (Background job should have pre-computed)
        return await self._compute_and_cache_recs(
            user_id, limit, self.config.active_user_ttl
        )
    
    async def _get_recent_user_recs(
        self,
        user_id: str,
        limit: int
    ) -> List[dict]:
        """Get recommendations for recent users."""
        cache_key = f"user:{user_id}:recommendations"
        
        cached = await self.redis.get(cache_key)
        if cached:
            product_ids = json.loads(cached)[:limit]
            return await self._get_products_by_ids(product_ids)
        
        # Cache miss - compute and cache
        return await self._compute_and_cache_recs(
            user_id, limit, self.config.recent_user_ttl
        )
    
    async def _get_dormant_user_recs(
        self,
        user_id: str,
        limit: int
    ) -> List[dict]:
        """Get recommendations for dormant users."""
        # Don't check cache for dormant users
        # Their data would be stale anyway
        
        # Compute fresh recommendations
        recs = await self.recs.compute_recommendations(user_id, limit)
        
        # Cache briefly (5 minutes) in case they browse around
        cache_key = f"user:{user_id}:recommendations"
        await self.redis.setex(
            cache_key,
            300,
            json.dumps([r['product_id'] for r in recs])
        )
        
        return recs
    
    async def _compute_and_cache_recs(
        self,
        user_id: str,
        limit: int,
        ttl: int
    ) -> List[dict]:
        """Compute recommendations and cache."""
        recs = await self.recs.compute_recommendations(user_id, limit)
        
        cache_key = f"user:{user_id}:recommendations"
        await self.redis.setex(
            cache_key,
            ttl,
            json.dumps([r['product_id'] for r in recs])
        )
        
        return recs
    
    async def _get_user_tier(self, user_id: str) -> UserActivityTier:
        """Determine user's activity tier."""
        cache_key = f"user:{user_id}:last_active"
        
        last_active = await self.redis.get(cache_key)
        if not last_active:
            return UserActivityTier.DORMANT
        
        last_active_dt = datetime.fromisoformat(last_active.decode())
        days_inactive = (datetime.utcnow() - last_active_dt).days
        
        if days_inactive == 0:
            return UserActivityTier.ACTIVE
        elif days_inactive < self.config.dormant_threshold_days:
            return UserActivityTier.RECENT
        else:
            return UserActivityTier.DORMANT
    
    def _get_ttl_for_tier(self, tier: UserActivityTier) -> int:
        """Get cache TTL based on user tier."""
        ttls = {
            UserActivityTier.ACTIVE: self.config.active_user_ttl,
            UserActivityTier.RECENT: self.config.recent_user_ttl,
            UserActivityTier.DORMANT: 300  # 5 minutes
        }
        return ttls.get(tier, 300)
    
    async def _get_products_by_ids(self, product_ids: List[str]) -> List[dict]:
        """Fetch products by IDs with caching."""
        # Would use the product cache from Deep Dive 1
        pass
    
    async def _compute_recently_viewed(
        self,
        user_id: str,
        limit: int
    ) -> List[dict]:
        """Compute recently viewed from database."""
        pass
    
    async def _persist_view(self, user_id: str, product_id: str):
        """Persist view to database (async)."""
        pass


# =============================================================================
# Background Job: Pre-Compute Active User Recommendations
# =============================================================================

class RecommendationPreComputer:
    """
    Background job to pre-compute recommendations for active users.
    
    Run periodically (e.g., every hour) to ensure active users
    always have cached recommendations.
    """
    
    def __init__(
        self,
        feed_service: PersonalizedFeedService,
        db_client
    ):
        self.feed = feed_service
        self.db = db_client
    
    async def run(self):
        """Pre-compute recommendations for active users."""
        # Get users active in last 24 hours
        active_users = await self.db.fetch(
            """
            SELECT user_id FROM user_sessions
            WHERE last_active > NOW() - INTERVAL '24 hours'
            """
        )
        
        logger.info(f"Pre-computing recs for {len(active_users)} active users")
        
        for user in active_users:
            try:
                await self.feed._compute_and_cache_recs(
                    user['user_id'],
                    limit=50,
                    ttl=3600
                )
            except Exception as e:
                logger.warning(
                    f"Failed to pre-compute recs for {user['user_id']}: {e}"
                )
        
        logger.info("Recommendation pre-computation complete")

Deep Dive 4: Multi-Tier Cache Architecture (Week 4, Day 5)

Interviewer: "You mentioned different cache headers for different content types. Walk me through exactly what gets cached where."

You: "Let me detail the complete multi-tier strategy..."

The Cache Matrix

MULTI-TIER CACHE MATRIX

┌─────────────────┬────────────┬──────────────┬───────────┬────────────┐
│ Content Type    │ Browser    │ CDN          │ Gateway   │ Redis      │
├─────────────────┼────────────┼──────────────┼───────────┼────────────┤
│ Static Assets   │ 1 year     │ 1 year       │ N/A       │ N/A        │
│ (JS, CSS)       │ immutable  │ immutable    │           │            │
├─────────────────┼────────────┼──────────────┼───────────┼────────────┤
│ Product Images  │ 1 week     │ 1 week       │ N/A       │ N/A        │
│                 │            │ purge on chg │           │            │
├─────────────────┼────────────┼──────────────┼───────────┼────────────┤
│ Product Page    │ 60s        │ 60s          │ 30s       │ 5 min      │
│ (anonymous)     │ swr=30s    │ purge on chg │           │ event inv. │
├─────────────────┼────────────┼──────────────┼───────────┼────────────┤
│ Product Page    │ 0          │ NO           │ 30s       │ 5 min      │
│ (authenticated) │ private    │ (private)    │           │ event inv. │
├─────────────────┼────────────┼──────────────┼───────────┼────────────┤
│ Product Price   │ 0          │ NO           │ NO        │ 30s        │
│                 │ no-cache   │ (dynamic)    │           │ event inv. │
├─────────────────┼────────────┼──────────────┼───────────┼────────────┤
│ Inventory       │ 0          │ NO           │ NO        │ 30s        │
│                 │ no-cache   │ (dynamic)    │           │ event inv. │
├─────────────────┼────────────┼──────────────┼───────────┼────────────┤
│ Category Page   │ 60s        │ 60s          │ 30s       │ 5 min      │
│ (anonymous)     │ swr=30s    │              │           │            │
├─────────────────┼────────────┼──────────────┼───────────┼────────────┤
│ Search Results  │ 30s        │ 30s          │ NO        │ 1 min      │
│                 │ Vary: q    │ Vary: q      │           │            │
├─────────────────┼────────────┼──────────────┼───────────┼────────────┤
│ Recently Viewed │ 0          │ NO           │ NO        │ 1 hour     │
│ (user-specific) │ private    │              │           │ write-thru │
├─────────────────┼────────────┼──────────────┼───────────┼────────────┤
│ Recommendations │ 0          │ NO           │ NO        │ varies     │
│ (user-specific) │ private    │              │           │ by tier    │
├─────────────────┼────────────┼──────────────┼───────────┼────────────┤
│ Cart            │ 0          │ NO           │ NO        │ 30 min     │
│                 │ no-store   │              │           │            │
├─────────────────┼────────────┼──────────────┼───────────┼────────────┤
│ Checkout        │ 0          │ NO           │ NO        │ NO         │
│                 │ no-store   │              │           │ (real-time)│
└─────────────────┴────────────┴──────────────┴───────────┴────────────┘

Key:
- swr = stale-while-revalidate
- event inv. = event-driven invalidation
- write-thru = write-through caching

Implementation

# Multi-Tier Cache Headers
# Applies: Week 4, Day 5

from fastapi import FastAPI, Response, Request, Depends
from enum import Enum


class ContentType(Enum):
    STATIC_ASSET = "static_asset"
    PRODUCT_IMAGE = "product_image"
    PRODUCT_PAGE_ANON = "product_page_anon"
    PRODUCT_PAGE_AUTH = "product_page_auth"
    PRODUCT_PRICE = "product_price"
    INVENTORY = "inventory"
    CATEGORY_PAGE = "category_page"
    SEARCH_RESULTS = "search_results"
    PERSONALIZED = "personalized"
    CART = "cart"
    CHECKOUT = "checkout"


class CacheHeaderBuilder:
    """
    Build appropriate cache headers for each content type.
    
    Different content requires different caching strategies
    across browser, CDN, and gateway layers.
    """
    
    CACHE_POLICIES = {
        ContentType.STATIC_ASSET: {
            "cache_control": "public, max-age=31536000, immutable",
            "cdn_cache": True,
            "vary": ["Accept-Encoding"],
        },
        ContentType.PRODUCT_IMAGE: {
            "cache_control": "public, max-age=604800",  # 1 week
            "cdn_cache": True,
            "vary": ["Accept-Encoding"],
        },
        ContentType.PRODUCT_PAGE_ANON: {
            "cache_control": "public, max-age=60, stale-while-revalidate=30",
            "cdn_cache": True,
            "vary": ["Accept-Encoding", "Accept-Language"],
        },
        ContentType.PRODUCT_PAGE_AUTH: {
            "cache_control": "private, max-age=0, must-revalidate",
            "cdn_cache": False,
            "vary": ["Authorization", "Accept-Encoding"],
        },
        ContentType.PRODUCT_PRICE: {
            "cache_control": "no-cache, no-store, must-revalidate",
            "cdn_cache": False,
            "vary": [],
        },
        ContentType.INVENTORY: {
            "cache_control": "no-cache, no-store, must-revalidate",
            "cdn_cache": False,
            "vary": [],
        },
        ContentType.CATEGORY_PAGE: {
            "cache_control": "public, max-age=60, stale-while-revalidate=30",
            "cdn_cache": True,
            "vary": ["Accept-Encoding", "Accept-Language"],
        },
        ContentType.SEARCH_RESULTS: {
            "cache_control": "public, max-age=30",
            "cdn_cache": True,
            "vary": ["Accept-Encoding"],  # Also varies on query string
        },
        ContentType.PERSONALIZED: {
            "cache_control": "private, no-cache",
            "cdn_cache": False,
            "vary": ["Authorization"],
        },
        ContentType.CART: {
            "cache_control": "private, no-store",
            "cdn_cache": False,
            "vary": [],
        },
        ContentType.CHECKOUT: {
            "cache_control": "no-store",
            "cdn_cache": False,
            "vary": [],
        },
    }
    
    @classmethod
    def get_headers(
        cls,
        content_type: ContentType,
        etag: str = None
    ) -> dict:
        """Get cache headers for content type."""
        policy = cls.CACHE_POLICIES[content_type]
        
        headers = {
            "Cache-Control": policy["cache_control"],
        }
        
        if policy["vary"]:
            headers["Vary"] = ", ".join(policy["vary"])
        
        if etag:
            headers["ETag"] = f'"{etag}"'
        
        # Add CDN hints
        if not policy["cdn_cache"]:
            headers["CDN-Cache-Control"] = "no-store"
        
        return headers


# FastAPI Integration
app = FastAPI()


def get_auth_status(request: Request) -> bool:
    """Check if request is authenticated."""
    return "Authorization" in request.headers


@app.get("/api/products/{product_id}")
async def get_product(
    product_id: str,
    response: Response,
    authenticated: bool = Depends(get_auth_status)
):
    """Get product with appropriate cache headers."""
    
    # Fetch product (from cache or DB)
    product = await product_service.get_product(product_id)
    
    # Determine content type based on auth status
    content_type = (
        ContentType.PRODUCT_PAGE_AUTH if authenticated
        else ContentType.PRODUCT_PAGE_ANON
    )
    
    # Set cache headers
    headers = CacheHeaderBuilder.get_headers(
        content_type,
        etag=str(product['updated_at'])
    )
    
    for header, value in headers.items():
        response.headers[header] = value
    
    return product


@app.get("/api/products/{product_id}/price")
async def get_price(product_id: str, response: Response):
    """
    Get product price.
    
    Never cached at CDN - always fresh from Redis/DB.
    """
    price = await price_service.get_price(product_id)
    
    headers = CacheHeaderBuilder.get_headers(ContentType.PRODUCT_PRICE)
    for header, value in headers.items():
        response.headers[header] = value
    
    return price


@app.get("/api/me/recently-viewed")
async def get_recently_viewed(
    response: Response,
    user = Depends(get_current_user)
):
    """Get user's recently viewed - personalized, private."""
    items = await feed_service.get_recently_viewed(user.id)
    
    headers = CacheHeaderBuilder.get_headers(ContentType.PERSONALIZED)
    for header, value in headers.items():
        response.headers[header] = value
    
    return items

Phase 5: Scaling and Edge Cases (5 minutes)

Interviewer: "How would this system scale to 10x the current load? What breaks first?"

Scaling Strategy

You: "Let me analyze the bottlenecks at 10x scale..."

SCALING ANALYSIS: 10X LOAD

Current → 10x:
├── 6K req/sec → 60K req/sec average
├── 600K req/sec → 6M req/sec peak
└── 300 GB cache → 1+ TB cache

BOTTLENECK ANALYSIS:

1. CDN (Lowest risk)
   ├── CDN scales horizontally by design
   ├── More edge locations as needed
   └── Cost scales linearly

2. Redis Cluster (Medium risk)
   ├── Current: 10 nodes × 32 GB = 320 GB
   ├── 10x: 30+ nodes × 64 GB = 2 TB
   ├── Challenge: Cross-slot operations
   └── Solution: Shard by product ID consistently

3. Database (Highest risk)
   ├── Even with 99% cache hit, 1% miss at 6M req/sec = 60K DB queries/sec
   ├── PostgreSQL won't handle this
   └── Solutions:
       ├── Read replicas (10+)
       ├── Connection pooling (PgBouncer)
       └── Consider DynamoDB for product reads

4. Network (Medium risk)
   ├── Internal traffic between services
   └── Solution: Co-locate services, use service mesh

Edge Cases

Interviewer: "What happens if Redis goes down during a flash sale?"

You: "That's our worst-case scenario. Here's how we handle it..."

EDGE CASE: REDIS CLUSTER FAILURE

Scenario:
  Redis cluster partially fails during flash sale
  600K req/sec hitting the system

Impact WITHOUT mitigation:
  All requests hit database → database fails → site down

MITIGATION STRATEGY:

1. CIRCUIT BREAKER
   - Detect Redis failure quickly (< 1 second)
   - Open circuit, stop trying Redis
   - Serve degraded experience

2. GRACEFUL DEGRADATION
   - Serve stale data from local memory cache (Guava/Caffeine)
   - Show "temporarily unavailable" for personalization
   - Block flash sale purchases temporarily (prevent oversell)

3. FALLBACK DATA
   - Pre-compute "default" product data
   - Store in local process memory
   - Serve default when cache unavailable

4. AUTOMATIC RECOVERY
   - Circuit breaker half-open after 30 seconds
   - Test with single request
   - Gradually restore traffic if successful

# Graceful Degradation Implementation

class ResilientProductService:
    """
    Product service with graceful degradation.
    
    Falls back to local cache → default data → error
    if Redis is unavailable.
    """
    
    def __init__(self, redis_cache, local_cache, db_client):
        self.redis = redis_cache
        self.local = local_cache  # In-memory (Guava-style)
        self.db = db_client
        self.circuit_breaker = CircuitBreaker(
            failure_threshold=5,
            recovery_timeout=30
        )
    
    async def get_product(self, product_id: str) -> dict:
        """Get product with fallback chain."""
        
        # Try Redis (primary cache)
        if self.circuit_breaker.is_closed():
            try:
                product = await self.redis.get(f"product:{product_id}")
                if product:
                    # Also store in local cache for fallback
                    self.local.put(product_id, product)
                    return product
            except Exception as e:
                self.circuit_breaker.record_failure()
                logger.warning(f"Redis failure: {e}")
        
        # Try local in-memory cache (fallback)
        product = self.local.get(product_id)
        if product:
            logger.info(f"Serving from local cache: {product_id}")
            return product
        
        # Try database (last resort during outage)
        if self.circuit_breaker.is_open():
            # Don't hammer DB during Redis outage
            # Return default/error
            return self._get_default_product(product_id)
        
        # Normal cache miss - fetch from DB
        product = await self.db.fetch_product(product_id)
        if product:
            await self.redis.set(f"product:{product_id}", product)
            self.local.put(product_id, product)
        
        return product
    
    def _get_default_product(self, product_id: str) -> dict:
        """Return minimal product data during outage."""
        return {
            "id": product_id,
            "title": "Product Temporarily Unavailable",
            "price": None,
            "inventory": None,
            "_degraded": True
        }

Failure Scenarios

Failure	Detection	Impact	Recovery
Redis node failure	Health checks	Partial cache miss	Auto-failover to replica
Redis cluster failure	Circuit breaker	High DB load	Local cache + degradation
CDN outage	Synthetic monitoring	Higher origin load	Bypass CDN, scale origin
Database slow	Query latency alerts	Slow responses	Read replicas, cache more
Kafka consumer lag	Lag monitoring	Stale prices	Alert, scale consumers

Phase 6: Monitoring and Operations

Interviewer: "How would you monitor this caching system in production?"

Key Metrics

You: "I'd track metrics at multiple levels..."

Cache Metrics

CACHE HEALTH DASHBOARD

┌─────────────────────────────────────────────────────────────────────────────────┐
│                    CACHE PERFORMANCE                                            │
│                                                                                 │
│  REDIS CLUSTER                                                                  │
│  ├── Hit ratio:              98.5%      [████████████████████░░] Target: >99%   │
│  ├── Latency p99:            2.1ms      [██░░░░░░░░░░░░░░░░░░░░] Target: <5ms   │
│  ├── Memory usage:           78%        [███████████████████░░░] Alert: >85%    │
│  ├── Connections:            4,521      [████████████░░░░░░░░░░] Max: 10K       │ 
│  └── Evictions/sec:          12         [█░░░░░░░░░░░░░░░░░░░░░] Alert: >100    │
│                                                                                 │
│  CDN                                                                            │
│  ├── Hit ratio:              94.2%      [██████████████████░░░░] Target: >90%   │
│  ├── Origin requests/sec:    358        [████░░░░░░░░░░░░░░░░░░]                │
│  ├── Bandwidth saved:        89%        [█████████████████░░░░░]                │ 
│  └── Purges pending:         12         [█░░░░░░░░░░░░░░░░░░░░░]                │
│                                                                                 │
│  INVALIDATION                                                                   │
│  ├── Events processed/sec:   245        [█████░░░░░░░░░░░░░░░░░]                │ 
│  ├── Consumer lag:           34         [█░░░░░░░░░░░░░░░░░░░░░] Alert: >1000   │
│  └── Failed invalidations:   0          [░░░░░░░░░░░░░░░░░░░░░░] Alert: >0      │
│                                                                                 │
│  THUNDERING HERD PROTECTION                                                     │
│  ├── Coalesced requests:     12,456     (requests saved)                        │
│  ├── Stale served:           2,341      (graceful degradation)                  │
│  └── Background refreshes:   892        (proactive updates)                     │
│                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────┘

Alerting Strategy

CRITICAL (PagerDuty - Wake up):
  - Cache hit ratio < 95% for 5 minutes
  - Redis latency p99 > 50ms for 5 minutes
  - Redis cluster node down
  - Invalidation consumer lag > 10,000
  - Circuit breaker opened

WARNING (Slack - Business hours):
  - Cache hit ratio < 98% for 15 minutes
  - Redis memory > 85%
  - CDN origin requests > 1000/sec
  - Eviction rate > 100/sec

INFO (Dashboard only):
  - Cache key expiration patterns
  - Hot key detection
  - Invalidation event volume

Runbook: Cache Hit Ratio Drop

RUNBOOK: Cache Hit Ratio Below 95%

SYMPTOMS:
  - Alert: "Redis cache hit ratio dropped to X%"
  - Increased database latency
  - Increased API response times

DIAGNOSIS:
  1. Check for recent deployments:
     > kubectl get deployments -n production --sort-by='.metadata.creationTimestamp'
  
  2. Check cache key version (did we bump it accidentally?):
     > redis-cli GET cache_version
  
  3. Check for mass invalidation events:
     > kafka-consumer-groups --describe --group cache-invalidation
  
  4. Check hot keys (thundering herd?):
     > redis-cli --hotkeys
  
  5. Check memory pressure (evictions?):
     > redis-cli INFO stats | grep evicted

RESOLUTION:
  1. If deployment issue:
     - Rollback deployment
     - Investigate cache key changes
  
  2. If thundering herd:
     - Enable emergency background refresh
     - Increase stale TTL temporarily
  
  3. If memory pressure:
     - Scale Redis cluster (add nodes)
     - Review TTLs (reduce if possible)
  
  4. If invalidation storm:
     - Pause non-critical invalidation
     - Investigate source of invalidations

ESCALATION:
  - If not resolved in 15 minutes: Page on-call SRE
  - If database impacted: Page database team

Interview Conclusion

Interviewer: "Excellent work. You've demonstrated strong understanding of caching patterns, clear trade-off decisions, and practical production experience. Any questions for me?"

You: "Thank you! I'd love to hear how your team currently handles cache invalidation for pricing updates. Do you use event-driven invalidation, and if so, what message broker do you use?"

Interviewer: "We actually use a combination — Kafka for inventory and pricing events, and simple TTL for product metadata. We've had some challenges with flash sales similar to what you described. Your thundering herd protection approach is something we should consider."

You: "That's great to hear. I'm also curious about your CDN setup — do you use a single provider or multi-CDN?"

Summary: Week 4 Concepts Applied

Week 4 Concepts (Caching — Beyond "Just Add Redis")

Day	Concept	Application in This Design
Day 1: Caching Patterns	Cache-aside, write-through	Cache-aside for products, write-through for recently viewed
Day 2: Invalidation	Event-driven, TTL safety net, multi-strategy	Different strategies per data type (price=event, metadata=TTL)
Day 3: Thundering Herd	Request coalescing, stale-while-revalidate, background refresh	Flash sale protection, hot key handling
Day 4: Feed Caching	Activity-based tiering, hybrid push/pull	Recently viewed (write-through), recommendations (tiered)
Day 5: Multi-Tier	Browser, CDN, Gateway, App cache layers	Complete cache matrix with appropriate headers per tier

Code Patterns Demonstrated

1. THUNDERING HERD PROTECTION
   - Request coalescing with asyncio.Future
   - Stale-while-revalidate pattern
   - Background refresh for hot keys
   - Flash sale cache warming

2. MULTI-STRATEGY INVALIDATION
   - Data-type-aware invalidation
   - Event-driven for real-time data
   - TTL safety net for all data
   - Cascade invalidation (price → product page)

3. ACTIVITY-BASED PERSONALIZATION
   - User tier classification (active/recent/dormant)
   - Different TTLs per tier
   - Background pre-computation for active users

4. MULTI-TIER CACHE HEADERS
   - Content-type-aware headers
   - CDN-Cache-Control for edge control
   - Vary headers for correct caching
   - Private vs public distinction

5. GRACEFUL DEGRADATION
   - Circuit breaker pattern
   - Local cache fallback
   - Default data for outages

Self-Assessment Checklist

After studying this capstone, you should be able to:

This capstone integrates all concepts from Week 4 of the System Design Mastery Series: Caching — Beyond "Just Add Redis". Use this as a template for approaching e-commerce and high-traffic system design interviews.

💬 Public Discussion: Comments are visible to all users. Please be respectful and mindful of what you share.

Discussion (0)

Sort by:

Back to Course Overview