Week 4 Capstone: Design a Global E-Commerce Product Catalog System
šÆ A Real-World Problem Covering Everything You've Learned in Week 4
The Interview Begins
You walk into the interview room. The interviewer smiles and gestures to the whiteboard.
Interviewer: "Thanks for coming in. Today we're going to work through a system design problem together. I'm interested in your thought process, so please think out loud. Feel free to ask questions ā this is meant to be collaborative."
They write on the whiteboard:
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā
ā Design a Global E-Commerce Product Catalog System ā
ā ā
ā You're building the product catalog for a large e-commerce platform ā
ā like Amazon or Shopify. The system serves product pages, search ā
ā results, category listings, and personalized recommendations. ā
ā ā
ā Key challenges: ā
ā - 50 million products across 10,000 categories ā
ā - 500 million daily page views globally ā
ā - Flash sales with 100x traffic spikes ā
ā - Price/inventory updates must reflect within seconds ā
ā - Personalized "Recently Viewed" and "Recommended For You" ā
ā ā
ā Focus on the CACHING ARCHITECTURE ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Interviewer: "Take a few minutes to think about this, then walk me through your approach. We have about 45 minutes. I'm particularly interested in how you'd design the caching layers."
Phase 1: Requirements Clarification (5 minutes)
Before diving in, you take a breath and start asking questions. This is crucial ā never assume.
Your Questions
You: "Before I start designing, I'd like to clarify a few requirements. First, what's the read-to-write ratio? How often do products change?"
Interviewer: "Reads vastly outnumber writes. Product metadata like descriptions change maybe once a week. But prices can change multiple times per day, and inventory changes constantly ā every purchase affects it."
You: "Interesting. So we have different freshness requirements for different data. For the flash sales ā what's the traffic pattern? How quickly does it spike?"
Interviewer: "Flash sales are announced in advance but start at a specific time. Traffic goes from normal to 100x in about 30 seconds. We had an incident last Black Friday where the site went down at midnight."
You: "That sounds like a thundering herd problem. For personalization ā the 'Recently Viewed' and recommendations ā are these per-user or can they be shared?"
Interviewer: "Per-user. Each user has their own browsing history and recommendations based on their behavior."
You: "Got it. How global is the user base? Do we need multi-region deployment?"
Interviewer: "Yes, we have users in North America, Europe, and Asia. Latency matters ā we want sub-200ms page loads globally."
You: "For the price changes, what's the acceptable staleness? Can a user see an old price for a few seconds?"
Interviewer: "For display purposes, a few seconds is acceptable. But at checkout, they MUST see the current price. We've had complaints about price mismatches."
You: "Last question ā are there any hot products that get disproportionate traffic? Like featured items or viral products?"
Interviewer: "Absolutely. Our 'Deal of the Day' gets about 10% of all traffic. And when influencers mention products, those can spike to millions of views in minutes."
You: "Perfect. Let me summarize the requirements as I understand them."
Functional Requirements
1. PRODUCT DISPLAY
- Product detail pages (title, description, images, specs)
- Product pricing (current price, discounts, sale prices)
- Product availability (inventory count, in-stock status)
- Product reviews and ratings (aggregated)
2. PRODUCT DISCOVERY
- Category browsing (hierarchical categories)
- Search results (filtered, sorted, paginated)
- Homepage featured products
3. PERSONALIZATION
- Recently viewed products (per user)
- Personalized recommendations (per user)
- "Customers also bought" (per product)
4. REAL-TIME UPDATES
- Flash sale price activation
- Inventory updates on purchase
- New product launches
Non-Functional Requirements
1. SCALE
- 50 million products
- 500 million page views/day (~6,000 req/sec average)
- 100x spike during flash sales (~600,000 req/sec peak)
- 100 million registered users
2. LATENCY
- Product page load: <200ms p99 (globally)
- Search results: <300ms p99
- Personalization: <100ms p99
3. FRESHNESS (Bounded Staleness)
- Product metadata: 1 hour acceptable
- Price: <5 seconds during normal, immediate during checkout
- Inventory: <30 seconds for display, real-time for purchase
4. AVAILABILITY
- 99.9% uptime
- Graceful degradation during failures
- No downtime during flash sales
Phase 2: Back-of-the-Envelope Estimation (5 minutes)
You: "Let me work through the numbers to understand the scale."
Traffic Estimation
PAGE VIEW TRAFFIC
Daily page views: 500 million
Seconds per day: 86,400
Average requests/sec: ~6,000 req/sec
Peak traffic (flash sale): 100x normal
Peak requests/sec: ~600,000 req/sec
Breakdown by page type:
āāā Product detail pages: 60% ā 3,600 req/sec (360K peak)
āāā Category/search: 25% ā 1,500 req/sec (150K peak)
āāā Homepage: 10% ā 600 req/sec (60K peak)
āāā Personalization: 5% ā 300 req/sec (30K peak)
Storage Estimation
PRODUCT DATA
Products: 50 million
Average product size:
āāā Metadata (title, desc): 5 KB
āāā Pricing data: 100 bytes
āāā Inventory data: 100 bytes
āāā Images (URLs only): 500 bytes
āāā Reviews aggregate: 200 bytes
āāā Total per product: ~6 KB
Total product data: 50M Ć 6KB = 300 GB
PERSONALIZATION DATA
Users: 100 million
Recently viewed (20 items): 20 Ć 50 bytes = 1 KB per user
Recommendations (50 items): 50 Ć 50 bytes = 2.5 KB per user
Total per user: ~3.5 KB
Total personalization: 100M Ć 3.5KB = 350 GB
Cache Sizing
CACHE REQUIREMENTS
Product cache (hot products):
āāā Assume 20% products are "hot" (viewed daily)
āāā Hot products: 10 million
āāā Size: 10M Ć 6KB = 60 GB
āāā Add overhead: ~80 GB Redis
Personalization cache:
āāā Active users (daily): ~50 million
āāā Size: 50M Ć 3.5KB = 175 GB
āāā Add overhead: ~200 GB Redis
Category/Search results cache:
āāā 10,000 categories Ć 50 variations = 500K entries
āāā Size: 500K Ć 10KB = 5 GB
āāā Add overhead: ~10 GB Redis
TOTAL REDIS CLUSTER: ~300 GB
(Distributed across multiple nodes)
Key Metrics Summary
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ESTIMATION SUMMARY ā
ā ā
ā TRAFFIC ā
ā āāā Average: 6,000 req/sec ā
ā āāā Peak (flash sale): 600,000 req/sec ā
ā āāā Target cache hit: >99% (to survive peak) ā
ā ā
ā STORAGE ā
ā āāā Product data: 300 GB (database) ā
ā āāā Personalization: 350 GB (database) ā
ā āāā Total cache: ~300 GB (Redis cluster) ā
ā ā
ā INFRASTRUCTURE ā
ā āāā Redis nodes: 10 nodes Ć 32 GB ā
ā āāā CDN edge locations: 50+ global PoPs ā
ā āāā API servers: 100 instances (auto-scaling) ā
ā ā
ā CRITICAL INSIGHT: ā
ā At 600K req/sec peak, even 1% cache miss = 6,000 DB queries/sec ā
ā Database cannot handle this. Cache hit rate must be >99% ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Phase 3: High-Level Design (10 minutes)
You: "Now let me sketch out the high-level architecture, focusing on the caching layers."
System Architecture
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā MULTI-TIER CACHING ARCHITECTURE ā
ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā CLIENTS ā ā
ā ā Browser / Mobile App / Third-party ā ā
ā ā [Browser Cache Layer] ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā CDN LAYER ā ā
ā ā CloudFront / Fastly (50+ global edge locations) ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā ⢠Static assets (images, JS, CSS) - 1 year TTL ā ā ā
ā ā ā ⢠Product pages (anonymous) - 60s TTL ā ā ā
ā ā ā ⢠Category pages - 60s TTL ā ā ā
ā ā ā ⢠NOT cached: personalized content, prices ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā API GATEWAY ā ā
ā ā (Rate Limiting, Auth) ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā ā ā
ā ā¼ ā¼ ā¼ ā
ā āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā ā
ā ā Product ā ā Search ā ā Personal- ā ā
ā ā Service ā ā Service ā ā ization ā ā
ā āāāāāāāā¬āāāāāāā āāāāāāāā¬āāāāāāā āāāāāāāā¬āāāāāāā ā
ā ā ā ā ā
ā āāāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāā ā
ā ā REDIS CLUSTER ā ā
ā ā (Application Cache - 300GB across 10 nodes) ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā Product Cache: product:{id} ā full product data ā ā ā
ā ā ā Price Cache: price:{id} ā current price ā ā ā
ā ā ā Inventory: inventory:{id} ā stock count ā ā ā
ā ā ā Category Cache: category:{id}:page:{n} ā products ā ā ā
ā ā ā User Feed: user:{id}:recent ā product IDs ā ā ā
ā ā ā Recommendations: user:{id}:recs ā product IDs ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā DATABASES ā ā
ā ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā ā ā
ā ā ā PostgreSQL ā ā Elasticsearch ā ā DynamoDB ā ā ā
ā ā ā (Products) ā ā (Search) ā ā (User Data) ā ā
ā ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā EVENT STREAM (Kafka) ā ā
ā ā price.updated | inventory.changed | product.modified ā ā
ā ā ā Triggers cache invalidation ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Component Breakdown
You: "Let me walk through each component and its caching role..."
1. Browser Cache Layer
Purpose: Cache static assets and reduce redundant requests
Strategy:
- Static assets:
Cache-Control: public, max-age=31536000, immutable - Product pages:
Cache-Control: private, max-age=60, stale-while-revalidate=30 - Personalized:
Cache-Control: private, no-cachewith ETag
2. CDN Layer
Purpose: Serve content from edge locations globally for low latency
Strategy:
- Product images: Long TTL (7 days), purge on update
- Anonymous product pages: Short TTL (60s)
- Search results: Short TTL (30s) with Vary on query params
- Personalized: Bypass CDN entirely
3. Application Cache (Redis)
Purpose: Fast access to frequently requested data
Strategy:
- Cache-aside pattern for most data
- Write-through for critical data (prices)
- Different TTLs based on data type
4. Event-Driven Invalidation
Purpose: Keep caches fresh when data changes
Strategy:
- Kafka events trigger invalidation
- Invalidate app cache ā gateway ā CDN (in order)
- Different strategies for different data types
Data Flow: Product Page Request
You: "Let me trace through a typical product page request..."
PRODUCT PAGE REQUEST FLOW
User requests: /products/ABC123
1. BROWSER CHECK
āāā If cached and fresh ā Return immediately
āāā If stale ā Send conditional request (If-None-Match)
2. CDN CHECK
āāā If HIT ā Return from edge (20ms latency)
āāā If MISS ā Forward to origin
3. API GATEWAY
āāā Auth check (cached token validation)
āāā Rate limit check
āāā Route to Product Service
4. PRODUCT SERVICE
āāā Check Redis: product:ABC123
āāā If HIT ā Return cached product (2ms)
āāā If MISS ā Query PostgreSQL, cache result
5. PRICE SERVICE (separate call)
āāā Check Redis: price:ABC123
āāā Short TTL (30s) + event-driven invalidation
āāā Always fresh for checkout
6. ASSEMBLE RESPONSE
āāā Combine product + price + inventory
āāā Set appropriate cache headers
āāā Return to user
Total latency (cache hit): <50ms
Total latency (cache miss): <200ms
Phase 4: Deep Dives (20 minutes)
Interviewer: "Great high-level design. Let's dive deeper into some specific challenges. How would you handle the flash sale traffic spike?"
Deep Dive 1: Thundering Herd Protection (Week 4, Day 3)
You: "Flash sales are a classic thundering herd scenario. At midnight, thousands of users refresh simultaneously. If the cache expires or is empty, all those requests hit the database at once."
The Problem
FLASH SALE THUNDERING HERD
11:59:59 PM - Normal traffic
āāā 6,000 req/sec
āāā 99% cache hit
āāā 60 DB queries/sec
12:00:00 AM - Flash sale starts
āāā Cache expires on "Deal of the Day" product
āāā 100,000 users refresh simultaneously
āāā ALL requests miss cache
āāā 100,000 DB queries in 1 second!
āāā Database overwhelmed
āāā Site goes down
This is what happened last Black Friday.
The Solution
You: "I'd implement multiple layers of thundering herd protection..."
# Thundering Herd Protection Implementation
# Applies: Week 4, Day 3
import asyncio
from dataclasses import dataclass
from typing import Optional, Dict, Any, Callable
from datetime import datetime, timedelta
import random
import logging
logger = logging.getLogger(__name__)
@dataclass
class CacheEntry:
"""Cache entry with fresh and stale TTLs."""
value: Any
fresh_until: datetime
stale_until: datetime
version: int = 0
class ThunderingHerdProtectedCache:
"""
Cache with multiple thundering herd protections.
Protections applied:
1. Request coalescing - Duplicate requests share one fetch
2. Stale-while-revalidate - Serve stale, refresh async
3. Probabilistic early expiration - Stagger expiration times
4. Background refresh - Pre-refresh hot keys before expiry
"""
def __init__(self, redis_client, config: dict = None):
self.redis = redis_client
self.config = config or {}
# In-flight requests for coalescing
self._in_flight: Dict[str, asyncio.Future] = {}
# Hot keys for background refresh
self._hot_keys: set = set()
# Metrics
self._coalesced_count = 0
self._stale_served_count = 0
async def get(
self,
key: str,
fetch_func: Callable,
fresh_ttl: int = 60,
stale_ttl: int = 300
) -> Any:
"""
Get value with thundering herd protection.
1. Check cache - if fresh, return immediately
2. If stale but valid, return stale and refresh async
3. If expired, coalesce requests and fetch once
"""
# Try to get from cache
entry = await self._get_entry(key)
now = datetime.utcnow()
# Case 1: Fresh cache hit
if entry and entry.fresh_until > now:
return entry.value
# Case 2: Stale but within stale window - serve stale, refresh async
if entry and entry.stale_until > now:
self._stale_served_count += 1
# Trigger async refresh (don't wait)
asyncio.create_task(
self._refresh_async(key, fetch_func, fresh_ttl, stale_ttl)
)
return entry.value
# Case 3: Expired or missing - fetch with coalescing
return await self._fetch_with_coalescing(
key, fetch_func, fresh_ttl, stale_ttl
)
async def _fetch_with_coalescing(
self,
key: str,
fetch_func: Callable,
fresh_ttl: int,
stale_ttl: int
) -> Any:
"""
Fetch with request coalescing.
If multiple requests come in for the same key,
only one actually fetches - others wait for it.
"""
# Check if there's already a fetch in flight
if key in self._in_flight:
self._coalesced_count += 1
logger.debug(f"Coalescing request for {key}")
return await self._in_flight[key]
# Create future for this fetch
future = asyncio.get_event_loop().create_future()
self._in_flight[key] = future
try:
# Actually fetch the data
value = await fetch_func()
# Apply probabilistic early expiration
# Add jitter to prevent synchronized expiration
jitter = random.uniform(0.8, 1.0)
actual_fresh_ttl = int(fresh_ttl * jitter)
# Cache it
await self._set_entry(key, value, actual_fresh_ttl, stale_ttl)
# Complete the future
future.set_result(value)
return value
except Exception as e:
future.set_exception(e)
raise
finally:
del self._in_flight[key]
async def _refresh_async(
self,
key: str,
fetch_func: Callable,
fresh_ttl: int,
stale_ttl: int
):
"""Background refresh without blocking."""
try:
# Don't refresh if already in flight
if key in self._in_flight:
return
value = await fetch_func()
jitter = random.uniform(0.8, 1.0)
actual_fresh_ttl = int(fresh_ttl * jitter)
await self._set_entry(key, value, actual_fresh_ttl, stale_ttl)
logger.debug(f"Background refresh completed for {key}")
except Exception as e:
logger.warning(f"Background refresh failed for {key}: {e}")
async def _get_entry(self, key: str) -> Optional[CacheEntry]:
"""Get cache entry with metadata."""
data = await self.redis.hgetall(f"cache:{key}")
if not data:
return None
return CacheEntry(
value=json.loads(data[b'value']),
fresh_until=datetime.fromisoformat(data[b'fresh_until'].decode()),
stale_until=datetime.fromisoformat(data[b'stale_until'].decode()),
version=int(data.get(b'version', 0))
)
async def _set_entry(
self,
key: str,
value: Any,
fresh_ttl: int,
stale_ttl: int
):
"""Set cache entry with metadata."""
now = datetime.utcnow()
entry_data = {
'value': json.dumps(value, default=str),
'fresh_until': (now + timedelta(seconds=fresh_ttl)).isoformat(),
'stale_until': (now + timedelta(seconds=fresh_ttl + stale_ttl)).isoformat(),
'version': str(int(now.timestamp()))
}
pipe = self.redis.pipeline()
pipe.hset(f"cache:{key}", mapping=entry_data)
pipe.expire(f"cache:{key}", fresh_ttl + stale_ttl)
await pipe.execute()
# =========================================================================
# Background Refresh for Hot Keys (Flash Sale Products)
# =========================================================================
def mark_hot(self, key: str):
"""Mark a key as hot for background refresh."""
self._hot_keys.add(key)
def unmark_hot(self, key: str):
"""Remove key from hot list."""
self._hot_keys.discard(key)
async def run_background_refresh(self, fetch_funcs: Dict[str, Callable]):
"""
Background job to keep hot keys fresh.
Run this continuously to ensure flash sale products
are ALWAYS in cache before they're requested.
"""
while True:
for key in list(self._hot_keys):
try:
entry = await self._get_entry(key)
# Refresh if within 20% of fresh TTL expiring
if entry:
now = datetime.utcnow()
time_to_stale = (entry.fresh_until - now).total_seconds()
if time_to_stale < 12: # Less than 12 seconds to stale
if key in fetch_funcs:
await self._refresh_async(
key, fetch_funcs[key], 60, 300
)
except Exception as e:
logger.error(f"Background refresh error for {key}: {e}")
await asyncio.sleep(5) # Check every 5 seconds
# =============================================================================
# Flash Sale Cache Warming
# =============================================================================
class FlashSaleCacheWarmer:
"""
Pre-warm cache before flash sales start.
Flash sales are scheduled in advance, so we know
which products will be hit. Warm them before midnight!
"""
def __init__(self, cache: ThunderingHerdProtectedCache, db_client):
self.cache = cache
self.db = db_client
async def warm_flash_sale(self, sale_id: str, start_time: datetime):
"""
Warm cache for upcoming flash sale.
Call this 5 minutes before sale starts.
"""
# Get flash sale products
products = await self.db.fetch(
"""
SELECT p.* FROM products p
JOIN flash_sale_items fsi ON p.id = fsi.product_id
WHERE fsi.sale_id = $1
""",
sale_id
)
logger.info(f"Warming cache for {len(products)} flash sale products")
for product in products:
product_id = product['id']
# Cache the product
await self.cache._set_entry(
f"product:{product_id}",
dict(product),
fresh_ttl=60,
stale_ttl=300
)
# Mark as hot for background refresh
self.cache.mark_hot(f"product:{product_id}")
logger.info(f"Flash sale cache warming complete")
async def cool_down_flash_sale(self, sale_id: str):
"""Remove products from hot list after sale ends."""
products = await self.db.fetch(
"SELECT product_id FROM flash_sale_items WHERE sale_id = $1",
sale_id
)
for product in products:
self.cache.unmark_hot(f"product:{product['product_id']}")
Edge Cases
Interviewer: "What if the background refresh fails?"
You: "If background refresh fails, the stale value is still served. Users see slightly old data (within stale window), but the system doesn't collapse. We'd alert on refresh failures and extend the stale TTL as a fallback."
Deep Dive 2: Multi-Strategy Cache Invalidation (Week 4, Day 2)
Interviewer: "You mentioned price changes need to reflect within seconds. How do you handle that across all cache layers?"
You: "This is a cache invalidation problem. Different data needs different strategies. Let me show how I'd handle it..."
The Problem
INVALIDATION CHALLENGE
Product has multiple cached data types:
āāā Description: Changes weekly, 1-hour staleness OK
āāā Price: Changes daily, <5 second staleness required
āāā Inventory: Changes constantly, <30 second staleness
āāā Images: Changes rarely, 1-week staleness OK
One invalidation strategy doesn't fit all!
Additionally, data is cached at multiple tiers:
āāā Browser cache
āāā CDN (50+ edge locations)
āāā API Gateway
āāā Redis (application cache)
Must invalidate in correct order!
The Solution
# Multi-Strategy Cache Invalidation
# Applies: Week 4, Day 2 (Invalidation) + Day 5 (Multi-Tier)
from dataclasses import dataclass
from typing import List, Dict, Optional
from enum import Enum
import asyncio
class DataFreshness(Enum):
"""Data freshness requirements."""
REAL_TIME = "real_time" # <5 seconds, event-driven
NEAR_TIME = "near_time" # <30 seconds, event + short TTL
EVENTUAL = "eventual" # <1 hour, TTL only
STATIC = "static" # Days/weeks, TTL + purge on change
@dataclass
class InvalidationStrategy:
"""Strategy for a data type."""
freshness: DataFreshness
ttl: int
event_driven: bool
cdn_cache: bool
cdn_ttl: int = 0
# Define strategies for different data types
INVALIDATION_STRATEGIES = {
"product_metadata": InvalidationStrategy(
freshness=DataFreshness.EVENTUAL,
ttl=3600, # 1 hour Redis TTL
event_driven=False, # Just TTL-based
cdn_cache=True,
cdn_ttl=60 # 1 minute CDN
),
"product_price": InvalidationStrategy(
freshness=DataFreshness.REAL_TIME,
ttl=30, # 30 second Redis TTL (safety net)
event_driven=True, # Event-driven invalidation
cdn_cache=False # Don't cache prices at CDN
),
"product_inventory": InvalidationStrategy(
freshness=DataFreshness.NEAR_TIME,
ttl=30, # 30 second Redis TTL
event_driven=True, # Event-driven invalidation
cdn_cache=False # Don't cache inventory at CDN
),
"product_images": InvalidationStrategy(
freshness=DataFreshness.STATIC,
ttl=86400, # 24 hour Redis TTL
event_driven=True, # Purge on image change
cdn_cache=True,
cdn_ttl=604800 # 7 day CDN
),
"category_listing": InvalidationStrategy(
freshness=DataFreshness.EVENTUAL,
ttl=300, # 5 minute Redis TTL
event_driven=False, # Just TTL-based
cdn_cache=True,
cdn_ttl=60 # 1 minute CDN
),
}
class MultiStrategyInvalidationService:
"""
Invalidation service with data-type-aware strategies.
Different data types have different freshness requirements.
This service applies the right strategy for each type.
"""
def __init__(
self,
redis_client,
cdn_client,
event_consumer
):
self.redis = redis_client
self.cdn = cdn_client
self.event_consumer = event_consumer
self.strategies = INVALIDATION_STRATEGIES
async def invalidate(
self,
data_type: str,
entity_id: str,
cascade: bool = True
):
"""
Invalidate cache for an entity.
Args:
data_type: Type of data (determines strategy)
entity_id: ID of the entity
cascade: Whether to invalidate related caches
"""
strategy = self.strategies.get(data_type)
if not strategy:
logger.warning(f"Unknown data type: {data_type}")
return
# Only process if event-driven
if not strategy.event_driven:
logger.debug(f"Skipping invalidation for TTL-only type: {data_type}")
return
# Invalidation order: App Cache ā CDN
# (Closest to database first)
# 1. Application cache (Redis)
await self._invalidate_redis(data_type, entity_id)
# 2. CDN (if cached there)
if strategy.cdn_cache:
await self._invalidate_cdn(data_type, entity_id)
# 3. Cascade to related caches if needed
if cascade:
await self._cascade_invalidation(data_type, entity_id)
async def _invalidate_redis(self, data_type: str, entity_id: str):
"""Invalidate application cache."""
keys = self._get_cache_keys(data_type, entity_id)
if keys:
await self.redis.delete(*keys)
logger.info(f"Invalidated Redis keys: {keys}")
async def _invalidate_cdn(self, data_type: str, entity_id: str):
"""Invalidate CDN cache."""
urls = self._get_cdn_urls(data_type, entity_id)
for url in urls:
await self.cdn.purge_url(url)
logger.info(f"Purged CDN URLs: {urls}")
async def _cascade_invalidation(self, data_type: str, entity_id: str):
"""Invalidate related caches."""
# Price change ā invalidate product page cache
if data_type == "product_price":
await self._invalidate_redis("product_page", entity_id)
# Inventory change ā invalidate category listings
if data_type == "product_inventory":
categories = await self._get_product_categories(entity_id)
for cat_id in categories:
await self._invalidate_redis("category_listing", cat_id)
def _get_cache_keys(self, data_type: str, entity_id: str) -> List[str]:
"""Get Redis keys for a data type and entity."""
key_patterns = {
"product_metadata": [f"product:{entity_id}:metadata"],
"product_price": [f"product:{entity_id}:price", f"price:{entity_id}"],
"product_inventory": [f"product:{entity_id}:inventory", f"inventory:{entity_id}"],
"product_images": [f"product:{entity_id}:images"],
"product_page": [f"product:{entity_id}", f"product_page:{entity_id}"],
"category_listing": [f"category:{entity_id}:*"],
}
return key_patterns.get(data_type, [])
def _get_cdn_urls(self, data_type: str, entity_id: str) -> List[str]:
"""Get CDN URLs to purge."""
url_patterns = {
"product_metadata": [f"/api/products/{entity_id}"],
"product_images": [f"/images/products/{entity_id}/*"],
"category_listing": [f"/api/categories/{entity_id}/*"],
}
return url_patterns.get(data_type, [])
async def _get_product_categories(self, product_id: str) -> List[str]:
"""Get categories a product belongs to."""
# Would query database
return []
# =============================================================================
# Event Handler for Real-Time Invalidation
# =============================================================================
class PriceChangeEventHandler:
"""
Handle price change events for real-time cache invalidation.
When a price changes, we must invalidate immediately.
"""
def __init__(self, invalidation_service: MultiStrategyInvalidationService):
self.invalidation = invalidation_service
async def handle(self, event: dict):
"""Handle price change event from Kafka."""
product_id = event['product_id']
old_price = event['old_price']
new_price = event['new_price']
logger.info(
f"Price change detected: product={product_id}, "
f"{old_price} ā {new_price}"
)
# Invalidate price cache across all tiers
await self.invalidation.invalidate(
data_type="product_price",
entity_id=product_id,
cascade=True
)
# If significant price drop (flash sale), warm the cache
if new_price < old_price * 0.5: # 50%+ discount
logger.info(f"Flash sale price detected for {product_id}")
# Mark as hot for background refresh
class InventoryChangeEventHandler:
"""Handle inventory changes."""
def __init__(self, invalidation_service: MultiStrategyInvalidationService):
self.invalidation = invalidation_service
async def handle(self, event: dict):
"""Handle inventory change event."""
product_id = event['product_id']
new_quantity = event['new_quantity']
# Only invalidate if stock status changed
# (in stock ā out of stock, or vice versa)
if event.get('stock_status_changed', False):
await self.invalidation.invalidate(
data_type="product_inventory",
entity_id=product_id,
cascade=True
)
The Safety Net Pattern
You: "I always combine event-driven invalidation with TTL as a safety net. Events can be lost or delayed. TTL ensures eventual consistency even if events fail."
# Safety Net Pattern: Event-Driven + TTL
async def cache_product_price(product_id: str, price: dict):
"""
Cache price with safety net TTL.
- Primary invalidation: Event-driven (real-time)
- Safety net: 30-second TTL (eventual consistency)
Even if the invalidation event is lost, price will
be refreshed within 30 seconds.
"""
await redis.setex(
f"price:{product_id}",
30, # Safety net TTL
json.dumps(price)
)
Deep Dive 3: Personalized Feed Caching (Week 4, Day 4)
Interviewer: "What about the personalized features? Recently Viewed and Recommendations are per-user. How do you cache those?"
You: "This is similar to social media feed caching. The challenge is that we can't pre-compute feeds for 100 million users. We need a hybrid approach."
The Problem
PERSONALIZATION CACHE CHALLENGE
100 million users
Each user has:
āāā Recently Viewed: 20 products
āāā Recommendations: 50 products
If we cache feeds for all users:
āāā 100M users Ć 3.5KB = 350 GB
āāā Most users are inactive (wasted storage)
If we compute on demand:
āāā Each view: Query user history + ML model
āāā 100ms+ latency
āāā Won't scale at peak traffic
The Solution
You: "I'd use a hybrid caching strategy with activity-based tiering..."
# Personalized Feed Caching
# Applies: Week 4, Day 4
from dataclasses import dataclass
from typing import List, Optional, Dict
from datetime import datetime, timedelta
from enum import Enum
class UserActivityTier(Enum):
"""User activity tiers for caching strategy."""
ACTIVE = "active" # Logged in today - full cache
RECENT = "recent" # Logged in this week - partial cache
DORMANT = "dormant" # Not logged in for 7+ days - compute on demand
@dataclass
class UserFeedConfig:
"""Configuration for user feed caching."""
recently_viewed_limit: int = 20
recommendations_limit: int = 50
active_user_ttl: int = 3600 # 1 hour for active users
recent_user_ttl: int = 86400 # 24 hours for recent users
dormant_threshold_days: int = 7
class PersonalizedFeedService:
"""
Personalized feed service with activity-based caching.
Strategy:
- Active users (today): Full cache, pre-computed
- Recent users (this week): Partial cache, refresh on access
- Dormant users (7+ days): Compute on demand, cache briefly
"""
def __init__(
self,
redis_client,
db_client,
recommendation_service,
config: UserFeedConfig = None
):
self.redis = redis_client
self.db = db_client
self.recs = recommendation_service
self.config = config or UserFeedConfig()
async def get_recently_viewed(
self,
user_id: str,
limit: int = 20
) -> List[dict]:
"""
Get user's recently viewed products.
Recently viewed is user-specific but simple:
- Just a list of product IDs with timestamps
- Easy to maintain incrementally
"""
cache_key = f"user:{user_id}:recently_viewed"
# Get from sorted set (most recent first)
product_ids = await self.redis.zrevrange(
cache_key, 0, limit - 1
)
if product_ids:
return await self._get_products_by_ids(product_ids)
# Cache miss - compute from database
return await self._compute_recently_viewed(user_id, limit)
async def record_view(self, user_id: str, product_id: str):
"""
Record a product view.
This is write-through: Write to cache AND database.
Cache is always up-to-date.
"""
now = datetime.utcnow().timestamp()
cache_key = f"user:{user_id}:recently_viewed"
pipe = self.redis.pipeline()
# Add to sorted set (score = timestamp)
pipe.zadd(cache_key, {product_id: now})
# Trim to limit (keep most recent 20)
pipe.zremrangebyrank(cache_key, 0, -self.config.recently_viewed_limit - 1)
# Set TTL based on user activity
tier = await self._get_user_tier(user_id)
ttl = self._get_ttl_for_tier(tier)
pipe.expire(cache_key, ttl)
await pipe.execute()
# Also persist to database (async)
await self._persist_view(user_id, product_id)
async def get_recommendations(
self,
user_id: str,
limit: int = 20
) -> List[dict]:
"""
Get personalized recommendations.
Recommendations are expensive to compute (ML model).
Strategy varies by user activity tier.
"""
tier = await self._get_user_tier(user_id)
if tier == UserActivityTier.ACTIVE:
# Active users: Check cache, compute if missing
return await self._get_active_user_recs(user_id, limit)
elif tier == UserActivityTier.RECENT:
# Recent users: Check cache, compute and cache if missing
return await self._get_recent_user_recs(user_id, limit)
else:
# Dormant users: Compute on demand, cache briefly
return await self._get_dormant_user_recs(user_id, limit)
async def _get_active_user_recs(
self,
user_id: str,
limit: int
) -> List[dict]:
"""Get recommendations for active users (cached)."""
cache_key = f"user:{user_id}:recommendations"
# Check cache
cached = await self.redis.get(cache_key)
if cached:
product_ids = json.loads(cached)[:limit]
return await self._get_products_by_ids(product_ids)
# Cache miss - should be rare for active users
# (Background job should have pre-computed)
return await self._compute_and_cache_recs(
user_id, limit, self.config.active_user_ttl
)
async def _get_recent_user_recs(
self,
user_id: str,
limit: int
) -> List[dict]:
"""Get recommendations for recent users."""
cache_key = f"user:{user_id}:recommendations"
cached = await self.redis.get(cache_key)
if cached:
product_ids = json.loads(cached)[:limit]
return await self._get_products_by_ids(product_ids)
# Cache miss - compute and cache
return await self._compute_and_cache_recs(
user_id, limit, self.config.recent_user_ttl
)
async def _get_dormant_user_recs(
self,
user_id: str,
limit: int
) -> List[dict]:
"""Get recommendations for dormant users."""
# Don't check cache for dormant users
# Their data would be stale anyway
# Compute fresh recommendations
recs = await self.recs.compute_recommendations(user_id, limit)
# Cache briefly (5 minutes) in case they browse around
cache_key = f"user:{user_id}:recommendations"
await self.redis.setex(
cache_key,
300,
json.dumps([r['product_id'] for r in recs])
)
return recs
async def _compute_and_cache_recs(
self,
user_id: str,
limit: int,
ttl: int
) -> List[dict]:
"""Compute recommendations and cache."""
recs = await self.recs.compute_recommendations(user_id, limit)
cache_key = f"user:{user_id}:recommendations"
await self.redis.setex(
cache_key,
ttl,
json.dumps([r['product_id'] for r in recs])
)
return recs
async def _get_user_tier(self, user_id: str) -> UserActivityTier:
"""Determine user's activity tier."""
cache_key = f"user:{user_id}:last_active"
last_active = await self.redis.get(cache_key)
if not last_active:
return UserActivityTier.DORMANT
last_active_dt = datetime.fromisoformat(last_active.decode())
days_inactive = (datetime.utcnow() - last_active_dt).days
if days_inactive == 0:
return UserActivityTier.ACTIVE
elif days_inactive < self.config.dormant_threshold_days:
return UserActivityTier.RECENT
else:
return UserActivityTier.DORMANT
def _get_ttl_for_tier(self, tier: UserActivityTier) -> int:
"""Get cache TTL based on user tier."""
ttls = {
UserActivityTier.ACTIVE: self.config.active_user_ttl,
UserActivityTier.RECENT: self.config.recent_user_ttl,
UserActivityTier.DORMANT: 300 # 5 minutes
}
return ttls.get(tier, 300)
async def _get_products_by_ids(self, product_ids: List[str]) -> List[dict]:
"""Fetch products by IDs with caching."""
# Would use the product cache from Deep Dive 1
pass
async def _compute_recently_viewed(
self,
user_id: str,
limit: int
) -> List[dict]:
"""Compute recently viewed from database."""
pass
async def _persist_view(self, user_id: str, product_id: str):
"""Persist view to database (async)."""
pass
# =============================================================================
# Background Job: Pre-Compute Active User Recommendations
# =============================================================================
class RecommendationPreComputer:
"""
Background job to pre-compute recommendations for active users.
Run periodically (e.g., every hour) to ensure active users
always have cached recommendations.
"""
def __init__(
self,
feed_service: PersonalizedFeedService,
db_client
):
self.feed = feed_service
self.db = db_client
async def run(self):
"""Pre-compute recommendations for active users."""
# Get users active in last 24 hours
active_users = await self.db.fetch(
"""
SELECT user_id FROM user_sessions
WHERE last_active > NOW() - INTERVAL '24 hours'
"""
)
logger.info(f"Pre-computing recs for {len(active_users)} active users")
for user in active_users:
try:
await self.feed._compute_and_cache_recs(
user['user_id'],
limit=50,
ttl=3600
)
except Exception as e:
logger.warning(
f"Failed to pre-compute recs for {user['user_id']}: {e}"
)
logger.info("Recommendation pre-computation complete")
Deep Dive 4: Multi-Tier Cache Architecture (Week 4, Day 5)
Interviewer: "You mentioned different cache headers for different content types. Walk me through exactly what gets cached where."
You: "Let me detail the complete multi-tier strategy..."
The Cache Matrix
MULTI-TIER CACHE MATRIX
āāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāā¬āāāāāāāāāāāāāāā¬āāāāāāāāāāāā¬āāāāāāāāāāāāā
ā Content Type ā Browser ā CDN ā Gateway ā Redis ā
āāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāā¼āāāāāāāāāāāāāāā¼āāāāāāāāāāāā¼āāāāāāāāāāāāā¤
ā Static Assets ā 1 year ā 1 year ā N/A ā N/A ā
ā (JS, CSS) ā immutable ā immutable ā ā ā
āāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāā¼āāāāāāāāāāāāāāā¼āāāāāāāāāāāā¼āāāāāāāāāāāāā¤
ā Product Images ā 1 week ā 1 week ā N/A ā N/A ā
ā ā ā purge on chg ā ā ā
āāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāā¼āāāāāāāāāāāāāāā¼āāāāāāāāāāāā¼āāāāāāāāāāāāā¤
ā Product Page ā 60s ā 60s ā 30s ā 5 min ā
ā (anonymous) ā swr=30s ā purge on chg ā ā event inv. ā
āāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāā¼āāāāāāāāāāāāāāā¼āāāāāāāāāāāā¼āāāāāāāāāāāāā¤
ā Product Page ā 0 ā NO ā 30s ā 5 min ā
ā (authenticated) ā private ā (private) ā ā event inv. ā
āāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāā¼āāāāāāāāāāāāāāā¼āāāāāāāāāāāā¼āāāāāāāāāāāāā¤
ā Product Price ā 0 ā NO ā NO ā 30s ā
ā ā no-cache ā (dynamic) ā ā event inv. ā
āāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāā¼āāāāāāāāāāāāāāā¼āāāāāāāāāāāā¼āāāāāāāāāāāāā¤
ā Inventory ā 0 ā NO ā NO ā 30s ā
ā ā no-cache ā (dynamic) ā ā event inv. ā
āāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāā¼āāāāāāāāāāāāāāā¼āāāāāāāāāāāā¼āāāāāāāāāāāāā¤
ā Category Page ā 60s ā 60s ā 30s ā 5 min ā
ā (anonymous) ā swr=30s ā ā ā ā
āāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāā¼āāāāāāāāāāāāāāā¼āāāāāāāāāāāā¼āāāāāāāāāāāāā¤
ā Search Results ā 30s ā 30s ā NO ā 1 min ā
ā ā Vary: q ā Vary: q ā ā ā
āāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāā¼āāāāāāāāāāāāāāā¼āāāāāāāāāāāā¼āāāāāāāāāāāāā¤
ā Recently Viewed ā 0 ā NO ā NO ā 1 hour ā
ā (user-specific) ā private ā ā ā write-thru ā
āāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāā¼āāāāāāāāāāāāāāā¼āāāāāāāāāāāā¼āāāāāāāāāāāāā¤
ā Recommendations ā 0 ā NO ā NO ā varies ā
ā (user-specific) ā private ā ā ā by tier ā
āāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāā¼āāāāāāāāāāāāāāā¼āāāāāāāāāāāā¼āāāāāāāāāāāāā¤
ā Cart ā 0 ā NO ā NO ā 30 min ā
ā ā no-store ā ā ā ā
āāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāā¼āāāāāāāāāāāāāāā¼āāāāāāāāāāāā¼āāāāāāāāāāāāā¤
ā Checkout ā 0 ā NO ā NO ā NO ā
ā ā no-store ā ā ā (real-time)ā
āāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāā“āāāāāāāāāāāāāāā“āāāāāāāāāāāā“āāāāāāāāāāāāā
Key:
- swr = stale-while-revalidate
- event inv. = event-driven invalidation
- write-thru = write-through caching
Implementation
# Multi-Tier Cache Headers
# Applies: Week 4, Day 5
from fastapi import FastAPI, Response, Request, Depends
from enum import Enum
class ContentType(Enum):
STATIC_ASSET = "static_asset"
PRODUCT_IMAGE = "product_image"
PRODUCT_PAGE_ANON = "product_page_anon"
PRODUCT_PAGE_AUTH = "product_page_auth"
PRODUCT_PRICE = "product_price"
INVENTORY = "inventory"
CATEGORY_PAGE = "category_page"
SEARCH_RESULTS = "search_results"
PERSONALIZED = "personalized"
CART = "cart"
CHECKOUT = "checkout"
class CacheHeaderBuilder:
"""
Build appropriate cache headers for each content type.
Different content requires different caching strategies
across browser, CDN, and gateway layers.
"""
CACHE_POLICIES = {
ContentType.STATIC_ASSET: {
"cache_control": "public, max-age=31536000, immutable",
"cdn_cache": True,
"vary": ["Accept-Encoding"],
},
ContentType.PRODUCT_IMAGE: {
"cache_control": "public, max-age=604800", # 1 week
"cdn_cache": True,
"vary": ["Accept-Encoding"],
},
ContentType.PRODUCT_PAGE_ANON: {
"cache_control": "public, max-age=60, stale-while-revalidate=30",
"cdn_cache": True,
"vary": ["Accept-Encoding", "Accept-Language"],
},
ContentType.PRODUCT_PAGE_AUTH: {
"cache_control": "private, max-age=0, must-revalidate",
"cdn_cache": False,
"vary": ["Authorization", "Accept-Encoding"],
},
ContentType.PRODUCT_PRICE: {
"cache_control": "no-cache, no-store, must-revalidate",
"cdn_cache": False,
"vary": [],
},
ContentType.INVENTORY: {
"cache_control": "no-cache, no-store, must-revalidate",
"cdn_cache": False,
"vary": [],
},
ContentType.CATEGORY_PAGE: {
"cache_control": "public, max-age=60, stale-while-revalidate=30",
"cdn_cache": True,
"vary": ["Accept-Encoding", "Accept-Language"],
},
ContentType.SEARCH_RESULTS: {
"cache_control": "public, max-age=30",
"cdn_cache": True,
"vary": ["Accept-Encoding"], # Also varies on query string
},
ContentType.PERSONALIZED: {
"cache_control": "private, no-cache",
"cdn_cache": False,
"vary": ["Authorization"],
},
ContentType.CART: {
"cache_control": "private, no-store",
"cdn_cache": False,
"vary": [],
},
ContentType.CHECKOUT: {
"cache_control": "no-store",
"cdn_cache": False,
"vary": [],
},
}
@classmethod
def get_headers(
cls,
content_type: ContentType,
etag: str = None
) -> dict:
"""Get cache headers for content type."""
policy = cls.CACHE_POLICIES[content_type]
headers = {
"Cache-Control": policy["cache_control"],
}
if policy["vary"]:
headers["Vary"] = ", ".join(policy["vary"])
if etag:
headers["ETag"] = f'"{etag}"'
# Add CDN hints
if not policy["cdn_cache"]:
headers["CDN-Cache-Control"] = "no-store"
return headers
# FastAPI Integration
app = FastAPI()
def get_auth_status(request: Request) -> bool:
"""Check if request is authenticated."""
return "Authorization" in request.headers
@app.get("/api/products/{product_id}")
async def get_product(
product_id: str,
response: Response,
authenticated: bool = Depends(get_auth_status)
):
"""Get product with appropriate cache headers."""
# Fetch product (from cache or DB)
product = await product_service.get_product(product_id)
# Determine content type based on auth status
content_type = (
ContentType.PRODUCT_PAGE_AUTH if authenticated
else ContentType.PRODUCT_PAGE_ANON
)
# Set cache headers
headers = CacheHeaderBuilder.get_headers(
content_type,
etag=str(product['updated_at'])
)
for header, value in headers.items():
response.headers[header] = value
return product
@app.get("/api/products/{product_id}/price")
async def get_price(product_id: str, response: Response):
"""
Get product price.
Never cached at CDN - always fresh from Redis/DB.
"""
price = await price_service.get_price(product_id)
headers = CacheHeaderBuilder.get_headers(ContentType.PRODUCT_PRICE)
for header, value in headers.items():
response.headers[header] = value
return price
@app.get("/api/me/recently-viewed")
async def get_recently_viewed(
response: Response,
user = Depends(get_current_user)
):
"""Get user's recently viewed - personalized, private."""
items = await feed_service.get_recently_viewed(user.id)
headers = CacheHeaderBuilder.get_headers(ContentType.PERSONALIZED)
for header, value in headers.items():
response.headers[header] = value
return items
Phase 5: Scaling and Edge Cases (5 minutes)
Interviewer: "How would this system scale to 10x the current load? What breaks first?"
Scaling Strategy
You: "Let me analyze the bottlenecks at 10x scale..."
SCALING ANALYSIS: 10X LOAD
Current ā 10x:
āāā 6K req/sec ā 60K req/sec average
āāā 600K req/sec ā 6M req/sec peak
āāā 300 GB cache ā 1+ TB cache
BOTTLENECK ANALYSIS:
1. CDN (Lowest risk)
āāā CDN scales horizontally by design
āāā More edge locations as needed
āāā Cost scales linearly
2. Redis Cluster (Medium risk)
āāā Current: 10 nodes Ć 32 GB = 320 GB
āāā 10x: 30+ nodes Ć 64 GB = 2 TB
āāā Challenge: Cross-slot operations
āāā Solution: Shard by product ID consistently
3. Database (Highest risk)
āāā Even with 99% cache hit, 1% miss at 6M req/sec = 60K DB queries/sec
āāā PostgreSQL won't handle this
āāā Solutions:
āāā Read replicas (10+)
āāā Connection pooling (PgBouncer)
āāā Consider DynamoDB for product reads
4. Network (Medium risk)
āāā Internal traffic between services
āāā Solution: Co-locate services, use service mesh
Edge Cases
Interviewer: "What happens if Redis goes down during a flash sale?"
You: "That's our worst-case scenario. Here's how we handle it..."
EDGE CASE: REDIS CLUSTER FAILURE
Scenario:
Redis cluster partially fails during flash sale
600K req/sec hitting the system
Impact WITHOUT mitigation:
All requests hit database ā database fails ā site down
MITIGATION STRATEGY:
1. CIRCUIT BREAKER
- Detect Redis failure quickly (< 1 second)
- Open circuit, stop trying Redis
- Serve degraded experience
2. GRACEFUL DEGRADATION
- Serve stale data from local memory cache (Guava/Caffeine)
- Show "temporarily unavailable" for personalization
- Block flash sale purchases temporarily (prevent oversell)
3. FALLBACK DATA
- Pre-compute "default" product data
- Store in local process memory
- Serve default when cache unavailable
4. AUTOMATIC RECOVERY
- Circuit breaker half-open after 30 seconds
- Test with single request
- Gradually restore traffic if successful
# Graceful Degradation Implementation
class ResilientProductService:
"""
Product service with graceful degradation.
Falls back to local cache ā default data ā error
if Redis is unavailable.
"""
def __init__(self, redis_cache, local_cache, db_client):
self.redis = redis_cache
self.local = local_cache # In-memory (Guava-style)
self.db = db_client
self.circuit_breaker = CircuitBreaker(
failure_threshold=5,
recovery_timeout=30
)
async def get_product(self, product_id: str) -> dict:
"""Get product with fallback chain."""
# Try Redis (primary cache)
if self.circuit_breaker.is_closed():
try:
product = await self.redis.get(f"product:{product_id}")
if product:
# Also store in local cache for fallback
self.local.put(product_id, product)
return product
except Exception as e:
self.circuit_breaker.record_failure()
logger.warning(f"Redis failure: {e}")
# Try local in-memory cache (fallback)
product = self.local.get(product_id)
if product:
logger.info(f"Serving from local cache: {product_id}")
return product
# Try database (last resort during outage)
if self.circuit_breaker.is_open():
# Don't hammer DB during Redis outage
# Return default/error
return self._get_default_product(product_id)
# Normal cache miss - fetch from DB
product = await self.db.fetch_product(product_id)
if product:
await self.redis.set(f"product:{product_id}", product)
self.local.put(product_id, product)
return product
def _get_default_product(self, product_id: str) -> dict:
"""Return minimal product data during outage."""
return {
"id": product_id,
"title": "Product Temporarily Unavailable",
"price": None,
"inventory": None,
"_degraded": True
}
Failure Scenarios
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| Redis node failure | Health checks | Partial cache miss | Auto-failover to replica |
| Redis cluster failure | Circuit breaker | High DB load | Local cache + degradation |
| CDN outage | Synthetic monitoring | Higher origin load | Bypass CDN, scale origin |
| Database slow | Query latency alerts | Slow responses | Read replicas, cache more |
| Kafka consumer lag | Lag monitoring | Stale prices | Alert, scale consumers |
Phase 6: Monitoring and Operations
Interviewer: "How would you monitor this caching system in production?"
Key Metrics
You: "I'd track metrics at multiple levels..."
Cache Metrics
CACHE HEALTH DASHBOARD
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā CACHE PERFORMANCE ā
ā ā
ā REDIS CLUSTER ā
ā āāā Hit ratio: 98.5% [āāāāāāāāāāāāāāāāāāāāāā] Target: >99% ā
ā āāā Latency p99: 2.1ms [āāāāāāāāāāāāāāāāāāāāāā] Target: <5ms ā
ā āāā Memory usage: 78% [āāāāāāāāāāāāāāāāāāāāāā] Alert: >85% ā
ā āāā Connections: 4,521 [āāāāāāāāāāāāāāāāāāāāāā] Max: 10K ā
ā āāā Evictions/sec: 12 [āāāāāāāāāāāāāāāāāāāāāā] Alert: >100 ā
ā ā
ā CDN ā
ā āāā Hit ratio: 94.2% [āāāāāāāāāāāāāāāāāāāāāā] Target: >90% ā
ā āāā Origin requests/sec: 358 [āāāāāāāāāāāāāāāāāāāāāā] ā
ā āāā Bandwidth saved: 89% [āāāāāāāāāāāāāāāāāāāāāā] ā
ā āāā Purges pending: 12 [āāāāāāāāāāāāāāāāāāāāāā] ā
ā ā
ā INVALIDATION ā
ā āāā Events processed/sec: 245 [āāāāāāāāāāāāāāāāāāāāāā] ā
ā āāā Consumer lag: 34 [āāāāāāāāāāāāāāāāāāāāāā] Alert: >1000 ā
ā āāā Failed invalidations: 0 [āāāāāāāāāāāāāāāāāāāāāā] Alert: >0 ā
ā ā
ā THUNDERING HERD PROTECTION ā
ā āāā Coalesced requests: 12,456 (requests saved) ā
ā āāā Stale served: 2,341 (graceful degradation) ā
ā āāā Background refreshes: 892 (proactive updates) ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Alerting Strategy
CRITICAL (PagerDuty - Wake up):
- Cache hit ratio < 95% for 5 minutes
- Redis latency p99 > 50ms for 5 minutes
- Redis cluster node down
- Invalidation consumer lag > 10,000
- Circuit breaker opened
WARNING (Slack - Business hours):
- Cache hit ratio < 98% for 15 minutes
- Redis memory > 85%
- CDN origin requests > 1000/sec
- Eviction rate > 100/sec
INFO (Dashboard only):
- Cache key expiration patterns
- Hot key detection
- Invalidation event volume
Runbook: Cache Hit Ratio Drop
RUNBOOK: Cache Hit Ratio Below 95%
SYMPTOMS:
- Alert: "Redis cache hit ratio dropped to X%"
- Increased database latency
- Increased API response times
DIAGNOSIS:
1. Check for recent deployments:
> kubectl get deployments -n production --sort-by='.metadata.creationTimestamp'
2. Check cache key version (did we bump it accidentally?):
> redis-cli GET cache_version
3. Check for mass invalidation events:
> kafka-consumer-groups --describe --group cache-invalidation
4. Check hot keys (thundering herd?):
> redis-cli --hotkeys
5. Check memory pressure (evictions?):
> redis-cli INFO stats | grep evicted
RESOLUTION:
1. If deployment issue:
- Rollback deployment
- Investigate cache key changes
2. If thundering herd:
- Enable emergency background refresh
- Increase stale TTL temporarily
3. If memory pressure:
- Scale Redis cluster (add nodes)
- Review TTLs (reduce if possible)
4. If invalidation storm:
- Pause non-critical invalidation
- Investigate source of invalidations
ESCALATION:
- If not resolved in 15 minutes: Page on-call SRE
- If database impacted: Page database team
Interview Conclusion
Interviewer: "Excellent work. You've demonstrated strong understanding of caching patterns, clear trade-off decisions, and practical production experience. Any questions for me?"
You: "Thank you! I'd love to hear how your team currently handles cache invalidation for pricing updates. Do you use event-driven invalidation, and if so, what message broker do you use?"
Interviewer: "We actually use a combination ā Kafka for inventory and pricing events, and simple TTL for product metadata. We've had some challenges with flash sales similar to what you described. Your thundering herd protection approach is something we should consider."
You: "That's great to hear. I'm also curious about your CDN setup ā do you use a single provider or multi-CDN?"
Summary: Week 4 Concepts Applied
Week 4 Concepts (Caching ā Beyond "Just Add Redis")
| Day | Concept | Application in This Design |
|---|---|---|
| Day 1: Caching Patterns | Cache-aside, write-through | Cache-aside for products, write-through for recently viewed |
| Day 2: Invalidation | Event-driven, TTL safety net, multi-strategy | Different strategies per data type (price=event, metadata=TTL) |
| Day 3: Thundering Herd | Request coalescing, stale-while-revalidate, background refresh | Flash sale protection, hot key handling |
| Day 4: Feed Caching | Activity-based tiering, hybrid push/pull | Recently viewed (write-through), recommendations (tiered) |
| Day 5: Multi-Tier | Browser, CDN, Gateway, App cache layers | Complete cache matrix with appropriate headers per tier |
Code Patterns Demonstrated
1. THUNDERING HERD PROTECTION
- Request coalescing with asyncio.Future
- Stale-while-revalidate pattern
- Background refresh for hot keys
- Flash sale cache warming
2. MULTI-STRATEGY INVALIDATION
- Data-type-aware invalidation
- Event-driven for real-time data
- TTL safety net for all data
- Cascade invalidation (price ā product page)
3. ACTIVITY-BASED PERSONALIZATION
- User tier classification (active/recent/dormant)
- Different TTLs per tier
- Background pre-computation for active users
4. MULTI-TIER CACHE HEADERS
- Content-type-aware headers
- CDN-Cache-Control for edge control
- Vary headers for correct caching
- Private vs public distinction
5. GRACEFUL DEGRADATION
- Circuit breaker pattern
- Local cache fallback
- Default data for outages
Self-Assessment Checklist
After studying this capstone, you should be able to:
- Design a multi-tier caching architecture (browser ā CDN ā gateway ā app ā DB)
- Choose appropriate caching patterns for different data types
- Implement thundering herd protection with request coalescing
- Design event-driven cache invalidation with TTL safety nets
- Apply different invalidation strategies based on data freshness requirements
- Implement activity-based caching for personalized data
- Set correct HTTP cache headers for anonymous vs authenticated content
- Handle cache failures gracefully with degradation strategies
- Design cache warming strategies for predictable traffic spikes
- Monitor cache health and troubleshoot performance issues
- Estimate cache sizes and hit ratio requirements for a given load
This capstone integrates all concepts from Week 4 of the System Design Mastery Series: Caching ā Beyond "Just Add Redis". Use this as a template for approaching e-commerce and high-traffic system design interviews.