Week 1-2 Capstone: The Ultimate System Design Interview
π― A Real-World Problem Covering Everything You've Learned
The Interview Begins
You walk into the interview room. The interviewer smiles and gestures to the whiteboard.
Interviewer: "Thanks for coming in. Today we're going to work through a system design problem together. I'm interested in your thought process, so please think out loud. Feel free to ask questions β this is meant to be collaborative."
They write on the whiteboard:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Design a Global Flash Sale System for "MegaMart" β
β β
β MegaMart is launching "Lightning Deals" β flash sales where limited β
β inventory items are sold at 90% discount for exactly 10 minutes. β
β β
β - 50 million users globally β
β - Flash sales happen every hour (random products) β
β - Each sale: 10,000 units available β
β - Users get 3 minutes to complete checkout after claiming β
β - Must handle payment processing and inventory β
β - Must notify users via email/push when deals go live β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Interviewer: "Take a few minutes to think about this, then walk me through your approach. We have about 45 minutes."
Phase 1: Requirements Clarification (5 minutes)
Before diving in, you take a breath and start asking questions. This is crucial β never assume.
Your Questions
You: "Before I start designing, I'd like to clarify a few requirements. First, when you say 50 million users globally β is that monthly active users, or do we expect all of them potentially accessing a flash sale?"
Interviewer: "Good question. 50M is MAU. For popular flash sales, we've seen up to 5 million users trying to access a single sale. The product team calls these 'hype drops' β think iPhone launches or limited edition sneakers."
You: "That's helpful. For the 10,000 units β do users select quantities, or is it one unit per user?"
Interviewer: "One unit per user per sale. We want to maximize the number of happy customers."
You: "What happens if someone claims an item but doesn't complete checkout within 3 minutes?"
Interviewer: "The item should go back to available inventory for others."
You: "For payment processing β do you have an existing payment provider, or should I design that?"
Interviewer: "Assume we use Stripe. Treat it as an external API that might be slow or fail."
You: "Last question β the notification when deals go live. Are users subscribed to specific products, or do we notify everyone?"
Interviewer: "Users can 'watch' products. When a flash sale includes a watched product, they should be notified. But we also have a general 'deals' notification channel for users who opted in. Could be millions of notifications per sale."
You: "Perfect. Let me summarize the requirements as I understand them."
Functional Requirements
1. FLASH SALE MANAGEMENT
- Create flash sales with specific products and inventory
- Sales start at scheduled times (hourly)
- Sales last exactly 10 minutes
- 10,000 units per sale
2. INVENTORY CLAIMING
- Users can claim one item per sale
- Claimed items reserved for 3 minutes
- Unclaimed items return to pool after timeout
- No overselling (exactly 10,000 successful orders max)
3. CHECKOUT FLOW
- Complete payment within 3 minutes of claim
- Process payment via Stripe
- Handle payment failures gracefully
- Create order on successful payment
4. NOTIFICATIONS
- Notify watchers when watched product goes on sale
- Notify deal subscribers when any sale starts
- Support email and push notifications
- Millions of notifications per sale
Non-Functional Requirements
1. SCALE
- Handle 5M concurrent users hitting a single sale
- Process 10K checkouts in 3-minute window
- Send millions of notifications within seconds
2. RELIABILITY
- No overselling (inventory consistency is critical)
- No double-charging (payment idempotency)
- No lost orders (durability)
3. AVAILABILITY
- System must work during sale windows
- Graceful degradation if components fail
4. LATENCY
- Claim response: <500ms at p99
- Checkout: <3s at p99
- Notification delivery: <30s for email, <5s for push
Phase 2: Back of the Envelope Estimation (5 minutes)
You: "Let me work through the numbers to understand the scale."
Traffic Estimation
FLASH SALE TRAFFIC (per sale)
Users trying to access: 5,000,000
Sale duration: 10 minutes = 600 seconds
Average request rate: 5M / 600 = 8,333 requests/second
But traffic isn't uniform. The first 30 seconds see 80% of traffic:
Peak traffic: (5M Γ 0.8) / 30 = 133,333 requests/second
Let's round up for safety: 150,000 requests/second peak
Breakdown by operation:
βββ Page views: 100,000 /sec (viewing the sale)
βββ Claim attempts: 40,000 /sec (trying to claim)
βββ Stock checks: 10,000 /sec (AJAX refreshes)
βββ Checkouts: 50 /sec (successful claimers)
Inventory Operations
INVENTORY MATH
Total inventory: 10,000 units
Claim timeout: 3 minutes
Successful checkout rate: ~70% (estimate)
If all 10K claimed in first 30 seconds:
βββ 7,000 complete checkout
βββ 3,000 timeout β return to pool
βββ Pool refills trigger second wave of claims
Maximum claim operations: ~15,000 (accounting for timeouts)
Checkout operations: ~10,000 (until inventory exhausted)
Claims per second (peak): 10,000 / 30 = 333 claims/second
Checkouts per second (peak): 10,000 / 180 = 56 checkouts/second
Notification Estimation
NOTIFICATION VOLUME
Product watchers: 500,000 (popular product)
Deal subscribers: 2,000,000
Overlap (watching + subscribed): ~100,000
Total notifications: 2,400,000
Delivery target: 30 seconds for all
Notification rate: 2.4M / 30 = 80,000 notifications/second
Push notification payload: ~500 bytes
Email payload: ~5 KB
Bandwidth:
βββ Push: 80,000 Γ 500 = 40 MB/sec
βββ Email: 30,000 Γ 5KB = 150 MB/sec
Storage Estimation
STORAGE REQUIREMENTS
Per sale:
βββ Sale metadata: ~1 KB
βββ Inventory records: 10,000 Γ 100 bytes = 1 MB
βββ Claim records: 15,000 Γ 200 bytes = 3 MB
βββ Order records: 10,000 Γ 500 bytes = 5 MB
βββ Total per sale: ~10 MB
Sales per day: 24
Storage per day: ~240 MB
Storage per year: ~90 GB
Notification logs:
βββ Per notification: 200 bytes
βββ Per sale: 2.4M Γ 200 = 480 MB
βββ Per year: ~4 TB
Hot data (active sales): ~100 MB
Warm data (past week): ~2 GB
Cold data (archive): Compress and store in S3
Infrastructure Estimation
SERVER REQUIREMENTS
API Servers (150K req/sec):
βββ Per server capacity: 5,000 req/sec
βββ Servers needed: 30 servers
βββ With 2x headroom: 60 servers
Redis (inventory + claims):
βββ Operations/sec: 50,000 (reads + writes)
βββ Memory for hot data: ~1 GB per sale
βββ Cluster: 3 primary + 3 replica
PostgreSQL (orders, users):
βββ Write TPS: 500 (orders only)
βββ Read TPS: 5,000
βββ Single primary + 2 read replicas
Message Queue (notifications):
βββ Messages/sec: 100,000
βββ Kafka cluster: 6 brokers, 3 partitions
βββ Consumer groups: Push, Email, Analytics
Interviewer: "Good analysis. Those numbers seem reasonable. What concerns you most about the scale?"
You: "Three things concern me:
-
The 150K requests/second peak β this is a thundering herd problem. We need aggressive caching and potentially a queue-based claim system.
-
The inventory consistency β with 40K claim attempts per second on 10K items, we need atomic operations. Race conditions could cause overselling.
-
The notification delivery β 80K notifications/second is achievable, but we need to pre-compute recipient lists before the sale starts, not at sale time."
Phase 3: High-Level Design (10 minutes)
You: "Let me sketch out the high-level architecture."
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MEGAMART FLASH SALE ARCHITECTURE β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β USERS (5M) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CDN + WAF + Load Balancer β β
β β (CloudFront + Shield + ALB) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββΌββββββββββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β Sale Page β β Claim API β β Checkout API β β
β β (Read Heavy) β β (Write Heavy) β β (Critical) β β
β β 60 servers β β 20 servers β β 20 servers β β
β ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ β
β β β β β
β β β β β
β ββββββββββ΄ββββββββββββββββββββββββ΄ββββββββββββββββββββββββ΄ββββββββββ β
β β REDIS CLUSTER β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β β β Inventory β β Claims β β Rate Limit β β β
β β β Counter β β Store β β Cache β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββΌββββββββββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β PostgreSQL β β Kafka β β Stripe β β
β β (Orders DB) β β (Events/Notif) β β (Payments) β β
β β Primary + 2RR β β 6 brokers β β External API β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β β
β βββββββββββββββββΌββββββββββββββββ β
β βΌ βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Push β β Email β β Webhook β β
β β Workers β β Workers β β Delivery β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SUPPORTING SERVICES β β
β β ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ β β
β β β Sale Sched β β Inventory β β Claim β β Monitoring β β β
β β β (Cron) β β Manager β β Expiry β β (Grafana) β β β
β β ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Component Overview
You: "Let me walk through each component and its role:"
Traffic Layer
CDN + WAF + Load Balancer
βββ CloudFront: Cache static assets (sale page HTML, JS, CSS)
βββ AWS Shield: DDoS protection (critical during sale spikes)
βββ WAF: Rate limiting, bot detection
βββ ALB: Route to appropriate service based on path
Benefits:
- CDN handles 80% of traffic (static content)
- WAF blocks abusive traffic before hitting servers
- Geographic distribution reduces latency
API Services
Sale Page Service (Read-Heavy)
βββ Renders sale page with current inventory count
βββ Heavy caching (1-second TTL on inventory count)
βββ Stateless, horizontally scalable
βββ Circuit breaker to Redis
Claim API Service (Write-Heavy)
βββ Handles claim requests
βββ Atomic inventory operations
βββ Idempotent (claim key per user per sale)
βββ Rate limited per user
Checkout API Service (Critical)
βββ Processes payments through Stripe
βββ Creates orders
βββ Idempotent (prevents double-charge)
βββ Timeout-aware (3-minute claim expiry)
βββ Circuit breaker to Stripe
Data Layer
Redis Cluster
βββ Inventory Counter: Atomic decrement with Lua scripts
βββ Claims Store: user_id β claim_info with TTL
βββ Rate Limit: Sliding window counters
βββ Session Cache: User session data
PostgreSQL
βββ Primary: Writes (orders, users)
βββ Read Replicas: Read queries
βββ Partitioned by date for orders
βββ Indexed on user_id, sale_id, order_status
Event Processing
Kafka
βββ Topic: sale_events (sale start, sale end, inventory updates)
βββ Topic: claim_events (claimed, expired, checkout_started)
βββ Topic: notification_events (to_send, sent, failed)
βββ Topic: order_events (created, paid, failed)
Consumer Groups:
βββ NotificationWorkers: Send push/email
βββ AnalyticsWorkers: Real-time dashboards
βββ WebhookWorkers: Notify external systems
βββ ClaimExpiryWorkers: Return expired claims to pool
Interviewer: "This looks comprehensive. I'm curious about a few things. First, how do you prevent overselling? Walk me through the claim flow."
Phase 4: Deep Dive - Inventory and Claims (10 minutes)
You: "Great question. This is the most critical part of the system. Let me detail the claim flow."
The Claim Flow
USER CLICKS "CLAIM DEAL"
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 1: RATE LIMITING β
β β
β Check: Has user exceeded 10 claim attempts in last minute? β
β Implementation: Redis sliding window β
β β
β INCRBY rate_limit:{user_id}:{minute} 1 β
β EXPIRE rate_limit:{user_id}:{minute} 60 β
β β
β If count > 10: Return 429 Too Many Requests β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 2: IDEMPOTENCY CHECK β
β β
β Check: Has user already claimed in this sale? β
β Key: claim:{sale_id}:{user_id} β
β β
β If exists: Return existing claim (idempotent response) β
β β
β This prevents double-claiming and handles retries safely. β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 3: ATOMIC CLAIM (Lua Script in Redis) β
β β
β -- Lua script for atomic claim β
β local inventory_key = KEYS[1] -- inventory:{sale_id} β
β local claim_key = KEYS[2] -- claim:{sale_id}:{user_id} β
β local user_id = ARGV[1] β
β local claim_id = ARGV[2] β
β local ttl = ARGV[3] -- 180 seconds β
β β
β -- Check if already claimed β
β if redis.call('EXISTS', claim_key) == 1 then β
β return redis.call('GET', claim_key) -- Return existing claim β
β end β
β β
β -- Try to decrement inventory β
β local remaining = redis.call('DECR', inventory_key) β
β β
β if remaining < 0 then β
β -- No inventory, restore counter β
β redis.call('INCR', inventory_key) β
β return nil -- Sold out β
β end β
β β
β -- Success! Store claim with TTL β
β local claim_data = cjson.encode({ β
β claim_id = claim_id, β
β user_id = user_id, β
β claimed_at = redis.call('TIME')[1], β
β expires_at = redis.call('TIME')[1] + ttl β
β }) β
β β
β redis.call('SET', claim_key, claim_data, 'EX', ttl) β
β β
β -- Add to expiry tracking set β
β redis.call('ZADD', 'claims:expiry:' .. KEYS[1], β
β redis.call('TIME')[1] + ttl, claim_key) β
β β
β return claim_data β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 4: PUBLISH EVENT β
β β
β Kafka: claim_events β
β { β
β "event_type": "claimed", β
β "claim_id": "clm_abc123", β
β "sale_id": "sale_xyz", β
β "user_id": "usr_456", β
β "expires_at": "2024-01-15T10:03:00Z", β
β "remaining_inventory": 9523 β
β } β
β β
β Consumers: β
β - AnalyticsWorker: Update real-time dashboard β
β - WebsocketBroadcaster: Push inventory count to browsers β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 5: RETURN RESPONSE β
β β
β { β
β "status": "claimed", β
β "claim_id": "clm_abc123", β
β "expires_at": "2024-01-15T10:03:00Z", β
β "seconds_remaining": 180, β
β "checkout_url": "/checkout/clm_abc123" β
β } β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Why This Works
You: "This design prevents overselling through several mechanisms:"
OVERSELLING PREVENTION
1. ATOMIC OPERATIONS
- The Lua script runs atomically in Redis
- No race condition between check and decrement
- DECR is atomic, returns new value immediately
2. IDEMPOTENCY
- Checking existing claim BEFORE decrementing
- Same user retrying gets same response
- No double-claiming even under network issues
3. BOUNDED INVENTORY
- Counter starts at exactly 10,000
- Can only decrement, never go negative (we restore on <0)
- TTL ensures claims expire automatically
4. NO DATABASE IN HOT PATH
- Redis handles all claim logic
- PostgreSQL only involved at checkout (much lower rate)
- Removes database as bottleneck
Claim Expiry Handling
You: "When a claim expires without checkout, we need to return inventory:"
class ClaimExpiryWorker:
"""
Background worker that returns expired claims to inventory.
Runs continuously, processing expired claims.
"""
def __init__(self, redis_client):
self.redis = redis_client
async def run(self):
"""Main worker loop."""
while True:
await self.process_expired_claims()
await asyncio.sleep(1) # Check every second
async def process_expired_claims(self):
"""Find and process expired claims."""
now = int(time.time())
# Get all active sales
active_sales = await self.redis.smembers('active_sales')
for sale_id in active_sales:
expiry_key = f'claims:expiry:{sale_id}'
inventory_key = f'inventory:{sale_id}'
# Get claims that have expired
expired = await self.redis.zrangebyscore(
expiry_key,
min=0,
max=now
)
for claim_key in expired:
# Atomically return inventory and remove claim
await self.return_claim_to_inventory(
sale_id, claim_key, inventory_key, expiry_key
)
async def return_claim_to_inventory(
self, sale_id, claim_key, inventory_key, expiry_key
):
"""Return a single expired claim to inventory."""
# Lua script for atomic return
script = """
local claim_key = KEYS[1]
local inventory_key = KEYS[2]
local expiry_key = KEYS[3]
-- Check if claim still exists (might have been checked out)
if redis.call('EXISTS', claim_key) == 0 then
-- Claim was checked out, just remove from expiry set
redis.call('ZREM', expiry_key, claim_key)
return 0
end
-- Claim exists and expired, return to inventory
redis.call('DEL', claim_key)
redis.call('INCR', inventory_key)
redis.call('ZREM', expiry_key, claim_key)
return 1
"""
returned = await self.redis.eval(
script,
keys=[claim_key, inventory_key, expiry_key]
)
if returned:
# Publish event
await self.publish_claim_expired(sale_id, claim_key)
Interviewer: "Nice. What about the checkout flow? How do you handle payment failures and ensure orders aren't double-charged?"
Phase 5: Deep Dive - Checkout Flow (10 minutes)
You: "The checkout flow is where everything comes together β timeouts, idempotency, circuit breakers. Let me walk through it."
Checkout Architecture
USER CLICKS "COMPLETE PURCHASE"
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 1: VALIDATE CLAIM β
β β
β Check claim exists and hasn't expired: β
β β
β claim_data = redis.get(f"claim:{sale_id}:{user_id}") β
β β
β if not claim_data: β
β return 410 Gone "Your claim has expired" β
β β
β if claim_data.checkout_started: β
β return 409 Conflict "Checkout already in progress" β
β β
β # Mark checkout started (prevent concurrent checkouts) β
β redis.hset(claim_key, "checkout_started", timestamp) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 2: IDEMPOTENCY CHECK β
β β
β Check if this checkout was already processed: β
β β
β Key: idempotency:{claim_id} β
β β
β existing = await idempotency_store.get(claim_id) β
β if existing: β
β if existing.status == "completed": β
β return existing.response # Return same successful response β
β if existing.status == "processing": β
β return 409 "Payment in progress" β
β if existing.status == "failed": β
β # Allow retry β
β pass β
β β
β # Record processing started β
β await idempotency_store.set(claim_id, { β
β status: "processing", β
β started_at: now β
β }, ttl=3600) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 3: PROCESS PAYMENT (with timeout + circuit breaker) β
β β
β # Check circuit breaker first (Day 3 pattern) β
β if stripe_circuit.is_open: β
β await idempotency_store.set(claim_id, {status: "failed"}) β
β return 503 "Payment service temporarily unavailable" β
β β
β # Call Stripe with timeout (Day 1 pattern) β
β try: β
β payment = await asyncio.wait_for( β
β stripe.charges.create( β
β amount=sale.discounted_price, β
β currency="usd", β
β customer=user.stripe_customer_id, β
β idempotency_key=f"claim_{claim_id}", # Day 2 pattern β
β metadata={"sale_id": sale_id, "claim_id": claim_id} β
β ), β
β timeout=10.0 # 10 second timeout β
β ) β
β stripe_circuit.record_success() β
β β
β except asyncio.TimeoutError: β
β stripe_circuit.record_failure() β
β # Don't fail yet - payment might have succeeded! β
β payment = await verify_payment_status(claim_id) β
β if not payment: β
β await idempotency_store.set(claim_id, {status: "failed"}) β
β return 504 "Payment timeout - please try again" β
β β
β except StripeError as e: β
β stripe_circuit.record_failure() β
β await idempotency_store.set(claim_id, {status: "failed", error: e}) β
β return 400 f"Payment failed: {e.message}" β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 4: CREATE ORDER (transactionally) β
β β
β async with database.transaction(): β
β # Create order record β
β order = await Order.create( β
β id=generate_order_id(), β
β user_id=user_id, β
β sale_id=sale_id, β
β claim_id=claim_id, β
β amount=payment.amount, β
β payment_id=payment.id, β
β status="confirmed" β
β ) β
β β
β # Record in idempotency store β
β await idempotency_store.set(claim_id, { β
β status: "completed", β
β order_id: order.id, β
β response: {order_id: order.id, status: "confirmed"} β
β }, ttl=86400 * 7) # Keep for 7 days β
β β
β # Delete claim (inventory already decremented) β
β await redis.delete(f"claim:{sale_id}:{user_id}") β
β await redis.zrem(f"claims:expiry:{sale_id}", β
β f"claim:{sale_id}:{user_id}") β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 5: PUBLISH EVENTS (async, non-blocking) β
β β
β # Publish to Kafka - don't wait for confirmation β
β await kafka.send_async("order_events", { β
β "event_type": "order.created", β
β "order_id": order.id, β
β "user_id": user_id, β
β "sale_id": sale_id, β
β "amount": payment.amount β
β }) β
β β
β # Queue confirmation email (Day 4 - webhook pattern) β
β await notification_queue.send({ β
β "type": "email", β
β "template": "order_confirmation", β
β "user_id": user_id, β
β "order_id": order.id β
β }) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 6: RETURN SUCCESS β
β β
β { β
β "status": "confirmed", β
β "order_id": "ord_xyz789", β
β "message": "Congratulations! Your order is confirmed.", β
β "receipt_url": "/orders/ord_xyz789/receipt" β
β } β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Complete Checkout Code
class CheckoutService:
"""
Checkout service implementing all Week 1-2 patterns:
- Timeouts (Day 1)
- Idempotency (Day 2)
- Circuit Breakers (Day 3)
- Async Events (Day 4)
"""
def __init__(
self,
redis: Redis,
db: Database,
stripe_client: Stripe,
kafka: KafkaProducer,
idempotency_store: IdempotencyStore,
circuit_breaker: CircuitBreaker
):
self.redis = redis
self.db = db
self.stripe = stripe_client
self.kafka = kafka
self.idempotency = idempotency_store
self.circuit = circuit_breaker
# Timeout configuration (Day 1)
self.payment_timeout = 10.0
self.db_timeout = 5.0
self.total_timeout = 25.0 # Budget for entire checkout
async def checkout(
self,
user_id: str,
sale_id: str,
claim_id: str
) -> CheckoutResult:
"""
Process checkout with full reliability guarantees.
"""
# Start timeout budget (Day 1)
deadline = time.time() + self.total_timeout
# Step 1: Validate claim
claim = await self._validate_claim(user_id, sale_id, claim_id)
if not claim:
return CheckoutResult(
success=False,
error="Claim expired or invalid"
)
# Step 2: Idempotency check (Day 2)
existing = await self.idempotency.get(claim_id)
if existing:
if existing['status'] == 'completed':
return CheckoutResult(
success=True,
order_id=existing['order_id'],
idempotent=True
)
elif existing['status'] == 'processing':
return CheckoutResult(
success=False,
error="Checkout already in progress"
)
# Mark as processing
await self.idempotency.set(claim_id, {
'status': 'processing',
'started_at': time.time()
})
try:
# Step 3: Process payment (Day 1 timeout + Day 3 circuit breaker)
payment = await self._process_payment(
user_id, sale_id, claim_id, deadline
)
# Step 4: Create order (with remaining timeout budget)
remaining_time = deadline - time.time()
if remaining_time <= 0:
raise TimeoutError("Checkout timeout exceeded")
order = await asyncio.wait_for(
self._create_order(user_id, sale_id, claim_id, payment),
timeout=remaining_time
)
# Step 5: Finalize idempotency record
await self.idempotency.set(claim_id, {
'status': 'completed',
'order_id': order.id,
'completed_at': time.time()
}, ttl=86400 * 7)
# Step 6: Publish events (async, non-blocking) (Day 4)
asyncio.create_task(self._publish_order_events(order))
return CheckoutResult(
success=True,
order_id=order.id
)
except PaymentError as e:
await self.idempotency.set(claim_id, {
'status': 'failed',
'error': str(e)
})
return CheckoutResult(success=False, error=str(e))
except TimeoutError:
# Don't mark as failed - payment might have succeeded
# Let user retry, idempotency will handle it
return CheckoutResult(
success=False,
error="Request timeout - please try again"
)
async def _process_payment(
self,
user_id: str,
sale_id: str,
claim_id: str,
deadline: float
) -> PaymentResult:
"""Process payment with circuit breaker and timeout."""
# Check circuit breaker (Day 3)
if self.circuit.is_open():
raise PaymentError("Payment service temporarily unavailable")
# Calculate remaining timeout
remaining = deadline - time.time()
timeout = min(self.payment_timeout, remaining)
if timeout <= 0:
raise TimeoutError("No time remaining for payment")
try:
# Call Stripe with idempotency key (Day 2)
payment = await asyncio.wait_for(
self.stripe.charges.create(
amount=await self._get_sale_price(sale_id),
currency='usd',
customer=await self._get_stripe_customer(user_id),
idempotency_key=f"checkout_{claim_id}",
metadata={
'sale_id': sale_id,
'claim_id': claim_id,
'user_id': user_id
}
),
timeout=timeout
)
self.circuit.record_success()
return payment
except asyncio.TimeoutError:
self.circuit.record_failure()
# Payment might have succeeded - verify
payment = await self._verify_payment(claim_id)
if payment:
return payment
raise TimeoutError("Payment timeout")
except StripeError as e:
self.circuit.record_failure()
raise PaymentError(str(e))
async def _verify_payment(self, claim_id: str) -> Optional[PaymentResult]:
"""Verify if payment exists in Stripe (for timeout recovery)."""
try:
payments = await self.stripe.charges.list(
limit=1,
metadata={'claim_id': claim_id}
)
if payments.data:
return payments.data[0]
return None
except:
return None
async def _create_order(
self,
user_id: str,
sale_id: str,
claim_id: str,
payment: PaymentResult
) -> Order:
"""Create order in database."""
async with self.db.transaction():
order = await self.db.orders.create(
id=generate_order_id(),
user_id=user_id,
sale_id=sale_id,
claim_id=claim_id,
payment_id=payment.id,
amount=payment.amount,
status='confirmed',
created_at=datetime.utcnow()
)
# Clean up claim
await self.redis.delete(f"claim:{sale_id}:{user_id}")
return order
async def _publish_order_events(self, order: Order):
"""Publish events for order (non-blocking)."""
# Order created event
await self.kafka.send('order_events', {
'event_type': 'order.created',
'order_id': order.id,
'user_id': order.user_id,
'sale_id': order.sale_id,
'amount': order.amount,
'timestamp': datetime.utcnow().isoformat()
})
# Queue confirmation email
await self.kafka.send('notification_events', {
'type': 'email',
'template': 'order_confirmation',
'recipient_id': order.user_id,
'data': {
'order_id': order.id,
'amount': order.amount
}
})
Interviewer: "I like how you've integrated all the patterns. Now, let's talk about the notification system. You mentioned millions of notifications when a sale starts. How do you handle that?"
Phase 6: Deep Dive - Notification System (5 minutes)
You: "The notification system is where the webhook patterns from Day 4 come in. Let me explain."
Pre-computation Strategy
THE PROBLEM:
Sale starts at 10:00:00
Need to notify 2.4M users
Users expect to know within seconds
Computing recipients at 10:00:00 = disaster
THE SOLUTION:
Pre-compute recipient list before sale starts
Store in ready-to-send format
At 10:00:00, just fan out to workers
Notification Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β NOTIFICATION PIPELINE β
β β
β T-10 MINUTES (Pre-computation) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β SELECT user_id, email, push_token, notification_preferences β β
β β FROM users β β
β β WHERE (user_id IN (SELECT user_id FROM product_watchers β β
β β WHERE product_id = 'xyz') β β
β β OR subscribed_to_deals = true) β β
β β AND notification_enabled = true β β
β β β β
β β Result: 2.4M rows β β
β β Store in: Redis sorted set (partitioned by user_id hash) β β
β β β β
β β notification_batch:{sale_id}:0 = [users 0-100k] β β
β β notification_batch:{sale_id}:1 = [users 100k-200k] β β
β β ... β β
β β notification_batch:{sale_id}:23 = [users 2.3M-2.4M] β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β T=0 (Sale Starts) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Scheduler publishes to Kafka: β β
β β β β
β β Topic: notification_triggers β β
β β { β β
β β "sale_id": "xyz", β β
β β "batch_count": 24, β β
β β "trigger_time": "2024-01-15T10:00:00Z" β β
β β } β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Workers Process Batches (Parallel) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β 24 workers, each handles 100k users β β
β β β β
β β Worker 0: β β
β β - Read notification_batch:{sale_id}:0 from Redis β β
β β - For each user: β β
β β - Check preferences (email? push? both?) β β
β β - Queue to appropriate sender β β
β β β β
β β All workers run simultaneously = 2.4M queued in ~10 seconds β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Sender Pools (Final Delivery) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β Push Notification Pool (20 workers) β β
β β βββ Firebase Cloud Messaging for Android β β
β β βββ APNs for iOS β β
β β βββ Rate: ~50,000/second total β β
β β β β
β β Email Pool (10 workers) β β
β β βββ SendGrid / SES β β
β β βββ Rate: ~10,000/second total β β
β β βββ Lower priority (30 second SLA, not 5 second) β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Notification Worker Implementation
class NotificationWorker:
"""
Notification worker implementing Day 4 patterns:
- At-least-once delivery
- Retry with backoff
- Dead letter queue
- Circuit breaker for external services
"""
def __init__(
self,
redis: Redis,
kafka_consumer: KafkaConsumer,
push_service: PushNotificationService,
email_service: EmailService,
circuit_breakers: Dict[str, CircuitBreaker]
):
self.redis = redis
self.consumer = kafka_consumer
self.push = push_service
self.email = email_service
self.circuits = circuit_breakers
async def process_batch(self, trigger: NotificationTrigger):
"""Process a notification batch for a sale."""
batch_key = f"notification_batch:{trigger.sale_id}:{trigger.batch_id}"
# Get users from pre-computed batch
users = await self.redis.lrange(batch_key, 0, -1)
for user_data in users:
user = json.loads(user_data)
try:
await self._send_notification(user, trigger)
except Exception as e:
# Log and continue - don't let one failure stop batch
logger.error(
"Notification failed",
user_id=user['id'],
error=str(e)
)
async def _send_notification(self, user: dict, trigger: NotificationTrigger):
"""Send notification to a single user."""
# Idempotency check (Day 2)
idem_key = f"notif_sent:{trigger.sale_id}:{user['id']}"
if await self.redis.exists(idem_key):
return # Already sent
# Send based on preferences
if user.get('push_enabled') and user.get('push_token'):
await self._send_push(user, trigger)
if user.get('email_enabled') and user.get('email'):
await self._queue_email(user, trigger)
# Mark as sent (idempotency)
await self.redis.set(idem_key, '1', ex=3600) # 1 hour TTL
async def _send_push(self, user: dict, trigger: NotificationTrigger):
"""Send push notification with circuit breaker."""
# Check circuit breaker (Day 3)
provider = 'fcm' if user['platform'] == 'android' else 'apns'
if self.circuits[provider].is_open():
# Queue for retry later
await self._queue_for_retry(user, trigger, 'push')
return
try:
await asyncio.wait_for(
self.push.send(
token=user['push_token'],
title="β‘ Lightning Deal Live!",
body=f"The {trigger.product_name} deal is live! Tap to claim.",
data={'sale_id': trigger.sale_id}
),
timeout=2.0 # Day 1: Timeout
)
self.circuits[provider].record_success()
except asyncio.TimeoutError:
self.circuits[provider].record_failure()
await self._queue_for_retry(user, trigger, 'push')
except Exception as e:
self.circuits[provider].record_failure()
logger.error("Push failed", error=str(e))
async def _queue_email(self, user: dict, trigger: NotificationTrigger):
"""Queue email for async sending (Day 4 pattern)."""
await self.kafka.send('email_queue', {
'recipient': user['email'],
'template': 'flash_sale_live',
'data': {
'user_name': user['name'],
'product_name': trigger.product_name,
'sale_url': f"https://megamart.com/sale/{trigger.sale_id}"
},
'idempotency_key': f"sale_email:{trigger.sale_id}:{user['id']}"
})
Interviewer: "Good. One more thing β how does the flash sale scheduling work? You mentioned these happen every hour. Is that a cron job?"
Phase 7: Deep Dive - Sale Scheduling (5 minutes)
You: "Yes, this is where the distributed cron patterns from Day 5 come in. We need to ensure each sale starts exactly once, even across multiple servers."
Sale Scheduling Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DISTRIBUTED SALE SCHEDULER β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ETCD COORDINATION LAYER β β
β β β β
β β /elections/sale-scheduler/leader β β
β β /config/sales/upcoming β β
β β /fencing/current_token β β
β β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β β Only leader runs β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SALE SCHEDULER SERVICE β β
β β β β
β β Every minute: β β
β β 1. Check for sales starting in next 10 minutes β β
β β 2. Pre-compute notification recipients β β
β β 3. Warm up Redis with sale data β β
β β β β
β β At scheduled time: β β
β β 1. Verify fencing token β β
β β 2. Initialize inventory counter β β
β β 3. Mark sale as active β β
β β 4. Trigger notifications β β
β β β β
β β At sale end: β β
β β 1. Mark sale as ended β β
β β 2. Return unclaimed inventory β β
β β 3. Generate sale report β β
β β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Sale Scheduler Implementation
class FlashSaleScheduler:
"""
Distributed flash sale scheduler using Day 5 patterns:
- Leader election
- Fencing tokens
- Exactly-once scheduling
- Heartbeat monitoring
"""
def __init__(
self,
etcd_client,
redis: Redis,
db: Database,
kafka: KafkaProducer,
notification_service: NotificationService
):
self.etcd = etcd_client
self.redis = redis
self.db = db
self.kafka = kafka
self.notifications = notification_service
self.is_leader = False
self.fencing_token = 0
self.leader_election = LeaderElection(
etcd_client,
election_name="sale-scheduler"
)
async def start(self):
"""Start the scheduler."""
# Start leader election
asyncio.create_task(
self.leader_election.campaign(
on_elected=self._on_elected,
on_demoted=self._on_demoted
)
)
# Main scheduler loop
while True:
if self.is_leader:
await self._scheduler_tick()
await asyncio.sleep(1)
def _on_elected(self):
"""Called when this instance becomes leader."""
self.is_leader = True
self.fencing_token = int(time.time() * 1000)
logger.info("Became sale scheduler leader", token=self.fencing_token)
def _on_demoted(self):
"""Called when this instance loses leadership."""
self.is_leader = False
logger.info("Lost sale scheduler leadership")
async def _scheduler_tick(self):
"""Main scheduling loop tick."""
now = datetime.utcnow()
# Find sales that need action
await self._prepare_upcoming_sales(now)
await self._start_due_sales(now)
await self._end_expired_sales(now)
async def _prepare_upcoming_sales(self, now: datetime):
"""Prepare sales starting in next 10 minutes."""
upcoming = await self.db.query("""
SELECT * FROM flash_sales
WHERE status = 'scheduled'
AND start_time BETWEEN $1 AND $2
AND preparation_started = false
""", now, now + timedelta(minutes=10))
for sale in upcoming:
# Mark as preparing (idempotent)
await self.db.execute("""
UPDATE flash_sales
SET preparation_started = true,
preparation_token = $1
WHERE id = $2 AND preparation_started = false
""", self.fencing_token, sale.id)
# Pre-compute notifications (async)
asyncio.create_task(
self.notifications.prepare_recipients(sale)
)
# Warm up Redis
await self._warm_up_sale_data(sale)
async def _start_due_sales(self, now: datetime):
"""Start sales that are due."""
due_sales = await self.db.query("""
SELECT * FROM flash_sales
WHERE status = 'scheduled'
AND start_time <= $1
AND preparation_started = true
""", now)
for sale in due_sales:
await self._start_sale(sale)
async def _start_sale(self, sale):
"""
Start a single sale.
Uses fencing token to prevent double-start.
"""
# Atomically claim sale start with fencing token
claimed = await self.db.execute("""
UPDATE flash_sales
SET status = 'active',
started_at = NOW(),
started_by_token = $1
WHERE id = $2
AND status = 'scheduled'
AND (started_by_token IS NULL OR started_by_token < $1)
RETURNING id
""", self.fencing_token, sale.id)
if not claimed:
logger.warning("Could not claim sale start", sale_id=sale.id)
return
logger.info("Starting sale", sale_id=sale.id, token=self.fencing_token)
# Initialize inventory in Redis
await self.redis.set(
f"inventory:{sale.id}",
sale.total_inventory
)
# Add to active sales set
await self.redis.sadd('active_sales', sale.id)
# Trigger notifications
await self.kafka.send('notification_triggers', {
'type': 'sale_started',
'sale_id': sale.id,
'product_name': sale.product_name,
'fencing_token': self.fencing_token
})
# Schedule sale end
asyncio.create_task(
self._schedule_sale_end(sale.id, sale.duration_minutes)
)
async def _schedule_sale_end(self, sale_id: str, duration_minutes: int):
"""Schedule the end of a sale."""
await asyncio.sleep(duration_minutes * 60)
if self.is_leader: # Only end if still leader
await self._end_sale(sale_id)
async def _end_sale(self, sale_id: str):
"""End a sale and clean up."""
# Update database with fencing token
ended = await self.db.execute("""
UPDATE flash_sales
SET status = 'ended',
ended_at = NOW(),
ended_by_token = $1
WHERE id = $2
AND status = 'active'
RETURNING id
""", self.fencing_token, sale_id)
if not ended:
return
logger.info("Ending sale", sale_id=sale_id)
# Remove from active sales
await self.redis.srem('active_sales', sale_id)
# Return unclaimed inventory to report
remaining = await self.redis.get(f"inventory:{sale_id}")
# Publish sale ended event
await self.kafka.send('sale_events', {
'type': 'sale_ended',
'sale_id': sale_id,
'remaining_inventory': int(remaining) if remaining else 0,
'fencing_token': self.fencing_token
})
# Clean up Redis (keep for a while for debugging)
await self.redis.expire(f"inventory:{sale_id}", 3600)
async def _warm_up_sale_data(self, sale):
"""Pre-load sale data into Redis."""
await self.redis.hset(f"sale:{sale.id}", mapping={
'product_id': sale.product_id,
'product_name': sale.product_name,
'original_price': sale.original_price,
'sale_price': sale.sale_price,
'total_inventory': sale.total_inventory,
'start_time': sale.start_time.isoformat(),
'end_time': sale.end_time.isoformat()
})
# Set TTL beyond sale end
await self.redis.expire(
f"sale:{sale.id}",
sale.duration_minutes * 60 + 3600
)
Phase 8: Monitoring and Observability (3 minutes)
Interviewer: "How would you monitor this system? What alerts would you set up?"
You: "Monitoring is critical for a flash sale system. Here's my approach:"
Key Metrics Dashboard
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FLASH SALE REAL-TIME DASHBOARD β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β β
β CURRENT SALE: iPhone 15 Pro Lightning Deal β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β INVENTORY β β CLAIMS β β CHECKOUTS β β
β β β β β β β β
β β 2,847 β β 7,153 β β 6,421 β β
β β remaining β β active β β completed β β
β β β β β β β β
β β βΌβΌβΌ 450/min β β β²β²β² 320/min β β β²β²β² 280/min β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β REQUESTS/SEC β β ERROR RATE β β P99 LATENCY β β
β β β β β β β β
β β 45,231 β β 0.12% β β 247ms β β
β β β β β β β β
β β [ββββββββββ] β β [ββββββββββ] β β [ββββββββββ] β β
β β peak: 150k β β budget: 1% β β budget: 500ms β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β
β CIRCUIT BREAKERS β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Stripe API β ββββββββββββββββββββ β CLOSED β 99.8% success β
β β Redis Cluster β ββββββββββββββββββββ β CLOSED β 100% success β
β β Push (FCM) β ββββββββββββββββββββ β CLOSED β 97.2% success β
β β Push (APNs) β ββββββββββββββββββββ β CLOSED β 99.9% success β
β β Email (SES) β ββββββββββββββββββββ β CLOSED β 99.7% success β
β β
β NOTIFICATIONS β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Total to send β 2,400,000 β
β β Sent β 2,387,421 (99.5%) β
β β Failed β 12,579 (0.5%) β
β β Pending β 0 β
β β Time elapsed β 18 seconds β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Critical Alerts
# Prometheus Alert Rules
groups:
- name: flash_sale_critical
rules:
# INVENTORY ALERTS
- alert: InventoryOversold
expr: flash_sale_inventory < 0
for: 0s
labels:
severity: critical
annotations:
summary: "CRITICAL: Inventory oversold for sale {{ $labels.sale_id }}"
- alert: InventoryNotDecreasing
expr: rate(flash_sale_inventory[1m]) >= 0 AND flash_sale_active == 1
for: 2m
labels:
severity: warning
annotations:
summary: "Inventory not moving during active sale"
# LATENCY ALERTS
- alert: ClaimLatencyHigh
expr: histogram_quantile(0.99, rate(claim_request_duration_seconds_bucket[1m])) > 0.5
for: 1m
labels:
severity: critical
annotations:
summary: "P99 claim latency {{ $value }}s exceeds 500ms"
- alert: CheckoutLatencyHigh
expr: histogram_quantile(0.99, rate(checkout_request_duration_seconds_bucket[1m])) > 3
for: 1m
labels:
severity: critical
annotations:
summary: "P99 checkout latency {{ $value }}s exceeds 3s"
# ERROR RATE ALERTS
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[1m])) / sum(rate(http_requests_total[1m])) > 0.01
for: 1m
labels:
severity: critical
annotations:
summary: "Error rate {{ $value | humanizePercentage }} exceeds 1%"
# CIRCUIT BREAKER ALERTS
- alert: CircuitBreakerOpen
expr: circuit_breaker_state == 2 # 2 = OPEN
for: 0s
labels:
severity: critical
annotations:
summary: "Circuit breaker {{ $labels.name }} is OPEN"
# INFRASTRUCTURE ALERTS
- alert: RedisHighLatency
expr: redis_command_duration_seconds_p99 > 0.01
for: 1m
labels:
severity: warning
annotations:
summary: "Redis P99 latency {{ $value }}s exceeds 10ms"
- alert: KafkaLag
expr: kafka_consumer_lag > 10000
for: 2m
labels:
severity: warning
annotations:
summary: "Kafka consumer lag {{ $value }} exceeds 10k"
Phase 9: Wrap-Up and Extensions (2 minutes)
Interviewer: "We're almost out of time. Let's quickly discuss: what would you do differently at 10x the scale?"
You: "Great question. At 10x scale (50M concurrent users, 100K inventory), I'd make these changes:"
Scaling to 10x
CURRENT β 10x SCALE
TRAFFIC HANDLING:
βββ Current: 150K req/sec peak
βββ 10x: 1.5M req/sec peak
βββ Solution:
- Geographic distribution (multiple regions)
- Regional inventory pools
- Edge caching for sale pages
- WebSocket for inventory updates (reduce polling)
INVENTORY MANAGEMENT:
βββ Current: Single Redis cluster
βββ 10x: Redis cluster can't handle atomic ops at this rate
βββ Solution:
- Shard inventory by region
- Eventual consistency for display count
- Strong consistency only for actual claims
- Consider CockroachDB for distributed transactions
NOTIFICATION DELIVERY:
βββ Current: 2.4M in 30 seconds
βββ 10x: 24M in 30 seconds
βββ Solution:
- Pre-send notifications (hint: "sale starting in 5 seconds")
- Progressive rollout by user segment
- More aggressive batching to FCM/APNs
DATABASE:
βββ Current: PostgreSQL primary + 2 replicas
βββ 10x: Single primary becomes bottleneck
βββ Solution:
- Vitess or CockroachDB for horizontal scaling
- Event sourcing for order creation
- Read from cache, async write to DB
Alternative Approaches Considered
You: "I should also mention alternatives I considered but didn't choose:"
ALTERNATIVE: Queue-Based Claims
Instead of: Direct Redis DECR for claims
Alternative: Put all claim requests in a queue, process in order
Pros: Perfect ordering, simpler consistency
Cons: Higher latency, doesn't match "instant" UX requirement
Decision: Direct Redis better for our latency requirements
ALTERNATIVE: Reservation-Style Inventory
Instead of: Decrement on claim
Alternative: Decrement only on successful checkout
Pros: No need for claim expiry handling
Cons: Inventory shows as available but isn't, worse UX
Decision: Claim-then-checkout better for transparency
ALTERNATIVE: Distributed Lock per Item
Instead of: Atomic counter
Alternative: Lock each inventory unit individually
Pros: Fine-grained control
Cons: 10,000 locks = complexity nightmare
Decision: Atomic counter much simpler
Interview Conclusion
Interviewer: "Excellent work. You've demonstrated strong understanding of distributed systems, handled the scale estimation well, and made good trade-off decisions. Any questions for me?"
You: "Thank you! I'd love to hear how MegaMart currently handles this, and what challenges you've faced in production."
Summary: Week 1-2 Concepts Applied
Week 1 Concepts (Foundations of Scale)
| Concept | Application in This Design |
|---|---|
| Horizontal vs Vertical Scaling | API services scale horizontally, Redis and PostgreSQL scale with clustering |
| Database Partitioning | Orders partitioned by date, notifications batched by user_id hash |
| Caching Strategies | CDN for static content, Redis for hot data, 1-second TTL for inventory count |
| Load Balancing | ALB across API servers, partition-aware Kafka consumers |
| Message Queues | Kafka for event-driven architecture, async notification processing |
Week 2 Concepts (Failure-First Design)
| Day | Concept | Application |
|---|---|---|
| Day 1 | Timeouts | Payment timeout (10s), checkout budget (25s), claim validation |
| Day 2 | Idempotency | Claim idempotency key, checkout idempotency, Stripe idempotency_key |
| Day 3 | Circuit Breakers | Stripe circuit, FCM/APNs circuits, Redis circuit |
| Day 4 | Webhooks | Notification delivery, at-least-once semantics, retry with backoff |
| Day 5 | Distributed Cron | Sale scheduling, leader election, fencing tokens |
Code Patterns Demonstrated
1. ATOMIC OPERATIONS
- Redis Lua scripts for claim
- Database transactions for orders
2. IDEMPOTENCY IMPLEMENTATION
- Check-before-execute pattern
- Idempotency store with TTL
3. CIRCUIT BREAKER INTEGRATION
- Check before external calls
- Record success/failure
- Fallback behavior
4. TIMEOUT BUDGETS
- Total budget for operation
- Remaining budget propagation
- Timeout on all external calls
5. LEADER ELECTION
- etcd-based election
- Fencing token validation
- Graceful failover
6. EVENT-DRIVEN ARCHITECTURE
- Kafka for event publishing
- Consumer groups for parallel processing
- At-least-once delivery
Self-Assessment Checklist
After studying this capstone, you should be able to:
- Estimate traffic and storage requirements from business requirements
- Design a system that handles massive traffic spikes
- Implement atomic inventory operations without race conditions
- Integrate idempotency at multiple levels (claim, checkout, notifications)
- Apply circuit breakers to protect against external service failures
- Design a notification system that delivers millions of messages quickly
- Implement distributed scheduling with exactly-once semantics
- Set up meaningful monitoring and alerting
- Discuss trade-offs and alternatives clearly
- Handle follow-up questions about scaling and edge cases
This capstone problem integrates all concepts from Weeks 1-2 of the System Design Mastery Series. Use this as a template for approaching similar interview problems.
π¬ Public Discussion: Comments are visible to all users. Please be respectful and mindful of what you share.
Discussion (0)
Sign in to join the discussion