Week 1-2 Capstone: The Ultimate System Design Interview
šÆ A Real-World Problem Covering Everything You've Learned
The Interview Begins
You walk into the interview room. The interviewer smiles and gestures to the whiteboard.
Interviewer: "Thanks for coming in. Today we're going to work through a system design problem together. I'm interested in your thought process, so please think out loud. Feel free to ask questions ā this is meant to be collaborative."
They write on the whiteboard:
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā
ā Design a Global Flash Sale System for "MegaMart" ā
ā ā
ā MegaMart is launching "Lightning Deals" ā flash sales where limited ā
ā inventory items are sold at 90% discount for exactly 10 minutes. ā
ā ā
ā - 50 million users globally ā
ā - Flash sales happen every hour (random products) ā
ā - Each sale: 10,000 units available ā
ā - Users get 3 minutes to complete checkout after claiming ā
ā - Must handle payment processing and inventory ā
ā - Must notify users via email/push when deals go live ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Interviewer: "Take a few minutes to think about this, then walk me through your approach. We have about 45 minutes."
Phase 1: Requirements Clarification (5 minutes)
Before diving in, you take a breath and start asking questions. This is crucial ā never assume.
Your Questions
You: "Before I start designing, I'd like to clarify a few requirements. First, when you say 50 million users globally ā is that monthly active users, or do we expect all of them potentially accessing a flash sale?"
Interviewer: "Good question. 50M is MAU. For popular flash sales, we've seen up to 5 million users trying to access a single sale. The product team calls these 'hype drops' ā think iPhone launches or limited edition sneakers."
You: "That's helpful. For the 10,000 units ā do users select quantities, or is it one unit per user?"
Interviewer: "One unit per user per sale. We want to maximize the number of happy customers."
You: "What happens if someone claims an item but doesn't complete checkout within 3 minutes?"
Interviewer: "The item should go back to available inventory for others."
You: "For payment processing ā do you have an existing payment provider, or should I design that?"
Interviewer: "Assume we use Stripe. Treat it as an external API that might be slow or fail."
You: "Last question ā the notification when deals go live. Are users subscribed to specific products, or do we notify everyone?"
Interviewer: "Users can 'watch' products. When a flash sale includes a watched product, they should be notified. But we also have a general 'deals' notification channel for users who opted in. Could be millions of notifications per sale."
You: "Perfect. Let me summarize the requirements as I understand them."
Functional Requirements
1. FLASH SALE MANAGEMENT
- Create flash sales with specific products and inventory
- Sales start at scheduled times (hourly)
- Sales last exactly 10 minutes
- 10,000 units per sale
2. INVENTORY CLAIMING
- Users can claim one item per sale
- Claimed items reserved for 3 minutes
- Unclaimed items return to pool after timeout
- No overselling (exactly 10,000 successful orders max)
3. CHECKOUT FLOW
- Complete payment within 3 minutes of claim
- Process payment via Stripe
- Handle payment failures gracefully
- Create order on successful payment
4. NOTIFICATIONS
- Notify watchers when watched product goes on sale
- Notify deal subscribers when any sale starts
- Support email and push notifications
- Millions of notifications per sale
Non-Functional Requirements
1. SCALE
- Handle 5M concurrent users hitting a single sale
- Process 10K checkouts in 3-minute window
- Send millions of notifications within seconds
2. RELIABILITY
- No overselling (inventory consistency is critical)
- No double-charging (payment idempotency)
- No lost orders (durability)
3. AVAILABILITY
- System must work during sale windows
- Graceful degradation if components fail
4. LATENCY
- Claim response: <500ms at p99
- Checkout: <3s at p99
- Notification delivery: <30s for email, <5s for push
Phase 2: Back of the Envelope Estimation (5 minutes)
You: "Let me work through the numbers to understand the scale."
Traffic Estimation
FLASH SALE TRAFFIC (per sale)
Users trying to access: 5,000,000
Sale duration: 10 minutes = 600 seconds
Average request rate: 5M / 600 = 8,333 requests/second
But traffic isn't uniform. The first 30 seconds see 80% of traffic:
Peak traffic: (5M Ć 0.8) / 30 = 133,333 requests/second
Let's round up for safety: 150,000 requests/second peak
Breakdown by operation:
āāā Page views: 100,000 /sec (viewing the sale)
āāā Claim attempts: 40,000 /sec (trying to claim)
āāā Stock checks: 10,000 /sec (AJAX refreshes)
āāā Checkouts: 50 /sec (successful claimers)
Inventory Operations
INVENTORY MATH
Total inventory: 10,000 units
Claim timeout: 3 minutes
Successful checkout rate: ~70% (estimate)
If all 10K claimed in first 30 seconds:
āāā 7,000 complete checkout
āāā 3,000 timeout ā return to pool
āāā Pool refills trigger second wave of claims
Maximum claim operations: ~15,000 (accounting for timeouts)
Checkout operations: ~10,000 (until inventory exhausted)
Claims per second (peak): 10,000 / 30 = 333 claims/second
Checkouts per second (peak): 10,000 / 180 = 56 checkouts/second
Notification Estimation
NOTIFICATION VOLUME
Product watchers: 500,000 (popular product)
Deal subscribers: 2,000,000
Overlap (watching + subscribed): ~100,000
Total notifications: 2,400,000
Delivery target: 30 seconds for all
Notification rate: 2.4M / 30 = 80,000 notifications/second
Push notification payload: ~500 bytes
Email payload: ~5 KB
Bandwidth:
āāā Push: 80,000 Ć 500 = 40 MB/sec
āāā Email: 30,000 Ć 5KB = 150 MB/sec
Storage Estimation
STORAGE REQUIREMENTS
Per sale:
āāā Sale metadata: ~1 KB
āāā Inventory records: 10,000 Ć 100 bytes = 1 MB
āāā Claim records: 15,000 Ć 200 bytes = 3 MB
āāā Order records: 10,000 Ć 500 bytes = 5 MB
āāā Total per sale: ~10 MB
Sales per day: 24
Storage per day: ~240 MB
Storage per year: ~90 GB
Notification logs:
āāā Per notification: 200 bytes
āāā Per sale: 2.4M Ć 200 = 480 MB
āāā Per year: ~4 TB
Hot data (active sales): ~100 MB
Warm data (past week): ~2 GB
Cold data (archive): Compress and store in S3
Infrastructure Estimation
SERVER REQUIREMENTS
API Servers (150K req/sec):
āāā Per server capacity: 5,000 req/sec
āāā Servers needed: 30 servers
āāā With 2x headroom: 60 servers
Redis (inventory + claims):
āāā Operations/sec: 50,000 (reads + writes)
āāā Memory for hot data: ~1 GB per sale
āāā Cluster: 3 primary + 3 replica
PostgreSQL (orders, users):
āāā Write TPS: 500 (orders only)
āāā Read TPS: 5,000
āāā Single primary + 2 read replicas
Message Queue (notifications):
āāā Messages/sec: 100,000
āāā Kafka cluster: 6 brokers, 3 partitions
āāā Consumer groups: Push, Email, Analytics
Interviewer: "Good analysis. Those numbers seem reasonable. What concerns you most about the scale?"
You: "Three things concern me:
-
The 150K requests/second peak ā this is a thundering herd problem. We need aggressive caching and potentially a queue-based claim system.
-
The inventory consistency ā with 40K claim attempts per second on 10K items, we need atomic operations. Race conditions could cause overselling.
-
The notification delivery ā 80K notifications/second is achievable, but we need to pre-compute recipient lists before the sale starts, not at sale time."
Phase 3: High-Level Design (10 minutes)
You: "Let me sketch out the high-level architecture."
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā MEGAMART FLASH SALE ARCHITECTURE ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā£
ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā USERS (5M) ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā
ā ā¼ ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā CDN + WAF + Load Balancer ā ā
ā ā (CloudFront + Shield + ALB) ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā ā ā
ā ā¼ ā¼ ā¼ ā
ā āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā ā
ā ā Sale Page ā ā Claim API ā ā Checkout API ā ā
ā ā (Read Heavy) ā ā (Write Heavy) ā ā (Critical) ā ā
ā ā 60 servers ā ā 20 servers ā ā 20 servers ā ā
ā āāāāāāāāāā¬āāāāāāāāā āāāāāāāāāā¬āāāāāāāāā āāāāāāāāāā¬āāāāāāāāā ā
ā ā ā ā ā
ā ā ā ā ā
ā āāāāāāāāāā“āāāāāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāā ā
ā ā REDIS CLUSTER ā ā
ā ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā ā ā
ā ā ā Inventory ā ā Claims ā ā Rate Limit ā ā ā
ā ā ā Counter ā ā Store ā ā Cache ā ā ā
ā ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā ā ā
ā ā¼ ā¼ ā¼ ā
ā āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā ā
ā ā PostgreSQL ā ā Kafka ā ā Stripe ā ā
ā ā (Orders DB) ā ā (Events/Notif) ā ā (Payments) ā ā
ā ā Primary + 2RR ā ā 6 brokers ā ā External API ā ā
ā āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā ā
ā ā ā
ā āāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāā ā
ā ā¼ ā¼ ā¼ ā
ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā ā
ā ā Push ā ā Email ā ā Webhook ā ā
ā ā Workers ā ā Workers ā ā Delivery ā ā
ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā ā
ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā SUPPORTING SERVICES ā ā
ā ā āāāāāāāāāāāāāā āāāāāāāāāāāāāā āāāāāāāāāāāāāā āāāāāāāāāāāāāā ā ā
ā ā ā Sale Sched ā ā Inventory ā ā Claim ā ā Monitoring ā ā ā
ā ā ā (Cron) ā ā Manager ā ā Expiry ā ā (Grafana) ā ā ā
ā ā āāāāāāāāāāāāāā āāāāāāāāāāāāāā āāāāāāāāāāāāāā āāāāāāāāāāāāāā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Component Overview
You: "Let me walk through each component and its role:"
Traffic Layer
CDN + WAF + Load Balancer
āāā CloudFront: Cache static assets (sale page HTML, JS, CSS)
āāā AWS Shield: DDoS protection (critical during sale spikes)
āāā WAF: Rate limiting, bot detection
āāā ALB: Route to appropriate service based on path
Benefits:
- CDN handles 80% of traffic (static content)
- WAF blocks abusive traffic before hitting servers
- Geographic distribution reduces latency
API Services
Sale Page Service (Read-Heavy)
āāā Renders sale page with current inventory count
āāā Heavy caching (1-second TTL on inventory count)
āāā Stateless, horizontally scalable
āāā Circuit breaker to Redis
Claim API Service (Write-Heavy)
āāā Handles claim requests
āāā Atomic inventory operations
āāā Idempotent (claim key per user per sale)
āāā Rate limited per user
Checkout API Service (Critical)
āāā Processes payments through Stripe
āāā Creates orders
āāā Idempotent (prevents double-charge)
āāā Timeout-aware (3-minute claim expiry)
āāā Circuit breaker to Stripe
Data Layer
Redis Cluster
āāā Inventory Counter: Atomic decrement with Lua scripts
āāā Claims Store: user_id ā claim_info with TTL
āāā Rate Limit: Sliding window counters
āāā Session Cache: User session data
PostgreSQL
āāā Primary: Writes (orders, users)
āāā Read Replicas: Read queries
āāā Partitioned by date for orders
āāā Indexed on user_id, sale_id, order_status
Event Processing
Kafka
āāā Topic: sale_events (sale start, sale end, inventory updates)
āāā Topic: claim_events (claimed, expired, checkout_started)
āāā Topic: notification_events (to_send, sent, failed)
āāā Topic: order_events (created, paid, failed)
Consumer Groups:
āāā NotificationWorkers: Send push/email
āāā AnalyticsWorkers: Real-time dashboards
āāā WebhookWorkers: Notify external systems
āāā ClaimExpiryWorkers: Return expired claims to pool
Interviewer: "This looks comprehensive. I'm curious about a few things. First, how do you prevent overselling? Walk me through the claim flow."
Phase 4: Deep Dive - Inventory and Claims (10 minutes)
You: "Great question. This is the most critical part of the system. Let me detail the claim flow."
The Claim Flow
USER CLICKS "CLAIM DEAL"
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā STEP 1: RATE LIMITING ā
ā ā
ā Check: Has user exceeded 10 claim attempts in last minute? ā
ā Implementation: Redis sliding window ā
ā ā
ā INCRBY rate_limit:{user_id}:{minute} 1 ā
ā EXPIRE rate_limit:{user_id}:{minute} 60 ā
ā ā
ā If count > 10: Return 429 Too Many Requests ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā STEP 2: IDEMPOTENCY CHECK ā
ā ā
ā Check: Has user already claimed in this sale? ā
ā Key: claim:{sale_id}:{user_id} ā
ā ā
ā If exists: Return existing claim (idempotent response) ā
ā ā
ā This prevents double-claiming and handles retries safely. ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā STEP 3: ATOMIC CLAIM (Lua Script in Redis) ā
ā ā
ā -- Lua script for atomic claim ā
ā local inventory_key = KEYS[1] -- inventory:{sale_id} ā
ā local claim_key = KEYS[2] -- claim:{sale_id}:{user_id} ā
ā local user_id = ARGV[1] ā
ā local claim_id = ARGV[2] ā
ā local ttl = ARGV[3] -- 180 seconds ā
ā ā
ā -- Check if already claimed ā
ā if redis.call('EXISTS', claim_key) == 1 then ā
ā return redis.call('GET', claim_key) -- Return existing claim ā
ā end ā
ā ā
ā -- Try to decrement inventory ā
ā local remaining = redis.call('DECR', inventory_key) ā
ā ā
ā if remaining < 0 then ā
ā -- No inventory, restore counter ā
ā redis.call('INCR', inventory_key) ā
ā return nil -- Sold out ā
ā end ā
ā ā
ā -- Success! Store claim with TTL ā
ā local claim_data = cjson.encode({ ā
ā claim_id = claim_id, ā
ā user_id = user_id, ā
ā claimed_at = redis.call('TIME')[1], ā
ā expires_at = redis.call('TIME')[1] + ttl ā
ā }) ā
ā ā
ā redis.call('SET', claim_key, claim_data, 'EX', ttl) ā
ā ā
ā -- Add to expiry tracking set ā
ā redis.call('ZADD', 'claims:expiry:' .. KEYS[1], ā
ā redis.call('TIME')[1] + ttl, claim_key) ā
ā ā
ā return claim_data ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā STEP 4: PUBLISH EVENT ā
ā ā
ā Kafka: claim_events ā
ā { ā
ā "event_type": "claimed", ā
ā "claim_id": "clm_abc123", ā
ā "sale_id": "sale_xyz", ā
ā "user_id": "usr_456", ā
ā "expires_at": "2024-01-15T10:03:00Z", ā
ā "remaining_inventory": 9523 ā
ā } ā
ā ā
ā Consumers: ā
ā - AnalyticsWorker: Update real-time dashboard ā
ā - WebsocketBroadcaster: Push inventory count to browsers ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā STEP 5: RETURN RESPONSE ā
ā ā
ā { ā
ā "status": "claimed", ā
ā "claim_id": "clm_abc123", ā
ā "expires_at": "2024-01-15T10:03:00Z", ā
ā "seconds_remaining": 180, ā
ā "checkout_url": "/checkout/clm_abc123" ā
ā } ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Why This Works
You: "This design prevents overselling through several mechanisms:"
OVERSELLING PREVENTION
1. ATOMIC OPERATIONS
- The Lua script runs atomically in Redis
- No race condition between check and decrement
- DECR is atomic, returns new value immediately
2. IDEMPOTENCY
- Checking existing claim BEFORE decrementing
- Same user retrying gets same response
- No double-claiming even under network issues
3. BOUNDED INVENTORY
- Counter starts at exactly 10,000
- Can only decrement, never go negative (we restore on <0)
- TTL ensures claims expire automatically
4. NO DATABASE IN HOT PATH
- Redis handles all claim logic
- PostgreSQL only involved at checkout (much lower rate)
- Removes database as bottleneck
Claim Expiry Handling
You: "When a claim expires without checkout, we need to return inventory:"
class ClaimExpiryWorker:
"""
Background worker that returns expired claims to inventory.
Runs continuously, processing expired claims.
"""
def __init__(self, redis_client):
self.redis = redis_client
async def run(self):
"""Main worker loop."""
while True:
await self.process_expired_claims()
await asyncio.sleep(1) # Check every second
async def process_expired_claims(self):
"""Find and process expired claims."""
now = int(time.time())
# Get all active sales
active_sales = await self.redis.smembers('active_sales')
for sale_id in active_sales:
expiry_key = f'claims:expiry:{sale_id}'
inventory_key = f'inventory:{sale_id}'
# Get claims that have expired
expired = await self.redis.zrangebyscore(
expiry_key,
min=0,
max=now
)
for claim_key in expired:
# Atomically return inventory and remove claim
await self.return_claim_to_inventory(
sale_id, claim_key, inventory_key, expiry_key
)
async def return_claim_to_inventory(
self, sale_id, claim_key, inventory_key, expiry_key
):
"""Return a single expired claim to inventory."""
# Lua script for atomic return
script = """
local claim_key = KEYS[1]
local inventory_key = KEYS[2]
local expiry_key = KEYS[3]
-- Check if claim still exists (might have been checked out)
if redis.call('EXISTS', claim_key) == 0 then
-- Claim was checked out, just remove from expiry set
redis.call('ZREM', expiry_key, claim_key)
return 0
end
-- Claim exists and expired, return to inventory
redis.call('DEL', claim_key)
redis.call('INCR', inventory_key)
redis.call('ZREM', expiry_key, claim_key)
return 1
"""
returned = await self.redis.eval(
script,
keys=[claim_key, inventory_key, expiry_key]
)
if returned:
# Publish event
await self.publish_claim_expired(sale_id, claim_key)
Interviewer: "Nice. What about the checkout flow? How do you handle payment failures and ensure orders aren't double-charged?"
Phase 5: Deep Dive - Checkout Flow (10 minutes)
You: "The checkout flow is where everything comes together ā timeouts, idempotency, circuit breakers. Let me walk through it."
Checkout Architecture
USER CLICKS "COMPLETE PURCHASE"
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā STEP 1: VALIDATE CLAIM ā
ā ā
ā Check claim exists and hasn't expired: ā
ā ā
ā claim_data = redis.get(f"claim:{sale_id}:{user_id}") ā
ā ā
ā if not claim_data: ā
ā return 410 Gone "Your claim has expired" ā
ā ā
ā if claim_data.checkout_started: ā
ā return 409 Conflict "Checkout already in progress" ā
ā ā
ā # Mark checkout started (prevent concurrent checkouts) ā
ā redis.hset(claim_key, "checkout_started", timestamp) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā STEP 2: IDEMPOTENCY CHECK ā
ā ā
ā Check if this checkout was already processed: ā
ā ā
ā Key: idempotency:{claim_id} ā
ā ā
ā existing = await idempotency_store.get(claim_id) ā
ā if existing: ā
ā if existing.status == "completed": ā
ā return existing.response # Return same successful response ā
ā if existing.status == "processing": ā
ā return 409 "Payment in progress" ā
ā if existing.status == "failed": ā
ā # Allow retry ā
ā pass ā
ā ā
ā # Record processing started ā
ā await idempotency_store.set(claim_id, { ā
ā status: "processing", ā
ā started_at: now ā
ā }, ttl=3600) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā STEP 3: PROCESS PAYMENT (with timeout + circuit breaker) ā
ā ā
ā # Check circuit breaker first (Day 3 pattern) ā
ā if stripe_circuit.is_open: ā
ā await idempotency_store.set(claim_id, {status: "failed"}) ā
ā return 503 "Payment service temporarily unavailable" ā
ā ā
ā # Call Stripe with timeout (Day 1 pattern) ā
ā try: ā
ā payment = await asyncio.wait_for( ā
ā stripe.charges.create( ā
ā amount=sale.discounted_price, ā
ā currency="usd", ā
ā customer=user.stripe_customer_id, ā
ā idempotency_key=f"claim_{claim_id}", # Day 2 pattern ā
ā metadata={"sale_id": sale_id, "claim_id": claim_id} ā
ā ), ā
ā timeout=10.0 # 10 second timeout ā
ā ) ā
ā stripe_circuit.record_success() ā
ā ā
ā except asyncio.TimeoutError: ā
ā stripe_circuit.record_failure() ā
ā # Don't fail yet - payment might have succeeded! ā
ā payment = await verify_payment_status(claim_id) ā
ā if not payment: ā
ā await idempotency_store.set(claim_id, {status: "failed"}) ā
ā return 504 "Payment timeout - please try again" ā
ā ā
ā except StripeError as e: ā
ā stripe_circuit.record_failure() ā
ā await idempotency_store.set(claim_id, {status: "failed", error: e}) ā
ā return 400 f"Payment failed: {e.message}" ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā STEP 4: CREATE ORDER (transactionally) ā
ā ā
ā async with database.transaction(): ā
ā # Create order record ā
ā order = await Order.create( ā
ā id=generate_order_id(), ā
ā user_id=user_id, ā
ā sale_id=sale_id, ā
ā claim_id=claim_id, ā
ā amount=payment.amount, ā
ā payment_id=payment.id, ā
ā status="confirmed" ā
ā ) ā
ā ā
ā # Record in idempotency store ā
ā await idempotency_store.set(claim_id, { ā
ā status: "completed", ā
ā order_id: order.id, ā
ā response: {order_id: order.id, status: "confirmed"} ā
ā }, ttl=86400 * 7) # Keep for 7 days ā
ā ā
ā # Delete claim (inventory already decremented) ā
ā await redis.delete(f"claim:{sale_id}:{user_id}") ā
ā await redis.zrem(f"claims:expiry:{sale_id}", ā
ā f"claim:{sale_id}:{user_id}") ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā STEP 5: PUBLISH EVENTS (async, non-blocking) ā
ā ā
ā # Publish to Kafka - don't wait for confirmation ā
ā await kafka.send_async("order_events", { ā
ā "event_type": "order.created", ā
ā "order_id": order.id, ā
ā "user_id": user_id, ā
ā "sale_id": sale_id, ā
ā "amount": payment.amount ā
ā }) ā
ā ā
ā # Queue confirmation email (Day 4 - webhook pattern) ā
ā await notification_queue.send({ ā
ā "type": "email", ā
ā "template": "order_confirmation", ā
ā "user_id": user_id, ā
ā "order_id": order.id ā
ā }) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā STEP 6: RETURN SUCCESS ā
ā ā
ā { ā
ā "status": "confirmed", ā
ā "order_id": "ord_xyz789", ā
ā "message": "Congratulations! Your order is confirmed.", ā
ā "receipt_url": "/orders/ord_xyz789/receipt" ā
ā } ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Complete Checkout Code
class CheckoutService:
"""
Checkout service implementing all Week 1-2 patterns:
- Timeouts (Day 1)
- Idempotency (Day 2)
- Circuit Breakers (Day 3)
- Async Events (Day 4)
"""
def __init__(
self,
redis: Redis,
db: Database,
stripe_client: Stripe,
kafka: KafkaProducer,
idempotency_store: IdempotencyStore,
circuit_breaker: CircuitBreaker
):
self.redis = redis
self.db = db
self.stripe = stripe_client
self.kafka = kafka
self.idempotency = idempotency_store
self.circuit = circuit_breaker
# Timeout configuration (Day 1)
self.payment_timeout = 10.0
self.db_timeout = 5.0
self.total_timeout = 25.0 # Budget for entire checkout
async def checkout(
self,
user_id: str,
sale_id: str,
claim_id: str
) -> CheckoutResult:
"""
Process checkout with full reliability guarantees.
"""
# Start timeout budget (Day 1)
deadline = time.time() + self.total_timeout
# Step 1: Validate claim
claim = await self._validate_claim(user_id, sale_id, claim_id)
if not claim:
return CheckoutResult(
success=False,
error="Claim expired or invalid"
)
# Step 2: Idempotency check (Day 2)
existing = await self.idempotency.get(claim_id)
if existing:
if existing['status'] == 'completed':
return CheckoutResult(
success=True,
order_id=existing['order_id'],
idempotent=True
)
elif existing['status'] == 'processing':
return CheckoutResult(
success=False,
error="Checkout already in progress"
)
# Mark as processing
await self.idempotency.set(claim_id, {
'status': 'processing',
'started_at': time.time()
})
try:
# Step 3: Process payment (Day 1 timeout + Day 3 circuit breaker)
payment = await self._process_payment(
user_id, sale_id, claim_id, deadline
)
# Step 4: Create order (with remaining timeout budget)
remaining_time = deadline - time.time()
if remaining_time <= 0:
raise TimeoutError("Checkout timeout exceeded")
order = await asyncio.wait_for(
self._create_order(user_id, sale_id, claim_id, payment),
timeout=remaining_time
)
# Step 5: Finalize idempotency record
await self.idempotency.set(claim_id, {
'status': 'completed',
'order_id': order.id,
'completed_at': time.time()
}, ttl=86400 * 7)
# Step 6: Publish events (async, non-blocking) (Day 4)
asyncio.create_task(self._publish_order_events(order))
return CheckoutResult(
success=True,
order_id=order.id
)
except PaymentError as e:
await self.idempotency.set(claim_id, {
'status': 'failed',
'error': str(e)
})
return CheckoutResult(success=False, error=str(e))
except TimeoutError:
# Don't mark as failed - payment might have succeeded
# Let user retry, idempotency will handle it
return CheckoutResult(
success=False,
error="Request timeout - please try again"
)
async def _process_payment(
self,
user_id: str,
sale_id: str,
claim_id: str,
deadline: float
) -> PaymentResult:
"""Process payment with circuit breaker and timeout."""
# Check circuit breaker (Day 3)
if self.circuit.is_open():
raise PaymentError("Payment service temporarily unavailable")
# Calculate remaining timeout
remaining = deadline - time.time()
timeout = min(self.payment_timeout, remaining)
if timeout <= 0:
raise TimeoutError("No time remaining for payment")
try:
# Call Stripe with idempotency key (Day 2)
payment = await asyncio.wait_for(
self.stripe.charges.create(
amount=await self._get_sale_price(sale_id),
currency='usd',
customer=await self._get_stripe_customer(user_id),
idempotency_key=f"checkout_{claim_id}",
metadata={
'sale_id': sale_id,
'claim_id': claim_id,
'user_id': user_id
}
),
timeout=timeout
)
self.circuit.record_success()
return payment
except asyncio.TimeoutError:
self.circuit.record_failure()
# Payment might have succeeded - verify
payment = await self._verify_payment(claim_id)
if payment:
return payment
raise TimeoutError("Payment timeout")
except StripeError as e:
self.circuit.record_failure()
raise PaymentError(str(e))
async def _verify_payment(self, claim_id: str) -> Optional[PaymentResult]:
"""Verify if payment exists in Stripe (for timeout recovery)."""
try:
payments = await self.stripe.charges.list(
limit=1,
metadata={'claim_id': claim_id}
)
if payments.data:
return payments.data[0]
return None
except:
return None
async def _create_order(
self,
user_id: str,
sale_id: str,
claim_id: str,
payment: PaymentResult
) -> Order:
"""Create order in database."""
async with self.db.transaction():
order = await self.db.orders.create(
id=generate_order_id(),
user_id=user_id,
sale_id=sale_id,
claim_id=claim_id,
payment_id=payment.id,
amount=payment.amount,
status='confirmed',
created_at=datetime.utcnow()
)
# Clean up claim
await self.redis.delete(f"claim:{sale_id}:{user_id}")
return order
async def _publish_order_events(self, order: Order):
"""Publish events for order (non-blocking)."""
# Order created event
await self.kafka.send('order_events', {
'event_type': 'order.created',
'order_id': order.id,
'user_id': order.user_id,
'sale_id': order.sale_id,
'amount': order.amount,
'timestamp': datetime.utcnow().isoformat()
})
# Queue confirmation email
await self.kafka.send('notification_events', {
'type': 'email',
'template': 'order_confirmation',
'recipient_id': order.user_id,
'data': {
'order_id': order.id,
'amount': order.amount
}
})
Interviewer: "I like how you've integrated all the patterns. Now, let's talk about the notification system. You mentioned millions of notifications when a sale starts. How do you handle that?"
Phase 6: Deep Dive - Notification System (5 minutes)
You: "The notification system is where the webhook patterns from Day 4 come in. Let me explain."
Pre-computation Strategy
THE PROBLEM:
Sale starts at 10:00:00
Need to notify 2.4M users
Users expect to know within seconds
Computing recipients at 10:00:00 = disaster
THE SOLUTION:
Pre-compute recipient list before sale starts
Store in ready-to-send format
At 10:00:00, just fan out to workers
Notification Architecture
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā NOTIFICATION PIPELINE ā
ā ā
ā T-10 MINUTES (Pre-computation) ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā ā
ā ā SELECT user_id, email, push_token, notification_preferences ā ā
ā ā FROM users ā ā
ā ā WHERE (user_id IN (SELECT user_id FROM product_watchers ā ā
ā ā WHERE product_id = 'xyz') ā ā
ā ā OR subscribed_to_deals = true) ā ā
ā ā AND notification_enabled = true ā ā
ā ā ā ā
ā ā Result: 2.4M rows ā ā
ā ā Store in: Redis sorted set (partitioned by user_id hash) ā ā
ā ā ā ā
ā ā notification_batch:{sale_id}:0 = [users 0-100k] ā ā
ā ā notification_batch:{sale_id}:1 = [users 100k-200k] ā ā
ā ā ... ā ā
ā ā notification_batch:{sale_id}:23 = [users 2.3M-2.4M] ā ā
ā ā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
ā T=0 (Sale Starts) ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā ā
ā ā Scheduler publishes to Kafka: ā ā
ā ā ā ā
ā ā Topic: notification_triggers ā ā
ā ā { ā ā
ā ā "sale_id": "xyz", ā ā
ā ā "batch_count": 24, ā ā
ā ā "trigger_time": "2024-01-15T10:00:00Z" ā ā
ā ā } ā ā
ā ā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
ā Workers Process Batches (Parallel) ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā ā
ā ā 24 workers, each handles 100k users ā ā
ā ā ā ā
ā ā Worker 0: ā ā
ā ā - Read notification_batch:{sale_id}:0 from Redis ā ā
ā ā - For each user: ā ā
ā ā - Check preferences (email? push? both?) ā ā
ā ā - Queue to appropriate sender ā ā
ā ā ā ā
ā ā All workers run simultaneously = 2.4M queued in ~10 seconds ā ā
ā ā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
ā Sender Pools (Final Delivery) ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā ā
ā ā Push Notification Pool (20 workers) ā ā
ā ā āāā Firebase Cloud Messaging for Android ā ā
ā ā āāā APNs for iOS ā ā
ā ā āāā Rate: ~50,000/second total ā ā
ā ā ā ā
ā ā Email Pool (10 workers) ā ā
ā ā āāā SendGrid / SES ā ā
ā ā āāā Rate: ~10,000/second total ā ā
ā ā āāā Lower priority (30 second SLA, not 5 second) ā ā
ā ā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Notification Worker Implementation
class NotificationWorker:
"""
Notification worker implementing Day 4 patterns:
- At-least-once delivery
- Retry with backoff
- Dead letter queue
- Circuit breaker for external services
"""
def __init__(
self,
redis: Redis,
kafka_consumer: KafkaConsumer,
push_service: PushNotificationService,
email_service: EmailService,
circuit_breakers: Dict[str, CircuitBreaker]
):
self.redis = redis
self.consumer = kafka_consumer
self.push = push_service
self.email = email_service
self.circuits = circuit_breakers
async def process_batch(self, trigger: NotificationTrigger):
"""Process a notification batch for a sale."""
batch_key = f"notification_batch:{trigger.sale_id}:{trigger.batch_id}"
# Get users from pre-computed batch
users = await self.redis.lrange(batch_key, 0, -1)
for user_data in users:
user = json.loads(user_data)
try:
await self._send_notification(user, trigger)
except Exception as e:
# Log and continue - don't let one failure stop batch
logger.error(
"Notification failed",
user_id=user['id'],
error=str(e)
)
async def _send_notification(self, user: dict, trigger: NotificationTrigger):
"""Send notification to a single user."""
# Idempotency check (Day 2)
idem_key = f"notif_sent:{trigger.sale_id}:{user['id']}"
if await self.redis.exists(idem_key):
return # Already sent
# Send based on preferences
if user.get('push_enabled') and user.get('push_token'):
await self._send_push(user, trigger)
if user.get('email_enabled') and user.get('email'):
await self._queue_email(user, trigger)
# Mark as sent (idempotency)
await self.redis.set(idem_key, '1', ex=3600) # 1 hour TTL
async def _send_push(self, user: dict, trigger: NotificationTrigger):
"""Send push notification with circuit breaker."""
# Check circuit breaker (Day 3)
provider = 'fcm' if user['platform'] == 'android' else 'apns'
if self.circuits[provider].is_open():
# Queue for retry later
await self._queue_for_retry(user, trigger, 'push')
return
try:
await asyncio.wait_for(
self.push.send(
token=user['push_token'],
title="ā” Lightning Deal Live!",
body=f"The {trigger.product_name} deal is live! Tap to claim.",
data={'sale_id': trigger.sale_id}
),
timeout=2.0 # Day 1: Timeout
)
self.circuits[provider].record_success()
except asyncio.TimeoutError:
self.circuits[provider].record_failure()
await self._queue_for_retry(user, trigger, 'push')
except Exception as e:
self.circuits[provider].record_failure()
logger.error("Push failed", error=str(e))
async def _queue_email(self, user: dict, trigger: NotificationTrigger):
"""Queue email for async sending (Day 4 pattern)."""
await self.kafka.send('email_queue', {
'recipient': user['email'],
'template': 'flash_sale_live',
'data': {
'user_name': user['name'],
'product_name': trigger.product_name,
'sale_url': f"https://megamart.com/sale/{trigger.sale_id}"
},
'idempotency_key': f"sale_email:{trigger.sale_id}:{user['id']}"
})
Interviewer: "Good. One more thing ā how does the flash sale scheduling work? You mentioned these happen every hour. Is that a cron job?"
Phase 7: Deep Dive - Sale Scheduling (5 minutes)
You: "Yes, this is where the distributed cron patterns from Day 5 come in. We need to ensure each sale starts exactly once, even across multiple servers."
Sale Scheduling Architecture
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā DISTRIBUTED SALE SCHEDULER ā
ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ETCD COORDINATION LAYER ā ā
ā ā ā ā
ā ā /elections/sale-scheduler/leader ā ā
ā ā /config/sales/upcoming ā ā
ā ā /fencing/current_token ā ā
ā ā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā
ā ā Only leader runs ā
ā ā¼ ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā SALE SCHEDULER SERVICE ā ā
ā ā ā ā
ā ā Every minute: ā ā
ā ā 1. Check for sales starting in next 10 minutes ā ā
ā ā 2. Pre-compute notification recipients ā ā
ā ā 3. Warm up Redis with sale data ā ā
ā ā ā ā
ā ā At scheduled time: ā ā
ā ā 1. Verify fencing token ā ā
ā ā 2. Initialize inventory counter ā ā
ā ā 3. Mark sale as active ā ā
ā ā 4. Trigger notifications ā ā
ā ā ā ā
ā ā At sale end: ā ā
ā ā 1. Mark sale as ended ā ā
ā ā 2. Return unclaimed inventory ā ā
ā ā 3. Generate sale report ā ā
ā ā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Sale Scheduler Implementation
class FlashSaleScheduler:
"""
Distributed flash sale scheduler using Day 5 patterns:
- Leader election
- Fencing tokens
- Exactly-once scheduling
- Heartbeat monitoring
"""
def __init__(
self,
etcd_client,
redis: Redis,
db: Database,
kafka: KafkaProducer,
notification_service: NotificationService
):
self.etcd = etcd_client
self.redis = redis
self.db = db
self.kafka = kafka
self.notifications = notification_service
self.is_leader = False
self.fencing_token = 0
self.leader_election = LeaderElection(
etcd_client,
election_name="sale-scheduler"
)
async def start(self):
"""Start the scheduler."""
# Start leader election
asyncio.create_task(
self.leader_election.campaign(
on_elected=self._on_elected,
on_demoted=self._on_demoted
)
)
# Main scheduler loop
while True:
if self.is_leader:
await self._scheduler_tick()
await asyncio.sleep(1)
def _on_elected(self):
"""Called when this instance becomes leader."""
self.is_leader = True
self.fencing_token = int(time.time() * 1000)
logger.info("Became sale scheduler leader", token=self.fencing_token)
def _on_demoted(self):
"""Called when this instance loses leadership."""
self.is_leader = False
logger.info("Lost sale scheduler leadership")
async def _scheduler_tick(self):
"""Main scheduling loop tick."""
now = datetime.utcnow()
# Find sales that need action
await self._prepare_upcoming_sales(now)
await self._start_due_sales(now)
await self._end_expired_sales(now)
async def _prepare_upcoming_sales(self, now: datetime):
"""Prepare sales starting in next 10 minutes."""
upcoming = await self.db.query("""
SELECT * FROM flash_sales
WHERE status = 'scheduled'
AND start_time BETWEEN $1 AND $2
AND preparation_started = false
""", now, now + timedelta(minutes=10))
for sale in upcoming:
# Mark as preparing (idempotent)
await self.db.execute("""
UPDATE flash_sales
SET preparation_started = true,
preparation_token = $1
WHERE id = $2 AND preparation_started = false
""", self.fencing_token, sale.id)
# Pre-compute notifications (async)
asyncio.create_task(
self.notifications.prepare_recipients(sale)
)
# Warm up Redis
await self._warm_up_sale_data(sale)
async def _start_due_sales(self, now: datetime):
"""Start sales that are due."""
due_sales = await self.db.query("""
SELECT * FROM flash_sales
WHERE status = 'scheduled'
AND start_time <= $1
AND preparation_started = true
""", now)
for sale in due_sales:
await self._start_sale(sale)
async def _start_sale(self, sale):
"""
Start a single sale.
Uses fencing token to prevent double-start.
"""
# Atomically claim sale start with fencing token
claimed = await self.db.execute("""
UPDATE flash_sales
SET status = 'active',
started_at = NOW(),
started_by_token = $1
WHERE id = $2
AND status = 'scheduled'
AND (started_by_token IS NULL OR started_by_token < $1)
RETURNING id
""", self.fencing_token, sale.id)
if not claimed:
logger.warning("Could not claim sale start", sale_id=sale.id)
return
logger.info("Starting sale", sale_id=sale.id, token=self.fencing_token)
# Initialize inventory in Redis
await self.redis.set(
f"inventory:{sale.id}",
sale.total_inventory
)
# Add to active sales set
await self.redis.sadd('active_sales', sale.id)
# Trigger notifications
await self.kafka.send('notification_triggers', {
'type': 'sale_started',
'sale_id': sale.id,
'product_name': sale.product_name,
'fencing_token': self.fencing_token
})
# Schedule sale end
asyncio.create_task(
self._schedule_sale_end(sale.id, sale.duration_minutes)
)
async def _schedule_sale_end(self, sale_id: str, duration_minutes: int):
"""Schedule the end of a sale."""
await asyncio.sleep(duration_minutes * 60)
if self.is_leader: # Only end if still leader
await self._end_sale(sale_id)
async def _end_sale(self, sale_id: str):
"""End a sale and clean up."""
# Update database with fencing token
ended = await self.db.execute("""
UPDATE flash_sales
SET status = 'ended',
ended_at = NOW(),
ended_by_token = $1
WHERE id = $2
AND status = 'active'
RETURNING id
""", self.fencing_token, sale_id)
if not ended:
return
logger.info("Ending sale", sale_id=sale_id)
# Remove from active sales
await self.redis.srem('active_sales', sale_id)
# Return unclaimed inventory to report
remaining = await self.redis.get(f"inventory:{sale_id}")
# Publish sale ended event
await self.kafka.send('sale_events', {
'type': 'sale_ended',
'sale_id': sale_id,
'remaining_inventory': int(remaining) if remaining else 0,
'fencing_token': self.fencing_token
})
# Clean up Redis (keep for a while for debugging)
await self.redis.expire(f"inventory:{sale_id}", 3600)
async def _warm_up_sale_data(self, sale):
"""Pre-load sale data into Redis."""
await self.redis.hset(f"sale:{sale.id}", mapping={
'product_id': sale.product_id,
'product_name': sale.product_name,
'original_price': sale.original_price,
'sale_price': sale.sale_price,
'total_inventory': sale.total_inventory,
'start_time': sale.start_time.isoformat(),
'end_time': sale.end_time.isoformat()
})
# Set TTL beyond sale end
await self.redis.expire(
f"sale:{sale.id}",
sale.duration_minutes * 60 + 3600
)
Phase 8: Monitoring and Observability (3 minutes)
Interviewer: "How would you monitor this system? What alerts would you set up?"
You: "Monitoring is critical for a flash sale system. Here's my approach:"
Key Metrics Dashboard
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā FLASH SALE REAL-TIME DASHBOARD ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā£
ā ā
ā CURRENT SALE: iPhone 15 Pro Lightning Deal ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
ā āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā ā
ā ā INVENTORY ā ā CLAIMS ā ā CHECKOUTS ā ā
ā ā ā ā ā ā ā ā
ā ā 2,847 ā ā 7,153 ā ā 6,421 ā ā
ā ā remaining ā ā active ā ā completed ā ā
ā ā ā ā ā ā ā ā
ā ā ā¼ā¼ā¼ 450/min ā ā ā²ā²ā² 320/min ā ā ā²ā²ā² 280/min ā ā
ā āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā ā
ā ā
ā āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā ā
ā ā REQUESTS/SEC ā ā ERROR RATE ā ā P99 LATENCY ā ā
ā ā ā ā ā ā ā ā
ā ā 45,231 ā ā 0.12% ā ā 247ms ā ā
ā ā ā ā ā ā ā ā
ā ā [āāāāāāāāāā] ā ā [āāāāāāāāāā] ā ā [āāāāāāāāāā] ā ā
ā ā peak: 150k ā ā budget: 1% ā ā budget: 500ms ā ā
ā āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā ā
ā ā
ā CIRCUIT BREAKERS ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā Stripe API ā āāāāāāāāāāāāāāāāāāāā ā CLOSED ā 99.8% success ā
ā ā Redis Cluster ā āāāāāāāāāāāāāāāāāāāā ā CLOSED ā 100% success ā
ā ā Push (FCM) ā āāāāāāāāāāāāāāāāāāāā ā CLOSED ā 97.2% success ā
ā ā Push (APNs) ā āāāāāāāāāāāāāāāāāāāā ā CLOSED ā 99.9% success ā
ā ā Email (SES) ā āāāāāāāāāāāāāāāāāāāā ā CLOSED ā 99.7% success ā
ā ā
ā NOTIFICATIONS ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā Total to send ā 2,400,000 ā
ā ā Sent ā 2,387,421 (99.5%) ā
ā ā Failed ā 12,579 (0.5%) ā
ā ā Pending ā 0 ā
ā ā Time elapsed ā 18 seconds ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Critical Alerts
# Prometheus Alert Rules
groups:
- name: flash_sale_critical
rules:
# INVENTORY ALERTS
- alert: InventoryOversold
expr: flash_sale_inventory < 0
for: 0s
labels:
severity: critical
annotations:
summary: "CRITICAL: Inventory oversold for sale {{ $labels.sale_id }}"
- alert: InventoryNotDecreasing
expr: rate(flash_sale_inventory[1m]) >= 0 AND flash_sale_active == 1
for: 2m
labels:
severity: warning
annotations:
summary: "Inventory not moving during active sale"
# LATENCY ALERTS
- alert: ClaimLatencyHigh
expr: histogram_quantile(0.99, rate(claim_request_duration_seconds_bucket[1m])) > 0.5
for: 1m
labels:
severity: critical
annotations:
summary: "P99 claim latency {{ $value }}s exceeds 500ms"
- alert: CheckoutLatencyHigh
expr: histogram_quantile(0.99, rate(checkout_request_duration_seconds_bucket[1m])) > 3
for: 1m
labels:
severity: critical
annotations:
summary: "P99 checkout latency {{ $value }}s exceeds 3s"
# ERROR RATE ALERTS
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[1m])) / sum(rate(http_requests_total[1m])) > 0.01
for: 1m
labels:
severity: critical
annotations:
summary: "Error rate {{ $value | humanizePercentage }} exceeds 1%"
# CIRCUIT BREAKER ALERTS
- alert: CircuitBreakerOpen
expr: circuit_breaker_state == 2 # 2 = OPEN
for: 0s
labels:
severity: critical
annotations:
summary: "Circuit breaker {{ $labels.name }} is OPEN"
# INFRASTRUCTURE ALERTS
- alert: RedisHighLatency
expr: redis_command_duration_seconds_p99 > 0.01
for: 1m
labels:
severity: warning
annotations:
summary: "Redis P99 latency {{ $value }}s exceeds 10ms"
- alert: KafkaLag
expr: kafka_consumer_lag > 10000
for: 2m
labels:
severity: warning
annotations:
summary: "Kafka consumer lag {{ $value }} exceeds 10k"
Phase 9: Wrap-Up and Extensions (2 minutes)
Interviewer: "We're almost out of time. Let's quickly discuss: what would you do differently at 10x the scale?"
You: "Great question. At 10x scale (50M concurrent users, 100K inventory), I'd make these changes:"
Scaling to 10x
CURRENT ā 10x SCALE
TRAFFIC HANDLING:
āāā Current: 150K req/sec peak
āāā 10x: 1.5M req/sec peak
āāā Solution:
- Geographic distribution (multiple regions)
- Regional inventory pools
- Edge caching for sale pages
- WebSocket for inventory updates (reduce polling)
INVENTORY MANAGEMENT:
āāā Current: Single Redis cluster
āāā 10x: Redis cluster can't handle atomic ops at this rate
āāā Solution:
- Shard inventory by region
- Eventual consistency for display count
- Strong consistency only for actual claims
- Consider CockroachDB for distributed transactions
NOTIFICATION DELIVERY:
āāā Current: 2.4M in 30 seconds
āāā 10x: 24M in 30 seconds
āāā Solution:
- Pre-send notifications (hint: "sale starting in 5 seconds")
- Progressive rollout by user segment
- More aggressive batching to FCM/APNs
DATABASE:
āāā Current: PostgreSQL primary + 2 replicas
āāā 10x: Single primary becomes bottleneck
āāā Solution:
- Vitess or CockroachDB for horizontal scaling
- Event sourcing for order creation
- Read from cache, async write to DB
Alternative Approaches Considered
You: "I should also mention alternatives I considered but didn't choose:"
ALTERNATIVE: Queue-Based Claims
Instead of: Direct Redis DECR for claims
Alternative: Put all claim requests in a queue, process in order
Pros: Perfect ordering, simpler consistency
Cons: Higher latency, doesn't match "instant" UX requirement
Decision: Direct Redis better for our latency requirements
ALTERNATIVE: Reservation-Style Inventory
Instead of: Decrement on claim
Alternative: Decrement only on successful checkout
Pros: No need for claim expiry handling
Cons: Inventory shows as available but isn't, worse UX
Decision: Claim-then-checkout better for transparency
ALTERNATIVE: Distributed Lock per Item
Instead of: Atomic counter
Alternative: Lock each inventory unit individually
Pros: Fine-grained control
Cons: 10,000 locks = complexity nightmare
Decision: Atomic counter much simpler
Interview Conclusion
Interviewer: "Excellent work. You've demonstrated strong understanding of distributed systems, handled the scale estimation well, and made good trade-off decisions. Any questions for me?"
You: "Thank you! I'd love to hear how MegaMart currently handles this, and what challenges you've faced in production."
Summary: Week 1-2 Concepts Applied
Week 1 Concepts (Foundations of Scale)
| Concept | Application in This Design |
|---|---|
| Horizontal vs Vertical Scaling | API services scale horizontally, Redis and PostgreSQL scale with clustering |
| Database Partitioning | Orders partitioned by date, notifications batched by user_id hash |
| Caching Strategies | CDN for static content, Redis for hot data, 1-second TTL for inventory count |
| Load Balancing | ALB across API servers, partition-aware Kafka consumers |
| Message Queues | Kafka for event-driven architecture, async notification processing |
Week 2 Concepts (Failure-First Design)
| Day | Concept | Application |
|---|---|---|
| Day 1 | Timeouts | Payment timeout (10s), checkout budget (25s), claim validation |
| Day 2 | Idempotency | Claim idempotency key, checkout idempotency, Stripe idempotency_key |
| Day 3 | Circuit Breakers | Stripe circuit, FCM/APNs circuits, Redis circuit |
| Day 4 | Webhooks | Notification delivery, at-least-once semantics, retry with backoff |
| Day 5 | Distributed Cron | Sale scheduling, leader election, fencing tokens |
Code Patterns Demonstrated
1. ATOMIC OPERATIONS
- Redis Lua scripts for claim
- Database transactions for orders
2. IDEMPOTENCY IMPLEMENTATION
- Check-before-execute pattern
- Idempotency store with TTL
3. CIRCUIT BREAKER INTEGRATION
- Check before external calls
- Record success/failure
- Fallback behavior
4. TIMEOUT BUDGETS
- Total budget for operation
- Remaining budget propagation
- Timeout on all external calls
5. LEADER ELECTION
- etcd-based election
- Fencing token validation
- Graceful failover
6. EVENT-DRIVEN ARCHITECTURE
- Kafka for event publishing
- Consumer groups for parallel processing
- At-least-once delivery
Self-Assessment Checklist
After studying this capstone, you should be able to:
- Estimate traffic and storage requirements from business requirements
- Design a system that handles massive traffic spikes
- Implement atomic inventory operations without race conditions
- Integrate idempotency at multiple levels (claim, checkout, notifications)
- Apply circuit breakers to protect against external service failures
- Design a notification system that delivers millions of messages quickly
- Implement distributed scheduling with exactly-once semantics
- Set up meaningful monitoring and alerting
- Discuss trade-offs and alternatives clearly
- Handle follow-up questions about scaling and edge cases
This capstone problem integrates all concepts from Weeks 1-2 of the System Design Mastery Series. Use this as a template for approaching similar interview problems.