Week 10 Capstone: The Grand Finale
System Design Mastery Series β The Complete Journey
π― The Final Interview
Ten weeks. Fifty lessons. Thousands of concepts. One final test.
You've learned to think like a senior engineer. Now prove it.
The Interview Begins
You walk into the interview room for the final round. The whiteboard is blank. The interviewer smiles β this is the principal engineer, someone who's built systems at massive scale.
Interviewer: "Welcome. I've heard great things about your preparation. Today, we're going to design something ambitious β a system that will test everything you know about distributed systems, reliability, and operational excellence."
They turn to the whiteboard and write:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Design a Global Real-Time Bidding Platform β
β β
β You're building the ad-tech infrastructure for a company that β
β serves 10 billion ad requests per day across 6 continents. β
β β
β When a user loads a webpage, your system has 100ms to: β
β 1. Receive the bid request β
β 2. Query multiple demand-side platforms (DSPs) β
β 3. Run an auction β
β 4. Return the winning ad β
β β
β Requirements: β
β β’ 100ms end-to-end latency (p99) β
β β’ 99.99% availability (52 minutes downtime/year) β
β β’ Support 500,000 requests per second at peak β
β β’ Real-time budget tracking (can't overspend) β
β β’ Fraud detection (block invalid traffic) β
β β’ Complete audit trail for billing β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Interviewer: "This is one of the most demanding system design problems in our industry. Take your time. Think out loud. I'm interested in your approach as much as your solution."
You take a deep breath. This is it β everything you've learned comes together now.
Phase 1: Requirements Clarification
Before touching the whiteboard, you pause. Week 1 taught you: never assume.
You: "Before I dive in, I'd like to clarify some requirements. This will help me make the right trade-offs."
Interviewer: "Of course. Good instinct."
Your Questions
You: "First, let me understand the traffic pattern. You mentioned 10 billion requests per day and 500K peak RPS. What's the average RPS, and how bursty is the traffic?"
Interviewer: "Average is about 120,000 RPS. Peak happens during major events β Super Bowl, World Cup β where we might see 4-5x normal traffic for a few hours. There's also daily seasonality: traffic is 2x higher during evening hours in each timezone."
You: "For the 100ms latency requirement β is that from when we receive the request to when we return the response? And what happens if we can't meet it?"
Interviewer: "Yes, end-to-end. If you miss the deadline, the ad exchange moves on without you β you lose that impression. Speed is revenue."
You: "How many DSPs do we need to query for each request?"
Interviewer: "Typically 5-10 DSPs per request, depending on the ad slot. Each DSP has their own latency characteristics β some respond in 20ms, some take 80ms."
You: "For budget tracking, what's the tolerance for overspend?"
Interviewer: "Advertisers set daily budgets. We absolutely cannot exceed budget by more than 0.1%. At our scale, even small percentage errors mean real money."
You: "What about geographic distribution of traffic?"
Interviewer: "Roughly: 40% Americas, 35% Europe, 20% Asia-Pacific, 5% rest of world. We need low latency everywhere."
You: "Last question: what data do we need for the audit trail, and how long must we retain it?"
Interviewer: "Every bid request, every response, every auction result. 7 years retention for compliance. This is billions of events per day."
Requirements Summary
You turn to the whiteboard and write:
Functional Requirements:
1. BID REQUEST HANDLING
β’ Receive bid request from ad exchanges
β’ Parse and validate request (user context, ad slot info)
β’ Enrich with user data (if available, privacy-compliant)
2. DSP ORCHESTRATION
β’ Fan out to 5-10 DSPs per request
β’ Collect bids within timeout
β’ Handle partial responses gracefully
3. AUCTION ENGINE
β’ Run second-price auction
β’ Apply business rules (frequency caps, brand safety)
β’ Select winning bid
4. BUDGET MANAGEMENT
β’ Track spend in real-time
β’ Enforce daily/campaign budgets
β’ Pace delivery throughout the day
5. FRAUD DETECTION
β’ Detect invalid traffic (bots, click farms)
β’ Block suspicious requests
β’ Minimize false positives (lost revenue)
6. AUDIT & BILLING
β’ Log every transaction
β’ Support billing reconciliation
β’ 7-year retention
Non-Functional Requirements:
LATENCY
β’ End-to-end p99: < 100ms
β’ DSP timeout: 80ms (must leave margin)
β’ Internal processing: < 20ms
THROUGHPUT
β’ Average: 120,000 RPS
β’ Peak: 500,000 RPS
β’ Must handle 5x spikes
AVAILABILITY
β’ 99.99% uptime
β’ 52 minutes/year max downtime
β’ No single points of failure
CONSISTENCY
β’ Budget: Strong consistency (can't overspend)
β’ Auction: Eventual consistency OK
β’ Audit: Durable, ordered
DATA SCALE
β’ 10 billion events/day
β’ 7 years retention
β’ ~25 PB total storage
Interviewer: "Excellent requirements gathering. You've identified the key constraints. Now show me the architecture."
Phase 2: Back of the Envelope Estimation
You: "Let me validate these numbers and derive some infrastructure requirements."
Traffic Calculations
REQUESTS PER SECOND
Daily requests: 10,000,000,000
Seconds per day: 86,400
Average RPS: ~116,000 (matches interviewer's 120K)
Peak multiplier: ~4x
Peak RPS: ~500,000 β
Per-region peak (assuming 40% in Americas):
Americas peak: 200,000 RPS
Europe peak: 175,000 RPS
APAC peak: 100,000 RPS
Data Volume Calculations
STORAGE REQUIREMENTS
Per bid request (raw):
β’ Request metadata: ~2 KB
β’ DSP responses (10): ~5 KB
β’ Auction result: ~1 KB
β’ Total per request: ~8 KB
Daily storage:
β’ 10B requests Γ 8 KB = 80 TB/day
Annual storage:
β’ 80 TB Γ 365 = 29 PB/year
7-year retention:
β’ ~200 PB total
β’ Need tiered storage (hot/warm/cold)
Infrastructure Estimates
COMPUTE REQUIREMENTS
Assuming each server handles 5,000 RPS at 70% utilization:
Peak servers needed: 500,000 / 5,000 = 100 servers
With 50% headroom: 150 servers globally
Per region:
β’ Americas: 60 servers
β’ Europe: 50 servers
β’ APAC: 30 servers
β’ Other: 10 servers
Latency Budget
100ms LATENCY BUDGET
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Request Parse & DSP Fan-out Auction Response β
β Receive Validate & Collect Engine Send β
β β
β [ 2ms ] [ 5ms ] [ 75ms ] [ 10ms ] [ 3ms ] β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β 95ms total β
β 5ms buffer for variance β
β β
β DSP TIMEOUT: 70ms (must return by 80ms to leave margin) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Interviewer: "Good. The latency budget is particularly insightful. Now, show me the system."
Phase 3: High-Level Architecture
You: "Let me draw the complete architecture, then we'll dive deep into each component."
System Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GLOBAL RTB PLATFORM β
β β
β GLOBAL LAYER β
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Route53 β β CloudFront β β Global β β
β β GeoDNS ββββββββββΆβ (TLS) ββββββββββΆβ Config β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β β
β ββββββββββββββββββββββββββΌβββββββββββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β US-EAST β β EU-WEST β β AP-SOUTH β β
β β REGION β β REGION β β REGION β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
REGIONAL ARCHITECTURE
(Each region identical)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β EDGE LAYER β β
β β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β NLB β β NLB β β NLB β (Network LBs) β β
β β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β β
β β βββββββββββββββββΌββββββββββββββββ β β
β β βΌ β β
β β βββββββββββββββββββββββββββββββββ β β
β β β RATE LIMITER β (Token bucket/region) β β
β β β + FRAUD FILTER β (ML-based filtering) β β
β β βββββββββββββββββ¬ββββββββββββββββ β β
β β β β β
β ββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β BIDDING LAYER β β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β BID ORCHESTRATOR FLEET β β β
β β β β β β
β β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β β
β β β β Orch 1 β β Orch 2 β β Orch 3 β β Orch N β β β β
β β β β β β β β β β ... β β β β
β β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β β
β β β β β β β β β β
β β β βββββββββββββ΄ββββββ¬ββββββ΄ββββββββββββ β β β
β β β β β β β
β β βββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββ β β
β β β β β
β β ββββββββββββββββ΄βββββββββββββββ β β
β β βΌ βΌ β β
β β ββββββββββββββββββββ ββββββββββββββββββββ β β
β β β DSP GATEWAY β β AUCTION ENGINE β β β
β β β β β β β β
β β β Fan-out to DSPs β β Second-price β β β
β β β Timeout mgmt β β Business rules β β β
β β β Circuit breaker β β Winner select β β β
β β ββββββββββββββββββββ ββββββββββββββββββββ β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ β
β β βΌ β β
β β DATA LAYER β β
β β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β Redis β β Budget β β User β β Config β β β
β β β Cluster β β Store β β Store β β Store β β β
β β β (cache) β β (strong) β β (cache) β β (replicated)β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ β
β β βΌ β β
β β ASYNC PIPELINE β β
β β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β Kafka ββββΆβ Flink ββββΆβ S3 ββββΆβ Redshiftβ β β
β β β Events β β Streamingβ β Parquet β β /Presto β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Request Flow
You: "Let me trace a single request through the system..."
BID REQUEST LIFECYCLE (Target: 100ms)
Time Budget
βΌ
0ms ββββββββββββββββββββββββββββββββββββββββββββββββββββββ 100ms
β β
βΌ βΌ
β REQUEST ARRIVES (0-2ms)
β
β Publisher's webpage loads
β Ad exchange sends bid request
β GeoDNS routes to nearest region
β NLB receives request
β
βΌ
β‘ EDGE PROCESSING (2-7ms)
β
β Rate limiter checks quota
β Fraud filter runs ML model (pre-trained, < 1ms inference)
β Request validated and parsed
β Enriched with cached user data
β
βΌ
β’ DSP FAN-OUT (7-77ms)
β
β Orchestrator selects relevant DSPs (5-10)
β Parallel requests to all DSPs
β 70ms timeout (hard deadline)
β Collect responses as they arrive
β Circuit breakers protect against slow DSPs
β
βΌ
β£ AUCTION (77-92ms)
β
β Run second-price auction on collected bids
β Apply frequency caps, brand safety rules
β Check budget (CRITICAL: strong consistency)
β Reserve budget atomically
β Select winner
β
βΌ
β€ RESPONSE (92-97ms)
β
β Format winning ad response
β Sign response for verification
β Send to ad exchange
β
βΌ
β₯ ASYNC LOGGING (fire-and-forget)
β
β Publish to Kafka (non-blocking)
β Event contains: request, all bids, winner, timing
β Downstream: billing, analytics, fraud ML training
β
βΌ
DONE: ~95ms average, ~100ms p99
Interviewer: "Good overview. Let's dive into the hard parts. Tell me about budget management β how do you guarantee we never overspend?"
Phase 4: Deep Dives
Deep Dive 1: Real-Time Budget Management
Week 1 concepts: Partitioning, consistency. Week 5 concepts: Distributed transactions.
You: "Budget management is one of the hardest problems here. Let me explain why, and then the solution."
The Challenge:
WHY BUDGET IS HARD
Scenario:
β’ Advertiser has $10,000 daily budget
β’ Current spend: $9,990
β’ We have 500,000 RPS globally
β’ 1000 requests for this advertiser arrive simultaneously
β’ Each bid is $20
If each server checks "is there budget?" independently:
β’ All 1000 see "$10 remaining"
β’ All 1000 approve the bid
β’ Actual spend: $9,990 + (1000 Γ $20) = $29,990
β’ OVERSPEND: $19,990 (200% over budget!)
This is a classic distributed consistency problem.
The Solution:
BUDGET ARCHITECTURE
We use a hierarchical budget allocation system:
βββββββββββββββββββββ
β GLOBAL BUDGET β
β COORDINATOR β
β (source of β
β truth) β
βββββββββββ¬ββββββββββ
β
βββββββββββββββββββΌββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ
β US-EAST β β EU-WEST β β AP-SOUTH β
β REGION BUDGET β β REGION BUDGET β β REGION BUDGET β
β ALLOCATOR β β ALLOCATOR β β ALLOCATOR β
β β β β β β
β Allocation: β β Allocation: β β Allocation: β
β $4,000 β β $3,500 β β $2,000 β
βββββββββ¬ββββββββ βββββββββ¬ββββββββ βββββββββ¬ββββββββ
β β β
βββββββββ΄ββββββββ βββββββββ΄ββββββββ βββββββββ΄ββββββββ
β β β β β β
βΌ βΌ βΌ βΌ βΌ βΌ
βββββββββ βββββββββ βββββββββ
βServer β ... βServer β ... βServer β ...
β Pod β β Pod β β Pod β
β β β β β β
βLocal: β βLocal: β βLocal: β
β $100 β β $100 β β $100 β
βββββββββ βββββββββ βββββββββ
HOW IT WORKS:
1. Global coordinator divides daily budget across regions
(Based on expected traffic distribution)
2. Region allocators divide among server pods
(Each pod gets a "budget lease")
3. Pods spend locally until lease exhausted
(No network call needed for each bid)
4. When lease low, request more from region allocator
(Background refresh, not in critical path)
5. If region exhausted, request rebalancing from global
(Shift unused budget from other regions)
You: "Let me show the implementation..."
# budget/distributed_budget.py
"""
Hierarchical budget management for real-time bidding.
Provides strong consistency guarantees while minimizing
latency impact on the critical bidding path.
"""
from dataclasses import dataclass
from typing import Dict, Optional, Tuple
from datetime import datetime, timedelta
import asyncio
import logging
logger = logging.getLogger(__name__)
@dataclass
class BudgetLease:
"""A local budget allocation."""
advertiser_id: str
campaign_id: str
allocated_amount: float
spent_amount: float
expires_at: datetime
@property
def remaining(self) -> float:
return self.allocated_amount - self.spent_amount
@property
def is_expired(self) -> bool:
return datetime.utcnow() > self.expires_at
@property
def utilization(self) -> float:
if self.allocated_amount == 0:
return 1.0
return self.spent_amount / self.allocated_amount
class LocalBudgetManager:
"""
Manages budget leases for a single server pod.
Design principles:
- Spend decisions are LOCAL (no network in hot path)
- Background refresh keeps leases topped up
- Conservative: rather reject bids than overspend
"""
def __init__(
self,
region_allocator,
pod_id: str,
refresh_threshold: float = 0.7, # Refresh at 70% utilization
lease_duration: timedelta = timedelta(seconds=30)
):
self.region = region_allocator
self.pod_id = pod_id
self.refresh_threshold = refresh_threshold
self.lease_duration = lease_duration
self.leases: Dict[str, BudgetLease] = {}
self._refresh_lock = asyncio.Lock()
async def can_spend(
self,
advertiser_id: str,
campaign_id: str,
amount: float
) -> Tuple[bool, str]:
"""
Check if we can spend amount. MUST BE FAST (<1ms).
Returns (can_spend, reason)
"""
key = f"{advertiser_id}:{campaign_id}"
lease = self.leases.get(key)
# No lease? Try to get one (but don't block hot path)
if lease is None:
# Trigger background refresh
asyncio.create_task(self._refresh_lease(advertiser_id, campaign_id))
return False, "no_budget_lease"
# Lease expired?
if lease.is_expired:
asyncio.create_task(self._refresh_lease(advertiser_id, campaign_id))
return False, "lease_expired"
# Not enough remaining?
if lease.remaining < amount:
asyncio.create_task(self._refresh_lease(advertiser_id, campaign_id))
return False, "insufficient_budget"
# Check if we need background refresh
if lease.utilization > self.refresh_threshold:
asyncio.create_task(self._refresh_lease(advertiser_id, campaign_id))
return True, "approved"
async def record_spend(
self,
advertiser_id: str,
campaign_id: str,
amount: float
) -> bool:
"""
Record spend against local lease.
Call this AFTER winning auction, BEFORE responding.
"""
key = f"{advertiser_id}:{campaign_id}"
lease = self.leases.get(key)
if lease is None or lease.remaining < amount:
# This shouldn't happen if can_spend was called
logger.warning(f"Spend rejected: {key}, amount={amount}")
return False
lease.spent_amount += amount
# Async report to region (for reconciliation)
asyncio.create_task(
self.region.report_spend(advertiser_id, campaign_id, amount)
)
return True
async def _refresh_lease(
self,
advertiser_id: str,
campaign_id: str
):
"""Background refresh of budget lease."""
key = f"{advertiser_id}:{campaign_id}"
async with self._refresh_lock:
try:
# Request new allocation from region
allocation = await self.region.request_allocation(
advertiser_id=advertiser_id,
campaign_id=campaign_id,
pod_id=self.pod_id,
requested_amount=self._calculate_request_amount(key)
)
if allocation.granted_amount > 0:
self.leases[key] = BudgetLease(
advertiser_id=advertiser_id,
campaign_id=campaign_id,
allocated_amount=allocation.granted_amount,
spent_amount=0,
expires_at=datetime.utcnow() + self.lease_duration
)
except Exception as e:
logger.error(f"Failed to refresh lease {key}: {e}")
class RegionBudgetAllocator:
"""
Manages budget allocation for a region.
Coordinates between:
- Global coordinator (gets region's share)
- Local pods (distributes to pods)
"""
def __init__(
self,
global_coordinator,
region_id: str,
redis_cluster
):
self.global_coord = global_coordinator
self.region_id = region_id
self.redis = redis_cluster
async def request_allocation(
self,
advertiser_id: str,
campaign_id: str,
pod_id: str,
requested_amount: float
) -> 'AllocationResult':
"""
Allocate budget to a pod.
Uses Redis for atomic operations within region.
"""
key = f"budget:{advertiser_id}:{campaign_id}:{self.region_id}"
# Lua script for atomic check-and-decrement
lua_script = """
local available = tonumber(redis.call('GET', KEYS[1]) or '0')
local requested = tonumber(ARGV[1])
if available <= 0 then
return 0
end
local granted = math.min(available, requested)
redis.call('DECRBY', KEYS[1], granted)
return granted
"""
granted = await self.redis.eval(
lua_script,
keys=[key],
args=[requested_amount]
)
if granted == 0:
# Try to get more from global
await self._request_global_refill(advertiser_id, campaign_id)
# Retry once
granted = await self.redis.eval(
lua_script,
keys=[key],
args=[requested_amount]
)
return AllocationResult(
granted_amount=granted,
advertiser_id=advertiser_id,
campaign_id=campaign_id
)
Interviewer: "Good. What happens during a region failure? How do you prevent double-spending?"
You: "Great question. We use budget reservations with expiration..."
FAILURE HANDLING
Scenario: US-EAST region goes down
1. IMMEDIATE
β’ Region had $4,000 allocated
β’ That budget is "locked" (not available to other regions)
β’ Global coordinator detects failure (health check fails)
2. AFTER 30 SECONDS (lease expiration)
β’ All pod leases expire
β’ Region's allocation returned to global pool
β’ Other regions can now claim it
3. RECONCILIATION
β’ When US-EAST recovers
β’ Reports actual spend from before failure
β’ Kafka events provide source of truth
β’ Any discrepancy flagged for review
WHY THIS IS SAFE:
β’ Worst case: We UNDER-spend during failure (region can't bid)
β’ Never over-spend (leases expire, budget returns)
β’ Accept temporary reduced efficiency over correctness violation
Deep Dive 2: DSP Fan-Out with Timeouts
Week 2 concepts: Timeouts, circuit breakers, failure handling.
You: "The DSP fan-out is where we battle latency. Let me show how we handle partial failures gracefully."
# bidding/dsp_gateway.py
"""
DSP Gateway: Manages communication with demand-side platforms.
Challenges:
- Must query 5-10 DSPs in parallel
- Each DSP has different latency characteristics
- Total budget: 70ms (must leave margin for auction)
- Partial responses are OK (bid with what we have)
"""
from dataclasses import dataclass, field
from typing import List, Dict, Optional, Set
from datetime import datetime
import asyncio
import logging
logger = logging.getLogger(__name__)
@dataclass
class DSPConfig:
"""Configuration for a DSP."""
dsp_id: str
endpoint: str
timeout_ms: int = 70
# Circuit breaker settings
failure_threshold: int = 5 # Failures before opening
success_threshold: int = 3 # Successes before closing
half_open_timeout_s: int = 30 # Time before retrying
# Performance tracking
p99_latency_ms: float = 50
error_rate: float = 0.01
@dataclass
class DSPBid:
"""A bid from a DSP."""
dsp_id: str
bid_id: str
price: float
ad_markup: str
latency_ms: float
received_at: datetime
@dataclass
class FanOutResult:
"""Results from DSP fan-out."""
bids: List[DSPBid]
timeouts: List[str] # DSP IDs that timed out
errors: List[str] # DSP IDs that errored
circuit_open: List[str] # DSP IDs with open circuits
total_latency_ms: float
class CircuitBreaker:
"""
Circuit breaker for a single DSP.
States:
- CLOSED: Normal operation, requests flow through
- OPEN: DSP is unhealthy, fail fast (don't even try)
- HALF_OPEN: Testing if DSP recovered
"""
def __init__(self, config: DSPConfig):
self.config = config
self.state = "CLOSED"
self.failure_count = 0
self.success_count = 0
self.last_failure_time: Optional[datetime] = None
def can_execute(self) -> bool:
if self.state == "CLOSED":
return True
if self.state == "OPEN":
# Check if we should try half-open
if self.last_failure_time:
elapsed = (datetime.utcnow() - self.last_failure_time).total_seconds()
if elapsed > self.config.half_open_timeout_s:
self.state = "HALF_OPEN"
return True
return False
if self.state == "HALF_OPEN":
return True
return False
def record_success(self):
if self.state == "HALF_OPEN":
self.success_count += 1
if self.success_count >= self.config.success_threshold:
self.state = "CLOSED"
self.failure_count = 0
self.success_count = 0
else:
self.failure_count = 0
def record_failure(self):
self.failure_count += 1
self.last_failure_time = datetime.utcnow()
if self.state == "HALF_OPEN":
self.state = "OPEN"
self.success_count = 0
elif self.failure_count >= self.config.failure_threshold:
self.state = "OPEN"
class DSPGateway:
"""
Gateway for querying DSPs with timeout and failure handling.
"""
def __init__(
self,
http_client,
dsp_configs: List[DSPConfig],
global_timeout_ms: int = 70
):
self.http = http_client
self.dsps = {cfg.dsp_id: cfg for cfg in dsp_configs}
self.global_timeout_ms = global_timeout_ms
# Circuit breakers per DSP
self.breakers = {
cfg.dsp_id: CircuitBreaker(cfg)
for cfg in dsp_configs
}
# Metrics
self.metrics = DSPMetrics()
async def fan_out(
self,
request: 'BidRequest',
target_dsps: List[str]
) -> FanOutResult:
"""
Query multiple DSPs in parallel with timeout.
Key design decisions:
1. Start all requests immediately (parallel)
2. Use global timeout (not per-DSP)
3. Collect responses as they arrive
4. Return whatever we have when timeout hits
"""
start_time = datetime.utcnow()
result = FanOutResult(
bids=[],
timeouts=[],
errors=[],
circuit_open=[],
total_latency_ms=0
)
# Filter out DSPs with open circuits
active_dsps = []
for dsp_id in target_dsps:
breaker = self.breakers.get(dsp_id)
if breaker and breaker.can_execute():
active_dsps.append(dsp_id)
else:
result.circuit_open.append(dsp_id)
if not active_dsps:
return result
# Create tasks for all DSPs
tasks = {
dsp_id: asyncio.create_task(
self._query_dsp(dsp_id, request)
)
for dsp_id in active_dsps
}
# Wait with timeout, collecting results as they arrive
pending = set(tasks.values())
deadline = self.global_timeout_ms / 1000
try:
while pending:
# Calculate remaining time
elapsed = (datetime.utcnow() - start_time).total_seconds()
remaining = deadline - elapsed
if remaining <= 0:
# Timeout! Cancel remaining and break
for task in pending:
task.cancel()
break
# Wait for next completion or timeout
done, pending = await asyncio.wait(
pending,
timeout=remaining,
return_when=asyncio.FIRST_COMPLETED
)
# Process completed tasks
for task in done:
dsp_id = self._get_dsp_id_for_task(task, tasks)
try:
bid = task.result()
if bid:
result.bids.append(bid)
self.breakers[dsp_id].record_success()
except asyncio.TimeoutError:
result.timeouts.append(dsp_id)
self.breakers[dsp_id].record_failure()
except Exception as e:
result.errors.append(dsp_id)
self.breakers[dsp_id].record_failure()
logger.warning(f"DSP {dsp_id} error: {e}")
finally:
# Cancel any still pending
for task in pending:
task.cancel()
dsp_id = self._get_dsp_id_for_task(task, tasks)
result.timeouts.append(dsp_id)
result.total_latency_ms = (
datetime.utcnow() - start_time
).total_seconds() * 1000
# Log metrics
self.metrics.record_fanout(result)
return result
async def _query_dsp(
self,
dsp_id: str,
request: 'BidRequest'
) -> Optional[DSPBid]:
"""Query a single DSP."""
config = self.dsps[dsp_id]
start = datetime.utcnow()
try:
response = await self.http.post(
config.endpoint,
json=request.to_dsp_format(),
timeout=config.timeout_ms / 1000
)
latency = (datetime.utcnow() - start).total_seconds() * 1000
if response.status == 204:
# No bid
return None
bid_data = response.json()
return DSPBid(
dsp_id=dsp_id,
bid_id=bid_data['id'],
price=bid_data['price'],
ad_markup=bid_data['adm'],
latency_ms=latency,
received_at=datetime.utcnow()
)
except asyncio.TimeoutError:
raise
except Exception as e:
logger.error(f"DSP {dsp_id} request failed: {e}")
raise
Interviewer: "What if all DSPs are slow one day? How do you prevent cascading failures?"
You: "We implement backpressure at the edge. If we're falling behind, we shed load gracefully..."
BACKPRESSURE AND LOAD SHEDDING
When system is overwhelmed:
1. DETECT OVERLOAD
β’ Request queue depth > threshold
β’ p99 latency > 80ms (approaching limit)
β’ CPU utilization > 80%
2. SHED LOAD GRACEFULLY
β’ Return "no bid" immediately for low-value requests
β’ Prioritize high-value ad slots
β’ Random shedding to be fair across publishers
3. ADAPTIVE TIMEOUT
β’ If we're fast, use full 70ms
β’ If we're slow, reduce to 50ms (faster failures = more capacity)
4. DSP PRIORITIZATION
β’ During overload, only query top 3 DSPs (by historical performance)
β’ Skip consistently slow DSPs
This is Week 3's backpressure pattern applied to real-time bidding!
Deep Dive 3: Audit Trail at Scale
Week 3 concepts: Event streaming, transactional outbox. Week 8 concepts: Analytics pipeline.
You: "Every bid request must be logged for billing and compliance. That's 10 billion events per day. Let me show the pipeline."
AUDIT TRAIL ARCHITECTURE
Requirements:
β’ Every event captured (no loss)
β’ Ordered per advertiser (for billing accuracy)
β’ 7-year retention
β’ Queryable for disputes
CRITICAL PATH
(sync, ~5ms)
β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββ
β β β
β Bid Orchestrator βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1. Process request β β
β β 2. Run auction β β
β β 3. Write to OUTBOX TABLE (same transaction as budget) β β
β β 4. Return response β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββ
β
β (Transactional Outbox Pattern)
β
ASYNC PATH
(background)
β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββ
β βΌ β
β ββββββββββββββββββββ ββββββββββββββββββββ β
β β Outbox Poller βββββΆβ Kafka β β
β β (Debezium CDC) β β bid-events β β
β ββββββββββββββββββββ β partitioned by β β
β β advertiser_id β β
β ββββββββββ¬ββββββββββ β
β β β
β βββββββββββββββββββββββββΌββββββββββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Billing β β Analytics β β Fraud ML β β
β β Consumer β β Consumer β β Training β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ β
β β β β β
β βΌ βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Billing β β S3 β β Feature β β
β β Database β β Data Lake β β Store β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββ β
β β Glacier β β
β β (7 years) β β
β ββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# audit/event_pipeline.py
"""
Audit event schema and processing.
Every bid event follows this schema for consistency
across billing, analytics, and compliance.
"""
from dataclasses import dataclass, asdict
from typing import List, Dict, Optional, Any
from datetime import datetime
import json
@dataclass
class BidEvent:
"""
Immutable record of a bid transaction.
This is the source of truth for:
- Billing reconciliation
- Analytics dashboards
- Fraud investigation
- Compliance audits
"""
# Identity
event_id: str # UUID, globally unique
event_type: str # "bid_request", "bid_response", "auction_result"
event_time: datetime # When this happened (event time, not processing time)
# Request context
request_id: str # Correlates all events for one request
exchange_id: str # Which ad exchange
publisher_id: str # Which publisher
# Advertiser context
advertiser_id: str # For partitioning and billing
campaign_id: str
# Auction details
bid_price: Optional[float] = None
win_price: Optional[float] = None # Second price (what they actually pay)
is_winner: bool = False
# DSP details
dsp_id: Optional[str] = None
dsp_latency_ms: Optional[float] = None
# User context (privacy-compliant)
user_id_hash: Optional[str] = None # Hashed, not raw
geo_country: Optional[str] = None
device_type: Optional[str] = None
# Technical metadata
region: str = ""
pod_id: str = ""
processing_time_ms: float = 0
def to_json(self) -> str:
"""Serialize for Kafka."""
data = asdict(self)
data['event_time'] = self.event_time.isoformat()
return json.dumps(data)
@property
def partition_key(self) -> str:
"""Kafka partition key ensures ordering per advertiser."""
return self.advertiser_id
# Kafka topic configuration
KAFKA_TOPIC_CONFIG = {
'name': 'bid-events',
'partitions': 256, # High partition count for parallelism
'replication_factor': 3, # Durability
'retention_ms': 7 * 24 * 60 * 60 * 1000, # 7 days hot
'cleanup_policy': 'delete',
# Compaction settings for efficiency
'compression_type': 'zstd', # Best compression ratio
'max_message_bytes': 1048576, # 1MB max
}
Interviewer: "7 years of data at 10 billion events/day. How do you make that queryable for billing disputes?"
You: "Tiered storage with intelligent partitioning..."
DATA TIERING STRATEGY
HOT TIER (0-7 days): Kafka + Elasticsearch
β’ Fast queries for recent data
β’ Used for real-time dashboards
β’ ~560 billion events
WARM TIER (7-90 days): S3 Parquet + Presto
β’ Columnar format for analytical queries
β’ Partitioned by date + advertiser_id
β’ ~2.5 trillion events
COLD TIER (90 days - 7 years): S3 Glacier
β’ Compressed archives
β’ Queryable with Athena (slower)
β’ Restored on-demand for disputes
β’ ~25+ trillion events
QUERY ROUTING:
β’ "Last 24 hours" β Elasticsearch (sub-second)
β’ "Last month" β Presto on S3 (seconds)
β’ "2 years ago" β Glacier restore + Athena (minutes-hours)
Deep Dive 4: Fraud Detection in Real-Time
Week 9 concepts: Security, ML inference at scale.
You: "We need to detect and block invalid traffic in the critical path, without adding latency."
FRAUD DETECTION ARCHITECTURE
The challenge:
β’ Must decide in <5ms (part of edge processing budget)
β’ 500K RPS of inference
β’ Accuracy matters (false positives = lost revenue)
SOLUTION: Tiered Detection
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β TIER 1: RULE-BASED BLOCKING (<0.5ms) β
β βββββββββββββββββββββββββββββββββββββ β
β β’ Known bad IPs (blocklist) β
β β’ Known data center ranges β
β β’ Rate limiting per IP/user β
β β’ Impossible geography (user in 2 countries in 1 second) β
β β
β Blocks: ~5% of traffic (obvious bots) β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β TIER 2: ML SCORING (<2ms) β
β βββββββββββββββββββββββββ β
β β’ Pre-computed features (cached in Redis) β
β β’ Lightweight model (gradient boosting) β
β β’ Returns fraud probability (0-1) β
β β
β High confidence fraud (>0.95): Block β
β Medium confidence (0.5-0.95): Flag, bid lower β
β Low confidence (<0.5): Allow β
β β
β Blocks: ~2% of traffic β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β TIER 3: ASYNC DEEP ANALYSIS (non-blocking) β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β’ Complex neural network β
β β’ Analyzes patterns over time β
β β’ Updates blocklists and ML model features β
β β’ Runs on event stream (not real-time path) β
β β
β Identifies: Sophisticated attacks, click farms, etc. β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# fraud/realtime_detector.py
"""
Real-time fraud detection in the bidding pipeline.
Constraint: Must complete in <5ms including all tiers.
"""
from dataclasses import dataclass
from typing import Tuple
import numpy as np
@dataclass
class FraudSignals:
"""Features extracted for fraud detection."""
ip_address: str
user_agent: str
device_id: str
geo_country: str
request_rate_1m: int # Requests from this IP in last minute
request_rate_1h: int # Requests from this IP in last hour
user_age_days: int # How long we've seen this user
session_duration: float # Current session length
click_rate: float # Historical click-through rate
class RealTimeFraudDetector:
"""
Tiered fraud detection for real-time bidding.
"""
def __init__(
self,
blocklist_store,
feature_store,
ml_model,
rate_limiter
):
self.blocklist = blocklist_store # Redis set
self.features = feature_store # Redis hash
self.model = ml_model # Pre-loaded XGBoost
self.limiter = rate_limiter
async def evaluate(
self,
signals: FraudSignals
) -> Tuple[str, float]:
"""
Evaluate fraud risk.
Returns: (decision, confidence)
- decision: "block", "flag", "allow"
- confidence: 0.0 to 1.0
"""
# TIER 1: Rule-based checks (<0.5ms)
tier1_result = await self._tier1_rules(signals)
if tier1_result == "block":
return "block", 0.99
# TIER 2: ML scoring (<2ms)
fraud_score = await self._tier2_ml_score(signals)
if fraud_score > 0.95:
return "block", fraud_score
elif fraud_score > 0.5:
return "flag", fraud_score
else:
return "allow", 1 - fraud_score
async def _tier1_rules(self, signals: FraudSignals) -> str:
"""
Fast rule-based checks.
Rules are simple, but catch obvious fraud instantly.
"""
# Check IP blocklist
if await self.blocklist.contains(signals.ip_address):
return "block"
# Check rate limits
if signals.request_rate_1m > 100: # >100 requests/minute from one IP
return "block"
# Check for data center IP ranges
if self._is_datacenter_ip(signals.ip_address):
return "block"
return "allow"
async def _tier2_ml_score(self, signals: FraudSignals) -> float:
"""
ML-based fraud scoring.
Uses pre-computed features and lightweight model.
"""
# Get historical features from Redis
historical = await self.features.get(signals.device_id)
# Build feature vector
features = np.array([
signals.request_rate_1m,
signals.request_rate_1h,
signals.user_age_days,
signals.session_duration,
signals.click_rate,
historical.get('conversion_rate', 0),
historical.get('avg_session_length', 0),
historical.get('fraud_score_history', 0),
])
# Score with pre-loaded model (XGBoost inference is fast)
score = self.model.predict_proba(features.reshape(1, -1))[0][1]
return score
Phase 5: Scaling and Edge Cases
Interviewer: "Impressive coverage. Now let's stress test your design. What happens during the Super Bowl when traffic goes 5x?"
You: "We handle this with predictive scaling and graceful degradation..."
Super Bowl Scale Event
HANDLING 5X TRAFFIC SPIKE
BEFORE THE EVENT (24 hours prior):
β’ Pre-scale all regions to 3x capacity
β’ Warm up caches with likely ad content
β’ Alert on-call team
β’ Verify runbooks
DURING THE EVENT:
β’ Auto-scaling handles 3x β 5x if needed
β’ DSP fan-out reduced (query top 5 only, not 10)
β’ Longer latency tolerance (relax to 120ms if needed)
β’ Non-critical features disabled (detailed logging reduced)
IF WE'RE STILL OVERWHELMED:
β’ Load shedding kicks in
β’ Lower-value requests get "no bid" response
β’ High-value requests prioritized
β’ Status page updated if significant
METRICS TO WATCH:
β’ Request queue depth
β’ p99 latency (should stay <100ms)
β’ Budget accuracy (must stay correct)
β’ Revenue per second (business metric)
Edge Cases
You: "Let me walk through critical edge cases..."
EDGE CASE 1: DSP Goes Rogue (Returns in 500ms)
Problem: One DSP starts responding very slowly
Impact: Slows down all auctions that include them
Solution:
β’ Circuit breaker opens after 5 consecutive timeouts
β’ DSP excluded from future requests for 30s
β’ Automatic recovery when they improve
β’ Alert fires for ops awareness
EDGE CASE 2: Clock Skew Between Regions
Problem: Different regions see different "now"
Impact: Budget leases might overlap or gap
Solution:
β’ Use logical time (lamport clocks) for critical operations
β’ NTP synchronization on all servers
β’ Lease expiration has buffer (30s lease, 25s working time)
β’ Daily reconciliation catches any drift
EDGE CASE 3: Network Partition Between Regions
Problem: US-EAST can't talk to global coordinator
Impact: Can't refresh budget allocations
Solution:
β’ Each region operates independently during partition
β’ Conservative: Stop bidding when lease expires
β’ Partition detection via gossip protocol
β’ Automatic recovery when connectivity restored
EDGE CASE 4: Flash Crash of DSP Responses
Problem: All DSPs suddenly return no bids
Impact: No ads to serve, revenue drops
Detection:
β’ Sudden drop in bid rate (compared to baseline)
β’ All DSPs affected vs one DSP affected
Response:
β’ If one DSP: circuit breaker handles it
β’ If all DSPs: alert, may be exchange issue
β’ Fallback: serve house ads if configured
Phase 6: Monitoring and Operations
You: "Let me define the SLOs and observability strategy for this system..."
SLOs and SLIs
SERVICE LEVEL OBJECTIVES
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β SLO 1: LATENCY β
β βββββββββββββββ β
β SLI: Percentage of requests completing within 100ms β
β Target: 99.9% of requests < 100ms β
β Measurement: Histogram of response times at load balancer β
β β
β Error Budget: 0.1% of requests can exceed 100ms β
β Per day at 10B requests: 10 million slow requests allowed β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β SLO 2: AVAILABILITY β
β βββββββββββββββββββ β
β SLI: Percentage of requests that receive valid response β
β Target: 99.99% availability β
β Measurement: (Successful responses) / (Total requests) β
β β
β Error Budget: 0.01% requests can fail β
β Per day at 10B requests: 1 million failures allowed β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β SLO 3: BUDGET ACCURACY β
β ββββββββββββββββββββββ β
β SLI: Percentage of campaigns within 0.1% of budget β
β Target: 99.9% of campaigns never exceed budget by >0.1% β
β Measurement: (Actual spend - Budget) / Budget β
β β
β This is a HARD requirement (contractual obligation) β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β SLO 4: BID RATE β
β ββββββββββββββββ β
β SLI: Percentage of requests where we submit a bid β
β Target: >80% bid rate (when we have budget) β
β Measurement: (Bids submitted) / (Requests with budget) β
β β
β Lower bid rate = lost revenue opportunity β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Dashboards
OPERATIONAL DASHBOARDS
DASHBOARD 1: REAL-TIME BIDDING HEALTH
βββ Requests per second (by region)
βββ Bid rate (should be >80%)
βββ Latency p50/p99 (should be <100ms)
βββ Error rate (should be <0.01%)
βββ DSP response times (per DSP)
βββ Circuit breaker states
DASHBOARD 2: BUSINESS METRICS
βββ Revenue per second
βββ Spend rate (vs budget)
βββ Win rate (bids won / bids submitted)
βββ Average win price
βββ Top advertisers by spend
DASHBOARD 3: FRAUD OVERVIEW
βββ Block rate (Tier 1 rules)
βββ Flag rate (Tier 2 ML)
βββ False positive estimates
βββ Top blocked IPs/patterns
βββ Model drift indicators
DASHBOARD 4: INFRASTRUCTURE
βββ CPU/Memory utilization by service
βββ Kafka lag (should be near zero)
βββ Redis memory usage
βββ Database connection pool status
βββ Regional traffic distribution
Interview Conclusion
Interviewer: "This has been an exceptional session. You've demonstrated deep understanding across all dimensions β from low-level latency optimization to high-level architectural decisions. A few rapid-fire questions:"
Interviewer: "Why not use gRPC instead of HTTP for DSP communication?"
You: "Good question. gRPC would give us better performance with HTTP/2 multiplexing and protobuf serialization. The trade-off is that DSPs would need to support gRPC, which many don't. For the DSPs we control, we could use gRPC internally. For external DSPs, we'd use HTTP with connection pooling and keep-alive."
Interviewer: "If you had to cut latency from 100ms to 50ms, where would you start?"
You: "I'd focus on the DSP fan-out since that's 70ms of our budget. Options:
- Reduce DSP count from 10 to 5 (fewer bids but faster)
- Use predictive models to guess which DSPs will bid (skip unlikely ones)
- Tighter timeouts (50ms instead of 70ms)
- Edge caching of DSP bid models for popular requests"
Interviewer: "What's the biggest risk in this system?"
You: "Budget accuracy. Unlike latency or availability where degradation is bad but recoverable, budget violations are contractual breaches. If we overspend, advertisers don't pay β we eat the cost. I'd invest heavily in testing the budget system, including chaos engineering specifically for the budget path."
Interviewer: "Excellent. Any questions for me?"
You: "Yes β at your actual scale, what was the hardest problem to solve? And how do you handle the cold start problem for new advertisers with no historical data for fraud detection?"
The interviewer smiles.
Interviewer: "Great questions β and exactly what I'd expect from a senior candidate. Welcome to the team."
Summary: Ten Weeks of Learning Applied
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β CONCEPTS APPLIED IN THIS DESIGN β
β β
β WEEK 1: DATA AT SCALE β
β βββ Partitioning: Budget by advertiser, events by advertiser β
β βββ Replication: Multi-region for availability β
β βββ Consistency: Strong for budget, eventual for analytics β
β β
β WEEK 2: FAILURE-FIRST DESIGN β
β βββ Timeouts: 70ms DSP timeout with budget for auction β
β βββ Circuit breakers: Per-DSP failure isolation β
β βββ Idempotency: Event IDs for exactly-once processing β
β β
β WEEK 3: MESSAGING & ASYNC β
β βββ Transactional outbox: Guaranteed event capture β
β βββ Kafka: Event streaming for audit trail β
β βββ Backpressure: Load shedding during overload β
β β
β WEEK 4: CACHING β
β βββ User data caching: Redis for enrichment β
β βββ DSP response caching: Where appropriate β
β βββ Feature caching: For real-time ML inference β
β β
β WEEK 5: CONSISTENCY & COORDINATION β
β βββ Distributed budget: Hierarchical allocation β
β βββ Lease-based consistency: Local spending with global truth β
β βββ Conflict resolution: Conservative (under-spend > over-spend) β
β β
β WEEK 6: NOTIFICATION PLATFORM β
β βββ Multi-channel delivery: Applied to DSP fan-out β
β βββ Priority queuing: High-value requests first β
β βββ Dead letter handling: Failed DSP requests tracked β
β β
β WEEK 7: SEARCH SYSTEM β
β βββ Indexing: Not directly, but similar patterns for user lookup β
β βββ Query optimization: Fast feature lookup for ML β
β β
β WEEK 8: ANALYTICS PIPELINE β
β βββ Event ingestion: 10B events/day capture β
β βββ Tiered storage: Hot/warm/cold for 7-year retention β
β βββ Late arriving data: Reconciliation processes β
β β
β WEEK 9: MULTI-TENANCY & SECURITY β
β βββ Tenant isolation: Per-advertiser budget isolation β
β βββ Fraud detection: ML-based traffic validation β
β βββ Audit compliance: Complete transaction history β
β β
β WEEK 10: PRODUCTION READINESS β
β βββ SLOs: Latency, availability, budget accuracy β
β βββ Observability: Metrics, logs, traces for debugging β
β βββ Deployment: Canary with automatic rollback β
β βββ Capacity: Predictive scaling for events β
β βββ Incidents: Runbooks for common failure modes β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Self-Assessment Checklist
After studying this capstone, you should be able to:
Requirements & Estimation:
- Clarify requirements before designing
- Calculate traffic, storage, and infrastructure needs
- Create latency budgets for time-sensitive systems
Architecture:
- Design multi-region systems for global availability
- Create appropriate separation of concerns
- Handle real-time and batch processing in one system
Distributed Systems:
- Implement strong consistency where required (budget)
- Use eventual consistency appropriately (analytics)
- Design hierarchical systems for coordination at scale
Performance:
- Optimize for latency-critical paths
- Implement parallel processing with timeouts
- Use caching effectively at multiple layers
Reliability:
- Implement circuit breakers and bulkheads
- Design for graceful degradation
- Handle edge cases and failure modes
Operations:
- Define meaningful SLOs
- Design observability into the system
- Plan for capacity and scaling events
- Create runbooks for incident response
π Congratulations!
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β SYSTEM DESIGN MASTERY COMPLETE β
β β
β 10 Weeks β
β 50 Lessons β
β 10 Capstones β
β 1 Complete Toolkit β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β You've learned to: β
β β
β β Design systems that scale to millions of users β
β β Build for failure from the start β
β β Make informed trade-offs between consistency and availability β
β β Implement production-ready patterns β
β β Operate systems with confidence β
β β Communicate designs clearly in interviews β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Remember: β
β β
β "The goal isn't to memorize solutions β β
β it's to build intuition for trade-offs. β
β β
β When someone asks 'how would you design X?', β
β your first instinct should be: β
β 'What are the constraints? What can we sacrifice?' β
β β
β That's senior engineering thinking." β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Go build something great. β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
End of Week 10 Capstone
End of System Design Mastery: 10-Week Intensive Program