Week 10 — Day 4: Capacity Planning
System Design Mastery Series — Production Readiness and Operational Excellence
Preface
You can define health, see health, and maintain health through change.
But what happens when Black Friday hits and traffic goes 10x?
THE CAPACITY NIGHTMARE
Friday, 11:47 PM. Your phone explodes.
ALERT: API latency p99 = 15 seconds (SLO: 200ms)
ALERT: Database connections exhausted
ALERT: Error rate 45% (SLO: 0.1%)
ALERT: Memory usage 98%
ALERT: Customer complaints spiking
You open your laptop. The dashboard is all red.
What happened?
├── Marketing launched a viral campaign
├── Traffic went 5x normal
├── Nobody told engineering
├── System wasn't ready
└── You're now firefighting at midnight
This didn't have to happen.
WITH CAPACITY PLANNING:
├── You know your system's limits
├── You forecast future demand
├── You scale BEFORE you need to
├── Marketing campaign? Pre-scaled 24 hours ago
└── You're sleeping peacefully at midnight
This is capacity planning.
Today, we learn to answer: "How much traffic can my system handle, and when will I need more?"
Part I: Foundations
Chapter 1: What Is Capacity Planning?
1.1 The Capacity Planning Process
CAPACITY PLANNING CYCLE
┌────────────────────────────────────────────────────────────────────────┐
│ │
│ CAPACITY PLANNING │
│ │
│ ┌──────────────┐ │
│ │ MEASURE │ ◄─── What can we handle NOW? │
│ │ Current │ Current load, headroom, limits │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ FORECAST │ ◄─── What will we NEED? │
│ │ Demand │ Growth trends, events, seasonality │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ IDENTIFY │ ◄─── What breaks FIRST? │
│ │ Bottlenecks│ Database, memory, CPU, network │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ PLAN │ ◄─── What do we DO about it? │
│ │ Scaling │ Scale up, scale out, optimize │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ VALIDATE │ ◄─── Does it WORK? │
│ │ (Test) │ Load testing, chaos engineering │
│ └──────┬───────┘ │
│ │ │
│ └──────────────► Back to MEASURE │
│ │
└────────────────────────────────────────────────────────────────────────┘
1.2 Key Metrics for Capacity
CAPACITY METRICS BY LAYER
┌────────────────────────────────────────────────────────────────────────┐
│ │
│ APPLICATION LAYER │
│ ───────────────── │
│ ├── Requests per second (RPS) │
│ ├── Active connections │
│ ├── Request queue depth │
│ ├── Response time (p50, p99) │
│ └── Error rate │
│ │
│ Question: How many requests can we handle before latency degrades? │
│ │
│ ──────────────────────────────────────────────────────────────────── │
│ │
│ COMPUTE LAYER │
│ ───────────── │
│ ├── CPU utilization (per core) │
│ ├── Memory usage and available │
│ ├── Thread/goroutine count │
│ ├── Garbage collection frequency │
│ └── Context switches │
│ │
│ Question: Are we CPU-bound or memory-bound? │
│ │
│ ──────────────────────────────────────────────────────────────────── │
│ │
│ DATABASE LAYER │
│ ────────────── │
│ ├── Queries per second │
│ ├── Connection pool usage │
│ ├── Query latency (p50, p99) │
│ ├── Replication lag │
│ ├── Disk I/O (reads/writes) │
│ └── Buffer cache hit rate │
│ │
│ Question: Is the database the bottleneck? │
│ │
│ ──────────────────────────────────────────────────────────────────── │
│ │
│ CACHE LAYER │
│ ─────────── │
│ ├── Hit rate │
│ ├── Miss rate │
│ ├── Memory usage │
│ ├── Eviction rate │
│ └── Operations per second │
│ │
│ Question: Is caching effective? Are we evicting too much? │
│ │
│ ──────────────────────────────────────────────────────────────────── │
│ │
│ NETWORK LAYER │
│ ───────────── │
│ ├── Bandwidth utilization │
│ ├── Packet loss │
│ ├── Connection errors │
│ └── DNS lookup time │
│ │
│ Question: Is the network a bottleneck? │
│ │
└────────────────────────────────────────────────────────────────────────┘
1.3 The Concept of Headroom
HEADROOM: THE SAFETY BUFFER
Headroom = (Maximum Capacity - Current Usage) / Maximum Capacity × 100%
Example:
├── Database can handle: 3,000 QPS
├── Current usage: 2,000 QPS
├── Headroom: (3000 - 2000) / 3000 = 33%
Why headroom matters:
├── Traffic is bursty (spikes above average)
├── Dependent services might fail (retry storms)
├── GC pauses, network blips happen
├── Need room to absorb unexpected load
└── Operating at 100% = already failing
RECOMMENDED HEADROOM:
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ Headroom % │ Status │ Action │
│ ────────────┼─────────────┼────────────────────────────────────── │
│ > 50% │ Healthy │ Monitor, plan for growth │
│ 30-50% │ Adequate │ Start planning scaling │
│ 20-30% │ Warning │ Actively prepare scaling │
│ 10-20% │ Critical │ Scale NOW │
│ < 10% │ Emergency │ You're probably already degraded │
│ │
└─────────────────────────────────────────────────────────────────────┘
The rule of thumb: Maintain at least 30% headroom on critical resources.
Chapter 2: Measuring Current Capacity
2.1 Establishing Baselines
# capacity/baseline.py
"""
Capacity baseline measurement and tracking.
Establishes what "normal" looks like for your system.
"""
from dataclasses import dataclass, field
from typing import Dict, List, Optional
from datetime import datetime, timedelta
from enum import Enum
import statistics
@dataclass
class ResourceMetrics:
"""Metrics for a single resource."""
name: str
current_value: float
max_capacity: float
unit: str
@property
def utilization(self) -> float:
"""Current utilization percentage."""
return (self.current_value / self.max_capacity) * 100 if self.max_capacity > 0 else 0
@property
def headroom(self) -> float:
"""Available headroom percentage."""
return 100 - self.utilization
@property
def status(self) -> str:
"""Health status based on headroom."""
if self.headroom > 50:
return "healthy"
elif self.headroom > 30:
return "adequate"
elif self.headroom > 20:
return "warning"
elif self.headroom > 10:
return "critical"
else:
return "emergency"
@dataclass
class CapacityBaseline:
"""Baseline capacity measurements for a service."""
service_name: str
measured_at: datetime
# Traffic metrics
avg_rps: float
peak_rps: float
p50_latency_ms: float
p99_latency_ms: float
# Resource metrics
resources: Dict[str, ResourceMetrics] = field(default_factory=dict)
# Derived limits
max_sustainable_rps: float = 0 # RPS before latency degrades
breaking_point_rps: float = 0 # RPS when errors start
class CapacityMeasurer:
"""
Measures and tracks system capacity.
"""
def __init__(self, metrics_client, config):
self.metrics = metrics_client
self.config = config
async def measure_baseline(
self,
service_name: str,
window: str = "7d"
) -> CapacityBaseline:
"""
Measure baseline capacity from historical data.
"""
# Traffic metrics
avg_rps = await self.metrics.query(
f'avg(rate(http_requests_total{{service="{service_name}"}}[5m]))',
window=window
)
peak_rps = await self.metrics.query(
f'max(rate(http_requests_total{{service="{service_name}"}}[5m]))',
window=window
)
p50_latency = await self.metrics.query(
f'avg(histogram_quantile(0.5, rate(http_request_duration_seconds_bucket{{service="{service_name}"}}[5m])))',
window=window
) * 1000 # Convert to ms
p99_latency = await self.metrics.query(
f'avg(histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{{service="{service_name}"}}[5m])))',
window=window
) * 1000
# Resource metrics
resources = {}
# CPU
cpu_usage = await self.metrics.query(
f'avg(rate(container_cpu_usage_seconds_total{{service="{service_name}"}}[5m]))',
window=window
)
cpu_limit = await self.metrics.query(
f'avg(container_spec_cpu_quota{{service="{service_name}"}} / container_spec_cpu_period{{service="{service_name}"}})',
window=window
)
resources['cpu'] = ResourceMetrics(
name='CPU',
current_value=cpu_usage,
max_capacity=cpu_limit,
unit='cores'
)
# Memory
memory_usage = await self.metrics.query(
f'avg(container_memory_usage_bytes{{service="{service_name}"}})',
window=window
) / (1024 ** 3) # Convert to GB
memory_limit = await self.metrics.query(
f'avg(container_spec_memory_limit_bytes{{service="{service_name}"}})',
window=window
) / (1024 ** 3)
resources['memory'] = ResourceMetrics(
name='Memory',
current_value=memory_usage,
max_capacity=memory_limit,
unit='GB'
)
# Database connections
db_connections = await self.metrics.query(
f'avg(db_connections_active{{service="{service_name}"}})',
window=window
)
db_max_connections = await self.metrics.query(
f'avg(db_connections_max{{service="{service_name}"}})',
window=window
)
resources['db_connections'] = ResourceMetrics(
name='Database Connections',
current_value=db_connections,
max_capacity=db_max_connections,
unit='connections'
)
return CapacityBaseline(
service_name=service_name,
measured_at=datetime.utcnow(),
avg_rps=avg_rps,
peak_rps=peak_rps,
p50_latency_ms=p50_latency,
p99_latency_ms=p99_latency,
resources=resources
)
async def generate_capacity_report(
self,
service_name: str
) -> str:
"""
Generate a human-readable capacity report.
"""
baseline = await self.measure_baseline(service_name)
report = f"""
CAPACITY REPORT: {service_name}
Generated: {baseline.measured_at.isoformat()}
{'=' * 60}
TRAFFIC METRICS
---------------
Average RPS: {baseline.avg_rps:.1f}
Peak RPS: {baseline.peak_rps:.1f}
P50 Latency: {baseline.p50_latency_ms:.1f}ms
P99 Latency: {baseline.p99_latency_ms:.1f}ms
RESOURCE UTILIZATION
--------------------
"""
for name, resource in baseline.resources.items():
status_emoji = {
'healthy': '🟢',
'adequate': '🟡',
'warning': '🟠',
'critical': '🔴',
'emergency': '⚫'
}.get(resource.status, '⚪')
report += f"""
{resource.name}:
Current: {resource.current_value:.2f} {resource.unit}
Maximum: {resource.max_capacity:.2f} {resource.unit}
Usage: {resource.utilization:.1f}%
Headroom: {resource.headroom:.1f}%
Status: {status_emoji} {resource.status.upper()}
"""
# Identify bottleneck
bottleneck = min(
baseline.resources.values(),
key=lambda r: r.headroom
)
report += f"""
{'=' * 60}
BOTTLENECK ANALYSIS
-------------------
Primary bottleneck: {bottleneck.name}
- Only {bottleneck.headroom:.1f}% headroom remaining
- Will likely be first to saturate under load
RECOMMENDATIONS
---------------
"""
for name, resource in baseline.resources.items():
if resource.headroom < 30:
report += f" ⚠️ {resource.name}: Plan scaling (headroom < 30%)\n"
elif resource.headroom < 50:
report += f" 📋 {resource.name}: Monitor closely\n"
return report
2.2 Understanding Your Limits
FINDING YOUR BREAKING POINT
You need to know:
1. At what load does latency start degrading?
2. At what load do errors start appearing?
3. At what load does the system fall over completely?
┌────────────────────────────────────────────────────────────────────────┐
│ │
│ LATENCY vs LOAD CURVE │
│ │
│ Latency │
│ (ms) │
│ │ ╱ │
│ 5000│ ╱ │
│ │ ╱ │
│ │ ╱ Breaking │
│ 1000│ ╱ Point │
│ │ ╱ │
│ │ ╱╱╱ │
│ 500│ ╱╱╱ Degradation │
│ │ ╱╱╱ Begins │
│ │ ╱╱╱╱ │
│ 200├──────────────────────────╱──────────────────── SLO Target │
│ │ ╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱ │
│ 100│╱╱╱╱╱╱ │
│ └────────────┬────────────┬────────────┬────────────┬──────────► │
│ 500 1000 1500 2000 RPS │
│ │ │ │
│ │ └── Breaking point │
│ └── Sustainable max (with SLO) │
│ │
│ Key points: │
│ ├── Up to 1000 RPS: Latency stable, within SLO │
│ ├── 1000-1500 RPS: Latency increasing but acceptable │
│ ├── 1500+ RPS: SLO violated, degradation visible │
│ └── 1800+ RPS: System failing, errors appearing │
│ │
│ YOUR CAPACITY = 1000 RPS (with 30% headroom from sustainable max) │
│ │
└────────────────────────────────────────────────────────────────────────┘
Chapter 3: Forecasting Demand
3.1 Growth Modeling
# capacity/forecasting.py
"""
Demand forecasting for capacity planning.
Predicts future load based on historical trends and known events.
"""
from dataclasses import dataclass
from typing import List, Optional, Tuple
from datetime import datetime, timedelta
import math
@dataclass
class TrafficForecast:
"""Forecast for future traffic."""
date: datetime
predicted_rps: float
confidence_low: float
confidence_high: float
factors: List[str] # What's driving this prediction
@dataclass
class GrowthModel:
"""Model for traffic growth."""
base_rps: float
monthly_growth_rate: float # e.g., 0.10 for 10% MoM
seasonality: dict # Month -> multiplier
day_of_week_pattern: dict # Day -> multiplier
hour_of_day_pattern: dict # Hour -> multiplier
class DemandForecaster:
"""
Forecasts future demand based on historical patterns.
"""
def __init__(self, metrics_client):
self.metrics = metrics_client
async def build_growth_model(
self,
service_name: str,
lookback_months: int = 6
) -> GrowthModel:
"""
Build growth model from historical data.
"""
# Get monthly averages
monthly_data = []
for i in range(lookback_months, 0, -1):
start = datetime.utcnow() - timedelta(days=30 * i)
end = start + timedelta(days=30)
avg_rps = await self.metrics.query_range(
f'avg(rate(http_requests_total{{service="{service_name}"}}[1h]))',
start=start,
end=end
)
monthly_data.append(avg_rps.mean())
# Calculate growth rate
if len(monthly_data) >= 2:
# Compound monthly growth rate
start_value = monthly_data[0]
end_value = monthly_data[-1]
months = len(monthly_data) - 1
if start_value > 0:
growth_rate = (end_value / start_value) ** (1 / months) - 1
else:
growth_rate = 0
else:
growth_rate = 0
# Get seasonality (month of year patterns)
seasonality = await self._calculate_seasonality(service_name)
# Get day-of-week patterns
dow_pattern = await self._calculate_dow_pattern(service_name)
# Get hour-of-day patterns
hod_pattern = await self._calculate_hod_pattern(service_name)
return GrowthModel(
base_rps=monthly_data[-1] if monthly_data else 0,
monthly_growth_rate=growth_rate,
seasonality=seasonality,
day_of_week_pattern=dow_pattern,
hour_of_day_pattern=hod_pattern
)
def forecast(
self,
model: GrowthModel,
target_date: datetime,
include_events: List[dict] = None
) -> TrafficForecast:
"""
Forecast traffic for a target date.
"""
now = datetime.utcnow()
months_ahead = (target_date - now).days / 30
# Base growth projection
base_projection = model.base_rps * ((1 + model.monthly_growth_rate) ** months_ahead)
# Apply seasonality
month = target_date.month
seasonality_factor = model.seasonality.get(month, 1.0)
# Apply day-of-week pattern
dow = target_date.weekday()
dow_factor = model.day_of_week_pattern.get(dow, 1.0)
# Apply hour pattern for peak hour
peak_hour_factor = max(model.hour_of_day_pattern.values())
# Calculate predicted RPS
predicted_rps = base_projection * seasonality_factor * dow_factor * peak_hour_factor
# Apply event multipliers
factors = ["organic_growth", f"seasonality_{month}", f"dow_{dow}"]
if include_events:
for event in include_events:
if event['date'].date() == target_date.date():
predicted_rps *= event['traffic_multiplier']
factors.append(event['name'])
# Confidence intervals (simplified)
confidence_low = predicted_rps * 0.8
confidence_high = predicted_rps * 1.4
return TrafficForecast(
date=target_date,
predicted_rps=predicted_rps,
confidence_low=confidence_low,
confidence_high=confidence_high,
factors=factors
)
def forecast_range(
self,
model: GrowthModel,
start_date: datetime,
end_date: datetime,
known_events: List[dict] = None
) -> List[TrafficForecast]:
"""
Forecast traffic for a date range.
"""
forecasts = []
current = start_date
while current <= end_date:
forecast = self.forecast(model, current, known_events)
forecasts.append(forecast)
current += timedelta(days=1)
return forecasts
# =============================================================================
# EXAMPLE: PLANNING FOR KNOWN EVENTS
# =============================================================================
"""
Known events that affect traffic:
BLACK_FRIDAY = {
'name': 'Black Friday',
'date': datetime(2024, 11, 29),
'traffic_multiplier': 5.0, # 5x normal traffic
'duration_hours': 24
}
PRODUCT_LAUNCH = {
'name': 'Product Launch',
'date': datetime(2024, 10, 15),
'traffic_multiplier': 3.0,
'duration_hours': 48
}
MARKETING_CAMPAIGN = {
'name': 'TV Ad Campaign',
'date': datetime(2024, 9, 1),
'traffic_multiplier': 2.0,
'duration_hours': 168 # 1 week
}
"""
3.2 Capacity Timeline
CAPACITY TIMELINE PLANNING
Current: January
Traffic: 1,000 RPS average, 1,500 peak
Capacity: 2,000 RPS (sustainable)
Headroom: 33%
Forecast:
┌────────────────────────────────────────────────────────────────────────┐
│ │
│ Month │ Projected Peak │ Capacity │ Headroom │ Action Needed │
│ ─────────┼────────────────┼──────────┼──────────┼──────────────────── │
│ January │ 1,500 RPS │ 2,000 │ 25% │ 🟡 Monitor │
│ February │ 1,650 RPS │ 2,000 │ 17% │ 🟠 Plan scaling │
│ March │ 1,815 RPS │ 2,000 │ 9% │ 🔴 SCALE NOW │
│ April │ 2,000 RPS │ 2,500 │ 20% │ After scaling │
│ May │ 2,200 RPS │ 2,500 │ 12% │ 🟠 Plan again │
│ ... │
│ November │ 4,400 RPS │ ??? │ ??? │ Black Friday! │
│ (Black Friday) × 5 = 22,000 RPS! │
│ │
└────────────────────────────────────────────────────────────────────────┘
ACTION TIMELINE:
├── January: Establish baseline, monitor growth
├── February: Begin scaling project (lead time: 4-6 weeks)
├── March: Complete scaling to 2,500 RPS
├── Q2: Plan for 4,000 RPS capacity (Q3 target)
├── Q3: Begin Black Friday planning
├── October: Pre-scale for Black Friday
└── November: Execute Black Friday capacity plan
Chapter 4: Identifying Bottlenecks
4.1 Bottleneck Analysis
FINDING THE BOTTLENECK
The bottleneck is the component that limits system throughput.
It's the FIRST thing to saturate under load.
Common bottlenecks:
┌────────────────────────────────────────────────────────────────────────┐
│ │
│ 1. DATABASE │
│ ─────────── │
│ Symptoms: │
│ ├── Slow query times │
│ ├── Connection pool exhaustion │
│ ├── Lock contention │
│ └── High CPU on database server │
│ │
│ Fixes: │
│ ├── Add read replicas │
│ ├── Add caching layer │
│ ├── Optimize queries │
│ ├── Connection pooling (PgBouncer) │
│ └── Shard the database │
│ │
│ ──────────────────────────────────────────────────────────────────── │
│ │
│ 2. APPLICATION CPU │
│ ────────────────── │
│ Symptoms: │
│ ├── High CPU utilization across pods │
│ ├── Latency increases with load │
│ └── Adding pods helps │
│ │
│ Fixes: │
│ ├── Scale horizontally (add pods) │
│ ├── Optimize hot code paths │
│ ├── Use faster serialization │
│ └── Enable async processing │
│ │
│ ──────────────────────────────────────────────────────────────────── │
│ │
│ 3. MEMORY │
│ ───────── │
│ Symptoms: │
│ ├── OOM kills │
│ ├── Excessive GC pauses │
│ └── Swap usage │
│ │
│ Fixes: │
│ ├── Increase memory limits │
│ ├── Fix memory leaks │
│ ├── Tune GC settings │
│ └── Stream instead of buffering │
│ │
│ ──────────────────────────────────────────────────────────────────── │
│ │
│ 4. EXTERNAL DEPENDENCIES │
│ ──────────────────────── │
│ Symptoms: │
│ ├── Timeouts to specific service │
│ ├── Queue buildup for async calls │
│ └── Rate limiting from external APIs │
│ │
│ Fixes: │
│ ├── Add caching │
│ ├── Implement circuit breakers │
│ ├── Negotiate higher rate limits │
│ └── Add fallback providers │
│ │
│ ──────────────────────────────────────────────────────────────────── │
│ │
│ 5. NETWORK │
│ ───────── │
│ Symptoms: │
│ ├── High bandwidth utilization │
│ ├── Connection timeouts │
│ └── Packet loss │
│ │
│ Fixes: │
│ ├── Compress responses │
│ ├── Use CDN for static content │
│ ├── Upgrade network capacity │
│ └── Optimize payload sizes │
│ │
└────────────────────────────────────────────────────────────────────────┘
4.2 Little's Law
LITTLE'S LAW: THE FUNDAMENTAL EQUATION
L = λ × W
Where:
├── L = Average number of items in system (queue + being processed)
├── λ = Arrival rate (requests per second)
└── W = Average time in system (seconds)
EXAMPLE:
Your API:
├── Receives 100 requests/second (λ = 100)
├── Average response time is 200ms (W = 0.2s)
├── Therefore: L = 100 × 0.2 = 20 concurrent requests
If you have 10 pods with 2 workers each = 20 workers total
You're AT CAPACITY!
What happens if traffic increases to 150 RPS?
├── L = 150 × 0.2 = 30 concurrent requests needed
├── You only have 20 workers
├── Requests queue up
├── Queue adds to W (wait time)
├── W increases → L increases → more queuing
├── SPIRAL INTO FAILURE
USING LITTLE'S LAW FOR CAPACITY:
To handle 500 RPS with 200ms response time:
├── L = 500 × 0.2 = 100 concurrent requests
├── Need: 100 workers (or threads/connections)
├── With 4 workers per pod: 100/4 = 25 pods
├── Add 30% headroom: ~33 pods
This is your capacity plan!
4.3 Queueing Theory Basics
QUEUEING THEORY FOR ENGINEERS
When utilization approaches 100%, latency goes to infinity.
┌────────────────────────────────────────────────────────────────────────┐
│ │
│ Wait Time │
│ │ │
│ │ ▲ │
│ │ ╱│ │
│ │ ╱ │ │
│ │ ╱╱ │ │
│ │ ╱╱╱ │ │
│ │ ╱╱╱╱ │ │
│ │ ╱╱╱╱╱ │ │
│ │ ╱╱╱╱╱╱ │ Danger │
│ │ ╱╱╱╱╱╱╱ │ Zone │
│ │ ╱╱╱╱╱╱╱╱ │ │
│ │ ╱╱╱╱╱╱╱╱ │ │
│ ├──╱╱╱╱╱───────────────────────────────────────────┬┴────────► │
│ 0% 20% 40% 60% 80% 100% │
│ Utilization │
│ │
│ Key insight: │
│ ├── At 50% utilization: Wait time is manageable │
│ ├── At 80% utilization: Wait time doubles │
│ ├── At 90% utilization: Wait time 10x baseline │
│ └── At 100%: Infinite queue, system fails │
│ │
│ RULE: Never operate above 70-80% utilization on critical resources │
│ │
└────────────────────────────────────────────────────────────────────────┘
Part II: Implementation
Chapter 5: Load Testing
5.1 Load Testing Strategy
LOAD TESTING TYPES
┌────────────────────────────────────────────────────────────────────────┐
│ │
│ 1. SMOKE TEST │
│ ───────────── │
│ Minimal load to verify system works. │
│ ├── Load: Very low (1-10 users) │
│ ├── Duration: Minutes │
│ └── Goal: Verify test setup, basic functionality │
│ │
│ ──────────────────────────────────────────────────────────────────── │
│ │
│ 2. LOAD TEST │
│ ──────────── │
│ Expected load to verify normal operation. │
│ ├── Load: Normal production traffic │
│ ├── Duration: 30-60 minutes │
│ └── Goal: Verify system handles normal load │
│ │
│ ──────────────────────────────────────────────────────────────────── │
│ │
│ 3. STRESS TEST │
│ ───────────── │
│ Beyond normal load to find breaking point. │
│ ├── Load: Gradually increase until failure │
│ ├── Duration: Until break or max reached │
│ └── Goal: Find system limits │
│ │
│ ──────────────────────────────────────────────────────────────────── │
│ │
│ 4. SPIKE TEST │
│ ──────────── │
│ Sudden traffic spike to test elasticity. │
│ ├── Load: Sudden jump (e.g., 2x → 10x → 2x) │
│ ├── Duration: Short spikes (minutes) │
│ └── Goal: Verify auto-scaling, graceful degradation │
│ │
│ ──────────────────────────────────────────────────────────────────── │
│ │
│ 5. SOAK TEST │
│ ─────────── │
│ Extended duration to find memory leaks, drift. │
│ ├── Load: Normal to moderate │
│ ├── Duration: Hours to days │
│ └── Goal: Find issues that emerge over time │
│ │
└────────────────────────────────────────────────────────────────────────┘
5.2 Load Test Implementation
# capacity/load_testing.py
"""
Load testing orchestration and analysis.
Uses k6, Locust, or similar tools under the hood.
"""
from dataclasses import dataclass, field
from typing import List, Dict, Optional
from datetime import datetime, timedelta
from enum import Enum
import asyncio
import subprocess
import json
class LoadPattern(Enum):
CONSTANT = "constant"
RAMP_UP = "ramp_up"
SPIKE = "spike"
STRESS = "stress"
@dataclass
class LoadTestConfig:
"""Configuration for a load test."""
name: str
target_url: str
# Load profile
pattern: LoadPattern
virtual_users: int
ramp_up_time_seconds: int = 60
duration_seconds: int = 300
# For spike tests
spike_users: int = 0
spike_duration_seconds: int = 60
# Thresholds
max_latency_p99_ms: int = 500
max_error_rate: float = 0.01
# Scenarios
scenarios: List[dict] = field(default_factory=list)
@dataclass
class LoadTestResult:
"""Results from a load test."""
config: LoadTestConfig
started_at: datetime
completed_at: datetime
# Summary metrics
total_requests: int
successful_requests: int
failed_requests: int
# Latency
latency_p50_ms: float
latency_p95_ms: float
latency_p99_ms: float
latency_max_ms: float
# Throughput
requests_per_second: float
# Thresholds
passed: bool
threshold_violations: List[str]
# Raw data for analysis
timeseries_data: List[dict] = field(default_factory=list)
class LoadTestRunner:
"""
Runs and analyzes load tests.
"""
def __init__(self, metrics_client, alerter):
self.metrics = metrics_client
self.alerter = alerter
async def run_test(
self,
config: LoadTestConfig,
notify_on_failure: bool = True
) -> LoadTestResult:
"""
Run a load test and analyze results.
"""
started_at = datetime.utcnow()
# Generate k6 script
script = self._generate_k6_script(config)
# Run k6
result_json = await self._execute_k6(script, config)
# Parse results
result = self._parse_results(config, result_json, started_at)
# Check thresholds
result.passed, result.threshold_violations = self._check_thresholds(
config, result
)
# Notify if failed
if not result.passed and notify_on_failure:
await self.alerter.send_warning(
f"Load test '{config.name}' failed: {result.threshold_violations}"
)
# Store results
await self._store_results(result)
return result
def _generate_k6_script(self, config: LoadTestConfig) -> str:
"""Generate k6 test script."""
if config.pattern == LoadPattern.CONSTANT:
stages = f"""
stages: [
{{ duration: '{config.ramp_up_time_seconds}s', target: {config.virtual_users} }},
{{ duration: '{config.duration_seconds}s', target: {config.virtual_users} }},
{{ duration: '30s', target: 0 }},
]
"""
elif config.pattern == LoadPattern.RAMP_UP:
stages = f"""
stages: [
{{ duration: '{config.duration_seconds}s', target: {config.virtual_users} }},
{{ duration: '60s', target: 0 }},
]
"""
elif config.pattern == LoadPattern.SPIKE:
stages = f"""
stages: [
{{ duration: '60s', target: {config.virtual_users} }},
{{ duration: '{config.spike_duration_seconds}s', target: {config.spike_users} }},
{{ duration: '60s', target: {config.virtual_users} }},
{{ duration: '30s', target: 0 }},
]
"""
elif config.pattern == LoadPattern.STRESS:
# Gradually increase until we find the breaking point
stages = f"""
stages: [
{{ duration: '60s', target: {config.virtual_users} }},
{{ duration: '120s', target: {config.virtual_users * 2} }},
{{ duration: '120s', target: {config.virtual_users * 4} }},
{{ duration: '120s', target: {config.virtual_users * 8} }},
{{ duration: '60s', target: 0 }},
]
"""
script = f"""
import http from 'k6/http';
import {{ check, sleep }} from 'k6';
import {{ Rate, Trend }} from 'k6/metrics';
const errorRate = new Rate('errors');
const latency = new Trend('latency');
export let options = {{
{stages},
thresholds: {{
'http_req_duration': ['p(99)<{config.max_latency_p99_ms}'],
'errors': ['rate<{config.max_error_rate}'],
}},
}};
export default function() {{
const res = http.get('{config.target_url}');
const success = check(res, {{
'status is 200': (r) => r.status === 200,
'latency < 500ms': (r) => r.timings.duration < 500,
}});
errorRate.add(!success);
latency.add(res.timings.duration);
sleep(1);
}}
"""
return script
async def _execute_k6(self, script: str, config: LoadTestConfig) -> dict:
"""Execute k6 and return JSON results."""
# Write script to temp file
script_path = f'/tmp/k6_script_{config.name}.js'
with open(script_path, 'w') as f:
f.write(script)
# Run k6 with JSON output
result = subprocess.run(
['k6', 'run', '--out', 'json=/tmp/k6_output.json', script_path],
capture_output=True,
text=True
)
# Parse JSON output
with open('/tmp/k6_output.json', 'r') as f:
return json.load(f)
def _check_thresholds(
self,
config: LoadTestConfig,
result: LoadTestResult
) -> tuple[bool, List[str]]:
"""Check if results meet thresholds."""
violations = []
if result.latency_p99_ms > config.max_latency_p99_ms:
violations.append(
f"P99 latency {result.latency_p99_ms}ms > threshold {config.max_latency_p99_ms}ms"
)
error_rate = result.failed_requests / result.total_requests
if error_rate > config.max_error_rate:
violations.append(
f"Error rate {error_rate:.2%} > threshold {config.max_error_rate:.2%}"
)
return len(violations) == 0, violations
# =============================================================================
# EXAMPLE LOAD TEST SCENARIOS
# =============================================================================
API_LOAD_TEST = LoadTestConfig(
name="api_baseline",
target_url="https://api.example.com/health",
pattern=LoadPattern.CONSTANT,
virtual_users=100,
duration_seconds=300,
max_latency_p99_ms=200,
max_error_rate=0.01
)
BLACK_FRIDAY_TEST = LoadTestConfig(
name="black_friday_simulation",
target_url="https://api.example.com/checkout",
pattern=LoadPattern.SPIKE,
virtual_users=500,
spike_users=5000,
spike_duration_seconds=120,
duration_seconds=600,
max_latency_p99_ms=1000,
max_error_rate=0.05
)
STRESS_TEST = LoadTestConfig(
name="find_breaking_point",
target_url="https://api.example.com/api",
pattern=LoadPattern.STRESS,
virtual_users=100,
duration_seconds=600,
max_latency_p99_ms=5000,
max_error_rate=0.10
)
Chapter 6: Scaling Strategies
6.1 Vertical vs Horizontal Scaling
SCALING STRATEGIES
┌────────────────────────────────────────────────────────────────────────┐
│ │
│ VERTICAL SCALING (Scale Up) │
│ ─────────────────────────── │
│ │
│ Before: After: │
│ ┌───────┐ ┌───────────┐ │
│ │ 4 CPU │ ──► │ 16 CPU │ │
│ │ 8 GB │ │ 64 GB │ │
│ └───────┘ └───────────┘ │
│ │
│ Pros: │
│ ├── Simple (no code changes) │
│ ├── No distributed complexity │
│ └── Good for databases │
│ │
│ Cons: │
│ ├── Limited by hardware │
│ ├── Single point of failure │
│ ├── Expensive at high end │
│ └── Usually requires downtime │
│ │
│ ═══════════════════════════════════════════════════════════════════ │
│ │
│ HORIZONTAL SCALING (Scale Out) │
│ ────────────────────────────── │
│ │
│ Before: After: │
│ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ │
│ │ Pod │ ──► │ Pod 1 │ │ Pod 2 │ │ Pod 3 │ │ Pod 4 │ │
│ └───────┘ └───────┘ └───────┘ └───────┘ └───────┘ │
│ │
│ Pros: │
│ ├── Nearly unlimited scaling │
│ ├── No single point of failure │
│ ├── Can scale incrementally │
│ └── Often cheaper at scale │
│ │
│ Cons: │
│ ├── Distributed complexity │
│ ├── Stateless requirement │
│ ├── Load balancing needed │
│ └── Data consistency challenges │
│ │
└────────────────────────────────────────────────────────────────────────┘
WHEN TO USE WHICH:
Vertical (Scale Up):
├── Databases (primary)
├── Single-threaded workloads
├── When simplicity matters most
└── As a quick fix while planning horizontal
Horizontal (Scale Out):
├── Stateless web services
├── Read replicas
├── Worker pools
└── For long-term scalability
6.2 Auto-Scaling Configuration
# capacity/autoscaling.py
"""
Auto-scaling configuration and policies.
Defines when and how to scale automatically.
"""
from dataclasses import dataclass
from typing import List, Optional
from enum import Enum
class ScalingMetric(Enum):
CPU = "cpu"
MEMORY = "memory"
RPS = "requests_per_second"
QUEUE_DEPTH = "queue_depth"
LATENCY = "latency_p99"
CUSTOM = "custom"
@dataclass
class ScalingPolicy:
"""Defines when to scale."""
name: str
metric: ScalingMetric
# Thresholds
scale_up_threshold: float
scale_down_threshold: float
# Behavior
scale_up_cooldown_seconds: int = 300
scale_down_cooldown_seconds: int = 600
# Limits
min_replicas: int = 2
max_replicas: int = 100
# Aggressiveness
scale_up_by: int = 2 # Add 2 pods at a time
scale_down_by: int = 1 # Remove 1 pod at a time
@dataclass
class AutoScalingConfig:
"""Complete auto-scaling configuration."""
deployment_name: str
namespace: str
# Policies (can have multiple)
policies: List[ScalingPolicy]
# Predictive scaling (for known events)
scheduled_scaling: List[dict] = None
# Protection
disable_scale_down_during_deployment: bool = True
# =============================================================================
# EXAMPLE CONFIGURATIONS
# =============================================================================
# Web API: Scale on CPU and latency
WEB_API_SCALING = AutoScalingConfig(
deployment_name="api-server",
namespace="production",
policies=[
# Primary: CPU-based scaling
ScalingPolicy(
name="cpu-scaling",
metric=ScalingMetric.CPU,
scale_up_threshold=70, # Scale up when CPU > 70%
scale_down_threshold=30, # Scale down when CPU < 30%
min_replicas=3,
max_replicas=50,
scale_up_by=3,
scale_down_by=1,
scale_down_cooldown_seconds=900 # 15 min cooldown
),
# Secondary: Latency-based scaling
ScalingPolicy(
name="latency-scaling",
metric=ScalingMetric.LATENCY,
scale_up_threshold=200, # Scale up when p99 > 200ms
scale_down_threshold=50, # Scale down when p99 < 50ms
min_replicas=3,
max_replicas=50,
scale_up_by=5, # More aggressive for latency
scale_up_cooldown_seconds=60 # React faster to latency
),
],
scheduled_scaling=[
# Pre-scale for known traffic patterns
{
"name": "morning_ramp",
"schedule": "0 8 * * 1-5", # 8 AM weekdays
"min_replicas": 10
},
{
"name": "evening_scale_down",
"schedule": "0 22 * * *", # 10 PM daily
"min_replicas": 3
},
{
"name": "black_friday",
"schedule": "0 0 * 11 5", # Thanksgiving week
"min_replicas": 50,
"max_replicas": 200
},
]
)
# Worker pool: Scale on queue depth
WORKER_SCALING = AutoScalingConfig(
deployment_name="worker-pool",
namespace="production",
policies=[
ScalingPolicy(
name="queue-scaling",
metric=ScalingMetric.QUEUE_DEPTH,
scale_up_threshold=1000, # Scale up when queue > 1000
scale_down_threshold=100, # Scale down when queue < 100
min_replicas=2,
max_replicas=20,
scale_up_by=2,
scale_up_cooldown_seconds=60 # React quickly to queue buildup
),
]
)
def generate_kubernetes_hpa(config: AutoScalingConfig) -> dict:
"""Generate Kubernetes HPA manifest from config."""
hpa = {
"apiVersion": "autoscaling/v2",
"kind": "HorizontalPodAutoscaler",
"metadata": {
"name": f"{config.deployment_name}-hpa",
"namespace": config.namespace
},
"spec": {
"scaleTargetRef": {
"apiVersion": "apps/v1",
"kind": "Deployment",
"name": config.deployment_name
},
"minReplicas": min(p.min_replicas for p in config.policies),
"maxReplicas": max(p.max_replicas for p in config.policies),
"metrics": [],
"behavior": {
"scaleUp": {
"stabilizationWindowSeconds": 60,
"policies": [
{
"type": "Pods",
"value": max(p.scale_up_by for p in config.policies),
"periodSeconds": 60
}
]
},
"scaleDown": {
"stabilizationWindowSeconds": 300,
"policies": [
{
"type": "Pods",
"value": 1,
"periodSeconds": 120
}
]
}
}
}
}
# Add metrics
for policy in config.policies:
if policy.metric == ScalingMetric.CPU:
hpa["spec"]["metrics"].append({
"type": "Resource",
"resource": {
"name": "cpu",
"target": {
"type": "Utilization",
"averageUtilization": int(policy.scale_up_threshold)
}
}
})
elif policy.metric == ScalingMetric.MEMORY:
hpa["spec"]["metrics"].append({
"type": "Resource",
"resource": {
"name": "memory",
"target": {
"type": "Utilization",
"averageUtilization": int(policy.scale_up_threshold)
}
}
})
return hpa
Part III: Real-World Application
Chapter 7: Case Studies
7.1 How Netflix Handles Scale
NETFLIX CAPACITY PLANNING
Scale:
├── 200M+ subscribers
├── 15% of global internet traffic
├── Hundreds of microservices
├── Runs primarily on AWS
Key Practices:
1. CHAOS ENGINEERING
├── Chaos Monkey: Randomly kills instances
├── Chaos Kong: Simulates region failure
├── Gremlin: Controlled fault injection
└── Result: Confidence in resilience
2. PREDICTIVE SCALING
├── Know when shows release
├── Pre-scale for "Squid Game" launch
├── Analyze viewing patterns
└── Scale regions before prime time
3. CAPACITY TESTING
├── Regular load tests in production
├── "Red/Black" pushes to test capacity
├── Test at 2x expected peak
└── Find bottlenecks before customers do
4. AUTO-SCALING
├── Titus: Their container platform
├── Aggressive auto-scaling policies
├── Scale based on request latency
└── Can scale from 0 to thousands of instances
5. REGIONAL DISTRIBUTION
├── Content in multiple AWS regions
├── CDN (Open Connect) for streaming
├── Failover between regions
└── No single region is critical
Lesson: At Netflix scale, capacity planning is a full-time job.
They have dedicated teams for capacity analysis.
7.2 How Stripe Handles Scale
STRIPE CAPACITY PLANNING
Context:
├── Processes billions in payments
├── Every request is high-value
├── Zero tolerance for downtime
├── Must scale for Black Friday, Cyber Monday
Key Practices:
1. METERED OVERPROVISIONING
├── Always maintain 50%+ headroom
├── Cost of overprovisioning < cost of downtime
├── For payments, reliability > efficiency
└── "We'd rather waste capacity than drop transactions"
2. LOAD SHEDDING
├── Prioritize payment completion over analytics
├── Graceful degradation under load
├── Non-critical features turn off automatically
└── Payment path is always protected
3. REGULAR GAME DAYS
├── Simulate Black Friday in staging
├── Test at 3x expected peak
├── Practice runbooks
└── Find weaknesses before they matter
4. MULTI-REGION
├── Active-active in multiple regions
├── Can shift traffic between regions
├── Used for capacity expansion
└── Also provides disaster recovery
5. CAPACITY COMMITTEE
├── Weekly capacity review
├── Cross-functional (eng + ops + finance)
├── Review utilization, forecasts, plans
└── Make scaling decisions together
Lesson: For critical systems, overprovision.
The cost of downtime far exceeds the cost of extra capacity.
Chapter 8: Common Mistakes
CAPACITY PLANNING ANTI-PATTERNS
❌ MISTAKE 1: Waiting for Problems
Wrong:
"We'll scale when we see issues"
Problem:
├── Scaling takes time (minutes to hours)
├── By the time you see issues, users are affected
└── Reactive = outages
Right:
├── Know your limits from load testing
├── Track headroom continuously
├── Scale BEFORE you need to
❌ MISTAKE 2: Only Measuring Averages
Wrong:
"Average CPU is only 30%, we're fine!"
Problem:
├── Peak usage is what matters
├── Spikes cause outages, not averages
└── P99 utilization might be 95%
Right:
├── Track P99 utilization
├── Watch peak-to-average ratio
└── Plan for peaks, not averages
❌ MISTAKE 3: Ignoring Dependencies
Wrong:
"We can handle 10K RPS"
Problem:
├── But can your database?
├── Can your cache?
├── Can your third-party APIs?
└── You're only as fast as your slowest dependency
Right:
├── Test the entire stack
├── Know limits of all dependencies
└── Plan capacity for the bottleneck
❌ MISTAKE 4: Not Testing Production Config
Wrong:
"Load tests pass in staging!"
Problem:
├── Staging != production
├── Different data sizes
├── Different network topology
└── Different concurrency patterns
Right:
├── Load test in production (carefully)
├── Or use production-like staging
└── Mirror production traffic for testing
❌ MISTAKE 5: Ignoring Auto-Scaling Lag
Wrong:
"Auto-scaling will handle it"
Problem:
├── Auto-scaling takes 3-5 minutes
├── Spike hits before scale-up completes
└── Brief outage during scaling
Right:
├── Pre-scale for known events
├── Maintain baseline capacity for spikes
└── Auto-scaling for sustained increases
Part IV: Interview Preparation
Chapter 9: Interview Tips
9.1 Capacity Discussion Framework
DISCUSSING CAPACITY IN INTERVIEWS
When asked "How would you ensure this system handles scale?":
1. ESTIMATE THE LOAD
"First, let me understand the scale:
- How many users?
- What's the request pattern?
- What's the peak vs average ratio?
Let me calculate the expected RPS..."
2. IDENTIFY BOTTLENECKS
"The likely bottlenecks are:
- Database: Will it handle this QPS?
- Memory: How much state per user?
- Network: What's the payload size?
Let me work through each..."
3. DESIGN FOR SCALE
"To handle this scale, I'd:
- Add caching to reduce DB load
- Shard the database by user_id
- Use connection pooling
- Deploy across multiple regions"
4. PLAN FOR GROWTH
"To stay ahead of growth:
- Maintain 30% headroom on all resources
- Auto-scale based on CPU and latency
- Load test regularly
- Pre-scale for known events"
5. HANDLE SPIKES
"For unexpected spikes:
- Rate limiting protects the system
- Load shedding drops non-critical work
- Circuit breakers prevent cascade
- Graceful degradation keeps core working"
9.2 Key Phrases
CAPACITY KEY PHRASES
On Measurement:
"I'd establish a baseline by load testing to find
where latency starts degrading. That's my sustainable
capacity. Then I'd ensure we maintain at least 30%
headroom below that limit."
On Bottlenecks:
"The bottleneck is usually the database. At 1000 RPS
with 10ms average query time, we need 10 concurrent
connections minimum. I'd use connection pooling
and add read replicas if needed."
On Little's Law:
"Using Little's Law: if we have 500 RPS and 100ms
response time, we need 50 concurrent handlers.
With 4 workers per pod, that's about 13 pods.
Add 30% headroom: 17 pods minimum."
On Auto-Scaling:
"Auto-scaling reacts to load, but it's not instant.
For predictable events like Black Friday, I'd
pre-scale the night before. Auto-scaling handles
the unexpected variations."
On Load Testing:
"I'd run regular load tests at 2x expected peak.
This finds bottlenecks before customers do.
The goal isn't just to pass, but to understand
where the system breaks."
Chapter 10: Practice Problems
Problem 1: E-Commerce Peak Planning
Scenario: Your e-commerce site normally handles 500 RPS. Black Friday is coming with expected 10x traffic.
Questions:
- How would you prepare for 5,000 RPS?
- What needs to scale and by how much?
- What's your testing strategy?
-
Preparation (6-8 weeks before):
- Load test current system to find breaking point
- Identify bottlenecks (probably database)
- Plan and implement scaling
- Schedule pre-scaling for the night before
-
What to scale:
- Web tier: 10x pods (with auto-scaling headroom)
- Database: Add read replicas, warm connection pools
- Cache: 10x cache size to handle increased working set
- CDN: Pre-warm with popular content
-
Testing:
- Load test at 10x in staging
- Shadow traffic test in production
- Have runbooks for manual scaling
- War room staffed during peak
Problem 2: Database Scaling
Scenario: Your database handles 2,000 QPS. You're at 1,500 QPS and growing 10% monthly.
Questions:
- When will you hit capacity?
- What are your options?
- How would you buy time?
-
Time to capacity:
- Month 1: 1,500 QPS (75% utilized)
- Month 2: 1,650 QPS (82.5%)
- Month 3: 1,815 QPS (91%)
- Month 4: 2,000 QPS (100%) - CRITICAL
- Action needed by month 2
-
Options:
- Vertical: Upgrade to larger instance (quick fix)
- Read replicas: Offload read traffic (medium effort)
- Caching: Reduce queries (high impact)
- Sharding: Long-term solution (high effort)
-
Buying time:
- Add aggressive caching (1-2 weeks)
- Optimize slow queries (ongoing)
- Add read replica (2-4 weeks)
- Plan sharding for 6-month horizon
Chapter 11: Sample Interview Dialogue
Interviewer: "You're designing a food delivery app. How do you plan for capacity during dinner rush?"
You: "Great question. Let me think through the load characteristics first.
Understanding the pattern:"
Traffic Pattern:
├── Normal: 1,000 orders/minute
├── Dinner rush (6-8 PM): 5x = 5,000 orders/minute
├── Friday dinner: 7x = 7,000 orders/minute
├── Super Bowl Sunday: 20x = 20,000 orders/minute
Each order involves:
├── 5-10 API calls (browse, add to cart, checkout)
├── 1 database write (order creation)
├── 2-3 external calls (payment, restaurant, driver)
└── Estimate: 50-100 API calls per order
Peak API load: 7,000 × 100 = 700,000 API calls/minute = ~12,000 RPS
Interviewer: "How would you handle that?"
You: "I'd design with the peak in mind and auto-scale for variations.
For the API tier:"
├── Using Little's Law: 12,000 RPS × 0.1s = 1,200 concurrent requests
├── With 20 workers per pod: 1,200/20 = 60 pods minimum
├── Add 50% headroom: 90 pods
├── Auto-scale between 20 pods (baseline) and 150 pods (max)
"For the database:"
├── 7,000 writes/minute = 117 writes/second
├── Reads are probably 10x writes = 1,170 reads/second
├── Use read replicas for menu/restaurant queries
├── Shard orders table by region for write scaling
├── Cache frequently accessed data (menus, restaurant info)
Interviewer: "What about Super Bowl Sunday at 20x?"
You: "For exceptional events, I wouldn't rely on auto-scaling alone.
Pre-event preparation:"
1. Week before: Load test at 25x (leave margin)
2. Night before: Pre-scale to 50% of expected peak
3. During event: Auto-scale handles variations
4. Have war room staffed
5. Pre-warm caches with popular restaurants
If we still hit limits:
├── Graceful degradation: Disable non-essential features
├── Rate limiting: Queue orders instead of rejecting
├── Geographic limiting: Prioritize areas we can serve
Interviewer: "Good systematic approach. How do you know if your planning worked?"
You: "I'd set up clear success metrics:
During the event:
- Order success rate > 99%
- Checkout latency < 3 seconds
- No capacity alerts firing
- Auto-scaling responded within 3 minutes
After the event:
- Review peak utilization (should be < 70%)
- Identify any bottlenecks that appeared
- Cost analysis: did we over/under provision?
- Update forecasting model with actual data
This becomes input for next year's planning."
Summary
┌────────────────────────────────────────────────────────────────────────┐
│ DAY 4 KEY TAKEAWAYS │
│ │
│ CAPACITY PLANNING CYCLE: │
│ ├── Measure: Know your current capacity and utilization │
│ ├── Forecast: Predict future demand │
│ ├── Identify: Find bottlenecks before they hit │
│ ├── Plan: Decide how and when to scale │
│ └── Validate: Test your assumptions │
│ │
│ KEY METRICS: │
│ ├── Headroom = (Max Capacity - Current) / Max Capacity │
│ ├── Maintain at least 30% headroom │
│ ├── Track P99 utilization, not just averages │
│ └── Little's Law: L = λ × W │
│ │
│ LOAD TESTING: │
│ ├── Smoke: Verify basic functionality │
│ ├── Load: Test normal production traffic │
│ ├── Stress: Find the breaking point │
│ ├── Spike: Test sudden traffic jumps │
│ └── Soak: Find issues that emerge over time │
│ │
│ SCALING STRATEGIES: │
│ ├── Vertical: Bigger machines (simple, limited) │
│ ├── Horizontal: More machines (complex, unlimited) │
│ ├── Auto-scaling: React to load (not instant!) │
│ └── Pre-scaling: Prepare for known events │
│ │
│ COMMON BOTTLENECKS: │
│ ├── Database (connections, queries) │
│ ├── Application CPU │
│ ├── Memory (especially caches) │
│ ├── External dependencies │
│ └── Network bandwidth │
│ │
│ KEY INSIGHT: │
│ The best time to scale is BEFORE you need to. │
│ Know your limits. Watch your headroom. Plan ahead. │
│ │
└────────────────────────────────────────────────────────────────────────┘
Further Reading
Books:
- "The Art of Capacity Planning" by John Allspaw
- "Release It!" by Michael Nygard - Capacity patterns
Tools:
- k6: Modern load testing
- Locust: Python-based load testing
- Grafana + Prometheus: Capacity monitoring
- AWS Auto Scaling / Kubernetes HPA
Articles:
- Netflix: "Lessons from Building Observability Tools at Netflix"
- Google SRE Book: Chapter on Capacity Planning
- Stripe: "Scaling Your API with Rate Limiters"
End of Day 4: Capacity Planning
Tomorrow: Day 5 — Incident Management. Despite all your planning, things will still break. When they do, how do you respond? How do you learn? Tomorrow, we cover the human side of operations.
You now have the full operational toolkit: Define health. See health. Maintain health through change. Ensure health under load. Tomorrow: what to do when health fails.