Day 04

Week 10 — Day 4: Capacity Planning

System Design Mastery Series — Production Readiness and Operational Excellence

Preface

You can define health, see health, and maintain health through change.

But what happens when Black Friday hits and traffic goes 10x?

THE CAPACITY NIGHTMARE

Friday, 11:47 PM. Your phone explodes.

ALERT: API latency p99 = 15 seconds (SLO: 200ms)
ALERT: Database connections exhausted
ALERT: Error rate 45% (SLO: 0.1%)
ALERT: Memory usage 98%
ALERT: Customer complaints spiking

You open your laptop. The dashboard is all red.

What happened?
├── Marketing launched a viral campaign
├── Traffic went 5x normal
├── Nobody told engineering
├── System wasn't ready
└── You're now firefighting at midnight

This didn't have to happen.

WITH CAPACITY PLANNING:
├── You know your system's limits
├── You forecast future demand
├── You scale BEFORE you need to
├── Marketing campaign? Pre-scaled 24 hours ago
└── You're sleeping peacefully at midnight

This is capacity planning.

Today, we learn to answer: "How much traffic can my system handle, and when will I need more?"

Part I: Foundations

Chapter 1: What Is Capacity Planning?

1.1 The Capacity Planning Process

CAPACITY PLANNING CYCLE

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│                         CAPACITY PLANNING                              │
│                                                                        │
│     ┌──────────────┐                                                   │
│     │   MEASURE    │ ◄─── What can we handle NOW?                      │
│     │   Current    │      Current load, headroom, limits               │
│     └──────┬───────┘                                                   │
│            │                                                           │
│            ▼                                                           │
│     ┌──────────────┐                                                   │
│     │   FORECAST   │ ◄─── What will we NEED?                           │
│     │   Demand     │      Growth trends, events, seasonality           │
│     └──────┬───────┘                                                   │
│            │                                                           │
│            ▼                                                           │
│     ┌──────────────┐                                                   │
│     │   IDENTIFY   │ ◄─── What breaks FIRST?                           │
│     │   Bottlenecks│      Database, memory, CPU, network               │
│     └──────┬───────┘                                                   │
│            │                                                           │
│            ▼                                                           │
│     ┌──────────────┐                                                   │
│     │    PLAN      │ ◄─── What do we DO about it?                      │
│     │   Scaling    │      Scale up, scale out, optimize                │
│     └──────┬───────┘                                                   │
│            │                                                           │
│            ▼                                                           │
│     ┌──────────────┐                                                   │
│     │   VALIDATE   │ ◄─── Does it WORK?                                │
│     │   (Test)     │      Load testing, chaos engineering              │
│     └──────┬───────┘                                                   │
│            │                                                           │
│            └──────────────► Back to MEASURE                            │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

1.2 Key Metrics for Capacity

CAPACITY METRICS BY LAYER

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  APPLICATION LAYER                                                     │
│  ─────────────────                                                     │
│  ├── Requests per second (RPS)                                         │
│  ├── Active connections                                                │
│  ├── Request queue depth                                               │
│  ├── Response time (p50, p99)                                          │
│  └── Error rate                                                        │
│                                                                        │
│  Question: How many requests can we handle before latency degrades?    │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  COMPUTE LAYER                                                         │
│  ─────────────                                                         │
│  ├── CPU utilization (per core)                                        │
│  ├── Memory usage and available                                        │
│  ├── Thread/goroutine count                                            │
│  ├── Garbage collection frequency                                      │
│  └── Context switches                                                  │
│                                                                        │
│  Question: Are we CPU-bound or memory-bound?                           │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  DATABASE LAYER                                                        │
│  ──────────────                                                        │
│  ├── Queries per second                                                │
│  ├── Connection pool usage                                             │
│  ├── Query latency (p50, p99)                                          │
│  ├── Replication lag                                                   │
│  ├── Disk I/O (reads/writes)                                           │
│  └── Buffer cache hit rate                                             │
│                                                                        │
│  Question: Is the database the bottleneck?                             │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  CACHE LAYER                                                           │
│  ───────────                                                           │
│  ├── Hit rate                                                          │
│  ├── Miss rate                                                         │
│  ├── Memory usage                                                      │
│  ├── Eviction rate                                                     │
│  └── Operations per second                                             │
│                                                                        │
│  Question: Is caching effective? Are we evicting too much?             │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  NETWORK LAYER                                                         │
│  ─────────────                                                         │
│  ├── Bandwidth utilization                                             │
│  ├── Packet loss                                                       │
│  ├── Connection errors                                                 │
│  └── DNS lookup time                                                   │
│                                                                        │
│  Question: Is the network a bottleneck?                                │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

1.3 The Concept of Headroom

HEADROOM: THE SAFETY BUFFER

Headroom = (Maximum Capacity - Current Usage) / Maximum Capacity × 100%

Example:
├── Database can handle: 3,000 QPS
├── Current usage: 2,000 QPS
├── Headroom: (3000 - 2000) / 3000 = 33%

Why headroom matters:
├── Traffic is bursty (spikes above average)
├── Dependent services might fail (retry storms)
├── GC pauses, network blips happen
├── Need room to absorb unexpected load
└── Operating at 100% = already failing

RECOMMENDED HEADROOM:

┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  Headroom %  │ Status      │ Action                                 │
│  ────────────┼─────────────┼──────────────────────────────────────  │
│  > 50%       │ Healthy     │ Monitor, plan for growth               │
│  30-50%      │ Adequate    │ Start planning scaling                 │
│  20-30%      │ Warning     │ Actively prepare scaling               │
│  10-20%      │ Critical    │ Scale NOW                              │
│  < 10%       │ Emergency   │ You're probably already degraded       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

The rule of thumb: Maintain at least 30% headroom on critical resources.

Chapter 2: Measuring Current Capacity

2.1 Establishing Baselines

# capacity/baseline.py

"""
Capacity baseline measurement and tracking.

Establishes what "normal" looks like for your system.
"""

from dataclasses import dataclass, field
from typing import Dict, List, Optional
from datetime import datetime, timedelta
from enum import Enum
import statistics


@dataclass
class ResourceMetrics:
    """Metrics for a single resource."""
    name: str
    current_value: float
    max_capacity: float
    unit: str
    
    @property
    def utilization(self) -> float:
        """Current utilization percentage."""
        return (self.current_value / self.max_capacity) * 100 if self.max_capacity > 0 else 0
    
    @property
    def headroom(self) -> float:
        """Available headroom percentage."""
        return 100 - self.utilization
    
    @property
    def status(self) -> str:
        """Health status based on headroom."""
        if self.headroom > 50:
            return "healthy"
        elif self.headroom > 30:
            return "adequate"
        elif self.headroom > 20:
            return "warning"
        elif self.headroom > 10:
            return "critical"
        else:
            return "emergency"


@dataclass
class CapacityBaseline:
    """Baseline capacity measurements for a service."""
    service_name: str
    measured_at: datetime
    
    # Traffic metrics
    avg_rps: float
    peak_rps: float
    p50_latency_ms: float
    p99_latency_ms: float
    
    # Resource metrics
    resources: Dict[str, ResourceMetrics] = field(default_factory=dict)
    
    # Derived limits
    max_sustainable_rps: float = 0  # RPS before latency degrades
    breaking_point_rps: float = 0   # RPS when errors start


class CapacityMeasurer:
    """
    Measures and tracks system capacity.
    """
    
    def __init__(self, metrics_client, config):
        self.metrics = metrics_client
        self.config = config
    
    async def measure_baseline(
        self,
        service_name: str,
        window: str = "7d"
    ) -> CapacityBaseline:
        """
        Measure baseline capacity from historical data.
        """
        # Traffic metrics
        avg_rps = await self.metrics.query(
            f'avg(rate(http_requests_total{{service="{service_name}"}}[5m]))',
            window=window
        )
        
        peak_rps = await self.metrics.query(
            f'max(rate(http_requests_total{{service="{service_name}"}}[5m]))',
            window=window
        )
        
        p50_latency = await self.metrics.query(
            f'avg(histogram_quantile(0.5, rate(http_request_duration_seconds_bucket{{service="{service_name}"}}[5m])))',
            window=window
        ) * 1000  # Convert to ms
        
        p99_latency = await self.metrics.query(
            f'avg(histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{{service="{service_name}"}}[5m])))',
            window=window
        ) * 1000
        
        # Resource metrics
        resources = {}
        
        # CPU
        cpu_usage = await self.metrics.query(
            f'avg(rate(container_cpu_usage_seconds_total{{service="{service_name}"}}[5m]))',
            window=window
        )
        cpu_limit = await self.metrics.query(
            f'avg(container_spec_cpu_quota{{service="{service_name}"}} / container_spec_cpu_period{{service="{service_name}"}})',
            window=window
        )
        resources['cpu'] = ResourceMetrics(
            name='CPU',
            current_value=cpu_usage,
            max_capacity=cpu_limit,
            unit='cores'
        )
        
        # Memory
        memory_usage = await self.metrics.query(
            f'avg(container_memory_usage_bytes{{service="{service_name}"}})',
            window=window
        ) / (1024 ** 3)  # Convert to GB
        memory_limit = await self.metrics.query(
            f'avg(container_spec_memory_limit_bytes{{service="{service_name}"}})',
            window=window
        ) / (1024 ** 3)
        resources['memory'] = ResourceMetrics(
            name='Memory',
            current_value=memory_usage,
            max_capacity=memory_limit,
            unit='GB'
        )
        
        # Database connections
        db_connections = await self.metrics.query(
            f'avg(db_connections_active{{service="{service_name}"}})',
            window=window
        )
        db_max_connections = await self.metrics.query(
            f'avg(db_connections_max{{service="{service_name}"}})',
            window=window
        )
        resources['db_connections'] = ResourceMetrics(
            name='Database Connections',
            current_value=db_connections,
            max_capacity=db_max_connections,
            unit='connections'
        )
        
        return CapacityBaseline(
            service_name=service_name,
            measured_at=datetime.utcnow(),
            avg_rps=avg_rps,
            peak_rps=peak_rps,
            p50_latency_ms=p50_latency,
            p99_latency_ms=p99_latency,
            resources=resources
        )
    
    async def generate_capacity_report(
        self,
        service_name: str
    ) -> str:
        """
        Generate a human-readable capacity report.
        """
        baseline = await self.measure_baseline(service_name)
        
        report = f"""
CAPACITY REPORT: {service_name}
Generated: {baseline.measured_at.isoformat()}
{'=' * 60}

TRAFFIC METRICS
---------------
Average RPS:      {baseline.avg_rps:.1f}
Peak RPS:         {baseline.peak_rps:.1f}
P50 Latency:      {baseline.p50_latency_ms:.1f}ms
P99 Latency:      {baseline.p99_latency_ms:.1f}ms

RESOURCE UTILIZATION
--------------------
"""
        for name, resource in baseline.resources.items():
            status_emoji = {
                'healthy': '🟢',
                'adequate': '🟡',
                'warning': '🟠',
                'critical': '🔴',
                'emergency': '⚫'
            }.get(resource.status, '⚪')
            
            report += f"""
{resource.name}:
  Current:   {resource.current_value:.2f} {resource.unit}
  Maximum:   {resource.max_capacity:.2f} {resource.unit}
  Usage:     {resource.utilization:.1f}%
  Headroom:  {resource.headroom:.1f}%
  Status:    {status_emoji} {resource.status.upper()}
"""
        
        # Identify bottleneck
        bottleneck = min(
            baseline.resources.values(),
            key=lambda r: r.headroom
        )
        
        report += f"""
{'=' * 60}
BOTTLENECK ANALYSIS
-------------------
Primary bottleneck: {bottleneck.name}
  - Only {bottleneck.headroom:.1f}% headroom remaining
  - Will likely be first to saturate under load

RECOMMENDATIONS
---------------
"""
        for name, resource in baseline.resources.items():
            if resource.headroom < 30:
                report += f"  ⚠️  {resource.name}: Plan scaling (headroom < 30%)\n"
            elif resource.headroom < 50:
                report += f"  📋 {resource.name}: Monitor closely\n"
        
        return report

2.2 Understanding Your Limits

FINDING YOUR BREAKING POINT

You need to know:
1. At what load does latency start degrading?
2. At what load do errors start appearing?
3. At what load does the system fall over completely?

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  LATENCY vs LOAD CURVE                                                 │
│                                                                        │
│  Latency                                                               │
│  (ms)                                                                  │
│    │                                                       ╱           │
│ 5000│                                                    ╱             │
│    │                                                   ╱               │
│    │                                                 ╱   Breaking      │
│ 1000│                                              ╱     Point         │
│    │                                            ╱                      │
│    │                                        ╱╱╱                        │
│  500│                                    ╱╱╱    Degradation            │
│    │                                 ╱╱╱        Begins                 │
│    │                            ╱╱╱╱                                   │
│  200├──────────────────────────╱──────────────────── SLO Target        │
│    │       ╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱                                          │
│  100│╱╱╱╱╱╱                                                            │
│    └────────────┬────────────┬────────────┬────────────┬──────────►    │
│               500          1000         1500         2000    RPS       │
│                               │            │                           │
│                               │            └── Breaking point          │
│                               └── Sustainable max (with SLO)           │
│                                                                        │
│  Key points:                                                           │
│  ├── Up to 1000 RPS: Latency stable, within SLO                        │
│  ├── 1000-1500 RPS: Latency increasing but acceptable                  │
│  ├── 1500+ RPS: SLO violated, degradation visible                      │
│  └── 1800+ RPS: System failing, errors appearing                       │
│                                                                        │
│  YOUR CAPACITY = 1000 RPS (with 30% headroom from sustainable max)     │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Chapter 3: Forecasting Demand

3.1 Growth Modeling

# capacity/forecasting.py

"""
Demand forecasting for capacity planning.

Predicts future load based on historical trends and known events.
"""

from dataclasses import dataclass
from typing import List, Optional, Tuple
from datetime import datetime, timedelta
import math


@dataclass
class TrafficForecast:
    """Forecast for future traffic."""
    date: datetime
    predicted_rps: float
    confidence_low: float
    confidence_high: float
    factors: List[str]  # What's driving this prediction


@dataclass
class GrowthModel:
    """Model for traffic growth."""
    base_rps: float
    monthly_growth_rate: float  # e.g., 0.10 for 10% MoM
    seasonality: dict  # Month -> multiplier
    day_of_week_pattern: dict  # Day -> multiplier
    hour_of_day_pattern: dict  # Hour -> multiplier


class DemandForecaster:
    """
    Forecasts future demand based on historical patterns.
    """
    
    def __init__(self, metrics_client):
        self.metrics = metrics_client
    
    async def build_growth_model(
        self,
        service_name: str,
        lookback_months: int = 6
    ) -> GrowthModel:
        """
        Build growth model from historical data.
        """
        # Get monthly averages
        monthly_data = []
        for i in range(lookback_months, 0, -1):
            start = datetime.utcnow() - timedelta(days=30 * i)
            end = start + timedelta(days=30)
            
            avg_rps = await self.metrics.query_range(
                f'avg(rate(http_requests_total{{service="{service_name}"}}[1h]))',
                start=start,
                end=end
            )
            monthly_data.append(avg_rps.mean())
        
        # Calculate growth rate
        if len(monthly_data) >= 2:
            # Compound monthly growth rate
            start_value = monthly_data[0]
            end_value = monthly_data[-1]
            months = len(monthly_data) - 1
            
            if start_value > 0:
                growth_rate = (end_value / start_value) ** (1 / months) - 1
            else:
                growth_rate = 0
        else:
            growth_rate = 0
        
        # Get seasonality (month of year patterns)
        seasonality = await self._calculate_seasonality(service_name)
        
        # Get day-of-week patterns
        dow_pattern = await self._calculate_dow_pattern(service_name)
        
        # Get hour-of-day patterns
        hod_pattern = await self._calculate_hod_pattern(service_name)
        
        return GrowthModel(
            base_rps=monthly_data[-1] if monthly_data else 0,
            monthly_growth_rate=growth_rate,
            seasonality=seasonality,
            day_of_week_pattern=dow_pattern,
            hour_of_day_pattern=hod_pattern
        )
    
    def forecast(
        self,
        model: GrowthModel,
        target_date: datetime,
        include_events: List[dict] = None
    ) -> TrafficForecast:
        """
        Forecast traffic for a target date.
        """
        now = datetime.utcnow()
        months_ahead = (target_date - now).days / 30
        
        # Base growth projection
        base_projection = model.base_rps * ((1 + model.monthly_growth_rate) ** months_ahead)
        
        # Apply seasonality
        month = target_date.month
        seasonality_factor = model.seasonality.get(month, 1.0)
        
        # Apply day-of-week pattern
        dow = target_date.weekday()
        dow_factor = model.day_of_week_pattern.get(dow, 1.0)
        
        # Apply hour pattern for peak hour
        peak_hour_factor = max(model.hour_of_day_pattern.values())
        
        # Calculate predicted RPS
        predicted_rps = base_projection * seasonality_factor * dow_factor * peak_hour_factor
        
        # Apply event multipliers
        factors = ["organic_growth", f"seasonality_{month}", f"dow_{dow}"]
        
        if include_events:
            for event in include_events:
                if event['date'].date() == target_date.date():
                    predicted_rps *= event['traffic_multiplier']
                    factors.append(event['name'])
        
        # Confidence intervals (simplified)
        confidence_low = predicted_rps * 0.8
        confidence_high = predicted_rps * 1.4
        
        return TrafficForecast(
            date=target_date,
            predicted_rps=predicted_rps,
            confidence_low=confidence_low,
            confidence_high=confidence_high,
            factors=factors
        )
    
    def forecast_range(
        self,
        model: GrowthModel,
        start_date: datetime,
        end_date: datetime,
        known_events: List[dict] = None
    ) -> List[TrafficForecast]:
        """
        Forecast traffic for a date range.
        """
        forecasts = []
        current = start_date
        
        while current <= end_date:
            forecast = self.forecast(model, current, known_events)
            forecasts.append(forecast)
            current += timedelta(days=1)
        
        return forecasts


# =============================================================================
# EXAMPLE: PLANNING FOR KNOWN EVENTS
# =============================================================================

"""
Known events that affect traffic:

BLACK_FRIDAY = {
    'name': 'Black Friday',
    'date': datetime(2024, 11, 29),
    'traffic_multiplier': 5.0,  # 5x normal traffic
    'duration_hours': 24
}

PRODUCT_LAUNCH = {
    'name': 'Product Launch',
    'date': datetime(2024, 10, 15),
    'traffic_multiplier': 3.0,
    'duration_hours': 48
}

MARKETING_CAMPAIGN = {
    'name': 'TV Ad Campaign',
    'date': datetime(2024, 9, 1),
    'traffic_multiplier': 2.0,
    'duration_hours': 168  # 1 week
}
"""

3.2 Capacity Timeline

CAPACITY TIMELINE PLANNING

Current: January
Traffic: 1,000 RPS average, 1,500 peak
Capacity: 2,000 RPS (sustainable)
Headroom: 33%

Forecast:
┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  Month    │ Projected Peak │ Capacity │ Headroom │ Action Needed       │
│  ─────────┼────────────────┼──────────┼──────────┼──────────────────── │
│  January  │  1,500 RPS     │ 2,000    │ 25%      │ 🟡 Monitor          │
│  February │  1,650 RPS     │ 2,000    │ 17%      │ 🟠 Plan scaling     │
│  March    │  1,815 RPS     │ 2,000    │ 9%       │ 🔴 SCALE NOW        │
│  April    │  2,000 RPS     │ 2,500    │ 20%      │ After scaling       │
│  May      │  2,200 RPS     │ 2,500    │ 12%      │ 🟠 Plan again       │
│  ...                                                                   │
│  November │  4,400 RPS     │ ???      │ ???      │ Black Friday!       │
│  (Black Friday)  × 5 = 22,000 RPS!                                     │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

ACTION TIMELINE:
├── January: Establish baseline, monitor growth
├── February: Begin scaling project (lead time: 4-6 weeks)
├── March: Complete scaling to 2,500 RPS
├── Q2: Plan for 4,000 RPS capacity (Q3 target)
├── Q3: Begin Black Friday planning
├── October: Pre-scale for Black Friday
└── November: Execute Black Friday capacity plan

Chapter 4: Identifying Bottlenecks

4.1 Bottleneck Analysis

FINDING THE BOTTLENECK

The bottleneck is the component that limits system throughput.
It's the FIRST thing to saturate under load.

Common bottlenecks:

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  1. DATABASE                                                           │
│  ───────────                                                           │
│  Symptoms:                                                             │
│  ├── Slow query times                                                  │
│  ├── Connection pool exhaustion                                        │
│  ├── Lock contention                                                   │
│  └── High CPU on database server                                       │
│                                                                        │
│  Fixes:                                                                │
│  ├── Add read replicas                                                 │
│  ├── Add caching layer                                                 │
│  ├── Optimize queries                                                  │
│  ├── Connection pooling (PgBouncer)                                    │
│  └── Shard the database                                                │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  2. APPLICATION CPU                                                    │
│  ──────────────────                                                    │
│  Symptoms:                                                             │
│  ├── High CPU utilization across pods                                  │
│  ├── Latency increases with load                                       │
│  └── Adding pods helps                                                 │
│                                                                        │
│  Fixes:                                                                │
│  ├── Scale horizontally (add pods)                                     │
│  ├── Optimize hot code paths                                           │
│  ├── Use faster serialization                                          │
│  └── Enable async processing                                           │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  3. MEMORY                                                             │
│  ─────────                                                             │
│  Symptoms:                                                             │
│  ├── OOM kills                                                         │
│  ├── Excessive GC pauses                                               │
│  └── Swap usage                                                        │
│                                                                        │
│  Fixes:                                                                │
│  ├── Increase memory limits                                            │
│  ├── Fix memory leaks                                                  │
│  ├── Tune GC settings                                                  │
│  └── Stream instead of buffering                                       │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  4. EXTERNAL DEPENDENCIES                                              │
│  ────────────────────────                                              │
│  Symptoms:                                                             │
│  ├── Timeouts to specific service                                      │
│  ├── Queue buildup for async calls                                     │
│  └── Rate limiting from external APIs                                  │
│                                                                        │
│  Fixes:                                                                │
│  ├── Add caching                                                       │
│  ├── Implement circuit breakers                                        │
│  ├── Negotiate higher rate limits                                      │
│  └── Add fallback providers                                            │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  5. NETWORK                                                            │
│  ─────────                                                             │
│  Symptoms:                                                             │
│  ├── High bandwidth utilization                                        │
│  ├── Connection timeouts                                               │
│  └── Packet loss                                                       │
│                                                                        │
│  Fixes:                                                                │
│  ├── Compress responses                                                │
│  ├── Use CDN for static content                                        │
│  ├── Upgrade network capacity                                          │
│  └── Optimize payload sizes                                            │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

4.2 Little's Law

LITTLE'S LAW: THE FUNDAMENTAL EQUATION

L = λ × W

Where:
├── L = Average number of items in system (queue + being processed)
├── λ = Arrival rate (requests per second)
└── W = Average time in system (seconds)

EXAMPLE:

Your API:
├── Receives 100 requests/second (λ = 100)
├── Average response time is 200ms (W = 0.2s)
├── Therefore: L = 100 × 0.2 = 20 concurrent requests

If you have 10 pods with 2 workers each = 20 workers total
You're AT CAPACITY!

What happens if traffic increases to 150 RPS?
├── L = 150 × 0.2 = 30 concurrent requests needed
├── You only have 20 workers
├── Requests queue up
├── Queue adds to W (wait time)
├── W increases → L increases → more queuing
├── SPIRAL INTO FAILURE

USING LITTLE'S LAW FOR CAPACITY:

To handle 500 RPS with 200ms response time:
├── L = 500 × 0.2 = 100 concurrent requests
├── Need: 100 workers (or threads/connections)
├── With 4 workers per pod: 100/4 = 25 pods
├── Add 30% headroom: ~33 pods

This is your capacity plan!

4.3 Queueing Theory Basics

QUEUEING THEORY FOR ENGINEERS

When utilization approaches 100%, latency goes to infinity.

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  Wait Time                                                             │
│     │                                                                  │
│     │                                                      ▲           │
│     │                                                    ╱│            │
│     │                                                  ╱  │            │
│     │                                               ╱╱    │            │
│     │                                           ╱╱╱       │            │
│     │                                       ╱╱╱╱          │            │
│     │                                  ╱╱╱╱╱              │            │
│     │                            ╱╱╱╱╱╱                   │ Danger     │
│     │                     ╱╱╱╱╱╱╱                         │ Zone       │
│     │              ╱╱╱╱╱╱╱╱                               │            │
│     │       ╱╱╱╱╱╱╱╱                                      │            │
│     ├──╱╱╱╱╱───────────────────────────────────────────┬┴────────►     │
│     0%        20%        40%        60%        80%     100%            │
│                              Utilization                               │
│                                                                        │
│  Key insight:                                                          │
│  ├── At 50% utilization: Wait time is manageable                       │
│  ├── At 80% utilization: Wait time doubles                             │
│  ├── At 90% utilization: Wait time 10x baseline                        │
│  └── At 100%: Infinite queue, system fails                             │
│                                                                        │
│  RULE: Never operate above 70-80% utilization on critical resources    │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Part II: Implementation

Chapter 5: Load Testing

5.1 Load Testing Strategy

LOAD TESTING TYPES

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  1. SMOKE TEST                                                         │
│  ─────────────                                                         │
│  Minimal load to verify system works.                                  │
│  ├── Load: Very low (1-10 users)                                       │
│  ├── Duration: Minutes                                                 │
│  └── Goal: Verify test setup, basic functionality                      │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  2. LOAD TEST                                                          │
│  ────────────                                                          │
│  Expected load to verify normal operation.                             │
│  ├── Load: Normal production traffic                                   │
│  ├── Duration: 30-60 minutes                                           │
│  └── Goal: Verify system handles normal load                           │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  3. STRESS TEST                                                        │
│  ─────────────                                                         │
│  Beyond normal load to find breaking point.                            │
│  ├── Load: Gradually increase until failure                            │
│  ├── Duration: Until break or max reached                              │
│  └── Goal: Find system limits                                          │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  4. SPIKE TEST                                                         │
│  ────────────                                                          │
│  Sudden traffic spike to test elasticity.                              │
│  ├── Load: Sudden jump (e.g., 2x → 10x → 2x)                           │
│  ├── Duration: Short spikes (minutes)                                  │
│  └── Goal: Verify auto-scaling, graceful degradation                   │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  5. SOAK TEST                                                          │
│  ───────────                                                           │
│  Extended duration to find memory leaks, drift.                        │
│  ├── Load: Normal to moderate                                          │
│  ├── Duration: Hours to days                                           │
│  └── Goal: Find issues that emerge over time                           │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

5.2 Load Test Implementation

# capacity/load_testing.py

"""
Load testing orchestration and analysis.

Uses k6, Locust, or similar tools under the hood.
"""

from dataclasses import dataclass, field
from typing import List, Dict, Optional
from datetime import datetime, timedelta
from enum import Enum
import asyncio
import subprocess
import json


class LoadPattern(Enum):
    CONSTANT = "constant"
    RAMP_UP = "ramp_up"
    SPIKE = "spike"
    STRESS = "stress"


@dataclass
class LoadTestConfig:
    """Configuration for a load test."""
    name: str
    target_url: str
    
    # Load profile
    pattern: LoadPattern
    virtual_users: int
    ramp_up_time_seconds: int = 60
    duration_seconds: int = 300
    
    # For spike tests
    spike_users: int = 0
    spike_duration_seconds: int = 60
    
    # Thresholds
    max_latency_p99_ms: int = 500
    max_error_rate: float = 0.01
    
    # Scenarios
    scenarios: List[dict] = field(default_factory=list)


@dataclass
class LoadTestResult:
    """Results from a load test."""
    config: LoadTestConfig
    started_at: datetime
    completed_at: datetime
    
    # Summary metrics
    total_requests: int
    successful_requests: int
    failed_requests: int
    
    # Latency
    latency_p50_ms: float
    latency_p95_ms: float
    latency_p99_ms: float
    latency_max_ms: float
    
    # Throughput
    requests_per_second: float
    
    # Thresholds
    passed: bool
    threshold_violations: List[str]
    
    # Raw data for analysis
    timeseries_data: List[dict] = field(default_factory=list)


class LoadTestRunner:
    """
    Runs and analyzes load tests.
    """
    
    def __init__(self, metrics_client, alerter):
        self.metrics = metrics_client
        self.alerter = alerter
    
    async def run_test(
        self,
        config: LoadTestConfig,
        notify_on_failure: bool = True
    ) -> LoadTestResult:
        """
        Run a load test and analyze results.
        """
        started_at = datetime.utcnow()
        
        # Generate k6 script
        script = self._generate_k6_script(config)
        
        # Run k6
        result_json = await self._execute_k6(script, config)
        
        # Parse results
        result = self._parse_results(config, result_json, started_at)
        
        # Check thresholds
        result.passed, result.threshold_violations = self._check_thresholds(
            config, result
        )
        
        # Notify if failed
        if not result.passed and notify_on_failure:
            await self.alerter.send_warning(
                f"Load test '{config.name}' failed: {result.threshold_violations}"
            )
        
        # Store results
        await self._store_results(result)
        
        return result
    
    def _generate_k6_script(self, config: LoadTestConfig) -> str:
        """Generate k6 test script."""
        
        if config.pattern == LoadPattern.CONSTANT:
            stages = f"""
                stages: [
                    {{ duration: '{config.ramp_up_time_seconds}s', target: {config.virtual_users} }},
                    {{ duration: '{config.duration_seconds}s', target: {config.virtual_users} }},
                    {{ duration: '30s', target: 0 }},
                ]
            """
        elif config.pattern == LoadPattern.RAMP_UP:
            stages = f"""
                stages: [
                    {{ duration: '{config.duration_seconds}s', target: {config.virtual_users} }},
                    {{ duration: '60s', target: 0 }},
                ]
            """
        elif config.pattern == LoadPattern.SPIKE:
            stages = f"""
                stages: [
                    {{ duration: '60s', target: {config.virtual_users} }},
                    {{ duration: '{config.spike_duration_seconds}s', target: {config.spike_users} }},
                    {{ duration: '60s', target: {config.virtual_users} }},
                    {{ duration: '30s', target: 0 }},
                ]
            """
        elif config.pattern == LoadPattern.STRESS:
            # Gradually increase until we find the breaking point
            stages = f"""
                stages: [
                    {{ duration: '60s', target: {config.virtual_users} }},
                    {{ duration: '120s', target: {config.virtual_users * 2} }},
                    {{ duration: '120s', target: {config.virtual_users * 4} }},
                    {{ duration: '120s', target: {config.virtual_users * 8} }},
                    {{ duration: '60s', target: 0 }},
                ]
            """
        
        script = f"""
import http from 'k6/http';
import {{ check, sleep }} from 'k6';
import {{ Rate, Trend }} from 'k6/metrics';

const errorRate = new Rate('errors');
const latency = new Trend('latency');

export let options = {{
    {stages},
    thresholds: {{
        'http_req_duration': ['p(99)<{config.max_latency_p99_ms}'],
        'errors': ['rate<{config.max_error_rate}'],
    }},
}};

export default function() {{
    const res = http.get('{config.target_url}');
    
    const success = check(res, {{
        'status is 200': (r) => r.status === 200,
        'latency < 500ms': (r) => r.timings.duration < 500,
    }});
    
    errorRate.add(!success);
    latency.add(res.timings.duration);
    
    sleep(1);
}}
"""
        return script
    
    async def _execute_k6(self, script: str, config: LoadTestConfig) -> dict:
        """Execute k6 and return JSON results."""
        # Write script to temp file
        script_path = f'/tmp/k6_script_{config.name}.js'
        with open(script_path, 'w') as f:
            f.write(script)
        
        # Run k6 with JSON output
        result = subprocess.run(
            ['k6', 'run', '--out', 'json=/tmp/k6_output.json', script_path],
            capture_output=True,
            text=True
        )
        
        # Parse JSON output
        with open('/tmp/k6_output.json', 'r') as f:
            return json.load(f)
    
    def _check_thresholds(
        self,
        config: LoadTestConfig,
        result: LoadTestResult
    ) -> tuple[bool, List[str]]:
        """Check if results meet thresholds."""
        violations = []
        
        if result.latency_p99_ms > config.max_latency_p99_ms:
            violations.append(
                f"P99 latency {result.latency_p99_ms}ms > threshold {config.max_latency_p99_ms}ms"
            )
        
        error_rate = result.failed_requests / result.total_requests
        if error_rate > config.max_error_rate:
            violations.append(
                f"Error rate {error_rate:.2%} > threshold {config.max_error_rate:.2%}"
            )
        
        return len(violations) == 0, violations


# =============================================================================
# EXAMPLE LOAD TEST SCENARIOS
# =============================================================================

API_LOAD_TEST = LoadTestConfig(
    name="api_baseline",
    target_url="https://api.example.com/health",
    pattern=LoadPattern.CONSTANT,
    virtual_users=100,
    duration_seconds=300,
    max_latency_p99_ms=200,
    max_error_rate=0.01
)

BLACK_FRIDAY_TEST = LoadTestConfig(
    name="black_friday_simulation",
    target_url="https://api.example.com/checkout",
    pattern=LoadPattern.SPIKE,
    virtual_users=500,
    spike_users=5000,
    spike_duration_seconds=120,
    duration_seconds=600,
    max_latency_p99_ms=1000,
    max_error_rate=0.05
)

STRESS_TEST = LoadTestConfig(
    name="find_breaking_point",
    target_url="https://api.example.com/api",
    pattern=LoadPattern.STRESS,
    virtual_users=100,
    duration_seconds=600,
    max_latency_p99_ms=5000,
    max_error_rate=0.10
)

Chapter 6: Scaling Strategies

6.1 Vertical vs Horizontal Scaling

SCALING STRATEGIES

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  VERTICAL SCALING (Scale Up)                                           │
│  ───────────────────────────                                           │
│                                                                        │
│  Before:           After:                                              │
│  ┌───────┐         ┌───────────┐                                       │
│  │ 4 CPU │  ──►    │ 16 CPU    │                                       │
│  │ 8 GB  │         │ 64 GB     │                                       │
│  └───────┘         └───────────┘                                       │
│                                                                        │
│  Pros:                                                                 │
│  ├── Simple (no code changes)                                          │
│  ├── No distributed complexity                                         │
│  └── Good for databases                                                │
│                                                                        │
│  Cons:                                                                 │
│  ├── Limited by hardware                                               │
│  ├── Single point of failure                                           │
│  ├── Expensive at high end                                             │
│  └── Usually requires downtime                                         │
│                                                                        │
│  ═══════════════════════════════════════════════════════════════════   │
│                                                                        │
│  HORIZONTAL SCALING (Scale Out)                                        │
│  ──────────────────────────────                                        │
│                                                                        │
│  Before:           After:                                              │
│  ┌───────┐         ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐             │
│  │  Pod  │  ──►    │ Pod 1 │ │ Pod 2 │ │ Pod 3 │ │ Pod 4 │             │
│  └───────┘         └───────┘ └───────┘ └───────┘ └───────┘             │
│                                                                        │
│  Pros:                                                                 │
│  ├── Nearly unlimited scaling                                          │
│  ├── No single point of failure                                        │
│  ├── Can scale incrementally                                           │
│  └── Often cheaper at scale                                            │
│                                                                        │
│  Cons:                                                                 │
│  ├── Distributed complexity                                            │
│  ├── Stateless requirement                                             │
│  ├── Load balancing needed                                             │
│  └── Data consistency challenges                                       │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

WHEN TO USE WHICH:

Vertical (Scale Up):
├── Databases (primary)
├── Single-threaded workloads
├── When simplicity matters most
└── As a quick fix while planning horizontal

Horizontal (Scale Out):
├── Stateless web services
├── Read replicas
├── Worker pools
└── For long-term scalability

6.2 Auto-Scaling Configuration

# capacity/autoscaling.py

"""
Auto-scaling configuration and policies.

Defines when and how to scale automatically.
"""

from dataclasses import dataclass
from typing import List, Optional
from enum import Enum


class ScalingMetric(Enum):
    CPU = "cpu"
    MEMORY = "memory"
    RPS = "requests_per_second"
    QUEUE_DEPTH = "queue_depth"
    LATENCY = "latency_p99"
    CUSTOM = "custom"


@dataclass
class ScalingPolicy:
    """Defines when to scale."""
    name: str
    metric: ScalingMetric
    
    # Thresholds
    scale_up_threshold: float
    scale_down_threshold: float
    
    # Behavior
    scale_up_cooldown_seconds: int = 300
    scale_down_cooldown_seconds: int = 600
    
    # Limits
    min_replicas: int = 2
    max_replicas: int = 100
    
    # Aggressiveness
    scale_up_by: int = 2  # Add 2 pods at a time
    scale_down_by: int = 1  # Remove 1 pod at a time


@dataclass
class AutoScalingConfig:
    """Complete auto-scaling configuration."""
    deployment_name: str
    namespace: str
    
    # Policies (can have multiple)
    policies: List[ScalingPolicy]
    
    # Predictive scaling (for known events)
    scheduled_scaling: List[dict] = None
    
    # Protection
    disable_scale_down_during_deployment: bool = True
    

# =============================================================================
# EXAMPLE CONFIGURATIONS
# =============================================================================

# Web API: Scale on CPU and latency
WEB_API_SCALING = AutoScalingConfig(
    deployment_name="api-server",
    namespace="production",
    policies=[
        # Primary: CPU-based scaling
        ScalingPolicy(
            name="cpu-scaling",
            metric=ScalingMetric.CPU,
            scale_up_threshold=70,  # Scale up when CPU > 70%
            scale_down_threshold=30,  # Scale down when CPU < 30%
            min_replicas=3,
            max_replicas=50,
            scale_up_by=3,
            scale_down_by=1,
            scale_down_cooldown_seconds=900  # 15 min cooldown
        ),
        # Secondary: Latency-based scaling
        ScalingPolicy(
            name="latency-scaling",
            metric=ScalingMetric.LATENCY,
            scale_up_threshold=200,  # Scale up when p99 > 200ms
            scale_down_threshold=50,  # Scale down when p99 < 50ms
            min_replicas=3,
            max_replicas=50,
            scale_up_by=5,  # More aggressive for latency
            scale_up_cooldown_seconds=60  # React faster to latency
        ),
    ],
    scheduled_scaling=[
        # Pre-scale for known traffic patterns
        {
            "name": "morning_ramp",
            "schedule": "0 8 * * 1-5",  # 8 AM weekdays
            "min_replicas": 10
        },
        {
            "name": "evening_scale_down",
            "schedule": "0 22 * * *",  # 10 PM daily
            "min_replicas": 3
        },
        {
            "name": "black_friday",
            "schedule": "0 0 * 11 5",  # Thanksgiving week
            "min_replicas": 50,
            "max_replicas": 200
        },
    ]
)

# Worker pool: Scale on queue depth
WORKER_SCALING = AutoScalingConfig(
    deployment_name="worker-pool",
    namespace="production",
    policies=[
        ScalingPolicy(
            name="queue-scaling",
            metric=ScalingMetric.QUEUE_DEPTH,
            scale_up_threshold=1000,  # Scale up when queue > 1000
            scale_down_threshold=100,  # Scale down when queue < 100
            min_replicas=2,
            max_replicas=20,
            scale_up_by=2,
            scale_up_cooldown_seconds=60  # React quickly to queue buildup
        ),
    ]
)


def generate_kubernetes_hpa(config: AutoScalingConfig) -> dict:
    """Generate Kubernetes HPA manifest from config."""
    
    hpa = {
        "apiVersion": "autoscaling/v2",
        "kind": "HorizontalPodAutoscaler",
        "metadata": {
            "name": f"{config.deployment_name}-hpa",
            "namespace": config.namespace
        },
        "spec": {
            "scaleTargetRef": {
                "apiVersion": "apps/v1",
                "kind": "Deployment",
                "name": config.deployment_name
            },
            "minReplicas": min(p.min_replicas for p in config.policies),
            "maxReplicas": max(p.max_replicas for p in config.policies),
            "metrics": [],
            "behavior": {
                "scaleUp": {
                    "stabilizationWindowSeconds": 60,
                    "policies": [
                        {
                            "type": "Pods",
                            "value": max(p.scale_up_by for p in config.policies),
                            "periodSeconds": 60
                        }
                    ]
                },
                "scaleDown": {
                    "stabilizationWindowSeconds": 300,
                    "policies": [
                        {
                            "type": "Pods",
                            "value": 1,
                            "periodSeconds": 120
                        }
                    ]
                }
            }
        }
    }
    
    # Add metrics
    for policy in config.policies:
        if policy.metric == ScalingMetric.CPU:
            hpa["spec"]["metrics"].append({
                "type": "Resource",
                "resource": {
                    "name": "cpu",
                    "target": {
                        "type": "Utilization",
                        "averageUtilization": int(policy.scale_up_threshold)
                    }
                }
            })
        elif policy.metric == ScalingMetric.MEMORY:
            hpa["spec"]["metrics"].append({
                "type": "Resource",
                "resource": {
                    "name": "memory",
                    "target": {
                        "type": "Utilization",
                        "averageUtilization": int(policy.scale_up_threshold)
                    }
                }
            })
    
    return hpa

Part III: Real-World Application

Chapter 7: Case Studies

7.1 How Netflix Handles Scale

NETFLIX CAPACITY PLANNING

Scale:
├── 200M+ subscribers
├── 15% of global internet traffic
├── Hundreds of microservices
├── Runs primarily on AWS

Key Practices:

1. CHAOS ENGINEERING
   ├── Chaos Monkey: Randomly kills instances
   ├── Chaos Kong: Simulates region failure
   ├── Gremlin: Controlled fault injection
   └── Result: Confidence in resilience

2. PREDICTIVE SCALING
   ├── Know when shows release
   ├── Pre-scale for "Squid Game" launch
   ├── Analyze viewing patterns
   └── Scale regions before prime time

3. CAPACITY TESTING
   ├── Regular load tests in production
   ├── "Red/Black" pushes to test capacity
   ├── Test at 2x expected peak
   └── Find bottlenecks before customers do

4. AUTO-SCALING
   ├── Titus: Their container platform
   ├── Aggressive auto-scaling policies
   ├── Scale based on request latency
   └── Can scale from 0 to thousands of instances

5. REGIONAL DISTRIBUTION
   ├── Content in multiple AWS regions
   ├── CDN (Open Connect) for streaming
   ├── Failover between regions
   └── No single region is critical

Lesson: At Netflix scale, capacity planning is a full-time job.
They have dedicated teams for capacity analysis.

7.2 How Stripe Handles Scale

STRIPE CAPACITY PLANNING

Context:
├── Processes billions in payments
├── Every request is high-value
├── Zero tolerance for downtime
├── Must scale for Black Friday, Cyber Monday

Key Practices:

1. METERED OVERPROVISIONING
   ├── Always maintain 50%+ headroom
   ├── Cost of overprovisioning < cost of downtime
   ├── For payments, reliability > efficiency
   └── "We'd rather waste capacity than drop transactions"

2. LOAD SHEDDING
   ├── Prioritize payment completion over analytics
   ├── Graceful degradation under load
   ├── Non-critical features turn off automatically
   └── Payment path is always protected

3. REGULAR GAME DAYS
   ├── Simulate Black Friday in staging
   ├── Test at 3x expected peak
   ├── Practice runbooks
   └── Find weaknesses before they matter

4. MULTI-REGION
   ├── Active-active in multiple regions
   ├── Can shift traffic between regions
   ├── Used for capacity expansion
   └── Also provides disaster recovery

5. CAPACITY COMMITTEE
   ├── Weekly capacity review
   ├── Cross-functional (eng + ops + finance)
   ├── Review utilization, forecasts, plans
   └── Make scaling decisions together

Lesson: For critical systems, overprovision.
The cost of downtime far exceeds the cost of extra capacity.

Chapter 8: Common Mistakes

CAPACITY PLANNING ANTI-PATTERNS

❌ MISTAKE 1: Waiting for Problems

Wrong:
  "We'll scale when we see issues"

Problem:
  ├── Scaling takes time (minutes to hours)
  ├── By the time you see issues, users are affected
  └── Reactive = outages

Right:
  ├── Know your limits from load testing
  ├── Track headroom continuously
  ├── Scale BEFORE you need to


❌ MISTAKE 2: Only Measuring Averages

Wrong:
  "Average CPU is only 30%, we're fine!"

Problem:
  ├── Peak usage is what matters
  ├── Spikes cause outages, not averages
  └── P99 utilization might be 95%

Right:
  ├── Track P99 utilization
  ├── Watch peak-to-average ratio
  └── Plan for peaks, not averages


❌ MISTAKE 3: Ignoring Dependencies

Wrong:
  "We can handle 10K RPS"

Problem:
  ├── But can your database?
  ├── Can your cache?
  ├── Can your third-party APIs?
  └── You're only as fast as your slowest dependency

Right:
  ├── Test the entire stack
  ├── Know limits of all dependencies
  └── Plan capacity for the bottleneck


❌ MISTAKE 4: Not Testing Production Config

Wrong:
  "Load tests pass in staging!"

Problem:
  ├── Staging != production
  ├── Different data sizes
  ├── Different network topology
  └── Different concurrency patterns

Right:
  ├── Load test in production (carefully)
  ├── Or use production-like staging
  └── Mirror production traffic for testing


❌ MISTAKE 5: Ignoring Auto-Scaling Lag

Wrong:
  "Auto-scaling will handle it"

Problem:
  ├── Auto-scaling takes 3-5 minutes
  ├── Spike hits before scale-up completes
  └── Brief outage during scaling

Right:
  ├── Pre-scale for known events
  ├── Maintain baseline capacity for spikes
  └── Auto-scaling for sustained increases

Part IV: Interview Preparation

Chapter 9: Interview Tips

9.1 Capacity Discussion Framework

DISCUSSING CAPACITY IN INTERVIEWS

When asked "How would you ensure this system handles scale?":

1. ESTIMATE THE LOAD
   "First, let me understand the scale:
   - How many users?
   - What's the request pattern?
   - What's the peak vs average ratio?
   Let me calculate the expected RPS..."

2. IDENTIFY BOTTLENECKS
   "The likely bottlenecks are:
   - Database: Will it handle this QPS?
   - Memory: How much state per user?
   - Network: What's the payload size?
   Let me work through each..."

3. DESIGN FOR SCALE
   "To handle this scale, I'd:
   - Add caching to reduce DB load
   - Shard the database by user_id
   - Use connection pooling
   - Deploy across multiple regions"

4. PLAN FOR GROWTH
   "To stay ahead of growth:
   - Maintain 30% headroom on all resources
   - Auto-scale based on CPU and latency
   - Load test regularly
   - Pre-scale for known events"

5. HANDLE SPIKES
   "For unexpected spikes:
   - Rate limiting protects the system
   - Load shedding drops non-critical work
   - Circuit breakers prevent cascade
   - Graceful degradation keeps core working"

9.2 Key Phrases

CAPACITY KEY PHRASES

On Measurement:
"I'd establish a baseline by load testing to find
where latency starts degrading. That's my sustainable
capacity. Then I'd ensure we maintain at least 30%
headroom below that limit."

On Bottlenecks:
"The bottleneck is usually the database. At 1000 RPS
with 10ms average query time, we need 10 concurrent
connections minimum. I'd use connection pooling
and add read replicas if needed."

On Little's Law:
"Using Little's Law: if we have 500 RPS and 100ms
response time, we need 50 concurrent handlers.
With 4 workers per pod, that's about 13 pods.
Add 30% headroom: 17 pods minimum."

On Auto-Scaling:
"Auto-scaling reacts to load, but it's not instant.
For predictable events like Black Friday, I'd
pre-scale the night before. Auto-scaling handles
the unexpected variations."

On Load Testing:
"I'd run regular load tests at 2x expected peak.
This finds bottlenecks before customers do.
The goal isn't just to pass, but to understand
where the system breaks."

Chapter 10: Practice Problems

Problem 1: E-Commerce Peak Planning

Scenario: Your e-commerce site normally handles 500 RPS. Black Friday is coming with expected 10x traffic.

Questions:

How would you prepare for 5,000 RPS?
What needs to scale and by how much?
What's your testing strategy?

Preparation (6-8 weeks before):
- Load test current system to find breaking point
- Identify bottlenecks (probably database)
- Plan and implement scaling
- Schedule pre-scaling for the night before
What to scale:
- Web tier: 10x pods (with auto-scaling headroom)
- Database: Add read replicas, warm connection pools
- Cache: 10x cache size to handle increased working set
- CDN: Pre-warm with popular content
Testing:
- Load test at 10x in staging
- Shadow traffic test in production
- Have runbooks for manual scaling
- War room staffed during peak

Problem 2: Database Scaling

Scenario: Your database handles 2,000 QPS. You're at 1,500 QPS and growing 10% monthly.

Questions:

When will you hit capacity?
What are your options?
How would you buy time?

Time to capacity:
- Month 1: 1,500 QPS (75% utilized)
- Month 2: 1,650 QPS (82.5%)
- Month 3: 1,815 QPS (91%)
- Month 4: 2,000 QPS (100%) - CRITICAL
- Action needed by month 2
Options:
- Vertical: Upgrade to larger instance (quick fix)
- Read replicas: Offload read traffic (medium effort)
- Caching: Reduce queries (high impact)
- Sharding: Long-term solution (high effort)
Buying time:
- Add aggressive caching (1-2 weeks)
- Optimize slow queries (ongoing)
- Add read replica (2-4 weeks)
- Plan sharding for 6-month horizon

Chapter 11: Sample Interview Dialogue

Interviewer: "You're designing a food delivery app. How do you plan for capacity during dinner rush?"

You: "Great question. Let me think through the load characteristics first.

Understanding the pattern:"

Traffic Pattern:
├── Normal: 1,000 orders/minute
├── Dinner rush (6-8 PM): 5x = 5,000 orders/minute
├── Friday dinner: 7x = 7,000 orders/minute
├── Super Bowl Sunday: 20x = 20,000 orders/minute

Each order involves:
├── 5-10 API calls (browse, add to cart, checkout)
├── 1 database write (order creation)
├── 2-3 external calls (payment, restaurant, driver)
└── Estimate: 50-100 API calls per order

Peak API load: 7,000 × 100 = 700,000 API calls/minute = ~12,000 RPS

Interviewer: "How would you handle that?"

You: "I'd design with the peak in mind and auto-scale for variations.

For the API tier:"

├── Using Little's Law: 12,000 RPS × 0.1s = 1,200 concurrent requests
├── With 20 workers per pod: 1,200/20 = 60 pods minimum
├── Add 50% headroom: 90 pods
├── Auto-scale between 20 pods (baseline) and 150 pods (max)

"For the database:"

├── 7,000 writes/minute = 117 writes/second
├── Reads are probably 10x writes = 1,170 reads/second
├── Use read replicas for menu/restaurant queries
├── Shard orders table by region for write scaling
├── Cache frequently accessed data (menus, restaurant info)

Interviewer: "What about Super Bowl Sunday at 20x?"

You: "For exceptional events, I wouldn't rely on auto-scaling alone.

Pre-event preparation:"

1. Week before: Load test at 25x (leave margin)
2. Night before: Pre-scale to 50% of expected peak
3. During event: Auto-scale handles variations
4. Have war room staffed
5. Pre-warm caches with popular restaurants

If we still hit limits:
├── Graceful degradation: Disable non-essential features
├── Rate limiting: Queue orders instead of rejecting
├── Geographic limiting: Prioritize areas we can serve

Interviewer: "Good systematic approach. How do you know if your planning worked?"

You: "I'd set up clear success metrics:

During the event:

Order success rate > 99%
Checkout latency < 3 seconds
No capacity alerts firing
Auto-scaling responded within 3 minutes

After the event:

Review peak utilization (should be < 70%)
Identify any bottlenecks that appeared
Cost analysis: did we over/under provision?
Update forecasting model with actual data

This becomes input for next year's planning."

Summary

┌────────────────────────────────────────────────────────────────────────┐
│                    DAY 4 KEY TAKEAWAYS                                 │
│                                                                        │
│  CAPACITY PLANNING CYCLE:                                              │
│  ├── Measure: Know your current capacity and utilization               │
│  ├── Forecast: Predict future demand                                   │
│  ├── Identify: Find bottlenecks before they hit                        │
│  ├── Plan: Decide how and when to scale                                │
│  └── Validate: Test your assumptions                                   │
│                                                                        │
│  KEY METRICS:                                                          │
│  ├── Headroom = (Max Capacity - Current) / Max Capacity                │
│  ├── Maintain at least 30% headroom                                    │
│  ├── Track P99 utilization, not just averages                          │
│  └── Little's Law: L = λ × W                                           │
│                                                                        │
│  LOAD TESTING:                                                         │
│  ├── Smoke: Verify basic functionality                                 │
│  ├── Load: Test normal production traffic                              │
│  ├── Stress: Find the breaking point                                   │
│  ├── Spike: Test sudden traffic jumps                                  │
│  └── Soak: Find issues that emerge over time                           │
│                                                                        │
│  SCALING STRATEGIES:                                                   │
│  ├── Vertical: Bigger machines (simple, limited)                       │
│  ├── Horizontal: More machines (complex, unlimited)                    │
│  ├── Auto-scaling: React to load (not instant!)                        │
│  └── Pre-scaling: Prepare for known events                             │
│                                                                        │
│  COMMON BOTTLENECKS:                                                   │
│  ├── Database (connections, queries)                                   │
│  ├── Application CPU                                                   │
│  ├── Memory (especially caches)                                        │
│  ├── External dependencies                                             │
│  └── Network bandwidth                                                 │
│                                                                        │
│  KEY INSIGHT:                                                          │
│  The best time to scale is BEFORE you need to.                         │
│  Know your limits. Watch your headroom. Plan ahead.                    │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘