Himanshu Kukreja
0%
Day 01

Week 10 — Day 1: SLIs, SLOs, and SLAs

System Design Mastery Series — Production Readiness and Operational Excellence


Preface

You've come a long way.

YOUR JOURNEY

Week 1:  You learned how data scales — partitioning, replication, hot keys
Week 2:  You learned how systems fail — timeouts, retries, idempotency
Week 3:  You learned async patterns — queues, streams, backpressure
Week 4:  You learned caching — invalidation, thundering herds, consistency
Week 5:  You learned coordination — consensus, sagas, conflict resolution
Week 6:  You designed a notification platform — your first complete system
Week 7:  You designed a search system — indexing, relevance, resilience
Week 8:  You designed an analytics pipeline — streaming, batch, correctness
Week 9:  You learned enterprise concerns — multi-tenancy, compliance, security

Now, Week 10: You learn to OPERATE what you build.

Here's the thing nobody tells junior engineers:

THE UNCOMFORTABLE TRUTH

Building the system is maybe 30% of the work.
Operating the system is the other 70%.

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  WHAT JUNIOR ENGINEERS THINK:                                          │
│                                                                        │
│  Requirements → Design → Code → Test → Deploy → Done! ✓                │
│                                                                        │
│  WHAT SENIOR ENGINEERS KNOW:                                           │
│                                                                        │
│  Requirements → Design → Code → Test → Deploy → Monitor → Alert →      │
│  Debug → Fix → Deploy → Monitor → Scale → Monitor → Incident →         │
│  Respond → Fix → Postmortem → Improve → Monitor → Capacity Plan →      │
│  Scale → Monitor → New Feature → Deploy → Monitor → ...forever         │
│                                                                        │
│  The system is never "done." It's either running or it's dead.         │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

This week, we close the loop. We learn to answer the question that matters most:

"Is my system healthy?"

And it starts with defining what "healthy" even means.


Part I: Foundations

Chapter 1: The Language of Reliability

1.1 Why We Need Precise Language

Imagine this conversation:

Product Manager: "Is the API reliable?"
Engineer: "Yeah, it's pretty reliable."
PM: "What does 'pretty reliable' mean?"
Engineer: "It's usually up."
PM: "Usually? We just lost a $500K customer who said the API was down."
Engineer: "It was only down for 10 minutes!"
PM: "They said it was down for an hour."
Engineer: "Well, it depends on what you mean by 'down'..."

This conversation happens because we lack precise language for reliability.

Enter SLIs, SLOs, and SLAs.

1.2 The Three Pillars

THE RELIABILITY LANGUAGE

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  SLI (Service Level Indicator)                                         │
│  ─────────────────────────────                                         │
│  WHAT: A quantitative measure of service behavior                      │
│  EXAMPLE: "The proportion of requests that complete in < 200ms"        │
│  ANSWER TO: "What should we measure?"                                  │
│                                                                        │
│                              │                                         │
│                              ▼                                         │
│                                                                        │
│  SLO (Service Level Objective)                                         │
│  ─────────────────────────────                                         │
│  WHAT: A target value for an SLI                                       │
│  EXAMPLE: "99.9% of requests should complete in < 200ms"               │
│  ANSWER TO: "How reliable should we be?"                               │
│                                                                        │
│                              │                                         │
│                              ▼                                         │
│                                                                        │
│  SLA (Service Level Agreement)                                         │
│  ─────────────────────────────                                         │
│  WHAT: A contract with consequences for missing targets                │
│  EXAMPLE: "If availability drops below 99.5%, customer gets credit"    │
│  ANSWER TO: "What happens if we fail?"                                 │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Relationship:
├── SLI = The measurement
├── SLO = The target (internal)
└── SLA = The promise (external, with consequences)

Rule of thumb:
├── SLO should be stricter than SLA
├── You should hit SLO 99%+ of the time
└── You should almost never violate SLA

1.3 A Concrete Example

Let's make this real with our notification platform from Week 6:

NOTIFICATION PLATFORM RELIABILITY

SLI (What we measure):
──────────────────────
1. Availability: Proportion of API requests that succeed (non-5xx)
2. Latency: Proportion of notification sends completing in < 500ms
3. Delivery: Proportion of notifications delivered within 30 seconds
4. Correctness: Proportion of notifications delivered to correct recipient

SLO (What we target):
─────────────────────
1. Availability: 99.9% of requests succeed over 30 days
2. Latency: 99% of sends complete in < 500ms over 30 days
3. Delivery: 95% of notifications delivered within 30 seconds
4. Correctness: 99.99% delivered to correct recipient (this is critical!)

SLA (What we promise customers):
───────────────────────────────
"The Notification API will be available 99.5% of the time,
measured monthly. If we fail to meet this target, affected
customers will receive a 10% service credit for that month."

Notice:
├── SLO availability (99.9%) > SLA availability (99.5%)
├── We have internal buffer to fix issues before violating SLA
└── Correctness has no SLA because we CANNOT get it wrong

Chapter 2: Choosing Good SLIs

2.1 The Four Golden Signals

Google's SRE book popularized these four signals that apply to most services:

THE FOUR GOLDEN SIGNALS

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  1. LATENCY                                                            │
│     The time it takes to serve a request                               │
│     ├── Measure: p50, p90, p99, p99.9 latencies                        │
│     ├── Distinguish: Successful vs failed request latency              │
│     └── Why: Users perceive slow as broken                             │
│                                                                        │
│  2. TRAFFIC                                                            │
│     How much demand is being placed on the system                      │
│     ├── Measure: Requests per second, queries per second               │
│     ├── Context: By endpoint, by customer, by region                   │
│     └── Why: Capacity planning, anomaly detection                      │
│                                                                        │
│  3. ERRORS                                                             │
│     The rate of failed requests                                        │
│     ├── Measure: 5xx rate, 4xx rate (different meanings)               │
│     ├── Include: Explicit errors AND implicit (wrong data returned)    │
│     └── Why: Direct measure of user impact                             │
│                                                                        │
│  4. SATURATION                                                         │
│     How "full" the service is                                          │
│     ├── Measure: CPU, memory, disk, queue depth, connections           │
│     ├── Predict: When will we hit limits?                              │
│     └── Why: Leading indicator of future problems                      │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

2.2 SLI Specification Template

A good SLI is precisely specified:

SLI SPECIFICATION TEMPLATE

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  SLI Name: API Request Latency                                         │
│                                                                        │
│  Description:                                                          │
│    The proportion of valid API requests that complete within           │
│    the latency threshold, measured at the load balancer.               │
│                                                                        │
│  Good events:                                                          │
│    HTTP requests to /api/* endpoints that:                             │
│    - Return 2xx or 4xx status code (successful or client error)        │
│    - Complete within 200ms                                             │
│                                                                        │
│  Valid events:                                                         │
│    All HTTP requests to /api/* endpoints, excluding:                   │
│    - Health check endpoints (/api/health)                              │
│    - Internal service-to-service calls                                 │
│    - Requests during planned maintenance windows                       │
│                                                                        │
│  Measurement:                                                          │
│    SLI = (count of good events / count of valid events) × 100%         │
│                                                                        │
│  Data source:                                                          │
│    Load balancer access logs, aggregated every 5 minutes               │
│                                                                        │
│  Rationale:                                                            │
│    - Load balancer gives client-perspective latency                    │
│    - Excludes health checks (not user-facing)                          │
│    - Includes 4xx because they're still "handled correctly"            │
│    - 200ms threshold based on user research showing satisfaction cliff │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

2.3 Common SLIs by Service Type

SLIs BY SERVICE TYPE

API SERVICE:
├── Availability: % requests returning non-5xx
├── Latency: % requests completing < threshold (p50, p99)
└── Throughput: Requests/second (for capacity SLOs)

DATA PIPELINE:
├── Freshness: % time data is < N minutes old
├── Correctness: % records processed without error
├── Completeness: % expected records actually present
└── Throughput: Records/second processed

STORAGE SYSTEM:
├── Availability: % time system accepts reads/writes
├── Durability: % data retrievable after write acknowledged
├── Latency: % operations completing < threshold
└── Throughput: Operations/second

BATCH JOB:
├── Success rate: % job runs completing successfully
├── Freshness: % time since last successful run < threshold
└── Runtime: % runs completing within expected duration

REAL-TIME SYSTEM (like our notification platform):
├── Availability: % requests accepted
├── End-to-end latency: % events processed within time window
├── Delivery rate: % events successfully delivered
└── Ordering: % events delivered in correct order (if required)

Chapter 3: Setting SLOs

3.1 The SLO Setting Framework

Setting SLOs is both art and science:

SLO SETTING FRAMEWORK

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  STEP 1: Understand user expectations                                  │
│  ─────────────────────────────────────                                 │
│  ├── What do users actually need?                                      │
│  ├── What would make them leave for a competitor?                      │
│  ├── What's the user journey? Where are they most sensitive?           │
│  └── Talk to customer support — what do users complain about?          │
│                                                                        │
│  STEP 2: Look at historical data                                       │
│  ──────────────────────────────                                        │
│  ├── What's our current performance?                                   │
│  ├── What were we achieving when users were happy?                     │
│  ├── What were we achieving when they complained?                      │
│  └── Don't set SLO at current performance — leave room for drift       │
│                                                                        │
│  STEP 3: Consider business constraints                                 │
│  ─────────────────────────────────────                                 │
│  ├── What's the cost of achieving higher reliability?                  │
│  ├── What's the cost of NOT achieving reliability (churn, reputation)? │
│  ├── What do competitors offer?                                        │
│  └── What does the SLA require? (SLO must exceed SLA)                  │
│                                                                        │
│  STEP 4: Start conservative, adjust with data                          │
│  ────────────────────────────────────────────                          │
│  ├── Set initial SLO slightly below current performance                │
│  ├── Measure for 1-2 quarters                                          │
│  ├── Tighten if we're easily meeting it AND users expect better        │
│  └── Loosen if we're constantly missing AND users are still happy      │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

3.2 The Reliability vs Velocity Trade-off

Here's the insight that separates good SREs from everyone else:

THE FUNDAMENTAL TRADE-OFF

100% reliability is:
├── Impossible (hardware fails, networks fail, humans err)
├── Infinitely expensive (diminishing returns)
└── Actually harmful (no room to ship new features)

Consider:

99.9% availability = 8.76 hours downtime/year
99.99% availability = 52.6 minutes downtime/year
99.999% availability = 5.26 minutes downtime/year

To go from 99.9% to 99.99%:
├── Cost: Maybe 10x infrastructure spending
├── Effort: Dedicated reliability team
├── Velocity: Much slower feature development
└── Question: Do users actually need it?

Most users can't tell the difference between 99.9% and 99.99%.
But they CAN tell the difference between "ships features monthly"
and "ships features quarterly."

THE GOAL IS NOT MAXIMUM RELIABILITY.
THE GOAL IS APPROPRIATE RELIABILITY.

3.3 SLO Documentation

SLO DOCUMENT TEMPLATE

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  SERVICE: Notification Platform API                                    │
│  OWNER: Platform Team                                                  │
│  LAST UPDATED: 2024-01-15                                              │
│  REVIEW CADENCE: Quarterly                                             │
│                                                                        │
│  ═══════════════════════════════════════════════════════════════════   │
│                                                                        │
│  SLO 1: AVAILABILITY                                                   │
│  ────────────────────                                                  │
│  Objective: 99.9% of API requests return non-5xx responses             │
│  Window: Rolling 30 days                                               │
│  Measurement: Load balancer logs, 5-minute aggregation                 │
│                                                                        │
│  Error budget: 0.1% × 30 days = 43.2 minutes of total downtime         │
│                                                                        │
│  Rationale: Users expect high availability for notifications.          │
│  Historical performance shows we achieve 99.95% typically.             │
│  99.9% gives buffer for incidents while meeting user expectations.     │
│                                                                        │
│  ═══════════════════════════════════════════════════════════════════   │
│                                                                        │
│  SLO 2: LATENCY                                                        │
│  ──────────────                                                        │
│  Objective: 99% of API requests complete within 200ms (p99 < 200ms)    │
│  Window: Rolling 30 days                                               │
│  Measurement: Application metrics, 1-minute aggregation                │
│                                                                        │
│  Rationale: API is called from user-facing applications.               │
│  User research shows satisfaction drops sharply above 200ms.           │
│  99% (not 99.9%) because some slow requests are acceptable.            │
│                                                                        │
│  ═══════════════════════════════════════════════════════════════════   │
│                                                                        │
│  SLO 3: DELIVERY                                                       │
│  ────────────────                                                      │
│  Objective: 95% of notifications delivered within 30 seconds           │
│  Window: Rolling 30 days                                               │
│  Measurement: End-to-end tracking from accept to delivery callback     │
│                                                                        │
│  Rationale: Notifications are often time-sensitive.                    │
│  95% accounts for third-party provider variability.                    │
│  30 seconds is acceptable for most notification use cases.             │
│                                                                        │
│  ═══════════════════════════════════════════════════════════════════   │
│                                                                        │
│  DEPENDENCIES:                                                         │
│  ├── Email provider SLA: 99.5% availability                            │
│  ├── SMS provider SLA: 99.9% availability                              │
│  ├── Push notification service: 99.95% availability                    │
│  └── Our SLO cannot exceed weakest dependency                          │
│                                                                        │
│  EXCLUDED FROM SLOs:                                                   │
│  ├── Planned maintenance windows (announced 72 hours ahead)            │
│  ├── Customer-caused issues (invalid API keys, malformed requests)     │
│  └── Force majeure events                                              │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Part II: Implementation

Chapter 4: Error Budgets

4.1 What Is an Error Budget?

The error budget is the most powerful concept in SRE:

ERROR BUDGET EXPLAINED

If your SLO is 99.9% availability over 30 days:
├── You're promising to be "good" 99.9% of the time
├── That means you're ALLOWED to be "bad" 0.1% of the time
└── 0.1% of 30 days = 43.2 minutes

This 43.2 minutes is your ERROR BUDGET.

┌───────────────────────────────────────────────────────────────────────┐
│                                                                       │
│  MONTH START                                              MONTH END   │
│  │                                                             │      │
│  │  Error Budget: 43.2 minutes                                 │      │
│  │  ════════════════════════════════════════════════════════   │      │
│  │                                                             │      │
│  │  Day 5: 10-minute outage                                    │      │
│  │  ══════════════════════════════════════════                 │      │
│  │  Remaining: 33.2 minutes                                    │      │
│  │                                                             │      │
│  │  Day 12: Slow deployment, 15 minutes degraded               │      │
│  │  ════════════════════════════                               │      │
│  │  Remaining: 18.2 minutes                                    │      │
│  │                                                             │      │
│  │  Day 20: Third-party outage, 20 minutes                     │      │
│  │  WARNING: Only 0 minutes remaining!                         │      │
│  │  ════                                                       │      │
│  │  Remaining: -1.8 minutes (BUDGET EXCEEDED)                  │      │
│  │                                                             │      │
│  │  Action: Feature freeze until budget recovers               │      │
│  │                                                             │      │
└───────────────────────────────────────────────────────────────────────┘

4.2 Error Budget Policies

What happens when the budget runs out?

# error_budget/policy.py

"""
Error budget policy implementation.

When error budget is depleted, we shift focus from features to reliability.
"""

from dataclasses import dataclass
from enum import Enum
from datetime import datetime, timedelta
from typing import List, Optional
import logging

logger = logging.getLogger(__name__)


class BudgetStatus(Enum):
    """Error budget health status."""
    HEALTHY = "healthy"          # > 50% budget remaining
    WARNING = "warning"          # 20-50% remaining
    CRITICAL = "critical"        # 0-20% remaining
    EXHAUSTED = "exhausted"      # Budget depleted


@dataclass
class ErrorBudget:
    """Error budget state for an SLO."""
    slo_name: str
    window_days: int
    target_percentage: float  # e.g., 99.9
    
    # Current state
    current_percentage: float
    budget_remaining_minutes: float
    budget_consumed_minutes: float
    
    # Burn rate
    burn_rate_1h: float   # How fast we're consuming (1 = normal)
    burn_rate_24h: float  # Smoothed over 24 hours
    
    @property
    def total_budget_minutes(self) -> float:
        """Total error budget in minutes."""
        error_percentage = 100 - self.target_percentage
        return (error_percentage / 100) * self.window_days * 24 * 60
    
    @property
    def budget_remaining_percentage(self) -> float:
        """Percentage of budget remaining."""
        return (self.budget_remaining_minutes / self.total_budget_minutes) * 100
    
    @property
    def status(self) -> BudgetStatus:
        """Current budget status."""
        remaining_pct = self.budget_remaining_percentage
        
        if remaining_pct <= 0:
            return BudgetStatus.EXHAUSTED
        elif remaining_pct <= 20:
            return BudgetStatus.CRITICAL
        elif remaining_pct <= 50:
            return BudgetStatus.WARNING
        else:
            return BudgetStatus.HEALTHY
    
    @property
    def projected_exhaustion(self) -> Optional[datetime]:
        """When budget will be exhausted at current burn rate."""
        if self.burn_rate_24h <= 1.0:
            return None  # Not on track to exhaust
        
        # Minutes remaining / (burn rate - 1) = minutes until exhausted
        minutes_until_exhausted = self.budget_remaining_minutes / (self.burn_rate_24h - 1)
        return datetime.utcnow() + timedelta(minutes=minutes_until_exhausted)


class ErrorBudgetPolicy:
    """
    Defines actions based on error budget status.
    
    Philosophy:
    - Budget healthy: Ship features freely
    - Budget warning: Increase caution
    - Budget critical: Focus on reliability
    - Budget exhausted: Feature freeze
    """
    
    def __init__(self, config):
        self.config = config
    
    def get_allowed_actions(self, budget: ErrorBudget) -> dict:
        """
        Get allowed actions based on budget status.
        """
        status = budget.status
        
        if status == BudgetStatus.HEALTHY:
            return {
                "feature_development": True,
                "risky_deployments": True,
                "experiments": True,
                "on_call_escalation": "normal",
                "deployment_approval": "team_lead",
                "rollback_threshold": "5xx_rate > 5%",
                "message": "Budget healthy. Ship features!"
            }
        
        elif status == BudgetStatus.WARNING:
            return {
                "feature_development": True,
                "risky_deployments": False,  # No risky changes
                "experiments": True,
                "on_call_escalation": "normal",
                "deployment_approval": "team_lead",
                "rollback_threshold": "5xx_rate > 2%",  # Lower threshold
                "message": "Budget warning. Exercise caution with deployments."
            }
        
        elif status == BudgetStatus.CRITICAL:
            return {
                "feature_development": False,  # Pause features
                "risky_deployments": False,
                "experiments": False,
                "on_call_escalation": "heightened",  # More eyes on alerts
                "deployment_approval": "engineering_manager",
                "rollback_threshold": "5xx_rate > 1%",  # Very sensitive
                "message": "Budget critical. Focus on reliability only."
            }
        
        else:  # EXHAUSTED
            return {
                "feature_development": False,
                "risky_deployments": False,
                "experiments": False,
                "on_call_escalation": "incident_mode",
                "deployment_approval": "director",
                "rollback_threshold": "any_degradation",
                "message": "Budget exhausted. Full reliability focus until recovered."
            }
    
    def should_alert(self, budget: ErrorBudget) -> List[dict]:
        """
        Determine what alerts should fire based on budget state.
        """
        alerts = []
        
        # Budget status alerts
        if budget.status == BudgetStatus.EXHAUSTED:
            alerts.append({
                "severity": "critical",
                "title": f"Error budget exhausted: {budget.slo_name}",
                "message": f"SLO {budget.slo_name} has exhausted its error budget. "
                          f"Feature development should halt.",
                "route": "engineering_leadership"
            })
        
        elif budget.status == BudgetStatus.CRITICAL:
            alerts.append({
                "severity": "warning",
                "title": f"Error budget critical: {budget.slo_name}",
                "message": f"Only {budget.budget_remaining_percentage:.1f}% budget remaining.",
                "route": "team_lead"
            })
        
        # Burn rate alerts (predicting future problems)
        if budget.burn_rate_1h > 14.4:  # Will exhaust budget in 1 hour
            alerts.append({
                "severity": "critical",
                "title": f"High burn rate: {budget.slo_name}",
                "message": f"Current burn rate will exhaust budget in ~1 hour. "
                          f"Burn rate: {budget.burn_rate_1h:.1f}x normal.",
                "route": "on_call"
            })
        
        elif budget.burn_rate_1h > 6:  # Will exhaust in 6 hours
            alerts.append({
                "severity": "warning",
                "title": f"Elevated burn rate: {budget.slo_name}",
                "message": f"Burn rate: {budget.burn_rate_1h:.1f}x normal.",
                "route": "on_call"
            })
        
        return alerts


class ErrorBudgetCalculator:
    """
    Calculates error budget from SLI measurements.
    """
    
    def __init__(self, metrics_client):
        self.metrics = metrics_client
    
    async def calculate_budget(
        self,
        slo_name: str,
        sli_query: str,
        target: float,
        window_days: int = 30
    ) -> ErrorBudget:
        """
        Calculate current error budget state.
        """
        # Get SLI value over window
        sli_value = await self.metrics.query(
            sli_query,
            window=f"{window_days}d"
        )
        
        # Calculate budget
        target_error_rate = 100 - target
        actual_error_rate = 100 - sli_value
        
        total_budget_minutes = (target_error_rate / 100) * window_days * 24 * 60
        consumed_minutes = (actual_error_rate / 100) * window_days * 24 * 60
        remaining_minutes = total_budget_minutes - consumed_minutes
        
        # Calculate burn rates
        sli_1h = await self.metrics.query(sli_query, window="1h")
        sli_24h = await self.metrics.query(sli_query, window="24h")
        
        expected_error_rate_1h = target_error_rate
        actual_error_rate_1h = 100 - sli_1h
        burn_rate_1h = actual_error_rate_1h / expected_error_rate_1h if expected_error_rate_1h > 0 else 0
        
        actual_error_rate_24h = 100 - sli_24h
        burn_rate_24h = actual_error_rate_24h / expected_error_rate_1h if expected_error_rate_1h > 0 else 0
        
        return ErrorBudget(
            slo_name=slo_name,
            window_days=window_days,
            target_percentage=target,
            current_percentage=sli_value,
            budget_remaining_minutes=max(0, remaining_minutes),
            budget_consumed_minutes=consumed_minutes,
            burn_rate_1h=burn_rate_1h,
            burn_rate_24h=burn_rate_24h
        )

4.3 Burn Rate Alerting

Traditional alerting (alert when error rate > X%) is noisy. Burn rate alerting is smarter:

BURN RATE ALERTING

Traditional alert:
  "Alert if error rate > 0.1%"
  
  Problem: 
  - Fires on every tiny spike
  - Doesn't consider budget
  - No urgency context

Burn rate alert:
  "Alert if consuming budget 14x faster than normal for 1 hour"
  
  Better because:
  - Considers SLO context
  - Predicts future impact
  - Different urgency levels

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  BURN RATE ALERT THRESHOLDS (Google's recommendation)                  │
│                                                                        │
│  Burn Rate │ Window  │ Budget Consumed │ Action                        │
│  ──────────┼─────────┼─────────────────┼─────────────────────────────  │
│  14.4x     │ 1 hour  │ 2% in 1 hour    │ PAGE immediately              │
│  6x        │ 6 hours │ 5% in 6 hours   │ PAGE during business hours    │
│  3x        │ 1 day   │ 10% in 1 day    │ Create ticket                 │
│  1x        │ 3 days  │ 10% in 3 days   │ Review in weekly meeting      │
│                                                                        │
│  Why these numbers?                                                    │
│  - 14.4x burn rate = budget exhausted in ~1 hour if sustained          │
│  - 6x burn rate = budget exhausted in ~6 hours                         │
│  - You want to catch problems before they become SLO violations        │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘
# error_budget/alerting.py

"""
Multi-window, multi-burn-rate alerting.
"""

from dataclasses import dataclass
from typing import List


@dataclass
class BurnRateAlert:
    """A burn rate alert configuration."""
    name: str
    burn_rate: float
    short_window: str
    long_window: str
    severity: str
    action: str


class BurnRateAlerting:
    """
    Implements multi-window burn rate alerting.
    """
    
    # Alert configurations
    ALERTS = [
        BurnRateAlert(
            name="budget_burn_critical",
            burn_rate=14.4,
            short_window="5m",
            long_window="1h",
            severity="critical",
            action="page"
        ),
        BurnRateAlert(
            name="budget_burn_high",
            burn_rate=6.0,
            short_window="30m",
            long_window="6h",
            severity="warning",
            action="page_business_hours"
        ),
        BurnRateAlert(
            name="budget_burn_medium",
            burn_rate=3.0,
            short_window="2h",
            long_window="1d",
            severity="info",
            action="ticket"
        ),
        BurnRateAlert(
            name="budget_burn_low",
            burn_rate=1.0,
            short_window="6h",
            long_window="3d",
            severity="info",
            action="review"
        ),
    ]
    
    def generate_prometheus_rules(
        self,
        slo_name: str,
        error_ratio_metric: str,
        target: float
    ) -> str:
        """
        Generate Prometheus alerting rules for an SLO.
        """
        error_budget_ratio = 1 - (target / 100)  # e.g., 0.001 for 99.9%
        
        rules = []
        
        for alert in self.ALERTS:
            # Multi-window condition: Both windows must exceed burn rate
            rule = f"""
- alert: {slo_name}_{alert.name}
  expr: |
    (
      {error_ratio_metric}[{alert.short_window}] > {error_budget_ratio * alert.burn_rate}
      and
      {error_ratio_metric}[{alert.long_window}] > {error_budget_ratio * alert.burn_rate}
    )
  for: 2m
  labels:
    severity: {alert.severity}
    slo: {slo_name}
  annotations:
    summary: "SLO {slo_name} burning error budget at {alert.burn_rate}x rate"
    description: "Error budget burn rate is {alert.burn_rate}x normal over {alert.long_window}"
    runbook: "https://wiki/runbooks/{slo_name}"
    action: "{alert.action}"
"""
            rules.append(rule)
        
        return "\n".join(rules)

Chapter 5: SLO-Based Decision Making

5.1 Using SLOs for Prioritization

SLOs DRIVE DECISIONS

SCENARIO 1: Feature vs Reliability
─────────────────────────────────
Product: "We need to ship feature X this sprint"
SRE: "Error budget is at 30%, one more incident and we're in freeze"

Decision framework:
├── Budget > 50%: Ship the feature
├── Budget 20-50%: Ship if low-risk, otherwise defer
├── Budget < 20%: Defer feature, focus on reliability
└── Budget exhausted: Feature freeze until recovered

SCENARIO 2: Technical Debt
─────────────────────────────────
Engineer: "We should refactor the authentication service"
Manager: "What's the business case?"

SLO answer:
├── "Auth service latency SLI is 95.2%, target is 99%"
├── "We're consistently missing SLO"
├── "Refactor should improve by ~3% based on profiling"
└── "This directly impacts our SLO compliance"

SCENARIO 3: On-Call Load
─────────────────────────────────
Team: "We're getting paged too much"
Manager: "Are these valid alerts?"

Analysis:
├── Calculate: What % of alerts were SLO-impacting?
├── If < 50% alerts were SLO-impacting: Alerts are noisy, tune them
├── If > 50%: Real reliability problem, invest in fixes
└── If 100% but meeting SLO: Good alerts, consider loosening SLO

5.2 Communicating SLOs to Stakeholders

TRANSLATING SLOs FOR DIFFERENT AUDIENCES

FOR ENGINEERS:
─────────────
"Our p99 latency SLI is currently 185ms against a 200ms target.
Error budget is at 62%. Burn rate is 0.8x."

FOR PRODUCT MANAGERS:
────────────────────
"We're meeting our reliability targets. We have room to take
some risks with the new feature launch. If something goes wrong,
we have about 2 days of buffer before customers would notice."

FOR EXECUTIVES:
──────────────
"System reliability is GREEN. We're on track to meet our
99.9% availability commitment to customers. No risk to
the enterprise SLA."

FOR CUSTOMERS:
─────────────
"Current system status: Operational.
This month's availability: 99.94%.
Committed availability: 99.5%.
All systems performing within normal parameters."

Part III: Real-World Application

Chapter 6: Case Studies

6.1 How Google Uses SLOs

GOOGLE'S SLO PRACTICE

Scale:
├── Thousands of services
├── Each service has SLOs
├── Error budgets tracked automatically
└── Policies enforced by tooling

Key practices:

1. USER-CENTRIC SLIs
   ├── SLIs measured from user perspective
   ├── Not server metrics, but client experience
   └── "What does the user see?" not "What does the server do?"

2. SLO CULTURE
   ├── Every service owner defines SLOs
   ├── SLOs reviewed quarterly
   ├── Missing SLO = reliability investment
   └── Exceeding SLO = maybe too conservative

3. ERROR BUDGET POLICIES
   ├── Written, agreed-upon policies
   ├── "If budget exhausted, then X happens"
   ├── Not just guidelines — enforced
   └── Development velocity tied to reliability

4. GRADUAL ROLLOUTS
   ├── Canary deployments standard
   ├── Automated rollback if SLI drops
   ├── Budget consumed = slower rollouts
   └── Reliability gates on every deploy

Lesson: SLOs only work if you actually use them for decisions.

6.2 How Netflix Uses SLOs

NETFLIX'S APPROACH

Context:
├── Streaming service with millions of concurrent users
├── User experience is everything
├── "If it buffers, we lose them"

Key SLIs:

1. AVAILABILITY
   └── % of play requests that succeed

2. STARTUP TIME
   └── Time from pressing play to first frame
   └── Target: < 3 seconds for most users

3. REBUFFER RATE
   └── % of viewing time spent buffering
   └── Target: < 0.1% rebuffer ratio

4. VIDEO QUALITY
   └── % of time delivering HD+ quality
   └── Target varies by device/network

Error budget usage:

├── HIGH BUDGET: Roll out new encoding algorithms
├── MEDIUM BUDGET: Standard deployments
├── LOW BUDGET: Only critical fixes
└── EXHAUSTED: All-hands reliability focus

Regional SLOs:
├── Different targets for different regions
├── Emerging markets: Lower targets (network variability)
├── Developed markets: Higher targets (user expectations)
└── Allows appropriate investment per region

Lesson: SLOs should match user expectations, which vary by context.

Chapter 7: Common Mistakes

7.1 SLO Anti-Patterns

SLO ANTI-PATTERNS

❌ MISTAKE 1: Too Many SLOs

Wrong:
  - API availability SLO
  - API latency p50 SLO
  - API latency p95 SLO
  - API latency p99 SLO
  - API latency p99.9 SLO
  - Database availability SLO
  - Database latency SLO
  - Cache hit rate SLO
  - ... 50 more SLOs

Problem:
  - Can't focus on what matters
  - Conflicting signals
  - Alert fatigue

Right:
  - 3-5 SLOs per service
  - Focus on user-facing impact
  - Internal metrics are monitoring, not SLOs


❌ MISTAKE 2: SLOs That Don't Reflect Users

Wrong:
  SLO: "Server CPU utilization < 70%"

Problem:
  - CPU could be at 50% while users experience errors
  - Not measuring what users experience

Right:
  SLO: "99.9% of user requests succeed"
  Supporting metric (not SLO): "CPU utilization"


❌ MISTAKE 3: SLO = Current Performance

Wrong:
  "Our current p99 latency is 150ms, so our SLO should be 150ms"

Problem:
  - No room for normal variation
  - Constant SLO violations
  - Alert fatigue

Right:
  "Our current p99 is 150ms, so our SLO should be 200ms"
  - Gives headroom for normal variation
  - Alerts only on meaningful degradation


❌ MISTAKE 4: No Error Budget Policy

Wrong:
  "We have SLOs, but when we miss them, nothing happens"

Problem:
  - SLOs become meaningless
  - No behavior change
  - Reliability never improves

Right:
  Written policy:
  "When error budget < 20%, feature development pauses
   until reliability work brings budget above 50%"


❌ MISTAKE 5: SLOs Without Data

Wrong:
  "Our SLO is 99.99% availability"
  "How do you measure that?"
  "We don't have monitoring for that yet"

Problem:
  - Can't know if you're meeting SLO
  - Can't calculate error budget
  - SLO is fiction

Right:
  First: Set up measurement
  Then: Observe current performance
  Then: Set SLO based on data

Part IV: Interview Preparation

Chapter 8: Interview Tips

8.1 SLO Discussion Framework

DISCUSSING SLOs IN INTERVIEWS

When asked "How would you ensure this system is reliable?":

1. START WITH USER IMPACT
   "First, I'd identify what reliability means to users.
   For a payment system, that's successful transactions
   and acceptable latency. For a notification system,
   it's delivery rate and timeliness."

2. DEFINE SPECIFIC SLIs
   "I'd measure:
   - Availability: Successful requests / total requests
   - Latency: p99 response time
   - Correctness: For payments, transactions that complete correctly"

3. SET TARGETS WITH RATIONALE
   "For a payment system, I'd target 99.99% correctness —
   incorrect charges are unacceptable. For availability,
   99.9% might be appropriate, giving us about 8 hours
   of error budget per year."

4. EXPLAIN ERROR BUDGETS
   "The error budget lets us balance reliability and velocity.
   With 99.9% availability SLO, we have 43 minutes per month
   of downtime budget. If we're burning it fast, we slow down
   on features and focus on reliability."

5. CONNECT TO ALERTING
   "I'd alert on burn rate rather than raw errors.
   If we're consuming budget 10x faster than normal,
   that's worth investigating even if we haven't violated
   the SLO yet."

8.2 Key Phrases

SLO KEY PHRASES

On Defining Reliability:
"I define SLIs based on user experience, not server metrics.
What the user sees matters more than server CPU. For an API,
that's success rate and latency measured at the load balancer."

On Setting Targets:
"I set SLOs tighter than what I'd put in an SLA.
If my SLA promises 99.5%, my SLO is 99.9%.
This gives us buffer to catch and fix issues before
customers are impacted enough to trigger SLA penalties."

On Error Budgets:
"Error budgets turn reliability into a feature trade-off.
When budget is healthy, we ship fast. When it's depleted,
we focus on reliability. This prevents both over-engineering
reliability and neglecting it."

On Alerting:
"I use burn rate alerting instead of threshold alerting.
A 1% error rate for 1 minute might be noise.
But consuming a week's error budget in an hour? That's real.
Multi-window burn rate catches real problems, not noise."

Chapter 9: Practice Problems

Problem 1: E-Commerce Checkout

Scenario: You're defining SLOs for an e-commerce checkout service.

Questions:

  1. What SLIs would you measure?
  2. What targets would you set?
  3. How would you handle dependency failures (payment processor)?
  • SLIs: Checkout success rate, latency, payment accuracy
  • Consider: Different targets for browsing vs checkout (checkout more critical)
  • Dependency: Your SLO can't exceed payment processor's SLA
  • Error budget: Very small for payment correctness

Problem 2: Real-Time Chat Application

Scenario: You're defining SLOs for a Slack-like chat application.

Questions:

  1. What makes "reliability" different for real-time chat?
  2. How would you handle message delivery guarantees?
  3. What's an appropriate latency target?
  • Real-time = latency matters more than typical APIs
  • Message delivery: Different SLOs for "accepted" vs "delivered"
  • Latency: Users notice > 200ms in chat
  • Consider: Read vs write latency (writes more critical)

Chapter 10: Sample Interview Dialogue

Interviewer: "You've designed a notification system. How do you know if it's reliable?"

You: "Great question. Reliability for a notification system has several dimensions. Let me define specific SLOs.

First, the SLIs I'd measure:"

1. Availability: % of send requests that succeed (non-5xx)
2. Acceptance latency: Time to accept and queue a notification
3. Delivery latency: Time from acceptance to actual delivery
4. Delivery rate: % of accepted notifications actually delivered

Interviewer: "How do you set the targets?"

You: "I'd base targets on user expectations and business needs:

For availability, I'd target 99.9%. Notifications are important but not mission-critical like payments. 99.9% gives us about 43 minutes of downtime budget per month.

For acceptance latency, I'd target p99 < 200ms. The API should feel instant.

For delivery latency, I'd target 95% delivered within 30 seconds. This accounts for third-party provider variability while meeting user expectations for timeliness.

For delivery rate, I'd target 99.5%. Some delivery failures are expected (invalid tokens, user blocked notifications), but we should successfully deliver almost everything.

Interviewer: "What happens if you miss the SLO?"

You: "That's where error budgets come in. Let me explain our policy:

If we're above 50% budget, we operate normally — ship features, run experiments.

At 20-50% budget, we increase caution — no risky deployments, lower rollback thresholds.

At below 20%, we enter reliability focus — pause feature work, all engineering on reliability.

At budget exhausted, we're in freeze mode — only critical fixes, all hands on improving reliability.

This creates a self-balancing system. When we're reliable, we ship fast. When we're not, we automatically shift focus."

Interviewer: "How do you alert on this?"

You: "I use burn rate alerting instead of raw thresholds.

A 0.5% error rate for 5 minutes might be noise from a network blip. But if I'm consuming my monthly error budget 14x faster than normal for an hour, that's a real problem.

I'd set up multi-window alerts:

  • 14x burn rate for 1 hour → page immediately
  • 6x burn rate for 6 hours → page during business hours
  • 3x burn rate for 1 day → create a ticket

This catches real issues while avoiding alert fatigue."


Summary

┌────────────────────────────────────────────────────────────────────────┐
│                    DAY 1 KEY TAKEAWAYS                                 │
│                                                                        │
│  THE LANGUAGE OF RELIABILITY:                                          │
│  ├── SLI: What we measure (the metric)                                 │
│  ├── SLO: What we target (internal goal)                               │
│  ├── SLA: What we promise (external contract)                          │
│  └── Error Budget: How much failure we allow                           │
│                                                                        │
│  GOOD SLIs:                                                            │
│  ├── Measure user experience, not server health                        │
│  ├── Four golden signals: Latency, Traffic, Errors, Saturation         │
│  ├── Precisely specified (good events / valid events)                  │
│  └── Measured from user's perspective (load balancer, not server)      │
│                                                                        │
│  SETTING SLOs:                                                         │
│  ├── Based on user expectations, not current performance               │
│  ├── 3-5 SLOs per service (not 50)                                     │
│  ├── Leave headroom (don't set at current performance)                 │
│  └── Review and adjust quarterly                                       │
│                                                                        │
│  ERROR BUDGETS:                                                        │
│  ├── Balance reliability and velocity                                  │
│  ├── Written policies for budget states                                │
│  ├── Budget exhausted = feature freeze                                 │
│  └── Burn rate alerting > threshold alerting                           │
│                                                                        │
│  KEY INSIGHT:                                                          │
│  100% reliability is impossible, infinitely expensive, and harmful.    │
│  The goal is APPROPRIATE reliability — reliable enough for users,      │
│  with enough room to ship features and innovate.                       │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Chapter 11: Implementing SLO Dashboards

11.1 The SLO Dashboard

Every team needs visibility into their SLO status:

# slo/dashboard.py

"""
SLO Dashboard data provider.

Provides all the data needed for an SLO status dashboard.
"""

from dataclasses import dataclass
from typing import List, Dict, Optional
from datetime import datetime, timedelta
from enum import Enum


@dataclass
class SLOStatus:
    """Current status of an SLO."""
    name: str
    description: str
    target: float
    current_value: float
    window_days: int
    
    # Budget state
    budget_total_minutes: float
    budget_consumed_minutes: float
    budget_remaining_minutes: float
    budget_remaining_percentage: float
    
    # Burn rate
    burn_rate_1h: float
    burn_rate_6h: float
    burn_rate_24h: float
    
    # Trend
    trend_direction: str  # "improving", "stable", "degrading"
    projected_end_of_window: float  # Projected SLI at window end
    
    # Status
    status: str  # "healthy", "warning", "critical", "violated"
    status_changed_at: Optional[datetime]


class SLODashboardService:
    """
    Provides data for SLO dashboards.
    """
    
    def __init__(self, metrics_client, slo_config):
        self.metrics = metrics_client
        self.config = slo_config
    
    async def get_all_slo_status(self) -> List[SLOStatus]:
        """Get status of all configured SLOs."""
        statuses = []
        
        for slo in self.config.slos:
            status = await self.get_slo_status(slo)
            statuses.append(status)
        
        return statuses
    
    async def get_slo_status(self, slo_config) -> SLOStatus:
        """Get detailed status for a single SLO."""
        
        # Query current SLI value
        current_value = await self.metrics.query(
            slo_config.sli_query,
            window=f"{slo_config.window_days}d"
        )
        
        # Calculate error budget
        target = slo_config.target
        error_rate_allowed = 100 - target
        error_rate_actual = 100 - current_value
        
        budget_total = (error_rate_allowed / 100) * slo_config.window_days * 24 * 60
        budget_consumed = (error_rate_actual / 100) * slo_config.window_days * 24 * 60
        budget_remaining = max(0, budget_total - budget_consumed)
        budget_remaining_pct = (budget_remaining / budget_total) * 100 if budget_total > 0 else 0
        
        # Calculate burn rates at different windows
        burn_rate_1h = await self._calculate_burn_rate(slo_config, "1h")
        burn_rate_6h = await self._calculate_burn_rate(slo_config, "6h")
        burn_rate_24h = await self._calculate_burn_rate(slo_config, "24h")
        
        # Determine trend
        trend = self._calculate_trend(burn_rate_1h, burn_rate_6h, burn_rate_24h)
        
        # Project end-of-window value
        projected = self._project_eow_value(current_value, burn_rate_24h, slo_config)
        
        # Determine status
        status = self._determine_status(budget_remaining_pct, burn_rate_1h)
        
        return SLOStatus(
            name=slo_config.name,
            description=slo_config.description,
            target=target,
            current_value=current_value,
            window_days=slo_config.window_days,
            budget_total_minutes=budget_total,
            budget_consumed_minutes=budget_consumed,
            budget_remaining_minutes=budget_remaining,
            budget_remaining_percentage=budget_remaining_pct,
            burn_rate_1h=burn_rate_1h,
            burn_rate_6h=burn_rate_6h,
            burn_rate_24h=burn_rate_24h,
            trend_direction=trend,
            projected_end_of_window=projected,
            status=status,
            status_changed_at=None  # Would track in database
        )
    
    async def _calculate_burn_rate(self, slo_config, window: str) -> float:
        """Calculate burn rate for a time window."""
        value = await self.metrics.query(slo_config.sli_query, window=window)
        
        error_rate_allowed = 100 - slo_config.target
        error_rate_actual = 100 - value
        
        if error_rate_allowed == 0:
            return 0
        
        return error_rate_actual / error_rate_allowed
    
    def _calculate_trend(
        self, 
        burn_1h: float, 
        burn_6h: float, 
        burn_24h: float
    ) -> str:
        """Determine trend from burn rates."""
        # If recent burn rate is lower than longer-term, improving
        if burn_1h < burn_6h < burn_24h:
            return "improving"
        elif burn_1h > burn_6h > burn_24h:
            return "degrading"
        else:
            return "stable"
    
    def _project_eow_value(
        self,
        current: float,
        burn_rate_24h: float,
        slo_config
    ) -> float:
        """Project SLI value at end of window."""
        # Simplified projection based on current burn rate
        days_remaining = slo_config.window_days / 2  # Rough middle of window
        
        if burn_rate_24h <= 1:
            return current  # Stable or improving
        
        # Project degradation
        daily_degradation = (100 - current) * (burn_rate_24h - 1) / slo_config.window_days
        projected = current - (daily_degradation * days_remaining)
        
        return max(0, projected)
    
    def _determine_status(
        self,
        budget_remaining_pct: float,
        burn_rate_1h: float
    ) -> str:
        """Determine SLO status."""
        # Violated if budget exhausted
        if budget_remaining_pct <= 0:
            return "violated"
        
        # Critical if budget very low OR burning very fast
        if budget_remaining_pct < 20 or burn_rate_1h > 10:
            return "critical"
        
        # Warning if budget getting low OR elevated burn rate
        if budget_remaining_pct < 50 or burn_rate_1h > 5:
            return "warning"
        
        return "healthy"
    
    async def get_slo_history(
        self,
        slo_name: str,
        days: int = 30
    ) -> List[Dict]:
        """Get historical SLO performance."""
        # Query historical data points
        data_points = await self.metrics.query_range(
            self.config.get_slo(slo_name).sli_query,
            start=f"-{days}d",
            end="now",
            step="1h"
        )
        
        return [
            {
                "timestamp": dp.timestamp,
                "value": dp.value,
                "target": self.config.get_slo(slo_name).target
            }
            for dp in data_points
        ]

11.2 Grafana Dashboard Definition

{
  "dashboard": {
    "title": "Service SLO Dashboard",
    "panels": [
      {
        "title": "SLO Status Overview",
        "type": "stat",
        "gridPos": {"x": 0, "y": 0, "w": 24, "h": 4},
        "targets": [
          {
            "expr": "slo:current_value{service=\"api\"}",
            "legendFormat": "{{slo_name}}"
          }
        ],
        "options": {
          "colorMode": "background",
          "graphMode": "none",
          "justifyMode": "center",
          "textMode": "value_and_name"
        },
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                {"color": "red", "value": null},
                {"color": "red", "value": 99},
                {"color": "yellow", "value": 99.5},
                {"color": "green", "value": 99.9}
              ]
            },
            "unit": "percent"
          }
        }
      },
      {
        "title": "Error Budget Remaining",
        "type": "gauge",
        "gridPos": {"x": 0, "y": 4, "w": 8, "h": 6},
        "targets": [
          {
            "expr": "slo:error_budget_remaining_ratio{service=\"api\"} * 100"
          }
        ],
        "options": {
          "showThresholdLabels": true,
          "showThresholdMarkers": true
        },
        "fieldConfig": {
          "defaults": {
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                {"color": "red", "value": 0},
                {"color": "orange", "value": 20},
                {"color": "yellow", "value": 50},
                {"color": "green", "value": 75}
              ]
            },
            "unit": "percent"
          }
        }
      },
      {
        "title": "Burn Rate (1h, 6h, 24h)",
        "type": "timeseries",
        "gridPos": {"x": 8, "y": 4, "w": 16, "h": 6},
        "targets": [
          {
            "expr": "slo:burn_rate_1h{service=\"api\"}",
            "legendFormat": "1h burn rate"
          },
          {
            "expr": "slo:burn_rate_6h{service=\"api\"}",
            "legendFormat": "6h burn rate"
          },
          {
            "expr": "slo:burn_rate_24h{service=\"api\"}",
            "legendFormat": "24h burn rate"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "custom": {
              "thresholdsStyle": {
                "mode": "line"
              }
            },
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 1},
                {"color": "orange", "value": 6},
                {"color": "red", "value": 14.4}
              ]
            }
          }
        }
      },
      {
        "title": "SLI Over Time",
        "type": "timeseries",
        "gridPos": {"x": 0, "y": 10, "w": 24, "h": 8},
        "targets": [
          {
            "expr": "slo:sli_value{service=\"api\"}",
            "legendFormat": "{{slo_name}}"
          },
          {
            "expr": "slo:target{service=\"api\"}",
            "legendFormat": "Target"
          }
        ]
      }
    ]
  }
}

Chapter 12: SLOs Across the Organization

12.1 Service Dependency SLOs

When your service depends on others, how do you set SLOs?

SERVICE DEPENDENCY CHAIN

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  Your Service                                                          │
│  SLO: 99.9% availability                                               │
│       │                                                                │
│       ├──────────────────────┐                                         │
│       │                      │                                         │
│       ▼                      ▼                                         │
│  Database                Payment Provider                              │
│  SLO: 99.99%             SLA: 99.95%                                   │
│       │                      │                                         │
│       ▼                      ▼                                         │
│  Storage                 Bank API                                      │
│  SLA: 99.99%             SLA: 99.9%                                    │
│                                                                        │
│  YOUR MAXIMUM ACHIEVABLE AVAILABILITY:                                 │
│  = 99.99% × 99.95% × 99.99% × 99.9%                                    │
│  = 99.83% (theoretical max)                                            │
│                                                                        │
│  LESSON: You cannot be more reliable than your least reliable          │
│  critical dependency!                                                  │
│                                                                        │
│  SOLUTIONS:                                                            │
│  1. Only depend on services with SLOs >= your SLO                      │
│  2. Build redundancy (multiple payment providers)                      │
│  3. Graceful degradation (queue payments if provider down)             │
│  4. Adjust your SLO to reflect dependencies                            │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

12.2 Tiered SLOs

Different customers may have different SLOs:

# slo/tiered.py

"""
Tiered SLO management for different customer segments.
"""

from dataclasses import dataclass
from typing import Dict
from enum import Enum


class CustomerTier(Enum):
    """Customer tier levels."""
    FREE = "free"
    STANDARD = "standard"
    PROFESSIONAL = "professional"
    ENTERPRISE = "enterprise"


@dataclass
class TieredSLO:
    """SLO with different targets per tier."""
    sli_name: str
    description: str
    
    # Targets by tier
    targets: Dict[CustomerTier, float]
    
    # Whether to enforce (vs aspirational)
    enforced: bool = True


# Example tiered SLOs for our notification platform
NOTIFICATION_SLOS = [
    TieredSLO(
        sli_name="api_availability",
        description="API request success rate",
        targets={
            CustomerTier.FREE: 99.0,        # 7.3 hours/month downtime OK
            CustomerTier.STANDARD: 99.5,    # 3.6 hours/month
            CustomerTier.PROFESSIONAL: 99.9, # 43 minutes/month
            CustomerTier.ENTERPRISE: 99.99,  # 4.3 minutes/month
        }
    ),
    TieredSLO(
        sli_name="api_latency_p99",
        description="99th percentile API latency",
        targets={
            CustomerTier.FREE: 1000,         # 1 second
            CustomerTier.STANDARD: 500,      # 500ms
            CustomerTier.PROFESSIONAL: 200,  # 200ms
            CustomerTier.ENTERPRISE: 100,    # 100ms
        }
    ),
    TieredSLO(
        sli_name="delivery_rate",
        description="Notification delivery success rate",
        targets={
            CustomerTier.FREE: 95.0,
            CustomerTier.STANDARD: 98.0,
            CustomerTier.PROFESSIONAL: 99.0,
            CustomerTier.ENTERPRISE: 99.9,
        }
    ),
]


class TieredSLOService:
    """
    Manages tiered SLOs across customer segments.
    """
    
    def __init__(self, metrics_client, customer_service):
        self.metrics = metrics_client
        self.customers = customer_service
    
    async def check_customer_slo(
        self,
        customer_id: str,
        slo_name: str
    ) -> Dict:
        """Check SLO status for a specific customer."""
        
        # Get customer tier
        customer = await self.customers.get(customer_id)
        tier = CustomerTier(customer.tier)
        
        # Find SLO config
        slo_config = next(s for s in NOTIFICATION_SLOS if s.sli_name == slo_name)
        target = slo_config.targets[tier]
        
        # Query SLI for this customer
        current_value = await self.metrics.query(
            f'sli:{slo_name}{{customer_id="{customer_id}"}}',
            window="30d"
        )
        
        return {
            "customer_id": customer_id,
            "tier": tier.value,
            "slo_name": slo_name,
            "target": target,
            "current_value": current_value,
            "meeting_slo": current_value >= target,
            "gap": current_value - target
        }
    
    async def get_slo_compliance_report(self) -> Dict:
        """Generate compliance report across all tiers."""
        report = {
            "generated_at": datetime.utcnow().isoformat(),
            "by_tier": {},
            "by_slo": {}
        }
        
        for tier in CustomerTier:
            tier_customers = await self.customers.get_by_tier(tier.value)
            
            compliance = {
                "total_customers": len(tier_customers),
                "meeting_all_slos": 0,
                "violating_slos": 0
            }
            
            for customer in tier_customers:
                meeting_all = True
                
                for slo in NOTIFICATION_SLOS:
                    result = await self.check_customer_slo(
                        customer.id, 
                        slo.sli_name
                    )
                    if not result["meeting_slo"]:
                        meeting_all = False
                        break
                
                if meeting_all:
                    compliance["meeting_all_slos"] += 1
                else:
                    compliance["violating_slos"] += 1
            
            report["by_tier"][tier.value] = compliance
        
        return report

12.3 SLO Review Process

QUARTERLY SLO REVIEW PROCESS

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  WEEK 1: DATA COLLECTION                                               │
│  ├── Pull last quarter's SLI data                                      │
│  ├── Calculate SLO compliance rate                                     │
│  ├── Identify patterns in violations                                   │
│  └── Gather customer feedback                                          │
│                                                                        │
│  WEEK 2: ANALYSIS                                                      │
│  ├── Were SLOs too tight? (Constantly missing)                         │
│  ├── Were SLOs too loose? (Always met, but users unhappy)              │
│  ├── Did dependencies impact our SLOs?                                 │
│  └── What incidents consumed error budget?                             │
│                                                                        │
│  WEEK 3: PROPOSAL                                                      │
│  ├── Propose adjustments to targets                                    │
│  ├── Propose new SLOs if gaps identified                               │
│  ├── Propose deprecating unused SLOs                                   │
│  └── Document rationale for changes                                    │
│                                                                        │
│  WEEK 4: REVIEW & APPROVAL                                             │
│  ├── Review with stakeholders (product, engineering, support)          │
│  ├── Align on changes                                                  │
│  ├── Update documentation                                              │
│  └── Communicate changes to team                                       │
│                                                                        │
│  KEY QUESTIONS TO ASK:                                                 │
│  ├── Did we miss SLO? Why? Was it preventable?                         │
│  ├── Did we meet SLO but users were unhappy? Target too loose?         │
│  ├── Did we exceed SLO by large margin? Over-invested in reliability?  │
│  ├── Did error budget policy work? Did we actually freeze?             │
│  └── Are our SLIs still measuring the right things?                    │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Chapter 13: Your Journey So Far

As we begin this final week, let's reflect on how far you've come:

YOUR SYSTEM DESIGN EVOLUTION

WEEK 1: You learned WHY systems fail at scale
└── Hot keys, replication lag, partition strategies
    Now you know: Distribution is hard, but necessary

WEEK 2: You learned HOW to handle failures
└── Timeouts, retries, idempotency, circuit breakers
    Now you know: Assume failure, design for recovery

WEEK 3: You learned ASYNC patterns
└── Queues, streams, backpressure, dead letters
    Now you know: Not everything needs to be synchronous

WEEK 4: You learned CACHING deeply
└── Invalidation, thundering herds, consistency
    Now you know: Caching is easy to add, hard to get right

WEEK 5: You learned COORDINATION
└── Consensus, sagas, conflict resolution
    Now you know: Distributed transactions are expensive

WEEK 6-8: You DESIGNED complete systems
└── Notifications, Search, Analytics
    Now you know: Real systems combine many patterns

WEEK 9: You learned ENTERPRISE concerns
└── Multi-tenancy, compliance, security
    Now you know: Business requirements shape architecture

WEEK 10 (NOW): You're learning to OPERATE systems
└── SLOs tell you if your system is healthy
    After this week: You'll own systems, not just build them

This week is the capstone of your entire journey. Everything you've learned comes together when you operate systems in production. SLOs are how you know if all that careful design is actually working.


Further Reading

Books:

  • "Site Reliability Engineering" by Google — Chapters on SLIs/SLOs/SLAs
  • "The Site Reliability Workbook" — Practical SLO implementation
  • "Implementing Service Level Objectives" by Alex Hidalgo

Articles:

Tools:

  • Prometheus + Grafana for SLI measurement
  • Sloth — SLO generator for Prometheus
  • Nobl9 — SLO management platform
  • Google Cloud SLO Monitoring

End of Day 1: SLIs, SLOs, and SLAs

Tomorrow: Day 2 — Observability. You've defined what "healthy" means. Now you need to SEE whether your system is healthy. Metrics, logs, traces — the three pillars that let you understand what's happening inside your production systems.


You're in the final week. You've learned to build systems. Now you're learning to own them. This is what makes a senior engineer.