Himanshu Kukreja
0%
LearnSystem DesignWeek 10Production Readiness And Operational Excellence Preview
Week Preview

Week 10 Preview: Production Readiness and Operational Excellence

System Design Mastery Series — The Final Week


Welcome to Week 10

You've designed the systems. Now you need to run them.

THE PRODUCTION REALITY

Week 1-9: You learned to design systems
├── Data at scale
├── Failure handling
├── Messaging patterns
├── Caching strategies
├── Consistency models
├── Complete system designs
├── Analytics pipelines
├── Multi-tenancy
└── Security architecture

Week 10: You learn to OPERATE systems
├── How do you know it's working?
├── How do you deploy without breaking things?
├── How do you handle 10x traffic?
├── What happens when it breaks at 3am?
└── How do you prevent it from breaking again?

The difference between a junior and senior engineer:
├── Junior: "I built it, it works on my machine"
├── Mid: "I built it, it passed the tests"
└── Senior: "I built it, I can operate it, I can debug it,
             I know when it's healthy, and I have a plan
             for when it breaks"

This week transforms you from someone who builds systems to someone who owns systems in production.


Week 10 Theme: Production Readiness

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│                    PRODUCTION READINESS PILLARS                        │
│                                                                        │
│                         ┌─────────────┐                                │
│                         │   RELIABLE  │                                │
│                         │   SYSTEMS   │                                │
│                         └──────┬──────┘                                │
│                                │                                       │
│         ┌──────────────────────┼──────────────────────┐                │
│         │                      │                      │                │
│         ▼                      ▼                      ▼                │
│  ┌─────────────┐       ┌─────────────┐       ┌─────────────┐           │
│  │    SLOs     │       │ Observa-    │       │  Incident   │           │
│  │   Define    │       │ bility      │       │  Response   │           │
│  │  Reliability│       │   See       │       │   Handle    │           │
│  │   Targets   │       │  Problems   │       │  Problems   │           │
│  └──────┬──────┘       └──────┬──────┘       └──────┬──────┘           │
│         │                      │                      │                │
│         ▼                      ▼                      ▼                │
│  ┌─────────────┐       ┌─────────────┐       ┌─────────────┐           │
│  │ Deployment  │       │  Capacity   │       │ Postmortem  │           │
│  │  Strategies │       │  Planning   │       │  & Learning │           │
│  │   Deploy    │       │   Scale     │       │   Improve   │           │
│  │   Safely    │       │  Correctly  │       │   Always    │           │
│  └─────────────┘       └─────────────┘       └─────────────┘           │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Daily Breakdown

Day 1: SLIs, SLOs, and SLAs — Defining Reliability

WHAT YOU'LL LEARN

The Language of Reliability:
├── SLI (Service Level Indicator): What we measure
├── SLO (Service Level Objective): What we target
├── SLA (Service Level Agreement): What we promise
└── Error Budget: How much unreliability we allow

Example Progression:
┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  METRIC:     Response latency                                          │
│                    │                                                   │
│                    ▼                                                   │
│  SLI:        p99 latency of successful GET requests                    │
│              Measured at load balancer over 5-minute windows           │
│                    │                                                   │
│                    ▼                                                   │
│  SLO:        p99 latency < 200ms for 99.9% of 5-minute windows         │
│              Over a rolling 30-day period                              │
│                    │                                                   │
│                    ▼                                                   │
│  SLA:        "API requests will complete in under 500ms                │
│               99% of the time, or customer receives credit"            │
│                    │                                                   │
│                    ▼                                                   │
│  ERROR       30 days × 24 hours × 60 minutes = 43,200 minutes          │
│  BUDGET:     0.1% error budget = 43.2 minutes of violations allowed    │
│              If we exceed this, we stop features and fix reliability   │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Key Topics:
├── Choosing the right SLIs (latency, availability, throughput, correctness)
├── Setting realistic SLOs (not too tight, not too loose)
├── Error budget policies (what happens when budget depleted)
├── SLO-based alerting (alert on budget burn rate, not every error)
└── Communicating reliability to stakeholders

Why This Matters: Without SLOs, you don't know if your system is "healthy enough." You'll either over-invest in reliability or under-invest until customers complain.


Day 2: Observability — Seeing Into Production

WHAT YOU'LL LEARN

The Three Pillars of Observability:
├── Metrics: Aggregated numerical data (counters, gauges, histograms)
├── Logs: Discrete events with context
└── Traces: Request flow across services

┌────────────────────────────────────────────────────────────────────────┐
│                    OBSERVABILITY STACK                                 │
│                                                                        │
│  USER REQUEST                                                          │
│       │                                                                │
│       ▼                                                                │
│  ┌─────────┐    Trace ID: abc-123                                      │
│  │   API   │────────────────────────────────────────┐                  │
│  │ Gateway │                                        │                  │
│  └────┬────┘                                        │                  │
│       │                                             │                  │
│       ▼                                             ▼                  │
│  ┌─────────┐    Span: api-gateway → user-service   TRACE               │
│  │  User   │◄──────────────────────────────────────                    │
│  │ Service │    Duration: 45ms                                         │
│  └────┬────┘                                                           │
│       │                                                                │
│       ▼                                             ▼                  │
│  ┌─────────┐    Span: user-service → postgres      METRICS             │
│  │Postgres │◄──────────────────────────────────────                    │
│  │         │    Duration: 12ms                     request_count: +1   │
│  └─────────┘                                       latency_p99: 45ms   │
│                                                    error_count: 0      │
│                                             ▼                          │
│  Throughout this flow:                      LOGS                       │
│  {"level":"info","trace_id":"abc-123",     [timestamp] INFO            │
│   "message":"User fetched","user_id":123}  Request processed           │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Key Topics:
├── Structured logging (JSON, correlation IDs)
├── Metric types (counters, gauges, histograms, summaries)
├── Distributed tracing (OpenTelemetry, trace context propagation)
├── Dashboard design (USE method, RED method, golden signals)
├── Alert design (symptoms vs causes, runbook links)
└── Debug workflows (from alert to root cause)

Why This Matters: You can't fix what you can't see. Observability is how you turn "something is slow" into "this specific database query on this specific pod is slow because of this index."


Day 3: Deployment Strategies — Shipping Without Breaking

WHAT YOU'LL LEARN

Deployment Evolution:
├── Big Bang: Deploy everything at once (risky)
├── Rolling: Replace instances gradually
├── Blue-Green: Switch between identical environments
├── Canary: Route small percentage to new version
└── Feature Flags: Deploy code, activate later

┌────────────────────────────────────────────────────────────────────────┐
│                    CANARY DEPLOYMENT                                   │
│                                                                        │
│  Stage 1: Deploy to 1% of traffic                                      │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                           Load Balancer                         │   │
│  │                                │                                │   │
│  │              ┌─────────────────┴─────────────────┐              │   │
│  │              │                                   │              │   │
│  │              ▼ 99%                               ▼ 1%           │   │
│  │      ┌───────────────┐                   ┌───────────────┐      │   │
│  │      │   v1.2.3      │                   │   v1.2.4      │      │   │
│  │      │  (current)    │                   │   (canary)    │      │   │
│  │      │  10 pods      │                   │   1 pod       │      │   │
│  │      └───────────────┘                   └───────────────┘      │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                        │
│  Monitor for 15 minutes:                                               │
│  ├── Error rate: canary vs baseline                                    │
│  ├── Latency: canary vs baseline                                       │
│  └── Business metrics: conversion, etc.                                │
│                                                                        │
│  If healthy: Promote to 10% → 50% → 100%                               │
│  If unhealthy: Automatic rollback to v1.2.3                            │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Key Topics:
├── Deployment strategies comparison
├── Database migrations (backward compatible, expand-contract)
├── Feature flags (gradual rollout, kill switches)
├── Rollback strategies (fast rollback, data rollback)
├── CI/CD pipeline design
└── Change management (deployment windows, freezes)

Why This Matters: The fastest way to cause an outage is a bad deployment. Safe deployment practices are the difference between "we deploy 50 times a day" and "we're scared to deploy."


Day 4: Capacity Planning — Scaling Before You Need To

WHAT YOU'LL LEARN

Capacity Planning Process:
├── Understand current capacity
├── Forecast future demand
├── Identify bottlenecks
├── Plan scaling actions
└── Validate with load testing

┌───────────────────────────────────────────────────────────────────────┐
│                    CAPACITY PLANNING WORKFLOW                         │
│                                                                       │
│  1. MEASURE CURRENT STATE                                             │
│     ┌────────────────────────────────────────────────────────────┐    │
│     │  Component      │ Current Load │ Max Capacity │ Headroom   │    │
│     │  ──────────────────────────────────────────────────────────│    │
│     │  API Servers    │ 400 req/s    │ 600 req/s    │ 33%        │    │
│     │  PostgreSQL     │ 2000 QPS     │ 3000 QPS     │ 33%        │    │
│     │  Redis          │ 50% memory   │ 100%         │ 50%        │    │
│     │  Elasticsearch  │ 200 search/s │ 400 search/s │ 50%        │    │
│     └────────────────────────────────────────────────────────────┘    │
│                                                                       │
│  2. FORECAST DEMAND                                                   │
│     ├── Historical growth: 10% month-over-month                       │
│     ├── Planned events: Black Friday (5x normal traffic)              │
│     ├── New features: Video upload (high bandwidth)                   │
│     └── Seasonality: Q4 is 2x Q1                                      │
│                                                                       │
│  3. IDENTIFY BOTTLENECKS                                              │
│     At 2x traffic, what breaks first?                                 │
│     ├── API servers: Scale horizontally ✓                             │
│     ├── PostgreSQL: Need read replicas or sharding                    │
│     ├── Redis: Add memory ✓                                           │
│     └── Elasticsearch: BOTTLENECK - needs cluster expansion           │
│                                                                       │
│  4. PLAN ACTIONS                                                      │
│     ├── Q3: Expand Elasticsearch cluster (4 weeks lead time)          │
│     ├── Q4: Add PostgreSQL read replica                               │
│     └── Black Friday: Pre-scale API servers 48 hours before           │
│                                                                       │
└───────────────────────────────────────────────────────────────────────┘

Key Topics:
├── Load testing (tools, realistic traffic patterns)
├── Bottleneck identification (Little's Law, queueing theory basics)
├── Vertical vs horizontal scaling decisions
├── Auto-scaling configuration (metrics, thresholds, cooldowns)
├── Cost optimization (right-sizing, reserved capacity)
└── Chaos engineering (verify resilience under load)

Why This Matters: Scaling reactively means outages. Scaling proactively means you're ready when traffic spikes. Capacity planning turns "we might need more servers" into "we need 3 more nodes by October 15."


Day 5: Incident Management — When Things Go Wrong

WHAT YOU'LL LEARN

Incident Lifecycle:
├── Detection: How do we know there's a problem?
├── Response: Who gets paged, what do they do?
├── Mitigation: Stop the bleeding
├── Resolution: Fix the root cause
└── Learning: Prevent recurrence

┌────────────────────────────────────────────────────────────────────────┐
│                    INCIDENT RESPONSE FLOW                              │
│                                                                        │
│  DETECTION                                                             │
│  ────────                                                              │
│  Alert fires: "Error rate > 5% for 5 minutes"                          │
│       │                                                                │
│       ▼                                                                │
│  TRIAGE (5 minutes)                                                    │
│  ──────                                                                │
│  On-call engineer:                                                     │
│  ├── Acknowledge alert                                                 │
│  ├── Assess severity (SEV1/SEV2/SEV3)                                  │
│  ├── Check dashboard for scope                                         │
│  └── Decide: Handle alone or escalate?                                 │
│       │                                                                │
│       ▼                                                                │
│  INCIDENT DECLARED (SEV1)                                              │
│  ────────────────────────                                              │
│  Roles assigned:                                                       │
│  ├── Incident Commander: Coordinates response                          │
│  ├── Technical Lead: Drives investigation                              │
│  ├── Communications: Updates stakeholders                              │
│  └── Scribe: Documents timeline                                        │
│       │                                                                │
│       ▼                                                                │
│  MITIGATION                                                            │
│  ──────────                                                            │
│  ├── Rollback deployment?                                              │
│  ├── Scale up resources?                                               │
│  ├── Disable feature flag?                                             │
│  ├── Fail over to backup?                                              │
│  └── Goal: Restore service, not find root cause                        │
│       │                                                                │
│       ▼                                                                │
│  RESOLUTION                                                            │
│  ──────────                                                            │
│  ├── Root cause identified                                             │
│  ├── Permanent fix deployed                                            │
│  └── Incident closed                                                   │
│       │                                                                │
│       ▼                                                                │
│  POSTMORTEM (within 48 hours)                                          │
│  ─────────                                                             │
│  ├── Timeline reconstruction                                           │
│  ├── Root cause analysis (5 Whys)                                      │
│  ├── Impact assessment                                                 │
│  ├── Action items with owners and deadlines                            │
│  └── Blameless discussion                                              │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Key Topics:
├── Severity levels and escalation
├── On-call rotations and handoffs
├── Runbooks (structured troubleshooting guides)
├── Communication during incidents (status pages, stakeholder updates)
├── Blameless postmortems (5 Whys, contributing factors)
└── Action item tracking and follow-through

Why This Matters: Incidents will happen. The difference between a minor blip and a catastrophic outage is how quickly and effectively you respond. Good incident management is a competitive advantage.


Week 10 Capstone Preview

Design: Production Readiness Review

THE CAPSTONE SCENARIO

You're joining a company as a Staff Engineer. Your first task:
Lead a Production Readiness Review (PRR) for a new service
before it launches.

The service: Real-time Notification Platform
├── Sends push notifications, emails, SMS
├── 10M notifications/day expected
├── Must be highly available (notifications are time-sensitive)
├── Integrates with 5 third-party providers

Your job:
├── Define SLIs and SLOs for the service
├── Design the observability strategy
├── Review the deployment pipeline
├── Create capacity plan for launch
├── Write incident response runbooks
└── Present the PRR to leadership

This capstone tests whether you can:
├── Think like an SRE, not just a developer
├── Anticipate failure modes before they happen
├── Design for operability, not just functionality
└── Communicate reliability to non-technical stakeholders

What Makes Week 10 Different

SHIFTING PERSPECTIVE

Weeks 1-9: "How do I build this?"
Week 10:   "How do I know this is working?"
           "How do I ship changes safely?"
           "How do I handle problems at 3am?"
           "How do I prevent future problems?"

This week changes HOW you think about systems:

BEFORE WEEK 10:
├── "The system is done when the feature works"
├── "Tests pass, ship it"
├── "We'll figure out monitoring later"
├── "If it breaks, we'll debug it then"
└── "Incidents are unpredictable"

AFTER WEEK 10:
├── "The system is done when I can operate it"
├── "Tests pass, metrics are in, runbook written, ship it"
├── "Observability is a requirement, not an afterthought"
├── "I've thought through failure modes in advance"
└── "Incidents are managed, learnings are captured"

Connections to Previous Weeks

BUILDING ON YOUR FOUNDATION

Week 10 operationalizes everything you've learned:

WEEK 1 (Data at Scale):
└── Day 4: SLOs for database performance
    └── "99th percentile query latency < 50ms"

WEEK 2 (Failure Handling):
└── Day 5: Incident response for cascading failures
    └── "Circuit breaker opened, runbook step 3"

WEEK 3 (Messaging):
└── Day 2: Monitoring queue depth and lag
    └── "Consumer lag > 10K messages triggers alert"

WEEK 4 (Caching):
└── Day 3: Capacity planning for cache tier
    └── "Cache hit rate drop = scale trigger"

WEEK 6 (Notification Platform):
└── Capstone: Complete PRR for the system you designed
    └── "How would you operate this in production?"

WEEK 9 (Multi-Tenancy):
└── Day 2: Per-tenant SLOs and monitoring
    └── "Enterprise tenant SLO: 99.99%, Standard: 99.9%"

Practical Skills You'll Gain

AFTER WEEK 10, YOU CAN:

1. DEFINE RELIABILITY
   ├── Write precise SLIs that measure what matters
   ├── Set SLOs that balance reliability and velocity
   ├── Calculate and manage error budgets
   └── Translate technical SLOs to business language

2. BUILD OBSERVABILITY
   ├── Instrument code with metrics and traces
   ├── Design dashboards that answer questions
   ├── Write alerts that are actionable, not noisy
   └── Debug production issues systematically

3. DEPLOY SAFELY
   ├── Choose the right deployment strategy
   ├── Design backward-compatible migrations
   ├── Use feature flags effectively
   └── Roll back quickly when needed

4. PLAN CAPACITY
   ├── Run meaningful load tests
   ├── Identify bottlenecks before they hit
   ├── Right-size infrastructure
   └── Prepare for traffic spikes

5. MANAGE INCIDENTS
   ├── Respond effectively under pressure
   ├── Communicate clearly during outages
   ├── Run blameless postmortems
   └── Turn incidents into improvements

The Production Mindset

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  THE PRODUCTION MINDSET                                                │
│                                                                        │
│  "How do I build this?"                                                │
│                │                                                       │
│                ▼                                                       │
│  "How do I build this SO THAT..."                                      │
│                │                                                       │
│       ┌────────┴────────┬────────────────┬────────────────┐            │
│       │                 │                │                │            │
│       ▼                 ▼                ▼                ▼            │
│  ...I can tell     ...I can deploy  ...I can scale   ...I can fix      │
│  when it's         changes safely   it when needed   it when broken    │
│  healthy           at any time      without panic    without panic     │
│                                                                        │
│  This is what separates SENIOR from JUNIOR engineers.                  │
│  This is what interviewers look for at Staff+ levels.                  │
│  This is what this week teaches you.                                   │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Daily Schedule

Day Topic Key Deliverable
Day 1 SLIs, SLOs, SLAs Define SLOs for a real system
Day 2 Observability Design metrics, logs, traces strategy
Day 3 Deployment Strategies Create deployment runbook
Day 4 Capacity Planning Build capacity model with load test plan
Day 5 Incident Management Write incident runbook and postmortem template
Capstone Production Readiness Review Complete PRR for notification platform

Preparing for Week 10

MINDSET SHIFT

This week asks different questions than previous weeks:

INSTEAD OF:              ASK:
─────────────────────────────────────────────────────
"What database?"      →  "What SLO for database latency?"
"How to scale?"       →  "When do we need to scale?"
"Handle failures"     →  "How do we detect failures?"
"Build feature X"     →  "How do we safely ship feature X?"
"Fix the bug"         →  "How do we prevent this class of bug?"

BRING YOUR SYSTEMS

Think about the systems you designed in weeks 6-9:
├── Notification Platform (Week 6)
├── Search System (Week 7)
├── Analytics Pipeline (Week 8)
├── Multi-Tenant SaaS (Week 9)

For each one, consider:
├── What would the SLOs be?
├── What metrics would you monitor?
├── How would you deploy changes?
├── What's the capacity limit?
└── What's the incident response plan?

Summary

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│                    WEEK 10 AT A GLANCE                                 │
│                                                                        │
│  THEME: Production Readiness and Operational Excellence                │
│                                                                        │
│  THE QUESTION: "Can I operate this system in production?"              │
│                                                                        │
│  DAILY TOPICS:                                                         │
│  ├── Day 1: SLIs, SLOs, SLAs — Define what "healthy" means             │
│  ├── Day 2: Observability — See into your system                       │
│  ├── Day 3: Deployment — Ship changes safely                           │
│  ├── Day 4: Capacity — Scale before you need to                        │
│  └── Day 5: Incidents — Handle problems effectively                    │
│                                                                        │
│  CAPSTONE: Production Readiness Review                                 │
│  └── Complete operational review of notification platform              │
│                                                                        │
│  KEY TRANSFORMATION:                                                   │
│  ├── From: "I built it"                                                │
│  └── To:   "I can operate it"                                          │
│                                                                        │
│  THIS COMPLETES YOUR JOURNEY:                                          │
│  ├── Week 1-2: Foundations                                             │
│  ├── Week 3-5: Building blocks                                         │
│  ├── Week 6-8: Complete systems                                        │
│  ├── Week 9: Enterprise requirements                                   │
│  └── Week 10: Production excellence                                    │
│                                                                        │
│  After this week, you'll think like a Staff Engineer.                  │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Week 10 begins tomorrow with Day 1: SLIs, SLOs, and SLAs.

Get ready to transform from a builder into an operator.